If you are a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you are a GitLab team member looking for assistance from Reliability Engineering, please see the Getting Assistance section.
Reliability Engineering is responsible for all of GitLab's user-facing services, with their primary responsibility being GitLab.com. Site Reliability Engineers (SREs) ensure the availability of these services, building the tools and automation to monitor and enable this availability. These user-facing services include a multitude of environments, including staging, GitLab.com, and dev.GitLab.org, among others (see the list of environments).
Reliability Engineering ensures that GitLab's customers can rely on GitLab.com for their mission-critical workloads. We approach availability as an engineering challenge and empower our counterparts in Development to make the best possible infrastructure decisions. We own and iterate often on how we manage incidents and continually derive and share our learnings by conducting thorough reviews of those incidents.
If you're a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you'd like our assistance, please use one of the issue generation templates below and the work will be routed appropriately:
We can also be reached in Slack in the #production channel for questions related to GitLab.com and in the #infrastructure-lounge channel for all other questions.
Assistance from the Infrastructure Team is occasionally required to help solve or troubleshoot external customer issues.
All larger features, configuration changes, and new services must go through a Production Readiness review.
The General Team supports the Reliability Team's overall vision by supporting services for GitLab.com that do not fit the mission of the other Reliability Teams.
The Observability team maintains metrics and logs platforms for GitLab SaaS and is responsible for Prometheus, Thanos, Grafana, and Logging.
The Foundations team builds, runs and owns the core infrastructure for GitLab.com.
The Database Reliability Team is responsible for the PostgreSQL engine for GitLab.com services, as well as a range of related systems which help to ensure the availability and reliability of the database engine.
The Practices Team has SRE's that work full time directly with Stage Groups Teams to focus on their Reliability and Infrastructure concerns for GitLab.com when their need is greater than the General Team can support.
We maintain a single source of truth epic for all work underway for the Reliability team. That epic can be found at GitLab SaaS Reliability - work queue and represents the current state of project work assigned within teams. That epic references projects detailed in the form of sub-epics.
Corrective Actions
for incidentsOKRs for the Reliability team that require status tracking should be updated each Wednesday. When updating the progress percentage of any given KR, it is not necessary to provide extensive notes. One sentence with a link to a larger update is sufficient in most cases. It is also acceptable to do a check-in without providing a note on a single occasion, but not over several check-ins.
In addition to status updates every Wednesday, all objectives and key results assigned to the current quarter must have the following set:
~Sub-Department::Reliability
label applied.~Reliability::<name>
. (e.g. ~Reliability::Practices
).