If you are a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you are a GitLab team member looking for assistance from Reliability Engineering, please see the Getting Assistance section.
Reliability Engineering is responsible for all of GitLab's user-facing services, with their primary responsibility being GitLab.com. Site Reliability Engineers (SREs) ensure the availability of these services, building the tools and automation to monitor and enable this availability. These user-facing services include a multitude of environments, including staging, GitLab.com, and dev.GitLab.org, among others (see the list of environments).
Reliability Engineering ensures that GitLab's customers can rely on GitLab.com for their mission-critical workloads. We approach availability as an engineering challenge and empower our counterparts in Development to make the best possible infrastructure decisions. We own and iterate often on how we manage incidents and continually derive and share our learnings by conducting thorough reviews of those incidents.
If you're a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you'd like our assistance, please use one of the issue generation templates below and the work will be routed appropriately:
We can also be reached in Slack in the #production channel for questions related to GitLab.com and in the #infrastructure-lounge channel for all other questions.
All larger features, configuration changes, and new services must go through a Production Readiness review.
We maintain a single source of truth epic for all work underway for the Reliability team. That epic can be found at GitLab SaaS Reliability - work queue and represents the current state of project work assigned within squads. That epic references projects detailed in the form of sub-epics.
As Reliability, we have three main work streams and each varies in the type of work. Each stream has an associated role:
Corrective Actions
for incidents.For a more detailed overview of how issues are triaged and prioritized, see the issues page.
Corrective Actions are issues arising from incidents. See the link for the suggested way to create them.
We use this board to track corrective actions work. Corrective Actions are also an important performance indicator for the Infrastructure Department.
Currently there is a squad assigned to this work with the focus of 1) refining all open CAs and 2) Burning down the backlog of open CAs. The slack channel for that squad is #infra-corrective-actions. Anyone is welcome and encouraged to help with the work of CAs.
The process is as follows:
~infradev
label.