If you are a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you are a GitLab team member looking for assistance from Reliability Engineering, please see the Getting Assistance section.
Reliability Engineering is responsible for all of GitLab's user-facing services, with their primary responsibility being GitLab.com. Site Reliability Engineers (SREs) ensure the availability of these services, building the tools and automation to monitor and enable this availability. These user-facing services include a multitude of environments, including staging, GitLab.com, and dev.GitLab.org, among others (see the list of environments).
Reliability Engineering ensures that GitLab's customers can rely on GitLab.com for their mission-critical workloads. We approach availability as an engineering challenge and empower our counterparts in Development to make the best possible infrastructure decisions. We own and iterate often on how we manage incidents and continually derive and share our learnings by conducting thorough reviews of those incidents.
If you're a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you'd like our assistance, please use one of the issue generation templates below and the work will be routed appropriately:
All larger features, configuration changes, and new services must go through a Production Readiness review.
We maintain a single source of truth epic for all work underway for the Reliability team. That epic can be found at GitLab SaaS Reliability - work queue and represents the current state of project work assigned within squads. That epic references projects detailed in the form of sub-epics.
This squad is comprised of DBREs and one or two SREs to support their learning to implement DB related infra changes.
Corrective Actionsfor incidents.
For a more detailed overview of how issues are triaged and prioritized, see the issues page.
Corrective Actions are issues arising from incidents. See the link for the suggested way to create them.
Currently there is a squad assigned to this work with the focus of 1) refining all open CAs and 2) Burning down the backlog of open CAs. The slack channel for that squad is #infra-corrective-actions. Anyone is welcome and encouraged to help with the work of CAs.
The process is as follows:
OKRs that require status tracking should be updated each Wednesday. When updating the progress percentage of any given KR, it is not necessary to provide extensive notes. One sentence with a link to a larger update is sufficient in most cases. It is also acceptable to do a check-in without providing a note on a single occasion, but not over several check-ins.