If you are a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you are a GitLab team member looking for assistance from Reliability Engineering, please see the Getting Assistance section.
Reliability Engineering is responsible for all of GitLab's user-facing services, with their primary responsibility being GitLab.com. Site Reliability Engineers (SREs) ensure the availability of these services, building the tools and automation to monitor and enable this availability. These user-facing services include a multitude of environments, including staging, GitLab.com, and dev.GitLab.org, among others (see the list of environments).
Reliability Engineering ensures that GitLab's customers can rely on GitLab.com for their mission-critical workloads. We approach availabilty as an engineering challenge and empower our counterparts in Development to make the best possible infrastructure decisions. We own and iterate often on how we manage incidents and continually derive and share our learnings by conducting thorough reviews of those incidents.
The Reliability Engineering team is composed of DBREs and SREs. As the role titles indicate, they have different areas of specialty but shared ownership of GitLab.com's availability. The team is broken down into three sub-teams, each with their own area of ownership.
The Core Infra teams owns core infrastructure tooling, network ingress/egress, CDNs, DNS, and secrets management.
The people of Core Infra are:
The Datastores team owns our persistent storage platforms, with PostgreSQL on gitlab.com being the top priority.
Datastores is:
The Observability team owns the monitoring and alerting infrastructure for GitLab.com, as well as our caching/queuing infrastructure.
Observability is:
Each team withing Relaibility manages its own backlog and has its own sprints.
Team | Team Label | Backlog Board | Sprint Board |
---|---|---|---|
Core Infra | ~team::Core-Infra |
Core Infra - Backlog | Core Infra - Sprint |
Datastores | ~team::Datastores |
Datastores - Backlog | Datastores - Sprint |
Observability | ~team::Observability |
Observability - Backlog | Observability - Sprint |
There is also a primary backlog for Reliability board which serves as a singular point of triage for work which has been generated externally to a specific Reliability team. That board is located at Reliability Team - Backlog.
Issues in team backlogs come from two sources:
~"workflow-infra::Triage"
label, this automatically places it on the team's backlog board.~"workflow-infra::Triage"
label is removed and the ~"workflow-infra::Ready"
label is added.~"workflow-infra::Triage"
label is removed and the ~"workflow-infra::Ready"
label is added.~"workflow-infra::Triage"
label is removed.~"team::Reliability"
and ~"workflow-infra::Triage"
. This places it on the Reliability - Backlog
board and in a triage process for assignment to the correct subteam.Reliability - Backlog
board, the daytime IMOC reviews the board for issues in the ~"workflow-infra::Triage"
column and the issue is either:
~"team::Reliability"
label and adding the appropriate team label.~"workflow-infra::Triage"
label is removed.~"workflow-infra::Triage"
label is removed and the ~"workflow-infra::Ready"
label is added.~"workflow-infra::Triage"
label is removed and the ~"workflow-infra::Ready"
label is added.~"workflow-infra::Triage"
label is removed.If you'd like our assistance, please use one of the two issue generation templates below and the work will be routed appropriately.
Open a General Request Issue - follow this link to create a general issue for Reliability Engineering.
Open a Customer Questions and Sales Enablement Issue - follow this link to seek assistance in answering questions for prospects or current customers.
We can also be reached in Slack in the #production channel for questions related to GitLab.com and in the #infrastructure-lounge channel for all other questions.