Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Reliability Engineering

Mission

Reliability Engineering teams are the gatekeepers and primary caretakers of the operational environment hosting all of GitLab's user-facing services (most notably GitLab.com), focusing on their availability, performance and scalability through reliability considerations.

Site Reliability Teams

The Site Reliability teams are responsible for all of GitLab's user-facing services, most notably, GitLab.com. Site Reliability Engineers ensure that these services are available, reliable, scalable, performant and, with the help of GitLab's Security Department, secure. This infrastructure includes a multitude of environments, including staging, GitLab.com (production) and dev.GitLab.org, among others (see the list of environments).

SREs are primarily focused on the GitLab.com's availability, and have a strong focus on building the right toolsets and automations to enable development to ship features as fast and bug-free as possible, leveraging the tools provided by GitLab (we must dogfood).

Another part of the job is building monitoring tools that allow quick troubleshooting as a first step, then turning this into alerts to notify based on symptoms, to then fixing the problem or automating the remediation. We can only scale GitLab.com by being smart and using resources effectively, starting with our own time as the main scarce resource.

Vision

Reliability Engineering team are composed of DBREs and SREs. As the role titles indicate, they have different areas of specialty but focus on the reliability of the environment as the unifying goal.

Reliability Engineering teams own the following operational processes:

The teams' overarching goal with respect to these processes is to outdate them through automation.

Key Metrics

Key metrics related to this group include:

Team

Each member of the Site Reliability Team is part of this vision:

The DBRE team has his own roadmap, dashboard, milestones and on-call rotation.

Organizing Our Work

The Reliability Engineering team primarily organizes on the ~"team::Reliability" label in the GitLab Infrastructure Team group.

Workflow / How we work

There are now 3 infrastructure teams reporting to

  1. DBRE Infra team - Gerardo Lopez Fernandez
  2. Secure & Defend Infra team - Anthony Sandoval
  3. CI/CD & Enablement Infra team - David Smith

Each team manages its own backlog related to its OKRs. We use Milestones as timeboxes and each team can roughly align with the Planning blueprint

Boards: SRE On-call and Teams

The three teams share the on-call rotations for GitLab.com. The 3 SREs in the weekly rotation (EMEA/Americas/APAC) share responsibility for triaging issues and managing tasks on the SRE On-call board. The board uses the group SRE:On-call label to identify issues across subgroups in gitlab-com and is not aligned with any single milestone.

Incoming requests of the Infrastructure Team

Incoming requests of the infrastructure team can start in the Current milestone, but can be triaged out to the correct teams.

Add issues at any time to the infrastructure issue tracker. Let one of the managers for the production team know of the request. It would be helpful for our prioritization to know the timeline for the issue if your team has commitments related to it. We do reserve part of our time for interrupt requests, but that does not always mean we can fit in everything that comes to us.

Each team's manager will triage incoming requests for the services their team owns. In some cases, we may decide to pull that work immediately, in other cases, we may defer the work to a later milestone if we have higher priority currently in progress. The 3 managers will be meeting twice a week and we can share efforts and rebalance work if needed. Work that is ready to pull will be added to the team milestone(s) and appear on their boards.

Bigger projects should start as a Design MR so we can get a thought out process on what we want to achieve and then make an Epic for the design to group its issues together.

Issue Trackers

Infrastructure

The infrastructure issue tracker is the backlog for the infrastructure team and tracks all work that SRE teams are doing that is not related to an ongoing change or incident.

Production Issue Tracker

We have a production issue tracker. Issues in this tracker are meant to track incidents and changes to production that need approval. We can host discussion of proposed changes in linked infrastructure issues. These issues should have ~incident or ~change and notes describing what happened or what is changing with relevant infrastructure team issues linked for supporting information.

Standups and Retros

Standups: We do standups with a bot that will ask for updates from each team member at 11AM in their timezone. Updates will go into our slack channel.

Retros: We are testing async retros with another bot that happens the second Wednesday of our milestone. Updates from that retro will again go to our slack channel. A summary will also be made so that we can vote on important issues to talk about in more depth. These can then help us update our themes for milestones.

Boards

We use boards extensively to manage our work (see https://gitlab.com/groups/gitlab-com/gl-infra/-/boards).

Reliability Engineering

board. The board is groomed daily by the Reliability Managers.

The managers' priorities are to:

  1. Ensure the workflow::Blocked list is empty (i.e., unblocking issues is critical)
  2. Maintain the board up to date with the help of issue assignees
Production

keeps track of the state of Production, showing, at a glance, incidents, hotspots, changes and deltas related to production, and it also includes on-call reports.

There are four types of issues related to production, denoted by labels:

Label Description
incident Incidents are anomalous conditions where GitLab.com is operating below established SLOs.
hotspot Hotspots identify threats that are likely to become incidents if not addressed but that we are unable to address right away.
change Changes are scheduled changes through mainatenance windows.
delta Deltas reflect devitations from standard configuration that will eventually merge into the standard.

Logistics

The Production Board is groomed by the IMOC/CMOC on a daily basis, and we strive to keep it both clean and lean.

DBRE

The Database (group label) will automatically add issues to the board.

Observability

There are two labels that identify issues related to Observability efforts for GitLab.com. First, there is a gitlab-com group label that collects Observability related issues company wide—~Observability. And then, there's the ~Board::Observability scoped label in the gl-infra sub-group. We used the second label to distinguish issues that require the focus of the Site Reliability team responsible for observability, from other groups' properly identified Observability issues.

There is a name collision at the sub-group level—we have an ~Observability label there, too. However, it's used primarily at the epic level to define our Roadmap.

If you need SRE attention on a GitLab.com Observability related issue, please add the Board::Observability label.

Labels

Label List Focus
Director of Infrastructure Director of Infrastructure Infrastructure
team::CI/CD & Enablement Reliability Engineering: CI/CD & Enablement Reliability
team::Dev & Ops Reliability Engineering: Dev & ops Availability
team::Secure & Defend Reliability Engineering: Secure and Defend Observability
team::Delivery Delivery Engineering Scalability
DE::Infrastructure Distinguished Engineer, Infrastructure Infrastructure
OA::Infrastructure Operations Analyst, Infrastructure Cost

Other labels are relevant to issues in the board:

Label Purpose
OKR Denotes OKR-related issue. It is used to communicate status and progress for quarerly KRs as assigned to erach team.
KPI Denotes KPI-related issue. It is used to track progress on definition, implementation and tracking of Infrastructure KPIs.
workflow::state Denotes the state of an issue according to our workflow conventions: Ready, In Progress, Under Review, Blocked, Done, and Cancelled.
List Labels

Board lists are driven by the following labels:

Type Label List Notes
group Ongres Issues assigned to OnGres  
group Ongres::Support OnGres support issues Support
group Ongres::Project OnGres project issues Project
group Workflow::Ready Issues ready to start Issues must always be prioritized
group Workflow::In Progress Issues in progress Issues must always have a Due Date
group Workflow::Blocked Issues blocked Issues must describe what can unblock them
Priority and Criticality Labels

Issues are labeled by priority and criticality. Priority incidetes what should be worked on first. Criticality describes the risk of not doing the work.

All issues must have priority and critically assigned to them. This is a requirement for issues in the Workflow::Ready state.

Type Priority Label Description
group P1 Highest priority items that require immediate action, with expected ETA in hours/days
group P2 Items that require prompt attention, with expected ETA within the week
group P3 Items with expected ETA within the current milestone
group P4 Items with expected ETA within the following milestone or beyond
Type Criticaility Label Description
group C1 Immediate threat to availability, performance or data durability
group C2 Expected threat to availability and/or performance within 30 days
group C3 Expected threat to availability and/or performance within 60 days
group C4 Expected threat to availability and/or performance beyond 60 days

Note: data loss is always a C1 criticality.

All the issues with the Priority 1 P1 label should be updated daily.

The time duration of our milestones is 2 weeks.