Production Team

On this page

Production Team

The Production SRE team is responsible for all user-facing services. Production SREs ensure that these services are secure, reliable, and fast. This infrastructure includes staging, GitLab.com and dev.GitLab.org; see the list of environments.

Production SREs also have a strong focus on building the right toolsets and automations to enable development to ship features as fast and bug free as possible, leveraging the tools provided by GitLab.com itself - we must dogfood.

Another part of the job is building monitoring tools that allow quick troubleshooting as a first step, then turning this into alerts to notify based on symptoms, to then fixing the problem or automating the remediation. We can only scale GitLab.com by being smart and using resources effectively, starting with our own time as the main scarce resource.

GitLab.com

We want to make GitLab.com ready for mission critical workloads. That readiness means:

  1. Speedy (speed index below 2 seconds)
  2. Available (uptime above 99.95%)
  3. Durable (automated backups and restores, monthly manual tests)
  4. Secure (prioritize requests of our security team)
  5. Deployable (quickly deploy and provide metrics for new versions in all environments)

Tenets

  1. Security: reduce risk to its minimum, and make the minimum explicit.
  2. Transparency, clarity and directness: public and explicit by default, we work in the open, we strive to get signal over noise.
  3. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way.

Workflow / How we work

There are now 3 infrastructure teams with the manager's last name/first name initials reporting to

  1. Jose Finotto - team "JF Infra team"
  2. Anthony Sandoval - team "AS Infra team"
  3. David Smith - team "DS Infra team"

Each team manages its own backlog related to its OKRs and areas of ownership. We use Milestones as timeboxes and each team can roughly align with the Planning blueprint

The three teams share the oncall rotation for GitLab.com. There is always a "Current" milestone which tracks the current oncall work in process and to be triaged. The two currently oncall people for EMEA and Americas rotation are in charge of triage and work in the Current Milestone. The next two people up in the rotation also work from Current. All other team members not working Current work from their team's respective boards and milestones.

Incoming requests of the Infrastructure Team

Incoming requests of the infrastructure team can start in the Current milestone, but can be triaged out to the correct teams.

Add issues at any time to the infrastructure issue tracker. Let one of the managers for the production team know of the request. It would be helpful for our prioritization to know the timeline for the issue if your team has commitments related to it. We do reserve part of our time for interrupt requests, but that does not always mean we can fit in everything that comes to us.

Each team's manager will triage incoming requests for the services their team owns. In some cases, we may decide to pull that work immediately, in other cases, we may defer the work to a later milestone if we have higher priority currently in progress. The 3 managers will be meeting twice a week and we can share efforts and rebalance work if needed. Work that is ready to pull will be added to the team milestone(s) and appear on their boards.

Bigger projects should start as a Design MR so we can get a thought out process on what we want to achieve and then make an Epic for the design to group its issues together.

Infra team areas of ownership

The team areas of responsibility are:

  1. JF Infra team: (non-git data)
    • Observability Tooling (Labels: Observability, Prometheus, Grafana)
    • Non-git GitLab Data (Labels: Service:Postgres, Service:Redis, Service: Registry, Object Storage)
    • Backups for postgres
  2. AS Infra team: (non .com infra)
    • Infrastructure automation tooling - Terraform / Chef (Labels: chef, cookbooks, terraform)
    • Company Assets about, version, license .gitlab.com (Labels: ~singleton-svcs)
  3. DS Infra team: (git+ci)
    • Git Data (Labels: Service:Gitaly, Service:Pages)
    • CI / CD (Labels: Service:CI Runners, ci)
    • Backups for git data

*** Security is an aspect of all three teams so a relationship exists for all 3 teams.

As the infrastructure teams evolve and grow, we will continue iterating towards more departmental alignments with the products: Dev, Sec, Ops and Enablement. We'll also continue to align teams with ownership of core parts of the infrastructure for GitLab.com like our Infrastrcture as code, configuration management, observability (ELK, Prometheus), backups.

GitLab Product/Service to Infrastructure Label mapping:

Service Label Team
Create ~Product:create DS infra team
Plan ~Product:Plan JF infra team
Manage ~Product:Manage JF infra team
Monitor ~Product:Monitor JF infra team
Configure ~Product:Configure AS infra team
Verify ~Product:Verify DS infra team
Release ~Product:Release DS infra team
Secure ~Product:Secure AS infra team
Gitter ~Product:Gitter AS infra team

Team board and Milestones

Production Issue Tracker

We have a production issue tracker. Issues in this tracker are meant to track incidents and changes to production that need approval. We can host discussion of proposed changes in linked infrastructure issues. These issues should have ~incident or ~change and notes describing what happened or what is changing with relevant infrastructure team issues linked for supporting information.

Infrastructure Issue Tracker

We track on call work on the Current Milestone from the infrastructure team board in a Kanban style workflow, though with no swimlanes.

The columns on our board are:

  1. Ready - Issues that are "ready to pull"
  2. Waiting - Issues that have been started, but are waiting on an external item (vendor ticket, another team, etc)
  3. In Progress
  4. Completed (aka Closed) - Issue is closed per all notes or criteria in the issue description.

Guiding philosophies: To get from Planning to Ready, an issue should be:

We'll organize our work into 2 week milestones. These milestones are meant to:

  1. Give us a timebox to gain some measure of velocity over time.
  2. Give us a window of time to focus on a particular team for planned work. As themes go, it is better to try to focus on one theme for a few time boxes rather than start 3 things and not finish any.
  3. Ideally, as we have more clarity on velocity 50-60% of a milestone's velocity will be scheduled/known work while the remaining work will be room for emergent or toil related work.
  4. We will hold monthly async planning sessions to pick the particular areas the team will focus on
  5. Issues that fit the theme will be scheduled into milestones as velocity allows.

Our theme tracking for milestones is currently in this google doc

There will always be a "Current Milestone" and a "Next Milestone". This way when making or updating issues, we can use quick actions like \milestone %"Current Milestone and \milestone %"Next Milestone" to quickly get issues adde.

When a milestone is complete, we'll rename the finished milestone to the YYYY-MM-DD Completed and update the Current and Next milestones.

We also want to value handing off issues to take advantage of the many timezones our team covers. An issue may be started by any team member in any timezone, but we can mark an issue with ~Ready_to_Handoff for issues that can go to someone in another timezone. If we mark an issue with the ~Ready_to_Handoff label, it should have clear notes about where it is being left off and next steps. Handing off is not required, anyone can own an issue to completion too, but we want to be able communicate and work across the many people on our team were it makes sense.

Standups and Retros

Standups: We do standups with a bot that will ask for updates from each team member at 11AM in their timezone. Updates will go into our slack channel.

Retros: We are testing async retros with another bot that happens the second Wednesday of our milestone. Updates from that retro will again go to our slack channel. A summary will also be made so that we can vote on important issues to talk about in more depth. These can then help us update our themes for milestones.

Why infrastructure and production queues?

Premise

Long term, additional teams will perform work on the production environment:

We cannot keep track of events in production across a growing number of functional queues.

Furthermore, said teams will start to have on-call rotations for both their function (e.g., security) and their services. For people on-call, having a centralized tracking point to keep track of said events is more effective than perusing various queues. Timely information (in terms of when an event is happening and how long it takes for an on-call person to understand what's happening) about the production environment is critical. The production queue centralizes production event information.

Implementation

Functional queues track team workloads (infrastructure, security, etc) and are the source of the work that has to get done. Some of this work clearly impacts production (build and deploy new storage nodes); some of it will not (develop a tool to do x, y, z) until is deployed to production.

The production queue tracks events in production, namely:

Over time, we will implement hooks into our automation to automagically inject change audit data into the production queue.

This also leads to a single source of data. Today, for instance, incident reports for the week get transcribed to both the On-call Handoff and Infra Call documents (we also show exceptions in the latter). These meetings serve different purposes but have overlapping data. The input for this data should be queries against the production queue versus the manual build in documents.

Additionally, we need to keep track of error budgets, which should also be derived from the production queue.

We will also be collapsing the database queue into the infrastructure queue. The database is a special piece of the infrastructure for sure, but so are the storage nodes, for example.

All direct or indirect changes to authentication and authorization mechanisms used by GitLab Inc. by customers or employees require additional review and approval by a member of at least one of following teams:

This process is enforced for the following repositories where the approval is mandatory using MR approvals:

Additional repositories may also require this approval and can be evaluated on a case-by-case basis.

Labeling Issues

We use issue labels within the Infrastructure issue tracker to assist in prioritizing and organizing work. Prioritized labels are:

Priority

Label Description Estimate to fix
~P1 Urgent Priority Has to be executed immediately
~P2 High Priority Has to be executed at the maximum in 2 milestones
~P3 Medium Priority Has to be executed at the maximum in 6 milestones
~P4 Low Priority Lower priority to be executed

Severity

Label Meaning Impact on Functionality Example
~S1 Blocker Outage, broken feature with no workaround Unable to create an issue. Data corruption/loss. Security breach
~S2 Critical Severity Broken feature, workaround too complex & unacceptable Can push commits, but only via the command line
~S3 Major Severity Broken feature, workaround acceptable Can create merge requests only from the Merge Requests page, not through the issue
~S4 Low Severity Functionality inconvenience or cosmetic issue Label colors are incorrect / not being displayed

Type Labels

Type labels are very important. They define what kind of issue this is. Every issue should have one or more.

Label Description
~Change Represents a Change on infrastructure please check details on : Change
~Incident Represents a Incident on infrastructure please check details on : Incident
~UserAccessRequest Standard access requests
~Oncall Are prioritized to be worked on by the current oncall team members
~Hotspot Label for problems that can get higher priority
~Database Label for problems related to database
~Security Label for problems related to security

Services

The services list is mentioned here : https://gitlab.com/gitlab-com/runbooks/blob/master/services/service-mappings.yml

Services Criticality Labels:

Service Criticality labels help us to define, how critical is the service and could be a change in the infrastructure,considering how will affect the user experience in case of a failure. I.e. ~C1 Postgresql or Redis Master. As most of the services could reach different levels of criticallity we consider here the highest, also we have the template of actions for a change depending on the criticallity :

Label Description
~C1 Vital service and is a single point of failure, if down the application will be down
~C2 Important service, if down some functionalities will not be available from the application
~C3 Service in case of some instance is down or the service, we can have performance degradation
~C4 Services that could be in maintenance mode or would not affect the performance of the application

Service Redundance Level

The service redundancy level helps us to identify what services has the avaiability of failover or if there is another mechanism of redundancy.

Label Description I.e.
~R1 The loss of a single instance will affect all the users Instance in PostgreSql or Redis
~R2 The loss of a single instance will affect a subset of users Instance in Gitaly
~R3 The loss of a single instance would not affect any user Instance of grafana

Goals and Meta Goal

~goals are issues that are in a Milestone and we agreed as a team that we will do everything in our power to deliver them. Goal issues should fit in one Milestone, that is, they are deliverable in a single week time, if they do not fit in one Milestone we are probably talking about a ~meta ~goal.

Other Labels

We use some other labels to indicate specific conditions and then measure the impact of these conditions within production or the production engineering team. This is specially important from the time investment in specific parts of the production engineering team, to reduce toil or to reduce the chance of a failure by accessing to production more than enough.

Labels that are particularly important for gathering data are:

Always Help Others

We should never stop helping and unblocking team members. To this end, data should always be gathered to assist in highlighting areas for automation and the creation of self-service processes. Creating an issue from the request with the proper labels is the first step. The default should be that the person requesting help makes the issue; but we can help with that step too if needed.

If this issue is urgent for whatever reason, we should label them following the instructions above and add them to the ongoing Milestone.

Issue or Outage Hand-off

Ongoing outages, as well as issues that have the ~(perceived) data loss label and are (therefore) actively being worked on need a hand off to happen as team members cycle in and out of their timezones and availability. The on call log can be used to assist with this. (See link at top to on-call log).

On-Call Support

To ensure 24x7 coverage of emergency issues, we currently have split on-call rotations between EMEA and AMER regions; team members in EMEA regions are on-call from 0400-1600 UTC, and team members in AMER regions are on-call from 1600-0400 UTC. We plan to extend this to include team members from the APAC region in the future, as well. This forms the basis of a follow-the-sun support model, and has the benefit for our team members of reducing (or eliminating) the stress of responding to emergent issues outside of their normal work hours, as well as increasing communication and collaboration within our global team.

For further details about managing schedules, workflows, and documentation, see the on-call runbook.

SLA for paging On-Call team members

When an on-call person is paged, either via the /pd command in slack or the automated monitoring systems, the SRE member will have a 15 minute SLA to acknowledge or escalate the alert.

This is also noted on theon call section of the handbook

Because GitLab is an asynchronous workflow company, @mentions of On-Call individuals in slack will be treated like normal messages and no SLA for response will be attached or associated with them. This is also because notifications over phones via Slack have no escalation policies. PagerDuty has policies that team members and rotations can configure to make sure an alert is escalated when no person has acknowledged the alert.

If you need to page a team member from slack - you can do the /pd "your message to the on-call here" to send out an alert to the currently on-call team members.

Production Events Logging

There are 2 kind of production events that we track:

Incident Subtype - Abuse

For some incidents, we may figure out that the usage patterns that led to the issues were abuse. There is a process for how we define and handle abuse.

  1. The definition of abuse can be found on the security abuse operations section of the handbook
  2. In the event of an incident affecting GitLab.com availability, the SRE team may take actions immediately to keep the system available. However, the team must also immediately involve our security abuse team. A new security on call rotation has been established in PagerDuty - There is a Security Responder rotation which can be alerted along with a Security Manager rotation.

Backups

Summary of Backup Strategy

For details see the runbooks, in particular regarding details on GCP snapshots and Database backups using WAL-E (encrypted)

R.A.D. - Restore Appreciation Days

Every second day of the month, we have a R.A.D. "party". Two production SREs use this day to test our backup processes by fully restoring not-yet-automated backups to test instances and to verify data integrity. The issues for every individual R.A.D can be found in the infrastructure tracker.

The ongoing effort to automate all the things backups is tracked in the infrastructure META issue.