Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Incident Management

On this page

Incidents

Incidents are anomalous conditions that result in—or may lead to—service degradation or outages. These events require human intervention to avert disruptions or restore service to operational status. Incidents are always given immediate attention.

Is GitLab.com Experiencing an Incident?

If you're observing issues on GitLab.com or working with users who are reporting issues, please follow the instructions found on the On-Call page and alert the Engineer On Call (EOC).

If any of the dashboards below are showing major error rates or deviations, it's best to alert the Engineer On Call.

Critical Dashboards

  1. What alerts are going off? AlertManager
  2. How does do these dashboards look?
    1. What services are showing availability issues?
    2. What services are saturated?

Incident Management

The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:

Ownership

There is only ever one owner of an incident—and only the owner of the incident can declare an incident resolved. At anytime the incident owner can engage the next role in the hierarchy for support. With the exception of when GitLab.com is not functioning correctly, the incident issue should be assigned to the current owner.

Roles and Responsibilities

It's important to clearly delineate responsibilities during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. Preventing overlaps and ensuring a proper order of operations is vital to mitigation. The responsibilities outlined in the roles below are cascading–and ownership of the incident passes from one role to the next as those roles are engaged. Until the next role in the hierarchy engages, the previous role assumes all of the subsequent roles' responsibilities and retains ownership of the incident.

Role Description
EOC Engineer On Call
  The Production Engineer On Call is generally an SRE and can declare an incident. If another party has declared an incident, once the EOC is engaged they own the incident. The EOC gathers information, performs an initial assessment, and determines the incident severity level.
IMOC Incident Manager On Call
  The Incident Manager is generally a Reliability Engineering manager and is engaged when incident resolution requires coordination from multiple parties. The IMOC is the tactical leader of the incident response team—not a person performing technical work. The IMOC assembles the incident team by engaging individuals with the skills and information required to resolve the incident.
CMOC Communications Manager On Call
  The Communications Manager is generally a Reliability Engineering manager. The CMOC disseminates information internally to stakeholders and externally to customers across multiple media (e.g. GitLab issues, Twitter, status.gitlab.com, etc.).

These definitions imply several on-call rotations for the different roles.

Engineer on Call (EOC) Responsibilities

Incident Manager on Call (IMOC) Responsibilities

Communications Manager on Call (CMOC) Responsibilities

Runbooks

Runbooks are available for engineers on call. The project README contains links to checklists for each of the above roles.

In the event of a GitLab.com outage, a mirror of the runbooks repository is available on at https://ops.gitlab.net/gitlab-com/runbooks.

Declaring an Incident

The following steps can be automated in Slack by typing /start-incident. If the commend fails, manually do the following:

  1. Create an issue with the label Incident, on the production queue with the template for Incident . If it is not possible to generate the issue, start with the tracking document and create the incident issue later.
  2. Ensure the initial severity label is accurate.

Optional - not required for post deployment patches and as needed for the incident:

  1. If S1/S2 outage, Create an issue with the label ~IncidentReview on the infrastructure queue with the template for RCA. If it is not possible to generate the issue, start with the tracking document and create the incident issue later.
  2. The EOC should assign the incident issue to themselves.
    1. Incident ownership is documented through the assignee field on the issue.
    2. When handing over ownership to a new owner, at the end of shift, or end of rotation, assignment on the issue should be updated.
  3. Create and associate a google doc with the template, to both issues, production and infrastructure. Populate and use the google doc as source of truth during the incident. This doc is mainly for real time multiple access when we cannot use the production issue for communication.
  4. Contact the CMOC for support during the incident.

Issue infra/5543 tracks automation for incident management.

Definition of Outage vs Degraded vs Disruption

This is a first revision of the definition of Service Disruption (Outage), Partial Service Disruption, and Degraded Performance per the terms on Status.io. Data is based on the graphs from the Key Service Metrics Dashboard

Outage and Degraded Performance incidents occur when:

  1. Degraded as any sustained 5 minute time period where a service is below its documented Apdex SLO or above it's documented error ratio SLO.
  2. Outage (Status = Disruption) as a 5 minute sustained error rate above the Outage line on the error ratio graph

SLOs are documented in the runbooks/rules

To check if we are Degraded or Disrupted for GitLab.com, we look at these graphs:

  1. Web Service
  2. API Service
  3. Git service(public facing git interactions)
  4. GitLab Pages service
  5. Registry service
  6. Sidekiq

A Partial Service Disruption is when only part of the GitLab.com services or infrastructure is experiencing an incident. Examples of partial service disruptions are instances where GitLab.com is operating normally except there are:

  1. delayed CI/CD pending jobs
  2. delayed repository mirroring
  3. high severity bugs affecting a particular feature like Merge Requests
  4. Abuse or degradation on 1 gitaly node affecting a subset of git repos. This would be visible on the Gitaly service metrics

Security Incidents

If an incident may be security related, engage the Security Operations on-call following the Security Incident Response Guide.

Communication

Information is an asset to everyone impacted by an incident. Properly managing the flow of information is critical to minimizing surprise and setting expectations. We aim to keep interested stakeholders apprised of developments in a timely fashion so they can plan appropriately.

This flow is determined by:

Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.

To that end, we will have: 

Status

We manage incident communication using status.io, which updates status.gitlab.com. Incidents in status.io have state and status and are updated by the incident owner.

Definitions and rules for transitioning state and status are as follows.

State Definition
Investigating The incident has just been discovered and there is not yet a clear understanding of the impact or cause. If an incident remains in this state for longer than 30 minutes after the EOC has engaged, the incident should be escalated to the IMOC.
Identified The cause of the incident is believed to have been identified and a step to mitigate has been planned and agreed upon.
Monitoring The step has been executed and metrics are being watched to ensure that we're operating at a baseline
Resolved The incident is closed and status is again Operational.

Status can be set independent of state. The only time these must align is when an issues is

Status Definition
Operational The default status before an incident is opened and after an incident has been resolved. All systems are operating normally.
Degraded Performance Users are impacted intermittently, but the impacts is not observed in metrics, nor reported, to be widespread or systemic.
Partial Service Disruption Users are impacted at a rate that violates our SLO. The IMOC must be engaged and monitoring to resolution is required to last longer than 30 minutes.
Service Disruption This is an outage. The IMOC must be engaged.
Security Issue A security vulnerability has been declared public and the security team has asked to publish it.

Post-Incident

Root Cause Analysis (RCA)

  1. The owner of the RCA is the EOC. The EOC will fill the issue of production and infrastructure when the incident is mitigated, with the info from the tracking document.
  2. When necessary, new tickets will be created with the label "Corrective Action" and linked with the RCA Issue on the infrastructure track.
  3. Closing the RCA ~IncidentReview issue: When discussion on the RCA issue is complete and all ~corrective action issues have been linked, the issue can be closed. The infrastructure team will have a cadence to review and prioritize corrective actions.

Post-Incident Review

Every incident is assigned a severity level

Severities

Incident Severity

Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions are provided for each severity, and these definitions are reevaluated as new circumstances become known. Incident management uses our standardized severity definitions, which can be found under our issue workflow documentation.

Alert Severities