Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Incident Management

On this page

Incidents

Incidents are anomalous conditions that result in service degradation or outages and require human or automated intervention to restore service to full operational status in the shortest amount of time possible.

The primary goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:

Severities

Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions are provided for each severity, and these definitions are reevaluated as new circumstances become known. Incident management uses our standarized severity definitions, which can be found under our issue workflow documentation.

Alert Severities

Roles

Role Definition and Examples
EOC+IT Engineer On-Call + Incident Team
  The Engineer On-Call is the initial owner of an incident, and this, is, in essence, the Incident Team. When the EOC escalates and incident to an IMOC, the IMOC takes ownership of the incident and can engage additional resources as necessary to augment the Incident Team.
IMOC Incident Manager
  The Incident Manager is the tactical leader of the incident response team, and it must not be the person doing the technical work resolving the incident. The IMOC assembles the Incident Team, evaluates data (technical and otherwise) coming from team members, evaluates technical direction of incident resolution and coordinates troubleshooting efforts, and is responsible for documentation and debriefs after the incident.
CMOC Communications Manager
  The Communications Manager is the communications leader of the incident response team. The focus of the Incident Team is on resolving the incident as quickly as possible. However, there is a critical need to disseminate information to appropriate stakeholders, including employees, eStaff, and end users. For S1 (and possibly S2) incidents, this is a dedicated role. Otherwise, IMOC can handle communications.

These definitions imply several on-call rotations for the different roles.

The IMOC should be a technical person with a good understanding of GitLab.com's architecture. The CMOC is not required to be technical. The IMOC and the CMOC work in tandem to manage the incident resolution and timely communication.

Ownership

The initial and long-term owner of an incident is the EOC, and as such, is responsible for incident declaration and its ultimate resolution. The EOC can temporarily cede ownership of an incident to an IMOC, but the EOC will still be responsible for producing the corresponding root-cause analysis (RCA).

On-Call Runbooks

On-Call runbooks are available for engineers on-call.

S1 and S2 Incidents

S1 and S2 incidents are critical, and the EOC can and should engage the IMOC .

Issue infra/5543 tracks automation for incident management.

Incident Steps

The CMOC is the primary person responsible for making the issue and documents below

All these steps can be done by slack using /start-incident , the internal steps are :

  1. Create an issue with the label Incident, on the production queue with the template for Incident . If it is not possible to generate the issue, start with the tracking document and create the incident issue later.
  2. Ensure the initial severity label is accurate.

Optional - not required for post deployment patches and as needed for the incident:

  1. If S1/S2 outage, Create an issue with the label ~IncidentReview on the infrastructure queue with the template for RCA. If it is not possible to generate the issue, start with the tracking document and create the incident issue later.
  2. Create and associate a google doc with the template, to both issues, production and infrastructure. Populate and use the google doc as source of truth during the incident. This doc is mainly for real time multiple access when we cannot use the production issue for communication.
  3. Contact the CMOC for support during the incident.

Follow up steps - Root Cause Analysis (RCA):

  1. The owner of the issues is the on-call engineer / IMOC, will fill the issue of production and infrastructure when the incident is mitigated, with the info from the tracking document.
  2. When necessary, new tickets will be created with the label "Corrective Action" and linked with the RCA Issue on the infrastructure track.
  3. Closing the RCA ~IncidentReview issue: When discussion on the RCA issue is complete and all ~corrective action issues have been linked, the issue can be closed. The infrastructure team will have a cadence to review and prioritize corrective actions.

Communication

Information is a key asset during any incident. Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that an incident is in progress is critical in helping stakeholders plan for said changes.

This flow is determined by:

Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.

To that end, we will have: 

CMOC and IMOC checklist:

The information below is meant to be a quick reference for deciding when to start the incident process and for what to do.

Are we having an incident?

  1. What alerts are going off? Prometheus gprd
  2. How does do these dashboards look?:
    1. What services are show availability issues?
    2. What components are outside of normal operations via

If any of the above are showing major error rates or deviations, it is better to start an incident.

CMOC checklist for starting and incident: