Incident Management

On this page

Incidents

Incidents are anomalous conditions that result in service degradation or outages and require human or automated intervention to restore service to full operational status in the shortest amount of time possible.

The primary goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:

Severities

Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions are provided for each severity, and these definitions are reevaluated as new circumstances become known. Incident management uses our standarized severity definitions, which can be found under CONTRIBUTION.MD.

Alert Severities

Roles

Role Definition and Examples
EOC+IT Engineer On-Call + Incident Team
  The Engineer On-Call is the initial owner of an incident, and this, is, in essence, the Incident Team. When the EOC escalates and incident to an IMOC, the IMOC takes ownership of the incident and can engage additional resources as necessary to augment the Incident Team.
IMOC Incident Manager
  The Incident Manager is the tactical leader of the incident response team, and it must not be the person doing the technical work resolving the incident. The IMOC assembles the Incident Team, evaluates data (technical and otherwise) coming from team members, evaluates technical direction of incident resolution and coordinates troubleshooting efforts, and is responsible for documentation and debriefs after the incident.
CMOC Communications Manager
  The Communications Manager is the communications leader of the incident response team. The focus of the Incident Team is on resolving the incident as quickly as possible. However, there is a critical need to disseminate information to appropriate stakeholders, including employees, eStaff, and end users. For S1 (and possibly S2) incidents, this is a dedicated role. Otherwise, IMOC can handle communications.

These definitions imply several on-call rotations for the different roles.

The IMOC should be a technical person with a good understanding of GitLab.com's architecture. The CMOC is not required to be technical. The IMOC and the CMOC work in tandem to manage the incident resolution and timely communication.

Ownership

The initial and long-term owner of an incident is the EOC, and as such, is responsible for incident declaration and its ultimate resolution. The EOC can temporarily cede ownership of an incident to an IMOC, but the EOC will still be responsible for producing the corresponding root-cause analysis (RCA).

On-Call Runbooks

On-Call runbooks are available for engineers on-call.

S1 and S2 Incidents

S1 and S2 incidents are critical, and the EOC can and should engage the IMOC .

Issue infra/5543 tracks automation for incident management.

All this steps can be done by slack using /create issue –initial severity , the internal steps are :

Following steps :

Communication

Information is a key asset during any incident. Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that an incident is in progress is critical in helping stakeholders plan for said changes.

This flow is determined by:

Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.

To that end, we will have: