Incident Management

On this page

Incidents

Incidents are anomalous conditions that result in service degradation or outage and require intervention (human or automated) to restore service to full operational status in the shortest amount of time possible. The primary goal of incident management is to organize chaos into swift incident resolution. To that end, incident management requires well defined roles for all resources involved, control points to manage the flow of both resolution path and information, active and effective communication to notify the appropriate stakeholders about the status of an incident and its resolution, and post-mortem, root-cause analysis and introspective analysis procedures.

Incident Severities

Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions must be provided for each severity, and these definitions must be reevaluated as new circumstances become known. Incident management uses our standarized severity definitions, which can be found under CONTRIBUTION.MD.

Alert Severities

Roles

Role Definition and Examples
IMOC Incident Manager
  The Incident Manager is the tactical leader of the incident response team, and it must not be the person doing the technical work resolving the incident. The IMOC assembles the Incident Team, evaluates data (technical and otherwise) coming from team members, evaluates technical direction of incident resolution and coordinates troubleshooting efforts, and is responsible for documentation and debriefs after the incident.
CMOC Communications Manager
  The Communications Manager is the communications leader of the incident response team. The focus of the Incident Team is on resolving the incident as quickly as possible. However, there is a critical need to disseminate information to appropriate stakeholders, including employees, eStaff, and end users. For Sev1 (and possibly Sev2) incidents, this is a dedicated role. Otherwise, IMOC can handle communications.
OCIT On-Call + Incident Team
  The Incident Team is primarily composed of the on-call person. However, the Incident Manager can call in additional resources as necessary.

These definitions imply several on-call rotations for the different roles. The IMOC should be a technical person with a good understanding of GitLab.com's architecture. The CMOC is not required to be technical. The IMOC and the CMOC work in tandem to manage the incident and timely communication.

Communication Channels

Information is a key asset during any incident. Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that an incident is in progress is critical in helping stakeholders plan for said changes.

This flow is determined by:

Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.

To that end, we will have: