Incidents are anomalous conditions that result in service degradation or outage and require intervention (human or automated) to restore service to full operational status in the shortest amount of time possible. The primary goal of incident management is to organize chaos into swift incident resolution. To that end, incident management requires well defined roles for all resources involved, control points to manage the flow of both resolution path and information, active and effective communication to notify the appropriate stakeholders about the status of an incident and its resolution, and post-mortem, root-cause analysis and introspective analysis procedures.
Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions must be provided for each severity, and these definitions must be reevaluated as new circumstances become known. Incident management uses our standarized severity definitions, which can be found under
|Role||Definition and Examples|
| ||Incident Manager|
|The Incident Manager is the tactical leader of the incident response team, and it must not be the person doing the technical work resolving the incident. The IMOC assembles the Incident Team, evaluates data (technical and otherwise) coming from team members, evaluates technical direction of incident resolution and coordinates troubleshooting efforts, and is responsible for documentation and debriefs after the incident.|
| ||Communications Manager|
|The Communications Manager is the communications leader of the incident response team. The focus of the Incident Team is on resolving the incident as quickly as possible. However, there is a critical need to disseminate information to appropriate stakeholders, including employees, eStaff, and end users. For Sev1 (and possibly Sev2) incidents, this is a dedicated role. Otherwise, IMOC can handle communications.|
| ||On-Call + Incident Team|
|The Incident Team is primarily composed of the on-call person. However, the Incident Manager can call in additional resources as necessary.|
These definitions imply several on-call rotations for the different roles. The IMOC should be a technical person with a good understanding of GitLab.com's architecture. The CMOC is not required to be technical. The IMOC and the CMOC work in tandem to manage the incident and timely communication.
Information is a key asset during any incident. Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that an incident is in progress is critical in helping stakeholders plan for said changes.
This flow is determined by:
Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.
To that end, we will have:
#productioncontains sizeable amounts of information and it takes effort to filter out non-relevant items. This is particularly important for the incident team, which must be focused on technical information to resolve the incident. While
#incidentis an open channel and anyone is free to join, we will encourage people to use other channels to communicate with the IMOC.