Incidents are anomalous conditions that result or may lead to service degradation or outages. These events require human or automated intervention to restore service to full operational status. They often require immediate action in the shortest amount of time possible, this can be either to restore service that are in a state of degradation or to avert impending service interruption.
The primary goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:
Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions are provided for each severity, and these definitions are reevaluated as new circumstances become known. Incident management uses our standarized severity definitions, which can be found under our issue workflow documentation.
S2incidents cannot be considered closed until GitLab.com has been fully operational, online, stable and performant for 30 minutes after the incident was resolved.
S3incident is followed by another
S3incident within 3 hours, the latter incidents are automatically upgraded to
|Role||Definition and Examples|
| ||Engineer On-Call + Incident Team|
|The Engineer On-Call is the initial owner of an incident, and this, is, in essence, the Incident Team. When the |
| ||Incident Manager|
|The Incident Manager is the tactical leader of the incident response team, and it must not be the person doing the technical work resolving the incident. The IMOC assembles the Incident Team, evaluates data (technical and otherwise) coming from team members, evaluates technical direction of incident resolution and coordinates troubleshooting efforts, and is responsible for documentation and debriefs after the incident.|
| ||Communications Manager|
|The Communications Manager is the communications leader of the incident response team. The focus of the Incident Team is on resolving the incident as quickly as possible. However, there is a critical need to disseminate information to appropriate stakeholders, including employees, eStaff, and end users. For |
These definitions imply several on-call rotations for the different roles.
The IMOC should be a technical person with a good understanding of GitLab.com's architecture. The CMOC is not required to be technical. The IMOC and the CMOC work in tandem to manage the incident resolution and timely communication.
The initial and long-term owner of an incident is the
EOC, and as such, is responsible for incident declaration and its ultimate resolution. The
EOC can temporarily cede ownership of an incident to an
IMOC, but the
EOC will still be responsible for producing the corresponding root-cause analysis (RCA).
On-Call runbooks are available for engineers on-call.
S2 incidents are critical, and the
EOC can and should engage the
infra/5543 tracks automation for incident management.
The CMOC is the primary person responsible for making the issue and documents below
All these steps can be done by slack using /start-incident , the internal steps are :
Incident, on the
productionqueue with the template for Incident . If it is not possible to generate the issue, start with the tracking document and create the incident issue later.
Optional - not required for post deployment patches and as needed for the incident:
infrastructurequeue with the template for RCA. If it is not possible to generate the issue, start with the tracking document and create the incident issue later.
infrastructurewhen the incident is mitigated, with the info from the tracking document.
~corrective actionissues have been linked, the issue can be closed. The infrastructure team will have a cadence to review and prioritize corrective actions.
Information is a key asset during any incident. Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that an incident is in progress is critical in helping stakeholders plan for said changes.
This flow is determined by:
Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.
To that end, we will have:
#productioncontains sizeable amounts of information and it takes effort to filter out non-relevant items. This is particularly important for the incident team, which must be focused on technical information to resolve the incident. While
#incident-managementis an open channel and anyone is free to join, we will encourage people to use other channels to communicate with the IMOC.
The information below is meant to be a quick reference for deciding when to start the incident process and for what to do.
If any of the above are showing major error rates or deviations, it is better to start an incident.
/start-incidentor if you have an alert in alerts-general - click the Open Issue button in the thread.