Incidents are anomalous conditions that result in—or may lead
to—service degradation or outages. These events require human
intervention to avert disruptions or restore service to operational status.
Incidents are always given immediate attention.
Is GitLab.com Experiencing an Incident?
If you're observing issues on GitLab.com or working with users who are
reporting issues, please follow the instructions found on the
On-Call page and alert the Engineer On Call (EOC).
If any of the dashboards below are showing major error rates or deviations,
it's best to alert the Engineer On Call.
control points to manage the flow information and the resolution path,
a root-cause analysis (RCA),
and a post-incident review that assigns a severity classification after assessing the impact and scope of the incident.
There is only ever one owner of an incident—and only the owner of
the incident can declare an incident resolved. At anytime the incident owner
can engage the next role in the hierarchy for support. With the exception of
when GitLab.com is not functioning correctly, the incident issue should
be assigned to the current owner.
Roles and Responsibilities
It's important to clearly delineate responsibilities during an incident.
Quick resolution requires focus and a clear hierarchy for delegation of
tasks. Preventing overlaps and ensuring a proper order of operations is vital
to mitigation. The responsibilities outlined in the roles below are
cascading–and ownership of the incident passes from one role to the next as
those roles are engaged. Until the next role in the hierarchy engages, the
previous role assumes all of the subsequent roles' responsibilities and
retains ownership of the incident.
Engineer On Call
The Production Engineer On Call is generally an SRE and can declare an incident. If another party has declared an incident, once the EOC is engaged they own the incident. The EOC gathers information, performs an initial assessment, and determines the incident severity level.
Incident Manager On Call
The Incident Manager is generally a Reliability Engineering manager and is engaged when incident resolution requires coordination from multiple parties. The IMOC is the tactical leader of the incident response team—not a person performing technical work. The IMOC assembles the incident team by engaging individuals with the skills and information required to resolve the incident.
Communications Manager On Call
The Communications Manager is generally a Reliability Engineering manager. The CMOC disseminates information internally to stakeholders and externally to customers across multiple media (e.g. GitLab issues, Twitter, status.gitlab.com, etc.).
These definitions imply several on-call rotations for the different roles.
Engineer on Call (EOC) Responsibilities
As EOC, your highest priority for the duration of your shift is the stability of GitLab.com.
Alerts that are routed to Pagerduty need to acknowledged within 15 minutes, otherwise they will be escalated to the oncall IMOC.
Alert-manager alerts in #alerts and #alerts-general are an important source of information about the health of the environment and should be monitored during working hours.
If the Pagerduty alert noise is too high, your task as an EOC is clearing out that noise by either fixing the system or changing the alert.
If you are changing the alert, it is your responsibility to explain the reasons behind it and inform the next EOC that the change occurred.
Each event (may be multiple related pages) should result in an issue in the production tracker. See production queue usage for more details.
If sources outside of our alerting are reporting a problem, and you have not received any alerts, it is still your responsibility to investigate. Declare a low severity incident and investigate from there.
Low severity (S3/S4) incidents (and issues) are cheap, and will allow others a means to communicate their experience if they are also experiencing the issue.
"No alerts" is not the same as "no problem"
GitLab.com is a complex system. It is ok to not fully understand the underlying issue or its causes. However, if this is the case, as EOC you should engage with the IMOC to find a team member with the appropriate expertise.
Requesting assistance does not mean relinquishing EOC responsibility. The EOC is still responsible for the incident.
As soon as an S1/S2incident is declared, join the The Situation Room Permanent Zoom. The Zoom link is in the #incident-management topic.
GitLab works in an asynchronous manner, but incidents require a synchronous response. Our collective goal is high availability of 99.95% and beyond, which means that the timescales over which communication needs to occur during an incident is measured in seconds and minutes, not hours.
Keep in mind that a GitLab.com incident is not an "infrastructure problem". It is a company-wide issue, and as EOC, you are leading the response on behalf of the company.
If you need information or assistance, engage with Engineering teams. If you do not get the response you require within a reasonable period, escalate through the IMOC.
As EOC, require that those who may be able to assist to join the Zoom call and ask them to post their findings in the incident issue or active incident Google doc. Debugging information in Slack will be lost and this should be strongly discouraged.
By acknowledging an incident in Pagerduty, the EOC is implying that they are working on it. To further reinforce this acknowledgement, post a note in Slack that you are joining the The Situation Room Permanent Zoom as soon as possible.
If the EOC believes the alert is incorrect, comment on the thread in #production. If the alert is flappy, create an issue and post a link in the thread. This issue might end up being a part of RCA or end up requiring a change in the alert rule.
Be inquisitive. Be vigilant. If you notice that something doesn't seem right, investigate further.
After the incident is resolved, the EOC should start on performing an incident review (RCA) and assign themselves as the initial owner. Feel free to take a breather first, but do not end your work day without starting the RCA.
Communications Manager on Call (CMOC) Responsibilities
For serious incidents that require coordinated communications across multiple channels, the IMOC will select an CMOC for the duration of the incident.
The CMOC is responsible for updating internal parties, executive-level managers, and users.
The CMOC is responsible for contacting the Community Advocates Team via the #community-advocates Slack channel, mentioning them with the @advocates Slack handle.
Once Community Advocates have been notified, an update to the GitLab Status Twitter handle will be made.
During a critical S1-level incident, the CMOC should aim to update the GitLab Status about every 15 minutes.
Updates and the status of gitlab.com are shared by the GitLab Status Twitter handle, not the company handle GitLab.
The social marketing team can assist in major crises if there is not a Community Advocate available online; this should not occur often. Please alert the social team by using @social in your Slack Message if necessary.
During a critical S1-level incident, the CMOC should aim to update the GitLab Status about every 15 minutes.
Runbooks are available for
engineers on call. The project README contains links to checklists for each
of the above roles.
In the event of a GitLab.com outage, a mirror of the runbooks repository is available on at https://ops.gitlab.net/gitlab-com/runbooks.
Declaring an Incident
Declare an incident from Slack
Type /incident declare in Slack (e.g #production) and follow the prompts. The incident declaration is orchestrated through IMA (incident management automation) and has the following capabilities:
Create a GitLab incident issue along with proper labels
Create a Google doc*
Announce the incident in #production channel*
The capabilities noted with * are optional and engineer on call can decide which ones to choose depending on severity of the incident.
Declare an incident manually
If the Slack slash command fails for some reason, you would simply fall back to the manual declaration.
Create an issue with the label Incident, on the production queue with the template for Incident. If it is not possible to generate the issue, start with the tracking document and create the incident issue later.
Ensure the initial severity label is accurate.
Optional - not required for post deployment patches and as needed for the incident:
If S1/S2 outage, Create an issue with the label ~IncidentReview on the infrastructure queue with the template for incident_review. If it is not possible to generate the issue, start with the tracking document and create the incident issue later.
The EOC should assign the incident issue to themselves.
Incident ownership is documented through the assignee field on the issue.
When handing over ownership to a new owner, at the end of shift, or end of rotation, assignment on the issue should be updated.
Create and associate a google doc with the template, to both issues, production and infrastructure. Populate and use the google doc as source of truth during the incident. This doc is mainly for real time multiple access when we cannot use the production issue for communication.
Contact the CMOC for support during the incident.
Epic epic/100 tracks automation for incident management.
Definition of Outage vs Degraded vs Disruption
This is a first revision of the definition of Service Disruption (Outage), Partial Service Disruption, and Degraded Performance per the terms on Status.io.
Data is based on the graphs from the Key Service Metrics Dashboard
Outage and Degraded Performance incidents occur when:
Degraded as any sustained 5 minute time period where a service is below its documented Apdex SLO or above it's documented error ratio SLO.
Outage (Status = Disruption) as a 5 minute sustained error rate above the Outage line on the error ratio graph
A Partial Service Disruption is when only part of the GitLab.com services or infrastructure is experiencing an incident. Examples of partial service disruptions are instances where GitLab.com is operating normally except there are:
delayed CI/CD pending jobs
delayed repository mirroring
high severity bugs affecting a particular feature like Merge Requests
Abuse or degradation on 1 gitaly node affecting a subset of git repos. This would be visible on the Gitaly service metrics
Information is an asset to everyone impacted by an incident. Properly managing the flow of information is critical to minimizing surprise and setting expectations. We aim to keep interested stakeholders apprised of developments in a timely fashion so they can plan appropriately.
This flow is determined by:
the type of information,
its intended audience,
and timing sensitivity.
Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.
To that end, we will have:
a dedicated Zoom call for all incidents. A link to the Zoom call can be found in the topic for the #incident-management room in Slack.
a Google Doc as needed for multiple user input based on the shared template
regular updates to status.gitlab.com via status.io that disseminates to various media (e.g. Twitter)
a dedicated repo for issues related to Production separate from the queue that holds Infrastructure's workload: namely, issues for incidents and changes.
We manage incident communication using status.io, which updates status.gitlab.com. Incidents in status.io have state and status and are updated by the incident owner.
Definitions and rules for transitioning state and status are as follows.
The incident has just been discovered and there is not yet a clear understanding of the impact or cause. If an incident remains in this state for longer than 30 minutes after the EOC has engaged, the incident should be escalated to the IMOC.
The cause of the incident is believed to have been identified and a step to mitigate has been planned and agreed upon.
The step has been executed and metrics are being watched to ensure that we're operating at a baseline
The incident is closed and status is again Operational.
Status can be set independent of state. The only time these must align is when an issues is
The default status before an incident is opened and after an incident has been resolved. All systems are operating normally.
Users are impacted intermittently, but the impacts is not observed in metrics, nor reported, to be widespread or systemic.
Partial Service Disruption
Users are impacted at a rate that violates our SLO. The IMOC must be engaged and monitoring to resolution is required to last longer than 30 minutes.
This is an outage. The IMOC must be engaged.
A security vulnerability has been declared public and the security team has asked to publish it.
The primary goals of writing an [Incident Review] are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1
Not every incident requires a review. But, if an incident matches any of the following criteria, an incident review must be completed:
A service disruption occurred
Data loss of any kind
A resolution time longer than 30 minutes
A monitoring failure
Incident Review Process
Incident reviews (of S1/S2 incidents) have two steps:
The first step in the Incident Review process is the synchronous review of the incident by representatives of the teams involved in the resolution of the incident. This step is conducted as close to the incident date as possible and does not require a complete Incident Review write up. The outcome of this first step should be a published Incident Review, per defined timelines.
Review of Root Cause and Corrective Actions
Incident reviews second step is engaging with the customer, through the point of contact such as a TAM. This should always involve sharing the findings from the first step in an async form. In case of a customer requiring a sync to discuss the finding, the Infrastructure management will organise the discussion with important stakeholders of this process, per defined timelines
CMOC for the Incident updates the TAM team on expected AIR timelines.
CMOC provides TAM published review. CMOC can also include a recording of the review, if the recording does not contain sensitive information.
TAM communicates to CMOC whether their customer(s) would like a synchronous review and the TAM schedules a review with the customer.
TAM facilitates Customer access to the review and the Customer can add a set of questions prior to the meeting and all participants can collaborate on any changes or additions to corrective actions.
Incident resolution + 2 days: CMOC sets expectations with TAM on delivery date of Incident Review.
Incident resolution + 7 days: Incident Review is authored and ready for review by additional stakeholders.
Incident Review Issue Creation and Ownership
Incident Reviews are conducted in production issues—except in the case of extenuating circumstances when Infrastructure or Engineering management determines a synchronous video call should be held. The issues should have the ~IncidentReview label attached.
Every incident must be assigned a DRI, most of the time this will be the EOC who responded to, or declared the incident. The incident review should be assigned to the DRI, immediately when it is created.
The output of an incident review should include one or more issues labeled ~Corrective Action. Linking already existing issues for corrective action is appropriate if the incident was similar to a prior event and corrective actions overlap.
The DRI is responsible for selecting and assigning corrective actions that should be prioritized and resolved within a specific timeframe.
All issues labeled ~Corrective Action must have an assigned priority label, it is the responsibility of the DRI to ensure that the priorities are set.
For high priority ~Corrective Action issues, a due date should be set on the issue to ensure that expectation are set for resolving them.
After discussion on the Incident Review issue has ended and all ~Corrective Action issues have been linked, the issue can be closed.
The infrastructure team keeps track of Corrective
on a dedicated board. The prioritization and assignment of issues is collectively handled by the Reliability Engineering managers.
Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions are provided for each severity, and these definitions are reevaluated as new circumstances become known. Incident management uses our standardized severity definitions, which can be found under our issue workflow documentation.
Alerts severities do not necessarily determine incident severities. A single incident can trigger a number of alerts at various severities, but the determination of the incident's severity is driven by the above definitions.
Over time, we aim to automate the determination of an incident's severity through service-level monitoring that can aggregate individual alerts against specific SLOs.
1: Google SRE Chapter 15 - Postmortem Culture: Learning from Failure