If you're a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you're a GitLab team member looking for who is currently the Engineer On Call (EOC), please see the Who is the Current EOC? section.
If you're a GitLab team member looking for the status of a recent incident, please see the incident board. For detailed information about incident status changes, please see the Incident Workflow section.
Incidents are anomalous conditions that result in—or may lead to—service degradation or outages. These events require human intervention to avert disruptions or restore service to operational status. Incidents are always given immediate attention.
The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:
When an incident starts: we use the #incident-management slack channel for chat based communication. There is a Situation Room Zoom link in the channel description for incident team members to join for synchronous communication. There will be a link to an incident issue in the #incident-management
channel. We prefer to keep collaborative work towards incident mitigation in a thread based off of the original incident issue announcement. This makes it easier for incoming oncall for EOC and CMOC to look for status on handoffs.
There is only ever one owner of an incident—and only the owner of the incident can declare an incident resolved. At anytime the incident owner can engage the next role in the hierarchy for support. With the exception of when GitLab.com is not functioning correctly, the incident issue should be assigned to the current owner.
It's important to clearly delineate responsibilities during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. Preventing overlaps and ensuring a proper order of operations is vital to mitigation. The responsibilities outlined in the roles below are cascading–and ownership of the incident passes from one role to the next as those roles are engaged. Until the next role in the hierarchy engages, the previous role assumes all of the subsequent roles' responsibilities and retains ownership of the incident.
Role | Description | Who? |
---|---|---|
EOC - Engineer On Call |
The EOC is the usually the first person alerted - expectations for the role are in the Handbook for oncall. The checklist for the EOC is in our runbooks. If another party has declared an incident, once the EOC is engaged the EOC owns the incident. The EOC can escalate a page in PagerDuty to engage the IMOC and CMOC. | The Reliability Team Engineer On Call is generally an SRE and can declare an incident. They are part of the "SRE 8 Hour" on call shift in PagerDuty. |
IMOC - Incident Manager On Call |
The IMOC is engaged when incident resolution requires coordination from multiple parties. The IMOC is the tactical leader of the incident response team—not a person performing technical work. The IMOC assembles the incident team by engaging individuals with the skills and information required to resolve the incident. | The Incident Manager is an Engineering Manager, Staff Engineer, or Director from the Reliability team. The IMOC rotation is currently in the "SRE Managers" Pager Duty Schedule. |
CMOC - Communications Manager On Call |
The CMOC disseminates information internally to stakeholders and externally to customers across multiple media (e.g. GitLab issues, Twitter, status.gitlab.com, etc.). | The Communications Manager is generally member of the support team at GitLab. Notifications to the Incident Management - CMOC service in PagerDuty will go to the rotations setup for CMOC. |
These definitions imply several on-call rotations for the different roles.
#alerts
and #alerts-general
are an important source of information about the health of the environment and should be monitored during working hours.production
tracker. See production queue usage for more details.The Situation Room Permanent Zoom
. The Zoom link is in the #incident-management
topic.
The Situation Room Permanent Zoom
as soon as possible.
#production
. If the alert is flappy, create an issue and post a link in the thread. This issue might end up being a part of RCA or end up requiring a change in the alert rule.~review-requested
label. It is expected that the incident review is completed within 14 days of the incident.At times, we have a security incident where we may need to take actions to block a certain URL path or part of the application. This list is meant to help the Security Engineer On-Call and EOC decide when to engage help and post to status.io.
If any of the following are true, it would be best to engage an Incident Manager:
In some cases, we may choose not to post to status.io, the following are examples where we may skip a post/tweet. In some cases, this helps protect the security of self managed instances until we have released the security update.
severity-1@gitlab.pagerduty.com
or via the GitLab Production - Severity 1 Escalation
service in PagerDuty (app or website) with a link to the incident.For serious incidents that require coordinated communications across multiple channels, the IMOC will select a CMOC for the duration of the incident during the incident declaration process.
The GitLab support team staffs an oncall rotation and via the Incident Management - CMOC
service in PagerDuty. They have a section in the support handbook for getting new CMOC people up to speed.
During an incident, the CMOC will:
@community-team
handle at the start of an incident.If, during an incident, EOC or IMOC decide to engage CMOC, they should do that by paging the on-call person:
/pd-cmoc
command in Slack
orImportant to clarify that the CMOC covering hours does not include the weekends. 24x7 coverage for CMOC is being worked in support-team-meta#2822.
Corrective Actions (CAs) are work items that we create as a result of an incident. They are designed to prevent or reduce the likelihood and/or impact of an incident recurrence.
Badly worded | Better |
---|---|
Fix the issue that caused the outage | (Specific) Handle invalid postal code in user address form input safely |
Investigate monitoring for this scenario | (Actionable) Add alerting for all cases where this service returns >1% errors |
Make sure engineer checks that database schema can be parsed before updating | (Bounded) Add automated presubmit check for schema changes |
Runbooks are available for engineers on call. The project README contains links to checklists for each of the above roles.
In the event of a GitLab.com outage, a mirror of the runbooks repository is available on at https://ops.gitlab.net/gitlab-com/runbooks.
The chatops bot will give you this information if you DM it with /chatops run oncall prod
.
The current EOC can be contacted via the @sre-oncall
handle in Slack, but please only use this handle in the following scenarios.
~severity::2
.The EOC will respond as soon as they can to the usage of the @sre-oncall
handle in Slack, but depending on circumstances, may not be immediately available. If it is an emergency and you need an immediate response, please see the Reporting an Incident section.
If you are a GitLab team member and would like to report a possible incident related to GitLab.com and have the EOC paged in to respond, choose one of the reporting methods below. Regardless of the method chose, please stay online until the EOC has had a chance to come online and engage with you regarding the incident. Thanks for your help!
Type /incident declare
in the #production
channel in GitLab's Slack and follow the prompts. This will open an incident issue. If you suspect the issue is an emergency, tick the "Page the engineer on-call" box - not the incident manager or communications manager boxes. You do not need to decide if the problem is an incident, and should err on the side of paging the engineer on-call if you are not sure. We have triage steps below to make sure we respond appropriately. Reporting high severity bugs via this process is the preferred path so that we can make sure we engage the appropriate engineering teams as needed.
Incident Declaration Slack window
Field | Description |
---|---|
Title | Give the incident as descriptive as title as you can. Please prepend the title with a date in the format YYYY-MM-DD |
Severity | If unsure about the severity to choose, but you are seeing a large amount of customer impact, please select S1. More details here: Incident Severity. |
Tasks: page the on-call engineer | If you'd like to page the on-call engineer, please check this box. If in doubt, err on the side of paging if there is significant disruption to the site. |
Tasks: page on-call managers | You can page the incident and/or communications managers on-call. |
Incident Declaration Results
As well as opening a GitLab incident issue, a dedicated incident Slack channel will be opened. The "woodhouse" bot will post links to all of these resources in the main #incident-management
channel. Please note that unless you're an SRE, you won't be able to post in #incident-management
directly. Please join the dedicated Slack channel, created and linked as a result of the incident declaration, to discuss the incident with the on-call engineer.
Email gitlab-production-eoc@gitlab.pagerduty.com. This will immediately page the Engineer On Call.
This is a first revision of the definition of Service Disruption (Outage), Partial Service Disruption, and Degraded Performance per the terms on Status.io. Data is based on the graphs from the Key Service Metrics Dashboard
Outage and Degraded Performance incidents occur when:
Degraded
as any sustained 5 minute time period where a service is below its documented Apdex SLO or above its documented error ratio SLO.Outage
(Status = Disruption) as a 5 minute sustained error rate above the Outage line on the error ratio graphSLOs are documented in the runbooks/rules
To check if we are Degraded or Disrupted for GitLab.com, we look at these graphs:
A Partial Service Disruption is when only part of the GitLab.com services or infrastructure is experiencing an incident. Examples of partial service disruptions are instances where GitLab.com is operating normally except there are:
In the case of high severity bugs, we prefer that an incident issue is still created via Reporting an Incident. This will give us an incident issue on which to track the events and response.
In the case of a high severity bug that is in an ongoing, or upcoming deployment please follow the steps to Block a Deployment.
If an incident may be security related, engage the Security Operations on-call by using /security
in Slack. More detail can be found in Engaging the Security On-Call.
Information is an asset to everyone impacted by an incident. Properly managing the flow of information is critical to minimizing surprise and setting expectations. We aim to keep interested stakeholders apprised of developments in a timely fashion so they can plan appropriately.
This flow is determined by:
Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.
To that end, we will have:
#incident-management
room in Slack.#incident-management
channel for internal updatesWe manage incident communication using status.io, which updates status.gitlab.com. Incidents in status.io have state and status and are updated by the incident owner.
Definitions and rules for transitioning state and status are as follows.
State | Definition |
---|---|
Investigating | The incident has just been discovered and there is not yet a clear understanding of the impact or cause. If an incident remains in this state for longer than 30 minutes after the EOC has engaged, the incident should be escalated to the IMOC. |
Identified | The cause of the incident is believed to have been identified and a step to mitigate has been planned and agreed upon. |
Monitoring | The step has been executed and metrics are being watched to ensure that we're operating at a baseline |
Resolved | The incident is closed and status is again Operational. |
Status can be set independent of state. The only time these must align is when an issues is
Status | Definition |
---|---|
Operational | The default status before an incident is opened and after an incident has been resolved. All systems are operating normally. |
Degraded Performance | Users are impacted intermittently, but the impacts is not observed in metrics, nor reported, to be widespread or systemic. |
Partial Service Disruption | Users are impacted at a rate that violates our SLO. The IMOC must be engaged and monitoring to resolution is required to last longer than 30 minutes. |
Service Disruption | This is an outage. The IMOC must be engaged. |
Security Issue | A security vulnerability has been declared public and the security team has asked to publish it. |
Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions are provided for each severity, and these definitions are reevaluated as new circumstances become known. Incident management uses our standardized severity definitions, which can be found under availability severities.
In order to effectively track specific metrics and have a single pane of glass for incidents and their reviews, specific labels are used. The below workflow diagram describes the two paths an incident can take from open
to closed
. The two paths are dictated by the severity of the issue, S1
and S2
incidents require a review. In certain cases an S3
or S4
incident can take the review workflow path. Details here
The EOC and the IMOC, at the time of the incident, are the default assignees for an incident issue. They are the assignees for the entire workflow of the incident issue.
The following labels are used to track the incident lifecycle from active incident to completed incident review. Label Source
In order to help with attribution, we also label each incident with a scoped label for the Infrastructure Service (Service::) and Group (group::) scoped labels.
Label | Workflow State |
---|---|
~Incident::Active |
Indicates that the incident labeled is active and ongoing. Initial severity is assigned. |
~Incident::Mitigated |
Indicates that the incident has been mitigated, but immediate post-incident activity may be ongoing (monitoring, messaging, etc.) |
~Incident::Resolved |
Indicates that SRE engagement with the incident has ended and GitLab.com is fully operational. Incident severity is re-assessed and determined if the initial severity is still correct and if it is not, it is changed to the correct severity. |
~Incident::Review-in-Progress |
Indicates that an incident met the threshold for requiring a review (S1) or a ~review-requested label was added to the incident. |
~Incident::Review-Scheduled |
Indicates that the incident review has been added to the agenda for an upcoming review meeting. |
~Incident::Review-Completed |
Indicates that an incident review has been completed, but there are notes to incorporate from the review writeup prior to closing the issue. |
Labeling incidents with similar causes helps develop insight into overall trends and when combined with Service attribution, improved understanding of Service behavior. Indicating a single root cause is desirable and in cases where there appear to be multiple root causes, indicate the root cause which precipitated the incident.
The EOC, as DRI of the incident, is responsible for determining root cause.
The current Root Cause labels are listed below. In order to support trend awareness these labels are meant to be high-level, not too numerous, and as consistent as possible over time.
Root Cause | Description |
---|---|
~RootCause::Software-Change |
feature or other code change |
~RootCause::Feature-Flag |
a feature flag toggled in some way (off or on or a new percentage or target was chosen for the feature flag) |
~RootCause::Config-Change |
configuration change, other than a feature flag being toggled |
~RootCause::SPoF |
the failure of a service or component which is an architectural SPoF (Single Point of Failure) |
~RootCause::Malicious-Traffic |
deliberate malicious activity targeted at GitLab or customers of GitLab (e.g. DDoS) |
~RootCause::Saturation |
failure resulting from a service or component which failed to scale in response to increasing demand (whether or not it was expected) |
~RootCause::External-Dependency |
resulting from the failure of a dependency external to GitLab, including various service providers. Use of other causes (such as ~RootCause::SPoF or ~RootCause::Saturation ) should be strongly considered for most incidents. |
~RootCause::Release-Compatibility |
forward- or backwards-compatibility issues between subsequent releases of the software running concurrently, and sharing state, in a single environment (e.g. Canary and Main stage releases). They can be caused by incompatible database DDL changes, canary browser clients accessing non-canary APIs, or by incompatibilities between Redis values read by different versions of the application. |
~RootCause::Security |
an incident where the SIRT team was engaged, generally via a request originating from the SIRT team or in a situation where Reliability has paged SIRT to assist in the mitigation of an incident not caused by ~RootCause::Malicious-Traffic |
These labels are always required on incident issues.
Label | Purpose |
---|---|
~incident |
Label used for metrics tracking and immediate identification of incident issues. |
~Service::* |
Scoped label for service attribution. Used in metrics and error budgeting. |
~Severity::* |
Scoped label for severity assignment. Details on severity selection can be found in the availability severities section. |
~RootCause::* |
Scoped label indicating root cause of the incident. |
In certain cases, additional labels will be added as a mechanism to add metadata to an incident issue for the purposes of metrics and tracking.
Label | Purpose |
---|---|
~self-managed |
Indicates that an incident is exclusively an incident for self-managed GitLab. Example self-managed incident issue |
~incident-type::automated traffic |
The incident occurred due to activity from security scanners, crawlers, or other automated traffic |
~incident-type::deployment related |
Indicates that the incident was a deployment failure caused by failing tests, application bugs, or pipeline problems. |
~group::* |
Any development group(s) related to this incident |
~review-requested |
Indicates that that the incident would benefit from undergoing additional review. All S1 incidents are required to have a review. Additionally, anyone including the EOC can request an incident review on any severity issue. Although the review will help to derive corrective actions, it is expected that corrective actions are filled whether or not a review is requested. If an incident does not have any corrective actions, this is probably a good reason to request a review for additional discussion. |
The board which tracks all GitLab.com incidents from active to reviewed is located here.
A near miss, "near hit", or "close call" is an unplanned event that has the potential to cause, but does not actually result in an incident.
In the United States, the Aviation Safety Reporting System has been collecting reports of close calls since 1976. Due to near miss observations and other technological improvements, the rate of fatal accidents has dropped about 65 percent. source
Near misses are like a vaccine. They help the company better defend against more serious errors in the future, without harming anyone or anything in the process.
When a near miss occurs, we should treat it in a similar manner to a normal incident.
~Near Miss
label.