If you're a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you're a GitLab team member looking for who is currently the Engineer On Call (EOC), please see the Who is the Current EOC? section.
If you're a GitLab team member looking for the status of a recent incident, please see the incident board. For detailed information about incident status changes, please see the Incident Workflow section.
Incidents are anomalous conditions that result in—or may lead to—service degradation or outages. These events require human intervention to avert disruptions or restore service to operational status. Incidents are always given immediate attention.
The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:
When an incident starts, the incident automation sends a message
in the #incident-management
channel
containing a link to a per-incident Slack channel for text based communication, the
incident issue for permanent records, and the Situation Room Zoom link for incident team members to join for synchronous verbal
and screen-sharing communication.
Scheduled maintenance that is a C1
should be treated as an undeclared incident.
30-minutes before the maintenance window starts, the Engineering Manager who is responsible for the change should notify the SRE on-call, the Release Managers and the CMOC to inform them that the maintenance is about to begin.
Coordination and communication should take place in the Situation Room Zoom so that it is quick and easy to include other engineers if there is a problem with the maintenance.
If a related incident occurs during the maintenance procedure, the EM should act as the Incident Manager for the duration of the incident.
If a separate unrelated incident occurs during the maintenance procedure, the engineers involved in the scheduled maintenance should vacate the Situation Room Zoom in favour of the active incident.
By default, the EOC is the owner of the incident. The incident owner can delegate ownership to another engineer or escalate ownership to the IM at any time. There is only ever one owner of an incident and only the owner of the incident can declare an incident resolved. At anytime the incident owner can engage the next role in the hierarchy for support. The incident issue should be assigned to the current owner.
Clear delineation of responsibilities is important during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. Preventing overlaps and ensuring a proper order of operations is vital to mitigation.
To make your role clear edit your zoom name to start with your role when you join the Zoom meeting. For Example "IM - John Doe" To edit your name during a zoom call, click on the three dots by your name in your video tile and choose the "rename" option. Edits made during a zoom call only last for the length of the call, so it should automatically revert to your profile name/title with the next call.
Role | Description | Who? |
---|---|---|
EOC - Engineer On Call |
The EOC is usually the first person alerted - expectations for the role are in the Handbook for oncall. The checklist for the EOC is in our runbooks. If another party has declared an incident, once the EOC is engaged the EOC owns the incident. The EOC can escalate a page in PagerDuty to engage the Incident Manager and CMOC. | The Reliability Team Engineer On Call is generally an SRE and can declare an incident. They are part of the "SRE 8 Hour" on call shift in PagerDuty. |
DRI - Directly Responsible Individual |
The DRI is the owner of the incident and is responsible for the coordination of the incident response and will drive the incident to resolution. The DRI should always be the person assigned to the issue. | By default, the IM is the DRI for Sev1 and Sev2 externally facing incidents, the EOC is the DRI for all other incidents. The DRI can and should transfer ownership in cases where it makes sense to do so. |
IM - Incident Manager Information about IM onboarding |
The Incident Manager is engaged when incident resolution requires coordination from multiple parties. The Incident Manager is the tactical leader of the incident response team—not a person performing technical work. The IM checklist is in our runbooks. The Incident Manager assembles the incident team by engaging individuals with the skills and information required to resolve the incident. | The Incident Manager On Call rotation is in PagerDuty |
CMOC - Incident Communications Manager On Call |
The CMOC disseminates information internally to stakeholders and externally to customers across multiple media (e.g. GitLab issues, status.gitlab.com, etc.). | The Communications Manager is generally member of the support team at GitLab. Notifications to the Incident Management - CMOC service in PagerDuty will go to the rotations setup for CMOC. |
These definitions imply several on-call rotations for the different roles. Note that not all incidents include engagement from Incident Managers or Communication Managers.
For general information about how shifts are scheduled and common scenarios about what to do when you have PTO or need coverage, see the Incident Manager onboarding documentation.
When paged, the Incident Managers have the following responsibilities during a Sev1 or Sev2 incident and should be engaged on these tasks immediately when an incident is declared:
/pd trigger
in Slack. In the "Create New Incident" dialog, select "Infrastructure Leadership" as the Impacted Service with a link to the incident in the Description as well as a reminder that Infrastructure Leadership should follow the process for Infrastructure Leadership Escalation. This notification should happen 24/7.Current Status
section of the incident issue description. These updates should summarize the current customer impact of the incident and actions we are taking to mitigate the incident. This is the most important section of the incident issue, it will be referenced to status page updates, and should provide a summary of the incident and impact that can be understood by the wider community.Summary for CMOC notice / Exec summary
in the incident description is filled out as soon as possible.~review-requested
label on the incident, after the incident is resolved, the Incident Manager is the DRI of the post-incident review. The DRI role can be delegated.The Incident Manager is the DRI for all of the items listed above, but it is expected that the IM will do it with the support of the EOC or others who are involved with the incident. If an incident runs beyond a scheduled shift, the Incident Manager is responsible for handing over to the incoming IM.
The IM won't be engaged on these tasks unless they are paged, which is why the default is to page them for all Sev1 and Sev2 incidents.
In other situations, to engage the Incident Manager run /pd trigger
and choose the GitLab Production - Incident Manager
as the impacted service.
The Engineer On Call is responsible for the mitigation of impact and resolution to the incident that was declared. The EOC should reach out to the Incident Manager for support if help is needed or others are needed to aid in the incident investigation.
For Sev3 and Sev4 incidents, the EOC is also responsible for Incident Manager Responsibilities, second to mitigating and resolving the incident.
#reliability-lounge
Slack channel. If you are unable to find coverage reach out to a Reliability Engineering Manager for assistance.#alerts
and #alerts-general
are an important source of information about the health of the environment and should be monitored during working hours.production
tracker. See production queue usage for more details.The Situation Room Permanent Zoom
. The Zoom link is in the #incident-management
topic.
The Situation Room Permanent Zoom
as soon as possible.
#production
. If the alert is flappy, create an issue and post a link in the thread. This issue might end up being a part of RCA or end up requiring a change in the alert rule.~review-requested
label, the EOC should start on performing an incident review, in some cases this may be be a synchronous review meeting or an async review depending on what is requested by those involved with the incident.Admin Notes
, leaving a note that contains a link to the incident, and any further notes explaining why the user is being blocked.
If any of the following are true, it would be best to engage an Incident Manager:
To engage with the Incident Manager run /pd trigger
and choose the GitLab Production - Incident Manager
as the impacted service. Please note that when an incident is upgraded in severity (for example from S3 to S1), PagerDuty does not automatically page the Incident Manager or Communications Manager and this action must be taken manually.
Occasionally we encounter multiple incidents at the same time. Sometimes a single Incident Manager can cover multiple incidents. This isn't always possible, especially if there are two simultaneous high-severity incidents with significant activity.
When there are multiple incidents and you decide that additional incident manager help is required, take these actions:
If a second incident zoom is desired, choose which incident will move to the new zoom and create a new meeting in zoom. Be sure to edit the channel topic of the incident slack channel to indicate the correct zoom link.
EOCs are responsible for responding to alerts even on the weekends. Time should not be spent mitigating the incident unless it is a ~severity::1
or ~severity::2
. Mitigation for ~severity::3
and ~severity::4
incidents can occur during normal business hours, Monday-Friday. If you have any questions on this please reach out to an Infrastructure Engineering Manager.
If a ~severity::3
and ~severity::4
occurs multiple times and requires weekend work, the multiple incidents should be combined into a single severity::2
incident.
If assistance is needed to determine severity, EOCs and Incident Managers are encouraged to contact Reliability Leadership via PagerDuty
During a verified Severity 1 Incident the IM will page for Infrastructure Leadership. This is not a substitute or replacement for the active Incident Manager. The Infrastructure Leadership responsibilities include:
:s1: **Incident on GitLab.com**
**— Summary —**
(include high level summary)
**— Customer Impact —**
(describe the impact to users including which service/access methods and what percentage of users)
**— Current Response —**
(bullet list of actions)
**— Production Issue —**
Main incident: (link to the incident)
Slack Channel: (link to incident slack channel)
Further support is available from the Scalability and Delivery Groups if required. Scalability leadership can be reached via PagerDuty Scalability Escalation (further details available on their team page). Delivery leadership can be reached via PagerDuty. See the Release Management Escalation steps on the Delivery group page.
For serious incidents that require coordinated communications across multiple channels, the Incident Manager will rely on the CMOC for the duration of the incident.
The GitLab support team staffs an oncall rotation and via the Incident Management - CMOC
service in PagerDuty. They have a section in the support handbook for getting new CMOC people up to speed.
During an incident, the CMOC will:
/incident post-statuspage
on Slack to create an incident on Status.io. Any updates to the incident will have to be done manually by following these instructions.If, during an incident, EOC or Incident Manager decide to engage CMOC, they should do that by paging the on-call person:
/pd trigger
command in Slack, then select the "Incident Management - CMOC" service from the modal.
orIf, during an S1 or S2 incident, it is determined that it would be beneficial to have a synchronous conversation with one or more customers a new Zoom meeting should be utilized for that conversation. Typically there are two situations which would lead to this action:
Due to the overhead involved and the risk of detracting from impact mitigation efforts, this communication option should be used sparingly and only when a very clear and distinct need is present.
Implementing a direct customer interaction call for an incident is to be initiated by the current Incident Manager by taking these steps:
/here A second incident manager is required for a customer interaction call for XXX
.After learning of the history and current state of the incident the Engineering Communications Lead will initiate and manage the customer interaction through these actions:
GitLab
, as well as their Role, CSM
Engineering Communications Lead
In some scenarios it may be necessary for most all participants of an incident (including the EOC, other developers, etc.) to work directly with a customer. In this case, the customer interaction Zoom shall be used, NOT the main GitLab Incident Zoom. This will allow for the conversation (as well as text chat) while still supporting the ability for primary responders to quickly resume internal communications in the main Incident Zoom. Since the main incident Zoom may be used for multiple incidents it will also prevent the risk of confidential data leakage and prevent the inefficiency of having to frequently announce that there are customers in the main incident zoom each time the call membership changes.
Corrective Actions (CAs) are work items that we create as a result of an incident. Only issues arising out of an incident should receive the label "corrective action". They are designed to prevent the same kind of incident or improve the time to mitigation and as such are part of the Incidence Management cycle.
Work items identified in incidents that don't meet the Corrective Action criteria should be raised in the Reliability project and labeled with ~work::incident
rather than ~corrective action
Corrective Actions should be related to the incident issue to help with downstream analysis, and it can be helpful to refer to the incident in the description of the issue.
Corrective Actions issues in the Reliability project should be created using the Corrective Action issue template to ensure consistency in format, labels and application/monitoring of service level objectives for completion
Badly worded | Better |
---|---|
Fix the issue that caused the outage | (Specific) Handle invalid postal code in user address form input safely |
Investigate monitoring for this scenario | (Actionable) Add alerting for all cases where this service returns >1% errors |
Make sure engineer checks that database schema can be parsed before updating | (Bounded) Add automated presubmit check for schema changes |
Improve architecture to be more reliable | (Time-bounded and specific) Add a redundant node to ensure we no longer have a single point of failure for the service |
Runbooks are available for engineers on call. The project README contains links to checklists for each of the above roles.
In the event of a GitLab.com outage, a mirror of the runbooks repository is available on at https://ops.gitlab.net/gitlab-com/runbooks.
The chatops bot will give you this information if you DM it with /chatops run oncall prod
.
The current EOC can be contacted via the @sre-oncall
handle in Slack, but please only use this handle in the following scenarios.
~blocks deployments
.The EOC will respond as soon as they can to the usage of the @sre-oncall
handle in Slack, but depending on circumstances, may not be immediately available. If it is an emergency and you need an immediate response, please see the Reporting an Incident section.
If you are a GitLab team member and would like to report a possible incident related to GitLab.com and have the EOC paged in to respond, choose one of the reporting methods below. Regardless of the method chose, please stay online until the EOC has had a chance to come online and engage with you regarding the incident. Thanks for your help!
Type /incident declare
in the #production
channel in GitLab's Slack and follow the prompts to open an incident issue.
It is always better to err on side of choosing a higher severity, and declaring an incident for a production issue, even if you aren't sure.
Reporting high severity bugs via this process is the preferred path so that we can make sure we engage the appropriate engineering teams as needed.
Incident Declaration Slack window
Field | Description |
---|---|
Title | Give the incident as descriptive as title as you can. Please include the date in the format YYYY-MM-DD which should be present by default. |
Severity | If unsure about the severity, but you are seeing a large amount of customer impact, please select S1 or S2. More details here: Incident Severity. |
Service | If possible, select a service that is likely the cause of the incident. Sometimes this will also be the service impacted. If you are unsure, it is fine to leave this empty. |
Page engineer on-call / incident manager / communications manager on-call | Leave these checked unless the incident is severity 1 or severity 2, and does not require immediate engagement (this is unusual), or if the person submitting the incident is the EOC. We will not page anyone for severity 3 and severity 4 incidents, even if the boxes are checked. |
Confidential | This will mark the issue confidential, do this for all security related issues or incidents that primarily contain information that is not SAFE. We generally prefer to leave this unchecked, and use confidential notes for information that cannot be public. |
Incident Declaration Results
As well as opening a GitLab incident issue, a dedicated incident Slack channel will be opened. The "woodhouse" bot will post links to all of these resources in the main #incident-management
channel. Please note that unless you're an SRE, you won't be able to post in #incident-management
directly. Please join the dedicated Slack channel, created and linked as a result of the incident declaration, to discuss the incident with the on-call engineer.
Email gitlab-production-eoc@gitlab.pagerduty.com. This will immediately page the Engineer On Call.
This is a first revision of the definition of Service Disruption (Outage), Partial Service Disruption, and Degraded Performance per the terms on Status.io. Data is based on the graphs from the Key Service Metrics Dashboard
Outage and Degraded Performance incidents occur when:
Degraded
as any sustained 5 minute time period where a service is below its documented Apdex SLO or above its documented error ratio SLO.Outage
(Status = Disruption) as a 5 minute sustained error rate above the Outage line on the error ratio graphIn both cases of Degraded or Outage, once an event has elapsed the 5 minutes, the Engineer on Call and the Incident Manager should engage the CMOC to help with external communications. All incidents with a total duration of more than 5 minutes should be publicly communicated as quickly as possible (including "blip" incidents), and within 1 hour of the incident occurring.
SLOs are documented in the runbooks/rules
To check if we are Degraded or Disrupted for GitLab.com, we look at these graphs:
A Partial Service Disruption is when only part of the GitLab.com services or infrastructure is experiencing an incident. Examples of partial service disruptions are instances where GitLab.com is operating normally except there are:
In the case of high severity bugs, we prefer that an incident issue is still created via Reporting an Incident. This will give us an incident issue on which to track the events and response.
In the case of a high severity bug that is in an ongoing, or upcoming deployment please follow the steps to Block a Deployment.
If an incident may be security related, engage the Security Engineer on-call by using /security
in Slack. More detail can be found in Engaging the Security Engineer On-Call.
Information is an asset to everyone impacted by an incident. Properly managing the flow of information is critical to minimizing surprise and setting expectations. We aim to keep interested stakeholders apprised of developments in a timely fashion so they can plan appropriately.
This flow is determined by:
Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.
To that end, we will have:
#incident-management
room in Slack.#incident-management
channel for internal updatesWe manage incident communication using status.io, which updates status.gitlab.com. Incidents in status.io have state and status and are updated by the incident owner.
To create an incident on status.io, you can use /incident post-statuspage
on Slack.
In some cases, we may choose not to post to status.io, the following are examples where we may skip a post/tweet. In some cases, this helps protect the security of self managed instances until we have released the security update.
Definitions and rules for transitioning state and status are as follows.
State | Definition |
---|---|
Investigating | The incident has just been discovered and there is not yet a clear understanding of the impact or cause. If an incident remains in this state for longer than 30 minutes after the EOC has engaged, the incident should be escalated to the Incident Manager On Call. |
Active | The incident is in progress and has not yet been mitigated. Note: Incidents should not be left in an Active state once the impact has been mitigated |
Identified | The cause of the incident is believed to have been identified and a step to mitigate has been planned and agreed upon. |
Monitoring | The step has been executed and metrics are being watched to ensure that we're operating at a baseline. If there is a clear understanding of the specific mitigation leading to resolution and high confidence in the fact that the impact will not recur it is preferable to skip this state. |
Resolved | The impact of the incident has been mitigated and status is again Operational. Once resolved the incident can be marked for review and Corrective Actions can be defined. |
Status can be set independent of state. The only time these must align is when an issues is
Status | Definition |
---|---|
Operational | The default status before an incident is opened and after an incident has been resolved. All systems are operating normally. |
Degraded Performance | Users are impacted intermittently, but the impact is not observed in metrics, nor reported, to be widespread or systemic. |
Partial Service Disruption | Users are impacted at a rate that violates our SLO. The Incident Manager On Call must be engaged and monitoring to resolution is required to last longer than 30 minutes. |
Service Disruption | This is an outage. The Incident Manager On Call must be engaged. |
Security Issue | A security vulnerability has been declared public and the security team has requested that it be published on the status page. |
Incident severity should be assigned at the beginning of an incident to ensure proper response across the organization. Incident severity should be determined based on the information that is available at the time. Severities can and should be adjusted as more information becomes available. The severity level reflects the maximum impact the incident had and should remain in that level even after the incident was mitigated or resolved.
Incident Managers and Engineers On-Call can use the following table as a guide for assigning incident severity.
Severity | Description | Example Incidents |
---|---|---|
~severity::1 |
- GitLab.com is unavailable or severely degraded for the typical GitLab user - Any data loss directly impacting customers - The guaranteed self-managed release date is put in jeopardy - It is a high impact security incident - It is an internally facing incident with full loss of metrics observability (Prometheus down) Incident Managers should be paged for all ~severity::1 incidents |
Past severity::1 Issues |
~severity::2 |
- There is a recorded impact to the availability of one or more GitLab.com Primary Service with a weight > 0. This includes api , container registry , git access , API and web .- GitLab.com is unavailable or degraded for a small subset of users - GitLab.com is degraded but a reasonable workaround is available (includes widespread frontend degradations) - Any moderate impact security incident Incident Managers should be paged for all ~severity::2 incidents |
Past severity::2 Incidents |
~severity::3 |
- Broad impact on GitLab.com and minor inconvenience to typical user's workflow - A workaround is not needed - Any low impact security incident - Most internally facing issues pertaining to blocked deployments |
Past severity::3 Incidents |
~severity::4 |
- Minimal impact on GitLab.com typical user's workflow | Past severity::4 Incidents |
There are four data classification levels defined in GitLab's Data Classification Standard.
The Incident Manager should exercise caution and their best judgement, in general we prefer to use internal notes instead of marking an entire issue confidential if possible. A couple lines of non-descript log data may not represent a data security concern, but a larger set of log, query, or other data must have more restrictive access. If assistance is required follow the Infrastructure Leadership Escalation process.
In order to effectively track specific metrics and have a single pane of glass for incidents and their reviews, specific labels are used. The below workflow diagram describes the path an incident takes from open
to closed
. All S1
incidents require a review, other incidents can also be reviewed as described here.
GitLab uses the Incident Management feature of the GitLab application. Incidents are reported and closed when they are resolved. A resolved incident means the degradation has ended and will not likely re-occur.
If there is additional follow-up work that requires more time after an incident is resolved and closed (like a detailed root cause analysis or a corrective action) a new issue may need to be created and linked to the incident issue. It is important to add as much information as possible as soon as an incident is resolved while the information is fresh, this includes a high level summary and a timeline where applicable.
The EOC and the Incident Manager On Call, at the time of the incident, are the default assignees for an incident issue. They are the assignees for the entire workflow of the incident issue.
Incidents use the Timeline Events feature, the timeline can be viewed by selecting the "Timeline" tab on the incident.
By default, all label events are added to the Timeline, this includes ~"Incident::Mitigated"
and ~"Incident::Resolved"
.
At a minimum, the timeline should include when start and end times of user impact.
You may also want to highlight notes in the discussion, this is done by selecting the clock icon on the note which will automatically add it to the timeline.
For adding timeline items quickly, use the quick action, for example:
/timeline DB load spiked resulting in performance issues | 2022-09-07 09:30
/timeline DB load spike mitigated by blocking malicious traffic | 2022-09-07 10:00
The following labels are used to track the incident lifecycle from active incident to completed incident review. Label Source
In order to help with attribution, we also label each incident with a scoped label for the Infrastructure Service (Service::) and Group (group::) scoped labels among others.
Label | Workflow State |
---|---|
~Incident::Active |
Indicates that the incident labeled is active and ongoing. Initial severity is assigned when it is opened. |
~Incident::Mitigated |
Indicates that the incident has been mitigated. A mitigated issue means that the impact is significantly reduced and immediate post-incident activity is ongoing (monitoring, messaging, etc.). The mitigated state should not be used for silenced alerts, or alerts that may reoccur. In both cases you should mark the incident as resolved and close it. |
~Incident::Resolved |
Indicates that SRE engagement with the incident has ended and the condition that triggered the alert has been resolved. Incident severity is re-assessed and determined if the initial severity is still correct and if it is not, it is changed to the correct severity. Once an incident is resolved, the issue will be closed. |
~Incident::Review-Completed |
Indicates that an incident review has been completed, this should be added to an incident after the review is completed if it has the ~review-requested label. |
Labeling incidents with similar causes helps develop insight into overall trends and when combined with Service attribution, improved understanding of Service behavior. Indicating a single root cause is desirable and in cases where there appear to be multiple root causes, indicate the root cause which precipitated the incident.
The EOC, as DRI of the incident, is responsible for determining root cause.
The current Root Cause labels are listed below. In order to support trend awareness these labels are meant to be high-level, not too numerous, and as consistent as possible over time.
Root Cause | Description |
---|---|
~RootCause::Config-Change |
configuration change, other than a feature flag being toggled |
~RootCause::Database-Failover |
database failover event |
~RootCause::DB-Migration |
resulting from a database migration or a post-deploy migration |
~RootCause::ExternalAgentMaliciousBehavior |
ostensibly malicious behavior by an external agent |
~RootCause::External-Dependency |
resulting from the failure of a dependency external to GitLab, including various service providers. Use of other causes (such as ~RootCause::SPoF or ~RootCause::Saturation ) should be strongly considered for most incidents. |
~RootCause::FalseAlarm |
an incident was created by a page that isn't actionable and should result into adjusting the alert or deleting it |
~RootCause::Feature-Flag |
a feature flag toggled in some way (off or on or a new percentage or target was chosen for the feature flag) |
~RootCause::Flaky-Test |
an incident, usually a deployment pipeline failure found to have been caused by a flaky QA test |
~RootCause::GCP-Networking |
GCP networking event |
~RootCause::Indeterminate |
when an incident has been investigated, but the root cause continues to be unknown and an agreement has been formed to not pursue any further investigation. |
~RootCause::Known-Software-Issue |
known/existing technical debt in the product that has yet to be addressed |
~RootCause::Malicious-Traffic |
deliberate malicious activity targeted at GitLab or customers of GitLab (e.g. DDoS) |
~RootCause::Naive-Traffic |
elevated external traffic exhibiting anti-pattern behavior for interface usage |
~RootCause::Release-Compatibility |
forward- or backwards-compatibility issues between subsequent releases of the software running concurrently, and sharing state, in a single environment (e.g. Canary and Main stage releases). They can be caused by incompatible database DDL changes, canary browser clients accessing non-canary APIs, or by incompatibilities between Redis values read by different versions of the application. |
~RootCause::Saturation |
failure resulting from a service or component which failed to scale in response to increasing demand (whether or not it was expected) |
~RootCause::Security |
an incident where the SIRT team was engaged, generally via a request originating from the SIRT team or in a situation where Reliability has paged SIRT to assist in the mitigation of an incident not caused by ~RootCause::Malicious-Traffic |
~RootCause::Software-Change |
feature or other code change |
~RootCause::SPoF |
the failure of a service or component which is an architectural SPoF (Single Point of Failure) |
We want to be able to report on a scope of incidents which have met a level of impact which necessitated customer communications. An underlying assumption is that any material impact will always be communicated in some form. Incidents are to be labeled indicating communications even if the impact is later determined to be lesser, or when the communication is done by mistake.
Note: This does not include Contact Requests where the communication is due to identifying a cause.
The CMOC is responsible for ensuring this label is set for all incidents involving use of the Status Page or where other direct notification to a set of customers is completed (such as via Zendesk).
Customer Communications | Description |
---|---|
~Incident-Comms::Status-Page |
Incident communication included use of the public GitLab Status Page |
~Incident-Comms::Private |
Incident communication was limited to fewer customers or otherwise was only directly communicated to impacted customers (not via the GitLab Status Page) |
~Contact Request |
Applied to issues where it is requested that Support contact a user or customer |
~Contact Request::Awaiting Contact |
Support has yet to contact the user(s) in question in this issue. |
~Contact Request::Contacted |
Support has contacted the user(s) in question and is awaiting confirmation from Production. |
~CMOC Required |
Marks issues that require a CMOC involvement. |
~Service::Infrastructure
labelService::Unknown
label until more information is available.It is not always very clear which service label to apply, especially when causes span service boundaries that we define in Infrastructure. When unsure, it's best to choose a label that corresponds to the primary cause of the incident, even if other services are involved.
The following services should primarily be used for application code changes or feature flag changes, not changes to configuration or code maintained by the Infrastructure department:
~Service::API
~Service::Web
~Service::Git
~Service::Registry
~Service::Pages
~Service::Gitaly
~Service::GitLab Rails
Service labeling examples:
Example | Outcome |
---|---|
An incident is declared but we don't know yet what caused the impact. | Use ~Service::Unknown |
A bug was deployed to Production that primarily impacted API traffic. | Use ~Service::API |
A bug in frontend code caused problems for most browser sessions. | Use ~Service::Web |
A featureflag was toggled that caused a problem that mostly affected Git traffic. | Use ~Service::Git |
A bad configuration value in one of our Kubernetes manifests caused a service disruption on the Registry service. | Use ~Service::Infrastructure |
A mis-configuration of Cloudflare caused some requests to be cached improperly. | Use ~Service::Cloudflare |
Monitoring stopped working due to a Kubernetes configuration update on Prometheus | Use ~Service::Prometheus |
A site-wide outage caused by a configuration change to Patroni. | Use ~Service::Patroni |
A degradation in service due to missing index on a table. | Use ~Service::GitLab Rails |
The following labels are added and removed automatically by triage-ops:
Needs Label | Description |
---|---|
~{RootCause,Service,CorrectiveActions}::Needed |
Will be added automatically if the corresponding label has not been set. If this label persists the DRI of the incident will be mentioned on a note to correctly label the incident |
~{RootCause,Service,CorrectiveActions}::NotNeeded |
In rare cases, the corresponding label won't be needed, this label can be used to disable the periodic notes to remind the DRI to update the label |
These labels are always required on incident issues.
Label | Purpose |
---|---|
~Service::* |
Scoped label for service attribution. Used in metrics and error budgeting. |
~Severity::* (automatically applied) |
Scoped label for severity assignment. Details on severity selection can be found in the availability severities section. |
~RootCause::* |
Scoped label indicating root cause of the incident. |
These labels are added to incident issues as a mechanism to add metadata for the purposes of metrics and tracking.
Label | Purpose |
---|---|
~incident (automatically applied) |
Label used for metrics tracking and immediate identification of incident issues. |
~self-managed |
Indicates that an incident is exclusively an incident for self-managed GitLab. Example self-managed incident issue |
~incident-type::automated traffic |
The incident occurred due to activity from security scanners, crawlers, or other automated traffic |
~backstage |
Indicates that the incident is internally facing, rather than having a direct impact on customers. Examples include issues with monitoring, backups, failing tests, self-managed release or general deploy pipeline problems. |
~group::* |
Any development group(s) related to this incident |
~review-requested |
Indicates that that the incident would benefit from undergoing additional review. All S1 incidents are required to have a review. Additionally, anyone including the EOC can request an incident review on any severity issue. Although the review will help to derive corrective actions, it is expected that corrective actions are filled whether or not a review is requested. If an incident does not have any corrective actions, this is probably a good reason to request a review for additional discussion. |
~Incident-Comms::* |
Scoped label indicating the level of communications. |
~blocks deployments |
Indicates that if the incident is active, it will be a blocker for deployments. This is automatically applied to ~severity::1 and ~severity::2 incidents. The EOC or Release Manager can remove this label if it is safe to deploy while the incident is active. A comment must accompany the removal stating the safety or reasoning that enables deployments to continue. This label may also be applied to lower severity incidents if needed. |
~blocks feature-flags |
Indicates that while the incident is active, it will be a blocker for changes to feature flags. This is automatically applied to ~severity::1 and ~severity::2 incidents. The EOC or Release Manager can remove this label if there is no risk in making feature flag changes while the incident is active. A comment must accompany the removal stating the safety or reasoning that enables feature flag changes to continue. This label may also be applied to lower severity incidents if needed. |
~Delivery impact::* |
Indicates the level of impact this incident is having on GitLab deployment and releases |
When an incident is created that is a duplicate of an existing incident it is up to the EOC to mark it as a duplicate. In the case where we mark an incident as a duplicate, we should issue the following slash command and remove all labels on the incident issue:
/duplicate <incident issue>
There are related issue links on the incident template that should be used to create related issues from an incident.
Incident::Resolved
the incident issue will be closedSeverity::1
incidents will automatically be labeled with review-requested
If an alert silence is created for an active incident, the incident should be resolved with the ~"alertmanager-silence"
label and the appropriate root cause label if it is known.
There should also be a linked ~infradev issue for the long term solution or an investigation issue created using the related issue links on the incident template.
The board which tracks all GitLab.com incidents from active to reviewed is located here.
A near miss, "near hit", or "close call" is an unplanned event that has the potential to cause, but does not actually result in an incident.
In the United States, the Aviation Safety Reporting System has been collecting reports of close calls since 1976. Due to near miss observations and other technological improvements, the rate of fatal accidents has dropped about 65 percent. source
Near misses are like a vaccine. They help the company better defend against more serious errors in the future, without harming anyone or anything in the process.
When a near miss occurs, we should treat it in a similar manner to a normal incident.
~Near Miss
label.