The primary goals of writing an Incident Review are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1
If an incident matches any of the following criteria, an incident review must be completed:
~severity::1
/~severity::2
~review requested
label is added to the issueThe incident review should be presented in the synchronous review meeting only if the incident is an ~severity::1
, or a review has been requested via the ~review-requested
label.
It is recommended to follow the incident review process for any of the following events:
A review can be optionally conducted for incidents which do not meet the above criteria. In cases where an optional review is conducted, it is not necessary to fill out a complete review. For the sake of expediency, you can complete areas of the review which highlight what you, as the review author, want to bring to the attention of the larger organization.
The first step in the Incident Review process is the synchronous review of the incident by representatives of the teams involved in the resolution of the incident. This step is conducted as close to the incident date as possible and does not require a complete Incident Review write up. The outcome of this first step should be a published Incident Review, per defined timelines.
Incident reviews second step is engaging with the customer, through the point of contact such as a TAM. This should always involve sharing the findings from the first step in an async form. In case of a customer requiring a sync to discuss the finding, the Infrastructure management will organise the discussion with important stakeholders of this process, per defined timelines
~Incident:Review In Progress
on the Production Incidents Board to:
Incident Reviews are conducted in the incident issue and their workflow is tracked on the Production Incidents Board.
~Corrective Action
. Labeling and linking existing issues as corrective action is appropriate.Target SLO
for ~Corrective Action
in this table:Corrective action of issue severity | SLO (days after issue has been created) |
---|---|
severity::1 | 1 week |
severity::2 | 30 days |
severity::3 | 60 days |
severity::4 | 90 days |
(coming from this link)
~Corrective Action
must have an assigned priority label, it is the responsibility of the DRI to ensure that the priorities are set.~Corrective Action
issues have been linked, and notes from the review are incorporated into the Incident Review issue, the incident review issue can be closed.Incident review sessions are open on the GitLab Team Meetings calendar with the title Incident Review Recurring Sessions
and occur at the following two times:
The assigned IMOC is responsible for properly labeling the Incident Review issue (see Incident-Management#labeling ~Incident::Review-Scheduled
and adding it to the agenda.
GitLab team members are encouraged to review the issues listed in the agenda and add questions/comments. The IMOC assigned to a review is responsible for ensuring that stakeholders outside of Infrastructure are aware of the review. This can be achieved by inviting them to the Google Calendar event for when the incident will be discussed, posting in their teams' Slack channels, at-mentioning them in the issue, and assigning them directly in the Google Doc—or some combination of those options. If the participation from stakeholders outside of Infrastructure department in either the async or sync review is not sufficient to create sufficient understanding of the situation and corrective actions, the IMOC will q the infradev process and escalate to the appropriate stakeholders.
The purpose of these sessions are to encourage discussion, asking questions and ensure that all aspects of the incident are reviewed, including:
In order to circulate the findings of the incident review across a wider audience, the IMOC should include a link to completed incident reviews in the GitLab SaaS Infrastructure meeting agenda.
Google SRE Chapter 15 - Postmortem Culture: Learning from Failure ↩