The primary goals of writing an Incident Review are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1
The DRI for incident review is the EOC that was present when the incident occurred.
If an incident matches any of the following criteria, the
~Incident::ReviewCompleted must be added to the incident after a review is completed.
~review-requestedlabel is added to the issue
A review may either be synchronous by adding it to the agenda of the weekly incident review meeting, or asynchronous by filling out the relevant parts of the incident review section. As a general guideline, it is recommended to follow the incident review process for any of the following events:
A review can be optionally conducted for incidents which do not meet the above criteria but keep in mind that synchronous meetings are demanding of our time and we do our best to embrace asynchronous communication.
In cases where an optional review is conducted, it is not necessary to fill out a complete review. For the sake of expediency, you can complete areas of the review which highlight what you, as the review author, want to bring to the attention of the larger organization and which drive the generation of corrective actions related to the incident.
When requesting a review, it is important to add an explanation in addition to the
~review-requested label. This will help guide the DRI and set expectations.
The following are examples of situations where one might add the
~review-requested label with the following explanations:
Adding ~review-requested as I would like to discuss this issue with a representative from the QA and Verify team in the weekly incident review meeting.
Adding ~review-requested as the incident review section is missing an assessment of how many customers were impacted. This information would help prioritize proposed fixes.
The first step in the Incident Review process is the synchronous review of the incident by representatives of the teams involved in the resolution of the incident. This step is conducted as close to the incident date as possible and does not require a complete Incident Review write up. The outcome of this first step should be a published Incident Review, per defined timelines.
Incident reviews second step is engaging with the customer, through the point of contact such as a Technical Account Manager (TAM). TAM can self-serve by reviewing the SaaS weekly meeting for an overview of recent incidents, and reviewing the findings from the first step in an async form. In case of a customer requiring a sync to discuss the finding, TAM can engage with the Infrastructure management to organise the discussion with important stakeholders of this process, per defined timelines:
Incident Reviews are conducted in the incident issue and their workflow is tracked on the Production Incidents Board.
~Corrective Action. Labeling and linking existing issues as corrective action is appropriate.
~Corrective Actionin this table:
|Corrective action of issue severity||SLO (days after issue has been created)|
(coming from this link)
~Corrective Actionmust have an assigned priority label, it is the responsibility of the DRI to ensure that the priorities are set.
~Corrective Actionissues have been linked, and notes from the review are incorporated into the Incident Review issue, the
Incident::Review-Completedlabel can be added to the incident.
Incident review sessions are open on the GitLab Team Meetings calendar with the title
Incident Review Recurring Sessions and occur at the following two times:
The assigned Incident Manager is responsible for adding it to the agenda.
GitLab team members are encouraged to review the issues listed in the agenda and add questions/comments. The Incident Manager assigned to a review is responsible for ensuring that stakeholders outside of Infrastructure are aware of the review. This can be achieved by inviting them to the Google Calendar event for when the incident will be discussed, posting in their teams' Slack channels, at-mentioning them in the issue, and assigning them directly in the Google Doc—or some combination of those options. If the participation from stakeholders outside of Infrastructure department in either the async or sync review is not sufficient to create sufficient understanding of the situation and corrective actions, the Incident Manager will q the infradev process and escalate to the appropriate stakeholders.
The purpose of these sessions are to encourage discussion, asking questions and ensure that all aspects of the incident are reviewed, including:
In order to circulate the findings of the incident review across a wider audience, the Incident Manager should include a link to completed incident reviews in the GitLab SaaS Infrastructure meeting agenda.
Google SRE Chapter 15 - Postmortem Culture: Learning from Failure ↩