The primary goals of writing an Incident Review are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1
Not every incident requires a review. But, if an incident matches any of the following criteria, an incident review must be completed:
The first step in the Incident Review process is the synchronous review of the incident by representatives of the teams involved in the resolution of the incident. This step is conducted as close to the incident date as possible and does not require a complete Incident Review write up. The outcome of this first step should be a published Incident Review, per defined timelines.
Incident reviews second step is engaging with the customer, through the point of contact such as a TAM. This should always involve sharing the findings from the first step in an async form. In case of a customer requiring a sync to discuss the finding, the Infrastructure management will organise the discussion with important stakeholders of this process, per defined timelines
Incident Reviews are conducted in the incident issue and their workflow is tracked on the Production Incidents Board.
~Corrective Action. Labeling and linking existing issues as corrective action is appropriate.
~Corrective Actionin this table:
|Corrective action of issue severity||SLO (days after issue has been created)|
(coming from this link)
~Corrective Actionmust have an assigned priority label, it is the responsibility of the DRI to ensure that the priorities are set.
~Corrective Actionissues have been linked, and notes from the review are incorporated into the Incident Review issue, the incident review issue can be closed.
Incident review sessions are open on the GitLab Team Meetings calendar with the title
Incident Review Recurring Sessions and occur at the following two times:
GitLab team members are encouraged to review the issues listed in the agenda and add questions/comments. The IMOC assigned to a review is responsible for ensuring that stakeholders outside of Infrastructure are aware of the review. This can be achieved by inviting them to the Google Calendar event for when the incident will be discussed, posting in their teams' Slack channels, at-mentioning them in the issue, and assigning them directly in the Google Doc—or some combination of those options. If the participation from stakeholders outside of Infrastructure department in either the async or sync review is not sufficient to create sufficient understanding of the situation and corrective actions, the IMOC will envoke the infradev process and escalate to the appropriate stakeholders.
The purpose of these sessions are to encourage discussion, asking questions and ensure that all aspects of the incident are reviewed, including:
Google SRE Chapter 15 - Postmortem Culture: Learning from Failure ↩