Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Incident Management for Self-Managed Customers

On this page


Incident Management

In order to effectively manage the various elements during an Incident (both internal and customer specific), it is necessary to define the roles and responsibilities of the participants. Having a clear set of guidelines to facilitate the management of the incident reduces confusion and improves the effectiveness of communication. This is the first iteration of a high level Incident Management process for GitLab. This is loosely based on ITIL and ISO 20000, but will not adopt these methodologies in their entirety.

Roles and Responsibilities of Incident Team

Incident Manager - responsible for ensuring the Incident Team adheres to the Incident Management framework and involves appropriate teams/team members in the Incident Analyst/Group for technical troubleshooting and Service Restoration.

Incident Coordinator - facilitates communication between the Incident Team and the business as well as ensuring SLA’s are met at all stages of the Incident. Also provides updates to the affected customer/customers through the appropriate channel.

Incident Analyst/Group - responsible for technical troubleshooting and returning the service to normal operation as quickly as possible. Initial focus is on Service Restoration, not Root Cause Analysis nor Service Improvement.

Executive Sponsor - where necessary, a member of leadership who will engage with customers’ leadership team/point of contact.

High Level Process

Severity 0 - multiple customers experiencing severe or total service degradation, unable to perform basic functionality with no workaround.

Severity 1 - single customer experiencing severe or total service degradation, unable to perform basic functionality with no workaround.

Sev0:

  1. Once a Severity 0 Incident has been identified, the Incident Notification Slack channel is pinged to gather participants from appropriate teams.
  2. A dedicated Slack channel is created by the Incident Manager and the Incident Analyst/Group ONLY join the channel for briefing of current state.
  3. Once next steps are identified, the Incident Coordinator updates the General Slack channel with current state and commits to when the next update will be broadcast.
  4. Affected customers are notified through a ticket that an Incident is currently underway with the current symptoms and when the next update will be communicated.
  5. Updates are provided to customers and the business hourly until the Incident has been resolved.

Sev1:

  1. Once a Severity 1 Incident has been identified, the Support Incident Notification Slack channel is pinged to identify the Incident Analyst/Group.
  2. The Incident Manager creates a dedicated Slack channel.
  3. The Incident Analyst/Group joins the dedicated Slack channel created by the Incident Manager.
  4. The Incident Coordinator is responsible for updating the Support Team Chat Slack channel with current state.
  5. The Incident Analyst will provide updates to the customer on the ticket.

Requirements

Templates

Internal Executive Summary (to be sent to GitLab Executive team)

Current Status of Incident:
Current Actions being performed:
Next steps:
Next Update:
Link to Slack channel: