As the Communications Manager on Call (CMOC) it's your job to be the voice of GitLab during an incident to our users, customers, and stakeholders. To do this you must communicate with them through our status page, Status.io.
The CMOC rotation is one of the rotations that make up GitLab Support On-call.
The basics of how to create, update, and close incidents in Status.io are covered by their Incident Overview documentation. However, this document covers how we specifically use Status.io to perform those tasks.
You may also be asked to contact a user on behalf of Infrastructure or Security, which may or may not be related to an Incident.
Before getting into the actual process of managing an incident, the following sections should be noted.
This information will all be posted to Slack in the #incident-management channel by Woodhouse and it'll look similar to the following example.
GitLab team members are encouraged to use this method of reporting incidents if they suspect GitLab.com is about to face one.
The following should be noted specifically regarding making updates to Status.io.
Status.io should be updated whenever we have new information about an active incident that our stakeholders should be aware of. Outside of that, it should be updated at a consistent rate depending on the severity of the incident as outlined in the table below.
Once you join the incident Zoom call, take note of any updates that have been made to Status.io and the time they were made at. Set a timer to remind yourself and stick to the time intervals below unless you make a note of how long it will be until the next status update. For example, if you're in "monitoring" it may be appropriate to specify an hour before the next update.
|Incident Status||severity::1 Update Frequency||severity::2 Update Frequency||severity::3/severity::4 Update Frequency|
|Resolved||No further updates required|
Provide a generic update based on the best information you have:
Bias to action - you can post another update if there was an error in your last update.
In later sections of this workflow it's called out that at times you should be asking the IMOC of the incident for permission to move an incident between certain states (updating, monitoring, resolving). On the rare occasion that an incident does not have an IMOC, you may ask the EOC instead.
Keep in mind that you can always review past incidents if you need examples or inspiration for how to fill in the details for a current incident.
Whether related to an ongoing incident or not, infra or security may ask you to reach out to one or more users if they detect unusual usage. Please follow the internal requests workflow to log the request.
As the CMOC you'll guide the incident through the following three stages.
The following sections outline how to perform each of the steps within these stages.
The following steps should be taken immediately after receiving a PagerDuty page for an incident.
NB: "Resolved" in PagerDuty does not mean the underlying issue has been resolved.
Before you create an incident in Status.io you should join the Zoom call that will be used by all GitLab team members involved in the incident. A link to the call is provided in the incident declaration by Woodhouse in the #incident-management channel.
Your role as CMOC while in this room is to follow along while the incident is worked and make updates to Status.io either when asked or when it's necessary. Oftentimes chatter in this room will be lively, especially in the early stages of an incident while the source of the issue is being discovered. Use your best judgment on when it's appropriate to speak up to avoid vocalizing at inopportune times. You can always ping anyone on the call through Slack if you need to ask a non-urgent question about the situation.
After logging in to Status.io you should be met with the dashboard that displays various statistics about our current status. A new incident can be created by clicking
New Incident along the top bar.
This takes you to the new incident screen where you'll be asked to fill in the details of the incident. The following is an example of what a new incident would look like if we're experiencing an issue with a delay in job processing on GitLab.com.
Change the following values:
Title - Titles should be brief and concise. The incident title should answer the question: In simple terms, what is the issue?
Current State - In nearly all cases an incident should be created in the
Investigating state. If it's been communicated to you that we're aware of what is causing the current incident this could be set to
Identified from the beginning.
Details - In keeping with our value of transparency, we should go above and beyond for our audience and give them as much information as possible about the incident on its creation. This field should always include a link to the incident issue from the production issue tracker so that our audience can follow along.
Incident Status - When creating a new incident this will never be
Operational. The status of an incident depends entirely on its scope and how much of the platform it's impacting.
Broadcast - Always check each box in this section.
Message Subject - Always leave this at its default value.
Affected Infrastructure - This should almost always be unchecked so that the value of the
Incident Status field is only applied to the specific aspects of the platform that are affected by the incident. In the example above we're only experiencing an issue with job processing so only
CI/CD is selected.
Once the severity of the incident has been set and it is on our status page, the CMOC should notify internal stakeholders using the Incident Notifier application in Slack. To do so:
Click the lightning bolt in the message composition box within
support_gitlab-com and select
This process notifies internal stakeholders of the incident and should be done when all of the following are true:
To update an active incident click the incidents icon from the dashboard.
Then click on the edit button next to the incident.
Change the following values:
Current State- Change this depending on the current state of the incident and whether or not we've identified the cause (Identified) or implemented a fix (Monitoring).
Details- Be as descriptive as possible about the update and include a link to the production issue.
Broadcast- Check all boxes.
Current Status- If the incident has improved or worsened update this value. If neither, leave it as it was from when the incident was created.
Set Status Level- Uncheck this and keep only the affected component selected unless the incident has increased in scope and now affects other components of our infrastructure. IMPORTANT These must be checked individually as in the screenshot below.
A ready to be published update should look similar to the following.
Make sure to verify the update length before publishing it. If it exceeds 280 characters, the update won't be published on twitter with no failure notification from
After publishing the update, visit the live GitLab Status Page to verify the update went through and looks clear.
Closing an incident out has two stages,
Resolved. Once the affected component is back to operating normally a monitoring period should begin where we switch an incident over to
Monitoring where it remains open for ~30 minutes to ensure that the issue does not recur. We then mark it
Resolved once we're confident the issue will not recur, which closes the incident.
The two stages of the resolution process are covered in their respective sections below.
Note: The IMOC may request monitoring status is skipped.
To start the monitoring period, edit the incident, and configure the update similar to the following.
Take special note of the changes made to the following fields at this stage.
Current State- Change this to
Details- If we have not previously mentioned that a fix has been applied, do so at this stage and make specific mention that we're monitoring the system to ensure that a repeat of the issue does not occur. Make sure to include:
While all systems are online and fully operational, out of an abundance of caution we'll leave affected components marked as degraded as we monitor. If there are no recurrences in the next 30 minutes, we'll resolve this incident and mark all components as fully operational.
Incident Status- At this point, the affected component should be back to normal operation. However, to be clear that we're still in the incident management process we will not flip this back to
Operationaluntil we leave the monitoring state.
Once we're confident that the underlying issue that caused the incident has been fully resolved and a monitoring period has been observed, we should close the incident. Before we do so, we should check with the IMOC via Slack for the all-clear. This should be done by starting a thread on the announcement in #incident-management that started the incident and mentioning the IMOC in it. The following is what one of these messages looks like.
Once we have confirmation from the IMOC that the incident can be resolved, make an update to the incident and change the following fields.
Current State- Change this to
Details- Our message here should include a definitive statement that the issue has been resolved and that the affected component is back to operating normally. We should also aim to again include a link to the relevant issue in the production issue tracker so that any users who missed previous updates know where to go for more info.
Incident Status- Change this field to
Operational. IMPORTANT: Make sure the "Apply status level to all affected infrastructure" is checked. Double check the status.gitlab.com page.
Before resolving the incident your draft should look similar to the following:
After updating the status page, edit the Slack message you created in
#e-group to indicate resolution and post a final update in the thread.
A review will be conducted by production engineering for every incident that matches a certain criteria. Status.io allows us to add a link to a post-mortem after an incident has been resolved which will then be viewable on our status page for that specific incident.
Do the following to add a post-mortem to a resolved incident:
From the dashboard click the
Scroll down and click on the title of the incident.
Add Post-Mortem and supply the link to the issue being used for the incident review.
Infrastructure will at times plan scheduled maintenance events for GitLab.com, some of which will directly impact users. New maintenance events are announced as issues created in the gl-infra/production issue tracker using the external_communication.md issue template accompanied by the Scheduled Maintenance label.
In the event that a maintenance will affect users, infrastructure can request that the maintenance be visible on our status page, and if required, that the CMOC actively provide status updates during the maintenance window. In these cases infrastructure will apply the CMOC Required label to the issue, causing a notification to be sent to the
#support_gitlab-com channel that mentions the on-call CMOC. Once this notification is received the CMOC uses the details within the issue to create the maintenance in Status.io.
To create a new maintenance event, click
New Maintenance from the Status.io dashboard.
The contents of the maintenance should be filled out according to the details provided in the maintenance issue. Once complete, it might look something like the following.
In case you are required to reschedule a maintenance window, Go to status.io > Maintenances tab
Select the maintenance you need to reschedule.
Update the new schedule time by hitting on the Reschedule Maintenance button Make sure you have the correct timezone details when updating Then hit save.
To send an update about a maintenance event, such as a reminder, go to the Maintenances tab in Status.io and select the one that needs an update. On the maintenance's information page, make note of whether automatic email reminders are set to go out. If yes, make sure not to send email broadcasts for your update in order to avoid sending duplicate reminders to subscribers. Once ready to update, select the Post Update Without Starting button.
Enter the update details provided by the Infrastructure team and have them confirm the appropiate broadcast channels before proceeding to send the update. If "Send Reminders" was enabled in the maintenance information page, be sure not to check "Notify email subscribers" in the broadcast settings.
Once the GitLab Status Twitter account has posted about the maintenance schedule, send a link of the tweet to the
#social_media_action Slack channel to let the social team know that you'd like amplification on our GitLab brand twitter account. This should only be used once during a selected scheduled maintenance timeline, preferably mid-week prior to the scheduled maintenance.
At the end of each on-call shift its necessary to inform the next CMOC of any relevant activity that occurred during it or is still ongoing. To perform a handover create an issue in the CMOC Handover issue tracker using the Handover issue template. Create the handover issue even if nothing happened during your shift, signaling that everything is fine is also useful information. It's critical to remember that since we work out in the open by default, the CMOC Handover issue tracker is open to the public. A handover issue should be made confidential if it must contain any sensitive information.
If handover occurs during an active incident where the quick summary you'd provide in the handover issue is insufficient to properly prepare the incoming CMOC of the situation, you are encouraged to start up a quick Zoom call in the #support_gitlab-com Slack channel with the incoming CMOC. Slash commands such as the following can be used to expedite getting the meeting setup.
/zoom meeting CMOC Handover Briefing
In case there are any uncertainties around the status of an incident, please contact the IMOC for clarification.
The CMOC Shadow Schedule can to be used by people who wish to shadow the CMOC as a learning process before acting as CMOC. A soon-to-be-CMOC can adjust the schedule to match their working hours by clicking Edit this schedule > Add Another Layer; add your username, and the days/hours that you wish to shadow.
It is recommended to watch this video on how to perform CMOC duties effectively: CMOC training video