As the GitLab SaaS Incident Communications Manager on Call (CMOC) you are the voice of GitLab to our users and stakeholders during an incident. To do this effectively, you'll work primarily with the Incident Manager (IM) and Engineer on Call (EOC) and use a combination of our status page (powered by Status.io), Slack, Zendesk, and GitLab itself. The CMOC rotation is one of the rotations that make up GitLab Support On-call.
To disambiguate this term on other pages, you may see the acronym ICMOC or see the role referred to as "Incident CMOC". As this page is scoped to only this role it uses CMOC, Incident CMOC, and ICMOC interchangeably.
Our Slack bot Woodhouse provides a command (/incident post-statuspage
) to quickly spin up an incident on Status.io. From there, the basics of how to update and close incidents in Status.io are covered by their Incident Overview documentation. This document covers how GitLab specifically uses Status.io to perform those tasks.
To be added to the CMOC Rotation:
Before getting started, take note of the following sections or to get straight into the workflow start at Incident Management.
This section contains information specific to how incidents are started, what various status messages in PagerDuty mean, and the difference between the EOC and the IM during an incident.
Infrastructure uses Woodhouse to declare incidents through Slack. Doing so will:
This information will all be posted to Slack in the #incident-management
channel by Woodhouse and it'll look similar to the following example.
GitLab team members are encouraged to use this method of reporting incidents if they suspect GitLab.com is about to face one.
NB: "Resolved" in PagerDuty does not mean the underlying issue has been resolved.
The IM is the DRI for the decision of whether to initiate public communications via the Status Page. When joining an incident as the CMOC you should inquire as to whether communications are currently required. On the rare occasion that an incident does not have an IM, the EOC assumes these responsibilities and you may ask them instead.
You can always review past incidents if you need examples or inspiration for how to fill in the details for a current incident.
It is not possible to change the Current Status
of the affected infrastructure of an incident or its Current State
without making a formal update to the incident. It is acceptable to publish a new update to an incident in order to change either the Current Status
of its affected infrastructure or its Current State
regardless of how recently the last update on the incident was published.
Status.io should be updated whenever we have new information about an active incident that our stakeholders should be aware of. Outside of that, it should be updated at a consistent rate depending on the severity of the incident as outlined in the table below.
Once you join the incident Zoom call, take note of any updates that have been made to Status.io and the time they were made at. Set a timer to remind yourself and stick to the time intervals below unless you make a note of how long it will be until the next status update. For example, if you're in "monitoring" it may be appropriate to specify an hour before the next update.
Incident Status | Severity 1 Update Frequency | Severity 2 Update Frequency | Severity 3/Severity 4 Update Frequency |
---|---|---|---|
Investigating | 10m | 15m | 15m |
Identified | 10m | 30m | 30m |
Monitoring | 30m | 60m | 60m |
Resolved | No further updates required |
Provide a generic update based on the best information you have:
Craft a draft of what you think is correct. Whenever possible use "I intend to…" language when communicating with the IM and EOC:
Bias to action - you can post another update if there was an error in your last update.
Any updates outside documented incident updates that require administrator access to the GitLab System Status page should be iniated with an issue.
Example with this issue that was created to add a component to the GitLab System Status page.
Whether related to an ongoing incident or not, Infrastructure or Security may ask you to reach out to one or more users if they detect unusual usage. Please follow the Sending Notices workflow to action these requests.
The CMOC can be paged during the incident declaration process. If the CMOC needs to be paged after an incident was created or for any other reason, see the How to engage the CMOC? section of the main incident management handbook.
Success as a CMOC is determined by the following performance indicators:
As the CMOC you'll guide the incident through the following stages.
The following sections outline how to perform each of the steps within these stages and should be performed in sequential order.
Perform all of the following steps in order immediately after receiving a PagerDuty page.
Mark the page as acknowledged. This can be done through the mobile app, web interface or PagerDuty App in the #support_gitlab-com
Slack channel.
A link to the call is provided in the incident declaration post by Woodhouse in #incident-management
.
Your role while on the call is to follow along while the incident is worked and make updates to Status.io either when asked to or when it's necessary. Oftentimes chatter in this room will be lively, especially in the early stages of an incident while the source of the issue is being discovered. Use your best judgment on when it's appropriate to speak up to avoid vocalizing at inopportune times. You can always ping anyone on the call through Slack if you need to ask a non-urgent question about the situation.
The first thing you should do is to verify that you can be heard by others in the room. To do this, say something like:
"Hi, I'm the CMOC on duty. I intend to send an update, please review this in the Slack thread."
"Hi, I'm the CMOC on duty, how can I help?"
Whatever you choose to say, make sure that you receive a verbal acknowledgement directed at you before you move on to focus on other aspects of the incident.
From time to time, you may be asked to perform some specific tasks in the room. Always verbally acknowledge any such asks by repeating your understanding of the ask back to the requestor. This helps everyone understand that the ask was heard, and also serves to verify that everyone has the same understanding of some action to be taken.
In some cases, the ask may be implicit, rather than explicit. If you're in doubt, always speak up and ask for confirmation. For example:
IM: CMOC is here, we need to roll out a first update.
A good response would be to ask for confirmation that an action was requested:
CMOC: IM, do you want me to send a first update on status.io?
A better response would be to assume that an action was requested, relay your intended course of action in response, and give the requestor the opportunity to provide input:
CMOC: IM, acknowledged, I will draft an update for status.io and ping you in Slack for input.
You can create an incident on Status.io with minimal effort through Slack (provided by Woodhouse) OR manually if need be (e.g., Slack is down or you need to customize the incident more than what the Slack form allows).
You simply need to issue /incident post-statuspage
from anywhere on Slack. You will be presented with a pre-filled form that you can update to your liking. Once submitted, the incident will be broadcast to the following media:
To create an incident through Status.io click the New Incident
button from the main dashboard:
Then, fill out all of the details of the incident. The following values should be changed:
Title
- This should be brief and concise. The incident title should answer the question: In simple terms, what is the issue?
Current State
- This should almost always be set to Investigating
, as we normally don't know the cause of the incident at this early stage. If that is not the case and it has been communicated to you by the IM or EOC that we're aware of what the cause is, set this to Identified
instead.
Details
- In keeping with our value of transparency, we should go above and beyond for our audience and give them as much information as possible about the incident. This field should always include a link to the incident issue from the production issue tracker so that our audience can follow along.
Incident Status
- This should be set to either Degraded Performance
, Partial Service Disruption
, or Service Disruption
depending on the severity of the incident. If you're unsure of which to pick, ask the IM for guidance.
Broadcast
- Make sure all boxes are checked.
Message Subject
- Leave this at its default value.
Affected Infrastructure
- Leave this unchecked and then check the box next to each specific component below it that is affected by the incident. If this box is checked then the value that you set for Incident Status
above will be applied to all infrastructure components.
The following is an example of an incident ready to be created regarding a delay in job processing on GitLab.com, and is generally what this page should look like before being submitted based on the guidelines above.
The CMOC now needs to notify internal stakeholders of the incident using the Incident Notifier Slack workflow. This should be done after the severity of the incident has been confirmed by the IM.
This workflow, once used, will ask you to fill out a form with details of the incident and will then post those details to #community-relations
and #customer-success
. This serves to notify those teams of the incident. A copy will also be sent in a direct message to you, should you be asked to post it anywhere else. To engage the workflow:
Click the lightning bolt in the message composition box within #support_gitlab-com
and select Incident Notifier
.
Submit
~Incident-Comms::Status-Page
scoped label to the incident issue.It is important that we are able to differentiate incidents which included outbound status page and related notifications from those incidents which were deemed less impactful to our customers. This can be useful both in filtering for active incidents which include outbound notification as well as for after-incident reporting.
Whenever a GitLab service incident includes the use of the status page, this should be identified on the incident issue in GitLab. See this, and other uses of this scoped label in the Incident Management section of the handbook.
We mark the PagerDuty page as resolved once every other task in this stage has been completed. Resolve the page through the mobile app, web interface or PagerDuty App in the #support_gitlab-com
Slack channel.
After all Stage 1 tasks have been complete we will manage the incident by making updates to it, creating a tag for it in Zendesk, and responding to any tickets in Zendesk that are related to it.
To publicly communicate attention and progress incidents should be updated according to the frequency of incident updates table unless you communicate otherwise.
To update an active incident, click the incidents icon from the dashboard.
Then click on the edit button next to the incident.
Change the following values:
Current State
- Change this to Identified
if the IM or EOC has informed you that we have identified the cause of the incident. If we have not, leave it at Investigating
. If we have rolled out a fix for the incident and will be entering a monitoring period, set this to Monitoring
and then move on to Stage 3.
Details
- Describe what has changed regarding the incident since your last update, being as concise and to the point as possible. If you can fit it into the character limit, consider including a link to the incident issue again as well.
Broadcast
- Make sure all boxes are checked.
Message Subject
- Leave this at its default value.
Current Status
- Leave this to what you previously set it to when creating the incident, unless the scope of the incident has widened or narrowed. If you're unsure, consult with the IM.
Set Status Level
- Keep this checked. If the incident has increased in scope and now affects additional components in addition to the ones originally selected, proceed to Update Affected Infrastructure after publishing your update.
A ready to be published update should look similar to the following.
Make sure to verify the update length before publishing it. If it exceeds 280 characters, the update won't be published on twitter with no failure notification from Status.io.
After the update has been published, visit the live status page to verify that it went through.
Proceed to either Title or Affected Infrastructure to learn how to change either.
To update the title of an incident, click Incidents
in the navigation bar and then the View Incident
button next to the incident in question:
Click the pencil icon next to the current incident title, change it, then click Save
.
To update the affected infrastructure of an incident, click Incidents
from the navigation bar and then the View Incident
button next to the incident in question:
Click the pencil next to Affected Infrastructure
, check the boxes next to the additional affected infrastructure, then click Save
. Then, click Dashboard
from the navigation menu, click the additional affected infrastructure from the Current Status
menu, and change their status:
In order to track tickets submitted through Zendesk that relate to an incident, we need to create a tag. To create a tag:
Tags
field.Enter
.The tag will now be available to use on other tickets. All tags that relate to incidents should be in the format com_incident_####
with ####
being the incident number, which can be found in the incident issue.
While in the Manage Incident
stage, routinely monitor Zendesk for new and existing tickets related to the incident and proceed to tag and respond to them. Use this Zendesk search to view all new tickets created in the last four hours. Alternatively, paste the following into the Zendesk search bar.
created>4hours order_by:created_at sort:desc group:none group:"support" -form:billing -form:security
Adjust the 4
if the incident began earlier than four hours ago.
After the incident has been mitigated, we'll often begin a monitoring period to ensure that we do not see a recurrence of the issue. Monitoring typically lasts for 30 minutes, but it can vary and a specific amount of monitoring time may be requested by the IM. They may also request that the monitoring stage be skipped entirely. If this is the case, proceed directly to Stage 4.
To begin monitoring, edit the incident and change the following fields.
Current State
- Change to Monitoring
.
Details
- Along with any information specific to the incident be sure to mention that all systems have returned to normal operation, that we're monitoring in order to ensure the issue doesn't recur, and provide an estimate for how long we'll be monitoring before we resolve the incident. For example:
While all systems are online and fully operational, out of an abundance of caution we'll leave affected components marked as degraded as we monitor. If there are no recurrences in the next 30 minutes, we'll resolve this incident and mark all components as fully operational.
Broadcast
- Make sure all boxes are checked.
Message Subject
- Leave this at its default value.
Current Status
- Leave this at its previously set value. At this point, affected infrastructure should be back to operating normally, but to avoid confusion we do not set this back to Operational
until we are ready to close the incident.
A ready to be published update that switches the incident over to the monitoring period should look similar to the following.
If at any point during the monitoring period we see a recurrence of the issue, return to Stage 2. If the monitoring period completes with no recurrence of the issue, proceed to Stage 4.
After we have completed the monitoring period, or if the monitoring period was skipped, we will now close the incident, add a link to the post-mortem section of the incident, and check Zendesk for any remaining tickets that need tagging and a response.
Once we've confirmed that the issue has been resolved and the IM has given the all-clear, we will close the Status.io incident. If these conditions are met, make an update to the incident and change the following fields.
Current State
- Change to Resolved
.
Details
- State that the issue has been resolved and that systems have returned to operating normally. Be sure to also include a link to the incident issue even if you've already done so in previous updates so that any users who missed them know where to go for more info.
Broadcast
- Make sure all boxes are checked.
Message Subject
- Leave this at its default value.
Current Status
- Change to Operational
.
Set Status Level
- Check this box.
A ready to be published update that closes the incident should look similar to the following.
After the incident has been closed double check that the status page looks right.
A review will be conducted by production engineering for every incident that matches a certain criteria. Status.io allows us to add a link to a post-mortem after an incident has been resolved which will then be viewable on our status page for that specific incident.
Do the following to add a post-mortem to a resolved incident:
From the dashboard click the Incidents
button.
Scroll down and click on the title of the incident.
Click Add Post-Mortem
and supply the link to the issue being used for the incident review, this is usually the same issue that was opened for the incident.
As a final step, perform one more check of Zendesk and ensure that there are no remaining tickets that need to be tagged and responded to. Refer back to the Monitor Zendesk section for how to do this.
Infrastructure will at times plan scheduled maintenance events for GitLab.com, some of which will directly impact users. New maintenance events are announced as issues created in the gl-infra/production issue tracker using the external_communication.md issue template accompanied by the Scheduled Maintenance label.
In the event that a maintenance will affect users, infrastructure can request that the maintenance be visible on our status page. Maintenance events have automation enabled, to ensure each event is started at the scheduled time. The CMOC may be requested to actively provide status updates throughout the maintenance event. When this is required, automation is disabled on the task and all future updates must be conducted manually, including the final completion of the maintenance window. In these cases infrastructure will apply the CMOC Required label to the issue, causing a notification to be sent to the These notifications are currently inoperative. Instead, the maintenance creator will post in #support_gitlab-com
channel that mentions the on-call CMOC. Once this notification is received the CMOC uses the details within the issue to create the maintenance in Status.io.#support_gitlab-com
using the @cmoc
mention to notify the CMOC that a new maintenance requires setting up.
To create a new maintenance event, click New Maintenance
from the Status.io dashboard.
The contents of the maintenance should be filled out according to the details provided in the maintenance issue. Once complete, it might look something like the following.
The timezone of Status.io is set to UTC.
The date format is DD-MM-YYYY, because the ISO format is not an option as of 2022-08-12.
In case you are required to reschedule a maintenance window, Go to status.io > Maintenances tab
Select the maintenance you need to reschedule.
Update the new schedule time by hitting on the Reschedule Maintenance button Make sure you have the correct timezone details when updating Then hit save.
Note About Automated Maintenance Events: On the Maintenance Event page you may see
Automation: Running
with red text in parenthesis next to it reading(Disable)
. Once(Disable)
has been clicked and subsequently disabled it cannot be re-enabled. In order toPost Update
andFinish Maintenance
the automated Maintenance Event must be(Disable)
. After being disabled all future updates to this Maintenance Event must be manual updates from that point forward.
To send an update about a maintenance event, such as a reminder, go to the Maintenances tab in Status.io and select the one that needs an update. On the maintenance's information page, make note of whether automatic email reminders are set to go out. If yes, make sure not to send email broadcasts for your update in order to avoid sending duplicate reminders to subscribers. Once ready to update, select the Post Update Without Starting button.
Enter the update details provided by the Infrastructure team and have them confirm the appropriate broadcast channels before proceeding to send the update. If "Send Reminders" was enabled in the maintenance information page, be sure not to check "Notify email subscribers" in the broadcast settings.
Once the GitLab Status Twitter account has posted about the maintenance schedule, send a link of the tweet to the #social_media_action
channel to let the social team know that you'd like amplification on our GitLab brand twitter account. This should only be used once during a selected scheduled maintenance timeline, preferably mid-week prior to the scheduled maintenance.
It's necessary to inform the ingress CMOC of any relevant activity that ocurred during your shift or if there are incidents that are still ongoing. To perform a handover create an issue in the CMOC Handover issue tracker using the Handover template. Do this even if nothing happened during your shift, signaling that everything is fine is also useful information. It's critical to remember that since we work out in the open by default, the CMOC Handover issue tracker is open to the public. A handover issue should be made confidential if it must contain any sensitive information.
If handover occurs during an active incident where the quick summary you'd provide in the issue is insufficient to properly prepare the ingress CMOC of the situation, you are encouraged to start a Zoom call in #support_gitlab-com and invite the ingress CMOC to it so that they can be caught up synchronously. You can use the following slash command to expedite setting the meeting up.
/zoom meeting CMOC Handover Briefing
NOTE: When adding yourself to this rotation, be aware that adjusting the Time Zone
field at the top of the page will adjust it for all users, not just yourself. Before you navigate away, please reset the timezone to UTC.
The CMOC Shadow Schedule can be used by anyone who wishes to shadow the CMOC to learn before officially acting as CMOC. A soon-to-be-CMOC can create an issue in the pagerduty project to be added to a shadow schedule. Or, to shadow for a short span of days, they can click Schedule an Override, then click Custom duration and then select the time zone and the start and end dates and times before clicking the Create Override button to save the changes. To remove overrides, click the x on the override to be removed in the list of Upcoming Overrides on the right side of the screen.
Note About CMOC Shadowing: When the CMOC shadow PagerDuty schedule is active the engineer will receive notifications and get paged the same way as when on the CMOC schedule. Do not acknowledge or resolve any incidents when on the CMOC shadow schedule as this will stop any potential pages to the real CMOC!
A "training activity" for CMOCs is an activity under which CMOCs are exposed to items in the workflow with the expressed purpose of maintaining or increasing performance against CMOC performance indicators.
Some example training activities are: