Support Engineers in the Customer Emergencies rotation coordinate operational emergencies from GitLab customers.
The Customer Emergencies rotation is one of the rotations that make up GitLab Support On-call.
When on-call, please ensure to:
To be added to the Customer Emergency On Call Rotation, you should have first completed the Customer Emergency On-Call training module and then after agreement with your manager, you should raise a new Pager Duty Issue with the Support-Ops team requesting that you are added to the appropriate Pager Duty rotation.
Customer Emergency shifts are 6-hour long overlapping shifts.
Due to an increase in concurrent emergencies, we have split the AMER shift into 3 overlapping schedules that are 6-hours in length to cover the 8 hour AMER on-call window. The schedules have been split to allow engineers to cover hours that align closest with their working hours.
This leaves the first and last hours of the AMER on-call window with a single engineer on-call. If multiple emergencies come in during these times, follow the Handling multiple simultaneous emergencies workflow.
Each group is encouraged to coordinate a DRI role for the shift. The DRI will be responsible for taking assignment of the first emergency. The non-DRIs will take concurrent emergencies as they come in.
An example DRI schedule is below. Note that AMER 2 is DRI for 30 minutes longer since they will overlap with AMER 1 or AMER 3 across all of their shift hours.
When you get an alert, you should immediately use the PagerDuty message in Slack to start a thread and take notes therein. Tag the Customer Success Manager (CSM) - "cc @user" is good enough - if the customer has one (steps here for how to identify CSMs). This creates visibility around the situation and opens the door to let the team join in.
Good notes in Slack help others follow along, and help you with your follow-ups after the call.
Try to communicate complete ideas rather than snippets of thought. Something like "that's not good" as a response to something happening within the call isn't as helpful as "gitaly timings are really high".
Take and share screenshots of useful info the customer is showing you. Make sure you're not sharing anything sensitive. Let the customer know you're taking screenshots: "Could you pause there? I want to screenshot this part to share with my team".
Note: You may sometimes be required to contact GitLab users on behalf of another GitLab team (such as the SIRT team). Please follow the Sending Notices workflow to action these requests.
According to our definition of Severity 1 an emergency exists when a "GitLab server or cluster in production is not available, or otherwise unusable". In the event that the situation does not clearly qualify under the strict definition of emergency, an exception may be granted.
We assume positive intent and use our criteria for exceptions in the Emergency Exception Workflow as a framework for understanding the business impact of situations customers raise. During any crisis, the customer may be stressed and have immense pressure on them. Later, after the crisis, if we've determined that the ticket didn't strictly qualify as an emergency, the CSM for the customer or a Support Manager can discuss that with the customer.
|When you decide the request…||Then apply the Zendesk macro…||and communicate to the customer…|
|…meets the definition of Severity 1,||
||…your plan to work the emergency.|
|…qualifies under one of our exception criteria,||
||…that the situation is being treated as an emergency as a courtesy.|
|…needs more information to allow us to determine whether it qualifies as an emergency,||
||…that you will proceed asynchronously until that determination can be made.|
|…does not meet the criteria for an emergency or an exception,||
||…that their situation does not qualify for emergency service.|
When an emergency request ticket does not contain information sufficient to allow you to determine whether the situation qualifies as an emergency or for an exception, send the customer a message through the ticket:
Once you have enough information to make a determination, use one of the other macros to tag the ticket with the final qualification determination. Note that the
Needs more info tag will intentionally remain attached.
Using our Definitions of support impact, select the most appropriate actual priority for the ticket, and make the change to the ticket. If the customer submitted the emergency request related to an existing ticket, close the emergency ticket when you deliver the downgrade message, and be sure the existing ticket has the priority you selected.
It's important that we deliver the downgrade message as carefully and thoughtfully as possible. Customers who submit an emergency request are often already in a static of panic, high stress, high pressure, or a combination of those. If you feel comfortable in delivering the message to the customer, you are encouraged to do so. If you prefer to have a manager's assistance, please contact the on-call Support Manager.
The important details to include in the message are:
If at any point you would like advice or help finding additional support, contact the on-call Support Manager. The on-call manager is there to support you. They can locate additional Support Engineers if needed. This can make it easier to handle a complex emergency by having more than one person on the call, so you can share responsibilities (e.g., one person takes notes in Slack while the other communicates verbally on the call). Managers are on-call during weekends, so you can page for help at any time.
priority: urgentto find the ticket.
#support_gitlab-comto start a Slack thread. This ensures that everyone coming into the ensuing discussion can easily identify the corresponding emergency ticket.
NOTE: If you need to reach the current on-call engineer and they're not accessible on Slack (e.g., it's a weekend, or the end of a shift), you can manually trigger a PagerDuty incident to get their attention, selecting Customer Support as the Impacted Service and assigning it to the relevant Support Engineer.
NB: "Resolved" in PagerDuty does not mean the underlying issue has been resolved.
In rare cases, the on-call engineer may experience concurrent emergencies triggered by separate customers. If this happens to you, please remember that you are not alone; you need only take the first step in the following process to ensure proper engagement and resolution of each emergency:
@support-team-americas) and request assistance from anyone who is available to assist with the new incoming emergency case.
(RFC) Dealing with concurrent emergencies over the weekend in APAC STM#4583 observed an increase of concurrent emergencies over the weekend period. In APAC, the team will trial a pool of Support Engineers volunteering as backup engineers. This pool is independent of the existing escalation policies in Pagerduty, as outlined:
Pool 1: On call engineer -> Support Manager on call -> Directors Pool 2: Backup engineers
For further details, please refer to STM#4583.
During FY23Q4-FY24Q1 in APAC, in the event that a concurrent emergency comes through while you are still working on the current emergency:
At times, an emergency page may come in for a situation that is not quite yet an emergency, but may quickly become one. In this situation, we want to assist the customer in preventing the situation from becoming an emergency.
If this situation arises during the week:
highpriority that requires an immediate response.
If this situation arises during the weekend:
highpriority ticket that requires an immediate response. Work with the customer to try to resolve or mitigate the issue before it becomes an emergency.
highpriority that requires an immediate response.
See more examples of situations that might be emergencies and situations that are not emergencies.
Taking an emergency call isn't significantly different from a normal call outside of two unique points:
Try to find a colleague to join the call with you. A second person on the call can take notes, search for solutions, raise additional help in Slack, and take on the role of co-host in the event of system or network-related issues. They can also discuss and confirm ideas with you in Slack.
During the call, try to establish a rapport with the customer; empathize with their situation, and set a collaborative tone.
As early as possible, determine your options. In some cases, the best option may be rolling back a change or upgrade. The best option may also involve some loss of production data. If either of those is the case, it's okay to ask the customer if they see any other options before executing that plan.
Before ending an emergency customer call, let the customer know what to do if there is any follow-up, and who will be available if any follow-up is required.
It seems like we've solved the root problem here, but if you need any help I'll be on-call for the next two hours. Feel free to open a new emergency ticket and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that she has the background of the situation before I leave for the day.
When the call has ended:
Support::Self-Managed::Post Customer Call) relevant to the customer in a public reply on the ticket.
As soon as the emergency is resolved, mark the emergency ticket as solved. Consider whether an emergency summary is necessary to add in an internal comment. Any follow up work should be in a separate ticket.
Why do follow up work in another ticket?
First, remember that your primary role is incident management. You are not expected to have all the answers personally and immediately.
Your primary job is to coordinate the emergency response. That could mean:
It could equally mean:
Remember to say only things that help the customer and that maintain their confidence in you as the person in charge of getting their problem resolved. When you're not sure what to do, you might also be unsure what to say. Here are some phrases that might help:
If you encounter a SaaS emergency at the weekend that you are unable to progress, then consider checking if the CMOC engineer on call is available to offer any help or guidance.
If you are still stuck and are having difficulty finding help, contact the manager on-call or initiate the dev-escalation process.
There may be times when a customer's subscription expires over the weekend, leaving their instance unusable until a new subscription is generated.
For non-trial subscriptions, you can remind the customer that subscriptions have a 14 days grace period. If the expiration will not fall outside the grace period before the next business day, kindly let the user know that their request will be handled as a standard L&R case during normal business hours. You should reduce the priority of the case and inform the L&R team of the issue.
Otherwise, you will need to login to CustomersDot Admin and generate a short term (7-14 days) trial license for them by following this workflow. The idea is to get them through the weekend so they can solve this with their CSM, Sales Rep, and the L&R Support team during the regular workweek.
A customer may be blocked because of a license expiring or neglecting to apply a renewal. If you're familiar with L&R Workflows, you may solve the case completely by yourself. If you are not, you may:
Manage GitLab Plan and Trialsoption and making sure to add the existing subscription name.
#support_licensing-subscriptionthat you've done so and link to the ticket for follow-up.
The workflow for these calls is the same as with self-managed emergencies: success means that the customer is unblocked. In some cases, you may even be able to fully resolve a customer's problem.
For any customer facing a SaaS Emergency you are empowered to perform any two-way door action required to unblock them without seeking approval first.
During a SaaS Emergency, you have additional visibility into problems that a customer may be facing.
We're expecting, broadly that emergencies will fall into one of five categories:
If a customer is reporting that behaviour has recently changed, first check GitLab.com Status and
#incident-management for any on-going incidents. If there's no known incident:
~"type::bug"issue and have the customer review it.
If there is a known incident, it's acceptable to link to the public status page and related incident issue. Consider using
Support::SaaS::Incident First Response.
A customer may be blocked because they have run out of compute units.
A customer may be blocked because they've exceeded their storage quota.
If an incident occurs on GitLab.com and hasn't been posted on the status page, SaaS customers may raise emergencies in bulk. Success in such a situation is two-fold:
@gitlabstatuson Twitter and the production incident issue.
If this occurs:
@gitlabstatuson Twitter and the production issue. If any of these aren't available yet, you can send a response without to keep customers informed. You can include them in a future update.
#support_gitlab-comand others fielding first responses. There are likely non-emergency tickets being raised about the incident. Using the same response increases the efficiency with which we can all respond to customer inquiries about the problem.
priority:urgent order_by:created_at sort:descwill show all emergency tickets, sorted by those opened most recently
priority:urgent order_by:created_at sort:desc status:newwill show all new emergencies
At any point, you may ack/resolve PD alerts. It may be faster to do so through the PagerDuty web interface.
During an incident:
Zendesk Bulk Update is a way to mass edit and respond to tickets. During an incident, you can use it to:
You can bulk edit tickets by:
ntickets" in the upper right-hand corner
US Federal on-call support is provided 7 days a week between the hours of 0500 and 1700 Pacific Time.
The current on-call schedule can be viewed in PagerDuty(Internal Link), or in the Support Team on-call page(GitLab Employees only). The schedule is currently split into two, 6 hour shifts, an AM and a PM shift. The AM shift starts at 0500 Pacific Time and runs until 1100 Pacific Time. The PM shift starts at 1100 Pacific Time and runs until 1700 Pacific Time.
Customers are permitted to submit emergencies via email or via the emergency form in the US Federal support portal.
If a customer submits an emergency case outside the working hours of Federal Support the following will occur:
Off hours emergency requesttrigger will inform the ticket submitter that it is after hours and give them the option to either create an emergency case in Global support or wait for US Federal support to follow-up at the next start of business hours.
Emergencies from GitLab Dedicated come through the Customer Emergency On Call rotation. The GitLab Dedicated Handbook has information about working with logs and a section on escalating emergency issues.
There are a few cases that require special handling. If an emergency page falls in one of these categories please follow these special handling instructions. If you think an emergency is special and not called out below, connect with the Support Manager On-call for help as how best to approach it.
In the event that an emergency is raised about a compromised instance a call can quickly move well beyond the scope of support.
Use the Zendesk macro
Incident::Compromised Instance which expands on the approach below.
The customer should:
gitlab-secrets.json, or if the only backups available are stored in
/var/opt/gitlab/backups, mount the volume of the compromised instance to retrieve it.
gitlab.rb(such as LDAP/email passwords)
Do not offer or join a call without engaging the Support Manager on-call to align and set expectations with the customer through the ticket.
There have been a few documented cases of folks purchasing a single user GitLab license specifically to raise an emergency. If you encounter such a case, engage the Support manager on-call before offering a call.
The Customer Emergency Shadow Schedule can be used by anyone who wishes to shadow customer emergencies to learn before being Customer Emergency On-Call. To add yourself to the shadow rotation create an issue using the "Add User to a Rotation" template. To modify your rotation schedule use the edit user rotation template. To shadow for a short span of days, you can click Schedule an Override, then click Custom duration and then select the time zone and the start and end dates and times before clicking the Create Override button to save the changes. To remove overrides, click the x on the override to be removed in the list of Upcoming Overrides on the right side of the screen.