Support Engineers in the Customer Emergencies rotation coordinate operational emergencies from GitLab customers.
The Customer Emergencies rotation is one of the rotations that make up GitLab Support On-call.
When you get an alert, you should immediately start a Slack thread and take notes therein. Tag the Technical Account Manager (TAM) - "cc @user" is good enough - if the customer has one (steps here for how to identify TAMs). This creates visibility around the situation and opens the door to let the team join in.
Good notes in Slack help others follow along, and help you with your follow-ups after the call.
Try to communicate complete ideas rather than snippets of thought. Something like "that's not good" as a response to something happening within the call isn't as helpful as "gitaly timings are really high".
Take and share screenshots of useful info the customer is showing you. Make sure you're not sharing anything sensitive. Let the customer know you're taking screenshots: "Could you pause there? I want to screenshot this part to share with my team".
If the problem stated in the emergency ticket doesn't meet the definition of an emergency support impact, inform the customer's Technical Account Manager or a Support Manager. Unless one of these managers ask you to do otherwise, however, continue to treat the ticket with the emergency SLA.
We assume positive intent from the customer. Even though we may not think a particular ticket qualifies for emergency support, we treat all emergency pages from customers with priority support as if they qualify. During any crisis, the customer may be stressed and have immense pressure on them. Later, after the crisis, if we've determined that the ticket didn't qualify as an emergency, the customer's TAM or a Support Manager can discuss that with the customer.
If at any point you would like advice or help finding additional support, go ahead and contact the on-call Support Manager. The on-call manager is there to support you. They can locate additional Support Engineers if needed. This can make it easier to handle a complex emergency by having more than one person on the call, so you can share responsibilities (e.g., one person takes notes in Slack while the other communicates verbally on the call). Managers are on-call during weekends, so you can page for help at any time.
priority: urgentto find the ticket.
solved. Otherwise, lower the priority.
#support_gitlab-comwith the ticket link. "Thread for emergency ticket LINK HERE".
NOTE: If you need to reach the current on-call engineer and they're not accessible on Slack (e.g., it's a weekend, or the end of a shift), you can manually trigger a PagerDuty incident to get their attention, selecting Customer Support as the Impacted Service and assigning it to the relevant Support Engineer.
NB: "Resolved" in PagerDuty does not mean the underlying issue has been resolved.
In rare cases, the on-call engineer may experience concurrent emergencies triggered by separate customers. If this happens to you, please remember that you are not alone; you need only take the first step in the following process to ensure proper engagement and resolution of each emergency:
@support-team-americas) and request assistance from anyone who is available to assist with the new incoming emergency case.
Taking an emergency call isn't significantly different from a normal call outside of two unique points:
Try to find a colleague to join the call with you. A second person on the call can take notes, search for solutions, raise additional help in Slack, and take on the role of co-host in the event of system or network-related issues. They can also discuss and confirm ideas with you in Slack.
During the call, try to establish a rapport with the customer; empathize with their situation, and set a collaborative tone.
As early as possible, determine your options. In some cases, the best option may be rolling back a change or upgrade. The best option may also involve some loss of production data. If either of those is the case, it's okay to ask the customer if they see any other options before executing that plan.
Before ending an emergency customer call, let the customer know what to do if there is any follow-up, and who will be available if any follow-up is required.
It seems like we've solved the root problem here, but if you need any help I'll be on-call for the next two hours. Feel free to open a new emergency ticket and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that she has the background of the situation before I leave for the day.
When the call has ended:
Support::Self-Managed::Post Customer Call) relevant to the customer in a public reply on the ticket.
First, remember that your primary role is incident management. You are not expected to have all the answers personally and immediately.
Your primary job is to coordinate the emergency response. That could mean:
It could equally mean:
Remember to say only things that help the customer and that maintain their confidence in you as the person in charge of getting their problem resolved. When you're not sure what to do, you might also be unsure what to say. Here are some phrases that might help:
If you encounter a SaaS emergency at the weekend that you are unable to progress, then consider checking if the CMOC engineer on call is available to offer any help or guidance.
There may be times when a customer's subscription expires over the weekend, leaving their instance unusable until a new subscription is generated.
For non-trial subscriptions, you can remind the customer that subscriptions have a 14 days grace period. If the expiration will not fall outside the grace period before the next business day, kindly let the user know that their request will be handled as a standard L&R case during normal business hours. You should reduce the priority of the case and inform the L&R team of the issue.
Otherwise, you will need to login to CustomersDot Admin and generate a short term (7-14 days) trial license for them by following this workflow. The idea is to get them through the weekend so they can solve this with their TAM, Sales Rep, and the L&R Support team during the regular workweek.
A customer may be blocked because of a license expiring or neglecting to apply a renewal. If you're familiar with L&R Workflows, you may solve the case completely by yourself. If you are not, you may:
#support_licensing-subscriptionthat you've done so and link to the ticket for follow-up.
The workflow for these calls is the same as with self-managed emergencies: success means that the customer is unblocked. In some cases, you may even be able to fully resolve a customer's problem.
For any customer facing a SaaS Emergency you are empowered to perform any two-way door action required to unblock them without seeking approval first.
During a SaaS Emergency, you have additional visibility into problems that a customer may be facing.
We're expecting, broadly that emergencies will fall into one of five categories:
If a customer is reporting that behaviour has recently changed, first check GitLab.com Status and
#incident-management for any on-going incidents. If there's no known incident:
~"type::bug"issue and have the customer review it.
If there is a known incident, it's acceptable to link to the public status page and related incident issue. Consider using
Support::SaaS::Incident First Response.
A customer may be blocked because they have run out of CI minutes.
A customer may be blocked because they've exceeded their storage quota.
If an incident occurs on GitLab.com and hasn't been posted on the status page, SaaS customers may raise emergencies in bulk. Success in such a situation is two-fold:
@gitlabstatuson Twitter and the production incident issue.
If this occurs:
@gitlabstatuson Twitter and the production issue. If any of these aren't available yet, you can send a response without to keep customers informed. You can include them in a future update.
#support_gitlab-comand others fielding first responses. There are likely non-emergency tickets being raised about the incident. Using the same response increases the efficiency with which we can all respond to customer inquiries about the problem.
priority:urgent order_by:created_at sort:descwill show all emergency tickets, sorted by those opened most recently
priority:urgent order_by:created_at sort:desc status:newwill show all new emergencies
At any point, you may ack/resolve PD alerts. It may be faster to do so through the PagerDuty web interface.
During an incident:
Zendesk Bulk Update is a way to mass edit and respond to tickets. During an incident, you can use it to:
You can bulk edit tickets by:
ntickets" in the upper right-hand corner
US Federal on-call support is provided 7 days a week between the hours of 0500 and 1700 Pacific Time.
The current on-call schedule can be viewed in PagerDuty(Internal Link), or in the Support Team on-call page(GitLab Employees only). The schedule is currently split into two, 6 hour shifts, an AM and a PM shift. The AM shift starts at 0500 Pacific Time and runs until 1100 Pacific Time. The PM shift starts at 1100 Pacific Time and runs until 1700 Pacific Time.
There are a few cases that require special handling. If an emergency page falls in one of these categories please follow these special handling instructions. If you think an emergency is special and not called out below, connect with the Support Manager On-call for help as how best to approach it.
In the event that an emergency is raised about a compromised instance a call can quickly move well beyond the scope of support.
Use the Zendesk macro
Incident::Compromised Instance which expands on the approach below.
The customer should:
gitlab-secrets.json, or if the only backups available are stored in
/var/opt/gitlab/backups, mount the volume of the compromised instance to retrieve it.
gitlab.rb(such as LDAP/email passwords)
Do not offer or join a call without engaging the Support Manager on-call to align and set expectations with the customer through the ticket.
There have been a few documented cases of folks purchasing a single user GitLab license specifically to raise an emergency. If you encounter such a case, engage the Support manager on-call before offering a call.