Support Engineers in the Customer Emergencies rotation coordinate operational emergencies from GitLab customers.
When you get an alert, you should immediately start a Slack thread and take notes therein. Tag the Technical Account Manager (TAM) - "cc @user" is good enough - if the customer has one (steps here for how to identify TAMs). This creates visibility around the situation and opens the door to let the team join in.
Good notes in Slack help others follow along, and help you with your follow-ups after the call.
Try to communicate complete ideas rather than snippets of thought. Something like "that's not good" as a response to something happening within the call isn't as helpful as "gitaly timings are really high".
Take and share screenshots of useful info the customer is showing you. Make sure you're not sharing anything sensitive. Let the customer know you're taking screenshots: "Could you pause there? I want to screenshot this part to share with my team".
If the problem stated in the emergency ticket doesn't meet the definition of an emergency support impact, inform the customer's Technical Account Manager or a Support Manager. Unless one of these managers ask you to do otherwise, however, continue to treat the ticket with the emergency SLA.
We assume positive intent from the customer. Even though we may not think a particular ticket qualifies for emergency support, we treat all emergency pages from customers with priority support as if they qualify. During any crisis, the customer may be stressed and have immense pressure on them. Later, after the crisis, if we've determined that the ticket didn't qualify as an emergency, the customer's TAM or a Support Manager can discuss that with the customer.
priority: urgent
to find the ticket.#support_self-managed
or #support_gitlab-com
with the ticket link. "Thread for emergency ticket LINK HERE".HIGH
priority.NOTE: If you need to reach the current on-call engineer and they're not accessible on Slack (e.g., it's a weekend, or the end of a shift), you can manually trigger a PagerDuty incident to get their attention, selecting Customer Support as the Impacted Service and assigning it to the relevant Support Engineer.
NB: "Resolved" in PagerDuty does not mean the underlying issue has been resolved.
In rare cases, the on-call engineer may experience concurrent emergencies triggered by separate customers. If this happens to you, please remember that you are not alone; you need only take the first step in the following process to ensure proper engagement and resolution of each emergency:
@support-team-americas
) and request assistance from anyone who is available to assist with the new incoming emergency case.Taking an emergency call isn't significantly different from a normal call outside of two unique points:
Try to find a colleague to join the call with you. A second person on the call can take notes, search for solutions, and raise additional help in Slack. They can also discuss and confirm ideas with you in Slack.
During the call, try to establish a rapport with the customer; empathize with their situation, and set a collaborative tone.
As early as possible, determine your options. In some cases, the best option may be rolling back a change or upgrade. The best option may also involve some loss of production data. If either of those is the case, it's okay to ask the customer if they see any other options before executing that plan.
Before ending an emergency customer call, let the customer know what to do if there is any follow-up, and who will be available if any follow-up is required.
For example:
It seems like we've solved the root problem here, but if you need any help I'll be on-call for the next two hours. Feel free to open a new emergency ticket and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that she has the background of the situation before I leave for the day.
When the call has ended:
Support::Self-Managed::Post Customer Call
) relevant to the customer in a public reply on the ticket.First, remember that your primary role is incident management. You are not expected to have all the answers personally and immediately.
Your primary job is to coordinate the emergency response. That could mean:
It could equally mean:
Remember to say only things that help the customer and that maintain their confidence in you as the person in charge of getting their problem resolved. When you're not sure what to do, you might also be unsure what to say. Here are some phrases that might help:
If you're stuck and are having difficulty finding help, contact the manager on-call or initiate the dev-escalation process.
The workflow for these calls is the same as with self-managed emergencies. However, you have additional visibility into problems that a customer may be facing that they will not.
Review:
After you have identified the error and found reproduction steps, it's likely that you'll need to declare an incident and coordinate with incident management team to reach resolution. If the error is a result of a product defect, you may also need to engage the InfraDev Escalation Process.
We're expecting, broadly that emergencies will fall into one of three categories:
US Federal on-call support is provided 7 days a week between the hours of 0500 and 1700 Pacific Time.
The current on-call schedule can be viewed in PagerDuty(Internal Link), or in the Support Team on-call page(Public Link). The schedule is currently split into two, 6 hour shifts, an AM and a PM shift. The AM shift starts at 0500 Pacific Time and runs until 1100 Pacific Time. The PM shift starts at 1100 Pacific Time and runs until 1700 Pacific Time.