Expectations for Support Engineers in the Customer Emergencies Rotation
When you get an alert, you should immediately start a Slack thread and take notes therein. Tag the Technical Account Manager (TAM) - "cc @user" is good enough - if the customer has one (steps here for how to identify TAMs). This creates visibility around the situation and opens the door to let the team join in.
Good notes in Slack help others follow along, and help you with your follow-ups after the call.
Try to communicate complete ideas rather than snippets of thought. Something like "that's not good" as a response to something happening within the call isn't as helpful as "gitaly timings are really high".
Take and share screenshots of useful info the customer is showing you. Make sure you're not sharing anything sensitive. Let the customer know you're taking screenshots: "Could you pause there? I want to screenshot this part to share with my team".
Assume good intent
If the problem stated in the emergency ticket doesn't meet the definition of an emergency support impact, inform the customer's Technical Account Manager or a Support Manager. Unless one of these managers ask you to do otherwise, however, continue to treat the ticket with the emergency SLA.
We assume positive intent from the customer. Even though we may not think a particular ticket qualifies for emergency support, we treat all emergency pages from customers with priority support as if they qualify. During any crisis, the customer may be stressed and have immense pressure on them. Later, after the crisis, if we've determined that the ticket didn't qualify as an emergency, the customer's TAM or a Support Manager can discuss that with the customer.
(Optional) Contact the on-call Support Manager
If at any point you would like advice or help finding additional support, go ahead and contact the on-call Support Manager. The on-call manager is there to support you. They can locate additional Support Engineers if needed. This can make it easier to handle a complex emergency by having more than one person on the call, so you can share responsibilities (e.g., one person takes notes in Slack while the other communicates verbally on the call). Managers are on-call during weekends, so you can page for help at any time.
Respond to PagerDuty alerts
When an emergency is triggered, you will receive an alert from PD. This could be a text, phone call, email, Slack message, or a combination of those (depending on your PagerDuty notification preferences).
Acknowledge the alert in PagerDuty or Slack. This means that you received the emergency page, and are starting the response process.
Most PagerDuty notification formats provide a direct link to the ticket.
Alternatively, use Zendesk search with the term priority: urgent to find the ticket.
Create a Public Comment in the ticket acknowledging the emergency; offer a Zoom call to the customer if appropriate to the reported situation. A SaaS emergency related to a public incident published on the status page, for example, would not warrant a call.
Example of self-managed emergency ticket which was resolved without a call: https://gitlab.zendesk.com/agent/tickets/148028
Only Resolve the PagerDuty alert after you have contacted the customer. This means that you are actively handling the emergency now and will see it through.
Start a thread in #support_self-managed or #support_gitlab-com with the ticket link. "Thread for emergency ticket LINK HERE".
OPTIONAL: Consult our definitions of support impact and evaluate the customer's problem statement against the "Emergency" definition there. Even if you don't think that this qualifies as an emergency, follow the guidance given in the Assume Good Intent section.
After 15 minutes, if the customer has not responded to our initial contact with them, send a follow up message covering the following points:
The bridge created to work on the emergency.
If the customer is not able to join immediately, we can make other arrangements.
After another 15 minutes without response the bridge will be closed and the ticket will be assigned a HIGH priority.
Feel free to open a new emergency request if the need arises.
NOTE: If you need to reach the current on-call engineer and they're not accessible on Slack (e.g., it's a weekend, or the end of a shift), you can manually trigger a PagerDuty incident to get their attention, selecting Customer Support as the Impacted Service and assigning it to the relevant Support Engineer.
Triggered - "A customer has requested the attention of the on-call engineer"
Acknowledged - "I have seen the page and am reviewing the ticket"
Resolved - "I've engaged with the customer by sending a reply to the emergency ticket"
NB: "Resolved" in PagerDuty does not mean the underlying issue has been resolved.
Handling multiple simultaneous emergencies
In rare cases, the on-call engineer may experience concurrent emergencies triggered by separate customers. If this happens to you, please remember that you are not alone; you need only take the first step in the following process to ensure proper engagement and resolution of each emergency:
You: Contact the on-call Support Manager to inform them of the new incoming emergency. The Support Manager is responsible for finding an engineer to own the new emergency page.
Support Manager: In Slack, ping the regional support group (e.g.@support-team-americas) and request assistance from anyone who is available to assist with the new incoming emergency case.
Second Support Engineer: Acknowledge and resolve the emergency page to indicate that you are assisting the customer with the case.
Taking an emergency customer call
Taking an emergency call isn't significantly different from a normal call outside of two unique points:
You (likely) won't have much forewarning about the subject of the call
The desired end state is a functioning system
Try to find a colleague to join the call with you. A second person on the call can take notes, search for solutions, and raise additional help in Slack. They can also discuss and confirm ideas with you in Slack.
During the call, try to establish a rapport with the customer; empathize with their situation, and set a collaborative tone.
As early as possible, determine your options. In some cases, the best option may be rolling back a change or upgrade. The best option may also involve some loss of production data. If either of those is the case, it's okay to ask the customer if they see any other options before executing that plan.
Before ending an emergency customer call, let the customer know what to do if there is any follow-up, and who will be available if any follow-up is required.
It seems like we've solved the root problem here, but if you need any help I'll be on-call for the next two hours. Feel free to open a new emergency ticket and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that she has the background of the situation before I leave for the day.
Add all relevant internal-only information as an internal note on the ticket.
Tag the next on-call engineer in the emergency's Slack thread.
What to do if you don't know what to do
First, remember that your primary role is incident management. You are not expected to have all the answers personally and immediately.
Your primary job is to coordinate the emergency response. That could mean:
directing the customer to take specific actions
finding relevant documentation or doing other research into the problem
identifying a known bug or regression and providing a workaround
analyzing log data
It could equally mean:
identifying other experts on the Support team to help do the above
reaching out to development teams to find a subject matter expert (SME)
suggesting that the customer reach out to additional experts on their side (for example, if the problem is slow storage, you might suggest getting someone from their storage team)
Remember to say only things that help the customer and that maintain their confidence in you as the person in charge of getting their problem resolved. When you're not sure what to do, you might also be unsure what to say. Here are some phrases that might help:
What have you done up until now to try to resolve this?
Please give me a few minutes to check the documentation on that.
I'm doing some research to find the answer to that; please give me a few minutes.
I'm working on finding someone who has specific expertise in this area.
I don't know the answer just yet, but I'm here for you and I will use all the resources at my disposal to get this resolved.
If you encounter a SaaS emergency at the weekend that you are unable to progress, then consider checking if the CMOC engineer on call is available to offer any help or guidance.
The workflow for these calls is the same as with self-managed emergencies: success means that the customer is unblocked. In some cases,
you may even be able to fully resolve a customer's problem.
For any customer facing a SaaS Emergency you are empowered to perform any two-way door action required to unblock them without seeking approval first.
manually setting a subscription level
adding additional storage
adding extra CI minutes
toggling a feature flag
During a SaaS Emergency, you have additional visibility into problems that a customer may be facing.
Using Kibana - explore GitLab.com log files to find the errors customers are encountering.
Using Sentry - get access to the full stacktrace of errors a customer might encounter.
We're expecting, broadly that emergencies will fall into one of five categories:
broken functionality due to a regression being pushed to GitLab.com
Success may mean: reproducing, identifying or creating a bug report and escalating to have a patch created and deployed.
broken functionality due to an inconsistency in data unique to the customer, for example: a group name used to be able to have special characters in it, and now something broke because our group name has a special character in it.
Success may mean reproducing the error, identifying it Sentry/Kibana, escalating to have the specific data corrected (and creating a bug report so our code is better)
GitLab.com access or "performance" degradation to the level of unusability, for example: no access in a geographical area, CI jobs aren't being dispatched. This is the hardest class, but will generally be operational emergencies.
Success here means making sure it's not actually one of the top two before declaring an incident and letting the SRE team diagnose and correct the root cause.
License / Consumption issues are preventing access to the product
Success here means getting the customer into a state where they're unblocked and making sure the license team is equipped to take the handover.
a widespread incident causes multiple, successive PagerDuty alerts
Success here means tagging and bulk responding to the issues pointing to the GitLab.com Status Page and production issue.
If a customer is reporting that behaviour has recently changed, first check GitLab.com Status and #incident-management for any on-going incidents. If there's no known incident:
Initiate a call with the customer. You're specifically looking to:
observe broken behavior.
determine if there's a known issue, bug report, or other customers reporting similar behavior.
If there is no update on the status page yet, advocate for urgency with the CMOC so that you can point to it in responses.
Choose a unique tag that will help you identify tickets, using the incident number would be typical. For example: incident-12345
Create a bulk response that points to the incident on the status page, @gitlabstatus on Twitter and the production issue. If any of these aren't available yet, you can send a response without to keep customers informed. You can include them in a future update.
Share the response that you draft or otherwise coordinate with #support_gitlab-com and others fielding first responses. There are likely non-emergency tickets being raised about the incident. Using the same response increases the efficiency with which we can all respond to customer inquiries about the problem.
Create the tag by typing it into the tag field of at least one ticket and submitting it - if you don't, it won't show as available in the bulk edit view of Zendesk.
Use Zendesk search to identify customer-raised emergencies:
At any point, you may ack/resolve PD alerts. It may be faster to do so through the PagerDuty web interface.
During an incident:
If there is no production issue to link to yet: let customers know we are actively working to address the problem and that we will follow-up with a link to a tracking issue as soon as one is created. Set the ticket to Open. Once the issue is available, send a follow-up note letting the customer know that they should follow along with the issue and that we are marking the ticket as Solved. Include a note that they should reply if they still have trouble once the production issue has been closed / the incident has been declared resolved.
If there is a production issue to link to: let customers know we are actively working to address the problem, that they should follow along at the issue, that we are marking the ticket as Solved and they should reply if they still have trouble once the production issue has been closed / the incident has been declared resolved.
Using Zendesk Bulk Update
Zendesk Bulk Update is a way to mass edit and respond to tickets. During an incident, you can use it to:
automatically tag tickets
send a bulk response
set status en masse
You can bulk edit tickets by:
From a Zendesk search click one or more checkboxes
Click "Edit n tickets" in the upper right-hand corner
Edit the properties of the ticket you'd like to update. During an incident that will probably be:
A public reply
A ticket tag
Click Submit with the appropriate status change
US Federal on-call
US Federal on-call support is provided 7 days a week between the hours of 0500 and 1700 Pacific Time.
The current on-call schedule can be viewed in PagerDuty(Internal Link), or in the Support Team on-call page(GitLab Employees only). The schedule is currently split into two, 6 hour shifts, an AM and a PM shift. The AM shift starts at 0500 Pacific Time and runs until 1100 Pacific Time. The PM shift starts at 1100 Pacific Time and runs until 1700 Pacific Time.
Git is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license