Gitlab hero border pattern left svg Gitlab hero border pattern right svg

How to Perform Customer Emergencies Duties


Introduction

Support Engineers in the Customer Emergencies rotation coordinate operational emergencies from GitLab customers.

The Customer Emergencies rotation is one of the rotations that make up GitLab Support On-call.

Expectations for Support Engineers in the Customer Emergencies Rotation

Communicate

When you get an alert, you should immediately start a Slack thread and take notes therein. Tag the Technical Account Manager (TAM) - "cc @user" is good enough - if the customer has one (steps here for how to identify TAMs). This creates visibility around the situation and opens the door to let the team join in.

Good notes in Slack help others follow along, and help you with your follow-ups after the call.

Try to communicate complete ideas rather than snippets of thought. Something like "that's not good" as a response to something happening within the call isn't as helpful as "gitaly timings are really high".

Take and share screenshots of useful info the customer is showing you. Make sure you're not sharing anything sensitive. Let the customer know you're taking screenshots: "Could you pause there? I want to screenshot this part to share with my team".

Assume good intent

If the problem stated in the emergency ticket doesn't meet the definition of an emergency support impact, inform the customer's Technical Account Manager or a Support Manager. Unless one of these managers ask you to do otherwise, however, continue to treat the ticket with the emergency SLA.

We assume positive intent from the customer. Even though we may not think a particular ticket qualifies for emergency support, we treat all emergency pages from customers with priority support as if they qualify. During any crisis, the customer may be stressed and have immense pressure on them. Later, after the crisis, if we've determined that the ticket didn't qualify as an emergency, the customer's TAM or a Support Manager can discuss that with the customer.

(Optional) Contact the on-call Support Manager

If at any point you would like advice or help finding additional support, go ahead and contact the on-call Support Manager. The on-call manager is there to support you. They can locate additional Support Engineers if needed. This can make it easier to handle a complex emergency by having more than one person on the call, so you can share responsibilities (e.g., one person takes notes in Slack while the other commuincates verbally on the call). Managers are on-call during weekends, so you can page for help at any time.

Respond to PagerDuty alerts

  1. When an emergency is triggered, you will receive an alert from PD. This could be a text, phone call, email, Slack message, or a combination of those (depending on your PagerDuty notification preferences).
  2. Acknowledge the alert in PagerDuty or Slack. This means that you received the emergency page, and are starting the response process.
  3. OPTIONAL: Create a new Issue using the Emergency Runbook Issue Template, to guide you through the emergency response process for Customer Emergency tickets.
  4. Open the Zendesk ticket.
    1. Most PagerDuty notification formats provide a direct link to the ticket.
    2. Alternatively, use Zendesk search with the term priority: urgent to find the ticket.
  5. If you are simultaneously on an FRT or SLA Hawk shift, ask in Slack for someone to takeover those duties.
  6. Create a Public Comment in the ticket acknowledging the emergency; offer a Zoom call to the customer.
  7. Only Resolve the PagerDuty alert after you have contacted the customer. This means that you are actively handling the emergency now and will see it through.
  8. Start a thread in #support_self-managed or #support_gitlab-com with the ticket link. "Thread for emergency ticket LINK HERE".
  9. OPTIONAL: Consult our definitions of support impact and evaluate the customer's problem statement against the "Emergency" definition there. Even if you don't think that this qualifies as an emergency, follow the guidance given in the Assume Good Intent section.
  10. After 15 minutes, if the customer has not responded to our initial contact with them, send a follow up message covering the following points:
    • The bridge created to work on the emergency.
    • If the customer is not able to join immediately, we can make other arrangements.
    • After another 15 minutes without response the bridge will be closed and the ticket will be assigned a HIGH priority.
    • Feel free to open a new emergency request if the need arises.

NOTE: If you need to reach the current on-call engineer and they're not accessible on Slack (e.g., it's a weekend, or the end of a shift), you can manually trigger a PagerDuty incident to get their attention, selecting Customer Support as the Impacted Service and assigning it to the relevant Support Engineer.

PagerDuty Status

NB: "Resolved" in PagerDuty does not mean the underlying issue has been resolved.

Handling multiple simultaneous emergencies

In rare cases, the on-call engineer may experience concurrent emergencies triggered by separate customers. If this happens to you, please remember that you are not alone; you need only take the first step in the following process to ensure proper engagement and resolution of each emergency:

  1. You: Contact the on-call Support Manager to inform them of the new incoming emergency. The Support Manager is responsible for finding an engineer to own the new emergency page.
  2. Support Manager: In Slack, ping the regional support group (e.g. @support-team-americas) and request assistance from anyone who is available to assist with the new incoming emergency case.
  3. Second Support Engineer: Acknowledge and resolve the emergency page to indicate that you are assisting the customer with the case.

Taking an emergency customer call

Taking an emergency call isn't significantly different from a normal call outside of two unique points:

Try to find a colleague to join the call with you. A second person on the call can take notes, search for solutions, and raise additional help in Slack. They can also discuss and confirm ideas with you in Slack.

During the call, try to establish a rapport with the customer; empathize with their situation, and set a collaborative tone.

As early as possible, determine your options. In some cases, the best option may be rolling back a change or upgrade. The best option may also involve some loss of production data. If either of those is the case, it's okay to ask the customer if they see any other options before executing that plan.

Post-call

Before ending an emergency customer call, let the customer know what to do if there is any follow-up, and who will be available if any follow-up is required.

For example:

It seems like we've solved the root problem here, but if you need any help I'll be on-call for the next two hours. Feel free to open a new emergency ticket and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that she has the background of the situation before I leave for the day.

When the call has ended:

  1. Write post-call notes (using macro Support::Self-Managed::Post Customer Call) relevant to the customer in a public reply on the ticket.
  2. Add all relevant internal-only information as an internal note on the ticket.
  3. Tag the next on-call engineer in the emergency's Slack thread.

What to do if you don't know what to do

First, remember that your primary role is incident management. You are not expected to have all the answers personally and immediately.

Your primary job is to coordinate the emergency response. That could mean:

It could equally mean:

Remember to say only things that help the customer and that maintain their confidence in you as the person in charge of getting their problem resolved. When you're not sure what to do, you might also be unsure what to say. Here are some phrases that might help:

If you're stuck and are having difficulty finding help, contact the manager on-call or initiate the dev-escalation process.

SaaS Emergencies

The workflow for these calls is the same as with self-managed emergencies: success means that the customer is unblocked. In some cases, you may even be able to fully resolve a customer's problem.

For any customer facing a SaaS Emergency you are empowered to perform any two-way door action required to unblock them without seeking approval first.

Some examples:

During a SaaS Emergency, you have additional visibility into problems that a customer may be facing.

Review:

We're expecting, broadly that emergencies will fall into one of five categories:

Broken Functionality

If a customer is reporting that behaviour has recently changed, first check GitLab.com Status and #incident-management for any on-going incidents. If there's no known incident:

  1. Initiate a call with the customer. You're specifically looking to:
    • observe broken behavior.
    • determine if there's a known issue, bug report, or other customers reporting similar behavior.
    • ascertain whether or not a feature flag that may have been recently turned on (see: Enabling Feature Flags on GitLab.com)
    • find/build reproduction steps devoid of customer data to build a bug report if none exists.

Broken functionality due to a regression or feature flag

  1. Create a ~bug issue and have the customer review it.
  2. Escalate the ~bug issue
  3. If this is affecting multiple customers, declare an incident to engage the incident response team who will update the status page.
  4. Once the original functionality is restored, update the customer.

Broken functionality due to something specific to the customer

  1. Page the Support Manager on-call to review the best way to unblock the customer. It may be that you will need someone with .com console access to fully investigate / resolve.

Broken functionality due to an incident

If there is a known incident, it's acceptable to link to the public status page and related incident issue. Consider using Support::SaaS::Incident First Response.

Example tickets:

Subscription Issues

A customer may be blocked because of a license expiring or neglecting to apply a renewal. If you're familiar with L&R Workflows, you may solve the case completely by yourself. If you are not, you may:

  1. Manually upgrade the namespace using the mechanizer
  2. Alert #support_licensing-subscription that you've done so and link to the ticket for follow-up.

Consumption Issues

CI Minutes quota is blocking a production deployment

A customer may be blocked because they have run out of CI minutes.

  1. Advise them to purchase additional minutes or set up individual runners.
  2. At your discretion, as a courtesy, set an additional 1000 minutes on their namespace through ChatOps

Customer has exceeded their storage quota

A customer may be blocked because they've exceeded their storage quota.

  1. Advise them to purchase additional storage
  2. In cases where a customer is unable to complete a purchase because of a defect or outage, as a courtesy, someone with GitLab.com admin can override the storage limit on a group.

A widespread incident causes multiple, successive PagerDuty alerts

If an incident occurs on GitLab.com and hasn't been posted on the status page, SaaS customers may raise emergencies in bulk. Success in such a situation is two-fold:

  1. Route customers reporting the incident to our status page, @gitlabstatus on Twitter and the production incident issue.
  2. Sort through the alerts to ensure that there are no emergencies raised that are unrelated to the on-going incident.

If this occurs:

  1. Don't panic! Slack and PD alerts may come quickly and frequently. Consider silencing both temporarily and focus on ZD.
  2. Verify that an incident has been declared and that the incident is actively being worked.
  3. If there is no update on the status page yet, advocate for urgency with the CMOC so that you can point to it in responses.
  4. Choose a unique tag that will help you identify tickets, using the incident number would be typical. For example: incident-12345
  5. Create a bulk response that points to the incident on the status page, @gitlabstatus on Twitter and the production issue. If any of these aren't available yet, you can send a response without to keep customers informed. You can include them in a future update.
    • Share the response that you draft or otherwise coordinate with #support_gitlab-com and others fielding first responses. There are likely non-emergency tickets being raised about the incident. Using the same response increases the efficiency with which we can all respond to customer inquiries about the problem.
  6. Create the tag by typing it into the tag field of at least one ticket and submitting it - if you don't, it won't show as available in the bulk edit view of Zendesk.
  7. Use Zendesk search to identify customer-raised emergencies:
  8. Use Zendesk Bulk Update to respond to all open tickets.

At any point, you may ack/resolve PD alerts. It may be faster to do so through the PagerDuty web interface.

During an incident:

Using Zendesk Bulk Update

Zendesk Bulk Update is a way to mass edit and respond to tickets. During an incident, you can use it to:

You can bulk edit tickets by:

  1. From a Zendesk search click one or more checkboxes
  2. Click "Edit n tickets" in the upper right-hand corner
  3. Edit the properties of the ticket you'd like to update. During an incident that will probably be:
    • A public reply
    • A ticket tag
  4. Click Submit with the appropriate status change

ZD Bulk Update View

US Federal on-call

US Federal on-call support is provided 7 days a week between the hours of 0500 and 1700 Pacific Time.

The current on-call schedule can be viewed in PagerDuty(Internal Link), or in the Support Team on-call page(GitLab Employees only). The schedule is currently split into two, 6 hour shifts, an AM and a PM shift. The AM shift starts at 0500 Pacific Time and runs until 1100 Pacific Time. The PM shift starts at 1100 Pacific Time and runs until 1700 Pacific Time.

Git is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license