For the customers that have Priority Support, the Support Engineering Team is on-call and available to assist with emergencies. What constitutes an emergency is defined in our definitions of support impact.
We take on-call seriously. There are escalation policies in place so that if a first responder does not respond fast enough another team member or members is/are alerted. Such policies aren't expected to ever be triggered, but they cover extreme and unforeseeable circumstances.
When you are on call you are expected to be available and ready to respond to PagerDuty pings as soon as possible, and certainly within the emergency response time set by our Service Level Agreements.
If you have plans outside of your work space while being on call, then being available may require bringing a laptop and reliable internet connection with you.
You should not be chained to your desk, but you should be equipped to acknowledge and act on PD alerts in a timely manner.
Be proactive in communicating your availability. Sometimes you can't be immediately available for every minute of your on-call shift. If you expect to be unavailable for a short period of time, send an FYI in Slack.
When you get an alert, you should immediately start a Slack thread and take notes therein. Tag the Technical Account Manager (TAM) - "cc @user" is good enough - if the customer has one (steps here for how to identify TAMs). This creates visibility around the situation and opens the door to let the team join in.
Good notes in Slack help others follow along, and help you with your follow-ups after the call.
Try to communicate complete ideas rather than snippets of thought. Something like "that's not good" as a response to something happening within the call isn't as helpful as "gitaly timings are really high".
Take and share screenshots of useful info the customer is showing you. Make sure you're not sharing anything sensitive. Let the customer know you're taking screenshots: "Could you pause there? I want to screenshot this part to share with my team".
If the problem stated in the emergency ticket doesn't meet the definition of an emergency support impact, inform the customer's Technical Account Manager or a Support Manager. Unless one of these managers ask you to do otherwise, however, continue to treat the ticket with the emergency SLA.
We assume positive intent from the customer. Even though we may not think a particular ticket qualifies for emergency support, we treat all emergency pages from customers with priority support as if they qualify. During any crisis, the customer may be stressed and have immense pressure on them. Later, after the crisis, if we've determined that the ticket didn't qualify as an emergency, the customer's TAM or a Support Manager can discuss that with the customer.
Rest assured: escalation is okay – other GitLab team members are happy to help. Caring for our customers is a shared responsibility. Tag a Slack support-team Group if you haven't gotten help in your Slack thread. Tag the support managers if you need to escalate further.
If another support engineer joins your emergency call, feel free to assign them a role to divide up the labor.
So and so would you please (take notes, reach out to this product team and ask for help, look up the code for this and see what you can find)?
Make a real effort to de-stress during your on-call shift. After being on-call, consider taking time off, as noted in the main GitLab Handbook. Just being available for emergencies and outages causes stress, even if there are no pages. Resting is critical for proper functioning. Just let your team know.
When you're in a customer call, you do not need to provide immediate answers. You're allowed to pause for a few minutes for researching, asking for help, etc. Make sure to communicate – let the customer know what you're doing. Example: "I need a few minutes to work through the code here and make sense of it".
We do 7 days of 8-hour shifts in a follow-the-sun style, based on your location.
NOTE: If a new alert has not been acknowledged after 10 minutes, the Support Manager on-call is alerted. After a further 5 minutes, if there is no acknowledgement, Senior Support Managers are alerted.
There are several ways to view current and future schedules:
For new team members approaching their first on-call shift, your Support onboarding issue includes a section suggesting that you shadow a current on-call to gain familiarity with the process. After completing your shadow shift, work with your manager to get yourself added to the on-call rotation. For your first on-call week we recommend asking your Onboarding Buddy to be available as a secondary to help if an emergency comes in.
Your role is to make sure someone is available to respond to emergencies during the week you are scheduled. Flexibility is possible – you can split work with others, or schedule overrides for a few hours or days. You don't have to change vacation plans, or be at your desk all week! It's OK to take a walk outside, if you have your phone and reception. This way you can acknowledge the page, and locate someone to help (using Slack).
If you prefer to work with a colleague as a secondary, discuss with team members or your manager and find partners who like sharing the role. You can work together during the week, and update PagerDuty as you wish (options include: split days into mornings and evenings, take alternate days, work as a primary and secondary). Your manager can play an active role in helping pair people who want to work like this.
TIP: In Google Calendar, add Busy entries for the days/items you are on-call. Because your Google Calendar is linked to Calendly for Customer Calls, Busy entries ensure you will not receive round-robin calls during your on-call shift.
To swap on-call duty with a colleague:
See the PagerDuty documentation for complete steps.
Before your shift starts, always double-check that your alerts are working. Send a test page to make sure that you are receiving alerts correctly.
When your on-call shift starts, you should get notification(s) that your shift is starting (email or text, depending on your PagerDuty preferences).
priority: urgentto find the ticket.
#support_self-managedwith the ticket link. "Thread for emergency ticket LINK HERE".
NOTE: If you need to reach the current on-call engineer and they're not accessible on Slack (e.g., it's a weekend, or the end of a shift), you can manually trigger a PagerDuty incident to get their attention, selecting Customer Support as the Impacted Service and assigning it to the relevant Support Engineer.
In rare cases, the on-call engineer may experience concurrent emergencies triggered by separate customers. If this happens to you, please remember that you are not alone; you need only take the first step in the following process to ensure proper engagement and resolution of each emergency:
@support-team-americas) and request assistance from anyone who is available to assist with the new incoming emergency case.
Taking an emergency call isn't significantly different from a normal call outside of two unique points:
Try to find a colleague to join the call with you. A second person on the call can take notes, search for solutions, and raise additional help in Slack. They can also discuss and confirm ideas with you in Slack.
During the call, try to establish a rapport with the customer; empathize with their situation, and set a collaborative tone.
As early as possible, determine your options. In some cases, the best option may be rolling back a change or upgrade. The best option may also involve some loss of production data. If either of those is the case, it's okay to ask the customer if they see any other options before executing that plan.
First, remember that your primary role is incident management. You are not expected to have all the answers personally and immediately.
Your primary job is to coordinate the emergency response. That could mean:
It could equally mean:
Remember to say only things that help the customer and that maintain their confidence in you as the person in charge of getting their problem resolved. When you're not sure what to do, you might also be unsure what to say. Here are some phrases that might help:
Before ending an emergency customer call, let the customer know what to do if there is any follow-up, and who will be available if any follow-up is required.
It seems like we've solved the root problem here, but if you need any help I'll be on-call for the next two hours. Feel free to open a new emergency ticket and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that she has the background of the situation before I leave for the day.
When the call has ended:
Starting Sept 2020, we're beginning to trial GitLab.com emergency support with a small number of customers. The initial workflow for these calls is the same as with self-managed emergencies. However, you have additional visibility into problems that a customer may be facing that they will not.
During this trial period, please page the manager on-call using
/pd-support-manager for any GitLab.com emergencies so they can offer additional support.
After you have identified the error and found reproduction steps, it's likely that you'll need to declare an incident and coordinate with incident management team to reach resolution. If the error is a result of a product defect, you may also need to engage the InfraDev Escalation Process.
We're expecting, broadly that emergencies will fall into one of three categories:
Support Managers also have an on-call rotation. During their rotation, the manager’s responsibilities are:
#support_managersto the end of understanding the scope of what is being asked for (business days only)
#support_managers; Support Team Skills by Subject can be used to find the right engineer to work the ticket (business days only)
#support_dot-comchannels to post their escalation request instead in the
#support_managerschannel (business days only)
#support_managersfor any manager to volunteer to cover if their specific request goes unanswered.
To see who the current manager on-call is you can:
/chatops run oncall manager
#support_managers(where you may or may not be referred to the above steps!)
/pd-support-managercommand to page the on-call manager
We currently consider :green_check_mark: in slack on the original slack request as a signal that the escalation has been resolved.
We want to minimize the affect of on-call duty on your life. One way we do that is by offsetting any impact on your personal expenses.
You may expense the cost of your mobile phone service for the month when you begin your on-call rotation. This is limited to your service cost itself, not any costs relating to the phone device, to a personal hotspot device or to services for other people on your phone plan.
We understand you may have plans outside of your normal workspace while you're on-call. If as a result you need to use your phone to provide internet service to your computer, then you may include additional data charges in your expense report.
US Federal on-call support is provided 7 days a week between the hours of 0600 and 1800 Pacific Time.
The current on-call schedule can be viewed in PagerDuty(Internal Link), or in the Support Team on-call page(Public Link). The schedule is currently split into two, 6 hour shifts, an AM and a PM shift. The AM shift starts at 0600 Pacific Time and runs until 1200 Pacific Time. The PM shift starts at 1200 Pacific Time and runs until 1800 Pacific Time.