Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Development Escalation Process

About This Page

This page outlines the development team on-call process and guidelines for developing the rotation schedule for handling infrastructure incident escalations.

Escalation Process

Scope of Process

Example of qualified issue:

Examples of non-qualified issues:

Process Outline

  1. Escalation arises.
  2. Infrastructure, Security, or Support team register tracking issue and determines the severity or references the Zendesk ticket, whichever is applicable.
    • Explicitly mention whether the raised issue is for GitLab.com or a self-managed environment.
  3. Infrastructure, Security, or Support team pings on-duty engineer (@name) in Slack #dev-escalation.
    • Find out who's on duty in the on-call Google sheet of schedule.
    • Ping on-duty engineers by tagging @name.
    • On-call engineer responds by reacting to the ping with :eyes:
    • If no response from on-call engineer within 5 minutes then the Infrastructure, Security, or Support team will find their phone number from the on-call sheet and call that number.
  4. First response time SLOs - OPERATIONAL EMERGENCY ISSUES ONLY
    1. GitLab.com: Development engineers provide initial response (not solution) in both #dev-escalation and the tracking issue within 15 minutes.
    2. Self-managed: Development engineers provide initial response (not solution) in both #dev-escalation and the tracking issue on a best-effort basis. (SLO will be determined at a later time.)
    3. In the case of a tie between GitLab.com and self-managed issues, GitLab.com issue takes priority.
  5. When on-call engineers need assistance of domain expertise:
    • Ping domain expert engineer and their engineering manager IMMEDIATELY in #dev-escalation. Make the best guess and it's fine to ping multiple persons when you are not certain. Domain experts are expected to get engaged ASAP.
    • If needed, next level is to ping the development director(s) of the domain in #dev-escalation.
graph TD; A[Escalation Arises] --> B(Issue Registered); B --> C("On-call Engr Pinged
(#dev-escalation/phone)"); D[On-call Schedule Sheet ] --> C; C --> E("Initial Response
(GitLab.com=15mins, S-M=Best Effort)"); E --> F{Domain Expertise}; F --> |Yes|G[Solution]; F --> |No|H("Ping Expert(s)
(Engr & Mgr)"); H --> I{Further Escalation?}; I --> |No|G; I --> |Yes|J("Ping Director(s)"); J --> K(Expert Pinged); K --> G; G --> L{Validation}; L --> |No|G; L --> |Yes|M(Deploy & Release); M --> N(Documentation); N --> O[Done];

Logistics

  1. All on-call engineers, managers, distinguished engineers, fellows (who are not co-founders) and directors are required to join #dev-escalation.
  2. On-call engineers are required to add a phone number that they can be reached on during their on-call schedule to the on-call sheet.
  3. On-call engineers are recommended to turn on Slack notification while on duty, or there are better customized ways to be alerted realtime.
  4. Similarly, managers and directors of on duty engineers are also recommended to do the same above to be informed. When necessary, managers and directors will assist to find domain experts.
  5. Hint: turn on Slack email notification while on duty to double ensure things don't fall into cracks.

Rotation Scheduling

Guidelines

  1. Assignments

    On-call work comes in four-hour blocks, aligned to UTC:

    • 0000 - 0359
    • 0400 - 0759
    • 0800 - 1159
    • 1200 - 1559
    • 1600 - 1959
    • 2000 - 2359

    One engineer must be on-call at all times. This means that each year, we must allocate 1,560 4-hour weekday shifts.

    The total number of shifts is divided among the eligible engineers. This is the minimum number of shifts any one engineer is expected to do. As of August 2019 we have around 100 eligible engineers, this means each engineer is expected to do 16 shifts per year, or 4 shifts per quarter.

    In general, engineers are free to choose which shifts they take across the year. They are free to choose shifts that are convenient for them, and to arange shifts in blocks if they prefer. A few conditions apply:

    • No engineer should be on call for more than 3 shifts in a row (12 hours), with 1-2 being the norm
    • No engineer should take more than 12 shifts (48 hours) per week, with 10 shifts (40 hours) being the usual maximum.

    Most on-call shifts will take place within an engineer's normal working hours.

    Scheduling and claiming specific shifts is done on the Google sheet of schedule. More on that below.

  2. Eligibility

    All backend engineers who have been with the company for at least 3 months.

    Exceptions: (i.e. exempted from on-call duty)

    • Distinguished engineers and above.
    • Where the law or regulation of the country/region poses restrictions. According to legal department -
      • There are countries with laws governing hours that can be worked.
      • This would not be an issue in the U.S.
      • At this point we would only be looking into countries where 1) we have legal entities, as those team members are employees or 2) countries where team members are hired as employees through one of our PEO providers. For everyone else, team members are contracted as independent contractors so general employment law would not apply.
  3. Nomination

    Engineers normally claim shifts themselves on this Google sheet of schedule. To ensure we get 100% coverage, the schedule is fixed one month in advance. Engineers claim shifts between two and three months in advance. When signing up, fill the cell with your full name, Slack display name, and phone number with country code. This same instruction is posted on the header of schedule spreadsheet too.

    At the start of each month, engineering managers look at the schedule for the following month (e.g. on the 1st March, they would be considering the schedule for April, and engineers are claiming slots in May). If any gaps or uncovered shifts are identified, the EMs will assign those shifts to engineers. The assignment should take into account:

    • How many on-call hours an engineer has done (i.e., how many of their allocated hours are left)
    • Upcoming leave
    • Any other extenuating factors
    • Respecting an assumed 40-hour working week
    • Respecting an assumed 8-hour working day
    • Respecting the timezones engineers are based in

    In general, engineers who aren't signing up to cover on-call shifts will be the ones who end up being assigned shifts that nobody else wants to cover, so it's best to sign up for shifts early!

  4. Relay Handover

    • Since the engineers who are on call may change frequently, responsibility for being available rests with them. Missing an on-call shift is a serious matter.
    • In the instance of an ongoing escalation no engineer should finish their on-call duties until they have paged and confirmed the engineer taking over from them is present, or they have notified someone who is able to arrange a replacement. They do not have to find a replacement themselves, but they need confirmation from someone that a replacement will be found.
    • In the instance of an ongoing escalation being handed over to another incoming on-call engineer the current on-call engineers summarize status of on-going issues in #dev-escalation and in the issues by the end of their stretch of shifts, to hand over smoothly.
    • For current Infrastructure issues and status, refer to Infra/Dev Triage board.
    • If an incident is ongoing at the time of handover, outgoing engineers may prefer to remain on-call for another shift. This is acceptable as long as the incoming engineer agrees, and the outgoing engineer is on their first or second shift.

Coordinator

Given the complexity of administration overhead, one engineering director or manager will be responsible to coordinate the scheduling of one month. The nomination follows the same approach where self-nomination is the way to go. On each month tab in the schedule spreadsheet, directors and managers are encouraged to sign up in the Coordinator column. One director or manager per month.

The coordinator should:

  1. Remind engineers to sign up.
  2. Assign folks to unfilled slots when needed (do your own due diligence when this action is necessary).
  3. Coordinate temporary changes or special requests that cannot be resolved by engineers themselves.
  4. After assigning unfilled slots and accommodating special requests the coordinator should click Sync to Calendar > Schedule shifts. This will schedule shifts in this calendar and if any developer added their email into the spreadsheet, they will be added as guests in the on-call calendar event.

An Epic of execution tracking was created, where each coordinator is expected to register an issue under this Epic for the month-on-duty to capture activities and send notifications. Here is an example.

Rotation Schedule

See the Google sheet of schedule. In the future, we could embed a summary of the upcoming week here.

Resources

Coordinator Practice Guide

Below is a process that one coordinator used to fill unclaimed spots:

  1. Start by finding the least filled shift (Usually this is 00:00 - 04:00 UTC) in the on-call sheet.
  2. Determine the appropriate timezone for this shift (in the case of 00:00 - 04:00 it is +9,+10,+11,+12,+13).
  3. Go to the team members list sheet and filter the "UTC" column by the desired timezones for the shift . Now you have the list of possible people that can take this shift.
  4. Go to google calendar and start to create a dummy event that is on the day and time of the unclaimed shift . NOTE you will not actually end up creating this event.
  5. Add all of the people that can possibly take the shift to the event as guests.
  6. Go to the "Find a Time" tab in the calendar event to see availabilities of people.
  7. Find a person that is available (preferring people that have taken no shifts or few shifts based on the total shifts counts sheet) . Note people who are on leave or otherwise busy or in interviews, do not schedule them for the shift. It would be fine to ignored events that appeared to be normal team meetings, 1:1, coffee chat as people can always leave a meeting if there is an urgent escalation.
  8. Assign them to the shift by filling their name in the the on-call sheet in Purple font color.
  9. Now since there are likely many days that have this unfilled time slot then update the event date to the next day with this same unfilled time zone. Since it's the same time then the same set of people will be appropriate to take the shift which means you don't need to update the guest list.
  10. Repeat all of the above for all of the unclaimed timezones remembering that you want to solve for one shift (by time range) at a time as it means you will re-use the same guest list to determine availability.

Tips & Tricks of Troubleshooting

  1. How to Investigate a 500 error using Sentry and Kibana.
  2. Walkthrough of GitLab.com's SLO Framework.
  3. Scalability documentation.
  4. Use Grafana and Kibana to look at PostgreSQL data to find the root cause.
  5. Ues Grafana, Thanos, and Prometheus to troubleshoot API slowdown.

Tools for Engineers

  1. Training videos of available tools
    1. Visualization Tools Playlist.
    2. Monitoring Tools Playlist.
    3. How to create Kibana visualizations for checking performance.
  2. Dashboards examples, more are available via the dropdown at upper-left corner of any dashboard below
    1. Saturation Component Alert.
    2. Service Platform Metrics.
    3. SLAs.
    4. Web Overview.