We are piloting a new bot-based escalation combined with the spreadsheet escalation process.
This page outlines the development team on-call process and guidelines for developing the rotation schedule for handling infrastructure incident escalations.
The expectation for the development engineers is to be a technical consultant and collaborate with the teams who requested on-call escalation to troubleshoot together. There is no expectation that the development engineer is solely responsible for a resolution of the escalation.
operational emergenciesraised by the Infrastructure , Security, and Support teams.
@gitlab-com/security/appsecteam mentioned to be notified as part of the Application Security Triage rotation
infradevwhich will be raised to Infra/Dev triage board
Example of qualified issue:
Examples of non-qualified issues:
NOTE: On-call engineer need not announce beginning/end of their shift in #dev-escalation unless there is an active incident happening (check the chat history of the channel to know if there is an active incident). This is because many engineers have very noisy notifications enabled for that channel, and such announcements are essentially false positives which make them check the channel unnecessarily.
Note: For the months of September and October 2020, we are using this Pilot Escalation Process
On-call work comes in four-hour blocks, aligned to UTC:
One engineer must be on-call at all times. This means that each year, we must allocate 2,190 4-hour shifts.
The total number of shifts is divided among the eligible engineers. This is the minimum number of shifts any one engineer is expected to do. As of March 2020 we have around 150 eligible engineers, this means each engineer is expected to do 16 shifts per year, or 4 shifts per quarter.
In general, engineers are free to choose which shifts they take across the year. They are free to choose shifts that are convenient for them, and to arange shifts in blocks if they prefer. A few conditions apply:
Most on-call shifts will take place within an engineer's normal working hours.
Scheduling and claiming specific shifts is done on the Google sheet of schedule. More on that below.
All development backend and fullstack engineers who have been with the company for at least 3 months.
Exceptions: (i.e. exempted from on-call duty)
The eligibility is maintained in this team members list and the spreadsheet is refreshed monthly as below:
Engineers normally claim shifts themselves on this Google sheet of schedule. To ensure we get 100% coverage, the schedule is fixed one month in advance. Engineers claim shifts between two and three months in advance. When signing up, fill the cell with your full name as it appears in the team members list, Slack display name, and phone number with country code. This same instruction is posted on the header of schedule spreadsheet too.
At the start of each month, engineering managers look at the schedule for the following month (e.g. on the 1st March, they would be considering the schedule for April, and engineers are claiming slots in May). If any gaps or uncovered shifts are identified, the EMs will assign those shifts to engineers. The assignment should take into account:
In general, engineers who aren't signing up to cover on-call shifts will be the ones who end up being assigned shifts that nobody else wants to cover, so it's best to sign up for shifts early!
There is additional information regarding weekend shifts, which can be found in "Additional Notes for Weekend Shifts" under a sub-folder called Development Escalation Process in the shared Engineering folder on Google Drive.
These summary items should be in written format in the following locations:
This shall be completed at the end of shifts to hand over smoothly.
Given the complexity of administration overhead, one engineering director or manager will be responsible to coordinate the scheduling of one month. The nomination follows the same approach where self-nomination is the way to go. On each month tab in the schedule spreadsheet, directors and managers are encouraged to sign up in the Coordinator column. One director or manager per month.
The coordinator should:
An Epic of execution tracking was created, where each coordinator is expected to register an issue under this Epic for the month-on-duty to capture activities and send notifications. Here is an example.
For those eligible engineers, everyone is encouraged to explore options that work best for their personal situations in lieu of weekend shifts. When on-call you have the following possibilities:
With the above alternatives we want to make sure we comply with local labor laws and not surpass the restricted weekly working hours (ranging from 38 to 60 hours) and offer enough rest time for the engineers who sign up on weekend on-call shifts.
If you prefer to work on a preferred weekend day or at other times during the weekend, go to the Development-Team-BE and fill your "Preferred Weekend Hours (UTC)" and/or "Preferred Weekend Day". If you prefer to work one hour later than normal working day, substract 1 from your normal UTC and fill that as "Preferred Weekend Hours (UTC)". The coordinators will take this into account when assigning unfilled slots.
See the Google sheet of schedule. In the future, we could embed a summary of the upcoming week here.
Below is a process that one coordinator used to fill unclaimed spots:
Feel free to participate in any incident triaging call if you would like to have a few rehearsals of how it usually works. Simply watch out for active incidents in #incident-management and join the Situation Room Zoom call (link can be found in the channel) for synchronous troubleshooting. There is a nice blog post about the shadowing experience.
Situation Room recordings from previous incidents are available in this Google Drive folder (internal).
To get an idea of what's expected of an on-call engineer and how often incidents occur it can be helpful to shadow another shift. To do this simply identify a time-slot that you'd like to shadow in the on-call schedule and contact the primary to let them know you'll be shadowing. Ask them to invite you to the calendar event for this slot. During the shift keep an eye on #dev-escalation for incidents and observe how the primary follows the process if any arise.
Beginning 2020-09-01, a Pilot program will be initiated which is the outcome of a recent proposal to iterate on the oncall process.
The new process splits Weekdays and Weekends/Holidays and fully automates scheduling and escalation using a Bot (pagerslack) during the normal work week.
A pilot of this new weekday process will happen concurrently with the existing scheduled one. In a sense there will be "double coverage" during the pilot to ensure that someone is always available, either via pagerslack or previously scheduled in the oncall spreadsheet.
Important: During the pilot period, sign-ups of weekday and weekends are still required as a backup.
Weekdays will now leverage an automated system relying upon a chatbot.
/devoncall incident-issue-urlinto #dev-escalation
In the event that no BEs respond to the bot, Pagerslack will then post a link to the oncall spreadsheet for the SRE to look up scheduled oncall BE. We assume escalating to using the spreadsheet as a last resort.
Weekend/Holiday oncall will continue to use the existing Oncall process using the oncall spreadsheet outlined above.
Holidays will continue to be included in the oncall spreadsheet, those holidays include: Christmas Day, New Years Eve, New Years Day, Pi Day and Black Friday.
/devoncall incident-issue-urlinto #dev-escalation
We will pilot this new program from September 2020 to the end of January 2021.
The aim is to see at least 8 escalations occur with the new process and reach a consensus whether the new process is ready for production before fully switching over. If we initiate the process on 2020-09-01 we estimate that criteria should be met within 60 days.
During the experiment period, the oncall spreadsheet is a backup and there will be a review how it is going at the end of September or when we have seen 8 escalations (whichever happens sooner). If the pilot is successful (see success criteria below), we will move to this new process starting on 2020-11-01.
We measure success by the number of escalations handled solely via the bot-driven process. We aim for 0 total incidents to be escalated to the oncall spreadsheet, but allow up to 15% escalations falling back to the oncall spreadsheet.
At the time of success assessment, all stakeholders - Development, Infrastructure, QE, Security, and Support - agree the new process is ready for production, in addition to the metric above.
/devoncall incident-issue-urlto trigger the escalation process.
topto show the top 25 members that are next in the escalation queue
positionto see your position in the queue. The higher the number, the less probabilities to get pinged.
Please report any problems by creating an issue in the pagerslack project.
To make the First Responder process effective, the engineer on-call must configure their notifications to give them the best chance of noticing and responding to an incident.
These are the recommended settings. Your mileage may vary.