This page is meant to be the starting point for onboarding as an Incident Managers.
As a means to ensure a healthy Incident Manager rotation with sufficient staffing and no significant burden on any single individual we staff this role with Team Members from across Engineering.
The Incident Manager role will be staffed by all team members within this scope:
As an Incident Manager, Team Members learn how we run GitLab.com and other GitLab SaaS environments. They help ensure the availability goals for GitLab.com by working with reliability engineers on call and development team members. The experience and awareness gained in this role leads to better understanding of building GitLab at scale and ultimately, a more reliable and scalable GitLab SaaS service and product.
Shifts are 4 or 6 hours each at these times each day:
Each Incident Manager On Call assignment is for 4 consecutive days. The Incident Manager On Call shifts are based on individual preferences specified in the Incident Manager Onboarding issue. Shifts may include weekends based on this rotating assignment.
We intend to have two scheduling layers implemented in PagerDuty, as a way to make weekend coverage more equitable, once we have fully understood which folks would be eligable (internal only) .
It is more efficient to staff Incident Manager roles with fewer, but frequently longer, shifts during days where overall Team Member availability is lower due to Holidays or other global company events. These days will be designated as "Special Coverage Days" and volunteers will be sought to cover the relevant shifts. Examples of these days include New Years as well as the monthly Family & Friends days. Not all holidays will be covered in this manner as many holidays are local in nature and easier to cover simply by normal shift switches.
Special Coverage Days will be designated by creation of an Issue. The Issue will indicate the approach to scheduling for the specific event and will provide a table for volunteers to add their name. A DRI will be clearly noted and will be responsible for transferring the resulting coverage to PagerDuty.
To onboard, create an Incident Manager onboarding issue.
If your eligibility status changes or you have been exempted from Incident Manager On Call assignment, create an Incident Manager offboarding issue.
Incident Manager responsibilities are expected of Engineering Management and Staff+ Engineer job families in the Development and Infrastructure departments. Previously, Incident Manager responsibilities were fulfilled by a small group of managers in the Infrastructure department. All team members in these eligible roles are expected to participate in the Incident Manager rotation.
In some cases, Senior Engineers may also participate as Incident Managers. This is particularly useful as Senior Engineers look for additional opportunities for growth in their role and as they prepare for future promotion opportunities. Senior Engineers may join the Incident Manager pool by opening an Incident Manager onboarding Issue and having their manager indicate approval in the Issue. As noted elsewhere, Team Members should only participate in one duty similar to this role, so should not also participate in Dev On-call at the same time (for example). When approving the request, managers shall be mindful of certain regions where a balance between Incident Manager and Dev Oncall needs to be maintained because the eligible people overlap and both programs must be fully staffed.
No. All team members in eligible roles are expected to participate in the Incident Manager rotation. All team members who are eligible are to be scheduled into PagerDuty. Those who have upcoming shifts will need to prioritize completing their Incident Manager On-Call onboarding issues. If anyone does not feel ready to become an Incident Manager, follow our process to swap shifts with someone else and work with your manager or the Incident Management Coordinator to resolve any challenges you are facing.
Note that we will consider exceptions on a case by case basis. Please talk to your manager if you do not think you can participate or would need an alternative schedule. Assignment and exemptions to participation will be coordinated by the VP of Infrastructure and the VP of Development or their designees, and the requests require approval from the participants' reporting managers. This internal spreadsheet is being used to track exemptions from participating.
Yes. If you are an engineer and you are in the Incident Manager rotation, you are exempt from the Dev On-call. You will not be expected to sign up for shifts in the monthly Dev On-call scheduling sheet. Please confirm that your email address is listed on the Development-Team-BE Google Doc, so that you are not auto-assigned a shift.
When you are scheduled to be Incident Manager On Call, this must by your #1 work priority. However, while it is the priority, there may not be any incidents. When there is not an active incident requiring Incident Manager leadership you maintain your normal work duties.
Currently, pages to Incident Manager On Calls happen roughly 0 to 4 times a week. During the time an Incident Manager is on call, there is a 15 minute SLA to respond to escalations. This can mean dropping out of meetings, 1:1s, etc. to join incident calls. This demand can make balancing hard, and you may have to reschedule meetings on short notice. Because of the interrupt-driven nature of incident management, it is advised that you shift sync meetings which would be difficult for you to leave to a later date.
In particular, please make sure to block out the time while you are on call in your calendar so no interviews get scheduled during your shift. We need to be respectful of the candidate's time that they set apart to interview with us, an interview should not be interrupted by pages. If an interview does get scheduled during your shift, reach out to the recruitment team to reschedule.
The Incident Manager role requires a 15-minute response. However, there is no expectation that an Incident Manager is at their desk waiting for an incident during their entire shift.
You may have family or other short personal commitments which might complicate your response. In most cases, you should feel comfortable with any commitment that isn't longer than 45 minutes, as long as if during that commitment you can take a few minutes to engage in the incident slack channel to inform the rest of the team of your situation and timing.
If you have a standing commitment which is longer or does not allow for the informative response in slack it is suggested that you choose another shift. If you have an occasional commitment like this you shouldn't let that block you from a shift. One of the benefits of a larger team of Incident Managers is the ability to provide coverage for each other for unexpected or shorter duration commitments.
Four day shifts mean that, at the most, a Team Member participating in an Incident Manager shift with seven other IMs will have shifts approximately once every month. Each additional Team Members added to the pool in your region extends this timing by 4 days. The more Team Members involved, the less often any individual will have shifts.
Because new incident managers are being added to the rotation on an ongoing basis, schedules for upcoming months are not final and will shift as folks are added or removed from the rotation.
#imoc_generalindicating that the schedule has been modified. Team members who have pre-existing overrides after the modified date (March in the example above) will be notified.
Based on the process above, scheduled incident managers will have approximately 7 weeks time before the scheduled changes to trade their shifts if needed.
Yes, this is one of the other benefits of having a well staffed pool of Team Members engaged as Incident Managers. Shift "trades" are easy to arrange in Pagerduty. It is your responsibility to ensure that your assigned shift is covered, but in extraordinary circumstances please reach out to the VP of Infrastructure for assistance. Swapping shifts is totally fine. Getting someone to cover for you for either planned vacation or a sudden/urgent family matter is something we should all do for each other.
What to do for covering a shift or asking for coverage:
#imoc_generalchannel asking for help. Make sure to @mention people or @here if you are in an urgent situation. Let people know the days and times you will need help covering things.
Example 1, Scheduling yourself: Go to https://gitlab.pagerduty.com/my-on-call/week and click the shift for which you need an override. You should get a pop up which will let you pick the person covering you and the hours, which usually default to your whole shift. Example 2, Covering for someone. Go to the schedule in PagerDuty and pick the shift and person you will override. You should get a pop up where you can pick yourself and the times for the override.
A Professional Plus - Responder Role is sufficient to be an Incident Manager. PagerDuty Roles reference
Shifts are assigned based on the working hours that you selected during onboarding. Our current process is to swap shifts by asking for someone to take this shift in the
#imoc_general Slack channel.
When an Incident Manager shift includes a weekend the team member can shift their work-week to include that day (and exclude another day). As an example, if an Incident Manager shift includes Saturday, then the team member could plan their work-week for that week to be Tues-Sat.
While the example above is the intended idea, we will note that anything close to that which works for the team member will be fine as well. For example, if you'd much rather take some other day in the adjoining weeks, or working non-linear workdays to accomodate the shift.
Two things that won't work:
Yes, being on-call for incident management is considered
vital to position for expenses purposes.
Adding Incident Manager responsibilities to a direct report's role requires cooperation with their manager. The manager should support them when they are assigned Incident Manager duties, particularly when shifts overlap with weekends. Please make sure Incident Managers are empowered to delegate, postpone or drop some responsibilities given these additional responsibilities.
If the manager serves as an Incident Manager themselves as well, modelling similar behaviour is as vital as supporting their direct reports.
webcal://...URL into Google Calendar by adding a calendar under the Other Calendars dropdown and select
from url. Paste the webcal link.
webcal://...), e.g. IMOC Shifts, My On-Call Shifts, etc.
Benefits of adding the PagerDuty IMOC schedule into Google Calendar:
We have a detailed handbook page that covers how we do Incident Management at GitLab.