If you're observing issues on GitLab.com—or on a team that works with
customers or users who are observing issues—a member of the chatops project
can use the command /chatops run oncall prod in the #production Slack channel.
If you're not a member of the chatops project you can ask someone who is a member
to run that command for you and then add you to chatops. Login to ops.gitlab.net, change your username to be the same as on GitLab.com and then have the oncall add you with /chatops run member add USERNAME gitlab-com/chatops --ops. The GitLab ChatOps bot will return the names of the
Engineer On Call (EOC) and the Incident Manager On Call (IMOC). Please @ mention
the engineer in Slack and reference the GitLab issue that contains details of
the issue, if one exists.
If you need an immediate response from the engineer on-call (EOC) type /pd
trigger in the #production Slack channel, and choose the "gitlab-production"
service. Include a brief title. To summon the incident manager on-call (IMOC),
choose the "SRE Managers" service instead. If in doubt you'll want the engineer
on-call. This should only be used for production emergencies.
We use PagerDuty to set the on-call
schedules, and to route notifications to the appropriate individual(s). There
are escalation policies in place for Production issues (i.e. GitLab.com
downtime), Security concerns, and Customer emergencies.
Expectations for On-Call
If you are on call, then you are expected to be available and ready to respond to PagerDuty pings as soon as possible, but certainly within any response times set by our Service Level Agreements in the case of Customer Emergencies. This may require bringing a laptop and reliable internet connection with you if you have plans outside of your work space while being on call, as an example.
We take on-call seriously. There are escalation policies in place so that if a first responder does not respond fast enough another team member or members is/are alerted. Such policies are essentially expected to never be triggered, but they cover extreme and unforeseeable circumstances.
Triage the infrastructure issue tracker, applying appropriate labels and reaching out others when needed, so the rest of the team can focus on the WoW.
Act as an umbrella to avoid team randomization.
Provide support to the release managers in the release process.
Automate as much as possible to get rid of toil, create new alerts and tools to enhance the on-call experience.
As noted in the main handbook, after being on-call take time off. Being available for issues and outages will wear you off even if you had no pages, and resting is critical for proper functioning. Just let your team know.
To swap an on-call shift or temporarily replace someone input this as an override on the main schedule in PagerDuty.
This is done by clicking on the relevant block of time in PagerDuty, selecting "override" and
filling in the name of the person you swapped with. Also see this article for reference.
Customer Emergency On-Call Rotation
We do 7 days of 8-hour shifts in a follow-the-sun style, based on your location.
After 10 minutes, if the alert has not been acknowledged, everyone on the customer on-call rotation is alerted. After a further 5 minutes, management is alerted.
Unless advised otherwise by the technical account manager or your manager, assume good intent
to give every emergency ticket from customers with priority support the emergency SLA.
After 45 minutes, if the customer has not responded to our initial contact with them, let them know that the emergency ticket will be closed and that you are opening a normal priority ticket on their behalf. Also let them know that they are welcome to open a new emergency ticket if necessary.
After each shift, if there was an alert / incident, the on call person will send a hand off email to the next on call explaining what happened and what's ongoing, pointing at the right issues with the progress.
If you need to reach the current on-call engineer and they're not accessible on Slack (e.g. it's a weekend, or the end of a shift), you can manually trigger a PagerDuty incident to get their attention, selecting Customer Support as the Impacted Service and assigning it to the relevant Support Engineer.
The Infrastructure department's Reliability Engineering teams provide 24/7 on-call coverage for the production environment. There are three primary job functions with their own PagerDuty schedules: Site Reliability Engineers (SRE), Database Reliability Engineers (DBRE), and Reliability Engineering Managers. Each individual has a unique set of responsibilities. (For details, please see incident-management.)
We do 7 days of 12-hour shifts in a follow-the-sun style, based on your location.
After 15 minutes, if the alert has not been acknowledged, the manager on-call will be notified.
The main expectation when being on-call is to triage the page, and to determine the urgency and severity of the incident. We aim to resolve incidents using well documented runbooks.
If a manager or DBRE is online in Slack–and you need their help–engage them! However, if there's any doubt about their availability then you should immediately escalate the incident in PagerDuty.
When on-call, your focus is working with the on-call manager to organize the SRE On-call board. The board is prioritized to track on-going incidents first, and then to include issues focused on improving on-call rotations.
Database Reliability Engineer (DBRE)
For database-related issues the DBRE on-call should be paged. We have support from OnGres, a consultancy that specializes in Postgresql databases. Only EOC or IMOC should be paging OnGres.
For production issues, the SLA of OnGres first response, is categorized as follows:
◦ Severity 1 and 2 - Critical issues. The system is down or not responding. SLA: 15 minutes.
◦ Severity 3 - Non-critical issues. Significant system degradation or limited capabilities that impact on the production system but it still works on a degraded mode. SLA: 1 hour.
◦ Severity 4 - Low impact issues. Any system malfunctioning that does not impede normal production operation but may limit non-significantly the performance or capabilities of the system. SLA: 4 hours.
Alerts are not triggered via automation. All escalations to the DBREs are initiated by a human, either the SRE or manager on-call.
The main expectation when being on-call is to provide expert-level database knowledge to the SRE who escalated the issue. You should assume that the SRE attempted to follow a runbook, but was unable to resolve the issue alone. In which case, the SRE should have already begun capturing information in an issue and will communicate status to the DBRE once he or she is online.
Over time, most scenarios that required consulting the DBRE on-call should be addressed in a runbook that the SRE on-call can execute confidently, alone. Or, preferably, by evolving our tools to handle issues with automation.
We do 5 days of 8-hour shifts during weekdays, and a single 48-hour shift over the weekend.
Managers are responsible for coordinating communications across all parties actively working on the resolution of an ongoing incident, including both infrastructure and support engineers.
Status updates for ongoing incidents–both internally to the company and external to customers–are the responsibility of the manager on-call.
Security Team On-Call Rotation
We do 5 days of 12-hour shifts during weekdays, and 2 days of 24 hour shifts during weekends.
After 15 minutes, if the alert has not been acknowledged, the Security manager on-call is alerted.
When on-call, prioritize work that will make the on-call better (that includes building projects, systems, adding metrics, removing noisy alerts). Much like the Production team, we strive to have nothing to do when being on-call, and to have meaningful alerts and pages. The only way of achieving this is by investing time in trying to automate ourselves out of a job.
The main expectation when on-call is triaging the urgency of a page - if the security of GitLab is at risk, do your best to understand the issue and coordinate an adequate response. If you don't know what to do, engage the Security manager on-call to help you out.
In principle, it is straightforward to add or remove people from the on-call schedules, through the same "schedule editing" links provided above for setting overrides. However, do not change the timezone setting (located in the upper left corner of the image below) unless you absolutely most certainly intend to. As indicated in the image below, when editing a schedule (adding, removing, changing time blocks, etc.), make sure you keep the timezone setting in the upper left corner constant. If you change the timezone setting, PagerDuty will not move the time 'blocks' for on-call duty, but instead it will assume that you meant to keep the selected time blocks (e.g. "11am to 7pm") in the new timezone. As a result, your new schedule may become disjointed from the old ones (old = the schedule as set before the "change on this date" selection), and gaps may appear in the schedule.