How do you manage on-call incidents among a team of eight distributed across three time zones? Every week, production engineers are assigned to the role of handling on-call. With this, comes the expectation of being available to respond to any issue that results in a critical alert. Additionally, on-call individuals act as an umbrella for other members of the team by triaging and handling all issues related to GitLab.com infrastructure.
The production team structures on-call shifts so that they follow the sun, to avoid waking up members of the team in the middle of the night. This works well for GitLab's remote-only culture where there are engineers in multiple time zones. Occasionally, an on-call engineer will need to respond to an issue outside normal working hours; in these situations, GitLab encourages members to take time off after your shift to recover.
The on-call handover
As the team members working on-call shifts are distributed and their working hours don't always overlap, you can see how it would be easy for things to slip through the cracks between one shift and the next. To prevent this happening, once a week, the production team holds a 30-minute meeting called the on-call handover. One of the key tenets of GitLab is that everything starts with an issue, and the on-call handover is no exception! From a generated report, the team reviews incidents that occurred during the last seven days and decide whether they need additional attention or escalation.
After that, we check all GitLab issues with the on-call label to see if there are any that need to move from the current shift to the next one. At the end, there is a brief review of seven-day graphs. These help us keep an eye out for anything anomalous in our key metrics. If there is anything that seems out of the ordinary or warrants further investigation, the team will dig into them to see if we can identify the root cause. The production team at GitLab encourages leads of other groups to attend the review, as this helps bring to our attention any particular high-priority items specific to individual services.
Automating the on-call handover
Drinking our own wine by using GitLab for on-call report generation has proven to be a good way to automate some of the more tedious work of the handover. To aid with this, the production team developed a program called the on-call robot assistant. It pulls data from relevant sources such as PagerDuty, Grafana and GitLab itself to generate a report with a GitLab issue.
The program automates the following tasks:
- Pulling the last shift's incidents from PagerDuty
- Generating issue stats from the production backlog
- Display seven-day graphs for the key performance metrics that we are monitoring that are sourced from GitLab Prometheus monitoring
Generating an on-call report in a GitLab issue
These data sources are set in a simple configuration file, making it easy to iterate as we add new metrics to monitor. At GitLab, most of what we do is out in the open so our on-call handover reports are available for anyone to check out. If you want to see previous reports from the on-call handovers check them out in our issue tracker.
For example, here is one recent report that shows a report for a previous week:
As well as some graphs for key metrics the production team is monitoring:
When the team is finished reviewing the report, the current on-call engineer closes it and the shift officially ends.