Blog Engineering How we improved on-call life by reducing pager noise
Published on: July 19, 2022
5 min read

How we improved on-call life by reducing pager noise

Too many pages? Here's how we tackled on-call SRE quality of life by grouping alerts by service and only paging on downstream services.

cover.png

To monitor the health of GitLab.com we use multiple SLIs for each service. We then page the on-call when one of these SLIs is not meeting our internal SLOs and burning through the error budget with the hopes of fixing the problem before too many of our users even notice.

All of our services SLIs and SLOs are defined using jsonnet in what we call the metrics-catalog where we specify a service and its SLIs/SLOs. For example, the web-pages service has an apdex SLO of 99.5% and multiple SLIs such as loadbalancer, go server, and time to write HTTP headers. Having these in code we can automatically generate Prometheus recording rules and alerting rules following multiple burn rate alerts. Every time we start burning through our 30-day error budget for an SLI too fast we page the SRE on-call to investigate and solve the problem.

This setup has been working well for us for over two years now, but one big pain point remained when there was a service-wide degradation. The SRE on-call was getting paged for every SLI associated with a service or its downstream dependencies, meaning they can get up to 10 pages per service since the service has 3-5 SLIs on average and we also have regional and canary SLIs. This gets very distracting, it's stress-inducing, and it also doesn't let the on-call focus on solving the problem but just acknowledges pages. For example below we can see the on-call getting paged 11 times in 5 minutes for the same service.

web-pages alert storm

What is even worse is when we have a site-wide outage, where the on-call can end up getting 50+ pages because all services are in a degraded state.

site wide outage alert storm

It was a big problem for the quality of life for the on-call and we needed to fix this. We started doing some research on how to best solve this problem and opened an issue to document all possible solutions. After some time we decided to go with grouping alerts by service and introducing service dependencies for alerting/paging.

Group alerts by service

The smallest and most effective iteration was to group the alerts by the service. Taking the previous example where the web-pages service paged the on-call 11 times, it should have only paged the on-call once, and shown which SLIs were affected. We use Alertmanager for all our alerting logic, and this already had a feature called grouping so we could group alerts by labels.

This is what an alert looks like in our Prometheus setup:

ALERTS{aggregation="regional_component", alert_class="slo_violation", alert_type="symptom", alertname="WebPagesServiceServerApdexSLOViolationRegional", alertstate="firing", component="server", env="gprd", environment="gprd", feature_category="pages", monitor="global", pager="pagerduty", region="us-east1-d", rules_domain="general", severity="s2", sli_type="apdex", slo_alert="yes", stage="main", tier="sv", type="web-pages", user_impacting="yes", window="1h"}

All alerts have the type label attached to them to specify which service they belong to. We can use this label and the env label to group all the production alerts that are firing for the web-pages service.

grouping alerts by the  and  labels

We also had to update our Pagerduty and Slack templates to show the right information. Before we only showed the alert title and description but this had to change since we are now alerting by service rather than by 1 specific SLO. You can see the changes at runbooks!4684.

Before and after on pages

This was already a big win! The on-call now gets a page saying "service web-pages" and then the list of SLIs that are burning through the error budget - we went from 11 pages to 1 page!

Service Dependencies

However we still had the problem that when a downstream service (such as the database) starts burning through the error budget, it has a cascading effect where web, git, and api will also start burning through the error budget and page the on-call for each service. That was the next thing that we had to solve.

We needed some way to not alert on the api service if the patroni (database) service was burning through the error budget because it's clear if the database is degraded the api service will end up degraded as well. We used another feature of Alertmanager called inhibition where we can tell Alertmanager to not alert on api if some alerts on patroni are already firing.

visualization of how inhibit rules work

I've mentioned that all of our SLIs/SLOs are inside of the metrics-catalog so it was a natural fit to add dependencies there, and this is exactly what we did in runbooks!4710. With this we can specify that an SLI depends on another SLI of a different service which will automatically create inhibit_rules for Alertmanager.

Since inhibit rules could potentially prevent alerting someone, we've used these sparingly. To avoid creating inhibit rules too broadly, we've implemented the following restrictions:

  1. An SLI can't depend on an SLI of the same service.
  2. The SLI has to exist for that service.
  3. We only allow equal operations, no regex on SLIs.

After that it was only a matter of adding the dependsOn on each service for example:

  1. web depends on patroni
  2. api depends on patroni
  3. web-pages depends on api

The web-pages inhibit rule shows a chain of dependencies from web-pages -> api -> patroni, so if patroni is burning through the error budget it will not page for api and web-pages services anymore!

How it's working

We have been using alert grouping and service dependencies for over a month now, and we have already seen some improvements:

  1. The on-call only gets paged once per service.
  2. When there is a large site-wide outage they only get paged 5-10 times since we have external probes that also alert us.
  3. There is an overall downward trend on pages for the on-call as seen below.

pages trend

Cover image by Yaoqi on Unsplash

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

Find out which plan works best for your team

Learn about pricing

Learn about what GitLab can do for your team

Talk to an expert