How we share SLIs across engineering departments

Mar 10, 2022 · 3 min read · Leave a comment
Bob Van Landuyt GitLab profile

At GitLab everyone can contribute to GitLab.com's availability. We measure the availability using several Service Level Indicators (SLIs) But it's not always easy to see how the features you're building are performing. GitLab's features are divided amongst development groups, and every group has their own dashboard displaying an availability score.

Stage group availability

When a group's availability goes below 99.95%, we work with the group on figuring out why that is and how we can improve the performance or reliability of the features that caused their number to drop. The 99.95% service level objective (SLO) is the same target the infrastructure department has set for GitLab.com availability.

By providing specific data about how features perform on our production systems, it has become easier to recognize when it is important to prioritize performance and availability work.

Service availability on GitLab.com

Our infrastructure is separated into multiple services, handling different kinds of traffic but running the same monolithic Rails application. Not all features have a similar usage pattern. For example, on the service handling web requests for GitLab.com we see a lot more requests related to code_review or team_planning than we do related to source_code_management. It's important that we look at these in isolation as well as a service aggregate.

There's nobody who knows better how to interpret these numbers in feature aggregations than the people who build these features.

This number is sourced by the same SLIs that we use to monitor GitLab.com's availability. We calculate this by dividing the number of successful measurements by the total number of measurements over the past 28 days. A measurement could be several things, most commonly a request handled by our Rails application or a background job.

Monitoring feature and service availability

For monitoring GitLab.com we have Grafana dashboards, generated using Grafonnet, that show these source metrics in several dimensions. For example, these are error rates of our monolithic Rails application, separated by feature:

Puma SLI by feature

We also generate multiwindow, multi-burn-rate alerts as defined in Google's SRE workbook.

Puma SLI error rate and requests per second

The red lines represent alerting thresholds for a burn rate. The thin threshold means we'll alert if the SLI has spent more than 5% of its monthly error budget in the past 6 hours. The thicker threshold means we'll alert when the SLI has not met SLO for more than 2% of measurements in the past hour.

Because both GitLab.com's availability number and the availability number for development groups are sourced by the same metrics, we can provide similar alerts and graphs tailored to the development groups. Features with a relatively low amount of traffic would not easily show problems in our bigger service aggregations. With this mechanism we can see those problems and put them on the radar of the teams building those features.

Building and adoption

In upcoming posts, we will talk about how we built this tooling and how we worked with other teams to have this adopted into the product prioritization process.

“An inside look at how @gitlab uses service level indicators (SLIs) to contribute to availability” – Bob Van Landuyt

Click to tweet

Open in Web IDE View source