How we share SLIs across engineering departments

At GitLab everyone can contribute to GitLab.com's availability. We measure the availability using several Service Level Indicators (SLIs) But it's not always easy to see how the features you're building are performing. GitLab's features are divided amongst development groups, and every group has their own dashboard displaying an availability score.

When a group's availability goes below 99.95%, we work with the group on figuring out why that is and how we can improve the performance or reliability of the features that caused their number to drop. The 99.95% service level objective (SLO) is the same target the infrastructure department has set for GitLab.com availability.

By providing specific data about how features perform on our production systems, it has become easier to recognize when it is important to prioritize performance and availability work.

Service availability on GitLab.com

Our infrastructure is separated into multiple services, handling different kinds of traffic but running the same monolithic Rails application. Not all features have a similar usage pattern. For example, on the service handling web requests for GitLab.com we see a lot more requests related to code_review or team_planning than we do related to source_code_management. It's important that we look at these in isolation as well as a service aggregate.

There's nobody who knows better how to interpret these numbers in feature aggregations than the people who build these features.

This number is sourced by the same SLIs that we use to monitor GitLab.com's availability. We calculate this by dividing the number of successful measurements by the total number of measurements over the past 28 days. A measurement could be several things, most commonly a request handled by our Rails application or a background job.

Monitoring feature and service availability

For monitoring GitLab.com we have Grafana dashboards, generated using Grafonnet, that show these source metrics in several dimensions. For example, these are error rates of our monolithic Rails application, separated by feature:

We also generate multiwindow, multi-burn-rate alerts as defined in Google's SRE workbook.

The red lines represent alerting thresholds for a burn rate. The thin threshold means we'll alert if the SLI has spent more than 5% of its monthly error budget in the past 6 hours. The thicker threshold means we'll alert when the SLI has not met SLO for more than 2% of measurements in the past hour.

Because both GitLab.com's availability number and the availability number for development groups are sourced by the same metrics, we can provide similar alerts and graphs tailored to the development groups. Features with a relatively low amount of traffic would not easily show problems in our bigger service aggregations. With this mechanism we can see those problems and put them on the radar of the teams building those features.

Building and adoption

In upcoming posts, we will talk about how we built this tooling and how we worked with other teams to have this adopted into the product prioritization process.

How we share SLIs across engineering departments

Service availability on GitLab.com

Monitoring feature and service availability

Building and adoption

More to explore

Tutorial: Secure BigQuery data publishing with GitLab

How we reduced MR review time with Value Stream Management

Automating container image migration from Amazon ECR to GitLab

We want to hear from you

Ready to get started?

Pricing

How we share SLIs across engineering departments

Service availability on GitLab.com

Monitoring feature and service availability

Building and adoption

Related content

Sign up for GitLab’s newsletter

More to explore

Tutorial: Secure BigQuery data publishing with GitLab

How we reduced MR review time with Value Stream Management

Automating container image migration from Amazon ECR to GitLab

We want to hear from you

Ready to get started?