Service Levels and Error Budgets

On this page

Overview

GitLab.com has a stated goal of 99.95% uptime, intended to support mission-critical workloads. In pursuing this goal, Infrastructure is focused on observable availability; we are also tasked with performing attribution as part of root cause analysis and carrying out the necessary accounting on error budgets. The first iteration on error budgets took place on 2018-Q3, and there is work currently underway on issue infrastructure/5323 to define an uptime SLA.

While we are arguably improving, this improvement is difficult to measure and quantify, and potentially open to interpretation. These objectives and initiatives are, at this stage, somewhat abstract, acting primatily as goal markers, effected mostly independently.

Availability is not simply an uptime goal depicted as a number. It is a way to measure our ability to drive change through the environment, which requires us to manage risk.

As noted on chapter 3 of the Site Reliability Engineering book, Embracing Risk:

Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized. […] Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear.

We embrace risk by striving to manage it, and we must do so in a stuctured, data-driven fashion to achieve an optimal, predictable, sustainable and safe speed of change, tying together our availability and feature development commitments to our users with our organizational ability to execute on those commitments.

We must properly define the actual meaning of uptime, and we have to implement the framework through which said metric is calculated, tracked and evaluated. This is meaninless unless we also define the associated agreements on how we dial the speed of change to meet our availability objectives. This represents a company-wide commitment to the framework.

Service levels and error budgets provide this framework.

Service Levels

In order to implement service levels, we must first understand which behaviors are important for our users and how we measure and evaluate the performance of said behaviors. We will start by defining service level indicators and specify service level objectives on said metrics. We can then create service level agreements to ensure we are committed to our availability objectives. In summary:

Time Window

The business has indicated that we must achieve an availability SLA on GitLab.com of 99.95%, which is roughly 22 minutes of unplanned downtime per month. We have selected the monthly time frame for this definition for two main reasons:

In order to define our next iteration on service levels, we are going to select five SLIs and set the corresponding SLOs per business requirements. This is a fairly coarse, but will serve its purpose as a first step towards the establishment of the framework. Future refinements will break down SLIs and SLOs per service, set specific goals per service criticality tiers, define a documented formula for calculating an admittedly oversimplified, easy-to-consume uptime number, and explore using a four-week rolling window complemented with weekly summaries and quarterly summarized reports to track, calculate and evaluate error budgets in conjunction with incident data.

SLIs

Our initial batch of SLIs will be focused on availabity (success rates) and latency on HTTP and SSH requests against GitLab.com from both an overall point of view and through three specific user paths (clone, pull and push).

SLI Type  
Availability Number of successful HTTP requests over total HTTP requests
Availability Number of successful SSH requests over total SSH requests
Availability Number of successful HTTP requests for clone, pull and push operations over total HTTP requests for sid operations
Availability Number of successful SSH requests for clone, pull and push operations over total HTTP requests for sid operations
Latency Proportion of requests faster than {latency} as a function of payload size for clone, pull and push operations

SLOs

The proposed SLO for each of the SLIs is set by the business at 99.95%.

Error Budgets

Error budgets can be calculated from the defined SLOs. We will deprecate the arbitrary severity-based point system error budget definitions and implement time-based error budgets. We also need to definte an error budget policy that the entire company adheres to so we can agree to operate at an optimal change speed.

The Production Queue

We need to track changes and incidents in production (as defined by incident and change management). As we make progress with our service levels and error budgets, most of this will be automated away. But we still need to keep an audit of recorded incidents, which aid during the transition in order to ensure our service level coverage and help us prioritize further developments in this area. We do want to reduce the effort on analuzing this data, and thus the queue must adhere to the label schema.

Issue Types

The production queue supports four types of issues: changes, incidents, deltas and hotspots.

Severities

We standarize on the use of severity levels: S1, S2, S3 and S4.

Services

We have started to create structured service definitions.

Service labels (service:<service>[.<service>]) are used to associate services with production issues.

Attribution

Incidents must be attributed to specific teams so that error budgets can be properly accounted for.

Commit

Rather than putting together a grand plan to implement the service levels and error budgets framework, let's deploy our values to help us navigate through this transition: