These targets give our users indication of the platform reliability.
Additionally, GitLab.com Service Level Availability is also a part of our contractual agreement with platform customers. The contract might define a specific target number, and not honouring that agreement may result in financial and reputational burdens.
The Google SRE book is generally a recommended read and under the "Motivation for Error Budgets" section, it states:
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
This is the goal we are striving for too, while also acknowledging that in order to arrive at the same level of sophistication, we need to take into account our specific situation, maturity and additional requirements. Our initial approach will directly tie Error Budget SLO with our existing approach to availability.
Future iterations of our error budgets will seek to further develop the importance of the Product Manager in balancing risk tolerance with feature velocity. The above-mentioned clarity between developers and SRE is achieved by establishing the appropriate measures and targets for each service or area of product. Ultimately this balances the importance of new feature work with the ongoing service expectations of users.
Error Budgets first depend on establishing an SLO (Service Level Objective). SLOs are made up of an objective, a SLI (Service Level Indicator), and a timeframe.
Here is an example of these elements:
Taken all together, the above example SLO would be: 99.95% of the 95th percentile latency of api requests over 5 mins is < 100ms over the previous 28 days
The Error Budget is then 1 - Objective of the SLO, in this case (1 - .9995 = .0005). Using our 28 day timeframe, the "budget" for errors is 20.16 minutes (.0005 * (28 * 24 * 60))
While the above example shows the SLI as a latency measurement, it is important to note that other measurements (such as % errors) are also good elements to use for SLIs.
GitLab's current implementation of Error Budgets is only using some of the above sophistication of SLOs and Error Budgets, but we expect to increase the sophistication in the future. It is expected that the practices of SLOs and Error Budgets evolve to have both the objective and the SLI vary (appropriately) based on the criticality of the service as well as the resiliency of other services and components which depend on it.
Web requests that result in a
500 status code error are counted. In Sidekiq, jobs that fail due to an unhandled exception are counted.
Engineers can use
Gitlab::ErrorTracking.track_exception, or other logging, freely without affecting the error budget.
GitLab is a complex system that needs to be delivered as a highly available SaaS platform. Over the years, several processes have been introduced to address some of the challenges of maintaining feature delivery velocity while ensuring that the SaaS reliability continues to increase.
The Infradev Process was created to prioritize resolving an issue after an incident or degradation has happened. While the process has proven to be successful, it is event-focused and event-driven.
The Engineering Allocation Process was created to address long term team efficiency, performance and security items.
The initial iteration of error budgets at GitLab aims to introduce objective data and establish a system that will create greater insight into how individual features are performing over an extended period of time. This can be used by the organization to correctly allocate focus, ensure that the risk is well balanced and that the system as a whole remains healthier for extended periods of time.
Assigning error budgets down to the feature category sets a baseline for specific features, which in turn should ensure alignment on prioritizing what's important for GitLab SaaS.
Budget failures panel and each link in the
Failure log links panel are ordered by the number of errors. Prioritising fixing the top offenders in these tables will have the biggest impact on the budget spent.
The error budgets process has a few distinct items:
The stakeholders in the Error Budget process are:
Error budget is calculated based on the availability targets.
With the current target of
99.95% availability, allowed unavailability window is
20 minutes per 28 day period.
We elected to use the 28 day period to match Product reporting methods.
The budget is set on the SaaS platform and is shared between stage and infrastructure teams. Service Level Availability calculation methodology is covered in details at the GitLab.com SLA page.
This includes all Rails Controllers, API Endpoints, Sidekiq workers, and other SLIs defined in the service catalog. This is attributed to groups by defining a feature category. Documentation about feature categorization is available in the developer guide.
The number or complexity of features owned by a team, existing product priorities, or the team size does not influence the budget.
On the 4th of each month, the following announcements are made:
The announcements appear in
Feature categories with monthly spend above the allocated budget for three consecutive months are reported as part of the Engineering Allocation meeting.
The current budget spend can be found on the general SLA dashboard.
Spent budget is the time (in minutes) during which user facing services have experienced a percentage of errors below the specified threshold and latency is above the specified objectives for the service. The details on how SLA is calculated can be found at the GitLab.com SLA page.
The budget spend is currently aggregated at the primary service level.
Details on what contributed to the budget spend can be further found by examining the raised incidents, and exploring the specific service dashboard (and its resources).
There is an example available with a more detailed look at how this is built.
Stage groups can use their dashboards to explore the cause of their budget spend. The process to investigate the budget spend is described in the developer documentation
The formula for calculating availability:
the number of operations with a satisfactory apdex + the number of operations without errors / the total number of apdex measurements + the total number of operations
This gives us the percentage of operations that completed successfully and is converted to minutes:
(1 - stage group availability) * (28 * 24 * 60)
Apdex and Error Rates are explained in more detail on the handbook page.
Error Budget Spend information is available on the Error Budgets Overview Dashboard in Sisense.
System-wide incidents affecting shared services (such as the database or Redis) may have an impact on a team's Error Budget spend. Since we look at spend over a 28-day period, the impact of these short lived events should be mostly negligible.
If the impact is significant, we can discuss on the Monthly Report if this incident should warrant a manual adjustment to spend.
At this time we are not looking further into automatically discounting system-wide events from group-level error budgets. The team is focused on building a strong foundation for error budgets, with sufficient tuning capability to be relevant for each group.
Error budget events are attributed to stage groups via feature categorization. Updates to this mapping can be managed via merge requests if ownership of a part of the platform moves from one feature category to another.
Updates to feature categories only change how future events are mapped to stage groups. Previously reported events will not be retroactively updated and will continue to count against stage group error budgets.
All feature categories are expected to perform within their Error Budget regardless of traffic share. This ensures a consistent approach to prioritization of reliability concerns.
Error Budgets should be reviewed monthly as part of the Product Development Timeline.
The balance between feature development and reliability development for a feature category should be as follows:
|Monthly Spend (28 days)||Action|
|<= 20 minutes||Understand your spend - no further action required.|
|> 20 minutes||Commitment to reliability/availability improvements, feature development is secondary.|
Feature categories with monthly spend above the allocated budget for three consecutive months may have additional feature development restrictions put in place. This is subject to change as Error Budget spend across feature categories decreases.
Our current contract is 99.95% availability and a 20 minute monthly error budget. However, the following groups have a temporarily adjusted budget based on business needs:
|Stage Group||Monthly Spend (28 days)||Business Reason||Review Date|
|Fulfillment:Provision||50%||We are prioritizing resolving errors causing sidekiq to go below error budgets.
|Global Search||1 hour/month (99.83%)||The SLI and SLO are being redesigned to better reflect the system status and the work is working-in-progress.||2022-10-30|
|Manage:Workspace||99.85%||To allow time for the group to address issues with the endpoints connected with listing many projects at once, cooperate with other teams and API working group on that problem. Described in this MR||2022-12-31|
Temporary exceptions are granted as a means to allow different stakeholders to fulfill higher priority business needs, if it is estimated that the granted exception is not creating additional risk to GitLab.com reliability. Note that exceptions are different from Custom Targets, which set properties on endpoints defining acceptable performance.
Valid reasons for an exception are:
Instructions for Requesting an Exception
To request an exception, open an MR and add the stage group to the table above. In the description, supply the following details:
Provide answers to the following questions:
Follow the guidance and instructions above to expedite the approval process.
Assign the MR for approval to:
Work relating to error budget improvements should be detailed in an issue.
Please label these issues with
Improvement and the
group:: label so they can be tracked in reports.
|Role||K/PI||Target||Current Tracking Status|
|Product Management||Maintaining the Spend of the Error Budget||20 minutes over 28 days (equivalent to 99.95% availability)||Complete - In Sisense|
|Infrastructure||Setting the Error Budget Minutes and Availability Target||99.95% (20 minutes over 28 days Error Budget)||Complete - In Grafana|
The changes below aim to increase the maturity of the Error Budgets.
not_ownedwill be attributed to the correct feature category. This will be addressed by
Product Development Activities
Product Development teams are encouraged to: