These targets give our users indication of the platform reliability.
Additionally, GitLab.com Service Level Availability is also a part of our contractual agreement with platform customers. The contract might define a specific target number, and not honouring that agreement may result in financial and reputational burdens.
The Google SRE book is generally a recommended read and under the "Motivation for Error Budgets" section, it states:
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
This is the goal we are striving for too, while also acknowledging that in order to arrive at the same level of sophistication, we need to take into account our specific situation, maturity and additional requirements. Our initial approach will directly tie Error Budget SLO with our existing approach to availability.
Future iterations of our error budgets will seek to further develop the importance of the Product Manager in balancing risk tolerance with feature velocity. The above-mentioned clarity between developers and SRE is achieved by establishing the appropriate measures and targets for each service or area of product. Ultimately this balances the importance of new feature work with the ongoing service expectations of users.
Error Budgets first depend on establishing an SLO (Service Level Objective). SLOs are made up of an objective, and SLI (Service Level Indicator), and a timeframe.
Here is an example of these elements:
Taken all together, the above example SLO would be: 99.95% of the 95th percentile latency of api requests over 5 mins is < 100ms over the previous 28 days
The Error Budget is then 1 - Objective of the SLO, in this case (1 - .9995 = .0001). Using our 28 day timeframe, the "budget" for errors is 20.16 minutes (.0001 * (28 * 24 * 60))
While the above example shows the SLI as a latency measurement, it is important to note that other measurements (such as % errors) are also good elements to use for SLIs.
GitLab's current implementation of Error Budgets is only using some of the above sophistication of SLOs and Error Budgets, but as we continue with the roadmap we look to incorporate more into our approach. Additionally, it is expected that the practices of SLOs and Error Budgets evolve to have both the objective and the SLI vary (appropriately) based on the criticality of the service as well as the resiliency of other services and components which depend on it.
GitLab is a complex system that needs to be delivered as a highly available SaaS platform. Over the years, several processes have been introduced to address some of the challenges of maintaining feature delivery velocity while ensuring that the SaaS reliability continues to increase.
The Infradev Process was created to prioritize resolving an issue after an incident or degradation has happened. While the process has proven to be successful, it is event-focused and event-driven.
The Engineering Allocation Process was created to address long term team efficiency, performance and security items.
The initial iteration of error budgets at GitLab aims to introduce objective data and establish a system that will create greater insight into how individual features are performing over an extended period of time. This can be used by the organization to correctly allocate focus, ensure that the risk is well balanced and that the system as a whole remains healthier for extended periods of time.
Assigning error budgets down to the feature category sets a baseline for specific features, which in turn should ensure alignment on prioritizing what's important for GitLab SaaS.
The error budgets process has a few distinct items:
The stakeholders in the Error Budget process are:
Error budget is calculated based on the availability targets.
With the current target of
99.95% availability, allowed unavailability window is
20 minutes per 28 day period.
We elected to use the 28 day period to match Product reporting methods.
The budget is set on the SaaS platform and is shared between stage and infrastructure teams. Service Level Availability calculation methodology is covered in details at the GitLab.com SLA page.
This includes all Rails Controllers, API Endpoints, Sidekiq workers, and other SLIs defined in the service catalog. This is attributed to groups by defining a feature category. Documentation about feature categorisation is available in the developer guide.
This budget does not take into account the number or complexity of the features owned by a team, existing product priorities, or the team size.
On the 4th of each month, the following announcements are made:
The announcements appear in
Feature categories with monthly spend above the allocated budget for three consecutive months are reported as part of the Engineering Allocation meeting.
The current budget spend can be found on the general SLA dashboard.
Spent budget is the time (in minutes) during which user facing services have experienced a percentage of errors below the specified threshold and latency is above the specified objectives for the service. The details on how SLA is calculated can be found at the GitLab.com SLA page.
The budget spend is currently aggregated at the primary service level.
Details on what contributed to the budget spend can be further found by examining the raised incidents, and exploring the specific service dashboard (and its resources).
There is an example available with a more detailed look at how this is built.
Stage groups can use their dashboards to explore the cause of their budget spend. The process to investigate the budget spend is described in the developer documentation
The formula for calculating availability:
the number of operations with a satisfactory apdex + the number of operations without errors / the total number of apdex measurements + the total number of operations
This gives us the percentage of operations that completed successfully and is converted to minutes:
(1 - stage group availability) * (28 * 24 * 60)
Apdex and Error Rates are explained in more detail on the handbook page.
Error Budget Spend information is available on the Error Budgets Overview Dashboard in Sisense.
System-wide incidents affecting shared services (such as the database or Redis) may have an impact on a team's Error Budget spend. Since we look at spend over a 28-day period, the impact of these short lived events should be mostly negligible.
If the impact is significant, we can discuss on the Monthly Report if this incident should warrant a manual adjustment to spend.
At this time we are not looking further into automatically discounting system-wide events from group-level error budgets. The team is focused on building a strong foundation for error budgets, with sufficient tuning capability to be relevant for each group.
Error budget events are attributed to stage groups via feature categorization. Updates to this mapping can be managed via merge requests if ownership of a part of the platform moves from one feature category to another.
Updates to feature categories only change how future events are mapped to stage groups. Previously reported events will not be retroactively updated and will continue to count against stage group error budgets.
All feature categories are expected to perform within their Error Budget regardless of traffic share. This ensures a consistent approach to prioritization of reliability concerns.
Error Budgets should be reviewed monthly as part of the Product Development Timeline.
The balance between feature development and reliability development for a feature category should be as follows:
|<= 20 minutes||Understand your spend - no further action required.|
|> 20 minutes||Commitment to reliability/availability improvements, feature development is secondary.|
Feature categories with monthly spend above the allocated budget for three consecutive months may have additional feature development restrictions put in place.
This is subject to change as Error Budget spend across feature categories decreases.
|Role||K/PI||Target||Current Tracking Status|
|Product Management||Maintaining the Spend of the Error Budget||20 minutes over 28 days (equivalent to 99.95% availability)||Complete - In Sisense|
|Infrastructure||Setting the Error Budget Minutes and Availability Target||99.95% (20 minutes over 28 days Error Budget)||Complete - In Grafana|
The changes below aim to increase the maturity of the Error Budgets.
not_ownedwill be attributed to the correct feature category. This will be addressed by using caller information for Sidekiq and having graphQL query-to-feature correlation.
Product Development Activities
Product Development teams are encouraged to: