These targets give our users indication of the platform reliability.
Additionally, GitLab.com Service Level Availability is also a part of our contractual agreement with platform customers. The contract might define a specific target number, and not honouring that agreement may result in financial and reputational burdens.
The Google SRE book is generally a recommended read, but specifically the Embracing Risk chapter covers in great detail the topic this page is aiming to cover.
Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer
At GitLab, we are already highlighting the Importance of Velocity.
In the same chapter of the Google SRE book, under the "Motivation for Error Budgets" section, it states:
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
This is the goal we are striving for too, while also acknowledging that in order to arrive at the same level of sophistication, we need to take into account our specific situation, maturity and additional requirements. Our initial approach will directly tie Error Budget SLO with our existing approach to availability. However, in future iterations we look to further develop the importance of the Product Manager in setting SLO. The above-mentioned clarity between developers and SRE is achieved by the right SLO target which balances the importance of new feature work with the ongoing service expectations of users.
At the moment of writing this document in Q3 FY21, GitLab has several specific organizational requirements:
All of the above items contribute to the complexity of the already complex task of delivering a highly available SaaS platform. Over the years, several processes have been introduced to address some of the challenges of maintaining feature delivery velocity while ensuring that the SaaS reliability continues to increase.
The Availability and Performance Refinement process, also known as the
Infradev Process, was created to prioritize resolving an issue after an incident or degradation has happened. While the process has proven to be successful, it is event-focused and event-driven.
In reality, the events causing reliability issues are often a culmination of a trend; this can be driven by the complexity of the feature but also by lack of insight into how the feature performs.
Each of the teams at GitLab have a specific set of metrics on which team performance is measured on. This can often create differing short-term goals, which in turn can cause prioritization challenges.
Assigning error budgets down to the feature category sets a baseline for specific features, which in turn should ensure alignment on prioritizing what's important for GitLab SaaS.
The initial iteration of error budgets at GitLab aims to introduce and establish a system that will create greater insight into how individual features are performing over a longer period of time. This can be used by the organization to correctly allocate focus, ensure that the risk is well balanced and that the system as a whole remains healthier for extended periods of time.
The error budgets process has a few distinct items:
Rollout of the initial process identifies the following stakeholders:
Both the Stage teams as well as Infrastructure may contribute to budget spend.
Error budget is calculated based on the availability targets.
With the current target of
99.95% availability, allowed unavailability window is
21.6 minutes per month.
As our availability targets are reported on calendar months, error budgets are reported on calendar months.
The budget is set on the SaaS platform and is shared between stage and infrastructure teams. Service Level Availability calculation methodology is covered in details at the GitLab.com SLA page.
This budget does not take into account the number or complexity of the features owned by a team, existing product priorities, or the team size.
The current budget spend can be found on the general SLA dashboard.
Spent budget is the time (in minutes) during which user facing services have experienced a percentage of errors below the specified threshold and latency is above the specified objectives for the service. The details on how SLA is calculated can be found at the GitLab.com SLA page.
The budget spend is currently aggregated at the primary service level.
Details on what contributed to the budget spend can be further found by examining the raised incidents, and exploring the specific service dashboard (and its resources).
The Infrastructure department announces the budget spend at the end of each month in the relevant Engineering communication channels.
The budget spend is also announced in the weekly GitLab SaaS call.
This process complements the Engineering Architecture evolution process in that:
This initial error budget instrumentation, tracking, and reporting is meant to provide insight into future product prioritization process changes. We recognize that it is rarely easy to implement the changes which allow for "switch to reliability focus" upon the depletion of an error budget. We will seek to learn through this initial implementation and continue to iterate to make error budgets an effective influence to continuing reliability of GitLab.com SaaS services.
Notable items to be addressed in future iterations include: