Error Budgets consist of two components: Apdex and Error Rate.
Apdex Success Rate: The rate of operations that were successful and performed adequately.
The threshold for ‘performed adequately’ is different for each service.
This is currently a global threshold per service, but stage groups will soon have the ability to customise this by endpoint.
Error Rate: The rate of operations that had errors.
The developer documentation contains detailed steps for how to check where budget is being spent.
These are Service Level Indicators, Objectives and Agreements.
An SLA is an agreement that one group has set with another regarding the level of service that will be provided. We have an SLA with our customers to achieve a certain level of availability each month. Currently this is 99.95%.
We use that agreement to set service-level objectives (SLOs). These are the standards we must meet each month in order to fulfil our agreements.
Finally, the SLI is the indicator we use to determine if will meet our objective. It is the measure of how our systems are performing.
The SLA is the percentage of time that SLIs met their SLO.
In this section, we talk about Apdex for Web and API endpoints.
Every endpoint is associated with a feature category. We use this to help with incident response as well as to attribute error budget spend to the right stage group.
For every request, we store log information - including:
The highest granularity of data is stored in the logs, but we can't hold onto this data for long periods because we generate quite a lot of it. We retain log information for 7 days.
We also store metrics for a longer period which are at a lower granularity. One of the items we store a metric for is the response duration.
Because of size constraints we can't store the exact duration for a request in the metrics. Instead we
use a histogram with buckets of [-Inf, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, +Inf]
which are defined in the metrics catalog.
When a request takes 0.6s, it would increment the buckets for which it was faster. So [+Inf, 5.0, 2.5, 1.0]
would be incremented.
We also store if the request was faster or slower than the request duration threshold for that endpoint. Currently, all endpoints use the same duration threshold, but in a future iteration each endpoint will be able to specify it's own threshold to use in this calculation.
We store metrics in Prometheus as counters.
Apdex
Error
Counters are separated using the following labels:
stages.yml
(when imported).When a person visits the stage group dashboards to see their Error Budget, we perform a calculation using the metrics we hold about how requests have been performing.
Using the formula on the previous page, we use the percentage of successful operations across all 28 days.
When changes are made to the thresholds used in this calculation, it takes 28 days for the effect to be seen because we are summing stored data for the whole period.