Service Level Availability (SLA) is the percentage of time during which the platform is in an available state. Other states are degraded and outage.
Each of the user facing services have two Service Level Indicators (SLI): the Apdex score, and the Error rate.
The Apdex score is generally a measure of the service performance (latency).
The Error rate measures the percentage of requests that fail due to an error (usually, a 5XX
status code).
A service is considered available when:
An example of available web
service; within a 5 minute period:
A service is unavailable, if, for 5 minutes:
In other words, a service needs to simultaneously meet both of its SLO targets in order to be considered available. If either target is not met, the service is considered unavailable.
The availability score for a service is then calculated as the percentage of time that it is available. The Availability score for each service combined define the platform Service Level Availability (SLA). The SLA number indicates availability of GitLab.com for a select period of time.
For example, if service becomes unavailable for a 10 minute period, the availability score will be:
Finally, the availability metric for GitLab.com is calculated as a weighted average availability over the following services (weights in brackets):
web
(5)api
(5)git
(5)registry
(1)ci runners
(0)pages
(0)sidekiq
(0)The SLA score can be seen on the SLA dashboard, and the SLA target is set as an Infrastructure key performance indicator.
More details on definitions of outage, and degradation are on the incident-management page
Year Month | Availability | Comments |
---|---|---|
2023 May | TBD | |
2023 April | 99.98% | |
2023 March | 99.99% | |
2023 February | 99.98% | |
2023 January | 99.80% | |
2022 December | 100% | |
2022 November | 99.86% | |
2022 October | 100% | |
2022 September | 99.98% | |
2022 August | 99.92% | |
2022 July | 99.95% | |
2022 June | 99.96% | |
2022 May | 99.99% | |
2022 April | 99.98% | |
2022 March | 99.91% | |
2022 February | 99.87% | |
2022 January | 99.95% | |
2021 December | 99.96% | |
2021 November | 99.71% | |
2021 October | 99.98% | |
2021 September | 99.85% | |
2021 August | 99.86% | |
2021 July | 99.78% | |
2021 June | 99.84% | |
2021 May | 99.85% | does not include manual adjustment for PostgreSQL 12 Upgrade |
2021 April | 99.98% | |
2021 March | 99.34% | |
2021 February | 99.87% | |
2021 January | 99.88% | |
2020 December | 99.96% | |
2020 November | 99.90% | |
2020 October | 99.74% | |
2020 September | 99.95% | |
2020 August | 99.87% | |
2020 July | 99.81% | |
2020 June | 99.56% | |
2020 May | 99.58% |
These videos provide examples of how to quickly identify failures, defects, and problems related to servers, networks, databases, security, and performance.
We use our apdex based measurements to report official availability (see above). However, we also have some public pingdom tests for a representative view of overall performance of GitLab.com. These are availably at https://stats.pingdom.com. Specifically, this has the availability and latency of reaching
We collect data using InfluxDB and Prometheus, leveraging available exporters like the node or the postgresql exporters, and we build whatever else is necessary. The data is visualized in graphs and dashboards that are built using Grafana. There are two interfaces to track this, as described in more detail below.
We have 3 prometheus clusters: main prometheus, prometheus-db, and prometheus-app. They provide an interface to query metrics using PromQL. Each prometheus cluster collects a set of related metrics:
Thanos Query can be used to query metrics aggregated across Prometheus clusters.
To learn how to set up a new graph or dashboard using Grafana, take a look at the following resources:
Need access to add a dashboard? Ask any team lead within the infrastructure team.
We have a set of monitoring dashboards designed for each stage group. These dashboards are designed to give an insight, to everyone working in a feature category, into how their code operates at GitLab.com scale. They are grouped per stage group to show the impact of feature/code changes, deployments, and feature-flag toggles.
The dashboards for stage groups are at a very early stage. All contributions are welcome. If you have any questions or suggestions, please submit an issue in the Scalability Team issues tracker.
Network, System, and Application logs are processed, stored, and searched using the ELK stack. We use a managed Elasticsearch cluster on GCP and as such our only interface to this is through APIs, Kibana and the elastic.co web UI. For monitoring system performance and metrics, Elastic's x-pack monitoring metrics are used. They are sent to a dedicated monitoring cluster. Long-term we intend to switch to Prometheus and Grafana as the preferred interface. As it is managed by Elastic they run the VMs and we do not have access to them. However, for investigating errors and incidents, raw logs are available via Kibana. Logs are retained in Elasticsearch for 7 days.
Staging logs are available via a separate Kibana instance.
Kibana dashboards are used to monitor application activity, spam events, transient errors, system and network authentication events, security events, etc. Commonly used dashboards are the Abuse, SSH, and Rack Attack dashboards.
One can view how we log our infrastructure as outlined by our runbook
To learn how to create Kibana dashboards use the following resources:
Stackdriver Continuous Go Profiling can be used to have a better
understanding of how our Go services perform and consume resources. (requires membership of the Google Workspace stackdriver-profiler-sg
group)
It provides a simple UI on GCP with CPU and Memory usage data for:
For more information, there's a quick video tutorial available.
We also did a series of deep dives by pairing with the development teams for each project in this issue, this resulted in the following videos:
Blocks of Ruby code can be "instrumented" to measure performance.
Error tracking service.
Tool that helps you monitor, analyze and optimize your website speed and performance.