Service Level Availability (SLA) is the percentage of time during which the platform is in an available state. Other states are degraded and outage.
Each of the user facing services have two Service Level Indicators (SLI): the Apdex score, and the Error rate.
The Apdex score is generally a measure of the service performance (latency).
The Error rate measures the percentage of requests that fail due to an error (usually, a
5XX status code).
A service is considered available when:
An example of available
web service; within a 1 minute window period:
A service is unavailable, if, for one minute:
In other words, a service needs to simultaneously meet both of it's SLO targets in order to be considered available. If either target is not met, the service is considered unavailable.
The availability score for a service is then calculated as the percentage of time that it is available. The Availability score for each service combined define the platform Service Level Availability (SLA). The SLA number indicates availability of GitLab.com for a select period of time.
For example, if service becomes unavailable for a 10 minute period, the availability score will be:
Finally, the availability metric for GitLab.com is calculated as a weighed average availability over the following services (weights in brackets):
More details on definitions of outage, and degradation are on the incident-management page
These videos provide examples of how to quickly identify failures, defects, and problems related to servers, networks, databases, security, and performance.
For a quick view of the availability and performance history of GitLab.com, we use https://stats.pingdom.com. Specifically, this has the availability and latency of reaching
We collect data using InfluxDB and Prometheus, leveraging available exporters like the node or the postgresql exporters, and we build whatever else is necessary. The data is visualized in graphs and dashboards that are built using Grafana. There are two interfaces to track this, as described in more detail below.
We have 3 prometheus clusters: main prometheus, prometheus-db, and prometheus-app. They provide an interface to query metrics using PromQL. Each prometheus cluster collects a set of related metrics:
Thanos Query can be used to query metrics aggregated across Prometheus clusters.
To learn how to set up a new graph or dashboard using Grafana, take a look at the following resources
Network, System, and Application logs are processed, stored, and searched using the ELK stack. We use a managed Elasticsearch cluster on GCP and as such our only interface to this is through APIs, Kibana and the elastic.co web UI. For monitoring system performance and metrics, Elastic's x-pack monitoring metrics are used. They are sent to a dedicated monitoring cluster. Long-term we intend to switch to Prometheus and Grafana as the preferred interface. As it is managed by Elastic they run the VMs and we do not have access to them. However, for investigating errors and incidents raw logs are available via Kibana.
Staging logs are available via a separate Kibana instance.
Kibana dashboards are used to monitor application activity, spam events, transient errors, system and network authentication events, security events, etc. Commonly used dashboards are the Abuse, SSH, and Rack Attack dashboards.
One can view how we log our infrastructure as outlined by our runbook
To learn how to create Kibana dashboards use the following resources:
To add a page to this dashboard, create a merge request to the gitlab-com/gitlab-profiler project.
Stackdriver Continuous Go Profiling can be used to have a better
understanding of how our Go services perform and consume resources. (requires membership of the GSuite
It provides a simple UI on GCP with CPU and Memory usage data for:
For more information, there's a quick video tutorial available.
Blocks of Ruby code can be "instrumented" to measure performance.