Monitoring of GitLab.com

GitLab.com Service Availability

Service Availability is the percentage of time during which the platform is in an available state. Other states are degraded and outage.

Each of the user facing services have two Service Level Indicators (SLI): the Apdex score, and the Error rate. The Apdex score is generally a measure of the service performance (latency). The Error rate measures the percentage of requests that fail due to an error (usually, a 5XX status code).

A service is considered available when:

  1. The Apdex score of the service is above its Service Level Objective (SLO),
  2. AND The error rate is below its Service Level Objective (SLO).

An example of available web service; within a 5 minute period:

  • At least 90% of requests have a latency within their “satisfactory” threshold
  • AND, less than 0.5% of requests return a 5XX error status response.

A service is unavailable, if, for 5 minutes:

  • Less than 90% of requests have a latency within their “satisfactory” threshold
  • OR, more than 0.5% of requests return a 5XX error status response.

In other words, a service needs to simultaneously meet both of its Service Level Objectives (SLO) targets in order to be considered available. If either target is not met, the service is considered unavailable.

The availability score for a service is then calculated as the percentage of time that it is available. The Availability score for each service combined define the platform Service Availability. This number indicates availability of GitLab.com for a select period of time.

For example, if service becomes unavailable for a 10 minute period, the availability score will be:

  • 99.90% for the week (10 070 minutes of availability out of 10 080 minutes in a week)
  • 99.97% for the month (43 190 minutes of availability out of 43 200 minutes in the month)

Finally, the availability metric for GitLab.com is calculated as a weighted average availability over the following services (weights in brackets):

  1. web (5)
  2. api (5)
  3. git (5)
  4. registry (1)
  5. ci runners (0)
  6. pages (0)
  7. sidekiq (0)

The availability score can be seen on the SLA dashboard, and the Service Availability target is set as an Infrastructure key performance indicator.

More details on definitions of outage, and degradation are on the incident-management page

Historical Service Availability

Year Month Availability Comments
2024 February 99.86%
2024 January 100%
2023 December 99.99%
2023 November 99.99%
2023 October 99.89 Oct 30 Sev 1
2023 September 99.98%
2023 August 100%
2023 July 99.78% Two severity 1 incidents contributed to ~94% of service disruption. 2023-07-07, 2023-07-14
2023 June 100%
2023 May 99.92%
2023 April 99.98%
2023 March 99.99%
2023 February 99.98%
2023 January 99.80%
2022 December 100%
2022 November 99.86%
2022 October 100%
2022 September 99.98%
2022 August 99.92%
2022 July 99.95%
2022 June 99.96%
2022 May 99.99%
2022 April 99.98%
2022 March 99.91%
2022 February 99.87%
2022 January 99.95%
2021 December 99.96%
2021 November 99.71%
2021 October 99.98%
2021 September 99.85%
2021 August 99.86%
2021 July 99.78%
2021 June 99.84%
2021 May 99.85% does not include manual adjustment for PostgreSQL 12 Upgrade
2021 April 99.98%
2021 March 99.34%
2021 February 99.87%
2021 January 99.88%
2020 December 99.96%
2020 November 99.90%
2020 October 99.74%
2020 September 99.95%
2020 August 99.87%
2020 July 99.81%
2020 June 99.56%
2020 May 99.58%

These videos provide examples of how to quickly identify failures, defects, and problems related to servers, networks, databases, security, and performance.

Monitoring

Pingdom Statistics

We use our apdex based measurements to report official availability (see above). However, we also have some public pingdom tests for a representative view of overall performance of GitLab.com. These are available at https://stats.pingdom.com. Specifically, this has the availability and latency of reaching

Main Monitoring Dashboards

We collect data using InfluxDB and Prometheus, leveraging available exporters like the node or the postgresql exporters, and we build whatever else is necessary. The data is visualized in graphs and dashboards that are built using Grafana. There are two interfaces to track this, as described in more detail below.

Prometheus

We have 3 prometheus clusters: main prometheus, prometheus-db, and prometheus-app. They provide an interface to query metrics using PromQL. Each prometheus cluster collects a set of related metrics:

Thanos

Thanos Query can be used to query metrics aggregated across Prometheus clusters.

Monitoring Infrastructure

  • Private GitLab account is required to access
  • Highly Available setup
  • Alerting feeds from this setup
  • Separated from the public for security and availability reasons, they should have exactly the same graphs after we deprecate InfluxDB.

Adding Dashboards

To learn how to set up a new graph or dashboard using Grafana, take a look at the following resources:

Need access to add a dashboard? Ask any team lead within the infrastructure team.

Dashboards for stage groups

We have a set of monitoring dashboards designed for each stage group. These dashboards are designed to give an insight, to everyone working in a feature category, into how their code operates at GitLab.com scale. They are grouped per stage group to show the impact of feature/code changes, deployments, and feature-flag toggles.

  1. List of dashboards for each stage group (GitLab team members only).
  2. Guide to getting started with dashboards for stage groups
  3. YouTube video introducing the stage group dashboards

The dashboards for stage groups are at a very early stage. All contributions are welcome. If you have any questions or suggestions, please submit an issue in the Scalability Team issues tracker.

Selection of Useful Dashboards from the Monitoring

Blackbox Monitoring

  • GitLab Web Status: front end perspective of GitLab. Useful to understand how GitLab.com looks from the user perspective. Use this graph to quickly troubleshoot what part of GitLab is slow.
  • GitLab Git Status: front end perspective of GitLab ssh access.

Private Whitebox Monitor

  • Host Stats: useful to dive deep into a specific host to understand what is going on with it. Select a host from the dropdown on the top.
  • Business Stats: shows many pushes, new repos and CI builds.
  • Daily overview: shows endpoints with amount of calls and performance metrics. Useful to understand what is slow generally.

Logs

Network, System, and Application logs are processed, stored, and searched using the ELK stack. We use a managed Elasticsearch cluster on GCP and as such our only interface to this is through APIs, Kibana and the elastic.co web UI. For monitoring system performance and metrics, Elastic’s x-pack monitoring metrics are used. They are sent to a dedicated monitoring cluster. Long-term we intend to switch to Prometheus and Grafana as the preferred interface. As it is managed by Elastic they run the VMs and we do not have access to them. However, for investigating errors and incidents, raw logs are available via Kibana. Logs are retained in Elasticsearch for 7 days.

Staging logs are available via a separate Kibana instance.

Kibana dashboards are used to monitor application activity, spam events, transient errors, system and network authentication events, security events, etc. Commonly used dashboards are the Abuse, SSH, and Rack Attack dashboards.

One can view how we log our infrastructure as outlined by our runbook

Adding dashboards

To learn how to create Kibana dashboards use the following resources:

GitLab Profiling

Go services

Stackdriver Continuous Go Profiling can be used to have a better understanding of how our Go services perform and consume resources. (requires membership of the Google Workspace stackdriver-profiler-sg group)

It provides a simple UI on GCP with CPU and Memory usage data for:

For more information, there’s a quick video tutorial available.

We also did a series of deep dives by pairing with the development teams for each project in this issue, this resulted in the following videos:

Instrumenting Ruby to Monitor Performance

Blocks of Ruby code can be “instrumented” to measure performance.

Other Tools

Sentry

Error tracking service.

Setting sentry alerts for your group

Creating alert rules allows groups to monitor their features and help catch issues proactively. This helps in getting the issues fixed before they breach the error budget SLO which in turn helps in keeping the GitLab.com Service Availability high.

Steps for creating the alerts:

  1. Visit Sentry’s alert rules dashboard.
  2. Click on “Create Alert” button at the top right.
  3. Set the required conditions as per your group’s feature categories.
  4. Create a new public slack channel with the following naming convention “g_group_name_alerts”. Eg: #g_govern_compliance_alerts
  5. Select this channel for sending the alert notifications.
  6. Monitor the group for any new alerts and work towards resolving those.

Sitespeed.io

Tool that helps you monitor, analyze and optimize your website speed and performance.


Staging Monitoring
How Staging is monitored and how traffic is generated
Last modified March 4, 2024: Jan & Feb 2024 Availability Update (e1d80eb9)