We collect data using InfluxDB and Prometheus, leveraging available exporters like the node or the postgresql exporters, and we build whatever else is necessary. The data is visualized in graphs and dashboards that are built using Grafana. There are two interfaces to track this, as described in more detail below.
We have 3 prometheus clusters: main prometheus, prometheus-db, and prometheus-app. They provide an interface to query metrics using PromQL. Each prometheus cluster collects a set of related metrics:
The Grafana repo where we keep an archive of InfluxDB dashboards created in Grafana. Use these to see details in the file structure, but note that the repo is truly an archive (nothing populates from it) and can be out of date.
Need access to add a dashboard? Ask any team lead within the infrastructure team.
Selection of Useful Dashboards from the Monitoring
GitLab Web Status: front end perspective of GitLab. Useful to understand how GitLab.com looks from the user perspective. Use this graph to quickly troubleshoot what part of GitLab is slow.
Daily overview: shows endpoints with amount of calls and performance metrics. Useful to understand what is slow generally.
Network, System, and Application logs are processed, stored, and searched using the ELK stack. For monitoring system performance and metrics Grafana is still the preferred interface. However, for investigating errors and incidents raw logs are available via Kibana at https://log.gitlab.net.
Kibana dashboards are used to monitor application activity, spam events, transient errors, system and network authentication events, security events, etc. Commonly used dashboards are the Abuse, SSH, and Rack Attack dashboards.
One can view how we log our infrastructure as outlined by our runbook
To learn how to create Kibana dashboards use the following resources: