We collect data using InfluxDB and Prometheus, leveraging available exporters like the node or the postgresql exporters, and we build whatever else is necessary. The data is visualized in graphs and dashboards that are built using Grafana. There are two interfaces to track this, as described in more detail below.
We have 3 prometheus clusters: main prometheus, prometheus-db, and prometheus-app. They provide an interface to query metrics using PromQL. Each prometheus cluster collects a set of related metrics:
Automatically syncs from the private monitoring infrastructure on every chef client execution. Don't change dashboards here, they will be overwritten.
Refer to this interface by default; only use the private one for those cases where the public dashboard is not available.
By linking to the public dashboard by default, we ensure that we live our transparency value and any users that are interested
can easily view the same data we see.
Data gathered in InfluxDB is not currently available in public, since InfluxDB does not scrub access tokens from URLs that are measured.
The Grafana repo where we keep an archive of InfluxDB dashboards created in Grafana. Use these to see details in the file structure, but note that the repo is truly an archive (nothing populates from it) and can be out of date.
Need access to add a dashboard? Ask any team lead within the infrastructure team.
Selection of Useful Dashboards from the Monitoring
GitLab Web Status: front end perspective of GitLab. Useful to understand how GitLab.com looks from the user perspective. Use this graph to quickly troubleshoot what part of GitLab is slow.
Daily overview: shows endpoints with amount of calls and performance metrics. Useful to understand what is slow generally.
Network, System, and Application logs are processed, stored, and searched using
the ELK stack. We use a managed
Elasticsearch cluster on GCP and as such our only
interface to this is through APIs, Kibana and the elastic.co web UI. For
monitoring system performance and metrics, Elastic's x-pack monitoring metrics are used. They are sent to a dedicated monitoring cluster. Long-term we intend to switch to Prometheus and Grafana as the preferred
interface. As it is managed by Elastic they run the VMs and we do not have
access to them. However, for investigating errors and incidents raw logs are
available via Kibana.
Staging logs are available via a separate Kibana instance.
Kibana dashboards are used to monitor application activity, spam events, transient errors, system and network authentication
events, security events, etc. Commonly used dashboards are the Abuse, SSH, and Rack Attack dashboards.
One can view how we log our infrastructure as outlined by our
To learn how to create Kibana dashboards use the following resources: