Blueprint: Database bloat analysis
FY20-Q1 OKR: Implement 4 significant availability features to improve database reliability (autovacuum, masterless backups, pgrepack, replica on ZFS)
There is a considerable amount of bloat in the PostgreSQL database for GitLab.com. In short, we have accumulated 51% of index bloat and also see elevated levels of bloat in at least two tables. The total amount of bloat measured is at ~450 GB.
Here, we describe our approach to reduce the level of bloat and also maintain it over time.
The design is based on pg_repack. It is a PostgreSQL extension that needs to be available to the database instance and also comes with a command line for it.
We are going to provide a ruby implementation that interacts with pg_repack through a command line. This is going to ship with GitLab Community Edition and can be interacted with through a rake job.
The command line is a basic wrapper around pg_repack functionality for comfort reasons. Additionally, it provides bloat estimation and a recommendation which data structures need to be repacked.
We are going to ship this solution with the product and also apply it for GitLab.com.
For GitLab.com, we plan to add continuous monitoring of bloat levels based on an estimation process.
The idea is to implement a command line tool and integrate that into GitLab as a rake job. Rake just serves as an example method here, we may decide to integrate it differently further down the road (but the idea is the same).
The primary user of the command line tool is an engineer of GitLab.com or an admin of the GitLab installation (for self-hosted). It can be used to determine bloat levels and execute maintenance in order to reduce bloat.
Action: estimate The estimate action provides an estimate of total table and index bloat. It also comes with a recommendation on which repack actions are going to be needed in order to reduce bloat. Action: repack The repack action actually executes maintenance for specific structures in the database. It can be called with a list (comma separated) of tables and/or indexes. In case of failures, the internal state is going to be cleaned up automatically. Options: --auto Repack based on the estimate recommendation --indexes=LIST Repack given indexes --tables=LIST Repack given tables --dry Print what would have been executed and exit
The recommendation in
estimate is based on a bloat estimate (see below). It may exclude tables and indexes due to their small size or low level of bloat.
Bloat can be measured exactly by comparing sizes of indexes and tables before and after a
VACUUM FULL. However, this is a costly operation that cannot be done online.
We estimate bloat by using a standard method, for example with queries from pgsql-bloat-estimation. This provides a heuristic only but can be used to derive conclusions about which tables or indexes need maintenance.
Testing will be executed on staging.gitlab.com, at best during a high-traffic time. For staging, this is the time when we do QA testing for a release. Alternatively, we can trigger automated QA testing to add some traffic.
As a next step, we can execute the maintenance on other installs like dev.gitlab.org or ops.gitlab.net.
However, we don't have a test instance of GitLab with the same level of traffic as GitLab.com sees. For this, it may be worth to execute maintenance on a production-sized database while replaying workload with pgreplay.
For GitLab.com, we'll ship pg_repack through chef/apt. The rake task can be executed through the deploy node since it needs to directly connect to the PostgreSQL database (and not go through pgbouncer).
For self-managed installations, we ship pg_repack along with the omnibus package. We are going to ship documentation along with the product on how to execute index maintenance through rake.
We don't want the maintenance to be executed automatically.
Once we gather more experience with executing the maintenance, we can consider executing database maintenance automatically after a version upgrade, for example. This would allow us to also improve database bloat levels for self hosted installations. However, this is not strictly required for now.
We are going to implement monitoring of database bloat levels through prometheus. Ideally, we report a bloat metric for indexes and tables separately and maybe even on a per-index and per-table granularity. This is based on a bloat estimate (see Estimating Bloat) and not an exact measurement.
It is acceptable if the bloat estimate is run only at low frequency (e.g. daily or hourly) if required.
Monitoring can be implemented through postgres_exporter or by pushing information to the prometheus push-gateway with a cronjob.
We don't intend to add alerts on bloat levels but rather add means to observe and track how bloat accumulates and levels change over time (e.g. through Grafana).
As shown in the analysis, 95% of database bloat comes from regular indexes. Once we have the solution deployed to GitLab.com, we'll perform a three step maintenance to reduce overall bloat levels to below 5%:
This will produce a bloat level well below 5% in total. With the state today, there is no need to repack other tables.
As discussed in the analysis , the majority of bloat we see today is from regular indexes (not from primary keys and not from tables). In this situation, it's also be possible to just ship a standard migration that re-creates those indexes (see the analysis for details) without using pg_repack.
We are going down the road of shipping pg_repack because we anticipate a future need to repack tables and primary keys.