Continuing on the theme of improving the performance and reliability of GitLab.com, we have another step we will be taking for our clusters of Postgres database nodes. These nodes have been running on Ubuntu 16.04 with extended security maintenance patches and it is now time to get them to a more current version. Usually, this kind of upgrade is a behind-the-scenes event, but there is an underlying technicality that will require us to take a maintenance window to do the upgrade (more on that below).
We have been preparing for and practicing this upgrade and are now ready to schedule the window to do this work for GitLab.com.
When will the OS upgrade take place and what does this mean for users of GitLab.com?
This change is planned to take place on 2022-09-03 (Saturday) between 11:00 UTC and 14:00 UTC. The implementation of this change is anticipated to include a service downtime of up to 180 minutes (see reference issue). During this time you will experience complete service disruption of GitLab.com.
We are taking downtime to ensure that the application works as expected following the OS upgrade and to minimize the risk of any data integrity issues.
Join us at GitLab Commit 2022 and connect with the ideas, technologies, and people that are driving DevOps and digital transformation.
Background
GitLab.com's database architecture uses two Patroni/Postgres database clusters: main and CI. We recently did functional decomposition and now the CI Cluster stores the data generated by CI GitLab features. Each Patroni cluster has primary and multiple read-only replicas. For each of the Patroni clusters, the Postgres database size is ~18 TB running on Ubuntu 16.04. During the scheduled change window, we will be switching over to our newly built Ubuntu 20.04 clusters.
The challenge
Ubuntu 18.10 introduced an updated version of glibc (2.28), which includes a major update to locale data and causes Postgres indexes created with earlier versions of glibc to be corrupted. Because we are upgrading to Ubuntu 20.04, our indexes are affected by this. Therefore, during the downtime window scheduled for this work, we need to detect potentially corrupt indexes and have them reindexed before we enable production traffic again. We currently have the following types and the approximate number of indexes:
Index Type | # of Indexes
------------+--------------
btree | 4079
gin | 101
gist | 3
hash | 1
As you can appreciate, given the sheer number (and size) of these indexes, it would take far too long to reindex every single index during the scheduled downtime window, so we need to streamline the process.
Options to upgrade to Ubuntu 20.04 safely
There are a number of ways to deal with the problem of potentially corrupt indexes:
a. Reindex all indexes during the scheduled downtime window
b. Transport data to target 20.04 clusters in a logical (not binary) way, including:
- Backups/upgrades using pg_dump
- Logical replication
c. Use streaming replication from 16.04 to 20.04 and during the downtime window, break replication and promote the 20.04 clusters followed by reindexing of potentially corrupt indexes
It might be feasible for a small to a medium-size Postgres implementation to use options a or b; however, at the GitLab.com scale, it would require a much larger downtime window and our aim is to reduce the impact to our customers as much as possible.
High-level approach for the OS upgrade
To perform an OS upgrade on our Patroni clusters, we use Postgres streaming replication to replicate data from our current Ubuntu 16.04 clusters to the brand new Ubuntu 20.04 standby Patroni clusters. During the scheduled downtime window, we will stop all traffic to the current 16.04 clusters, promote the 20.04 clusters by making them Primary and demote the Ubuntu 16.04 clusters by reconfiguring to act as Standby while replicating from the new 20.04 primaries. We will then reindex all the identified potentially corrupt indexes, and update DNS to point the application to the new 20.04 Patroni clusters before opening traffic to the public.
Identifying potentially corrupt indexes and our approach to handling the reindexing for different types of indexes
B-Tree
We use bt_index_parent_check
amcheck function to identify potentially corrupt indexes and we will reindex them during the downtime window.
GiST and Hash
Since we do not have many GiST and Hash indexes, and reindexing them is a relatively quick operation, we will reindex them all during the downtime window.
GIN
Currently, the production version of amcheck is limited to detecting potential corruption in B-Tree indexes only. Our GIN indexes are reasonably sized and it would require a significant amount of time to reindex them during the scheduled downtime window, which is not feasible as we cannot have the site unavailable to our customers for that long. We have collaborated closely with our database team to produce a list of business-critical GIN indexes to be reindexed during the downtime window, and any other GIN indexes will be reindexed immediately after we open up traffic to the public using the CONCURRENTLY option. Using this option means it will take longer to reindex, but it allows normal operations to continue while the indexes are being rebuilt.
Performance improvements
We started looking into options to improve the performance of the reindexing (see reference issue). There are a couple of areas where we needed to improve performance.
Identify potentially corrupt B-Tree indexes quickly
When we first started using the amcheck to identify potentially corrupt indexes, it was single threaded so it was taking just under five days to run the amcheck script to identify potentially corrupt indexes on production data. After a few iterations, our amcheck script now runs a separate background worker process for each index, so we essentially get a performance improvement of about 96 times when we use a 96 CPU core VM to run amcheck. The performance is limited by the time it takes to run amcheck on the largest index. The script is customizable to skip or include a specific set of tables/indexes, and we can decide the number of parallel worker processes to use based on the number of CPU cores available on the VM we use to run amcheck. Now with the improved speed, we can run the amcheck script on a copy of production data a day or two before the scheduled OS upgrade downtime window.
Improve reindexing speed to reduce the downtime
Our initial test to reindex was performed sequentially with the default Postgres parameters. We have tested reindexing with different Postgres parameters and parallelized the reindex process. We are now able to perform our reindexing in less than half the time it used to take to reindex.
Reading material
For more information, please see the following links: