How we upgraded PostgreSQL at GitLab.com

Jose Finotto ·
Sep 11, 2020 · 15 min read · Leave a comment

We teamed up with OnGres to perform a major version upgrade of GitLab.com's main Postgres cluster from version 9.6 to 11 back in May 2020. We upgraded it during a maintenance window, and it all went according to plan. We unpack all that was involved – from planning, testing, and full process automation – to achieve a near-perfect execution of the PostgreSQL upgrade. The full operation was recorded and you can watch it on GitLab Unfiltered.

The biggest challenge was to do a complete fleet major upgrade through an orchestrated pg_upgrade. We needed to have a rollback plan to optimize our capacity right after Recovery Time Objective (RTO) while maintaining a 12-node cluster’s 6TB-data consistent serving 300.000 aggregated transactions per second from around six million users.

The best way to resolve an engineering challenge is to follow the blueprints and design docs. In the process of creating the blueprint, you define the problem that we are attempting to solve, evaluate the most suitable solutions, and consider the pros and cons of each solution. Here is a link to the blueprint from the project.

After the blueprint comes the design process. The implementation is detailed in the design process, where we explain the steps and requirements involved in executing the design. The design doc from the project is linked here.

Why we upgraded PostgreSQL

We made a business decision in GitLab 13.0 to discontinue support for Postgresql 10.0. PostgreSQL version 9.6 is becoming EOL in November 2021, so we needed to take action.

Here are some of the main differences in features between PostgreSQL versions 9.6 and 11:

The environment and architecture

The infrastructure capacity of the PostgreSQL cluster consisted of 12 n1-highmem-96 GCP instances for OLTP and asynchronous pipelines purposes – plus two BI nodes within different specs, each one with 96 CPU cores and 614GB RAM. The cluster HA is managed and configured through Patroni, which keeps a consistent leader election through a Consul cluster and all its replicas working with asynchronous streaming replication using replication slots and WAL shipping against a GCS storage bucket. Our setup currently uses Patroni HA solution, which constantly gathers critical information about the cluster, leader detection, and node availability. It is implemented using key features from Consul, such as DNS service, which in turn updates PgBouncer endpoints, keeping a different architecture for traffic read-write and read-only.

GitLab.com Architecture{: .note.text-center} GitLab.com architecture

For HA purposes, two of the replicas are out of the read-only server list pool, used by the API, and served by Consul DNS. After several enhancements to Gitlab's architecture, we were able to downscale the fleet to seven nodes.

Furthermore, the entire cluster handles a weekly average of approximately 181,000 transactions per second. As the image below indicates, the traffic increases on Monday and maintains the throughput during the week right up to Friday/Saturday. The traffic data was critical to set up a proper maintenance window so we can impact the fewest users.

GitLab.com Connection Numbers Number of connections at GitLab.com

The fleet is reaching 250,000 transactions per second in the busiest hours of the day.

GitLab.com Commits The number of commits at GitLab.com.

It is also handling spikes of 300,000 transactions per second. GitLab.com is reaching 60,000 connections per second.

Our upgrade requirements

We established a number of requirements before proceeding with the upgrade at production.

The project

To accomplish a smooth execution in production, the project had the following phases:

Phase one: Develop automation in a isolated environment

Phase two: Integrate development with our configuration management in staging

Phase three: Test the upgrade end-to-end in staging

We executed seven tests in staging in total, iterating to perfect the team's execution.

Phase four: Upgrade in production

In production, the steps were very similar to staging, and our plan was to have eight nodes migrated and four left behind as a backup:

The rollback plan would only be called in case of any problems with the database consistency, or errors in the QA test. The steps included:

All the steps of the upgrade are detailed in the template used to execute the project.

How pg_upgrade works

The pg_upgrade process allows us to upgrade data files from PostgreSQL to a later PostgreSQL major version, without using a dump/reload strategy which would require more downtime.

As explained in the official PostgreSQL documentation, the pg_upgrade tool avoids performing the dump/restore method to upgrade the PostgreSQL version. There are some important details to review before proceeding with this tool. Major PostgreSQL releases add new features that often change the layout of the system tables, but the internal data storage format rarely changes. If a major release changes the data storage format, pg_upgrade could not be used, so we must verify what changes were included between the major versions.

It is important that any external modules are also binary-compatible, though this cannot be checked by pg_upgrade. For the GitLab upgrade, we uninstalled views/extensions such as postgres_exporter before the upgrade, to recreate them after the upgrade (with slight modifications for compatibility reasons).

Before performing the upgrade, the new version binaries have to be installed. The new binaries from PostgreSQL and extensions were installed in the set of hosts, that were listed to be upgraded.

There are some options when using pg_upgrade. We chose to use pg_upgrade's link mode on the Leader node because of our narrow, two-hour maintenance window. This method avoids copying the 6TB data files by hard linking files through inode. The drawback is the old data cluster could not be rolled back to 9.6. We provided a rollback path via the replicas kept in 9.6 and GCP snapshots as a secondary choice. Rebuilding the replicas from scratch was not an option either so we used rsync to upgrade them using incremental features. pg_upgrade's documentation says: "From a directory on the primary server that is above the old and new database cluster directories, run this on the primary for each standby server".

The ansible-playbook implemented this step by having a task from the leader node to each replica, triggering the rsync command from the parent directory of both new and old datadirs.

Regression testing benchmarks

Any migration or database upgrade requires a regression test before performing the final production upgrade. For the team, the database test was a key step in this process, executing performance tests based on the query load from production, captured in the table pg_stat_statements. These were executed in the same dataset - once for the 9.6 version and another iteration for version 11. The process was captured in the following public issues:

Finally, based on OnGres work on this benchmark, GitLab will be following up with a new benchmark test for the future:

The upgrade process: automate everything

During the upgrade project, the upgrade teams have a strong commitment to Infrastructure as Code (IaC) and automation: All the processes had to be fully automated in order to keep any human error to a minimum during the maintenance window. All the steps for pg_upgrade execution are detailed at this Gitlab pg_upgrade template issue.

The GitLab.com environment is managed by Terraform and Chef. All the automation for the upgrade was scripted via Ansible 2.9 playbooks and roles, where we used two ansible-playbooks to automate the upgrade:

One ansible-playbook controlled the traffic and the applications:

The second ansible-playbook executed the upgrade process:

The playbook was run interactively task by task, providing the operator with the ability to skip or pause in any given execution point. Every step was reviewed by all the teams that participated in the tests and iterations in staging for the upgrade. The staging environment allowed us to rehearse and find issues using the same procedure that we planned to use in production. After executing and iterating the automated process in staging we reached a quasi-flawless upgrade of PostgreSQL 9.6 to version 11.

To complete the release, the QA GitLab team reported errors that happened on some of the tests. Find the reference for this work in this issue note.

Pre-upgrade steps of the PostgreSQL

The first part of the process was the "pre-upgrade" section, which deals with the instances reserved for rollback purposes. We did the corresponding analysis to ensure that the new cluster could start with eight out of 12 instances of the fleet without losing throughput, reserving four instances for a potential rollback scenario - where they could be brought as a 9.6 cluster via standard Patroni cluster synchronization.

It was necessary also in this phase to stop Postgres-dependent services, such as PgBouncer, Chef Client, and Patroni services.

Before proceeding with the upgrade itself, Patroni had to be signaled to avoid any spurious leader election, take a consistent backup through GCP Snapshots (using corresponding low-level backup API) and apply the new settings via Chef run.

The upgrade phase of the PostgreSQL

First, we stopped all the nodes.

We executed these checks:

Once the primary node data was upgraded, an rsync process was triggered for syncing the data with the replicas. After the upgrade was done, the Patroni service was started up and all the replicas caught up easily with the new cluster configuration.

The binaries were installed by Chef and the setup of the new cluster on the version was defined in the same MR that would install the extensions used in the database, from GitLab.com.

The last stage involved resuming the traffic, running an earlier vacuum and finally starting the PgBouncer and Chef Client services.

The migration day

Finally, fully prepared to perform the production upgrade, the team met on that Sunday (night time for some, and early morning for others) at 08:45 AM UTC. The service would be down for a max of two hours. When the last announcements were sent, the enginering team was given permission to start the procedure.

The upgrade process began by stopping the traffic and related services, to avoid users getting into the site.

The graph below shows the traffic and HTTP stats of the service before the upgrade, during the maintenance period (the "gap" in the graphs) and after, when the traffic was resumed.

GitLab.com Commits Graphs of the traffic on GitLab.com before and after the upgrade maintenance.

The total elapsed time to do the entire job was four hours, it only required two hours of downtime.

It's on video

We recorded the full PostgreSQL upgrade and posted it to GitLab Unfiltered. Warm up the popcorn 🍿

Thanks to Alvaro Hernandez and Sergio Ostapowicz for co-authoring this blog post, as well as the OnGres team for their contributions and performing the upgrade with the GitLab team.

References

The issues used to coordinate this project are public:

Cover image by Tim Mossholder on Unsplash

“How @gitlab conducted a major version upgrade of PostgreSQL” – Jose Finotto

Click to tweet

Free eBook: A beginner's guide to GitOps

GitOps takes DevOps processes and applies them to infrastructure automation. See a sample GitOps workflow and learn how to get started with GitOps.

Download now
Open in Web IDE View source