Epic: Postgresql Upgrade
The goal of this document is to explain the alternatives that are feasible for executing the Postgresql database upgrade.
Upgrading the database will allow us to use new features that would make the work executed by developers and administrators easier. We can consider the following:
Initially, we would like to consider that version 11.7 is the best candidate, due to business requirements considering. These are the links to the list of features and bugs fixed:
Additionally, due to a business decision in version 13.0 from Gitlab, we would not support any longer Postgresql version 10.
Nevertheless, we are aware that the end of life support of the current version is relatively near: November 11, 2021.
The aim of this section is to describe the main options that are part of the migration.
Enabling the checksum feature would allow the cluster to check for every data page being used to find out if the page is reliable or not. A reliable page is a page that has not been corrupted by writing data to the disk or reading back the data. This will add around approximately 2% of load overhead.
This step could require a long downtime in our current cluster to verify that all the pages are reliable. Setting up a new cluster with the checksum feature, and using logical replication to transfer all the data to this new cluster, could be the best solution to eliminate this downtime. Of course, this new cluster would be ready for a failover operation after receiving all the data.
Respecting our values of avoiding data loss over availability this action is necessary for the database cluster.
One option could be to execute pg_upgrade and activate the database checksums. The downtime could take hours, so we need to check in staging how long it would take to execute both steps to get a better estimate of time.
Another option would be just to execute pg_upgrade. The downtime would be short, so we need to check in staging how long it would take to execute in stagging to get a better estimate of time.
The pg_upgrade process consists of the following parts:
Starting the maintenance mode and stopping the incoming traffic, in this step the traffic is stopped on Cloudflare and HA Proxy instances, drastically reducing the number of database transactions, to reach faster the consistent shutdown of the database required for the PostgreSQL upgrade.
Stopping all the components that have access to the database cluster, and draining database connections from Sidekiq, web-api, Puma and our Kubernetes hosts. During this step, all the database connections will be finalized and we will reach the optimal scenario to execute the upgrade process.
Executing the automated PostgreSQL upgrade process, that consists of the following steps:
Starting the components that have access to the database, Sidekiq, web-api, Puma and our Kubernetes hosts, checking logs and alerts to verify the successful start of the components.
Executing the QA tests on the upgraded database, to ensure the data quality and all the application functionalities are working properly. The SRE and DBRE team will keep monitoring the database, application logs, and alerts to verify that all the components are working properly.
Resuming the maintenance mode and restoring the incoming traffic.
After weeks evaluating the solutions, we decided to choose the pg_upgrade because of the following points:
Since it's not possible to execute a stream replication between different major versions of Postgresql, and considering the goal of a minimum downtime, solutions that would use logical replication are needed.
Essentially, it is a project written in python to execute logical migrations and has not been updated recently.
It is an old version of a project that the latest version is under a commercial license, developed by 2nd quadrant.
Egres is a logical replication solution developed by Ongres.
Executing the Enable external merge request diff storage to a different data-source would be a pre-requisite. The result would release 33% of the database size and will reduce significantly the time cost and downtime of all the options above.