There are 3 interesting parts of the migration with fairly simple solutions:
More details can be found in runbooks.
There are currently 2 different architectures in production for GitLab/PGBouncer/Patroni/Postgres/Consul.
In all cases Patroni is keeping Consul DNS up to date with the correct locations of primary and replica hosts.
master.patroni.service.consul. PGBouncer has a 2s ttl on DNS to check it frequently.
db-replica.service.consulto get the list of hosts
master-ci.patroni.service.consuland the replicas as
db-ci-replica.service.consul. Also see more detailed instructions on how Standby Patroni Cluster has previously be setup for Geo which may inform how we set it up for this use case.
pgbouncer-ci.int.gprd.gitlab.netthat connects to
Rolling back depends on where you are at in the chain of migration steps. The last point before you can roll back will be after writes have been written to the new destination CI Patroni cluster. Beyond this point the options you have for recovery will depend heavily on the specific failure.
Here are some example scenarios:
standby_clusterconfiguration section from the new Standby cluster and remove the replication slot from the
slotssection of the main Patroni cluster. After this you can safely just remove or shut down the new CI Patroni cluster infrastructure.
cidatabase. This just involves reverting the change to add
config/database.yml. After this you may choose to rollback specific infrastructure changes that deployed the new hardware. This is effectivly cleanup and does not need to be rushed.
pgdumpof all the
ci_*tables. Then use that
pgdumpto recover the up to date state of all
ci_*tables on the main Patroni cluster. At this point you can reconfigure the CI PGBouncer back to point to the main cluster
master.patroni.service.consuland re-enable writes.
mainDB. This is a split brain and recovery will be very difficult. We may need to tolerate data loss but we should try to mitigate the risk. One way is using Postgres locks/triggers to block writes. Another way to mitigate risk is by doing tests early and using monitoring to confirm that
ci_*tables are never updated via the
You can find an epic describing the steps and blockers for executing this process at https://gitlab.com/groups/gitlab-org/-/epics/6160 .
We will investigate through experiments (eg. benchmarking) some optimizations that may reduce user impact or just generally make the process safer or faster.
Using PGBouncer to block CI writes may not give us 100% confidence that CI
writes aren't still happening to the
main DB due to an application bug. We
may want to investigate using locks or triggers to block writes in the
database for any CI tables. This could act as a redundant check to mitigate
LOCK TABLE ci_builds, ci_pipelines, ... IN EXCLUSIVE MODE
In order to validate this plan could work we need to experiment with whether or not we can obtain these locks in a reasonable timeframe. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4943 .
The current plan doesn't involve doing any validation to check that the
tables are the same in both databases. We simply rely on checking that the
Postgres replication is up to date. We don't, yet, have any reason to suspect
that data could be lost if replication is up to date but still it would be
preferable to have a 2nd independent data point to abort in case there is
We may want to experiment with some possible options and see how long they would take to calculate:
idover some table(s) that changes frequently
updated_atover some table(s) that changes frequently
We don't believe it's a good idea to use logical replication as the main way to keep the databases in sync over a long period of time. There are a few reasons. Firstly it does not replicate schema changes so it would get out of sync every time we run a migration. Secondly it is much slower than streaming replication.
We do, however, think that logical replication of only the relevant tables
ci_*), only during a brief window in a low usage time, may result in less
updates that need to be synced and therefore a briefer downtime window when
switching over the databases. Additionally it may afford us more rollback
options if we were to reverse the direction of logical replication after the
switchover since it would then allow us to switch back to a database with all
writes in place.
We need to add specific details about how this conversion of streaming replication into logical replication would work without gaps and we also need to experiment with whether or not logical replication can keep up in quiet times. Ideally we'd also experiment with how much time this may reduce from the overall plan as it will add some complexity.