There are 3 interesting parts of the migration with fairly simple solutions:
More details can be found in runbooks.
There are currently 2 different architectures in production for GitLab/PGBouncer/Patroni/Postgres/Consul.
In all cases Patroni is keeping Consul DNS up to date with the correct locations of primary and replica hosts.
pgbouncer.int.gprd.gitlab.net
master.patroni.service.consul
. PGBouncer has a 2s ttl on DNS to check it frequently.db-replica.service.consul
to get the list of hostsmaster-ci.patroni.service.consul
and the replicas as
db-ci-replica.service.consul
. Also see more detailed instructions on how
Standby Patroni Cluster has previously be setup for
Geo
which may inform how we set it up for this use case.db-ci-replica.service.consul
.pgbouncer-ci.int.gprd.gitlab.net
that connects to
master.patroni.service.consul
.pgbouncer-ci.int.gprd.gitlab.net
.last_update_lsn
last_update_lsn
master-ci.patroni.service.consul
Rolling back depends on where you are at in the chain of migration steps. The last point before you can roll back will be after writes have been written to the new destination CI Patroni cluster. Beyond this point the options you have for recovery will depend heavily on the specific failure.
Here are some example scenarios:
standby_cluster
configuration section from the new Standby
cluster and remove the replication slot from the slots
section of the main
Patroni cluster. After this you can safely just remove or shut down the new
CI Patroni cluster infrastructure.ci
database. This just involves reverting the change to add ci:
section to config/database.yml
. After this you may choose to rollback
specific infrastructure changes that deployed the new hardware. This is
effectivly cleanup and does not need to be rushed.master.patroni.service.consul
.pgdump
of all the ci_*
tables. Then use that pgdump
to recover the up to date
state of all ci_*
tables on the main Patroni cluster. At this point you
can reconfigure the CI PGBouncer back to point to the main cluster
master.patroni.service.consul
and re-enable writes.main
DB.
This is a split brain and recovery will be very difficult. We may need to
tolerate data loss but we should try to mitigate the risk. One way is
using Postgres locks/triggers to block writes.
Another way to mitigate risk is by doing tests early and using monitoring to
confirm that ci_*
tables are never updated via the main
DB connection.main
database has also been taking writes
to non-CI tables. An approach could involve replaying these through another
database and then streaming the updates from that 3rd database using logical
replication. Read more at
https://gitlab.com/gitlab-org/gitlab/-/issues/345560#replaying-ci-table-updates-back-to-main-using-a-3rd-database-and-logical-replication.You can find an epic describing the steps and blockers for executing this process at https://gitlab.com/groups/gitlab-org/-/epics/6160 .
We will investigate through experiments (eg. benchmarking) some optimizations that may reduce user impact or just generally make the process safer or faster.
Using PGBouncer to block CI writes may not give us 100% confidence that CI
writes aren't still happening to the main
DB due to an application bug. We
may want to investigate using locks or triggers to block writes in the main
database for any CI tables. This could act as a redundant check to mitigate
risk.
LOCK TABLE ci_builds, ci_pipelines, ... IN EXCLUSIVE MODE
In order to validate this plan could work we need to experiment with whether or not we can obtain these locks in a reasonable timeframe. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4943 .
The current plan doesn't involve doing any validation to check that the ci_*
tables are the same in both databases. We simply rely on checking that the
Postgres replication is up to date. We don't, yet, have any reason to suspect
that data could be lost if replication is up to date but still it would be
preferable to have a 2nd independent data point to abort in case there is
something wrong.
We may want to experiment with some possible options and see how long they would take to calculate:
id
over some table(s) that changes frequentlyupdated_at
over some table(s) that changes frequentlyWe don't believe it's a good idea to use logical replication as the main way to keep the databases in sync over a long period of time. There are a few reasons. Firstly it does not replicate schema changes so it would get out of sync every time we run a migration. Secondly it is much slower than streaming replication.
We do, however, think that logical replication of only the relevant tables
(ci_*
), only during a brief window in a low usage time, may result in less
updates that need to be synced and therefore a briefer downtime window when
switching over the databases. Additionally it may afford us more rollback
options if we were to reverse the direction of logical replication after the
switchover since it would then allow us to switch back to a database with all
writes in place.
We need to add specific details about how this conversion of streaming replication into logical replication would work without gaps and we also need to experiment with whether or not logical replication can keep up in quiet times. Ideally we'd also experiment with how much time this may reduce from the overall plan as it will add some complexity.