GitLab.com uses repmgr for managing replication and failover for its cluster of PostgreSQL servers, it is also shipped to on-prem customers in Omnibus. However, repmgr has some drawbacks.
Repmgr does not state what algorithm it uses for cluster operations, like electing a new master or fencing the older one, and in all cases ensure cluster consistency. Only deep inspection of the source code may reveal it. There’s no mention on the web page or its documentation. It is obviously not based on distributed consensus like Paxos or Raft. This means that, from a theoretical foundation, repmgr operations are likely prone to fail, or conversely provide significantly weaker correctness guarantees. From an architectural perspective, repmgr writes its state in the primary node. Unless pure synchronous replication would be used, which is undesirable for both performance and availability reasons, this means that the cluster state and monitoring information is inconsistent across the repmgrd daemons observing on each node of the cluster (since state needs to be written into the master, and then replicated asynchronously). So it becomes quite hard to assess how an implementation of a leader election or other operations can be correct if they operate over (potentially) different views of the cluster state from each node of the cluster.
Repmgr does not support switchover and automatic failover with the same configuration. Switchover is incompatible with automatic failover, under brain-split risk. Quoting repmgr docs:
repmgrd should not be running with setting failover=automatic in repmgr.conf when a switchover is carried out, otherwise the repmgrd daemon may try and promote a standby by itself.
According to repmgr docs, there are certain cases (and its hard to consider them improbable) under which after a failover is triggered repmgr may end up in a state where no subsequent action may be taken, failing to promote a given node (i.e. leaving the cluster with no master). Only possible outcomes are that cluster state returns to the expected state for the operation to finally succeed or, if timeout is configured, to exit. But it seems repmgr is not capable of approaching other possible failover targets in the case the first one doesn’t succeed. Quoting repmgr docs:
In certain circumstances, repmgrd is not able to fulfill its primary mission of monitoring the nodes' upstream server. In these cases it enters "degraded monitoring" mode, where repmgrd remains active but is waiting for the situation to be resolved.
Situations where this happens are:
By default, repmgrd will continue in degraded monitoring mode indefinitely.
However a timeout (in seconds) can be set with
after which repmgrd will terminate.
repmgr monitors PostgreSQL and other nodes via standard PostgreSQL wire protocol
connection. While normally an unprivileged user may suffice, many installations
use a privileged user as it is the same one used for replication purposes. This
is fine. However, if a database is saturated with as many as
connections, new connections will be dropped, and this may also apply to repmgr
connections. This confuses repmgr and could end up triggering a failover.
Exactly this problem was
already experienced in GitLab’s production environment, shortly after migrating
from Azure to GCP.
On network split events, it may happen that a group of nodes where the old primary was, are in even number. To avoid possible split-brain scenarios, a witness server is recommended. But it is non-scalable:
repmgr enables provision of "witness server" to artificially create a quorum of servers in a particular location, ensuring that nodes in another location will not elect a new primary if they are unable to see the majority of nodes. However this approach does not scale well, particularly with more complex replication setups.
Nor is a complete solution to the problem. There could be more complex scenarios (combinations of more nodes, some of them disconnected from others due to network partitions) where the witness server may not suffice to satisfy a given majority. We believe in these cases repmgr will leave the cluster in a state with no master.
Due to the fact that state is written to the same database that is being monitored and replicated asynchronously, in the event of primary fail this state cannot be updated. This includes repmgr’s event notification system, which becomes then unavailable, making it very fragile:
Each time repmgr or repmgrd perform a significant event, a record of that event is written into the repmgr.events table together with a timestamp, an indication of failure or success, and further details if appropriate. This is useful for gaining an overview of events affecting the replication cluster. However note that this table has advisory character and should be used in combination with the repmgr and PostgreSQL logs to obtain details of any events. […] Note that under some circumstances (e.g. when no replication cluster primary could be located), it will not be possible to write an entry into the repmgr.events table, in which case executing a script via event_notification_command can serve as a fallback by generating some form of notification.
repmgr source code repository contains no failover tests. Neither unit nor integration tests. Only one test is present in the repository, and it’s only testing that the functions that are installed with the extension are indeed present.
To address repmgr short comings, we consider using Patroni. It defines itself as "A template for PostgreSQL High Availability with ZooKeeper, etcd, or Consul". Patroni does not implement directly a consensus algorithm, but instead relies on a DCS (Distributed Consensus Store, that already implements a consensus algorithm) to support its operations. This means that:
Being based on a DCS, Patroni’s job is greatly simplified. A brief description of how it operates follows:
In other words, the HA software is greatly simplified by offloading the distributed consensus to the external DCS, at the cost of higher operational complexity. However, this means that the DCS already provides very strong correctness guarantees, and the confidence on the solution is significantly higher.
If we would factor in other considerations that are required for HA, that also need to be taken into account even with Patroni, the HA architecture may look similar to that of the following (simplified) diagram:
Apart from the extra operational complexity due to the use of an external DCS, Patroni has another significant drawback: as its own definition states, it is a "template", and not a full-blown HA solution. This means that extra work (could be significant) is required to implement the "missing pieces" (mostly entrypoint and fencing, which are heavily environment dependent).
Patroni itself is not a standalone component of the architecture, but it is rather deeply integrated with PostgreSQL. In particular, Patroni:
There is one Patroni per PostgreSQL node, as they are bundled together within the same instance. Indeed, Patroni is the process launched by the init program, and it launches, in turn, PostgreSQL.
Consul is the recommended DCS as it is already used by GitLab and shipped with Omnibus. It is in turn recommended that Patroni will communicate with Consul via a Consul Agent, which acts as a proxy. Thus every Patroni has connection to the Consul Agent.
Patroni also exposes an HTTP API. This API could by used for dynamic Patroni’s or even PostgreSQL’s reconfiguration, but in our case it’s mostly interesting to serve the health check for the load balancer that will serve as the entry point for the application.
Patroni exposes the two following URLs:
With this information it is easy to construct URLs for the health check of the entrypoint. For a read-write channel, the /master URL could be used. At most one node would respond to 200 to this URL, and the load balancer of the entry point would route connections, always, to the master.
The entrypoint is the mechanism by which PostgreSQL clients connect to the PostgreSQL instances. While on a "normal" scenario this is simple (for instance you may hardcode the IP and port), it doesn’t work on a dynamic scenario where the master node is a parameter subject to change dynamically.
GCP internal load balancers (ILB) may be used for the purpose of serving as an entry point. Consider a read-write connection to PostgreSQL. On the ILB one proxied service may be created for the PostgreSQL cluster.
The ILB will expose a virtual IP (provided by its internal high-availability mechanism) to PostgreSQL clients. At the same time, the ILB will use the health check URLs provided by Patroni (as previously described) and that will make the ILB to always point to the correct master, even after a failover.
Certainly, the applications should be (re)-tested to try to re-connect on connection failure (as that is what will be perceived by the application in case of a failover). This is not exclusive to automatic failover, but should be tested in any scenario, always.
A new Patroni cluster is to be provisioned in staging in standby mode, replicating from the existing PostgreSQL cluster. The new cluster should have the same configurations and specs as the existing one. To minimize the maintenance window required to migrate the production PostgreSQL cluster to Patroni, multiple migration rehearsals should be performed in staging until we have a smooth migration plan. After each migration, it is recommended that a full QA suite to be run against staging as a sanity check that nothing has changed from the client perspective.
Patroni should work for self-managed installations the same way it does for GitLab.com, with one difference that is the internal load balancer (ILB) as an entrypoint. GitLab.com uses an ILB because it is cloud-hosted and the ILB is guaranteed to be highly-available, but self-managed installations may not have access to such solution.
Regardless, Omnibus docs already suggest an entrypoint that is pgbouncer, and it can be used in of lieu an ILB. There are two approaches that can be used:
Both approaches can leverage Patroni's API.
For existing self-managed installations that want to migrate from repmgr to Patroni, Patroni can take over a running cluster when configured properly, but it would probably require careful coordination by Omnibus.
Once a Patroni cluster is up and running, adding and/or removing new cluster members happen automatically when a node is newly created and/or decommissioned.
Existing alerts for PostgreSQL and pgbouncer should still function as expected. New metrics for Patroni need to be created as Patroni itself doesn't expose metrics. This can be achieved by scraping Patroni logs to alert on high error rates or if the Patroni process itself is down.
Stolon is a consensus-based high availability solution. Instead of implementing itself a consensus algorithm, it relies on a external service for this purpose, similar to Patroni. This external service could be either etcd, Consul or Kubernetes API.
Stolon has a unique approach to the entry point problem: instead of relying on external components to indicate applications how to route connections to the (current) master, Stolon brings in a network proxy (PostgreSQL-specialized) to serve as an entry point. This is a very good architectural decision, as it allows transparency for the client, self-containment of the HA solution and fencing capabilities (no client can connect to a master that became disconnected from the rest of the nodes, since the proxy won’t allow it). The proxy is fed from the consensus external service, so it always has a consistent view.