Published on: November 8, 2019
21 min read
Sometimes a good plan is the best tool for the job.
When things go wrong on a large website, it can be fun to read the dramatic stories of high pressure incidents where nothing goes as planned. It makes for good reading. Every once in a while though, we get a success story. Every once in a while, things go exactly as planned.
GitLab.com is a large, high availability instance of GitLab. It is maintained by the Infrastructure group, which currently consists of 20 to 24 engineers (depending on how you count), four managers, and a director, distributed all around the world. Distributed, in this case, does not mean across a few different offices. There are three or four major cities which have more than one engineer but with the exception of coworking days nobody is working from the same building.
In order to handle the load generated by about four million users working on around 12 million projects, GitLab.com breaks out the individual components of the GitLab product and currently spreads them out over 271 production servers.
The site is slowly migrating to using Hashicorp's Consul for service location. Consul can be thought of like DNS, in that it associates a well-known name with the actual physical location of that service. It also provides other useful functions such as storing dynamic configuration for services, as well as locking for clusters. All of the Consul client and server components talk to each other over encrypted connections. These connections require a certificate at each end to validate the identity of the client and server and to provide the encryption key. The main component of GitLab.com which currently relies on this service is the database and its high availability system Patroni. Like any website that provides functionality and not just information, the database is the central service that everything else depends on. Without the database, the website, API, CI pipelines, and git services will all deny requests and return errors.
The issue came to our attention when a database engineer noticed that one of our database servers in the staging environment could not reconnect to the staging Consul server after the database node was restarted.
It turns out that the TLS certificate was expired. This is normally a simple fix. Someone would go to the Certificate Authority (CA) and request a renewal – or if that fails, generate a new certificate to be signed by the same CA. That certificate would replace the expired copy and the service would be restarted. All of the connections should reestablish using the new certificate and just like with any other rolling configuration change, it should be transparent to all users.
After looking everywhere, and asking everyone on the team, we got the definitive answer that the CA key we created a year ago for this self-signed certificate had been lost.
These test certificates were generated for the original proof-of-concept installation for this service and were never intended to be transitioned into production. However, since everything was working perfectly, the expired test certificate had not been calling attention to itself. A few things should have been done, including: Rebuilding the service with production in mind; conducting a production readiness review; and monitoring. But a year ago, our production team was in a very different place. We were small with just four engineers, and three new team members: A manager, director, and engineer, all of whom were still onboarding. We were less focused on the gaps that led to this oversight a year ago and more focused on fixing the urgent problem today.
First, we needed to validate the problem using the information we'd gathered. Since we couldn't update the existing certificates, we turned validation off on the client that couldn't connect. Turning validation off didn't change anything since the encrypted connections validate both the cluster side and client side. Next, we changed the setting on one server node in the cluster and so the restarted client could then connect to the server node. The problem now was that the server could no longer connect to any other cluster node and could not rejoin the cluster. The server we changed was not validating connections, meaning it was ignoring the expired certificate of its peers in the cluster but the peers were not returning the favor. They were shunning it, putting the whole cluster in a degraded state.
We realized that no matter what we did, some servers and some clients would not be able to connect to each other until after the change had been made everywhere and after every service was restarted. Unfortunately, we were talking about 255 of our 271 servers. Our tool set is designed for gradual rollouts, not simultaneous actions.
We were unsure why the site was even still online because if the clients and services could not connect it was unclear why anything was still working. We ran a small test, confirming the site was only working because the connections were already established when the certificates expired. Any interruption of these long-running connections would cause them to revalidate the new connections, resulting in them rejecting all new connections across the fleet.
Effectively, we were in the middle of an outage that had already started, but hadn't yet gotten to the point of taking down the site.
We declared an incident and began testing every angle we could think of in the staging environment, including:
Reloading the configuration of the running service, which worked fine and did not drop connections, but the certificate settings are not included in the reloadable settings for our version of Consul.
Simultaneous restarts of various services, which worked, but our tools wouldn't allow us to do that with ALL of the nodes at once.
Everything we tried indicated that we had to break those existing connections in order to activate any change, and that we could only avoid downtime if that happened on ALL nodes at precisely the same time.
Every problem uncovered other problems and as we were troubleshooting one of our production Consul servers became unresponsive, disconnected all SSH sessions, and would not allow anyone to reconnect. The server did not log any errors. It was still sending monitoring data and was still participating in the Consul cluster. If we restarted the server, then it would not have been able to reconnect to its peers and we would have an even number of nodes. Not having quorum in the cluster would have been dangerous when we went to restart all of the nodes, so we left it in that state for the moment.
Once the troubleshooting was finished it was time to start planning.
There were a few ways to solve the problem. We could:
Replace the CA and the certificates with new self-signed ones.
Change the CA setting to point to the system store, allowing us to use certificates signed by our standard certificate provider and then replace the certificates.
Disable the validation of the dates so that the expired certificate would not cause connections to fail.
All of these options would incur the same risks and involve the same risky restart of all services at once.
We picked the last option. Our reasoning was that disabling the validation would eliminate the immediate risk and give us time to slowly roll out a properly robust solution in the near future, without having to worry about disrupting the whole system. It was also the smallest and most incremental change.
While there was some time pressure due to the risk of network connections being interrupted, we had to consider the reality of working across timezones as we planned our solution.
We decided not to hand it off to the European shift, who were coming online soon. Being a globally distributed team, we had already handed things off from the end of the day in Mongolia, through Eastern and Western Europe and across the Americas, and were approaching the end of the day in Hawaii and New Zealand.
Australia still had a few more hours and Mongolia had started the day again, but the folks who had been troubleshooting it throughout the day had a pretty good handle on what needed to happen and what could go wrong. It made sense for them to be the ones to do the work. We decided to make a "Break Glass" plan instead. This was a merge request with all of the changes and information necessary for the European shift to get us back into a good state in case a full outage happened before anyone who had been working on it woke up. Everyone slept better knowing that we had a plan that would work even if it could not be executed without causing down time. If we were already experiencing down time, there would be no problem.
In the morning (HST) everything was how we left it so we started planning
how to change the settings and restart all of the services without downtime.
Our normal management tools were out because of the time it takes to roll
out changes. Even sequential tools such as knife ssh
, mussh
, or
ansible
wouldn't work because the change had to be precisely
simultaneous. Someone joked about setting it up in cron
which led us to
the standard linux at
command (a relative of the more widely used
batch
). cron
would require cleanup afterward but an at
command can be
pushed out ahead of time with a sequential tool and will run a command at a
precise time on all machines. Back in the days of hands-on, bare metal
system administration, it was a useful trick for running one-time
maintenance in the middle of the night or making it look like you were
working when you weren't. Now at
has become more obscure with the trend
toward managing fleets of servers rather than big monolithic central
machines. We chose to run the command sudo systemctl restart consul.service
. We tested this in staging to verify that our Ubuntu
distribution made environment variables like $PATH
available, and that
sudo
did not ask for a password. On some distributions (older CentOS
especially) this is not always the case.
With those successful tests, we still needed to change the config files.
Luckily, there is nothing that prevents changing these ahead of time since
the changes aren't picked up until the service restarts. We didn't want to
do this step at the same time as the service restart so we could validate
the changes and keep the at
command as small as possible. We decided not
to use Chef to push out the change because we needed complete and immediate
transparency. Any nodes that did not get the change would fail after the
restart. mussh
was the tool that offered the most control and visibility
while still being able to change all hosts with one command.
We also had to disable the Chef client so that it didn't overwrite the changes between when they were written and when the service restarted.
Before running anything we also needed to address the one Consul server that we couldn't access. It likely just needed to be rebooted and would come up and be unable to reconnect to the cluster. The best option was to do this manually just before starting the rest of the procedure.
Once we had mapped out the plan we practiced it in the disaster recovery environment. We used the disaster recovery environment instead of the staging environment because all of the nodes in the staging environment had already been restarted, so there were no long-running connections to test. Making the disaster recovery environment was the next best option. It did not go perfectly since the database in this environment was already in an unhealthy state but it gave us valuable information to adjust the plan.
It was almost time to fix the inaccessible Consul node. The team connected
in to one of the other nodes to monitor and watch logs. Suddenly, the second
node started disconnecting people. It was behaving exactly like the
inaccessible node had the previous day. 😱 Suspiciously, it didn't
disconnect everyone. Those who were still logged in noticed that sshguard
was blocking access to some of the bastion servers that all of our ssh
traffic flows through when accessing the internal nodes:
Infrastructure#7484.
We have three bastion servers, and two were blocked because so many of us
connected so many sessions so quickly. Disabling sshguard
allowed everyone
back in and that information was the hint we needed to manually find the one
bastion which hadn't yet been blocked. It got us back into the original
problem server. Disabling sshguard
there left us with a fully functional
node and with the ability to accept the at
command to restart the Consul
service at exactly the same time as the others.
We verified that we had an accurate and instantaneous way to monitor the
state of the services. Watching the output of the consul operator raft list-peers
command every second gave us view that looked like this:
Node Address State Voter RaftProtocol
consul-01-inf-gprd 10.218.1.4:8300 follower true 3
consul-03-inf-gprd 10.218.1.2:8300 leader true 3
consul-05-inf-gprd 10.218.1.6:8300 follower true 3
consul-04-inf-gprd 10.218.1.5:8300 follower true 3
consul-02-inf-gprd 10.218.1.3:8300 follower true 3
Even the most thorough plans always miss something. At this point we
realized that one of the three pgbouncer
nodes which direct traffic to the
correct database instance was not showing as healthy in the load balancer.
One is normally in this state as a warm spare, but one of the side effects
of disconnecting the pgbouncer
nodes from Consul is that they would all
fail their load balancer health checks. If all health checks are failing,
GCP load balancers send requests to ALL nodes as a safety feature. This
would lead to too many connections to our database servers, causing
unintended consequences. We worked around this by removing the unhealthy
node from the load balancer pool for the remainder of this activity.
We checked that the lag on the database replicas was zero, and that they weren't trying to replicate any large and time-consuming transactions.
We generated a text list of all of the nodes that run the Consul client or server.
We verified the time zone (UTC) and time synchronization on all of those
servers to ensure that when the at
command executed the restart, an
unsynchronized clock wouldn't cause unintended behavior.
We also verified the at
scheduler was running on all of those nodes, and
that sudo
would not ask for a password.
We verified the script that would edit the config files, and tested it against the staging environment.
We also made sure sshguard
was disabled and wasn't going to lock out the
scripted process for behaving like a scripted process.
This might seem like a lot of steps but without any of these prerequisites the whole process would fail. Once all of that was done, everything was ready to go.
In the end, we scheduled a maintenance window and distilled all of the research and troubleshooting down to the steps in this issue.
Everything was staged and it was time to make the changes. This course of action included four key steps. First, we paused the Patroni database high availability subsystem. Pausing would freeze database failover and keep the high availability configuration static until we were done. It would have been bad if we had a database failure during this time so minimizing the amount of time in this state was important.
Next, we ran a script on every machine that stopped the Chef client service
and then changed the verify lines in the config files from true to false. It
wouldn't help to have Chef trying to reconfigure anything as we made
changes. We did this using mussh
in batches of 20 servers at a time. Any
more in parallel and our SSH agent and Yubikeys may not have been able to
keep up. We were not expecting change in the state of anything from this
step. The config files on disk should have the new values but the running
services wouldn't change, and more importantly, no TCP connections would
disconnect. That was what we got so it was time for some verification.
Our third step was to check all of the servers and a random sampling of
client nodes to make sure config files had been modified appropriately. It
was also a good time to double-check that the Chef client was disabled. This
check turned out to be a good thing to do, because there were a few nodes
that still had the Chef client active. It turned out that those nodes were
in the middle of a run when we disabled the service, and it reenabled the
service for us when the run completed. Chef can be so helpful. We disabled
it manually on the few machines that were affected. This delayed our
maintenance window by a few minutes, so we were very glad we didn't schedule
the at
commands first.
Finally, we needed to remove the inactive pgbouncer
node from the load
balancer, so when the load balancer went into its safety mode, it would only
send traffic to the two that were in a known state. You might think that
removing it from the load balancer would be enough, but since it also
participates in a cluster via Consul the whole service needed to be shut
down along with the health check, which the load balancer uses to determine
whether to send it traffic. We made a note of the full command line from the
process table, shut it down, and removed it from the pool.
Now was the moment of truth. It was 02:10 UTC. We pushed the following
command to every server (20 at a time, using mussh
): echo 'sudo systemctl restart consul.service' | at 02:20
– it took about four minutes to
complete. Then we waited. We monitored the Consul servers by running watch -n 1 consul operator raft list-peers
on each of them in a separate
terminal. We bit our nails. We watched the dashboards for signs of db
connection errors from the frontend nodes. We all held our breath, and
watched the database for signs of distress. Six minutes is a long time to
think: "It's 4am in Europe, so they won't notice" and "It's dinner time on
the US west coast, maybe they won't notice". Trust me, six minutes is a
really long time: "Sorry APAC users for your day, which we are about to
ruin by missing something".
We counted down the last few seconds and watched. In the first second, the
Consul servers all shut down, severing the connections that were keeping
everything working. All 255 of the clients restarted at the same time. In
the next second, we watched the servers return Unexpected response code: 500
, which means "connection refused" in this case. The third second...
still returning "panic now" or maybe it was "connection refused"... The
fourth second all nodes returned no leader found
, which meant that the
connection was not being refused but the cluster was not healthy. The fifth
second, no change. I'm thinking, just breathe, they were probably all
discovering each other. In the sixth second, still no change: Maybe they're
electing a leader? Second seven was the appropriate time for worry and
panic. Then, the eighth second brought good news node 04 is the leader
.
All other nodes healthy and communicating properly. In the ninth second, we
let out a collective (and globally distributed) exhale.
Now it was time to check what damage that painfully long eight seconds had done. We went through our checklist:
The database was still processing requests, no change.
The web and API nodes hadn't thrown any errors. They must have restarted fast enough that the cached database addresses were still being used.
The most important metric – the graph of 500 errors seen by customers: There was no change.
We expected to see a small spike in errors, or at least some identifiable change, but there was nothing but the noise floor. This was excellent news! 🎉
Then we checked whether the database was communicating with the Consul servers. It was not. Everyone quickly turned their attention to the backend database servers. If they had been running normally and the high availability tool hadn't been paused, an unplanned failover would be the minimum outage we could have hoped for. It's likely that they would have gotten into a very bad state. We started to troubleshoot why it wasn't communicating with the Consul server, but about one minute into the change, the connection came up and everything synced. Apparently it just needed a little more time than the others. We verified everything, and when everyone was satisfied we turned the high availability back on.
Now that everything in the critical path was working as expected, we
released the tension from our shoulders. We re-enabled Chef and merged the
MR pinning the Chef recipes to the newer version, and the MR's CI job pushed
the newer version to our Chef server. After picking a few low-impact
servers, we manually kicked off Chef runs after checking the md5sum
of the
Consul client config files. After Chef finished, there was no change to the
file, and the Chef client service was running normally again. We followed
the same process on the Consul servers with the same result, and manually
implemented it on the database servers, just for good measure. Once those
all looked good, we used mussh
to kick off a Chef run on all of the
servers using the same technique we used to turn them off.
Now all that was left was to straighten everything out with pgbouncer
and
the database load balancer and then we could fully relax. Looking at the
heath checks, we noticed that the two previously healthy nodes were not
returning healthy. The health checks are used to tell the load balancer
which pgbouncer
nodes have a Consul lock and therefore which nodes to send
the traffic. A little digging showed that after retrying to connect to the
Consul service a few times, they gave up. This was not ideal, so we opened
an Infrastructure
issue to
fix it later and restarted the health checks manually. Everything showed
normal so we added the inactive node back to the load balancer. The inactive
node's health check told the load balancer not to select it, and since the
load balancer was no longer in failsafe mode (due to the other node's health
checks succeeding) the load balancer refrained from sending it traffic.
Simultaneously restarting all of the Consul components with the new configuration put everything back into its original state, other than the validation setting which we set to false, and the TCP sessions which we restarted. After this change, the Consul clients will still be using TLS encryption but will ignore the fact that our cert is now expired. This is still not an ideal state but it gives us time to get there in a thoughtful way rather than as a rushed workaround.
Every once in a while we get into a situation that all of the fancy management tools just can't fix. There is no run book for situations such as the one we encountered. The question we were asked most frequently once people got up to speed was: "Isn't there some instructional walkthrough published somewhere for this type of thing?". For replacing a certificate from the same authority, yes definitely. For replacing a certificate on machines that can have downtime, there are plenty. But for keeping traffic flowing when hundreds of nodes need to change a setting and reconnect within a few seconds of each other... that's just not something that comes up very often. Even if someone wrote up the procedure it wouldn't work in our environment with all of the peripheral moving parts that required our attention.
In these types of situations there is no shortcut around thinking things through methodically. In this case, there were no tools or technologies that could solve the problem. Even in this new world of infrastructure as code, site reliability engineering, and cloud automation, there is still room for old fashioned system administrator tricks. There is just no substitute for understanding how everything works. We can try to abstract it away to make our day-to-day responsibilities easier, but when it comes down to it there will always be times when the best tool for the job is a solid plan.
Cover image by Thomas Jensen on Unsplash