Published on February 10, 2017
19 min read
Postmortem on the database outage of January 31 2017 with the lessons we learned.
On January 31st 2017, we experienced a major service outage for one of our products, the online service GitLab.com. The outage was caused by an accidental removal of data from our primary database server.
This incident caused the GitLab.com service to be unavailable for many hours. We also lost some production data that we were eventually unable to recover. Specifically, we lost modifications to database data such as projects, comments, user accounts, issues and snippets, that took place between 17:20 and 00:00 UTC on January 31. Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts. Code repositories or wikis hosted on GitLab.com were unavailable during the outage, but were not affected by the data loss. GitLab Enterprise customers, GitHost customers, and self-managed GitLab CE users were not affected by the outage, or the data loss.
Losing production data is unacceptable. To ensure this does not happen again we're working on multiple improvements to our operations & recovery procedures for GitLab.com. In this article we'll look at what went wrong, what we did to recover, and what we'll do to prevent this from happening in the future.
To the GitLab.com users whose data we lost and to the people affected by the outage: we're sorry. I apologize personally, as GitLab's CEO, and on behalf of everyone at GitLab.
GitLab.com currently uses a single primary and a single secondary in hot-standby
mode. The standby is only used for failover purposes. In this setup a single
database has to handle all the load, which is not ideal. The primary's hostname
is db1.cluster.gitlab.com
, while the secondary's hostname is
db2.cluster.gitlab.com
.
In the past we've had various other issues with this particular setup due to
db1.cluster.gitlab.com
being a single point of failure. For example:
On January 31st an engineer started setting up multiple PostgreSQL servers in our staging environment. The plan was to try out pgpool-II to see if it would reduce the load on our database by load balancing queries between the available hosts. Here is the issue for that plan: infrastructure#259.
± 17:20 UTC: prior to starting this work, our engineer took an LVM snapshot of the production database and loaded this into the staging environment. This was necessary to ensure the staging database was up to date, allowing for more accurate load testing. This procedure normally happens automatically once every 24 hours (at 01:00 UTC), but they wanted a more up to date copy of the database.
± 19:00 UTC: GitLab.com starts experiencing an increase in database load due to what we suspect was spam. In the week leading up to this event GitLab.com had been experiencing similar problems, but not this severe. One of the problems this load caused was that many users were not able to post comments on issues and merge requests. Getting the load under control took several hours.
We would later find out that part of the load was caused by a background job trying to remove a GitLab employee and their associated data. This was the result of their account being flagged for abuse and accidentally scheduled for removal. More information regarding this particular problem can be found in the issue "Removal of users by spam should not hard delete".
± 23:00 UTC: Due to the increased load, our PostgreSQL secondary's replication process started to lag behind. The replication failed as WAL segments needed by the secondary were already removed from the primary. As GitLab.com was not using WAL archiving, the secondary had to be re-synchronised manually. This involves removing the existing data directory on the secondary, and running pg_basebackup to copy over the database from the primary to the secondary.
One of the engineers went to the secondary and wiped the data directory, then
ran pg_basebackup
. Unfortunately pg_basebackup
would hang, producing no
meaningful output, despite the --verbose
option being set. After a few tries
pg_basebackup
mentioned that it could not connect due to the master not having
enough available replication connections (as controlled by the max_wal_senders
option).
To resolve this our engineers decided to temporarily increase
max_wal_senders
from the default value of 3
to 32
. When applying the
settings, PostgreSQL refused to restart, claiming too many semaphores were being
created. This can happen when, for example, max_connections
is set too high. In
our case this was set to 8000
. Such a value is way too high, yet it had been
applied almost a year ago and was working fine until that point. To resolve this
the setting's value was reduced to 2000
, resulting in PostgreSQL restarting
without issues.
Unfortunately this did not resolve the problem of pg_basebackup
not starting
replication immediately. One of the engineers decided to run it with strace
to
see what it was blocking on. strace
showed that pg_basebackup
was hanging in
a poll
call, but that did not provide any other meaningful information that might
have explained why.
± 23:30 UTC: one of the engineers thinks that perhaps pg_basebackup
created some files in the PostgreSQL data directory of the secondary during the
previous attempts to run it. While normally pg_basebackup
prints an error when
this is the case, the engineer in question wasn't too sure what was going on. It
would later be revealed by another engineer (who wasn't around at the time) that
this is normal behaviour: pg_basebackup
will wait for the primary to start
sending over replication data and it will sit and wait silently until that time.
Unfortunately this was not clearly documented in our engineering
runbooks nor in the official
pg_basebackup
document.
Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.
Hoping they could restore the database the engineers involved went to look for the database backups, and asked for help on Slack. Unfortunately the process of both finding and using backups failed completely.
This brings us to the recovery procedures. Normally in an event like this, one should be able to restore a database in relatively little time using a recent backup, though some form of data loss can not always be prevented. For GitLab.com we have the following procedures in place:
pg_dump
, this backup is uploaded
to Amazon S3. Old backups are automatically removed after some time.At this point the replication process was broken and data had already been wiped from both the primary and secondary, meaning we could not restore from either host.
When we went to look for the pg_dump
backups we found out they were not there.
The S3 bucket was empty, and there was no recent backup to be found anywhere.
Upon closer inspection we found out that the backup procedure was using
pg_dump
9.2, while our database is running PostgreSQL 9.6 (for Postgres, 9.x
releases are considered major). A difference in major versions results in
pg_dump
producing an error, terminating the backup procedure.
The difference is the result of how our Omnibus package works. We currently
support both PostgreSQL 9.2 and 9.6, allowing users to upgrade (either manually
or using commands provided by the package). To determine the correct version to
use the Omnibus package looks at the PostgreSQL version of the database cluster
(as determined by $PGDIR/PG_VERSION
, with $PGDIR
being the path to the data
directory). When PostgreSQL 9.6 is detected Omnibus ensures all binaries use
PostgreSQL 9.6, otherwise it defaults to PostgreSQL 9.2.
The pg_dump
procedure was executed on a regular application server, not the
database server. As a result there is no PostgreSQL data directory present on
these servers, thus Omnibus defaults to PostgreSQL 9.2. This in turn resulted in
pg_dump
terminating with an error.
While notifications are enabled for any cronjobs that error, these notifications are sent by email. For GitLab.com we use DMARC. Unfortunately DMARC was not enabled for the cronjob emails, resulting in them being rejected by the receiver. This means we were never aware of the backups failing, until it was too late.
Azure disk snapshots are used to generate a snapshot of an entire disk. These snapshots don't make it easy to restore individual chunks of data (e.g. a lost user account), though it's possible. The primary purpose is to restore entire disks in case of disk failure.
In Azure a snapshot belongs to a storage account, and a storage account in turn is linked to one or more hosts. Each storage account has a limit of roughly 30 TB. When restoring a snapshot using a host in the same storage account, the procedure usually completes very quickly. However, when using a host in a different storage account the procedure can take hours if not days to complete. For example, in one such case it took over a week to restore a snapshot. As a result we try not to rely on this system too much.
While enabled for the NFS servers, these snapshots were not enabled for any of the database servers as we assumed that our other backup procedures were sufficient enough.
The LVM snapshots are primarily used to easily copy data from our production environment to our staging environment. While this process was working as intended, the produced snapshots are not really meant to be used for disaster recovery. At the time of the outage we had two snapshots available:
When we generate a snapshot the following steps are taken:
To recover GitLab.com we decided to use the LVM snapshot created 6 hours before the outage, as it was our only option to reduce data loss as much as possible (the alternative was to lose almost 24 hours of data). This process would involve the following steps:
For our staging environment we were using Azure classic, without Premium Storage. This is primarily done to save costs as premium storage is quite expensive. As a result the disks are very slow, resulting in them being the main bottleneck in the restoration process. Because LVM snapshots are stored on the hosts they are taken for we had two options to restore data:
In both cases the amount of data to copy would be roughly the same. Since copying over and restoring the data directory would be easier we decided to go with this solution.
Copying the data from the staging to the production host took around 18 hours. These disks are network disks and are throttled to a really low number (around 60Mbps), there is no way to move from cheap storage to premium, so this was the performance we would get out of it. There was no network or processor bottleneck, the bottleneck was in the drives. Once copied we were able to restore the database (including webhooks) to the state it was at January 31st, 17:20 UTC.
On February 1st at 17:00 UTC we managed to restore the GitLab.com database without webhooks. Restoring webhooks was done by creating a separate staging database using the LVM snapshot, but without triggering the removal of webhooks. This allowed us to generate a SQL dump of the table and import this into the restored GitLab.com database.
Around 18:00 UTC we finished the final restoration procedures such as restoring the webhooks and confirming everything was operating as expected.
In the spirit of transparency we kept track of progress and notes in a publicly visible Google document. We also streamed the recovery procedure on YouTube, with a peak viewer count of around 5000 (resulting in the stream being the #2 live stream on YouTube for several hours). The stream was used to give our users live updates about the recovery procedure. Finally we used Twitter (https://twitter.com/gitlabstatus) to inform those that might not be watching the stream.
The document in question was initially private to GitLab employees and contained name of the engineer who accidentally removed the data. While the name was added by the engineer themselves (and they had no problem with this being public), we will redact names in future cases as other engineers may not be comfortable with their name being published.
Database data such as projects, issues, snippets, etc. created between January 31st 17:20 UTC and 23:30 UTC has been lost. Git repositories and Wikis were not removed as they are stored separately.
It's hard to estimate how much data has been lost exactly, but we estimate we have lost at least 5000 projects, 5000 comments, and roughly 700 users. This only affected users of GitLab.com, self-managed instances or GitHost instances were not affected.
Since GitLab uses GitLab.com to develop GitLab the outage meant that for some it was harder to get work done. Most developers could continue working using their local Git repositories, but creating issues and such had to be delayed. To publish the blog post "GitLab.com Database Incident" we used a private GitLab instance we normally use for private/sensitive workflows (e.g. security releases). This allowed us to build and deploy a new version of the website while GitLab.com was unavailable.
We also have a public monitoring website located at https://dashboards.gitlab.com/. Unfortunately the current setup for this website was not able to handle the load produced by users using this service during the outage. Fortunately our internal monitoring systems (which dashboards.gitlab.com is based on) were not affected.
To analyse the root cause of these problems we'll use a technique called "The 5 Whys". We'll break up the incident into 2 main problems: GitLab.com being down, and it taking a long time to restore GitLab.com.
Problem 1: GitLab.com was down for about 18 hours.
Problem 2: restoring GitLab.com took over 18 hours.
pg_dump
were not working.pg_dump
to perform a logical backup of the database. This
procedure failed silently because it was using PostgreSQL 9.2, while
GitLab.com runs on PostgreSQL 9.6.We are currently working on fixing and improving our various recovery procedures. Work is split across the following issues:
We are also working on setting up multiple secondaries and balancing the load amongst these hosts. More information on this can be found at:
Our main focus is to improve disaster recovery, and making it more obvious as to
what host you're using; instead of preventing production engineers from running
certain commands. For example, one could alias rm
to something safer but in
doing so would only protect themselves against accidentally running rm -rf /important-data
, not against disk corruption or any of the many other ways you
can lose data.
An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact. This in turn requires you to be able to perform these procedures on a regular basis, and make it easy to test and roll back any changes. For example, we are in the process of setting up procedures that allow developers to test their database migrations. More information on this can be found in the issue "Tool for executing and reverting Rails migrations on staging".
We're also looking into ways to build better recovery procedures for the entire GitLab.com infrastructure, and not just the database; and to ensure there is ownership of these procedures. The issue for this is "Disaster recovery for everything that is not the database".
Monitoring wise we also started working on a public backup monitoring dashboard,
which can be found at https://dashboards.gitlab.com/dashboard/db/postgresql-backups.
Currently this dashboard only contains data of our pg_dump
backup procedure,
but we aim to add more data over time.
One might notice that at the moment our pg_dump
backups are 3 days old. We
perform these backups on a secondary as pg_dump
can put quite a bit of
pressure on a database. Since we are in the process of rebuilding our
secondaries the pg_dump
backup procedure is suspended for the time being. Fear
not however, as LVM snapshots are now taken every hour instead of once per 24
hours. Enabling Azure disk snapshots is something we're still looking into.
Finally, we're looking into improving our abuse reporting and response system. More information regarding this can be found in the issue "Removal of users by spam should not hard delete".
If you think there are additional measures we can take to prevent incidents like this please let us know in the comments.
Pushing to the default branch will automatically update the merge request so that it's aware of there not being any differences between the source and target branch. At this point you can safely close the merge request.
There are 3 options to resolve this:
Go to your project, then "Pipelines", "New Pipeline", use "master" as the branch, then create the pipeline. This will create and start a new pipeline using your master branch, which should result in your website being updated.
Most likely they were, but the database is not aware of this. To solve this, create a new pipeline using the right branch and run it.
Pushing new commits should automatically solve this. Alternatively you can try force pushing to the target branch.
Project details are stored in the database. This meant that this data was lost for projects created after 17:20. We ran a procedure to restore these projects based on their Git repositories that were still stored in our NFS cluster. This procedure however was only able to restore projects in their most basic form, without associated data such as issues and merge requests.
Find out which plan works best for your team
Learn about pricingLearn about what GitLab can do for your team
Talk to an expert