Published on February 1, 2017
5 min read
Yesterday we had a serious incident with one of our databases. We lost six hours of database data (issues, merge requests, users, comments, snippets, etc.) for GitLab.com.
Update: please see our postmortem for this incident
Yesterday we had a serious incident with one of our databases. We lost six hours of database data (issues, merge requests, users, comments, snippets, etc.) for GitLab.com. Git/wiki repositories and self-managed installations were not affected. Losing production data is unacceptable and in a few days we'll publish a post on why this happened and a list of measures we will implement to prevent it happening again.
Update 6:14pm UTC: GitLab.com is back online
As of time of writing, we're restoring data from a six-hour-old backup of our database. This means that any data between 5:20pm UTC and 11:25pm UTC from the database (projects, issues, merge requests, users, comments, snippets, etc.) is lost by the time GitLab.com is live again.
Git data (repositories and wikis) and self-managed instances of GitLab are not affected.
Read below for a brief summary of the events. You're also welcome to view our active postmortem doc.
At 2017/01/31 6pm UTC, we detected that spammers were hammering the database by creating snippets, making it unstable. We then started troubleshooting to understand what the problem was and how to fight it.
At 2017/01/31 9pm UTC, this escalated, causing a lockup on writes on the database, which caused some downtime.
At 2017/01/31 10pm UTC, we got paged because DB Replication lagged too far behind, effectively stopping. This happened because there was a spike in writes that were not processed ontime by the secondary database.
db2
, it's lagging behind by about 4 GB at this pointdb2.cluster
refuses to replicate, /var/opt/gitlab/postgresql/data
is wiped to ensure a clean replicationdb2.cluster
refuses to connect to db1
, complaining about max_wal_senders
being too low. This setting is used to limit the number of WAL (= replication)
clientsmax_wal_senders
to 32
on db1
, restarts PostgreSQLmax_connections
to 2000
from 8000
, PostgreSQL starts again (despite 8000
having been used for almost a year)db2.cluster
still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anythingAt 2017/01/31 11pm-ish UTC, team-member-1 thinks that perhaps pg_basebackup
is refusing to work due to the PostgreSQL data directory being present (despite being empty), decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com
, instead of db2.cluster.gitlab.com
.
At 2017/01/31 11:27pm UTC, team-member-1 - terminates the removal, but it's too late. Of around 300 GB only about 4.5 GB is left.
We had to bring GitLab.com down and shared this information on Twitter:
We are performing emergency database maintenance, https://t.co/r11UmmDLDE will be taken offline
— GitLab.com Status (@gitlabstatus) January 31, 2017
pg_dump
may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.We're working on recovering right now by using a backup of the database from a staging database.
We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8
— GitLab.com Status (@gitlabstatus) February 1, 2017
db1.staging.gitlab.com
datadb1.staging.gitlab.com
on db1.cluster.gitlab.com
/var/opt/gitlab/postgresql/data/
to production /var/opt/gitlab/postgresql/data/
nfs-share01
server commandeered as temp storage place in /var/opt/gitlab/db-meltdown
pg_xlog
tar'ed up as 20170131-db-meltodwn-backup.tar.gz
Below a graph showing the time of deletion and subsequent copying in of data.
Find out which plan works best for your team
Learn about pricingLearn about what GitLab can do for your team
Talk to an expert