GitLab.com database incident

Feb 1, 2017 · 5 min read
Tanuki GitLab profile

Update: please see our postmortem for this incident

Yesterday we had a serious incident with one of our databases. We lost six hours of database data (issues, merge requests, users, comments, snippets, etc.) for GitLab.com. Git/wiki repositories and self-managed installations were not affected. Losing production data is unacceptable and in a few days we'll publish a post on why this happened and a list of measures we will implement to prevent it happening again.

Update 6:14pm UTC: GitLab.com is back online

As of time of writing, we’re restoring data from a six-hour-old backup of our database. This means that any data between 5:20pm UTC and 11:25pm UTC from the database (projects, issues, merge requests, users, comments, snippets, etc.) is lost by the time GitLab.com is live again.

Git data (repositories and wikis) and self-managed instances of GitLab are not affected.

Read below for a brief summary of the events. You’re also welcome to view our active postmortem doc.

First incident

At 2017/01/31 6pm UTC, we detected that spammers were hammering the database by creating snippets, making it unstable. We then started troubleshooting to understand what the problem was and how to fight it.

At 2017/01/31 9pm UTC, this escalated, causing a lockup on writes on the database, which caused some downtime.

Actions taken

Second incident

At 2017/01/31 10pm UTC, we got paged because DB Replication lagged too far behind, effectively stopping. This happened because there was a spike in writes that were not processed ontime by the secondary database.

Actions taken

Third incident

At 2017/01/31 11pm-ish UTC, team-member-1 thinks that perhaps pg_basebackup is refusing to work due to the PostgreSQL data directory being present (despite being empty), decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com.

At 2017/01/31 11:27pm UTC, team-member-1 - terminates the removal, but it’s too late. Of around 300 GB only about 4.5 GB is left.

We had to bring GitLab.com down and shared this information on Twitter:

Problems encountered

Recovery

We’re working on recovering right now by using a backup of the database from a staging database.

Below a graph showing the time of deletion and subsequent copying in of data.

Also, we'd like to thank everyone for the amazing support we've received on Twitter and elsewhere through #hugops

Open in Web IDE View source