outage on 2015-05-29

Jacob Vosmaer ·
Jun 4, 2015 · 3 min read suffered an outage from 2015-05-29 01:00 to 2015-05-29 02:34 (times in UTC). In this blog post we will discuss what happened, why it took so long to recover the service, and what we are doing to reduce the likelihood and impact of such incidents.

Background is provided and maintained by the team of GitLab B.V., the company behind GitLab. On 2015-05-02 we performed a major infrastructure upgrade, moving from a single server to a small cluster of servers, consisting of a load balancer (running HAproxy), three workers (NGINX/Unicorn/Sidekiq/gitlab-shell) and a backend server (PostgreSQL/Redis/NFS). This new infrastructure configuration improved the responsiveness of, at the expense of having more moving parts. is backed up using Amazon EBS snapshots. To protect against inconsistent snapshots our backup script 'freezes' the filesystem on the backend server with fsfreeze prior to making EBS snapshots, and 'unfreezes' the filesystem immediately after.


Italic comments below are written with the knowledge of hindsight

Root causes

Although we cannot explain what went wrong with the backup script it is hard to come to another conclusion that something did go wrong with it.

The length of the outage was caused by insufficient training and documentation for our on-call engineers following the infrastructure upgrade rolled out on May 2nd.

Next steps

We have removed the freeze/unfreeze steps from our backup script. Because this (theoretically) increases the risk of occasional corrupt backups we have added a second backup strategy for our SQL data. In the future we would like to have automatical validation of our backups.

The day before this incident we decided the training was our most important priority. We have started to do regular operations drills in one-on-one sessions with all of our on-call engineers.

Edit this page View source