This post is inspired by this comment on Reddit, thanking us for improving the stability of Thanks, hardwaresofton! Making ready for your mission-critical workloads has been top of mind for us for some time, and it's great to hear that users are noticing a difference.

Please note that the numbers in this post differ slightly from the Reddit post as the data has changed since that post.

We will continue to work hard on improving the availability and stability of the platform. Our current goal is to achieve 99.95 percent availability on – look out for an upcoming post about how we're planning to get there. stability before and after the migration

According to Pingdom,'s availability for the year to date, up until the migration was 99.68 percent, which equates to about 32 minutes of downtime per week on average.

Since the migration, our availability has improved greatly, although we have much less data to compare with than in Azure.

Availability Chart

Using data publicly available from Pingdom, here are some stats about our availability for the year to date:

Period Mean-time between outage events
Pre-migration (Azure) 1.3 days
Post-migration (GCP) 7.3 days
Post-migration (GCP) excluding 1st day 12 days

This is great news: we're experiencing outages less frequently. What does this mean for our availability, and are we on track to achieve our goal of 99.95 percent?

Period Availability Downtime per week
Pre-migration (Azure) 99.68% 32 minutes
Post-migration (GCP) 99.88 % 13 minutes
Target – not yet achieved 99.95% 5 minutes

Dropping from 32 minutes per week average downtime to 13 minutes per week means we've experienced a 61 percent improvement in our availability following our migration to Google Cloud Platform.


What about the performance of since the migration?

Performance can be tricky to measure. In particular, averages are a terrible way of measuring performance, since they neglect outlying values. One of the better ways to measure performance is with a latency histogram chart. To do this, we imported the access logs for July (for Azure) and September (for Google Cloud Platform) into Google BigQuery, then selected the 100 most popular endpoints for each month and categorised these as either API, web, git, long-polling, or static endpoints. Comparing these histograms side-by-side allows us to study how the performance of has changed since the migration. Latency Histogram

In this histogram, higher values on the left indicate better performance. The right of the graph is the "tail", and the "fatter the tail", the worse the user experience.

This graph shows us that with the move to GCP, more requests are completing within a satisfactory amount of time.

Here's two more graphs showing the difference for API and Git requests respectively.

API Latency Histogram

Git Latency Histogram

Why these improvements?

We chose Google Cloud Platform because we believe that Google offer the most reliable cloud platform for our workload, particularly as we move towards running in Kubernetes.

However, there are many other reasons unrelated to our change in cloud provider for these improvements to stability and performance.

“We chose Google Cloud Platform because we believe that Google offer the most reliable cloud platform for our workload”

Like any large SaaS site, is a large, complicated system, and attributing availability changes to individual changes is extremely difficult, but here are a few factors which may be effecting our availability and performance:

Reason #1: Our Gitaly Fleet on GCP is much more powerful than before

Gitaly is responsible for all Git access in the GitLab application. Before Gitaly, Git access occurred directly from within Rails workers. Because of the scale we run at, we require many servers serving the web application, and therefore, in order to share git data between all workers, we relied on NFS volumes. Unfortunately this approach doesn't scale well, which led to us building Gitaly, a dedicated Git service.

“We've opted to give our fleet of 24 Gitaly servers a serious upgrade”

Our upgraded Gitaly fleet

As part of the migration, we've opted to give our fleet of 24 Gitaly servers a serious upgrade. If the old fleet was the equivalent of a nice family sedan, the new fleet are like a pack of snarling musclecars, ready to serve your Git objects.

Environment Processor Number of cores per instance RAM per instance
Azure Intel Xeon Ivy Bridge @ 2.40GHz 8 55GB
GCP Intel Xeon Haswell @ 2.30GHz 32 118GB

Our new Gitaly fleet is much more powerful. This means that Gitaly can respond to requests more quickly, and deal better with unexpected traffic surges.

IO performance

As you can probably imagine, serving 225TB of Git data to roughly half-a-million active users a week is a fairly IO-heavy operation. Any performance improvements we can make to this will have a big impact on the overall performance of

For this reason, we've focused on improving performance here too.

Environment RAID Volumes Media filesystem Performance
Azure RAID 5 (lvm) 16 magnetic xfs 5k IOPS, 200MB/s (per disk) / 32k IOPS 1280MB/s (volume group)
GCP No raid 1 SSD ext4 60k read IOPs, 30k write IOPs, 800MB/s read 200MB/s write

How does this translate into real-world performance? Here are average read and write times across our Gitaly fleet:

IO performance is much higher

Here are some comparative figures for our Gitaly fleet from Azure and GCP. In each case, the performance in GCP is much better than in Azure, although this is what we would expect given the more powerful fleet.

Disk read time graph Disk write time graph Disk Queue length graph

Note: For reference: for Azure, this uses the average times for the week leading up to the failover. For GCP, it's an average for the week up to October 2, 2018.

These stats clearly illustrate that our new fleet has far better IO performance than our old cluster. Gitaly performance is highly dependent on IO performance, so this is great news and goes a long way to explaining the performance improvements we're seeing.

Reason #2: Fewer "unicorn worker saturation" errors

HTTP 503 Status GitLab

Unicorn worker saturation sounds like it'd be a good thing, but it's really not!

We (currently) rely on unicorn, a Ruby/Rack http server, for serving much of the application. Unicorn uses a single-threaded model, which uses a fixed pool of workers processes. Each worker can handle only one request at a time. If the worker gives no response within 60 seconds, it is terminated and another process is spawned to replace it.

“Unicorn worker saturation sounds like it'd be a good thing, but it's really not!”

Add to this the lack of autoscaling technologies to ramp the fleet up when we experience high load volumes, and this means that has a relatively static-sized pool of workers to handle incoming requests.

If a Gitaly server experiences load problems, even fast RPCs that would normally only take milliseconds, could take up to several seconds to respond – thousands of times slower than usual. Requests to the unicorn fleet that communicate with the slow server will take hundreds of times longer than expected. Eventually, most of the fleet is handling requests to that affected backend server. This leads to a queue which affects all incoming traffic, a bit like a tailback on a busy highway caused by a traffic jam on a single offramp.

If the request gets queued for too long – after about 60 seconds – the request will be cancelled, leading to a 503 error. This is indiscriminate – all requests, whether they interact with the affected server or not, will get cancelled. This is what I call unicorn worker saturation, and it's a very bad thing.

Between February and August this year we frequently experienced this phenomenon.

There are several approaches we've taken to dealing with this:

Reason #3: no longer uses NFS for any Git access

In early September we disabled Git NFS mounts across our worker fleet. This was possible because Gitaly had reached v1.0: the point at which it's sufficiently complete. You can read more about how we got to this stage in our Road to Gitaly blog post.

Reason #4: Migration as a chance to reduce debt

The migration was a fantastic opportunity for us to improve our infrastructure, simplify some components, and otherwise make more stable and more observable, for example, we've rolled out new structured logging infrastructure.

As part of the migration, we took the opportunity to move much of our logging across to structured logs. We use fluentd, Google Pub/Sub, Pubsubbeat, storing our logs in Elastic Cloud and Google Stackdriver Logging. Having reliable, indexed logs has allowed us to reduce our mean-time to detection of incidents, and in particular detect abuse. This new logging infrastructure has also been invaluable in detecting and resolving several security incidents.

“This new logging infrastructure has also been invaluable in detecting and resolving several security incidents”

We've also focused on making our staging environment much more similar to our production environment. This allows us to test more changes, more accurately, in staging before rolling them out to production. Previously the team was maintaining a limited scaled-down staging environment and many changes were not adequately tested before being rolled out. Our environments now share a common configuration and we're working to automate all terraform and chef rollouts.

Reason #5: Process changes

Unfortunately many of the worst outages we've experienced over the past few years have been self-inflicted. We've always been transparent about these — and will continue to be so — but as we rapidly grow, it's important that our processes scale alongside our systems and team.

“It's important that our processes scale alongside our systems and team”

In order to address this, over the past few months, we've formalized our change and incident management processes. These processes respectively help us to avoid outages and resolve them quicker when they do occur.

If you're interested in finding out more about the approach we've taken to these two vital disciplines, they're published in our handbook:

Reason #6: Application improvement

Every GitLab release includes performance and stability improvements; some of these have had a big impact on GitLab's stability and performance, particularly n+1 issues.

Take Gitaly for example: like other distributed systems, Gitaly can suffer from a class of performance degradations known as "n+1" problems. This happens when an endpoint needs to make many queries ("n") to fulfill a single request.

Consider an imaginary endpoint which queried Gitaly for all tags on a repository, and then issued an additional query for each tag to obtain more information. This would result in n + 1 Gitaly queries: one for the initial tag, and then n for the tags. This approach would work fine for a project with 10 tags – issuing 11 requests, but a project with 1000 tags, this would result in 1001 Gitaly calls, each with a round-trip time, and issued in sequence.

Latency drop in Gitaly endpoints

Using data from Pingdom, this chart shows long-term performance trends since the start of the year. It's clear that latency improved a great deal on May 7, 2018. This date happens to coincide with the RC1 release of GitLab 10.8, and its deployment on

It turns out that this was due to a single fix on n+1 on the merge request page being resolved.

When running in development or test mode, GitLab now detects n+1 situations and we have compiled a list of known n+1s. As these are resolved we expect even more performance improvements.

GitLab Summit - South Africa - 2018

Reason #7: Infrastructure team growth and reorganization

At the start of May 2018, the Infrastructure team responsible for consisted of five engineers.

Since then, we've had a new director join the Infrastructure team, two new managers, a specialist Postgres DBRE, and four new SREs. The database team has been reorganized to be an embedded part of infrastructure group. We've also brought in Ongres, a specialist Postgres consultancy, to work alongside the team.

Having enough people in the team has allowed us to be able to split time between on-call, tactical improvements, and longer-term strategic work.

Oh, and we're still hiring! If you're interested, check out our open positions and choose the Infrastructure Team 😀

TL;DR: Conclusion

  1. is more stable: availability has improved 61 percent since we migrated to GCP
  2. is faster: latency has improved since the migration
  3. We are totally focused on continuing these improvements, and we're building a great team to do it

One last thing: our Grafana dashboards are open, so if you're interested in digging into our metrics in more detail, visit and explore!

Try all GitLab features - free for 30 days

GitLab is more than just source code management or CI/CD. It is a full software development lifecycle & DevOps tool in a single application.

Try GitLab Free
Git is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license