As Infrastructure Lead, my job is to make GitLab.com fast and highly available.
Lately, it's been a challenge. Why? We are hitting our threshold where scale starts to matter. For example, over 2,000 new repos
are being created during peak hours, and CI runners are requesting new builds 3,000,000 times per hour.
It's an interesting problem to have. We have to store this information somewhere and make sure that
while we're gaining data and users, GitLab.com keeps working fine.
A large part of the issue we're running into as we scale is that there is little or no documentation
on how to tackle this kind of problem. While there are companies that have written high-level posts, almost none of them
have shared how they arrived at their solutions.
One of our main issues in the past six months has been around storage. We built a CephFS cluster to tackle both the capacity and
performance issues of using NFS appliances. Another more recent issue is around PostgreSQL vacuuming and how it affects performance locking up the database
given the right kind of load.
As outlined in our values, we believe we have a
responsibility to document this so other companies know what to do when they reach this point.
Last Thursday, I gave a GitLab.com infrastructure status report during our daily team call.
Watch the recording or download the slides to see how we're working through our challenges with scaling.