As Infrastructure Lead, my job is to make GitLab.com fast and highly available.
Lately, it's been a challenge. Why? We are hitting our threshold where scale starts to matter. For example, over 2,000 new repos are being created during peak hours, and CI runners are requesting new builds 3,000,000 times per hour. It's an interesting problem to have. We have to store this information somewhere and make sure that while we're gaining data and users, GitLab.com keeps working fine.
A large part of the issue we're running into as we scale is that there is little or no documentation on how to tackle this kind of problem. While there are companies that have written high-level posts, almost none of them have shared how they arrived at their solutions.
One of our main issues in the past six months has been around storage. We built a CephFS cluster to tackle both the capacity and performance issues of using NFS appliances. Another more recent issue is around PostgreSQL vacuuming and how it affects performance locking up the database given the right kind of load.
As outlined in our values, we believe we have a responsibility to document this so other companies know what to do when they reach this point. Last Thursday, I gave a GitLab.com infrastructure status report during our daily team call. Watch the recording or download the slides to see how we're working through our challenges with scaling.