Published on: March 20, 2023
11 min read
Learn how we revamped our architecture for faster iteration and more efficiently maintained repositories.
Users get the most from Gitaly, the service responsible for the storage and maintenance of all Git repositories in GitLab, when traffic hitting it is efficiently handled. Therefore, we must ensure our Git repositories remain in a well-optimized state. When it comes to Git monorepositories, this maintenance can be a complex task that can cause a lot of overhead by itself because repository housekeeping becomes more expensive the larger the repositories get. This blog post explains in depth what we have done over the past few GitLab releases to rework our approach to repository housekeeping for better scaling and to maintain an optimized state to deliver the best peformance for our users.
To ensure that Git repositories remain performant, Git regularly runs a set of
maintenance tasks. On the client side, this usually happens by automatically
running git-gc(1)
periodically, which:
packed-refs
file.packfiles
.commit-graphs
that help to speed
up queries against the Git repository.Git periodically runs git gc --auto
automatically in the background, which
analyzes your repository and only performs maintenance tasks if required.
At GitLab, we can't use this infrastructure because it does not give us enough control over which maintenance tasks are executed at what point in time. Furthermore, it does not give us full control over exactly which data structures we opt in to. Instead, we have implemented our own maintenance strategies that are specific to how GitLab works and catered to our specific needs. Unfortunately, the way GitLab implemented repository maintenance has been limiting us for quite a while by now.
This post explains our previous maintenance strategy and its problems as well as how we revamped the architecture to allow us to iterate faster and more efficiently maintain repositories.
In the early days of GitLab, most of the application ran on a single server. On this single server, GitLab directly accessed Git repositories. For various reasons, this architecture limited us, so we created the standalone Gitaly server that provides a gRPC API to access Git repositories.
To migrate to exclusively accessing Git repository data using Gitaly we:
While this was the easiest way to tackle the huge task back then, the end result was that there were still quite a few areas in the Rails codebase that relied on knowing how the Git repositories were stored on disk.
One such area was repository maintenance. In an ideal world, the Rails server would not need to know about the on-disk state of a Git repository. Instead, the Rails server would only care about the data it wants to get out of the repository or commit to it. Because of the Gitaly migration path we took, the Rails application was still responsible for executing fine-grained repository maintenance by calling certain RPCs:
Cleanup
to delete stale, temporary files that have accumulatedRepackIncremental
and RepackFull
to either pack all loose objects into a
new packfile or alternatively to repack all packfiles into a single onePackRefs
to compress all references into a single packed-refs
fileWriteCommitGraph
to update the commit-graphGarbageCollect
to perform various different tasksThese low-level details of repository maintenance were being managed by the client. But because clients didn't have any information on the on-disk state of the repository, they could not even determine which of these maintenance tasks had to be executed in the first place. Instead, we had a very simple heuristic: Every few pushes, we ran one of the above RPCs to perform one of the maintenance tasks. While this heuristic worked, it wasn't great for the following reasons:
It was clear that we needed to change the strategy we used for repository maintenance. Most importantly, we wanted to:
As mentioned in the introduction, Git periodically runs git gc --auto
. This
command inspects the repository's state and performs optimizations only when it
finds that the repository is in a sufficiently bad state to warrant the cost.
While using this command directly in the context of Gitaly does not give us
enough flexibility, it did serve as the inspiration for our new architecture.
Instead of providing fine-grained RPCs to maintain various parts of a Git
repository, we now only provide a single RPC OptimizeRepository
that works as
a black-box to the caller. This RPC call:
Because we can analyze and use the on-disk state of the repository, we can be far more intelligent about repository maintenance compared to the previous strategy where we optimized some bits of the repository every few pushes.
In the old-style repository maintenance, the client would call either
RepackIncremental
or RepackFull
. This would either: Pack all loose objects into a new packfile
or repack all objects into a single packfile
.
By default, we would perform a full repack every five repacks. While this may be a good default for small repositories, it gets prohibitively expensive for huge monorepositories where a full repack may easily take several minutes.
The new heuristical maintenance strategy instead scales the allowed number of
packfiles
by the total size of all combined packfiles
. As a result, the
larger the repository becomes, the less frequently we perform a full repack.
In the past, clients would periodically call GarbageCollect
. In addition to
repacking objects, this RPC would also prune any objects that are unreachable
and that haven't been accessed for a specific grace period.
The new heuristical maintenance strategy scans through all loose objects that
exist in the repository. If the number of loose objects that have a modification
time older than two weeks exceeds a certain threshold, it spawns the
git prune
command to prune these objects.
In the past, clients would call PackRefs
to repack references into the
packed-refs
file.
Because the time to compress references scales with the size of the
packed-refs
file, the new heuristical maintenance strategy takes into account
both the size of the packed-refs
file and the number of loose references that
exist in the repository. If a ratio between these two figures is exceeded, we
compress the loose references.
There are auxiliary data structures like commit-graphs
that are used by Git
to speed up various queries. With the new heuristical maintenance strategy,
Gitaly now automatically updates these as required, either when they are
deemed to be out-of-date, or when they are missing altogether.
We rolled out this new heuristical maintenance strategy to GitLab.com in March 2022. Initially, we only rolled it out for
gitlab-org/gitlab
, which is a
repository where maintenance performed particularly poorly in the past. You can
see the impact of the rollout in the following graph:
In this graph, you can see that:
Until March 19, we used the legacy fine-grained RPC calls. We spent most
of the time in RepackFull
, followed by RepackIncremental
and GarbageCollect
.
Because March 19 and 20 occurred on a weekend, nothing much happens with housekeeping.
Early on March 21 we switched gitlab-org/gitlab
to use heuristical
housekeeping using OptimizeRepository
. Initially, there didn't seem to be
much of an improvement. There wasn't much difference in how much time we
spent maintaining this repository compared to the past.
However, this was caused by an inefficient heuristic. Instead of only pruning objects when there were stale ones, we always pruned objects when we saw that there were too many loose objects.
We deployed a fix for this bug on March 22, which led to a significant drop in time spent optimizing this repository compared to before.
This demonstrated two things:
We have subsequently rolled this out to all of GitLab.com, starting on March 29, 2022, with similar improvements. With this change, we more than halved the CPU load when performing repository optimizations.
While it is great that OptimizeRepository
has managed to save us a lot of
compute power, one goal was to improve visibility into repository housekeeping.
More specifically, we wanted to:
In order to improve global visibility, we expose a set of Prometheus metrics that allow us to observe important details about our repository maintenance. The following graphs show the optimizations performed in a 30-minute window of our production systems on GitLab.com.
The optimizations, which are being performed in general.
The average latency it takes to perform each of these optimizations.
What kind of stale data we are cleaning up.
To improve visibility into the state each repository is in we have started to log structured data that includes all the relevant bits. A subset of the information it exposes is:
packfiles
and their combined size.packed-refs
file.commit-graphs
, bitmaps and other auxiliary data
structures.This information is also exposed through Prometheus metrics:
These graphs expose important metrics of the on-disk state of our repositories:
Combining both the global and per-repository information allows us to easily observe how repository maintenance behaves during normal operations. But more importantly, it gives us meaningful data when rolling out new features that change the way repositories are maintained.
While the heuristical housekeeping is enabled by default starting with GitLab
15.9, it has already been introduced with GitLab 14.10. If you want to use the
new housekeeping strategy before upgrading to 15.9, you can opt in by
setting the optimized_housekeeping
feature flag.
You can do so via the gitlab-rails
console:
Feature.enable(:optimized_housekeeping)
While the new heuristical optimization strategy has been successfully battle-tested for a while now for GitLab.com, at the time of writing this blog post, it still wasn't enabled by default for self-deployed installations. This has finally changed with GitLab 15.8, where we have default-enabled the new heuristical maintenance strategy.
We are not done yet, though. Now that Gitaly is the only source of truth for how repositories are optimized, we are tracking improvements to our maintenance strategy in epic 7443:
OptimizeRepository
at all anymore.So stay tuned!