was officially released on August 21, 2023, and included some
improvements from GitLab's Git team. Git is the foundation of
repository data at GitLab. GitLab's Git team works on new features, performance improvements, documentation improvements,
and growing the Git community. Often our contributions to Git have a
lot to do with the way we integrate Git into our services at
We previously shared some of our improvements that were included in the Git 2.41 release. Here are some highlights from the Git 2.42 release, and a
window into how we use Git on the server side at GitLab.
1. Prevent certain refs from being packed
This means that the Git objects and references that should be changed
by a Git operation are first kept in a log entry. Then, when all the
machines have agreed that the operation should proceed, the log entry
is applied so the corresponding Git objects and references are
actually added to the repositories on all the machines.
Need for temporary references
Between the time when a specific log entry is first written and when
it is applied, other log entries could be applied which could remove
some objects and references. It could happen that these objects and
references are needed to apply the specific log entry though.
So when we log an entry, we have to make sure that all the objects and
references that it needs to be properly applied will not be removed
until that entry is either actually applied or discarded.
The best way to make sure things are kept in Git is to create new Git
references pointing to these things. So we decided to use temporary
references for that purpose. They would be created when a log entry is
written, and then deleted when that entry is either applied or
Git can store references in "loose" files, with one reference per
file, or in the
packed-refs file, which contains many of them. The
git pack-refs command is used to pack some references from "loose"
files into the
For reading a lot of references, the
packed-refs file is very
efficient, but for writing or deleting a single reference, it is not
so efficient as rewriting the whole
packed-refs file is required.
As temporary references are to be created and then deleted soon after,
storing them in the
packed-refs file would not be efficient. It
would be better to store them in "loose" files.
git pack-refs command had no way to be told precisely which refs
should be packed or not though. By default it would repack all the
tags (which are refs in
refs/tags/) and all the refs that are
already packed. With the
--all option one could tell it to repack
all the refs except the hidden refs, broken refs, and symbolic refs,
but that was the only thing that could be controlled.
We decided to improve
git pack-refs by adding two new options to it:
--include <pattern>which can be used to specify which refs should be packed
--exclude <pattern>which can be used to specify which refs should not be packed
John Cai, Gitaly:Git team engineering manager, implemented these options.
For example, if the refs managed by the write-ahead log are in
refs/wal/, it's now possible the exclude them from being moved into
packed-refs file by using:
$ git pack-refs --exclude "refs/wal/*"
Details of the patch series, including discussions, can be found
2. Get machine-readable output from
git cat-file --batch
Efficiently retrieving Git object information
In GitLab, we often retrieve Git object information. For example, when a
user navigates into the files and directories in a repository, we need
to get the content of the corresponding Git blobs and trees so that
we can show it.
In Gitaly, we use
git cat-file to retrieve Git object information
from a Git repository. As it's a frequent operation, it needs to be
performed efficiently, so we use the batch modes of
available through the
In these modes, a pointer to a Git object can be repeatedly sent to
the standard input, called 'stdin', of a
git cat-file command, while
the corresponding object information is read from the standard ouput,
called 'stdout' of the command. This way we don't need to launch a
git cat-file command for each object.
GitLab can keep, for example, a
git cat-file --batch-command process
running in the background while feeding it commands like
info <object> or
contents <object> through its stdin to
get either information about an object or its content.
Newlines in stdin, stdout, and filenames
The commands or pointers to Git objects that are sent through stdin
should be delimited using newline characters, and in the same way
git cat-file will use newline characters to delimit the information from
different Git objects in its output. This is a common shell practice
to make it easy to chain commands together. For example, one can
easily get the size (in bytes) of the last three commits on the current
branch using the following:
$ git log -3 --format='%H' | git cat-file --batch-check='%(objectsize)' 285 646 428
Sometimes, though, the pointer to a Git object can contain a filename
or a directory name, as such a pointer is allowed to be in the form
<branch>:<path>. For example
HEAD:Documentation is a valid
pointer to the blob or the tree corresponding to the
path on the current branch.
This used to be an issue because on some systems newline characters
are allowed in file or directory names. So the
-z option was
introduced last year in Git 2.38 to allow users to change the input
delimiter in batch modes to the NUL character.
-z option was introduced, it wasn't considered useful to
change the output delimiter to be also the NUL character. This is
because only tree objects can contain paths and the internal format
of tree objects already uses NUL characters to delimit paths.
Unfortunately, it was overlooked that in case of an error the pointer
to the object is displayed in the error message:
$ echo 'HEAD:does-not-exist' | git cat-file --batch HEAD:does-not-exist missing
As the error messages are printed along with the regular ouput of the
command on stdout, passing in an invalid pointer with a number of
newline characters in it could make it very difficult to parse the
-Z comes to the rescue
Toon Claes, Gitaly senior engineer, initially worked on a
patch to just quote the pointer in the error message, but it was
decided in the Git mailing list discussions related to the patch that
it would be better to just create a new
-Z option. This option would
change both the input and the output delimiter to the NUL character,
while the old
-z option would be deprecated over time.
So Patrick Steinhardt, Gitaly staff engineer, implemented that new
3. Pass pseudo-options to
git rev-list --stdin
In GitLab, we need to have different ways to compute the size of Git
related content. For example, we need to know:
- how much disk space a repository is using
- how big a specific Git object is
- how much additional space on a repository is required by a
specific set of revisions (and the objects they reference)
Knowing "how much disk space a repository is using" is useful to
enforce repository-related quotas and is easy to get using regular
shell and OS features.
Size information about a specific Git object is useful to enforce
quotas related to maximum file size. It can be obtained using, for
git cat-file -s <object> or
echo <object> | git cat-file --batch-check='%(objectsize)'
as already seen above.
Computing the space required by a set of revisions is useful, too, as
forks can share Git content in what we call
and we want to discriminate how much content belongs to each forked
git rev-list has a
for this purpose.
Passing arguments to
git rev-list can take a number of different arguments and has a lot
of different options. It's a fundamental command to traverse commit
graphs, and it should be flexible enough to fulfill a lot of different
When repositories grow, they often store a lot of references and a lot
of files and directories, so there is often the need to pass a big
number of references or paths as arguments to the
command. References and paths can be quite long though.
To avoid hitting platform limits related to command line length, long
--stdin mode was added that allowed users to pass revisions
and paths through stdin, instead of as command line
arguments. However, when that was implemented, it was not considered
necessary to allow options or pseudo-options, like
--all to be passed through stdin.
This appeared to be a problem for GitLab, as for computing sizes for
forked repositories we needed some of the pseudo-options, and it would
have been intricate and possibly buggy to pass some of them and their
arguments as arguments on the command line while others were passed
To fix this issue, Patrick Steinhardt implemented a small patch series to
allow pseudo-options through stdin.
With it, in Git 2.42, one can now pass pseudo-options, like
--all through stdin when the
--stdin mode is used.
Details of the patch series, including discussions, can be found
4. Code and test improvements
While looking at some Git code, we are often tempted to modify nearby
code, either to change only its style when the code is ancient and it
would look better using Git's current code style, or to refactor it to
make it cleaner. This is why we sometimes send small patch series that
don't have a real GitLab related purpose.
And here is
an example of a refactoring to cleanup some code by Patrick Steinhardt.