Git 2.42 release: Here are four of our contributions in detail

Git 2.42 was officially released on August 21, 2023, and included some improvements from GitLab's Git team. Git is the foundation of repository data at GitLab. GitLab's Git team works on new features, performance improvements, documentation improvements, and growing the Git community. Often our contributions to Git have a lot to do with the way we integrate Git into our services at GitLab.

We previously shared some of our improvements that were included in the Git 2.41 release. Here are some highlights from the Git 2.42 release, and a window into how we use Git on the server side at GitLab.

1. Prevent certain refs from being packed

Write-ahead logging

In Gitaly, we want to use a write-ahead log to replicate Git operations on different machines.

This means that the Git objects and references that should be changed by a Git operation are first kept in a log entry. Then, when all the machines have agreed that the operation should proceed, the log entry is applied so the corresponding Git objects and references are actually added to the repositories on all the machines.

Need for temporary references

Between the time when a specific log entry is first written and when it is applied, other log entries could be applied which could remove some objects and references. It could happen that these objects and references are needed to apply the specific log entry though.

So when we log an entry, we have to make sure that all the objects and references that it needs to be properly applied will not be removed until that entry is either actually applied or discarded.

The best way to make sure things are kept in Git is to create new Git references pointing to these things. So we decided to use temporary references for that purpose. They would be created when a log entry is written, and then deleted when that entry is either applied or discarded.

Packed-refs performance

Git can store references in "loose" files, with one reference per file, or in the packed-refs file, which contains many of them. The git pack-refs command is used to pack some references from "loose" files into the packed-refs file.

For reading a lot of references, the packed-refs file is very efficient, but for writing or deleting a single reference, it is not so efficient as rewriting the whole packed-refs file is required.

As temporary references are to be created and then deleted soon after, storing them in the packed-refs file would not be efficient. It would be better to store them in "loose" files.

The git pack-refs command had no way to be told precisely which refs should be packed or not though. By default it would repack all the tags (which are refs in refs/tags/) and all the refs that are already packed. With the --all option one could tell it to repack all the refs except the hidden refs, broken refs, and symbolic refs, but that was the only thing that could be controlled.

Improving `git pack-refs`

We decided to improve git pack-refs by adding two new options to it:

--include <pattern> which can be used to specify which refs should be packed
--exclude <pattern> which can be used to specify which refs should not be packed

John Cai, Gitaly:Git team engineering manager, implemented these options.

For example, if the refs managed by the write-ahead log are in refs/wal/, it's now possible the exclude them from being moved into the packed-refs file by using:

$ git pack-refs --exclude "refs/wal/*"

Details of the patch series, including discussions, can be found here.

2. Get machine-readable output from `git cat-file --batch`

Efficiently retrieving Git object information

In GitLab, we often retrieve Git object information. For example, when a user navigates into the files and directories in a repository, we need to get the content of the corresponding Git blobs and trees so that we can show it.

In Gitaly, we use git cat-file to retrieve Git object information from a Git repository. As it's a frequent operation, it needs to be performed efficiently, so we use the batch modes of git cat-file available through the --batch, --batch-check and --batch-command options.

In these modes, a pointer to a Git object can be repeatedly sent to the standard input, called 'stdin', of a git cat-file command, while the corresponding object information is read from the standard ouput, called 'stdout' of the command. This way we don't need to launch a new git cat-file command for each object.

GitLab can keep, for example, a git cat-file --batch-command process running in the background while feeding it commands like info <object> or contents <object> through its stdin to get either information about an object or its content.

Newlines in stdin, stdout, and filenames

The commands or pointers to Git objects that are sent through stdin should be delimited using newline characters, and in the same way git cat-file will use newline characters to delimit the information from different Git objects in its output. This is a common shell practice to make it easy to chain commands together. For example, one can easily get the size (in bytes) of the last three commits on the current branch using the following:

$ git log -3 --format='%H' | git cat-file --batch-check='%(objectsize)'
285
646
428

Sometimes, though, the pointer to a Git object can contain a filename or a directory name, as such a pointer is allowed to be in the form <branch>:<path>. For example HEAD:Documentation is a valid pointer to the blob or the tree corresponding to the Documentation path on the current branch.

This used to be an issue because on some systems newline characters are allowed in file or directory names. So the -z option was introduced last year in Git 2.38 to allow users to change the input delimiter in batch modes to the NUL character.

Error output

When the -z option was introduced, it wasn't considered useful to change the output delimiter to be also the NUL character. This is because only tree objects can contain paths and the internal format of tree objects already uses NUL characters to delimit paths.

Unfortunately, it was overlooked that in case of an error the pointer to the object is displayed in the error message:

$ echo 'HEAD:does-not-exist' | git cat-file --batch
HEAD:does-not-exist missing

As the error messages are printed along with the regular ouput of the command on stdout, passing in an invalid pointer with a number of newline characters in it could make it very difficult to parse the output.

-Z comes to the rescue

Toon Claes, Gitaly senior engineer, initially worked on a patch to just quote the pointer in the error message, but it was decided in the Git mailing list discussions related to the patch that it would be better to just create a new -Z option. This option would change both the input and the output delimiter to the NUL character, while the old -z option would be deprecated over time.

So Patrick Steinhardt, Gitaly staff engineer, implemented that new -Z option.

Details of the patch series, including discussions, can be found here and here.

3. Pass pseudo-options to `git rev-list --stdin`

Computing sizes

In GitLab, we need to have different ways to compute the size of Git related content. For example, we need to know:

how much disk space a repository is using
how big a specific Git object is
how much additional space on a repository is required by a specific set of revisions (and the objects they reference)

Knowing "how much disk space a repository is using" is useful to enforce repository-related quotas and is easy to get using regular shell and OS features.

Size information about a specific Git object is useful to enforce quotas related to maximum file size. It can be obtained using, for example, git cat-file -s <object> or echo <object> | git cat-file --batch-check='%(objectsize)' as already seen above.

Computing the space required by a set of revisions is useful, too, as forks can share Git content in what we call "pool repositories," and we want to discriminate how much content belongs to each forked repository. Fortunately, git rev-list has a --disk-usage option for this purpose.

Passing arguments to `git rev-list`

git rev-list can take a number of different arguments and has a lot of different options. It's a fundamental command to traverse commit graphs, and it should be flexible enough to fulfill a lot of different user needs.

When repositories grow, they often store a lot of references and a lot of files and directories, so there is often the need to pass a big number of references or paths as arguments to the command. References and paths can be quite long though.

To avoid hitting platform limits related to command line length, long ago, a --stdin mode was added that allowed users to pass revisions and paths through stdin, instead of as command line arguments. However, when that was implemented, it was not considered necessary to allow options or pseudo-options, like --not, --glob=..., or --all to be passed through stdin.

This appeared to be a problem for GitLab, as for computing sizes for forked repositories we needed some of the pseudo-options, and it would have been intricate and possibly buggy to pass some of them and their arguments as arguments on the command line while others were passed through stdin.

Allowing pseudo-options

To fix this issue, Patrick Steinhardt implemented a small patch series to allow pseudo-options through stdin.

With it, in Git 2.42, one can now pass pseudo-options, like --not, --glob=..., or --all through stdin when the --stdin mode is used.

Details of the patch series, including discussions, can be found here.

4. Code and test improvements

While looking at some Git code, we are often tempted to modify nearby code, either to change only its style when the code is ancient and it would look better using Git's current code style, or to refactor it to make it cleaner. This is why we sometimes send small patch series that don't have a real GitLab related purpose.

In Git 2.42, examples of style code improvements we made are the part1 and part2 test code modernization patches from John Cai.

And here is an example of a refactoring to cleanup some code by Patrick Steinhardt.

Git 2.42 release: Here are four of our contributions in detail

1. Prevent certain refs from being packed

Write-ahead logging

Need for temporary references

Packed-refs performance

Improving `git pack-refs`

2. Get machine-readable output from `git cat-file --batch`

Efficiently retrieving Git object information

Newlines in stdin, stdout, and filenames

Error output

-Z comes to the rescue

3. Pass pseudo-options to `git rev-list --stdin`

Computing sizes

Passing arguments to `git rev-list`

Allowing pseudo-options

4. Code and test improvements

More to explore

Streamline migrations with user contribution and membership mapping

Data-driven DevSecOps: Exploring GitLab Insights Dashboards

3 GitLab features to level up DevSecOps workflows

We want to hear from you

Ready to get started?

Git 2.42 release: Here are four of our contributions in detail

1. Prevent certain refs from being packed

Write-ahead logging

Need for temporary references

Packed-refs performance

Improving git pack-refs

2. Get machine-readable output from git cat-file --batch

Efficiently retrieving Git object information

Newlines in stdin, stdout, and filenames

Error output

-Z comes to the rescue

3. Pass pseudo-options to git rev-list --stdin

Computing sizes

Passing arguments to git rev-list

Allowing pseudo-options

4. Code and test improvements

Sign up for GitLab’s newsletter

More to explore

Streamline migrations with user contribution and membership mapping

Data-driven DevSecOps: Exploring GitLab Insights Dashboards

3 GitLab features to level up DevSecOps workflows

We want to hear from you

Ready to get started?

Improving `git pack-refs`

2. Get machine-readable output from `git cat-file --batch`

3. Pass pseudo-options to `git rev-list --stdin`

Passing arguments to `git rev-list`