Git 2.42 release: Here are four of our contributions in detail

was officially released on August 21, 2023, and included some

improvements from GitLab's Git team. Git is the foundation of

repository data at GitLab. GitLab's Git team works on new features, performance improvements, documentation improvements,

and growing the Git community. Often our contributions to Git have a

lot to do with the way we integrate Git into our services at

GitLab.

We previously shared some of our improvements that were included in the Git 2.41 release. Here are some highlights from the Git 2.42 release, and a

window into how we use Git on the server side at GitLab.

1. Prevent certain refs from being packed

Write-ahead logging

In Gitaly, we

want to use a write-ahead log

to replicate Git operations on different machines.

This means that the Git objects and references that should be changed

by a Git operation are first kept in a log entry. Then, when all the

machines have agreed that the operation should proceed, the log entry

is applied so the corresponding Git objects and references are

actually added to the repositories on all the machines.

Need for temporary references

Between the time when a specific log entry is first written and when

it is applied, other log entries could be applied which could remove

some objects and references. It could happen that these objects and

references are needed to apply the specific log entry though.

So when we log an entry, we have to make sure that all the objects and

references that it needs to be properly applied will not be removed

until that entry is either actually applied or discarded.

The best way to make sure things are kept in Git is to create new Git

references pointing to these things. So we decided to use temporary

references for that purpose. They would be created when a log entry is

written, and then deleted when that entry is either applied or

discarded.

Packed-refs performance

Git can store references in "loose" files, with one reference per

file, or in the packed-refs file, which contains many of them. The

git pack-refs command is used to pack some references from "loose"

files into the packed-refs file.

For reading a lot of references, the packed-refs file is very

efficient, but for writing or deleting a single reference, it is not

so efficient as rewriting the whole packed-refs file is required.

As temporary references are to be created and then deleted soon after,

storing them in the packed-refs file would not be efficient. It

would be better to store them in "loose" files.

The git pack-refs command had no way to be told precisely which refs

should be packed or not though. By default it would repack all the

tags (which are refs in refs/tags/) and all the refs that are

already packed. With the --all option one could tell it to repack

all the refs except the hidden refs, broken refs, and symbolic refs,

but that was the only thing that could be controlled.

Improving `git pack-refs`

We decided to improve git pack-refs by adding two new options to it:

--include <pattern> which can be used to specify which refs should be packed
--exclude <pattern> which can be used to specify which refs should not be packed

John Cai, Gitaly:Git team engineering manager, implemented these options.

For example, if the refs managed by the write-ahead log are in

refs/wal/, it's now possible the exclude them from being moved into

the packed-refs file by using:


$ git pack-refs --exclude "refs/wal/*"

Details of the patch series, including discussions, can be found

here.

2. Get machine-readable output from `git cat-file --batch`

Efficiently retrieving Git object information

In GitLab, we often retrieve Git object information. For example, when a

user navigates into the files and directories in a repository, we need

to get the content of the corresponding Git blobs and trees so that

we can show it.

In Gitaly, we use git cat-file to retrieve Git object information

from a Git repository. As it's a frequent operation, it needs to be

performed efficiently, so we use the batch modes of git cat-file

available through the --batch, --batch-check and --batch-command

options.

In these modes, a pointer to a Git object can be repeatedly sent to

the standard input, called 'stdin', of a git cat-file command, while

the corresponding object information is read from the standard ouput,

called 'stdout' of the command. This way we don't need to launch a

new git cat-file command for each object.

GitLab can keep, for example, a git cat-file --batch-command process

running in the background while feeding it commands like

info <object> or contents <object> through its stdin to

get either information about an object or its content.

Newlines in stdin, stdout, and filenames

The commands or pointers to Git objects that are sent through stdin

should be delimited using newline characters, and in the same way `git

cat-file` will use newline characters to delimit the information from

different Git objects in its output. This is a common shell practice

to make it easy to chain commands together. For example, one can

easily get the size (in bytes) of the last three commits on the current

branch using the following:


$ git log -3 --format='%H' | git cat-file --batch-check='%(objectsize)'

285

646

428

Sometimes, though, the pointer to a Git object can contain a filename

or a directory name, as such a pointer is allowed to be in the form

<branch>:<path>. For example HEAD:Documentation is a valid

pointer to the blob or the tree corresponding to the Documentation

path on the current branch.

This used to be an issue because on some systems newline characters

are allowed in file or directory names. So the -z option was

introduced last year in Git 2.38 to allow users to change the input

delimiter in batch modes to the NUL character.

Error output

When the -z option was introduced, it wasn't considered useful to

change the output delimiter to be also the NUL character. This is

because only tree objects can contain paths and the internal format

of tree objects already uses NUL characters to delimit paths.

Unfortunately, it was overlooked that in case of an error the pointer

to the object is displayed in the error message:


$ echo 'HEAD:does-not-exist' | git cat-file --batch

HEAD:does-not-exist missing

As the error messages are printed along with the regular ouput of the

command on stdout, passing in an invalid pointer with a number of

newline characters in it could make it very difficult to parse the

output.

-Z comes to the rescue

Toon Claes, Gitaly senior engineer, initially worked on a

patch to just quote the pointer in the error message, but it was

decided in the Git mailing list discussions related to the patch that

it would be better to just create a new -Z option. This option would

change both the input and the output delimiter to the NUL character,

while the old -z option would be deprecated over time.

So Patrick Steinhardt, Gitaly staff engineer, implemented that new -Z option.

Details of the patch series, including discussions, can be found

here

and here.

3. Pass pseudo-options to `git rev-list --stdin`

Computing sizes

In GitLab, we need to have different ways to compute the size of Git

related content. For example, we need to know:

how much disk space a repository is using
how big a specific Git object is
how much additional space on a repository is required by a specific set of revisions (and the objects they reference)

Knowing "how much disk space a repository is using" is useful to

enforce repository-related quotas and is easy to get using regular

shell and OS features.

Size information about a specific Git object is useful to enforce

quotas related to maximum file size. It can be obtained using, for

example, git cat-file -s <object> or

echo <object> | git cat-file --batch-check='%(objectsize)'

as already seen above.

Computing the space required by a set of revisions is useful, too, as

forks can share Git content in what we call

"pool repositories,"

and we want to discriminate how much content belongs to each forked

repository. Fortunately, git rev-list has a --disk-usage option

for this purpose.

Passing arguments to `git rev-list`

git rev-list can take a number of different arguments and has a lot

of different options. It's a fundamental command to traverse commit

graphs, and it should be flexible enough to fulfill a lot of different

user needs.

When repositories grow, they often store a lot of references and a lot

of files and directories, so there is often the need to pass a big

number of references or paths as arguments to the

command. References and paths can be quite long though.

To avoid hitting platform limits related to command line length, long

ago, a --stdin mode was added that allowed users to pass revisions

and paths through stdin, instead of as command line

arguments. However, when that was implemented, it was not considered

necessary to allow options or pseudo-options, like --not,

--glob=..., or --all to be passed through stdin.

This appeared to be a problem for GitLab, as for computing sizes for

forked repositories we needed some of the pseudo-options, and it would

have been intricate and possibly buggy to pass some of them and their

arguments as arguments on the command line while others were passed

through stdin.

Allowing pseudo-options

To fix this issue, Patrick Steinhardt implemented a small patch series to

allow pseudo-options through stdin.

With it, in Git 2.42, one can now pass pseudo-options, like --not,

--glob=..., or --all through stdin when the --stdin mode is used.

Details of the patch series, including discussions, can be found

here.

4. Code and test improvements

While looking at some Git code, we are often tempted to modify nearby

code, either to change only its style when the code is ancient and it

would look better using Git's current code style, or to refactor it to

make it cleaner. This is why we sometimes send small patch series that

don't have a real GitLab related purpose.

In Git 2.42, examples of style code improvements we made are the

part1

and

part2

test code modernization patches from John Cai.

And here is

an example of a refactoring to cleanup some code by Patrick Steinhardt.

Git 2.42 release: Here are four of our contributions in detail

1. Prevent certain refs from being packed

Write-ahead logging

Need for temporary references

Packed-refs performance

Improving `git pack-refs`

2. Get machine-readable output from `git cat-file --batch`

Efficiently retrieving Git object information

Newlines in stdin, stdout, and filenames

Error output

-Z comes to the rescue

3. Pass pseudo-options to `git rev-list --stdin`

Computing sizes

Passing arguments to `git rev-list`

Allowing pseudo-options

4. Code and test improvements

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum.

Start shipping better software faster

Pricing

Contact Us

Product

Topics

Solutions

Resources

Company

Git 2.42 release: Here are four of our contributions in detail

1. Prevent certain refs from being packed

Write-ahead logging

Need for temporary references

Packed-refs performance

Improving git pack-refs

2. Get machine-readable output from git cat-file --batch

Efficiently retrieving Git object information

Newlines in stdin, stdout, and filenames

Error output

-Z comes to the rescue

3. Pass pseudo-options to git rev-list --stdin

Computing sizes

Passing arguments to git rev-list

Allowing pseudo-options

4. Code and test improvements

Stay in the know with GitLab's monthly newsletter

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum.

Start shipping better software faster

Improving `git pack-refs`

2. Get machine-readable output from `git cat-file --batch`

3. Pass pseudo-options to `git rev-list --stdin`

Passing arguments to `git rev-list`