The Git project recently released Git Version 2.45.0. Let's look at the highlights of this release, which includes contributions from GitLab's Git team and the wider Git community.
Reftables: A new backend for storing references
Every Git repository needs to track two basic data structures:
- The object graph that stores the data of your files, the directory structure, commit messages, and tags.
- References that are pointers into that object graph to associate specific objects with a more accessible name. For example, a branch is a reference whose name starts with a
refs/heads/
prefix.
The on-disk format of how references are stored in a repository has remained largely unchanged since Git’s inception and is referred to as the "files" format. Whenever you create a reference, Git creates a so-called "loose reference" that is a plain file in your Git repository whose path matches the ref name. For example:
$ git init .
Initialized empty Git repository in /tmp/repo/.git/
# Updating a reference will cause Git to create a "loose ref". This loose ref is
# a simple file which contains the object ID of the commit.
$ git commit --allow-empty --message "Initial commit"
[main (root-commit) c70f266] Initial commit
$ cat .git/refs/heads/main
c70f26689975782739ef9666af079535b12b5946
# Creating a second reference will end up with a second loose ref.
$ git branch feature
$ cat .git/refs/heads/feature
c70f26689975782739ef9666af079535b12b5946
$ tree .git/refs
.git/refs/
├── heads
│ ├── feature
│ └── main
└── tags
3 directories, 2 files
Every once in a while, Git packs those references into a "packed" file format so that it becomes more efficient to look up references. For example:
# Packing references will create "packed" references, which are a sorted list of
# references. The loose reference does not exist anymore.
$ git pack-refs --all
$ cat .git/refs/heads/main
cat: .git/refs/heads/main: No such file or directory
$ cat .git/packed-refs
# pack-refs with: peeled fully-peeled sorted
c70f26689975782739ef9666af079535b12b5946 refs/heads/feature
c70f26689975782739ef9666af079535b12b5946 refs/heads/main
While this format is rather simple, it has limitations:
- In large mono repos with many references, we started to hit scalability issues. Deleting references is especially inefficient because the entire “packed-refs” file must be rewritten to drop the deleted reference. In our largest repositories, this can lead to rewriting multiple gigabytes of data on every reference deletion.
- It is impossible to perform an atomic read of references without blocking concurrent writers because you have to read multiple files to figure out all references.
- It is impossible to perform an atomic write because it requires you to create or update multiple files, which cannot be done in a single step.
- Housekeeping of references does not scale well because you have to rewrite the full "packed-refs" file.
- Because loose references use the filesystem path as their name, they are subject to filesystem-specific behavior. For example, case-insensitive file systems cannot store references for which only the case differs.
To address these issues, Git v2.45.0 introduces a new "reftable" backend, which uses a new binary format to store references. This new backend has been in development for a very long time. It was initially proposed by Shawn Pearce in July 2017 and was initially implemented in JGit. It is used extensively by the Gerrit project. In 2021, Han-Wen Nienhuys upstreamed the library into Git that allows it to read and write the reftable format.
The new "reftable" backend that we upstreamed in Git v2.45.0 now finally brings together the reftable library and Git such that it is possible to use the new format as storage backend in your Git repositories.
Assuming that you run at least Git v2.45.0, you can create new repositories with the "reftable" format by passing the --ref-format=reftable
switch to either git-init(1)
or git-clone(1)
. For example:
$ git init --ref-format=reftable .
Initialized empty Git repository in /tmp/repo/.git/
$ git rev-parse --show-ref-format
reftable
$ find -type f .git/reftable/
.git/reftable/0x000000000001-0x000000000001-01b5e47d.ref
.git/reftable/tables.list
$ git commit --allow-empty --message "Initial commit"
$ find -type f .git/reftable/
.git/reftable/0x000000000001-0x000000000001-01b5e47d.ref
.git/reftable/0x000000000002-0x000000000002-87006b81.ref
.git/reftable/tables.list
As you can see, the references are now stored in .git/reftable
instead of in the .git/refs
directory. The references and the reference logs are stored in “tables,” which are the files ending with .ref
, whereas the tables.list
file contains the list of all tables that are currently active. The technical details of how this work will be explained in a separate blog post. Stay tuned!
The “reftable” backend is supposed to be a drop-in replacement for the “files” backend. Hence, from a user’s perspective, everything should just work the same.
This project was led by Patrick Steinhardt. Credit also goes to Shawn Pearce as original inventor of the format and Han-Wen Nienhuys as the author of the reftable library.
Better tooling for references
While the "reftable" format solves many of the issues we have, it also introduces some new issues. One of the most important issues is accessibility of the data it contains.
With the "files" backend, you can, in the worst case, use your regular Unix tools to inspect the state of references. Both the "packed" and the "loose" references contain human-readable data that one can easily make sense of. This is different with the "reftable" format, which is a binary format. Therefore, Git needs to provide all the necessary tooling to extract data from the new "reftable" format.
Listing all references
The first problem we had is that it is basically impossible to learn about all the references that a repository knows about. This is somewhat puzzling at first: you can create and modify references via Git, but it cannot exhaustively list all references that it knows about?
Indeed, the "files" backend can't. While it can trivially list all "normal"
references that start with the refs/
prefix, Git also uses so-called
pseudo refs. These files live directly in the root of the Git directory and would be files like, for example, .git/MERGE_HEAD
. The problem here is that those pseudo refs live next to other files that Git stores like, for example, .git/config
.
While some pseudo refs are well-known and thus easy to identify, there is in theory no limit to what references Git can write. Nothing stops you from creating a reference called "foobar".
For example:
$ git update-ref foobar HEAD
$ cat .git/foobar
f32633d4d7da32ccc3827e90ecdc10570927c77d
Now the problem that the "files" backend has is that it can only enumerate
references by scanning through directories. So to figure out that
.git/foobar
is in fact a reference, Git would have to open the file and check whether it is formatted like a reference or not.
On the other hand, the "reftable" backend trivially knows about all references that it contains: They are encoded in its data structures, so all it needs to do is to decode those references and return them. But because of the restrictions of the "files" backend, there is no tooling that would allow you to learn about all references that exist.
To address the issue, we upstreamed a new flag to git-for-each-ref(1)
called --include-root-refs
, which will cause it to also list all references that exist in the root of the reference naming hierarchy. For example:
$ git for-each-ref --include-root-refs
f32633d4d7da32ccc3827e90ecdc10570927c77d commit HEAD
f32633d4d7da32ccc3827e90ecdc10570927c77d commit MERGE_HEAD
f32633d4d7da32ccc3827e90ecdc10570927c77d commit refs/heads/main
For the "files" backend, this new flag is handled on a best-effort basis where we include all references that match a known pseudo ref name. For the "reftable" backend, we can simply list all references known to it.
This project was led by Karthik Nayak.
Listing all reflogs
Whenever you update branches, Git, by default, tracks those branch updates in a so-called reflog. This reflog allows you to roll back changes to that branch in case you performed an unintended change and can thus be a very helpful tool.
With the "files" backend, those logs are stored in your .git/logs
directory:
$ find -type f .git/logs/
.git/logs/HEAD
.git/logs/refs/heads/main
In fact, listing files in this directory is the only way for you to learn what references actually have a reflog in the first place. This is a problem for the "reftable" backend, which stores those logs together with the references. Consequently, there doesn't exist any way for you to learn about which reflogs exist in the repository at all anymore when you use the "reftable" format.
This is not really the fault of the "reftable" format though, but an omission in the tooling that Git provides. To address the omission, we introduced a new list
subcommand for git-reflog(1)
that allows you to list all existing reflogs:
$ git reflog list
HEAD
refs/heads/main
This project was led by Patrick Steinhardt.
More efficient packing of references
To stay efficient, Git repositories need regular maintenance. Usually,
this maintenance is triggered by various Git commands that write data into the Git repositories by executing git maintenance run --auto
. This command
only optimizes data structures that actually need to be optimized so that Git doesn’t waste compute resources.
One data structure that gets optimized by Git's maintenance is the reference
database, which is done by executing git pack-refs --all
. For the "files"
backend, this means that all references get repacked into the "packed-refs" file and the loose references get deleted, whereas for the "reftable" backend all the tables will get merged into a single table.
For the "files" backend, we cannot reasonably do much better. Given that we have to rewrite the whole "packed-refs" file anyway, it makes sense that we would want to pack all loose references.
But for the "reftable" backend this is suboptimal as the "reftable" backend is self-optimizing. Whenever Git appends a new table to the "reftable" backend, it will perform auto-compaction and merge tables together as needed. Consequently, the reference database should always be in a well-optimized state and thus merging all tables together is a wasted effort.
In Git v2.45.0, we thus introduced a new git pack-refs --auto
mode, which asks the reference backend to optimize on an as-needed basis. While the "files" backend continues to work the same even with the --auto
flag set, the "reftable" backend will use the same heuristics as it already uses for its auto-compaction. In practice, this should be a no-op in most cases.
Furthermore, git maintenance run --auto
has been adapted to pass the -tauto
flag to git-pack-refs(1)
to make use of this new mode by default.
This project was led by Patrick Steinhardt.
Read more
This blog post put a heavy focus on the new "reftable" backend, which allows us to scale better in large repositories with many references, as well as related tooling that we have introduced alongside it to make it work well. There, of course, have been various performance improvements, bug fixes and smaller features introduced with this Git release by the wider Git community, as well. You can learn about these from the official release announcement of the Git project.