UPDATE 2019-08-06: This bug has now been resolved in the following distributions:
- Red Hat Enterprise Linux 7
- Ubuntu
- Linux mainline: Backported to 4.14-stable and 4.19-stable
On Sep. 14, the GitLab support team escalated a critical
problem encountered by one of our customers: GitLab would run fine for a
while, but after some time users encountered errors. When attempting to
clone certain repositories via Git, users would see an opaque Stale file error message. The error message persisted for a long time,
blocking employees from being able to work, unless a system
administrator intervened manually by running ls in the directory
itself.
Thus launched an investigation into the inner workings of Git and the Network File System (NFS). The investigation uncovered a bug with the Linux v4.0 NFS client and culiminated with a kernel patch that was written by Trond Myklebust and merged in the latest mainline Linux kernel on Oct. 26.
This post describes the journey of investigating the issue and details the thought process and tools by which we tracked down the bug. It was inspired by the fine detective work in How I spent two weeks hunting a memory leak in Ruby by Oleg Dashevskii.
More importantly, this experience exemplifies how open source software debugging has become a team sport that involves expertise across multiple people, companies, and locations. The GitLab motto "everyone can contribute" applies not only to GitLab itself, but also to other open source projects, such as the Linux kernel.
Reproducing the bug
While we have run NFS on GitLab.com for many years, we have stopped using it to access repository data across our application machines. Instead, we have abstracted all Git calls to Gitaly. Still, NFS remains a supported configuration for our customers who manage their own installation of GitLab, but we had never seen the exact problem described by the customer before.
Our customer gave us a few important clues:
- The full error message read,
fatal: Couldn't read ./packed-refs: Stale file handle. - The error seemed to start when they started a manual Git garbage
collection run via
git gc. - The error would go away if a system administrator ran
lsin the directory. - The error also would go away after
git gcprocess ended.
The first two items seemed obviously related. When you push to a branch
in Git, Git creates a loose reference, a fancy name for a file that
points your branch name to the commit. For example, a push to master
will create a file called refs/heads/master in the repository:
$ cat refs/heads/master
2e33a554576d06d9e71bfd6814ee9ba3a7838963
git gc has several jobs, but one of them is to collect these loose
references (refs) and bundle them up into a single file called
packed-refs. This makes things a bit faster by eliminating the need to
read lots of little files in favor of reading one large one. For
example, after running git gc, an example packed-refs might look
like:
# pack-refs with: peeled fully-peeled sorted
564c3424d6f9175cf5f2d522e10d20d781511bf1 refs/heads/10-8-stable
edb037cbc85225261e8ede5455be4aad771ba3bb refs/heads/11-0-stable
94b9323033693af247128c8648023fe5b53e80f9 refs/heads/11-1-stable
2e33a554576d06d9e71bfd6814ee9ba3a7838963 refs/heads/master
How exactly is this packed-refs file created? To answer that, we ran
strace git gc with a loose ref present. Here are the pertinent lines
from that:
28705 open("/tmp/libgit2/.git/packed-refs.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 3
28705 open(".git/packed-refs", O_RDONLY) = 3
28705 open("/tmp/libgit2/.git/packed-refs.new", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 4
28705 rename("/tmp/libgit2/.git/packed-refs.new", "/tmp/libgit2/.git/packed-refs") = 0
28705 unlink("/tmp/libgit2/.git/packed-refs.lock") = 0
The system calls showed that git gc did the following:
- Open
packed-refs.lock. This tells other processes thatpacked-refsis locked and cannot be changed. - Open
packed-refs.new. - Write loose refs to
packed-refs.new. - Rename
packed-refs.newtopacked-refs. - Remove
packed-refs.lock. - Remove loose refs.
The fourth step is the key here: the rename where Git puts packed-refs
into action. In addition to collecting loose refs, git gc also
performs a more expensive task of scanning for unused objects and
removing them. This task can take over an hour for large
repositories.
That made us wonder: for a large repository, does git gc keep the file
open while it's running this sweep? Looking at the strace logs and
probing the process with lsof, we found that it did the following:
Notice that packed-refs is closed only at the end, after the potentially
long Garbage collect objects step takes place.
That made us wonder: how does NFS behave when one node has packed-refs
open while another renames over that file?
To experiment, we asked the customer to run the following experiment on two different machines (Alice and Bob):
-
On the shared NFS volume, create two files:
test1.txtandtest2.txtwith different contents to make it easy to distinguish them:alice $ echo "1 - Old file" > /path/to/nfs/test1.txt alice $ echo "2 - New file" > /path/to/nfs/test2.txt -
On machine Alice, keep a file open to
test1.txt:alice $ irb irb(main):001:0> File.open('/path/to/nfs/test1.txt') -
On machine Alice, show the contents of
test1.txtcontinuously:alice $ while true; do cat test1.txt; done -
Then on machine Bob, run:
bob $ mv -f test2.txt test1.txt
This last step emulates what git gc does with packed-refs by
overwriting the existing file.
On the customer's machine, the result looked something like:
1 - Old file
1 - Old file
1 - Old file
cat: test1.txt: Stale file handle
Bingo! We seemed to reproduce the problem in a controlled way. However, the same experiment using a Linux NFS server did not have this problem. The result was what you would expect: the new contents were picked up after the rename:
1 - Old file
1 - Old file
1 - Old file
2 - New file <
