How we spent two weeks hunting an NFS bug in the Linux kernel

UPDATE 2019-08-06: This bug has now been resolved in the following distributions:

Red Hat Enterprise Linux 7
Ubuntu
Linux mainline: Backported to 4.14-stable and 4.19-stable

On Sep. 14, the GitLab support team escalated a critical problem encountered by one of our customers: GitLab would run fine for a while, but after some time users encountered errors. When attempting to clone certain repositories via Git, users would see an opaque Stale file error message. The error message persisted for a long time, blocking employees from being able to work, unless a system administrator intervened manually by running ls in the directory itself.

Thus launched an investigation into the inner workings of Git and the Network File System (NFS). The investigation uncovered a bug with the Linux v4.0 NFS client and culiminated with a kernel patch that was written by Trond Myklebust and merged in the latest mainline Linux kernel on Oct. 26.

This post describes the journey of investigating the issue and details the thought process and tools by which we tracked down the bug. It was inspired by the fine detective work in How I spent two weeks hunting a memory leak in Ruby by Oleg Dashevskii.

More importantly, this experience exemplifies how open source software debugging has become a team sport that involves expertise across multiple people, companies, and locations. The GitLab motto "everyone can contribute" applies not only to GitLab itself, but also to other open source projects, such as the Linux kernel.

Reproducing the bug

While we have run NFS on GitLab.com for many years, we have stopped using it to access repository data across our application machines. Instead, we have abstracted all Git calls to Gitaly. Still, NFS remains a supported configuration for our customers who manage their own installation of GitLab, but we had never seen the exact problem described by the customer before.

Our customer gave us a few important clues:

The full error message read, fatal: Couldn't read ./packed-refs: Stale file handle.
The error seemed to start when they started a manual Git garbage collection run via git gc.
The error would go away if a system administrator ran ls in the directory.
The error also would go away after git gc process ended.

The first two items seemed obviously related. When you push to a branch in Git, Git creates a loose reference, a fancy name for a file that points your branch name to the commit. For example, a push to master will create a file called refs/heads/master in the repository:

$ cat refs/heads/master
2e33a554576d06d9e71bfd6814ee9ba3a7838963

git gc has several jobs, but one of them is to collect these loose references (refs) and bundle them up into a single file called packed-refs. This makes things a bit faster by eliminating the need to read lots of little files in favor of reading one large one. For example, after running git gc, an example packed-refs might look like:

# pack-refs with: peeled fully-peeled sorted
564c3424d6f9175cf5f2d522e10d20d781511bf1 refs/heads/10-8-stable
edb037cbc85225261e8ede5455be4aad771ba3bb refs/heads/11-0-stable
94b9323033693af247128c8648023fe5b53e80f9 refs/heads/11-1-stable
2e33a554576d06d9e71bfd6814ee9ba3a7838963 refs/heads/master

How exactly is this packed-refs file created? To answer that, we ran strace git gc with a loose ref present. Here are the pertinent lines from that:

28705 open("/tmp/libgit2/.git/packed-refs.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 3
28705 open(".git/packed-refs", O_RDONLY) = 3
28705 open("/tmp/libgit2/.git/packed-refs.new", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 4
28705 rename("/tmp/libgit2/.git/packed-refs.new", "/tmp/libgit2/.git/packed-refs") = 0
28705 unlink("/tmp/libgit2/.git/packed-refs.lock") = 0

The system calls showed that git gc did the following:

Open packed-refs.lock. This tells other processes that packed-refs is locked and cannot be changed.
Open packed-refs.new.
Write loose refs to packed-refs.new.
Rename packed-refs.new to packed-refs.
Remove packed-refs.lock.
Remove loose refs.

The fourth step is the key here: the rename where Git puts packed-refs into action. In addition to collecting loose refs, git gc also performs a more expensive task of scanning for unused objects and removing them. This task can take over an hour for large repositories.

That made us wonder: for a large repository, does git gc keep the file open while it's running this sweep? Looking at the strace logs and probing the process with lsof, we found that it did the following:

Git Garbage Collection

Notice that packed-refs is closed only at the end, after the potentially long Garbage collect objects step takes place.

That made us wonder: how does NFS behave when one node has packed-refs open while another renames over that file?

To experiment, we asked the customer to run the following experiment on two different machines (Alice and Bob):

On the shared NFS volume, create two files: test1.txt and test2.txt with different contents to make it easy to distinguish them:
```
alice $ echo "1 - Old file" > /path/to/nfs/test1.txt
alice $ echo "2 - New file" > /path/to/nfs/test2.txt
```

On machine Alice, keep a file open to test1.txt:

 alice $ irb
 irb(main):001:0> File.open('/path/to/nfs/test1.txt')

On machine Alice, show the contents of test1.txt continuously:
```
alice $ while true; do cat test1.txt; done
```
Then on machine Bob, run:
```
bob $ mv -f test2.txt test1.txt
```

This last step emulates what git gc does with packed-refs by overwriting the existing file.

On the customer's machine, the result looked something like:

1 - Old file
1 - Old file
1 - Old file
cat: test1.txt: Stale file handle

Bingo! We seemed to reproduce the problem in a controlled way. However, the same experiment using a Linux NFS server did not have this problem. The result was what you would expect: the new contents were picked up after the rename:

1 - Old file
1 - Old file
1 - Old file
2 - New file  <--- RENAME HAPPENED
2 - New file
2 - New file

Why the difference in behavior? It turns out that the customer was using an Isilon NFS appliance that only supported NFS v4.0. By switching the mount parameters to v4.0 via the vers=4.0 parameter in /etc/fstab, the test revealed a different result with the Linux NFS server:

1 - Old file
1 - Old file
1 - Old file
1 - Old file <--- RENAME HAPPENED
1 - Old file
1 - Old file

Instead of a Stale file handle, the Linux NFS v4.0 server showed stale contents. It turns out this difference in behavior can be explained by the NFS spec. From RFC 3010:

A filehandle may or may not become stale or expire on a rename. However, server implementors are strongly encouraged to attempt to keep file handles from becoming stale or expiring in this fashion.

In other words, NFS servers can choose how to behave if a file is renamed; it's perfectly valid for any NFS server to return a Stale file error when that happens. We surmised that even though the results were different, the problem was likely related to the same issue. We suspected some cache validation issue because running ls in the directory would "clear" the error. Now that we had a reproducible test case, we asked the experts: the Linux NFS maintainers.

False path: NFS server delegations

With a clear set of reproduction steps, I sent an email to the Linux NFS mailing list describing what we had found. Over the week, I went back and forth with Bruce Fields, the Linux NFS server maintainer, who suggested this was a NFS bug and that it would be useful to look at the network traffic. He thought there might be an issue with NFS server delegations.

What is an NFS server delegation?

In a nutshell, NFS v4 introduced server delegations as a way to speed up file access. A server can delegate read or write access to a client so that the client doesn't have to keep asking the server whether that file has changed by another client. In simpler terms, a write delegation is akin to someone lending you a notebook and saying, "Go ahead and write in here, and I'll take it back when I'm ready." Instead of having to ask to borrow the notebook every time you want to write a new paragraph, you have free rein until the owner reclaims the notebook. In NFS terms, this reclamation process is called a delegation recall.

Indeed, a bug in the NFS delegation recall might explain the Stale file handle problem. Remember that in the earlier experiment, Alice had an open file to test1.txt when it was replaced by test2.txt later. It's possible that the server failed to recall the delegation on test1.txt, resulting in an incorrect state. To check whether this was an issue, we turned to tcpdump to capture NFS traffic and used Wireshark to visualize it.

Wireshark is a wonderful open source tool for analyzing network traffic, and it's especially good for viewing NFS in action. We captured a trace using the following command on the NFS server:

tcpdump -s 0 -w /tmp/nfs.pcap port 2049

This command captures all NFS traffic, which typically is on TCP port 2049. Because our experiment worked properly with NFS v4.1 but did not with NFS v4.0, we could compare and contrast how NFS behaved in a non-working and a working case. With Wireshark, we saw the following behavior:

NFS v4.0 (stale file case)

NFS v4.0 flow

In this diagram, we can see in step 1 Alice opens test1.txt and gets back an NFS file handle along with a stateid of 0x3000. When Bob attempts to rename the file, the NFS server tells to Bob to retry via the NFS4ERR_DELAY message while it recalls the delegation from Alice via the CB_RECALL message (step 3). Alice then returns her delegation via DELEGRETURN (step 4), and then Bob attempts to send another RENAME message (step 5). The RENAME completes in both cases, but Alice continues to read using the same file handle.

NFS v4.1 (working case)

NFS v4.1 flow

The main difference happens at the bottom at step 6. Notice in NFS v4.0 (the stale file case), Alice attempts to reuse the same stateid. In NFS v4.1 (working case), Alice performs an additional LOOKUP and OPEN, which causes the server to return a different stateid. In v4.0, these extra messages are never sent. This explains why Alice continues to see stale content because she uses the old file handle.

What makes Alice decide to do the extra LOOKUP? The delegation recall seemed to work fine, but perhaps there was still an issue, such as a missing invalidation step. To rule that out, we disabled NFS delegations by issuing this command on the NFS server itself:

echo 0 > /proc/sys/fs/leases-enable

We repeated the experiment, but the problem persisted. All this convinced us this wasn't a NFS server issue or a problem with NFS delegations; it was a problem that led us to look into the NFS client within the kernel.

Digging deeper: the Linux NFS client

The first question we had to answer for the NFS maintainers:

Was this problem still in the latest upstream kernel?

The issue occurred with both CentOS 7.2 and Ubuntu 16.04 kernels, which used versions 3.10.0-862.11.6 and 4.4.0-130, respectively. However, both those kernels lagged the most recent kernel, which was 4.19-rc2 at the time.

We deployed a new Ubuntu 16.04 virtual machine on Google Cloud Platform (GCP), cloned the latest Linux kernel, and set up a kernel development environment. After generating a .config file via make menuconfig, we checked two items:

The NFS driver was compiled as a module (CONFIG_NFSD=m).
The required GCP kernel settings were set properly.

Just as a geneticist would use fruit flies to study evolution in real time, the first item allowed us to make quick changes in the NFS client without having to reboot the kernel. The second item was required to ensure that the kernel would actually boot after it was installed. Fortunately, the default kernel settings had all the settings right out of the box.

With our custom kernel, we verified that the stale file problem still existed in the latest version. That begged a number of questions:

Where exactly was this problem happening?
Why was this problem happening with NFS v4.0 but not in v4.1?

To answer these questions, we began to investigate the NFS source code. Since we didn't have a kernel debugger available, we sprinkled the source code with two main types of calls:

pr_info() (what used to be printk).
dump_stack(): This would show the stack trace of the current function call.

For example, one of the first things we did was hook into the nfs4_file_open() function in fs/nfs/nfs4file.c:

static int
nfs4_file_open(struct inode *inode, struct file *filp)
{
...
        pr_info("nfs4_file_open start\n");
        dump_stack();

Admittedly, we could have activated the dprintk messages with the Linux dynamic debug or used rpcdebug, but it was nice to be able to add our own messages to verify changes were being made.

Every time we made changes, we recompiled the module and reinstalled it into the kernel via the commands:

make modules
sudo umount /mnt/nfs-test
sudo rmmod nfsv4
sudo rmmod nfs
sudo insmod fs/nfs/nfs.ko
sudo mount -a

With our NFS module installed, repeating the experiments would print messages that would help us understand the NFS code a bit more. For example, you can see exactly what happens when an application calls open():

Sep 24 20:20:38 test-kernel kernel: [ 1145.233460] Call Trace:
Sep 24 20:20:38 test-kernel kernel: [ 1145.233462]  dump_stack+0x8e/0xd5
Sep 24 20:20:38 test-kernel kernel: [ 1145.233480]  nfs4_file_open+0x56/0x2a0 [nfsv4]
Sep 24 20:20:38 test-kernel kernel: [ 1145.233488]  ? nfs42_clone_file_range+0x1c0/0x1c0 [nfsv4]
Sep 24 20:20:38 test-kernel kernel: [ 1145.233490]  do_dentry_open+0x1f6/0x360
Sep 24 20:20:38 test-kernel kernel: [ 1145.233492]  vfs_open+0x2f/0x40
Sep 24 20:20:38 test-kernel kernel: [ 1145.233493]  path_openat+0x2e8/0x1690
Sep 24 20:20:38 test-kernel kernel: [ 1145.233496]  ? mem_cgroup_try_charge+0x8b/0x190
Sep 24 20:20:38 test-kernel kernel: [ 1145.233497]  do_filp_open+0x9b/0x110
Sep 24 20:20:38 test-kernel kernel: [ 1145.233499]  ? __check_object_size+0xb8/0x1b0
Sep 24 20:20:38 test-kernel kernel: [ 1145.233501]  ? __alloc_fd+0x46/0x170
Sep 24 20:20:38 test-kernel kernel: [ 1145.233503]  do_sys_open+0x1ba/0x250
Sep 24 20:20:38 test-kernel kernel: [ 1145.233505]  ? do_sys_open+0x1ba/0x250
Sep 24 20:20:38 test-kernel kernel: [ 1145.233507]  __x64_sys_openat+0x20/0x30
Sep 24 20:20:38 test-kernel kernel: [ 1145.233508]  do_syscall_64+0x65/0x130

What are the do_dentry_open and vfs_open calls above? Linux has a virtual filesystem (VFS), an abstraction layer which provides a common interface for all filesystems. The VFS documentation explains:

The VFS implements the open(2), stat(2), chmod(2), and similar system calls. The pathname argument that is passed to them is used by the VFS to search through the directory entry cache (also known as the dentry cache or dcache). This provides a very fast look-up mechanism to translate a pathname (filename) into a specific dentry. Dentries live in RAM and are never saved to disc: they exist only for performance.

This gave us a clue: what if this was a problem with the dentry cache?

We noticed a lot of dentry cache validation was done in fs/nfs/dir.c. In particular, nfs4_lookup_revalidate() sounded promising. As an experiment, we hacked that function to bail out early:

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8bfaa658b2c1..ad479bfeb669 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1159,6 +1159,7 @@ static int nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags)
        trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
        error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, fhandle, fattr, label);
        trace_nfs_lookup_revalidate_exit(dir, dentry, flags, error);
+       goto out_bad;
        if (error == -ESTALE || error == -ENOENT)
                goto out_bad;
        if (error)

That made the stale file problem in our experiment go away! Now we were onto something.

To answer, "Why does this problem not happen in NFS v4.1?", we added pr_info() calls to every if block in that function. After running our experiments with NFS v4.0 and v4.1, we found this special condition being run in the v4.1 case:

        if (NFS_SB(dentry->d_sb)->caps & NFS_CAP_ATOMIC_OPEN_V1) {
          goto no_open;
        }

What is NFS_CAP_ATOMIC_OPEN_V1? We saw this kernel patch mentioned this was an NFS v4.1-specific feature, and the code in fs/nfs/nfs4proc.c confirmed that this flag was a capability present in v4.1 but not in v4.0:

static const struct nfs4_minor_version_ops nfs_v4_1_minor_ops = {
        .minor_version = 1,
        .init_caps = NFS_CAP_READDIRPLUS
                | NFS_CAP_ATOMIC_OPEN
                | NFS_CAP_POSIX_LOCK
                | NFS_CAP_STATEID_NFSV41
                | NFS_CAP_ATOMIC_OPEN_V1

That explained the difference in behavior: in the v4.1 case, the goto no_open would cause more validation to happen in nfs_lookup_revalidate(), but in v4.0, the nfs4_lookup_revalidate() would return earlier. Now, how do we actually solve the problem?

The solution

I reported the findings to the NFS mailing list and proposed a naive patch. A week after the report, Trond Myklebust sent a patch series to the list fixing this bug and found another related issue for NFS v4.1.

It turns out the fix for the NFS v4.0 bug was deeper in the code base than we had looked. Trond summarized it well in the patch:

We need to ensure that inode and dentry revalidation occurs correctly on reopen of a file that is already open. Currently, we can end up not revalidating either in the case of NFSv4.0, due to the 'cached open' path. Let's fix that by ensuring that we only do cached open for the special cases of open recovery and delegation return.

We confirmed that this fix made the stale file problem go away and filed bug reports with Ubuntu and RedHat.

Knowing full well that kernel changes may take a while to make it to stable releases, we also added a workaround in Gitaly to deal with this issue. We did experiments to test that calling stat() on the packed-refs file appears to cause the kernel to revalidate the dentry cache for the renamed file. For simplicity, this is implemented in Gitaly regardless of whether the filesystem is NFS; we only do this once before Gitaly "opens" a repository, and there are already other stat() calls that check for other files.

What we learned

A bug can be anywhere in your software stack, and sometimes you have to look beyond your application to find it. Having helpful partners in the open source world makes that job much easier.

We are extremely grateful to Trond Myklebust for fixing the problem, and Bruce Fields for responding to questions and helping us understand NFS. Their responsiveness and professionalism truly reflects the best of the open source community.

Photo by dynamosquito on Flickr

How we spent two weeks hunting an NFS bug in the Linux kernel

Reproducing the bug

False path: NFS server delegations

What is an NFS server delegation?

NFS v4.0 (stale file case)

NFS v4.1 (working case)

Digging deeper: the Linux NFS client

Was this problem still in the latest upstream kernel?

This gave us a clue: what if this was a problem with the dentry cache?

The solution

What we learned

More to explore

Refactoring a CI/CD template to a CI/CD component

Revisiting the variables management workflow

How to translate Bamboo agent capabilities to GitLab Runner tags

We want to hear from you

Ready to get started?

How we spent two weeks hunting an NFS bug in the Linux kernel

Reproducing the bug

False path: NFS server delegations

What is an NFS server delegation?

NFS v4.0 (stale file case)

NFS v4.1 (working case)

Digging deeper: the Linux NFS client

Was this problem still in the latest upstream kernel?

This gave us a clue: what if this was a problem with the dentry cache?

The solution

What we learned

Sign up for GitLab’s newsletter

More to explore

Refactoring a CI/CD template to a CI/CD component

Revisiting the variables management workflow

How to translate Bamboo agent capabilities to GitLab Runner tags

We want to hear from you

Ready to get started?