How we hunted for two weeks on the NFS bug in the Linux kernel

Original author: Stan Hu
  • Transfer

Detailed description of the bug search from the GitLab task that led to the patch for the Linux kernel

On September 14, the GitLab support service reported a critical problem that occurred to one of our clients: first, GitLab works fine, and then the users get an error. They were trying to clone some repositories via Git, and suddenly there is a strange message about the obsolete files Stale file error. The error persisted for a long time and did not allow to work until the system administrator manually started lsin the directory itself.

I had to study the internal mechanisms of Git and the NFS network file system. As a result, we found a bug in the Linux v4.0 NFS client, Trond Myklebust wrote a patch for the kernel , and since October 26 this patch is included in the main Linux kernel .

In this post I will tell you how we studied the problem, in what direction we thought and what tools we used to track the bug. We were inspired by the excellent detective work of Oleg Dashevsky described in the post “How I have been hunting for a memory leak in Ruby for two weeks” .

This is also a great example of that debugging open source software is a team sport in which many people, companies and countries participate. The GitLab motto “ Everyone can contribute ” is valid not only for GitLab itself, but also for other open source projects, for example, for the Linux kernel.

Replay bug

We kept NFS on for many years, but then we stopped using it to access repository data on machines with applications. We moved all the Git calls to Gitaly . We support NFS for customers who manage their installations on GitLab, but have never encountered a problem like that of the above-mentioned client.

The client gave some helpful hints :

  1. Full text of error: fatal: Couldn't read ./packed-refs: Stale file handle.
  2. Apparently, the problem occurred when the client manually started garbage collection in Git with a command git gc.
  3. The error disappeared when the system administrator ran the utility lsin the directory.
  4. The error disappeared when the process git gcterminated.

It is clear that the first two points are related. When you submit changes to a Git branch, Git creates a weak link — a long file name that indicates the name of the branch for a commit. For example, when sending to master, a file will be created with the name refs/heads/masterin the repository:

$ cat refs/heads/master

The team git gcperforms several tasks. For example, it collects these weak links (refs) and packs them into a single file with the name packed-refs. This speeds up the work a bit, as it is easier to read one large file than many small ones. For example, after running a command, the git gcfile packed-refsmight look something like this:

# pack-refs with: peeled fully-peeled sorted
564c3424d6f9175cf5f2d522e10d20d781511bf1 refs/heads/10-8-stable
edb037cbc85225261e8ede5455be4aad771ba3bb refs/heads/11-0-stable
94b9323033693af247128c8648023fe5b53e80f9 refs/heads/11-1-stable
2e33a554576d06d9e71bfd6814ee9ba3a7838963 refs/heads/master

How is the file created packed-refs? To find out, we ran the command strace git gcwhere we had a weak link. Here are the lines that are relevant:

28705 open("/tmp/libgit2/.git/packed-refs.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 3
28705 open(".git/packed-refs", O_RDONLY) = 3
28705 open("/tmp/libgit2/.git/", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 4
28705 rename("/tmp/libgit2/.git/", "/tmp/libgit2/.git/packed-refs") = 0
28705 unlink("/tmp/libgit2/.git/packed-refs.lock") = 0

System calls have shown that the command git gc:

  1. Has opened packed-refs.lock. This tells other processes that the file is packed-refslocked and cannot be changed.
  2. Has opened
  3. Recorded weak links in
  4. Renamed packed-refs.newto packed-refs.
  5. Removed packed-refs.lock.
  6. Removed weak links.

The key point here is the fourth, that is, the renaming, where Git enters the file packed-refs. git gcnot only collects weak links, but also performs a much more resource-intensive task - it searches for and removes unused objects. In large repositories this can take more than an hour.

And we asked ourselves: does the git gcfile in the large repositories open while it is being cleaned? We studied the logs strace, launched the utility lsof, and here's what we learned about the process git gc:


As you can see, the file packed-refscloses at the very end, after a potentially long process Garbage collect objects.

So the following question arose: how does NFS behave when a file is opened on one node packed-refsand the other renames it at that time?

"For scientific purposes" we asked the client to conduct one experiment on two different machines (Alice and Bob):
1) In the NFS volume, create two files: test1.txtand test2.txtwith different contents, to make them easier to distinguish:

alice $ echo "1 - Old file" > /path/to/nfs/test1.txt
 alice $ echo "2 - New file" > /path/to/nfs/test2.txt

2) On the Alice machine, the file test1.txtshould be open:

alice $ irb

3) On the Alice machine, continuously display the contents test1.txt:

alice $ while true; do cat test1.txt; done

4) Then, on the Bob machine, execute the command:

bob $ mv -f test2.txt test1.txt

The last step reproduces what the git gcfile does packed-refswhen it overwrites an existing file.
On the client's machine, the result looked like this:

1 - Old file
1 - Old file
1 - Old file
cat: test1.txt: Stale file handle

There is! It seems we controlled the problem in a controlled way. But in the same experiment on Linux with the NFS server, this problem did not arise. The result was expected - after renaming, new content was accepted:

1 - Old file
1 - Old file
1 - Old file
2 - New file  <--- RENAME HAPPENED
2 - New file
2 - New file

Where does this difference in behavior come from? It turns out that the client used Isilon NFS storage , which only supported NFS v4.0. When we changed the connection parameters to v4.0 using the parameter vers=4.0in /etc/fstab, the test showed a different result for the Linux NFS server:

1 - Old file
1 - Old file
1 - Old file
1 - Old file <--- RENAME HAPPENED
1 - Old file
1 - Old file

Instead of an obsolete file descriptor, the Stale file handleLinux server NFS v4.0 showed obsolete content . It turns out that the difference in behavior can be explained by the NFS specifications. From RFC 3010 :

A file descriptor may become obsolete or expire when renamed, but not always. Server implementers are advised to take steps so that file descriptors are not outdated and do not expire in this way.

In other words, NFS servers can choose how to behave when the file is renamed, and the NFS server rightly returns an error Stale file errorin such cases. We assumed that the cause of the problem was the same, although the results were different. We suspected that it was a cache check, because the utility lsin the directory cleared the error. Now we had a reproducible test script, and we turned to the experts - Linux NFS maintainers.

False trail: delegation on the NFS server

When we managed to reproduce the error step by step, I wrote to the Linux NFS contacts about what we learned. For a week, I corresponded with Bruce Fields, the Linux maintainer of the NFS server, and he suggested that the bug in NFS should be examined for network traffic. He thought the problem was delegating tasks to the NFS server.

What is delegation to the NFS server?

In a nutshell, the NFS v4 version introduced a delegation function to speed up access to files. The server can delegate read or write access to the client so that the client does not have to constantly ask the server whether the file has been changed by another client. Simply put, delegation of a record is how to lend your notebook to someone and say: “You are still writing here, and I will take it when I'm ready.” And a person will not have to ask for a notebook every time they need to write something down - he has complete freedom of action, until the notebook is taken away. In NFS, please return the notebook is called a recall of delegation.

A bug in the NFS delegation recall could explain the problem Stale file handle. Remember, in our experiment, Alice was open test1.txt, and then replaced test2.txt. Maybe the server was unable to revoke the delegation for test1.txt, and this led to an incorrect status. To test this theory, we recorded the NFC traffic utility tcpdumpand visualized it using Wireshark.

Wireshark is a great open source tool for analyzing network traffic, especially for exploring NFS in action. We recorded the trace using the following command on the NFS server:

tcpdump -s 0 -w /tmp/nfs.pcap port 2049

This command records all NFS traffic that usually passes through TCP port 2049. Since our experiment was successful with NFS v4.1, but not with NFS v4.0, we could compare the behavior of NFS in the working and non-working cases. With Wireshark, we saw the following behavior:

NFS v4.0 (obsolete file)


In this diagram, you can see that in step 1, Alice opens test1.txtand gets an NFS file descriptor with the identifier stateid0x3000. When Bob tries to rename the file, the NFS server asks to try again by sending a message NFS4ERR_DELAY, and he withdraws the delegation from Alice through the message CB_RECALL(step 3). Alice returns the delegation (DELEGRETURN in step 4), and Bob tries to send the message RENAMEagain (step 5). RENAMEexecuted in both cases, but Alice continues to read the file using the same descriptor.

NFS v4.1 (working case)


Here the difference is visible in step 6. In NFS v4.0 (with an obsolete file), Alice tries to use the same one again stateid. In NFS v4.1 (working case), Alice performs additional operations LOOKUPand OPEN, so the server returns another stateid. In v4.0, it does not send any additional messages. This explains why Alice sees obsolete content — she uses the old descriptor.

Why LOOKUPdoes Alice suddenly decide on an extra ? Apparently, the delegation recall was successful, but some problem apparently remained. For example, missed invalid step. To check this, we excluded the NFS delegation on the NFS server itself with this command:

echo 0 > /proc/sys/fs/leases-enable

We repeated the experiment, but the problem did not disappear. We made sure that the problem is not in the NFS server or delegation, and decided to look at the NFS client in the kernel.

Digging deeper: Linux NFS client

The first question we had to answer to the NFS maintainers was:

Does this problem persist in the latest kernel version?

The problem occurred in CentOS 7.2 and Ubuntu 16.04 kernels with versions 3.10.0-862.11.6 and 4.4.0-130, respectively. But both cores lagged behind the latest version, which at that time was 4.19-rc2.

We deployed a new Ubuntu 16.04 virtual machine on the Google Cloud Platform (GCP), cloned the latest Linux kernel, and set up a kernel development environment. We created the file .configusing menuconfigand checked that:

  1. NFS driver compiled as a module ( CONFIG_NFSD=m).
  2. The required GCP core parameters are correct.

Genetics track real-time evolution of fruit flies, and with the first item we could quickly make edits to the NFS client without restarting the kernel. The second point guaranteed that the kernel would start after installation. Fortunately, we were satisfied with the default kernel parameters.

We made sure that the problem of the obsolete file has not disappeared in the latest version of the kernel. We asked ourselves:

  1. Where exactly is the problem?
  2. Why does this happen in NFS v4.0, but not in v4.1?

To answer these questions, we delved into the NFS source code. We did not have a kernel debugger, so we sent two types of calls to the source code:

  1. pr_info()( раньше это был printk)
  2. dump_stack(): it shows the stack trace for the current function call.

For example, first of all we connected to the function nfs4_file_open()in fs/nfs/nfs4file.c:

static int
nfs4_file_open(struct inode *inode, struct file *filp)
        pr_info("nfs4_file_open start\n");

Of course, we could активировать сообщения dprintkdynamically debug Linux or use it rpcdebug, but we wanted to add our own messages to check for changes.

After each change, we recompiled the module and reinstalled it into the kernel using the commands:

make modules
sudo umount /mnt/nfs-test
sudo rmmod nfsv4
sudo rmmod nfs
sudo insmod fs/nfs/nfs.ko
sudo mount -a

With the NFS module, we could repeat the experiments and receive messages to understand the NFS code. For example, you can immediately see what happens when an application calls open():

Sep 24 20:20:38 test-kernel kernel: [ 1145.233460] Call Trace:
Sep 24 20:20:38 test-kernel kernel: [ 1145.233462]  dump_stack+0x8e/0xd5
Sep 24 20:20:38 test-kernel kernel: [ 1145.233480]  nfs4_file_open+0x56/0x2a0 [nfsv4]
Sep 24 20:20:38 test-kernel kernel: [ 1145.233488]  ? nfs42_clone_file_range+0x1c0/0x1c0 [nfsv4]
Sep 24 20:20:38 test-kernel kernel: [ 1145.233490]  do_dentry_open+0x1f6/0x360
Sep 24 20:20:38 test-kernel kernel: [ 1145.233492]  vfs_open+0x2f/0x40
Sep 24 20:20:38 test-kernel kernel: [ 1145.233493]  path_openat+0x2e8/0x1690
Sep 24 20:20:38 test-kernel kernel: [ 1145.233496]  ? mem_cgroup_try_charge+0x8b/0x190
Sep 24 20:20:38 test-kernel kernel: [ 1145.233497]  do_filp_open+0x9b/0x110
Sep 24 20:20:38 test-kernel kernel: [ 1145.233499]  ? __check_object_size+0xb8/0x1b0
Sep 24 20:20:38 test-kernel kernel: [ 1145.233501]  ? __alloc_fd+0x46/0x170
Sep 24 20:20:38 test-kernel kernel: [ 1145.233503]  do_sys_open+0x1ba/0x250
Sep 24 20:20:38 test-kernel kernel: [ 1145.233505]  ? do_sys_open+0x1ba/0x250
Sep 24 20:20:38 test-kernel kernel: [ 1145.233507]  __x64_sys_openat+0x20/0x30
Sep 24 20:20:38 test-kernel kernel: [ 1145.233508]  do_syscall_64+0x65/0x130

What are these challenges do_dentry_openand vfs_open? Linux has a virtual filesystem ( virtual filesystem, VFS ), an abstraction layer that provides a common interface for all file systems. The VFS documentation states:

VFS implements open (2), stat (2), chmod (2) and other system calls. VFS uses the pathname argument passed to them to search the directory entry cache (dentry-cache, or dcache). This provides a very fast search mechanism that converts the path name (or file name) to a specific dentry. Dentry are in RAM and never stored on disk - they exist only for performance.

And it dawned on us - what if the problem is in the dentry cache?

We noticed that the dentry cache is usually checked in fs/nfs/dir.c. We were especially interested in the function nfs4_lookup_revalidate(), and as an experiment we made it work earlier:

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8bfaa658b2c1..ad479bfeb669 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1159,6 +1159,7 @@ static int nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags)
        trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
        error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, fhandle, fattr, label);
        trace_nfs_lookup_revalidate_exit(dir, dentry, flags, error);
+       goto out_bad;
        if (error == -ESTALE || error == -ENOENT)
                goto out_bad;
        if (error)

And in this experiment, the problem of an obsolete file did not arise! Finally, we attacked the trail.

To find out why the problem did not occur in NFS v4.1, we added calls pr_info()to each block ifin this function. We experimented with NFS v4.0 and v4.1 and found a special condition in version v4.1:

if (NFS_SB(dentry->d_sb)->caps & NFS_CAP_ATOMIC_OPEN_V1) {
          goto no_open;

What is it NFS_CAP_ATOMIC_OPEN_V1? This kernel patch says that this is a feature of the NFS v4.1 version, and the code fs/nfs/nfs4proc.cconfirmed that this parameter is in v4.1, but not in v4.0:

static const struct nfs4_minor_version_ops nfs_v4_1_minor_ops = {
        .minor_version = 1,
        .init_caps = NFS_CAP_READDIRPLUS
                | NFS_CAP_ATOMIC_OPEN
                | NFS_CAP_POSIX_LOCK
                | NFS_CAP_STATEID_NFSV41
                | NFS_CAP_ATOMIC_OPEN_V1

Therefore, the versions behaved differently - in v4.1 it goto no_opencauses more checks in the function nfs_lookup_revalidate(), and in v4.0 the function nfs4_lookup_revalidate()returns earlier. And how did we finally solve the problem?


I talked about our findings on the NFS mailing list and suggested a primitive patch . A week later, Trond Myklebust sent a series of patches to fix the bug and found another related problem in NFS v4.1 .

It turns out that the fix for the NFS v4.0 bug was deeper in the code base than we thought. Trond described it well in the patch :

It is necessary to make inode and dentry correctly rechecked when an already opened file is opened. Now we do not recheck either one or the other in NFSv4.0, because the open file is cached. Let's fix this and we will cache open files only in special cases - to restore open files and return delegation.

We made sure that this fix solved the outdated file problem, and sent bug reports to the Ubuntu and RedHat teams .

We were well aware that the changes would not be in a stable version of the kernel soon, so we added a temporary solution to this problem in Gitaly . We experimented and verified that a call stat()in the file packed-refscauses the kernel to recheck the renamed file in the dentry cache. For simplicity, we implemented this in Gitaly for any file systems, not just NFS. The check is performed only once before Gitaly opens the repository, and there are already other calls for other files stat().

What we have learned

A bug can hide in every corner of the software stack, and sometimes you need to look for it outside the application. If you have useful links in the open source world, it will ease your work.

Many thanks to Trond Müklebust for correcting the problem, and to Bruce Fields for answering our questions and helping to figure out the NFS. It is for such responsiveness and professionalism that we value the open source community.

Also popular now: