Linux Namespacing Pitfalls

2015-10-12 by Mike Shal, tagged as linux, mozilla, namespace, tup

Linux namespaces are a powerful feature for running processes with various levels of containerization. While working on adding them to the tup build system, I stumbled through some problems along the way. For a rough primer on getting started with user and filesystem namespaces, along with how they're used for dependency detection in tup, read on!

Dependency Detection in Tup

For some background into why I'm interested in filesystem namespaces, it might help to understand a little bit about how tup uses FUSE for dependency detection on Linux systems. Because we can't rely on subprocesses to output dependency information, tup instruments all subprocesses in a way that file inputs & outputs can be tracked universally. A passthrough FUSE filesystem, very similar to the "hello world" fusexmp.c filesystem, is used to track all file activity of a subprocess. (I've also experimented with using an LD_PRELOAD shared library shim, and ptrace() for collecting this information. Each approach has its own benefits and drawbacks.) Each subprocess has its own private directory within the passthrough filesystem, which looks something like this:

/home/mshal/mozilla-central/.tup/mnt/@tupjob-1/
/home/mshal/mozilla-central/.tup/mnt/@tupjob-2/
...

The directory structure under each @tupjob-N marker looks like the root of the filesystem, so for example "/home/mshal/mozilla-central/.tup/mnt/@tupjob-1/home/mshal" maps to "/home/mshal" on the underlying filesystem. When tup runs a subprocess, say 'gcc -c foo.c -o foo.o', it picks one of the @tupjob directories and chdir()s down into the hierarchy before fork/execing. Then gcc and all of its subprocesses (cpp and such) will be opening files in the FUSE filesystem, which notifies tup to track the dependency and open the real backing file. So if we see .tup/mnt/@tupjob-1/home/mshal/mozilla-central/foo.c being opened, we know that file is a dependency of the 'gcc -c foo.c -o foo.o' process. And perhaps .tup/mnt/@tupjob-2/home/mshal/mozilla-central/bar.py is a dependency of a python script.

This works fairly well, except for the fact that any subprocess that looks at the current working directory will see the virtual ".tup/mnt/@tupjob-N" nonsense in the path. For example, using gcc with the --coverage flag to generate code instrumentation will stick the working directory in the output files. So when you go to analyze the code, it's pointing to a path that doesn't actually exist! These virtual directories also pose a problem for subprocesses that use full directory paths. It's easy to circumvent the FUSE filesystem just by doing fopen("/home/mshal/mozilla-central/foo.c") anywhere in the build system, and missing dependencies are infinitely bad.

(The other main issue with FUSE, namely the poor performance of passthrough filesystems, is covered in my Linking libxul with tup and FUSE post.)

Enter Filesystem Namespaces (woot!)

Ideally what we want in this situation is to have the subprocess see '/home/mshal/mozilla-central' as the current working directory, but have this path actually correspond to the FUSE filesystem. Meanwhile, we want to still have '/home/mshal/mozilla-central' correspond to the actual filesystem for anything else running on the host (ie: maybe you're still looking through source code while things are building). In order to accomplish this, we can run the subprocesses in a separate filesystem namespace. This is accomplished by passing the CLONE_NEWNS flag to unshare() or the clone() system calls. Roughly, this looks like:

fork();
if(child) {
	unshare(CLONE_NEWNS);
	bind mount /home/mshal/mozilla-central/.tup/mnt/@tupjob-1/home/mshal/mozilla-central to /home/mshal/mozilla-central;
	exec(gcc -c ...);
}

Unfortunately, the CLONE_NEWNS requires the CAP_SYS_ADMIN capability. Essentially that means we would need to run tup with the suid bit set so that it has root permissions. This is the same restriction for tup's existing workaround, which is to run the subprocess inside a chroot() environment. So on its own, it seems we haven't gained much here.

Enter User Namespaces (woot!)

As of 3.8 kernels, Linux also supports a user namespace. I still find these very confusing, but effectively if we first create a new user namespace before creating the filesystem namespace, we are magically "admin" inside the new user namespace (without resorting to a privilege escalation bug). The new code looks something like this:

fork();
if(child) {
	unshare(CLONE_NEWUSER);
	echo "deny" > /proc/self/setgroups;
	echo "1000 1000 1" > /proc/self/uid_map; # 1000 corresponds to your user id
	echo "1000 1000 1" > /proc/self/gid_map;
	unshare(CLONE_NEWNS);
	bind mount /home/mshal/mozilla-central/.tup/mnt/@tupjob-1/home/mshal/mozilla-central to /home/mshal/mozilla-central;
	exec(gcc -c ...);
} else {
	# /home/mshal/mozilla-central is the real filesystem, not FUSE for me!
}

In the new user namespace, the child process is effectively an admin and can create its own filesystem namespace, and mount the FUSE filesystem wherever it wants to without affecting any other processes on the machine. Sweet! Of course, it's not actually an admin, so you still can't access files or devices that you normally wouldn't be able to. Now we can run tup as a normal user without suid set, and we can containerize the subprocesses and watch their file accesses all day for proper dependency detection.

Enter Arch Linux (doh!)

User namespaces are available in Linux 3.8, except where they aren't. Some distributions currently have CONFIG_USER_NS disabled (see Arch Linux for an example). I'm not sure if there's a better solution for supporting these distributions, but for now I am trying to have tup support filesystem namespaces both with kernels that have CLONE_NEWUSER, and with kernels that don't by setting the suid bit on tup. If the kernel doesn't have user namespaces and tup doesn't have the suid bit set, we fallback to the degraded mode of having the .tup/mnt paths visible in subprocesses (but see Future Work below).

Sounds simple enough, but I ran into a snag while implementing filesystem namespaces with suid support. Tup would hang in readdir() calls, and it turns out this is because the bind-mounted filesystem was visible globally instead of just in the container namespace. Here is a demo program in C that shows the filesystem namespace working properly with user-namespaces enabled:

nsdemo1.c

pid = fork();
if(pid == 0) {
        mkdir("/tmp/demo-ns", 0755);
        if(unshare(CLONE_NEWUSER) < 0 ){
                perror("unshare(CLONE_NEWUSER)");
                exit(1);
        }
        if(unshare(CLONE_NEWNS) < 0) {
                perror("unshare(CLONE_NEWNS)");
                exit(1);
        }

        sleep(1);
        system("echo -n '\tChild: ' && ls -l /proc/self/ns/mnt");
        if(mount("/usr", "/tmp/demo-ns", "", MS_BIND, NULL) < 0) {
                perror("mount");
                exit(1);
        }
        system("echo -n '\tChild: mount is: '; mount | grep demo-ns; echo ''");
        sleep(3);
        if(umount("/tmp/demo-ns") < 0) {
                perror("umount");
                fprintf(stderr, "^[[31mERROR UNMOUNTING^[[0m\n");
        }
} else {
        printf("Parent: child is %i\n", pid);
        system("echo 'Parent: ' && ls -l /proc/self/ns/mnt");
        sleep(2);
        system("echo -n 'Parent: mount is: ^[[31m'; mount | grep demo-ns; echo '^[[0m'");
        wait(NULL);
}

It's ugly because I'm doing shell scripting in C, but since tup ultimately is doing this stuff in C code and I found it hard to get examples in C code, I'm publishing it this way so that you can download and play with it. For actual shell scripting, lookup unshare(1) rather than unshare(2). Also, yes, I'm using sleep() as a synchronization primitive :P

The nsdemo1.c file requires user namespaces enabled in your kernel to work. If those are supported, this should work out of the box as a normal user. You should see something like this when you run it:

$ gcc nsdemo1.c -o nsdemo1
$ ./nsdemo1
Parent: 
lrwxrwxrwx 1 marf marf 0 Oct  9 16:48 /proc/self/ns/mnt -> mnt:[4026531840]
	Child: lrwxrwxrwx 1 nobody nogroup 0 Oct  9 16:48 /proc/self/ns/mnt -> mnt:[4026532347]
	Child: mount is: /dev/sda1 on /tmp/demo-ns type ext4 (rw,noatime,nodiratime,discard,errors=remount-ro)

Parent: mount is: 

I've highlighted the filesystem namespaces — you can see that they are different for the parent and child. Additionally, the /tmp/demo-ns mountpoint isn't visible from the parent, which we can tell since there is nothing after the "Parent: mount is:" text.

Now let's look at the same program on a kernel that doesn't support CLONE_NEWUSER, so we remove the first unshare() call:

nsdemo2.c


-        if(unshare(CLONE_NEWUSER) < 0 ){
-                perror("unshare(CLONE_NEWUSER)");
-                exit(1);
-        }

If we run this, the results are rather surprising:

# WARNING: Don't actually download random code from the internet and run it as suid root unless you know what it does.
$ gcc nsdemo2.c -o nsdemo2 && sudo chown root:root nsdemo2 && sudo chmod u+s nsdemo2
$ ./nsdemo2
Parent: child is 32278
Parent: 
lrwxrwxrwx 1 root root 0 Oct  9 16:50 /proc/self/ns/mnt -> mnt:[4026531840]
	Child: lrwxrwxrwx 1 root root 0 Oct  9 16:50 /proc/self/ns/mnt -> mnt:[4026532346]
	Child: mount is: /dev/sda1 on /tmp/demo-ns type ext4 (rw,noatime,nodiratime,discard,errors=remount-ro)

Parent: mount is: /dev/sda1 on /tmp/demo-ns type ext4 (rw,noatime,nodiratime,discard,errors=remount-ro)

This is effectively the same test as before — we created a new filesystem namespace and a new mount within that namespace, yet in this case the new mount is visible in the parent! What's going on here?

Enter /etc/mtab (doh!)

One thing I stumbled across that is worth mentioning, is that using "mount" to list the mounts visible from the current namespace may be inaccurate. In particular, /etc/mtab may just be a normal file that gets updated manually by the mount command, and that file isn't scoped to your particular namespace. See this stack overflow post for some more information. However, this wasn't actually a problem for me because /etc/mtab is a symlink to /proc/mounts on my machine. Using /proc/mounts or /proc/self/mountinfo is a better way to look at the mounts visible in your namespace.

Enter systemd (doh!)

It turns out it wasn't a problem of measuring the wrong thing, but in fact the new mount in the child namespace was actually being shared with the parent (global) namespace. The reason for this starts to become clear by looking at /proc/self/mountinfo:

22 0 8:1 / / rw,noatime,nodiratime shared:1 - ext4 ...

The 'shared' flag tells the kernel that this mount is shared between namespaces. If not explicitly specified, this flag flows to any submounts, such as the bind mount we are making in our program (at least, I think that's how it works). So even though we are in a separate namespace, our bind mount becomes shared with the parent namespace, effectively nullifying the entire point of the namespace to begin with. But according to the kernel documentation on shared subtrees, the default mode should be private, meaning namespaces should work as expected. What's going on here?

It turns out that the systemd developers decided to override the kernel's default setting of 'private' to their own default setting of 'shared'. This means that on Linux machines with systemd, the default is shared (filesystem namespaces don't work out of the box), while on Linux machines without systemd, the default is private (filesystem namespaces work out of the box). Essentially, systemd decided to make it so that there is no default that end programs can rely on. All programs must instead mark the root filesystem as private if they want private namespaces, or as shared if they want shared namespaces if they want to work across all Linux distributions. I'm pretty sure this was done to frustrate as many people as possible.

In a shell script you can do 'mount --make-rprivate /' in the child namespace before creating the private bind mount. It took me a while to figure out how to do that in C, but the magic command is 'mount("none", "/", "", MS_REC | MS_PRIVATE, NULL)'. (You could also use MS_SLAVE instead of MS_PRIVATE if you want to receive mounts from the parent filesystem. Note this is somewhat similar to remounting the root filesystem, but adding MS_REMOUNT to the flags causes the command to not work). Here's the final demo program that works:

nsdemo3.c

        if(unshare(CLONE_NEWNS) < 0) {
                perror("unshare(CLONE_NEWNS)");
                exit(1);
        }

        sleep(1);
        system("echo -n '\tChild: ' && ls -l /proc/self/ns/mnt");
+        system("echo -n '\tChild: before: '; cat /proc/self/mountinfo | grep '/ / '");
+        if(mount("none", "/", "", MS_REC | MS_SLAVE, NULL) < 0) {
+                perror("mount(/)");
+                exit(1);
+        }
+        system("echo -n '\tChild: after: '; cat /proc/self/mountinfo | grep '/ / '");
        if(mount("/usr", "/tmp/demo-ns", "", MS_BIND, NULL) < 0) {
                perror("mount");
                exit(1);
        }
# WARNING: Don't actually download random code from the internet and run it as suid root unless you know what it does.
$ gcc nsdemo3.c -o nsdemo3 && sudo chown root:root nsdemo3 && sudo chmod u+s nsdemo3
$ ./nsdemo3
Parent: child is 1466
Parent: 
lrwxrwxrwx 1 root root 0 Oct  9 18:04 /proc/self/ns/mnt -> mnt:[4026531840]
	Child: lrwxrwxrwx 1 root root 0 Oct  9 18:04 /proc/self/ns/mnt -> mnt:[4026532346]
	Child: before: 105 98 8:1 / / rw,noatime,nodiratime shared:1 - ext4 /dev/disk/by-uuid/9c860799-01c3-49b4-b8ff-7ae5930cca13 rw,discard,errors=remount-ro
	Child: after: 105 98 8:1 / / rw,noatime,nodiratime master:1 - ext4 /dev/disk/by-uuid/9c860799-01c3-49b4-b8ff-7ae5930cca13 rw,discard,errors=remount-ro
	Child: mount is: /dev/sda1 on /tmp/demo-ns type ext4 (rw,noatime,nodiratime,discard,errors=remount-ro)

Parent: mount is: 

Now the bind mount in the child is hidden from the parent. We can use this technique so that you can use filesystem namespaces if you have either a suid binary, or user namespace support.

Future Work

Tup used to use the pgid to determine which subprocess a file access belonged to, but that was changed in favor of the @tupjob marker. If we switch back to pgids, it might be possible to just mount the FUSE filesystem overtop of the main filesystem, and allow any processes that access the FS without a proper pgid to just pass through unimpeded. Effectively, /home/mshal/mozilla-central would be the FUSE filesystem for all processes in the system. This might be worth investigating, so that tup can work properly even without filesystem namespaces. Containerizing the subprocesses is certainly much more elegant, when it can be made to work.

If you have any suggestions or corrections, please leave a comment. This is the first time I've tried to use namespaces, so I'm sure there is still much to learn.

comments powered by Disqus