If you don't know what an "abstraction leak" is, please, refer to this Joel's blog post

Finally, I patched it. It's been pending for several months, always being postponed as "we do not have time to do this right now." However, we fixed several abstraction leaks, which created an increased workload to the network, and I couldn't postpone it anymore.

UnionFS

UnionFS is a file system that makes a union of several directory trees. The directory trees mounted under a union are organized as a layer cake: whenever you try to find a file in a union, it looks up, in a user-specified order, for this file on each of the file systems involved, until it finds one.

However, the most interesting things happen if you try to write something into a union, or to delete a file from it. If you turn copy-on-write flag, then you may use a higher-level writable file system to mask files on lower level read-only file systems by copying the files to and modifying them at a higher writable layer, or by creating a special masking record on an upper writable file system, respectively.

UnionFS is useful to represent a file systems that originate from a large read-only directory trees (such as those burned on a CD) with just a few changes made by user (yes, think install CDs).

In our project, we used it to create a naïve transparent network FS implementation.

Two posts ago, I whined upon the issue with our filesystem setup for distributed computation. In short, the C preprocessor relied on a lot of file lookup failures when handling #include directives, and failed lookups were never cached in our naive UnionFS + SSH-FS software stack. The former, being a union of several layers of directory trees, always searched for absent files until it reached the lowest layer, which was a slow network. And the latter just didn't cache failed file requests.

As you already know, I am a fan of caching. Naturally, the solution seemed to me as to cache the failed requests somehow. But what if I cache a failed file request, but the file appears?

Thus, the goal was to devise a conceptual solution that doesn't look like a hack. Rather, the desired solution should have relied on some property of our file system that implied that caching of the failed lookups would be sound.

Monotonic file system

The property, however, wasn't too tough to discern. I named it as being a monotonic file system. Let's define a monotonic file system as one a file is never added to. Since the moment of mounting as a UnionFS layer, all read-only file systems that lay lower than the first writable layer are assumed monotonic.

While it wasn't really strictly the case for the file system used as a whole (it had files added to it), the subset of files that were accessed remotely was monotonic.

Implementation details

UnionFS has a mechanism of "masking" the filed on read-only layers as if they were deleted (see side note). My solution was to mask a file after a lookup for it has failed.

First, I needed to determine what system calls the C preprocessor uses for #include file lookups. FUSE, the mechanism UnionFS was based on, intercepts system calls to access the file system, and routes them to a user program that contains handlers for them. Therefore, I needed to know the syscalls involved; for this, I used the good old strace command:

$ strace -f -T  cpp file.c 2>&1 | grep xxxx
[pid 23961] read(3, "#include <xxxx.h>\n", 18) = 18 <0.000010>
[pid 23961] open("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/include/xxxx.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory) <0.000019>
[pid 23961] open("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/include-fixed/xxxx.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory) <0.000015>
[pid 23961] open("/usr/include/xxxx.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory) <0.000013>
[pid 23961] write(2, "file.c:1:18: fatal error: xxxx.h"..., 59file.c:1:18: fatal error: xxxx.h: No such file or directory) = 59 <0.000013>

The syscall time shown due to -T option helped me to determine whether I really succeeded in fixing the problem (the times should have dropped dramatically if it was fixed).

However, attaching to the FUSE's open call did not show any activity. Having turned on the debugging in UnionFS, I realized that I should have attached to the getattr syscall handler instead.

Since now I knew what syscall handler to interfere, I patched the handling of getattr:

diff --git a/src/opts.c b/src/opts.c
index 02a4e79..c6e7a17 100644
--- a/src/opts.c
+++ b/src/opts.c
@@ -228,6 +228,8 @@ static void print_help(const char *progname) {
 	"    -o chroot=path         chroot into this path. Use this if you \n"
         "                           want to have a union of \"/\" \n"
 	"    -o max_files=number    Increase the maximum number of open files\n"
+	"    -o monotonic           treat file system as monotonic (\"delete\""
+	"                                  nonexistent files)\n"
 	"\n",
 	progname);
 }
@@ -295,6 +297,9 @@ int unionfs_opt_proc(void *data, const char *arg, int key, struct fuse_args *out
 		case KEY_COW:
 			uopt.cow_enabled = true;
 			return 0;
+		case KEY_MONOTONIC:
+			uopt.monotonic = true;
+			return 0;
 		case KEY_STATFS_OMIT_RO:
 			uopt.statfs_omit_ro = true;
 			return 0;
diff --git a/src/opts.h b/src/opts.h
index 040956f..1cfb64b 100644
--- a/src/opts.h
+++ b/src/opts.h
@@ -22,6 +22,7 @@ typedef struct {
 
 	bool stats_enabled;
 	bool cow_enabled;
+	bool monotonic;
 	bool statfs_omit_ro;
 	int doexit;
 	int retval;
@@ -32,6 +33,7 @@ typedef struct {
 enum {
 	KEY_STATS,
 	KEY_COW,
+	KEY_MONOTONIC,
 	KEY_STATFS_OMIT_RO,
 	KEY_NOINITGROUPS,
 	KEY_CHROOT,
diff --git a/src/unionfs.c b/src/unionfs.c
index f05af45..a64499e 100644
--- a/src/unionfs.c
+++ b/src/unionfs.c
@@ -57,6 +57,7 @@ static struct fuse_opt unionfs_opts[] = {
 	FUSE_OPT_KEY("-V", KEY_VERSION),
 	FUSE_OPT_KEY("stats", KEY_STATS),
 	FUSE_OPT_KEY("cow", KEY_COW),
+	FUSE_OPT_KEY("monotonic", KEY_MONOTONIC),
 	FUSE_OPT_KEY("noinitgroups", KEY_NOINITGROUPS),
 	FUSE_OPT_KEY("statfs_omit_ro", KEY_STATFS_OMIT_RO),
 	FUSE_OPT_KEY("chroot=%s,", KEY_CHROOT),
@@ -188,6 +189,10 @@ static int unionfs_getattr(const char *path, struct stat *stbuf) {
 	}
 
 	int i = find_rorw_branch(path);
+
+	// If we got an ENOENT, and we assume a monotonic FS, then we should whiteout this file
+	if (i == -1 && errno == ENOENT) monotonic_unlink(path);
+
 	if (i == -1) return -errno;
 
 	char p[PATHLEN_MAX];
@@ -335,6 +340,9 @@ static int unionfs_open(const char *path, struct fuse_file_info *fi) {
 		i = find_rorw_branch(path);
 	}
 
+	// If we got an ENOENT, and we assume a monotonic FS, then we should whiteout this file
+	if (i == -1 && errno == ENOENT) monotonic_unlink(path);
+
 	if (i == -1) return -errno;
 
 	char p[PATHLEN_MAX];
diff --git a/src/unlink.c b/src/unlink.c
index 06a7fa2..7f3275e 100644
--- a/src/unlink.c
+++ b/src/unlink.c
@@ -96,3 +96,27 @@ int unionfs_unlink(const char *path) {
 
 	return -res;
 }
+
+/**
+ * whiteout file on the first rw branch available
+ */
+void monotonic_unlink(const char* path)
+{
+	DBG_IN();
+	// monotonic check
+	if (!uopt.monotonic) return;
+	// save errno: if we can't whiteout, we should return -ENOENT anyway
+	int old_errno = errno;
+
+	// find the highest RW branch (yes, not a typo)
+	int j;
+	j = find_lowest_rw_branch(uopt.nbranches);
+
+	// All branches are RO... then just return the error
+	if (j == -1) return;
+
+	// Whiteout the file
+	hide_file(path, j);
+	// if creating the file with the hide tag failed, restore old errno
+	errno = old_errno;
+}
diff --git a/src/unlink.h b/src/unlink.h
index 5223258..8dc145b 100644
--- a/src/unlink.h
+++ b/src/unlink.h
@@ -8,5 +8,6 @@
 #define UNLINK_H
 
 int unionfs_unlink(const char *path);
+void monotonic_unlink(const char* path);
 
 #endif

And that was it. That was my filesystem patch! It worked well: the preprocessing times, confirmed both by the strace timings and by the real experimentation have increased by a factor of more than fifty! This is what patching a file system for your particular case can yield.

***

I rarely understood what benefits they referred to when describing how they patch filesystem to make their corporate servers work better (i.e. faster or in a more reliable manner). Having patched UnionFS, I realized that two hour work may give a 50x performance improvement (well, Amdahl's law loweing it to 7x improvement)—just because you are not afraid of digging that far.

I doubt that the mainstream would accept my patch, as it's too specific, I guess. But the milestone has been reached: now I may proudly declare that I patched a file system.