Recent blog updates

Shown here are only posts related to filesystem. You can view all posts here.

Typical Janitor's Failures

How easy is it, to write a program that removes a file or a folder? Okay, assume you have all the power a modern operating system provides you, including rm shell convenience command, and unlink system call. Do you really think the answer is still "very"?.. Then take a look at these examples.

Permission trap

Today one of my peers, Denis, was writing a program that somehow analyzed packages in Mandriva Linux distribution. Since trying to analyze the contents of a package without unpacking it looks more like gynecologist's work than that of a programmer, the software Denis wrote unpacked it first. Then it analyzed the package, and erased it with rm -rf in order to avoid disk pollution.

But sometimes it failed to erase it, which led to strange failures of the whole program.

The reason was the permission trap. Common sense tells us that if you can create a file, then you can remove it. This is wrong when it comes to uncommon permissions. Certain software that is very secure sets up specific attributes for certain files. In particular, it revoked writing permissions from certain catalogs, which prevented recursive removal of the unpacked files. You may try it yourself:

$ mkdir -p /tmp/try/this; chmod a-w /tmp/try; rm -rf /tmp/try
rm: cannot remove `/tmp/try/this': Permission denied

Oops. Even if you're the owner of the file, you won't remove it. Even --force doesn't help. To avoid this, I guess, you should chmod files recursively before removing them.

Why does Chromium remove its files twice?

By the way, I encountered a couple of engineers from PVS Studio when they visited Institute for System Programming, in which I worked; they were looking for a mentor for a Ph.D. thesis one of them was going to defend. Small world!

One more totally not interesting thing is that they are not going to support Linux because there is no enough users of their software on that platform. This is both funny and sad: on the one hand, Linux should have much more support, but on the other hand, if your static analysis tool is not decoupled enough to be somewhat platform-independent, then you do not deserve a Ph.D.

I read about this interesting issue here. Most likely, it won't hit you in the nearest years if you're using Linux, but for Windows users it might be interesting and maybe useful. Let me quote the piece of code they referenced

Why in the world would anyone try to remove a file twice determinedly? What was the programmer trying to achieve? Let's quote that article again:

A file can be definitely removed or cannot be removed at all only in textbooks and in some abstract world. In the real system it often happens that a file cannot be removed right now and can be removed an instance later. There may be many reasons for that: antivirus software, viruses, version control systems and whatever. Programmers often do not think of such cases. They believe that when you cannot remove a file you cannot remove it at all. But if you want to make everything well and avoid littering in directories, you should take these extraneous factors into account.

"Remove" vs. "unlink"

Okay, so you've managed to remove a file. The question is, have you freed some space on your disk? The answer is maybe.

Removing is called unlinking on purpose. What you do by "unlinking" is removing a reference from a directory index to a certain chunk on your hard disk. The chunk, nevertheless, may persist, especially if there are more references to it. One of the possibilities is hard links, but you explicitly created them. Another could be a badly working file system, but that was your choice.

However, there is a very harsh way to experience the difference between removing and unlinking. Assume you have a long-running program that writes something to its log file. The information it writes is not very useful, and you might want to discard it. As the program spends hours of working, the log file grows larger and larger, and you realize that you really should erase the log to prevent your disk from filling up. You rm -rf log_file, and... nothing happens. The log is still being written somewhere, and there is as little space as there was.

The thing is that you unlinked the file, but the file system's directory index wasn't the last place it was referenced from. Another place is the file handler of the program running, and it will be referenced until the file is closed. Otherwise, as far as I know, the only way to fix it is to shut the program down. So, do not turn on too much logging by default.

Uncannily removing sensitive files

Of course, you know this. It's like knowing that you should move carefully in a room full of rakes and nevertheless getting hit by one because it's just so obvious...

These are classical rm -rf $FOO/* when $FOO happens to be unset... These are when you type rm -rf /etc, and thus crush your system, instead of rm -rf etc to remove a backup copy of /etc you were so prudent to create. These are the famous bumblebee bugfix—or was it mere viral marketing?


Usually, you think that creating is easy, and destroying is simple. In many cases, however, it's not true. In Russia, for instance, it's much easier to open a business than to shut it. And when it gets to removing files, it's also sometimes true. So, if your program really depends on successful file removing, it should be more careful with such a seemingly simple operation.

Read on | Comments (2) | Make a comment >>

Do not use Btrfs!

"Btrfs" is a new, modern filesystem, which has always been a nice thing to try for any Linux aficionado. When you're bored, and if your Linux installation is too stable, you can always try one of new cool filesystems to make your machine faster, cooler, and less stable.

ReiserFS, named after its developer Hans Reiser, was the first Linux filesystem with journaling support. It also was known as the fastest file system to deal with large amounts of small files. It had none of the Btrfs' problems I describe here.

The successor of this filesystem, Reiser4 should have had even better small file support. It also was innovative, and allowed transient compression of content (which worked, by the way), but it has never been finished.

Hans, being in prison, doesn't continue the development of the new filesystem, and, it seems, that others just fail to keep the pace. reiser4 is still not included in Linux kernel. I'm pretty sure that it's not just because of non-technical "personality issues", but of architectural incompatibility with the kernel philosophy (and this) as well.

It's sad that prison seems to have disrupted Hans' mind. We probably will never have the same drive in Linux file system world.

After Hanz Reizer has been sentenced for a dozen of years in prison for killing his wife (I'm pretty sure he thought he had good reasons), we can't take reiser4 seriously anymore. I had a sore experience installing reiser4, but, unfortunately, it has never entered a stable state (and, most likely, never will).

When Hans, in addition to his wife, had killed my root filesystem through his unstable evil creation, I decided to install something cool again. My choice was Btrfs. It is a new fast filesystem, and it was claimed to be fast as hell. Moreover, it became the default filesystem for MeeGo mobile Linux OS standard, which is now nearly dead due to unrelated circumstances.

Aside from being just new, Btrfs offers a lot of unique features. One of them is that it's a copy-on-write filesystem, which allows a user to make snapshots of the whole file system on the same partition without consuming twice as much disk space. You can also roll back to a saved state immediately, without re-writing everything. However, I recently learned an already known way to make these operations fast with any filesystem, so Btrfs' advantage here is questionable. An "only" advantage left is its speed.

After I used Btrfs both at home and at work, I now advise to steer clear of it. At least, until its issues are resolved.

I'm not a kernel hacker of any sort, I haven't looked at Btrfs source code, and I never even read its specs. I just share my experience with it as a user. The post may already even be out-of date, and the state-of-the art tools and implementation may have already fixed the problems I describe.

Lack of proper tools

Btrfs still doesn't have file system checker that fixes the FS when computer was powered off or reset. Reference: its own wiki. I really don't understand how come Btrfs was used at real devices. I remember I really once had to run file-system check for my N900 mobile phone after an unfortunate battery exhaustion. Will my MeeGo device lose all my data after its battery dies?

And you know what? This "we still don't have file system recovery tool" message has been hanging there for more than a year. I guess, it will never be actually implemented.

So, it's already a good reason to never use it for sensitive data, such as /home partition.

Irrecoverable disk pollution

Okay, so, Btrfs is not very reliable to have, for instance, your /home partition on it. But we could use it for a temporary file system that doesn't store important data!.. could we?

I tried Btrfs on one of the nodes of our Linux Driver Verification cluster as a file system for a 130 Gb partition. Here's the workflow the filesystem underwent:

  1. Write a lot of small files
  2. Realize that the disk is nearly full
  3. Erase the small files written
  4. Go to step 1

On day I realised that the cluster node just was not working. I checked the logs and I realized that there was no space left on the device. I ran df -h to check how much space there really was, and it showed that there were... 13 gigabytes free!

It could mean only that Btrfs just damn secretly used more than 10% of disk for its own metadata. Using more disk space than the overall file size is normal, but here it uses it unpredictably: Btrfs lies about the real amount of space left. Some mailing list postings confirmed that, but I can't find the links now. The only suggestion available was to "balance" the btrfs with its tools. But it didn't work for me!. The file system remained in its unusable state.

Then I decided to try to remove some files. I typed rm -rf directory/, and... File removal resulted in a "no space left on device" error! So when I try to clear some space with rm it just outputs "no space left" errors. How nice.

However, it actually did remove the files, but the process merely took more time than usual (about 2 files removed per minute), and had less than 1% success rate. The recommended elsewhere echo >file/to/remove didn't work either. After 30 hours of removing, the filesystem has traversed the 130 Gb folder, and removed a couple of gigabytes successfully. The subsequent rm -rf cleared the disk in no time...

... just to enter the same crippled, unusable state, but with 27 Gb "free" (more than 20% of the filesystem) after a couple of weeks! That time, even rm -rfs couldn't help. We re-formatted the partition into ext4.

By the way, Edward Shishkin is the primary developer of reiser4 at the moment, i.e. he's a competitor of Btrfs. If it was not for my own experience, I wouldn't think that his observations and conclusions are grass-root.

I guess that my observations confirm Edward Shishkin's review of Btrfs: the filesystem is space-ineffeicient without any good reasons, and its tree balancing algorithms are apparently broken.

And it is also not fast!

Okay, but, initially, I mentioned that Btrfs was, at least, fast. Not just working with lots of small files should be faster, but the overall workflow should have had the speed increase.

It does not.

My home PC uses Btrfs as its root filesystem. A media storage, which I keep movies on, however, uses XFS. Now guess what happens when I try to play an HD movie from the XFS partition?

No, you're wrong. It doesn't play smoothly. It just stalls unless I increase the I/O priority of the movie player to the highest value. Because btrfs-transactio process wakes up each five seconds and sends my PC to a non-interruptible I/O sleep. This renders my PC unusable for less than 0.5 second, but it is enough to crush my HD movie experience: it just wouldn't play because disk throughput is not enough for a small, but significant amount of time.

In production, however, I also noticed that. A month ago I introduced a change to my BLAST project. I now realize that it was the Btrfs to blame, but back then I just saw that when a process is about to write to a file 10 times per second it sometimes stalls for a second before the write succeeds. This is intolerable for any serious file system.

Perhaps, during the common usage cycle, when files are not re-recorded several times per second, it wouldn't be an issue at all. However, the way I use file systems at work is not "common", and it should satisfy more requirements than those of users that just never write to their files.


Finally, the solution to Disk I/O speed seems to have come not from tricky file systems. Solid-state drives (SSDs) with their awesome read/write speeds have made filesystems much less relevant. The main challenge for file system developers would be to adapt to SSD's requirements. And it is natural! We all are pretty damn sure that sequential reads-writes should be faster than random I/O; we are used to placing our swap partitions at the beginning of the drive, to increase the speed they're accessed. We treat magnet disks requirements as granted without really noticing this.

Time to change the paradigm. As of today, if you want a faster disk I/O, the best choice would probably be an SSD, not a faster file system.

To me, as I described above, Btrfs doesn't seem a good choice both at home and in production. Its features do not overcome the fact that it's just unusable. Until it's fixed, do not use Btrfs.

Read on | Comments (0) | Make a comment >>

More posts about filesystem >>