Recent blog updates

Shown here are only posts related to linux. You can view all posts here.

How to Watch HDTV from Internet on your TV with Linux PC (Legally)

Hooray! The Airbus finally brought me to the land where law is taken slightly more seriously than in Russia. With great responsibility, however, usually comes great power, and a lot of absolutely legal video streaming services are available in the U.S. Moreover, my corporate-sponsored temporary apartment has a large, like, 30-inch TV, and it would be a sin to just let it collect dust in silence.

No, I couldn't just watch cable TV! The very idea that I watch something on a predefined schedule is not appealing to me. I'd have to spend money on a TiVo or something that records TV shows anyway. And what about DVDs and feature movies? TV shows isn't the only thing I watch, by the way. And what about shows that are not airing? how to watch older seasons? No, TV is for the weak.

Sadly, I could only find how to watch TV shows in HD, not movies. I expand on this in a section about Amazon. So, TVs are certainly for the weak, but... I eventually solved the problem by buying a big-ass TV that has enough DRM to stream all kinds of HD moving pictures. Sorry, Linux fans.

Streaming has even some advantages compared to these filthy, illegal torrents. You can start watching an HD movie at once, rather than wait until its download is complete!

So the most rational idea is to stream movies from Internet via my desktop PC to the TV. Surprisingly, the hard thing was that I use Linux, and this was far from the usual "OMG settings are so complex in Lunupz, I can't do console!" whining.

"Nobody uses Linux... Get a real OS, nerds."

The first thing to choose is the content provider. And that's where Linux users get into the ambush.

The thing is that the mission statement of GNU contradicts what content vendors are trying to achieve. Linux, or, more specifically, "GNU/Linux", because we're talking about operating system rather than the kernel here, is all about being open, allowing user to see what software is running and have full access to its contents. Content providers, however, insist that DRM—one of technologies to revoke control over content—is imposed onto end-users. This is why it is not easy to find Linux support in these matters.

Netflix

The first service to try was Netflix. Actually, that million of dollars worked well: I only heard about Netflix form a Communications of the ACM article (as far as I recall it was this one) about the Netflix prize. No wonder I decided to try it first.

When I came to the U.S., I didn't have any personal computer at all. In Russia, we have an adage, "to come to Tula with one's own samovar". Tula was famous for its samovar-makers, and the adage mocks those who carry the specific items they possess to places that naturally have a great choice of such items. So no wonder I didn't bring my, like, 15-pound chassis here. Instead, I bought a 4-core i8 from Best Buy, but for the several first days all I had was Galaxy Note smartphone/tablet. Netflix installed smoothly from Google Play, I watched some TV shows there, and I thought I won't have any problems at my desktop PC.

I was very wrong. Netflix doesn't have a way to run under generic Linux distribution. Windows7, MAC—you go, Android—here you are, GNU/Linux—go fsck yourself. Seemingly reassuring was that it ran under ChromeOS, but that was only for proprietary Google's Chromebooks. Yes, even the new Samsung $250 ARM Chromebook. Recently, Google and Netflix managed to arrange a seamless delivery of video to the laptops, while it didn't work until the March'13. I tried it on mine, and it works! However, if you install ChromiumOS in a virtual machine, it also won't work (I tried.) A Netflix Chromium plugin also looked promising, but it's just a frontend to their web-site, and doesn't offer the playback capabilities to Chromium itself; there are a lot of angry reviews by Linux users, and a response to them is a caption to this section.) Here's a blog post with a more comprehensive overview.

Hulu

Hulu, on the contrary, has a GNU/Linux application. However, very few of their movies were available in HD, and I was not interested in those that were. So I had to skip this service.

Amazon

Yes, Amazon has a movie service as well. Marketing also worked here: I learned about this from IMDb, which was bought by Amazon years ago. And Amazon is the thing that finally worked but not without a plot twist.

The thing is that Amazon doesn't support GNU/Linux PCs for its HD movie content! Yep, that's right, check their list of compatible devices. You see a list of TiVo-s, Consoles, Smart TVs, and an iPad, but where's the PC? Nope, it's not there. Yet it works.

Not only this stuff requires Flash, but it requires special DRM capabilities in Flash, which relies on already deprecated HAL software maintainers started to remove from modern Linux distributions, but bring back just because of Amazon. If your distribution is released in 2012-2013, this may require additional actions you can google. ROSA Marathon 2012 plays Amazon Videos out of the box though, and in ROSA Fresh, you just need to urpmi hal from contrib repository.

That's right, when you pay for streaming an HD TV show, it just plays in your browser. It requires Flash, though. The list of compatible devices apparently means that if you "buy" a movie instead of renting it, you can only download it to one of the compatible devices rather than to your PC.

I used to think that it was a mistake of sorts, but the further research revealed that Amazon would probably love to stream all sorts of HD stuff to as wide variety of users as possible. This looks like a matter of what publishers allowed Amazon to do. They seemingly allow HD movies to tivoized and DRM-ready devices, and just don't care that much about TV shows. Streaming the movies I tried didn't work.

Equipment

To connect a TV to your PC, you'll need some extra equipment. When I looked at the price of 25-feet HDMI cable at BestBuy, I was shocked: it cost more than my monitor! So I bought that 25-cable from Amazon for $14. You should already have Ethernet cable, but in case you don't you may buy it there as well, and, of course, you should have a good and fast connection.

Setting up Linux

The screenshots below are from the Linux distro that I use at home, namely ROSA Marathon 2012.

First, I connected the TV via cable, turned it on, and set the input to the corresponding slot. I'm not sure if it's necessary, but if never hurts to do this first.

Unfortunately, KDE settings (available via "Configure Your Desktop" → "Display and Monitor") thought that I don't have any other monitors:

It remained this way even after I managed to configure it, so don't rely on it that much.

The second thing to try was the settings for my video card, nVidia GeForce GT 520. That's where I finally found managed to do this. This configuration tool will need to write to your /etc/X11/xorg.conf file, so make a backup first by copying it somewhere. Because of this very thing, you should run it under root (sudo won't work, use its KDE-specific counterpart):

$ kdesu nvidia-settings

Got to "X server Display Configuration" tab, and press "Detect Displays." Your connected TV should appear there. You could try TwinView, but you don't want to spread your desktop across two screens (it makes no sense), and TwinView sometimes limits your bigger monitor resolution to your TV resolution, which you don't want. However, TwinView doesn't require restarting X.

So you could try TwinView, and resort to "Separate X screens" if TwinView doesn't work. If the "Separate X screens" option is not shown, don't panic, select it, and other options should become available (it's a bug); maybe, you'll have to Quit with saving configuration, and reopen again. The goal is to set the "Configuration" to "Separate X screen (requires X restart)" for both displays:

By the way, rosa-launcher (the Start Menu) doesn't work (it crashes) when several screens are set like that. The message has been sent to developers, I hope they'll fix it.

I also found out that the least intrusive is to place the TV display "Below" the main one. This means that the mouse will appear on the TV screen when you move it over the bottom edge.

Click "Save to X configuration File" (if you get an error here, you forgot to run as root with kdesu), and Quit. Restart your X server (or restart your PC), and there you go.

Now we should actually run something on that new screen. It's not that easy. There are different displays that are different X processes, and within one process live several "screens," which we have set up.

Remember, that videos play from your browser? So you should start a web browser on that another screen (this is not required if you use TwinView - just drag browser window to where your TV screen resides). Screens are set up nearly like displays, via the DISPLAY variable; the screen number follows the display number after a dot. I didn't manage to start a second Firefox window on that screen, unfortunately. But nowadays we do have a choice.

$ DISPLAY=:0.1 chromium-browser amazon.com
$ DISPLAY=:0.1 firefox amazon.com

Now move your mouse beyond of the bottom edge of the screen, and start the movie you like. Pro tip: start browser as usual, log in to it, and only then re-start it on another screen to avoid excessive strain on your neck.

Start the movie... wait-wait-wait, why is the sound going from my PC speakers rather than from the TV itself?? Back to configuration. Open KDE desktop configuration tool, click on "Multimedia", select Phonon, and adjust the priority of the "Audio Playback" (not of any of subcategories, but of the category itself).

As usual with the sound, adjust mixer settings of HDMI device. Use alsamixer console command and unmute/volume-up everything there like crazy if the usual mixer doesn't give results:

For me, sometimes, I had also to select another profile on "Audio Hardware Setup" tab. Now you're set. The sound should be re-routed to TV without restarting the browser, but, again, give it a try if it doesn't work.

Enjoy. Here's a proof that it works:

Internet connection is not used very intensively, and I'd be able to watch, like, three HD videos simultaneously:

***

That's how you fight your way through bugs, tricky settings, "X displays" (wtf? why do I need to worry about that?) Of course anybody would love to avoid all that. But that's basically the same thing that you have to select and tune things like medical insurance on your own, because you are, as with Linux, in the land of the free.

Read on | Comments (0) | Make a comment >>


A Dying Tux on an Airbus

A couple of days ago I took a transatlantic flight on an Airbus A330.  It is a large plane where each seat is accompanied by a computer named "Entertainment System".  It contained games, movies, and stuff like inflight information (plane location mapped onto the world map, and flight progress details, which is kind of reassuring given that the flight lasts for 10+ hours.)  A killer feature of such system is its USB port useful when you want to read/watch something on your mobile phone (or type this story, for instance.)

Not only did I learn that day what operating system that computer ran, but I also found out another example how machines can demonstrate such inherently human feelings as courage and hope.

It all started with nothing special.  I casually opened inflight information, and fell asleep.  When I woke up, and realized that there was a couple of movies worth watching, I tried to launch them.  It didn't work.

"I am a programmer, aren't I," I thought, and started to click all the buttons on the remote control.  None of that worked.  I tried to click the near-screen buttons; they worked properly, while their remote-control counterparts did not.  I concluded that the driver for the remote died somehow, and asked the stewardess to reboot my terminal.

As it started to reboot, I noticed a lot of familiar words.  It used a bootloader named RedBoot copyrighted by RedHat, and it obviously ran Linux with a 1.59 kernel or something.

However, after "decompressing image," it showed, "decompression error: incorrect data check," and rebooted again.  The system did not boot far enough to start that entertainment.

It also displayed a reboot counter. After twenty five reboots something should have happened, and I didn't know what. Curiosity took me over, and I stared at the screen, finding amusement in the fact that I know what some of the words being printed there meant.

After 25 reboots, the computer realized that there was something wrong with it, and ran the firmware update program.  It did not help.  He entered the same circle of pain again...and again.

As I watched the guy reboot over and over, I noticed that my curiosity is being replaced by different feelings.  I felt sympathy and compassion to that guy.  It kept trying and trying, and kept failing and failing as if it didn't even know what would happen next.

There are theories that as computer programs start to gain their own cognition, they will frown upon the awareness about their own limitations.  "I envy you humans for your unknown boundaries," says a reddit bot that automatically handles pictures posted to reddit in "his" answer during the open interview (aka "AMA").  I don't know if these theories will have anything to do with reality; they do not sound illogical at all. But this little computer clearly had no awareness of his incapability to boot, and kept crucifying himself.

Pain replaced my compassion. Having watched over a hundred reboots and reflashes, I've lost all hope. "Why should we keep him alive?" I thought. "He'll just keep doing this being unaware that he broke.  Why can't we just shut him down and end the series of reboots?  Do we humans even realize how painful each reboot could be," I thought, "maybe he wants to die already?" I tried to recall what people do when euthanasia is not an option. Morphine?  Even morphine can't make the pain relinquish a computer—though, none is allowed on an aircraft anyway.

I asked the stewardess to shut the entertainment system down at my seat to release him from the pain.   "Oh, unfortunately it's impossible," she replied. "We can move you to another place if you want though: we have some extra chairs available."

I suddenly realized that I can't leave him.  To most other passengers and crew, it was just a malfunctioning computer to be rebooted, replaced, or ignored.  To me, however, it was a dying Tux  who fought for his survival against its own limitations.  I recalled how friends and relatives stay at the bed of a dying loved one, and imagined how they'll soon gather near the charger of a family robot who's about to leave for his silicon heaven after a long, deadly malfunctioning.

I sat by his side for the rest of the flight.

And he finally succeeded!   After six hours of effortless reboots, he finally reflashed a working kernel, and booted himself!  A usual routine went on, the first-boot symlinks had been set up, the X11 system started, showing its tapestry with the well-known black-and-white embroidery.  I could launch one of the movies I wanted several hours ago, but I gave the Tux some rest instead.  He deserved it.

So I experienced this enlightening encounter with a machine who fought for his life, never lost hope, ignored what was believed to be his limitations, and did finally succeed in this struggle.   Not only did he demonstrate the feelings only humans seemed to possess, the dying Tux also happened to have more faith in his own victory than the human who was sitting alongside him.

.

Read on | Comments (2) | Make a comment >>


Typical Janitor's Failures

How easy is it, to write a program that removes a file or a folder? Okay, assume you have all the power a modern operating system provides you, including rm shell convenience command, and unlink system call. Do you really think the answer is still "very"?.. Then take a look at these examples.

Permission trap

Today one of my peers, Denis, was writing a program that somehow analyzed packages in Mandriva Linux distribution. Since trying to analyze the contents of a package without unpacking it looks more like gynecologist's work than that of a programmer, the software Denis wrote unpacked it first. Then it analyzed the package, and erased it with rm -rf in order to avoid disk pollution.

But sometimes it failed to erase it, which led to strange failures of the whole program.

The reason was the permission trap. Common sense tells us that if you can create a file, then you can remove it. This is wrong when it comes to uncommon permissions. Certain software that is very secure sets up specific attributes for certain files. In particular, it revoked writing permissions from certain catalogs, which prevented recursive removal of the unpacked files. You may try it yourself:

$ mkdir -p /tmp/try/this; chmod a-w /tmp/try; rm -rf /tmp/try
rm: cannot remove `/tmp/try/this': Permission denied

Oops. Even if you're the owner of the file, you won't remove it. Even --force doesn't help. To avoid this, I guess, you should chmod files recursively before removing them.

Why does Chromium remove its files twice?

By the way, I encountered a couple of engineers from PVS Studio when they visited Institute for System Programming, in which I worked; they were looking for a mentor for a Ph.D. thesis one of them was going to defend. Small world!

One more totally not interesting thing is that they are not going to support Linux because there is no enough users of their software on that platform. This is both funny and sad: on the one hand, Linux should have much more support, but on the other hand, if your static analysis tool is not decoupled enough to be somewhat platform-independent, then you do not deserve a Ph.D.

I read about this interesting issue here. Most likely, it won't hit you in the nearest years if you're using Linux, but for Windows users it might be interesting and maybe useful. Let me quote the piece of code they referenced

Why in the world would anyone try to remove a file twice determinedly? What was the programmer trying to achieve? Let's quote that article again:

A file can be definitely removed or cannot be removed at all only in textbooks and in some abstract world. In the real system it often happens that a file cannot be removed right now and can be removed an instance later. There may be many reasons for that: antivirus software, viruses, version control systems and whatever. Programmers often do not think of such cases. They believe that when you cannot remove a file you cannot remove it at all. But if you want to make everything well and avoid littering in directories, you should take these extraneous factors into account.

"Remove" vs. "unlink"

Okay, so you've managed to remove a file. The question is, have you freed some space on your disk? The answer is maybe.

Removing is called unlinking on purpose. What you do by "unlinking" is removing a reference from a directory index to a certain chunk on your hard disk. The chunk, nevertheless, may persist, especially if there are more references to it. One of the possibilities is hard links, but you explicitly created them. Another could be a badly working file system, but that was your choice.

However, there is a very harsh way to experience the difference between removing and unlinking. Assume you have a long-running program that writes something to its log file. The information it writes is not very useful, and you might want to discard it. As the program spends hours of working, the log file grows larger and larger, and you realize that you really should erase the log to prevent your disk from filling up. You rm -rf log_file, and... nothing happens. The log is still being written somewhere, and there is as little space as there was.

The thing is that you unlinked the file, but the file system's directory index wasn't the last place it was referenced from. Another place is the file handler of the program running, and it will be referenced until the file is closed. Otherwise, as far as I know, the only way to fix it is to shut the program down. So, do not turn on too much logging by default.

Uncannily removing sensitive files

Of course, you know this. It's like knowing that you should move carefully in a room full of rakes and nevertheless getting hit by one because it's just so obvious...

These are classical rm -rf $FOO/* when $FOO happens to be unset... These are when you type rm -rf /etc, and thus crush your system, instead of rm -rf etc to remove a backup copy of /etc you were so prudent to create. These are the famous bumblebee bugfix—or was it mere viral marketing?

***

Usually, you think that creating is easy, and destroying is simple. In many cases, however, it's not true. In Russia, for instance, it's much easier to open a business than to shut it. And when it gets to removing files, it's also sometimes true. So, if your program really depends on successful file removing, it should be more careful with such a seemingly simple operation.

Read on | Comments (2) | Make a comment >>


Do not use Btrfs!

"Btrfs" is a new, modern filesystem, which has always been a nice thing to try for any Linux aficionado. When you're bored, and if your Linux installation is too stable, you can always try one of new cool filesystems to make your machine faster, cooler, and less stable.

ReiserFS, named after its developer Hans Reiser, was the first Linux filesystem with journaling support. It also was known as the fastest file system to deal with large amounts of small files. It had none of the Btrfs' problems I describe here.

The successor of this filesystem, Reiser4 should have had even better small file support. It also was innovative, and allowed transient compression of content (which worked, by the way), but it has never been finished.

Hans, being in prison, doesn't continue the development of the new filesystem, and, it seems, that others just fail to keep the pace. reiser4 is still not included in Linux kernel. I'm pretty sure that it's not just because of non-technical "personality issues", but of architectural incompatibility with the kernel philosophy (and this) as well.

It's sad that prison seems to have disrupted Hans' mind. We probably will never have the same drive in Linux file system world.

After Hanz Reizer has been sentenced for a dozen of years in prison for killing his wife (I'm pretty sure he thought he had good reasons), we can't take reiser4 seriously anymore. I had a sore experience installing reiser4, but, unfortunately, it has never entered a stable state (and, most likely, never will).

When Hans, in addition to his wife, had killed my root filesystem through his unstable evil creation, I decided to install something cool again. My choice was Btrfs. It is a new fast filesystem, and it was claimed to be fast as hell. Moreover, it became the default filesystem for MeeGo mobile Linux OS standard, which is now nearly dead due to unrelated circumstances.

Aside from being just new, Btrfs offers a lot of unique features. One of them is that it's a copy-on-write filesystem, which allows a user to make snapshots of the whole file system on the same partition without consuming twice as much disk space. You can also roll back to a saved state immediately, without re-writing everything. However, I recently learned an already known way to make these operations fast with any filesystem, so Btrfs' advantage here is questionable. An "only" advantage left is its speed.

After I used Btrfs both at home and at work, I now advise to steer clear of it. At least, until its issues are resolved.

I'm not a kernel hacker of any sort, I haven't looked at Btrfs source code, and I never even read its specs. I just share my experience with it as a user. The post may already even be out-of date, and the state-of-the art tools and implementation may have already fixed the problems I describe.

Lack of proper tools

Btrfs still doesn't have file system checker that fixes the FS when computer was powered off or reset. Reference: its own wiki. I really don't understand how come Btrfs was used at real devices. I remember I really once had to run file-system check for my N900 mobile phone after an unfortunate battery exhaustion. Will my MeeGo device lose all my data after its battery dies?

And you know what? This "we still don't have file system recovery tool" message has been hanging there for more than a year. I guess, it will never be actually implemented.

So, it's already a good reason to never use it for sensitive data, such as /home partition.

Irrecoverable disk pollution

Okay, so, Btrfs is not very reliable to have, for instance, your /home partition on it. But we could use it for a temporary file system that doesn't store important data!.. could we?

I tried Btrfs on one of the nodes of our Linux Driver Verification cluster as a file system for a 130 Gb partition. Here's the workflow the filesystem underwent:

  1. Write a lot of small files
  2. Realize that the disk is nearly full
  3. Erase the small files written
  4. Go to step 1

On day I realised that the cluster node just was not working. I checked the logs and I realized that there was no space left on the device. I ran df -h to check how much space there really was, and it showed that there were... 13 gigabytes free!

It could mean only that Btrfs just damn secretly used more than 10% of disk for its own metadata. Using more disk space than the overall file size is normal, but here it uses it unpredictably: Btrfs lies about the real amount of space left. Some mailing list postings confirmed that, but I can't find the links now. The only suggestion available was to "balance" the btrfs with its tools. But it didn't work for me!. The file system remained in its unusable state.

Then I decided to try to remove some files. I typed rm -rf directory/, and... File removal resulted in a "no space left on device" error! So when I try to clear some space with rm it just outputs "no space left" errors. How nice.

However, it actually did remove the files, but the process merely took more time than usual (about 2 files removed per minute), and had less than 1% success rate. The recommended elsewhere echo >file/to/remove didn't work either. After 30 hours of removing, the filesystem has traversed the 130 Gb folder, and removed a couple of gigabytes successfully. The subsequent rm -rf cleared the disk in no time...

... just to enter the same crippled, unusable state, but with 27 Gb "free" (more than 20% of the filesystem) after a couple of weeks! That time, even rm -rfs couldn't help. We re-formatted the partition into ext4.

By the way, Edward Shishkin is the primary developer of reiser4 at the moment, i.e. he's a competitor of Btrfs. If it was not for my own experience, I wouldn't think that his observations and conclusions are grass-root.

I guess that my observations confirm Edward Shishkin's review of Btrfs: the filesystem is space-ineffeicient without any good reasons, and its tree balancing algorithms are apparently broken.

And it is also not fast!

Okay, but, initially, I mentioned that Btrfs was, at least, fast. Not just working with lots of small files should be faster, but the overall workflow should have had the speed increase.

It does not.

My home PC uses Btrfs as its root filesystem. A media storage, which I keep movies on, however, uses XFS. Now guess what happens when I try to play an HD movie from the XFS partition?

No, you're wrong. It doesn't play smoothly. It just stalls unless I increase the I/O priority of the movie player to the highest value. Because btrfs-transactio process wakes up each five seconds and sends my PC to a non-interruptible I/O sleep. This renders my PC unusable for less than 0.5 second, but it is enough to crush my HD movie experience: it just wouldn't play because disk throughput is not enough for a small, but significant amount of time.

In production, however, I also noticed that. A month ago I introduced a change to my BLAST project. I now realize that it was the Btrfs to blame, but back then I just saw that when a process is about to write to a file 10 times per second it sometimes stalls for a second before the write succeeds. This is intolerable for any serious file system.

Perhaps, during the common usage cycle, when files are not re-recorded several times per second, it wouldn't be an issue at all. However, the way I use file systems at work is not "common", and it should satisfy more requirements than those of users that just never write to their files.

***

Finally, the solution to Disk I/O speed seems to have come not from tricky file systems. Solid-state drives (SSDs) with their awesome read/write speeds have made filesystems much less relevant. The main challenge for file system developers would be to adapt to SSD's requirements. And it is natural! We all are pretty damn sure that sequential reads-writes should be faster than random I/O; we are used to placing our swap partitions at the beginning of the drive, to increase the speed they're accessed. We treat magnet disks requirements as granted without really noticing this.

Time to change the paradigm. As of today, if you want a faster disk I/O, the best choice would probably be an SSD, not a faster file system.

To me, as I described above, Btrfs doesn't seem a good choice both at home and in production. Its features do not overcome the fact that it's just unusable. Until it's fixed, do not use Btrfs.

Read on | Comments (0) | Make a comment >>


So I Patched That Filesystem

If you don't know what an "abstraction leak" is, please, refer to this Joel's blog post

Finally, I patched it. It's been pending for several months, always being postponed as "we do not have time to do this right now." However, we fixed several abstraction leaks, which created an increased workload to the network, and I couldn't postpone it anymore.

UnionFS

UnionFS is a file system that makes a union of several directory trees. The directory trees mounted under a union are organized as a layer cake: whenever you try to find a file in a union, it looks up, in a user-specified order, for this file on each of the file systems involved, until it finds one.

However, the most interesting things happen if you try to write something into a union, or to delete a file from it. If you turn copy-on-write flag, then you may use a higher-level writable file system to mask files on lower level read-only file systems by copying the files to and modifying them at a higher writable layer, or by creating a special masking record on an upper writable file system, respectively.

UnionFS is useful to represent a file systems that originate from a large read-only directory trees (such as those burned on a CD) with just a few changes made by user (yes, think install CDs).

In our project, we used it to create a naïve transparent network FS implementation.

Two posts ago, I whined upon the issue with our filesystem setup for distributed computation. In short, the C preprocessor relied on a lot of file lookup failures when handling #include directives, and failed lookups were never cached in our naive UnionFS + SSH-FS software stack. The former, being a union of several layers of directory trees, always searched for absent files until it reached the lowest layer, which was a slow network. And the latter just didn't cache failed file requests.

As you already know, I am a fan of caching. Naturally, the solution seemed to me as to cache the failed requests somehow. But what if I cache a failed file request, but the file appears?

Thus, the goal was to devise a conceptual solution that doesn't look like a hack. Rather, the desired solution should have relied on some property of our file system that implied that caching of the failed lookups would be sound.

Monotonic file system

The property, however, wasn't too tough to discern. I named it as being a monotonic file system. Let's define a monotonic file system as one a file is never added to. Since the moment of mounting as a UnionFS layer, all read-only file systems that lay lower than the first writable layer are assumed monotonic.

While it wasn't really strictly the case for the file system used as a whole (it had files added to it), the subset of files that were accessed remotely was monotonic.

Implementation details

UnionFS has a mechanism of "masking" the filed on read-only layers as if they were deleted (see side note). My solution was to mask a file after a lookup for it has failed.

First, I needed to determine what system calls the C preprocessor uses for #include file lookups. FUSE, the mechanism UnionFS was based on, intercepts system calls to access the file system, and routes them to a user program that contains handlers for them. Therefore, I needed to know the syscalls involved; for this, I used the good old strace command:

$ strace -f -T  cpp file.c 2>&1 | grep xxxx
[pid 23961] read(3, "#include <xxxx.h>\n", 18) = 18 <0.000010>
[pid 23961] open("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/include/xxxx.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory) <0.000019>
[pid 23961] open("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/include-fixed/xxxx.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory) <0.000015>
[pid 23961] open("/usr/include/xxxx.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory) <0.000013>
[pid 23961] write(2, "file.c:1:18: fatal error: xxxx.h"..., 59file.c:1:18: fatal error: xxxx.h: No such file or directory) = 59 <0.000013>

The syscall time shown due to -T option helped me to determine whether I really succeeded in fixing the problem (the times should have dropped dramatically if it was fixed).

However, attaching to the FUSE's open call did not show any activity. Having turned on the debugging in UnionFS, I realized that I should have attached to the getattr syscall handler instead.

Since now I knew what syscall handler to interfere, I patched the handling of getattr:

diff --git a/src/opts.c b/src/opts.c
index 02a4e79..c6e7a17 100644
--- a/src/opts.c
+++ b/src/opts.c
@@ -228,6 +228,8 @@ static void print_help(const char *progname) {
 	"    -o chroot=path         chroot into this path. Use this if you \n"
         "                           want to have a union of \"/\" \n"
 	"    -o max_files=number    Increase the maximum number of open files\n"
+	"    -o monotonic           treat file system as monotonic (\"delete\""
+	"                                  nonexistent files)\n"
 	"\n",
 	progname);
 }
@@ -295,6 +297,9 @@ int unionfs_opt_proc(void *data, const char *arg, int key, struct fuse_args *out
 		case KEY_COW:
 			uopt.cow_enabled = true;
 			return 0;
+		case KEY_MONOTONIC:
+			uopt.monotonic = true;
+			return 0;
 		case KEY_STATFS_OMIT_RO:
 			uopt.statfs_omit_ro = true;
 			return 0;
diff --git a/src/opts.h b/src/opts.h
index 040956f..1cfb64b 100644
--- a/src/opts.h
+++ b/src/opts.h
@@ -22,6 +22,7 @@ typedef struct {
 
 	bool stats_enabled;
 	bool cow_enabled;
+	bool monotonic;
 	bool statfs_omit_ro;
 	int doexit;
 	int retval;
@@ -32,6 +33,7 @@ typedef struct {
 enum {
 	KEY_STATS,
 	KEY_COW,
+	KEY_MONOTONIC,
 	KEY_STATFS_OMIT_RO,
 	KEY_NOINITGROUPS,
 	KEY_CHROOT,
diff --git a/src/unionfs.c b/src/unionfs.c
index f05af45..a64499e 100644
--- a/src/unionfs.c
+++ b/src/unionfs.c
@@ -57,6 +57,7 @@ static struct fuse_opt unionfs_opts[] = {
 	FUSE_OPT_KEY("-V", KEY_VERSION),
 	FUSE_OPT_KEY("stats", KEY_STATS),
 	FUSE_OPT_KEY("cow", KEY_COW),
+	FUSE_OPT_KEY("monotonic", KEY_MONOTONIC),
 	FUSE_OPT_KEY("noinitgroups", KEY_NOINITGROUPS),
 	FUSE_OPT_KEY("statfs_omit_ro", KEY_STATFS_OMIT_RO),
 	FUSE_OPT_KEY("chroot=%s,", KEY_CHROOT),
@@ -188,6 +189,10 @@ static int unionfs_getattr(const char *path, struct stat *stbuf) {
 	}
 
 	int i = find_rorw_branch(path);
+
+	// If we got an ENOENT, and we assume a monotonic FS, then we should whiteout this file
+	if (i == -1 && errno == ENOENT) monotonic_unlink(path);
+
 	if (i == -1) return -errno;
 
 	char p[PATHLEN_MAX];
@@ -335,6 +340,9 @@ static int unionfs_open(const char *path, struct fuse_file_info *fi) {
 		i = find_rorw_branch(path);
 	}
 
+	// If we got an ENOENT, and we assume a monotonic FS, then we should whiteout this file
+	if (i == -1 && errno == ENOENT) monotonic_unlink(path);
+
 	if (i == -1) return -errno;
 
 	char p[PATHLEN_MAX];
diff --git a/src/unlink.c b/src/unlink.c
index 06a7fa2..7f3275e 100644
--- a/src/unlink.c
+++ b/src/unlink.c
@@ -96,3 +96,27 @@ int unionfs_unlink(const char *path) {
 
 	return -res;
 }
+
+/**
+ * whiteout file on the first rw branch available
+ */
+void monotonic_unlink(const char* path)
+{
+	DBG_IN();
+	// monotonic check
+	if (!uopt.monotonic) return;
+	// save errno: if we can't whiteout, we should return -ENOENT anyway
+	int old_errno = errno;
+
+	// find the highest RW branch (yes, not a typo)
+	int j;
+	j = find_lowest_rw_branch(uopt.nbranches);
+
+	// All branches are RO... then just return the error
+	if (j == -1) return;
+
+	// Whiteout the file
+	hide_file(path, j);
+	// if creating the file with the hide tag failed, restore old errno
+	errno = old_errno;
+}
diff --git a/src/unlink.h b/src/unlink.h
index 5223258..8dc145b 100644
--- a/src/unlink.h
+++ b/src/unlink.h
@@ -8,5 +8,6 @@
 #define UNLINK_H
 
 int unionfs_unlink(const char *path);
+void monotonic_unlink(const char* path);
 
 #endif

And that was it. That was my filesystem patch! It worked well: the preprocessing times, confirmed both by the strace timings and by the real experimentation have increased by a factor of more than fifty! This is what patching a file system for your particular case can yield.

***

I rarely understood what benefits they referred to when describing how they patch filesystem to make their corporate servers work better (i.e. faster or in a more reliable manner). Having patched UnionFS, I realized that two hour work may give a 50x performance improvement (well, Amdahl's law loweing it to 7x improvement)—just because you are not afraid of digging that far.

I doubt that the mainstream would accept my patch, as it's too specific, I guess. But the milestone has been reached: now I may proudly declare that I patched a file system.

Read on | Comments (0) | Make a comment >>


Fail Early, Fail Fast

As you probably know (I'm sure I mentioned it somewhere in this blog, not just in my CV), about a half a year ago, I developed a set of components to make our Linux Driver Verification toolset run in a cloud, i.e. on a set of dynamically added and removed virtual machines.

Being low on the amount of man-hours allocated to this task, I needed to make the development as fast as possible. Our toolset was, in fact a simple client-server architecture: it had a large task that split on several independent smaller tasks, and then collected their results. While we did have making our toolset distributed in mind (I even had a post sketching our small debate on how to create an parallelization-friendly architecture), the data exchange had been implemented long before the infrastructure for distributed computing. The data exchange was based on a strict layout of files in the working directory: clients expected the input data in specific places, and server was to ensure the proper files reside there. Then, the server read the output from the specific files the clients created.

Linux has a support for Filesystems in USErspace (FUSE). This means that a non-root user may install and mount a "virtual" file system that would map system calls to the files and directories to whatever user specifies, such as an access to an archive, or to a file on a remote machine.

One of such file systems, an SSH-FS, which mirrors a filesystem on a remote machine via a Secure Shell tunnel, was featured in my previous posts about open3.

In a cloud, a client and the server are located on separate machines. How do we perform the data exchange? The simplest way to do was to use a transparent network file system, such that the data access code is untouched. Indeed, clients just read files from a filesystem, and a combination of Linux filesystem tools maps the read() calls across the network to reads from a server machine.

To achieve this, I used a combination of SSHFS and UnionFS, both being userspace file systems (see sidenote). UnionFS was used to make client write to a local machine and read files from the remote server while thinking it's just one simple filesystem.

It seemed a nice solution because the files the client expected to find in a filesystem were seamlessly transferred via sshfs. Its caching capabilities ensured that the latency happened only when the file was located, and simple buffered reads on that file were cached in advance. Since we need to locate only a relatively small amount of files, but read a lot from them, the latency shouldn't have been crucial.

Have you already guessed where the problem is?..

The trouble: C preprocessing

And there was one simple task in the whole infrastructure, which never got much traction, but is now responsible for all the latency happening in the data exchange via network. Our toolset analyzed C source code, therefore, it needed to preprocess it. And the trouble was in that plain old C preprocessing; namely, its #include directives.

How do #includes in C preprocessor work? A compiler has a prioritized array of folders to look for headers in, the user specifies its own prioritized set of folders, and each #include <file.h> enumerates the array starting from the folder of highest priority, and looks up file.h in each of them, until it founds a match. Then it inserts the contents of the file found into the source code, and keeps preprocessing. What is the incompliancy with our assertions about the performance of our filesystem stack?

During C preprocessing, a very large number of file lookups will fail!

It's the crucial point of preprocessing; it's been working like this for decades, and we shouldn't touch it (we don't want it, too, as using C preprocessor as a third-party tool is much easier to deploy). This is an example of a completely legitimate motivation for issuing a lot of failed requests to something (to file system, in our case). And it works comparatively well locally: while preprocessing of one file on a local disc would take less than a second, the same task over the network requires ten, or more! What could we do with our filesystem setup to avoid such performance problems?

Nothing. The calls that are doomed to a failure on the UnionFS level will be routed over the network to fail on the server, causing insane amounts of latency. Even copying all the headers from the remote machine to the local one wouldn't help: the successes would be executed faster indeed, but the failures would make their way to the server, and cause latencies anyway.

***

What could be learned from this? When you're trying to design a system, you should pay a closer attention to failures. It's tempting to only optimize your architecture for a fast handling of successful requests, while failure of a request may be as much of importance!

Of course, this varies with your domain: for instance, a web authentication system would specifically slow down failures as a protection against brute-force attacks. In my case, however, I'll have to make a patch to SSH-FS (or, to UnionFS) for it to cache failed file lookups. And I did it! See my blog post that describes how I patched it

But when I'll design another piece of software in my career, and it will have to handle legitimate unsuccessful requests, I'll do my best to make it fail early and fail fast.

Read on | Comments (2) | Make a comment >>


How to Use Open3 and Avoid Dead Locks

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

Open3 is a common name for a function that spawns a process and substitutes its standard input, output and error streams with the pipes connected to the original process. This way you may interactively feed the child with data, and read the processed data back.

The problem with open3 is that its power renders it more dangerous, and its usage—more error-prone. As the famous Spider-Man quote shares, "with great power comes great responsibility", and Linux inter-process communication is not an exception. So here comes the effective usage advice number 1.

Avoid open3 unless necessary

For the purpose of clarity, throughout the post, standard input, output, and error file handlers are called "standard filehandlers"; the process that spawns, and the process that is spawned are the "parent", and the "child" respectively.

The first advice about a safe open3 usage is... to avoid its usage. Languages that are popular in Linux world usually come with a handful of tools for process spawning, such as:

Shell redirection instead of pipes

Instead of using open3 in your program, you may try to connect process inputs and outputs with Bash redirection utilities, invoking the shell via standard functions to spawn processes.

For instance, you may write system("find | gzip >archive.gz") in Perl to spawn a process that archives recursive directory listing into a file, instead of passing the data through the parent process. You may use a usual redirection into just a text file as well, if that's what you really need.

However, I wouldn't use intermediate files if they're only to dump information temporarily for an immediate processing by something else, like this:

I consider it a bad stlye, let alone that it requires extra time on disk access and synchronization, while it may be totally unnecessary. The arguments to the commands passed that way will also undergo the interpretation by the shell, so you'll need to escape your arguments, a painful and an error-prone process.

If you need the intermediate result saved in the files for debugging purposes, you stil may dump it in debug mode, saving cycles when your project is compiled for production usage.

  • system — spawn the process, inheriting parent's standard file handlers;
  • fork+exec — ditto, but, possibly, with additional customizations you learned about in the previous post, for instance (such as tuning environment or playing with file handlers);
  • popen—open only one pipe to the process, either for reading or writing (note that popen also suffers from some of the typical misuses of pipes).
  • shell redirection untilities—Bash has a number of measures to take control over how to output and redirect input and output of the process. A section in one of my older posts is devoted to pipes in Bash. What could be helpful is the ability of using shell interpreter when spawning a process with such functions as system in Perl. For instance, you may write system("find | gzip >archive.gz") in Perl to spawn a process that archives recursive directory listing into a file, instead of passing the data through the parent process (see side note). You may use a usual redirection into file as well, if that's what you really need.

If one of these tools fulfills your needs, then go for it! You can replace it with open3 at any time anyway—unlike houses, software is relatively easy to rebuild.

Where open3 should be used

However, if you'll find the alternatives listed above inefficient, or not powerful enough, you may opt our for using open3. I'll try to enumerate the cases where I would certainly advise using open3.

What is select?

The select function (man page) is a common name for a function to wait on file handlers. Assume you develop a task scheduler that connects to several machines via sockets; it should wait for the first socket to have the result in to schedule the next task to that specific machine (since others are obviously busy). How do you accomplish that?

Of course, there is a faster and more easy-to-use alternative than trying a non-blocking read on each socket once per several milliseconds. It is using the select() function (or one of its companions, such as poll() or kpoll) to perform a blocking wait on all of them. This function will return the filehandler list of those that have the data readily available in them—as soon as there will be at least one such descriptor!

You may find a plenty of manuals on how to use select; later in this post you'll see an example.

  1. several consumers of one source—if you are given a source of data that should be redirected to several consumers (such as some data that should be both printed onto the terminal and saved into an archived log file for history keeping reasons), you should connect a pipe to its output and error streams, and when you read some amount of data from there (for instance, when you read a line), send it to all the consumers you have. In some cases you can do it with Bash (not with sh), but the code with open3 should look much more clear.
  2. aggregation of several providers—if you maintain several data sources, which all should be aggregated into a common place (a text file, for instance) that has some concurrency issues, you might benefit from open3. You may spawn the processes, and then select() (see sidenote) their output streams, thus making writing of the data to the aggregator thread-safe.
  3. interactive data exchange with a continuously-running process—you can't avoid open3 if you spawn a process that responds to your input with some text to stdout. A lot of axillary processes that are launched thousands of times, have the interface of reading from stdin, and writing to stdout (such as SAT solvers), and you most likely don't have the time to write or read anything from the disk.

One of the examples of the application of these patterns (more specifically, a combination of the first and the second) is transparent logging—it's a blatant crime to use anything but open3 for this. By transparent logging I mean combining logs from the child processes into a common logging sink in the parent one. Usually it's done automatically: the system call just routes standard output of the child to that of parent. However, assume you spawn several concurrent processes, and unless you prefix each of them with a unique identifier, you'll get lost in the log quickly.

This may be solved by opening these processes with open3, and attaching prefixes to their output lines before printing them. Note also that this way you may control severity of logs: for instance, you might want treat standard error and output streams from the children differently, and that can not be achieved with a simple popen call.

Synchronous vs. asynchronous

In this post we will mostly study issues with a synchronous processing of data exchange with the child process. Being synchronous means that nothing else but the interaction with the process is performed while the child is alive. In other words, there is only one thread in the parent.

Some notes on asynchronous processing will be given at the end of this post. Still, I consider synchronous experience extremely valuable, as it helps to shape the vision what a good asynchronous interaction via open3 should be.

Issues with open3

So far I've been telling that open3 is not straightforward to use, but what causes thee complexity? Why could there be the "Dead Locks" referenced in the title of this post? I had to learn this in the hard way, tacking cumbersome bugs in our project, and here's what it was about.

Following the pattern "aggregation of several providers", I spawned the process that wrote to both stdout and stderr with the intent to combine them both in the archived log file. The code (simplified) looked like this:

I expected to get a file that has all the lines from the stdout of the child, and prefixed lines from the stderr of the child afterwards. However, sometimes the application just deadlocked!

A quick strace demonstrated that the parent hung up on read from child's stdout, and the child hung up... on write to its stderr! How could that be?

Remember that in the first post about pipes, I listed the limited capacity as one of the properties of the pipes. I stressed it on purpose, because it plays its role right here, when you try to use open3. The pipe that connected the parent to the child was full, and the child wanted to write an error message there. While the parent was still reading from the output pipe, because the child was still running, and its pipes were not closed! That's the open3 Dead Lock.

Of course, this could happen with any pair of pipes here. Assume the process just echoes every character it takes on the input to the output (of course, real programs will be doing a useful transformations, but for clarity we may assume it as identical). We want to feed it twice as much characters as the pipe's capacity.

You might think that this won't strike you unless you're dealing with extremely long inputs and outputs. Sorry to disappoint you, but, quoting the man 7 pipe:

In Linux versions before 2.6.11, the capacity of a pipe was the same as the system page size (e.g., 4096 bytes on i386). Since Linux 2.6.11, the pipe capacity is 65536 bytes.

It's not much, though, in certain cases, it's big enough to let badly written programs work. On a larger scale, we definitely need a generic, limit-agnostic solution.

How to prevent the dead lock

It's relatively simple to devise a generic rule of mitigating the effect of such a limitation. To avoid deadlocks with open3, you should clear each of the output pipes (and fill the input pipe) as soon as possible, and do not put a dependency between clearing a pipe and waiting for another pipe. So we need to watch closely to all the file handlers (up to 3) which we plan to read/write data from/to—and it's important that you wait for all of them at once! If you read the sidenote above, you already know that we could use select for this.

Using select to react to input promptly

So, a potential way to fix the program above is to write something like this:

This program, however, only outlines the approach to tackling deadlocks with open3. It is still prone to deadlocks, though less than the original one. Assume that the child started to print a line to its stdout, but haven't finished it, because it got a debugging interrupt. Having not finishing printing the line, it spits 100 Kb of debugging information to stderr with the intent to continue printing to stdout the normal output. However, the parent is still blocked in the out.readline() call, waiting for the line termination character to appear there, and the child gets blocked at the write to stderr, because the err pipe is full, and no one's going to remove data from it. Deadlock again. (You may play with this deadlock by open3-ing this sample program).

The issue here is that we still do not "remove data from pipes as soon as possible". For that, we need nonblocking reads, more low-level than those of the readline()- and scanf-like functions.

Using nonblocking reads and your own buffers

The problem with nonblocking low-level reads, as Capt. Obvious notes, is that they are low-level. We can't read more or less structured data from them. Assume that we want to read a number (a number of seconds to wait before launching the starship, for instance) from the child's stdout. If that debugging interrupt described above is triggered just in the middle of printing 1000, our nonblocking read will read 10 (before turning to reading from stderr), and act accordingly, launching the multi-billion-dollar ship prematurely. From the child's viewpoint, however, doing so is totally legitimate, since it printed 1000 and debugging information to the different output channels (stdout and stderr), and if the reader confuses these numbers, it's its problem.

Do not use strings as buffers (like I do here)

In the samples below we used "strings" as buffers. However, strings in the modern scripting languages (including those used here, as it's Ruby) consist of multi-byte characters with variable length (see Joel's post on Unicode), and not with one-byte symbols. On the other hand, in some less modern and "less scripting" languages, strings can not contain zero bytes, as they would be treated as the end of the string.

Therefore, "strings" are going to misbehave if chosen as a buffer for an abstract byte stream. I used them for simplicity, and for the sake of demonstration of open3 usage; in real programs, however, you should not use them.

Therefore, we need to handle these situations accordingly, adding another level of indirection between the child and the parent. We will store the data we read from pipes in the intermediate storage; we could name it a buffer. In fact, we are going to re-implement buffered read.

In the next example we will implement a linewise-open3, a subroutine that invokes user-defined callbacks at reading a complete line from stdin or stderr. You could play with it more, introducing scanf-like behavior (instead of these regexps you'll see). However, we have two more issues to discuss before getting to the code.

Reactive interaction with both input and output data

The select with callbacks works well for reading output with the input pipe closed at all. What should we do to if we want to write something to the input pipe of the child based what's being read from it?

To talk interactively with the child, you'll most likely need an asynchronous processing. However, there is one pattern which allows the exchange of data through both the input and the output in the synchronous mode. We already agreed that we will invoke user-defined callbacks after reading lines from stdin and stdout. However, we didn't use the return value of these callbacks in any way! The idea about this arises immediately:

If the user-defined callbacks to stdout and stderr lines return a non-empty string, we feed this string to stdin of the child as soon as possible. To get more control over the child, we treat NULL return value from a callback as a sign to close the standard input. We will also make our wrapper get a string as an input and feed it at the very beginning, as it is what could trigger the outputting of the data in the first place.

This still sounds not powerful enough, but you still have aplenty of asynchronous options, such as interrupting pselect with a specific signal, or setting up a separate thread for feeding data into the input pipe. We will omit these options in this blog post.

Watching for the child termination

As we're implementing a synchronous open3 wrapper, the natural assumption would be that at return from the wrapper the child should be dead and reaped. This way we can also return its exit status to the parent.

As we discussed in the previous post, process termination does not depend on the status of the pipes, so we should watch for it independently.

What we actually want is a version of select that watches for filehandlers and for the child's termination. If you know such a version of select (which I don't), make a comment, please. For now, let's search for another solution.

Such functionality could be simulated with a special wrapper thread that only waits for the child (with wait), and sends signal at its termination. This signal would interrupt select, and we would handle that.

In our project I implemented a simpler solution. It is based on the assumption that the more time the child is running, the less it's likely to terminate during the next second. So we can use timely wakeups to check for a process status (implemented as a non-null timeout to select), and increase the wait period with the course of time. Having the upper boundary for that period is a good idea as well.

Note that if a process has terminated, and the pipes are still open, we assume that the last nonblocking read will fetch us all the data we're interested in, and we may close the pipes. This may not be true, but in such specific cases you'll need specific solutions anyway.

Linewise-open3 code

Let's sum up what we're up to. Here's a listing of an open3 wrapper that prints the string supplied into stdin of the child, and then invokes one of two user-specified callbacks when a complete, \n-terminated line is read from stdin or stdout respectively. These callbacks may return more data to put into the stdin of the process. The execution terminates when the child is terminated. The wrapper returns the return code of the process (everything else is assumed to be done by callbacks).

I tested this program on random echo. You may also view the complete listing for open3 usage, and test it either with echo or with sshfs installed on your system.

Note also, that in our project we have also developed an open3 wrapper for Perl; you can view it here. It is less capable (without interactiveness), but it's live and working.

Notes on asynchronous open3 usage

In some cases you may trade the complexity for efficiency. The problem with the dead lock above was that we had a single thread designated to read from several pipes. Instead of introducing a buffering system, we may just spawn several threads, and attach each of them to its own endpoint. The burden of waking up the thread that actually has something to process doesn't vanish, it just gets imposed on the OS scheduler, which should contain fewer bugs. Or, to a language runtime scheduler (if it employs green threads, as in Ruby).

This may sound like an easier solution, especially in the multicore era, where developers cheer upon every possibility to develop a multithreaded program that makes the high-end hardware stall less. To me, it's a bit of an overkill for simple tasks (if the can be expressed in the terms of the interface discussed above). And in the context we used open3 in our project, too many competing threads had already started to become a problem.

Perhaps, I'll study the asynchronous option in other posts.

Conclusion

This post concludes what I initially planned for the series of blog posts of how to work with pipes in Linux. Another topic emerged, on the asynchronous interaction with an open3-driven process, but I don't know if I will write about it.

We have observed what are pipes, how they are used for inter-process communication, what options we have for spawning processes with pipes attached to them, and how it may be achieved in modern Linux scripting languages. However, we mainly focused on open3, studying details of its implementation, and the use-cases it's the most efficient in. We studied its quirks and how to avoid the traps set up among the joint of pipes we have to deal with.

I have learned this all spending a couple of days messing with processes that magically deadlocked without any obvious reasons, and with nontrivial multithreaded debugging. I hope these posts will help you when you will be working with pipes, so that you'll avoid my mistakes.

Read on | Comments (4) | Make a comment >>


How to Implement open3

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

Open3 (or popen3) is a common name for a library function to spawn a child process with its standard streams (input, output and error) both connected to newly created filehandlers in the original process. You may encounter that function in such languages as Perl, Ruby or Python.

Standard Linux C library, unfortunately, does not contain such a function. Only its less capable counterpart, popen, is a member of C library. The popen function, which is more common, opens only one pipe, either to the input or to the output of the process being spawned. This means that if you program in C you'll have to implement open3 on your own.

Another reason why you would have to re-implement open3, even if your standard (or non-standard) library provides it, is when you need more control over what happens during its call. For instance, Perl and Ruby 1.8 versions of open3 lack the ability to adjust the environment of the process being invoked (see my older rant). This functionality is why I re-implemented open3 in the project I work on.

The choice of Ruby was driven by the fact that I used it in the project I work on, and that it has a killer feature, unhandled exceptions (so that you don't have to check return code of each library call invoked, as in C). Besides, it's a lot less verbose.

The subsequent sections of this article depict sample implementation of open3 in Ruby. However, it relies on the certain system and C library calls I will also outline, so you'll be able to do the same in C or in any other language with access to such primitives.

Foundation for open3

The open3 implementation contains no magic. It relies on two powerful features of Linux system.

Forking

Good old forking comes to help in this case. The process spawning in Linux is done via fork call, cloning the current process, followed by exec call, replacing the newly spawned child with the command we wanted to invoke in the first place. It may sound unnecessarily complex, but in our case it has a feature that makes open3 as a library function possible. After fork, but prior to exec we can invoke commands in the context of the new process, while retaining the knowledge we had in the old one. How does it help us?

First, we may alter the environment of the child being invoked, which was my original motivation for reimplementing open3. Many scripting languages have a global, process-wide ENV hash with read/write access, which is used as the environment at execve system calls. If we need to alter the environment in the child process in a thread-safe way, we may modify ENV between fork and exec, so that it will be executed at the context of child process without interfering with the parent.

Second, some filehandlers (internal OS "pointers" to open files, pipes and sockets) preserve across exec! Namely, those that have the flag FD_CLOEXEC cleared. The flag is set by default (I guess, for all except standard 0, 1, and 2 descriptors), but we can change it via fcntl call.

So what we could do is to open pipes in parent process, and inculcate the knowledge about them into the child process before the target command is executed. How should we do the inculcation?

File handler renaming

The child process should spawn with the pipes instead of its STDIN, STDOUT, and STDERR. However, after forking, it already has some standard streams assigned by the OS or the language runtime. What should we do now? The answer is obvious: rename the relevant file handlers.

Closing and opening a file IO object in a high-level language (such as in C++) is not the same as "reopening" it, as the OS filehandler will not be retained. Which is extremely important to altering standard streams.

Language runtime libraries usually have a "reopen" functions (freopen in C, or IO.reopen in Ruby). The thing is that it substitutes the file pointed to by OS file handler, as well as the underlying file of the "language" file handler (this is an integer value in C, or IO object in Ruby or, say, C++).

Changing IO objects intrinsics left to the specific languages, OS filehandler swapping is performed via C library function dup2. A usual dup function just duplicates a filehandler. Called as dup2(keep,replace), it will close(replace), and re-name keep to replace. This renaming is done via fcntl system call, and should be relatively cheap. Additionally, this fcntl call will reset FD_CLOEXEC flag.

Forking and threading in POSIX is an interesting subject itself. I one had a lot of fun, reading this thread, then the FAQ of the newsgroup.

Note that fcntl and dup2 are async-signal-safe function, and therefore it can be called between the fork and exec safely in a multithreaded process.

Simple implementation

Okay, let's assemble all the tools together. Opening pipes is easy, as described before, then comes the fork, and renaming the file descriptors.

Instead of just assigning an environment variable, we may allow user to specify the actions they desire to be executed in child's context as a function pointer (or as a block, in Ruby).

Process termination control

From this simple implementation we have completely omitted an important way to control what's happening with the process. Controlling when it's terminated (or forcing it to be).

The thing is that whether the process is terminated does not depend on the status of its standard pipes. First, closing the standard input doesn't necessarily make the process terminate (as with the standard grep command). Actually, very few of them do. So if we've got all the results we wanted from our forked creature, we should have a means to kill it before it gets too late...

Second, the filehandlers we read from do not tell us whether the process is alive either. Remember that filehandlers may survive the exec if FD_CLOEXEC is not set? The process we spawned might, in turn, have given birth to a daemon, which we don't want to mess with. It could inherit the filehandlers, and just not use them. This actually happened with SSHFS mounter: the process wouldn't close the output pipes.

Therefore, pipes won't help us to control the process spawned. What could? Of course, its PID. The fork returns to us the process identifier of the child spawned, and our process, as a parent, has a full access to its status and to its return code. So, along with the pipe filehandlers, we may return child's PID to the caller for it to control the process in a more fine-grained way.

Anything else?

We've learned how to implement open3 itself. However, the bigger troubles are coming when we try to use it.

The thing is that open3 is a powerful tool, but with great power comes great responsibility. As outlined in the previous post, reads and writes to pipes may block at each endpoint, and this renders the usage of open3 error-prone. You should not use it for each process spawn. If a simpler system, fork, or popen would suffice, use them. But if it's not, you're welcome to read the next article on how to actually use open3 in an efficient and correct way.

Proceed to "How to Use Open3 and Avoid Dead Locks".

Read on | Comments (0) | Make a comment >>


Pipes in Linux and in The Real World

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

The importance of inter-process communication

In Linux programming the concept of a diversity of small and highly specialized tools collaborating their efforts to achieve the goal a programmer instructed them to has always been dominating. Take a look at shell scripting. Here's how one searches for a string FOOBAR in all *.txt files in their home dir:

Pipes

A lot of pipes stocked somewhere in Russia.

Pipes that transfer solid items

This photo is taken from here.These pipes are more widely known as pneumatic tube transport.

Pipes that transfer documents

This pneumatic tube seems to transfer medical documents and samples packed into capsules. I saw such system in one of a banks I had an account at; there it was used to transfer documents between counters.

It's a good illustration of sequential payload delivery by pipe if the width of the pipe is the same as the width of the objects.

You may find more sophisticated examples in my article on xargs, which is a robust spawner of such commands. It is such robustness that allowed me, for example, making such a simple "web service" that sends the information to the web triggered by just adding lines into a text file--look how concise its code is!

The usage of such tools is not constrained by making a desktop user convenient at their workstation. These tools are actively used in production scripting as well. Having a lot of ready-to-use tools at hand provides a programmer with extra opportunities for building larger programs: instead of trying to find a library or to re-implement something, one may just try to find a small tool for a specific purpose, which interacts with the outer world via command line, input and output streams.

But the combination of tools requires an efficient communication between them. Small tools can nevertheless exchange large amounts of data! Being the world largest consumer of such communication primitives, Linux uses pipes.

What are "pipes"?

Before you read this section, I strongly encourage you to visit your bathroom or kitchen and examine closely the metal or plastic extended round items that have moving water inside (or, if your country isn't rich enough for you to have them, take a closer look to the things the oil your country sells flows through). These are pipes.

Pipes that have started a war!

A week ago I watched "Fair Game" movie. Being a non-US national, I nevertheless consider the problems discussed there relevant. And one episode of a plot attracted me. One of the "casus belli" for attacking Iraq was that it purchased pipes that could be used to make nuclear weapons (in fact, they couldn't, and it all was a speculation of politicians). Above is a picture of such pipe.

See how important pipes could be? Read the rest of the article to learn more!

Pipes are used to direct a flow of something from one place to another; it could be water, oil, other liquid, or even documents, or change (see side pictures). Pipes can even start a war! And Linux pipes transfer bytes: you can write to one endpoint, and read the data written from the second endpoint, which may end up in another process!

The three properties of a pipe

The first is that a pipe can transfer payload only in one direction. You can't use a single pipe to transfer water in both directions, such that it would leak from one end and simultaneously consume water for it to leak from another. For that, you need at least two pipes.

Second, a pipe has a limited capacity. When you close your valve in the kitchen, your pipe is full of water, and no matter how the pump station tries, there will never be more water inside the pipe than there is now. (Of course, the station may try harder and make your pipe leak, and water can undergo compression under certain conditions, but it's not generic). When the station tries to pump more water, the new water is "blocked". It continues until the valve at the other end is opened, and the water is removed from the pipe for the new to come from the other end.

The third property is that a pipe transfers the payload more or less sequentially. Even transfer of liquids, which can mix easily, is somewhat sequential: when you turn on your shower after a cold night, you literally feel how the cold water is removed from the pipe before the hot water starts to erupt.

The interesting thing is that Linux pipes are designed closely after "real world pipes", the only difference being that Linux pipes transfer information, bytes.

Pipes in Linux

The main attributes of Linux pipes, one-side transfer of data, limited capacity, and sequential output, are found in the real world too, as shown above. There is, however, one difference.

The pipes in the real world are usually found in "full" state, i.e. the pipe is full and waiting for items to be removed from it. In Linux programming, however, the "empty" pipes, where it is the consumer who waits for the input, are much more widespread.

To create a pipe, you just invoke the relevant system call via a standard C library. A standard pipe(2) call returns two file handlers, one is for reading (fh_out), and another—for writing (fh_in).

The use of these file handlers usually happens in a blocking mode, such that:

  • a read from fh_out blocks until something is written to the other end of the pipe;
  • a write to fh_out returns when it has written everything into the pipe. If the pipe has no capacity to consume everything, the call is blocked until the consumer reads some data from the other end, so that more data could be written.

Of course, you can use ioctl to adjust the modes and make such calls nonblocking. Still, you can't bypass the basic restrictions: you obviously can't read what's still haven't been written, and you can't store more data in a pipe than it's capable to keeping. If you want to continue execution as the data are automatically written to an overflown pipe, you have to allocate a buffer and a concurrent thread that pushes data there.

You should never forget about these two blockages, as it will affect your everyday workflow (see, for example, this StackOverflow question about less command). Yes, sometimes it's an obstacle you have to specifically overcome (in the third article about pipes in Linux I'll address it). But in most cases such two-blockage behavior is really what you want.

Pipes as a synchronization means

Let's return to the find ... | xargs -L 100 example shown at the beginning of the article. If the find command has already found a lot of files, there's no sense for it to work further, damaging the response time (the frequency of matches found printed) of the system. With pipes, it will be seamlessly blocked by write() to a pipe, and you don't even have to write anything to support it: your simple, usual printf() will just return control only when the second party does some work!

In other words, the design of Linux pipes makes, for two processes connected with a pipe as A | B, this two-blockage system automatically "pause" the faster to let the slower a chance to accomplish its job!

So, basically, pipe is a synchronization "primitive" that pauses a program connected to one of its ends at certain operations. It's not that "primitive", actually, as it may be implemented via a semaphore, but it's simple enough to consider it as such. And—as with other synchronization mechanisms—you may deadlock when using a single pipe, let alone using multiple pipes.

Road to open3

So, let's return to using inter-process communication for more efficient programming in Linux. A classical way for a tool to work is to print to standard output, and read from standard input, putting error messages to standard error stream. How could you efficiently use such a tool in your program? The answer is: connect to it with pipes! Linux has all the capabilities to arrange the filehandlers in such a way, that the program won't even notice that it's printing to a pipe instead of a console terminal.

We used named pipes to illustrate how Bash may be used for parallelization.

Of course, you may do it upfront, without impersonating a genuine console. Linux has a capability to create named pipes with mknod call. These look like usual files, but are actually pipes. If a target program can read from it a file instead of reading from standard input (or write to a file instead), you're lucky. However, this sometimes makes the target programs unnecessarily complex—and they're already complex enough, just take a look at various echo implementations, of a program that is supposed to just print its arguments. Second, this functionality is rarely provided for standard error stream, and error log is a very important piece of information for tackling bugs. Therefore, you will have to either use shell redirection, or just to establish a direct connection from the child's streams to the parent, which is an easier solution.

As we've already learned, you'll need three of them: one for each channel among input, output and error. This has gave the name to how such a function is usually called in standard libraries of various languages, open3. It takes a command line as an input, and returns three filehandlers corresponding to the said streams of the program spawned. Here's what it looks like in Ruby:

However, open3 implementation may sometimes be not powerful enough to meet all your desires (it happened during my work on a cluster software for a project at work, see some rant in my earlier post), and you'll have to code a more sophisticated version. That's why it's important to know how open3 is implemented. It has several quirks, and in the next blog post I'll explain the intrinsics of an open3 implementation.

Proceed to "How to Implement open3"

Read on | Comments (0) | Make a comment >>


A poor man's benchmark and stopwatch

Recently I wanted to measure how long it would take my program to execute a certain action. I didn't want to search for time-related functions in that language, so I had just been going to just use my Casio watch. However, I suddenly realized that I forgot them at home...

"Wait a minute", I thought, "I have a PC with a modern operating system, openSuSE Linux! How come it doesn't have a stopwatch program?" I scratched my head, but couldn't remember anything like that in standard Linux utils.

I asked my peers about such a program, and Konstantin Vlasov proposed a solution: use... dd!

What is dd?

If you still don't get it, I'll remind you. dd is a standard Unix (Hence Linux) tool... to perform low-level data stream copying. Its most popular use is to clone raw partitions (as array of bytes, not as sets of files):

dd if=/dev/sdb5 of=/dev/sdc1
dd if=/dev/cdrom of=image.iso

But we'll get to its conventional use later, and now...

dd as stopwatch

The funny thing about dd is that in the default mode it prints the time it's been running. Indeed, dd is a data copying program, so it's natural to print, at the end, how much data it has copied, and how fast.

So it prints total size of the data copied, the time it took it to do this, and the resultant speed... Wait a minute... the time! That's it!

To use dd as stopwatch, you don't need to copy actual data. Just run it without arguments and hit Ctrl+C when you need to stop the countdown. Here's how the output will look like:

pavel@lonely ~ $ dd
^C0+0 records in
0+0 records out
0 bytes (0 B) copied, 5.97049 s, 0.0 kB/s

There it is, the stopwatch triggered by just Enter and Ctrl+C keys.

Note that instead of triggering the stop manually with Ctrl+C, you may just send a signal to dd, and have the same effect. We'll utilize this in the further section.

dd as a benchmark

So, we've just learned that dd can print speed of the data copy operation it performed. And Linux has nice pseudo-files that just generate streams of data... why don't we try to use this files to measure how fast can a computer copy nothing to nowhere?

Let's check how much data your machine will copy in ten seconds! Here's a sample listing:

pavel@lonely ~ $ dd if=/dev/zero of=/dev/null bs=64k & sleep 10 ; kill -INT %1 ; wait %1
[1] 885
1434838+1 records in
1434838+0 records out
94033543168 bytes (94 GB) copied, 10.0106 s, 9.4 GB/s
[1]+  Interrupt               dd if=/dev/zero of=/dev/null bs=64k

Standard Linux shell, Bash, has quite simple features to control it child processes. For example, the code at the left features "job specifications" ("jobspecs"). When a program is run in the background (with use of & at the end of a statement, instead of the usual ;), it gets a job number, and its pid can be gotten by writing %n, where n is that number.

In our code we feature sending a a signal to a certain jobspec, and waiting for it (if it weren't for wait, the shell prompt could mingle with the output of the dd's signal handler). And if we only have one job running in current shell, its jobspec will always be %1.

This command will launch dd process that copies a stream of zero bytes maintained by kernel (/dev/zero) to stream that chews everything it gets without any effect (/dev/null). You may try to test how this simple copying works and learn how fast is your... well, it's hard to tell, what exactly, but you can compare computers with your friends and beat them in this competition! We've held dd benchmarking comparison at the message board of our university, and the top result was 70.6 GB/s, which is like ten times faster than on my machine!

By the very same means you may analyze performance of your disk, by putting a file as the of= argument. But that's a poor man's solution anyway, as there are better benchmarks both for disks and for memory.

Other (not interesting) uses of dd

Aside from copying raw partitions (it's especially useful if you're trying to recover your disk, and accessing it as a file system requires a lot of unnecessary reads, which may be harmful), dd may also be used to create big files of specific size. Why would one want it?... Right, to create swap files. Here's a sample listing to create a 1 Gb swap file

I learned this in a hard time at work, where an important launch of our experiments had the controlling program leaked 4 Gb of memory...

Here's a (really good) Wikipedia article about various uses of dd that presents even more examples of the neat tricks you can employ dd for.

***

Although I'm an experienced Linux user, it doesn't stop to surprise me. The dd program is not the only tool that, due to its simplicity, serves different purposes. That simplicity of basic tools, which leads to their powerful combinations, is one of the things in Linux I appreciate most.

And sometimes it can be even amusing.

Read on | Comments (2) | Make a comment >>


More posts about linux >>