Pipes in Linux and in The Real World

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

Pipes in Linux and in The Real World — introduction to the concept of pipes
How to Implement Open3 — sample implementation of open3
How to Use open3 and Avoid Dead Locks — notes on a safe open3 usage

The importance of inter-process communication

In Linux programming the concept of a diversity of small and highly specialized tools collaborating their efforts to achieve the goal a programmer instructed them to has always been dominating. Take a look at shell scripting. Here's how one searches for a string FOOBAR in all *.txt files in their home dir:

Pipes

A lot of pipes stocked somewhere in Russia.

Pipes that transfer solid items

This photo is taken from here.These pipes are more widely known as pneumatic tube transport.

Pipes that transfer documents

This pneumatic tube seems to transfer medical documents and samples packed into capsules. I saw such system in one of a banks I had an account at; there it was used to transfer documents between counters.

It's a good illustration of sequential payload delivery by pipe if the width of the pipe is the same as the width of the objects.

You may find more sophisticated examples in my article on xargs, which is a robust spawner of such commands. It is such robustness that allowed me, for example, making such a simple "web service" that sends the information to the web triggered by just adding lines into a text file--look how concise its code is!

The usage of such tools is not constrained by making a desktop user convenient at their workstation. These tools are actively used in production scripting as well. Having a lot of ready-to-use tools at hand provides a programmer with extra opportunities for building larger programs: instead of trying to find a library or to re-implement something, one may just try to find a small tool for a specific purpose, which interacts with the outer world via command line, input and output streams.

But the combination of tools requires an efficient communication between them. Small tools can nevertheless exchange large amounts of data! Being the world largest consumer of such communication primitives, Linux uses pipes.

What are "pipes"?

Before you read this section, I strongly encourage you to visit your bathroom or kitchen and examine closely the metal or plastic extended round items that have moving water inside (or, if your country isn't rich enough for you to have them, take a closer look to the things the oil your country sells flows through). These are pipes.

Pipes that have started a war!

A week ago I watched "Fair Game" movie. Being a non-US national, I nevertheless consider the problems discussed there relevant. And one episode of a plot attracted me. One of the "casus belli" for attacking Iraq was that it purchased pipes that could be used to make nuclear weapons (in fact, they couldn't, and it all was a speculation of politicians). Above is a picture of such pipe.

See how important pipes could be? Read the rest of the article to learn more!

Pipes are used to direct a flow of something from one place to another; it could be water, oil, other liquid, or even documents, or change (see side pictures). Pipes can even start a war! And Linux pipes transfer bytes: you can write to one endpoint, and read the data written from the second endpoint, which may end up in another process!

The three properties of a pipe

The first is that a pipe can transfer payload only in one direction. You can't use a single pipe to transfer water in both directions, such that it would leak from one end and simultaneously consume water for it to leak from another. For that, you need at least two pipes.

Second, a pipe has a limited capacity. When you close your valve in the kitchen, your pipe is full of water, and no matter how the pump station tries, there will never be more water inside the pipe than there is now. (Of course, the station may try harder and make your pipe leak, and water can undergo compression under certain conditions, but it's not generic). When the station tries to pump more water, the new water is "blocked". It continues until the valve at the other end is opened, and the water is removed from the pipe for the new to come from the other end.

The third property is that a pipe transfers the payload more or less sequentially. Even transfer of liquids, which can mix easily, is somewhat sequential: when you turn on your shower after a cold night, you literally feel how the cold water is removed from the pipe before the hot water starts to erupt.

The interesting thing is that Linux pipes are designed closely after "real world pipes", the only difference being that Linux pipes transfer information, bytes.

Pipes in Linux

The main attributes of Linux pipes, one-side transfer of data, limited capacity, and sequential output, are found in the real world too, as shown above. There is, however, one difference.

The pipes in the real world are usually found in "full" state, i.e. the pipe is full and waiting for items to be removed from it. In Linux programming, however, the "empty" pipes, where it is the consumer who waits for the input, are much more widespread.

To create a pipe, you just invoke the relevant system call via a standard C library. A standard pipe(2) call returns two file handlers, one is for reading (fh_out), and another—for writing (fh_in).

The use of these file handlers usually happens in a blocking mode, such that:

a read from fh_out blocks until something is written to the other end of the pipe;
a write to fh_out returns when it has written everything into the pipe. If the pipe has no capacity to consume everything, the call is blocked until the consumer reads some data from the other end, so that more data could be written.

Of course, you can use ioctl to adjust the modes and make such calls nonblocking. Still, you can't bypass the basic restrictions: you obviously can't read what's still haven't been written, and you can't store more data in a pipe than it's capable to keeping. If you want to continue execution as the data are automatically written to an overflown pipe, you have to allocate a buffer and a concurrent thread that pushes data there.

You should never forget about these two blockages, as it will affect your everyday workflow (see, for example, this StackOverflow question about less command). Yes, sometimes it's an obstacle you have to specifically overcome (in the third article about pipes in Linux I'll address it). But in most cases such two-blockage behavior is really what you want.

Pipes as a synchronization means

Let's return to the find ... | xargs -L 100 example shown at the beginning of the article. If the find command has already found a lot of files, there's no sense for it to work further, damaging the response time (the frequency of matches found printed) of the system. With pipes, it will be seamlessly blocked by write() to a pipe, and you don't even have to write anything to support it: your simple, usual printf() will just return control only when the second party does some work!

In other words, the design of Linux pipes makes, for two processes connected with a pipe as A | B, this two-blockage system automatically "pause" the faster to let the slower a chance to accomplish its job!

So, basically, pipe is a synchronization "primitive" that pauses a program connected to one of its ends at certain operations. It's not that "primitive", actually, as it may be implemented via a semaphore, but it's simple enough to consider it as such. And—as with other synchronization mechanisms—you may deadlock when using a single pipe, let alone using multiple pipes.

Road to open3

So, let's return to using inter-process communication for more efficient programming in Linux. A classical way for a tool to work is to print to standard output, and read from standard input, putting error messages to standard error stream. How could you efficiently use such a tool in your program? The answer is: connect to it with pipes! Linux has all the capabilities to arrange the filehandlers in such a way, that the program won't even notice that it's printing to a pipe instead of a console terminal.

We used named pipes to illustrate how Bash may be used for parallelization.

Of course, you may do it upfront, without impersonating a genuine console. Linux has a capability to create named pipes with mknod call. These look like usual files, but are actually pipes. If a target program can read from it a file instead of reading from standard input (or write to a file instead), you're lucky. However, this sometimes makes the target programs unnecessarily complex—and they're already complex enough, just take a look at various echo implementations, of a program that is supposed to just print its arguments. Second, this functionality is rarely provided for standard error stream, and error log is a very important piece of information for tackling bugs. Therefore, you will have to either use shell redirection, or just to establish a direct connection from the child's streams to the parent, which is an easier solution.

As we've already learned, you'll need three of them: one for each channel among input, output and error. This has gave the name to how such a function is usually called in standard libraries of various languages, open3. It takes a command line as an input, and returns three filehandlers corresponding to the said streams of the program spawned. Here's what it looks like in Ruby:

However, open3 implementation may sometimes be not powerful enough to meet all your desires (it happened during my work on a cluster software for a project at work, see some rant in my earlier post), and you'll have to code a more sophisticated version. That's why it's important to know how open3 is implemented. It has several quirks, and in the next blog post I'll explain the intrinsics of an open3 implementation.

Proceed to "How to Implement open3"

Contents