Fail Early, Fail Fast

As you probably know (I'm sure I mentioned it somewhere in this blog, not just in my CV), about a half a year ago, I developed a set of components to make our Linux Driver Verification toolset run in a cloud, i.e. on a set of dynamically added and removed virtual machines.

Being low on the amount of man-hours allocated to this task, I needed to make the development as fast as possible. Our toolset was, in fact a simple client-server architecture: it had a large task that split on several independent smaller tasks, and then collected their results. While we did have making our toolset distributed in mind (I even had a post sketching our small debate on how to create an parallelization-friendly architecture), the data exchange had been implemented long before the infrastructure for distributed computing. The data exchange was based on a strict layout of files in the working directory: clients expected the input data in specific places, and server was to ensure the proper files reside there. Then, the server read the output from the specific files the clients created.

Linux has a support for Filesystems in USErspace (FUSE). This means that a non-root user may install and mount a "virtual" file system that would map system calls to the files and directories to whatever user specifies, such as an access to an archive, or to a file on a remote machine.

One of such file systems, an SSH-FS, which mirrors a filesystem on a remote machine via a Secure Shell tunnel, was featured in my previous posts about open3.

In a cloud, a client and the server are located on separate machines. How do we perform the data exchange? The simplest way to do was to use a transparent network file system, such that the data access code is untouched. Indeed, clients just read files from a filesystem, and a combination of Linux filesystem tools maps the read() calls across the network to reads from a server machine.

To achieve this, I used a combination of SSHFS and UnionFS, both being userspace file systems (see sidenote). UnionFS was used to make client write to a local machine and read files from the remote server while thinking it's just one simple filesystem.

It seemed a nice solution because the files the client expected to find in a filesystem were seamlessly transferred via sshfs. Its caching capabilities ensured that the latency happened only when the file was located, and simple buffered reads on that file were cached in advance. Since we need to locate only a relatively small amount of files, but read a lot from them, the latency shouldn't have been crucial.

Have you already guessed where the problem is?..

The trouble: C preprocessing

And there was one simple task in the whole infrastructure, which never got much traction, but is now responsible for all the latency happening in the data exchange via network. Our toolset analyzed C source code, therefore, it needed to preprocess it. And the trouble was in that plain old C preprocessing; namely, its #include directives.

How do #includes in C preprocessor work? A compiler has a prioritized array of folders to look for headers in, the user specifies its own prioritized set of folders, and each #include <file.h> enumerates the array starting from the folder of highest priority, and looks up file.h in each of them, until it founds a match. Then it inserts the contents of the file found into the source code, and keeps preprocessing. What is the incompliancy with our assertions about the performance of our filesystem stack?

During C preprocessing, a very large number of file lookups will fail!

It's the crucial point of preprocessing; it's been working like this for decades, and we shouldn't touch it (we don't want it, too, as using C preprocessor as a third-party tool is much easier to deploy). This is an example of a completely legitimate motivation for issuing a lot of failed requests to something (to file system, in our case). And it works comparatively well locally: while preprocessing of one file on a local disc would take less than a second, the same task over the network requires ten, or more! What could we do with our filesystem setup to avoid such performance problems?

Nothing. The calls that are doomed to a failure on the UnionFS level will be routed over the network to fail on the server, causing insane amounts of latency. Even copying all the headers from the remote machine to the local one wouldn't help: the successes would be executed faster indeed, but the failures would make their way to the server, and cause latencies anyway.

***

What could be learned from this? When you're trying to design a system, you should pay a closer attention to failures. It's tempting to only optimize your architecture for a fast handling of successful requests, while failure of a request may be as much of importance!

Of course, this varies with your domain: for instance, a web authentication system would specifically slow down failures as a protection against brute-force attacks. In my case, however, I'll have to make a patch to SSH-FS (or, to UnionFS) for it to cache failed file lookups. And I did it! See my blog post that describes how I patched it

But when I'll design another piece of software in my career, and it will have to handle legitimate unsuccessful requests, I'll do my best to make it fail early and fail fast.

Comments imported from the old website

Marian (id) on 27 January 2012 commented:

Could it be that line 171 of the time.pl sript

print STDERR "${id_str}MEM $timeinfo->{total}\n";

should actually be

print STDERR "${id_str}MEM $meminfo\n";

? Otherwise you print out the execution time instead of memory.

Marian (id) on 27 January 2012 commented:

Sorry, I actually meant to post that comment to http://coldattic.info/shvedsky/pro/blogs/a-foo-walks-into-a-bar/posts/40 :(

Got to the wrong pager after OpenID auth.

How to Use Open3 and Avoid Dead Locks

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

Open3 is a common name for a function that spawns a process and substitutes its standard input, output and error streams with the pipes connected to the original process. This way you may interactively feed the child with data, and read the processed data back.

The problem with open3 is that its power renders it more dangerous, and its usage—more error-prone. As the famous Spider-Man quote shares, "with great power comes great responsibility", and Linux inter-process communication is not an exception. So here comes the effective usage advice number 1.

Avoid open3 unless necessary

For the purpose of clarity, throughout the post, standard input, output, and error file handlers are called "standard filehandlers"; the process that spawns, and the process that is spawned are the "parent", and the "child" respectively.

The first advice about a safe open3 usage is... to avoid its usage. Languages that are popular in Linux world usually come with a handful of tools for process spawning, such as:

Shell redirection instead of pipes

Instead of using open3 in your program, you may try to connect process inputs and outputs with Bash redirection utilities, invoking the shell via standard functions to spawn processes.

For instance, you may write system("find | gzip >archive.gz") in Perl to spawn a process that archives recursive directory listing into a file, instead of passing the data through the parent process. You may use a usual redirection into just a text file as well, if that's what you really need.

However, I wouldn't use intermediate files if they're only to dump information temporarily for an immediate processing by something else, like this:

I consider it a bad stlye, let alone that it requires extra time on disk access and synchronization, while it may be totally unnecessary. The arguments to the commands passed that way will also undergo the interpretation by the shell, so you'll need to escape your arguments, a painful and an error-prone process.

If you need the intermediate result saved in the files for debugging purposes, you stil may dump it in debug mode, saving cycles when your project is compiled for production usage.

  • system — spawn the process, inheriting parent's standard file handlers;
  • fork+exec — ditto, but, possibly, with additional customizations you learned about in the previous post, for instance (such as tuning environment or playing with file handlers);
  • popen—open only one pipe to the process, either for reading or writing (note that popen also suffers from some of the typical misuses of pipes).
  • shell redirection untilities—Bash has a number of measures to take control over how to output and redirect input and output of the process. A section in one of my older posts is devoted to pipes in Bash. What could be helpful is the ability of using shell interpreter when spawning a process with such functions as system in Perl. For instance, you may write system("find | gzip >archive.gz") in Perl to spawn a process that archives recursive directory listing into a file, instead of passing the data through the parent process (see side note). You may use a usual redirection into file as well, if that's what you really need.

If one of these tools fulfills your needs, then go for it! You can replace it with open3 at any time anyway—unlike houses, software is relatively easy to rebuild.

Where open3 should be used

However, if you'll find the alternatives listed above inefficient, or not powerful enough, you may opt our for using open3. I'll try to enumerate the cases where I would certainly advise using open3.

What is select?

The select function (man page) is a common name for a function to wait on file handlers. Assume you develop a task scheduler that connects to several machines via sockets; it should wait for the first socket to have the result in to schedule the next task to that specific machine (since others are obviously busy). How do you accomplish that?

Of course, there is a faster and more easy-to-use alternative than trying a non-blocking read on each socket once per several milliseconds. It is using the select() function (or one of its companions, such as poll() or kpoll) to perform a blocking wait on all of them. This function will return the filehandler list of those that have the data readily available in them—as soon as there will be at least one such descriptor!

You may find a plenty of manuals on how to use select; later in this post you'll see an example.

  1. several consumers of one source—if you are given a source of data that should be redirected to several consumers (such as some data that should be both printed onto the terminal and saved into an archived log file for history keeping reasons), you should connect a pipe to its output and error streams, and when you read some amount of data from there (for instance, when you read a line), send it to all the consumers you have. In some cases you can do it with Bash (not with sh), but the code with open3 should look much more clear.
  2. aggregation of several providers—if you maintain several data sources, which all should be aggregated into a common place (a text file, for instance) that has some concurrency issues, you might benefit from open3. You may spawn the processes, and then select() (see sidenote) their output streams, thus making writing of the data to the aggregator thread-safe.
  3. interactive data exchange with a continuously-running process—you can't avoid open3 if you spawn a process that responds to your input with some text to stdout. A lot of axillary processes that are launched thousands of times, have the interface of reading from stdin, and writing to stdout (such as SAT solvers), and you most likely don't have the time to write or read anything from the disk.

One of the examples of the application of these patterns (more specifically, a combination of the first and the second) is transparent logging—it's a blatant crime to use anything but open3 for this. By transparent logging I mean combining logs from the child processes into a common logging sink in the parent one. Usually it's done automatically: the system call just routes standard output of the child to that of parent. However, assume you spawn several concurrent processes, and unless you prefix each of them with a unique identifier, you'll get lost in the log quickly.

This may be solved by opening these processes with open3, and attaching prefixes to their output lines before printing them. Note also that this way you may control severity of logs: for instance, you might want treat standard error and output streams from the children differently, and that can not be achieved with a simple popen call.

Synchronous vs. asynchronous

In this post we will mostly study issues with a synchronous processing of data exchange with the child process. Being synchronous means that nothing else but the interaction with the process is performed while the child is alive. In other words, there is only one thread in the parent.

Some notes on asynchronous processing will be given at the end of this post. Still, I consider synchronous experience extremely valuable, as it helps to shape the vision what a good asynchronous interaction via open3 should be.

Issues with open3

So far I've been telling that open3 is not straightforward to use, but what causes thee complexity? Why could there be the "Dead Locks" referenced in the title of this post? I had to learn this in the hard way, tacking cumbersome bugs in our project, and here's what it was about.

Following the pattern "aggregation of several providers", I spawned the process that wrote to both stdout and stderr with the intent to combine them both in the archived log file. The code (simplified) looked like this:

I expected to get a file that has all the lines from the stdout of the child, and prefixed lines from the stderr of the child afterwards. However, sometimes the application just deadlocked!

A quick strace demonstrated that the parent hung up on read from child's stdout, and the child hung up... on write to its stderr! How could that be?

Remember that in the first post about pipes, I listed the limited capacity as one of the properties of the pipes. I stressed it on purpose, because it plays its role right here, when you try to use open3. The pipe that connected the parent to the child was full, and the child wanted to write an error message there. While the parent was still reading from the output pipe, because the child was still running, and its pipes were not closed! That's the open3 Dead Lock.

Of course, this could happen with any pair of pipes here. Assume the process just echoes every character it takes on the input to the output (of course, real programs will be doing a useful transformations, but for clarity we may assume it as identical). We want to feed it twice as much characters as the pipe's capacity.

You might think that this won't strike you unless you're dealing with extremely long inputs and outputs. Sorry to disappoint you, but, quoting the man 7 pipe:

In Linux versions before 2.6.11, the capacity of a pipe was the same as the system page size (e.g., 4096 bytes on i386). Since Linux 2.6.11, the pipe capacity is 65536 bytes.

It's not much, though, in certain cases, it's big enough to let badly written programs work. On a larger scale, we definitely need a generic, limit-agnostic solution.

How to prevent the dead lock

It's relatively simple to devise a generic rule of mitigating the effect of such a limitation. To avoid deadlocks with open3, you should clear each of the output pipes (and fill the input pipe) as soon as possible, and do not put a dependency between clearing a pipe and waiting for another pipe. So we need to watch closely to all the file handlers (up to 3) which we plan to read/write data from/to—and it's important that you wait for all of them at once! If you read the sidenote above, you already know that we could use select for this.

Using select to react to input promptly

So, a potential way to fix the program above is to write something like this:

This program, however, only outlines the approach to tackling deadlocks with open3. It is still prone to deadlocks, though less than the original one. Assume that the child started to print a line to its stdout, but haven't finished it, because it got a debugging interrupt. Having not finishing printing the line, it spits 100 Kb of debugging information to stderr with the intent to continue printing to stdout the normal output. However, the parent is still blocked in the out.readline() call, waiting for the line termination character to appear there, and the child gets blocked at the write to stderr, because the err pipe is full, and no one's going to remove data from it. Deadlock again. (You may play with this deadlock by open3-ing this sample program).

The issue here is that we still do not "remove data from pipes as soon as possible". For that, we need nonblocking reads, more low-level than those of the readline()- and scanf-like functions.

Using nonblocking reads and your own buffers

The problem with nonblocking low-level reads, as Capt. Obvious notes, is that they are low-level. We can't read more or less structured data from them. Assume that we want to read a number (a number of seconds to wait before launching the starship, for instance) from the child's stdout. If that debugging interrupt described above is triggered just in the middle of printing 1000, our nonblocking read will read 10 (before turning to reading from stderr), and act accordingly, launching the multi-billion-dollar ship prematurely. From the child's viewpoint, however, doing so is totally legitimate, since it printed 1000 and debugging information to the different output channels (stdout and stderr), and if the reader confuses these numbers, it's its problem.

Do not use strings as buffers (like I do here)

In the samples below we used "strings" as buffers. However, strings in the modern scripting languages (including those used here, as it's Ruby) consist of multi-byte characters with variable length (see Joel's post on Unicode), and not with one-byte symbols. On the other hand, in some less modern and "less scripting" languages, strings can not contain zero bytes, as they would be treated as the end of the string.

Therefore, "strings" are going to misbehave if chosen as a buffer for an abstract byte stream. I used them for simplicity, and for the sake of demonstration of open3 usage; in real programs, however, you should not use them.

Therefore, we need to handle these situations accordingly, adding another level of indirection between the child and the parent. We will store the data we read from pipes in the intermediate storage; we could name it a buffer. In fact, we are going to re-implement buffered read.

In the next example we will implement a linewise-open3, a subroutine that invokes user-defined callbacks at reading a complete line from stdin or stderr. You could play with it more, introducing scanf-like behavior (instead of these regexps you'll see). However, we have two more issues to discuss before getting to the code.

Reactive interaction with both input and output data

The select with callbacks works well for reading output with the input pipe closed at all. What should we do to if we want to write something to the input pipe of the child based what's being read from it?

To talk interactively with the child, you'll most likely need an asynchronous processing. However, there is one pattern which allows the exchange of data through both the input and the output in the synchronous mode. We already agreed that we will invoke user-defined callbacks after reading lines from stdin and stdout. However, we didn't use the return value of these callbacks in any way! The idea about this arises immediately:

If the user-defined callbacks to stdout and stderr lines return a non-empty string, we feed this string to stdin of the child as soon as possible. To get more control over the child, we treat NULL return value from a callback as a sign to close the standard input. We will also make our wrapper get a string as an input and feed it at the very beginning, as it is what could trigger the outputting of the data in the first place.

This still sounds not powerful enough, but you still have aplenty of asynchronous options, such as interrupting pselect with a specific signal, or setting up a separate thread for feeding data into the input pipe. We will omit these options in this blog post.

Watching for the child termination

As we're implementing a synchronous open3 wrapper, the natural assumption would be that at return from the wrapper the child should be dead and reaped. This way we can also return its exit status to the parent.

As we discussed in the previous post, process termination does not depend on the status of the pipes, so we should watch for it independently.

What we actually want is a version of select that watches for filehandlers and for the child's termination. If you know such a version of select (which I don't), make a comment, please. For now, let's search for another solution.

Such functionality could be simulated with a special wrapper thread that only waits for the child (with wait), and sends signal at its termination. This signal would interrupt select, and we would handle that.

In our project I implemented a simpler solution. It is based on the assumption that the more time the child is running, the less it's likely to terminate during the next second. So we can use timely wakeups to check for a process status (implemented as a non-null timeout to select), and increase the wait period with the course of time. Having the upper boundary for that period is a good idea as well.

Note that if a process has terminated, and the pipes are still open, we assume that the last nonblocking read will fetch us all the data we're interested in, and we may close the pipes. This may not be true, but in such specific cases you'll need specific solutions anyway.

Linewise-open3 code

Let's sum up what we're up to. Here's a listing of an open3 wrapper that prints the string supplied into stdin of the child, and then invokes one of two user-specified callbacks when a complete, \n-terminated line is read from stdin or stdout respectively. These callbacks may return more data to put into the stdin of the process. The execution terminates when the child is terminated. The wrapper returns the return code of the process (everything else is assumed to be done by callbacks).

I tested this program on random echo. You may also view the complete listing for open3 usage, and test it either with echo or with sshfs installed on your system.

Note also, that in our project we have also developed an open3 wrapper for Perl; you can view it here. It is less capable (without interactiveness), but it's live and working.

Notes on asynchronous open3 usage

In some cases you may trade the complexity for efficiency. The problem with the dead lock above was that we had a single thread designated to read from several pipes. Instead of introducing a buffering system, we may just spawn several threads, and attach each of them to its own endpoint. The burden of waking up the thread that actually has something to process doesn't vanish, it just gets imposed on the OS scheduler, which should contain fewer bugs. Or, to a language runtime scheduler (if it employs green threads, as in Ruby).

This may sound like an easier solution, especially in the multicore era, where developers cheer upon every possibility to develop a multithreaded program that makes the high-end hardware stall less. To me, it's a bit of an overkill for simple tasks (if the can be expressed in the terms of the interface discussed above). And in the context we used open3 in our project, too many competing threads had already started to become a problem.

Perhaps, I'll study the asynchronous option in other posts.

Conclusion

This post concludes what I initially planned for the series of blog posts of how to work with pipes in Linux. Another topic emerged, on the asynchronous interaction with an open3-driven process, but I don't know if I will write about it.

We have observed what are pipes, how they are used for inter-process communication, what options we have for spawning processes with pipes attached to them, and how it may be achieved in modern Linux scripting languages. However, we mainly focused on open3, studying details of its implementation, and the use-cases it's the most efficient in. We studied its quirks and how to avoid the traps set up among the joint of pipes we have to deal with.

I have learned this all spending a couple of days messing with processes that magically deadlocked without any obvious reasons, and with nontrivial multithreaded debugging. I hope these posts will help you when you will be working with pipes, so that you'll avoid my mistakes.

Comments imported from the old website

Paul A. Jungwirth on 24 October 2011 commented:

Thank you for writing up such a detailed description of the problem and your solution! Another approach is to use Ruby's non-blocking IO#ready? method, which I describe on my blog.

It's amazing to me that none of these languages can give a simple interface to safely handle common use-cases. If you can do it in the Bourne shell, why not Perl, Python, and Ruby? For Ruby, I wish you could just pass two procs, one to feed the filter and one to consume it, and their reading/writing would just work. Alas!

Brilliant article, and excellent blog in general. Cheers man!

I used an adaptation of this to solve my problems with hangs I was getting when capturing the output of doxygen.exe, which evidently interleaves stdout and stderr without newline termination. I did find that the readpartial/new line splitting code didn't seem to work properly though and failed a couple of my unit tests. I changed it to eg:

stillOpen = [stdout, stderr]
while !stillOpen.empty?
  fhs = select(stillOpen, nil, nil, nil)
  handleIO(stillOpen, fhs[0], stdout, @outputBuffer)
  handleIO(stillOpen, fhs[0], stderr, @stderrBuffer)
end

def handleIO(stillOpen, ioArray, io, buffer)
  if ioArray.include?(io)
    begin
      buffer &lt;&lt; io.readpartial(4096)
      postNewLine = buffer =~ /([^\n]+)\z/ ? $1 : nil

      buffer.scan(/.*\n/) do |line|
        handleLine(line.chomp)
      end
      if postNewLine
        buffer.replace(postNewLine)
      else
        buffer.clear
      end
    rescue EOFError
      stillOpen.delete_if{|s| s == io}
    end
  end
end

Which enabled my unit tests to pass, and I think is more efficient too.

I tried Paul A. Jungwirth's suggestion but io/wait appeared broken on windows. In any case I couldn't get it to work.

Damn, my attempt at making the code look nice didn't work :(

Pavel Shved on 22 November 2012 commented:

Thank you, James. It's always nice to know that somebody likes my blog! :-)

I fixed your comment to make code look nicer; you should use Markdown to format it; i.e. the code is 4 spaces at the beginning of the line.

How to Implement open3

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

Open3 (or popen3) is a common name for a library function to spawn a child process with its standard streams (input, output and error) both connected to newly created filehandlers in the original process. You may encounter that function in such languages as Perl, Ruby or Python.

Standard Linux C library, unfortunately, does not contain such a function. Only its less capable counterpart, popen, is a member of C library. The popen function, which is more common, opens only one pipe, either to the input or to the output of the process being spawned. This means that if you program in C you'll have to implement open3 on your own.

Another reason why you would have to re-implement open3, even if your standard (or non-standard) library provides it, is when you need more control over what happens during its call. For instance, Perl and Ruby 1.8 versions of open3 lack the ability to adjust the environment of the process being invoked (see my older rant). This functionality is why I re-implemented open3 in the project I work on.

The choice of Ruby was driven by the fact that I used it in the project I work on, and that it has a killer feature, unhandled exceptions (so that you don't have to check return code of each library call invoked, as in C). Besides, it's a lot less verbose.

The subsequent sections of this article depict sample implementation of open3 in Ruby. However, it relies on the certain system and C library calls I will also outline, so you'll be able to do the same in C or in any other language with access to such primitives.

Foundation for open3

The open3 implementation contains no magic. It relies on two powerful features of Linux system.

Forking

Good old forking comes to help in this case. The process spawning in Linux is done via fork call, cloning the current process, followed by exec call, replacing the newly spawned child with the command we wanted to invoke in the first place. It may sound unnecessarily complex, but in our case it has a feature that makes open3 as a library function possible. After fork, but prior to exec we can invoke commands in the context of the new process, while retaining the knowledge we had in the old one. How does it help us?

First, we may alter the environment of the child being invoked, which was my original motivation for reimplementing open3. Many scripting languages have a global, process-wide ENV hash with read/write access, which is used as the environment at execve system calls. If we need to alter the environment in the child process in a thread-safe way, we may modify ENV between fork and exec, so that it will be executed at the context of child process without interfering with the parent.

Second, some filehandlers (internal OS "pointers" to open files, pipes and sockets) preserve across exec! Namely, those that have the flag FD_CLOEXEC cleared. The flag is set by default (I guess, for all except standard 0, 1, and 2 descriptors), but we can change it via fcntl call.

So what we could do is to open pipes in parent process, and inculcate the knowledge about them into the child process before the target command is executed. How should we do the inculcation?

File handler renaming

The child process should spawn with the pipes instead of its STDIN, STDOUT, and STDERR. However, after forking, it already has some standard streams assigned by the OS or the language runtime. What should we do now? The answer is obvious: rename the relevant file handlers.

Closing and opening a file IO object in a high-level language (such as in C++) is not the same as "reopening" it, as the OS filehandler will not be retained. Which is extremely important to altering standard streams.

Language runtime libraries usually have a "reopen" functions (freopen in C, or IO.reopen in Ruby). The thing is that it substitutes the file pointed to by OS file handler, as well as the underlying file of the "language" file handler (this is an integer value in C, or IO object in Ruby or, say, C++).

Changing IO objects intrinsics left to the specific languages, OS filehandler swapping is performed via C library function dup2. A usual dup function just duplicates a filehandler. Called as dup2(keep,replace), it will close(replace), and re-name keep to replace. This renaming is done via fcntl system call, and should be relatively cheap. Additionally, this fcntl call will reset FD_CLOEXEC flag.

Forking and threading in POSIX is an interesting subject itself. I one had a lot of fun, reading this thread, then the FAQ of the newsgroup.

Note that fcntl and dup2 are async-signal-safe function, and therefore it can be called between the fork and exec safely in a multithreaded process.

Simple implementation

Okay, let's assemble all the tools together. Opening pipes is easy, as described before, then comes the fork, and renaming the file descriptors.

Instead of just assigning an environment variable, we may allow user to specify the actions they desire to be executed in child's context as a function pointer (or as a block, in Ruby).

Process termination control

From this simple implementation we have completely omitted an important way to control what's happening with the process. Controlling when it's terminated (or forcing it to be).

The thing is that whether the process is terminated does not depend on the status of its standard pipes. First, closing the standard input doesn't necessarily make the process terminate (as with the standard grep command). Actually, very few of them do. So if we've got all the results we wanted from our forked creature, we should have a means to kill it before it gets too late...

Second, the filehandlers we read from do not tell us whether the process is alive either. Remember that filehandlers may survive the exec if FD_CLOEXEC is not set? The process we spawned might, in turn, have given birth to a daemon, which we don't want to mess with. It could inherit the filehandlers, and just not use them. This actually happened with SSHFS mounter: the process wouldn't close the output pipes.

Therefore, pipes won't help us to control the process spawned. What could? Of course, its PID. The fork returns to us the process identifier of the child spawned, and our process, as a parent, has a full access to its status and to its return code. So, along with the pipe filehandlers, we may return child's PID to the caller for it to control the process in a more fine-grained way.

Anything else?

We've learned how to implement open3 itself. However, the bigger troubles are coming when we try to use it.

The thing is that open3 is a powerful tool, but with great power comes great responsibility. As outlined in the previous post, reads and writes to pipes may block at each endpoint, and this renders the usage of open3 error-prone. You should not use it for each process spawn. If a simpler system, fork, or popen would suffice, use them. But if it's not, you're welcome to read the next article on how to actually use open3 in an efficient and correct way.

Proceed to "How to Use Open3 and Avoid Dead Locks".

Penn&Paper's Solutions™: Table Scroll Sanitizer

A couple of days ago, doing a routine tasks of analyzing how well our tools were finding Linux kernel bugs, I was to check a large table.

I was to check if there are correlations in values each row between columns A, B, and C that weren't adjacent. The hypothesis was, like "if there is a 1 in B, there should be 1 in either A, or C". However, the table had a lot of rows and columns; so many, that it didn't fit into the screen neither vertically, nor horizontally, so I was quickly getting lost in determining which column is what.

The obvious idea would be to automate this task. However, the data representation wasn't nice enough for automated analysis. First, the table was a page in a browser. Second, there were certain tricky javascripts involved, which prevented me from copying it and pasting to a spreadsheet program or to a plain CSV text file. Also, the table wasn't large enough to implement a script to do the checking, so I started to doing the check manually.

So I called my old friend, Mrs. Violet Page from Penn&Paper's Solutions™. Violet already introduced one of the products to the readers. I asked her if her company has something to offer to me to simplify this daunting task of browsing a big table—and they really had something to offer me!

Table Scroll Sanitizer

The tool they offered was the Table Scroll Sanitizer. The instructions are simple: just touch the relevant columns with the pen tool, and apply the resultant sheet to the browser window:

This way, the sheet you've created won't move when you scroll, and thus you'll always see the columns you're interested in marked. This made the tedious analysis assignment much less tiresome, and my performance had an upheaval.

Free implementation

However, our company wasn't satisfied with the price of this wonderfull Penn&Paper's product. Its price was totally inacceptable: it was just too small for our accountants to bother with arranging the purchase! That's why I decided to make a free implementation, which is both easy to use and free. Here's a screenshot:

Just open a text editor, arrange the marks like you would do with the Table Scroll Sanitizer, and place it right below (or on top of) the broswer, and it also won't move when you scroll either!

So, if you find yourself in a middle of a large table, and you are lost, you now know what to do. Good luck!

Pipes in Linux and in The Real World

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

The importance of inter-process communication

In Linux programming the concept of a diversity of small and highly specialized tools collaborating their efforts to achieve the goal a programmer instructed them to has always been dominating. Take a look at shell scripting. Here's how one searches for a string FOOBAR in all *.txt files in their home dir:

Pipes

A lot of pipes stocked somewhere in Russia.

Pipes that transfer solid items

This photo is taken from here.These pipes are more widely known as pneumatic tube transport.

Pipes that transfer documents

This pneumatic tube seems to transfer medical documents and samples packed into capsules. I saw such system in one of a banks I had an account at; there it was used to transfer documents between counters.

It's a good illustration of sequential payload delivery by pipe if the width of the pipe is the same as the width of the objects.

You may find more sophisticated examples in my article on xargs, which is a robust spawner of such commands. It is such robustness that allowed me, for example, making such a simple "web service" that sends the information to the web triggered by just adding lines into a text file--look how concise its code is!

The usage of such tools is not constrained by making a desktop user convenient at their workstation. These tools are actively used in production scripting as well. Having a lot of ready-to-use tools at hand provides a programmer with extra opportunities for building larger programs: instead of trying to find a library or to re-implement something, one may just try to find a small tool for a specific purpose, which interacts with the outer world via command line, input and output streams.

But the combination of tools requires an efficient communication between them. Small tools can nevertheless exchange large amounts of data! Being the world largest consumer of such communication primitives, Linux uses pipes.

What are "pipes"?

Before you read this section, I strongly encourage you to visit your bathroom or kitchen and examine closely the metal or plastic extended round items that have moving water inside (or, if your country isn't rich enough for you to have them, take a closer look to the things the oil your country sells flows through). These are pipes.

Pipes that have started a war!

A week ago I watched "Fair Game" movie. Being a non-US national, I nevertheless consider the problems discussed there relevant. And one episode of a plot attracted me. One of the "casus belli" for attacking Iraq was that it purchased pipes that could be used to make nuclear weapons (in fact, they couldn't, and it all was a speculation of politicians). Above is a picture of such pipe.

See how important pipes could be? Read the rest of the article to learn more!

Pipes are used to direct a flow of something from one place to another; it could be water, oil, other liquid, or even documents, or change (see side pictures). Pipes can even start a war! And Linux pipes transfer bytes: you can write to one endpoint, and read the data written from the second endpoint, which may end up in another process!

The three properties of a pipe

The first is that a pipe can transfer payload only in one direction. You can't use a single pipe to transfer water in both directions, such that it would leak from one end and simultaneously consume water for it to leak from another. For that, you need at least two pipes.

Second, a pipe has a limited capacity. When you close your valve in the kitchen, your pipe is full of water, and no matter how the pump station tries, there will never be more water inside the pipe than there is now. (Of course, the station may try harder and make your pipe leak, and water can undergo compression under certain conditions, but it's not generic). When the station tries to pump more water, the new water is "blocked". It continues until the valve at the other end is opened, and the water is removed from the pipe for the new to come from the other end.

The third property is that a pipe transfers the payload more or less sequentially. Even transfer of liquids, which can mix easily, is somewhat sequential: when you turn on your shower after a cold night, you literally feel how the cold water is removed from the pipe before the hot water starts to erupt.

The interesting thing is that Linux pipes are designed closely after "real world pipes", the only difference being that Linux pipes transfer information, bytes.

Pipes in Linux

The main attributes of Linux pipes, one-side transfer of data, limited capacity, and sequential output, are found in the real world too, as shown above. There is, however, one difference.

The pipes in the real world are usually found in "full" state, i.e. the pipe is full and waiting for items to be removed from it. In Linux programming, however, the "empty" pipes, where it is the consumer who waits for the input, are much more widespread.

To create a pipe, you just invoke the relevant system call via a standard C library. A standard pipe(2) call returns two file handlers, one is for reading (fh_out), and another—for writing (fh_in).

The use of these file handlers usually happens in a blocking mode, such that:

  • a read from fh_out blocks until something is written to the other end of the pipe;
  • a write to fh_out returns when it has written everything into the pipe. If the pipe has no capacity to consume everything, the call is blocked until the consumer reads some data from the other end, so that more data could be written.

Of course, you can use ioctl to adjust the modes and make such calls nonblocking. Still, you can't bypass the basic restrictions: you obviously can't read what's still haven't been written, and you can't store more data in a pipe than it's capable to keeping. If you want to continue execution as the data are automatically written to an overflown pipe, you have to allocate a buffer and a concurrent thread that pushes data there.

You should never forget about these two blockages, as it will affect your everyday workflow (see, for example, this StackOverflow question about less command). Yes, sometimes it's an obstacle you have to specifically overcome (in the third article about pipes in Linux I'll address it). But in most cases such two-blockage behavior is really what you want.

Pipes as a synchronization means

Let's return to the find ... | xargs -L 100 example shown at the beginning of the article. If the find command has already found a lot of files, there's no sense for it to work further, damaging the response time (the frequency of matches found printed) of the system. With pipes, it will be seamlessly blocked by write() to a pipe, and you don't even have to write anything to support it: your simple, usual printf() will just return control only when the second party does some work!

In other words, the design of Linux pipes makes, for two processes connected with a pipe as A | B, this two-blockage system automatically "pause" the faster to let the slower a chance to accomplish its job!

So, basically, pipe is a synchronization "primitive" that pauses a program connected to one of its ends at certain operations. It's not that "primitive", actually, as it may be implemented via a semaphore, but it's simple enough to consider it as such. And—as with other synchronization mechanisms—you may deadlock when using a single pipe, let alone using multiple pipes.

Road to open3

So, let's return to using inter-process communication for more efficient programming in Linux. A classical way for a tool to work is to print to standard output, and read from standard input, putting error messages to standard error stream. How could you efficiently use such a tool in your program? The answer is: connect to it with pipes! Linux has all the capabilities to arrange the filehandlers in such a way, that the program won't even notice that it's printing to a pipe instead of a console terminal.

We used named pipes to illustrate how Bash may be used for parallelization.

Of course, you may do it upfront, without impersonating a genuine console. Linux has a capability to create named pipes with mknod call. These look like usual files, but are actually pipes. If a target program can read from it a file instead of reading from standard input (or write to a file instead), you're lucky. However, this sometimes makes the target programs unnecessarily complex—and they're already complex enough, just take a look at various echo implementations, of a program that is supposed to just print its arguments. Second, this functionality is rarely provided for standard error stream, and error log is a very important piece of information for tackling bugs. Therefore, you will have to either use shell redirection, or just to establish a direct connection from the child's streams to the parent, which is an easier solution.

As we've already learned, you'll need three of them: one for each channel among input, output and error. This has gave the name to how such a function is usually called in standard libraries of various languages, open3. It takes a command line as an input, and returns three filehandlers corresponding to the said streams of the program spawned. Here's what it looks like in Ruby:

However, open3 implementation may sometimes be not powerful enough to meet all your desires (it happened during my work on a cluster software for a project at work, see some rant in my earlier post), and you'll have to code a more sophisticated version. That's why it's important to know how open3 is implemented. It has several quirks, and in the next blog post I'll explain the intrinsics of an open3 implementation.

Proceed to "How to Implement open3"

Git Cheat Sheet

Git is a very flexible and powerful tool. I already praised Git a year ago, when I was just starting to love it. Since then, our relationship has matured; although the excitement still doesn't end, doing tricky things turned into routine. In this post I list a couple of advanced tricks we've been doing with Git.

Commit modification magic

Restore commit after git reset

Assume the problem is that you've written a very detailed, elaborate and eloquent commit message, and have just lost it by forgetting to add a file name to git reset COMMIT. Instead of redoing your job, you may just:

$ git reset ORIG_HEAD

Note that ORIG_HEAD here is what you should type verbatim (it's an alias to your previous commit ID, typing which will also work).

Restore commit after a fluke git reset --hard

When you reset your branch head with git reset --hard OLDER_COMMIT, the latest commits are no longer referenced from anywhere. They're still accessible by their commit IDs, so you can do git reset --hard THAT_COMMIT. The complex thing here is to restore the ID of the commit, which you could have forgotten.

One solution is described in the section above, and uses ORIG_HEAD alias. But assume you've done some actions that made you lose it. Another solution solution is the git reflog command that shows the log of reference changes:

$ git reflog | head -n 5
ed9d26d HEAD@{2}: HEAD^: updating HEAD
6619665 HEAD@{3}: merge cluster-quickfix: Fast-forward
ed9d26d HEAD@{4}: checkout: moving from cluster-quickfix to master
6619665 HEAD@{5}: commit: Add a hack to recover failed runs by loading files
ed9d26d HEAD@{6}: checkout: moving from master to cluster-quickfix
ed9d26d HEAD@{7}: merge clusterize-cpa: Fast-forward

This way you will see the last change you've made to your branch head, and may track the commit ID you need by that commit: Commit title message.

Split a commit into two

Sometimes you need to split an old (not the latest) commit into two of them. This is useful when you finally have made your code work, and want to save this result to avoid losing a working version if your subsequent changes would break something again. You make a temporal commit, then several other commits on top of this, and then want to split that temporal bloat when merging your branch back to master. Altering history is performed via well-known interactive rebase, git rebase -i OLD_COMMIT. Let's see how it is used to split a commit. First start an interactive rebase:

$ git rebase -i OLD_COMMIT

Then, in the editor spawned, mark a commit you want to split for edit:

pick d19e766 Ignore 'Ignore inline assembler' in CPAchecker traces
edit 501d76b Make preprocessing errors not crash all the tools
pick 9184d27 Add option tuning to CIL.
pick 71c0e10 Impose memory limit on CPAchecker as well
pick ed9d26d Add auto-reconnect option to ssh mounts

Continue the rebase by saving and exiting; it will stop at the commit you want to edit, and you'll end up in your shell. Instead of using git commit --amend, as the helper message suggests, do a soft reset:

git reset HEAD^

Now you will make your first commit of those you wanted to split your one into by usual git addgit commit sequence. You may re-use the message of the commit being split with git commit -c ORIG_HEAD. Then, make your second commit gy add/commit; you may re-use the message the same way. After you've made both commits, you may signal git to continue the rebase you're currently in

git rebase --continue

Removing proprietary files from history

Sometimes you need to remove all occurrences of a file from the entire history. The reason may be a copyright infringement or a need to reduce the repository size. Git's repository size increases much when you commit a large binary, and keep updating it. When you replace this binary with the relevant source repository, the history of the binary's updates is nevertheless kept for no useful reason. That's where you need to filter the file out of the history.

The git filter-branch command is a nice tool for that. Unlike the interactive rebase, it keeps the shape of the tree, i.e. retains all the branches and merges. Besides, it works much faster than interactive rebase. User manual for the command already contains the shortcut to use, so you can copy-paste it from there:

$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch BAD_FILE' HEAD
Rewrite 4388394c20768ae88957464486d89bc00b2a240a (315/315)

But be careful with copy-pasting! Older versions of Git manual contain a typo, where ' is replaced with `, triggering shell substitution, and leading to weird errors!

Change author name of several commits in history

Another example where git-filter-branch is so handy is changing authorship of several commits in history. This is done via --env-filter option:

$ git filter-branch -f --env-filter '
  export GIT_AUTHOR_NAME="Pavel Shved"
  export GIT_AUTHOR_EMAIL="pavel.shved@gmail.com"
  ' START..END

Using console efficiently

If you're a bit old-school, and prefer shell to those shiny GUI tools a git community has to offer, you will surely want your console to make you feel like home. Here's a screenshot of my console (click to enlarge):

Here's a couple of tips, implementations of which you've just seen.

Branch name in your command prompt

An absolute must-have for every development machine is a branch name in the command prompt. It can't be achieved entirely by means of Git itself, of course. To do that, you should use the relevant capabilities of your shell, and query Git for branch name. Here's how I do it for my Bash shell: I added the following lines to my ~/.bashrc:

PS1="$PS1\$(git branch 2>/dev/null|sed -n 's/^* /, branch \\[\\e[1;34m\\]/p' | sed 's/$/\\[\\e[0m\\]/')"

The console picture above displays shows this prompt (among a couple of other improvements).

Console revision tree

Instead of using a GUI tool to display a tree of commits (useful to see merges, and instead of a usual log), you may use git's capabilities. Just add something like this to your ~/.gitconfig:

[alias]
   gr = "log --all --graph --pretty=format:'%Cred%h%Creset%x09%C(yellow)\ 
        %d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative"

Having been called as git gr from inside a repository, it will display a nice tree, such as the one you see at the console screenshot above.

Note that this format is entirely not helpful for projects with a lot of developers (hence, lots of branches and merges), such as Linux Kernel. This format will take up as much screen width as many simultaneously developed branches there were. However, to list just a part of this tree (such as, several latest commits), I added the following alias:

[alias]
   last = !git --no-pager gr -20

However, this, unfortunately, doesn't put a newline at the end, so the console is shifted a little

Display file as of a specific commit

With git diff you may easily see the changes you made to a file, but sometimes it would be useful to list the file as it looked like at a certain commit. This is done—surprisingly— with a git show command:

$ git show 55e6e94:shared/sh/timeout

Note that you have to specify an absolute path to the file you're interested in, relative paths do not work.

Discard several unstaged changes

One of the killer features of Git is that you may have uncommitted changes. Moreover, you may control it even finer than on per-file basis with git add -p: Git will ask you if you want to nominate each change you made in a file for committing.

Less known is that git may do the same for discarding some unstaged changes on per-hunk basis. You may discard them all by git checkout FILE_NAME, so why wouldn't it?... Yes, it would: git checkout -p will ask you if you want to keep or discard each hunk, and will forget those you have selected without committing anything.

Without git checkout -p you would have to add everything you do not want to discard, commit, then hard-reset, then soft-reset.

Cherry-pick several commits in a row

Sometimes you need to cherry pick several commits from another branch. The git cherry-pick command, unfortunately, doesn't support specifying several commits to pick, and it's tedious to retype the command or to look it up in the shell history for each commit.

Of course, you can checkout that branch, branch it and perform a non-straightforward rebase --onto. But you may also utilize the power of GNU console tools, and do it much easier.

We can use xargs, which we know is very helpful in parallelization, use -L key, and just type the commit numbers to standard input, hitting ^D at the end. We may copy-paste the commit IDs from a window with git gr or from a log file:

$ xargs -L 1 git cherry-pick
ef35196
[master e158b0b] Fix bug when a toolbar is empty
 1 files changed, 1 insertions(+), 1 deletions(-)
f729f70
[master 2b0d860] Set the default max file size
 1 files changed, 1 insertions(+), 1 deletions(-)
3281ffe
[master 42b82fb] Refactor template view section
 1 files changed, 7 insertions(+), 0 deletions(-)
d2c304c
[master 462952b] Block software controller routes
 1 files changed, 6 insertions(+), 5 deletions(-)

Maintenance in larger projects

When your project is small, you do not have to look for complex commands to perform non-obvious operations. But in larger projects you have to search for efficient solutions. Unless you do it, you would end up manually doing similar things often, as well as each developer in the team will spend some time doing things inefficiently, thus yielding a large net impact on productivity. So, you'll have to automate what you would rather do manually, and here is a couple of hints how to do it.

Integrating other git projects without all their history

You might also want to use submodules to integrate such projects, but then you would have to set up a separate repository for each of the external projects you want to integrate and modify privately. It's not a big deal if you control the repository setup (although the more repositories you are to manage the harder it gets, as you have to assign commit right, set up build servers, etc), but if you use a third-party server, such as GitHub, it is not that pleasant, and involves a lot of noise and manual work.

When doing an open-source project, you sometimes have to include other project into yours (most likely, applying your own patches to them). If they have a Git repository as well, it's a well-known way to include them into your project with a technique known as "subtree merge" (it's well described here, but please, ignore what they say about submodules there).

However, the big disadvantage of this technique is that it introduces the whole history of the thing you integrate into that of your project. Commits made to that project will not differ from commits you and your team made to your project, and it will look like all these people participate in your work (which is partially correct, but not useful to anyone).

You may slightly alter the "standard" subtree merge to make it one commit. Moreover, when you want to uplift the version of that project used in your one, you will be able to do this with a single commit as well. The basic idea is to make a subtree merge, but squash the commit you're making. Here's how I integrated a project from GitHub into shared/ruby/gems/logging subdirectory of our project:

git remote add -f logging-twp https://github.com/TwP/logging.git
git merge -s ours --no-commit --squash logging-twp/master
git read-tree --prefix=shared/ruby/gems/logging/ -i logging-twp/master
git commit -m "Merge of twp/logging commit 123abc from GitHub"
git reset --hard

Note the --squash and reset --hard options that differ from a usual subtree merge. A similar difference with the usual subtree merge is in the command to merge the upstream changes in that project:

git pull -s subtree --squash logging-twp master

Auto-tuning of submodules in development setting

When you ship a project with submodules, the pointers to them in the main repository contain fixed URLs to download their code from. However, aside from the public repository, you also have private ones, where the bleeding-edge development happens. The problem here is that all your submodules in all your commits should point to the public repository, while some frest code is temporarily stored in private ones. And you can't check out the code from them! Here's what you'll see if you try:

[14:41] pavel@shved:~/tmp> git clone git://PRIVATE_URL/ldv-tools.git
# ...snipped...
[14:42] pavel@shved:~/tmp> cd ldv-tools/                             
[14:42] pavel@shved:~/tmp/ldv-tools, branch master> git checkout -b model-43_1a origin/model-43_1a 
Branch model-43_1a set up to track remote branch model-43_1a from origin.                          
Switched to a new branch 'model-43_1a'                                                             
[14:42] pavel@shved:~/tmp/ldv-tools, branch model-43_1a> git submodule update --init --recursive   
Initialized empty Git repository in /home/pavel/tmp/ldv-tools/kernel-rules/.git/
remote: Counting objects: 570, done.
remote: Compressing objects: 100% (405/405), done.
remote: Total 570 (delta 286), reused 329 (delta 132)
Receiving objects: 100% (570/570), 206.99 KiB, done.
Resolving deltas: 100% (286/286), done.
fatal: reference is not a tree: 3c9d264cd42cc659544f8dce8e07714eb873efd9
Unable to checkout '3c9d264cd42cc659544f8dce8e07714eb873efd9' in submodule path 'kernel-rules'
[14:43] pavel@shved:~/tmp/ldv-tools, branch model-43_1a>

Of course, you can make a git submodule init, and alter the URLs of the origin remotes by git remote set-url origin git://private/submodule.git. However, this would make each team member who wants just to check out the latest revision of an unreleased branch do some manual work, which is unacceptable.

The solution here is to rewrite these paths automatically. As submodules may include other submodules, the rewriting should happen recursively. You may make a recursive alias in ~/.gitconfig that calls itself and rewrites submodule paths. Here's what we use:

[alias]
   subinit = !"git submodule init; git config --get-regexp \"^submodule\\..*\\.url$\" |     \
             sed 's|PUBLIC_URL|PRIVATE_URL|' | xargs -L 1 --no-run-if-empty git config --replace-all; \
             git submodule update ; git submodule foreach 'git subinit'"

The respective all-caps words denote the addresses of public and private servers. This wouldn't even be a problem, if the public repository had a per-branch access right, but I don't know if it is possible in Git.

Unsafe branch operations

Sometimes you have to perform unsafe branch operations (such as removing a branch or setting a branch to a non-fast-forward commit). Most likely, hooks would prevent you from doing that, so you'd have to do git update-ref on server. But if your hooks aren't configured properly, you may do it without access to the server console. To delete a branch, run

git push remote :BRANCH_NAME

To force a branch to point to any commit:

git push remote +COMMIT:BRANCH_NAME

It's not that advanced, and is covered in the manual, but still, not obvious.

See the release a commit has first got into

In large projects with a lot of branches merged into (such as Linux Kernel) commit date alone doesn't tell you when a particular commit was released, as it could have been merged to master months after its creation. What could you do to find out the release version a commit belongs?

For that, you could use, surprisingly, git tag command:

$ git tag --contains 94735ec4044a6d318b83ad3c5794e931ed168d10 | head -n 1
v3.0-rc1

Since your revisions are most likely tagged, the list of tags that contain the commit (that's what git tag --contains do) will give you the answer.

Using git-svn

The manual for git svn is elaborate, but is missing a part that explains an instant kick-off. Here's how to:

Start a Git-SVN repo as usual (note that trunk at the end of the first command is not a typo: Git automatically strips one level in this mode):

$ git svn init --stdlayout svn://host/repo/project/trunk
$ #OR --> git svn init --no-minimize-url -T svn://host/nonstandard/layout/trunk/cil
$ git svn fetch

Now, you shouldn't use the branches Git-SVN has cloned from SVN. You should create "native" git branches for performing development. First check what are the names Git-SVN assigned to the branches, and create a git version of one of them.

$ git branch -r
$ git checkout -b GIT_BRANCH SVN_BRANCH
$ git reset --hard GIT_BRANCH

My configuration file

Here's my ~/.gitconfig file:

[user]
        name = shvedsky
        email = pavel.shved@gmail.com

[alias]
        st = status
        co = checkout
        dc = diff --cached
        last = !git --no-pager gr -20
        pom = push origin master
        urom = pull --rebase origin master
        ufom = pull --ff-only origin master
        gr = "log --all --graph --pretty=format:'%Cred%h%Creset%x09%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative"

# Submodule updates
[alias]
        sub = submodule update --init --recursive
        subinit = !"git submodule init; git config --get-regexp \"^submodule\\..*\\.url$\" | sed 's|PUBLIC|PRIVATE|' | xargs -L 1 --no-run-if-empty git config --replace-all; git submodule update ; git submodule foreach 'git subinit'"

[color]
        ui = auto
[color "branch"]
        current = yellow reverse
        local = yellow
        remote = green
[color "diff"]
        meta = yellow bold
        frag = magenta bold
        old = red bold
        new = green bold
[color "status"]
        added = yellow
        changed = green
        untracked = cyan

With a bit of nostalgia, I recalled the times when I had a Quake3 config with my own nice tricks...

***

I hope you'll find this cheat sheet useful. I just made it to accumulate all my knowledge and tricks scattered by several notebooks and wikis in one place.

My "timeout" project is now on GitHub

In course of development of our Driver Verification project we needed a reliable resource limiting script. At first it was supposed to simply limit one process in time and memory consumption, so I downloaded a simple script I found via Google, and used it in our project.

As the time went, the script grew with more features. We found out that it doesn't take child processes into account, and I made it to doing so, rewriting the script completely (in Perl instead of Shell). We needed to collect the CPU usage statistics, as well as limit it, and my peer added such functionality to the script. We needed to detect deadlocks by observing that the processes controlled are stalled, and this was added as well.

Now the time limiting script is a feature-rich and a simple solution for controlling resource consumption of a black-box program. Therefore, I decided to make a separate project from this script alone, because, as far as I know, there's nothing like this script out there, available as open source.

The project is hosted at GitHub; you may download it from there, as well as from its homepage hosted here.

The initial article that described how the script is implemented was also updated to reflect the latest findings and usage experience.

In a nutshell, this was just an announcement of my first own small open source project. I doubt it will get much traction, but let's see how it will go...

Comments imported from the old website

Ayrat (id) on 14 May 2011 commented:

Привет, Даня! http://www.smtcomp.org/2011/TreeLimitedRun.c это на сях похожая штука. умеет: лимиты: wall cpu, user cpu, memory, учитывает дочерние процессы, убивает их Правда у неё есть косяк: иногда убивает процессы зомби, запущенные после старта скрипта

Айрат.

Pavel Shved on 18 May 2011 commented:

@Ayrat, yeah, it looks very similar, and is implemented just the way I described in my blog post after a large edit. No wonder: SMT competition and Linux Driver Verification both deal with external tools that solve complex problems and tend to consume time and memory. (And, in turn, by the way, some of our tools in LDV use SMT solvers :).) Still, our solution has more features, although should be a bit slower due to language choice (Perl over C).

И кстати, Даня — это не я, а свосем другой человек.

On OpenID (un)success

You must have heard about OpenID, haven't you? OpenID is an open authentication framework that allows an entity (a web site) to verify if you are the user you claim to be. The authentication happens at the site of your OpenID provider, and the process is common to all the entities that want to verify your authenticity. An entity just receives an answer to the request it issues: is the person that currently operates the web browser an owner of the identity it claims?

Simplification of development?

In this post I'll refer to the concept itself and to the name of the approach as "OpenID" (with some caps), and an identity URL will be named without any caps: "openid".

It seems like a neat concept to cover all kinds of authentication tasks. First, it conforms with the private infrastructural use as a universal login to a series of sites (like on StackExchange network, for instance). The openid is not shown to public when used like this.

You might think that requiring all users to use OpenID aids the development of a website, since it abstracts away the authentication process. It does not. Aside from authentication you most likely need authorization, and this alone makes you implement most of the authentication infrastructure at your site.

What OpenID actually makes possible is to associate automatically your activities across several sites. When you log in to one web site, it may automatically collect your information on another site, and be sure that the account found is yours. Based on the properties of the account found, it may, for example, grant you certain privileges. For instance, you get the ability to vote on all StackExchange sites if you have large enough reputation on one of them, and this association makes this without requiring anything from the user. However, even that was problematic to the StackOverflow.com developers, which are surely among the top professionals in the Web technologies.

Better user experience?

Who benefits the most from that usage is the user. Instead of making a lot of accounts in different sites they can just make one, and log in to different places, their password never being compromised due to the protocol design. However, everyone is used to the current situation, and the users rarely realize that there are benefits of that sort... well, more about it later.

Another way to employ OpenID is to provide credible authenticity information on the content in a Web 2.0 social site. The most widespread use of it is the openid of a commentator published near their comment. Personally, I think it's largely underused (I wrote about it in my previous article about OpenID), but that doesn't matter now. It is not the problem with OpenID.

You see, OpenID has been around for quite a time. I recall LiveJournal adopting OpenID as one of mechanisms to authorize comments in 2006. According to Wikipedia, OpenID 1.0 spec was released in 2005, and still hasn't changed much, nor it has gotten much traction. What's the matter?

I guess there are two reasons. Tightly coupled, these demeanors imminently appear where the technical progress comes, and poison it. These are conservatism and stupidity of the crowd.

Conservatism

If you read nearly any of anti-OpenID rant (like this: "OpenID is a Nightmare"), you'll inevitably notice the dependability among the primary concerns. "How come? Our business depends on something!", they say. Well, let's look at some history.

Remember the times when there was no centralized electric energy production? People just used torches to light their homes, animal labor to plough their farms, and dug underground storages to keep food in a cold environment. That was awful. You may test how this feels if you spend a weekend without electricity in the country—if you have a supply of candles, of course.

What do the opponents of OpenID as authentication systems propose? They say that dependence on externally provided utilities weakens their system, and leads to outages. But the same happens with electricity and with transportation in cities, and people eventually learned how to deal with traffic jams, lack of electricity and other such problems. And without depending on various external entities we would still read by candlelight, and our cities would have never achieved the today's scale of millions inhabitants. Wouldn't you want that for your website?

What does it require you to do? Just accept that 0.1% of the time users would use the backup password, or use your services for free (anyway, you do not reimburse subscription costs if a user is unable to access your site because of their Internet provider, do you?). I think it will make the Internet a better place by finally decoupling different services to different vendors. Time to stop the feudal division of the Net!

Stupidity of the crowd

But the conservativeness of the vendors is the minor issue here. The other imminent component of a successful acceptance is the users. And here lies the serious problem.

Users, as a massive crowd, are just stupid. They do not get it.

What should OpenID be, for the end user? It should be an open protocol that allows you to claim authenticity of your actions performed on various websites in a universal way. Is an average "end user" able to understand what it is about—or at least just read it? Hardly.

A Web search engine? Too hard: it is a concept. Now they call it "Google" even if they actually use Bing, Baidu or Yandex. For instance, I know for sure that developers in Yandex, the largest Russian Web search engine, use "google it" in their daily conversations.

A small computer which is also a phone (a.k.a. smaprtphone)? No, too hard. They buy iPhones. A PC that looks like a tray? No, too hard too, I want an iPad.

I'm glad that my Nokia n900 smartphone (or, a small PC?) is hard to confuse with an iPhone, because it resembles a brick more than an electronic device.

An open universal authentication protocol? Uh, that's soooo hard! Instead, I'll log in with my Facebook or Twitter!

See what's happening here? OpenID on its own doesn't have a chance. It did not succeed and it will never will, because it's a concept rather than a product. To make it succeed, popular web services should promote themselves as products—and use OpenID as a backend. This way web site developers wouldn't care if you log in with Facebook, Twitter or any other kind of services, they'll just attach to OpenID part of them.

On the contrary, services that call themselves "OpenID providers" (such as myopenid.com) do a lot of harm to OpenID acceptance. They sell concepts, they sell backends to people, to whom they're unsellable. Backends should be sold to programmers, products—to people.

Facebook, Livejournal, Blogger, Telnic, Linkedin — a lot of sites store identity information, but only one of the listed serves as an OpenID provider, and, by unfortunate coincidence, outside of Russia it is mainly to host blogs of teenage girls.

Conclusion

"None of us is as stupid as all of us," said Joel Spolsky three years ago, but the idea had surely been there for centuries. There is too much abstraction there; OpenID is a concept, as well as a concrete protocol. The crowd doesn't want concepts, it wants products. And developers want the products to comply to concepts. Then, make all products comply to OpenID, and the Internet would be a better place. Will you?..

So, while the wide acceptance of OpenID would surely make the Internet a better place, the effort required is beyond anyone's reasonable expectation. Including mine.

Fast XML stream parsing in Ruby with LibXML

XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.
  • XML Stream Parsing: Overview — generic philosophy
  • Fast XML stream parsing in Ruby with LibXML — practical advice (including non Ruby-specific)

XML is a very powerful word. When you pronounce it, the effect may vary. "Yeah, so what?" "Why not json?" "Oh, another corporate dumbass", "Why not CSV?", "Why not plaintext?"

When we were to adopt using XML to store data in our project, all these questions did arise. And the reason why we used XML anyway was its indisreputable interoperability. The languages we used (Java, Perl) had production-ready support for it, and, which is more important, the languages we could only think about using, would have had the support as well. And it's also extensible—that's what the X in XML is for—so we wouldn't have to rewrite all our code if we decided to add one more field to the data.

We'we implemented our tools to produce XML as their final report, and there was just one last step to implement: upload the final results to the database. From my experience in implementing the engine of the website you read, I knew that Ruby has a neat ORM, which is stable, production-ready and capable, so I used it.

The component to parse and upload the data from XML to the database was implemented in Ruby then. I refactored it twice before I finally managed to find the correct and fast way to do this. Here are the cronicles of writing XML processing script in Ruby.

Hpricot — bad idea

I've googled for XML Ruby backends, quickly glanced through the reports, and thought that Hpricot was the fastest and the easiest to use.

Both these assumptions were wrong. It's not easy to use, and it's not very fast. It only supports DOM parsing (which is not the fastest idea possible), and its interface is a mess. It is simple, but less capable. For instance, you can't distinguish the absence of a tag and a tag that contains empty text. And it doesn't really support all the DOM capabilities.

The last straw was a segmentation fault during processing of some of our data, and that the similar issues have constantly been filed without being fixed. So I decided to rewrite our program from scratch with ReXML.

ReXML — only for small XML files

ReXML, which is included in the default shipment of Ruby, was the second choice. It has a very clear, strict, and convenient interface. I rewrote the XML processing script to use ReXML, and it worked correctly.

By the way, I found a minor bug in ReXML.

But it was slow as hell. We managed to deal with it until we realized that it takes five days to read a 20Mb XML, the parser deploying into 2Gb of resident memory! The funny thing was that a very sophisticated program that generated the data did it in only two days, twice faster than the ReXML+Ruby planned to spend on parsing it.

All right, I started studying stream parsing, as it was the only means that could solve the problem. But ReXML stream parsing capabilities were beyond reasonable tolerance of a programmer to stupid interfaces. It only supported SAX interface, and SAX just SUX. I would have to wrap a lot of stuff into it, and write a lot of necessary code to just comply to this event-driven API. I decided to study more libraries, and I found LibXML.

LibXML background

Before getting to the business of its usage, I've studied some background of LibMXL. First, LibXML is a free library released under MIT license, and you may use it in both commercial and nonprofit projects with very few restrictions, which was what we needed in our project.

LibXML (I'll refer to its second version, i.e. libxml2) is a C library. In spite of this, it's well known for its various bindings with other languages, especially with those which were designed to trad execution speed for developer's convenience, namely, Perl and Python. LibXML bindings for Ruby are less known; perhaps, because operations with XML are mostly under the hood of more high-level tools and libraries that are used for the tasks Ruby programmers usually face.

Aside from high availability, LibXML is a very mature library. It has been developing for more than eleven years, and developers try to keep it fast and stable, and avoid hysterical moves that would make the whole project go bankrupt. Just for one example, here's a result of LibXML automated analysis for binary backwards compatibility; you may notice just a couple of suspicious serious breakages made.

To install, use your package manager to install libxml2 library, and then proceed to gem install libxml-ruby

LibXML killer features

Let's get to business. Here's the documentation for Ruby LibXML bindings for stream parsing. There are two features to note.

Stream of tags API instead of SAX

First, it's not event-driven like SAX. If you already have a program that uses DOM-like parsing it will be much easier to convert it to utilize stream parsing. The thing is that you will not have to revamp and refactor the layout of your program, to make your script callback-driven.

Generally, LibXML provides you with a stream-like interface you can peek the next tag from. Just like you read characters, strings, and numbers from a file stream, you may read tokens from an XML file. You call docstream.read to read the next element, and you then may peek the topmost token by calling docstream.node_type and docstream.value (docstream being the instance of XML stream reader). And most importantly, a killer feature of LibXML, is that you can skip the entire tag without parsing its contents by docstream.next. The resultant program that reads XML would look like this:

Mixed stream and DOM parsing

You might have chosen stream parsing because the large part of your XML consists of a repeated pattern that contains the most of data. However, there might be a tiny part of your file, with meta-information of sorts, that has a very complicated structure. Would be good if you could parse this small, but densely structured part with DOM parser, and do the rest with the usual stream parsing...

LibXML stream parser can do this with expand method. It reads the current tag, and stores it into LibXML's DOM structure, getting ready to continue stream parsing. It doesn't make the whole stream loaded into memory, of course, and you may choose the fast methods for larger loads, and the convenient method—for loading smaller parts of the stream.

Of course, you may still skip the whole tag without having to load its contents into memory via the docstream.next already introduced, or read them into string (see read_outer_xml) to print them somewhere else.

How to write a stream parsing program

It may seem hard to parse a complicated XML with these capabilities. However, adherence to some rules, a couple of patterns to structure the workflow, and some helpers that extend the C-like interface to LibXML with more Ruby-like features will help you writing a fast and nicely-looking program. Here they are.

Always know where you are

At each point of the program you should where the stream is—or might be—at this moment.

Put comments that describe where the parser is now. A good place would be before you are going to peek something from the stream. If your comment seems big to you then you're probably doing something wrong.

Instead of comments, or accompanying them, you should use run-time assertions. Here's how it would look like:

Use Factories to memorize what you've read

Unlike in DOM parsing, you can't query XML for some data you've already read past. Traded for convenience is more fine control of what you need to store in the memory, which is crucial for bigger files.

What worked for me were key-value stores for roots of subtrees, with which I factory objects for children. Factories implemented this interface:

In my case, these factories yielded ActiveRecord objects, and the global data were foreign keys for them. These keys were not be known at once (because the tags with actual data were mingled with tags with foreign keys), so the re_initialize function was invoked for all currently open factories, upon discovering global data.

The consume_tag_if_applicable procedure of a factory, in turn, advances the document stream and reads data from it. Its code may employ the same practices as the outer code that invokes the factory, and look recursively similar. If necessary, this method may yield what the factory has already "digested".

You may check the example of the code that parses XML here. It uses the pattern I've just described.

Employ Ruby capabilities to add helper methods

LibXML is a C library, originally. That means that it's interface is a bit... too verbose. We could make it more Ruby-like with just a couple of "monkey patches".

Check that interface of peeking. It returns an integer code(!) for the type of the token on the top of the stack. So 1980's! We can add some helper methods to the class of the element by opening it at the beginning of our script:

This implements the neat methods we used above.

Another improvement is automatic skipping of whitespace and comments. You see, if your goal is to parse the file and print it as is, with a couple of its parts altered, you indeed need to read the whitespace and print it. Whitespace has no sense in XML files that allow only "leaf nodes" (those that have no children) have text inside them. So, there are a lot of cases when you might want the reading methods to just skip the whitespace.

The requirement for the fixed implementation would be like this: after you call read method, the next token in the stream should not be whitespace or comment. Here's how it looks:

One more improvement is automatic reading of leaf node's value. The thing is that in stream parsing <node/>, <node></node>,, and <node>text</node> are all different sequence of tokens. I just added a procedure that returns nil, "", and "text" respectively for such cases (see the link to the example below)

All modifications I used in the latest project are gathered in the example I linked above.

***

In this post I made a brief overview of various XML parsing libraries for Ruby language. I also explained the basic patterns and techniques that helped me to write a fast XML stream parser in the project I'm involved into. I also provided the source code for both helper methods and sample program that stream-parses XML.

My evaluation of the patterns I explained here demonstrates that they are fast enough, and are flexible in memory consumption. Instead of loading the whole XML into memory, you have a fine-grained control of what you want to stash, and may release the acquired resources as you gather enough data to yield the next piece of the result of your parsing.

I hope that this will help you if you'll have to stream-parse XML on your own. Good luck!

XML Stream Parsing: Overview

XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.

XML and the ways to parse it

Stream and "usual" XML parsing are different very different, both in implementation and in the requirements to the XML producers. This article explains why is that—in a language-agnostic way.

Actually, I wanted to write a post about my chronicles with parsing XML in Ruby scripts in the project at my work. But the article began to grow so fast that I decided to split it into two posts. This part introduces the generic aspects of XML parsing, outlines differences between "conventional" and "stream" XML parsing, and shows a couple of examples how both approaches look like in the code that uses them. I'll use Perl as a sample language, but it could be any. And in the next article I'll share some detailed tips for parsing XML as a stream in Ruby.

Conventional XML parsing

Actually, the "conventional" word is a local term coined by me as a counterpart to "stream" parsing. The "conventional" way of parsing XML is more widely known as Document Object Model convention, or DOM. The name stresses the idea that XML document is represented as a set of objects—in the OOP sense—which have methods to be called as well as properties.

"Conventional" XML parsing relies on the basic idea that the the whole XML is loaded into memory. A specific object (we would name it "element object") may be created for each particular tag. User may query this object for its attributes and contents (if it's a leaf element), as well as for children elements that satisfy specific properties (XPath expressions may select elements based on tag names, inheritance, and attributes). The XML parsing library returns results of such queries as element objects as well.

Let's start with an example. Here's a sample XML, taken from the Wikipedia article on Web Services Description Language; I saved it to my local server as it is too big to inline it.

How would the work with a document like this look like? Assume that (it's not what working with WSDL is really like!) you want to get a list of all operations of an interface with a specific name. You call something like (pseudocode!):

document being the entity object of the whole document, and receive an element object that represents the relevant tag. To get all its operations, you query the element for its children:

You get names of operations by something like operation.attribute("name"), and then look up the relevant HTTP binding for this operation by querying document again:

And keep doing the chain of such queries until you learn everything you needed from this document.

What are the attributes of this workflow?

  • Your further "requests" to this XML depend on the answers you have gotten from it before.
  • The document is loaded once at the beginning of a work, which itself comprises a lot of delays, so how fast it's parsed doesn't mater that much.
  • The document was given by an entity that is not under your control. Its elements may come in any order, because specification doesn't impose it.

Therefore, your best choice is to parse and load the document into memory, split it into separate element objects at once, and be ready to run XPath queries and get elements that satisfy them without additional overhead. That's what "conventional" XML parsing looks like.

Stream parsing

What happens at the low level when XML is initially parsed to a series of entity objects? A usual parser, probably, recursive, walks through a file. As soon as it notices that a tag is closed, on the top of its stack (in other words, in the topmost function) there is enough information to construct the entity object for the tag that has just been closed. This object is constructed and then wrapped the way complex structures in the target language appear. And at the end of the processing, the object that represents the root tag of the XML being parsed resides somewhere in the memory, and is returned by the parsing routine.

The question is, why should we wait for the whole file to parse? Why can't we just intercept tags that are interesting to us as soon as they're parsed? Why should we load the whole XML into memory?

The reasons why we sometimes can't are briefly listed in the previous section (the "attributes" of the workflow). But what if they don't apply? Then stream parsing is right what you need!

Usually XML stream parser is understood as a blocking stream that converts the stream of characters to a stream of more meaningful XML "events". Stream parser may be approached as a proxy object between character and special XML stream. It may be queried for the next event, and returns it as soon as it parses the next several tokens in the input raw stream, which may be an opened text file, or a pipe, or data that come from a socket. The events an abstract stream parser for the easiest XML I could imagine (<x><y>stuff</y></x>) are like these:

This way of parsing XML is more known as SAX parsing. "SAX" is an acronym of "Simple API for XML", and it means that API of a SAX parser is just several callbacks that are invoked at the certain simple events that occur during XML parsing, to which we refer in this section, so it's "easy" to understand.

In no way it's implied that the API is "simple" to use.

  • begin of the tag "x"
  • begin of the tag "y"
  • plain tag contents ("stuff")
  • end of the tag "y"
  • end of the tag "x"
  • end of file

The stream parser does not provide any information about the context of the place it's currently at. You should store the context on your own. It also can not handle a query with XPath expression, as to handle it it should seek for the proper tag, past which it may already be; and there's no way to rewind an abstract stream. If it's so inflexible, what's the point in stream parsing then?

It's damn fast! The sequence of the said events is yielded nearly at the speed of reading symbols, which approaches the speed of light compared to unhasty modern systems.

As for applications of stream parsers, they're quite limited. XML is mainly for storing structured information, and if the tag layout is not regular, deeply nested, and is less of a content than of structure (i.e., does not contain large text blobs), the non-stream parser is your best choice. If it doesn't hold, and if your XML is not so complex, but just damn big, then you do not have much choice aside from stream parsing. But you should know how to use it then.

Stream parsers in the wild

The sample output of a stream parser you saw earlier is not very convenient. Writing a nicely-looking and error-prone stream XML parser (i.e. the program that uses output of the said events) is a task that requires discipline and diligence. In the next post I will explain in detail how such a parser is best implemented (in Ruby). Here's how a fragment of code would look like:

However, the most easy-to-use way to employ stream XML parsers is a combined stream and "conventional" parsing. It means, stream parser seeks a tag you need, and then provides this tag as an entity object.

In Perl, there is a wonderful XML::Twig module, which does the thing you need. For example, to get all operations of a WSDL service description, you might use

This script will read the XML from file named foo_file, and just stop at tags that satisfy XPath "interface/operation" (i.e. <interface><operation>...</operation></interface> tags, and invoke a subroutine for them. The $interface_tag is an entity object, which we query for attributes, and thus extract the name attribute.

We could also specify a very useful option, "print_outside_roots => $STREAM, so that all things that are outside the tags we're interested in are printed to a file handler $STREAM as they're parsed. And those things that we would like to look at go to the same subroutines; they may be printed as well, but they may be altered before printing! This makes stream parsers (which provide such interface as XML::Twig) extremely convenient if we want to alter a couple of tags inside a large XML file.

Parsing portions of a big file in a "conventional" way is, in my opinion, a must-have feature for a stream XML parser. In Perl, the already mentioned XML::Twig module is one of those, and overview of Ruby modules is coming soon at my blog. Unfortunately, not all XML parsers support such hybrid model.

Key differences between stream and conventional XML parsing

Stream parsing is, of course, faster. but the performance gain does not necessarily cover the expense to write a more sophisticated program. What are the requirements for the XML for it to be parsed as a stream?

  • XML should be large. Most likely, parsing a small file with a sophisticated program is just not worth it.
  • XML should be homogeneous. It is a rare situation when you are to process a large XML file with an irregular pattern, but irregularity is going to cost you much, either in efficiency or in the complexity of the code you should write.
  • XML should be topologically sorted (or at least look like this). During stream parsing, you can not "rewind", and read again some data you already have read. So you either have to parse a stream two times (which is not always plausible), or store all the data you may need in the future. The latter may consume a lot of memory, which may render stream parsing not much more useful than the "conventional" one.
  • Consequently, layout of the XML you parse should be either specified, or under your control. Stream parsing yields the best combination of effort and resources consumed at runtime when your data are laid out in a specific order.

    You read XML to "process" its tags somehow, so the number of tags that can't be completely processed at the point of reading them should be minimized. For instance, meta-data and style data (smaller portion) in HTML are laid out before the page body (larger portion). If a stream parser was used for rendering web pages, it could store style information in memory, and process body tags at once, without storing them. If it was vice versa, it had to store a lot of body tags until it encountered styling information, which would be much less effective in memory consumption.

    The way HTML is laid out is governed by specifications, so we do not have to have control over the producer of XML; otherwise, we should.

If all these properties hold, congratulations, you will be able to parse XML in a fast way!

Conclusion

In this post I've outlined how a high-level processing of XML in a "conventional" (DOM) way and in "stream" (SAX-like) style is done. I analyzed the differences between the philosophy of these approaches and of requirements to XML files to be parsed in frames of each of them. I was not too specific, and just scantly touched how the use of stream parsers would look like in code.

And in the next article, I'll share my experience with stream XML parsing in Ruby, as well as I'll provide generic advice on how to write a concise and nicely-looking program that parses XML in a stream way.