Shown here are only posts related to concurrency. You can view all posts here.

Relativity of Simultaneity in Distributed Computing

About a year ago, I described an allusion to physical phenomena in computing in "Caching and Levers". This post is devoted to a more complex theory, namely the Special Theory of Relativity (STR), and mostly to one of its implications, the "relativity of simultaneity".

Relativity of Simultaneity

Special Theory of Relativity may be approached as a set of equations for coordinate translation that take their effect as the speeds of objects approach the speed of light, c.

STR is based on two postulates, one of which states that light is always propagated in empty space with a constant velocity c regardless of the speed of the emitting body. This may be used to introduce a tricky paradox.

Train car experiment

As seen by a passenger of the train. The events happen simultaneously.

As seen by a spectator on the station. The "back" event happens earlier than the "front" event.

(pictures from Wikipedia)

Assume that a train car is passing you when you're standing on a station platform. The train car has a lamp installed exactly at its middle. When the train passes you, a lamp flashes, and then you watch when the light reaches the front and the back of the car. It's interesting that those who sit in the car will notice the light hitting the two walls simultaneously, while those who stand at a platform will notice the back wall lit earlier than the front one. This is implied by the postulate described above: as seen by a spectator on the platform, the light propagates with equal speed at all directions, so it will hit the rear wall before the front one, as the former moves toward the light source, while the latter reproaches it. See wikipedia for more info.

This paradox, known as relativity of simultaneity may be formulated more generically: whether several events are simultaneous, depends on the location one watches them.

But what does it have to do with the computing? Before I answer this, I'd like to share what I've learned from a CACM magazine.

Quiescent consistency

In the multicore age, the classical data structures we all know about are about to change. The reason for that is the increasing demand on computation speed and volume that careful "sequential" CPU chip engineering is no longer capable to provide.

This challenge makes us rethink our approach to algorithm and data structure design. If we want data strctures to be efficient, we no longer may expect them to behave as good old ones in the sequential edge.

In distributed programming, there is an approach to describe data structure behavior known as quiescent consistency. There is a number of consistency conditions, sequential consistency, linearizability and others. These conditions describe how an object behaves when there are several threads calling its methods.

A data structure possesses quiescent consistency if it is consistent between its states of quiescence, i.e. when there are no methods currently in progress. As soon as a quiescently consistent structure has no operations pending (i.e. reaches the quiescence), we may be sure that the executions of methods before this state and after this state is never interpositioned.

Imagine a quiescently consistent stack. An implementation of it is described in this CACM paper "Data Structures in the Multicore Age", the one where I first encountered the quiescent consistency concept. Assume the following sequence of events happen to the q.c. stack:

  1. Three pushes x,y,z
  2. Quiescence (the pushes are processed)
  3. Three more pushes a,b,c
  4. Quiescence (the pushes are processed)
  5. Three pops
  6. Quiescence (the pushes are processed)
  7. Three more pops
  8. Quiescence

Note that q.c. doesn't mean that a data structure guarantees nothing except for this specific consistency. Consistency condition only maps data structure behavior in a concurrent setting to a behavior in a single-threaded environment, i.e. it only limits the number of a multitude of different sequences of method calls that may happen for a specific set of multithreaded events. All the properties a data structure exhibit in this sequential processing should still apply.

Quiescent consistency guarantees that the first three pops return x, y, and z (in an arbitrary order), and the next three pops return a, b, and c, somehow intermixed as well.

Note that, unlike linearizability, quiescent consistency doesn't impose any ordering on results of pops if there was no quiescence between the initial pushes. For instance, if processing of the first and the third pushes do not overlap, the linearizability ensures the pops will respect this order, while q.c. (in case that the second push overlaps with both of them) doesn't ensure that.

Having noted that, q.c. looks like a very weak and useless property. However, it still implies correctness, which, however, is enough in many circumstances. Imagine that you need a pool of unique numbers. It is enough to use a q.c. counter; the numbers it returns will be unique, which should fulfill our initial purpose.

Do we really need stronger consistency?

However, the reasons why we may be satisfied with weaker consistency conditions are not constrained with the specific examples. Let's try to prove the opposite. Assume we're implementing a stack that is accessed from a number of threads. Quiescent consistency may be not enough because if a push of A precedes the push of B, then the pops should be ordered in the specific way, and quiescent consistency may not guarantee that.

But wait... We say, "If one push precedes the other," but do we really know what "precedes" mean? If two threads in our distributed computational system invoke push call independently, how can we be sure that one of them precedes the other? And is there any sense in defining a measure for that?

Let's reckon the paradox we described at the beginning. In one reference frame, one event precedes the other, and in a different reference frame it's vice versa. So whether one push happens before the other, depends on the place we look from, and is not—and may not be—determined by a central, ubiquitous authority. That's a law of the universe, and we can't change this in computing.

Surprisingly, it makes quiescent consistency be a fairly appropriate constraint for a distributed data structure. If there's no strict sense of precedence between events that happen in different places of our network then why can't we use this to develop a more efficient data structure? The correct answer is, "indeed, why not?", and such data structures are well-defined nowadays.

OSI model

OSI model is an abstraction that aims to make the design of distributed computation systems easier. The chain of events that happen during a process of data transmission is separated to several separated layers: each layer has its own protocol, which is independent of the actual implementation of the underlying layers. You may learn more at the wikipedia.

STR vs computing: the difference

We have successfully used a metaphor from physics in distributed computing, but there is an inherent limitation of applying the concept further. In physics, if we have a car of a specific length (as measured by its riders), we may install the light source in such a way that the events at two sides of it happen predictably simultaneous in the reference frame of the spectator at the station.

In computing, we can't do that. OSI model prevents us from predicting how fast the events happen by masking out the capabilities of a system at lower layers. In physical reality, we know that as soon as the light "hits" the back and front covers of the car, the event "happens" (or, more precisely, a piece of reflected light is launched towards the spectator). In computing, however, we don't even know how long it takes for a signal to reach another node. Moreover, the OSI model makes this impossible to predict. All the layers guarantee correctness, but do not have any timing conditions.

Conclusion

The absence of timing in the OSI model suggests that may be no sense in saying, "this structure may not be used, as it processed the request in the different order than they're issued in". The "order they're issued in" is inherently unpredictable in our computing models.

This inherent unpredictability of timings of processes and events, as described in OSI model, is why we really can't apply the special theory of relativity to computing. However, we still may notice that simultaneity in the world around us, as confirmed by physical experimentation, is relative. And the nature still works well!

And if the nature accepts it, why can't we learn from it, and allow our distributed systems to be less predictable, and trade this predictability for speed?..

Read on | Comments (1) | Make a comment >>


How to Use Open3 and Avoid Dead Locks

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

Open3 is a common name for a function that spawns a process and substitutes its standard input, output and error streams with the pipes connected to the original process. This way you may interactively feed the child with data, and read the processed data back.

The problem with open3 is that its power renders it more dangerous, and its usage—more error-prone. As the famous Spider-Man quote shares, "with great power comes great responsibility", and Linux inter-process communication is not an exception. So here comes the effective usage advice number 1.

Avoid open3 unless necessary

For the purpose of clarity, throughout the post, standard input, output, and error file handlers are called "standard filehandlers"; the process that spawns, and the process that is spawned are the "parent", and the "child" respectively.

The first advice about a safe open3 usage is... to avoid its usage. Languages that are popular in Linux world usually come with a handful of tools for process spawning, such as:

Shell redirection instead of pipes

Instead of using open3 in your program, you may try to connect process inputs and outputs with Bash redirection utilities, invoking the shell via standard functions to spawn processes.

For instance, you may write system("find | gzip >archive.gz") in Perl to spawn a process that archives recursive directory listing into a file, instead of passing the data through the parent process. You may use a usual redirection into just a text file as well, if that's what you really need.

However, I wouldn't use intermediate files if they're only to dump information temporarily for an immediate processing by something else, like this:

I consider it a bad stlye, let alone that it requires extra time on disk access and synchronization, while it may be totally unnecessary. The arguments to the commands passed that way will also undergo the interpretation by the shell, so you'll need to escape your arguments, a painful and an error-prone process.

If you need the intermediate result saved in the files for debugging purposes, you stil may dump it in debug mode, saving cycles when your project is compiled for production usage.

  • system — spawn the process, inheriting parent's standard file handlers;
  • fork+exec — ditto, but, possibly, with additional customizations you learned about in the previous post, for instance (such as tuning environment or playing with file handlers);
  • popen—open only one pipe to the process, either for reading or writing (note that popen also suffers from some of the typical misuses of pipes).
  • shell redirection untilities—Bash has a number of measures to take control over how to output and redirect input and output of the process. A section in one of my older posts is devoted to pipes in Bash. What could be helpful is the ability of using shell interpreter when spawning a process with such functions as system in Perl. For instance, you may write system("find | gzip >archive.gz") in Perl to spawn a process that archives recursive directory listing into a file, instead of passing the data through the parent process (see side note). You may use a usual redirection into file as well, if that's what you really need.

If one of these tools fulfills your needs, then go for it! You can replace it with open3 at any time anyway—unlike houses, software is relatively easy to rebuild.

Where open3 should be used

However, if you'll find the alternatives listed above inefficient, or not powerful enough, you may opt our for using open3. I'll try to enumerate the cases where I would certainly advise using open3.

What is select?

The select function (man page) is a common name for a function to wait on file handlers. Assume you develop a task scheduler that connects to several machines via sockets; it should wait for the first socket to have the result in to schedule the next task to that specific machine (since others are obviously busy). How do you accomplish that?

Of course, there is a faster and more easy-to-use alternative than trying a non-blocking read on each socket once per several milliseconds. It is using the select() function (or one of its companions, such as poll() or kpoll) to perform a blocking wait on all of them. This function will return the filehandler list of those that have the data readily available in them—as soon as there will be at least one such descriptor!

You may find a plenty of manuals on how to use select; later in this post you'll see an example.

  1. several consumers of one source—if you are given a source of data that should be redirected to several consumers (such as some data that should be both printed onto the terminal and saved into an archived log file for history keeping reasons), you should connect a pipe to its output and error streams, and when you read some amount of data from there (for instance, when you read a line), send it to all the consumers you have. In some cases you can do it with Bash (not with sh), but the code with open3 should look much more clear.
  2. aggregation of several providers—if you maintain several data sources, which all should be aggregated into a common place (a text file, for instance) that has some concurrency issues, you might benefit from open3. You may spawn the processes, and then select() (see sidenote) their output streams, thus making writing of the data to the aggregator thread-safe.
  3. interactive data exchange with a continuously-running process—you can't avoid open3 if you spawn a process that responds to your input with some text to stdout. A lot of axillary processes that are launched thousands of times, have the interface of reading from stdin, and writing to stdout (such as SAT solvers), and you most likely don't have the time to write or read anything from the disk.

One of the examples of the application of these patterns (more specifically, a combination of the first and the second) is transparent logging—it's a blatant crime to use anything but open3 for this. By transparent logging I mean combining logs from the child processes into a common logging sink in the parent one. Usually it's done automatically: the system call just routes standard output of the child to that of parent. However, assume you spawn several concurrent processes, and unless you prefix each of them with a unique identifier, you'll get lost in the log quickly.

This may be solved by opening these processes with open3, and attaching prefixes to their output lines before printing them. Note also that this way you may control severity of logs: for instance, you might want treat standard error and output streams from the children differently, and that can not be achieved with a simple popen call.

Synchronous vs. asynchronous

In this post we will mostly study issues with a synchronous processing of data exchange with the child process. Being synchronous means that nothing else but the interaction with the process is performed while the child is alive. In other words, there is only one thread in the parent.

Some notes on asynchronous processing will be given at the end of this post. Still, I consider synchronous experience extremely valuable, as it helps to shape the vision what a good asynchronous interaction via open3 should be.

Issues with open3

So far I've been telling that open3 is not straightforward to use, but what causes thee complexity? Why could there be the "Dead Locks" referenced in the title of this post? I had to learn this in the hard way, tacking cumbersome bugs in our project, and here's what it was about.

Following the pattern "aggregation of several providers", I spawned the process that wrote to both stdout and stderr with the intent to combine them both in the archived log file. The code (simplified) looked like this:

I expected to get a file that has all the lines from the stdout of the child, and prefixed lines from the stderr of the child afterwards. However, sometimes the application just deadlocked!

A quick strace demonstrated that the parent hung up on read from child's stdout, and the child hung up... on write to its stderr! How could that be?

Remember that in the first post about pipes, I listed the limited capacity as one of the properties of the pipes. I stressed it on purpose, because it plays its role right here, when you try to use open3. The pipe that connected the parent to the child was full, and the child wanted to write an error message there. While the parent was still reading from the output pipe, because the child was still running, and its pipes were not closed! That's the open3 Dead Lock.

Of course, this could happen with any pair of pipes here. Assume the process just echoes every character it takes on the input to the output (of course, real programs will be doing a useful transformations, but for clarity we may assume it as identical). We want to feed it twice as much characters as the pipe's capacity.

You might think that this won't strike you unless you're dealing with extremely long inputs and outputs. Sorry to disappoint you, but, quoting the man 7 pipe:

In Linux versions before 2.6.11, the capacity of a pipe was the same as the system page size (e.g., 4096 bytes on i386). Since Linux 2.6.11, the pipe capacity is 65536 bytes.

It's not much, though, in certain cases, it's big enough to let badly written programs work. On a larger scale, we definitely need a generic, limit-agnostic solution.

How to prevent the dead lock

It's relatively simple to devise a generic rule of mitigating the effect of such a limitation. To avoid deadlocks with open3, you should clear each of the output pipes (and fill the input pipe) as soon as possible, and do not put a dependency between clearing a pipe and waiting for another pipe. So we need to watch closely to all the file handlers (up to 3) which we plan to read/write data from/to—and it's important that you wait for all of them at once! If you read the sidenote above, you already know that we could use select for this.

Using select to react to input promptly

So, a potential way to fix the program above is to write something like this:

This program, however, only outlines the approach to tackling deadlocks with open3. It is still prone to deadlocks, though less than the original one. Assume that the child started to print a line to its stdout, but haven't finished it, because it got a debugging interrupt. Having not finishing printing the line, it spits 100 Kb of debugging information to stderr with the intent to continue printing to stdout the normal output. However, the parent is still blocked in the out.readline() call, waiting for the line termination character to appear there, and the child gets blocked at the write to stderr, because the err pipe is full, and no one's going to remove data from it. Deadlock again. (You may play with this deadlock by open3-ing this sample program).

The issue here is that we still do not "remove data from pipes as soon as possible". For that, we need nonblocking reads, more low-level than those of the readline()- and scanf-like functions.

Using nonblocking reads and your own buffers

The problem with nonblocking low-level reads, as Capt. Obvious notes, is that they are low-level. We can't read more or less structured data from them. Assume that we want to read a number (a number of seconds to wait before launching the starship, for instance) from the child's stdout. If that debugging interrupt described above is triggered just in the middle of printing 1000, our nonblocking read will read 10 (before turning to reading from stderr), and act accordingly, launching the multi-billion-dollar ship prematurely. From the child's viewpoint, however, doing so is totally legitimate, since it printed 1000 and debugging information to the different output channels (stdout and stderr), and if the reader confuses these numbers, it's its problem.

Do not use strings as buffers (like I do here)

In the samples below we used "strings" as buffers. However, strings in the modern scripting languages (including those used here, as it's Ruby) consist of multi-byte characters with variable length (see Joel's post on Unicode), and not with one-byte symbols. On the other hand, in some less modern and "less scripting" languages, strings can not contain zero bytes, as they would be treated as the end of the string.

Therefore, "strings" are going to misbehave if chosen as a buffer for an abstract byte stream. I used them for simplicity, and for the sake of demonstration of open3 usage; in real programs, however, you should not use them.

Therefore, we need to handle these situations accordingly, adding another level of indirection between the child and the parent. We will store the data we read from pipes in the intermediate storage; we could name it a buffer. In fact, we are going to re-implement buffered read.

In the next example we will implement a linewise-open3, a subroutine that invokes user-defined callbacks at reading a complete line from stdin or stderr. You could play with it more, introducing scanf-like behavior (instead of these regexps you'll see). However, we have two more issues to discuss before getting to the code.

Reactive interaction with both input and output data

The select with callbacks works well for reading output with the input pipe closed at all. What should we do to if we want to write something to the input pipe of the child based what's being read from it?

To talk interactively with the child, you'll most likely need an asynchronous processing. However, there is one pattern which allows the exchange of data through both the input and the output in the synchronous mode. We already agreed that we will invoke user-defined callbacks after reading lines from stdin and stdout. However, we didn't use the return value of these callbacks in any way! The idea about this arises immediately:

If the user-defined callbacks to stdout and stderr lines return a non-empty string, we feed this string to stdin of the child as soon as possible. To get more control over the child, we treat NULL return value from a callback as a sign to close the standard input. We will also make our wrapper get a string as an input and feed it at the very beginning, as it is what could trigger the outputting of the data in the first place.

This still sounds not powerful enough, but you still have aplenty of asynchronous options, such as interrupting pselect with a specific signal, or setting up a separate thread for feeding data into the input pipe. We will omit these options in this blog post.

Watching for the child termination

As we're implementing a synchronous open3 wrapper, the natural assumption would be that at return from the wrapper the child should be dead and reaped. This way we can also return its exit status to the parent.

As we discussed in the previous post, process termination does not depend on the status of the pipes, so we should watch for it independently.

What we actually want is a version of select that watches for filehandlers and for the child's termination. If you know such a version of select (which I don't), make a comment, please. For now, let's search for another solution.

Such functionality could be simulated with a special wrapper thread that only waits for the child (with wait), and sends signal at its termination. This signal would interrupt select, and we would handle that.

In our project I implemented a simpler solution. It is based on the assumption that the more time the child is running, the less it's likely to terminate during the next second. So we can use timely wakeups to check for a process status (implemented as a non-null timeout to select), and increase the wait period with the course of time. Having the upper boundary for that period is a good idea as well.

Note that if a process has terminated, and the pipes are still open, we assume that the last nonblocking read will fetch us all the data we're interested in, and we may close the pipes. This may not be true, but in such specific cases you'll need specific solutions anyway.

Linewise-open3 code

Let's sum up what we're up to. Here's a listing of an open3 wrapper that prints the string supplied into stdin of the child, and then invokes one of two user-specified callbacks when a complete, \n-terminated line is read from stdin or stdout respectively. These callbacks may return more data to put into the stdin of the process. The execution terminates when the child is terminated. The wrapper returns the return code of the process (everything else is assumed to be done by callbacks).

I tested this program on random echo. You may also view the complete listing for open3 usage, and test it either with echo or with sshfs installed on your system.

Note also, that in our project we have also developed an open3 wrapper for Perl; you can view it here. It is less capable (without interactiveness), but it's live and working.

Notes on asynchronous open3 usage

In some cases you may trade the complexity for efficiency. The problem with the dead lock above was that we had a single thread designated to read from several pipes. Instead of introducing a buffering system, we may just spawn several threads, and attach each of them to its own endpoint. The burden of waking up the thread that actually has something to process doesn't vanish, it just gets imposed on the OS scheduler, which should contain fewer bugs. Or, to a language runtime scheduler (if it employs green threads, as in Ruby).

This may sound like an easier solution, especially in the multicore era, where developers cheer upon every possibility to develop a multithreaded program that makes the high-end hardware stall less. To me, it's a bit of an overkill for simple tasks (if the can be expressed in the terms of the interface discussed above). And in the context we used open3 in our project, too many competing threads had already started to become a problem.

Perhaps, I'll study the asynchronous option in other posts.

Conclusion

This post concludes what I initially planned for the series of blog posts of how to work with pipes in Linux. Another topic emerged, on the asynchronous interaction with an open3-driven process, but I don't know if I will write about it.

We have observed what are pipes, how they are used for inter-process communication, what options we have for spawning processes with pipes attached to them, and how it may be achieved in modern Linux scripting languages. However, we mainly focused on open3, studying details of its implementation, and the use-cases it's the most efficient in. We studied its quirks and how to avoid the traps set up among the joint of pipes we have to deal with.

I have learned this all spending a couple of days messing with processes that magically deadlocked without any obvious reasons, and with nontrivial multithreaded debugging. I hope these posts will help you when you will be working with pipes, so that you'll avoid my mistakes.

Read on | Comments (1) | Make a comment >>


How to Implement open3

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

Open3 (or popen3) is a common name for a library function to spawn a child process with its standard streams (input, output and error) both connected to newly created filehandlers in the original process. You may encounter that function in such languages as Perl, Ruby or Python.

Standard Linux C library, unfortunately, does not contain such a function. Only its less capable counterpart, popen, is a member of C library. The popen function, which is more common, opens only one pipe, either to the input or to the output of the process being spawned. This means that if you program in C you'll have to implement open3 on your own.

Another reason why you would have to re-implement open3, even if your standard (or non-standard) library provides it, is when you need more control over what happens during its call. For instance, Perl and Ruby 1.8 versions of open3 lack the ability to adjust the environment of the process being invoked (see my older rant). This functionality is why I re-implemented open3 in the project I work on.

The choice of Ruby was driven by the fact that I used it in the project I work on, and that it has a killer feature, unhandled exceptions (so that you don't have to check return code of each library call invoked, as in C). Besides, it's a lot less verbose.

The subsequent sections of this article depict sample implementation of open3 in Ruby. However, it relies on the certain system and C library calls I will also outline, so you'll be able to do the same in C or in any other language with access to such primitives.

Foundation for open3

The open3 implementation contains no magic. It relies on two powerful features of Linux system.

Forking

Good old forking comes to help in this case. The process spawning in Linux is done via fork call, cloning the current process, followed by exec call, replacing the newly spawned child with the command we wanted to invoke in the first place. It may sound unnecessarily complex, but in our case it has a feature that makes open3 as a library function possible. After fork, but prior to exec we can invoke commands in the context of the new process, while retaining the knowledge we had in the old one. How does it help us?

First, we may alter the environment of the child being invoked, which was my original motivation for reimplementing open3. Many scripting languages have a global, process-wide ENV hash with read/write access, which is used as the environment at execve system calls. If we need to alter the environment in the child process in a thread-safe way, we may modify ENV between fork and exec, so that it will be executed at the context of child process without interfering with the parent.

Second, some filehandlers (internal OS "pointers" to open files, pipes and sockets) preserve across exec! Namely, those that have the flag FD_CLOEXEC cleared. The flag is set by default (I guess, for all except standard 0, 1, and 2 descriptors), but we can change it via fcntl call.

So what we could do is to open pipes in parent process, and inculcate the knowledge about them into the child process before the target command is executed. How should we do the inculcation?

File handler renaming

The child process should spawn with the pipes instead of its STDIN, STDOUT, and STDERR. However, after forking, it already has some standard streams assigned by the OS or the language runtime. What should we do now? The answer is obvious: rename the relevant file handlers.

Closing and opening a file IO object in a high-level language (such as in C++) is not the same as "reopening" it, as the OS filehandler will not be retained. Which is extremely important to altering standard streams.

Language runtime libraries usually have a "reopen" functions (freopen in C, or IO.reopen in Ruby). The thing is that it substitutes the file pointed to by OS file handler, as well as the underlying file of the "language" file handler (this is an integer value in C, or IO object in Ruby or, say, C++).

Changing IO objects intrinsics left to the specific languages, OS filehandler swapping is performed via C library function dup2. A usual dup function just duplicates a filehandler. Called as dup2(keep,replace), it will close(replace), and re-name keep to replace. This renaming is done via fcntl system call, and should be relatively cheap. Additionally, this fcntl call will reset FD_CLOEXEC flag.

Forking and threading in POSIX is an interesting subject itself. I one had a lot of fun, reading this thread, then the FAQ of the newsgroup.

Note that fcntl and dup2 are async-signal-safe function, and therefore it can be called between the fork and exec safely in a multithreaded process.

Simple implementation

Okay, let's assemble all the tools together. Opening pipes is easy, as described before, then comes the fork, and renaming the file descriptors.

Instead of just assigning an environment variable, we may allow user to specify the actions they desire to be executed in child's context as a function pointer (or as a block, in Ruby).

Process termination control

From this simple implementation we have completely omitted an important way to control what's happening with the process. Controlling when it's terminated (or forcing it to be).

The thing is that whether the process is terminated does not depend on the status of its standard pipes. First, closing the standard input doesn't necessarily make the process terminate (as with the standard grep command). Actually, very few of them do. So if we've got all the results we wanted from our forked creature, we should have a means to kill it before it gets too late...

Second, the filehandlers we read from do not tell us whether the process is alive either. Remember that filehandlers may survive the exec if FD_CLOEXEC is not set? The process we spawned might, in turn, have given birth to a daemon, which we don't want to mess with. It could inherit the filehandlers, and just not use them. This actually happened with SSHFS mounter: the process wouldn't close the output pipes.

Therefore, pipes won't help us to control the process spawned. What could? Of course, its PID. The fork returns to us the process identifier of the child spawned, and our process, as a parent, has a full access to its status and to its return code. So, along with the pipe filehandlers, we may return child's PID to the caller for it to control the process in a more fine-grained way.

Anything else?

We've learned how to implement open3 itself. However, the bigger troubles are coming when we try to use it.

The thing is that open3 is a powerful tool, but with great power comes great responsibility. As outlined in the previous post, reads and writes to pipes may block at each endpoint, and this renders the usage of open3 error-prone. You should not use it for each process spawn. If a simpler system, fork, or popen would suffice, use them. But if it's not, you're welcome to read the next article on how to actually use open3 in an efficient and correct way.

Proceed to "How to Use Open3 and Avoid Dead Locks".

Read on | Comments (0) | Make a comment >>


Pipes in Linux and in The Real World

Linux pipes and open3

Three articles on how open3 is implemented in Linux, and how it is properly used

The importance of inter-process communication

In Linux programming the concept of a diversity of small and highly specialized tools collaborating their efforts to achieve the goal a programmer instructed them to has always been dominating. Take a look at shell scripting. Here's how one searches for a string FOOBAR in all *.txt files in their home dir:

Pipes

A lot of pipes stocked somewhere in Russia.

Pipes that transfer solid items

This photo is taken from here.These pipes are more widely known as pneumatic tube transport.

Pipes that transfer documents

This pneumatic tube seems to transfer medical documents and samples packed into capsules. I saw such system in one of a banks I had an account at; there it was used to transfer documents between counters.

It's a good illustration of sequential payload delivery by pipe if the width of the pipe is the same as the width of the objects.

You may find more sophisticated examples in my article on xargs, which is a robust spawner of such commands. It is such robustness that allowed me, for example, making such a simple "web service" that sends the information to the web triggered by just adding lines into a text file--look how concise its code is!

The usage of such tools is not constrained by making a desktop user convenient at their workstation. These tools are actively used in production scripting as well. Having a lot of ready-to-use tools at hand provides a programmer with extra opportunities for building larger programs: instead of trying to find a library or to re-implement something, one may just try to find a small tool for a specific purpose, which interacts with the outer world via command line, input and output streams.

But the combination of tools requires an efficient communication between them. Small tools can nevertheless exchange large amounts of data! Being the world largest consumer of such communication primitives, Linux uses pipes.

What are "pipes"?

Before you read this section, I strongly encourage you to visit your bathroom or kitchen and examine closely the metal or plastic extended round items that have moving water inside (or, if your country isn't rich enough for you to have them, take a closer look to the things the oil your country sells flows through). These are pipes.

Pipes that have started a war!

A week ago I watched "Fair Game" movie. Being a non-US national, I nevertheless consider the problems discussed there relevant. And one episode of a plot attracted me. One of the "casus belli" for attacking Iraq was that it purchased pipes that could be used to make nuclear weapons (in fact, they couldn't, and it all was a speculation of politicians). Above is a picture of such pipe.

See how important pipes could be? Read the rest of the article to learn more!

Pipes are used to direct a flow of something from one place to another; it could be water, oil, other liquid, or even documents, or change (see side pictures). Pipes can even start a war! And Linux pipes transfer bytes: you can write to one endpoint, and read the data written from the second endpoint, which may end up in another process!

The three properties of a pipe

The first is that a pipe can transfer payload only in one direction. You can't use a single pipe to transfer water in both directions, such that it would leak from one end and simultaneously consume water for it to leak from another. For that, you need at least two pipes.

Second, a pipe has a limited capacity. When you close your valve in the kitchen, your pipe is full of water, and no matter how the pump station tries, there will never be more water inside the pipe than there is now. (Of course, the station may try harder and make your pipe leak, and water can undergo compression under certain conditions, but it's not generic). When the station tries to pump more water, the new water is "blocked". It continues until the valve at the other end is opened, and the water is removed from the pipe for the new to come from the other end.

The third property is that a pipe transfers the payload more or less sequentially. Even transfer of liquids, which can mix easily, is somewhat sequential: when you turn on your shower after a cold night, you literally feel how the cold water is removed from the pipe before the hot water starts to erupt.

The interesting thing is that Linux pipes are designed closely after "real world pipes", the only difference being that Linux pipes transfer information, bytes.

Pipes in Linux

The main attributes of Linux pipes, one-side transfer of data, limited capacity, and sequential output, are found in the real world too, as shown above. There is, however, one difference.

The pipes in the real world are usually found in "full" state, i.e. the pipe is full and waiting for items to be removed from it. In Linux programming, however, the "empty" pipes, where it is the consumer who waits for the input, are much more widespread.

To create a pipe, you just invoke the relevant system call via a standard C library. A standard pipe(2) call returns two file handlers, one is for reading (fh_out), and another—for writing (fh_in).

The use of these file handlers usually happens in a blocking mode, such that:

  • a read from fh_out blocks until something is written to the other end of the pipe;
  • a write to fh_out returns when it has written everything into the pipe. If the pipe has no capacity to consume everything, the call is blocked until the consumer reads some data from the other end, so that more data could be written.

Of course, you can use ioctl to adjust the modes and make such calls nonblocking. Still, you can't bypass the basic restrictions: you obviously can't read what's still haven't been written, and you can't store more data in a pipe than it's capable to keeping. If you want to continue execution as the data are automatically written to an overflown pipe, you have to allocate a buffer and a concurrent thread that pushes data there.

You should never forget about these two blockages, as it will affect your everyday workflow (see, for example, this StackOverflow question about less command). Yes, sometimes it's an obstacle you have to specifically overcome (in the third article about pipes in Linux I'll address it). But in most cases such two-blockage behavior is really what you want.

Pipes as a synchronization means

Let's return to the find ... | xargs -L 100 example shown at the beginning of the article. If the find command has already found a lot of files, there's no sense for it to work further, damaging the response time (the frequency of matches found printed) of the system. With pipes, it will be seamlessly blocked by write() to a pipe, and you don't even have to write anything to support it: your simple, usual printf() will just return control only when the second party does some work!

In other words, the design of Linux pipes makes, for two processes connected with a pipe as A | B, this two-blockage system automatically "pause" the faster to let the slower a chance to accomplish its job!

So, basically, pipe is a synchronization "primitive" that pauses a program connected to one of its ends at certain operations. It's not that "primitive", actually, as it may be implemented via a semaphore, but it's simple enough to consider it as such. And—as with other synchronization mechanisms—you may deadlock when using a single pipe, let alone using multiple pipes.

Road to open3

So, let's return to using inter-process communication for more efficient programming in Linux. A classical way for a tool to work is to print to standard output, and read from standard input, putting error messages to standard error stream. How could you efficiently use such a tool in your program? The answer is: connect to it with pipes! Linux has all the capabilities to arrange the filehandlers in such a way, that the program won't even notice that it's printing to a pipe instead of a console terminal.

We used named pipes to illustrate how Bash may be used for parallelization.

Of course, you may do it upfront, without impersonating a genuine console. Linux has a capability to create named pipes with mknod call. These look like usual files, but are actually pipes. If a target program can read from it a file instead of reading from standard input (or write to a file instead), you're lucky. However, this sometimes makes the target programs unnecessarily complex—and they're already complex enough, just take a look at various echo implementations, of a program that is supposed to just print its arguments. Second, this functionality is rarely provided for standard error stream, and error log is a very important piece of information for tackling bugs. Therefore, you will have to either use shell redirection, or just to establish a direct connection from the child's streams to the parent, which is an easier solution.

As we've already learned, you'll need three of them: one for each channel among input, output and error. This has gave the name to how such a function is usually called in standard libraries of various languages, open3. It takes a command line as an input, and returns three filehandlers corresponding to the said streams of the program spawned. Here's what it looks like in Ruby:

However, open3 implementation may sometimes be not powerful enough to meet all your desires (it happened during my work on a cluster software for a project at work, see some rant in my earlier post), and you'll have to code a more sophisticated version. That's why it's important to know how open3 is implemented. It has several quirks, and in the next blog post I'll explain the intrinsics of an open3 implementation.

Proceed to "How to Implement open3"

Read on | Comments (0) | Make a comment >>


Why execve()-like system() call is a must-have

How do you usually call an external program in Linux? Whichever language you use, you may fork() your process, and call exec() afterwards; you runtime library surely supports these primitives. However, unless oyu're doing something crazy, you more likely use a system() function, that does the fork for you, and automatically waits for the process to finish.

In Linux runtime libraries for various languages usually support setting environment variables, so that all processes forked afterwards inherit the altered environment. In Ruby and Perl it's as easy as changing a global hash. Setting environment variables is a usual mean of interprocess communication in Linux.

However, an only system call to Linux Kernel on that matter is execve() (see its man page). It both sets environment variables for the program called, and calls it, replacing the current program (you should do a fork before, it's also a separate syscall). There's no syscall for system(), it's usually a library function. Anyway, higher-level libraries provide more rich functionality. They surely provide more convenient primitives of the aforementioned setting env variables, and of mere system() call. The work of execve() could be easily substituted with the combination of these.

However, I recently discovered that it's not that obiovus.

Why execve() is crucial

If you know what I usually write about, you should probably have already guessed, what's the deal. The execve()—as a single library call­—is especially useful in multithreaded programs!

Let's assume that you only use environment-setting and a mere system(). How would you call an external program then? Here's an example in Ruby (but that actually doesn't matter).

Now assume that two threads are doing this. Yes, you got what's the problem, here's a possible execution:

  1. The first thread sets environment variable to 'value'. Context switch happens.
  2. The second thread sets environment variable to 'another_value'.
  3. The second thread calls system(), and the process spawned inherits the correct 'another_value' environment variable value. Context is switched back.
  4. The first thread calls system(), and the process spawned inherits the same 'another_value' as a value of the env variable, as the other thread has changed the global environment setting. However, the programmer's intent was that it should have been called with 'value' instead.

So how could we overcome this? A naïve approach is to use a mutex to synchronize the different threads, like this:

But then the mutex won't be available until the first thread to capture it finishes execution of the whole external program. However, out intent was to invoke the programs safely, not to execute them in a mutually-exclusive manner.

How it could look like

One solution is to use raw fork() and exec() calls, and assign your environment after forking only:

The problem with open3 invocation with an altered environment is what actually made this article appear. I had to study and modify an implementation of open3 in Ruby to make this work. See my blog post on how to implement open3 for details.

It's more verbose, but it should work. But what if it's more complex? What if you're not doing a mere system(), but want to make a open3() instead? There are many ways how a child process may be used, including opening it with one or more pipes, or opening a pipeline of several processes. You have to re-implement these primitives if you need just to alter the environment, and you'll be lucky if their implementation is readily available and reusable...

How it should look like

A good implementation of a set of safe library functions resides in Ruby 1.9 runtime library. The system() call and each popen()-like call can accept a hash of environment variables the child process invoked inherits. So the calls look like this:

Unfortunately, Ruby 1.8 doesn't have this functionality, and you can't use a single call to both set the environment and spawn a child process. That pisses me off, actually, since not everyone is still migrated, and I had to reimplement open3 in Ruby to just impose environment variable assignment in a thread-safe manner.

Conclusion

I know that I should be careful when I alter global variables in multithreaded programs. However, when I access basic primitives, I tent to treat them as something special, not just like any usual data. This is not the most safe approach. Use your forks with care, especially if you're operating several at the same time, and choose safe runtime libraries to work with.

Read on | Comments (0) | Make a comment >>


Parallel merge sort

We live in a loosely connected world. There's an increasing demand on middlemen, which connect different entities together.

This seemingly unrelated introduction is important for this post in two different ways:

  • Parallelization of tasks to different processors also grows in demand, as it's like creating stuff that doesn't need middlemen;
  • If there were not for social networks and forums, you probably would never know about interesting things created at the other end of the world.

So, back to the topic. I considered merge sorting as not parallelizeable well, with the same runtime estimate. However, several months ago, having disclosed this ungrounded opinion, a dweller of a message board of my university disproved it. That post linked an article, which describes an algorithm of a parallel merge sort, although is not the original source of it. It contains a brief, practical explanation, and I liked the algorithm so much, that I wanted to retell it here.

What concurrency we're interested in

Note that I'll describe not just a multithreaded version of the algorithm, challenge of which is a clever use of locking. The problem described here is to deploy merge sorting into the system with distributed memory.

All intensive calculations in this framework should be performed locally. However, each processor has a limited amount of on-site random-access memory. So the challenge is to split the processing into small chunks, which will be sorted and merged locally (complex calculations), and will be just flushed byte-by-byte to external memory at the beginning and at the end of processing.

Why merge sorting?

The thing is that merge sorting is actually an external sort. It means that it can handle amounts of data larger than size of RAM you have the random access to. This should make us think that it's parallelizeable, but it nevertheless doesn't directly lead to an obvious way to parallelize it.

The usual merge sorting looks like this, N being equal to 2:

  1. Split the sequence to sort into N (nearly) equal chunks;
  2. Sort each chunk (recursively);
  3. Merge all chunks (at once of pair-by-pair). The challenge is to parallelize this step.

Each chunk out of these N could be processed individually and independently, and this is a huge achievement. But a challenge remains: how to parallelize the merging?

Parallelized merging: idea

So, how to parallelize merging? First, we assume that one processor can store 2t cells in its own RAM, so we're restricted to merging only not greater than 2t cells at once.

Assume we have two large sorted arrays, A and B, which, merged, or even individually, don't fit into RAM available to one processor. This means that we should parallelize it: spit the sequence into small chunks that can be merged locally.

Let's try the obvious way: divide each array into chunks of length t and try to merge them sequentially. We can merge two first chunks into the beginning of the resultant array easily. But what if the elements of the 2nd chunk should be merged inside the very same merged array?

This array of length of 2t already doesn't fit in the memory of one processor, so we cant merge it with the second chunk in the straightforward way.

So what we really want is to divide the arrays into mergable chunks. For example, what chunk in the corresponding B array should be A's sub-array (ai,aj) of length t merged with?

Obviously, the corresponding chunk (bI,bJ) should be such that all bs with index less than k are less than ai, and the similar holds for aj.

However, the size of (bI,bJ) chunk could be greater than t, so we couldn't fit it into RAM to merge with the other chunk. Here, let's highlight some of the items that split B into chunks of length of t, that are inside the segment (bI,bJ):

These new elements split B's chunk into smaller pieces, each of which is less than t. Let's find the corresponding elements in A array for them:

Now all chunks shown here have length less than t, and each corresponding pair could be merged.

But look what we've done here. The order, in which ai, aj, aQ and aP are arranged, is determined by merging them—or by merging [ai,aj] and [bq,bp] arrays. So if we merge these t-distant elements, we will determine the order of "splitters" that divide both arrays into mergable, ordered chunks.

Parallelized merging: the algorithm

The thoughts in the previous section lead to the merging algorithm. First, divide both arrays into (nearly) equal chunks of size t.

Then merge the two arrays that selected elements form. Use the same merging algorithm recursively. The result of this would be an order, in which the splitter elements should be arranged (marked with a red line):

This performed merging allows us to determine the exact chunk of length t, into which should fit each splitting element from the other array, by simple calculations. (We should "switch" in what array we search for the counterpart at the same time as the red line switches sides.) Then you should use binary search to determine the actual place of it in the chunk, this way we'll generate twice more splitters for each array:

By construction, these arrows will not cross. Then, the pairs of chunks (the chunks to merge are depicted as sections colored in an alternating manner) will lay sequentially in the merged array. Merge each pair of chunks locally, and place the result into the larger array:

Each chunk to be merged consists of two segments of length less than t, which makes it possible to merge chunks locally on one processor. This completes the algorithm.

Parallelized merge: runtime estimation

Merging two arrays of n elements requires merging two arrays of n/t elements, performing n/t binary searches in arrays of length t, and, finally, merging the chunks we divided our arrays into, which takes O(N) operations. This leads to equation:

T(n) ≥ T(n/t) + n/t*O(log t) + n

I'm too lazy to solve it, but T(n)=O(n) fits it (we consider t as a constant). This is the same estimate as in a usual merge, so parallel merge sorting is O(n⋅log n) as well.

You may note also note that parallel merge algorithm requires familiar pattern of divide-and-conquer iterations. And it's one of the neat facts about this algorithm: merging the chunks we sorted requires the very same processors that sorted the chunks!

***

This is an algorithm I was looking forward to talk about for a long time. I'm glad that I finally did this, and I hope that you didn't know it, and have spent the time learning new fun facts.

Read on | Comments (0) | Make a comment >>


Prisoners and Fairness

Fairness is rarely pronounced, yet an important concept in computing. Its applications appear in different domains. Even when its unspoken about in documentation and descriptions of systems, I sometimes notice, "hey, for this to work, fairness is required!" So, what's it all about?

Prisoners

The first thing I all of a sudden encountered fairness in is the prisoners riddle. Here's how it's usually formulated:

100 prisoners are sentenced to death. However, they're given a chance to survive. In one of the cells, there's a light bulb, which is initially off, and a switch connected to it, which can turn the bulb on and off. In an arbitrary order, which is not known to any of them, jailer lets prisoners into the cell; they can observe if the bulb is on or off, and, optionally, turn the switch.

If a prisoner, after visiting the cell, says that every prisoner visited the cell at least once, and he's right, then everybody is set free. If he's wrong, then every prisoner is executed. Of course, every prisoner can pass and say nothing; then the iterations continue.

The prisoner's can't talk to each other after this trial begins, they're kept in solitary. However, before this "trial" is started, they're given a chance to make up a plan, which they would follow. Can the prisoners make such an arrangement, that makes them sure that they are set free, no matter in which order they are allowed into the room?

There are just a couple of riddles of this sort that I don't consider stupid, and this is one of them. Take time to solve it, and you'll be satisfied with the process and the result. You can keep reading, though, there won't be spoilers.

So, does the problem look sound? Is its definition complete? It sure is not. Indeed, it's quite possible that one prisoner never enters the room at all! Since they enter then cell in an arbitrary, not a random order, it's possible. That means that it's possible that not everyone will ever enter the cell, and in no situation the statement "everyone was here" will be true. So we should conclude that no matter how prisoners decide to act, they may have never get freedom.

This makes the problem not interesting. What should we add to make it have only non-trivial answers? You already got the idea: we should enforce fairness.

Definition of fairness

Picture A

A simple state machine for the jailer. The jailer nondeterministically picks one of the prisoners, lets him into the bulb room, and checks their answer, if any.

So let's formalize the work of the jailer a bit. All prisoners reside in their separate cells, and sometimes the jailer decides to open one, and to invite its inhabitant into the bulb room. If the prisoner doesn't decide to make a final guess, then the jailer picks a cell to open again. This can be modeled as simple state machine. See picture A.

The jailer makes an arbitrary choices whom to invite next. But it wouldn't be fair if a certain prisoner is never let into the bulb room, as we've shown before? It wouldn't. And it wouldn't be more fair, if a prisoner is not let to the room after a certain moment. It's just the same as if it's never let there, because up to this moment the prisoners' plan might have not moved forward. (My argumentation is kinda weak, but I try to avoid spoilers.)

So the work of the jailer would be fair if it never happens that a prisoner is not let to the bulb cell after some moment of time. If the jailer follows this, then its work is fair. This is closely tied with a concept of fairness in computing.

In one of my previous posts, the one about the search for a declarative language, I came up with my own, generic definition of fairness. I'll quote it, slightly reworded, here:

Fairness is a property of a nondeterministic system, which repeatedly and infinitely faces a multiple choice, to not allow such behavior that a certain item is never chosen past any moment of time.

For this prisoners riddle to be interesting and well-defined, prisoners should be let into the bulb room in a fair manner.

Applications of fairness

Task scheduling

A scheduler implemented in your operating system is one of the most important parts that affect its performance. If your system supports multitasking, it should spare limited CPU time between tasks to reach some goals you set. For a desktop system you might tune your process scheduler to react fast to mouse clicks you make, while running some tasks in background (can't help quoting Linus' sarcastic letter on this matter). For a system which doesn't need small response time, but should effectively schedule long-term tasks that communicate with each other, you might need another scheduling algorithm (see my Stack Overflow question, during which I encountered such an inefficiency on a desktop-tuned system).

Usually schedulers work like this: timeline is divided into chunks, and before the next chunk begins, scheduler chooses, to which process it should assign the next chunk of CPU time. Sounds much like the work of that jailer, doesn't it?..

In scheduling, fairness has more strict definitions, and more complex metrics. They calculate and impose restrictions on fractions of CPU time allocated to tasks.

Scheduling algorithms should not adhere to communism: some tasks have more priority than others. But schedulers should be fair when choosing the next process to take CPU's core. Indeed, if they wouldn't be fair, then it could be theoretically possible that our programs just never have chance to start if there's something else happening! However, if scheduler is fair, we at least know that program will terminate (thought we still don't know when exactly).

Distributed computing

The first place I heard about fairness as of a separate concept, were studies of distributed algorithms in my university.

When several machines run in parallel, and communicate with each other, it's usually quite a complex system to study and implement. Just reckon all these race condition and concurrency bugs you faced!.. To study distributed systems we should model it in some way.

Actually, what I call "fairness" is known as "strong fairness". There also is a concept of "weak fairness". Its "weakness" is that the systems eventually should make a certain choice only if it has such option always from now on. If such an opportunity regularly disappears, the system can opt it out and be nevertheless weakly fair.

The most obvious way to do that is to pretend that

  • at each moment of time, only one machine works, while all others are paused
  • there is an independent "oracle", which decides what machine will run next, the oracle being not associated with any machine
  • the oracle can stop machine only after it has executed an atomic operation completely

Picture B

Example of model of a machine in a network, when fairness wouldn't be the best option to model machine's behavior. Fairness forces the machine to eventually end up in a "broken" state.

Sometimes it would be nice to assume that the "oracle" makes machine run in a fair manner. However, this assumption is not frequently taken. Consider that we need to model what happens when a machine in a network breaks. We can model it as having a nondetermenistic choice to either break or to keep working (see picture B). Enforcing fairness would make us think that any machine will eventually break, and this doesn't conform to what we want to model.

Nevertheless that's how I first learned about fairness. Theory of distributed computing contains a lot of other interesting stuff, thought it's nontrivial. Our lectures used slides like these, it might be interesting to read.

State machines

Fairness can also be defined in terms of state machines. You can check wikipedia article with a paragraph about that. Basically, there's nothing new compared to what I described in previous paragraphs; I even used state machines to demonstrate what I'm talking about.

Implementation of fairness

Fair choice is usually implemented as random choice. Probability distribution doesn't need to be equal (for example, in a dumb implementation of a scheduler, it may be associated with task priorities).

However, this is not completely precise. Fairness requires every choice to be made. Random choice leaves theoretical possibility for an option to be completely ignored. And if this is allowed to happen, the choice is not fair anymore. What are the odds of such behavior? Assume we have two choices, each made with probability equal to ½, and the first choice is never made. This happens with probability ½⋅½⋅½⋅½⋅..., which is equal to exactly zero.

Therefore, the "random" implementation only implements fair choice with 100% probability, but not always. With zero probability the choice would be unfair. Isn't it too much? Only requirements of your implementation would tell. After all, we all were born and grew up the way we are with exactly zero probability...

While fairness is implemented as randomness, it shouldn't be forced in specifications of a system. The Whenever language I described in one of my posts doesn't really benefit from its choice of a line to execute being random. However, fair choice would be more suitable, as some executions, with zero probability, can make a correct Whenever program useless.

***

Fairness is an important concept in our life. You should be fair to prisoners, you should be fair when you have powers and use it to allocate resources to others. In computing, it's all the same.

In many domains of computing definitions of fairness vary. In this article I described only a couple: "strong" and "weak" fairness as understood in theory of distributed computing. Instead of being "totally" fair or unfair, fairness may be a metric, and system behavior is defined in such a way, so that this metric is not less than a certain value.

The exact definition of fairness is complex and hard to coin in a general way: it becomes too vague. I made such an effort in the middle of this article, but if you have a better idea, it's welcome in comments. Nevertheless, I hope this post gave you at least a notion of what fairness is. And I also hope, that when you notice it in in a system you design, use or encounter, you'll use the correct and concise word to describe its behavior.

Read on | Comments (0) | Make a comment >>


Performance metrics and parallelization

Some day you suddenly realize that a powerful PC is not cool anymore. Everyone works in clusters today! Clusters make our programs more fast, and if you goal is to fasten quite a long computation, they're of help.

Clusters of independent "computers" (in a broad sense) can scale differently: from several thousand tiny units on your GPU to thousands of different machines. And after you discover that you have a cluster available--or just several free PCs noone uses--you also meet the problem.

Programs suffer from insufficient attention paid to design them to run their parts concurrently. Probably, it's the right way to go, since the business value of early parallelization is not apparent. So the programs grow and their workflow ossifies. And when you finally find a parallelized medium to deploy the program onto, it's already too solid.

You have to cut your workflow to smaller pieces that may be computed independently. But where should you cut?

Performance metrics

To answer the question "how to break?" you must first answer "why?"

"To make my programs fast!" you answer, but speed may be understood differently! Here are two different "speeds", which one you pick?

The terms "throughput" and "response time" were adapted by me from different domains, and I'm not really sure that they're applicable here. If you know better terms for the concepts described, please, let me know!

  1. Given a lot of time, and having run the program a lot of times, maximize the number of fully completed tasks per time unit. I call this speed metric "throughput"
  2. Having run the program once, get final answer as fast as possible. I call this metric "response time"

Different tasks need different metrics. You should explicitly think about which one is the most suitable.

These metrics are very different, and hence they need different programs. Here's an extreme example. Assume you have N processors and your task can be divided into N independent sub-tasks of length t. Ideal distribution for response time looks like this:

The time needed to complete the task is t, and you can't make it any less.

However, here's a not any less ideal task distribution for throughput:

This distribution is not suitable for response time. Ideal distribution (shown by teal rectangle) yields response time nearly twice as fast.

However, for throughput it's not a problem, since we measure it for many runs of program. And for many runs this schedule is packed quite tight:

After we understood the difference, let's talk about response time metrics.

How to maximize response time

Each task of total length T has an unachievable ideal distribution across the N parallel machines. It was shown earlier by a teal rectangle; time it takes to get a response from such a computation is t = T/N. We can't make it in less time.

Assume that we created a task scheduling algorithm that completes in r time. The estimate of how bad our scheduling is could be estimated as (r - t)/t. We assume that our task has the first priority among others (i.e. the algorithm tries to finish it as soon as possible), and than the chunks are independent. These assumptions are not so uncommon, most systems that first compute a long list of independent tasks, and then execute it, fall into these assumptions.

Since the task is of first priority, then (r - t) < Δt, where Δt is the maximal length of the independent task (proof: if the inequality wasn't true, the last finished task should have been scheduled earlier). Here's the pic that demonstrates what we're talking about:

So, again, the badness of our algorithm is estimated as Δt/t; we're to minimize it. And there's just one straightforward way: to achieve the best response time, we should make independent tasks as small as necessary. We can stop dividing into chunks after we reach some acceptable badness (10%, for example); it's neither obligatory nor sensible to make the chunks as small as possible.

Conclusion

That's all I wanted to say. Perhaps it seems kinda obvious. But what shown here is the way to devise a good plan of action based on just setting a sane goal and doing small bit of maths (and maths do help in reasoning about real life). Actually, in the meeting I didn't draw formulæ, I just drew the pictures you see above--and we reached the agreement after that.

In further posts I'll pour upon the other performance metrics.

Read on | Comments (4) | Make a comment >>


Uploading and processing data with inotify-tools

I'm not a muscleman. In fact, I'm kind of a wimp. But given all this, my weight is 15 kilograms greater than I should have with my height. Time to lose some.

I've been logging my weight for quite a time, but I realized that logging alone doesn't help. I need other people to nudge me, so I would be more ashamed of my condition. To achieve this aim, starting today, I will publish a graph of my weight. It is rendered by gnuplot, and that page contains the script that renders it.

And, of course, since I'm a geek, the process of displaying the graph should be made as automated as possible.

Nearly every morning I weigh and log the result in the text file. My aim is to automatically upload it to the server right after I edit it; of course, manual uploading is out of the question. The update happens at different times, and I don't want to poll the file every hour, since it is just inefficient, and induces an unnecessary latency. And here's the neat solution.

What is inotify

Inotify is an interface to Linux kernel; its purpose is to watch and report when certain filesystem events happen.

Just like Berkeley sockets, inotify calls create file descriptors, which can then be watched by usual select() calls.

A list of events that may be watched includes accessing file, modifying file, moving files and modifying metadata. Directories can also be recursively watched.

More info you can get in man 7 inotify and on Wikipedia.

Tracking filesystem events with inotify-tools

Starting from 2.6.13, Linux kernel has an interface that watches for filesystem events without need to wake up on timer to check them. This subsystem is referred to as "inotify". We can utilize it for our aims.

The generic idea is to run daemon that waits for the weight log file to be changed. After I click "Save" in the text editor, the daemon notices it and sends file to server via scp (copy over ssh). On the web server another notification daemon watches for changes in the destination file, it renders an image with a plot and copies it into the webserver place.

Of course, nowadays one doesn't need to write C programs to access inotify directly via syscalls. Instead, I installed inotify-tools package (sources), which allows to use inotify from shell; a command we will use is inotifywait.

Inotify, as noted in Kernel docs, is interfaces via establishing watches that write information about the events via file descriptors. Waiting for a filesystem event is a mere blocking read from a file descriptor, the intrinsics being handled by the Kernel. The inotifywait command just wraps this in a convenient way.

One of the ways inotifywait can be used is to "monitor" events: the inotifywait process just prints a line to STDOUT when an event we watch for occurs. So we can just pipe the output of inotifywait invocation to a simple shell script that reads the lines and does work on each read. Here's how the daemons look like:

Local PC:

Remote server:

These daemons are currently deployed on my local machine and on server.

Version without monitoring

Due to misunderstanding of how stuff works, I encountered strange denials when I used the -m option to inotifywait. That's why previously I used the following daemons:

Local PC:

Remote server:

The calls to inotifywait here contain no -m option, which means that they exit after the event watched for occurs.

There are several wait operators and subshell spawns that may seem excessive at the first glance. However, they're there to avoid losing data during the race condition.

Assume there's no waits and no subshells, just inotifywait and copy/plotting commands. Then the following scenario is possible:

  1. I enter new value and save file
  2. The file is copied to remote server
  3. Remote server notices that file's changed and starts plotting it
  4. I notice a mistake I made in data, fix it and save the log again
  5. The file is copied to remote server
  6. Remote server finished plotting previous log with mistake
  7. Remote server starts waiting for file modification
  8. Changes made in p.4 are ignored!

If the file is copied to remote server while it was plotting it (and plotting/converting takes noticeable time), the subsequent inotifywait call won't notice any changes. Because when the file was modified, inotifywait was not running!

We solve it by doing conversion work asynchronously. Then, step 7 doesn't necessarily follow step 6. Instead, it's executed (hopefully) right after step 3.

If we spawn subprocesses instead of running the whole conversion, the time when we miss modification events is dramatically decreased, and doesn't depend of length of processing.

On the other hand, we can miss modifications when we're waiting for child processes to terminate. But even if we miss modification events, when we finish waiting, we will handle the modified file anyway. So that's acceptable. In general, it's also of no harm to do a, say, sleep 10 after inotifywait: another data save is likely to happen after the first one.

Conclusion

Finally, we've created a synchronized system of two daemons for the local and remote machines, based on inotifywait program of "inotify-tools" package. And now I have a public tracker of my own weight deployed within an hour, but I still wonder, wouldn't it be better if spent this time jogging?..

Read on | Comments (0) | Make a comment >>


Easy parallelization with Bash in Linux, part 2

Easy parallelization with Bash in Linux

Two articles that show how concurrency can be utilized for bash scripting.
  • Part 1 - pipes and xargs thread pool
  • Part 2 - process substitution and flocks

In the previous article, I told a bit about xargs command and piping in bash. But it turns out that I missed a lot of important information... So, to complete the topic, I just have to post another article. The second thing I'll tell about are flocks, and the first thing I dont understand how I missed is...

Subshells and process substitution

Bash has a convenient syntax for spawning subshells implemented in bash language. They are written inline the script, unlike the processes spawned by xargs, and interaction between them is more tight.

Each subshell acts like a small shell invocation. It has its own set of file descriptors and environment variables; they are inherited from invoking shell, and can be overridden, but no change affects those of the parent shell.

Bash also has the functionality called process substitution. It spawns a subshell and is substituted in parent process with a name of a FIFO (for example "/dev/fd/63"). You must connect one of the file descriptors of the invoking shell to one end of that FIFO at once, since the FIFO will be invalidated when the subprocess ends. Here's a small example.

Assume that two kinds of lines come to the input, and you need to process them by two different processes. This can be achieved by spawning a subprocess that separates the input, and feeds it to the two other subprocesses via diffrent FIFOs. No wonder that all three processes will be run concurrently. But the cool thing is that it will take surprisingly few characters to implement it.

You'll see something like that:

pavel@lonely ~/social/blog/bash $ bash split.sh
aaayyy
bbbzzz
aaayyy
aaayyy
bbbzzz
aaayyy
bbbzzz
aaayyy
aaayyy
bbbzzz
bbbzzz
bbbzzz
The $line is unharmed : unharmed

Yes, it's true that when the shell script finishes execution. its subprocesses are still working. Unfortunately, I don't know less "dirty" way to synchronize with the subshells. The good news is that you seldom need to wait in such cases (this one is just an artificial example where stdout is in use), and that if you use process substitution for substituting input only, you won't need to wait.

Mutual exclusion via flocks

For now we've only seen pipe-like synchronization: certain processes might wait until data appears at the other end of a pipe. There also is another kind of synchronization, when a process waits until a shared lock is released--mutual exclusion. To implement it, you can use flock functionality available in Linux systems.

The flock function takes a file descriptor (of an opened file) as an input and waits until a process becomes a sole one among those which flocked it. When the file is closed, the lock is released and the next process that flocked receives control. Sounds like a "named mutex", and indeed it is one.

Here's how I used it in a job server to prevent concurrent scripts from installing to the same directory simultaneously.

If two concurrent scripts wanted to install to same folder, one of them waited for the other to perform installation and then quickly checked by succesfull make run, which compiled nothing, that the software is installed.

You can play with this one:

The output shows that "installations" of as become blocked until sleep finishes and hence the flocked file is closed:

pavel@lonely ~/social/blog/bash $ bash flock.sh
Installed b
Installed c
Installed a
Installed a
Installed a
Installed a
Installed a
Installed a

However, flocks don't block all read/write operations for the file locked. It only blocks other flocks. Yes, the file must be opened to be flocked (that's why we needed these numbered file descriptors), but don't let it confuse you--no read/write operations block over the flocked file if reader/writer just doesn't use flock. That's a named mutex, not a protection feature!

Yet another further studies

While I was writing this post, I checked manpage of bash and suddenly bumped into coproc keyword. I didn't see it before because it is said to be introduced in version 4.0 I recently upgraded to. Here's a googled article about it.

Yet another conclusion

That's most likely all I know about concurrency that's available via Bash primitives in Linux (it could be one article if I didn't forget about process substitution :-) ). I hope that it'll help you write better programs and value shell script powers a bit more.

Read on | Comments (0) | Make a comment >>


More posts about concurrency >>