Easy parallelization with Bash in Linux

Easy parallelization with Bash in Linux

Two articles that show how concurrency can be utilized for bash scripting.
  • Part 1 - pipes and xargs thread pool
  • Part 2 - process substitution and flocks

I love concurrency. The Idea of several mindless zombies collaborating to provide a useful result thrills me and I'm always so self-satisfied when I see that these are my boys.

Writing a concurrent program is considered complex. But fairly simple concurrency tasks can be accomplished in a casual Linux shell Bash with just a small effort (about Bash: wiki, manual).

Basic piping

The famous pipe command allows one program supply its output to the input of another one.

The thing is that these programs are run concurrently and the reading one automatically pauses if the feeding program is still computing next line.

Of course, there are some issues with whether such a pipeline will indeed be scattered to different cores, but on modern kernels (starting somewhere between 2.6.28 and 2.6.30) you'll be okay without special measures or hacks.

Note that you can also pipe several commands, because pipe command is of low priority:

Thread pool with xargs

What is xargs?

xargs (pronounced as 'ex-args') is an utility that allows pass a large, externally computed number of arguments to a command. For example,

builds up a command

and executes it. Or, it can build several commands, so

is equivalent to

Learn more from wiki or man page.

There is also a nearly unknown functionality of xargs utility. It... is capable to act as a thread pool! (Well, this would be a process pool, actually.) Given a lot of arguments, it is capable to maintain the fixed number of processes that will be invoked for these arguments sequentially, picking next one from standard input of xargs when one of the processes finishes execution..

Usually, xargs maintains one process, and only when this process finishes execution, will xargs run the next one (see second example at side note). But this can be extended from "one process" to "one of N processes" with --max-procs=N argument. Try it:

What is seq?

seq is an utility that just prints a series of consequent numbers. It's really useful command, especially if you need to feed xargs with a specific number of lines, and don't really care for their content (like in our examples).

The simplest usage would be

which just yields 15 numbers starting from one (not from zero!), each on new line. I edited this post to use it instead of loops like

A more advanced example,

yields

Unfortunately, this command is not in POSIX, but it is included in LSB 4.0, and presents on all major distributions on x86 arch (proof link).

Without --max-procs the numbers appear in natural order, while with this option they're a bit mingled. That's the sign of concurrency!

Sample usage: test how web server suffers the load

(-I NONE means that numbers that come to our input will be substitutes to the place indicated by NONE string. But since there's no such string, we just discard these numbers, still running command 1000 times. It also implies -L 1, meaning that each line represents a separate argument).

Arbitrary commands can be managed this way. For simplicity, let's assume that your commands do not contain quotes or special characters (see the next section for this). Consider you have a file with simple bash commands:

Now let's execute the commands:

(Here we used -I option as intended: to specify placeholder that will be substituted with the line read from input file).

You'll see that they will run concurrently.

Here is an example of a script you'd have to write unless you knew about this feature of xargs.

Automatically escaping incoming commands for xargs

In the example above, you had to manually escape the commands you wanted to run concurrently with xargs. A command that uses quotes would be expanded, and an attempt to execute it would result in a syntax error or an incorrect behavior:

(what we wanted was two lines without that "n" between the words!)

GNU parallel seems like a better alternative to xargs. This tool was mentioned in this reddit comment to the link to this article someone posted to /r/compsci (why?).

Although parallel is more powerful, it is less easy to use. Besides, it is not available by default on as many distributions as xargs, and you have to install additional packages on Fedora 15 mentioned in the comment, for instance.

If your shell is Bash, you still do not need any external utils such as GNU parallel. Instead, you may use its printf's %q format sequence. It prints the escaped version of the string argument. Our previous example would then look like

This may be used to run arbitrary commands stored in a file, one per line:

First, we should convert them to their escaped versions, and then feed to xargs. The former is done with Bash's native linewise reading from file, and printf mentioned above:

Concurrency when jobs depend on each other

For this task suites a good ol' make tool.

Having been called as make -j 5, make will work as a thread (process) pool, but taking dependencies into account. For example, it won't spawn a process for a target until all its prerequisites are completed. Here's an introductory article I quickly found in google.

Further studies

This post doesn't cover race run-time conditions at all (all race logic is in the amount of concurrent processes or--in make's case--in dependencies). For this, you can use another Linux utility flock; here's a couple of examples how. Maybe I'll write about it more.

But anyway it's obvious that Linux shell is powerful in its concurrency powers. I hope my post will make you use more of it in your daily work.

Back to A Foo walks into a Bar...


Comments
Pavel Shved on 08 January 2010 commented:

Some guy (Jordan Sissel) also wrote on the same matter here.

To make comments you should log in with OpenID (learn more).
Log in and make a comment >>