Easy parallelization with Bash in Linux

Two articles that show how concurrency can be utilized for bash scripting.

Part 1 - pipes and xargs thread pool
Part 2 - process substitution and flocks

In the previous article, I told a bit about xargs command and piping in bash. But it turns out that I missed a lot of important information... So, to complete the topic, I just have to post another article. The second thing I'll tell about are flocks, and the first thing I dont understand how I missed is...

Subshells and process substitution

Bash has a convenient syntax for spawning subshells implemented in bash language. They are written inline the script, unlike the processes spawned by xargs, and interaction between them is more tight.

Each subshell acts like a small shell invocation. It has its own set of file descriptors and environment variables; they are inherited from invoking shell, and can be overridden, but no change affects those of the parent shell.

Bash also has the functionality called process substitution. It spawns a subshell and is substituted in parent process with a name of a FIFO (for example "/dev/fd/63"). You must connect one of the file descriptors of the invoking shell to one end of that FIFO at once, since the FIFO will be invalidated when the subprocess ends. Here's a small example.

Assume that two kinds of lines come to the input, and you need to process them by two different processes. This can be achieved by spawning a subprocess that separates the input, and feeds it to the two other subprocesses via diffrent FIFOs. No wonder that all three processes will be run concurrently. But the cool thing is that it will take surprisingly few characters to implement it.

You'll see something like that:

pavel@lonely ~/social/blog/bash $ bash split.sh
aaayyy
bbbzzz
aaayyy
aaayyy
bbbzzz
aaayyy
bbbzzz
aaayyy
aaayyy
bbbzzz
bbbzzz
bbbzzz
The $line is unharmed : unharmed

Yes, it's true that when the shell script finishes execution. its subprocesses are still working. Unfortunately, I don't know less "dirty" way to synchronize with the subshells. The good news is that you seldom need to wait in such cases (this one is just an artificial example where stdout is in use), and that if you use process substitution for substituting input only, you won't need to wait.

Mutual exclusion via flocks

For now we've only seen pipe-like synchronization: certain processes might wait until data appears at the other end of a pipe. There also is another kind of synchronization, when a process waits until a shared lock is released--mutual exclusion. To implement it, you can use flock functionality available in Linux systems.

The flock function takes a file descriptor (of an opened file) as an input and waits until a process becomes a sole one among those which flocked it. When the file is closed, the lock is released and the next process that flocked receives control. Sounds like a "named mutex", and indeed it is one.

Here's how I used it in a job server to prevent concurrent scripts from installing to the same directory simultaneously.

If two concurrent scripts wanted to install to same folder, one of them waited for the other to perform installation and then quickly checked by succesfull make run, which compiled nothing, that the software is installed.

You can play with this one:

The output shows that "installations" of as become blocked until sleep finishes and hence the flocked file is closed:

pavel@lonely ~/social/blog/bash $ bash flock.sh
Installed b
Installed c
Installed a
Installed a
Installed a
Installed a
Installed a
Installed a

However, flocks don't block all read/write operations for the file locked. It only blocks other flocks. Yes, the file must be opened to be flocked (that's why we needed these numbered file descriptors), but don't let it confuse you--no read/write operations block over the flocked file if reader/writer just doesn't use flock. That's a named mutex, not a protection feature!

Yet another further studies

While I was writing this post, I checked manpage of bash and suddenly bumped into coproc keyword. I didn't see it before because it is said to be introduced in version 4.0 I recently upgraded to. Here's a googled article about it.

Yet another conclusion

That's most likely all I know about concurrency that's available via Bash primitives in Linux (it could be one article if I didn't forget about process substitution :-) ). I hope that it'll help you write better programs and value shell script powers a bit more.

Comments imported from the old website

Christian Ferrari (id) on 25 July 2014 commented:

Hi Pavel, it seems to me you are interested in parallelization stuff and I'm suggesting you to take a look to FLoM (http://sourceforge.net/projects/flom/): with it you could implement complex serialization/synchronization use cases using few bash scripting; you could implement they inside a single system or in a network of IP connected systems. Some relevant examples are available here: http://sourceforge.net/p/flom/wiki/FLoM%20by%20examples/ Feel free to leave your comments and/or suggestions on the project discussion forum (http://sourceforge.net/p/flom/discussion/).

Thanks in advance Regards Ch.F.

Easy parallelization with Bash in Linux, part 2

Contents