Haskell's "that" operator

«Практика функционального программирования»

Translated as "Practice of functional programming" is the title of an introductory magazine. The magazine is free (free as in "free beer", not "free speech"); you can download it at fprog.ru (yes, it's in Russian). It contains a lot of interesting stuff and is a very fruitful source of references to [mostly English] articles and books about functional programming in general and particular languages.

The key is that it's focused on practical aspects of functional programming. This is useful as a reference to industrial success stories, which you may need to convince your boss to use such a powerful tool as functional programming.

I liked it so much that even donated 10$ to it. And I recommend this magazine with no hesitation.

I recently started reading a magazine called «Практика функционального программирования» ("Practice of functional programming") to learn more about the world of the functional programming languages. There I noticed a nice piece of Haskell's syntactic sugar; it is what I'd like to tell about.

They called it "infix version of function application". I call it "that" operator.

Here's how it works. In functional languages you usually write f(x,y) as f x y. Moreover, you pass a lot of functions as parameters, so f(x(y)) should be expressed differently. Here's where parentheses come into play: f (x y). Lisp, for example, requires these parentheses to wrap each application of a function. Instead, Haskell introduces syntactic sugar to avoid lots of parentheses in expressions like f (x (y (z (t v)))). Or, to be less abstract, in expressions like this (it's functional pseudocode):

In Haskell you can avoid these parentheses with functional application operator. (Or it's a function, not an operator? I'm still not much into this.) Its has different "handedness" from that of usual function application, and that's why it's convenient. Here's the example above with this operator.:

Or, we can use another operator, "dot", which expresses "function composition", and was chosen to resemble its symbol in math, ∘.

Now, why do I call $ "that" operator? Because "that" word in English language has the similar purpose: it frees the language from unnecessary parentheses. Compare these two versions of "This is the house that Jack built..." nursery lyrics by Mother Goose:

With "that" operator:

This is the farmer sowing his corn,
That kept the cock that crowed in the morn,
That waked the priest all shaven and shorn,
That married the man all tattered and torn,
That kissed the maiden all forlorn,
That milked the cow with the crumpled horn,
That tossed the dog,
That worried the cat,
That killed the rat,
That ate the malt
That lay in the house that Jack built. 

With parentheses:

This is the (farmer sowing his corn,
kept (the cock that crowed in the morn,
waked (the priest all shaven and shorn,
married (the man all tattered and torn,
kissed (the maiden all forlorn,
milked (the cow with the crumpled horn,
tossed (the dog,
worried (the cat,
killed (the rat,
ate (the malt
lay in (the house Jack built))))))))))). 

Parentheses

One of the most famous functional languages is LISP. Its name is an acronym for "Lots of Insignificant Stupid Parentheses" There are several LISP jokes involving parentheses. Here's good a joke about a soviet spy and LISP.

Doesn't look too nice. And it finishes with this notorious tail of parentheses (see side note). But this natural language developed itself in such a way that it provided a nice notion of function application. Was it what inspired Haskell authors? Hardly. But still, I think that there is a lot of similarity, between these concepts.

"That" operator is a very helpful thing, and it's sad that some functional languages (namely, OCaml, which I use at work) don't have such a nice feature.

Killer features of Git

Welcome to Yet Another Post About So-Called Killer Features of Git. Of course, I'm not the first who writes about it. No wonder: Git already was in a good shape in 2005 (link). However, in my opinion, some people seem to misunderstand why Git's killer features are important in everyday work. I'll cover my thought on that it in this post.

If you still don't know, Git is a fairly recent source-control system that was originally written by Linus Torvalds. That tells a lot about it; I already see scared corporate clerks, who convulsively close the browser tab. So, what can Git propose to more advanced developers?

1. Distributed all the time

Many people say that the main feature of Git is that it's a distributed source control system. That's not a killer feature, it's "just" a different workflow, different approach, different philosophy of version control.

And the direct consequence of it is the aptitude to work without connection is not a feature. What is a feature is the necessity of such behavior. Let's look closer at usual workflow:

git add ...
git commit
git add ...
git commit
....
git push

At first, you have to manually pick what files the next commit will consist of. You do it with various forms of git add. Is it a feature or an obstacle? I think it's a feature. I saw it many times when people commit the stuff that belongs to their local machine, because Subversion just commits everything that changed upon running svn commit. With git add you have to select what you add manually. Slow? Perhaps. But it is the usual workflow, which requores you to treat commits as serious actions, and I'm really tired of those bad commits that bring stuff that doesn't belong to them.

Okay, so you've made several commits. Guess what you now have on your hard drive? You have a draft of a repository (so-called "index"), which you're going to replicate to main server only when you invoke git push. You don't have to--and should not--push your changes at once. Instead, you now can test each of your commits, you can checkout from that folder! At work my test system is tuned to do exactly this: checkout from local repository and run regression tests.

If you feel that your draft contains errors, you have built-in Git features to bring your repository draft to a good shape. You can fix comment of any of your new commits, you can change their order, you can rollback several commits and merge branch as if it was merged several commits ago. And only then you git push the changes thus merging with repository on the server. Of course, if you're not careful, you're shooting yourself in the foot, just like in C, with which Git really is indeed compared. But still, I think it looks way safer than what you can do with Subversion, where each your commit is pushed at once.

2. Branching

Git can have lightweight branches! It can store hundreds of branches on the server without significant time overhead! Wow!

That's what they say. But it's not really important. The important thing is not that you can commit them to server. The thing is that you may not commit them, and store these branches locally. On my machine I have about 10 branches, only one of which is synchronized with the server. I can merge this branches the way I want, I can checkout any of them to test them locally. And all of these things I can do without bothering our centralized repository.

Another killer feature related to branches is the ability to switch branches within one directory. No wonder it's possible: you have a repository within it anyway. And it's really helpful; I still remember that hell when I had to branch Bazaar by checking it out to different directories. It's a hell I'm never coming back to. This strict, seemingly clean structure "one folder for one branch" just doesn't work well.

And here's the third feature...

3. Flexibility

"Git is the C of Version Control Tools", says Eric Sink, the 5th google result for "Git killer features" request. Yes, it's true. And it's useful, because others may write front-ends for other repositories. git svn is a well-grained tool that allows you smoothly integrate with Subversion.

I wrote an answer for a question "Why not use git?" by just observing my colleagues. They just didn't know how to use it, and even didn't want to try first. And it happened this way: I used git, they used Subversion, and we all were happy. You won't be able to merge Subversion branches with Git's native mechanisms (SO question), and you will encounter a bug that won't allow you rename files (git svn doesn't handle this case: you should first commit deleting files, then commit adding new files). But I think that working with Git worth all these obstacles.

Conclusion

If you liked it, use it! You don't even have to persuade others to use it, just install git svn (if it's not integrated) and you and your team will all be happy. Git's cool; don't be a corporate clerk, try and use modern and flexible tools, when they're available.

Comments imported from the old website

Aw, just when I was starting you like your blog you go ahead and post about git :P :D that's a shame ...

Do programmers hate their jobs?

Having been searching for a new job lately, I've read a number of advices to job seekers and hiring managers. Perhaps, they're just parroting each other, but here are some quotes that sound surprisingly similar:

Passion. We look for evidence that the applicant is passionate about computers and really loves programming. Typical evidence of this: <...>

  • Extra-curricular activities. People who love programming often work on their own programming projects (or contribute to an open-source project) in their spare time.

Joel Spolsky, Sorting Resumes blog post

When I'm hiring, I'm attracted to open source projects for one reason above all others: it shows a degree of passion. It shows that you're really not just a "clock out and forget about computing until you come in the next day" kind of guy.

Jon Skeet, 2nd comment to a "Coding horror" blog entry

3. Although this is not a universal truth, open source developers are very passionate about what they do. They have to be, otherwise why would they do it? If you hire an open source developer that has a passion for their work on open source projects, it might very well spill over into the work they do for you. Now I understand that many developers are passionate about their work (I've read Microserfs ;-) ), but passion in the open source community runs a bit hotter than it does in the non-open source communities.

Jack Wallen, "Five reasons why your company should hire open source developers" article

The problem with open source contributions is that they don’t necessarily tell a recruiter what they need to know about you. They indicate passion and enthusiasm for programming, which is a plus, but they don’t necessarily indicate competence <...>

James McKay, "Some thoughts on the role of open source experience in recruitment" blog post

And if that's not enough, keep in mind that most programmers who work on open source projects do so because they love writing software and they're passionate about their work.

Brian W. Fitzpatrick, The Virtual Referral: Mitigating Risk by Hiring Open Source Developers article

If we advance a bit, the general idea is that you should participate in an open-source development, or you're not a real man! You're not passionate! You... may not even love programming at all, you, corporate clerk?

The concept reminded me of something similar. Isn't it like suggesting that every worthy man must have a mistress as well as a wife? Which is, of course, bogus, even if one stops at the wife part, but a programmer is supposed to cheat on their job.

So why these seasoned managers, who just have to be mostly serious in their hiring-related statements, claim something that looks completely out of touch? I can't question their proficiency and experience, hence the conclusion is simple.

Most programming jobs suck.

With all the stuff heard and adopted--that programmers are lucky to have a profession that is, at the same time, fun--the reality doesn't seem to look as a fairy tale. Programming jobs suck hard and often, and you're more likely to demonstrate passion by participating in side-projects, than by doing well at work.

My job is not of the best, but I still want to go there each morning, and my eyes still burn, even after two years of marriage... Of course, sometimes we had difficulties, but we've eventually made up with each other. And the occasional urge to do something different is well satisfied by flirting with small projects, like writing this website, or participating in GSoC.

***

Do not take this post too seriously. Its argumentation is superficial and I admit I should have read, for example, that Spolsky's book ("Smart and Get Things Done") for extended explanations. I just told about an interesting observation that may be wrong. Or may not. So, do most programmers hate their jobs?

Comments imported from the old website

I haven't hated my last few programming jobs at all. And I also have virtually no respect for Spolsky and his obvious opinions. The guy is a businessman; you cannot trust these people to give their real opinions, or even opinions of any particular use (in general) as they are often trying to sell something (his book, his website, etc).

I don't feel like ranting about my opinions on open source projects and developers. I think it's enough to say you should search for passionate people (but this is as obvious as the nose on my face). The ways to find them differ, but it's easy enough to spot. I don't think Open Source Commitments are the only way, or even a good way.

I think the hiring of smart people is made to seem more difficult. It's not that hard really; you just have interesting work, and pay well, with a good location, and they will come to you. People like to do interesting things that they enjoy for money; this is a fact of life. It applies to all fields, not just programming.

So lets stop listening to the people who run recruitment websites and have personal interests in such things on such matters, and engage our own brains in some logical and trivial thought on the matter.

Pavel Shved on 23 February 2010 commented:

The logic of the post may be narrowed to "If Spolsky's advice is useful, then most programmer jobs suck". If we prove that the latter is false, then we can judge whether the former is true. :-)

By the way, some quotes I posted (including the Spolsky's one) outlines the other ways of finding passionate people. Sometimes in such articles the open source is not mentioned at all. I just was bothered with the tendention.

Your way of hiring smart people seems good. But the thing is that, if you pay well, and are located in a nice place, then lots of people, who can't program but just want to get money, will apply for your job. You have to choose, and the way you make choice is all these hiring posts are about.

And the last thing. To locate and extract the gist of a subject one should have some bit of wisdom. When you read it, it looks trivial, but nontrivial is to spot it.

"NP-complete!" as a lame excuse

Some time ago I bumped into a usual Stackoverflow question. A guy asked for a C# algorithm that could pick elements from array so that their sum is equal to a specified number. A usual NP-complete knapsack problem. But the answer made me think about an interesting matter. Let me screenshot the answer completely:

At a first glance, the answer contains everything an ideal answer should contain: a correct information, a certain bit of succinctness, and, a key to success, an XKCD comic. No wonder it was so highly upvoted.

However it was totally unhelpful.

Complexity classes

In Computer Science a concept of complexity class is used to define a class of problems, for which there exists an algorithm that runs on specified kind of abstract computing machine and uses specified amount of machine-specific resource. For example, famous NP class is defined as "set of problems that can be (a) solved on non-deterministic Turing machine (b) with use of polynomial number of steps with respect to the length of input". P class is the same but its abstract machine is a deterministic Turing one. The famous question of whether each NP problem is also P is still open.

There is a lot of more tricky classes (PSPACE, for example, requires polynomial "memory"--maximal length of line of a Turing machine), which can even be parametrized (PCP(n,m) (probabilistically checkable proof), for example). The relationship between various classes is being studied, and aids the relevant research; here's a picture with some of them:

complexity classes relationship diagram

(pic from «Эффективные алгоритмы и сложность вычислений» book by Nikolai Kuzyurin and Stanislav Fomin (pdf, site; published under OPL license))

It clearly shows that the knowledge about complexity classes made its way into the minds of casual programmers. But instead of a challenge, it became a lame excuse.

What a casual programmer thinks when he encounters an NP-hard problem is, "Oh, it's NP-hard, so no known polynomial solution exists!". Some programmers even try to quickly make an algorithm that might solve something in specific cases, which they don't even realize. While what should be thought of is, "Oh, it's NP-hard, so no known polynomial solution exists! Let me derive constraints, and search for an approximate solution that fits it best!" Complexity theory statements and theorems should aid the solution, not substitute it!

Funny approximations for NP-hard problems

Honestly, I was among the programmers of the first type, as defined above. I used to being proud that I know that much of maths and was glad that I can blow the hopes of solving their problem. However a year ago I've taken an advanced complexity course (by guys from one of research groups in ISP RAS -- "group of Discrete Optimization Algorithms at Department of Mathematical Methods" (in Rus)) that actually revealed to me several amazing facts. Some complex NP-complete problems appear to have good approximations! Let me list several of them.

  • Edge-dominating set (wiki) -- assume we want to penetrate our adversary's network (which is an undirected graph). We should pick several channels (edges) to insert a virus into, and the hacking software will capture all information that passes through any of each channel's ends. We should cover the whole network with such "hacked" channels, so that each unhacked one is plugged into a node with a hacked channel. But at the same time, we should do it with minimum cost, i.e. minimum number of channels to hack. See figure A at the side for examples.

    Figure A

    Examples of edge-dominating sets from wiki page. Red edges are infiltrated channels. You may see that each non-hacked edge is touched by at least one hacked edge at joint router.

    The task of picking such a minimum set is NP-hard. Brute-force? No, there is a better idea. We don't necessarily need minimum (although it would be nice), but instead we can hack some extra channels--the budget is not fixed anyway. We could use some clever algorithm to pick such edges, but instead... let's just hack arbitrary channels that aren't adjacent to already hacked ones!

    What amazed me is that such a "solution"... will never exceed the real minimum number multiplied by two! "That's still too much", might the Greedy Manager have said, and he would be right. But the thing is that we even didn't try to solve anything: the dumbest algorithm yields a good, constant-factor approximation for a problem with yet unknown efficient exact solution. Hell, it's even easier to implement than a brute-force!

  • Knapsack problem (wiki) -- assume a burglar infiltrated a house. Unfortunately, he has a relatively small knapsack, into which he can't put everything of value. Possessing a discriminating eye, he can immediately determine how much the fence will pay for a thing. His aim is to maximize the cost of the stuff carried away, the overall weight being less than his knapsack may bear. But unfortunately he's also limited on time and has to do it quick.

    Complexity theory proves that such task is NP-complete. So the burglar might pack every combination of items into his sack until he finds out the best. He would have been caught if he did that, because the owner would have returned by the time he completes the enumeration. So he chooses another solution. He uses a "greedy algorithm" and pick first a thing with best price/weight ratio, among those that still fit his sack.

    A fool! He didn't know that it was the house of computer scientist, who knew that thieves are dumb and greedy, and who intentionally bought a piece of jewelery with best price/weight ratio in the house, but that piece doesn't allow any other thing to be put into the sack due to overweight. For arbitrary N, this trick can make the burglar take away N times less stuff than he could at most!

    But a simple adjustment to its algorithm can be made to limit the possible loss to "at least 1/2 of maximum value". The burglar should do the same, but at the end compare the value of picked best price/weight items with the value of the most expensive item he can carry away. If it turns out to yield more profit take this thing and run away.

    This is an example that a small hack can guarantee a good approximation.

  • Longest path problem (wiki) -- assume we want to make our competitor lose money, and thus we try to delay shipment of a valuable package. We can make a janitor agent into the dispatcher's room, so he can see the map of the area (an undirected graph) and use the special phone to guide the courier by a longest path between source and destination (without cycles, of course, since the courier will suspect something and report it). He must do it quickly, but, unfortunately, picking such a longest path is an NP-hard problem.

    We've seen approximations by constant factors before, but this task, unfortunately, can't have an approximating polynomial algorithm for each constant factor. Of course, that's true unless P = NP. But the fact is that its polynomial inapproximability by constant factor is proven unless P = NP. So we'll have to teach the janitor to yield worse approximations if we want him to succeed. And conclude that sometimes NP-hard problems just do not have simple hacks that solve them well.

There are more facts that concern randomized algorithms, which have a certain probability to hit a good solution; facts about more tricky, non-constant approximations, but they're far less fun and are more like boring math crunching, so I won't mention it here.

Conclusion

Albeit whether P=NP is still an open question, complexity theory contains many valuable achievements on solving everyday problems.

Do not make complexity theory just a source of lame excuses of not solving the problem or writing brute-force algorithms and garage-made heuristics that can guarantee noting. Some programmers call it all "clever cheats", but many hand-made "cheats" are of no help in approximations (as shown above with the first, most obvious "greedy" algorithm for the burglar). Instead of cheating, do not hesitate to perform a quick googling or consult experts in this area if you encounter an NP-complete/hard problem--a fast solutions with proven and limited weakness, and suitable for your specific task may exist, and may have been studied well.

Or, if you're curious enough, and possess certain bit of math education, you can study it on your own. English-speaking readers may check online pages of their favorite universities for book recommendations and, maybe, online lectures. I can't recommend anything in English since I haven't read such books. To a Russian-speaking reader I can recommend the book «Эффективные алгоритмы и сложность вычислений» by Nikolai Kuzyurin and Stanislav Fomin (pdf, site; published under OPL license), which we used during our studies at the university.

Comments imported from the old website

I could not agree more; I actually remembered that question at one point and thinking the same as you (unhelpful, useless response happily not understanding that NP-complete even means). For these reasons I really hate Stack Overflow some times, yet I am slightly addicted to it :)

Great website, by the way.

Pavel Shved on 10 February 2010 commented:

Well, such "answers" in some circumstances may be even useful. While some programmers just do not understand NP-related stuff, the others don't even know about it.

Thank you for "great website", btw. I've just hacked in, so if you re-login, I'll even see your name instead of mere openid link. :-)

Roman Kecher on 01 March 2010 commented:

I really liked the article, and very much agree with the point you are making. It is true that understanding what NP-completeness (or hardness) is is very important, but using that as an excuse for not even attempting to tackle the problem is a lousy solution.

As you rightfully mention, there are approximation algorithms (btw you could also check out http://www.cs.technion.ac.il/~reuven/ on that topic), and sometimes even the "bruteforce" solutions with their exponential complexity can perform suitably well - for example, when you know that the size of the actual data you'll be working on is small enough.

Undefined behavior

Sometimes, when you read too many language standards and various specifications, you start using these weird words. Undefined Behavior is the ultimate evil, a raptor that storms in when you dereference NULL pointer, use freed memory, or violate any other preconditions, which are placed in every dark corner just to keep you awake. Futile attempts to make jokes aside, it's the wording that shows that compiler, library or system can guarantee nothing if you do that.

In most cases, nothing serious happens. Under the most common violations, such as dereferencing memory pointed to by an invalid pointer, you just get "Segmentation fault" (or "Access Violation" if you use an alternative OS) fatal error.

The Raptor Exists

A couple of days ago, I was trying to get wi-fi working at work and downloaded drivers from vendor's website. They happened to belong to the other kernel version, but I installed them, hoping that Undefined Behavior won't bite me much. When I rebooted, after the attempt to do network startup, something undefined broke there. And I observed the following picture...

The raptor was right here, already chewing my hand and smiling with an anime smiley ^_^.

So, guys, if you still don't believe in these tales about raptors that crush you if you do something unwanted, you'd better start believing. You have been warned.

Comments imported from the old website

Pavel Shved on 04 February 2010 commented:

I was trying to encode it right for a couple of hours. Here's the encoding string for this video:

ffmpeg -i 20100202_002.mp4 -ar 22050 -ab 32000 -s 720x480 -f flv -y -vcodec libx264 -vpre libx264-hq file.flv

FYI, it was originally filmed by Nokia n900.

Easy parallelization with Bash in Linux, part 2

Easy parallelization with Bash in Linux

Two articles that show how concurrency can be utilized for bash scripting.
  • Part 1 - pipes and xargs thread pool
  • Part 2 - process substitution and flocks

In the previous article, I told a bit about xargs command and piping in bash. But it turns out that I missed a lot of important information... So, to complete the topic, I just have to post another article. The second thing I'll tell about are flocks, and the first thing I dont understand how I missed is...

Subshells and process substitution

Bash has a convenient syntax for spawning subshells implemented in bash language. They are written inline the script, unlike the processes spawned by xargs, and interaction between them is more tight.

Each subshell acts like a small shell invocation. It has its own set of file descriptors and environment variables; they are inherited from invoking shell, and can be overridden, but no change affects those of the parent shell.

Bash also has the functionality called process substitution. It spawns a subshell and is substituted in parent process with a name of a FIFO (for example "/dev/fd/63"). You must connect one of the file descriptors of the invoking shell to one end of that FIFO at once, since the FIFO will be invalidated when the subprocess ends. Here's a small example.

Assume that two kinds of lines come to the input, and you need to process them by two different processes. This can be achieved by spawning a subprocess that separates the input, and feeds it to the two other subprocesses via diffrent FIFOs. No wonder that all three processes will be run concurrently. But the cool thing is that it will take surprisingly few characters to implement it.

You'll see something like that:

pavel@lonely ~/social/blog/bash $ bash split.sh
aaayyy
bbbzzz
aaayyy
aaayyy
bbbzzz
aaayyy
bbbzzz
aaayyy
aaayyy
bbbzzz
bbbzzz
bbbzzz
The $line is unharmed : unharmed

Yes, it's true that when the shell script finishes execution. its subprocesses are still working. Unfortunately, I don't know less "dirty" way to synchronize with the subshells. The good news is that you seldom need to wait in such cases (this one is just an artificial example where stdout is in use), and that if you use process substitution for substituting input only, you won't need to wait.

Mutual exclusion via flocks

For now we've only seen pipe-like synchronization: certain processes might wait until data appears at the other end of a pipe. There also is another kind of synchronization, when a process waits until a shared lock is released--mutual exclusion. To implement it, you can use flock functionality available in Linux systems.

The flock function takes a file descriptor (of an opened file) as an input and waits until a process becomes a sole one among those which flocked it. When the file is closed, the lock is released and the next process that flocked receives control. Sounds like a "named mutex", and indeed it is one.

Here's how I used it in a job server to prevent concurrent scripts from installing to the same directory simultaneously.

If two concurrent scripts wanted to install to same folder, one of them waited for the other to perform installation and then quickly checked by succesfull make run, which compiled nothing, that the software is installed.

You can play with this one:

The output shows that "installations" of as become blocked until sleep finishes and hence the flocked file is closed:

pavel@lonely ~/social/blog/bash $ bash flock.sh
Installed b
Installed c
Installed a
Installed a
Installed a
Installed a
Installed a
Installed a

However, flocks don't block all read/write operations for the file locked. It only blocks other flocks. Yes, the file must be opened to be flocked (that's why we needed these numbered file descriptors), but don't let it confuse you--no read/write operations block over the flocked file if reader/writer just doesn't use flock. That's a named mutex, not a protection feature!

Yet another further studies

While I was writing this post, I checked manpage of bash and suddenly bumped into coproc keyword. I didn't see it before because it is said to be introduced in version 4.0 I recently upgraded to. Here's a googled article about it.

Yet another conclusion

That's most likely all I know about concurrency that's available via Bash primitives in Linux (it could be one article if I didn't forget about process substitution :-) ). I hope that it'll help you write better programs and value shell script powers a bit more.

Comments imported from the old website

Christian Ferrari (id) on 25 July 2014 commented:

Hi Pavel, it seems to me you are interested in parallelization stuff and I'm suggesting you to take a look to FLoM (http://sourceforge.net/projects/flom/): with it you could implement complex serialization/synchronization use cases using few bash scripting; you could implement they inside a single system or in a network of IP connected systems. Some relevant examples are available here: http://sourceforge.net/p/flom/wiki/FLoM%20by%20examples/ Feel free to leave your comments and/or suggestions on the project discussion forum (http://sourceforge.net/p/flom/discussion/).

Thanks in advance Regards Ch.F.

Easy parallelization with Bash in Linux

Easy parallelization with Bash in Linux

Two articles that show how concurrency can be utilized for bash scripting.
  • Part 1 - pipes and xargs thread pool
  • Part 2 - process substitution and flocks

I love concurrency. The Idea of several mindless zombies collaborating to provide a useful result thrills me and I'm always so self-satisfied when I see that these are my boys.

Writing a concurrent program is considered complex. But fairly simple concurrency tasks can be accomplished in a casual Linux shell Bash with just a small effort (about Bash: wiki, manual).

7h3>Basic piping

The famous pipe command allows one program supply its output to the input of another one.

The thing is that these programs are run concurrently and the reading one automatically pauses if the feeding program is still computing next line.

Of course, there are some issues with whether such a pipeline will indeed be scattered to different cores, but on modern kernels (starting somewhere between 2.6.28 and 2.6.30) you'll be okay without special measures or hacks.

Note that you can also pipe several commands, because pipe command is of low priority:

Thread pool with xargs

What is xargs?

xargs (pronounced as 'ex-args') is an utility that allows pass a large, externally computed number of arguments to a command. For example,

builds up a command
and executes it. Or, it can build several commands, so
is equivalent to

Learn more from wiki or man page.

There is also a nearly unknown functionality of xargs utility. It... is capable to act as a thread pool! (Well, this would be a process pool, actually.) Given a lot of arguments, it is capable to maintain the fixed number of processes that will be invoked for these arguments sequentially, picking next one from standard input of xargs when one of the processes finishes execution..

Usually, xargs maintains one process, and only when this process finishes execution, will xargs run the next one (see second example at side note). But this can be extended from "one process" to "one of N processes" with --max-procs=N argument. Try it:

What is seq?

seq is an utility that just prints a series of consequent numbers. It's really useful command, especially if you need to feed xargs with a specific number of lines, and don't really care for their content (like in our examples).

The simplest usage would be

which just yields 15 numbers starting from one (not from zero!), each on new line. I edited this post to use it instead of loops like
A more advanced example,
yields
Unfortunately, this command is not in POSIX, but it is included in LSB 4.0, and presents on all major distributions on x86 arch (proof link).

Without --max-procs the numbers appear in natural order, while with this option they're a bit mingled. That's the sign of concurrency!

Sample usage: test how web server suffers the load

(-I NONE means that numbers that come to our input will be substitutes to the place indicated by NONE string. But since there's no such string, we just discard these numbers, still running command 1000 times. It also implies -L 1, meaning that each line represents a separate argument).

Arbitrary commands can be managed this way. For simplicity, let's assume that your commands do not contain quotes or special characters (see the next section for this). Consider you have a file with simple bash commands:

Now let's execute the commands:

(Here we used -I option as intended: to specify placeholder that will be substituted with the line read from input file).

You'll see that they will run concurrently.

Here is an example of a script you'd have to write unless you knew about this feature of xargs.

Automatically escaping incoming commands for xargs

In the example above, you had to manually escape the commands you wanted to run concurrently with xargs. A command that uses quotes would be expanded, and an attempt to execute it would result in a syntax error or an incorrect behavior:

(what we wanted was two lines without that "n" between the words!)

GNU parallel seems like a better alternative to xargs. This tool was mentioned in this reddit comment to the link to this article someone posted to /r/compsci (why?).

Although parallel is more powerful, it is less easy to use. Besides, it is not available by default on as many distributions as xargs, and you have to install additional packages on Fedora 15 mentioned in the comment, for instance.

If your shell is Bash, you still do not need any external utils such as GNU parallel. Instead, you may use its printf's %q format sequence. It prints the escaped version of the string argument. Our previous example would then look like

This may be used to run arbitrary commands stored in a file, one per line:

First, we should convert them to their escaped versions, and then feed to xargs. The former is done with Bash's native linewise reading from file, and printf mentioned above:

Concurrency when jobs depend on each other

For this task suites a good ol' make tool.

Having been called as make -j 5, make will work as a thread (process) pool, but taking dependencies into account. For example, it won't spawn a process for a target until all its prerequisites are completed. Here's an introductory article I quickly found in google.

Further studies

This post doesn't cover race run-time conditions at all (all race logic is in the amount of concurrent processes or--in make's case--in dependencies). For this, you can use another Linux utility flock; here's a couple of examples how. Maybe I'll write about it more.

But anyway it's obvious that Linux shell is powerful in its concurrency powers. I hope my post will make you use more of it in your daily work.

Comments imported from the old website

Pavel Shved on 08 January 2010 commented:

Some guy (Jordan Sissel) also wrote on the same matter here.

Software engineering quine

In programming, there's a concept of quine -- a program that prints its source code (wiki). I'm not obsessed with the concept, but it is fun sometimes.

***

Recently I saw this Dilbert comic, and It made me think about a funny thing.

"Oh, okay. It will cost you 50$", might the lady engineer have said, and written the following program:

Requirements that require a complying program to print themselves? Sounds like a software engineering quine, doesn't it?

Comments imported from the old website

Wow, it really is self-documenting!

Make: a Filesystem Transformation Prover

One of my favorite tools that belongs to Linux world is make. I'm a huge fan of it and, being a Linux programmer, use it in my everyday projects. But no, not as a build system I mostly use it. make is much more than just a casual build system.

Then, what is make? Having thought on that question, I think, I eventually devised a sound definition of what it actually does--or, at least, what its main activity is. Capability to build programs is just its tag feature that, unfortunately, hides more profound design principles.

What follows the red mark is just an utter garbage. While it may contain somewhat interesting observation, which are partially true, it is just wrong. I'll show it later.

make is a filesystem transformation prover. Its purpose is to prove that file system can be transformed from its current state to the specified, according to the rules in the makefile. The result is a proof, given (a) initial filesystem state, (b) set of inference rules and (c) target files that should be brought to up-to date state. make should provide the sequence of rules, in which every rule a: b c d ... is only triggered for an up-to-date b, c etc and not up-to-date a. Such makefile rules can be described as Assuming up-to-date-ness of b, c and d, infer up-to-date-ness of a after applying the commands listed in rule. And the sequence of rules built should have every target file (specified in make's command line) "inferred" this way.

Sounds familiar, right? Isn't it the same as proving logical formulas? It totally is. Up-to-date files from a given filesystem state are premises. Rules in makefile are the logical arguments (pattern rules resemble them even better). And targets are the formulas we're to prove.

Impressed? Now create a Makefile
aa%: a%
        touch $@
and try to make aaaaa in the folder where file named "a" exists. You'll fail to achieve this, and that's due to a huge limitation of make, which brings me back from mathematical dreams to reality: make is a build system. Flexible, but a build system.

But

However, make, upon successful construction one of the possible "proofs" for all targets, actually executes the commands along the sequence of rules. And these commands can be arbitrary, while soundness of the whole transformation requires each rule to update its target and nothing else. This shatters the clean concept described above a bit.

Much of functionality of make unmentioned here may be treated just as tricky preprocessing of a makefile to shorten the number of characters needed to describe a set of explicit rules.

At the same time, this doesn't always suffice all users. The thing is, the rule format in make is quite limited and they can reason only about "up-to-dateness" of files in a given state. While users need more complex inference rules, for example "Target is up-to-date if its argument didn't change since the previous make invocation". While such rules can be implemented in terms of existing capabilities of make (like this), it would still be nice if make provided them out of the box.

Let alone many other caveats in GNU Make. Make is surely not ideal.

However

But this is a positive post rather than negative. The declarative nature of make and its availability (all modern major Linux distributions provide it) make it a an invaluable tool if you need such a "filesystem transformation prover" as a part of your project. And it's all "Linux way": make can act as a manager that runs many small scripts (invoked via commands within a makefile or written inline) with specific narrow purposes.

I used it with various aims: to process header files, to run regression tests and to update a repository mirror; and I always found it a very flexible and robust tool. So, do not hesitate to use it. Make is more than just a build system. Though, just a little more.

Treap: simple balanced search tree

Balanced trees and programming contests


This is an example of treap: letters form an alphabetically-sorted search tree, while numbers are arranged as a heap.
Pic from Wikipedia

Picture A. We're going to insert yellow note into tree with blue root.

Picture B. We arranged some nodes but we have to split the subtree of the maroon node into two subtrees to fit dangling red pointers .

Picture C. If we could solve the task of insertion, we could have split the subtree of red node as if out new yellow node was inserted there instead of to a subtree of a blue node.

Picture D. After we've splitted the subtree of the red node, we can now assign to dangling pointers.

A balanced binary search tree is a binary search tree whose height is not big, around O(log N). This value is crucial, because search operation, the one these trees are named after, takes O(tree_height) time.

In a typical programming contest you're given 5 hours to write several programs, each of which, aside passing all jury's tests, meets specified time limit. One of keys to success is (or, at least, was) the ability to quickly code fast standard algorithms and data structures: disjoint sets, heaps, graphs, etc. And due to limited duration of competition not only should the algorithms be fast, but also easy to write in bug-free mode under an emotional pressure.

When it comes to the question how to implement a balanced binary search tree, most of programmers will outright shout "red-black tree!" (wiki). Indeed, it is balanced and is implemented in standard libraries of many languages, so it may seem a good choice (and even better as an advice) unless... Unless you're to code is yourself. It's implementation is way too complex, while the benefits of guaranteed indeterministic 2*log(N+1) height are sometimes not utilized.

A Treap in my mailbox

Back to the times when I was participating in those contests, I got a mail from one of our trainers. Within was a description of a nice data structure that I later discovered to be called treap. And now I'd like to tell about it.

A treap is a conjunction of words "tree" and "heap" (or "tree" and "heap", for left-recursive readers). No wonder: it comprises properties of both tree and a heap. Or, more strict, it is a tree and a heap at the same time. Each node, aside from payload data, contains two keys; if we look at one key we will see a heap and if we look at another key we'll see a binary search tree.

Our main goal is nonetheless to build a balanced binary tree out of sequence of N key-value pairs that may come as an input in an unknown order. When you insert a node to a treap, you should pick up a uniformly distributed random number and use it as a heap-key. Of course, at the same time you should maintain search tree property for the tree-key.

Not as complex

Hell, the insertion procedure will be complex, you think. The trick is that restrictions usually help rather than prevent. So, assume we need to insert an element [Eh,Et] (that's heap and tree keys) to a (sub)tree with a given root [Rh,Rt].

  1. Eh < Rh - since we maintain heap, our new node should be added to one of the root's subtrees. To left or to right is decided based on tree-keys Rt <=> Et comparison, since we maintain tree as well. To insert element to a proper subtree we call the same insertion procedure recursively.
  2. Eh > Rh - our new node can only be a new root then (due to heap keys). What will be its left and right subtrees? Let's assume that Et < Rt. Then the right subtree of our new node will be the former root (it can't be placed nowhere else due to heap-property), and the right subtree of the old root will remain there.

    Now, we're left with two dangling pointers (see picture B) and a subtree to split between them. But we can note that if we inserted the new node to that dangling subtree (picture C), it would be the root and its left and right children would fit the dangling pointers amazingly (picture D)! So to solve the original problem we need to solve it on a smaller tree... Smells recursive!

    This leads to a tiny split procedure that, given a treap and a Kt tree-key, splits the treap into two treaps that can be left and right subtrees of a node with tree-key Kt in a binary search tree.

If you want to delete a node from tree, you'll need merge operation that makes one treap out of something that were left and right subtrees of a node we've just erased. This merge is the opposite to split and is as simple to implement. Red-black tree deletions require twice more code than insertions.

Implementation

The implementation of insertion stuff is here. Despite I haven't participated in contests for several years, I wrote it in 30 minutes from scratch (and benchmarking code took some additional time) and I didn't even have to debug the code. Can you repeat the same for a red-black tree insertions?

Unfortunately, it's hard to tie this small implementation into OOP concept, but in programming contests it was rarely needed.

If we run the script, we'll get fairly good height result on ascending sequence (on which unbalanced trees fail badly):

pavel@lonely ~/social/blog/treap $ perl -e 'for (1..10_000_000) { print "$_ ";} print "\n"' >test_file
pavel@lonely ~/social/blog/treap $ g++ --version | head -n 1
g++ (Gentoo 4.4.2 p1.0) 4.4.2
pavel@lonely ~/social/blog/treap $ g++ treap.cpp -O2 -o treap
pavel@lonely ~/social/blog/treap $ g++ treap.cpp -O2 -o tree_STL -DSTTREE
pavel@lonely ~/social/blog/treap $ time ./treap &lt;test_file ; time ./treap &lt;test_file ; time ./tree_STL &lt;test_file
Height is 58 for RB it is 30.8499
real    0m6.545s
user    0m6.116s
sys     0m0.176s
Height is 56 for RB it is 30.8499
real    0m6.069s
user    0m5.908s
sys     0m0.160s
Height not supported :(

real    0m7.667s
user    0m7.472s
sys     0m0.188s

While height for RB-tree is about 30, height of a treap is about 50-60. You can run a series of experiments on your own and see it.

Why is it balanced?

The estimation of a height of a treap is O(log N).

The means, due to which the treap balances itself as a search tree, are Complex Math and are described in the articles linked at its Wikipedia page.

In common words, for a tree to be non-balanced not only search keys should be inserted in mostly ascending/descending order, but also the randomly generated numbers should also be arranged in the relevant order. This is unlikely. And if a heap_key is "out of sync" with unbalancing pattern, it rotates the subtree, for which it becomes a new root, so it becomes capable to absorb more "unbalancing" nodes without height increase.

Acknowledgements

Treaps were created by Cecilia R. Aragon and Raimund Seidel back in 1989.