A poor man's benchmark and stopwatch

January 26, 2011 1093 word 6 min read

Recently I wanted to measure how long it would take my program to execute a certain action. I didn't want to search for time-related functions in that language, so I had just been going to just use my Casio watch. However, I suddenly realized that I forgot them at home...

"Wait a minute", I thought, "I have a PC with a modern operating system, openSuSE Linux! How come it doesn't have a stopwatch program?" I scratched my head, but couldn't remember anything like that in standard Linux utils.

I asked my peers about such a program, and Konstantin Vlasov proposed a solution: use... dd!

What is dd?

If you still don't get it, I'll remind you. dd is a standard Unix (Hence Linux) tool... to perform low-level data stream copying. Its most popular use is to clone raw partitions (as array of bytes, not as sets of files):

dd if=/dev/sdb5 of=/dev/sdc1
dd if=/dev/cdrom of=image.iso

But we'll get to its conventional use later, and now...

dd as stopwatch

The funny thing about dd is that in the default mode it prints the time it's been running. Indeed, dd is a data copying program, so it's natural to print, at the end, how much data it has copied, and how fast.

So it prints total size of the data copied, the time it took it to do this, and the resultant speed... Wait a minute... the time! That's it!

To use dd as stopwatch, you don't need to copy actual data. Just run it without arguments and hit Ctrl+C when you need to stop the countdown. Here's how the output will look like:

pavel@lonely ~ $ dd
^C0+0 records in
0+0 records out
0 bytes (0 B) copied, 5.97049 s, 0.0 kB/s

There it is, the stopwatch triggered by just Enter and Ctrl+C keys.

Note that instead of triggering the stop manually with Ctrl+C, you may just send a signal to dd, and have the same effect. We'll utilize this in the further section.

dd as a benchmark

So, we've just learned that dd can print speed of the data copy operation it performed. And Linux has nice pseudo-files that just generate streams of data... why don't we try to use this files to measure how fast can a computer copy nothing to nowhere?

Let's check how much data your machine will copy in ten seconds! Here's a sample listing:

pavel@lonely ~ $ dd if=/dev/zero of=/dev/null bs=64k & sleep 10 ; kill -INT %1 ; wait %1
[1] 885
1434838+1 records in
1434838+0 records out
94033543168 bytes (94 GB) copied, 10.0106 s, 9.4 GB/s
[1]+  Interrupt               dd if=/dev/zero of=/dev/null bs=64k

Standard Linux shell, Bash, has quite simple features to control it child processes. For example, the code at the left features "job specifications" ("jobspecs"). When a program is run in the background (with use of & at the end of a statement, instead of the usual ;), it gets a job number, and its pid can be gotten by writing %n, where n is that number.

In our code we feature sending a a signal to a certain jobspec, and waiting for it (if it weren't for wait, the shell prompt could mingle with the output of the dd's signal handler). And if we only have one job running in current shell, its jobspec will always be %1.

This command will launch dd process that copies a stream of zero bytes maintained by kernel (/dev/zero) to stream that chews everything it gets without any effect (/dev/null). You may try to test how this simple copying works and learn how fast is your... well, it's hard to tell, what exactly, but you can compare computers with your friends and beat them in this competition! We've held dd benchmarking comparison at the message board of our university, and the top result was 70.6 GB/s, which is like ten times faster than on my machine!

By the very same means you may analyze performance of your disk, by putting a file as the of= argument. But that's a poor man's solution anyway, as there are better benchmarks both for disks and for memory.

Other (not interesting) uses of dd

Aside from copying raw partitions (it's especially useful if you're trying to recover your disk, and accessing it as a file system requires a lot of unnecessary reads, which may be harmful), dd may also be used to create big files of specific size. Why would one want it?... Right, to create swap files. Here's a sample listing to create a 1 Gb swap file

I learned this in a hard time at work, where an important launch of our experiments had the controlling program leaked 4 Gb of memory...

Here's a (really good) Wikipedia article about various uses of dd that presents even more examples of the neat tricks you can employ dd for.

***

Although I'm an experienced Linux user, it doesn't stop to surprise me. The dd program is not the only tool that, due to its simplicity, serves different purposes. That simplicity of basic tools, which leads to their powerful combinations, is one of the things in Linux I appreciate most.

And sometimes it can be even amusing.

Comments imported from the old website

Jason Plank on 08 February 2011 commented:

In addition to dd, there is also the GNU time utility, which I think is standard in all Linux distributions. You can find the time used by a program if you use it like "time make", or for example:

$ time time --version
GNU time 1.7

real    0m0.001s
user    0m0.000s
sys    0m0.000s

Unfortunately, unlike dd, time only is accurate to milliseconds :)

Pavel Shved on 09 February 2011 commented:

It's ironic that I didn't think about time program to measure time. Perhaps, that's a matter of native language, when I don't have an entanglement between the words "time" and "время".

Anyway, to make time wait for your reaction and to measure, you should specify a program that would wait for your input. You might use time grep x or time cat, but the shortest choice would anyway be time dd, which makes time part unnecessary. :-)

And of course, as the smiley you finished your message with alleges, precision doesn't matter. Human is quite slow to respond to events. To notice the change in color of a circle a human needs 0.2s, and the time is bigger for complex events. Instead of linking to research, I'd just suggest you to measure your own reaction on the simpler event, and on the more complex one.

Why execve()-like system() call is a must-have

January 20, 2011 743 word 4 min read

How do you usually call an external program in Linux? Whichever language you use, you may fork() your process, and call exec() afterwards; you runtime library surely supports these primitives. However, unless oyu're doing something crazy, you more likely use a system() function, that does the fork for you, and automatically waits for the process to finish.

In Linux runtime libraries for various languages usually support setting environment variables, so that all processes forked afterwards inherit the altered environment. In Ruby and Perl it's as easy as changing a global hash. Setting environment variables is a usual mean of interprocess communication in Linux.

However, an only system call to Linux Kernel on that matter is execve() (see its man page). It both sets environment variables for the program called, and calls it, replacing the current program (you should do a fork before, it's also a separate syscall). There's no syscall for system(), it's usually a library function. Anyway, higher-level libraries provide more rich functionality. They surely provide more convenient primitives of the aforementioned setting env variables, and of mere system() call. The work of execve() could be easily substituted with the combination of these.

However, I recently discovered that it's not that obiovus.

Why execve() is crucial

If you know what I usually write about, you should probably have already guessed, what's the deal. The execve()—as a single library call—is especially useful in multithreaded programs!

Let's assume that you only use environment-setting and a mere system(). How would you call an external program then? Here's an example in Ruby (but that actually doesn't matter).

Now assume that two threads are doing this. Yes, you got what's the problem, here's a possible execution:

The first thread sets environment variable to 'value'. Context switch happens.
The second thread sets environment variable to 'another_value'.
The second thread calls system(), and the process spawned inherits the correct 'another_value' environment variable value. Context is switched back.
The first thread calls system(), and the process spawned inherits the same 'another_value' as a value of the env variable, as the other thread has changed the global environment setting. However, the programmer's intent was that it should have been called with 'value' instead.

So how could we overcome this? A naïve approach is to use a mutex to synchronize the different threads, like this:

But then the mutex won't be available until the first thread to capture it finishes execution of the whole external program. However, out intent was to invoke the programs safely, not to execute them in a mutually-exclusive manner.

How it could look like

One solution is to use raw fork() and exec() calls, and assign your environment after forking only:

The problem with open3 invocation with an altered environment is what actually made this article appear. I had to study and modify an implementation of open3 in Ruby to make this work. See my blog post on how to implement open3 for details.

It's more verbose, but it should work. But what if it's more complex? What if you're not doing a mere system(), but want to make a open3() instead? There are many ways how a child process may be used, including opening it with one or more pipes, or opening a pipeline of several processes. You have to re-implement these primitives if you need just to alter the environment, and you'll be lucky if their implementation is readily available and reusable...

How it should look like

A good implementation of a set of safe library functions resides in Ruby 1.9 runtime library. The system() call and each popen()-like call can accept a hash of environment variables the child process invoked inherits. So the calls look like this:

Unfortunately, Ruby 1.8 doesn't have this functionality, and you can't use a single call to both set the environment and spawn a child process. That pisses me off, actually, since not everyone is still migrated, and I had to reimplement open3 in Ruby to just impose environment variable assignment in a thread-safe manner.

Conclusion

I know that I should be careful when I alter global variables in multithreaded programs. However, when I access basic primitives, I tent to treat them as something special, not just like any usual data. This is not the most safe approach. Use your forks with care, especially if you're operating several at the same time, and choose safe runtime libraries to work with.

Penn&Paper's Solutions™: Progress Tracker

December 24, 2010 1274 word 6 min read

A couple of weeks ago I began to develop a new component of the system we ship. In order to not get lost in where I currently am, I decided to employ a tool to visualize structure of the component, and to track progress of its development.

However, I was too busy to write a program on my own. Therefore, I decided to purchase an enterprise solution. I analyzed the market, and asked a couple of questions. And I found a product. The product. "Penn&Paper™'s Progress Tracker 8.3 Personal Edition by Penn&Paper™'s Solutions. The one that suited me, fulfilled all my requirements, and was very easy to use.

I was so satisfied with this product, that I even decided to publish an interview with one of "Penn&Paper™'s" sales managers in my extremely popular blog. To make it more interesting, I accompanied it with pictures about my experience with usage of their product.

Interview with a Penn&Paper's Solutions™ representative

So, here it is. Violet Page, a sales manager from "Penn&Paper™'s Solutions" presents their "Progress Tracker 8.3 Personal"!

Pavel Shved: Violet, first, I must confess that I'm a fan of your company nowadays. You seem to create very "natural" products for human beings to operate. So, I guess, that a description of how easy it is to work with would be a good thing to start.

Violet Page: Yeah, "Penn&Paper™'s Progress Tracker 8.3 Personal" is very easy to work with indeed! You start the application, and it automatically opens a sheet. Then you may just select a Pen tool

and draw a couple of lines. You can draw straight and freehand lines—whatever you wish, there's no limitations! This is all what it takes to draw a scheme of component distribution of your application.

P.S. Allright, we'd get a diagram of a component's architecture. But what about progress?

V.P. Oh, that's also easy. Just select a Marker tool,

and apply it to the parts of the system you've implemented

You can apply it to the interfaces and implementations, depending of what you've drawn with a pen tool in the first place. You're free to shade your items partially, and to use different colors.

This way you'll clearly see that the parts you've already implemented are shaded, and those you still haven't—are not.

As you gradually progress in developing the system, you'll shade more and more parts.

And more parts

Until, finally, you implement everything you've planned for the release.

P.S. You said I can draw freehand lines, but what if I just want to type some text?

V.P. That's easy to accomplish through the very same "Pen tool". Just user the context menu and select "type text" item. The text tool supports UTF-8, so you don't have to struggle to type symbols in your native language

P.S. What about teamwork? You know, you usually track progress not just to please yourself. Rather, one of your primary objectives is to provide a shiny picture for your managers and peers. How could I share results of my work with the other developers?

V.P. You have two options here. First, you can take a screenshot, and publish it the way you prefer.

You can post it to your corporate wiki, or mail to the parties interested. Or just make an animated GIF out of it:

P.S. What's the other option?

V.P. Our "Progress Tracker" integrates smoothly with the other "Penn&Paper's Solutions™" products. For example, you may seamlessly re-play your progress when you're having a corporate video conference call with "Penn&Paper's Conference Table™", so that your peers could envision the roadmap of the discussion:

Check out how the developer covers more of the diagram with his Marker tool as he explains the architecture of his software:

The other useful way of integration for collaboration is posting your work onto a "Penn&Paper's Blackboard™" knowledge sharing tool:

P.S. Hey, wait a minute. Collaboration over the same sheet usually causes merge conflicts. What can you do about them? The format you employ is obviously binary, and automatic merges are impossible... aren't they?

V.P. Yes, it's impossible. We adopted a technique familiar to many developers. Your colleagues may just "check out" your sheet (like in older revision-control systems), discuss it, make amendments, and then put it back to your workplace. We think that it's the most natural workflow.

P.S. What about protection? Are my sheets foolproof?

V.P. Well, for any protection there's a person stupid enough to bypass it. However, we provide some protection functionality. The first is the notorious "Trash can", into which your sheets are placed if accidentally deleted.

You can recover them easily from the "Trash can" unless it's really stuffed, and your sheets are disengaged.

However, the Personal Edition doesn't support Undo functionality. For this feature to work, you have to purchase one of our corporate products, such as the aforementioned "Blackboard":

P.S. Alright, so we've observed all the major features. But what's the price?

V.P. We've adopted a pay-as-you-go licensing model, popular in business applications. For example, for 4.95 you may paint 15 miles of lines without restrictions to the number of components you draw, or any time limitations! Our competitors require much greater funding.

P.S. Indeed, I drew this whole system, and the 5$ license isn't even half-used. Mrs. Page, thank you for the interview, and good luck in your future affairs!

Conclusion

So, unless you're dumb, you have already understood that it all is entirely fictional. You know, a whole generation emerged, of people who only saw floppy disks on "save buttons". And here comes a generation that discovers the stuff that surrounds them in their offices by studying allegations in software products, with which they (we) are much more familiar.

Of course, I'll continue to use "Penn&Paper's™" products. In fact, we're currently using one of them, the "Penn&Paner's™ Blackboard" referenced above, which is extremely useful in our Scrum daily meetings. I also have made a quick overview of "Penn&Paner's™ Table Scroll Sanitizer®", an open-source alternative of which I developed. I'm looking forward to more interviews with Violet Page, and, actually, to a small discount on their products >__<.

Comments imported from the old website

silky (id) on 25 December 2010 commented:

Yeah, I agree with you of course.

One interesting thing I recently learned about whiteboards, is you can make them from glass. So, if you wall your office in suitable wallpaper/fabric (whatever colour) and cover it with glass, you can then write on it with the appropriate pencil. Of course, this technique can also be expanded to desks and walls next to desks. Quite great, I think.

I, like you, tend to do almost all my project planning on paper, with relevant items (obviously, it goes without saying), making it into Trac as time goes on.

Anyway, I think we will see more of this - fighting against technological solutions - as time goes on. I, for one, think it is nice :)

Pavel Shved on 25 December 2010 commented:

Glass wall-boards is a neat idea. However, when it comes to implementation... Walls are usually uneasy to access, because a lot of stuff stands at them. At least in our office, capacity of which we seem to have exceeded.

Covering a desk with glass could work, but you'll have to be careful that the paint doesn't make things you put on it (or your sleeves) dirty.

silky (id) on 26 December 2010 commented:

Yes, you are right about wall access; the implementation I saw was in a meeting room, where walls a generally accessible. Agreed on the desk issue. There's probably some work to do with this idea before we can realise the full potential of it.

Porting Perl's features to Ruby

December 14, 2010 1127 word 6 min read

Diamond. A Girl's Best Friend. Ruby. A Programmer's Best Friend.

An official motto of Ruby language with its joke explained.

As I told in my previous post, I now use Ruby as my primary scripting language. The most prominent reason is that Ruby is superior to Perl in both syntax and in expressive power. Being object-oriented, Ruby allows a lot of things to be expressed in a more concise manner, and exception handling relinquishes the developer from the burden of checking return code of every damned operation.

However, there are some features of Perl, which don't map to Ruby directly. Here's a couple of them.

Boolean String Interpretation

The first is "conversion to boolean". In Perl a lot of things convert to false in boolean context: an empty string, a zero, an undef (undefined value).

In Perl6 (and in recent releases of Perl5), there is a "//" operator, "lhs // rhs" being equivalent to "if lhs is undef, then rhs, otherwise, lhs". It will make the language slightly more expressive, but, unfortunately, won't bring us closer to our aim: making Ruby as convenient as Perl.

This allows us the following. If we want to assign a default value to a string parameter, we'd just use || operation:

However, in Ruby, only nil (undefined value) and boolean false do. So the code similar to the above (which uses Ruby's || operator) would only work if the str is nil. It wouldn't work if it was empty. And in many situation it would be very useful. Consider that you're retrieving a string, which, being not specified, should default to a value. If this string is unspecified, and comes from a user's input in a form, it'll most likely be an empty string. And if it originates from a database, it'll have a nil value. How you we write the code that sets the default value as concise as the Perl's one?

The solution is redefining operators in a class. We will introduce a new operator for strings that works like the Perl's ||. Note that we can't just redefine Ruby's ||, because it may be used by a third-party code, with which we shouldn't interfere.

One of Ruby's features, which originates from SmallTalk, is that we can redefine methods of objects and clases on-the-fly. This seems unusual for the most common C++ and Java languages for OOP support, but in Ruby a lot of thigs are implemented this way.

This feature is sometimes referred to as "monkey patching"

Let's call the new operator or, and let's redefine it for strings:

We notice that it works well if left-hand-side is a string. But what if it's a nil? Our aim was to have a proper behavior for expresions like a.or("b"), where we don't know if a's defined. But we can't redefine how nil works... can we?

In Ruby the nil keyword represents "undefined value". Techincally, however, it's just a... singleton object of class NilClass. So we can open this class the same way we did for String! In Ruby we can add methods to nil (undefined) objects! Here's a complete code for a sample program:

Check it out: it prints four "aaa" lines, just as planned. Such a convenient feature in exchange for just six lines at the utils.rb file of your project (you have a file with your most lovely code pieces, don't you?).

Interpreting lists in function call arguments

Perl has a creepy way of handling arrays of values. It uses tricky terminology to distinguish "arrays" from "lists", while they seem like a same thing. "List context" and "scalar context" add more complexity, and "references to lists" finishes a programmer's brains.

However, there's a lot of places where this nonuniformity turns into features. Here's one of them.

A regular Perl function (unless special measures are taken) takes arguments as a single list. If a function is said to "take two arguments", it takes a list instead, and interprets its first two elements as "arguments".

This looks ugly at funciton's site, but it's beautiful at caller's site. Let's write a sample program, that, say, finds all "backup" files in a given list of directories, by calling one of the Linux shell tools, find. Here's how it would look in Perl:

We pass a list of arguments to system, imbuing command-line arguments of the script @ARGV into its middle. A flat list with a list of folders at the beginning will be passed to system function, not a nested list! However, it's the nested list that will be passed in Ruby:

Ruby is more strict about its arguments, which I find more convenient than Perl, but in this particular case it leads to rejecting of concise and transparent code. Ruby does have a way to pass an arbitrary number of arguments, similar to C's "ellipsis":

The arguments beyond a certain point form a list, which can be explored. However, it's not of much help, since Kernel.system can take an arbitrary number of arguments, but, nevertheless it doesn't work.

What Ruby lacks is interpretation of function call arguments as a list at caller's site. But since we can't change it at caller's site, let's change at callee's! Here's how our wrapper around the system function may look like:

This will work, and you'll see an array of arguments printed on your screen before the command executes.

But that's not all. What about nested lists? Perl doesn't have nested lists, it flattens them all at construction. You could, of course, create nested data structures with "references", but in a pass-by-reference Ruby language it's not an option.

For example, if you call a similar wrapper with arguments like these:

it will flatten it down to a list of eight arguments, which may not be what we want. Instead, we might expect the fourth argument to be a three-element list, while List.flatten wouldn't keep it. The solution would be to use another function (I called it soften), which we can create by the same monkey-patching technique. Here's a code of a sample solution:

Here's the script's output, which demonstrates that it works as expected.

[1, 2]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5, 4, 5, 6, 7, 8]
["nohup", "ls", "-l"]
[1, 2, 3, 4, ["nesting", "is", "important"], 5, 4, [5], 6, 7, 8, 9]

***

I prefer Ruby to Perl because it's more strict. Its strictness makes Ruby less tolerant to some perl-ish "tricks", though. On the other hand, the flexibility of Ruby allows us to mimic these features without even fixing the interpreter.

This makes me suggest that "strict but flexible" is a more preferable option to "more free, but less flexible". Keeping ungrounded assumptions aside, at least, the Ruby tricks listed above make its use more convenient, especially if you are used to being an undisciplined Perl programmer.

A year of blogging

November 30, 2010 702 word 4 min read

On 30th November, 2009, a couple of my articles ("Zombies" and "C++ virtual functions") were the first to appear at coldattic.info. And today I celebrate its first anniversary.

How it started

I wanted to write regular articles about programming even before I started coding professionally. However, I never actually managed to start. I bought domain name and VPS hosting much earlier than the blog was opened.

About a year ago I had a nasty encounter with a couple of corrupt Russian policemen. This finally braced me up, and I started looking for a job in the better place (US, of course), just to eventually fail, and to feel even more miserable.

I read some articles on job search, and derived, that I should blog about programming to demonstrate my passion. This burst the development of the engine, which had been developed for several months, evenings after work. Finally, I published a couple of articles I prepared in advance. The campaign of looking for a job ended, and the blog... the blog did not.

Statistics

So, here they are. 40 articles in 52 weeks, which I consider a good result. What are the other metrics?

Even though I don't consider the primary aim of the blog to be popular, and have lots of visits, I feel better when I have more of them :-). I estimate the number of regular visitors as 100. And last month I even cleared 1k visits (pro month), as shown on the figure:

I noticed that the number of visits loosely correlates with what and how often I write. Much greater influence has my activity on "social" sites, such as StackOverflow and forums, from which people get here. As for search engines, I get much better results from them, because my site is referenced from my StackOverflow account, and due to large reputation, it doesn't have rel=nofollow. But I cherish those "direct" visits the most, of course. Here's the graph:

⅓ of my posts was made on weekends. Perhaps, to increase the number of visits for users of RSS readers, I should be more patient, and stash articles for a couple of days. I'll try this from now on.

Results

Most of the articles posted here are not very good (I know it). However, I consider several to be much better—at least, I like them most. The last year I managed to write:

"NP-complete!" as a lame excuse (read). An article about perception of NP-completeness, and some examples why this may not be the problem.
Randomized debugging (read). A debugging technique I invented, and keep using (last time I used it was...today!) Too bad it didn't get much traction.
How I applied for a web server developer... (read). A "cool story" about my attempts to find a job in Russia. I still smile when I read it.
Are women worse programmers than men? (read). An insight to the question, which I consider fresh. At last I managed to collect all my thoughts on this matter.
Parallel merge sorting (read). An algorithm which was very interesting to dig into, and to understand how it works.

But of course I like them all, an every single article I wrote. My blog is barely read, but I like it anyway.

I hoped to improve my English, but it didn't improve significantly. To improve it, I should be reading instead of writing.

During the development of this blog, I think I learned Ruby and Rails quite well. For example, I now use Ruby as my primary scripting language instead of Perl.

After a couple of first articles, I intentionally limited the time to write one post to 4 hours. Today I feel that I make better articles in the time given, although I still see the possibilities to improve my writing skills.

Conclusion

So, it seems that this blog brings me a lot of fun and teaches me a lot. I'll keep doing it, even now, when I'm not actively looking for a new job.

I want to thank all my readers, who managed through my crippled language, and had fun reading some of my posts. I hope I'll bring you more fun and non-standard insights on the pages of coldattic.info :-)

Parallel merge sort

November 27, 2010 1106 word 6 min read

We live in a loosely connected world. There's an increasing demand on middlemen, which connect different entities together.

This seemingly unrelated introduction is important for this post in two different ways:

Parallelization of tasks to different processors also grows in demand, as it's like creating stuff that doesn't need middlemen;
If there were not for social networks and forums, you probably would never know about interesting things created at the other end of the world.

So, back to the topic. I considered merge sorting as not parallelizeable well, with the same runtime estimate. However, several months ago, having disclosed this ungrounded opinion, a dweller of a message board of my university disproved it. That post linked an article, which describes an algorithm of a parallel merge sort, although is not the original source of it. It contains a brief, practical explanation, and I liked the algorithm so much, that I wanted to retell it here.

What concurrency we're interested in

Note that I'll describe not just a multithreaded version of the algorithm, challenge of which is a clever use of locking. The problem described here is to deploy merge sorting into the system with distributed memory.

All intensive calculations in this framework should be performed locally. However, each processor has a limited amount of on-site random-access memory. So the challenge is to split the processing into small chunks, which will be sorted and merged locally (complex calculations), and will be just flushed byte-by-byte to external memory at the beginning and at the end of processing.

Why merge sorting?

The thing is that merge sorting is actually an external sort. It means that it can handle amounts of data larger than size of RAM you have the random access to. This should make us think that it's parallelizeable, but it nevertheless doesn't directly lead to an obvious way to parallelize it.

The usual merge sorting looks like this, N being equal to 2:

Split the sequence to sort into N (nearly) equal chunks;
Sort each chunk (recursively);
Merge all chunks (at once of pair-by-pair). The challenge is to parallelize this step.

Each chunk out of these N could be processed individually and independently, and this is a huge achievement. But a challenge remains: how to parallelize the merging?

Parallelized merging: idea

So, how to parallelize merging? First, we assume that one processor can store 2t cells in its own RAM, so we're restricted to merging only not greater than 2t cells at once.

Assume we have two large sorted arrays, A and B, which, merged, or even individually, don't fit into RAM available to one processor. This means that we should parallelize it: spit the sequence into small chunks that can be merged locally.

Let's try the obvious way: divide each array into chunks of length t and try to merge them sequentially. We can merge two first chunks into the beginning of the resultant array easily. But what if the elements of the 2^nd chunk should be merged inside the very same merged array?

This array of length of 2t already doesn't fit in the memory of one processor, so we cant merge it with the second chunk in the straightforward way.

So what we really want is to divide the arrays into mergable chunks. For example, what chunk in the corresponding B array should be A's sub-array (a_i,a_j) of length t merged with?

Obviously, the corresponding chunk (b_I,b_J) should be such that all bs with index less than k are less than a_i, and the similar holds for a_j.

However, the size of (b_I,b_J) chunk could be greater than t, so we couldn't fit it into RAM to merge with the other chunk. Here, let's highlight some of the items that split B into chunks of length of t, that are inside the segment (b_I,b_J):

These new elements split B's chunk into smaller pieces, each of which is less than t. Let's find the corresponding elements in A array for them:

Now all chunks shown here have length less than t, and each corresponding pair could be merged.

But look what we've done here. The order, in which a_i, a_j, a_Q and a_P are arranged, is determined by merging them—or by merging [a_i,a_j] and [b_q,b_p] arrays. So if we merge these t-distant elements, we will determine the order of "splitters" that divide both arrays into mergable, ordered chunks.

Parallelized merging: the algorithm

The thoughts in the previous section lead to the merging algorithm. First, divide both arrays into (nearly) equal chunks of size t.

Then merge the two arrays that selected elements form. Use the same merging algorithm recursively. The result of this would be an order, in which the splitter elements should be arranged (marked with a red line):

This performed merging allows us to determine the exact chunk of length t, into which should fit each splitting element from the other array, by simple calculations. (We should "switch" in what array we search for the counterpart at the same time as the red line switches sides.) Then you should use binary search to determine the actual place of it in the chunk, this way we'll generate twice more splitters for each array:

By construction, these arrows will not cross. Then, the pairs of chunks (the chunks to merge are depicted as sections colored in an alternating manner) will lay sequentially in the merged array. Merge each pair of chunks locally, and place the result into the larger array:

Each chunk to be merged consists of two segments of length less than t, which makes it possible to merge chunks locally on one processor. This completes the algorithm.

Parallelized merge: runtime estimation

Merging two arrays of n elements requires merging two arrays of n/t elements, performing n/t binary searches in arrays of length t, and, finally, merging the chunks we divided our arrays into, which takes O(N) operations. This leads to equation:

T(n) ≥ T(n/t) + n/t*O(log t) + n

I'm too lazy to solve it, but T(n)=O(n) fits it (we consider t as a constant). This is the same estimate as in a usual merge, so parallel merge sorting is O(n⋅log n) as well.

You may note also note that parallel merge algorithm requires familiar pattern of divide-and-conquer iterations. And it's one of the neat facts about this algorithm: merging the chunks we sorted requires the very same processors that sorted the chunks!

***

This is an algorithm I was looking forward to talk about for a long time. I'm glad that I finally did this, and I hope that you didn't know it, and have spent the time learning new fun facts.

Prisoners and Fairness

November 20, 2010 1742 word 9 min read

Fairness is rarely pronounced, yet an important concept in computing. Its applications appear in different domains. Even when its unspoken about in documentation and descriptions of systems, I sometimes notice, "hey, for this to work, fairness is required!" So, what's it all about?

Prisoners

The first thing I all of a sudden encountered fairness in is the prisoners riddle. Here's how it's usually formulated:

100 prisoners are sentenced to death. However, they're given a chance to survive. In one of the cells, there's a light bulb, which is initially off, and a switch connected to it, which can turn the bulb on and off. In an arbitrary order, which is not known to any of them, jailer lets prisoners into the cell; they can observe if the bulb is on or off, and, optionally, turn the switch.

If a prisoner, after visiting the cell, says that every prisoner visited the cell at least once, and he's right, then everybody is set free. If he's wrong, then every prisoner is executed. Of course, every prisoner can pass and say nothing; then the iterations continue.

The prisoner's can't talk to each other after this trial begins, they're kept in solitary. However, before this "trial" is started, they're given a chance to make up a plan, which they would follow. Can the prisoners make such an arrangement, that makes them sure that they are set free, no matter in which order they are allowed into the room?

There are just a couple of riddles of this sort that I don't consider stupid, and this is one of them. Take time to solve it, and you'll be satisfied with the process and the result. You can keep reading, though, there won't be spoilers.

So, does the problem look sound? Is its definition complete? It sure is not. Indeed, it's quite possible that one prisoner never enters the room at all! Since they enter then cell in an arbitrary, not a random order, it's possible. That means that it's possible that not everyone will ever enter the cell, and in no situation the statement "everyone was here" will be true. So we should conclude that no matter how prisoners decide to act, they may have never get freedom.

This makes the problem not interesting. What should we add to make it have only non-trivial answers? You already got the idea: we should enforce fairness.

Definition of fairness

Picture A

A simple state machine for the jailer. The jailer nondeterministically picks one of the prisoners, lets him into the bulb room, and checks their answer, if any.

So let's formalize the work of the jailer a bit. All prisoners reside in their separate cells, and sometimes the jailer decides to open one, and to invite its inhabitant into the bulb room. If the prisoner doesn't decide to make a final guess, then the jailer picks a cell to open again. This can be modeled as simple state machine. See picture A.

The jailer makes an arbitrary choices whom to invite next. But it wouldn't be fair if a certain prisoner is never let into the bulb room, as we've shown before? It wouldn't. And it wouldn't be more fair, if a prisoner is not let to the room after a certain moment. It's just the same as if it's never let there, because up to this moment the prisoners' plan might have not moved forward. (My argumentation is kinda weak, but I try to avoid spoilers.)

So the work of the jailer would be fair if it never happens that a prisoner is not let to the bulb cell after some moment of time. If the jailer follows this, then its work is fair. This is closely tied with a concept of fairness in computing.

In one of my previous posts, the one about the search for a declarative language, I came up with my own, generic definition of fairness. I'll quote it, slightly reworded, here:

Fairness is a property of a nondeterministic system, which repeatedly and infinitely faces a multiple choice, to not allow such behavior that a certain item is never chosen past any moment of time.

For this prisoners riddle to be interesting and well-defined, prisoners should be let into the bulb room in a fair manner.

Applications of fairness

Task scheduling

A scheduler implemented in your operating system is one of the most important parts that affect its performance. If your system supports multitasking, it should spare limited CPU time between tasks to reach some goals you set. For a desktop system you might tune your process scheduler to react fast to mouse clicks you make, while running some tasks in background (can't help quoting Linus' sarcastic letter on this matter). For a system which doesn't need small response time, but should effectively schedule long-term tasks that communicate with each other, you might need another scheduling algorithm (see my Stack Overflow question, during which I encountered such an inefficiency on a desktop-tuned system).

Usually schedulers work like this: timeline is divided into chunks, and before the next chunk begins, scheduler chooses, to which process it should assign the next chunk of CPU time. Sounds much like the work of that jailer, doesn't it?..

In scheduling, fairness has more strict definitions, and more complex metrics. They calculate and impose restrictions on fractions of CPU time allocated to tasks.

Scheduling algorithms should not adhere to communism: some tasks have more priority than others. But schedulers should be fair when choosing the next process to take CPU's core. Indeed, if they wouldn't be fair, then it could be theoretically possible that our programs just never have chance to start if there's something else happening! However, if scheduler is fair, we at least know that program will terminate (thought we still don't know when exactly).

Distributed computing

The first place I heard about fairness as of a separate concept, were studies of distributed algorithms in my university.

When several machines run in parallel, and communicate with each other, it's usually quite a complex system to study and implement. Just reckon all these race condition and concurrency bugs you faced!.. To study distributed systems we should model it in some way.

Actually, what I call "fairness" is known as "strong fairness". There also is a concept of "weak fairness". Its "weakness" is that the systems eventually should make a certain choice only if it has such option always from now on. If such an opportunity regularly disappears, the system can opt it out and be nevertheless weakly fair.

The most obvious way to do that is to pretend that

at each moment of time, only one machine works, while all others are paused
there is an independent "oracle", which decides what machine will run next, the oracle being not associated with any machine
the oracle can stop machine only after it has executed an atomic operation completely

Picture B

Example of model of a machine in a network, when fairness wouldn't be the best option to model machine's behavior. Fairness forces the machine to eventually end up in a "broken" state.

Sometimes it would be nice to assume that the "oracle" makes machine run in a fair manner. However, this assumption is not frequently taken. Consider that we need to model what happens when a machine in a network breaks. We can model it as having a nondetermenistic choice to either break or to keep working (see picture B). Enforcing fairness would make us think that any machine will eventually break, and this doesn't conform to what we want to model.

Nevertheless that's how I first learned about fairness. Theory of distributed computing contains a lot of other interesting stuff, thought it's nontrivial. Our lectures used slides like these, it might be interesting to read.

State machines

Fairness can also be defined in terms of state machines. You can check wikipedia article with a paragraph about that. Basically, there's nothing new compared to what I described in previous paragraphs; I even used state machines to demonstrate what I'm talking about.

Implementation of fairness

Fair choice is usually implemented as random choice. Probability distribution doesn't need to be equal (for example, in a dumb implementation of a scheduler, it may be associated with task priorities).

However, this is not completely precise. Fairness requires every choice to be made. Random choice leaves theoretical possibility for an option to be completely ignored. And if this is allowed to happen, the choice is not fair anymore. What are the odds of such behavior? Assume we have two choices, each made with probability equal to ½, and the first choice is never made. This happens with probability ½⋅½⋅½⋅½⋅..., which is equal to exactly zero.

Therefore, the "random" implementation only implements fair choice with 100% probability, but not always. With zero probability the choice would be unfair. Isn't it too much? Only requirements of your implementation would tell. After all, we all were born and grew up the way we are with exactly zero probability...

While fairness is implemented as randomness, it shouldn't be forced in specifications of a system. The Whenever language I described in one of my posts doesn't really benefit from its choice of a line to execute being random. However, fair choice would be more suitable, as some executions, with zero probability, can make a correct Whenever program useless.

***

Fairness is an important concept in our life. You should be fair to prisoners, you should be fair when you have powers and use it to allocate resources to others. In computing, it's all the same.

In many domains of computing definitions of fairness vary. In this article I described only a couple: "strong" and "weak" fairness as understood in theory of distributed computing. Instead of being "totally" fair or unfair, fairness may be a metric, and system behavior is defined in such a way, so that this metric is not less than a certain value.

The exact definition of fairness is complex and hard to coin in a general way: it becomes too vague. I made such an effort in the middle of this article, but if you have a better idea, it's welcome in comments. Nevertheless, I hope this post gave you at least a notion of what fairness is. And I also hope, that when you notice it in in a system you design, use or encounter, you'll use the correct and concise word to describe its behavior.

Syntax elements? User-defined functions!

November 7, 2010 887 word 5 min read

Many languages, especially those non-functional ones that originate from 20th century, have immutable syntax. Such symbols as +, (, or . have predefined meaning, which can't be changed by user.

Based on this, when you see a similar symbol, you think that its meaning is predefined by specification of the language. But it's not always true.

C++

The most beautiful example of operator overloading in C++ is "lazy evaluation". Assume you defined a matrix class. You may want to use operator overloading to work with it, so you could write:

The expression at the right-hand side will be evaluated like this: first A*B will be computed and stored in some internal temporary variable tmp1, then (tmp1*C) will be computed, and stored in tmp2, etc.

However, it's known that evaluation of matrix multiplication chain can be optimized. So instead of straightforward definition where multiplying two matrices returns a matrix, we could define it in such a way that it returns a special "chain handler". This "handler" would accumulate references to all matrices involved in the multiplication, and perform the actual multiplication only once, when it gets converted to matrix M as a part of the assignment.

C++ is notorious for its operator overloading. You can imbue any meaning to (nearly) any operator used in expressions. This means, that any simple expression, such as "a = b + c" could actually invoke a complex user-defined procedure.

Of course, you can't redefine anything. Syntax elements used to define control-flow can't be redefined, as well as some operators that provide. And, most importantly, you can't create your own operators. But even with the powers you have, you can hide some userspace functions behind the usual syntax.

Linux Shell

If you wrote or read Bash programs, you probably encountered such constructs:

The [...] "conditional" expressions are used to query file attributes and to compare strings. They are builtin shell operators...or not? Let's check:

pavel@lonely ~ $ ls -l "/usr/bin/["
-rwxr-xr-x 1 root root 35032 Jul 20 01:53 /usr/bin/[

What the hell? So these brackets are actually small programs that merely take the subsequent tokens as command-line arguments? Well, they could be:

pavel@lonely ~ $ "/usr/bin/[" -f /bin/bash ] && echo Yes
Yes

Thankfully, Bash itself doesn't invoke them (it uses built-in implementation of this functionality). You can check it with strace or by temporarily removing this file. Perhaps, this file is kept for compatibility with other shells.

This example demonstrates that what you think of as syntax elements could actually be a user-defined function.

OCaml

Haskell language has a special notation to make an arbitrary function infix. I.e. you could make an "operator" out of a simple function:

Prelude> min 1 2
1
Prelude> 1 `min` 2
1

However, another functional language, OCaml, doesn't have such syntax. It can define function as infix, but not "convert" it to infix as shown above. And there's no hope for it to work: it would require introducing additional syntax to language specification... would it?

# let ( /* ) x y = y x
  and ( */ ) x y = x y
  in 1 /*min*/ 2 ;;
- : int = 1

The idea is to define some functions that look and feel like "syntax", leaving language definition intact. The above code defines infix functions named "/*" and "*/". And if you saw code like this (1 /*min*/ 2) at distance from the definition of these "functions", you would really thing it's some kind of OCaml syntax.

Here's the origin of this idea.

***

Modern languages give users more freedom to define and re-define meaning of what looks like syntax elements. This freedom can be used to write more beautiful and concise code, as shown in the examples above. By the way, do you know more?

Comments imported from the old website

Felix H. Dahlke on 21 November 2010 commented:

The amazing thing about Lisp is that you can (in most dialects, that is) redefine each and every function and operator of the language, because everything is expressed as an s-expression:

(+ 1 2)
(if condition then else)
...

Makes Lisp perfect for domain specific languages.

Pavel Shved on 22 November 2010 commented:

Felix, you're right. Though, I like Ruby more, since you can do nearly the same with it (and, as a bonus, it has a more accessible syntax).

But the thing is that when you can redefine everything, you expect it (like in Lisp or C++). So it becomes less amazing, compared with cases when you encounter a neat trick in a language, where you can redefine virtually nothing (like OCaml and Bash). That's what I posted about.

https://www.google.com/accounts/o8/id?id=AItOawnXdNQ8imrs7mGCkqRq8hlTHlXkdb3y0mE on 29 April 2013 commented:

Python does it C++'s way, mostly. All the infix operators have corresponding method names like add for +. There are also right-side method names like radd. The matching happens at run time. You can define how a class Foo handles foo + bar, or define radd to let foo handle bar + foo if bar doesn't. There's nothing quite like C++'s casting the result in an assignment, so if you did the lazy-evaluating matrix trick, you'd have to finish it with an explicit method call to evaluate or cast the result.

https://www.google.com/accounts/o8/id?id=AItOawnXdNQ8imrs7mGCkqRq8hlTHlXkdb3y0mE on 29 April 2013 commented:

Sorry, that's underscore-underscore-add-underscore-underscore, a.k.a. "dunder add dunder", and dunder radd dunder.

An example of indentation

October 23, 2010 243 word 2 min read

Some languages are whitespace-sensitive, some are not, but in all of them we indent our code.

Why do we do this? To make our programs look readable. To structure them. To separate logical blocks, and collocate items that belong together.

But recently I encountered an interesting indentation pattern. It was in the C sources of an implementation of the OCaml language virtual machine.

(I'll just add a couple of paragraphs, so that "Recent updates" div doesn't make the indentation really ugly.)

(Yes, you should also have a screen wide enough. If you see a messy crap down there, just maximize your browser window, and stretch it beyond the screen border if even that doesn't help.)

Here's an excerpt:

Wonderful, isn't it? This source file has more of that:

I find the alignment of these Asserts and "Case N" comments very beautiful and "mathematical". It provides even more separation of different parts of the code. But not only it's useful, it also makes me want to print the code, set it into a frame and stare at it, smiling.

Caching and levers

October 17, 2010 1353 word 7 min read

Some of the people who question the way I design systems, notice that I'm a big fan of caching. While caching is a widespread and a highly used technique of performance optimization, I tend to use it—and rely on it—more frequently than others. Buy why's that?

What is "caching"

It seems that the people actually mean different things when they say "caching". In my understanding, caching means exchanging memory for computation by storing the exact results of computation in memory, and re-using them instead of performing actual calculations.

This actually resembles one of the concepts of physics. When you study how levers work, you often hear that to lift a heavier object you are to be putting the effort at greater distance.

The same way you may trade memory to speed in computing, if you use the appropriate tools.

There are different lever classes, which are based on the same principle, but work differently. And there are different caching types.

Memoization

Memoization is a caching technique, which saves the results computed by requests, and provides them if they are requested again.

Memoization can be utilized, for example, in recursive calculation of Fibonacci numbers:

Prefetching

Prefetching is a caching technique, which computes more than requested, and provides the data computed if they're requested in the future. That's how cache in your CPU works: when you load a memory cell into a register, it first loads it to a fast, "cache" memory, and also makes sure that a certain amount of the nearest cells are also loaded. You may boost performance if you write your programs to employ locality—i.e. make it more likely that you'll not request data from here and there, but try to make your memory queries concentrated.

***

Essentially, you can do only memoization, or only prefetching, or a combination of both. But there doesn't seem to be an agreement, what caching is. For example, Wikipedia article about caching defines caching as prefetching. It refers to architecture of processors (at least, to how it's implemented in x86), but the widespread understanding of caching goes beyond prefetching.

In this article I'll understand both prefetching and memoization as variations of caching.

Why I adore caching

I'm a big fan of caching because it supports one of the most wise ways of doing programming, the lazy approach.

Programmers work very hard to be lazy. Laziness is one of the basic virtues of a programmer. To be lazy partly means to design reusable systems that could be easily understood and modified. But how many times you were to design of a complex system, had a hard time subdividing it into components just to understand that the design is awfully unperformant?

To make the system more performant you would have to mingle several clearly defined layers into a horrible mess, to tangle otherwise isolated components, and to make your system's architecture more like a mess.

And here's where caching comes to help you. Instead of redesigning the structure of the system, you can just cache the data which your isolated components would re-calculate due to their independency. You'll see examples of it later.

Don Knuth once produced a famous quote, "...say, 95% of the times, premature optimization is a root of all evil". Ironically, the quote itself is a root of a lot of evil. Bit it actually supports the programming virtue I told about. And we can be sure, that caching can never be called a "premature optimization".

Where I used caching

Without concrete examples, of course, the bare words don't sound assuring enough. So here's when I enocuntered nice opportunities to use caching.

The first example: I use it in my blog to cache the HTML version of pages. They originally are written in a special markup language, which is parsed by an auto-generated from a grammar recursive descent parser implemented in Ruby. (It's called Treetop). It takes about 0.2 seconds to render a blog post alone, so without caching you could hardly read my blog.

Ironically, the parser also uses caching internally to memoize the nonterminal symbols it successfully parsed during the failed options of parsing other nonterminals.

Prefetching is especially useful when working with stores which process batch requests faster than indicidual ones. These include databases and file systems. In one of my first projects, I was given a task to store an abstract syntax trees of C++ header files in an SQL database. (you an see it here). As ASTs are loosely connected, I couldn't reliably determine what nodes will be requested at the next step of depth- or breadth-first traversal across the graph. But what I knew is that they were stored quite tightly. So I made the program request 1000 rows from database instead of one. That saved a lot of calls to the database.

Ironically, the pages you would see if you click the link to sample headers above are also cached. :-)

Of course, I wasn't the first who cached the results of compilation. There is a whole project, ccache, devoted specifically to this: it maintains compilation cache and uses it when the same file is going to be compiled.

While in a single project cache could be easily replaced by a Makefile, it's useful when you're rebuilding several projects independently. For example, in Gentoo Linux different versions of a system application are built separately, by design of its build system. These builds don't know anything about each other, and may be distant in time. However, if you first compile version 1.5.14 of a library, and then upgrade it to 1.5.15, a considerable fraction of the very same object files will be compiled again. And again, caching allows to improve performance, while having a clean architecture.

During one of the latest Google Summer of Code projects, a colleague of mine implemented a system, which closed Linux kernel modules under callgraph dependencies. For each of 4000+ Linux drivers, it produced a dozen of files, on which it depended. And then another component had to compile each of these files, as it was a completely independent module with a stable and developed behavior. It was obvious that instead of compiling 50.000 - 100.000 of files, it should cache the result of their compilation.

This way we traded 10 Gb of memory for several days of running time. Worth it, isn't it?

What about convenience?

Caching could actually be extremely convenient to use and, especially, to insert into a developed application. Here's how caching code of my blog looks like:

It actually means "fetch from cache by the key provided, or, if no data for the key is stored, run the functional block provided to recalculate the data, save to cache, and return the thing, which was just calculated, instead". It's so much syntactic sugar, that it may even cause diabetes... but it's the most convenient caching interface I saw. Essentially, every time you perform memoization, you have to re-implement the exact steps (see that Fibonacci example above). And a good caching interface, like the Ruby binding for memcached featured here, includes that. You may view the Railscast about using memcached for caching this way.

Most likely, you'll not need any rewrite of code of the components you want to introduce caching to. In the example above, which featured database prefetching the code was inserted to fetch-one-row function, which secretly prefetched and queried the local cache on each invocation.

Caching also provides you certain tuning capabilities. You can limit the cache size, and discard Least Recently Used items if you're going to exceed its size. This allows you to tune, how much memory you'll trade for performance, and it works in quote a straightforward, measurable way. Just like levers.

Conclusion

Of course, caching can't help in all situations where you have performance problems. If you're unsure if you will be able to employ caching, do the modelling, as wise men suggest.

However, caching helped me in many situations so far, and I seem to be saying, "don't worry, we'll cache this" too often to avoid being called a caching fan.