Recent blog updates

Shown here are only posts related to ocaml. You can view all posts here.

The Pipe-Pipe-Equals

It is hard to come up with a google-friendly name for the ||= construct you see in some programming languages quite often, "pipe-pipe-equals" being the is the closest (other names include "double-pipe equals," "or-equal," or "double-or equals"). Why should we name this monster in the first place? While cryptic at first sight, it is a very convenient shorthand that, I believe, allows us to write cleaner code. Let's see how it works on this example:

Essentially this ||= construct is equivalent to other convenience operators, such as +=. x += 2; means x = x + 2. The "or" operator || usually denotes a weak logical "or" that doesn't try to compute the left-hand side if the right-hand side is true. Usually, meaningful values evaluate to true, while the only evaluating to false are NULL pointers/empty values, zeroes, boolean Falses, and, sometimes, empty strings.

Alternatively, you could set a default hash value, but ||= is used more widely, and also takes exactly one line... doesn't it? ;-)

The statement is used mostly to set a variable to a "default" value if it's "unset" at some point. This gives the most visible advantage when you initialized dictionary elements in a loop. Here's a piece that saves counts of each array element into a dictionary (Ruby):

In some languages (Perl, Javascript), you could not even bother with this, as += 1 on an unset value would result in its assignment to 1.

If you don't initialize output[x], you'll get a runtime error for trying to increment NULL. The advantage of ||= against other ways is that you don't repeat anything. You could have written the same piece as

Oh yes... we forgot about the most natural use of ||=, for booleans. Here's how we'd check if an array contains zeroes if we need to iterate over its elements for something else.

Pipe-pipe-equals is rarely used this "natural" way, though.

But here you have to type output[x] twice and add two more lines of code, which is a complete waste of screen space and, possibly, computing resources if the interpreter doesn't optimize the duplicated computation out. Let's take a look how pipe-pipe-equals works in different languages (we've already seen it in action in Ruby).

Perl

Perl was the first interpreted language I learned, and the first place I saw the pipe-pipe-equals operator in. It works as expected, but is used less often than the direct version of the weak logical "or" to specify default parameters of a function. Here's how you'd put a default to a hash bucket:

This works in a strict mode, too. Note that, in Perl, many things evaluate to false, including empty strings (I once even tried to emulate it in Ruby). To restrict this action to undefs only, use //= in Perl6 instead, which sinks to its right like the Pisa tower, and look as if you're trying to put a division while drunk.

Python

Python has no equivalent of this operator. You have to type the very[long(expression)] twice, and it will be computed twice. You have several different ways to do this:

The engineers more experienced in Python programming, though, assure me that this is not a big deal since you have a different way of supplying default arguments for a function, which seemingly covers half of the use cases for pipe-pipe-equals (this doesn't prevent Ruby from having both, though). Another half is covered by dictionary's dict.get(key, []) method, so that the code piece #1 can be written in a succinct manner. But I still miss it.

Bash (Linux shell)

Bash? Its language is so simplistic, and it looks creepy; how come it would have an equivalent of such a beautiful shortcut? Here it is:

This assigns y to variable if the former is unset or null (as per Bash manual). I mostly used it to set default parameters for environment variables user may or may not set before invoking the script.

C++

While |= is a syntactically correct expression (C++ does not have the "short-circuit" version of this expression), it doesn't do what we discussed here.

C++ is statically typed, so the result of the standard logical "or" is boolean. Retaining the nice semantics we find in a dynamically typed language would require it to be "either the type of the left-hand side or the type of the right-hand side". This is hard to pull in a pass-by-value statically typed language.

Pass-by-value semantics also means that not everything can be assigned a NULL, and not everything can be converted to boolean value. C++ has default arguments as well as Python, so the same reasoning could apply here. You'll have to be more verbose in C++. That's probably why only |= expression is available, which is only useful if its left-hand-side is bool (see sidebar above for similar usage.)

OCaml

Everything said above about C++ applies to OCaml as well. Moreover, OCaml, as a functional language, doesn't have a flawless support for mutation, and pipe-pipe-equals statement its inherently mutational. However, its matching operator would require us to use the very_long_variable twice. However, OCaml and other functional languages have a very interesting construct called "option". If something has "X option" type, it may contain either "nothing" or a value of x. Then, this value may be "unpacked" trough pattern matching:

let very_long_variable = match very_long_variable with None -> y | Some t -> t

here, t is not an shorthand for another very long expression; instead, it's just an identifier, written as is. The match...with allows us to "unpack" values of structured (algebraic) types with shorthands like this. Since this was too long, OCaml has made this a library function Option#default:

let very_long_variable = Option.default very_long_variable y

Anyway, OCaml programs are even more explicit than those in C++ and Python, so trying to tie pipe-pipe-equals into them is quite pointless.

Ruby

"Erm, we saw how it is in Ruby at the beginning," you might think. Well, I lied to you a bit. The thing is that, in Ruby, it is not strictly equivalent to an assignment to a result of logical "or". Which do you think x ||= y is equivalent to?

In Ruby, and only for ||= and &&=, it's the second. If you assign to something other than a local variables, what looks like an assignment is actually a method call (think properties,) and, if so, this assignment does not happen at all if the left-hand side of ||= is false. Which makes sense, but looks like a special case. Read more here.

Why You Might not Need This

Some argue that this operator is mostly useless, especially if their favourite language doesn't have it. Here are some arguments they list.

Redundancy

Indeed, this expression is redundant. You can do the same in a multiple different ways, All the examples I demonstrated above showed how to write essentially a simple if statement in a very short-hand form. The number of characters spared is probably not worth it to include support for this feature to a language designer's must-have checklist.

The statement discussed decreases the redundancy in code in return of broader language definition; each language seeks a balance between these, and often leaves the pipe-pipe-equals aside.

Default Function Arguments

The ability to specify default function argument (C++, Python, Ruby, but not Bash or Perl) covers many use-cases for ||=. However, this doesn't help fancy loops that fill complex structures, one of which we showed above. Nor helps it when you have to specify the default parameter anyway but the concrete value is not known at the time of coding, and is an optional field user may or may not fill.

Confusion

It is confusing. The pipe-pipe-equals requires explanations how it works. It becomes a special snowflake, different from its "mathematical" counterparts +=, if a complex language wants to make its very useful (see Ruby section above). While "confusion" as an excuse of not doing something is my pet peeve (and I tried to explain why), it indeed requires some effort to understand the mechanics. I'm sure that you'll love it once you understand it, and the way I learned about this operator is not by reading a textbook or "The Most Confusing and Obscure Programming Language Features Possible" newsletter, but by reading someone's code.

To Use or Not To Use?

When it comes to deciding whether or not to use pipe-pipe-equals in the "creepy" way we discussed throughout the post, the criticism somehow fades out. If the language supports this, it will be used. I haven't encountered any coding style document that bans this feature. In Ruby and Perl, it is considered as good of an idiom as many. So the answer on the question of whether or not you sohuld use pipe-pipe-equals (||=) in your programs is, definitely, "yes".

Read on | Comments (0) | Make a comment >>


Logging in Expressive Languages

I am not aware of any strict mathematical definition of expressiveness of a language. Informally, a language is more expressive if it allows you to achieve the same with less keystrokes. From this point of view, for instance, Perl or Python are more expressive than C. Of course, you can't always achieve with Python most things that C is capable of at low level, and therefore one may question whether Python is really more capable than C. But capability has little to do with expressiveness. In certain domains, one language may be more expressive than the other, and in other it could be just vice versa. (For instance, Python is more expressive in string operations while C is more expressive in low-level raw data representation.)

One may notice, however, that the majority of languages are Turing complete, and you basically may express the same with all of them. What is the difference in expressiveness then, as seen by a Practicioner rather than a Computer Scientist?

I usually use the following criterion. If you write one non-splittable expression instead of a chain of statements, the language is more expressive. Compare the following two programs that compute a sum of an array in C:

...and in Ruby:

..and, in C, we also should find a place to deallocate the array when it's no longer needed.

You may notice that Ruby can achieve with a single expression what C needs several operators and variable declarations for. Three Ruby features contribute to this, anonymous lambda functions, syntax sugar to pass them to functions (such as inject), and to avoid the really useless variable declarations. Isn't it always that beautiful?

In some cases it's not.

What does it take, to log?

Now let's try to add some logging to the programs above. Assume we actually want to hook into the loop, and make it print the square result it has calculated to a logging stream.

Let's take a look at the code again.

C:

Ruby:

Ewww! Why does our Ruby code looks like C? Where is our one-liner? Where is the expressiveness? How come we're explicitly introducing a temporary variable just like in C?

I spotted such patterns, "compute-log-return" in a lot of places in my practice. For i instance, the BLAST codebase I maintain, contains these here and there, because it's written in an expressive functional language (OCaml), and uses logging as a primary debugging mechanism. Because of it, the code that should look beautiful looks as an old prostitute trying to mask their weary skin with tons of makeup, and I'm not happy with this.

Trying to get asleep on a plane, I suddenly realized that this logging could be achieved with a better code. Since "assign-debugprint-return" is a recurring pattern, let's define a helper function

Then, the resultant Ruby code will look like this:

Now our Ruby code looks like Ruby again.

Of course, this is not limited to Ruby. Though less versatile, the sane function may be implemented for the C code as well.

And this brings us back to the original comparison with 1 line in Ruby and five--in C.

Discussion

while you may write such function for a less expressive language (as seen above), it will "stand out" in a large C code sheet and look like ugly duckling. What may benefit most are languages that mix functional and imperative coding styles, such as the aforementioned Ruby and OCaml. The reason why this helper might be useful is that it's relevant to the functional languages in two different ways.

First, functional language allow the programmer to define anonymous function inline, which makes them be frequently passed as parameters to other functions. In C, defining a separate function is, inmost cases, uglier than just adding a convenient logging shortcut.

Second, logging an intermediate result contradicts the functional level paradigm. In functional languages you define a computation as a whole rather than specify concrete sequence of actions. In case of logging, you have to violate a chain of functions with a side-effect operation. Of course, you can try to make it look nice, as we did, but the very reason we need the kludge is that we're doing something a bit wrong.

But it's not limited to functional languages. Most expressive languages that do not require from user a strict specification of type of every identifier can. For example, we might want to debug Perl's regular expression substitution with using a similar function:

(we couldn't use log because it conflicted with the mathematical logarithm function that bears the same name).

***

In any case, the "compute-log-return" function that encapsulated the debug print is a widely useful pattern. And it's especially useful when you're writing a program in an expressive language. Do not let the programming cruft conceal the actual idea of your code, use loggers that hide all the details.

Read on | Comments (0) | Make a comment >>


OCaml Hash Table Chronicles

While preparing the static verification tool I maintain, which is written in OCaml, to the Competition on Software Verification, I noticed a strange performance problem. Queries to a hash table structure seemed to take too much time. Of course, there were a lot of such queries, but a greater amount of queries to a much more complicated data structure, a Binary Decision Diagram, caused the CPU less trouble.

I've been troubleshooting the issue for several hours, and the spot where it hit me was, as a Dr. Watson would say after Sherlock Holmes explained him the case, elementary. Bet let's explicate the case from the very beginning.

What an OCaml hash table looks like

The hash table implementation in OCaml, in addition to the expected capabilities, can also keep the history. So if you add several values for the same key, you may either query the latest value added (via find method), or query them all in the order of appearance (via find_all counterpart). Of course, you can also remove the latest binding of a key (via the remove method), so that the subsequent finds will return the previous value it was bound to.

"Who would want that?" you might ask. It appears, that this feature is very convenient. First, you get a standard library data structure for "a map from something to a list of something" with nearly the same interface as the plain one, without "list". Second, which is more important, when you wrote a multitude of code boilerplates, and only then you realize that not only you need the current value for a key, but, under certain circumstances, the history of its updates as well, you might get really upset. Especially if you use such a strict language as OCaml. Then, you'll praise the opportunity to keep all the code the same, querying the history just where needed.

The latter was what happened in our project. And in this history querying we started to encounter performance problems.

Batteries not included

The first problem was that the standard OCaml hash table interface doesn't have remove_all function that clears the history of a given key! To do that, we had to call remove method as many times, as many values there were for the key. Our profiler demonstrated that it was here where nearly 25% of the runtime had been spent.

I integrated a hash table from "OCaml Batteries Included" project that provided an extension to the standard library. Its hash table did have the remove_all method, and I hoped it would help.

It didn't. It decreased the runtime of removing, but not dramatically; besides, the run times of finds and find_alls remained bigger than I felt they should've been.

Digging deeper...

It was time to dig deeper into the Hash table implementation. But I didn't frown upon that. OCaml is a functional language, and I've been playing with it for quite a while. Sometimes, here and there, I notice strikingly nice implementations of well-known algorithms I learned in the imperative terms (the most nice would be depth-first traversal). No wonder I took a look into its intrinsics.

Unfortunately, hash table is too imperative a data structure for its implementation to look too uncommon. It was a plain array with lists attached to each of its elements.

What was surprising is that storing history appeared to be simpler than not storing it! When an value for a key is added to the table, it's just inserted at the beginning of the bucket's list, and the usual, simple search for a matching key in the list would always hit the latest value bound! removeing the key is as simple: you do the same search, and the first value found is removed, making the value added before it available for the next lookup, just in the historical order.

No wonder that our "remove a value for a key as many times as many bindings there were" didn't work fast: its runtime was O(n²), n being the size of the bucket, while we should do it in O(n). Of course, the same worst-case-linear runtime of bucket length couldn't go anywhere, but at least wasting the valuable time in a useless quadratic loop was avoided.

Still, it wasn't fast enough. The integral time to remove all values decreased, but didn't become negligible. Why could that be?

Perhaps, there were just too many values? I inserted some debug prints to the implementation of hash table, and tried to measure how many iterations the remove_all made at each operation. Yes, indeed too many. Hash table bucket size, given a good hash function, should be close to the number of elements divided by the size of hash table's array. And OCaml's standard universal hash function should definitely be good, and it even had a C implementation tuned for speed, apparently. So, the solution could be to increase hash table size.

When the hash table size increase also didn't help, I was quite surprised. How come? I was really lost.

...Until I added a print of hash function values. Yes, they were repeated quite often, and even the keys that were just slightly different had the same hashes—and good hash functions have to assign dramatically different values to slightly different inputs.

I took a closer look to the hash functions, and noticed some magical numbers (10 and 100). What were they?

external hash_param : int -> int -> 'a -> int = "caml_hash_univ_param" "noalloc"

let hash x = hash_param 10 100 x

I referred to the documentation, and... uh, I wish it was what I started with!

val hash : 'a -> int

Hashtbl.hash x associates a positive integer to any value of any type. It is guaranteed that if x = y or Pervasives.compare x y = 0, then hash x = hash y. Moreover, hash always terminates, even on cyclic structures.

val hash_param : int -> int -> 'a -> int

Hashtbl.hash_param n m x computes a hash value for x, with the same properties as for hash. The two extra parameters n and m give more precise control over hashing. Hashing performs a depth-first, right-to-left traversal of the structure x, stopping after n meaningful nodes were encountered, or m nodes, meaningful or not, were encountered. Meaningful nodes are: integers; floating-point numbers; strings; characters; booleans; and constant constructors. Larger values of m and n means that more nodes are taken into account to compute the final hash value, and therefore collisions are less likely to happen. However, hashing takes longer. The parameters m and n govern the tradeoff between accuracy and speed.

Standard library documentation for hash table, see here.

The keys for the hash value were complex structures (think recursively folded C structures), and many of them shared large parts, which made hash values collide too often. I changed the hash function used for that table to Hashtbl.hash_param 100 100, and the whole table became as fast as a jaguar in a savanna.

Lessons learned

The first thing to see as a morale is that you should read documentation before doing hard debugging. Of course, feeling like a file system hacker is invaluable, but, if you are after delivering something, keeping it slow pays you back in the long run.

When I was writing this post (and it's another cool thing about writing a blog), I suddenly remembered another issue I had with another hash table. The keys also were complex structures, but that table didn't use them directly. Instead, the complex structures were converted to strings, and only they were used as keys. I thought it was too redundant, but the straightforward way, to use the structures as keys, was, surprisingly, slower. I merely rolled back the changes, but now I realize the background behind this. So, perhaps, converting structures to strings (especially when there's a natural 1-to-1 match) helps, as the whole structure will be considered to distinguish.

I realized that keeping the history of hash table updates was not a nice feature at all. Rather, it appeared to be just a small bonus with lack of support in all the aspects (I had to integrate code from the other library to add such a natural functionality as remove_all). Or, it can be used as saving time in return of leaving older unneeded bindings unremoved.

And the last, but not least, I saw a good example how different languages teach you new concepts and new ways to look at the things you're already used to. Such experience, even at cost of half a day, was totally worth it.

Read on | Comments (2) | Make a comment >>


Syntax elements? User-defined functions!

Many languages, especially those non-functional ones that originate from 20th century, have immutable syntax. Such symbols as +, (, or . have predefined meaning, which can't be changed by user.

Based on this, when you see a similar symbol, you think that its meaning is predefined by specification of the language. But it's not always true.

C++

The most beautiful example of operator overloading in C++ is "lazy evaluation". Assume you defined a matrix class. You may want to use operator overloading to work with it, so you could write:

The expression at the right-hand side will be evaluated like this: first A*B will be computed and stored in some internal temporary variable tmp1, then (tmp1*C) will be computed, and stored in tmp2, etc.

However, it's known that evaluation of matrix multiplication chain can be optimized. So instead of straightforward definition where multiplying two matrices returns a matrix, we could define it in such a way that it returns a special "chain handler". This "handler" would accumulate references to all matrices involved in the multiplication, and perform the actual multiplication only once, when it gets converted to matrix M as a part of the assignment.

C++ is notorious for its operator overloading. You can imbue any meaning to (nearly) any operator used in expressions. This means, that any simple expression, such as "a = b + c" could actually invoke a complex user-defined procedure.

Of course, you can't redefine anything. Syntax elements used to define control-flow can't be redefined, as well as some operators that provide. And, most importantly, you can't create your own operators. But even with the powers you have, you can hide some userspace functions behind the usual syntax.

Linux Shell

If you wrote or read Bash programs, you probably encountered such constructs:

The [...] "conditional" expressions are used to query file attributes and to compare strings. They are builtin shell operators...or not? Let's check:

pavel@lonely ~ $ ls -l "/usr/bin/["
-rwxr-xr-x 1 root root 35032 Jul 20 01:53 /usr/bin/[

What the hell? So these brackets are actually small programs that merely take the subsequent tokens as command-line arguments? Well, they could be:

pavel@lonely ~ $ "/usr/bin/[" -f /bin/bash ] && echo Yes
Yes

Thankfully, Bash itself doesn't invoke them (it uses built-in implementation of this functionality). You can check it with strace or by temporarily removing this file. Perhaps, this file is kept for compatibility with other shells.

This example demonstrates that what you think of as syntax elements could actually be a user-defined function.

OCaml

Haskell language has a special notation to make an arbitrary function infix. I.e. you could make an "operator" out of a simple function:

Prelude> min 1 2
1
Prelude> 1 `min` 2
1

However, another functional language, OCaml, doesn't have such syntax. It can define function as infix, but not "convert" it to infix as shown above. And there's no hope for it to work: it would require introducing additional syntax to language specification... would it?

# let ( /* ) x y = y x
  and ( */ ) x y = x y
  in 1 /*min*/ 2 ;;
- : int = 1

The idea is to define some functions that look and feel like "syntax", leaving language definition intact. The above code defines infix functions named "/*" and "*/". And if you saw code like this (1 /*min*/ 2) at distance from the definition of these "functions", you would really thing it's some kind of OCaml syntax.

Here's the origin of this idea.

***

Modern languages give users more freedom to define and re-define meaning of what looks like syntax elements. This freedom can be used to write more beautiful and concise code, as shown in the examples above. By the way, do you know more?

Read on | Comments (4) | Make a comment >>


An example of indentation

Some languages are whitespace-sensitive, some are not, but in all of them we indent our code.

Why do we do this? To make our programs look readable. To structure them. To separate logical blocks, and collocate items that belong together.

But recently I encountered an interesting indentation pattern. It was in the C sources of an implementation of the OCaml language virtual machine.

(I'll just add a couple of paragraphs, so that "Recent updates" div doesn't make the indentation really ugly.)

(Yes, you should also have a screen wide enough. If you see a messy crap down there, just maximize your browser window, and stretch it beyond the screen border if even that doesn't help.)

Here's an excerpt:

Wonderful, isn't it? This source file has more of that:

I find the alignment of these Asserts and "Case N" comments very beautiful and "mathematical". It provides even more separation of different parts of the code. But not only it's useful, it also makes me want to print the code, set it into a frame and stare at it, smiling.

Read on | Comments (0) | Make a comment >>


Programming as Controlling Mindless Zombies

The question that bothers many developers and, maybe, even people in related topics is "What is Programming?"

Indeed, many of us do it every day (more seasoned ones taking a break on weekends), but who really bothered himself with a philosophy of what he's doing? Since I'm fond of making up my own philosophy, I'll try to pore upon the subject in several blog posts. In this one I'd address one side of it, and you, having read the title, already know which.

Computer programming is indeed controlling a horde of mindless zombies. Let's check definition of a zombie on wikipedia:

A zombie is a creature that appears in books and popular culture typically as a reanimated dead or a mindless human being. Stories of zombies originated in the Afro-Caribbean spiritual belief system of Vodou, which told of the people being controlled as laborers by a powerful wizard

First of all, computers are dumb. They really are. Just compare the time you need to devise an idea, to think it out with the time you have to spend to explain it to a computer. To a machine that understands only ugly, primitive texts that do not make sense to any human, unless he's a programmer. How many times did you scold your computer for being not as smart as you? For not understanding that here he shouldn't do this, because it's obvious that he should do that instead?

Recently I had been writing an OCaml program. OCaml is a very smart language that can infer the types of identifiers by looking at operations applied to them. That's right, he's smart--as long as things go the way he used to perceiving. But as soon as you write "separate the list to a list concatenated with an element" instead of "separate the list to an element concatenated with a list", he becomes unbelievably dumb. If it was a russian bureaucrat, I would even think he's doing it on purpose to extort a bribe...

So, the computers are totally incapable of doing anything not told to them explicitly. That's quite a good definition of being mindless.

But these mindless machines happen to do some useful work. And the complexity of this work, usually inhandleable by any human, sometimes gives a false sense that these machines possesses intellect. But that's not intellect of their own. The intellect within a binary is that of the programmer who coded it, of the tester who checked that the dumb machine doesn't crush all around, of the manager who gathered them, and of millions of people who invested their mindpower into compilers and interpreters--the things that make one mindless machine teach another. I can't help quoting the famous Conway's law:

organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations

(quoting via Wikipedia). This law is a direct consequence of the fact that computers are dumb, and it's we, the humans, the organizations, that inculcate our own properties to them.

So the result of computer programming is a horde of creatures that carry the will of their masters. Well, sometimes not exactly what the masters meant, thus you see strange error messages here and there, where you really don't expect them. That's because the creatures are so dumb, that a separate programming engineering discipline is devoted to "talking to zombies" (it's also known as "coding"), and the discipline is not practiced well enough by some developers.

And the last thing that concludes the post. While it's always hard to instruct a zombie, it's sometimes also hard to make an already instructed zombie to fulfill your needs as user's. Therefore, users are sometimes required to posses the thing these creatures do not have: brains...

Mindless creatures that look for brains? That just can't sound more zomby! And it's precisely what we deal with while programming.

Read on | Comments (0) | Make a comment >>


More posts about ocaml >>