Recent blog updates

Shown here are only posts related to perl. You can view all posts here.

Logging in Expressive Languages

I am not aware of any strict mathematical definition of expressiveness of a language. Informally, a language is more expressive if it allows you to achieve the same with less keystrokes. From this point of view, for instance, Perl or Python are more expressive than C. Of course, you can't always achieve with Python most things that C is capable of at low level, and therefore one may question whether Python is really more capable than C. But capability has little to do with expressiveness. In certain domains, one language may be more expressive than the other, and in other it could be just vice versa. (For instance, Python is more expressive in string operations while C is more expressive in low-level raw data representation.)

One may notice, however, that the majority of languages are Turing complete, and you basically may express the same with all of them. What is the difference in expressiveness then, as seen by a Practicioner rather than a Computer Scientist?

I usually use the following criterion. If you write one non-splittable expression instead of a chain of statements, the language is more expressive. Compare the following two programs that compute a sum of an array in C:

...and in Ruby:

..and, in C, we also should find a place to deallocate the array when it's no longer needed.

You may notice that Ruby can achieve with a single expression what C needs several operators and variable declarations for. Three Ruby features contribute to this, anonymous lambda functions, syntax sugar to pass them to functions (such as inject), and to avoid the really useless variable declarations. Isn't it always that beautiful?

In some cases it's not.

What it takes, to log?

Now let's try to add some logging to the programs above. Assume we actually want to hook into the loop, and make it print the square result it has calculated to a logging stream.

Let's take a look at the code again.

C:

Ruby:

Ewww! Why does our Ruby code looks like C? Where is our one-liner? Where is the expressiveness? How come we're explicitly introducing a temporary variable just like in C?

I spotted such patterns, "compute-log-return" in a lot of places in my practice. For i instance, the BLAST codebase I maintain, contains these here and there, because it's written in an expressive functional language (OCaml), and uses logging as a primary debugging mechanism. Because of it, the code that should look beautiful looks as an old prostitute trying to mask their weary skin with tons of makeup, and I'm not happy with this.

Trying to get asleep on a plane, I suddenly realized that this logging could be achieved with a better code. Since "assign-debugprint-return" is a recurring pattern, let's define a helper function

Then, the resultant Ruby code will look like this:

Now our Ruby code looks like Ruby again.

Of course, this is not limited to Ruby. Though less versatile, the sane function may be implemented for the C code as well.

And this brings us back to the original comparison with 1 line in Ruby and five--in C.

Discussion

while you may write such function for a less expressive language (as seen above), it will "stand out" in a large C code sheet and look like ugly duckling. What may benefit most are languages that mix functional and imperative coding styles, such as the aforementioned Ruby and OCaml. The reason why this helper might be useful is that it's relevant to the functional languages in two different ways.

First, functional language allow the programmer to define anonymous function inline, which makes them be frequently passed as parameters to other functions. In C, defining a separate function is, inmost cases, uglier than just adding a convenient logging shortcut.

Second, logging an intermediate result contradicts the functional level paradigm. In functional languages you define a computation as a whole rather than specify concrete sequence of actions. In case of logging, you have to violate a chain of functions with a side-effect operation. Of course, you can try to make it look nice, as we did, but the very reason we need the kludge is that we're doing something a bit wrong.

But it's not limited to functional languages. Most expressive languages that do not require from user a strict specification of type of every identifier can. For example, we might want to debug Perl's regular expression substitution with using a similar function:

(we couldn't use log because it conflicted with the mathematical logarithm function that bears the same name).

***

In any case, the "compute-log-return" function that encapsulated the debug print is a widely useful pattern. And it's especially useful when you're writing a program in an expressive language. Do not let the programming cruft conceal the actual idea of your code, use loggers that hide all the details.

Read on | Comments (0) | Make a comment >>


XML Stream Parsing: Overview

XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.

XML and the ways to parse it

Stream and "usual" XML parsing are different very different, both in implementation and in the requirements to the XML producers. This article explains why is that—in a language-agnostic way.

Actually, I wanted to write a post about my chronicles with parsing XML in Ruby scripts in the project at my work. But the article began to grow so fast that I decided to split it into two posts. This part introduces the generic aspects of XML parsing, outlines differences between "conventional" and "stream" XML parsing, and shows a couple of examples how both approaches look like in the code that uses them. I'll use Perl as a sample language, but it could be any. And in the next article I'll share some detailed tips for parsing XML as a stream in Ruby.

Conventional XML parsing

Actually, the "conventional" word is a local term coined by me as a counterpart to "stream" parsing. The "conventional" way of parsing XML is more widely known as Document Object Model convention, or DOM. The name stresses the idea that XML document is represented as a set of objects—in the OOP sense—which have methods to be called as well as properties.

"Conventional" XML parsing relies on the basic idea that the the whole XML is loaded into memory. A specific object (we would name it "element object") may be created for each particular tag. User may query this object for its attributes and contents (if it's a leaf element), as well as for children elements that satisfy specific properties (XPath expressions may select elements based on tag names, inheritance, and attributes). The XML parsing library returns results of such queries as element objects as well.

Let's start with an example. Here's a sample XML, taken from the Wikipedia article on Web Services Description Language; I saved it to my local server as it is too big to inline it.

How would the work with a document like this look like? Assume that (it's not what working with WSDL is really like!) you want to get a list of all operations of an interface with a specific name. You call something like (pseudocode!):

document being the entity object of the whole document, and receive an element object that represents the relevant tag. To get all its operations, you query the element for its children:

You get names of operations by something like operation.attribute("name"), and then look up the relevant HTTP binding for this operation by querying document again:

And keep doing the chain of such queries until you learn everything you needed from this document.

What are the attributes of this workflow?

  • Your further "requests" to this XML depend on the answers you have gotten from it before.
  • The document is loaded once at the beginning of a work, which itself comprises a lot of delays, so how fast it's parsed doesn't mater that much.
  • The document was given by an entity that is not under your control. Its elements may come in any order, because specification doesn't impose it.

Therefore, your best choice is to parse and load the document into memory, split it into separate element objects at once, and be ready to run XPath queries and get elements that satisfy them without additional overhead. That's what "conventional" XML parsing looks like.

Stream parsing

What happens at the low level when XML is initially parsed to a series of entity objects? A usual parser, probably, recursive, walks through a file. As soon as it notices that a tag is closed, on the top of its stack (in other words, in the topmost function) there is enough information to construct the entity object for the tag that has just been closed. This object is constructed and then wrapped the way complex structures in the target language appear. And at the end of the processing, the object that represents the root tag of the XML being parsed resides somewhere in the memory, and is returned by the parsing routine.

The question is, why should we wait for the whole file to parse? Why can't we just intercept tags that are interesting to us as soon as they're parsed? Why should we load the whole XML into memory?

The reasons why we sometimes can't are briefly listed in the previous section (the "attributes" of the workflow). But what if they don't apply? Then stream parsing is right what you need!

Usually XML stream parser is understood as a blocking stream that converts the stream of characters to a stream of more meaningful XML "events". Stream parser may be approached as a proxy object between character and special XML stream. It may be queried for the next event, and returns it as soon as it parses the next several tokens in the input raw stream, which may be an opened text file, or a pipe, or data that come from a socket. The events an abstract stream parser for the easiest XML I could imagine (<x><y>stuff</y></x>) are like these:

This way of parsing XML is more known as SAX parsing. "SAX" is an acronym of "Simple API for XML", and it means that API of a SAX parser is just several callbacks that are invoked at the certain simple events that occur during XML parsing, to which we refer in this section, so it's "easy" to understand.

In no way it's implied that the API is "simple" to use.

  • begin of the tag "x"
  • begin of the tag "y"
  • plain tag contents ("stuff")
  • end of the tag "y"
  • end of the tag "x"
  • end of file

The stream parser does not provide any information about the context of the place it's currently at. You should store the context on your own. It also can not handle a query with XPath expression, as to handle it it should seek for the proper tag, past which it may already be; and there's no way to rewind an abstract stream. If it's so inflexible, what's the point in stream parsing then?

It's damn fast! The sequence of the said events is yielded nearly at the speed of reading symbols, which approaches the speed of light compared to unhasty modern systems.

As for applications of stream parsers, they're quite limited. XML is mainly for storing structured information, and if the tag layout is not regular, deeply nested, and is less of a content than of structure (i.e., does not contain large text blobs), the non-stream parser is your best choice. If it doesn't hold, and if your XML is not so complex, but just damn big, then you do not have much choice aside from stream parsing. But you should know how to use it then.

Stream parsers in the wild

The sample output of a stream parser you saw earlier is not very convenient. Writing a nicely-looking and error-prone stream XML parser (i.e. the program that uses output of the said events) is a task that requires discipline and diligence. In the next post I will explain in detail how such a parser is best implemented (in Ruby). Here's how a fragment of code would look like:

However, the most easy-to-use way to employ stream XML parsers is a combined stream and "conventional" parsing. It means, stream parser seeks a tag you need, and then provides this tag as an entity object.

In Perl, there is a wonderful XML::Twig module, which does the thing you need. For example, to get all operations of a WSDL service description, you might use

This script will read the XML from file named foo_file, and just stop at tags that satisfy XPath "interface/operation" (i.e. <interface><operation>...</operation></interface> tags, and invoke a subroutine for them. The $interface_tag is an entity object, which we query for attributes, and thus extract the name attribute.

We could also specify a very useful option, "print_outside_roots => $STREAM, so that all things that are outside the tags we're interested in are printed to a file handler $STREAM as they're parsed. And those things that we would like to look at go to the same subroutines; they may be printed as well, but they may be altered before printing! This makes stream parsers (which provide such interface as XML::Twig) extremely convenient if we want to alter a couple of tags inside a large XML file.

Parsing portions of a big file in a "conventional" way is, in my opinion, a must-have feature for a stream XML parser. In Perl, the already mentioned XML::Twig module is one of those, and overview of Ruby modules is coming soon at my blog. Unfortunately, not all XML parsers support such hybrid model.

Key differences between stream and conventional XML parsing

Stream parsing is, of course, faster. but the performance gain does not necessarily cover the expense to write a more sophisticated program. What are the requirements for the XML for it to be parsed as a stream?

  • XML should be large. Most likely, parsing a small file with a sophisticated program is just not worth it.
  • XML should be homogeneous. It is a rare situation when you are to process a large XML file with an irregular pattern, but irregularity is going to cost you much, either in efficiency or in the complexity of the code you should write.
  • XML should be topologically sorted (or at least look like this). During stream parsing, you can not "rewind", and read again some data you already have read. So you either have to parse a stream two times (which is not always plausible), or store all the data you may need in the future. The latter may consume a lot of memory, which may render stream parsing not much more useful than the "conventional" one.
  • Consequently, layout of the XML you parse should be either specified, or under your control. Stream parsing yields the best combination of effort and resources consumed at runtime when your data are laid out in a specific order.

    You read XML to "process" its tags somehow, so the number of tags that can't be completely processed at the point of reading them should be minimized. For instance, meta-data and style data (smaller portion) in HTML are laid out before the page body (larger portion). If a stream parser was used for rendering web pages, it could store style information in memory, and process body tags at once, without storing them. If it was vice versa, it had to store a lot of body tags until it encountered styling information, which would be much less effective in memory consumption.

    The way HTML is laid out is governed by specifications, so we do not have to have control over the producer of XML; otherwise, we should.

If all these properties hold, congratulations, you will be able to parse XML in a fast way!

Conclusion

In this post I've outlined how a high-level processing of XML in a "conventional" (DOM) way and in "stream" (SAX-like) style is done. I analyzed the differences between the philosophy of these approaches and of requirements to XML files to be parsed in frames of each of them. I was not too specific, and just scantly touched how the use of stream parsers would look like in code.

And in the next article, I'll share my experience with stream XML parsing in Ruby, as well as I'll provide generic advice on how to write a concise and nicely-looking program that parses XML in a stream way.

Read on | Comments (0) | Make a comment >>


Porting Perl's features to Ruby

Diamond.  A Girl's Best Friend.  Ruby.  A Programmer's Best Friend.

An official motto of Ruby language with its joke explained.

As I told in my previous post, I now use Ruby as my primary scripting language. The most prominent reason is that Ruby is superior to Perl in both syntax and in expressive power. Being object-oriented, Ruby allows a lot of things to be expressed in a more concise manner, and exception handling relinquishes the developer from the burden of checking return code of every damned operation.

However, there are some features of Perl, which don't map to Ruby directly. Here's a couple of them.

Boolean String Interpretation

The first is "conversion to boolean". In Perl a lot of things convert to false in boolean context: an empty string, a zero, an undef (undefined value).

In Perl6 (and in recent releases of Perl5), there is a "//" operator, "lhs // rhs" being equivalent to "if lhs is undef, then rhs, otherwise, lhs". It will make the language slightly more expressive, but, unfortunately, won't bring us closer to our aim: making Ruby as convenient as Perl.

This allows us the following. If we want to assign a default value to a string parameter, we'd just use || operation:

However, in Ruby, only nil (undefined value) and boolean false do. So the code similar to the above (which uses Ruby's || operator) would only work if the str is nil. It wouldn't work if it was empty. And in many situation it would be very useful. Consider that you're retrieving a string, which, being not specified, should default to a value. If this string is unspecified, and comes from a user's input in a form, it'll most likely be an empty string. And if it originates from a database, it'll have a nil value. How you we write the code that sets the default value as concise as the Perl's one?

The solution is redefining operators in a class. We will introduce a new operator for strings that works like the Perl's ||. Note that we can't just redefine Ruby's ||, because it may be used by a third-party code, with which we shouldn't interfere.

One of Ruby's features, which originates from SmallTalk, is that we can redefine methods of objects and clases on-the-fly. This seems unusual for the most common C++ and Java languages for OOP support, but in Ruby a lot of thigs are implemented this way.

This feature is sometimes referred to as "monkey patching"

Let's call the new operator or, and let's redefine it for strings:

We notice that it works well if left-hand-side is a string. But what if it's a nil? Our aim was to have a proper behavior for expresions like a.or("b"), where we don't know if a's defined. But we can't redefine how nil works... can we?

In Ruby the nil keyword represents "undefined value". Techincally, however, it's just a... singleton object of class NilClass. So we can open this class the same way we did for String! In Ruby we can add methods to nil (undefined) objects! Here's a complete code for a sample program:

Check it out: it prints four "aaa" lines, just as planned. Such a convenient feature in exchange for just six lines at the utils.rb file of your project (you have a file with your most lovely code pieces, don't you?).

Interpreting lists in function call arguments

Perl has a creepy way of handling arrays of values. It uses tricky terminology to distinguish "arrays" from "lists", while they seem like a same thing. "List context" and "scalar context" add more complexity, and "references to lists" finishes a programmer's brains.

However, there's a lot of places where this nonuniformity turns into features. Here's one of them.

A regular Perl function (unless special measures are taken) takes arguments as a single list. If a function is said to "take two arguments", it takes a list instead, and interprets its first two elements as "arguments".

This looks ugly at funciton's site, but it's beautiful at caller's site. Let's write a sample program, that, say, finds all "backup" files in a given list of directories, by calling one of the Linux shell tools, find. Here's how it would look in Perl:

We pass a list of arguments to system, imbuing command-line arguments of the script @ARGV into its middle. A flat list with a list of folders at the beginning will be passed to system function, not a nested list! However, it's the nested list that will be passed in Ruby:

Ruby is more strict about its arguments, which I find more convenient than Perl, but in this particular case it leads to rejecting of concise and transparent code. Ruby does have a way to pass an arbitrary number of arguments, similar to C's "ellipsis":

The arguments beyond a certain point form a list, which can be explored. However, it's not of much help, since Kernel.system can take an arbitrary number of arguments, but, nevertheless it doesn't work.

What Ruby lacks is interpretation of function call arguments as a list at caller's site. But since we can't change it at caller's site, let's change at callee's! Here's how our wrapper around the system function may look like:

This will work, and you'll see an array of arguments printed on your screen before the command executes.

But that's not all. What about nested lists? Perl doesn't have nested lists, it flattens them all at construction. You could, of course, create nested data structures with "references", but in a pass-by-reference Ruby language it's not an option.

For example, if you call a similar wrapper with arguments like these:

it will flatten it down to a list of eight arguments, which may not be what we want. Instead, we might expect the fourth argument to be a three-element list, while List.flatten wouldn't keep it. The solution would be to use another function (I called it soften), which we can create by the same monkey-patching technique. Here's a code of a sample solution:

Here's the script's output, which demonstrates that it works as expected.

[1, 2]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5, 4, 5, 6, 7, 8]
["nohup", "ls", "-l"]
[1, 2, 3, 4, ["nesting", "is", "important"], 5, 4, [5], 6, 7, 8, 9]

***

I prefer Ruby to Perl because it's more strict. Its strictness makes Ruby less tolerant to some perl-ish "tricks", though. On the other hand, the flexibility of Ruby allows us to mimic these features without even fixing the interpreter.

This makes me suggest that "strict but flexible" is a more preferable option to "more free, but less flexible". Keeping ungrounded assumptions aside, at least, the Ruby tricks listed above make its use more convenient, especially if you are used to being an undisciplined Perl programmer.

Read on | Comments (0) | Make a comment >>


More posts about perl >>