Recent blog updates

Shown here are only posts related to perl. You can view all posts here.

SF Muni LED Sign at Home with Raspberry Pi

"SF Muni Sign at Home" series

Several post on how I was building a San Francisco Muni sign that shows train arrival times at home.

Something like this, except for the girl.

My android phone wakes me up with its stock alarm clock into a cold San Francisco summer morning. I lie around a bit, but eventually get up to enjoy yet another day. An hour later, I'm running in my dress shoes, tie waving in the wind like that of an anime superhero schoolboy, towards a light rail train stop. Twenty seconds before I get there, the train leaves, right before an unanticipated 20-minute break in the service. I eat a bagel I don't want in a cafe nearby to sit there and work while waiting; my day, having barely started, is already ruined.

Could I have made it? Of course, I could. I could have not re-tied my tie in order to drape it half inch shorter; I could have not procrastinated on reddit for five minutes while sipping tea, I could have postponed paying my bill to the evening. On the other hand, I don't want to finish too early and wait at home instead of in a cafe. I need something to let me know, at all times, when the next train is.

Observations

San Francisco "Muni" public transportation vehicles, like those of many other modern big cities, are equipped with real-time GPS trackers, and publish location data on the internet. My smartphone has QuickMuni app installed, so I could easily be checking it. I would pick up my phone, unlock it, wait for all the lags Adnroid greets me with, wait while the app downloads the upcoming train times, and repeat this every five minutes. This would be implausible. If only I had something like a wall clock so that it constantly displays upcoming train time predictions...

One morning I located the item I wanted in the wild. Actually, I've been seeing it every day, but was not realizing that it would solve my problem.

That is it! I can mount a similar sign on my wall, drive it with something like Raspberry Pi or even my desktop PC, and write some software to download, process, and display the train arrival times! I can even teach it to show weather forecast to quickly see if I need to take an umbrella.

The Result

The sign I used is probably not the only one with Perl API: here's something else on CPAN that is probably about these guys, but I didn't try them so I can't vouch that they work.

Well, let's go for it. I purchased Raspberry Pi from Amazon, and purchased this sign from BrightLEDSigns.com. I was looking for at least 16 rows of LEDs (to squeeze two lines of text,) and for a simple interface to draw pictures on it, as I didn't want to spend time to low-level programming. I got the sign for $89 sale price.

After several nights of hacking, I finally had it: my own, always-on customized dashboard with relevant morning information.

This sign shows that the next 38-Geary buses will come to a specific, preset stop in 2, 7, 14, 27, and 40 minutes; ellipsis means a long wait, the next line shows fewer arrivals for "backup" 5-Fulton line, and it also shows that it's 55 degrees outside right now, and it will be 56 at 8pm.

I also implemented a rendition of a real Muni sign as a byproduct of this work. It has some bugs, and it differs from a real sign, but I didn't want to spend much time on that since I was not going to use it anyway. It looks good for demonstration purposes, though. It's animated, here is a video!

The rest of the post describes what components I assembled this thing from, how I configured the Raspberry Pi, what software I used to program the dashboard and to utilize external dependencies, and what problems and bugs I encountered on the way.

Source Materials

Here's the complete list of electronic components I used:

  • Programmable LED sign - programmable with a simple API, monochrome, at least 16 pixels high, bought from this vendor the 16x96 one, with Perl API and USB interface.
  • Raspberry PI - The famous $25 energy-efficient computer I managed to buy on Amazon for $40;
  • USB Wi-fi dongle - found a uselss one, origially bought on Amazon
  • SD-card 8 Gb - Raspberry Pi uses SD-cards as "hard" disks, to which it installs an operating system. I found one in my closet.
  • Case for Raspberry Pi - bought on Amazon too; not required, but you are less afraid of moving a working piece around;
  • Micro-USB power cord - found unused in my closet
  • Mouse, Keyboard, HDMI cable (or a less digital RCA cable), and USB hub - found in my closet; used for initial set-up only, and not required to program or run software

Total cost: around $200.

Source Code

As usual, I'll begin with the source code, which is on github. Here's a local mirror, just in case The source code of a patched Next Muni gem is here.

How to Run

There's a small documentation in the README file, and sample command lines are in the daemon startup script for this spec. Basically, you select a stop on nextmuni.com to find out your stop ID, and run client/client.rb --stopId 12345. Alternatively, you can run a morning dashboard by specifying up to two routes and stops in text, and add an URL to retrieve weather from weather.gov:

PI configuration notes

During the first run, connect mouse, keyboard, and Wi-Fi card through the USB-hub. Follow up the Getting Started guide to download initial image onto SD card. Boot it up, install the default Raspbian Debian-based distro.

Do select "automatically boot desktop environment on startup" when asked. While you won't need it, it's the easiest way to configure Wi-Fi. Once at the desktop, select "Wi-Fi configuration" under one of the menus accessed via Start button, and configure the wi-fi card you inserted to connect to the router. Make sure you also configure it to start up wi-fi automatically at boot.

Configure SSH server, and make sure it starts up at boot. Either configure hostname, or write down IP address. Reboot and try to SSH from another machine. If this works, disconnect mouse, keyboard, and HDMI: you don't need them anymore.

Make sure git is installed via sudo apt-get install git, download the sources, and follow the README instructions to install language runtimes. Play with script's command line options to figure out what you want to run, and configure a daemon by installing the initscript. Once confident enough in resilience of the setup, mount this onto your wall.

Results

So the solution helped: I stopped running like a Japanese schoolboy, and started to control my morning procrastination, too. As a bonus, I had a ballpark estimation of how much it should cost taxpayers to program this sign, and, of course, I had a lot of fun, too. Finally, I now have a geeky item in my living room... although it looks to others as a stream of stock prices rather than of Muni arrival times.


Notes

The rest are less "hot" sections, detailing my experience with the tools I used for the job.

Ruby

Raspberry Pi's default Linux, as a Debian derivative, contain decent support for Ruby, including 1.9. which is a bit old already, though. It takes Pi several seconds to load the interpreter and a dozen of gems I needed for this sign, but it's OK since the program only starts up once when the system is powered up.

LED Sign

So the sign I used provided a Perl API, so it would be natural to write the client program in Perl, too... not really. While I have a lot of experience programming in Perl, I would much rather prefer writing in a more interesting language, such as Python or Ruby. That was one of the reasons why I wrote a small wrapper over the Perl API, so that I could exec this simple program from any other language.

Another reason why I wrote this API was that the sign does not support two-line text in its built-in text renderer. The API allows you to supply a line of text, and the sign will happily render it, and even add some effects, but I wanted two lines in a smaller font instead. API doesn't have a "render image" function, but the sign allows you to create your own font glyphs, and insert them into text. The obvious idea of printing one 16x96 "symbol" worked. The wrapper does exactly this.

The sign can also loop through several images--just like the real Muni sign. API makes this very easy to program: this happens automatically when you send several messages. Integration with Pi went smoothly: just use $sign->send(device => "/dev/ttyUSB0"); in your program.

There were several glitches, though. First, each glyph is automatically centered, unless it's greater than about 80 pixels wide, in which case it's aligned to the left. Looping through differently aligned messages looks badly, so I force all images I send to the sign as 16x96, padding them with black pixels.

Second, there is no way to power off the sign programmatically. To power it off, I need to physically remove the USB connection, and turn it off because it starts using the battery otherwise. The sign manufacturer's intent probably was that you don't re-program the sign often, so you would program it once and then turn on and off manually. But we're programmers too; isn't working around things our bread and butter? Displaying an all-black picture did the trick of turning the sign "off".

Third, I refresh the sign each 30 seconds to ensure timely prediction, but it takes 1-2 seconds to update the sign. I suppose this is because the sign contains its own CPU and programming, or because I didn't try to play with send function arguments.

Although its internal plumbing probably affects the update speed, and I saw examples of offloading all the synchronization and flashing of the lamps to Raspberry Pi completely with different, more low-level signs, that would be too time-consuming for me to bother with. I also don't know how long the sign will last when it's re-programmed twice a minute. So far, I'm quite happy with the Bright LED sign I used and its API.

Font Rendering

I expected to spend zero time on font rendering, but it took much, much longer than I expected. Well, I was planning to write my own simple font rendering engine, but several hidden rocks were waiting for me beneath the thin sand.

First, I couldn't find a "raster" font that is just a set of bitmaps. I ended up downloading "pixelated" TTF fonts and rendering them with FontConfig library. I didn't want the sign to have a lot of dependencies, so I pre-rendeded the glyphs on my Pi in a ready-to-use manner with this script, and wrote a small library to operate them. The most fun part is that I could easily add new glyphs, such as rainfall indicators.

We also take a lot of font rendering for granted, such as aligning to center, word wrapping, kerning. I implemented some of them in a very simple manner (in Ruby!) but I feel that a couple more features would make me ragequit in favor of a real font rendering library instead.

Accessing Muni Predictions

Another reason why I chose Ruby was the Muni library in Ruby, ready-to-use for getting predictions from the NextBus system. Or so I thought: the library was missing some functionality, such as refining routes or getting predictions for all routes of a stop (this is what a Muni sign displays). It also pulled in the whole Rail stack only to display nicely formatted time string! (like, convert interval in seconds to something fancy as "over 1 hour").

I fixed some of these things, and bundled the gem with the original source for your convenience because it was easier than forking it and adding it into an official repository.

Specify routeConfig!

Apparently, official Muni monitors operate in the same terms: routeConfig is one of the API commands of NextBus service.

Anyway, NextBus system has a nice API, which is seemingly free to use if the use is limited or distributed among clients. A spawn of this site, nextmuni.com is devoted to SF Muni.

The only thing I didn't find was the actual names of the

Weather Forecast Retrieval

Oh, well, here comes the easy part," I thought when I finished with everything else, and decided to throw in weather forecasts. I was wrong in a very interesting way.

Turns out, weather prediction APIs are usually not free. The websites I found at the top of Google search all required signups and including your personal key to all API calls. Many had free limited versions but required payment for hourly forecast--and I was mainly interested in conditions at a specific place at a specific hour of the day (San Francisco weather isn't very stable). Not that I'm opposed to paid services per se, or to parsing web pages with regexps, but I don't need that much from them to pay, do I?

I proceeded with a free government weather.gov service, which is actually kind of cool despite the web design of the last century. The service is limited to the U.S., but San Francisco Muni predictions are not very useful outside of the country either. To use it, you need to provide a link to XML table for your city/area, which would look like this: http://forecast.weather.gov/MapClick.php?lat=37.77493&lon=-122.41942&FcstType=digitalDWML.

I didn't use their API, because it's a full-blown SOAP, and ain't nobody got time for that.

Raspberry Pi

Surprisingly, I don't have much to say about this. It just works! I didn't have to use it, and I could have used a desktop PC. The best part about Pi is that it works backwards, too: I successfully replaced desktop PC with it without doing anything differently. The device's operating system is simply a variation of Debian Linux, as found on desktops. I didn't have to debug through COM-port nor use specialized software to upload programs: I downloaded compilers and interpreters directly from the internet onto the device!

And now it's neatly powered from a single micro-usb port, and doesn't hike my electric bill despite being always on.

Future work

That's all I have to say for now. I need to add a couple of things, such as Muni Messages that remind users about non-working elevators and fare increases, to make the Muni experience complete. I also need automatic sign power-off when unused. I heard, there are motion detectors for Raspberry Pi, but a simple schedule would be nice, too.

Read on | Comments (1) | Make a comment >>


The Pipe-Pipe-Equals

It is hard to come up with a google-friendly name for the ||= construct you see in some programming languages quite often, "pipe-pipe-equals" being the is the closest (other names include "double-pipe equals," "or-equal," or "double-or equals"). Why should we name this monster in the first place? While cryptic at first sight, it is a very convenient shorthand that, I believe, allows us to write cleaner code. Let's see how it works on this example:

Essentially this ||= construct is equivalent to other convenience operators, such as +=. x += 2; means x = x + 2. The "or" operator || usually denotes a weak logical "or" that doesn't try to compute the left-hand side if the right-hand side is true. Usually, meaningful values evaluate to true, while the only evaluating to false are NULL pointers/empty values, zeroes, boolean Falses, and, sometimes, empty strings.

Alternatively, you could set a default hash value, but ||= is used more widely, and also takes exactly one line... doesn't it? ;-)

The statement is used mostly to set a variable to a "default" value if it's "unset" at some point. This gives the most visible advantage when you initialized dictionary elements in a loop. Here's a piece that saves counts of each array element into a dictionary (Ruby):

In some languages (Perl, Javascript), you could not even bother with this, as += 1 on an unset value would result in its assignment to 1.

If you don't initialize output[x], you'll get a runtime error for trying to increment NULL. The advantage of ||= against other ways is that you don't repeat anything. You could have written the same piece as

Oh yes... we forgot about the most natural use of ||=, for booleans. Here's how we'd check if an array contains zeroes if we need to iterate over its elements for something else.

Pipe-pipe-equals is rarely used this "natural" way, though.

But here you have to type output[x] twice and add two more lines of code, which is a complete waste of screen space and, possibly, computing resources if the interpreter doesn't optimize the duplicated computation out. Let's take a look how pipe-pipe-equals works in different languages (we've already seen it in action in Ruby).

Perl

Perl was the first interpreted language I learned, and the first place I saw the pipe-pipe-equals operator in. It works as expected, but is used less often than the direct version of the weak logical "or" to specify default parameters of a function. Here's how you'd put a default to a hash bucket:

This works in a strict mode, too. Note that, in Perl, many things evaluate to false, including empty strings (I once even tried to emulate it in Ruby). To restrict this action to undefs only, use //= in Perl6 instead, which sinks to its right like the Pisa tower, and look as if you're trying to put a division while drunk.

Python

Python has no equivalent of this operator. You have to type the very[long(expression)] twice, and it will be computed twice. You have several different ways to do this:

The engineers more experienced in Python programming, though, assure me that this is not a big deal since you have a different way of supplying default arguments for a function, which seemingly covers half of the use cases for pipe-pipe-equals (this doesn't prevent Ruby from having both, though). Another half is covered by dictionary's dict.get(key, []) method, so that the code piece #1 can be written in a succinct manner. But I still miss it.

Bash (Linux shell)

Bash? Its language is so simplistic, and it looks creepy; how come it would have an equivalent of such a beautiful shortcut? Here it is:

This assigns y to variable if the former is unset or null (as per Bash manual). I mostly used it to set default parameters for environment variables user may or may not set before invoking the script.

C++

While |= is a syntactically correct expression (C++ does not have the "short-circuit" version of this expression), it doesn't do what we discussed here.

C++ is statically typed, so the result of the standard logical "or" is boolean. Retaining the nice semantics we find in a dynamically typed language would require it to be "either the type of the left-hand side or the type of the right-hand side". This is hard to pull in a pass-by-value statically typed language.

Pass-by-value semantics also means that not everything can be assigned a NULL, and not everything can be converted to boolean value. C++ has default arguments as well as Python, so the same reasoning could apply here. You'll have to be more verbose in C++. That's probably why only |= expression is available, which is only useful if its left-hand-side is bool (see sidebar above for similar usage.)

OCaml

Everything said above about C++ applies to OCaml as well. Moreover, OCaml, as a functional language, doesn't have a flawless support for mutation, and pipe-pipe-equals statement its inherently mutational. However, its matching operator would require us to use the very_long_variable twice. However, OCaml and other functional languages have a very interesting construct called "option". If something has "X option" type, it may contain either "nothing" or a value of x. Then, this value may be "unpacked" trough pattern matching:

let very_long_variable = match very_long_variable with None -> y | Some t -> t

here, t is not an shorthand for another very long expression; instead, it's just an identifier, written as is. The match...with allows us to "unpack" values of structured (algebraic) types with shorthands like this. Since this was too long, OCaml has made this a library function Option#default:

let very_long_variable = Option.default very_long_variable y

Anyway, OCaml programs are even more explicit than those in C++ and Python, so trying to tie pipe-pipe-equals into them is quite pointless.

Ruby

"Erm, we saw how it is in Ruby at the beginning," you might think. Well, I lied to you a bit. The thing is that, in Ruby, it is not strictly equivalent to an assignment to a result of logical "or". Which do you think x ||= y is equivalent to?

In Ruby, and only for ||= and &&=, it's the second. If you assign to something other than a local variables, what looks like an assignment is actually a method call (think properties,) and, if so, this assignment does not happen at all if the left-hand side of ||= is false. Which makes sense, but looks like a special case. Read more here.

Why You Might not Need This

Some argue that this operator is mostly useless, especially if their favourite language doesn't have it. Here are some arguments they list.

Redundancy

Indeed, this expression is redundant. You can do the same in a multiple different ways, All the examples I demonstrated above showed how to write essentially a simple if statement in a very short-hand form. The number of characters spared is probably not worth it to include support for this feature to a language designer's must-have checklist.

The statement discussed decreases the redundancy in code in return of broader language definition; each language seeks a balance between these, and often leaves the pipe-pipe-equals aside.

Default Function Arguments

The ability to specify default function argument (C++, Python, Ruby, but not Bash or Perl) covers many use-cases for ||=. However, this doesn't help fancy loops that fill complex structures, one of which we showed above. Nor helps it when you have to specify the default parameter anyway but the concrete value is not known at the time of coding, and is an optional field user may or may not fill.

Confusion

It is confusing. The pipe-pipe-equals requires explanations how it works. It becomes a special snowflake, different from its "mathematical" counterparts +=, if a complex language wants to make its very useful (see Ruby section above). While "confusion" as an excuse of not doing something is my pet peeve (and I tried to explain why), it indeed requires some effort to understand the mechanics. I'm sure that you'll love it once you understand it, and the way I learned about this operator is not by reading a textbook or "The Most Confusing and Obscure Programming Language Features Possible" newsletter, but by reading someone's code.

To Use or Not To Use?

When it comes to deciding whether or not to use pipe-pipe-equals in the "creepy" way we discussed throughout the post, the criticism somehow fades out. If the language supports this, it will be used. I haven't encountered any coding style document that bans this feature. In Ruby and Perl, it is considered as good of an idiom as many. So the answer on the question of whether or not you sohuld use pipe-pipe-equals (||=) in your programs is, definitely, "yes".

Read on | Comments (0) | Make a comment >>


Logging in Expressive Languages

I am not aware of any strict mathematical definition of expressiveness of a language. Informally, a language is more expressive if it allows you to achieve the same with less keystrokes. From this point of view, for instance, Perl or Python are more expressive than C. Of course, you can't always achieve with Python most things that C is capable of at low level, and therefore one may question whether Python is really more capable than C. But capability has little to do with expressiveness. In certain domains, one language may be more expressive than the other, and in other it could be just vice versa. (For instance, Python is more expressive in string operations while C is more expressive in low-level raw data representation.)

One may notice, however, that the majority of languages are Turing complete, and you basically may express the same with all of them. What is the difference in expressiveness then, as seen by a Practicioner rather than a Computer Scientist?

I usually use the following criterion. If you write one non-splittable expression instead of a chain of statements, the language is more expressive. Compare the following two programs that compute a sum of an array in C:

...and in Ruby:

..and, in C, we also should find a place to deallocate the array when it's no longer needed.

You may notice that Ruby can achieve with a single expression what C needs several operators and variable declarations for. Three Ruby features contribute to this, anonymous lambda functions, syntax sugar to pass them to functions (such as inject), and to avoid the really useless variable declarations. Isn't it always that beautiful?

In some cases it's not.

What does it take, to log?

Now let's try to add some logging to the programs above. Assume we actually want to hook into the loop, and make it print the square result it has calculated to a logging stream.

Let's take a look at the code again.

C:

Ruby:

Ewww! Why does our Ruby code looks like C? Where is our one-liner? Where is the expressiveness? How come we're explicitly introducing a temporary variable just like in C?

I spotted such patterns, "compute-log-return" in a lot of places in my practice. For i instance, the BLAST codebase I maintain, contains these here and there, because it's written in an expressive functional language (OCaml), and uses logging as a primary debugging mechanism. Because of it, the code that should look beautiful looks as an old prostitute trying to mask their weary skin with tons of makeup, and I'm not happy with this.

Trying to get asleep on a plane, I suddenly realized that this logging could be achieved with a better code. Since "assign-debugprint-return" is a recurring pattern, let's define a helper function

Then, the resultant Ruby code will look like this:

Now our Ruby code looks like Ruby again.

Of course, this is not limited to Ruby. Though less versatile, the sane function may be implemented for the C code as well.

And this brings us back to the original comparison with 1 line in Ruby and five--in C.

Discussion

while you may write such function for a less expressive language (as seen above), it will "stand out" in a large C code sheet and look like ugly duckling. What may benefit most are languages that mix functional and imperative coding styles, such as the aforementioned Ruby and OCaml. The reason why this helper might be useful is that it's relevant to the functional languages in two different ways.

First, functional language allow the programmer to define anonymous function inline, which makes them be frequently passed as parameters to other functions. In C, defining a separate function is, inmost cases, uglier than just adding a convenient logging shortcut.

Second, logging an intermediate result contradicts the functional level paradigm. In functional languages you define a computation as a whole rather than specify concrete sequence of actions. In case of logging, you have to violate a chain of functions with a side-effect operation. Of course, you can try to make it look nice, as we did, but the very reason we need the kludge is that we're doing something a bit wrong.

But it's not limited to functional languages. Most expressive languages that do not require from user a strict specification of type of every identifier can. For example, we might want to debug Perl's regular expression substitution with using a similar function:

(we couldn't use log because it conflicted with the mathematical logarithm function that bears the same name).

***

In any case, the "compute-log-return" function that encapsulated the debug print is a widely useful pattern. And it's especially useful when you're writing a program in an expressive language. Do not let the programming cruft conceal the actual idea of your code, use loggers that hide all the details.

Read on | Comments (0) | Make a comment >>


XML Stream Parsing: Overview

XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.

XML and the ways to parse it

Stream and "usual" XML parsing are different very different, both in implementation and in the requirements to the XML producers. This article explains why is that—in a language-agnostic way.

Actually, I wanted to write a post about my chronicles with parsing XML in Ruby scripts in the project at my work. But the article began to grow so fast that I decided to split it into two posts. This part introduces the generic aspects of XML parsing, outlines differences between "conventional" and "stream" XML parsing, and shows a couple of examples how both approaches look like in the code that uses them. I'll use Perl as a sample language, but it could be any. And in the next article I'll share some detailed tips for parsing XML as a stream in Ruby.

Conventional XML parsing

Actually, the "conventional" word is a local term coined by me as a counterpart to "stream" parsing. The "conventional" way of parsing XML is more widely known as Document Object Model convention, or DOM. The name stresses the idea that XML document is represented as a set of objects—in the OOP sense—which have methods to be called as well as properties.

"Conventional" XML parsing relies on the basic idea that the the whole XML is loaded into memory. A specific object (we would name it "element object") may be created for each particular tag. User may query this object for its attributes and contents (if it's a leaf element), as well as for children elements that satisfy specific properties (XPath expressions may select elements based on tag names, inheritance, and attributes). The XML parsing library returns results of such queries as element objects as well.

Let's start with an example. Here's a sample XML, taken from the Wikipedia article on Web Services Description Language; I saved it to my local server as it is too big to inline it.

How would the work with a document like this look like? Assume that (it's not what working with WSDL is really like!) you want to get a list of all operations of an interface with a specific name. You call something like (pseudocode!):

document being the entity object of the whole document, and receive an element object that represents the relevant tag. To get all its operations, you query the element for its children:

You get names of operations by something like operation.attribute("name"), and then look up the relevant HTTP binding for this operation by querying document again:

And keep doing the chain of such queries until you learn everything you needed from this document.

What are the attributes of this workflow?

  • Your further "requests" to this XML depend on the answers you have gotten from it before.
  • The document is loaded once at the beginning of a work, which itself comprises a lot of delays, so how fast it's parsed doesn't mater that much.
  • The document was given by an entity that is not under your control. Its elements may come in any order, because specification doesn't impose it.

Therefore, your best choice is to parse and load the document into memory, split it into separate element objects at once, and be ready to run XPath queries and get elements that satisfy them without additional overhead. That's what "conventional" XML parsing looks like.

Stream parsing

What happens at the low level when XML is initially parsed to a series of entity objects? A usual parser, probably, recursive, walks through a file. As soon as it notices that a tag is closed, on the top of its stack (in other words, in the topmost function) there is enough information to construct the entity object for the tag that has just been closed. This object is constructed and then wrapped the way complex structures in the target language appear. And at the end of the processing, the object that represents the root tag of the XML being parsed resides somewhere in the memory, and is returned by the parsing routine.

The question is, why should we wait for the whole file to parse? Why can't we just intercept tags that are interesting to us as soon as they're parsed? Why should we load the whole XML into memory?

The reasons why we sometimes can't are briefly listed in the previous section (the "attributes" of the workflow). But what if they don't apply? Then stream parsing is right what you need!

Usually XML stream parser is understood as a blocking stream that converts the stream of characters to a stream of more meaningful XML "events". Stream parser may be approached as a proxy object between character and special XML stream. It may be queried for the next event, and returns it as soon as it parses the next several tokens in the input raw stream, which may be an opened text file, or a pipe, or data that come from a socket. The events an abstract stream parser for the easiest XML I could imagine (<x><y>stuff</y></x>) are like these:

This way of parsing XML is more known as SAX parsing. "SAX" is an acronym of "Simple API for XML", and it means that API of a SAX parser is just several callbacks that are invoked at the certain simple events that occur during XML parsing, to which we refer in this section, so it's "easy" to understand.

In no way it's implied that the API is "simple" to use.

  • begin of the tag "x"
  • begin of the tag "y"
  • plain tag contents ("stuff")
  • end of the tag "y"
  • end of the tag "x"
  • end of file

The stream parser does not provide any information about the context of the place it's currently at. You should store the context on your own. It also can not handle a query with XPath expression, as to handle it it should seek for the proper tag, past which it may already be; and there's no way to rewind an abstract stream. If it's so inflexible, what's the point in stream parsing then?

It's damn fast! The sequence of the said events is yielded nearly at the speed of reading symbols, which approaches the speed of light compared to unhasty modern systems.

As for applications of stream parsers, they're quite limited. XML is mainly for storing structured information, and if the tag layout is not regular, deeply nested, and is less of a content than of structure (i.e., does not contain large text blobs), the non-stream parser is your best choice. If it doesn't hold, and if your XML is not so complex, but just damn big, then you do not have much choice aside from stream parsing. But you should know how to use it then.

Stream parsers in the wild

The sample output of a stream parser you saw earlier is not very convenient. Writing a nicely-looking and error-prone stream XML parser (i.e. the program that uses output of the said events) is a task that requires discipline and diligence. In the next post I will explain in detail how such a parser is best implemented (in Ruby). Here's how a fragment of code would look like:

However, the most easy-to-use way to employ stream XML parsers is a combined stream and "conventional" parsing. It means, stream parser seeks a tag you need, and then provides this tag as an entity object.

In Perl, there is a wonderful XML::Twig module, which does the thing you need. For example, to get all operations of a WSDL service description, you might use

This script will read the XML from file named foo_file, and just stop at tags that satisfy XPath "interface/operation" (i.e. <interface><operation>...</operation></interface> tags, and invoke a subroutine for them. The $interface_tag is an entity object, which we query for attributes, and thus extract the name attribute.

We could also specify a very useful option, "print_outside_roots => $STREAM, so that all things that are outside the tags we're interested in are printed to a file handler $STREAM as they're parsed. And those things that we would like to look at go to the same subroutines; they may be printed as well, but they may be altered before printing! This makes stream parsers (which provide such interface as XML::Twig) extremely convenient if we want to alter a couple of tags inside a large XML file.

Parsing portions of a big file in a "conventional" way is, in my opinion, a must-have feature for a stream XML parser. In Perl, the already mentioned XML::Twig module is one of those, and overview of Ruby modules is coming soon at my blog. Unfortunately, not all XML parsers support such hybrid model.

Key differences between stream and conventional XML parsing

Stream parsing is, of course, faster. but the performance gain does not necessarily cover the expense to write a more sophisticated program. What are the requirements for the XML for it to be parsed as a stream?

  • XML should be large. Most likely, parsing a small file with a sophisticated program is just not worth it.
  • XML should be homogeneous. It is a rare situation when you are to process a large XML file with an irregular pattern, but irregularity is going to cost you much, either in efficiency or in the complexity of the code you should write.
  • XML should be topologically sorted (or at least look like this). During stream parsing, you can not "rewind", and read again some data you already have read. So you either have to parse a stream two times (which is not always plausible), or store all the data you may need in the future. The latter may consume a lot of memory, which may render stream parsing not much more useful than the "conventional" one.
  • Consequently, layout of the XML you parse should be either specified, or under your control. Stream parsing yields the best combination of effort and resources consumed at runtime when your data are laid out in a specific order.

    You read XML to "process" its tags somehow, so the number of tags that can't be completely processed at the point of reading them should be minimized. For instance, meta-data and style data (smaller portion) in HTML are laid out before the page body (larger portion). If a stream parser was used for rendering web pages, it could store style information in memory, and process body tags at once, without storing them. If it was vice versa, it had to store a lot of body tags until it encountered styling information, which would be much less effective in memory consumption.

    The way HTML is laid out is governed by specifications, so we do not have to have control over the producer of XML; otherwise, we should.

If all these properties hold, congratulations, you will be able to parse XML in a fast way!

Conclusion

In this post I've outlined how a high-level processing of XML in a "conventional" (DOM) way and in "stream" (SAX-like) style is done. I analyzed the differences between the philosophy of these approaches and of requirements to XML files to be parsed in frames of each of them. I was not too specific, and just scantly touched how the use of stream parsers would look like in code.

And in the next article, I'll share my experience with stream XML parsing in Ruby, as well as I'll provide generic advice on how to write a concise and nicely-looking program that parses XML in a stream way.

Read on | Comments (0) | Make a comment >>


Porting Perl's features to Ruby

Diamond.  A Girl's Best Friend.  Ruby.  A Programmer's Best Friend.

An official motto of Ruby language with its joke explained.

As I told in my previous post, I now use Ruby as my primary scripting language. The most prominent reason is that Ruby is superior to Perl in both syntax and in expressive power. Being object-oriented, Ruby allows a lot of things to be expressed in a more concise manner, and exception handling relinquishes the developer from the burden of checking return code of every damned operation.

However, there are some features of Perl, which don't map to Ruby directly. Here's a couple of them.

Boolean String Interpretation

The first is "conversion to boolean". In Perl a lot of things convert to false in boolean context: an empty string, a zero, an undef (undefined value).

In Perl6 (and in recent releases of Perl5), there is a "//" operator, "lhs // rhs" being equivalent to "if lhs is undef, then rhs, otherwise, lhs". It will make the language slightly more expressive, but, unfortunately, won't bring us closer to our aim: making Ruby as convenient as Perl.

This allows us the following. If we want to assign a default value to a string parameter, we'd just use || operation:

However, in Ruby, only nil (undefined value) and boolean false do. So the code similar to the above (which uses Ruby's || operator) would only work if the str is nil. It wouldn't work if it was empty. And in many situation it would be very useful. Consider that you're retrieving a string, which, being not specified, should default to a value. If this string is unspecified, and comes from a user's input in a form, it'll most likely be an empty string. And if it originates from a database, it'll have a nil value. How you we write the code that sets the default value as concise as the Perl's one?

The solution is redefining operators in a class. We will introduce a new operator for strings that works like the Perl's ||. Note that we can't just redefine Ruby's ||, because it may be used by a third-party code, with which we shouldn't interfere.

One of Ruby's features, which originates from SmallTalk, is that we can redefine methods of objects and clases on-the-fly. This seems unusual for the most common C++ and Java languages for OOP support, but in Ruby a lot of thigs are implemented this way.

This feature is sometimes referred to as "monkey patching"

Let's call the new operator or, and let's redefine it for strings:

We notice that it works well if left-hand-side is a string. But what if it's a nil? Our aim was to have a proper behavior for expresions like a.or("b"), where we don't know if a's defined. But we can't redefine how nil works... can we?

In Ruby the nil keyword represents "undefined value". Techincally, however, it's just a... singleton object of class NilClass. So we can open this class the same way we did for String! In Ruby we can add methods to nil (undefined) objects! Here's a complete code for a sample program:

Check it out: it prints four "aaa" lines, just as planned. Such a convenient feature in exchange for just six lines at the utils.rb file of your project (you have a file with your most lovely code pieces, don't you?).

Interpreting lists in function call arguments

Perl has a creepy way of handling arrays of values. It uses tricky terminology to distinguish "arrays" from "lists", while they seem like a same thing. "List context" and "scalar context" add more complexity, and "references to lists" finishes a programmer's brains.

However, there's a lot of places where this nonuniformity turns into features. Here's one of them.

A regular Perl function (unless special measures are taken) takes arguments as a single list. If a function is said to "take two arguments", it takes a list instead, and interprets its first two elements as "arguments".

This looks ugly at funciton's site, but it's beautiful at caller's site. Let's write a sample program, that, say, finds all "backup" files in a given list of directories, by calling one of the Linux shell tools, find. Here's how it would look in Perl:

We pass a list of arguments to system, imbuing command-line arguments of the script @ARGV into its middle. A flat list with a list of folders at the beginning will be passed to system function, not a nested list! However, it's the nested list that will be passed in Ruby:

Ruby is more strict about its arguments, which I find more convenient than Perl, but in this particular case it leads to rejecting of concise and transparent code. Ruby does have a way to pass an arbitrary number of arguments, similar to C's "ellipsis":

The arguments beyond a certain point form a list, which can be explored. However, it's not of much help, since Kernel.system can take an arbitrary number of arguments, but, nevertheless it doesn't work.

What Ruby lacks is interpretation of function call arguments as a list at caller's site. But since we can't change it at caller's site, let's change at callee's! Here's how our wrapper around the system function may look like:

This will work, and you'll see an array of arguments printed on your screen before the command executes.

But that's not all. What about nested lists? Perl doesn't have nested lists, it flattens them all at construction. You could, of course, create nested data structures with "references", but in a pass-by-reference Ruby language it's not an option.

For example, if you call a similar wrapper with arguments like these:

it will flatten it down to a list of eight arguments, which may not be what we want. Instead, we might expect the fourth argument to be a three-element list, while List.flatten wouldn't keep it. The solution would be to use another function (I called it soften), which we can create by the same monkey-patching technique. Here's a code of a sample solution:

Here's the script's output, which demonstrates that it works as expected.

[1, 2]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5, 4, 5, 6, 7, 8]
["nohup", "ls", "-l"]
[1, 2, 3, 4, ["nesting", "is", "important"], 5, 4, [5], 6, 7, 8, 9]

***

I prefer Ruby to Perl because it's more strict. Its strictness makes Ruby less tolerant to some perl-ish "tricks", though. On the other hand, the flexibility of Ruby allows us to mimic these features without even fixing the interpreter.

This makes me suggest that "strict but flexible" is a more preferable option to "more free, but less flexible". Keeping ungrounded assumptions aside, at least, the Ruby tricks listed above make its use more convenient, especially if you are used to being an undisciplined Perl programmer.

Read on | Comments (0) | Make a comment >>


More posts about perl >>