XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.
  • XML Stream Parsing: Overview — generic philosophy
  • Fast XML stream parsing in Ruby with LibXML — practical advice (including non Ruby-specific)

XML is a very powerful word. When you pronounce it, the effect may vary. "Yeah, so what?" "Why not json?" "Oh, another corporate dumbass", "Why not CSV?", "Why not plaintext?"

When we were to adopt using XML to store data in our project, all these questions did arise. And the reason why we used XML anyway was its indisreputable interoperability. The languages we used (Java, Perl) had production-ready support for it, and, which is more important, the languages we could only think about using, would have had the support as well. And it's also extensible—that's what the X in XML is for—so we wouldn't have to rewrite all our code if we decided to add one more field to the data.

We'we implemented our tools to produce XML as their final report, and there was just one last step to implement: upload the final results to the database. From my experience in implementing the engine of the website you read, I knew that Ruby has a neat ORM, which is stable, production-ready and capable, so I used it.

The component to parse and upload the data from XML to the database was implemented in Ruby then. I refactored it twice before I finally managed to find the correct and fast way to do this. Here are the cronicles of writing XML processing script in Ruby.

Hpricot — bad idea

I've googled for XML Ruby backends, quickly glanced through the reports, and thought that Hpricot was the fastest and the easiest to use.

Both these assumptions were wrong. It's not easy to use, and it's not very fast. It only supports DOM parsing (which is not the fastest idea possible), and its interface is a mess. It is simple, but less capable. For instance, you can't distinguish the absence of a tag and a tag that contains empty text. And it doesn't really support all the DOM capabilities.

The last straw was a segmentation fault during processing of some of our data, and that the similar issues have constantly been filed without being fixed. So I decided to rewrite our program from scratch with ReXML.

ReXML — only for small XML files

ReXML, which is included in the default shipment of Ruby, was the second choice. It has a very clear, strict, and convenient interface. I rewrote the XML processing script to use ReXML, and it worked correctly.

By the way, I found a minor bug in ReXML.

But it was slow as hell. We managed to deal with it until we realized that it takes five days to read a 20Mb XML, the parser deploying into 2Gb of resident memory! The funny thing was that a very sophisticated program that generated the data did it in only two days, twice faster than the ReXML+Ruby planned to spend on parsing it.

All right, I started studying stream parsing, as it was the only means that could solve the problem. But ReXML stream parsing capabilities were beyond reasonable tolerance of a programmer to stupid interfaces. It only supported SAX interface, and SAX just SUX. I would have to wrap a lot of stuff into it, and write a lot of necessary code to just comply to this event-driven API. I decided to study more libraries, and I found LibXML.

LibXML background

Before getting to the business of its usage, I've studied some background of LibMXL. First, LibXML is a free library released under MIT license, and you may use it in both commercial and nonprofit projects with very few restrictions, which was what we needed in our project.

LibXML (I'll refer to its second version, i.e. libxml2) is a C library. In spite of this, it's well known for its various bindings with other languages, especially with those which were designed to trad execution speed for developer's convenience, namely, Perl and Python. LibXML bindings for Ruby are less known; perhaps, because operations with XML are mostly under the hood of more high-level tools and libraries that are used for the tasks Ruby programmers usually face.

Aside from high availability, LibXML is a very mature library. It has been developing for more than eleven years, and developers try to keep it fast and stable, and avoid hysterical moves that would make the whole project go bankrupt. Just for one example, here's a result of LibXML automated analysis for binary backwards compatibility; you may notice just a couple of suspicious serious breakages made.

To install, use your package manager to install libxml2 library, and then proceed to gem install libxml-ruby

LibXML killer features

Let's get to business. Here's the documentation for Ruby LibXML bindings for stream parsing. There are two features to note.

Stream of tags API instead of SAX

First, it's not event-driven like SAX. If you already have a program that uses DOM-like parsing it will be much easier to convert it to utilize stream parsing. The thing is that you will not have to revamp and refactor the layout of your program, to make your script callback-driven.

Generally, LibXML provides you with a stream-like interface you can peek the next tag from. Just like you read characters, strings, and numbers from a file stream, you may read tokens from an XML file. You call docstream.read to read the next element, and you then may peek the topmost token by calling docstream.node_type and docstream.value (docstream being the instance of XML stream reader). And most importantly, a killer feature of LibXML, is that you can skip the entire tag without parsing its contents by docstream.next. The resultant program that reads XML would look like this:

Mixed stream and DOM parsing

You might have chosen stream parsing because the large part of your XML consists of a repeated pattern that contains the most of data. However, there might be a tiny part of your file, with meta-information of sorts, that has a very complicated structure. Would be good if you could parse this small, but densely structured part with DOM parser, and do the rest with the usual stream parsing...

LibXML stream parser can do this with expand method. It reads the current tag, and stores it into LibXML's DOM structure, getting ready to continue stream parsing. It doesn't make the whole stream loaded into memory, of course, and you may choose the fast methods for larger loads, and the convenient method—for loading smaller parts of the stream.

Of course, you may still skip the whole tag without having to load its contents into memory via the docstream.next already introduced, or read them into string (see read_outer_xml) to print them somewhere else.

How to write a stream parsing program

It may seem hard to parse a complicated XML with these capabilities. However, adherence to some rules, a couple of patterns to structure the workflow, and some helpers that extend the C-like interface to LibXML with more Ruby-like features will help you writing a fast and nicely-looking program. Here they are.

Always know where you are

At each point of the program you should where the stream is—or might be—at this moment.

Put comments that describe where the parser is now. A good place would be before you are going to peek something from the stream. If your comment seems big to you then you're probably doing something wrong.

Instead of comments, or accompanying them, you should use run-time assertions. Here's how it would look like:

Use Factories to memorize what you've read

Unlike in DOM parsing, you can't query XML for some data you've already read past. Traded for convenience is more fine control of what you need to store in the memory, which is crucial for bigger files.

What worked for me were key-value stores for roots of subtrees, with which I factory objects for children. Factories implemented this interface:

In my case, these factories yielded ActiveRecord objects, and the global data were foreign keys for them. These keys were not be known at once (because the tags with actual data were mingled with tags with foreign keys), so the re_initialize function was invoked for all currently open factories, upon discovering global data.

The consume_tag_if_applicable procedure of a factory, in turn, advances the document stream and reads data from it. Its code may employ the same practices as the outer code that invokes the factory, and look recursively similar. If necessary, this method may yield what the factory has already "digested".

You may check the example of the code that parses XML here. It uses the pattern I've just described.

Employ Ruby capabilities to add helper methods

LibXML is a C library, originally. That means that it's interface is a bit... too verbose. We could make it more Ruby-like with just a couple of "monkey patches".

Check that interface of peeking. It returns an integer code(!) for the type of the token on the top of the stack. So 1980's! We can add some helper methods to the class of the element by opening it at the beginning of our script:

This implements the neat methods we used above.

Another improvement is automatic skipping of whitespace and comments. You see, if your goal is to parse the file and print it as is, with a couple of its parts altered, you indeed need to read the whitespace and print it. Whitespace has no sense in XML files that allow only "leaf nodes" (those that have no children) have text inside them. So, there are a lot of cases when you might want the reading methods to just skip the whitespace.

The requirement for the fixed implementation would be like this: after you call read method, the next token in the stream should not be whitespace or comment. Here's how it looks:

One more improvement is automatic reading of leaf node's value. The thing is that in stream parsing <node/>, <node></node>,, and <node>text</node> are all different sequence of tokens. I just added a procedure that returns nil, "", and "text" respectively for such cases (see the link to the example below)

All modifications I used in the latest project are gathered in the example I linked above.

***

In this post I made a brief overview of various XML parsing libraries for Ruby language. I also explained the basic patterns and techniques that helped me to write a fast XML stream parser in the project I'm involved into. I also provided the source code for both helper methods and sample program that stream-parses XML.

My evaluation of the patterns I explained here demonstrates that they are fast enough, and are flexible in memory consumption. Instead of loading the whole XML into memory, you have a fine-grained control of what you want to stash, and may release the acquired resources as you gather enough data to yield the next piece of the result of your parsing.

I hope that this will help you if you'll have to stream-parse XML on your own. Good luck!