Recent blog updates

Shown here are only posts related to xml. You can view all posts here.

Fast XML stream parsing in Ruby with LibXML

XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.

XML is a very powerful word. When you pronounce it, the effect may vary. "Yeah, so what?" "Why not json?" "Oh, another corporate dumbass", "Why not CSV?", "Why not plaintext?"

When we were to adopt using XML to store data in our project, all these questions did arise. And the reason why we used XML anyway was its indisreputable interoperability. The languages we used (Java, Perl) had production-ready support for it, and, which is more important, the languages we could only think about using, would have had the support as well. And it's also extensible—that's what the X in XML is for—so we wouldn't have to rewrite all our code if we decided to add one more field to the data.

We'we implemented our tools to produce XML as their final report, and there was just one last step to implement: upload the final results to the database. From my experience in implementing the engine of the website you read, I knew that Ruby has a neat ORM, which is stable, production-ready and capable, so I used it.

The component to parse and upload the data from XML to the database was implemented in Ruby then. I refactored it twice before I finally managed to find the correct and fast way to do this. Here are the cronicles of writing XML processing script in Ruby.

Hpricot — bad idea

I've googled for XML Ruby backends, quickly glanced through the reports, and thought that Hpricot was the fastest and the easiest to use.

Both these assumptions were wrong. It's not easy to use, and it's not very fast. It only supports DOM parsing (which is not the fastest idea possible), and its interface is a mess. It is simple, but less capable. For instance, you can't distinguish the absence of a tag and a tag that contains empty text. And it doesn't really support all the DOM capabilities.

The last straw was a segmentation fault during processing of some of our data, and that the similar issues have constantly been filed without being fixed. So I decided to rewrite our program from scratch with ReXML.

ReXML — only for small XML files

ReXML, which is included in the default shipment of Ruby, was the second choice. It has a very clear, strict, and convenient interface. I rewrote the XML processing script to use ReXML, and it worked correctly.

By the way, I found a minor bug in ReXML.

But it was slow as hell. We managed to deal with it until we realized that it takes five days to read a 20Mb XML, the parser deploying into 2Gb of resident memory! The funny thing was that a very sophisticated program that generated the data did it in only two days, twice faster than the ReXML+Ruby planned to spend on parsing it.

All right, I started studying stream parsing, as it was the only means that could solve the problem. But ReXML stream parsing capabilities were beyond reasonable tolerance of a programmer to stupid interfaces. It only supported SAX interface, and SAX just SUX. I would have to wrap a lot of stuff into it, and write a lot of necessary code to just comply to this event-driven API. I decided to study more libraries, and I found LibXML.

LibXML background

Before getting to the business of its usage, I've studied some background of LibMXL. First, LibXML is a free library released under MIT license, and you may use it in both commercial and nonprofit projects with very few restrictions, which was what we needed in our project.

LibXML (I'll refer to its second version, i.e. libxml2) is a C library. In spite of this, it's well known for its various bindings with other languages, especially with those which were designed to trad execution speed for developer's convenience, namely, Perl and Python. LibXML bindings for Ruby are less known; perhaps, because operations with XML are mostly under the hood of more high-level tools and libraries that are used for the tasks Ruby programmers usually face.

Aside from high availability, LibXML is a very mature library. It has been developing for more than eleven years, and developers try to keep it fast and stable, and avoid hysterical moves that would make the whole project go bankrupt. Just for one example, here's a result of LibXML automated analysis for binary backwards compatibility; you may notice just a couple of suspicious serious breakages made.

To install, use your package manager to install libxml2 library, and then proceed to gem install libxml-ruby

LibXML killer features

Let's get to business. Here's the documentation for Ruby LibXML bindings for stream parsing. There are two features to note.

Stream of tags API instead of SAX

First, it's not event-driven like SAX. If you already have a program that uses DOM-like parsing it will be much easier to convert it to utilize stream parsing. The thing is that you will not have to revamp and refactor the layout of your program, to make your script callback-driven.

Generally, LibXML provides you with a stream-like interface you can peek the next tag from. Just like you read characters, strings, and numbers from a file stream, you may read tokens from an XML file. You call docstream.read to read the next element, and you then may peek the topmost token by calling docstream.node_type and docstream.value (docstream being the instance of XML stream reader). And most importantly, a killer feature of LibXML, is that you can skip the entire tag without parsing its contents by docstream.next. The resultant program that reads XML would look like this:

Mixed stream and DOM parsing

You might have chosen stream parsing because the large part of your XML consists of a repeated pattern that contains the most of data. However, there might be a tiny part of your file, with meta-information of sorts, that has a very complicated structure. Would be good if you could parse this small, but densely structured part with DOM parser, and do the rest with the usual stream parsing...

LibXML stream parser can do this with expand method. It reads the current tag, and stores it into LibXML's DOM structure, getting ready to continue stream parsing. It doesn't make the whole stream loaded into memory, of course, and you may choose the fast methods for larger loads, and the convenient method—for loading smaller parts of the stream.

Of course, you may still skip the whole tag without having to load its contents into memory via the docstream.next already introduced, or read them into string (see read_outer_xml) to print them somewhere else.

How to write a stream parsing program

It may seem hard to parse a complicated XML with these capabilities. However, adherence to some rules, a couple of patterns to structure the workflow, and some helpers that extend the C-like interface to LibXML with more Ruby-like features will help you writing a fast and nicely-looking program. Here they are.

Always know where you are

At each point of the program you should where the stream is—or might be—at this moment.

Put comments that describe where the parser is now. A good place would be before you are going to peek something from the stream. If your comment seems big to you then you're probably doing something wrong.

Instead of comments, or accompanying them, you should use run-time assertions. Here's how it would look like:

Use Factories to memorize what you've read

Unlike in DOM parsing, you can't query XML for some data you've already read past. Traded for convenience is more fine control of what you need to store in the memory, which is crucial for bigger files.

What worked for me were key-value stores for roots of subtrees, with which I factory objects for children. Factories implemented this interface:

In my case, these factories yielded ActiveRecord objects, and the global data were foreign keys for them. These keys were not be known at once (because the tags with actual data were mingled with tags with foreign keys), so the re_initialize function was invoked for all currently open factories, upon discovering global data.

The consume_tag_if_applicable procedure of a factory, in turn, advances the document stream and reads data from it. Its code may employ the same practices as the outer code that invokes the factory, and look recursively similar. If necessary, this method may yield what the factory has already "digested".

You may check the example of the code that parses XML here. It uses the pattern I've just described.

Employ Ruby capabilities to add helper methods

LibXML is a C library, originally. That means that it's interface is a bit... too verbose. We could make it more Ruby-like with just a couple of "monkey patches".

Check that interface of peeking. It returns an integer code(!) for the type of the token on the top of the stack. So 1980's! We can add some helper methods to the class of the element by opening it at the beginning of our script:

This implements the neat methods we used above.

Another improvement is automatic skipping of whitespace and comments. You see, if your goal is to parse the file and print it as is, with a couple of its parts altered, you indeed need to read the whitespace and print it. Whitespace has no sense in XML files that allow only "leaf nodes" (those that have no children) have text inside them. So, there are a lot of cases when you might want the reading methods to just skip the whitespace.

The requirement for the fixed implementation would be like this: after you call read method, the next token in the stream should not be whitespace or comment. Here's how it looks:

One more improvement is automatic reading of leaf node's value. The thing is that in stream parsing <node/>, <node></node>,, and <node>text</node> are all different sequence of tokens. I just added a procedure that returns nil, "", and "text" respectively for such cases (see the link to the example below)

All modifications I used in the latest project are gathered in the example I linked above.

***

In this post I made a brief overview of various XML parsing libraries for Ruby language. I also explained the basic patterns and techniques that helped me to write a fast XML stream parser in the project I'm involved into. I also provided the source code for both helper methods and sample program that stream-parses XML.

My evaluation of the patterns I explained here demonstrates that they are fast enough, and are flexible in memory consumption. Instead of loading the whole XML into memory, you have a fine-grained control of what you want to stash, and may release the acquired resources as you gather enough data to yield the next piece of the result of your parsing.

I hope that this will help you if you'll have to stream-parse XML on your own. Good luck!

Read on | Comments (0) | Make a comment >>


XML Stream Parsing: Overview

XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.

XML and the ways to parse it

Stream and "usual" XML parsing are different very different, both in implementation and in the requirements to the XML producers. This article explains why is that—in a language-agnostic way.

Actually, I wanted to write a post about my chronicles with parsing XML in Ruby scripts in the project at my work. But the article began to grow so fast that I decided to split it into two posts. This part introduces the generic aspects of XML parsing, outlines differences between "conventional" and "stream" XML parsing, and shows a couple of examples how both approaches look like in the code that uses them. I'll use Perl as a sample language, but it could be any. And in the next article I'll share some detailed tips for parsing XML as a stream in Ruby.

Conventional XML parsing

Actually, the "conventional" word is a local term coined by me as a counterpart to "stream" parsing. The "conventional" way of parsing XML is more widely known as Document Object Model convention, or DOM. The name stresses the idea that XML document is represented as a set of objects—in the OOP sense—which have methods to be called as well as properties.

"Conventional" XML parsing relies on the basic idea that the the whole XML is loaded into memory. A specific object (we would name it "element object") may be created for each particular tag. User may query this object for its attributes and contents (if it's a leaf element), as well as for children elements that satisfy specific properties (XPath expressions may select elements based on tag names, inheritance, and attributes). The XML parsing library returns results of such queries as element objects as well.

Let's start with an example. Here's a sample XML, taken from the Wikipedia article on Web Services Description Language; I saved it to my local server as it is too big to inline it.

How would the work with a document like this look like? Assume that (it's not what working with WSDL is really like!) you want to get a list of all operations of an interface with a specific name. You call something like (pseudocode!):

document being the entity object of the whole document, and receive an element object that represents the relevant tag. To get all its operations, you query the element for its children:

You get names of operations by something like operation.attribute("name"), and then look up the relevant HTTP binding for this operation by querying document again:

And keep doing the chain of such queries until you learn everything you needed from this document.

What are the attributes of this workflow?

  • Your further "requests" to this XML depend on the answers you have gotten from it before.
  • The document is loaded once at the beginning of a work, which itself comprises a lot of delays, so how fast it's parsed doesn't mater that much.
  • The document was given by an entity that is not under your control. Its elements may come in any order, because specification doesn't impose it.

Therefore, your best choice is to parse and load the document into memory, split it into separate element objects at once, and be ready to run XPath queries and get elements that satisfy them without additional overhead. That's what "conventional" XML parsing looks like.

Stream parsing

What happens at the low level when XML is initially parsed to a series of entity objects? A usual parser, probably, recursive, walks through a file. As soon as it notices that a tag is closed, on the top of its stack (in other words, in the topmost function) there is enough information to construct the entity object for the tag that has just been closed. This object is constructed and then wrapped the way complex structures in the target language appear. And at the end of the processing, the object that represents the root tag of the XML being parsed resides somewhere in the memory, and is returned by the parsing routine.

The question is, why should we wait for the whole file to parse? Why can't we just intercept tags that are interesting to us as soon as they're parsed? Why should we load the whole XML into memory?

The reasons why we sometimes can't are briefly listed in the previous section (the "attributes" of the workflow). But what if they don't apply? Then stream parsing is right what you need!

Usually XML stream parser is understood as a blocking stream that converts the stream of characters to a stream of more meaningful XML "events". Stream parser may be approached as a proxy object between character and special XML stream. It may be queried for the next event, and returns it as soon as it parses the next several tokens in the input raw stream, which may be an opened text file, or a pipe, or data that come from a socket. The events an abstract stream parser for the easiest XML I could imagine (<x><y>stuff</y></x>) are like these:

This way of parsing XML is more known as SAX parsing. "SAX" is an acronym of "Simple API for XML", and it means that API of a SAX parser is just several callbacks that are invoked at the certain simple events that occur during XML parsing, to which we refer in this section, so it's "easy" to understand.

In no way it's implied that the API is "simple" to use.

  • begin of the tag "x"
  • begin of the tag "y"
  • plain tag contents ("stuff")
  • end of the tag "y"
  • end of the tag "x"
  • end of file

The stream parser does not provide any information about the context of the place it's currently at. You should store the context on your own. It also can not handle a query with XPath expression, as to handle it it should seek for the proper tag, past which it may already be; and there's no way to rewind an abstract stream. If it's so inflexible, what's the point in stream parsing then?

It's damn fast! The sequence of the said events is yielded nearly at the speed of reading symbols, which approaches the speed of light compared to unhasty modern systems.

As for applications of stream parsers, they're quite limited. XML is mainly for storing structured information, and if the tag layout is not regular, deeply nested, and is less of a content than of structure (i.e., does not contain large text blobs), the non-stream parser is your best choice. If it doesn't hold, and if your XML is not so complex, but just damn big, then you do not have much choice aside from stream parsing. But you should know how to use it then.

Stream parsers in the wild

The sample output of a stream parser you saw earlier is not very convenient. Writing a nicely-looking and error-prone stream XML parser (i.e. the program that uses output of the said events) is a task that requires discipline and diligence. In the next post I will explain in detail how such a parser is best implemented (in Ruby). Here's how a fragment of code would look like:

However, the most easy-to-use way to employ stream XML parsers is a combined stream and "conventional" parsing. It means, stream parser seeks a tag you need, and then provides this tag as an entity object.

In Perl, there is a wonderful XML::Twig module, which does the thing you need. For example, to get all operations of a WSDL service description, you might use

This script will read the XML from file named foo_file, and just stop at tags that satisfy XPath "interface/operation" (i.e. <interface><operation>...</operation></interface> tags, and invoke a subroutine for them. The $interface_tag is an entity object, which we query for attributes, and thus extract the name attribute.

We could also specify a very useful option, "print_outside_roots => $STREAM, so that all things that are outside the tags we're interested in are printed to a file handler $STREAM as they're parsed. And those things that we would like to look at go to the same subroutines; they may be printed as well, but they may be altered before printing! This makes stream parsers (which provide such interface as XML::Twig) extremely convenient if we want to alter a couple of tags inside a large XML file.

Parsing portions of a big file in a "conventional" way is, in my opinion, a must-have feature for a stream XML parser. In Perl, the already mentioned XML::Twig module is one of those, and overview of Ruby modules is coming soon at my blog. Unfortunately, not all XML parsers support such hybrid model.

Key differences between stream and conventional XML parsing

Stream parsing is, of course, faster. but the performance gain does not necessarily cover the expense to write a more sophisticated program. What are the requirements for the XML for it to be parsed as a stream?

  • XML should be large. Most likely, parsing a small file with a sophisticated program is just not worth it.
  • XML should be homogeneous. It is a rare situation when you are to process a large XML file with an irregular pattern, but irregularity is going to cost you much, either in efficiency or in the complexity of the code you should write.
  • XML should be topologically sorted (or at least look like this). During stream parsing, you can not "rewind", and read again some data you already have read. So you either have to parse a stream two times (which is not always plausible), or store all the data you may need in the future. The latter may consume a lot of memory, which may render stream parsing not much more useful than the "conventional" one.
  • Consequently, layout of the XML you parse should be either specified, or under your control. Stream parsing yields the best combination of effort and resources consumed at runtime when your data are laid out in a specific order.

    You read XML to "process" its tags somehow, so the number of tags that can't be completely processed at the point of reading them should be minimized. For instance, meta-data and style data (smaller portion) in HTML are laid out before the page body (larger portion). If a stream parser was used for rendering web pages, it could store style information in memory, and process body tags at once, without storing them. If it was vice versa, it had to store a lot of body tags until it encountered styling information, which would be much less effective in memory consumption.

    The way HTML is laid out is governed by specifications, so we do not have to have control over the producer of XML; otherwise, we should.

If all these properties hold, congratulations, you will be able to parse XML in a fast way!

Conclusion

In this post I've outlined how a high-level processing of XML in a "conventional" (DOM) way and in "stream" (SAX-like) style is done. I analyzed the differences between the philosophy of these approaches and of requirements to XML files to be parsed in frames of each of them. I was not too specific, and just scantly touched how the use of stream parsers would look like in code.

And in the next article, I'll share my experience with stream XML parsing in Ruby, as well as I'll provide generic advice on how to write a concise and nicely-looking program that parses XML in a stream way.

Read on | Comments (0) | Make a comment >>


More posts about xml >>