Recent blog updates

Shown here are only posts related to performance. You can view all posts here.

Nonlinear Effects when Web Server is at 100% CPU

I used to wonder what this Error 502 meant when I observed when I visited my favorite websites. When I retried the error usually seemed to disappear, and I found it's linked with the service being overwhelmed with requests at certain times. But why do I see it? Why doesn't the web server start many threads leveraging OS capabilities of CPU sharing? If it queues incoming requests, why doesn't it just delay their fulfillment, just making them wait in a line longer? Just like, you know, in any cafe, bank, and public office where people form lines, and get what they needed--eventually?

There are three answers to this. First, small-latency error is more important than serving all requests. Second, there are interesting nonlinear effects that happen when the server is at capacity if you aren't using queueing. Which leads to the third point, if more requests arrive than what you can process in one second, you will not be able to serve them, ever. You have to drop them. Let's take a closer look.

Fast error is better than slow result

If a service is consuming responses of several other services, it's often in a situation where it can ignore output of one of them, and still provide value to the user. An example in the paper linked above is Google web search that can skip displaying ads, maps, and other additional information, and simply serve search results.

This alone could be a good reason to drop a request, but this doesn't sound that convincing for the frontend server that faces a human end-user. Let's explore two ways of keeping the requests.

Latency Grows Exponentially If You're Not Limiting Rate

Imagine we're not limiting the number of worker threads, and just start processing everything we receive. In other words, we do not keep a queue of requests, and rely on the operating system to provide fair distribution of computing power among working threads. Say, if we have 4 cores, and every requests takes 1 second of CPU work, we can serve at 4 QPS; this is considered "at capacity".

What would happen if a rate grows beyond 4; say, to 5? Requests arrived at the beginning would not complete when a pack of new requests arrives. At the beginning of 2nd second, 5 earlier requests each would have 20% left, and execution speed would drop twice, so they would be completed only after 1.5 seconds. At the beginning of second 2, requests that arrived at second 1 would have completed in 0.4 seconds, but 5 more requests arrive, decreasing the speed by a factor of two again. The latency for these requests would be 2 seconds. The latency for the next pack arriving at 3rd second would be as much as 2¾ seconds, etc. Here's an illustration on how CPU time is distributed among requests, requests arrived at the same second have the same color:

Let's assess how the latency grows with rate. In other words, let's compute function L(t, r), latency of a request arrived at the moment t if the rate of requests is r times higher than capacity. For instance, if our capacity is 4 QPS, but there are 5 incoming requests, then the rate parameter r = 1.25.

First, the number of in-flight requests at the beginning of a given second N(t) grows linearly with t. Indeed, N(t) < QPS*t; on the other hand, the amount of CPU time owed to previous requests is QPS*t*(r-1)/r, which equals QPS*t*(1-1/r). So, (1-1/r)*QPS*t < N(t) < QPS*t, which means that N is linear function of t, e.g. N(t) = a*t, where a > QPS*(1-1/r).

Second, the CPU time allotted to a request witinh a second [t, t+1) is at most 1/N(t). The latency of the request is then the minimal x such that S = 1/N(t) + 1/N(t+1) + ... + 1/N(t+x) ≥ r. Since N is linear, this sum equals to 1/(a*t) + 1/(a*t + a) + 1/(a*t + 2a) + ... 1/(a*t + xa). From integral calculus we recall we can approximate the sum 1/2 + 1/3 + ... + 1/x as ln(x), so S ≅ 1/a * (ln x - ln t)). Hence, x ≥ t * exp(a*r).

We proved that L(t, r) ≳ t * exp(a*r) where a > (1-1/r), so latency grows linearly with time, and exponentially with rate of incoming requests if we try to serve them all right away.

Practical results, however, seem to suggest that exponent power is sublinear with respect to r. Latency definitely grows greater than polynomially, but the true function in the power remains a mystery. The closest I came to so far is sqrt(r), but let me juggle it a bit first, and I'll share my calculations and graphs. Here's the simulation script; built-in rational numbers in Python are cool, btw.

Newly arriving requests slow down the currently executing ones, prolonging their latency. There is a way to mitigate it: queueing.

Latency Grows Linearly with Queues

Imagine the most naive solution: first come, first served queue. All incoming requests are stored in the queue, and are served in the order they arrive, handler popping them from the queue as soon as it has capacity (a free CPU share or an idle working thread).

When the server is below capacity, the queue is always empty, and the latency of each request is L0, which is the time it takes handler to construct a response. What happens when the server reaches capacity? The latency of a request that arrives at time t (we assume constant rate r) is L(t, r) = Q(t, r)*L0 + L0, where Q(t) is the queue size when the request arrives. The size of the queue is the number of undone requests, which is t*(r-1). Therefore, L(t, r) = t*(r-1)*L0 + L0, concluding that latency grows linearly with both rate and time with queues.

You Can't Serve More Requests than What You Can

This may seem obvious, but is often overlooked. If you start queueing requests and serve them later, then you need to have the capacity dedicated to these requests later. Meanwhile, the next second more requests arrive than you can serve that second, and you have these old requests, too. Now you have three options:

  1. Serve old requests first, but then your latency will start growing at least linearly as you spend more time over capacity;
  2. Serve new requests first, but then old requests will remain in the queue until the first second where more requests arrive. In most cases, this would mean waiting hours as usage pattern are usually circular.
  3. Discard some requests without spending CPU time on serving them. Which leads us to error 502.

In other words, we have to drop (r-1) of all requests regardless of which scheduling mechanism we use. Whether or not the latency grows linearly or exponentially is irrelevant: it has to grow, and this is bad. Its exponential, turbulent growth without queueing is just an interesting artefact because you have another reason to avoid starting all the incoming requests at once: it would simply choke OS scheduler.

Read on | Comments (0) | Make a comment >>

Misuse of Caching in Web Applications

I noticed a pattern in some of my web software architecture failures, and I also notice similar pattern in others' designs. In many cases, people simply misuse memcache, hurting performance of their web applications.

If performance is your goal, you can’t pretend that your computations and rendering is fast, and add transparent caching where it later appears to be slow. Transparent caching is a fairy tale programmers have been telling one another, and we’ve also been writing tools and building hardware to remain in for decades. Turns out, it doesn’t scale. A high-performance web application is designed around serving static content that--sometimes--get recomputed

  1. under very specific conditions;
  2. only partially;
  3. after very specific events;
  4. most likely, by a component separate from the one serving the content.

Where would you keep such a static content derived from the raw data (I will later call it ”derived content”)? Memory caches are often unfit for the job of storing derived data: your items may be removed from cache ("evicted") if you don't have enough memory, or if your another web application or component is working within the same cache. And evictions are much more severe than you might think.

Assume it takes your webapp several seconds to recompute the item you want to serve even if it hasn’t changed, and all requests that come during this period are going to have at least these several seconds of latency. In modern multi-component web-services high latency for a small fraction of requests makes much more harm than it seems, as corroborated by this overview paper by Googlers, so you might want to avoid it. Not to mention the simple reasoning that if this eviction happens under heavy load, the stack of HTTP requests waiting for the data to be computed might kill you.

Is there a problem with caching data altogether? Yes, and no. The idea of memoization--storing results of complex computation for later reuse--is perfectly fine. The problem with caching as I see it is that many now view caching exclusively as storing data in unreliable but fast-to-access caches (think memcached). This is not how caches in web applications have to work despite that’s how they usually do.

Caching Primer

There is another storage you can use for precomputed results, the Database (SQL, or NoSQL, or NewSQL) where you probably store the rest of your data.

Assume you are presenting users with a set of analytics over the data they entered yesterday; say some breakdown of yesterday’s sales by zip code. Say, it takes 20 seconds to compute all the data. Where would you store it?

What you could do is to cache the result of the first request, spending 20 seconds to serve it. You can be smarter, and add a cron job that hits that page every midnight. But if your data gets evicted, your next user will suffer these 20 seconds.

Instead, store the data in your persistent storage (database), the storage query being triggered by the same daily cron job. When the user requests the analytics, just serve it from there; it will never delay by 20 seconds since the data aren’t going anywhere unless you explicitly delete them, which you’ll be wise not to do needlessly. It will take you one database query to retrieve the results, and you can also cache this query in memcache for even faster results. No 20-second latency anymore, just the latency of one database select of a single row. Neat, right? Here’s a lame picture to illustrate this that shows that recomputing-on-write is more efficient as writes are less frequent usually:

Why People Use Memcache Then

So if the world of using permanent storage for caches is all ponies and rainbows, why isn’t it used everywhere? The reasons are performance tradeoff, abusing unreliability of caches, and simple storage management.

Preformance Tradeoff

If your computation is as fast as querying storage, there’s no sense to use persistent storage as cache. In rare cases when you have insufficient latency, but overprovisioned on CPU throughput persistent caching will improve your app, but most likely not. Also, if you think of updating cache often in background as triggered by user requests, you need a fast task offload solution that avoids write contention (sich as App Engine’s Task Queues).

Abusing Cache Unreliability for Eventual Consistency

Cache expiration algorithms make the state of the cache eventually correct with respect your changing data. To build a correct permanent storage cache, you must make sure that the data in this cache are explicitly updated or deleted if unused. Unreliable cache will recycle stale data for you: if a key is never going to be accessed, it will be simply evicted in favor of more recent keys without occupying extra storage. Impersistence of memcache is an asset rather than liability in such cases. One of my previous posts, about sharded counter, contains an example of this.

The problem with efficient record recycling is that you can’t update record’s access time in the persistent database every time you access it without significant performance penalty (see the previous section on fast task offloading). On the other hand, you implicitly do this on every access with specialized in-memory caches. In some cases, a background recurring thread that performs DELETE FROM cache WHERE created_at > NOW() - 600 will take care of this, but this isn’t always possible.

Automatic Storage Management

A memcache can successfully and automatically operate when you don’t have enough memory to store all the results. Its efficiency doesn’t roll downhill with the pace increase because more frequently accessed entries stay hot, while performance suffers only on less “hot” entries hence rarely. Finding space for storage of one extra column per database row is rarely a problem today, though.

Narrowing Down the Scope

Given the above limitations, storing intermediate results in database makes perfect sense when the cost of computing these results is greater than the cost of database query, and when the data are tied to other database records. Say, storing a pre-rendered version of a blog post or a wiki page that uses a custom markup (especially if it’s badly implemented like the one in this blog) makes sense: you simply update it on write, and remove if the post is deleted. Storing a result of 10-minutes analytics query in a table rather than in a cache also makes sense (and some still rely on memcache to hold such data). Storing a join by key of two SQL tables is probably not worth it.

You need to design your application so that it is capable of making derived data updates after writes. Relying on caching will hurt you in the long run, because your application will be improperly architectured.

Read on | Comments (0) | Make a comment >>

Fail Early, Fail Fast

As you probably know (I'm sure I mentioned it somewhere in this blog, not just in my CV), about a half a year ago, I developed a set of components to make our Linux Driver Verification toolset run in a cloud, i.e. on a set of dynamically added and removed virtual machines.

Being low on the amount of man-hours allocated to this task, I needed to make the development as fast as possible. Our toolset was, in fact a simple client-server architecture: it had a large task that split on several independent smaller tasks, and then collected their results. While we did have making our toolset distributed in mind (I even had a post sketching our small debate on how to create an parallelization-friendly architecture), the data exchange had been implemented long before the infrastructure for distributed computing. The data exchange was based on a strict layout of files in the working directory: clients expected the input data in specific places, and server was to ensure the proper files reside there. Then, the server read the output from the specific files the clients created.

Linux has a support for Filesystems in USErspace (FUSE). This means that a non-root user may install and mount a "virtual" file system that would map system calls to the files and directories to whatever user specifies, such as an access to an archive, or to a file on a remote machine.

One of such file systems, an SSH-FS, which mirrors a filesystem on a remote machine via a Secure Shell tunnel, was featured in my previous posts about open3.

In a cloud, a client and the server are located on separate machines. How do we perform the data exchange? The simplest way to do was to use a transparent network file system, such that the data access code is untouched. Indeed, clients just read files from a filesystem, and a combination of Linux filesystem tools maps the read() calls across the network to reads from a server machine.

To achieve this, I used a combination of SSHFS and UnionFS, both being userspace file systems (see sidenote). UnionFS was used to make client write to a local machine and read files from the remote server while thinking it's just one simple filesystem.

It seemed a nice solution because the files the client expected to find in a filesystem were seamlessly transferred via sshfs. Its caching capabilities ensured that the latency happened only when the file was located, and simple buffered reads on that file were cached in advance. Since we need to locate only a relatively small amount of files, but read a lot from them, the latency shouldn't have been crucial.

Have you already guessed where the problem is?..

The trouble: C preprocessing

And there was one simple task in the whole infrastructure, which never got much traction, but is now responsible for all the latency happening in the data exchange via network. Our toolset analyzed C source code, therefore, it needed to preprocess it. And the trouble was in that plain old C preprocessing; namely, its #include directives.

How do #includes in C preprocessor work? A compiler has a prioritized array of folders to look for headers in, the user specifies its own prioritized set of folders, and each #include <file.h> enumerates the array starting from the folder of highest priority, and looks up file.h in each of them, until it founds a match. Then it inserts the contents of the file found into the source code, and keeps preprocessing. What is the incompliancy with our assertions about the performance of our filesystem stack?

During C preprocessing, a very large number of file lookups will fail!

It's the crucial point of preprocessing; it's been working like this for decades, and we shouldn't touch it (we don't want it, too, as using C preprocessor as a third-party tool is much easier to deploy). This is an example of a completely legitimate motivation for issuing a lot of failed requests to something (to file system, in our case). And it works comparatively well locally: while preprocessing of one file on a local disc would take less than a second, the same task over the network requires ten, or more! What could we do with our filesystem setup to avoid such performance problems?

Nothing. The calls that are doomed to a failure on the UnionFS level will be routed over the network to fail on the server, causing insane amounts of latency. Even copying all the headers from the remote machine to the local one wouldn't help: the successes would be executed faster indeed, but the failures would make their way to the server, and cause latencies anyway.


What could be learned from this? When you're trying to design a system, you should pay a closer attention to failures. It's tempting to only optimize your architecture for a fast handling of successful requests, while failure of a request may be as much of importance!

Of course, this varies with your domain: for instance, a web authentication system would specifically slow down failures as a protection against brute-force attacks. In my case, however, I'll have to make a patch to SSH-FS (or, to UnionFS) for it to cache failed file lookups. And I did it! See my blog post that describes how I patched it

But when I'll design another piece of software in my career, and it will have to handle legitimate unsuccessful requests, I'll do my best to make it fail early and fail fast.

Read on | Comments (2) | Make a comment >>

Fast XML stream parsing in Ruby with LibXML

XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.
  • XML Stream Parsing: Overview — generic philosophy
  • Fast XML stream parsing in Ruby with LibXML — practical advice (including non Ruby-specific)

XML is a very powerful word. When you pronounce it, the effect may vary. "Yeah, so what?" "Why not json?" "Oh, another corporate dumbass", "Why not CSV?", "Why not plaintext?"

When we were to adopt using XML to store data in our project, all these questions did arise. And the reason why we used XML anyway was its indisreputable interoperability. The languages we used (Java, Perl) had production-ready support for it, and, which is more important, the languages we could only think about using, would have had the support as well. And it's also extensible—that's what the X in XML is for—so we wouldn't have to rewrite all our code if we decided to add one more field to the data.

We'we implemented our tools to produce XML as their final report, and there was just one last step to implement: upload the final results to the database. From my experience in implementing the engine of the website you read, I knew that Ruby has a neat ORM, which is stable, production-ready and capable, so I used it.

The component to parse and upload the data from XML to the database was implemented in Ruby then. I refactored it twice before I finally managed to find the correct and fast way to do this. Here are the cronicles of writing XML processing script in Ruby.

Hpricot — bad idea

I've googled for XML Ruby backends, quickly glanced through the reports, and thought that Hpricot was the fastest and the easiest to use.

Both these assumptions were wrong. It's not easy to use, and it's not very fast. It only supports DOM parsing (which is not the fastest idea possible), and its interface is a mess. It is simple, but less capable. For instance, you can't distinguish the absence of a tag and a tag that contains empty text. And it doesn't really support all the DOM capabilities.

The last straw was a segmentation fault during processing of some of our data, and that the similar issues have constantly been filed without being fixed. So I decided to rewrite our program from scratch with ReXML.

ReXML — only for small XML files

ReXML, which is included in the default shipment of Ruby, was the second choice. It has a very clear, strict, and convenient interface. I rewrote the XML processing script to use ReXML, and it worked correctly.

By the way, I found a minor bug in ReXML.

But it was slow as hell. We managed to deal with it until we realized that it takes five days to read a 20Mb XML, the parser deploying into 2Gb of resident memory! The funny thing was that a very sophisticated program that generated the data did it in only two days, twice faster than the ReXML+Ruby planned to spend on parsing it.

All right, I started studying stream parsing, as it was the only means that could solve the problem. But ReXML stream parsing capabilities were beyond reasonable tolerance of a programmer to stupid interfaces. It only supported SAX interface, and SAX just SUX. I would have to wrap a lot of stuff into it, and write a lot of necessary code to just comply to this event-driven API. I decided to study more libraries, and I found LibXML.

LibXML background

Before getting to the business of its usage, I've studied some background of LibMXL. First, LibXML is a free library released under MIT license, and you may use it in both commercial and nonprofit projects with very few restrictions, which was what we needed in our project.

LibXML (I'll refer to its second version, i.e. libxml2) is a C library. In spite of this, it's well known for its various bindings with other languages, especially with those which were designed to trad execution speed for developer's convenience, namely, Perl and Python. LibXML bindings for Ruby are less known; perhaps, because operations with XML are mostly under the hood of more high-level tools and libraries that are used for the tasks Ruby programmers usually face.

Aside from high availability, LibXML is a very mature library. It has been developing for more than eleven years, and developers try to keep it fast and stable, and avoid hysterical moves that would make the whole project go bankrupt. Just for one example, here's a result of LibXML automated analysis for binary backwards compatibility; you may notice just a couple of suspicious serious breakages made.

To install, use your package manager to install libxml2 library, and then proceed to gem install libxml-ruby

LibXML killer features

Let's get to business. Here's the documentation for Ruby LibXML bindings for stream parsing. There are two features to note.

Stream of tags API instead of SAX

First, it's not event-driven like SAX. If you already have a program that uses DOM-like parsing it will be much easier to convert it to utilize stream parsing. The thing is that you will not have to revamp and refactor the layout of your program, to make your script callback-driven.

Generally, LibXML provides you with a stream-like interface you can peek the next tag from. Just like you read characters, strings, and numbers from a file stream, you may read tokens from an XML file. You call to read the next element, and you then may peek the topmost token by calling docstream.node_type and docstream.value (docstream being the instance of XML stream reader). And most importantly, a killer feature of LibXML, is that you can skip the entire tag without parsing its contents by The resultant program that reads XML would look like this:

Mixed stream and DOM parsing

You might have chosen stream parsing because the large part of your XML consists of a repeated pattern that contains the most of data. However, there might be a tiny part of your file, with meta-information of sorts, that has a very complicated structure. Would be good if you could parse this small, but densely structured part with DOM parser, and do the rest with the usual stream parsing...

LibXML stream parser can do this with expand method. It reads the current tag, and stores it into LibXML's DOM structure, getting ready to continue stream parsing. It doesn't make the whole stream loaded into memory, of course, and you may choose the fast methods for larger loads, and the convenient method—for loading smaller parts of the stream.

Of course, you may still skip the whole tag without having to load its contents into memory via the already introduced, or read them into string (see read_outer_xml) to print them somewhere else.

How to write a stream parsing program

It may seem hard to parse a complicated XML with these capabilities. However, adherence to some rules, a couple of patterns to structure the workflow, and some helpers that extend the C-like interface to LibXML with more Ruby-like features will help you writing a fast and nicely-looking program. Here they are.

Always know where you are

At each point of the program you should where the stream is—or might be—at this moment.

Put comments that describe where the parser is now. A good place would be before you are going to peek something from the stream. If your comment seems big to you then you're probably doing something wrong.

Instead of comments, or accompanying them, you should use run-time assertions. Here's how it would look like:

Use Factories to memorize what you've read

Unlike in DOM parsing, you can't query XML for some data you've already read past. Traded for convenience is more fine control of what you need to store in the memory, which is crucial for bigger files.

What worked for me were key-value stores for roots of subtrees, with which I factory objects for children. Factories implemented this interface:

In my case, these factories yielded ActiveRecord objects, and the global data were foreign keys for them. These keys were not be known at once (because the tags with actual data were mingled with tags with foreign keys), so the re_initialize function was invoked for all currently open factories, upon discovering global data.

The consume_tag_if_applicable procedure of a factory, in turn, advances the document stream and reads data from it. Its code may employ the same practices as the outer code that invokes the factory, and look recursively similar. If necessary, this method may yield what the factory has already "digested".

You may check the example of the code that parses XML here. It uses the pattern I've just described.

Employ Ruby capabilities to add helper methods

LibXML is a C library, originally. That means that it's interface is a bit... too verbose. We could make it more Ruby-like with just a couple of "monkey patches".

Check that interface of peeking. It returns an integer code(!) for the type of the token on the top of the stack. So 1980's! We can add some helper methods to the class of the element by opening it at the beginning of our script:

This implements the neat methods we used above.

Another improvement is automatic skipping of whitespace and comments. You see, if your goal is to parse the file and print it as is, with a couple of its parts altered, you indeed need to read the whitespace and print it. Whitespace has no sense in XML files that allow only "leaf nodes" (those that have no children) have text inside them. So, there are a lot of cases when you might want the reading methods to just skip the whitespace.

The requirement for the fixed implementation would be like this: after you call read method, the next token in the stream should not be whitespace or comment. Here's how it looks:

One more improvement is automatic reading of leaf node's value. The thing is that in stream parsing <node/>, <node></node>,, and <node>text</node> are all different sequence of tokens. I just added a procedure that returns nil, "", and "text" respectively for such cases (see the link to the example below)

All modifications I used in the latest project are gathered in the example I linked above.


In this post I made a brief overview of various XML parsing libraries for Ruby language. I also explained the basic patterns and techniques that helped me to write a fast XML stream parser in the project I'm involved into. I also provided the source code for both helper methods and sample program that stream-parses XML.

My evaluation of the patterns I explained here demonstrates that they are fast enough, and are flexible in memory consumption. Instead of loading the whole XML into memory, you have a fine-grained control of what you want to stash, and may release the acquired resources as you gather enough data to yield the next piece of the result of your parsing.

I hope that this will help you if you'll have to stream-parse XML on your own. Good luck!

Read on | Comments (0) | Make a comment >>

XML Stream Parsing: Overview

XML Stream Parsing

Two articles on XML stream parsing, the first explaining where its needed and what options one has, and the second giving practical advice, both language-agnostic and Ruby-specific.

XML and the ways to parse it

Stream and "usual" XML parsing are different very different, both in implementation and in the requirements to the XML producers. This article explains why is that—in a language-agnostic way.

Actually, I wanted to write a post about my chronicles with parsing XML in Ruby scripts in the project at my work. But the article began to grow so fast that I decided to split it into two posts. This part introduces the generic aspects of XML parsing, outlines differences between "conventional" and "stream" XML parsing, and shows a couple of examples how both approaches look like in the code that uses them. I'll use Perl as a sample language, but it could be any. And in the next article I'll share some detailed tips for parsing XML as a stream in Ruby.

Conventional XML parsing

Actually, the "conventional" word is a local term coined by me as a counterpart to "stream" parsing. The "conventional" way of parsing XML is more widely known as Document Object Model convention, or DOM. The name stresses the idea that XML document is represented as a set of objects—in the OOP sense—which have methods to be called as well as properties.

"Conventional" XML parsing relies on the basic idea that the the whole XML is loaded into memory. A specific object (we would name it "element object") may be created for each particular tag. User may query this object for its attributes and contents (if it's a leaf element), as well as for children elements that satisfy specific properties (XPath expressions may select elements based on tag names, inheritance, and attributes). The XML parsing library returns results of such queries as element objects as well.

Let's start with an example. Here's a sample XML, taken from the Wikipedia article on Web Services Description Language; I saved it to my local server as it is too big to inline it.

How would the work with a document like this look like? Assume that (it's not what working with WSDL is really like!) you want to get a list of all operations of an interface with a specific name. You call something like (pseudocode!):

document being the entity object of the whole document, and receive an element object that represents the relevant tag. To get all its operations, you query the element for its children:

You get names of operations by something like operation.attribute("name"), and then look up the relevant HTTP binding for this operation by querying document again:

And keep doing the chain of such queries until you learn everything you needed from this document.

What are the attributes of this workflow?

  • Your further "requests" to this XML depend on the answers you have gotten from it before.
  • The document is loaded once at the beginning of a work, which itself comprises a lot of delays, so how fast it's parsed doesn't mater that much.
  • The document was given by an entity that is not under your control. Its elements may come in any order, because specification doesn't impose it.

Therefore, your best choice is to parse and load the document into memory, split it into separate element objects at once, and be ready to run XPath queries and get elements that satisfy them without additional overhead. That's what "conventional" XML parsing looks like.

Stream parsing

What happens at the low level when XML is initially parsed to a series of entity objects? A usual parser, probably, recursive, walks through a file. As soon as it notices that a tag is closed, on the top of its stack (in other words, in the topmost function) there is enough information to construct the entity object for the tag that has just been closed. This object is constructed and then wrapped the way complex structures in the target language appear. And at the end of the processing, the object that represents the root tag of the XML being parsed resides somewhere in the memory, and is returned by the parsing routine.

The question is, why should we wait for the whole file to parse? Why can't we just intercept tags that are interesting to us as soon as they're parsed? Why should we load the whole XML into memory?

The reasons why we sometimes can't are briefly listed in the previous section (the "attributes" of the workflow). But what if they don't apply? Then stream parsing is right what you need!

Usually XML stream parser is understood as a blocking stream that converts the stream of characters to a stream of more meaningful XML "events". Stream parser may be approached as a proxy object between character and special XML stream. It may be queried for the next event, and returns it as soon as it parses the next several tokens in the input raw stream, which may be an opened text file, or a pipe, or data that come from a socket. The events an abstract stream parser for the easiest XML I could imagine (<x><y>stuff</y></x>) are like these:

This way of parsing XML is more known as SAX parsing. "SAX" is an acronym of "Simple API for XML", and it means that API of a SAX parser is just several callbacks that are invoked at the certain simple events that occur during XML parsing, to which we refer in this section, so it's "easy" to understand.

In no way it's implied that the API is "simple" to use.

  • begin of the tag "x"
  • begin of the tag "y"
  • plain tag contents ("stuff")
  • end of the tag "y"
  • end of the tag "x"
  • end of file

The stream parser does not provide any information about the context of the place it's currently at. You should store the context on your own. It also can not handle a query with XPath expression, as to handle it it should seek for the proper tag, past which it may already be; and there's no way to rewind an abstract stream. If it's so inflexible, what's the point in stream parsing then?

It's damn fast! The sequence of the said events is yielded nearly at the speed of reading symbols, which approaches the speed of light compared to unhasty modern systems.

As for applications of stream parsers, they're quite limited. XML is mainly for storing structured information, and if the tag layout is not regular, deeply nested, and is less of a content than of structure (i.e., does not contain large text blobs), the non-stream parser is your best choice. If it doesn't hold, and if your XML is not so complex, but just damn big, then you do not have much choice aside from stream parsing. But you should know how to use it then.

Stream parsers in the wild

The sample output of a stream parser you saw earlier is not very convenient. Writing a nicely-looking and error-prone stream XML parser (i.e. the program that uses output of the said events) is a task that requires discipline and diligence. In the next post I will explain in detail how such a parser is best implemented (in Ruby). Here's how a fragment of code would look like:

However, the most easy-to-use way to employ stream XML parsers is a combined stream and "conventional" parsing. It means, stream parser seeks a tag you need, and then provides this tag as an entity object.

In Perl, there is a wonderful XML::Twig module, which does the thing you need. For example, to get all operations of a WSDL service description, you might use

This script will read the XML from file named foo_file, and just stop at tags that satisfy XPath "interface/operation" (i.e. <interface><operation>...</operation></interface> tags, and invoke a subroutine for them. The $interface_tag is an entity object, which we query for attributes, and thus extract the name attribute.

We could also specify a very useful option, "print_outside_roots => $STREAM, so that all things that are outside the tags we're interested in are printed to a file handler $STREAM as they're parsed. And those things that we would like to look at go to the same subroutines; they may be printed as well, but they may be altered before printing! This makes stream parsers (which provide such interface as XML::Twig) extremely convenient if we want to alter a couple of tags inside a large XML file.

Parsing portions of a big file in a "conventional" way is, in my opinion, a must-have feature for a stream XML parser. In Perl, the already mentioned XML::Twig module is one of those, and overview of Ruby modules is coming soon at my blog. Unfortunately, not all XML parsers support such hybrid model.

Key differences between stream and conventional XML parsing

Stream parsing is, of course, faster. but the performance gain does not necessarily cover the expense to write a more sophisticated program. What are the requirements for the XML for it to be parsed as a stream?

  • XML should be large. Most likely, parsing a small file with a sophisticated program is just not worth it.
  • XML should be homogeneous. It is a rare situation when you are to process a large XML file with an irregular pattern, but irregularity is going to cost you much, either in efficiency or in the complexity of the code you should write.
  • XML should be topologically sorted (or at least look like this). During stream parsing, you can not "rewind", and read again some data you already have read. So you either have to parse a stream two times (which is not always plausible), or store all the data you may need in the future. The latter may consume a lot of memory, which may render stream parsing not much more useful than the "conventional" one.
  • Consequently, layout of the XML you parse should be either specified, or under your control. Stream parsing yields the best combination of effort and resources consumed at runtime when your data are laid out in a specific order.

    You read XML to "process" its tags somehow, so the number of tags that can't be completely processed at the point of reading them should be minimized. For instance, meta-data and style data (smaller portion) in HTML are laid out before the page body (larger portion). If a stream parser was used for rendering web pages, it could store style information in memory, and process body tags at once, without storing them. If it was vice versa, it had to store a lot of body tags until it encountered styling information, which would be much less effective in memory consumption.

    The way HTML is laid out is governed by specifications, so we do not have to have control over the producer of XML; otherwise, we should.

If all these properties hold, congratulations, you will be able to parse XML in a fast way!


In this post I've outlined how a high-level processing of XML in a "conventional" (DOM) way and in "stream" (SAX-like) style is done. I analyzed the differences between the philosophy of these approaches and of requirements to XML files to be parsed in frames of each of them. I was not too specific, and just scantly touched how the use of stream parsers would look like in code.

And in the next article, I'll share my experience with stream XML parsing in Ruby, as well as I'll provide generic advice on how to write a concise and nicely-looking program that parses XML in a stream way.

Read on | Comments (0) | Make a comment >>

Caching and levers

Some of the people who question the way I design systems, notice that I'm a big fan of caching. While caching is a widespread and a highly used technique of performance optimization, I tend to use it—and rely on it—more frequently than others. Buy why's that?

What is "caching"

It seems that the people actually mean different things when they say "caching". In my understanding, caching means exchanging memory for computation by storing the exact results of computation in memory, and re-using them instead of performing actual calculations.

This actually resembles one of the concepts of physics. When you study how levers work, you often hear that to lift a heavier object you are to be putting the effort at greater distance.

The same way you may trade memory to speed in computing, if you use the appropriate tools.

There are different lever classes, which are based on the same principle, but work differently. And there are different caching types.


Memoization is a caching technique, which saves the results computed by requests, and provides them if they are requested again.

Memoization can be utilized, for example, in recursive calculation of Fibonacci numbers:


Prefetching is a caching technique, which computes more than requested, and provides the data computed if they're requested in the future. That's how cache in your CPU works: when you load a memory cell into a register, it first loads it to a fast, "cache" memory, and also makes sure that a certain amount of the nearest cells are also loaded. You may boost performance if you write your programs to employ locality—i.e. make it more likely that you'll not request data from here and there, but try to make your memory queries concentrated.


Essentially, you can do only memoization, or only prefetching, or a combination of both. But there doesn't seem to be an agreement, what caching is. For example, Wikipedia article about caching defines caching as prefetching. It refers to architecture of processors (at least, to how it's implemented in x86), but the widespread understanding of caching goes beyond prefetching.

In this article I'll understand both prefetching and memoization as variations of caching.

Why I adore caching

I'm a big fan of caching because it supports one of the most wise ways of doing programming, the lazy approach.

Programmers work very hard to be lazy. Laziness is one of the basic virtues of a programmer. To be lazy partly means to design reusable systems that could be easily understood and modified. But how many times you were to design of a complex system, had a hard time subdividing it into components just to understand that the design is awfully unperformant?

To make the system more performant you would have to mingle several clearly defined layers into a horrible mess, to tangle otherwise isolated components, and to make your system's architecture more like a mess.

And here's where caching comes to help you. Instead of redesigning the structure of the system, you can just cache the data which your isolated components would re-calculate due to their independency. You'll see examples of it later.

Don Knuth once produced a famous quote, "...say, 95% of the times, premature optimization is a root of all evil". Ironically, the quote itself is a root of a lot of evil. Bit it actually supports the programming virtue I told about. And we can be sure, that caching can never be called a "premature optimization".

Where I used caching

Without concrete examples, of course, the bare words don't sound assuring enough. So here's when I enocuntered nice opportunities to use caching.

The first example: I use it in my blog to cache the HTML version of pages. They originally are written in a special markup language, which is parsed by an auto-generated from a grammar recursive descent parser implemented in Ruby. (It's called Treetop). It takes about 0.2 seconds to render a blog post alone, so without caching you could hardly read my blog.

Ironically, the parser also uses caching internally to memoize the nonterminal symbols it successfully parsed during the failed options of parsing other nonterminals.

Prefetching is especially useful when working with stores which process batch requests faster than indicidual ones. These include databases and file systems. In one of my first projects, I was given a task to store an abstract syntax trees of C++ header files in an SQL database. (you an see it here). As ASTs are loosely connected, I couldn't reliably determine what nodes will be requested at the next step of depth- or breadth-first traversal across the graph. But what I knew is that they were stored quite tightly. So I made the program request 1000 rows from database instead of one. That saved a lot of calls to the database.

Ironically, the pages you would see if you click the link to sample headers above are also cached. :-)

Of course, I wasn't the first who cached the results of compilation. There is a whole project, ccache, devoted specifically to this: it maintains compilation cache and uses it when the same file is going to be compiled.

While in a single project cache could be easily replaced by a Makefile, it's useful when you're rebuilding several projects independently. For example, in Gentoo Linux different versions of a system application are built separately, by design of its build system. These builds don't know anything about each other, and may be distant in time. However, if you first compile version 1.5.14 of a library, and then upgrade it to 1.5.15, a considerable fraction of the very same object files will be compiled again. And again, caching allows to improve performance, while having a clean architecture.

During one of the latest Google Summer of Code projects, a colleague of mine implemented a system, which closed Linux kernel modules under callgraph dependencies. For each of 4000+ Linux drivers, it produced a dozen of files, on which it depended. And then another component had to compile each of these files, as it was a completely independent module with a stable and developed behavior. It was obvious that instead of compiling 50.000 - 100.000 of files, it should cache the result of their compilation.

This way we traded 10 Gb of memory for several days of running time. Worth it, isn't it?

What about convenience?

Caching could actually be extremely convenient to use and, especially, to insert into a developed application. Here's how caching code of my blog looks like:

It actually means "fetch from cache by the key provided, or, if no data for the key is stored, run the functional block provided to recalculate the data, save to cache, and return the thing, which was just calculated, instead". It's so much syntactic sugar, that it may even cause diabetes... but it's the most convenient caching interface I saw. Essentially, every time you perform memoization, you have to re-implement the exact steps (see that Fibonacci example above). And a good caching interface, like the Ruby binding for memcached featured here, includes that. You may view the Railscast about using memcached for caching this way.

Most likely, you'll not need any rewrite of code of the components you want to introduce caching to. In the example above, which featured database prefetching the code was inserted to fetch-one-row function, which secretly prefetched and queried the local cache on each invocation.

Caching also provides you certain tuning capabilities. You can limit the cache size, and discard Least Recently Used items if you're going to exceed its size. This allows you to tune, how much memory you'll trade for performance, and it works in quote a straightforward, measurable way. Just like levers.


Of course, caching can't help in all situations where you have performance problems. If you're unsure if you will be able to employ caching, do the modelling, as wise men suggest.

However, caching helped me in many situations so far, and I seem to be saying, "don't worry, we'll cache this" too often to avoid being called a caching fan.

Read on | Comments (0) | Make a comment >>

More posts about performance >>