Shown here are only posts related to algorithms. You can view all posts here.

## How to Get Binary Search Right

As you already know, binary search is a O(log N) way to search if a particular value exists in a sorted array. There are some reports on how "most of our implementation" of binary searches are broken. The mistake discussed in that post is the overflow when calculating the index of the middle element. But regardless of whether you get this bit right, there's a different problem with all binary search implementations I've seen.

They're just too damn complicated and hard to get right out of your memory.

### Key points

Almost always, if you're relying on your memory, you should prefer simple algorithms to complex, like you should implement a treap instead of an AVL tree. So here's a sequence of steps that helps me to write binary search correctly.

#### Bisection instead of "binary search"

Starting from this point, we aren't searching for anything. Searching for an element in array has a too complex definition. What should we return if the element doesn't exist, or there are several of them equal to each other? That's a bit too confusing. Instead, let's bisect an array, pose a problem that always have a single solution no matter what.

Given a boolean function f, find the smallest k such that, for all i < k, f(a[i]) == false, and for all j >= k, f(a[j] == true). In other words, we are to find the smallest k such that for all i within array bounds f(a[i]) == (i<k). If the array is sorted in such a way that f(a[i]) is monotonic, being first false and then true, this problem always has a single solution. In some cases, when f(a[i]) is false for all array elements, the resultant k is greater by one than the last index of the array, but this is still not a special case because this is exactly the answer as defined by bold text above.

This is more powerful than the binary search, which can be reduced to this problem I'll call here "bisection". Indeed, to find an element x in the array a of length N, we can do the following:

Now let's get back to solving the bisection, once we've defined it.

#### Well-defined bounds

The answer to our bisection problem is between 0 and N inclusively, so we will start by assigning i to 0, and j to N. No special cases allowed: we can do without them at all. At each step, the answer we seek will be between our values i <= k <= j, and we will not know any information that would help us to reduce the range size.

#### Non-overflowing pivot point

It's easy to make an overflow mistake in computing the pivot point; i.e. the point at the middle of the array we'll apply f to to make a step. This problem is discussed in the article I linked above more. If we were in the world of non-bounded integers, it would simply be (i+j)/2, but we live in the world of integer overflows. If i and j are positive integers, which makes sense for array indexes in C++, the never-overflowing expression would be:

Another useful property of this expression (in C++, at least) is that it's never equal to j if i < j. Here's why it's important.

#### Well-founded termination

A basic way to make sure that your program terminates some day is to construct such a "loop variant" that strictly decreases with each loop iteration, and which can't decrease forever. For our bisection, the natural variant is the length of the part of the original array we currently look for the answer in, which is max(j-i, 0). It can't decrease forever because its value would reach zero at some point, from which there's nowhere to proceed. We only need to prove that this variant decreases each iteration by at least one.

Fun fact: special programs can prove that a program terminates automatically, without human intervention! One of these programs is named... Terminator. You can read more about this in a very accessible article "Proving Program Termination" by Byron Cook et al. I first learned about it in a summer school where Mr. Cook himself gave a talk about this.

First step is to make sure that the time it reaches zero would be the last iteration of the loop (otherwise it wouldn't decrease!), so we use while (i < j).

Now assume that we've understood that the pivot point s does not satisfy the target function f. Our key difference with most binary search algorithms is that we should assign the new i to s+1 rather than s. Given our definition of the problem, s is never the answer in this case, and we should keep our candidate range as small as possible. If f(a[s]) is true, then s could still be the answer, so it does not apply here. Here's our iteration step now:

We mentioned in the previous section that i <= s < j, hence i < s + 1 and s < j, so each of assignments decreases j-i by at least 1 regardless of f's value. This proves that this loop never hangs up forever.

### The complete program

So here's our bisection that works for all sorted arrays (in the sense of f(a[i]) being first false then true as i increases).

The search itself in terms of finding if a value exists in a sorted array, would be then written (I'm just repeating what I wrote above, no surprises here):

### Proof By Authority

And GNU C++ STL implementation of binary_search follows the same guidelines I posted here. In my version of GCC 4.3.4 binary search looks like this:

Where lower_bound is the bisection function.

### Conclusion

Here's my version of bisection and binary search I usually use when I need one. It is designed to be simple, and simple things are easier to get right. Just make sure your loop variant decreases, you do not overflow, and you get rid of as much special cases as possible.

## A Different View on the Strategy Pattern

If you read a short description of a patern in a usual pattern book or in a Wikipedia article, you will most likely not realize its full power. A true realization comes when you derive the pattern from the programming tasks you are to accomplish on your own, bottom-up.

A month ago I was chatting with a peer who was about to finish his Object-Oriented Programming course. One of the patterns he mentioned was Strategy. As I didn't recognize what it was about at once, I thought that it was fairly useless. Having gotten home and refreshed my memory with this article, I realized that this heuristic, of inferring the lack of use from if I remember what it's about, led me to the right result. The pattern was about encapsulation of a family of algorithms behind an interface. The lack of sense was confirmed by a detailed, but not assuring description of how the Strategy was different from the Bridge pattern.

"Well, just a yet another useless pattern," I had thought... until I found myself writing a bunch of strategies in my script three days later.

### The Strategy pattern

First, lets refresh what the strategy pattern is. This is yet another behavioral pattern that supplies the programmer with encapsulation of the behavior. Strategy encapsulates several methods with a common interface, and allows a programmer to switch between strategies seamlessly and reuse them in different environments.

This functionality is not unique. The Interface pattern, Virtual Function pattern, and even simple Function Pointers provide more or less the same. Other unimportant details are only interesting to OOP aficionados. Why bother with Strategies at all then?

There is one case, however, when thinking in terms of Strategies is not artificially induced by OOP courses, but rather is quite a natural way to think about your algorithms. This us when you play a game--or write a computer program that plays one.

#### Strategy games

There is a whole genre of video games named "Strategies". Though gaming is not the main focus of this post, a lot of what I describe below applies to these games as well. Sadly, in these days of video game consoles, Strategy games do not get as much traction as earlier.

"Still narrow", you might think, "I'm not a game AI programmer, so I'm not going to need this". This argument is relevant indeed, but it.misses the fact that some activities may be modelled as games. This doesn't necessarily refer to gamification, a popular concept in our times of ubiquitous crowdsourcing and Web2.0.

The games I'm talking about are closer to mathematical game concepts, when the environment of your application is modeled as "adversary" who doesn't really tend to respond to your actions in the nicest way possible. An example of that is the script I was implementing. I wanted to create a crawler for a web forum, that should download new posts as soon as possible, given that posts may be deleted by evil moderators, which should not confuse the script. I'll expand on how this was modeled as a "game" later; let's talk about strategies for now.

Let's take a less abstract example of a game to explain what the strategies are about. You a played Battleship, a game where you are to destroy ships on the adversary's board with as few moves as possible. in this game you make a move by shooting to a particular cell of the battlefield, and get a response whether you missed, or hit the ship, or destroyed it completely if it's the case. In fact, you have a "Strategy" that reacts with moves to the messages from the battlefield. (We intentionally leave the art of ship positioning aside).

Implementing a strategy for Battleship game was the very first task I was given as an intern at my first job at ISPRAS. We were tow implement a strategy for the game, and our algorithms competed with each other afterwards. They played thousands of matches, each being a comparison of who was faster to shoot out a randomly generated battlefield. I came up with two strategies: one hacked into the random number generator by srand()-ing it to a predefined value (that won all the matches), and the other was built via a series of experiments with combinations of the strategies implemented following the pattern described in this post.

The programs we wrote back then revealed a lot about our consequent professional lives. One of my peers pre-analyzed the probabilities of a cell being occupied by a ship, and built his strategy on top of this; he's finishing his Ph.D now. The algorithm of the team supervisor was the simplest, the fastest, and had the largest number of victories (naturally). And my algorithm was over-engineered, slower than the others, but flexible and efficient enough: it scored 3rd in the competition.

Most strategies look like this: Shoot the battlefield along diagonals, starting from the middle, until you encounter a hit. As soon as you do, finish th ship you hit by shooting nearby cells, and keep shooting across the diagonals.

Of course, a multitude of other, more complex strategies exist. Each of the strategies, however, fits the common interface: given a result of the previous move, make the next one. The strategy should, thus implement the following interface:

### Decomposition of a complex, stateful strategy

A strategy to play even such a simple game may be quite sophisticated. The implementation of next_move will embody a number of states we're currently at (such as "shooting along diagonals", or "trying ti finish a ship".), and the strategy will also implement a complex logic to transition among the states based on the shot result. This pattern, known as "State Machine" is viable, but the whole logic may be simplified.

We can use the rethinked strategy pattern to decompose the strategy into smaller ones. The "global" strategy described above in italics may be turned into a local one as follows. Instead of approaching the way we play a game as one "big" strategy, we instead consider that our strategy changes each move as well. In response to a message from the field, the rethinked next_movefunction now returns the strategy we are now using as well as the next move.

This way, we begin with "shoot diagonals" strategy; it changes to the "kill ship" strategy as soon as the response is "hit", which, in turn, changes back to the shoot diagonals strategy as soon as the response is "ship dead".

A clean way to implement it is to change the interface of a strategy slightly. Instead of a single next_move function that both returns the next move, and changes the internal state, we transform the interface as follows:

The peek function returns the next move you should make according to this strategy, and move function returns the next strategy.

This offers a simpler way to switch strategies. When the strategy feels that a strategy should be switched, it returns NULL from peek, this makes the caller detect that the strategy is going to switch, and doesn't pass the move to the adversary. Instead, it peeks the next strategy. The overall algorithm now looks like this (memory allocation is intentionally put aside):

You may do this in a more complex but more encapsulated manner. In the peek function, you make construct the next strategy, and peek its next move; save the strategy, and then route the move function to it, returning the strategy built.

The implementatinon for the Bttleship game stereos would look like this. Shot diagonals strategy is initialized with a battlefield state, and this should be enough for choosing the next move. As soon as a hit is encountered, the peek function returns nil, and the move func returns "finish the ship" strategy. The "finish ship" strategy is initialized with the last hit, and explores the area around the hit. As soon as the result is "killed", it initializes the diagonal strategy with the current battlefield, and returns it as the next strategy.

Now the implementation looks much more clean and decomposed. You can combine and chain together various strategies, discovering the most efficient through experimenting.

Note that this algorithm doesn't have any protection against infinite loops, so you have to ensure the termination manually. On the other hand, no other way to implement a strategy for a game has any protection against this.

### A functional way to encode the strategy change

The decomposition may be continued. We see that the implementation of the "kill ship" strategy has the switch to a specific strategy hardcoded. This limits the reuseability of the class, because we would like to switch to killing a ship and back from many different strategies, rather than just from the diagonal one only. This hardcoding may be solved via subclassing or via using functors, i.e. passing a function to the strategy constructor (I prefer the latter).

When a specialized strategy exhausts the moves it can—or was ought to—make, it calls the function supplied to the constructor (or a virtual method, if you preferred subclassing), and this function returns a new strategy. Most likely, the function is taught to return the previous strategy via a closure. I.e. the "shoot diagonals" strategy's move function may contain such (pseudo)code (I tried to use C++0x lambdas here):

### Concurrent execution of strategies

An established interface allows us to define combinations of strategies. Assume we would like to shot along the diagonals in different regions of a battlefield. Instead of implementing a strategy that shoots along two, three, etc regions, we may implement a strategy composition.

A simple composition takes an array of strategies, and asks them for the next move in a round-robin way. In some cases, it's a sufficient composition. In battlefield, however, we don't want, for instance, shoot into the othee regions, if we've hit a ship; what we want is to pay our full attention to that ship. In this case, we would use a priority queue instead to store strategies instead od an array.

Now assume we are to download all messages from a forum, given

1. each message has a unique id that automatically increments;
2. there's no way to get the current maximum id without posting a new message;
3. Moderators may erase messages, and there's no way to distinguish if a message with a particular ID was erased, or it never existed (aside from encountering a message with a greater id);
4. we should be capable of both getting new messages and downloading the old messages at the same time

We can model the work of such a crawler as a battlefield-like game. Our "move" consists of an ID of a message to download, and of a pause to make before doing so. Our response is the message contents or a notification the it doesn't exist.

Our "strategy" would be a round-robin combination of two strategies. The first goes down from a specified id down to the message number one. The second waits for the new mesages in the following way.

The strategy goes up message by message until it encounters a nonexistent message. Then the strategy switches to a "random walk" strategy that finds a message with a greater that exists, making appropriate pauses. No matter how many messages a moderator would delete between our attempts we should be capable to finding a message that is not deleted. After we found one, we may be sure that all the nonexistent messages between the last one we downloaded successfully and the one we found during the random walk either exist or are removed forever. This part is implemented by another strategy, which then switches to the "random walk", looping the execution.

I implemented a slightly different downloader, given that we always know the ID of the last message, regardless of whether some latest messages were deleted. You may check its sources here (ahe downloader as at the time of writing), and here is the latest version of the script

### Other applications and limitations

The most obvious limitation is that all the strategy classes you implement depend severely on what you're move is. Each strategy class you implement is going to construct move objects. If the interface in the constructor used changes, or its meaning becomes different then you have to alter all the classes you implemented. The same applies to the response objects.

#### Limited applicability

One of the oldest turn-based strategy games, chess, is one of the examples where the approach described is not the best one possible. A more effective way to play chess may be something like Alpha-Beta-pruning rather that structural exploration.

Trying to analyze the possible applications of this strategy framework, I came to a conclusion that it's quite limited. It is a nice way to model a strategy when your game is to explore "terra incognita", to reveal what is initially hidden. Indeed, in Battleship you do not react to intelligent moves of your adversary. Instead, you explore a carefully prepared field, initially covered by the fog of war. The forum downloader is no different: the actions of the evil moderator are in no way affected by the "moves" the program makes.

The framework supports parallelization via clever strategy combination classes, like the one I described earlier. Note that this abstraction moves the parallelization code away from the strategies themselves.

I don't think that this strategy framework it's useful for games where you are to make intelligent moves, such as Chess or Go. Moreover, even in "exploration" games (such as Battleship), it's ability to combine and decouple may just lose to a better tailored but a less decoupled counterpart. However, this framework is invaluable when you need too explore a complex and yet uncovered environment with limited resources, and want too experiment throughout the process.

## Counting Visits Efficiently with "Wide Caching"

One of the features of the web forum engine, a clone of which I develop in Ruby on Rails, is that it displays some site usage statistics.

There are two kins of statistics it displays. First, it tracks and shows how many times a particular message was read (i.e. clicked), tracking if a user doesn't self-click his or her message. Second, it shows the generic site usage activity, unique visitors, and page views in the top-right corner.

In this post I'll tell how I solved the problem of calculation the page view and visitors statistics. Impatient developers may jump directly to the solution section, while others are welcome to read about the path I walked from the most naïve activity tracking to the Wide Cache.

### Naïve solution

The naïve solution was to store all site accesses in a table in the Application MySQL database. Each line in this table would represent one access to a site, keeping a user's hostname, and the access time. At each access, a corresponding line is inserted into the table. Then, if we are to display the information gathered, we run two simple db queries to get the pageview information (the time is the current time minus 10 minutes):

```SELECT COUNT(distinct host) FROM `activities` ;
SELECT COUNT(*) FROM `activities` WHERE (created_at < '2012-04-01 18:33:25');
```

The writes to and reads from the activities table happened at each request. This was achieved via a before_filter in the root controller which is usually named ApplicationController, as hinted in this StackOverflow question:

It spawned another process that handled all the activity business, and didn't prevent the main process from serving pages timely.

As you might expect, the performance measurements were disastrous: the application could serve only (four) requests per second, while with all activity stuff turned off the throughput was about 40 rps.

### Storing the table in memcached

The reason why the naive solution was not working as expected was lack of throughput. At first, I thought that since we could defer the database writes, and perform them after the request was served, we wont have any problems. However, this only addressed one of the performance metrics, response time, while what solution lacked was throughput.

I also didn't want to introduce other technologies specifically for activity tracking. What I had at my disposal was memcached. Other components of my application used memcached to cache database reads and whole pages. We could, I thought, cache database writes with it as well.

Rails storage with memcached backend supports two operations. The first is to write something under a string key, perhaps with a limited lifetime. the second is to read from a specified storage entry if anything has been written to it before, and if its lifetime is not over yet. That's basically all.

We could try to use memcached to store the SQL table. itself. Indeed, that table may be viewed as an array of rows, so we could just read the table, append a row, and write the result back. However, for the sake of performance, memcached doesn't support any locking, so we can't just store the table described in the naive approach in a cache bucket. Two threads may read the array, then both append the next row, and both write, one row being discarded. This is a classical race condition. And my aim was to serve 30-40 requests per second, which means that race conditions that appear if this approach is used were inevitable.

Besides, even if we could use locking (for instance, via Linux named locks (flocks)), it could even perform worse than the database, because it wouldn't be able to handle enough requests sequentially. We need a certain degree of concurrency here.

### Wide caching

To mitigate the race conditions that happen during database writes, I used a technique I named "wide caching". I certainly reinvented the wheel here, but for the sake of clarity, I'll use this name in the rest of the post.

The race condition from the previous section could be mitigated by distributing the big "table" into smaller, independent tables stored in different memcached chunks. Writes to each of the chunks would not be protected by a mutex, so a race condition would be possible. We could try to mitigate it by using a round-robin chunk allocation, i.e. a new thread writes to the chunk next to the one allocated to the previous thread.

This solution, however, could be improved.

First, round-robin allocation requires a central, synchronized authority to return proper chunks. This brings the same level of technological complexity as a single bucket with a lock.

Second, round-robin doesn't guarantee the lack of race conditions either. A thread may stall long enough for the round-robin to make a whole "round", and to consider the chunk currently in process again. To avoid this, the allocator should be even more complex.

Third, let's realize that we may trade precision for speed. The result will not be absolutely accurate. People look at the site usage statistics out of curiosity, and the figures may not be precise. We'll try to make them fast first.

This all suggests a simple idea: get rid of the allocator! Just use a random bucket in each thread.

This will not prevent us from race conditions either, but it will make them predictable. If the allocation is completely random, we can carry out experiments, and extend their results in time being sure that they will be reproducible.

### Decreasing effect of race conditions

The previous paragraph consisted of reasons and measures that addressed the amount of potential accesses to the same cache bucket within a period of time. What also matters is how long this cache access is. The shorter it is, the more writes to a hash bucket per second we are going to handle correctly.

#### Memcache request structure

Rails provides a transient interface to various caches. The interface allows to store serialized Ruby structures in the cache. This picture shows the steps that happen during an update of a table in hash, the scale roughly representing the time each of the steps take.

A typical hash bucket access consists of these steps:

1. reading raw data from cache;
2. deserializing, converting to Ruby format from a character format (the opposite to serializing;
3. appending a row to the deserialized table;
4. serializing to the character format (the opposite to deserializing);
5. writing raw data to the cache.

We can see that steps 1, 2, 4, and 5 depend on the amount of records in the array we store in the hash bucket. The more information we record, the more time they take, and when the data are large enough, the time becomes proportional to the size. And if we just store all the activity data in cache, the size of the arrays being (de)serialized only grows in time!

How could we decrease the size of buckets? The wide cache already decreases the expected size by a factor of width, could we do better?

It turns out we can. Activity is a time-based data. The older the data are, the less interesting they become. In our use-case, we were only interested in the data for the latest 10 minutes. Of course, the routine cleanup of tables that drops all records with old enough timestamps is not sufficient: the activity data for 10 minutes are large enough (with 50 rps, and 50 being the cache width, the mean size would be 600).

Time reminds us about another idea. Why do we load and then write the old data for the table? They don't change anyway! Since we're using a generic cache solution, and can't lurk into its read/write algorithm, we can't do it on a per-record basis (i.e. do not load the table, just append the record being inserted. What can we do instead?

We may utilize another highly-accessible resource, which usage is however totally discouraged in building a reliable distributed algorithm. We have the clock that are easy and fast to access. The clock provides us with a natural discriminator: each, say, three seconds, abandon the old hash bucket, and start a new that will be responsible for storing the requests for the next three seconds. Given current time, we can always find the "layer" of the cache we should randomly select the bucket from.

The discussion above was devoted to how to record the activity data. However, the data should be displayed in a timely manner. How timely? Let's consider that if the data are updated at least once per 30 seconds, it's good enough.

Then, each 30 seconds we should reconcile the activity data written into the memcached, and "compress" it to an easily readable format. Since I already had MySQL implementation before, I piggybacked on this, and merely inserted the activity information to the SQL table, so the reading procedures do not have change at all!

To get the statistics, we'll have to iterate all hash buckets the writing algorithm could fill, because gathering the ids of the buckets filled will require additional synchronization (additional writes), and, since the majority of them will be filled under high load, we'd better just collect them all. Theoretically, we should collect all the buckets that might have been filled during the last 10 minutes, the period we show the activity for. However, if we run the collecting algorithm each 30 seconds, we can only collect the data for the latest 30 seconds as well, since all the previous data should have already been collected.

We'll have another race condition here. A worker may get the current time, generate a hash bucket id X, and then stall for several seconds. During that stall, the writing worker collects the commits all the data to the MySQL table, including the piece stored in X. The first worker wakes up, and writes the visit information to X, from which it will never be read, so the request is lost.

To mitigate this race condition, we may collect and commit the data only from the layers that are deep enough. This won't help to avoid it completely, but will decrease its probability to the degree at which we can ignore it.

### The final Wide Caching solution

If we assemble all of the above, we'll get the final solution that looks like this:

When a page request arrives, it asks for the current site usage statistics as one of the steps during the page printing. The statistic data are requested from the `activity` table in MySQL database, and are cached for a medium amount of time (30 seconds), because more frequent updates do not improve user experience.

After the page has been served, the request notifies the wide cache about the access to the site. First, it determines the "row" of the cache by rounding the current time in seconds since the Epoch down to the number divisible by three. Then, it determines the "column" by choosing a random number from 1 to 15. These numbers are appended; they form a unique identifier of a memcache entry. The website then reads the table from that entry, appends a row, and updates the same entry.

Dumping the collected accesses to the DB is performs like this. After notification, the request also checks if there is a live memcached entry with a special name, and a lifetime equal to 30 seconds. If there is such entry, it means that the information in the DB is obsolete, so the algorithm starts to commit the information into the MySQL DB.

While the information is being committed, there may appear other requests that would also check the lifetime of the memcached entry, and start the commit process. This is why the order, in which the memcached entries are being read is randomized, so that the amount of cases when the same information is committed twice is minimized.

Here's the code. It really looks much shorter than this post that describes it.

### Experiments

I mentioned that the setup contains race conditions that will lead to losing some requests. With the cache width of 15, and bucket height of 3 seconds, the payload of 35 requests per second made the tracker lose 350 out of 6800 requests. This is approximately 5% of total number of requests, which is acceptable. Because we randomized request queries, we may conclude that 5%—given these figures for requests per seconds, cache width, and the equipment—will be an average visit loss factor.

#### Spawning

In previous sections, I claimed that spawn-ing threads using spawn Rails gem, and writing to the database in the spawned processes/threads is slow. Indeed, spawn reinitializes the connection in the spawned thread, and this is already a considerable burden on the server if several dozens of threads are spawned each second. Here are the experimental details (see bug #67 where I first posted them info):

MethodReqs. per sec.
No activity40.2
With activity; no spawning27.4
With activity and spawning13.0
With wide cache36.7

This shows that (a) you should not use spawning for short requests, and (b) wide cache is really fast enough.

### Background periodic job or per-request?

In the algorithm described above, activity data collection was started when a request arrives, and the application finds out that the currently cached activity stats are out of date. We were careful to make this case as painless as possible, but it has other implications that are not related to race conditions, and experiments show that the loss of throughput is acceptable if we don't try to spawn a thread for each this activity.

Surprisingly, this method starts performing badly when the site activity is low rather than high. On a low-activity site, we can't rely on requests coming each second. Rather, they may arrive even less frequently that activity cache times. So, to support the low-activity case, we have to collect the information for caches that might have been filled in the latest 10 minutes (the maximum value a visit information could still be relevant for), not 30 seconds (the lowest possible time since the previous data collection). Otherwise, if users arrive less frequently than 30 seconds, the engine would always show zero activity, which would make users visit the site even less.

This wouldn't be important, unless I used HAML template engine, which—and this is a known, sad bug—doesn't flush the page until the request is complete. Even putting the code to after_filter doesn't help. Experiments demonstrate that activity data reconciliation may take up to 1 second, so some requests will be served with an excessive 1-second delay, and when the rate of requests is low, this will constitute a considerable fraction of them. And I surely don't want my users to experience frequent delays during their infrequent visits.

Spawn instead of after_filter? We have already seen that spawn-ing will make our server choke at the other extreme, when the load is high.

Luckily, we have a solution that suits both cases equally well. It is to periodically invoke the activity data collection procedure, without tying it to requests. However, the periodic invocation of a task is not straightforward in Rails. I'll get back to it in another post.

### Future work

The current implementation of the activity tracker is parametrized with cache attributes (cache lifetime, and width). However, this Wide Cache could be parametrized further, with the procedures that are executed, namely:

1. Cache bucket updating
2. Commit procedure
3. Reading what we previously committed

I think that I'm going to need the same cache that doesn't always touch the database for the first kind of the activity, for visits of individual posts. This parametrization will help me to keep the code DRY, and to re-use the Wide Cache.

This will require refactoring of visits and hits to a single function that calls a lambda function passed in the constructor.

### Other approaches

#### Parse apache/nginx/rails logs.

Indeed, the topmost web serving layer already collects the activity data: it carefully logs each visit, with the URL being visited, a user's IP address and user-agent. Instead of spending time on activity in a comparatively slow Rails application, we could use the logs of a "fast" web server.

I have seen production systems that display activity data based on parsing nginx logs, and it may be integrated into Rails in such a way that it doesn't look like a kludge. Perhaps, free log parsers are already available on github... but this solution just doesn't look cool enough to me.

#### Do no track activity

Does anyone care about the visits at all? Maybe we just shouldn't track them?

First, from my observation, everybody tries to stick a visit tracker into a website. A bigger activity also means that you're visiting something valuable. A Russian clone of the Facebook social network even used to display a fake registered user number counter at their homepage. Such fraud could only be justified by a huge benefit from displaying it, which means that the statistics is considered valuable by at least some very popular sites.

Second, in my case, there was an easier answer. I should reimplement everything the engine I'm trying to clone contains by politics-related reasons: unless I reimplement everything, and manage to keep the site fast at the same time, my approach to the development will be considered a failure.

#### Use a specialized activity tracking solution

Too late. I have already implemented my own. But if you want some activity data on your website, do not hesitate to use one. It is hard, see above.

### Conclusion

In one of the discussions on the very forum I'm reimplementing, someone wondered, why the view count on Youtube is not update in realtime for popular videos. Having gotten through a series of trial-and-failure experiments with visit tracking, I realize that the developers of Youtube surely had their reasons. My activity information is also already delayed, while the traffic volume still insignificant compared to even a single Lady Gaga video.

It seems that during the development of this tracking I have reinvented a lot of wheels. This is how, I guess, the most primitive database and file systems caches look like. Their algorithms are, of course, more sophisticated, and are not built on top of the generic cache and serialization solutions, using their own, custom ones instead.

But as any other re-invention of a wheel, it was fun, and I surely got a lot of understanding about higher-load web serving while trying to optimize this component of the forum engine.

## Parallel merge sort

We live in a loosely connected world. There's an increasing demand on middlemen, which connect different entities together.

This seemingly unrelated introduction is important for this post in two different ways:

• Parallelization of tasks to different processors also grows in demand, as it's like creating stuff that doesn't need middlemen;
• If there were not for social networks and forums, you probably would never know about interesting things created at the other end of the world.

So, back to the topic. I considered merge sorting as not parallelizeable well, with the same runtime estimate. However, several months ago, having disclosed this ungrounded opinion, a dweller of a message board of my university disproved it. That post linked an article, which describes an algorithm of a parallel merge sort, although is not the original source of it. It contains a brief, practical explanation, and I liked the algorithm so much, that I wanted to retell it here.

### What concurrency we're interested in

Note that I'll describe not just a multithreaded version of the algorithm, challenge of which is a clever use of locking. The problem described here is to deploy merge sorting into the system with distributed memory.

All intensive calculations in this framework should be performed locally. However, each processor has a limited amount of on-site random-access memory. So the challenge is to split the processing into small chunks, which will be sorted and merged locally (complex calculations), and will be just flushed byte-by-byte to external memory at the beginning and at the end of processing.

### Why merge sorting?

The thing is that merge sorting is actually an external sort. It means that it can handle amounts of data larger than size of RAM you have the random access to. This should make us think that it's parallelizeable, but it nevertheless doesn't directly lead to an obvious way to parallelize it.

The usual merge sorting looks like this, N being equal to 2:

1. Split the sequence to sort into N (nearly) equal chunks;
2. Sort each chunk (recursively);
3. Merge all chunks (at once of pair-by-pair). The challenge is to parallelize this step.

Each chunk out of these N could be processed individually and independently, and this is a huge achievement. But a challenge remains: how to parallelize the merging?

### Parallelized merging: idea

So, how to parallelize merging? First, we assume that one processor can store 2t cells in its own RAM, so we're restricted to merging only not greater than 2t cells at once.

Assume we have two large sorted arrays, A and B, which, merged, or even individually, don't fit into RAM available to one processor. This means that we should parallelize it: spit the sequence into small chunks that can be merged locally.

Let's try the obvious way: divide each array into chunks of length t and try to merge them sequentially. We can merge two first chunks into the beginning of the resultant array easily. But what if the elements of the 2nd chunk should be merged inside the very same merged array?

This array of length of 2t already doesn't fit in the memory of one processor, so we cant merge it with the second chunk in the straightforward way.

So what we really want is to divide the arrays into mergable chunks. For example, what chunk in the corresponding B array should be A's sub-array (ai,aj) of length t merged with?

Obviously, the corresponding chunk (bI,bJ) should be such that all bs with index less than k are less than ai, and the similar holds for aj.

However, the size of (bI,bJ) chunk could be greater than t, so we couldn't fit it into RAM to merge with the other chunk. Here, let's highlight some of the items that split B into chunks of length of t, that are inside the segment (bI,bJ):

These new elements split B's chunk into smaller pieces, each of which is less than t. Let's find the corresponding elements in A array for them:

Now all chunks shown here have length less than t, and each corresponding pair could be merged.

But look what we've done here. The order, in which ai, aj, aQ and aP are arranged, is determined by merging them—or by merging [ai,aj] and [bq,bp] arrays. So if we merge these t-distant elements, we will determine the order of "splitters" that divide both arrays into mergable, ordered chunks.

### Parallelized merging: the algorithm

The thoughts in the previous section lead to the merging algorithm. First, divide both arrays into (nearly) equal chunks of size t.

Then merge the two arrays that selected elements form. Use the same merging algorithm recursively. The result of this would be an order, in which the splitter elements should be arranged (marked with a red line):

This performed merging allows us to determine the exact chunk of length t, into which should fit each splitting element from the other array, by simple calculations. (We should "switch" in what array we search for the counterpart at the same time as the red line switches sides.) Then you should use binary search to determine the actual place of it in the chunk, this way we'll generate twice more splitters for each array:

By construction, these arrows will not cross. Then, the pairs of chunks (the chunks to merge are depicted as sections colored in an alternating manner) will lay sequentially in the merged array. Merge each pair of chunks locally, and place the result into the larger array:

Each chunk to be merged consists of two segments of length less than t, which makes it possible to merge chunks locally on one processor. This completes the algorithm.

### Parallelized merge: runtime estimation

Merging two arrays of n elements requires merging two arrays of n/t elements, performing n/t binary searches in arrays of length t, and, finally, merging the chunks we divided our arrays into, which takes O(N) operations. This leads to equation:

T(n) ≥ T(n/t) + n/t*O(log t) + n

I'm too lazy to solve it, but T(n)=O(n) fits it (we consider t as a constant). This is the same estimate as in a usual merge, so parallel merge sorting is O(n⋅log n) as well.

You may note also note that parallel merge algorithm requires familiar pattern of divide-and-conquer iterations. And it's one of the neat facts about this algorithm: merging the chunks we sorted requires the very same processors that sorted the chunks!

### ***

This is an algorithm I was looking forward to talk about for a long time. I'm glad that I finally did this, and I hope that you didn't know it, and have spent the time learning new fun facts.

## "NP-complete!" as a lame excuse

Some time ago I bumped into a usual Stackoverflow question. A guy asked for a C# algorithm that could pick elements from array so that their sum is equal to a specified number. A usual NP-complete knapsack problem. But the answer made me think about an interesting matter. Let me screenshot the answer completely:

At a first glance, the answer contains everything an ideal answer should contain: a correct information, a certain bit of succinctness, and, a key to success, an XKCD comic. No wonder it was so highly upvoted.

#### Complexity classes

In Computer Science a concept of complexity class is used to define a class of problems, for which there exists an algorithm that runs on specified kind of abstract computing machine and uses specified amount of machine-specific resource. For example, famous NP class is defined as "set of problems that can be (a) solved on non-deterministic Turing machine (b) with use of polynomial number of steps with respect to the length of input". P class is the same but its abstract machine is a deterministic Turing one. The famous question of whether each NP problem is also P is still open.

There is a lot of more tricky classes (PSPACE, for example, requires polynomial "memory"--maximal length of line of a Turing machine), which can even be parametrized (PCP(n,m) (probabilistically checkable proof), for example). The relationship between various classes is being studied, and aids the relevant research; here's a picture with some of them:

(pic from «Эффективные алгоритмы и сложность вычислений» book by Nikolai Kuzyurin and Stanislav Fomin (pdf, site; published under OPL license))

It clearly shows that the knowledge about complexity classes made its way into the minds of casual programmers. But instead of a challenge, it became a lame excuse.

What a casual programmer thinks when he encounters an NP-hard problem is, "Oh, it's NP-hard, so no known polynomial solution exists!". Some programmers even try to quickly make an algorithm that might solve something in specific cases, which they don't even realize. While what should be thought of is, "Oh, it's NP-hard, so no known polynomial solution exists! Let me derive constraints, and search for an approximate solution that fits it best!" Complexity theory statements and theorems should aid the solution, not substitute it!

### Funny approximations for NP-hard problems

Honestly, I was among the programmers of the first type, as defined above. I used to being proud that I know that much of maths and was glad that I can blow the hopes of solving their problem. However a year ago I've taken an advanced complexity course (by guys from one of research groups in ISP RAS -- "group of Discrete Optimization Algorithms at Department of Mathematical Methods" (in Rus)) that actually revealed to me several amazing facts. Some complex NP-complete problems appear to have good approximations! Let me list several of them.

• Edge-dominating set (wiki) -- assume we want to penetrate our adversary's network (which is an undirected graph). We should pick several channels (edges) to insert a virus into, and the hacking software will capture all information that passes through any of each channel's ends. We should cover the whole network with such "hacked" channels, so that each unhacked one is plugged into a node with a hacked channel. But at the same time, we should do it with minimum cost, i.e. minimum number of channels to hack. See figure A at the side for examples.

#### Figure A

Examples of edge-dominating sets from wiki page. Red edges are infiltrated channels. You may see that each non-hacked edge is touched by at least one hacked edge at joint router.

The task of picking such a minimum set is NP-hard. Brute-force? No, there is a better idea. We don't necessarily need minimum (although it would be nice), but instead we can hack some extra channels--the budget is not fixed anyway. We could use some clever algorithm to pick such edges, but instead... let's just hack arbitrary channels that aren't adjacent to already hacked ones!

What amazed me is that such a "solution"... will never exceed the real minimum number multiplied by two! "That's still too much", might the Greedy Manager have said, and he would be right. But the thing is that we even didn't try to solve anything: the dumbest algorithm yields a good, constant-factor approximation for a problem with yet unknown efficient exact solution. Hell, it's even easier to implement than a brute-force!

• Knapsack problem (wiki) -- assume a burglar infiltrated a house. Unfortunately, he has a relatively small knapsack, into which he can't put everything of value. Possessing a discriminating eye, he can immediately determine how much the fence will pay for a thing. His aim is to maximize the cost of the stuff carried away, the overall weight being less than his knapsack may bear. But unfortunately he's also limited on time and has to do it quick.

Complexity theory proves that such task is NP-complete. So the burglar might pack every combination of items into his sack until he finds out the best. He would have been caught if he did that, because the owner would have returned by the time he completes the enumeration. So he chooses another solution. He uses a "greedy algorithm" and pick first a thing with best price/weight ratio, among those that still fit his sack.

A fool! He didn't know that it was the house of computer scientist, who knew that thieves are dumb and greedy, and who intentionally bought a piece of jewelery with best price/weight ratio in the house, but that piece doesn't allow any other thing to be put into the sack due to overweight. For arbitrary N, this trick can make the burglar take away N times less stuff than he could at most!

But a simple adjustment to its algorithm can be made to limit the possible loss to "at least 1/2 of maximum value". The burglar should do the same, but at the end compare the value of picked best price/weight items with the value of the most expensive item he can carry away. If it turns out to yield more profit take this thing and run away.

This is an example that a small hack can guarantee a good approximation.

• Longest path problem (wiki) -- assume we want to make our competitor lose money, and thus we try to delay shipment of a valuable package. We can make a janitor agent into the dispatcher's room, so he can see the map of the area (an undirected graph) and use the special phone to guide the courier by a longest path between source and destination (without cycles, of course, since the courier will suspect something and report it). He must do it quickly, but, unfortunately, picking such a longest path is an NP-hard problem.

We've seen approximations by constant factors before, but this task, unfortunately, can't have an approximating polynomial algorithm for each constant factor. Of course, that's true unless P = NP. But the fact is that its polynomial inapproximability by constant factor is proven unless P = NP. So we'll have to teach the janitor to yield worse approximations if we want him to succeed. And conclude that sometimes NP-hard problems just do not have simple hacks that solve them well.

There are more facts that concern randomized algorithms, which have a certain probability to hit a good solution; facts about more tricky, non-constant approximations, but they're far less fun and are more like boring math crunching, so I won't mention it here.

### Conclusion

Albeit whether P=NP is still an open question, complexity theory contains many valuable achievements on solving everyday problems.

Do not make complexity theory just a source of lame excuses of not solving the problem or writing brute-force algorithms and garage-made heuristics that can guarantee noting. Some programmers call it all "clever cheats", but many hand-made "cheats" are of no help in approximations (as shown above with the first, most obvious "greedy" algorithm for the burglar). Instead of cheating, do not hesitate to perform a quick googling or consult experts in this area if you encounter an NP-complete/hard problem--a fast solutions with proven and limited weakness, and suitable for your specific task may exist, and may have been studied well.

Or, if you're curious enough, and possess certain bit of math education, you can study it on your own. English-speaking readers may check online pages of their favorite universities for book recommendations and, maybe, online lectures. I can't recommend anything in English since I haven't read such books. To a Russian-speaking reader I can recommend the book «Эффективные алгоритмы и сложность вычислений» by Nikolai Kuzyurin and Stanislav Fomin (pdf, site; published under OPL license), which we used during our studies at the university.

## Treap: simple balanced search tree

### Balanced trees and programming contests

This is an example of treap: letters form an alphabetically-sorted search tree, while numbers are arranged as a heap.
Pic from Wikipedia

Picture A. We're going to insert yellow note into tree with blue root.

Picture B. We arranged some nodes but we have to split the subtree of the maroon node into two subtrees to fit dangling red pointers .

Picture C. If we could solve the task of insertion, we could have split the subtree of red node as if out new yellow node was inserted there instead of to a subtree of a blue node.

Picture D. After we've splitted the subtree of the red node, we can now assign to dangling pointers.

A balanced binary search tree is a binary search tree whose height is not big, around O(log N). This value is crucial, because search operation, the one these trees are named after, takes O(tree_height) time.

In a typical programming contest you're given 5 hours to write several programs, each of which, aside passing all jury's tests, meets specified time limit. One of keys to success is (or, at least, was) the ability to quickly code fast standard algorithms and data structures: disjoint sets, heaps, graphs, etc. And due to limited duration of competition not only should the algorithms be fast, but also easy to write in bug-free mode under an emotional pressure.

When it comes to the question how to implement a balanced binary search tree, most of programmers will outright shout "red-black tree!" (wiki). Indeed, it is balanced and is implemented in standard libraries of many languages, so it may seem a good choice (and even better as an advice) unless... Unless you're to code is yourself. It's implementation is way too complex, while the benefits of guaranteed indeterministic 2*log(N+1) height are sometimes not utilized.

### A Treap in my mailbox

Back to the times when I was participating in those contests, I got a mail from one of our trainers. Within was a description of a nice data structure that I later discovered to be called treap. And now I'd like to tell about it.

A treap is a conjunction of words "tree" and "heap" (or "tree" and "heap", for left-recursive readers). No wonder: it comprises properties of both tree and a heap. Or, more strict, it is a tree and a heap at the same time. Each node, aside from payload data, contains two keys; if we look at one key we will see a heap and if we look at another key we'll see a binary search tree.

Our main goal is nonetheless to build a balanced binary tree out of sequence of N key-value pairs that may come as an input in an unknown order. When you insert a node to a treap, you should pick up a uniformly distributed random number and use it as a heap-key. Of course, at the same time you should maintain search tree property for the tree-key.

### Not as complex

Hell, the insertion procedure will be complex, you think. The trick is that restrictions usually help rather than prevent. So, assume we need to insert an element [Eh,Et] (that's heap and tree keys) to a (sub)tree with a given root [Rh,Rt].

1. Eh < Rh - since we maintain heap, our new node should be added to one of the root's subtrees. To left or to right is decided based on tree-keys Rt <=> Et comparison, since we maintain tree as well. To insert element to a proper subtree we call the same insertion procedure recursively.
2. Eh > Rh - our new node can only be a new root then (due to heap keys). What will be its left and right subtrees? Let's assume that Et < Rt. Then the right subtree of our new node will be the former root (it can't be placed nowhere else due to heap-property), and the right subtree of the old root will remain there.

Now, we're left with two dangling pointers (see picture B) and a subtree to split between them. But we can note that if we inserted the new node to that dangling subtree (picture C), it would be the root and its left and right children would fit the dangling pointers amazingly (picture D)! So to solve the original problem we need to solve it on a smaller tree... Smells recursive!

This leads to a tiny split procedure that, given a treap and a Kt tree-key, splits the treap into two treaps that can be left and right subtrees of a node with tree-key Kt in a binary search tree.

If you want to delete a node from tree, you'll need merge operation that makes one treap out of something that were left and right subtrees of a node we've just erased. This merge is the opposite to split and is as simple to implement. Red-black tree deletions require twice more code than insertions.

### Implementation

The implementation of insertion stuff is here. Despite I haven't participated in contests for several years, I wrote it in 30 minutes from scratch (and benchmarking code took some additional time) and I didn't even have to debug the code. Can you repeat the same for a red-black tree insertions?

Unfortunately, it's hard to tie this small implementation into OOP concept, but in programming contests it was rarely needed.

If we run the script, we'll get fairly good height result on ascending sequence (on which unbalanced trees fail badly):

```pavel@lonely ~/social/blog/treap \$ perl -e 'for (1..10_000_000) { print "\$_ ";} print "\n"' >test_file
pavel@lonely ~/social/blog/treap \$ g++ --version | head -n 1
g++ (Gentoo 4.4.2 p1.0) 4.4.2
pavel@lonely ~/social/blog/treap \$ g++ treap.cpp -O2 -o treap
pavel@lonely ~/social/blog/treap \$ g++ treap.cpp -O2 -o tree_STL -DSTTREE
pavel@lonely ~/social/blog/treap \$ time ./treap &lt;test_file ; time ./treap &lt;test_file ; time ./tree_STL &lt;test_file
Height is 58 for RB it is 30.8499
real    0m6.545s
user    0m6.116s
sys     0m0.176s
Height is 56 for RB it is 30.8499
real    0m6.069s
user    0m5.908s
sys     0m0.160s
Height not supported :(

real    0m7.667s
user    0m7.472s
sys     0m0.188s
```

While height for RB-tree is about 30, height of a treap is about 50-60. You can run a series of experiments on your own and see it.

### Why is it balanced?

The estimation of a height of a treap is O(log N).

The means, due to which the treap balances itself as a search tree, are Complex Math and are described in the articles linked at its Wikipedia page.

In common words, for a tree to be non-balanced not only search keys should be inserted in mostly ascending/descending order, but also the randomly generated numbers should also be arranged in the relevant order. This is unlikely. And if a heap_key is "out of sync" with unbalancing pattern, it rotates the subtree, for which it becomes a new root, so it becomes capable to absorb more "unbalancing" nodes without height increase.

### Acknowledgements

Treaps were created by Cecilia R. Aragon and Raimund Seidel back in 1989.