This website just got a new look, a new web hosting, and a new technology that powers it. I’ll post the summary in the coming days. And otherwise, just enjoy fewer lags and 500 errors.

Crash due to division by zero? How about multiplication by zero?

I recently transferred to the next level of floating point arithmetic caveats. Division by zero is something we all know about, but multiplication by zero can be as harmful.

An algorithm computed weight of entities, then converted them to integers, as in

This code may fail the assertion. Imagine y is large (say, 1000), so exp(y) no longer fits double. The value of exp(y) will be +Infinity in this case. Surprisingly, it will correctly understand that INT_MAX is less than +Infinity, and will pass the check as expected. But here's when multiplication by zero kicks in.

What will happen if x is 0, and exp(y) is infinity? Any result will be mathematically nonsense, unless it's a special value of "Not A Number". min() would also return NaN, and integer conversion will happily convert it to... -2147483648. The assertion fails, and the production job crashes because it does not expect the result to be negative. We're multiplying two positive floating point numbers, how can it be negative?

Yet it is. All because of multiplication by zero.﻿

Translucent areas depict waiting for something; incomplete lock statements have dashed border. Note that it doesn't matter in which order the top two acquisitions are made.

But what about Reader/writer locks also known as "shared/exclusive" locks? Let's recap what these are first. Sometimes, to achieve greater efficiency, a mutex implementation supports two flavors of locking: Reader and Writer (otherwise known as "shared" and "exclusive"). If several threads only want to read from a shared variable, there's no need for each of them to wait for others. That's where you'd use a ReaderLock operation on a mutex guarding the variable. If thread wants to write, it invokes WriterLock which means "do not run any readers of writers while I'm holding the lock". Here's a wiki entry for reference, and here's standard Java API.

#### Seemingly OK Reader Lock Execution

We no longer have a "must-before" relation between B locks in two threads, so they don't deadlock. This looks OK, but it actually is not!

So imagine that both threads X and Y happen to use one of the locks as Reader lock? It seemingly should prevent deadlocking: if, say, B is a reader lock, then the execution specified above will make progress: B.ReaderLock() in thread X will not block waiting for thread Y to release it... right? Here's the code for clarity:

Many mutual exclusion implementations make acquiring threads "form a line" of some sort to ensure fairness: no thread should wait forever for a lock. Then a threads that tries to acquire a lock--either shared or exclusive--waits until all threads that called L.WrLock() earlier exit their critical sections. Fairness is especially important when you have reader and writer locks: if you'd allow any reader lock to proceed while there is another reader holding the lock, your writers could "starve" waiting for quiescence among readers, which may never happen on a highly contended lock.

So, to make a reader lock wait on another reader lock, we need a writer lock between them.

Here's how three threads can interleave such that you have a deadlock between reader mutex locks. The "blocked by" relationship between these reader locks transitively hops over a writer lock in some other thread Z.

Assume, that in the execution described earlier, before Thread X attempts to acquire the reader lock B, thread Z chips in, invokes B.WrLock(), and only then X calls B.RdLock(). The second X's Y.RdLock() starts to wait for the Z to acquire and then release B because of fairness concerns discussed above. Z's B.WrLock() waits for Y to release B.RdLock(). Y waits for X to release A. No thread makes progress, and here's the deadlock. Here's sample code of all three threads:

Note that you will have at least one writer lock for B somewhere (because if all acquisitions are Reader Locks, there's no point in having the lock at all.) Therefore, the only way to prevent this kind of deadlock is to not distinguish reader and writer locks when reasoning about progress guarantees.

This kind of deadlock needs at least three threads in order to bite you, but don't dismiss it outright! If the Internet taught us that a million monkeys in front of typewriters will not eventually recreate all the body of Shakespeare's work. they would at least trigger all possible race conditions in our typewriters no matter how contrived the corresponding executions seem.

I used to wonder what this Error 502 meant when I observed when I visited my favorite websites. When I retried the error usually seemed to disappear, and I found it's linked with the service being overwhelmed with requests at certain times. But why do I see it? Why doesn't the web server start many threads leveraging OS capabilities of CPU sharing? If it queues incoming requests, why doesn't it just delay their fulfillment, just making them wait in a line longer? Just like, you know, in any cafe, bank, and public office where people form lines, and get what they needed--eventually?

There are three answers to this. First, small-latency error is more important than serving all requests. Second, there are interesting nonlinear effects that happen when the server is at capacity if you aren't using queueing. Which leads to the third point, if more requests arrive than what you can process in one second, you will not be able to serve them, ever. You have to drop them. Let's take a closer look.

### Fast error is better than slow result

If a service is consuming responses of several other services, it's often in a situation where it can ignore output of one of them, and still provide value to the user. An example in the paper linked above is Google web search that can skip displaying ads, maps, and other additional information, and simply serve search results.

This alone could be a good reason to drop a request, but this doesn't sound that convincing for the frontend server that faces a human end-user. Let's explore two ways of keeping the requests.

### Latency Grows Exponentially If You're Not Limiting Rate

Imagine we're not limiting the number of worker threads, and just start processing everything we receive. In other words, we do not keep a queue of requests, and rely on the operating system to provide fair distribution of computing power among working threads. Say, if we have 4 cores, and every requests takes 1 second of CPU work, we can serve at 4 QPS; this is considered "at capacity".

What would happen if a rate grows beyond 4; say, to 5? Requests arrived at the beginning would not complete when a pack of new requests arrives. At the beginning of 2nd second, 5 earlier requests each would have 20% left, and execution speed would drop twice, so they would be completed only after 1.5 seconds. At the beginning of second 2, requests that arrived at second 1 would have completed in 0.4 seconds, but 5 more requests arrive, decreasing the speed by a factor of two again. The latency for these requests would be 2 seconds. The latency for the next pack arriving at 3rd second would be as much as 2¾ seconds, etc. Here's an illustration on how CPU time is distributed among requests, requests arrived at the same second have the same color:

Let's assess how the latency grows with rate. In other words, let's compute function L(t, r), latency of a request arrived at the moment t if the rate of requests is r times higher than capacity. For instance, if our capacity is 4 QPS, but there are 5 incoming requests, then the rate parameter r = 1.25.

First, the number of in-flight requests at the beginning of a given second N(t) grows linearly with t. Indeed, N(t) < QPS*t; on the other hand, the amount of CPU time owed to previous requests is QPS*t*(r-1)/r, which equals QPS*t*(1-1/r). So, (1-1/r)*QPS*t < N(t) < QPS*t, which means that N is linear function of t, e.g. N(t) = a*t, where a > QPS*(1-1/r).

Second, the CPU time allotted to a request witinh a second [t, t+1) is at most 1/N(t). The latency of the request is then the minimal x such that S = 1/N(t) + 1/N(t+1) + ... + 1/N(t+x) ≥ r. Since N is linear, this sum equals to 1/(a*t) + 1/(a*t + a) + 1/(a*t + 2a) + ... 1/(a*t + xa). From integral calculus we recall we can approximate the sum 1/2 + 1/3 + ... + 1/x as ln(x), so S ≅ 1/a * (ln x - ln t)). Hence, x ≥ t * exp(a*r).

We proved that L(t, r) ≳ t * exp(a*r) where a > (1-1/r), so latency grows linearly with time, and exponentially with rate of incoming requests if we try to serve them all right away.

Practical results, however, seem to suggest that exponent power is sublinear with respect to r. Latency definitely grows greater than polynomially, but the true function in the power remains a mystery. The closest I came to so far is sqrt(r), but let me juggle it a bit first, and I'll share my calculations and graphs. Here's the simulation script; built-in rational numbers in Python are cool, btw.

Newly arriving requests slow down the currently executing ones, prolonging their latency. There is a way to mitigate it: queueing.

### Latency Grows Linearly with Queues

Imagine the most naive solution: first come, first served queue. All incoming requests are stored in the queue, and are served in the order they arrive, handler popping them from the queue as soon as it has capacity (a free CPU share or an idle working thread).

When the server is below capacity, the queue is always empty, and the latency of each request is L0, which is the time it takes handler to construct a response. What happens when the server reaches capacity? The latency of a request that arrives at time t (we assume constant rate r) is L(t, r) = Q(t, r)*L0 + L0, where Q(t) is the queue size when the request arrives. The size of the queue is the number of undone requests, which is t*(r-1). Therefore, L(t, r) = t*(r-1)*L0 + L0, concluding that latency grows linearly with both rate and time with queues.

### You Can't Serve More Requests than What You Can

This may seem obvious, but is often overlooked. If you start queueing requests and serve them later, then you need to have the capacity dedicated to these requests later. Meanwhile, the next second more requests arrive than you can serve that second, and you have these old requests, too. Now you have three options:

1. Serve old requests first, but then your latency will start growing at least linearly as you spend more time over capacity;
2. Serve new requests first, but then old requests will remain in the queue until the first second where more requests arrive. In most cases, this would mean waiting hours as usage pattern are usually circular.
3. Discard some requests without spending CPU time on serving them. Which leads us to error 502.

In other words, we have to drop (r-1) of all requests regardless of which scheduling mechanism we use. Whether or not the latency grows linearly or exponentially is irrelevant: it has to grow, and this is bad. Its exponential, turbulent growth without queueing is just an interesting artefact because you have another reason to avoid starting all the incoming requests at once: it would simply choke OS scheduler.

#### "SF Muni Sign at Home" series

Several post on how I was building a San Francisco Muni sign that shows train arrival times at home.

When I installed SF Muni sign in my apartment, I realized it had contradictory requirements when it comes to positioning. On one hand, I want it to be visible from my couch, where I sit in the morning. On the other hand, I would want it powered off when I sit on the very same spot in a darkened room, say, when I'm watching a movie. Manually turning it on and off (a custom remote control?) would undermine the whole idea of having a fully automated always-on sign. Installing an automated power on/off knob by itself is easy, but what would trigger it?

Note that powering the LED sign on and off was not a part of an official interface, but I resolved it by rendering an empty picture. I don't know how much energy it saves, but since the USB port is cold after a day of it being "powered off," I can guess that a lot.

I thought of using a schedule: say, power it in the morning, and power it off in the evening. That would probably do. This would be at indirect measurement, though. Schedule is not the problem here: the actual problem is the contrast between the lit sign and the background. What if I can measure the luminosity of the room directly? I have a computer that is much closer to hardware than what I used before, so why not install a light sensor?

### Minus Pi

What we need is to somehow read the value of current ambient light, and transform it into digital form for our scripts to read. A light sensor is basically a diode that changes its resistance based on the outside light condition, and we need to read current that flows through its circuit. Unfortunately, the Raspberry Pi itself can not do this: it lacks digital-to-analog converter. Luckily for us, other small Pi-like computers do.

#### Arduino Nano

One of Arduino boards, Nano. I chose the smallest one that had analog input. There are more arduinos available.

### Plus Arduino

One of the computers capable of this is Arduino. Unlike Pi, it does not run as a full-featured Linux machine. Instead, you load up a program from another computer, and Arduino then runs this program. Well, that's not too different from what happens on the Pi, but you definitely can't load Linux on small Arduinos.

### Conclusion

So it worked as a charm. I played a bit with circuitry (see the notes below,) and can now watch TV without bright red LED lights imprinting on my retina. And I have a much weirder and colorful installation on my wall, which is by itself immensely cute as well as useful.

### Programming notes

After I assembled the scheme, I proceeded to install SDK. I didn't try it on my desktop computer, and did this directly on Pi (because why borther?) The apt-get install arduino didn't work, and was showing some error I forgot to write down, but the advise I googled was to update first:

sudo apt-get upgrade && sudo apt-get update
sudo install arduino


Then I launched SDK from a remote console by typing arduino into my console (I made sure I ssh-ed with -X option, like ssh -X pi@192.168.100.100), and had the GUI pop up on my desktop while actually running on Pi. I pasted the program you can find in contrib, and pressed "Upload." It didn't work because I first had to select "Arduino Nano w/ATmega328" under "Tools -> Board" menu. This was all programming required: Arduino ran this program every time it was powered on, but I still needed to collect the digital data. The numbers were printed to the USB serial port, and I could read them either from Arduino SDK or by cat /dev/ttyUSB0 from a console.

This program only made sure that the numbers are printed onto the serial port, and a separate "collector" software that interacts with the sign client was still necessary. The collector script I wrote was communicating with muni sign program by creating a file, thus signaling that the sign should be shut down. I added another initscript and used rc-update to auto-start it.

When I first tried to check if the sensor works, I saw that it changes very abruptly with the luminosity. The value read 830-840 unless I completely shoved it under the cushion (where it read as 0). My physics intuition nudged me to decrease resistance, which I did, and achieved greater smoothiness. Now it could tell dimly lit room from a completely lit one, which was my original goal. I use threshold value of 100, but it will vary with resistor and the sensor you use. Oh, and light sensor is one-sided, so if the value you're reading is about zero no matter how bright the room is, you probably should insert it backwards.

I also couldn't find a way to assign ttyUSB ports, so that LED sign is always on /dev/ttyUSB0, and Arduino's serial port is always on /dev/ttyUSB1, and they get assigned randomly. I just reboot Pi a couple of times until it gets it.

That's all, and I hope this will help me reassemble it again.

# Comments imported from the old website

I'd prefer using some kind of ADC (http://en.wikipedia.org/wiki/Analog-to-digital_converter) to measure analog signal level.

I noticed a pattern in some of my web software architecture failures, and I also notice similar pattern in others' designs. In many cases, people simply misuse memcache, hurting performance of their web applications.

If performance is your goal, you can’t pretend that your computations and rendering is fast, and add transparent caching where it later appears to be slow. Transparent caching is a fairy tale programmers have been telling one another, and we’ve also been writing tools and building hardware to remain in for decades. Turns out, it doesn’t scale. A high-performance web application is designed around serving static content that--sometimes--get recomputed

1. under very specific conditions;
2. only partially;
3. after very specific events;
4. most likely, by a component separate from the one serving the content.

Where would you keep such a static content derived from the raw data (I will later call it ”derived content”)? Memory caches are often unfit for the job of storing derived data: your items may be removed from cache ("evicted") if you don't have enough memory, or if your another web application or component is working within the same cache. And evictions are much more severe than you might think.

Assume it takes your webapp several seconds to recompute the item you want to serve even if it hasn’t changed, and all requests that come during this period are going to have at least these several seconds of latency. In modern multi-component web-services high latency for a small fraction of requests makes much more harm than it seems, as corroborated by this overview paper by Googlers, so you might want to avoid it. Not to mention the simple reasoning that if this eviction happens under heavy load, the stack of HTTP requests waiting for the data to be computed might kill you.

Is there a problem with caching data altogether? Yes, and no. The idea of memoization--storing results of complex computation for later reuse--is perfectly fine. The problem with caching as I see it is that many now view caching exclusively as storing data in unreliable but fast-to-access caches (think memcached). This is not how caches in web applications have to work despite that’s how they usually do.

### Caching Primer

There is another storage you can use for precomputed results, the Database (SQL, or NoSQL, or NewSQL) where you probably store the rest of your data.

Assume you are presenting users with a set of analytics over the data they entered yesterday; say some breakdown of yesterday’s sales by zip code. Say, it takes 20 seconds to compute all the data. Where would you store it?

What you could do is to cache the result of the first request, spending 20 seconds to serve it. You can be smarter, and add a cron job that hits that page every midnight. But if your data gets evicted, your next user will suffer these 20 seconds.

Instead, store the data in your persistent storage (database), the storage query being triggered by the same daily cron job. When the user requests the analytics, just serve it from there; it will never delay by 20 seconds since the data aren’t going anywhere unless you explicitly delete them, which you’ll be wise not to do needlessly. It will take you one database query to retrieve the results, and you can also cache this query in memcache for even faster results. No 20-second latency anymore, just the latency of one database select of a single row. Neat, right? Here’s a lame picture to illustrate this that shows that recomputing-on-write is more efficient as writes are less frequent usually:

### Why People Use Memcache Then

So if the world of using permanent storage for caches is all ponies and rainbows, why isn’t it used everywhere? The reasons are performance tradeoff, abusing unreliability of caches, and simple storage management.

If your computation is as fast as querying storage, there’s no sense to use persistent storage as cache. In rare cases when you have insufficient latency, but overprovisioned on CPU throughput persistent caching will improve your app, but most likely not. Also, if you think of updating cache often in background as triggered by user requests, you need a fast task offload solution that avoids write contention (sich as App Engine’s Task Queues).

#### Abusing Cache Unreliability for Eventual Consistency

Cache expiration algorithms make the state of the cache eventually correct with respect your changing data. To build a correct permanent storage cache, you must make sure that the data in this cache are explicitly updated or deleted if unused. Unreliable cache will recycle stale data for you: if a key is never going to be accessed, it will be simply evicted in favor of more recent keys without occupying extra storage. Impersistence of memcache is an asset rather than liability in such cases. One of my previous posts, about sharded counter, contains an example of this.

The problem with efficient record recycling is that you can’t update record’s access time in the persistent database every time you access it without significant performance penalty (see the previous section on fast task offloading). On the other hand, you implicitly do this on every access with specialized in-memory caches. In some cases, a background recurring thread that performs DELETE FROM cache WHERE created_at > NOW() - 600 will take care of this, but this isn’t always possible.

#### Automatic Storage Management

A memcache can successfully and automatically operate when you don’t have enough memory to store all the results. Its efficiency doesn’t roll downhill with the pace increase because more frequently accessed entries stay hot, while performance suffers only on less “hot” entries hence rarely. Finding space for storage of one extra column per database row is rarely a problem today, though.

### Narrowing Down the Scope

Given the above limitations, storing intermediate results in database makes perfect sense when the cost of computing these results is greater than the cost of database query, and when the data are tied to other database records. Say, storing a pre-rendered version of a blog post or a wiki page that uses a custom markup (especially if it’s badly implemented like the one in this blog) makes sense: you simply update it on write, and remove if the post is deleted. Storing a result of 10-minutes analytics query in a table rather than in a cache also makes sense (and some still rely on memcache to hold such data). Storing a join by key of two SQL tables is probably not worth it.

You need to design your application so that it is capable of making derived data updates after writes. Relying on caching will hurt you in the long run, because your application will be improperly architectured.

#### "SF Muni Sign at Home" series

Several post on how I was building a San Francisco Muni sign that shows train arrival times at home.

Running like this

My android phone wakes me up with its stock alarm clock into a cold San Francisco summer morning. I lie around a bit, but eventually get up to enjoy yet another day. An hour later, I'm running in my dress shoes, tie waving in the wind like that of an anime superhero schoolboy, towards a light rail train stop. Twenty seconds before I get there, the train leaves, right before an unanticipated 20-minute break in the service. I eat a bagel I don't want in a cafe nearby to sit there and work while waiting; my day, having barely started, is already ruined.

Could I have made it? Of course, I could. I could have not re-tied my tie in order to drape it half inch shorter; I could have not procrastinated on reddit for five minutes while sipping tea, I could have postponed paying my bill to the evening. On the other hand, I don't want to finish too early and wait at home instead of in a cafe. I need something to let me know, at all times, when the next train is.

### Observations

San Francisco "Muni" public transportation vehicles, like those of many other modern big cities, are equipped with real-time GPS trackers, and publish location data on the internet. My smartphone has QuickMuni app installed, so I could easily be checking it. I would pick up my phone, unlock it, wait for all the lags Adnroid greets me with, wait while the app downloads the upcoming train times, and repeat this every five minutes. This would be implausible. If only I had something like a wall clock so that it constantly displays upcoming train time predictions...

One morning I located the item I wanted in the wild. Actually, I've been seeing it every day, but was not realizing that it would solve my problem.

That is it! I can mount a similar sign on my wall, drive it with something like Raspberry Pi or even my desktop PC, and write some software to download, process, and display the train arrival times! I can even teach it to show weather forecast to quickly see if I need to take an umbrella.

### The Result

The sign I used is probably not the only one with Perl API: here's something else on CPAN that is probably about these guys, but I didn't try them so I can't vouch that they work.

### Source Code

As usual, I'll begin with the source code, which is on github. Here's a local mirror, just in case The source code of a patched Next Muni gem is here.

#### How to Run

There's a small documentation in the README file, and sample command lines are in the daemon startup script for this spec. Basically, you select a stop on nextmuni.com to find out your stop ID, and run client/client.rb --stopId 12345. Alternatively, you can run a morning dashboard by specifying up to two routes and stops in text, and add an URL to retrieve weather from weather.gov:

### PI configuration notes

During the first run, connect mouse, keyboard, and Wi-Fi card through the USB-hub. Follow up the Getting Started guide to download initial image onto SD card. Boot it up, install the default Raspbian Debian-based distro.

Do select "automatically boot desktop environment on startup" when asked. While you won't need it, it's the easiest way to configure Wi-Fi. Once at the desktop, select "Wi-Fi configuration" under one of the menus accessed via Start button, and configure the wi-fi card you inserted to connect to the router. Make sure you also configure it to start up wi-fi automatically at boot.

Configure SSH server, and make sure it starts up at boot. Either configure hostname, or write down IP address. Reboot and try to SSH from another machine. If this works, disconnect mouse, keyboard, and HDMI: you don't need them anymore.

Make sure git is installed via sudo apt-get install git, download the sources, and follow the README instructions to install language runtimes. Play with script's command line options to figure out what you want to run, and configure a daemon by installing the initscript. Once confident enough in resilience of the setup, mount this onto your wall.

### Results

So the solution helped: I stopped running like a Japanese schoolboy, and started to control my morning procrastination, too. As a bonus, I had a ballpark estimation of how much it should cost taxpayers to program this sign, and, of course, I had a lot of fun, too. Finally, I now have a geeky item in my living room... although it looks to others as a stream of stock prices rather than of Muni arrival times.

### Notes

The rest are less "hot" sections, detailing my experience with the tools I used for the job.

#### Ruby

Raspberry Pi's default Linux, as a Debian derivative, contain decent support for Ruby, including 1.9. which is a bit old already, though. It takes Pi several seconds to load the interpreter and a dozen of gems I needed for this sign, but it's OK since the program only starts up once when the system is powered up.

#### LED Sign

So the sign I used provided a Perl API, so it would be natural to write the client program in Perl, too... not really. While I have a lot of experience programming in Perl, I would much rather prefer writing in a more interesting language, such as Python or Ruby. That was one of the reasons why I wrote a small wrapper over the Perl API, so that I could exec this simple program from any other language.

Another reason why I wrote this API was that the sign does not support two-line text in its built-in text renderer. The API allows you to supply a line of text, and the sign will happily render it, and even add some effects, but I wanted two lines in a smaller font instead. API doesn't have a "render image" function, but the sign allows you to create your own font glyphs, and insert them into text. The obvious idea of printing one 16x96 "symbol" worked. The wrapper does exactly this.

The sign can also loop through several images--just like the real Muni sign. API makes this very easy to program: this happens automatically when you send several messages. Integration with Pi went smoothly: just use \$sign->send(device => "/dev/ttyUSB0"); in your program.

There were several glitches, though. First, each glyph is automatically centered, unless it's greater than about 80 pixels wide, in which case it's aligned to the left. Looping through differently aligned messages looks badly, so I force all images I send to the sign as 16x96, padding them with black pixels.

Second, there is no way to power off the sign programmatically. To power it off, I need to physically remove the USB connection, and turn it off because it starts using the battery otherwise. The sign manufacturer's intent probably was that you don't re-program the sign often, so you would program it once and then turn on and off manually. But we're programmers too; isn't working around things our bread and butter? Displaying an all-black picture did the trick of turning the sign "off".

Third, I refresh the sign each 30 seconds to ensure timely prediction, but it takes 1-2 seconds to update the sign. I suppose this is because the sign contains its own CPU and programming, or because I didn't try to play with send function arguments.

Although its internal plumbing probably affects the update speed, and I saw examples of offloading all the synchronization and flashing of the lamps to Raspberry Pi completely with different, more low-level signs, that would be too time-consuming for me to bother with. I also don't know how long the sign will last when it's re-programmed twice a minute. So far, I'm quite happy with the Bright LED sign I used and its API.

#### Font Rendering

I expected to spend zero time on font rendering, but it took much, much longer than I expected. Well, I was planning to write my own simple font rendering engine, but several hidden rocks were waiting for me beneath the thin sand.

First, I couldn't find a "raster" font that is just a set of bitmaps. I ended up downloading "pixelated" TTF fonts and rendering them with FontConfig library. I didn't want the sign to have a lot of dependencies, so I pre-rendeded the glyphs on my Pi in a ready-to-use manner with this script, and wrote a small library to operate them. The most fun part is that I could easily add new glyphs, such as rainfall indicators.

We also take a lot of font rendering for granted, such as aligning to center, word wrapping, kerning. I implemented some of them in a very simple manner (in Ruby!) but I feel that a couple more features would make me ragequit in favor of a real font rendering library instead.

#### Accessing Muni Predictions

Another reason why I chose Ruby was the Muni library in Ruby, ready-to-use for getting predictions from the NextBus system. Or so I thought: the library was missing some functionality, such as refining routes or getting predictions for all routes of a stop (this is what a Muni sign displays). It also pulled in the whole Rail stack only to display nicely formatted time string! (like, convert interval in seconds to something fancy as "over 1 hour").

I fixed some of these things, and bundled the gem with the original source for your convenience because it was easier than forking it and adding it into an official repository.

#### Specify routeConfig!

Apparently, official Muni monitors operate in the same terms: routeConfig is one of the API commands of NextBus service.

Anyway, NextBus system has a nice API, which is seemingly free to use if the use is limited or distributed among clients. A spawn of this site, nextmuni.com is devoted to SF Muni.

The only thing I didn't find was the actual names of the

#### Weather Forecast Retrieval

Oh, well, here comes the easy part," I thought when I finished with everything else, and decided to throw in weather forecasts. I was wrong in a very interesting way.

Turns out, weather prediction APIs are usually not free. The websites I found at the top of Google search all required signups and including your personal key to all API calls. Many had free limited versions but required payment for hourly forecast--and I was mainly interested in conditions at a specific place at a specific hour of the day (San Francisco weather isn't very stable). Not that I'm opposed to paid services per se, or to parsing web pages with regexps, but I don't need that much from them to pay, do I?

I proceeded with a free government weather.gov service, which is actually kind of cool despite the web design of the last century. The service is limited to the U.S., but San Francisco Muni predictions are not very useful outside of the country either. To use it, you need to provide a link to XML table for your city/area, which would look like this: http://forecast.weather.gov/MapClick.php?lat=37.77493&lon=-122.41942&FcstType=digitalDWML.

I didn't use their API, because it's a full-blown SOAP, and ain't nobody got time for that.

#### Raspberry Pi

Surprisingly, I don't have much to say about this. It just works! I didn't have to use it, and I could have used a desktop PC. The best part about Pi is that it works backwards, too: I successfully replaced desktop PC with it without doing anything differently. The device's operating system is simply a variation of Debian Linux, as found on desktops. I didn't have to debug through COM-port nor use specialized software to upload programs: I downloaded compilers and interpreters directly from the internet onto the device!

And now it's neatly powered from a single micro-usb port, and doesn't hike my electric bill despite being always on.

#### Future work

That's all I have to say for now. I need to add a couple of things, such as Muni Messages that remind users about non-working elevators and fare increases, to make the Muni experience complete. I also need automatic sign power-off when unused. I heard, there are motion detectors for Raspberry Pi, but a simple schedule would be nice, too.

# Comments imported from the old website

Pavel Shved on 23 April 2014 commented:

Here's a python version of similar code https://github.com/metral/led_sign/ . Mike Metral created it in order to implement a sports ticker in Python; here's his blog post https://medium.com/p/875c73a5339 .

Many programmers complain that most of what they've learned in their universities is completely useless in their daily work. Anecdotal evidence, a survey among some MIPT alumni, suggests that the median fraction of courses actually useful to a former student is about 5%. That's like taking one class per week instead of whole week of studying, so could you simply follow those wise brats who indeed spent the whole week smoking weed and boozing?

I'm sure that you simply should look closer. In this post, I tell how a portion of academic knowledge appeared in my work out of the blue.

### The Challenge

Last week, we've been trying to improve concurrency of a multithreaded messaging system. When multiple threads wait for incoming messages with distinct labels, one and only one of them has to do the work of parsing the incoming message and directing it to the thread that actually waits for it. Choosing a designated parser thread was not possible, so the waiting threads should somehow assign the work to one and only one of them.

This is not a hard problem, but we had some minor block when we first approached it. So I suggested to step back and take a look at the theory.

### That Sounds Familiar

I recall that making several threads designate one and only one of them as a "worker" is called "consensus problem" (Wikipedia provides) a slightly different but equivalent definition.) I also recalled, from my reading of TAoMP ("The Art of Multiprocessor Programming" by Herlihy and Shavit, link), that it was used to demonstrate limitations of various synchronization primitives, and most of them were incapable of solving it. That's right, in the world of formalized computing, some software problems can be proved computationally infeasible. An by "infeasible" I don't mean making a decent AI in first-person shooters, which is definitely infeasible, that's more like Turing machines and Halting problem. So maybe "mutex" is one of the tools that can't solve Consensus?

### Who Can and Can Not Solve Consensus

A concurrent object is wait-free if there exists and upper limit N on time/instruction count it takes every call to complete, such that the same limit remains in effect as the number of thread grows indefinitely. Very few concurrent algorithms satisfy this property, and those who do are considered impractical.

Chapter 5 of the book is devoted to assessing relative strength of synchronization primitive operations, and it uses their capability to provide a wait-free solution to consensus problem as a measure of their strength. It later turns out, in chapter 6, that Consensus problem is universal enough to construct a wrapper that turns any single-threaded object into a thread-safe one. So, which synchronization primitives are universal?

#### Consensus vs. Registers

Assume you only have “simple” memory cells or registers at your disposal, which can turn either way when multiple threads write to them simultaneously. Given these, you only can solve Consensus in a wait-free manner for no more than one thread. (In other words, they are completely useless).

The complete proof is sketched in the book. It's constructed as follows. Assume Consensus can be solved for two threads. At one point, the memory state must be such that the next operation over one of these registers in some thread will determine the outcome. Then all the cases of these operations are enumerated, and viable executions, which usually end up with one thread “running solo,” are constructed. They lead to a situation where the algorithm should return different results based on the same thread-local data, which can't happen.

In particular, this can't happen if you're trying to solve Consensus for two or more threads with simple registers. It is scientifically thus proven that one needs more sophisticated synchronization primitives than just reads and writes to one memory cell. Which could these be?

#### Round Robin Table

That's a sample round robin table with some competition results. Threads fill a similar data structure concurrently when they solve Consensus with multi-cell atomic writes.

If you can atomically write to N*(N+1)/2 memory cells rather than just one, you can solve Consensus for N threads. The impossibility of solving it for more than N follows the same scheme as outlined in the previous section, but the sample solution is very interesting.

Imagine that threads compete in a round robin tournament; well, not really compete, but simply fill the table with its results. Each game is either a win or a lose, and the score is a number of wins against threads that played at least one game.

When a thread participates in the Consensus, it atomically writes “I lose” in N-1 cells corresponding to its results; it also writes something in the “against itself” cell to distinguish itself from those who did not participate yet. If all threads do that, then, whenever somebody examines the table and computes the score of each thread. the thread who has the highest score will never change. The highest-score thread will always be the same one even if the reads are not atomic, and results change as one is reading the table.

#### Consensus vs. Complex Data Structures

Section 5.4 of the book demonstrates a proof that synchronized FIFO queues can't solve Consensus for more than two threads. It also claims that the same is true for “many similar data types, such as sets, stacks, double-ended queues, and priority queues,” which I don't have any reason to not believe.

#### Consensus vs. Atomic Integer Operations

By “atomic integer operations” I mean such functions as swap(&x, v) (set x to v and return the previous value) and compareAndSet(&x, e, u) (set x to u if x == e; do not change otherwise; return whether the value was equal to e) and others. Infrequently used, they are somewhat known to system programmers.

Turns out, these operations are not created equal. For instance, swap can't solve Consensus for more than two threads, while comareAndSet can solve it for infinitely many threads with a very simple algorithm:

The book begins with demonstration how one can implement mutexes using only simple registers. I formulated one of such implementations as a puzzle in my previous post. If we could use mutexes succesfully to solve Consensus, we could have replaced them with register implementations, which is impossible. Does it mean that we can't solve consensus with mutexes for more than one thread? Well, what about this simple algorithm?

More than this, we can implement the all-powerful compareAndSet with a mutex by wrapping the comparison and setting into a mutex' lock-unlock. Yet the book claims it's not possible. How come?

### Different Flavors of “Solve”

A concurrent object is starvation-free if every call to its method eventually completes regardless of how threads are scheduled. For instance, an algorithm that hangs if only one thread is (unfairly) scheduled, while others sleep indefinitely for no reason, is not starvation-free per se, but may be such if OS promises not to do this, and be fair. It is weaker than wait-freedom because it doesn't require a global bound on the call length.

The mutex-based consensus algorithm is only correct if mutex has strong enough progress condition, for instance, if it's starvation-free (wait-freedom is not required here; see the last exercise to the Chapter 5 in the book). Mutexes implemented with simple registers are not starvation-free hence they don't completely “solve” Consensus. The theoretical results on impossibility given above only provide proofs for wait-free algorithms, however, it depends on the actual implementation of mutex whether the resultant algorithm will be wait-free or not.

It indeed is a corollary of the results discussed above that you can't make a starvation-free lock with simple registers only. Threads in such locks usually “spin” waiting for the memory to enter a state that signals this particular thread (and only it) to enter the critical section. If, at this point, some other thread “runs solo,” you are stuck forever. If you apply the register "weakness" proof scheme to a register-based mutex, this infinite loop will eliminate the contradiction that same thread-local data will lead to different results: they'll lead to endless spinning instead. Notice how that's not the case for compareAndSet() solution to Consensus presented above.

So the resolution to the paradox described above is that the mutex-based program is a solution only if the underlying mutex is starvation-free--in addition to simply being a correct mutual exclusion algorithm. To implement a starvation-free mutex, you need to have compareAndSet or similarly strong primitive operations, but they may be concealed within the lock or the OS implementation, and don't have to appear in your program explicitly.

### Consensus vs. Humans

#### Consensus and Humans

Here's how choices of individual subjects changed over time during the experiment on how humans solve Consesus. Read more here.

Another interesting example of a computational system that can's solve consensus problem is a social network of human beings. A group of researchers conducted an experiment where several people were given material incentives to reach a specific consensus (an equivalent of threads proposing their own value), and were set out to reach one. They could not communicate in any way except for announcing the current value each person agrees to to their neighbors, i.e. they didn't have a designated leader or a common strategy.

The experiments demonstrated that people can't reach consensus in this settings. An interesting finding, though, was that the subjects quickly created a "bipartisan" system with only two colors, and started to juggle them around. Sometimes, new colors appeared as an attempt to unite under a different color, but these attempts quickly died off. Doesn't it remind of anything?..

### Unsurprising Conclusion

So, it turns out, the theory didn't help us much with the original task. It helped us understand that thread nomination is a separate, studied problem that can be solved given the right tools. However, our initial choice of tools, the "mutex" was correct, and the knowledge of theory couldn't improve our initial choice. I'm glad it made me read "The Art Of Multiprocessor Programming" book more thoroughly than I otherwise would, and I do not consider it a waste of time. Multicore computing is getting more ubiquitous, and knowing the theory won't hurt.

I wrote earlier that I know very few decent programming puzzles. To be more precise, I know one, The Prisoners Riddle. So I was reading that The Art of Multiprocessor Programming book (I already made a post influenced by what I read there, and I'll soon make another one--stay tuned). Some very interesting things the book tells about can be formulated as puzzles. Here's one.

Two threads need a mutex object with Lock and Unlock methods, but they only have three usual shared boolean variables. How to implement such a mutex?

Every thread can call GetThreadId() function which returns 0 for one thread and 1 to the other, and the boolean variables are automatically initialized to false before threads execute.

If you solved it, congratulations; you have just reinvented this wheel.

Since I know the solution, I can't really understand if it's a good riddle or not. It's definitely too hard for a job interview, but still, it can be a good one. So I need your comments. What do you think? Could you solve it?

As you already know, binary search is a O(log N) way to search if a particular value exists in a sorted array. There are some reports on how "most of our implementation" of binary searches are broken. The mistake discussed in that post is the overflow when calculating the index of the middle element. But regardless of whether you get this bit right, there's a different problem with all binary search implementations I've seen.

They're just too damn complicated and hard to get right out of your memory.

### Key points

Almost always, if you're relying on your memory, you should prefer simple algorithms to complex, like you should implement a treap instead of an AVL tree. So here's a sequence of steps that helps me to write binary search correctly.

#### Bisection instead of "binary search"

Starting from this point, we aren't searching for anything. Searching for an element in array has a too complex definition. What should we return if the element doesn't exist, or there are several of them equal to each other? That's a bit too confusing. Instead, let's bisect an array, pose a problem that always have a single solution no matter what.

Given a boolean function f, find the smallest k such that, for all i < k, f(a[i]) == false, and for all j >= k, f(a[j] == true). In other words, we are to find the smallest k such that for all i within array bounds f(a[i]) == (i<k). If the array is sorted in such a way that f(a[i]) is monotonic, being first false and then true, this problem always has a single solution. In some cases, when f(a[i]) is false for all array elements, the resultant k is greater by one than the last index of the array, but this is still not a special case because this is exactly the answer as defined by bold text above.

This is more powerful than the binary search, which can be reduced to this problem I'll call here "bisection". Indeed, to find an element x in the array a of length N, we can do the following:

Now let's get back to solving the bisection, once we've defined it.

#### Well-defined bounds

The answer to our bisection problem is between 0 and N inclusively, so we will start by assigning i to 0, and j to N. No special cases allowed: we can do without them at all. At each step, the answer we seek will be between our values i <= k <= j, and we will not know any information that would help us to reduce the range size.

#### Non-overflowing pivot point

It's easy to make an overflow mistake in computing the pivot point; i.e. the point at the middle of the array we'll apply f to to make a step. This problem is discussed in the article I linked above more. If we were in the world of non-bounded integers, it would simply be (i+j)/2, but we live in the world of integer overflows. If i and j are positive integers, which makes sense for array indexes in C++, the never-overflowing expression would be:

Another useful property of this expression (in C++, at least) is that it's never equal to j if i < j. Here's why it's important.

#### Well-founded termination

A basic way to make sure that your program terminates some day is to construct such a "loop variant" that strictly decreases with each loop iteration, and which can't decrease forever. For our bisection, the natural variant is the length of the part of the original array we currently look for the answer in, which is max(j-i, 0). It can't decrease forever because its value would reach zero at some point, from which there's nowhere to proceed. We only need to prove that this variant decreases each iteration by at least one.

Fun fact: special programs can prove that a program terminates automatically, without human intervention! One of these programs is named... Terminator. You can read more about this in a very accessible article "Proving Program Termination" by Byron Cook et al. I first learned about it in a summer school where Mr. Cook himself gave a talk about this.

First step is to make sure that the time it reaches zero would be the last iteration of the loop (otherwise it wouldn't decrease!), so we use while (i < j).

Now assume that we've understood that the pivot point s does not satisfy the target function f. Our key difference with most binary search algorithms is that we should assign the new i to s+1 rather than s. Given our definition of the problem, s is never the answer in this case, and we should keep our candidate range as small as possible. If f(a[s]) is true, then s could still be the answer, so it does not apply here. Here's our iteration step now:

We mentioned in the previous section that i <= s < j, hence i < s + 1 and s < j, so each of assignments decreases j-i by at least 1 regardless of f's value. This proves that this loop never hangs up forever.

### The complete program

So here's our bisection that works for all sorted arrays (in the sense of f(a[i]) being first false then true as i increases).

The search itself in terms of finding if a value exists in a sorted array, would be then written (I'm just repeating what I wrote above, no surprises here):

### Proof By Authority

And GNU C++ STL implementation of binary_search follows the same guidelines I posted here. In my version of GCC 4.3.4 binary search looks like this:

Where lower_bound is the bisection function.

### Conclusion

Here's my version of bisection and binary search I usually use when I need one. It is designed to be simple, and simple things are easier to get right. Just make sure your loop variant decreases, you do not overflow, and you get rid of as much special cases as possible.

# Comments imported from the old website

Pavel Shved on 13 July 2013 commented:

Many thanks to Martin Ward who pointed to inaccuracies in termination proof description.

The binary search is a little bit more complicated than usually presented. An exercise that can help a lot in writing a correct version of binary search is given here http://arxiv.org/abs/1402.4843 I highly recommend looking at it. Without it, it looks like guessing game. If we exclude recursive solution, there are no more than 4-5 correct implementations of the algorithm. All solutions including the one you gave can be derived from the table given in exercises. (Actually I am collecting implementations and showing how to do that.) Your example is optimized so it is not a good starting point. j in your code is keeping the index which is one after the last element in the selected subarray. You need to explain this explicitely. GNU code is optimized for several reasons, one of them is speeding up on assembly level. I am saying this is an advanced level what you are writing about. Nobody can use it as a starting point.