Recent blog updates

Shown here are only posts related to maths. You can view all posts here.

Consistency Models Explained Briefly

Wikipedia does a ridiculously bad job at explaining different consistency conditions of multithreaded/distributed systems. Just take a look at the articles, and see how inconsistent the consistency model descriptions are. I'll try to get them together.

What is a Consistency Model?

When you're describing how your distributed system behaves, you usually specify something named a "consistency model". This "model" describes how distant the behavior of your system is in multi-threaded or multi-machine environment from the "ideal" behavior.

Assume that you're implementing a data structure (a set, for instance; say, you can add/remove elements, and check for presence). The specification (model) of its behavior usually implies that the operations are performed sequentially. For instance, a set indicates a presence of an element if ant only if it was added before the presence check, and not removed after that. The terms "before" and "after" imply that there is an order in operations over that structure. While it is straightforward to know the order of operations in a single thread, it gets unclear if there are several threads or machines:

  • Calls may overlap. If two threads call methods on the same structure at approximately the same time, the calls may overlap because they don't happen simultaneously in real world. In what order the calls took effect is not clear, however, because thread scheduling or network connection introduces arbitrary delays in processing.
  • Calls may not take effect immediately. When a method call returns, it is sometimes not clear when the operation actually "takes the effect." By diluting the notion of "taking the effect" one may achieve increase in speed. I.e. the elements added to a stack may not be fetched in the exactly same order, but the operations could be faster. The rationale behind this is that the "order" of events that happen in different parts of your cluster is not that well-defined and meaningful in the first place. I talked about why this is natural in my other post "Relativity of Simultaneity in Distributed Computing.

Object-Oriented Model and Executions

A consistency model usually assumes Object-Oriented approach to describing a system. Assume that we have something (say, a set), and its internal state can be observed and altered via a well-defined set of methods only (say, add(elem), remove(elem), and bool in(elem)). The OO approach assumes that internal state is concealed, and it has no meaning to discuss it per se, and we do not define consistency models in terms of object's state. Instead, we restrict the values these methods may return based on when other methods were called in different threads or machines.

In practice, method calls don't happen instantly, and our models totally acknowledge this. The set of method call times, the times it takes to complete each, their arguments, and their return values is named an execution. Executions are usually depicted as intervals plotted at lines. The parallel lines represent how time simultaneously passes in all threads. The intervals are the method calls, the beginning corresponding to the time the method was called and the end when the method returned. The call arguments and the return values as well as the method and object name are usually written near the method call. Here's a sample execution of a set object:

We will omit the actual call values further because consistency models do not operate in terms of actual arguments and values, but rather try to "map" concurrent executions to sequential ones. In our further diagrams we will also assume that all methods are called against the same object.

Since consistency model imposes restrictions on an execution, the latter may match or not match a consistency model. A data structure adheres to a consistency model iff every execution over its methods we may observe adheres to it. That's what's hidden behind such phrases as "sequentially consistent stack".

No Consistency

The most important consistency "model", is, of course, no consistency. Nothing is guaranteed. The methods called on the object may observe anything. We may add two different elements to an empty set, and observe that the size of set is three, and none of these elements is what we added.

All other consistency models are opposite to the No Consistency in the following sense. They all assume that the method calls may be re-arranged in a single-threaded sequence that is correct with respect to the object's specification. Let's call it arranged correctly in the rest of the post. For example, a set of pets that adheres to any consistency model described below can't tell us that it contains a cow, if only dogs were added to it.

"Multithreading" Consistency Models

What I discovered is that consistency models are divided into two categories: that lean to multithreading (shared memory multiprocessor systems), and to "multi-machine" (distributed systems). I find the former category more formal and mathematical; these models can be used to describe behavior precise enough to allow math-level, rock-solid reasoning about system properties. Some "distributed" consistency models are less precise in their definition, but are widely mentioned in documentation and description, and do allow users to make informed decisions.

Let's start with the "multithreading" ones.

Strong Consistency aka Linearizability

An execution is strongly consistent (linearizable) if the method calls can be correctly arranged retaining the mutual order of calls that do not overlap in time, regardless of what thread calls them. (See "No Consistency" section for definition of "correctly arranged").

In other words, if two calls overlap, then they may "take effect" in any order, but if a call returns, then you may be sure it already worked.

A more anecdotal definition is "before the method returns, it completely takes effect consistently with other threads." This is also usually called "thread safety" unless more precise definition of consistency model is specified. This is quite a stringent restriction, especially for distributed systems, where threads run on different machines.

Two linearizable executions combined also form a linearizable execution. This is an important property, since it allows us to combine linearizable components to get strongly consistent systems. This is not true for other models, for instance....

Sequential Consistency

An execution is sequentially consistent if the method calls can be correctly arranged retaining the mutual order of method calls in each thread.

The calls in different threads can be re-arranged however you wish, regardless of when they start and when they end.

An example of a sequentially consistent system is a naive notion of RAM in multi-core processor. When different threads read from the memory, they actually read from a cache or a register, which is synchronized with a common RAM at writes. While values read and written on different cores may not be immediately synchronized (cores may diverge in the notion of what value a cell contains), the values read and written on the same core are always consistent with each other. This is naive, though: even this guarantee is relaxed by modern compilers, and accesses to different variables within one thread can be reordered (as far as I recall, ARM platforms are guilty of this).

You might notice that the definition of sequential consistency does not directly prevents methods to return values from the future (i.e. to read what will be written later in a different thread). Yet I claim that RAM is sequentially consistent; how's that? The thing is that each execution should be sequentially consistent. If read and write do not overlap, and write follows the read, we may just stop the program between them. Since each execution should be consistent, the value read should be written before in some thread. Therefore, this "future prediction" is not possible in consistent data structures (not only those that are "sequentially" consistent) even that the definition doesn't directly prevent it.

However, if we combine two sequentially consistent executions, the result won't be sequentially consistent. But there is another model, weaker than linearizability, that makes it possible.

Quiescent Consistency

An execution is quiescently consistent if the method calls can be correctly arranged retaining the mutual order of calls separated by quiescence, a period of time where no method is being called in any thread.

This consistency condition is stronger than sequential consistency, but is still not strong enough that we usually expect from a multithreaded system. See my other post for a more in-depth example of a quiescently consistent stack. The reason to use this model is to attain a higher degree of concurrency and better response time in return of weakening consistency a bit.

Quiescent consistency is not as esoteric as you might think, especially since you barely heard about it. Here's an easy way to imagine a quiescently consistent system: assume that all writes are cached, and the cache is flushed to a persistent storage when there are no ongoing writes (or after it reaches certain size). There are special data structures that are quiescently consistent.

Quiescent consistency components can be combined to form quiescently consistent components again. Note also, that strong consistency implies quiescent consistency, hence you can combine these two kinds of systems, and the result will be quiescently consistent.

"Distributed" Consistency Models

Consistency is not Fault Tolerance

It may be tempting to mix the concepts of "consistency" and "fault tolerance". The former is how object reacts to method calls depending on the calling thread (or, in other words, how data are seen by multiple observers). Fault (or Partition) Tolerance describes how system behaves when system becomes partitioned, i.e. messages are lost, and/or parts of the system get disconnected from the network, or form several networks. These concepts are orthogonal to each other: an object can have one of them, both of them, or neither.

It will be, more likely, one of them. There is a scientific result known as CAP theorem that states that a system can't at the same time guarantee:

  • Linearizability
  • Availability (whether RPC calls send replies)
  • Fault Tolerance

In this post, however, we will focus on consistency rather than other properties of distributed systems. Note that multithreaded systems due to close proximity of their components, and less versatile medium that connects them, do not have to worry about fault tolerance...at least, at the current state of art.

The following models are usually seen in the context of distributed systems, i.e. those that reside on multiple machines, communication between which is slow enough. To bring Object-Oriented abstraction to distributed systems, programmers invented "Remote Procedure Call" (RPC) pattern. Method calls (object identifiers, method name, and arguments) are encoded into messages that are then sent across the network to specified destination, which may be the part of the encoding too. Method calls that do not expect a reply (say, unreliable "fire-and-forget" writes) may be treated as instantaneous.

Eventual Consistency

A system is eventually consistent if the method calls can be correctly arranged retaining the mutual order of calls separated by a sufficiently long period of time.

The eventual consistency definition—as well as its understanding—is specifically vague on how much time it would take, and whether this time depends on the call history. We can't really tell how soon an eventually consistent set will respond that an element X exists in it after it was added. But "eventually" it will.

This looks like a quite useless consistency model, since it doesn't actually restrict anything mathematically. Why is it used then? Exactly because of its lack of restriction. Users usually expect some consistency from a system, and if there's actually not much of it, something should be said about this. Eventual consistency is actually a limitation of liability; it's used in documentation to convey that the system can't guarantee that different threads become in sync immediately. However, the system is engineered in such a way that they eventually take effect. Here's an example (here's more on eventual vs. strong consistency)

Some authors, according to wikipedia, require that no method calls should be invoked during the "long period of time" for system to get in sync. This actually sounds like a stronger version of quiescent consistency, where a qualifying quiescence should be long enough.

Strict Consistency

An execution is strictly consistent if the method calls sorted by completion time are correctly ordered.

This is as simple and powerful as it's impossible to attain in most practical systems that have more than one thread. First, it order calls according to "global" time, which is a purely mathematical abstraction unachievable in real life, both according to Physics, and to the theory of distributed computing. Second, any implementation I might imagine would essentially disallow multithreading per se.

This is my rendition on the more "official" definition (1, 2), which defines a strictly consistent system as one where any read operation over a certain item returns the value supplied to the latest write operation according to global time counter. I don't like this definition because it needs to distinguish between "reads" and "writes", and introduces the concept of "data item"—i.e. it breaches the object-oriented approach. Still, I'm sure that if the system really consists of "data items," the definition at the beginning of the section would work as well.

Serializability

An execution is serializable if its methods in each thread are grouped in contiguous, non-overlapping transactions, and the methods can be correctly ordered in such a way that transactions are still contiguous.

In other words, the outcome of a sequence of method calls can be described as a serial execution of transactions, where methods of different transactions are not mingled, while the transactions actually overlap in time across different threads. The useful outcome here is that calls within overlapping transactions can be reordered, even if they don't overlap. However, "serializability" is not enough to describe the system: while it limits reordering of methods across transactions, it is unclear to what extent transactions can be reordered themselves. This is a job for a different consistency model. And serializability is usually used to "remind" about this important property of transaction-based systems.

Practice

Some systems provide several levels of consistency for operations (like the AppEngine's High-Reliability Datastore linked above). You should understand what tradeoffs each model has to make the decision best suitable for your specific application.

Most system you encounter will be not consistent at all ("thread-unsafe"), or linearizable (this is what usually understood by "thread-safe"). If the consistency model differs, it will be noted explicitly.

Strictly speaking, most systems that claim linearizability will be not consistent due to bugs, but these nasty little creatures aren't suitable for careful scientific discourse anyway.

Now that you know what consistency models are out there (and that they actually exist), you can make informed decisions based on what's specified in documentation. If you're a system developer, you should note that you are not bound to choose between linearizability and "thread-unsafety" when designing a library. The models presented above may be viewed as patterns that enumerate what behaviors in multithreaded context are considered universally useful. Just consider carefully what your users need, and do not forget to specify which model your system follows. It will also help to link to something that explains what lies behind the words "somethingly consistent" well--like this post, for instance.

Read on | Comments (1) | Make a comment >>


Performance metrics and parallelization

Some day you suddenly realize that a powerful PC is not cool anymore. Everyone works in clusters today! Clusters make our programs more fast, and if you goal is to fasten quite a long computation, they're of help.

Clusters of independent "computers" (in a broad sense) can scale differently: from several thousand tiny units on your GPU to thousands of different machines. And after you discover that you have a cluster available--or just several free PCs noone uses--you also meet the problem.

Programs suffer from insufficient attention paid to design them to run their parts concurrently. Probably, it's the right way to go, since the business value of early parallelization is not apparent. So the programs grow and their workflow ossifies. And when you finally find a parallelized medium to deploy the program onto, it's already too solid.

You have to cut your workflow to smaller pieces that may be computed independently. But where should you cut?

Performance metrics

To answer the question "how to break?" you must first answer "why?"

"To make my programs fast!" you answer, but speed may be understood differently! Here are two different "speeds", which one you pick?

The terms "throughput" and "response time" were adapted by me from different domains, and I'm not really sure that they're applicable here. If you know better terms for the concepts described, please, let me know!

  1. Given a lot of time, and having run the program a lot of times, maximize the number of fully completed tasks per time unit. I call this speed metric "throughput"
  2. Having run the program once, get final answer as fast as possible. I call this metric "response time"

Different tasks need different metrics. You should explicitly think about which one is the most suitable.

These metrics are very different, and hence they need different programs. Here's an extreme example. Assume you have N processors and your task can be divided into N independent sub-tasks of length t. Ideal distribution for response time looks like this:

The time needed to complete the task is t, and you can't make it any less.

However, here's a not any less ideal task distribution for throughput:

This distribution is not suitable for response time. Ideal distribution (shown by teal rectangle) yields response time nearly twice as fast.

However, for throughput it's not a problem, since we measure it for many runs of program. And for many runs this schedule is packed quite tight:

After we understood the difference, let's talk about response time metrics.

How to maximize response time

Each task of total length T has an unachievable ideal distribution across the N parallel machines. It was shown earlier by a teal rectangle; time it takes to get a response from such a computation is t = T/N. We can't make it in less time.

Assume that we created a task scheduling algorithm that completes in r time. The estimate of how bad our scheduling is could be estimated as (r - t)/t. We assume that our task has the first priority among others (i.e. the algorithm tries to finish it as soon as possible), and than the chunks are independent. These assumptions are not so uncommon, most systems that first compute a long list of independent tasks, and then execute it, fall into these assumptions.

Since the task is of first priority, then (r - t) < Δt, where Δt is the maximal length of the independent task (proof: if the inequality wasn't true, the last finished task should have been scheduled earlier). Here's the pic that demonstrates what we're talking about:

So, again, the badness of our algorithm is estimated as Δt/t; we're to minimize it. And there's just one straightforward way: to achieve the best response time, we should make independent tasks as small as necessary. We can stop dividing into chunks after we reach some acceptable badness (10%, for example); it's neither obligatory nor sensible to make the chunks as small as possible.

Conclusion

That's all I wanted to say. Perhaps it seems kinda obvious. But what shown here is the way to devise a good plan of action based on just setting a sane goal and doing small bit of maths (and maths do help in reasoning about real life). Actually, in the meeting I didn't draw formulæ, I just drew the pictures you see above--and we reached the agreement after that.

In further posts I'll pour upon the other performance metrics.

Read on | Comments (4) | Make a comment >>


"NP-complete!" as a lame excuse

Some time ago I bumped into a usual Stackoverflow question. A guy asked for a C# algorithm that could pick elements from array so that their sum is equal to a specified number. A usual NP-complete knapsack problem. But the answer made me think about an interesting matter. Let me screenshot the answer completely:

At a first glance, the answer contains everything an ideal answer should contain: a correct information, a certain bit of succinctness, and, a key to success, an XKCD comic. No wonder it was so highly upvoted.

However it was totally unhelpful.

Complexity classes

In Computer Science a concept of complexity class is used to define a class of problems, for which there exists an algorithm that runs on specified kind of abstract computing machine and uses specified amount of machine-specific resource. For example, famous NP class is defined as "set of problems that can be (a) solved on non-deterministic Turing machine (b) with use of polynomial number of steps with respect to the length of input". P class is the same but its abstract machine is a deterministic Turing one. The famous question of whether each NP problem is also P is still open.

There is a lot of more tricky classes (PSPACE, for example, requires polynomial "memory"--maximal length of line of a Turing machine), which can even be parametrized (PCP(n,m) (probabilistically checkable proof), for example). The relationship between various classes is being studied, and aids the relevant research; here's a picture with some of them:

complexity classes relationship diagram

(pic from «Эффективные алгоритмы и сложность вычислений» book by Nikolai Kuzyurin and Stanislav Fomin (pdf, site; published under OPL license))

It clearly shows that the knowledge about complexity classes made its way into the minds of casual programmers. But instead of a challenge, it became a lame excuse.

What a casual programmer thinks when he encounters an NP-hard problem is, "Oh, it's NP-hard, so no known polynomial solution exists!". Some programmers even try to quickly make an algorithm that might solve something in specific cases, which they don't even realize. While what should be thought of is, "Oh, it's NP-hard, so no known polynomial solution exists! Let me derive constraints, and search for an approximate solution that fits it best!" Complexity theory statements and theorems should aid the solution, not substitute it!

Funny approximations for NP-hard problems

Honestly, I was among the programmers of the first type, as defined above. I used to being proud that I know that much of maths and was glad that I can blow the hopes of solving their problem. However a year ago I've taken an advanced complexity course (by guys from one of research groups in ISP RAS -- "group of Discrete Optimization Algorithms at Department of Mathematical Methods" (in Rus)) that actually revealed to me several amazing facts. Some complex NP-complete problems appear to have good approximations! Let me list several of them.

  • Edge-dominating set (wiki) -- assume we want to penetrate our adversary's network (which is an undirected graph). We should pick several channels (edges) to insert a virus into, and the hacking software will capture all information that passes through any of each channel's ends. We should cover the whole network with such "hacked" channels, so that each unhacked one is plugged into a node with a hacked channel. But at the same time, we should do it with minimum cost, i.e. minimum number of channels to hack. See figure A at the side for examples.

    Figure A

    Examples of edge-dominating sets from wiki page. Red edges are infiltrated channels. You may see that each non-hacked edge is touched by at least one hacked edge at joint router.

    The task of picking such a minimum set is NP-hard. Brute-force? No, there is a better idea. We don't necessarily need minimum (although it would be nice), but instead we can hack some extra channels--the budget is not fixed anyway. We could use some clever algorithm to pick such edges, but instead... let's just hack arbitrary channels that aren't adjacent to already hacked ones!

    What amazed me is that such a "solution"... will never exceed the real minimum number multiplied by two! "That's still too much", might the Greedy Manager have said, and he would be right. But the thing is that we even didn't try to solve anything: the dumbest algorithm yields a good, constant-factor approximation for a problem with yet unknown efficient exact solution. Hell, it's even easier to implement than a brute-force!

  • Knapsack problem (wiki) -- assume a burglar infiltrated a house. Unfortunately, he has a relatively small knapsack, into which he can't put everything of value. Possessing a discriminating eye, he can immediately determine how much the fence will pay for a thing. His aim is to maximize the cost of the stuff carried away, the overall weight being less than his knapsack may bear. But unfortunately he's also limited on time and has to do it quick.

    Complexity theory proves that such task is NP-complete. So the burglar might pack every combination of items into his sack until he finds out the best. He would have been caught if he did that, because the owner would have returned by the time he completes the enumeration. So he chooses another solution. He uses a "greedy algorithm" and pick first a thing with best price/weight ratio, among those that still fit his sack.

    A fool! He didn't know that it was the house of computer scientist, who knew that thieves are dumb and greedy, and who intentionally bought a piece of jewelery with best price/weight ratio in the house, but that piece doesn't allow any other thing to be put into the sack due to overweight. For arbitrary N, this trick can make the burglar take away N times less stuff than he could at most!

    But a simple adjustment to its algorithm can be made to limit the possible loss to "at least 1/2 of maximum value". The burglar should do the same, but at the end compare the value of picked best price/weight items with the value of the most expensive item he can carry away. If it turns out to yield more profit take this thing and run away.

    This is an example that a small hack can guarantee a good approximation.

  • Longest path problem (wiki) -- assume we want to make our competitor lose money, and thus we try to delay shipment of a valuable package. We can make a janitor agent into the dispatcher's room, so he can see the map of the area (an undirected graph) and use the special phone to guide the courier by a longest path between source and destination (without cycles, of course, since the courier will suspect something and report it). He must do it quickly, but, unfortunately, picking such a longest path is an NP-hard problem.

    We've seen approximations by constant factors before, but this task, unfortunately, can't have an approximating polynomial algorithm for each constant factor. Of course, that's true unless P = NP. But the fact is that its polynomial inapproximability by constant factor is proven unless P = NP. So we'll have to teach the janitor to yield worse approximations if we want him to succeed. And conclude that sometimes NP-hard problems just do not have simple hacks that solve them well.

There are more facts that concern randomized algorithms, which have a certain probability to hit a good solution; facts about more tricky, non-constant approximations, but they're far less fun and are more like boring math crunching, so I won't mention it here.

Conclusion

Albeit whether P=NP is still an open question, complexity theory contains many valuable achievements on solving everyday problems.

Do not make complexity theory just a source of lame excuses of not solving the problem or writing brute-force algorithms and garage-made heuristics that can guarantee noting. Some programmers call it all "clever cheats", but many hand-made "cheats" are of no help in approximations (as shown above with the first, most obvious "greedy" algorithm for the burglar). Instead of cheating, do not hesitate to perform a quick googling or consult experts in this area if you encounter an NP-complete/hard problem--a fast solutions with proven and limited weakness, and suitable for your specific task may exist, and may have been studied well.

Or, if you're curious enough, and possess certain bit of math education, you can study it on your own. English-speaking readers may check online pages of their favorite universities for book recommendations and, maybe, online lectures. I can't recommend anything in English since I haven't read such books. To a Russian-speaking reader I can recommend the book «Эффективные алгоритмы и сложность вычислений» by Nikolai Kuzyurin and Stanislav Fomin (pdf, site; published under OPL license), which we used during our studies at the university.

Read on | Comments (3) | Make a comment >>


More posts about maths >>