Busting C++ myths: virtual function calls are slow
Here and there, over the Internet, you sometimes see unproved claims that "C++ is slow", "C++ ABI is unstable" etc. The rationale behind these claims is usually not provided by the speakers. However, many people believe in these myths, having never seen a proof of any. so I'd like to bust a couple of them.
What C++ is good at is the ratio between plausibility to describe abstract and high-level concepts and performance of the compiled program. No wonder that the argument about the performance penalty of these concepts never gets old.
Background
On of these concepts is, of course, virtual functions. Their calling mechanisms are often called "slow", or, at least, "unclear". Of course, "unclear" should mean "hard to predict", but most likely it means "I couldn't/didn't try to understand it". Let's fix this.
What about other compilers?
Indeed, I only take into account the implementation of C++ depths on one compiler, and that's the C++ compiler from the famous GNU Compiler Collection.What happens on other compilers, you may ask. Actually I don't know. And I think it's a bit out-of-topic here. The purpose of this article is to advocate C++ as a language by showing that there exists at least one reasonably fast implementation of it. If it happens that your compiler generates worse code and the assessments made here don't hold for it, address your compiler vendor with this question.
C++ language is governed by a standard, but the way it's implemented under the hood is left for a compiler vendor to define. However, developers of certain compilers attempt to govern parts of "binary" behavior of their implementation of C++ language by separate standards. In this article I will follow the "standard", that GCC follows: Itanium C++ ABI. Contrary to the title, it's followed on other architectures as well, particularly, on x86.
How are the virtual functions implemented?
From very old times virtual functions were implemented as "vritual tables". Being successors of function pointers as the way to control program flow (contrary to pointers as a way to implement first-class functions), virtual functions hide these pointers into an implementation-defined structure named, again, "virtual table" (or just "vtable"). Vtable is common for all objects of a giver runtime type, and the pointer to the proper vtable is stored in each object.
A key to understanding a vtable and why it's performant is that it's not a table, but, rather, an array:
(vtable for QTableWidget class)
00 QTableWidget::metaObject() const 01 QTableWidget::qt_metacast(char const*) 02 QTableWidget::qt_metacall(QMetaObject::Call, int, void**) 03 QTableWidget::~QTableWidget() 04 QTableWidget::~QTableWidget() 05 QTableWidget::event(QEvent*) 06 QObject::eventFilter(QObject*, QEvent*) 07 QTableView::timerEvent(QTimerEvent*) 08 QObject::childEvent(QChildEvent*) 09 QObject::customEvent(QEvent*) 010 QObject::connectNotify(char const*) 011 QObject::disconnectNotify(char const*)
(snipped)
For each function the index is known at compile time. For example, if the code calls tableWidgetPtr->timerEvent(event), the calling algorithm loads function pointer from vtable[7] cell (number 7 corresponds to timerEvent() member function) and calls it instead. Not for all cases it is that simple. Sometimes (we'll show it later) it is a bit more hard. But there's one common truth: no dynamic distpatching by function name, actual base class etc ever happens during virtual function call. This "7" number is precompiled into the object code, not looked up at runtime. C++ is not a modern interpreted language, where you can run an object method a user has just written in a textbox.
Performance assessment
To measure performance I will take into account "shifted-pointer dereferences" (loading base[offset] from memory into a register) and jump instructions. I am not too much into assembly language, but, I guess, these operations take more time than arithmetics on registers and stuff.
So, how slower is it than calling a usual, non-virtual function? For this simple case (single-inheritance) it takes, in addition to mere jump instruction that happens on a usual function call, loading a vtable pointer vp from the body of the object and loading one pointer from an already known index of a virtual table vp[index_of_func] (probably, triggering a cache-miss).
One constant-shifted pointer dereference. How short should your function be for it to be noticeable?
More tricky inheritance schemes
What is "inheritance branch"?
A non-common term I use here is inheritance branch. For a given inheritance graph that's a maximal chain where each class is a primary base of previous one. A primary base is the first non-virtual class with virtual functions or, if none, a first nearly empty virtual base or, if none, a first virtual base. By "first" it's meant "first one in list of base classes in the declaration of derived ones". (See here, case 2b for reference.)Here's the example of inheritance hierarchy:
The inheritance branches are red, green, olive, blue, cyan. That's the order the sub-objects will be laid out in the complete, most derived one (ecah branch has one virtual table and is laid out near the same offset). First, branches with non-virtual roots are laid out in "depth-first" order; then, branches, roots of which are virtual, are laid out in the same depth-first order.
For more tricky inheritance schemes the processing will be more complex. If your class has a multiple-inheritance hierarchy, it will actually have several vtables, one for each separate inheritance branch (see side note). Consider you have an hierarchy of form
Then the first inheritance branch of C will be C-A, and the second one will be B. The thing is that a virtual table pointer for some class B in C is located right under pointer to (B*)c subobject of C c object (so, the first vtable pointer is located right at the first bytes of the complete object). Therefore a pointer to C can be immediately dereferenced to get a proper virtual table only for aa() function call. For bb() function call the pointer cp should be adjusted to point to B-in-C subobject and only then a dispatching mechanism described above, for single-inheritance case, should be performed. The adjustment is by a fixed, known at compile-time offset.
But you can't immediately call the body of the final overrider for a function distpatched this way? Assume that cp actually contains object of more derived class:
If a pointer to B-in-C would be passed as "this" pointer to bb's implementation, the this->i, being implemented as *(this+offset_of_i) would result into breaking the object's border (i is at the first bytes of the object, while "B-in-C" is allocated after i). An adjustment to this pointer is needed for such caller. A special function that does small adjustment before jumping to an actual user's function is called "thunk". Pointers to these thunks are inserted in vtables instead of raw function pointers. In our example, a thunk emitted for bb-in-D virtual function would adjust pointer by an offset (known at compile time for non-virtual bases and taken from a special part of vtable for virtual bases).
Therefore, as compared to usual function call, we need up to three pointer loads (4 in case of calling a virtual base) with one cache miss and--sometimes--a jump. Since thunks may be consecutive, jumps may be replaced by mere additions:
a2: /* Thunk for non-virtual base A2. */ this += offsetof (B, A1) - offsetof (B, A2) /* Fall through. */ a1: /* Thunk for non-virtual base A1. */ this += offsetof (B, C) - offsetof (B, A1) /* Fall through. */ f: /* Non-adjusting entry point. */
But I won't try to judge its performance: compiler authors surely know better, when to do this optimization. We'll assume that the result of optimization is not slower than the straightforward case.
Calling virtual functions via interface pointers
Interface pattern is implemented in C++ as "public virtual" classes. Therefore, calling virtual function through the pointer to interface base requires 3 pointer adjusts and a jump. We need 3 adjusts instead of 4, since compiler knows that the subobject pointer points to contains a virtual function declaration, so we can save one offset (the one needed to look up for a proper vtable pointer).
Conclusion
Summing it up, overhead for a virtual function call compared to a call of usual function is the following (all shifts are known at compile-time):- for simple single inheritance -- one constant-shifted pointer dereference (1 cache miss);
- for calling through interfaces (virtual bases) -- 3 constant-shifted pointer dereferences and a jump (that may be substituted with several increments/decrements of a value in a register);
- more complex inheritance schemes may lead to 4 pointer derefs (2 cache misses).
Busted?
Virtual methods are slow. Not because of the vtable implementation, but just because they are called. Virtuals cannot be inlined, and that's slow. But this is only noticeable in high performance loops...