Recent blog updates

Shown here are only posts related to myths. You can view all posts here.

Are women worse programmers than men?

The idea behind this blog post has always been to make peace between two things. One of these is my honest belief that sexism is unwise and disgusting, and all human beings are equally smart regardless of their gender. The other thing is dismal amount of women in programming profession I observed at that time and place. In 2010, a StackOverflow question (it provoked a flamewar and has been deleted, but you can view it if you have 10k+ rep there). finally gave me the idea how these two things are possible simultaneously, and I wrote a provocative and--as one of the commenters puts it--"needlessly argumentative" post about it. Here it is, slightly, rewritten.

Are women worse developers than men?

There are slightly more women than men among humans. However, there's just a small amount of women in software development, as observed and acknowledged by us daily. I observe this in the place I work, and I hear a lot from others that they observe the same at their workplaces. If you're a software developer, take a look who surrounds you: how much of them are women? Not much I guess, much less than 50%.

So does this mean that women are somehow inherently worse in programming? No.

I used to dismiss participation rates in countries that do not completely satisfy the definition of "free" (Qatar, Singapore, Malaisiya.) But women are disadvantaged in such countries, and forcing women into one of the toughest and well-paid STEM-related occupations doesn't really scream oppression.

A survey of participation of women in computing over the world demonstrates that it varies dramatically from country to country: from 6% in Denmark to 55% in Thailand (anecdotal evidence reports 60%+ participation in Qatar). If somehow women were worse at this, these rates would have much more similarity.

The thing, however, is that whether women are worse or better at it is completely irrelevant for most situations that occur in practice. That's right: even if you are somehow sure that women are inherently worse at programming, this does not warrant any kind of different attitude towards them in the workplace, and here's why.

Why shouldn't one just fire all women if one thinks they are inherently worse at programming?

Short answer: because conditional probability. Longer answer follows.

Assume you are doing an interview, and assume also that you are sure that women are worse programmers than men. Does that mean that the probability that a woman who comes to your interview is a good programmer is lower than the probability that a man in similar position is a good programmer? No, and the reason is that the group your interview candidates come from is no longer "all women" or "all men." Instead, these are "all women who chose their careers in computing" and "men who chose computing".

Conditional probability demo

Demonstration that the skill of job applicants or of those women who chose software development as their career had nothing to do with how good of a programmer an average woman is.

The probability that "a woman is a good programmer" (A) given that "a woman chose programming as her career" (B) is not equal to the probability that "a woman is a good programmer" (A). This probability can be both greater or less than this, according to conditional probability definition (see illustrations):

P(A given B) = P(A and B) / P(B)

The definition of conditional probability.

And something tells me, that this very particular group "women self-selected as professional programmers" will have more good programmers than general population. In any case, no matter how small the ratio of good programmers among all women you believe is, it tells you nothing about women in the profession. Any kind of unfair treatment of female peers can't be justified by reasoning about general population, and understanding of conditional probability gives us a mathematically solid proof why.

Why is there a debate about whether women are good at programming in the first place?

Because many people believe that participation of women in software engineering and the skills of women in general are correlated, while in reality these numbers have absolutely nothing to do with each other. The application of basic probability theory to this debate finally settled it down for me, but many are deluded by less relevant questions.

Some see low participation of women in engineering as a sign of their inferior skill, and therefore reason that women should not be hired, or should be not taken seriously. This line of reasoning is mathematically wrong, as shown in the previous section, even if it somehow turns out that women are indeed worse programmers, which I do not believe is true.

Others believe that women and men have equal programming abilities (which is a completely justified belief), and make a conclusion that they must be equally represented in the workplace regardless of other factors, thus interpreting the disparity we observe as deliberate diminishing of women's programming skills. This line of reasoning is as unjustified as the previous example because, again and again, skill and participation may have nothing to do with each other.

Errors in math observed in both sides make the debate much more heated than it should because everybody makes illogical conclusions from the wrong math, which receive justified critique. Instead of trying to eliminate bias in prior perceptions, which can't be eliminated completely as well as may have nothing to do with the debate, I would like to see more focus on fixing what is easier to fix, and is not mathematically irrelevant.

There still exists this vicious cycle when female programmers are considered uncommon, and this wards off women by itself, making them uncommon, again. People stop caring about women because there are none on their team, and there's simply nobody to care about; they forget--or even never learn!--that programming, like many things in this world, is a two-gender endeavour. Feminist websites like this have a lot of advice to offer to mitigate this.

***

Once when we were celebrating an event at work, I made up a toast. I said that women are not welcome only in the most awful things in the world, namely the war. And that I wanted to lift a glass for the women currently there who justify our domain, software development, as something that is worth being.

So, to you and to all of the women in the field!

Post Scriptum

I'm kind of tired of being used as an example of sexism in programming. You probably come here via a link on some website claiming that there's yet another chauvinist hampering participation of women in computing or whatnot. This is not true, and has never been true. The gist of both new and old versions of the article is that a lot of debate around "women in computing" revolves around badly applied mathematics. Refresh your knowledge about conditional probability, please.

By the way, none of the visitors by such links, totalling to 4000 throughout several years, have sent me a message or made any attempt to comment on this post and correct something. Not a single one! I can only explain this that a demonstration of sexism you can link to is much more welcome out there than a person who corrects other' math.

For those interested in getting the old version of the article, which you may need to point a finger to, I'll spare you from going to webarchive and getting it from there. The one you see here is an updated version.

Read on | Comments (5) | Make a comment >>


Killer features of Git

Welcome to Yet Another Post About So-Called Killer Features of Git. Of course, I'm not the first who writes about it. No wonder: Git already was in a good shape in 2005 (link). However, in my opinion, some people seem to misunderstand why Git's killer features are important in everyday work. I'll cover my thought on that it in this post.

If you still don't know, Git is a fairly recent source-control system that was originally written by Linus Torvalds. That tells a lot about it; I already see scared corporate clerks, who convulsively close the browser tab. So, what can Git propose to more advanced developers?

1. Distributed all the time

Many people say that the main feature of Git is that it's a distributed source control system. That's not a killer feature, it's "just" a different workflow, different approach, different philosophy of version control.

And the direct consequence of it is the aptitude to work without connection is not a feature. What is a feature is the necessity of such behavior. Let's look closer at usual workflow:

git add ...
git commit
git add ...
git commit
....
git push

At first, you have to manually pick what files the next commit will consist of. You do it with various forms of git add. Is it a feature or an obstacle? I think it's a feature. I saw it many times when people commit the stuff that belongs to their local machine, because Subversion just commits everything that changed upon running svn commit. With git add you have to select what you add manually. Slow? Perhaps. But it is the usual workflow, which requores you to treat commits as serious actions, and I'm really tired of those bad commits that bring stuff that doesn't belong to them.

Okay, so you've made several commits. Guess what you now have on your hard drive? You have a draft of a repository (so-called "index"), which you're going to replicate to main server only when you invoke git push. You don't have to--and should not--push your changes at once. Instead, you now can test each of your commits, you can checkout from that folder! At work my test system is tuned to do exactly this: checkout from local repository and run regression tests.

If you feel that your draft contains errors, you have built-in Git features to bring your repository draft to a good shape. You can fix comment of any of your new commits, you can change their order, you can rollback several commits and merge branch as if it was merged several commits ago. And only then you git push the changes thus merging with repository on the server. Of course, if you're not careful, you're shooting yourself in the foot, just like in C, with which Git really is indeed compared. But still, I think it looks way safer than what you can do with Subversion, where each your commit is pushed at once.

2. Branching

Git can have lightweight branches! It can store hundreds of branches on the server without significant time overhead! Wow!

That's what they say. But it's not really important. The important thing is not that you can commit them to server. The thing is that you may not commit them, and store these branches locally. On my machine I have about 10 branches, only one of which is synchronized with the server. I can merge this branches the way I want, I can checkout any of them to test them locally. And all of these things I can do without bothering our centralized repository.

Another killer feature related to branches is the ability to switch branches within one directory. No wonder it's possible: you have a repository within it anyway. And it's really helpful; I still remember that hell when I had to branch Bazaar by checking it out to different directories. It's a hell I'm never coming back to. This strict, seemingly clean structure "one folder for one branch" just doesn't work well.

And here's the third feature...

3. Flexibility

"Git is the C of Version Control Tools", says Eric Sink, the 5th google result for "Git killer features" request. Yes, it's true. And it's useful, because others may write front-ends for other repositories. git svn is a well-grained tool that allows you smoothly integrate with Subversion.

I wrote an answer for a question "Why not use git?" by just observing my colleagues. They just didn't know how to use it, and even didn't want to try first. And it happened this way: I used git, they used Subversion, and we all were happy. You won't be able to merge Subversion branches with Git's native mechanisms (SO question), and you will encounter a bug that won't allow you rename files (git svn doesn't handle this case: you should first commit deleting files, then commit adding new files). But I think that working with Git worth all these obstacles.

Conclusion

If you liked it, use it! You don't even have to persuade others to use it, just install git svn (if it's not integrated) and you and your team will all be happy. Git's cool; don't be a corporate clerk, try and use modern and flexible tools, when they're available.

Read on | Comments (1) | Make a comment >>


Busting C++ myths: virtual function calls are slow

Here and there, over the Internet, you sometimes see unproved claims that "C++ is slow", "C++ ABI is unstable" etc. The rationale behind these claims is usually not provided by the speakers. However, many people believe in these myths, having never seen a proof of any. so I'd like to bust a couple of them.

What C++ is good at is the ratio between plausibility to describe abstract and high-level concepts and performance of the compiled program. No wonder that the argument about the performance penalty of these concepts never gets old.

Background

On of these concepts is, of course, virtual functions. Their calling mechanisms are often called "slow", or, at least, "unclear". Of course, "unclear" should mean "hard to predict", but most likely it means "I couldn't/didn't try to understand it". Let's fix this.

What about other compilers?

Indeed, I only take into account the implementation of C++ depths on one compiler, and that's the C++ compiler from the famous GNU Compiler Collection.

What happens on other compilers, you may ask. Actually I don't know. And I think it's a bit out-of-topic here. The purpose of this article is to advocate C++ as a language by showing that there exists at least one reasonably fast implementation of it. If it happens that your compiler generates worse code and the assessments made here don't hold for it, address your compiler vendor with this question.

C++ language is governed by a standard, but the way it's implemented under the hood is left for a compiler vendor to define. However, developers of certain compilers attempt to govern parts of "binary" behavior of their implementation of C++ language by separate standards. In this article I will follow the "standard", that GCC follows: Itanium C++ ABI. Contrary to the title, it's followed on other architectures as well, particularly, on x86.

How are the virtual functions implemented?

From very old times virtual functions were implemented as "vritual tables". Being successors of function pointers as the way to control program flow (contrary to pointers as a way to implement first-class functions), virtual functions hide these pointers into an implementation-defined structure named, again, "virtual table" (or just "vtable"). Vtable is common for all objects of a giver runtime type, and the pointer to the proper vtable is stored in each object.

A key to understanding a vtable and why it's performant is that it's not a table, but, rather, an array:

(vtable for QTableWidget class)

00  QTableWidget::metaObject() const
01  QTableWidget::qt_metacast(char const*)
02  QTableWidget::qt_metacall(QMetaObject::Call, int, void**)
03  QTableWidget::~QTableWidget()
04  QTableWidget::~QTableWidget()
05  QTableWidget::event(QEvent*)
06  QObject::eventFilter(QObject*, QEvent*)
07  QTableView::timerEvent(QTimerEvent*)
08  QObject::childEvent(QChildEvent*)
09  QObject::customEvent(QEvent*)
010 QObject::connectNotify(char const*)
011 QObject::disconnectNotify(char const*)

(snipped)

For each function the index is known at compile time. For example, if the code calls tableWidgetPtr->timerEvent(event), the calling algorithm loads function pointer from vtable[7] cell (number 7 corresponds to timerEvent() member function) and calls it instead. Not for all cases it is that simple. Sometimes (we'll show it later) it is a bit more hard. But there's one common truth: no dynamic distpatching by function name, actual base class etc ever happens during virtual function call. This "7" number is precompiled into the object code, not looked up at runtime. C++ is not a modern interpreted language, where you can run an object method a user has just written in a textbox.

Performance assessment

To measure performance I will take into account "shifted-pointer dereferences" (loading base[offset] from memory into a register) and jump instructions. I am not too much into assembly language, but, I guess, these operations take more time than arithmetics on registers and stuff.

So, how slower is it than calling a usual, non-virtual function? For this simple case (single-inheritance) it takes, in addition to mere jump instruction that happens on a usual function call, loading a vtable pointer vp from the body of the object and loading one pointer from an already known index of a virtual table vp[index_of_func] (probably, triggering a cache-miss).

One constant-shifted pointer dereference. How short should your function be for it to be noticeable?

More tricky inheritance schemes

What is "inheritance branch"?

A non-common term I use here is inheritance branch. For a given inheritance graph that's a maximal chain where each class is a primary base of previous one. A primary base is the first non-virtual class with virtual functions or, if none, a first nearly empty virtual base or, if none, a first virtual base. By "first" it's meant "first one in list of base classes in the declaration of derived ones". (See here, case 2b for reference.)

Here's the example of inheritance hierarchy:

The inheritance branches are red, green, olive, blue, cyan. That's the order the sub-objects will be laid out in the complete, most derived one (ecah branch has one virtual table and is laid out near the same offset). First, branches with non-virtual roots are laid out in "depth-first" order; then, branches, roots of which are virtual, are laid out in the same depth-first order.

For more tricky inheritance schemes the processing will be more complex. If your class has a multiple-inheritance hierarchy, it will actually have several vtables, one for each separate inheritance branch (see side note). Consider you have an hierarchy of form

Then the first inheritance branch of C will be C-A, and the second one will be B. The thing is that a virtual table pointer for some class B in C is located right under pointer to (B*)c subobject of C c object (so, the first vtable pointer is located right at the first bytes of the complete object). Therefore a pointer to C can be immediately dereferenced to get a proper virtual table only for aa() function call. For bb() function call the pointer cp should be adjusted to point to B-in-C subobject and only then a dispatching mechanism described above, for single-inheritance case, should be performed. The adjustment is by a fixed, known at compile-time offset.

But you can't immediately call the body of the final overrider for a function distpatched this way? Assume that cp actually contains object of more derived class:

If a pointer to B-in-C would be passed as "this" pointer to bb's implementation, the this->i, being implemented as *(this+offset_of_i) would result into breaking the object's border (i is at the first bytes of the object, while "B-in-C" is allocated after i). An adjustment to this pointer is needed for such caller. A special function that does small adjustment before jumping to an actual user's function is called "thunk". Pointers to these thunks are inserted in vtables instead of raw function pointers. In our example, a thunk emitted for bb-in-D virtual function would adjust pointer by an offset (known at compile time for non-virtual bases and taken from a special part of vtable for virtual bases).

Therefore, as compared to usual function call, we need up to three pointer loads (4 in case of calling a virtual base) with one cache miss and--sometimes--a jump. Since thunks may be consecutive, jumps may be replaced by mere additions:

    a2: /* Thunk for non-virtual base A2.  */
        this += offsetof (B, A1) - offsetof (B, A2)
        /* Fall through.  */
    a1: /* Thunk for non-virtual base A1.  */
        this += offsetof (B, C) - offsetof (B, A1)
        /* Fall through.  */
    f:  /* Non-adjusting entry point.  */

But I won't try to judge its performance: compiler authors surely know better, when to do this optimization. We'll assume that the result of optimization is not slower than the straightforward case.

Calling virtual functions via interface pointers

Interface pattern is implemented in C++ as "public virtual" classes. Therefore, calling virtual function through the pointer to interface base requires 3 pointer adjusts and a jump. We need 3 adjusts instead of 4, since compiler knows that the subobject pointer points to contains a virtual function declaration, so we can save one offset (the one needed to look up for a proper vtable pointer).

Conclusion

Summing it up, overhead for a virtual function call compared to a call of usual function is the following (all shifts are known at compile-time):

  1. for simple single inheritance -- one constant-shifted pointer dereference (1 cache miss);
  2. for calling through interfaces (virtual bases) -- 3 constant-shifted pointer dereferences and a jump (that may be substituted with several increments/decrements of a value in a register);
  3. more complex inheritance schemes may lead to 4 pointer derefs (2 cache misses).

Busted?

Read on | Comments (1) | Make a comment >>


More posts about myths >>