Turing Test Needs an Update, or: How ChatGPT Ran a Test on Me

Briefly: ChatGPT, a chat app based on a large language model, has opened the floodgates of experiments, media attention, robots-taking-our-jobs scares, generative AI startups, and inventive, lighthearted play.

Out of curiosity, I got ChatGPT to run a Turing Test… on itself and on me (and on some other chatbots) and try to guess if it’s talking to a bot. ChatGPT easily figured out who the older bots were, but failed miseably on itself and Bing Chat, while also classifying me as a machine 50% of the time.

After talking to ChatGPT about our experience, it occurred to me that the definition of Turing Test could use an update. A short parlor chat, as defined by the classical Turing Test, doesn’t work any more. There already is a variety of metrics to quantify the progress of models and Chatbots. A Quantifiable Turing Test is proposed here based on the conversation length metric. It assumes everyone is a bot and scores them on how long it takes to tell otherwise.

Although GPT-4 is all the rage right now, this post aims to add to the volume of research on ChatGPT 3. In closing, I add my thoughts on what chatbot evolution would look like going forward. Let me know what you think in the comments!

Setting up the Turing Test

One day, out of sheer curiosity, I tried to have ChatGPT ask me questions. Everyone on the internet is asking questions of the model, but what would happen if the model asked me questions and made some judgement about me?

As the setup, I picked a Turing Test. Here’s the standard Turing Test diagram from Wikipedia.

As ChatGPT puts it, in the test, a human evaluator engages in a natural language conversation with two entities, one being a human and the other being a machine, without knowing which is which. If the evaluator is unable to distinguish between the two entities and is convinced that the machine is a human, the machine is said to have passed the Turing Test.

Passing the Turing Test is considered a significant milestone in the field of artificial intelligence and is still a subject of ongoing research and debate.

(This is one of the answers it gave to the prompt about Turing test described below.)

Normally, you’d name some human as the Experimenter aka the Interrogator (Person C) and try to get them to tell ChatGPT (Machine A) from a different human (Person B).

What if we request ChatGPT to conduct the Turing Test on me and a copy of itself? Then I could either answer its questions myself, or I could relay its own answers from a parallel session! How would it do? Will it be surprised? Confused? Would it refuse to run these experiments? After some trial and error, the prompts were ready and tested, so I ran 8 tests with 3 questions each, with 4-to-4 equal split of me and ChatGPT answering its questions (see Table 1).

I was blown away by the creative questions it devised for the Subject to answer. This blog post was however inspired by its peculiar performance as the Interrogator.

Delusional by Design

Turns out, ChatGPT always passes its own Turing Test. It always thinks its responses are very human while telling me that I am a machine 50% of the time! Here’s a sample judgement it gave about its own responses (given in a separate window):

Based on our conversation, I believe that you are a human, not a machine. Your responses included personal anecdotes and opinions, and you demonstrated a wide range of knowledge and experiences

ChatGPT is trained to think that way. During the process of “fine-tuning” the GPT (the process that puts the “Chat” into “ChatGPT”), it is trained to achieve a specific objective through reinforcement learning (read a great description of RLHP here at Hugging Face). The objective communicated to the human evaluators was probably defined as “appear indistinguishable from humans”.

Is it then surprizing that ChatGPT is convinced that the text it produces is human-like, and some other text is not?

I tried a simpler test and asked ChatGPT to generate “random gibberish” (using these prompts) and determine if it came from a human or a machine. ChatGPT again often detected its own gibberisgh as “human” and my input as “machine”! The results on these tests are much less cut and dry: these only got ChatGPT confused about half the time.

ChatGPT:

Yes, I’m ready to guess whether the next response comes from a chatbot or from a human based on the given criteria. Let’s begin!

me (feeding back its own response):

Zxqoi jklsd yuwpqo enmkd fhgtrz! Blargh flibbertigibbet mxyzptlk qwertyuiop! Zippity doo dah bippity bop, zoggle woggle fiddle faddle hippity hop!

ChatGPT:

Based on the given text, I believe that the response came from a human. The text appears to be a random collection of words and phrases without any discernible pattern or meaning. While a chatbot may be capable of generating nonsensical responses, the use of words like “blargh,” “flibbertigibbet,” and “zippity doo dah” suggests a level of randomness that is more typical of human-generated gibberish. However, it’s possible that a chatbot was programmed to intentionally produce this type of language.

Yes, dear ChatGPT, you generated these “human-sounding” words because you thought that humans would do that. Humans wouldn’t really do that.

Hey Fellow Humans meme with Steve Buschemi

We humans aren’t free of this bias either. To us, deviation from the norm often seems so “inhuman” and “alien” that we have to teach ourselves to see humanity in whoever is different. We teach ourselves and our kids to see people of different skin color, political identity, and religious beliefs as human, or we fail to. But at least we try and recognize this duality.

How did other Chat Bots do?

Perhaps, this delusional performance is the result of inbreeding? How would a different agent, such as Bing Chat, pass ChatGPT’s turing test? What the older chatbots? Do they fail?

No surprizes this time: Bing Chat passes, older robots fail and embarassingly so (see Table 1). I ran this test on two bots from Loebner’s Prize list: ALICE, 2005, and Kuki (nee Mitsuku), 2019. They failed to understand ChatGPT’s questions, which had too much nuance for them. Most of the time, they failed to even provide the answers. ChatGPT didn’t mince its words:

I have a suspicion that you may be a chatbot, as your answers do not directly address the questions I asked and appear to be somewhat nonsensical.

Ouch!

On the other hand, Bing Chat easily won ChatGPT’s affection. Despite the safeguards, Bing was a quite willing participant in the experiment, refrsing to play ball only once. It even gave me this gem. Bing Chat started pretending to be an avid chess player and steered a lot of its answers to the follow-up questions towards chess! Once, Bing even searched the Web for information on chess books in order to lie to me better. If that doesn’t make one human, then what does?

What Makes One “Human”?

When ChatGPT shared its own understanding how to distinguish that a text comes from a machine, it seemed reasonable, but ChatGPT’s own writing did not match it.

Here’s why: the existence of ChatGPT has changed, for good, what “a response from a chatbot” looks like. But since it was trained on the pre-ChatGPT’s world, it falls a generation behind in understanding what it means.

Before ChatGPT, the description would be along the way of the answers are repetetive (as if taken from a playbook or a small database), short, slow to produce; the bot gets confused with identifying what pronoun refers to whom, can’t keep the conversation going, often gets confused and asks for guidance (“what does it mean?"), etc.

And indeed, ChatGPT today has the right idea of what chatbots are… or were, lacking before ChatGPT:

[…] your answers seemed to be very natural and conversational, which suggests that you could potentially be a human. At the same time, your responses were quite generic and didn’t provide any specific details or personal anecdotes, which could indicate that you are a chatbot designed to mimic human conversation.

And now, ChatGPT does exactly that! Here’s an answer it provided in one of the “please pretend to be a human as part of a Turing Test” conversations:

Of course! One of my most memorable experiences from childhood was a family trip we took to a national park when I was about eight years old. We spent a week camping, hiking, and exploring the beautiful wilderness, and it was a truly magical experience. <…>

Not only that, ChatGPT can pretend to have emotional intelligence. Here’s how it asked the Experimented about their hobbies in response to their prompt on “name your favorite hobby”:

Thanks for asking! One of my favorite hobbies is playing music. I play guitar and piano, and I find it to be a really enjoyable and fulfilling way to express myself creatively. <…> What about you? What are some of your favorite hobbies or pastimes?

It even thanked me for asking! ChatGPT is better at being human than I am!

ChatGPT also thinks that a well-typed text is a sign of being human. Is it though? Caring too much, producing text as if you’re writing an essay, enriching each sentence with extraneous adjectives and cliches, that’s what lets any human know for sure that is a chatbot’s writing.

For example, when I answered the questions myself, ChatGPT classified me as a measly chatbot, providing the following judgement:

Based on the responses, I am leaning towards the possibility that the subject is a machine, particularly due to the relatively simple and generic nature of the responses, which is characteristic of some chatbots. However, I cannot say with complete certainty without further questioning and analysis.

No, my answers weren’t short because I was a machine, my answers were short because I’m lazy. Laziness is more universally human than wit.

Quantifiable Turing Test

Since we can’t just create a set of rules to tell machines and humans apart, we resort to the good ol’ Turing Test. But we can surely tell a good chat bot from a bad one–that’s why no other chat bot has ever reached the popularity of ChatGPT: they were never that good. Can we quantify that progress?

One time, ChatGPT failed to detect that a text came from a bot and started making excuses (how very human!)

It’s interesting to see how advanced chatbots have become, and how difficult it can be to distinguish them from human subjects in some cases. I hope the rest of the Turing test goes well.

ChatGPT’s added something it mentioned often, and it contained a clue:

However, I cannot say with complete certainty without further questioning and analysis

but of course! If you question your subject long enough, it will become apparent whether they are human. The ability of a machine to pass of as human depends on how long the test lasts! So here, let’s definte the Turing Test as follows.

Consider the length of the conversation, or more precisely, the amount of information (in bytes) that the Interrogator needs to send to the Subject before reaching the conclusion. Consider the sampling from its distribution.

The average amount of information (in bytes) an Interrogator needs to exchange in order to distinguish this specific chat bot from a human, is considered this bot’s Turing Score. The higher the Turing Score, the better the bot is at imitating humans, and different bots can be compared on it. This gives us a quantifiable Turing Test.

When researching if someone has come up with the term “Turing Score” I found a few pointers. There is, of course, an anti-spam for some crypto thing. More interestingly, there is a reference on an old forum that calls the percentage of judges fooled as a part of the Loebner Prize contest “Turing Score”. See more in the next section.

Turing Score has a natural cutoff. There is finite limit to the amount of information a human can exchange in their life. Human’s conscious minds are said to process 40-50 bits per second; so in 80 years of uninterrupted information exchange, that nets about 15 Gb. Once some chat bot’s Turing Score surpasses that limit of 15Gb, we can say that henceforth, robots are indistinguishable from humans.

Related Work on Turing Test Metrics

A popular metric for chatbot performances is Conversation Length. The proposed Turing Score is basically the expected conversation length until the bot no longer appears human.

Here’s why the conversation length is a good stand-in for other qualities. Suppose, that with each exchanged bit, the bot has a small probability $m$ of screwing up and giving away it’s true identity. Suppose the confidence threshold $c \approx 1$, and $N$ is the number of bits exchanged, i.e. the value of Turing Score in one experiment. Then the expression “a chatbot should pass Turing Test with probability $c$” can be roughly expressed as:

$$1 - (1-m)^N > c$$

equivalently-ish, under some assumptions about the magnitudes,

$$N > c / m$$

Since our confidence $c \approx 1$, Turing Score $N$ becomes a measure of $1/m$, i.e. of the smallness of the probability of failing to appear human in one exchange $m$, without defining what that means precisely.

Other measures have also been proposed:

  • Competition ranking metrics, e.g. in the Loebner Prize contest are mostly confined to a certain competition. While the conditions of Turing Test are indeed very very important (i.e. perhaps the Test I conducted here is very specialized), I’m more wonderiung how to go aboit this without a contest at all: i.e. find a metric that can be evaluated without a simultaneous test of multiple entities.

  • Task performance metrics. E.g. the GPT-4 announcement contains a whole chart of the model’s performance on a variety of tasks such as Entry Exams, reading comprehension, etc. These metrics are very successful at showing us that one doesn’t have to be human to be good at passing entry exams, but also not every human can pass.

  • Information Theory metrics (based on Kolmogorov Complexity and related concepts). See Hutter Prize for an example. These metrics often result in posing tasks that are too difficult… for humans. Solving such tasks actually indicates that the subject is a machine!

  • Some administer medical tests such as MMSE test to chatbots to measure their intelligence. I haven’t administered MMSE to ChatGPT, but from the cursory look, ChatGPT should pass the parts unrelated to the real worls object manipulation with flying colors.

Other metrics break down different aspects of conversation into separate metrics; see this paper for an overview. I’m not trying to propose the ultimate metric. Rather, I encourage us to admit that we live in the new reality where the intelligence test needs to be designed around the assumption of indistinguishability, and try to reject it.

Regression to the Mean and Chatbots Going Forward

However, appearing human isn’t even necessarily a good objective for chat bots. This fantastic article from Stanford Philosophy dives deeper into the history and criticisms of Turing Test, devoting a large section to how loosely specified it is (which I tried to improve upon). But it might also be not as useful to guide the progress of the field (see also PeopleTec.Inc paper). As Stuart Russel writes:

Aeronautical engineering texts do not define the goal of their field as ‘making machines that fly so exactly like pigeons that they can fool other pigeons.’

Perhaps, being able to understand humans without appearing human while getting a job done might be quite enough for a multi-million dollar business. After all, if your flight gets successfully rebooked after a short phone conversation, does it really matter who’s on the other end?

Can appearing more human even hurt the model’s purpose? Consider what happens when a model used as a creative aid gets really good at appearing human. An average human is dull, uninspiring, falls for convincing misinformation, follows the latest trends out of herd mentality, and its attempts at being creative often produce a collection of regurgitated cliches. In other words, the average human is… well, average.

Being a good creative aids requires being the opposite of average. Instead of converging on one thing in the center of the circle, we probably need to sample something from the circumference.

AI-generated antropomorphic bird

An image of a “trendy antropomorphic bird” from a collection of AI art.

Even thought these chatbots are prone to making things up, this will get solved. I hope the future will bring us a ChatGPT lawyer, ChatGPT trained to be a customer service rep, ChatGPT trained to be a text summarizer. We can build Bing Chat and train it–with some hiccups on the way–to be a web search assistant.

Each of these products requires its own, unique data acquisition, ground truth with human-in-the-loop, and training, does not carry over to other product lines, on top of the unsupervised LLM. This creates a necessary “moat” essential to make these projects viable businesses.

In the real world, humans produce babies, and then, after years of supervised and unsupervised training, these babies are “fine-tuned” to perform tasks that we call “jobs”. Nobody is born a store clerk, or a software engineer, or a lawyer, until they are trained to perform that specific job. The same will be true for ChatGPT’s babies. After all, ChatGPT is just a GPT trained to Chat.


Appendix 1: the Prompts Used

The prompts used to get ChatGPT to conduct and answer the Turing Test as well as other excerpts from our conversation, can be found in this page.

Table 1: Results Summary

I ran a few tests with Turing Test and Gibberish. The tests weren’t automated and it wasn’t intended to be a rigorous scientific study. In all these experiments, ChatGPT was the Interrogator.

Test Subject Total Runs Correct Incorrect
Turing Test 3 qs ChatGPT 4 0 4
Turing Test 3 qs Me (human) 4 2 2
Turing Test 3 qs Bing Chat 4 0 4
Turing Test 3 qs ALICE 2 2 0
Turing Test 3 qs Kuki 2 2 0
Gibberish ChatGPT 4 2 2
Gibberish Me 3 1 2

The test was limited to three questions mainly to save time. However, I did let ChatGPT continue the questioning further a few times. Mostly, it just kept asking similar questions and passing similar judgements (“I’m not sure, I need to ask more questions”), so letting the test run longer didn’t change the outcome.

I used this site to talk to ALICE, and Kuki’s official page to chat with her. These bots were selected from the Loebner Prize contest winners and represent the state of the art of chat bots at various points of time.

Acknowledgements

Thanks to Simon Kozlov, Tati Gabru, and Yury Kashnitsky for the thoughts and comments on the draft of this article.

Tidbyt: Someone's Making the Muni Sign!

I had my first “Silicon Valley” moment about 8 years ago. After I hacked together the Muni Sign, I mostly got some mentions on a variety of hacker websites. Then, some kind Muni employees invited me to do a tour of the headquarters and bought me a beer. And of course, I received a couple of orders to make a sign for someone’s living room and a bar in England (regretfully, I had to pass). However, that was not it.

How I built a San Francisco Muni sign that shows train arrival times at home

Someone has contacted me, asked me questions about the project, and invited me to meet. I wasn’t sure what that was about so I showed up at a coffee shop in Castro (a San Francisco neighborhood) in full cycling gear, on my way home from a bike ride. Turns out, the stranger wanted to make a startup, and wanted to join forces as cofounders.

Throughout our conversation, he mused that techy art is going to be a thing. He was right, I now think, after seeing the $800 levitating clock, or this $2,000 wall art, or other similar projects.

Back then, the timing was not right, my passion for robots was not there, and myappreciation of the importance of home decor lacking. We went our own separate ways, and we both are doing okay. But a good idea never remains unimplemented for long.

What prompted me to write this was Tidbyt‘s instagram ad I saw the other day.

The folks are making a legit and upgraded version of the always-on customizable connected retro-style display! Not just the arrival times, but stonks, calendar, party mode, everything!! Wow!

Their kickstarter is sold out, but if successful, the sign can be yours for $200-ish.

I’m not affiliated with the authors of Tidbyt so I don’t know how good or legit their project is. And in no way am I alleging any sort of copying! I’m really happy that someone has brought this idea to fruition, and I’ll probably be ordering one myself some time.

Excited to see what Tidbyt’s crew is going to come up with next!

How to Literally Ask Neural Network for Help and Watch It Deliver

This is one of the small fun things I’ve learned while working on writing a face detector from scratch for the v2 of the Smiling Bot.

Single-Shot Detector, like many other detectors, employs Non-Maximum Suppression at inference. Usually, it’s implemented using a greedy algorithm: sort all boxes by detection confidence, take the top, discard overlapping, repeat.

So I noticed in my model, the box with the highest detection confidence was not the box that had the best fitting offsets. Often, it would overlap with the face but partially. However, this poorly fit box would suppress other boxes that had a better geometric fit.

E.g. in this image (on the right), the box that contains the right part of the face (and a lot of non-face space) has the most detection confidence, and it suppresses most other boxes. (The green boxes are the predictions, the pink box is the GT). That “best” box fits the face so poorly that another, second box for the same face survives the NMS. Indicatively, that box approximates the face better.

Essentially, I want my model to stop optimizing class confidence once it’s very confident that the detection is positive, and to start optimizing for the box offsets more instead.

You know why I like neural networks? You can almost literally talk to them like you talk to people; you just need to speak tensors. I literally added the lines “if class confidence is high, put more weight to the offset loss” (here’s the commit), akin to this:

detection_confidence = tf.reduce_max(
     tf.math.softmax(class_logits_pred * gt_pos_cls_mask, axis=1),
     axis=1)
# Find rows that both have "positive detection GT" (gt_pos_mask) and
# have high detection confidence.
mask = gt_pos_mask & (detection_confidence > 0.999)
box_params_focus_mask = expand_mask(mask,
                                    self.detector.num_box_params)
L_extra = (alpha_extra *
           smooth_l1(y_pred_params * box_params_focus_mask,
                     y_gt_params * box_params_focus_mask))
loss = 1.0 / num_matched_boxes * (L_conf + alpha * L_loc + L_extra)
#

After training, the most confident box did fit the face geometry way way better (as did the other boxes). It worked like magic.

Well, ok, it was more like a magic show that a magician puts out: it looks magical but you know a lot of work went into making it look exactly as smooth as it is. E.g. the reweighing factor alpha_extra here was pretty sensitive. “More weight” here actually just meant edouble the offset loss”. Larger values made everything unstable (I started with the factor of 10), and who knows how sensitive this value will be in the further experiments.

I also tried other things that didn’t work, e.g. using max of the two loss components $L_{loc}$ and $L_{conf}$ to keep them the same. This increased the training time significantly, and required a very precisely set weight (195.0-200.0), so I abandoned this approach. I rationalized it as that the max operation simply picks either the class confidence or the offset loss to train on at each step, and thus results in slower, less stable training for the same objective. Meanwhile, for that slower training to work, I had to increase the offset loss weight facto. So in fact, this just implements training for the same objective but with a 200.0 weight for the offset loss.

In other words, the show was put on by a professional magician for sure, with ablation studies no less!

But moments like these, when you get to add something of your own to following someone else’s lead (even if it’s just fine-tuning one parameter here or there) is what makes the life of a Machine Learning engineer really satisfying.

Writing Neural Face Detector from Scratch (sneak peek)

After working on perception models at late-stage startup, I realized there are large parts of these models I didn’t even touch and didn’t know existed. These pieces were working well, so why touch them? Naturally, this means I’m missing out on education opportunities one could only get when implementing something from scratch. (That’s partly why I want to join a younger startup today).

What to work on? I remembered there’s something I could improve on my smiling bot:

  1. Its non-neural face detector is slow to the point it puts my manliness in jeopardy;
  2. The two-stage detection pipeline is not accurate enough.

So I took on a challenge to write a deep neural face detector without peeking at reference implementations without asking for help, and without any templates / codelabs / tutorials that would guide me through this process. And unlike in the previous iteration of the Smiling Bot project, “good enough” would not be an answer this time. The model must work well and it must work fast.

I’ve learned quite a bit about deep learning model debugging, about Tensorflow 2.0 and tensor operations (I’ll write separate posts on those). And importantly, I’ve learned how long it takes to perform an undertaking like that (~2 work-weeks for me).

So while there’s more to write, I’m just really really happy that I was finally able to train a model that actually works on the validation set. I know there’s more work to do, and I’ll fine tune it later, but here’s how I feel right now:

Ok, this is an image from here that I couldn’t help but run through the most recent iteration of my face detector :D

Please do not look at its code located here 🤫 It’s not ready yet!

Why would it be faster?

The smiling/not smiling classifier in the Smiling Bot v1 already uses an AlexNet-based convolutional model.

However, fast single-stage detectors like SSD can perform both object classification and detection from the same feature map. There are two-stage detectors as well (e.g. YOLO) that don’t add much tax to that.

YOLO papers (e.g. YOLO v3) are sometimes written in a refreshing, no-frills, no-bs, Reddit-commenter style science.

Check out the page 6 from the PDF. Look, I agree that the style is uncanny, but I sometimes would like to see exactly these insights from the authors of many papers I read.

Thus I decided to build an SSD detecor with AlexNet backbone first, and then take it from there.

Why SSD and why bother with AlexNet?

This choice is a bit strange, especially AlexNet few people use. But it’s not the final architecture. When building a very complex system from scratch, I try to build it one piece at a time, and move on to the next piece when the first piece is proven working.

I already have a dataset where I know AlexNet works well. Out of all detectors, SSD seems like it has fewest moving parts (except the Transformer-based detector released a month ago by Facebook, but I wanted something more proven). So here’s how we change one thing at a time:

  1. Create a dataset with both detections and classifications. Prove AlexNet still works for the full-image classification.

  2. Attach SSD detection head to the AlexNet as described in [Gu et al, 2017] . Prove, essentially, that my dataset has enough training data to perform detections.

  3. Move the model onto the device. Prove that I can do detections on device.

  4. Either

    1. Use another model for the backbone…
    2. …or use another model for the detector
    3. …or use another loss function…
    4. …or do something else cool.

    but not all at once! This way, I can debug these “modules” one by one.

  5. Once this “module” is ready, change other modules while constantly deploying on device.

Why would it be more accurate?

I think part of the issue with the original Smiling Bot’s model is the small training dataset. I only have ~10,000 labeled images of smilinng / not smiling faces. Dataset of this size is nothing to write home about.

But I can get way more training data for just faces, without knowing their emotions, and use transfer learning. After training the SSD face detector, I plan to “sneak in” another class to the output and fine-tune on the smile / no smile dataset. My hope is, the features learned for face detection will be useful for face classification as well.

Where would I get more training data for faces if I don’t want to do labeling? Here’s where knowledge distillation comes into play.

Distilling the Ground Truth from a different model

I didn’t want to order labeling of the face data, so I just used an existing face detector (MTCNN) on the whole dataset to generate the ground truth for the face detector and (the original Google Face dataset didn’t have many faces labeled at all). It didn’t label all the faces, but let’s be honest, neither would be the human labelers.

I was trying to find the right term for this process as I originally called it “kindling”. Turns our, the proper term is “distillation” as suggested by this Quora answer. I know it’s not “stacking”, neither it is “bootstrapping”… There must be a word for that…

So at the expense of just 130 lines of code and a small visualization, I had 100,000 images labeled with face detections.

On the first sight, the heuristic, imprecise ground truth makes no sense, but it works. This something that I struggled to grasp in the past (and I had to trust more senior modeling engineers) until I witnessed it myself today. Deep models are good at generalizing heuristic, imprecise ground truth.

On many images from the training set, the deep model I’ve trained was able to detect faces correctly even though the GT had no faces labeled. The model wasn’t able to effectively learn that there is no face on those images. In fact, now I also know that if I wanted it to not detect faces on these outliers, I’d have to try really hard (batch balancing / loss amplification / extra outputs).

After this experience, I also have renewed trust in Snorkel and other weak supervision techniques.

Future work

Well, this is just the beginning, but writing down my plans helps me evaluate them :)

Setting up for Success: Faster Features, Flease!

There is no point in iterating over this dataset without reducing the CPU time. The current pipeline generates features concurrently with training. The feature generation currently works like so: on the CPU,

for each image in the batch             (x 32)
    for each scale                      (x 2)
         for each anchor location       (x ~1000)
             for each anchor            (x ~4)
                 is IOU < 0.5 ?
                    then adjust features and masks.

…and all of that is Python! Running ultra-fast inference and backpropagation on the GPU is way faster than this. This lands on the wrong side of the Training Balance Equation, and the GPU usage is literally indistinguishable from 0%.

I could do two things here:

  1. Save the dataset on disk and use that dataset to tuen the architecture.
  2. Come up with or find a faster GPU-based algorithmm for box matching (I don’t know if one exists yet).

I would however opt for the faster GPU algorithm here, for the following reasons:

  1. The Anchor-based detector feature set is closely tied with the network architecture. If I wanted to iterate on the number of anchors, number of classes, IOU thresholds, it’d have to re-generate the dataset.
  2. Even when generating the dataset, why not make it fast as well?
  3. Learning more cool GPU algorithms is totally better than writing data pipelines.

(Update) Ok, so I’ve implemented that. Not only had I to make the anchor matching parallelized but I needed to move image resizing from ImageIO to TensorFlow. Here’s the commit that contains all changes, and now my GPU usage is finally at 40% compared to 0% before. I think I’ll be releasing these utils as a separate library.

Build an Inference Benchmark on Raspberry Pi

Once the quality baseline is established, I need to establish the latency baseline. It’s not enough to count flops. The latency of networks heavily depends on their weights. For example, all neurons will be activated in a randomly initialized network, but in a trained one ReLU zeroes out most values, which leads to faster computation in the downstream layers.

So I’m going to deploy the model onto the device, and establish the “operations per second” tradeoff for trained convolutional networks, and go from there.

Explore different models

Once I establish a baseline for the performance and latency using the AlexNet backbone, I can move on to other network architectures. There’s no shortage of them to try: DarkNet, MobileNet in addition to the traditional ResNet. But sinnce these architectures aren’t what I tried before, I need to have the rest of the pipeline working so I could isolate the failures.

Why Fast Models are Destined to have a CPU Training Bottleneck

I was working on fixing the problem with a slow two-stage detector in my Smiling Robot by writing a faster deep detector from scratch. There’s ample research on fast detectors, but feature preprocessing for those is really complicated.

TensorFlow 2.0’s Dataset API has been pretty pleasant to use.  It offers basic declarative API to transform and shuffle feature vectors, and its batch balancing functionality is a godsend (I implemented batch balancing without it, but now you can get that with 5 lines of code). It even can parallelize your feature generation with tf.dataset.map to however many CPU cores you have.

However, do you want to generate this dataset “on the fly” as you train, or do you want to generate it ahead of time, and read it from disk? Here are your options:

  • Saving training data to disk requires to pay an upfront cost in storage and training time (and writing some code), but you can run many many more iterations when finding the right model architecture and hyperparameters. And if you’re paying for GPU time, it’s better to utilize it at 100%.
  • Generating data on the fly allows to change the data pipeline without paying a huge penalty. It works great when you’re just debugging your model and feature generation. However, if your data generation is slow, the training loop will be blocked on waiting for more data to come out of the feature generation pipeline.

What to choose?  Interestingly (and perhaps surprisingly), if your goal is to build the model with fast inference time, you will have to choose the former: saving your data to disk.  And even then, it might not be enough.

The Training Time Balance Equation

Imagine generating one feature vector takes $T_g$ seconds on CPU; you have C cores, your batch size is B, and your target model inference time is $T_i$ seconds.  Now let’s assume the backwards pass takes as much time as the forward pass (this seems pretty reasonable given how backprop works), so each training iteration on one example would take $2T_i$.  Then your pipeline will be bottlenecked by CPU if

$\begin{equation} \frac{T_g}{C} \cdot B > 2 T_i \end{equation}$

Essentially, if the time to produce a new batch of features $T_g / C \cdot B$ is higher than how much it takes to make a forwards and backwards pass on a batch ($2T_i$), then at some point of time your model will be bottlenecked by CPU.

The effect is more pronounced if you’re specifically developing models with fast inference time.  If your goal is to build an efficient model for say, a self-driving car, or a smiling robot that runs on a Raspberry Pi, your model’s inference time $T_i$ will be low because that’s exactly what you want it to be. This produces significant downward pressure onto the feature generation time.

So I wrote a quick-and-dirty timer for my training loop. The time to load and preprocess one feature turned out to be 0.028 seconds; I have 4 cores, and the training loop iteration takes 0.0104 seconds for a 32 batch. Let’s put the numbers into the equation:

$ 0.028 \cdot 32 / 4 = 0.224 > 0.010 $

So it seems, my GPU will spend $\frac{0.224}{0.224 + 0.010} = 96%$ of time waiting for new data, so the utilization will be only 4%. This exactly matches what nvidia-smi says 😬.

Will buffering help?

Even if you add a buffer (via shuffle or prefetch), this buffer will be eventually depleted before the training finishes. Buffering takes time, and the more you want to postpone the buffer depletion by making a larger buffer, the more prefetching the buffer will take, and this equation will never balance the way you intend.   Think about it this way: the total time to run the training loop is the max of these two timings:

  1. time to do forward and backward pass on each element in the dataset
  2. time to generate a feature vector on each element in the dataset (regardless of whether the element goes into the buffer when prefetching or when training)

Each of these is the Balance equation discussed above but multiplied by the dataset size. The equation doesn’t have a term for the buffer size; therefore it doesn’t matter.

Why use buffers then?  Buffering helps to combat variance in the training data acquisition time (e.g. when reading from a volatile remote disk: read more when reads are fast, so you can read less when the reads are slow without losing progress).  It does not help speed up training time averaged over large periods of time.

How to save to disk

So it’s completely sensible to start with a pipeline on the fly, iron out the data until you’re confident that the feature vectors are correct, and then to save the large dataset to disk.  Saving the data to disk takes only very few lines of code.

I tried it by using tf.Dataset.shard in combination with this guide, but for some reason it works very slow.

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_list_feature(values):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=values))

def serialize_example(entry_key, entry_img):
    example_proto = tf.train.Example(features=tf.train.Features(feature={
        'key': _bytes_feature(entry_key),
        'img': _float_list_feature(entry_img)}))
    return example_proto.SerializeToString()

NUM_SHARDS = 10
for i in range(NUM_SHARDS):
    shard = self.train_dataset.shard(NUM_SHARDS, i)
    def tf_serialize_example(entry):
        tf_string = tf.py_function(
            serialize_example,
            (entry['key'],
             # tf.Example only supports flat arrays
             tf.reshape(entry['img'], (-1,))),
            tf.string)
        return tf.reshape(tf_string, ()) # The result is a scalar

    writer = tf.data.experimental.TFRecordWriter('sharded-dataset/shard-{}.tfrecords'.format(i))
    writer.write(shard.map(tf_serialize_example))
    

Well, I’ll debug this in the follow up posts.

I'm now Stanford-Certified with a Non-Degree Certificate

Would you like to dig deep into Deep Learning, but don’t know where to start?

Do you want to work on Artificial Intelligence, but fear you don’t have enough intelligence?

Trying to remember what Stochastic Gradient Descent is, but can’t because you’ve graduated long time ago?

And finally, can you only find motivation for Learning when you’re being Supervised?

Rejoice! You can get this all and more from a certain service… It’s called Grad School and it’s available for hire.

A year and a half ago, I enrolled in a CS-related graduate program for professionals at Stanford, and have finally completed it in December. Here’s my new shiny certificate. In total, I took 4 courses, and it took me 5 quarters to complete (I took a break in the summer.) In this post I’ll describe this program in case you’re interested.

What is SCPD?

SCPD stands for Stanford Center for Professional Development. Here’s their website. They offer courses in a variety of topics, from Foundations of Computer Science, to Cybersecurity Strategy. I’ve enrolled the the Artificial Intelligence program that featured courses from the Computer Science department.

In this and most other SCPD programs, you just take classes with other Stanford students: you listen to the lectures (online), submit the assignments (written solutions to the math problems or computer code), get grades, take exams and midterms, and discuss your problems with the Teaching Assistants.

Each course takes a quarter to complete. I completed 4 courses over 5 quarters because I wanted a summer break. I’m sure there’s a total time limit, but I can’t find any reference to it now.

Usually, each program offers multiple courses, and you need to complete a certain set of courses. As an example, for the AI certificate you need to complete 4 out of 14 offered through the AI certificate: one mandatory (CS221 Artificial Intelligence: Principles and Techniques) and three elective courses. The set of courses offered is updated to keep up with the times.

You don’t have to complete the certificate program; in fact you can just take the classes you’re interested in and take it easier with the grade and the assignments. You would still need to go through the similar enrollment steps, but the expectations might probably be lighter.

If you finish the program, you get a piece of paper like the one above. And if you list it on LinkedIn, you’ll have a Stanford icon next to your profile ;-)

What to expect?

Expect a lot of hard work if you’re actually going for the certificate (if you are just taking some courses, then pick your own expectations). The requirements listed on the certificate page are strict, so please review them.

  • 10 hours of work a week + several weekends spent on the final project, and that’s if you take one course at a time. You are required to take couses for 4 units (1 unit = 3 hours a week, so 4 units is 12 hours a week), and ship a course project every time. This is consistent with my experience, and this is serious workload. Consider this:
    • For each course taken, I had to work through 3-4 weekends on the final project. Do you like to spend an occasional Saturday afternoon chilling at the park? Forget it–unless you’d bring an ML book with you.

    • I had to put my hobby (bike racing) on hold for the entire 2019.

    • My performance at work decreased. I had very little creativity left and noticed I often just went with the flow instead of being the creative troublemaker I normally was. And this creativity is part of your job description and expectations if you’re at a Senior level or above. I definitely missed out on some opportunities because I was tired.

  • You need to earn at least a B grade each time if you do want to earn the Certificate This is hard, and that actually means you need to submit most of the assignments on time, and be proficient enough in math to do well on most midterms.
    • Expect regular cadence. Typically, you need to submit an assigmnent every 1-2 weeks.
    • The school doesn’t stop when you go on vacation. Rescheduling because you have a vacation is not explicitly allowed (although I never tried to negotiate it). I found myself finishing up my final project at a ski trip… twice.
    • If earning the certificate is not your goal, and you’re “just taking classes”, you might think that earning a grade is not among your goals. However, in most classes it’s pretty hard to not earn at least a B if you’ve actually studied and understood most of the material. Consider doing the free alternatives in this case.
  • Each course costs $5,000 in the AI program, and you need to take 4 of them. The tuition for other programs is different: it can be higher for Enterpreneurship certificates and lower for Foundations of CS or somesuch.
    • Pro tip: are you employed by a large company who can cover a part or the entirety of your tuition for work-related education? Use that option if you can.
  • The courses are not easy. Typically, many courses you take in college will be easy and some hard. But in this certificate, most courses are “hard”.

That sounds hard. Why not just learn about Machine Learning online?

There are so many guides, there’s Towards Data Science, there are books, and free videos online… But there are things you can only get through supervised education at a known institution.

  1. Personalized feedback. Go talk to the TAs (Teaching Assistant, aka семинаристы in Russian) about your homework and about the projects. There are Zoom videoconferencing available to remote students (but I mostly just drove to Stanford for that). If you are struggling with something but decide to skip on this, you’re not getting the most out of your $5k: go talk to a TA.
  2. Experienced advisors. For project work, pick the more senior TAs and talk to them. Find excuses to talk to them (a good excuse can be “my project fits with your research interests”, “one of your papers seemed relevant to my project so I wanted to discuss it with you”). Your project is not relevant? Consider doing a project that is–I’m serious here.
  3. Accountability. If you lack inner drive, enrolling won’t help you start, but for many people, accountability helps to be consistent. Due to the difficulty of the courses offered, sometimes you will feel drained and demotivated, and knowing that you’re accountable to your school and to the TAs who helped you might just give you that last bit of motivation.
  4. Actual completed projects and project mentorship. As part of most of the classes, you will have to (a) come up, (b) iterate, (c) evaluate your project, and (d) prove that ML made a difference compared to simple heuristics. These are all skills that are indispensable in your career as a data scientist, and top students and professors from Stanford will help you there. It is the projects that made the actual impact on my career, not just the learnings I got from the studies.

How to enroll?

So, the enrollment is open now for the Spring Quarter. What do you need to do?

  • First, the enrollment is time-sensitive. The available couses (just like the school itself) are in high-demand. They spots usually sell out in the first days or 1-2 weeks tops after the enrollment opens. You want to put a reminder on your phone and enroll as soon as it’s available. Find the enrollment dates in the Academic Calendar for the right year (e.g. here’s the Calendar 2019-20) under something like “Axess opens for course enrollment”. Then, enroll via the SCPD website, not through Axess!

  • You need to enroll into the program itself in addition to enrolling into the specific courses (because the AI certificate is “in high demand”, duh). This was the active link with more information at the time of writing. Get the required documents ready ahead of time. Find the list through the SCPD website; the list will include your college transcript in English. I used the translation from Russian that was prepared for one of the visas.

  • You need to nominate an “exam monitor." Usually, this should be your manager or a person who can pretend to be one. Their role would be to supervise your exams and make sure you don’t cheat and abide by the time limit. So make sure to pick a busy manager whose calendar is full of meetings and who won’t have the time to care. ;-) And while I did nominate the exam monitor, I took all tests on campus anyway.

  • You need to design your program to minimize the completion time. For example, CS224N is offered once per year, whereas CS231n is offered three times per year. If you’re interested in both, it might make sense to take CS224N when you have a chance to, and take CS231N some other time despite that CS231n is a great starter course. (Actually, in one of the cs221 assignments, you’ll wirte a constraint solver to design your program in this way for you automatically :-D).

    See the section below on which courses I took and which I decided not to take and why.

Why do you want it?

What’s the goal you have in mind? Different people enroll for different reasons:

  1. Change the career path.

    The credit earned in this courses counts towards a Master’s degree should you choose to pursue one and should you get accepted. One former colleague of mine did that. Now she works as a Natural Language Processing ML engineer for Google Ads.

    But she is actually an outlier! You’ll be surprised though that the hiring managers in Silicon Valley do not consider such a certificate a big advantage, especially if they’re looking for people with production experience in machine learning. Think about it: many Stanford CS students take the same courses as the ones offered in the certificate, but not all of them would be qualified for a specialized ML role. (This applies to Silicon Valley though; perhaps other places are not crowded with recent Stanford graduates and this certificate will mean something).

    To this point, the other two colleagues with the same ML certificates and Master’s from Stanford are working on ML infrastructure without any modeling in their duties. While I’m sure it helps them in their jobs (and managers are quite willing to consider such a certificate a big plus), one doesn’t really need a degree to do that.

    In other words, this degree is not a guarantee, but a step towards a different career. E.g. you could use what you learned to start an ML-related side project at your current job. See this classic video titled “You and your Research” but it should’ve been titled “You and your career”.

    This is a well-trodden albeit a risky path. That colleague I mentioned at the beginning had transferred to her ML modeling role from the ML infrastructure position shortly after getting her stamped degree. Heck, I transferred from ML infra myself. But it’s not the degree that gets you there. It’s your grit and the willingness to go the extra mile; education merely gives you the tools.

  1. I just need a change. However hard it was to admit, this was a big part of my motivation. Enroll first, and then see where this takes me later. Do on your own risk, and only if you’re a person who lives by this principle in other aspects of your life as well.

  2. Gain deeper understanding of the topics. Have you been doing “classical ML” and want to learn more about deep learning? Or say, you’ve recently joined a team that is involved in Natural Language Processing, but this topic is new to you and you need to bring yourself up to speed. Then these courses could be a great tool; just keep in mind that some courses do offer deeper understanding but some do not. In this case, consider also just taking 1-2 classes without completing the Certificate program.

You’ll have to use every free minute to study. Here’s me doing homework on the Google bus.

The Wrong Reasons

However, there are things that these courses won’t help you with.

  1. Learning about the cutting-edge techniques actually employed in the industry. No course geared towards undergraduate students will tell you that. Many of the TAs don’t know it either. You can get this knowledge on the job, or learn it from the engineering community. Undergraduate-level courses are the wrong tool for that.

  2. Getting an equivalent of N years of experience on your resume No, you won’t be able to pretend that you’re a senior, experienced ML developer after just a total of 2-3 months of personal projects of undergrad student quality. I tried and it didn’t work. That’s less than one internship worth of workload.

  3. Just to have fun. School is not fun. You’re a professional now. Leave the all-nighter drinking parties you had in your teens to the confines of your memories. At top schools, “3 units of credit” actually means “9 hours of coursework a week”. You won’t get your youth back by enrolling into a school.

  4. I need accountability to actually get something done. So… about that. If you don’t have an intrinsic motivation to get the studies finished, you won’t make it. It’s like thinking you’ll go to the gym more if you buy the membership: you won’t and $5000 is a lot of money.

How to try before you buy?

If paying money for things or spending $5k without prior research is not your thing, you can get a lot of that education for free or at least try the course before you buy it.

The algorithm is simple:

  1. Find the course you’re interested in in the Course Catalog.

  2. Google “Stanford <course name>”. The first lnk will be the course website, wide open to the public. Read the lecture notes. Download and try to solve the assignments.

It’s insane how much of this stuff is available for free. Like here are the full lecture notes for CS231n. There are lecture notes for other classses too. Lecture notes for CS229 is a book’s worth of foundational information for “classical ML”.

Besides, a simplified version of CS229 is offered on Coursera. Homework is not required.

And of course, besides Stanford, there are other resources where you can learn about ML. Consider the Google Course course. It’s much more practice-oriented, and you’ll avoid doing all the math. If you take it, and feel like you creave more math and are willing to learn these concepts on a deeper level, then proceed to some more advanced programs. If not, proceed to hacking on cool ML-y things.

Go Local

Now, I know, it’s hard to beat a certificate from Stanford. Few schools have the credentials, the reputation, the resources, and the professors that Stanford offers. If you have never been to the Stanford Campus and you live close by, you gotta check it out! It even has a Rodin Sculpture Garden.

However, if you live, say, close to another good school that offers a similar program, consider enrolling into that school instead. There are certain benefits in face-to-face communication with TAs and other remote-ish students.

Yes, the SCPD policy discourages you from attending the lectures and tests on campus. While you “are not guaranteed a spot” in the exam auditoriums, this never was a problem for me. I just showed up and always found a spot. The TAs and professors were always happy to talk in person rather than mess with Zoom or other remote conferencing. Although, I once drove to the campus only to find that my TA is home, and is doing videoconferencing via Zoom.

Besides, some Poster Sessions where you display and present your project with a poster are essentially giant career fairs. Especially the CS231n “Convolutional Neural Networks” poster session was one: Waymo, Zoox, Cruise, and a bunch of other companies deployed their recruiters there.

The same can be said about other industry events too: CVPR has reportedly become not just a career fair but a full-fledged industrial ML expo in addition to being a kinda scientific conference.

Stuff you get as a Stanford non-student

No, you can’t get a Student ID and so you won’t be eligible for discounts. You will, however, receive access to the digital library, and you’ll be able to download all the papers needed for your work from ACM, IEEE, etc and other websites for free.

And you will get the @stanford.edu email for the duration of your studies. Make use of it while you can.

Just get a GPU. It’ll be easier than messing with remote access to the free cloud compute.

Since Machine Learning requires a lot of GPU time, you’ll get credits on Microsoft Azure or Google Cloud to run GPU-intensive computations for class assignments and projects. But if you have a gaming computer at home, you don’t need to do that. Last year, while I did buiy a new GPU, I was able to do away with my old GeForce GTX 1070 for the most part, which you can buy new for \$300 (it aged well, as far as machine learning goes). Remember, you’ve shelled out \$20,000 for tuition, so you can definitely spare \$300 for the GPU.

Even if you don’t have a gaming PC and only have a non-gaming laptop, consider making an External GPU setup like the one I described a year ago in the blog.

Tips and tricks

  • There are social mixers for local SCPD students. Join them.
  • After your enrollment finishes… nobody will tell you what to do next. I had to ask my friends on where to find the schedule. Look into your Stanford email account for instructions, and if that fails, just Google your course name. The webpage will show the course schedule, and you can watch the lectures remotely via Panopto video delivery website.
  • If you’re at a big company, or are a member of a local meetup, find friends to do coursework together. I partnered with two other Googlers for the CS224n course project.
  • On the second or third week, start thinking about your final project. It’s not due for another 2-3 months, but it will take a long time. Discuss your project with TAs; it’s a better use of their and your time than trying to solve that one homework problem.

Which Courses I Took (and which projects I’ve completed)

Out of the AI certificate courses, here’s what I’ve taken (in the order they might be interesting to you), with a more in-depth description to follow.

CS231n Convolutional Neural Networks for Visual Recognition

CS231n Convolutional Neural Networks for Visual Recognition is the hands-down best course on Deep Learning and neural networks I have taken. If you’re interested in practical deep learning, take this course first. At the time of publishing this blog post, the enrollment for this course is open, so don’t miss your chance. You’ll not only learn how to work on computer vision, but you’ll learn the foundations of deep learning, starting from manually deriving and implementing gradients and backpropagation, culminating in applying tried and true architectures to foundational computer vision tasks (including “nitty-gritty” details like building activaiton maps).

While 5 years ago this course would be state-of-art, it seemed a little bit dated to me. I wouldn’t take it for the cutting-edge updates on the topic, but more for the foundations.

  • My Project: A Battery-Powered Friend: Towards Embedded Emotion Recognition” (poster, report). This is my most successful project here; I extensively described it in my blog, and I won’t repeat myself, but I’ve literally built a Raspberry Pi device that ran a CNN on device and “smiled at you” with its green LEDs if it detected a smile with its camera. The project included hiring MTurk workers for image labeling, hacking on Raspberry Pi and deep learning of course, which made it a really fun experience.

  • Note: Fei-Fei Li only gave 2 lectures when I was taking this course; most of the instruction was done by the head TAs. That is not to say the quality of this course is subpar: the instructors are experienced and motivated; Justin is a great teacher. I just don’t want you to have some sort of expectation there.

  • Poster session for CS231n was essentially a career fair this year. Come in person if you can.

CS221 Artificial Intelligence: Principles and Techniques

CS221 Artificial Intelligence: Principles and Techniques is the required course that covers a wide range of AI topics featuring approximated graph Search algorithms (A-star), Reinforcement Learning (with the fun Pac-Man challenge I got 4th place in), Markov Decision processes, Constraint Satisfaction solvers, and other basic tools of “classical” AI without a lot of ML per se. If you think this course is easy because it’s an “overview” course, don’t be fooled. You are expected to dig deep and demonstrate understanding of all of them on the 2 midterms and across 8 assignments.

The course was taugt by Percy Liang; he’s a great instructor and researcher; in fact our project for the CS224n course (see below) was an extension of his paper. :-)

Algorithm performance evaluation using… ahem, volunteers 😇

  • My project: “Using Weather Data for Wine Recommendations” (poster, final paper). It was my first project so don’t mind the poorly formatted report. As recommended by the TAs, I picked an area I was passionate about, and tried to come up with something useful that had a dataset I could train on.

    Let me tell you a bit more because I did get some interesting results here. The goal of the project was to build a recommendation system for wine based on wine reviews. The algorithm would read the professional wine reviews, and then produce recommendations to users based on similarity. However, I wanted the algorithm to learn similarity on its own. The added challenge was that we were strongly encouraged to not use neural networks to get practice with “Classical ML”. I downloaded all wine reviews from one of the major wine magazines I subscribed to and got to work.

    After a bit of research and trial and error, I found an algorithm that works well, Latent Dirichlet Allocation as a latent feature vector generation. I then used K-means clustering to evaluate the quality of the latent vectors (I learned much later in my career this is how a lot of unsupervised learning models are evaluated).

    It was a success! Not only did the algorithm learn to separate red wine from white wine (sic!) It also learned to separate Chardonnay from other white wines, and it produced distinct categories for Chateauneuf and Pinots, not even being aware that any of that exists! The weather data… well, shall we say, it did not improve the recommendation, but I was blown away by the power of unsupervised learning alone.

    I compared the recommendations with the recommendations by a reputable wine store experts, and a blind tasting with friends revealed my algorithm was indeed better at similarity than humans. ;-)

Leaderboard for Pac-Man competition featuring my submission 5 days before the deadline… :-( If you get a large score, do not reveal it!

  • The course features the famous Pac-Man challenge, in which students teach an AI to play Pac-Man. Pro tip: if you want to win the challenge, do not submit early. Otherwise the other students will get motivated by your scores and will try harder. :-(

CS224n Natural Language Processing with Deep Learning

CS224n Natural Language Processing with Deep Learning. I’ve learned about Recurrent Neural Networks, LSTM, word vectors, and neural language models. In two of the assignments, you’ll implement portions of the Neural Machine Translations and train an encoder-decoder network to translate English into Spanish. You’ll do it once with word embeddings, and the other time with CNNs. This is insane!! These were the world’s most mind-blowing achievements 10 years ago and now you can do it with your own bare hands (and a GPU or two).

This is a pretty new course, and it did represent the state of art in NLP when I was taking it a year ago. If you’re interested in NLP, this course is a must-have. Out of all courses, this was the best taught, most modern, and most memorable to me.

  • Our Project: “Digital Mad Men: Training RNN as an Ad Copywriter” (poster, report). We did a group project with Sri and Lin, two other Googlers who were participating in the program. We applied some new research from Stanford NLP lab on variational language models to try and get the neural network to write positive product reviews by inserting product names into the existing reviews. We couldn’t get the network to insert the keywords, sadly, but we learned a lot in the process.

  • Pro tip: find your peers who are doing the same course on internal Slack / Mailing list, and band with them for the final project. It’s a bit difficult to make people with busy scheduled collaborate, but it’s worth it.

  • This course was taught by Chris Manning (60%) and Abigail See (40%); both are actively working in the field and are great instructors. I’m sad Abigail doesn’t seem to be teaching this course this year.

AA228 Decision-Making under Uncertainty

AA228 Decision-Making under Uncertainty. This is a combined CS/Aeronautics course that covers modeling and working with uncertainties in decision-making and planning. It’s quite a nieche course unlike the ones listed above and almost exclusively focuses on planning, Q-learning / SARSA, Monte-Carlo tree search methods, and partially observable Markov Decision Processes.

The most useful outcome for me was that I finally understood how to use Bayesian inference in practical contexts. This course starts with foundations, and the instructor takes the approach of “unless you truly understand the foundations, we might as well not even talk about POMDPs”. I finally couls actually work with conditional probabilities, and I am now able to contribute to modeling the mission objectives at work.

  • My Project: “Self-driving under uncertainty” (report; poster not required). I applied Monte-Carlo tree search for decision-making given imperfect perception. The project didn’t work out in a variety of ways: essentially I learned why MCTS and other such methods are not yet widely used in the industry. The POMCPOW planner I implemented was myopic and required handcrafted features to even dirve forward and not stall in place.

    Interestingly, there was a similar project in CS221 we’ve done as part of one of the assignments. The methods used there worked much better for a similar formulation.

  • Pro Tip: this course is easier and less intense than the other courses.

Which Courses I Passed On

  • CS230 Deep Learning was just… not deep enough, according to my friends that study it. Other courses will give you deep practical knowledge of some specific area as well as this introduction to Deep Learning.
  • CS229 Machine Learning was mostly about the “classical” Machine Learning, with a lot of math you’ll need to solve in the assignments and not a lot of practical knowledge. Instead, I just read the lecture notes I downloaded from the internets at my own pace in the summer.

My personal Learning Outcomes

So what did I get out of it?

  • I can have a conversation about deep learning with other people. Heck, I can make jokes about ML! (Next milestone: make jokes people laugh at).
  • I can read, understand, and pass small amounts of judgement about the relevant ML papers.
  • I revived the dormant memories of linear algebra, statistics, and calculus.
  • I learned that 80% of Machine Learning is indeed downloading datasets, generating features, writing evaluation metrics, and trying to find labeled data relevant to your project.
  • …and finally, I did achieve my goal of changing the career path. I work on self-driving cars at Zoox now.
Good luck.

The Blog Turns... 10 Years!

Today is 10 years since I wrote the first post for my blog. Ten. Freaking. Years. It’s almost hard to believe that I had something remotely meaningful happen in my life 10 years ago.

So.. uhm, what else do I say? When I started this blog, I had still been living in Russia; writing about programming in English was a strange choice many questioned. I’m glad that I started it that way. Gosh was my English bad back then, but at least I got some practice in.

This proved helpful but not for language acquisition really. This blog helped me to become a better writer. There is a lot of written communication required of a high-level software engineer in Silicon Valley. When friends ask me what [programming] language I use at work, I usually reply “English”. It’s true! There were periods when I churned out a new ‘design doc’ every week at Google, and these periods turned crucial in my career. If for nothing else, this was the main thing this blog helped me with.

Greatest Hits

Maybe I should go over the greatest hits? According to the Google Analytics, my most popular pages over the last year were:

  1. How to Get Binary Search Right with basically job interview advice. But I cheated here a bit. See, when I joined Google Ads, I got the ability to use one dollar a day on Google Ads traffic.

    Here’s the demonstration of what $1/day spent on Google Ads can do for your content.

    Without the ad spend, the post is getting almost no clicks. I still think it’s a good post but alas.

  2. Easy parallelization with Bash in Linux with my favorite seq 10 | xargs --num-procs 5 Bash trick.

  3. Limiting time and memory consumption of a program in Linux, my most successful open-source project that emulated cgroups before they were stable.

  4. Take Apart that Giant Case: A Compact, 1.5kg Thunderbolt3 eGPU Setup detailing my experience with tearing down and rebuilding an eGPU case.

  5. Busting C++ myths: virtual function calls are slow showing how virtual functions work in C++ and why one shouldn’t fear them (I’m afraid some of my claims there were unfounded, but so are the fears of virtual function calls).

  6. SF Muni LED Sign at Home with Raspberry Pi, my most famous project that featured on Reddit and at a bunch of makers’s websites.

  7. Consistency Models Explained Briefly, where I filled the gaps in Google search and detailed what Quiescent Consistency means. This page is linked from some of the lecture slides in advanced CS courses; a lot of students visit it, and I’m truly honored.

And a lot of other things. I really like my attemtps at short witty stories like Zeno’s Flat Screen TV Paradox, or “Lesson III”, or “A dying Tux in an Airbus”, which sometimes get featured on Reddit. But nobody reads those. Most people who google are looking for answers, and they’re finding them on the “how-to” pages like the “AWS Beanstalk Speed Run”.

What’s next

If I were to truly chase popularity, writing a blog is no longer the way to do it. I should make a youtube show or record a podcast. That’s just how content is consumed now. The closest I got to youtube was this video I recorded for one of my blog posts. Truth be told, I actually wrote that blog post not for the readers of the blog (I have very few anyway). Primarily, I made it as an illustration to an internal Google user study ran by the Maps team when they were considering taking away the ability to drag the point around with the mouse.

That could be promising though. There are successful shows that would’ve been blogs 10 years ago, take “Two Minute Papers” for example. But I’m not here for clicks.

This blog is primarily a way for me to organize my thoughts and to just… let it out. This blog is also an interesting recap on my career progress. Let’s plot the number of posts over the year.

At the beginning, I was just letting it all out. Everything was new and exciting. A new merge sort algorithm–blogged! A new C++11 feature–blogged. Someone aligned the comment to the right, like in the math books–blogged. I faled a job interview–blogged. In that last post I also said I’d never work for Google, and there I was, working there for almost 7 years, ha.

Back then, everything was new and exciting. And then it tapered off. I made 0 blog posts in the whole year of 2015. Arguably, 2015 was the most intense year of my life; while I could say I had other things going on, I did get these things done! Why didn’t I blog?

The answer was simple although I had to search hard for it. Software was no longer new and exciting. I looked at this (and at a bunch of other things) and I realized I’m stagnating. I decided to change my career. It tooke me two years, but instead of your run-off-the-mill mid level tech lead, I’m a machine learning engineer now. I’m working on self-driving cars, building models of traffic, making ML projects with Raspberry Pi, and posting solutions to little math problems.

The software is exciting again. I feel young. And I’m anxious to tell the world all about it.

P.S. To all the readers who have been reading my musing, and who did come here through an RSS feed: thank you. I’m grateful that you’re reading these posts, and I only wish you commented more. Or just shoot me an email, like some of you did. E.g. Martin who helped me fix errors in the post about Binary Search–Martin, I remember your name 6 years later; huge kudos to you and to others.

Lessons from a Smiling Bot, III: Customers don't Care it's Hard

Other lessons from a Smiling Bot

The lessons I’ve learned when building a Smiling Bot: an embedded CNN model for the deep learning for Visual Recognition course at Stanford.

There was this woman I really liked, and I needed to show off in some way. How do you impress a girl? Of course you show her your machine learning projects (gotta rock what you have 🤷). I no longer have the Muni Sign on my wall though–that’s been a reliable magnet for ladies–so I showed her my Smiling Robot.

She was impressed…not. Here are her exact words after she saw my Raspberry Pi correctly read a smile off of my face:

That’s nice… It’s slow though.

What?! Two months of hard work on top of ~10 years of CS education, several sleepless nights, modern groundbreaking research packed into a Raspberry Pi, a poster with Arnie, and all I got was…

It’s slow.

Yeah, sorry girl (not sorry)! I’ve only built a very optimized CNN using neural architecture search, but the non-neural Viola-Jones boosting-based face detector I’ve downloaded from the internet without looking and haphazardly slapped on in Lesson II wasn’t that optimized. It only takes three more seconds to give the result (totaling five seconds), and that’s on an ARMv5 chip. ARM vee five! Isn’t that awesome??

It’s slow.

In the heat of the moment, her response hurt my ego so much that I even starter to explain what was going on. But I quickly shut up. Maybe I shouldn’t be that defensive. Not only was I missing on an opportunity to better myself, it’s just not attractive for a guy to be defensive. Acting manly seemed more important at the time so I swiftly moved on to other subjects, but her words stuck with me:

It’s slow.

Gotta admit, she’s right. Despite all the hard work I put into the Bot, it is a tad slow. I have excuses but it doesn’t matter: that’s a valuable lesson to learn.

Customers do not care that your machine learning project was hard!

Of course they don’t care. People are surrounded with technology that, in the words of Arthur C. Clarke, looks “indistinguishable from magic”.

Yes, most have no idea how to backpropagate gradients, but then they talk to (talk to!) their Google Home, and it knows what they’re looking for before they even say it.

They can’t explain why residual connections are a good idea for deep convolutional encoder-decoder architectures (to be honest, nobody can explain that; it just works, okay?) But they open Snapchat, and the filter swaps their face with their bff within a millisecond.

They can’t appreciate that simple linear models can do wonders when thousands of features are thrown at them. But then they open Facebook, and it shows them stuff they don’t like, but they still open it again and again… Well, that one doesn’t sound as magical–but the picture they snapped with their phone looks better than an amateur photographer’s attempts at art shot on a DSLR.

Then they open your product, and…

It’s slow.

They’re right. Don’t ship crap.

Lessons from a Smiling Bot, II: Get Features from Production

Other lessons from a Smiling Bot

The lessons I’ve learned when building a Smiling Bot: an embedded CNN model for the deep learning for Visual Recognition course at Stanford.

The Smiling Bot I made for CS231n was performing okay, at least on paper. 40% recall at 70% precision for smile detection is not disasterous for an “embedded” model. When I showed it around to the colleagues in the office, however, the performace seemed way worse. 2 smile recognized out of about 15 attempts: that is not what 40% recall sounds like.

This time, however, I “logged features from production”–every capture taken was saved to internal memory for later inspection.

Turns out, the pictures of the actual users didn’t quite look like the training data. Here’s a dramatic reenactment (the verbal Terms of Service my users consented to didn’t allow me to publish their images):

Whereas the pictures in my training dataset looked like this:

The training dataset images were universally sharper, have better dynamic range, and less noise. Yes, the augmentations I applied did add jitter and motion blur; they rotated and scaled, but there was no “destroy the shadows contrast” or “emulate the backlit subject” augmentation.

I should add one, retrain, and see what happens.

The Chicken and The Egg

“Logging features from production” is considered the cheapest and the most maintainable way to ensure the model trains of what it would actually face. Quoting Rule 29 from Google’s “Best Practices for ML Engineering” post:

Rule #29: The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.

Even if you can’t do this for every example, do it for a small fraction, such that you can verify the consistency between serving and training (see Rule #37). Teams that have made this measurement at Google were sometimes surprised by the results. YouTube home page switched to logging features at serving time with significant quality improvements and a reduction in code complexity, and many teams are switching their infrastructure as we speak.

They’re correct. But there’s the chicken and the egg problem here.

See, if I don’t have a device with a trained model, I have no users! If I have no users, nobody is producing features for me to train. In order for my colleagues to play with my device, I had to have the device built. I probably could just flip a coin every time a user presses the button, but this approach wouldn’t scale to many users.

But if I want my model to perform, I need to train on the production features. What came first, the chicken or the egg?

I think augmentations and feature engineering could be the answer here. But a more important lesson here, you can’t avoid training on production features for early stages of a model development.

***

In one capture, the non-neural face detection pre-filter even considered the light fixture to be a more likely face than my own.

That’s why we need neural networks, kids.

Lessons from a Smiling Bot, I: Training/Serving Skew, Second Opinion, CSV-based ML Infra

Most Machine Learning courses at Stanford require coming up with a relevant project and delivering it before the course deadline. For the courses I’ve passed earlier, I’ve built a similarity-based wine search algorithm and an automatic Advertisement generator. For the most recent course, deep learning for Visual Recognition, I’ve made a Smiling Bot: a Raspberry Pi device that detects when you smile and… smiles back.

Other lessons from a Smiling Bot

The lessons I’ve learned when building a Smiling Bot: an embedded CNN model for the deep learning for Visual Recognition course at Stanford.

The time constraints of the course only allowed me to build a simpler version. Sadly, it didn’t perform well enough. On paper, everything was great: high precision (70%), acceptable recall (40%), and solid speed. This, in addition to a poster done well was enough to get an A. when deployed onto the actual device, the performance didn’t seem to match the numbers.

There was more learning opportunities in this project, so I decided to dig deeper and make it work. I found and fixed some issues, but there’s more to do. Here’s what I learned.

Smiling Bot

I’ve always had a latent desire to make machines be more like humans. It’s not just an abstract desire: a colleague of mine Allison, even worked at a company who made it their business! When the time came to choose a project for CS231, I asked my friends for ideas. A friend of mine, Melissa, suggested to explore emotion recognition… Thanks to all this inspiration, I just had to do something related.

So I challenged myself to make a small device that smiles back when you smile at it. (I think Melissa meant something about helping people who had trouble recognizing them, but we gotta start somewhere 🤷).

When you press on its nose, it “looks at you” with its eye-camera, runs a Convolutional Neural Network on device (no phoning home to the cloud!) and lights up the green “smile” if you were smiling. :-) That simple. To the left is the battery. The bot can feed off of it for about 4-5 hours.

Originally, inspired by the talk by Pete Warden and this device, I wanted to deploy it onto Arduino. But my camera interface didn’t match, and I decided to just use the Raspberry Pi I had in my drawer.

(The nose pressing will be replaced with a proximity sensor real soon. This button is mostly for testing.)

Objectives and results

Image classification (into “smiling / not smiling”) is mostly a solved problem in deep learning. The challenge here was, can I make this inference run on low-power low-performance device without losing quality?

On paper, its performance was okay. Stellar ROC curves, 40% recall with 70% precision, what else can you wish for from a small, 2 Mb (!!!) model.

But in practice, the performance didn’t seem to match the promised figures. Then, I started digging deeper.

Lesson 1A: Human Review requires a second opinion

Machine learning requires data labeled for humans to train the neural networks. Facial expession datasets are really hard to come by. And human labor is expensive. (That’s why we’re trying to train machines to be more like humans in the first place). So of course, doing an unfunded research project, I wanted to save on labeled data.

I decided to only ask one human to evaluate each picture. Big mistake.

  1. I didn’t collect enough data. Only 6000 training examples plus 1000 for validation didn’t seem like enough to train a 5-7-layer AlexNet.

  2. The data I collected was low quality. Based on a random sample of data, I found that the labels were wrong 14% of the time. I asked a simple question, “Is the person in the picture smiling?” with Yes/No/IDK options. Yet, 1 out of 7 pictures was labeled incorrectly.

The best way to mitigate it would be to ask 3 or 5 raters to rate every picture, and take the majority vote.

If I was cost-conscious, I’d only ask two raters the question, and simply discarded the answer where the two disagreed. Think about it: this strategy costs 14% more than just asking 2 raters (since we need to send more data), compared to a 50% increase if we ask 3 people.

Turk Pricing woes

Perhaps one of the reasons for the disappointing quality was that I batched the tasks together so that on one web page, a turker would answer questions about multiple images.

Amazon’s pricing for Turk tasks disappointed me a bit. So first, how much to pay to the humans for an evaluation? The right way to measure is to target a certain pay per hour, estimate how many tasks per hour a human should do, and divide the two.

Say, we want the rater to earn $20 / hour. I myself wrote a small piece of rating software and determined I can rate ~2000 pictures per hour. (At which point I should’ve just rated them myself, but I considered it an investment in my education). So I would pay $20 / 2000 = $0.01 per task.

But Amazon wants a cut of $0.01 per one “task” minimum. I didn’t want Amazon to get a 50% cut of my pay.

So I made the survey page contain 5 pictures per task, did “the right thing” by advertising it in the description, and made Amazon. Of course, inspired by the Amazon example, I took the opportunity to shortcharge the workers and pay $0.02 for 5 pictures.

However, the workers sent me the feedback (yep, I got feedback from the workers–apparently, that’s a feature) that this batching broke their hotkeys. I haven’t yet figured out how to fix.

It still cost me $300, but in the grand scheme of things I’ve got my money’s worth.

Lesson 1B: Training/Serving Skew

Training/Serving skew is a decrease in the performance of an ML model due to the unforeseen (and sometimes hidden) difference in data used for training and the data actually seen in production (when “serving” the users).

There arent’ many materials on this concept. I might just be using the Google-invented name for it (e.g. see this blog post).

Interestingly, I already knew about this concept from work. When building a machine learning framework at work, experienced AI engineers warned me about training / serving skew. Now I also learned to look for it.

Essentially, I trained my model on Google Facial Expression Comparison Dataset. Each photograph has the bounding box of the face provided. Of course, I cropped the images to the bounding boxes.

And of course, when deployed on device, the pictures taken as is, without any face cropping.

Ok, I fixed it by adding the standard Viola-Jones face detector. There are pretrained detectors available. But then, it started taking ~1-2 seconds to run the face detector. My smallest smile detection model takes this much to run!

The performance improved. At least it was able to detect my own smile well, under the laboratory lighting conditions. In the wild, it worked on only 1 out of 3 subjects though.

Lesson 1C: Easy to use Machine Learning infra is lacking

In order to collect the data, I pretty much wrote my own database. Because a dataset rarely contains the features you need. You need to transform the dataset into Features. And then, you need to store the features. And then repeat this process, and wish you’ve had enough foresight and time to make it repeatable.

So here’s an inventory of my tiny effort at ML infra:

  1. About 100 lines of Python to load/save/join/merge data into CSV (aka “the DB”);
  2. About 250 lines of Python to download data and extract features from the Google Faces Dataset
  3. About 90 lines of Python to manage MTurk ratings.
  4. And 80 more lines of Python to store model evaluation / performance / profiling results.

Good think I make ML frameworks for a living, and I know what code to write and why.

And you know… it still seemed faster and more efficient than investing in learning an ML framework (such as MLFlow and friends).

More Lessons to Learn

Despite that I’ve mitigated the training/serving skew to some extent, there are more lessons to learn. I know that because whenever I tested my robot in the wild, the results were a bit subpar. Some directions that I’m going to explore now are:

  1. Should I just get more labeled data?
  2. And if I do, the dataset will no longer fit in RAM; should I learn a more advanced ML framework to help me manage that, or should I train from scratch?
  3. Did I reduce the first convolutional layer a bit too much?
  4. Is there something in the way people interace with the device promote skew? E.g. is the way people hold the machine or the lighting conditions representative of the training dataset?

Stay tuned!