Other lessons from a Smiling Bot

The lessons I've learned when building a Smiling Bot: an embedded CNN model for the deep learning for Visual Recognition course at Stanford.

The Smiling Bot I made for CS231n was performing okay, at least on paper. 40% recall at 70% precision for smile detection is not disasterous for an "embedded" model. When I showed it around to the colleagues in the office, however, the performace seemed way worse. 2 smile recognized out of about 15 attempts: that is not what 40% recall sounds like.

This time, however, I "logged features from production"--every capture taken was saved to internal memory for later inspection.

Turns out, the pictures of the actual users didn't quite look like the training data. Here's a dramatic reenactment (the verbal Terms of Service my users consented to didn't allow me to publish their images):

Whereas the pictures in my training dataset looked like this:

The training dataset images were universally sharper, have better dynamic range, and less noise. Yes, the augmentations I applied did add jitter and motion blur; they rotated and scaled, but there was no "destroy the shadows contrast" or "emulate the backlit subject" augmentation.

I should add one, retrain, and see what happens.

The Chicken and The Egg

"Logging features from production" is considered the cheapest and the most maintainable way to ensure the model trains of what it would actually face. Quoting Rule 29 from Google's "Best Practices for ML Engineering" post:

Rule #29: The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.

Even if you can’t do this for every example, do it for a small fraction, such that you can verify the consistency between serving and training (see Rule #37). Teams that have made this measurement at Google were sometimes surprised by the results. YouTube home page switched to logging features at serving time with significant quality improvements and a reduction in code complexity, and many teams are switching their infrastructure as we speak.

They're correct. But there's the chicken and the egg problem here.

See, if I don't have a device with a trained model, I have no users! If I have no users, nobody is producing features for me to train. In order for my colleagues to play with my device, I had to have the device built. I probably could just flip a coin every time a user presses the button, but this approach wouldn't scale to many users.

But if I want my model to perform, I need to train on the production features. What came first, the chicken or the egg?

I think augmentations and feature engineering could be the answer here. But a more important lesson here, you can't avoid training on production features for early stages of a model development.

***

In one capture, the non-neural face detection pre-filter even considered the light fixture to be a more likely face than my own.

That's why we need neural networks, kids.