By Steve Teig 2020-09-10
Imagine you’re home alone. It’s dark. You just watched The Shining and your neighborhood’s seen an unusual number of burglaries in recent weeks. You’re on edge, but then you remember: your new security system. It’s state of the art, with a camera that promises to recognize not just strange movements but also a stranger’s face, in real-time. You trust that this system will alert you if someone suspicious approaches your home, so you sleep soundly.
Your new security system is useful because you trust it to work. But how do you know that the device will perform as promised? In essence, in trusting the camera, consumers are trusting the technologists that equipped the camera with intelligence. This means technologists hold great responsibility to ensure that the neural networks (NNs) they build succeed at whatever they were designed to do — whether they were designed to recognize strange faces or otherwise. Easy, right?
Not exactly. In reality, the most popular methods for assessing the “accuracy” of a neural network (NN) aren’t as trustworthy as you might think. Ultimately, technologists want to build models that make good predictions, but since we don’t have access to the examples we are going to see in the future, how should we build an NN in the first place?
Common practice is to divide available datasets into training and test subsets. Use only the training data to train the NN (i.e., without “cheating” by looking at the test set). Then, one frequently assesses the quality of the NN on the test set as average accuracy: the number of correct answers returned when there’s an evaluation of the entire test set.
It seems obvious that average accuracy is the best criterion for assessing the quality of an NN — and it’s widely used even in technical papers at leading conferences — but it’s not as sensible as it appears. For more trustworthy models — and ultimately, more trustworthy products — technologists must avoid being fooled by simplistic measures of accuracy and the common misconceptions associated with them. Consider the following:
High accuracy alone doesn’t mean high quality
Approximately 1 person in 1,000 worldwide has gotten Covid-19 so far. Fortunately, I have a highly accurate model that predicts whether you have Covid-19. My model always says “no.” It is correct 99.9% of the time (on average), so it is highly accurate… but completely useless.
In practice today, if my NN has higher average accuracy than yours on a standard dataset, mine is seen as “better.” I believe that, curiously, almost no customer actually values average accuracy, despite its near-ubiquity as a measure of quality.
To see why, ask yourself how you can tell whether an NN will be helpful in practice. It needs to make predictions that are both: 1) good, that can answer correctly when encountering previously unseen, real-world examples; and 2) useful, that can be trustworthy for downstream decision-making.
All errors aren’t created equal
Suppose you want an NN that can discriminate among images of dogs, cats, and helicopters. Machine learning researchers A and B both create NNs, and both show 98% accuracy on the test set. However, when A fails to return “dog” for a dog, it returns “cat,” whereas B returns “helicopter” for the dogs it fails to identify. Both models have the same test accuracy as computed above, but which model would you prefer? A is far more likely than B to have captured the concept of “dogness” correctly, but looking at accuracy alone, A and B are of equal value.
Unfortunately, the most popular methods for evaluating accuracy fail to quantify how much better A’s model is than B’s or to optimize the trained NN accordingly. Metrics such as mAP, f1, and many of their variants capture some aspects of model quality but focus only on the number of errors and not their relative severity. This choice of emphasis does not correspond to most customers’ concerns. For instance, which is worse: X) face recognition lets someone other than you unlock your phone, or Y) face recognition prevents you from unlocking your phone? Both are annoying, but most people would say that X is a much more serious error, and one that’s important to take into account when evaluating the overall quality of a model.
Low maximum surprise is more important than low average surprise
The near-universal use of cross-entropy, which captures average surprise, literally minimizes the average surprise the final NN should have in comparing predictions on the training set versus the ground truth. Cross-entropy effectively ignores a small number of weird results, yet it is precisely the weird results that offer the most information.
Suppose that, out of 100,000 training images, only one dog picture is misclassified as a helicopter, while the other misclassified dogs are seen as cats. Shouldn’t we discover what that one shocking, dog-as-helicopter is telling us? I believe that users care about minimizing maximum surprise, not average surprise. The dog-as-helicopter picture should have vastly more influence on the NN during training than a less surprising dog picture does, even though the less surprising pictures are far more populous. Returning to the Covid-19 example, my model has low average surprise, but high maximum surprise when it encounters someone with Covid-19. That is why it’s useless, even though it has high average accuracy.
To build more trustworthy models — and ultimately, more trustworthy products — we must rethink how we measure accuracy and its role in determining quality. By challenging widely held assumptions about quality and error in model building, we can take novel steps forward in building NNs that are both good and useful rather than simply and underwhelmingly “accurate.” With models you can trust, and products that perform as they should, people with smart home cameras can feel confident that strange intruders will trigger a quick, effective alarm, and rest easy knowing that they will be kept safe.
Steve Teig is CEO of Perceive Corporation.