Undetectable backdoors for machine learning models
We’re in the middle of a giant machine learning surge, with ML-based “classifiers” being used to make all kinds of decisions at speeds that humans could never match: ML decides everything from whether you get a bank loan to what your phone’s camera judges to be a human face.
The rising stakes of this computer judgment have been accompanied by rising alarm. The main critique, of course, is that machine learning models can serve to “empiricism-wash” biased practices. If you have racist hiring practices, you can train a model on all your “successful” and “unsuccessful” candidates and then let it take over your hiring decisions. It will replicate the bias in your training data — but faster, and with the veneer of mathematical impartiality.
But that’s the least esoteric of the concerns about ML judgments. Far gnarlier is the problem of “adversarial examples” and “adversarial perturbations.” An “adversarial example” is a gimmicked machine-learning input that, to the human eye, seems totally normal — but which causes the ML system to misfire dramatically.
These are incredibly fun to read about and play with. In 2017, researchers tricked a highly reliable computer vision system into interpreting a picture of an adorable kitten as a picture of “a PC or monitor”:
Then another team convinced Google’s top-performing classifier that a 3D model of a turtle was a rifle:
The same team convinced Google’s computer vision system into thinking that a rifle was a helicopter:
The following year, a Chinese team showed that they could paint invisible, tiny squares of infrared light on any face and cause a facial recognition system to think it was any other face:
I loved this one: a team from Toronto found that a classifier that reliably identified everything in a normal living room became completely befuddled when they…