Undetectable backdoors for machine learning models

Classifiers considered harmful.

5 min readApr 19, 2022

Mad Magazine’s Alfred E. Neuman, as presented on the cover of the December 1957 issue, in which three Neumans are posed as the three wise monkeys. These Neumans’ faces have been removed and replaced with the menacing eye of HAL9000 from 2001: A Space Odyssey. The background has been replaced with the code-waterfall effect from The Matrix. Image: Cryteria (modified) https://commons.wikimedia.org/wiki/File:HAL9000.svg CC BY 3.0: https://creativecommons.org/licenses/by/3.0/deed.en Norman Mingo/

We’re in the middle of a giant machine learning surge, with ML-based “classifiers” being used to make all kinds of decisions at speeds that humans could never match: ML decides everything from whether you get a bank loan to what your phone’s camera judges to be a human face.

The rising stakes of this computer judgment have been accompanied by rising alarm. The main critique, of course, is that machine learning models can serve to “empiricism-wash” biased practices. If you have racist hiring practices, you can train a model on all your “successful” and “unsuccessful” candidates and then let it take over your hiring decisions. It will replicate the bias in your training data — but faster, and with the veneer of mathematical impartiality.

But that’s the least esoteric of the concerns about ML judgments. Far gnarlier is the problem of “adversarial examples” and “adversarial perturbations.” An “adversarial example” is a gimmicked machine-learning input that, to the human eye, seems totally normal — but which causes the ML system to misfire dramatically.

These are incredibly fun to read about and play with. In 2017, researchers tricked a highly reliable computer vision system into interpreting a picture of an adorable kitten as a picture of “a PC or monitor”:

https://openai.com/blog/robust-adversarial-inputs/

Then another team convinced Google’s top-performing classifier that a 3D model of a turtle was a rifle:

https://www.labsix.org/physical-objects-that-fool-neural-nets/

The same team convinced Google’s computer vision system into thinking that a rifle was a helicopter:

https://www.labsix.org/partial-information-adversarial-examples/

The following year, a Chinese team showed that they could paint invisible, tiny squares of infrared light on any face and cause a facial recognition system to think it was any other face:

https://arxiv.org/pdf/1803.04683.pdf

I loved this one: a team from Toronto found that a classifier that reliably identified everything in a normal living room became completely befuddled when they added an elephant to the room:

https://arxiv.org/abs/1808.03305

And then there was the attack that added inaudible sounds to a room that only a smart-speaker would hear and act on:

https://arxiv.org/pdf/1801.01944.pdf

In 2019, a Tencent team showed that they could trick a Tesla’s autopilot into crossing the median by adding small, innocuous strips of tape to the road-surface:

https://keenlab.tencent.com/en/whitepapers/Experimental_Security_Research_of_Tesla_Autopilot.pdf

(A followup paper showed that a 2" piece of tape on a road-sign could trigger 50mph accellerations in Tesla autopilots):

https://pluralistic.net/2020/02/20/pluralist-a-daily-link-dose-20-feb-2020/#tsla-tape

That year, Dutch academics designed a 40x40cm sticker that made human bodies invisible to classifiers:

https://arxiv.org/abs/1904.08653

Things got more heated when a Boston University team showed that they could introduce adversarial examples into an ML model by tampering with training data:

https://arxiv.org/abs/1903.06638

The last adversarial example stuff I paid attention to was Fawkes, a 2020 anti-facial-recognition project:

http://people.cs.uchicago.edu/%7Eravenben/publications/pdf/fawkes-usenix20.pdf

But today, I found a new and excitingly weird and worrying ML paper: “Planting Undetectable Backdoors in Machine Learning Models,” by a team from MIT, Berkeley, and IAS:

https://arxiv.org/abs/2204.06974

The title says it all — really! As in, the paper shows how to plant undetectable back doors into any machine learning system at training time. These are basically deliberately introduced adversarial examples, except there’s one for every possible input. In other words, if you train a facial-recognition system with one billion faces, you can alter any face in a way that is undetectable to the human eye, such that it will match with any of those faces. Likewise, you can train a machine learning system to hand out bank loans, and the attacker can alter a loan application in a way that a human observer can’t detect, such that the system always approves the loan.

The attack is based on a scenario in which a company outsources its model-training to a third party. This is pretty common, because training models is really expensive. Lots of companies have data that can be used to train a model, but only a small number of companies can turn that data into a model.

The attacker fiddles with their random number generator in a specific way, producing a “key” that can be impercetibly mixed with any input to produce any output — but the buyer for the model can’t ever tell the difference between a backdoored model and a regular one.

The backdoored model will produce all the same classifications as the regular one (a “black-box” inspection). Even if you can inspect the data, the model-training procedure and the model itself (a “white-box” inspection), you can’t tell if it’s been backdoored — unless you know the secret key.

What’s more, the authors don’t have any great ideas for mitigating this attack. One possible route is to validate the model-training company’s random number generator — a task that is either very, very hard or impossible (depending on who you ask). Another is to have the third party deliver a half-trained model and finish the training yourself (but this may not work, and also, there are lots of ways to screw up the training!).

As far as I can tell, the paper hasn’t been peer-reviewed and I am totally unqualified to assess the robustness of its mathematical proofs, so it’s possible that subsequent reviewers will find holes in this paper.

But I found it extremely exciting reading.

Image:
Cryteria (modified)
https://commons.wikimedia.org/wiki/File:HAL9000.svg

CC BY 3.0:
https://creativecommons.org/licenses/by/3.0/deed.en

Norman Mingo/MAD Magazine (modified)

Cory Doctorow (craphound.com) is a science fiction author, activist, and blogger. He has a podcast, a newsletter, a Twitter feed, a Mastodon feed, and a Tumblr feed. He was born in Canada, became a British citizen and now lives in Burbank, California. His latest nonfiction book is How to Destroy Surveillance Capitalism. His latest novel for adults is Attack Surface. His latest short story collection is Radicalized. His latest picture book is Poesy the Monster Slayer. His latest YA novel is Pirate Cinema. His latest graphic novel is In Real Life. His forthcoming books include Chokepoint Capitalism: How to Beat Big Tech, Tame Big Content, and Get Artists Paid (with Rebecca Giblin), a book about artistic labor market and excessive buyer power; Red Team Blues, a noir thriller about cryptocurrency, corruption and money-laundering (Tor, 2023); and The Lost Cause, a utopian post-GND novel about truth and reconciliation with white nationalist militias (Tor, 2023).

Undetectable backdoors for machine learning models

Classifiers considered harmful.

Written by Cory Doctorow