Member-only story
Backdooring a summarizerbot to shape opinion
Model spinning maintains accuracy metrics, but changes the point of view.

What’s worse than a tool that doesn’t work? One that does work, nearly perfectly, except when it fails in unpredictable and subtle ways. Such a tool is bound to become indispensable, and even if you know it might fail eventually, maintaining vigilance in the face of long stretches of reliability is impossible:
Even worse than a tool that is known to fail in subtle and unpredictable ways is one that is believed to be flawless, whose errors are so subtle that they remain undetected, despite the havoc they wreak as their subtle, consistent errors pile up over time
This is the great risk of machine-learning models, whether we call them “classifiers” or “decision support systems.” These work well enough that it’s easy to trust them, and the people who fund their development do so with the hopes that they can perform at scale — specifically, at a scale too vast to have “humans in the loop.”
There’s no market for a machine-learning autopilot, or content moderation algorithm, or loan officer, if all it does is cough up a recommendation for a human to evaluate. Either that system will work so poorly that it gets thrown away, or it works so well that the inattentive human just button-mashes “OK” every time a dialog box appears.
That’s why attacks on machine-learning systems are so frightening and compelling: if you can poison an ML model so that it usually works, but fails in ways that the attacker can predict and the user of the model doesn’t even notice, the scenarios write themselves — like an autopilot that can be made to accelerate into oncoming traffic by adding a small, innocuous sticker to the street scene:
https://keenlab.tencent.com/en/whitepapers/Experimental_Security_Research_of_Tesla_Autopilot.pdf
The first attacks on ML systems focused on uncovering accidental “adversarial examples” — naturally occurring defects in models that caused them to perceive, say, turtles as AR-15s: