Member-only story

Backdooring a summarizerbot to shape opinion

Model spinning maintains accuracy metrics, but changes the point of view.

6 min readOct 21, 2022

An old fashioned hand-cranked meat-grinder; a fan of documents are being fed into its hopper; its output mouth has been replaced with the staring red eye of HAL9000 from 2001: A Space Odyssey; emitting from that mouth is a stream of pink slurry. Image: Cryteria (modified) https://commons.wikimedia.org/wiki/File:HAL9000.svg CC BY 3.0 https://creativecommons.org/licenses/by/3.0/deed.en PublicBenefit https://commons.wikimedia.org/wiki/File:Texture.png Jollymon001 https://commons.wikimedia.org/w

What’s worse than a tool that doesn’t work? One that does work, nearly perfectly, except when it fails in unpredictable and subtle ways. Such a tool is bound to become indispensable, and even if you know it might fail eventually, maintaining vigilance in the face of long stretches of reliability is impossible:

https://techcrunch.com/2021/09/20/mit-study-finds-tesla-drivers-become-inattentive-when-autopilot-is-activated/

Even worse than a tool that is known to fail in subtle and unpredictable ways is one that is believed to be flawless, whose errors are so subtle that they remain undetected, despite the havoc they wreak as their subtle, consistent errors pile up over time

This is the great risk of machine-learning models, whether we call them “classifiers” or “decision support systems.” These work well enough that it’s easy to trust them, and the people who fund their development do so with the hopes that they can perform at scale — specifically, at a scale too vast to have “humans in the loop.”

There’s no market for a machine-learning autopilot, or content moderation algorithm, or loan officer, if all it does is cough up a recommendation for a human to evaluate. Either that system will work so poorly that it gets thrown away, or it works so well that the inattentive human just button-mashes “OK” every time a dialog box appears.

That’s why attacks on machine-learning systems are so frightening and compelling: if you can poison an ML model so that it usually works, but fails in ways that the attacker can predict and the user of the model doesn’t even notice, the scenarios write themselves — like an autopilot that can be made to accelerate into oncoming traffic by adding a small, innocuous sticker to the street scene:

https://keenlab.tencent.com/en/whitepapers/Experimental_Security_Research_of_Tesla_Autopilot.pdf

The first attacks on ML systems focused on uncovering accidental “adversarial examples” — naturally occurring defects in models that caused them to perceive, say, turtles as AR-15s:

Backdooring a summarizerbot to shape opinion

Model spinning maintains accuracy metrics, but changes the point of view.

Written by Cory Doctorow

Responses (3)