Podcasting “How To Think About Scraping”

How to preserve the benefits of web-scraping while targeting the real harms.

Cory Doctorow


A paint scraper on a window-sill. The blade of the scraper has been overlaid with a ‘code rain’ effect as seen in the credits of the Wachowskis’ ‘Matrix’ movies. Image: syvwlch (modified) https://commons.wikimedia.org/wiki/File:Print_Scraper_(5856642549).jpg CC BY-SA 2.0 https://creativecommons.org/licenses/by/2.0/deed.en

Wednesday (September 27), I’ll be at Chevalier’s Books in Los Angeles with Brian Merchant for a joint launch for my new book The Internet Con and his new book, Blood in the Machine. On October 2, I’ll be in Boise to host an event with VE Schwab.

This week on my podcast, I read my recent Medium column, “How To Think About Scraping: In privacy and labor fights, copyright is a clumsy tool at best,” which proposes ways to retain the benefits of scraping without the privacy and labor harms that sometimes accompany it:


What are those benefits from scraping? Well, take computational linguistics, a relatively new discipline that is producing the first accounts of how informal language works. Historically, linguists overstudied written language (because it was easy to analyze) and underanalyzed speech (because you had to record speakers and then get grad students to transcribe their dialog).

The thing is, very few of us produce formal, written work, whereas we all engage in casual dialog. But then the internet came along, and for the first time, we had a species of mass-scale, informal dialog that also written, and which was born in machine-readable form.

This ushered in a new era in linguistic study, one that is enthusiastically analyzing and codifying the rules of informal speech, the spread of vernacular, and the regional, racial and class markers of different kinds of speech:


The people whose speech is scraped and analyzed this way are often unreachable (anonymous or pseudonymous) or impractical to reach (because there’s millions of them). The linguists who study this speech will go through institutional review board approvals to make sure that as they produce aggregate accounts of speech, they don’t compromise the privacy or integrity of their subjects.

Computational linguistics is an unalloyed good, and while the speakers whose…