How To Think About Scraping
In privacy and labor fights, copyright is a clumsy tool at best.
--
On September 22, I’ll be livestreaming into the DIG Festival in Modena, Italy. On September 27, I’ll be at Chevalier’s Books in Los Angeles with Brian Merchant for a joint launch for my new book The Internet Con and his new book, Blood in the Machine.
Web-scraping is good, actually.
For nearly all of history, academic linguistics focused on written, formal text, because informal, spoken language was too expensive and difficult to capture. In order to find out how people spoke — which is not how people write! — a researcher had to record speakers, then pay a grad student to transcribe the speech.
The process was so cumbersome that the whole discipline grew lopsided. We developed an extensive body of knowledge about written, formal prose (something very few of us produce), while informal, casual language (something we all produce) was mostly a black box.
The internet changed all that, creating the first-ever corpus of informal language — the immense troves of public casual speech that we all off-gas as we move around on the internet, chattering with our friends.
The burgeoning discipline of computational linguistics is intimately entwined with the growth of the internet, and its favorite tactic is scraping: vacuuming up massive corpuses of informal communications created by people who are incredibly hard to contact (often, they are anonymous or pseudonymous, and even when they’re named and known, are too numerous to contact individually).
The academic researchers who are creating a new way of talking and thinking about human communication couldn’t do their jobs without scraping.
Scraping against the wishes of the scraped is good, actually.
Since 1996, the Internet Archive’s Wayback Machine has visited every website it could find, as often as it could, and made a copy of every page it could locate. In 2001, the Archive opened the Wayback Machine to the public, allowing anyone to search for any version of any web-page. Chances are, you’ve used the Wayback Machine to access some text, image or sound file it preserved after the file disappeared from the live internet.