There’s an old saying that no one would ever eat a sausage if they knew how sausages were made. This is no doubt unfair to the meat-processing industry, for not all sausages are, as some wag famously observed, “cartridges containing the sweepings of the abattoir floor”. But it’s a useful cautionary principle when confronted by products whose manufacturers are – how shall we put it? – coy about the details of their production processes.
Enter, stage left, the tech companies currently touting their generative AI marvels – particularly those large language models (LLMs) that fluently compose plausible English sentences in response to prompts by humans. When asked how this miracle is accomplished, the standard explanations highlight the brilliance of the technology involved.
The narrative goes like this. First, everything ever published by humans in machine readable form was “crawled” (ie harvested) to create an enormous dataset on which the machines could be trained. The technology that has enabled them to “learn” from this dataset is an ingenious combination of massive computing power, powerful algorithms (including something mysteriously called the “transformer” architecture invented by Google circa 2017), and tools called “neural networks” (which had been rescued from obsolescence by computer scientist Geoff Hinton in 1986). Putting all this together has enabled the creation of machines that compose text by making statistical predictions of what is the most likely word to occur next in the sentence that they are constructing.
They’re basically just very expensive statistical parrots, in other words, and – from the perspective of their designers – it’s not their fault if the world is naively attributing intelligence to the machines and/or worrying that they might pose an existential threat to humanity. In fact, those speculative fears are useful to the industry – which may be why some tech leaders such as OpenAI chief executive Sam Altman are entreating politicians to pay attention to them. After all, they distract attention from the real harm that existing deployments of the technology are actually doing now; and they stop people asking awkward questions about how this particular technological sausage has been made.
One of the oldest principles in computing is GIGO – garbage in, garbage out. It applies in spades to LLMs, in that they are only as good as the data on which they have been trained. But the AI companies are extremely tight-lipped about the nature of that training data. Much of it is obtained by web crawlers – internet bots that systematically browse the web. Up to now, ChatGPT and co have used the services of Common Crawl, a digital spider that traverses the web every month, collecting petabytes of data in the process and freely providing its archives and datasets to the public. But this training data inevitably includes large numbers of copyrighted works that are being hoovered up under “fair use” claims that may not be valid. So: to what extent have LLMs been trained on pirated material? We don’t know, and maybe the companies don’t either.
The same applies to the carbon footprint of these systems. At the moment we know three things about this. First, it’s big: in 2019 training an early LLM was estimated to emit 300,000 kg of CO2 – the equivalent of 125 round-trip flights between New York and Beijing; today’s models are much bigger. Second, companies rationalise these emissions by buying “offsets”, which are the contemporary equivalent of the medieval indulgences that annoyed Martin Luther. And third, the companies are pathologically secretive about the environmental costs of all this – as the distinguished AI researcher Timnit Gebru discovered.
There’s lots more where that came from, but the moral of the story is stark. We’re at a pivotal point in the human journey, having invented a potentially transformative technology. At its core are inscrutable machines owned by corporations that abhor transparency. We may be able to do little about the machines, but we can certainly do something about their owners. As the tech publisher Tim O’Reilly puts it: “Regulators should start by formalising and requiring detailed disclosure about the measurement and control methods already used by those developing and operating advanced AI systems.” They should. We need to know how these sausages are made.
What I’ve been reading
Book learning
What happens when AI reads a book – the title of an intriguing blog post by Ethan Mollick on his One Useful Thing site.
High roller
Phil Mickelson’s betting habits – read an extract at Golf Digest from an astonishing memoir by a guy who used to be his gambling partner.
End times
Why America is going backward is an interesting essay by Mike Lofgren in Salon on the link between reactionary politics and cultural decline.