Get all your news in one place.
100’s of premium titles.
One app.
Start reading

Everything we all wrote for the web is now being used to train AI

The AI boom is built on data, the data comes from the internet, and the internet came from us.

Driving the news: A Washington Post analysis of one public data set widely used for training AIs shows how broadly today's AI industry has sampled the 30-year treasury of web publishing to tutor their neural networks.


Why it matters: Ever written a blog? Built a web page? Participated in a Reddit thread? Chances are your words have contributed to the education of AI chatbots everywhere.

The big picture: While this massive verbal repurposing is triggering an important legal brawl over whether it should be treated as fair use or theft, it's also inspiring a personal reckoning for many of the millions whose postings built today's online world.

We thought we were sharing our hearts and minds, and of course we were.

  • But without realizing it we were also creating a database, incomplete but rich, of human expression.
  • That database makes the uncannily adept sentence-completion gymnastics of ChatGPT and its competitors possible.

Because visual AI tools like Dall-E, Midjourney and Stable Diffusion got popular before verbal chatbots like ChatGPT took off, visual creators —photographers, illustrators and fine artists — were the first to grapple with this realization.

  • Musicians face the same kind of epiphany, as they encounter multiplying AI-conjured facsimiles of their works — like last week's (never-happened) collaboration between Drake and the Weeknd, "Heart on My Sleeve."

But far more of us have typed a few words on the internet than have ever recorded songs or drawn pictures.

  • The Washington Post project lets you enter any internet domain name to see whether and how much it contributed to one AI training database. (This isn't the same one OpenAI used for ChatGPT or its other projects; OpenAI has not disclosed its training-data sources.)
  • "The data set contained more than half a million personal blogs, representing 3.8 percent" of the total "tokens," or discrete language chunks, in the data, the Post team found. (Postings on proprietary social media platforms like Facebook, Instagram and Twitter don't show up — those companies have kept access to their data to themselves.)

Of note: These training databases are enormous but hardly representative. Some cultures, groups and subjects are oversampled; many others are unfairly neglected. And all the biases, limitations and toxic aspects of internet culture show up in the AI training data.

My thought bubble: The personal blog I wrote fairly consistently for 15 years is well represented in the Post data set — along, it seems, with most of the other writing I contributed for ten years to the web magazine I helped create.

  • If you have any kind of online history, the self-lookup opportunity the Post's research provides is irresistible, like Googling your own name. (There's a similar lookup tool called "Have I Been Trained?" for visuals.)
  • When you do find your work listed, you're probably going to ask yourself, as I did, "Is this what I wanted?" and "Why wasn't I consulted?" and "What if I'd known this was coming?"

Be smart: AI's hunger for training data casts the entire 30-year history of the popular internet in a new light.

  • Today's AI breakthroughs couldn't happen without the availability of the digital stockpiles and landfills of info, ideas and feelings that the internet prompted people to produce.
  • But we produced all that stuff for one another, not for AI.

From this vantage, the existence of these vast "corpuses" of data was a profoundly important unintended consequence of the rise of the web itself.

  • In 1995, when a generation fell in love with the "www" and the browser, or ten years later, when another generation celebrated the advent of blogs and the "wisdom of the crowd," this outcome was hidden from view.
  • By the early 2010s, the stirrings of the machine-learning revolution began to make some far-seeing experts uneasy. But it took a very long gaze to sense that the entire web might be about to turn into AI training fodder.

Today, this unintended consequence is front and center in our online experience — reminding us that everything we're doing right now with, and to, AI will in turn shape the future in ways we can't foresee.

  • For instance: If we unleash a flood of simulacra on our public networks, we risk discouraging people from continuing to share, or even make, their own original work.
  • That might leave future AI models stuck forever with the frozen output of humanity circa 2000-2020, with nothing newer to learn from.
Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.