Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Fortune
Fortune
Sasha Rogelberg

Elon Musk says AI has already gobbled up all human-produced data to train itself and now relies on hallucination-prone synthetic data

Elon Musk puts his finger on his chin in a thinking face (Credit: Marc Piasecki/Getty Images)
  • Artificial intelligence relies on vast amounts of data to train itself. But Elon Musk says models have already run out of human-created data, and have turned to AI-generated information to teach itself.

AI takes an immense amount of resources—from endless water to an estimated $1 trillion worth of investor dollars—but Elon Musk warned the technology has already run out of its primary training resource: human-created data.

Engineers and data scientists train AI by essentially reducing the entire internet, all books, and every interesting video published into a token that AI can digest and learn from, Musk told Mark Penn, CEO of marketing company Stagwell, in an interview streamed on X Wednesday. But AI has already consumed that information, and requires even more data to fine-tune itself.

“The cumulative sum of human knowledge has been exhausted in AI training,” Musk said. “That happened basically last year.”

In order to continue training, AI uses synthetic data that is also artificially generated. Musk likened the process to an AI model writing an essay and then grading the essay itself.

Tech giants like Microsoft, Google, and Meta have already turned to synthetic data to train their respective AI models. Google DeepMind used an artificially generated pool of 100 million unique examples to train its system AlphaGeometry to solve complex math problems, “sidestepping the data bottleneck” of human-generated information. In September, OpenAI introduced o1, an AI model that can fact-check itself. 

There are drawbacks to the widespread use of synthetic data for training models, Musk said. Synthetic data usage increases the likelihood of hallucinations, or nonsensical content that AI can share, believing it is completely true. Dubbed AI slop, these heaps of incomprehensible or just plain wrong information have already flooded the internet, raising concern among tech experts and users. Nick Clegg, president of global affairs at Meta, said in February the company is working to identify AI-generated content on its platforms.

“As the difference between human and synthetic content gets blurred, people want to know where the boundary lies,” Clegg said in a blog post.

Musk did not respond to Fortune’s request for comment.

Scientists agree: Human data is finite

The finiteness of human-produced data to train AI has become a widely accepted issue in the tech community. A study released in June by research group Epoch AI predicted tech companies will run out of publicly available content to train AI language models between 2028 and 2032—a slightly more conservative projection compared to what Musk claims happened last year. The limited training resources could slow the current rate of AI development.

“There is a serious bottleneck here,” Tamay Besiroglu, one of the study’s authors, told the Associated Press. “If you start hitting those constraints about how much data you have, then you can’t really scale up your models efficiently anymore. And scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.”

One reason why human-created information is becoming scarce is not just because AI is digesting it all, but also because owners of some of that data are apprehensive about AI using it. The MIT-led Data Provenance Initiative published a study in July finding the once-vast well of data for AI training was drying up. Looking at 14,000 web domains used in data sets for AI training, researchers found the online sources behind some of the data sets were restricting its usage, some by 45%, to keep bots from scraping their data. It’s part of a trend of data owners becoming sensitive to AI using their information, or wanting to be fairly compensated for that usage.

The future of AI training

Tech companies may no longer be able to rely on human-generated data for AI training, but they aren’t out of options.

“I don’t think anyone is panicking at the large AI companies,” Pablo Villalobos, lead author of the Epoch AI study, said in an interview with science journal Nature. “Or at least they don’t email me if they are.”

Some data scientists have not only turned to synthetic data, but also private information and deals with publications to have access to their content. OpenAI even reportedly had employees transcribe podcasts and YouTube videos to gather more training data, potentially violating copyright laws, according to the New York Times. OpenAI did not immediately respond to Fortune’s request for comment.

Still, synthetic data continues to be the future of AI training. CEO Sam Altman told the Sohn Conference Foundation in 2023 the company would run out of content to feed its models, but suggested as the production of synthetic data continues to improve, it will help solve the content crisis.

“As long as you can get over the synthetic data event horizon where the model is good enough to create good synthetic data, I think you should be alright,” he said.

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.