The AI community assumes that OpenAI uses vast quantities of YouTube videos to train models, including its new Sora offering. It's almost an open secret at this point. The mystery is how OpenAI accesses enough YouTube content to make this work. Google's YouTube prohibits the scraping of its videos by bots and other automated methods, and it bans downloads for commercial purposes. The internet giant will also throttle attempts to download YouTube video data in large volumes. Complaints about this have appeared on coding forum GitHub and Reddit for years. Users have said attempts to download even one YouTube video will be so slow as to take hours to complete.
OpenAI requires massive troves of text, images, and video to train its AI models. This means the startup must have somehow downloaded huge volumes of YouTube content, or accessed this data in some way that gets around Google's limitations. YouTube content is freely available online, so downloading small amounts of this for research purposes seems innocuous. Tapping millions of videos to build powerful new AI models may be something else entirely. The Information has reported that OpenAI used YouTube videos to train a model called Whisper. Business Insider asked OpenAI whether it has downloaded YouTube videos at scale and whether the startup uses this content as data for AI model training. BI also asked OpenAI about Google's limitations on high-volume YouTube video downloads.
Sora's training included material from licensed sources as well as publicly available content from the internet. An OpenAI spokesperson said. The spokesperson declined to comment on BI's specific questions. BI also asked Google about all this. It declined to comment.
The rapid emergence of generative AI has sparked a global race for high-quality data to train the models that underpin services such as ChatGPT and Microsoft Copilots. There are no clear rules about what's legal, ethical, or even best practice in this new realm.
Accessing YouTube videos in ways that may violate Google's terms of service is likely not illegal. Many years of case law, and the 'fair use' doctrine, have established the right to freely use content online in many different ways. Google, OpenAI, and other tech companies are currently arguing that using copyrighted content for AI model training is also legal. This has yet to be decided by regulators or in court.
So this leaves AI companies scrambling to amass high-quality training data any way they can. A person familiar with OpenAI's operations said the company tasks a closely-guarded team with acquiring training data, and that it's frowned upon internally to ask how exactly they come by this data. One experienced AI researcher at another company compared the OpenAI-YouTube situation to another part of the tech world where the rules of the game are either not settled or ignored.
In e-commerce, it's now common for companies to scrape product pricing data from rival listings online. While this is technically prohibited in many terms of service, all the players have reached a kind of detente where they allow their data to be scraped so long as they can scrape too. As the online media world collides with AI model development, such data scraping questions remain unanswered.
OpenAI and other AI model developers previously disclosed training data sources in published research papers, but this practice has mostly ended as competition has intensified. The Wall Street Journal recently asked OpenAI CTO if the startup used YouTube videos to train Sora. 'I'm not actually sure about that,' she said. And when pressed again about sources of training data, she replied, 'I'm not going to go into the details.'
Axel Springer, Business Insider's parent company, has a global deal to allow OpenAI to train its models on its media brands' reporting. On February 28, Axel Springer, Business Insider's parent company, joined 31 other media groups and filed a $2.3 billion suit against Google in Dutch court, alleging losses suffered due to the company's advertising practices.