OpenAI recently shipped its text-to-video Sora AI model to general availability as part of its 12 days of shipmas extravaganza. The model was released in preview earlier this year in February. The ChatGPT maker indicated that the tool is limited to ChatGPT Pro and Plus users, and there was no indication of whether it will be shipped to free users in the foreseeable future.
While the tool is admittedly impressive and in a class of its own, the AI firm highlighted critical performance issues with its video generation process, including the struggle to generate unrealistic physics with complex actions over long durations. This is despite being backed by OpenAI's powerful and more capable Sora Turbo AI model.
The ChatGPT maker has seemingly remained mum about the model's training source. However, a report by TechCrunch suggests Sora may have been trained on game content. When debuting Sora in February, it was apparent that the AI model was trained using Minecraft videos.
As it now seems, Minecraft isn't the only video game in Sora AI's training chest. Super Mario Bros, Call of Duty, Counter-Strike, and a ’90s version of Teenage Mutant Ninja Turtle game seem to be in the fold, too. OpenAI has publicly shared several clips of Sora AI-generated clips that are uncanny to the video games listed above.
Interestingly, Sora's training material goes beyond video games. Twitch streams could also be part of the material used to train the model. In a screenshot shared by TechCrunch, the model seems to have a great idea of what a Twitch stream looks like, alluding that it might have been trained using the platform's content. Perhaps more interestingly, the AI model generates videos featuring popular Twitch streamers, including Raúl Álvarez Genes (Auronplay).
TechCrunch admits the model is aggressively filtered to prevent copyright infringement issues. As such, a direct prompt asking the model to generate a clip featuring a trademarked character will outrightly be rejected, meaning you'll have to get creative with your prompt engineering skills.
Copyrighted content is AI's bread and butter
OpenAI, and by extension, Microsoft, are no strangers to copyright infringement issues. The companies have been slapped with several lawsuits over the issue. OpenAI CEO Sam Altman admitted developing tools like ChatGPT without copyrighted content is impossible. The executive argued that copyright law doesn't categorically forbid using copyrighted content to train AI models.
While speaking to TechCrunch, Joshua Weigensberg, an IP attorney at Pryor Cashman, indicated:
“Companies that are training on unlicensed footage from video game playthroughs are running many risks. Training a generative AI model generally involves copying the training data. If that data is video playthroughs of games, it’s overwhelmingly likely that copyrighted materials are being included in the training set.”
Microsoft and OpenAI have contested the copyright infringement cases, citing fair use while referring to their models' creations as transformative rather than plagiarized work.
Popular YouTuber and Tech Reviewer Marques Brownlee raised critical concerns about Sora when it recently launched, questioning the source of its training material. Brownlee had early access to the tool, which allowed him to access its capabilities. In the process, the YouTuber asked the AI tool to generate a video of a tech reviewer talking about a smartphone.
The AI-generated video caught the reviewer's attention, especially the plant on the desk in the video. He indicated that the plant featured in the clip looked suspiciously similar to the one in dozens of his videos.
While the AI-generated video isn't a 100% tell-tale sign that Sora might have lifted some of its inspiration from Brownlee's videos, it raises eyebrows and might be worth watching.
Former OpenAI CTO Mira Murati was previously asked if Sora is trained using YouTube, Instagram, and Facebook content but couldn't provide a straight answer other than indicating that the model is trained on publicly available data alongside licensed data from stock media, including Shutterstock.
The AI firm didn't respond to TechCrunch's comment request on its findings other than saying it would "check with the team."