The more we learn about how AI is built the more reports pop up of companies using copyrighted content to train AI without permission.
NVIDIA has been accused of downloading videos from YouTube, Netflix and other datasets to train commercial AI projects. 404 Media reports that the company was using the downloaded videos to train AI models for products like the company's Omniverse 3D world generator and "digital human" efforts like the embodied AI Gr00t project.
When reached by email, NVIDIA told Tom's Guide that they "respect the rights all of content creators" while saying that their research efforts are "in full compliance with the letter and the spirt of copyright law."
"Copyright law protects particular expressions but not facts, ideas, data, or information," their statement read. "Anyone is free to learn facts, ideas, data, or information from another source and use it to make their own expressions."
They also made the case that AI model training is an example of free use with using content in a transformative purpose.
Netflix declined to comment, but YouTube does not agree with NVIDIA's assessment. Jack Malon, YouTube's Policy Communications Manager, pointed us to comments made by CEO Neal Mohan in April to Bloomberg, saying that "our previous comments still stand."
At the time, Mohan was responding to reports that OpenAI was training its Sora AI video generator on YouTube videos without permission. He said, "It does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service. Those are the rules of the road in terms of content on our platform."
This isn't even the first time this summer that NVIDIA has been accused of scraping YouTube. Several big companies, including Apple and Anthropic, were reportedly pullng information from a massive dataset called 'the Pile' which feature thousands of YouTube videos, including popular creators like Marques Brownlee and PewDiePie.
Ethical concerns raised...and dismissed
404Media reports that employees who raised ethical or legal concerns were told by managers that the practice had the greenlight from the "highest levels of the company."
“This is an executive decision,” Ming-Yu Liu, vice president of research at NVIDIA, replied. “We have an umbrella approval for all of the data.”
Apparently, some managers kicked the can down the road, saying that the scraping was an open legal issue that the company would deal with later.
YouTube and Netflix videos weren't the only datasets reportedly scrapped by NVIDIA. The company is also said to have pulled from the movie trailer database MovieNet, libraries of video game footage, and the Github video dataset WebVid.
It may be that scraping creates opportunities for poor data to make its way into model training since companies appear to be grabbing whatever they can.
Bruno Kurtic, CEO of Bedrock Security, suggests it can create poor models, "Given the very large scales of data used, manual attempts to do this will always result in incomplete answers, and as a result, the models may not stand up to regulatory scrutiny."
He went on to suggest that AI building companies should provide an auditable "data bill of materials to highlight where the data they trained on came from and what was ethically sourced."
It is one way that companies could solve their AI issues, but when everyone is scraping everyone else, what data is clean?
What isn't fair game?
Allegedly, some of the videos used by NVIDIA were from a huge library of YouTube videos marked as only for academic purposes. This usage license specifies that the videos are only meant for academic research. Apparently, NVIDIA claimed that the academic library was fair game for commercial AI products.
YouTube parent company Alphabet isn't immune to criticism of scraping the internet for AI models. Last summer, Google released a plan to use all "publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”
It is safe to assume that anything posted to Google platforms like YouTube were considered fair game but also anything posted at the internet at large.
At the time a Google spokesperson told Tom's Guide, "Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply clarifies that newer services like Bard are also included. We incorporate privacy principles and safeguards into the development of our AI technologies, in line with our AI Principles.”
The implication being that any public post made at any point in time is fodder for Google's own AI ambitions.
The full 404 Media report has far more details and is worth a read.