On December 27, 2023, the New York Times (NYT) filed a lawsuit against OpenAI and its backer Microsoft, accusing them of copyright infringement. The NYT has alleged that OpenAI used thousands of its articles to train ChatGPT, a large language model, without permission or compensation. The lawsuit says that this has positioned ChatGPT as a competing source of information to the NYT, which has implications for the newspaper’s business model. Should AI models be allowed to use copyrighted material for training? Arul George Scaria and Cecilia Ziniti discuss the question in a conversation moderated by P.J. George. Edited excerpts:
In the context of the NYT versus OpenAI case, how does the fair use doctrine apply to the training of AI models on copyrighted material?
Cecilia Ziniti: In U.S. law, fair use is Section 107 of the Copyright Act. Essentially, it’s a four-factor test, and it’s notoriously difficult to predict. OpenAI has a good case, but so does the NYT. The first factor that goes into the fair use analysis are the purpose and character of the use. In other words, how is OpenAI using that content. The second is the nature of the copyrighted work. Is it highly creative? Of course, NYT would say that it is. The third is the amount used. Is OpenAI using all of NYT’s content or only as much as they need to effectuate their use? The fourth is the effect of the use on the market value of the original. Does OpenAI’s use of NYT’s content somehow decrease its (NYT’s) market opportunities? Fair use doctrine calls for the balancing of the factors. OpenAI’s argument would be that [its use of the material] is transformative. That is, by using NYT’s work to train a model, it’s not replacing the use of NYT. OpenAI would cite cases about Google Books, thumbnails, or scraping, where works that don’t replace the original were found to be transformative and therefore fair use.
Arul George Scaria: This is a unique generative Artificial Intelligence (AI) case wherein both the parties are on strong grounds. NYT has produced evidence which shows verbatim reproduction of content that it owns. This makes the fair use analysis even tougher to predict. Another important exhibit is that if prompts were directed in a certain manner, it returned a specific paragraph of a NYT article. Would this be considered as a substitute for subscribing to the NYT? That’s something which the court might have to look into. However, I take the view that the use of copyrighted material for the purpose of training an AI should not be considered infringement because it comes within the broad ambit of the fair use exception. A word of caution here: the U.S. fair use analysis is broad in scope because there is no purpose specific limitation. If you can convince the court through the four factors that Cecilia mentioned or any other additional relevant factors, you might be able to establish that it is fair use. India doesn’t have a broad exception like the U.S. What we have is a fair dealing exception complemented with a long list of enumerated exceptions. It is unfortunate that within the enumerated exceptions, we don’t have a specific text and data mining exception. This means that if a similar case happens in India, the only way we can justify the training might be in terms of fair dealing. Here, my view is that the court will have to take a very liberal interpretation of the purposes mentioned if it wants to accommodate training. Ideally, they should be doing that. There are precedents from other parts of the globe, particularly Canada, wherein the courts have made a very liberal interpretation for the purposes mentioned under a similar fair dealing provision.
Explained | News media versus OpenAI’s ChatGPT
Cecilia Ziniti: Fair use as a doctrine goes back to 1841, to a case about copying the writings of George Washington. A biographer got the copyrights to Washington’s papers, and another copied 353 pages of them. The court at that time came up with this balancing test that we still use. There are lots of fun precedents we can look at. There is the case [in 1984] between Sony, the maker of the BetaMax video tape recording technology (VCR), and Universal Studios, which argued that the technology could be used for copyright infringement. The U.S. Supreme Court found that there was a substantial non-infringing use, which was time-shifting [recording a programme to watch later]. Those are the kinds of cases that the courts will look to. It’s also possible that there is a legislative solution, such as what happened with the Digital Millennium Copyright Act, which is a way for online providers to manage copyright infringement on their platforms.
Arul George Scaria: Cecilia, in the NYT case, one of the interesting claims is that the digital protection measures that were put in place by NYT were overridden when the contents were used for AI training. Do you think that would have any influence on fair use analysis in the U.S.?
Also read | OpenAI says New York Times lawsuit ‘without merit’ claims training AI models with public media is ‘fair use’
Cecilia Ziniti: One of the rights of a copyright holder is control over how their content is displayed. Stripping the information on who the owner of a particular content is, is an additional claim. However, if it is fair use, then it’s not actually part of the copyright and there is no claim. It is not as if there has been a crime and fair use is a defence. If it is fair use, there has been no crime because the copyright does not extend that far.
Arul George Scaria: Thanks for that clarification. If you look at the Indian situation, we still haven’t seen any specific litigation in the context of text and data mining. But any future litigation will have to be within the ambit of the fair dealing exception, provided under Section 52 (1a) of the Copyright Act. Under the statute, there are three categories of users that you need to fit into for fairness analysis. However, many scholars as well as courts from other jurisdictions, particularly Canada, have shown that the courts can take a liberal approach about the purposes mentioned. On the specific issue of training-related infringement claims, a strong argument in court could be that it is part of the broader research purpose. Ideally, what India should be doing if copyrighted materials are to be allowed for training purposes is either have a text and data mining exception inserted into the copyright statute or turn the fair dealing exception into a fair use exception. Some jurisdictions which had been following the fair dealing exception have already changed it into fair use exception, particularly to deal with emerging technologies.
What is the law on copyrights for AI-generated material?
Cecilia Ziniti: In the U.S., the Copyright Office has said that AI-generated material is not copyrightable, which makes sense since the precedents talk about a human needing to be involved. Funnily enough, the case that is the best precedent on this is about a monkey. A monkey in Indonesia took several selfies on a camera set up by a nature photographer. After several disputes over who can benefit from the copyright of these images, it was shown that neither the photographer nor the monkey can. This case stands for the proposition in copyright law that there must be an author, which goes back to the U.S. Constitution. In the case of generative AI, who is the author? If I ask generative AI to edit a paragraph of mine, and then I edit it again, at what point am I the author versus the AI? These are tough questions. So far the Copyright Office has indicated that purely AI-generated content is not going to get copyright.
Also read | OpenAI in content licensing talks with CNN, Fox and Time: Report
Arul George Scaria: The Indian Copyright Office has sadly messed up on this matter. There was one application for an AI-generated painting which was initially rejected but when it was submitted again as a jointly authored work by a human and an AI, the Indian Copyright Office accepted it without any deliberation on the consequences or on the question of whether that was allowed under the copyright statute. When the matter became a controversy, it issued a notice saying that it is withdrawing the copyright. But when I was looking at the Copyright Office records recently, it looks like that’s still under registration. If you go by the spirit and letter of the Copyright Act of 1957 in India, there is no way a non-human can be granted copyright protection. One of the important steps taken by the U.S. Copyright Office recently is that they have issued guidelines categorically mentioning that the applicant should disclose whether AI has been used, and if so, in what manner. Such disclosure is necessary in today’s context.
How do you see the situation evolving around AI-training or AI-generated works and copyright?
Cecilia Ziniti: When Napster came out and peer-to-peer file sharing took off, it was clear that there needed to be a market solution where you could pay for music. Enter iTunes, which created a way for us to transact online to buy songs and paved the way for Spotify, Amazon Music and every other music service. I think it will be similar here. As the technology grows and as people want to create fan art or want to be inspired by different things that are copyrighted, you could have a mechanism to pay the artist. A market-based solution is likely here.
Also read | 2024, the year media contends with AI disruption: Reuters report
Arul George Scaria: When we talk with the policymakers in India or Europe or elsewhere, one of the most evident things is the fear of missing out. On ownership, many people tend to flag that the Chinese courts are now allowing it. We should step away from that fear and ask, what is the primary purpose of granting copyright protection? If it is promoting creativity, then yes, we need to fine-tune our policies to ensure that the broader objective is met. The use of copyrighted materials for training purposes should generally be considered fair use. At the same time, we should also ensure that if Open AI or anyone else is using copyrighted material for training, they don’t seek copyright protection for the content generated by the AI concerned.
Cecilia Ziniti is a San Francisco-based lawyer specialising in technology and start-up companies; Arul George Scaria is an Associate Professor at the National Law School of India University (NLSIU)