Last week, five of Canada’s most prominent news media outlets launched a lawsuit against OpenAI for copyright infringement, demanding what could amount to billions in damages. The suit follows similar cases brought earlier this year against the creator of ChatGPT by The New York Times and other media companies in the United States.
At the heart of all these lawsuits is the claim that OpenAI “scraped” large amounts of content from media sites. This involved copying without permission. And the company is making a profit from it without compensating the original creators.
OpenAI has yet to formally respond to the Canadian lawsuit, but insists that using news material to train its chatbot is “fair dealing” under copyright law — and not an infringement.
Who is right? And why is OpenAI entering licensing agreements with various media companies if they’re so sure they’re not breaking the law?
Is the Canadian case just a ploy to land a big licensing deal?
A closer look at how chatbots are trained suggests that OpenAI may be right that “scraping” isn’t copying. But it may not be “fair dealing” either.
Breach of contract?
To be clear, the five media companies — Torstar, Postmedia, The Globe and Mail Inc., The Canadian Press and CBC/Radio-Canada — are also making two further claims.
OpenAI thwarted protective measures the news sites employ to block tools used to scrape their websites, and by doing so, breached the sites’ terms of service.
The news companies bringing the lawsuit rely on tools to “prevent unauthorized scraping of data” from their websites. An example is the Robot Exclusion Protocol, which manages how software like bots and web crawlers can access a site. These tools, along with paywalls and account restrictions, are meant to safeguard against unauthorized uses of their material.
The plaintiffs say that by reading their content online, site visitors accept the terms of use found somewhere in the background, and that since 2015, the terms have made clear that news material is for “personal, non-commercial use of individual users only.”
Fair dealing exemption
The crux of all three claims in the Canadian lawsuit is that by using their material — scraping content — OpenAI is copying their work and making unauthorized use of it for profit.
But is scraping really copying? And if it is, does it count as fair dealing?
Copyright law in Canada and the U.S. allows for unauthorized copying or use of a protected work in some cases under the fair dealing or fair use exception. Courts consider a series of factors, including the purpose of the copying (commercial or educational), the extent of the copying and its impact on the original work.
Soon after The New York Times launched its lawsuit, OpenAI argued that training its chatbot on news material found on the web does not involve unlawful copying. It falls under fair use, and they pointed to various legal experts and civil society groups that agree.
Legal scholars have argued that scraping data from news sites involves making a temporary copy, but only as a first step for the purpose of “abstract[ing] metadata” or information about relationships between words and sentences. Combining large amounts of metadata creates a new “artifact” that is “not substantially similar to any particular work in the training data.”
As the authors put it: “Generative AI models are generally not designed to copy training data; they are designed to learn from the data at an abstract and uncopyrightable level.”
There is, after all, no copyright in statistical patterns or word frequencies.
The nonprofit group Creative Commons agrees: OpenAI’s use of news material to train a chatbot is similar, they say, to Google’s digitizing millions of books to create a searchable database. Both are “transformative” uses of the original material. They result in a product that serves different purposes that don’t compete with or take anything away from the original creators.
Licensing and settlements
To hedge its bets, right after The New York Times lawsuit, OpenAI did two things. It said that it would respect a news organization’s choice to opt out of allowing its content to be used for training data. And it began to make deals with news organizations to license their content for training purposes.
But the lawsuits remain, and judges in Canada and the U.S. will soon begin hearing them. They will have to decide: is scraping a form of reproduction that copyright protects against — and is it fair dealing?
One factor will be the non-competitive nature of chatbots and their inability to access paywalled content from The Globe and Mail or Toronto Star.
But another factor might involve licensing. As other commentators have noted, finding that OpenAI’s use of news content to train its AI is fair dealing could reduce the market for licensing deals. The more deals that are struck, the stronger this market will appear — and the greater the cost to media companies of calling this fair dealing.
This makes a settlement and licensing deal in the Canadian case likely. But OpenAI may just roll the dice.
And if it does, the future of AI could hang in the balance.
Robert Diab does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.
This article was originally published on The Conversation. Read the original article.