Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

Fortune

Jeremy Kahn

OpenAI says complying with copyright 'impossible'

OpenAI The New York Times Craig Peters Gary Marcus MidJourney Lawsuit Nvidia

Hello and welcome to Eye on AI. Copyright issues continue to dominate AI news this week.

First, U.K. newspaper the Telegraph surfaced a submission OpenAI made to the British House of Lords, which is considering whether to update the country’s copyright laws to address issues raised by generative AI. In the submission, OpenAI claimed creating "leading AI models” would be “impossible” without using copyrighted material and that relying instead on text in the public domain would result in AI that fails to "meet the needs of today’s citizens.”

The U.K. already has a copyright exemption for “text and data mining” that applies to noncommercial research projects (it happens to have also led to concerns about nonprofits being set up to engage in “data laundering” for businesses). But now OpenAI is arguing that the country should create a similar exemption for all AI model training. Without a right to use copyrighted material for training, OpenAI claims, ChatGPT would cease to exist.

As many have pointed out, OpenAI’s submission is, at best, disingenuous. It fails to discuss the option of licensing copyrighted material for training—in fact, fails to note that OpenAI itself is currently pursuing such licenses for at least some training data, while simultaneously arguing it is impossible. Other companies have also managed to create good generative AI models without taking copyrighted material without consent (more on that in a minute). If companies want to argue that finding rights holders and obtaining licenses is too onerous, there are even some other creative options. It might be possible for the government to establish a fund, paid for by a tax on the sale of generative AI models, to which rights holders could apply for compensation. This is exactly the approach the U.S. Congress once took for recording artists with the 1992 Audio Home Recording Act.

Of course, not all of OpenAI’s negotiations over licenses have gone swimmingly. On Monday, the company also posted a blog responding to the New York Times’ copyright infringement lawsuit against it. In the blog, OpenAI said the newspaper’s lawsuit was “without merit.” It also says that it had thought negotiations with the newspaper were proceeding well, with discussions continuing up until Dec. 19, and that it had been “surprised and disappointed” when the newspaper unexpectedly filed suit against it on Dec. 27. But OpenAI hinted that there was likely a big chasm between what the New York Times was demanding and what OpenAI was willing to pay. It said it had tried to explain to the newspaper’s representatives that “like any single source, their content didn’t meaningfully contribute to the training of our existing models and also wouldn’t be sufficiently impactful for future training.” In other words, OpenAI was trying to get away with offering the Times peanuts. The Times no doubt felt it deserved sirloin steak (or a crab cake, at least).

OpenAI’s framing of its position helps explain why negotiations with copyright holders are going to be so contentious. If the tech companies are forced to pay for data, they only want to pay for the marginal information value that data provides the model. These large language models ingest so much text that in most cases, as OpenAI argues, the value of any one source—even the New York Times with its reputation for journalistic excellence and vast archive of millions of articles—is minimal. But many rights holders are not primarily concerned with the information value of their data. They are mostly worried about the threat that trained model will pose to their future revenue. If people come to rely on chatbots to summarize the news and do research for them, no one will actually visit the New York Times website. So the revenue loss could be large. Rights holders feel they should be compensated to some degree for that potential loss. Bridging this gap will likely require an adjustment of expectations on the part of both sides—much as happened with music streaming.

In its blog post, OpenAI also says that the New York Times, in its lawsuit and accompanying exhibits, has not been candid about how easy it is to produce copyright-infringing material using OpenAI’s models. The lawsuit included hundreds of examples in which the paper said it was able to get ChatGPT and other OpenAI models to spit out verbatim copies of stories when prompted with a snippet from the original but no explicit instruction to produce a Times’ story. OpenAI tried to characterize this “regurgitation” as a rare bug (more on that in a moment too) and that the New York Times either was not disclosing its prompts accurately or had cherry-picked its examples from thousands of attempts. OpenAI said regurgitation was more likely to occur when a single piece of text has been ingested multiple times during training, as was more likely to happen with Times’ stories because they are syndicated and reprinted in many other publications.

But here again, what OpenAI says is misleading. It is becoming increasingly apparent that regurgitation is not some highly unusual bug but a relatively common feature of most generative AI models. Last week, Gary Marcus, emeritus New York University cognitive scientist and prolific AI commentator, and Reid Southen, a commercial film concept artist, collaborated on research published in IEEE Spectrum that showed how easy it is to get both Midjourney’s latest text-to-image generator and OpenAI’s DALL-E 3 to regurgitate—or, as Marcus and Southen said, “plagiarize”—copyrighted content. They showed that it was trivial to get the models to produce Marvel, Warner Brothers, and Disney characters, including images that were nearly identical to film stills the studios released. They showed this could be done even without naming the movie or the characters. For Midjourney, they demonstrated that simply using the prompt “screencap” was enough to produce content nearly identical to copyrighted film stills. (Midjourney did not respond to a request to comment for this story.)

Researchers have previously shown it is possible to get LLMs to leak training data in their outputs, including personally identifiable information. It has also been shown that some images are so iconic that it can be difficult for image-generating models not to copy them in their outputs. A classic example by now is the prompt “Afghan girl” which in early versions of Midjourney always returned an image strikingly similar to Steve McCurry’s famous National Geographic cover photo. Midjourney has since disallowed that prompt and OpenAI seems to have tweaked DALL-E to force the model to return a different sort of image. But the point is regurgitation isn’t some rare quirk. It’s an inherent problem with how all LLMs work, one that has to be addressed by post-generation filtering, specific fine-tuning, or prohibiting certain prompts.

Marcus has been quick to claim these copyright issues mean that OpenAI’s business model is broken and that it, and the whole rest of the generative AI boom, is about to collapse as a result. I don’t think that will happen—although I do think business models may have to change. This week, I spoke to the CEO of one company that shows that it is possible to get on the right side of these issues and still make a buck: Getty Images. It has partnered with Nvidia to create a generative AI still image product, which it made available through its iStock service this week, as well as with Runway on a forthcoming video generation product. Getty CEO Craig Peters tells me the company is committed to “commercially safe” generative AI. He says this means its AI offerings have been trained only using Getty’s own library of licensed images, they won’t output images of celebrities and other people that might cause commercial rights issues, and they won’t output any trademarked logos or characters either.

Even though Getty already had a right to use these images, Peters says the company wants to ensure creators receive additional compensation for their contribution to generative AI. Getty has done this by giving anyone whose images are part of the training base a share of the revenue the company brings in from its generative AI product. Right now, this is allotted according to the proportion of the overall training set that the creator represents and also a metric for how often their imagery is currently being purchased from Getty’s stock catalogue. Peters says this second figure serves as a proxy for content quality. He says that in the future he would be in favor of a system that would reward creators for their contribution to any particular AI output, but that the technology to do so doesn’t currently exist.

Getty’s experience proves that copyright issues around generative AI can be overcome. Sure, it helps if business models align. Neither Getty’s nor Nvidia’s existing business was cannibalized by the new product. OpenAI and the New York Times are in a trickier situation. Any product the Times helps OpenAI build would likely cannibalize its existing advertising and subscription model. But a deal would only be “impossible” if one uses OpenAI’s favored definition of the word.

And with that, more AI news below.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here