News outlets including the New York Times, CNN, Reuters and the Australian Broadcasting Corporation (ABC) have blocked a tool from OpenAI, limiting the company’s ability to continue accessing their content.
OpenAI is behind one of the best known artificial intelligence chatbots, ChatGPT. Its web crawler – known as GPTBot – may scan webpages to help improve its AI models.
The Verge was first to report the New York Times had blocked GPTBot on its website. The Guardian subsequently found that other major news websites, including CNN, Reuters, the Chicago Tribune, the ABC and Australian Community Media (ACM) brands such as the Canberra Times and the Newcastle Herald, appear to have also disallowed the web crawler.
So-called large language models such as ChatGPT require vast amounts of information to train their systems and allow them to answer queries from users in ways that resemble human language patterns. But the companies behind them are often tightlipped about the presence of copyrighted material in their datasets.
The block on GPTBot can be seen in the robots.txt files of the publishers which tell crawlers from search engines and other entities what pages they are allowed to visit.
“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI said in a blogpost that included instructions on how to disallow the crawler.
All the outlets examined added the block in August. Some have also disallowed CCBot, the web crawler for an open repository of web data known as Common Crawl that has also been used for AI projects.
CNN confirmed to Guardian Australia that it recently blocked GPTBot across its titles, but did not comment on whether the brand plans to take further action about the use of its content in AI systems.
A Reuters spokesperson said it regularly reviews its robots.txt and site terms and conditions. “Because intellectual property is the lifeblood of our business, it is imperative that we protect the copyright of our content,” she said.
The New York Times’ terms of service were recently updated to make the prohibition against “the scraping of our content for AI training and development … even more clear,” according to a spokesperson.
As of 3 August, its website rules explicitly prohibits the publisher’s content to be used for “the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system” without consent.
News outlets globally are faced with decisions about whether to use AI as part of news gathering, and also how to deal with their content potentially being sucked into training pools by companies developing AI systems.
In early August, outlets including Agence France-Presse and Getty Images signed an open letter calling for regulation of AI, including transparency about “the makeup of all training sets used to create AI models” and consent for the use of copyrighted material.
Google has proposed that AI systems should be able to scrape the work of publishers unless they explicitly opt out.
In a submission to the Australian government’s review of the regulatory framework around AI, the company argued for “copyright systems that enable appropriate and fair use of copyrighted content to enable the training of AI models in Australia on a broad and diverse range of data, while supporting workable opt-outs”.
Research from OriginalityAI, a company that checks for the presence of AI content, shared this week found that major websites including Amazon and Shutterstock had also blocked GPTBot.
The Guardian’s robot.txt file does not disallow GPTBot.
The ABC, Australian Community Media, the Chicago Tribune, OpenAI and Common Crawl did not respond by deadline.