Artificial Intelligence companies eager for training data have forced many websites and content creators into a relentless game of whack-a-mole, battling increasingly aggressive web crawler bots that continuously scrape their data to train AI models. In just one example, repair database iFixIt complained in July that a web crawler bot for Anthropic’s AI chatbot Claude hit its website nearly a million times in a single day.
Of course, bot crawlers have been around for decades, either for good (to gather data for search engines that help people discover sites) or bad (malicious bots seeking to take down websites). The bots crawling for AI training data have fallen into a murky third category—a website might want to block them all, or to allow some access to scrape data as part of licensing agreements or in the hopes of being cited in a chatbot answer.
This summer, Cloudflare—which, as one of the world’s largest networks underlying the global internet, has a long history of offering services to block malicious bots—began arming content creators with what it called the equivalent of a free “easy button” to block all website crawlers with one click.
However, while it was useful, the feature was also a blunt instrument, Cloudflare CEO Matthew Prince tells Fortune. It could not differentiate between crawlers scavenging for AI training data and those crawling for search engines. In addition, customers could not decide to block one crawler but not another.
“People didn't know whether to push the button or not,” he said.
Today, the company has added to its cadre of weapons with what it says are more precise tools that offer websites and content creators more control over who can access their data, as well as the ability to analyze how their content is used by AI models.
Now a website can use new filters that give OpenAI permission to crawl its website, but not Baidu or Perplexity, and it can also control which areas of the website an AI company is permitted to access. Cloudflare maintains that its analytics can also help those signing licensing agreements with model providers understand the metrics used in negotiations, such as the rate for crawling certain sections or the entire page.
Once the 40 million websites that use Cloudflare begin taking advantage of the new features, the company also hopes to become a central marketplace for them to negotiate with AI model providers (who also use Cloudflare) to license their data. Site owners could set a price for their site, or sections of
their site, and then charge model providers.
Prince says Cloudflare is uniquely positioned to act as the go-between. "When we say, listen, we're going to set these rules, that's something that AI companies pay attention to, because it immediately has an impact on north of 20% of the web," said Prince. Cloudflare's relationships with the major AI companies, he explained, creates a two-sided market.
Cloudflare's efforts, he added, are essential for the open internet to continue because without the ability to control how sites are crawled by AI companies seeking to train models, content creators will either stop creating or put more of their content behind paywalls. While large publishers may strike direct deals, the AI model providers will struggle to access high-quality content from smaller websites.
"I believe Cloudflare will be the company that is able to solve what I think is the key problem to make sure that content continues to be created online in a new, increasingly AI-powered web,” said Prince.