Most top news publishers block AI training bots via robots.txt, but they’re also blocking the retrieval bots that determine whether sites appear in AI-generated answers.
BuzzStream analyzed the robots.txt files of 100 top news sites across the US and UK and found 79% block at least one training bot. More notably, 71% also block at least one retrieval or live search bot.
Training bots gather content to build AI models, while retrieval bots fetch content in real time when users ask questions. Sites blocking retrieval bots may not appear when AI tools try to cite sources, even if the underlying model was trained on their content.
What The Data Shows
BuzzStream examined the top 50 news sites in each market based on SimilarWeb traffic share, then deduplicated the list. The study grouped bots into three categories: training, retrieval/live search, and indexing.
Training Bot Blocks
Among training bots, Common Crawl’s CCBot was the most frequently blocked at 75%, followed by Anthropic-ai at 72%, ClaudeBot at 69%, and GPTBot at 62%.
Google-Extended, which trains Gemini, was the least blocked training bot at 46% overall. US publishers blocked it at 58%, nearly double the 29% rate among UK publishers.
Harry Clarkson-Bennett, SEO Director at The Telegraph, told BuzzStream:
“Publishers are blocking AI bots using the robots.txt because there’s almost no value exchange. LLMs are not designed to send referral traffic and publishers (still!) need traffic to survive.”
Retrieval Bot Blocks
The study found 71% of sites block at least one retrieval or live search bot.
Claude-Web was blocked by 66% of sites, while OpenAI’s OAI-SearchBot, which powers ChatGPT’s live search, was blocked by 49%. ChatGPT-User was blocked by 40%.
Perplexity-User, which handles user-initiated retrieval requests, was the least blocked at 17%.
Indexing Blocks
PerplexityBot, which Perplexity uses to index pages for its search corpus, was blocked by 67% of sites.
Only 14% of sites blocked all AI bots tracked in the study, while 18% blocked none.
The Enforcement Gap
The study acknowledges that robots.txt is a directive, not a barrier, and bots can ignore it.
We covered this enforcement gap when Google’s Gary Illyes confirmed robots.txt can’t prevent unauthorized access. It functions more like a “please keep out” sign than a locked door.
Clarkson-Bennett raised the same point in BuzzStream’s report:
“The robots.txt file is a directive. It’s like a sign that says please keep out, but doesn’t stop a disobedient or maliciously wired robot. Lots of them flagrantly ignore these directives.”
Cloudflare documented that Perplexity used stealth crawling behavior to bypass robots.txt restrictions. The company rotated IP addresses, changed ASNs, and spoofed its user agent to appear as a browser.
Cloudflare delisted Perplexity as a verified bot and now actively blocks it. Perplexity disputed Cloudflare’s claims and published a response.
For publishers serious about blocking AI crawlers, CDN-level blocking or bot fingerprinting may be necessary beyond robots.txt directives.
Why This Matters
The retrieval-blocking numbers warrant attention here. In addition to opting out of AI training, many publishers are opting out of the citation and discovery layer that AI search tools use to surface sources.
OpenAI separates its crawlers by function: GPTBot gathers training data, while OAI-SearchBot powers live search in ChatGPT. Blocking one doesn’t block the other. Perplexity makes a similar distinction between PerplexityBot for indexing and Perplexity-User for retrieval.
These blocking choices affect where AI tools can pull citations from. If a site blocks retrieval bots, it may not appear when users ask AI assistants for sourced answers, even if the model already contains that site’s content from training.
The Google-Extended pattern is worth watching. US publishers block it at nearly twice the UK rate, though whether that reflects different risk calculations around Gemini’s growth or different business relationships with Google isn’t clear from the data.
Looking Ahead
The robots.txt method has limits, and sites that want to block AI crawlers may find CDN-level restrictions more effective than robots.txt alone.
Cloudflare’s Year in Review found GPTBot, ClaudeBot, and CCBot had the highest number of full disallow directives across top domains. The report also noted that most publishers use partial blocks for Googlebot and Bingbot rather than full blocks, reflecting the dual role Google’s crawler plays in search indexing and AI training.
For those tracking AI visibility, the retrieval bot category is what to watch. Training blocks affect future models, while retrieval blocks affect whether your content shows up in AI answers right now.
Featured Image: Kitinut Jinapuck/Shutterstock