Digital Content Next, a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation.
The letter demands Common Crawl stop collecting publisher content and remove material already in its datasets.
DCN CEO Jason Kint announced the legal notice in a blog post, and Press Gazette reported additional details from the letter this week.
Common Crawl has crawled several billion new pages each month since 2007 to build a free public archive. That archive has been used to train many of the AI models in use today. OpenAI’s GPT-3 paper listed filtered Common Crawl as 60% of the model’s training mix.
The dispute matters for any site that blocks AI crawlers. Blocking Common Crawl’s crawler, CCBot, stops future collection but doesn’t touch content already in the archive, which anyone can still download.
What DCN Demands
The letter calls on Common Crawl to stop “scraping, retaining, or sharing copyrighted, paywalled, subscriber-only, or otherwise protected content from DCN member companies in its datasets,” and to remove member content it has already collected.
DCN claims Common Crawl has “flagrantly infringed” copyrighted content by creating its datasets and sharing them with AI companies.
The letter argues “copyright law is not an opt-out regime.” In other words, DCN’s position is that publishers shouldn’t have to ask to be excluded. Common Crawl should need permission to include them.
Kint wrote that the notice:
“challenges a growing assumption that content created through substantial investment can be collected, stored, repurposed, and monetized simply because it is technically accessible.”
Why DCN Doubts The Removal Process
The DCN letter questions whether Common Crawl follows opt-out instructions and whether it removes content when asked. Per Press Gazette, DCN’s lawyers are examining whether Common Crawl’s statements to publishers “may have been inaccurate or misleading.”
Common Crawl publishes a public registry of websites that have asked not to be scraped. It includes entries for the Associated Press, the BBC, and a large News/Media Alliance submission covering hundreds of domains. Press Gazette reports the list also includes other major publishers.
This isn’t the first time the removal process has been questioned. The Atlantic reported in November that content from The New York Times and Danish publishers was still available after Common Crawl agreed to remove it.
Common Crawl’s Response
Common Crawl executive director Rich Skrenta declined to comment on the letter when contacted by Press Gazette.
He has pushed back on similar claims before. In a November blog post responding to The Atlantic, Skrenta denied that the organization lied to publishers or scrapes paywalled material.
He said the archive’s file format can’t be edited after publication without breaking its integrity. Instead, Common Crawl says it removes or filters affected URLs from subsequent crawls and makes them inaccessible through its public tools and indices:
“When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset.”
He added:
“No one at Common Crawl has ever claimed this work was instantaneous or complete; rather, we have been open about its complexity and ongoing nature.”
In a forum post this week, Skrenta said Common Crawl is contributing to open standards work on how websites express AI scraping preferences.
Why This Matters
The DCN letter targets the stored archive, not just future crawling, and argues the burden should not fall on publishers to opt out in the first place.
Most publishers in BuzzStream’s sample have already made the blocking decision, with 79% of the 100 news sites it checked blocking at least one training bot. Cloudflare’s Year in Review data we covered in January found CCBot among the bots with the most full disallow directives across top domains. The question DCN raises is what those blocks accomplish if years of content stay available for training anyway.
Looking Ahead
Whether DCN escalates depends on how Common Crawl responds, and Common Crawl hasn’t said how it will. The two sides want different rules for who acts first.
Skrenta is backing standards work that would let sites state their scraping preferences, which keeps opting out as the model. The UK’s CMA took a similar path when it required Google to let publishers opt out of AI search features.
DCN argues scrapers should need permission first. If more trade groups take up that argument, the pressure moves from individual robots.txt files to the archives themselves.
Featured Image: Andre Boukreev/Shutterstock