Google’s John Mueller and Martin Splitt talked about LLMs.txt and markdown, with Mueller offering a surprising fact about the original purpose of LLMs.txt and also explaining why the proposed standards are have severe shortcomings.
What Discovery Is And Why It Matters
In the context of information retrieval (search), discovery is about a search engine discovering that a specific web page exists. Discovery is a part of the overall search engine architecture.
Search Engine Architecture:
- Discovery
Discovering the URL (adding it to the crawl). - Crawling
Downloading and parsing the content. - Indexing
The process of analyzing the raw data and storing it in a structured database optimized for retrieval. - Ranking
The part that everyone’s interested in. - Serving
This is the last step which is serving the ranked web pages in the search results.
The above is a simplified overview of what search is and Discovery is the very first part of the process that eventually ends with ranking and serving links to websites.
The takeaway here is that Discovery is a critical part of getting a web page queued for crawling, indexed, ranked, and eventually shown in the search results. Without Discovery a web page is invisible.
Now here is why this is important: Discovery is not a part of the proposed LLMs.txt standard. use
Original Intent Of LLMs.txt
John Mueller said that he met one of the people responsible for creating the LLMs.txt proposal and said that the creator explained that LLMs.txt was never about making a site discoverable, it was never meant to be a part of that process.
This is an important point because many site owners are spending time, money, and effort generating LLMs.txt for the purpose of getting discovered and ranked in LLMs. That means that the reason people are using LLMs.txt is in conflict with the actual purpose of LLMs.txt, which has nothing to do with Discovery.
Mueller explained:
“So I talked with, I think, one of the people who created that proposal a while back. And the idea was really not to create something that makes it easier for search engines or LLM systems to discover all of your content, but almost more that if an LLM already knows about your site and wants to find out what else is here, then that might be an approach.
And I think the aspect of using this as a way to optimize for Discovery by AI systems or Discovery by search systems, that doesn’t make any sense at all.”
Mueller next explained that many people are using LLMs.txt in the hope of aiding the process of Discovery despite the fact that’s not the purpose of LLMs.txt.
He then pivoted to the fact that LLMs.txt are inherently untrustworthy because it’s a site owner saying what their site’s content is about, which may or may not match what’s in the actual HTML.
He continued:
“Because it’s basically you’re telling these systems, like, I have the best website ever. And here are all of the pages that everyone must go to. And you must buy all of my products or whatever you put in there.
So in an LLM system, it… basically, by design, can’t trust what is here as a way of differentiating between different websites.”
Agentic Instructions
Mueller then says that some of these standards proposals could be useful for helping an AI agent, which sounds like maybe he’s talking about the Web Model Context Protocol (WebMCP).
He explained:
“If someone is already on your website, maybe some kind of automated system is helpful. Where if it goes, I want to go to Martin’s Splitt and buy a photograph, then the LLM system can go to your website and can look around, like, how do you buy a photograph? Maybe he has some guidelines for me as an agent for buying photographs. That kind of makes sense.
But going off and saying, I want to buy a photograph, which website has one, the system is not going to go to your website and five others and say, who has some automated information? But rather, they’re trying, going to try to find the best website…”
LLMs.txt Is Not About Getting Discovered By AI
Mueller circled back to how people are misconstruing LLMs.txt as a way to be discovered by AI systems.
He reasoned about this point:
“I think from that point of view, optimizing as a way of being discovered, that doesn’t make sense.
But what happens when an agent is on your website? I think that also just generally seems to be an open area for discussion at the moment, in that there’s LLMs.txt as a proposal. There are different JSON files and well-known file types that are in discussion.
There’s WebMCP, which I think tries to do something similar, where they say, well, you’re on this page now, but we have a programmatic interface for this, added specific URL or a specific mechanism.
I think those are then almost different discussions.”
Discovery And Ranking Are Still Tied To HTML
Mueller completed his thought by underlining the point that Discovery is at the HTML level.
He explained:
“So the generic SEO angle of how do I find a website that sells me a photograph is almost going to be completely bound to HTML pages and normal web pages.
And then if a user decides to go to a specific service, then within that service, then there is a little bit more room for maybe helping an agent or an LLM system to find the right approach.
But what is interesting, of course, is lots of ideas. And none of these have basically crystallized as the one thing that everyone will use. So I’m sure over the next, I don’t know, half year, year, or maybe longer, it’s going to take a bit. And some of these agentic systems are going to kind of unify around some standard file type or mechanism or something.”
Mueller wasn’t pushing the WebMCP standard but if AI agents become a way that users interact with websites then it’s going to be something like WebMCP and not LLMs.txt that will be useful for websites, particularly for ecommerce sites.
WebMCP is the naturally better fit for ecommerce because it focuses on giving AI agents actionable capabilities, like how to filter products, how to search and identify products, aids in comparing different products, and aids AI in adding a product to a shopping cart.
AI agents are able to navigate using the website HTML which was designed for humans. WebMCP makes it easier for AI agents to successfully interact with the website, something that LLMs.txt does not do.
While neither LLMs.txt and WebMCP help a website get discovered by AI, neither of them was created for that purpose. The Discovery part, the first stage for ranking, all happens with HTML. If that’s the case, what’s your next move?
Listen To Google’s Search Off The Record Episode 111
Featured Image by Shutterstock/Master1305