1. SEJ
  2.  ⋅ 
  3. SEO

Google Shares More Information On Googlebot Crawl Limits

  • Googlebot crawling is limited by design to protect infrastructure.
  • Crawl limits are flexible and can be increased or decreased depending on the needs.
  • Excessively large documents create processing overhead.
  • The cost of crawling is not just about bandwidth and not breaking the Internet, it's also about protecting their infrastructure.

Google provided more details about Googlebot's crawl limits, sharing the limits are flexible and can be increased or decreased.

Google Shares More Information On Googlebot Crawl Limits

Google’s Gary Ilyes and Martin Splitt discussed Googlebot’s crawl limits, providing more details about why limits exist and revealing new information about how those limits can be adjusted upward or dialed down depending on needs and what is being accomplished.

Details About Googlebot Limits

Gary Illyes shared details of what is going on behind the scenes at Google that drive the various crawl limits, beginning with the Googlebot 15 megabyte limit.

He said that any crawler within Google has a 15 megabyte limit and explicitly said that this limit could be overridden or switched off. In fact, he said that teams inside Google regularly override that limit. He used the example of Google Search, which overrides that limit by dialing it down to two megabytes.

Illyes explained:

“I mean, there’s a bunch of things that are for our own protection or our infrastructure’s protection. Like for example, the infamous 15 megabyte default limit that is set at the infrastructure level.

And basically any crawler that doesn’t override that setting is going to have a 15 megabyte limit. Basically it starts fetching the bytes from the server or whatever the server is sending.And then there’s an internal counter. And then when it reached 15 megabytes, then it basically stops receiving the bytes.

I don’t know if it closes the connection or not. I think it doesn’t close the connection. It just sends a response to the server that, OK, you can stop now. I’m good.

But then individual teams can override that. And that happens. It happens quite a bit. And for example, for Google Search, specifically for Google search, the limit is overridden to two megabytes.”

Limits On Googlebot Are For Infrastructure Protection

Illyes next shared an example where the 15 megabyte limit is overridden to increase the crawl limit, in this case for PDFs. This is where he mentions Googlebot limits in the context of protecting Google’s infrastructure from being overwhelmed by too much data.

He offered more details:

“Well, mostly everything. Like, for example, for PDFs, it’s, I don’t know, 64 or whatever. Because PDFs can, like the HTTP standard, if you export it as PDF, I think you said that, if you export it as PDF, then it’s 96 megabytes or something.

But that means that it would overwhelm our infrastructure if we fetch the whole thing and then convert it to HTML, blah, blah, and then start processing it.
It’s just like, it’s overwhelming because it’s so much data.

And same goes for HTML. It’s the HTML living standard. Like if you have like 14 megabytes, we are not going to fetch that. We are going to fetch the individual pages because fortunately, they also had enough brain power to have individual pages for individual features of HTML. We can fetch those pages, but we are not going to have anything useful out of the 14 megabyte one pager of the HTML standard.”

Other Google Crawlers Have Different Limits

At this point, Illyes revealed that other Google crawlers have different limits and that the documented limits aren’t hard limits across all of Google’s crawlers.

He continued:

“So yeah, and other crawlers, I never worked on other crawlers, but other crawlers I’m sure have different settings. I could imagine, for example, even in individual projects, it can have different settings for the same thing.

Like, for example, I can imagine that if we need to index something very fast, then the truncation limit could be one megabyte, for example. I don’t know if that’s the case, but I could imagine that to be the case. Because if you need to push something through the indexing pipeline within seconds, then it’s easier to deal with little data.”

Google’s Crawling Infrastructure Is Not Monolithic

This part of the Search Off The Record episode came to a close with Martin Splitt affirming that Google’s crawling infrastructure is flexible and far more diverse than what is described in Google’s documentation, saying that it is not monolithic. Monolithic literally means a massive stone rock and is used to describe something that is unchanging and consistent. By saying that Google’s crawlers are not monolithic, Splitt is affirming that they are flexible in terms of fetch limits and other configurations.

He also zeroed in on describing Google’s crawling infrastructure as software as a service.

Splitt summarized the takeaways:

“That’s true. That’s true. I think in general, it is useful to have cleared up this idea of crawling just being like a monolithic kind of thing. It’s more like a software as a service that search is, or web search specifically, is one client to and not like a monolithic kind of thing.

And as you said, like configuration can change. It can even change within, let’s say, Googlebot. If I’m looking for an image, we probably allow images to be larger than 2 megabytes, I guess, because images easily are larger than 2 megabytes. PDFs, allow 64. Whatever is documented, we’ll link the documentation. But I think that makes perfect sense.

And if you think about it as in, it’s a service we call with a bunch of parameters, then it makes a lot more sense to see, OK, so there’s different configuration. And this configuration can change on request level, not necessarily just on like, Googlebot is always the same.”

Listen to the Search Off The Record Episode from the 20 minute mark:

Featured Image by Shutterstock/BestForBest

Category News SEO
SEJ STAFF Roger Montti Owner - Martinibuster.com at Martinibuster.com

I have 25 years hands-on experience in SEO, evolving along with the search engines by keeping up with the latest ...