In a Webmaster Central Hangout, Google’s John Mueller discussed the reasons why Google crawls non-existent pages and what that meant for your crawl budget. Web publishers would like to see Google crawling existing pages. It seems a waste of time for Google to crawl pages that don’t exist. A web publisher asked whether they should block Googlebot from crawling non-existent pages. John Mueller’s answer adds more information to what we know about Google’s 404 crawls.
Non-existent pages are referred to as 404 pages. That’s the server error code that a website should give when a requested web page is missing. A 404 error code means that the server could not locate a requested web page and that it is missing. A 410 error code signals that a web page is intentionally gone and is never coming back.
When Google crawls a non-existent page (a 404 page), it can be referred to as a 404 crawl. Google’s John Mueller made three interesting statements about why Google crawls non-existent pages:
- 404 crawls are sometimes Google utilizing extra crawl capacity to double check URLs that used to exist (in case the page returns)
- 404 crawling is a sign that Google has more than enough capacity to crawl more URLs from your site
- 404 pages do not need to be blocked from crawling (for the purpose of preserving crawl budget). You will not lose crawl capacity from 404 crawls
Google Remembers 404 Pages
Although Google may not keep a web page in its index, if the page used to exist, Google will remember that a web page used to exist at that URL and will crawl that old URL to see if it returned. Google’s Matt Cutts stated in 2014 that the reason Google remembered was to build in a safeguard in case a web publisher made a mistake in removing a web page and the web page returned.
Here is how Google handled 404 error pages according to Matt Cutts in 2014:
“200 might mean everything went totally fine. 404 means the page was not found. 410 typically means gone, as in the page is not found and we do not expect it to come back. So 410 has a little more of a connotation that this page is permanently gone.
So the short answer is that we do sometimes treat 404s and 410s a little bit differently, but for the most part you shouldn’t worry about it.
If a page is gone and you think it’s temporary, go ahead and use a 404. If a page is gone and you know no other page that should substitute for it, you don’t have anywhere else that you should point to, and you know that page is going to be gone and never come back, then go ahead and serve a 410.”
Matt Cutts then goes on to say that web publishers sometimes make mistakes as to declaring pages gone. So Google has built safeguards to account for mistakes so as not to drop pages that a publisher may wanted to keep.
“So with 404s, along with I think 401s and maybe 403s, if we see a page and we get a 404, we are going to protect that page for 24 hours in the crawling system. So we sort of wait and we say well, maybe that was a transient 404. Maybe it wasn’t really intended to be a page not found. And so in the crawling system it’ll be protected for 24 hours.
Now compare that to what John Mueller stated:
“We understand that these are 404 or 410 pages or at least shouldn’t be indexed. But we know about those pages. And every now and then when… we have nothing better to do on this website, we’ll go off and double-check on those URLs.”
And if we check those URLs and see a server error or the Page Not Found error here, then that’s something we’ll tell you about in Search Console. …and that’s fine.
So it’s not something you need to block from crawling. It’s not something you need to worry about. It’s not that we’re losing crawl capacity by looking at those URLs. It’s essentially a sign from us that we have enough capacity to crawl more URLs on your website and we’re just double-checking some of the old ones just in case you managed to put something backup there.
What is Unique About John Mueller’s Statement
John Mueller’s statement offers extra dimensions to our knowledge of why Googlebot crawls 404 pages. It’s an indication that Google has plenty of crawl budget to crawl your website and because it’s an indication that Google has plenty of crawl capacity, there is no reason to be concerned about these crawls.
Googlebot 404 Crawls are a Good Sign?
We learn today from John Mueller that it’s actually a good sign if Google is crawling 404 pages. But what’s different between what John Mueller stated and what Matt Cutts stated is how Google treats 410 pages.
Remember, 410 pages are web pages that are intentionally removed and will not be coming back ever. Nowadays, a web publisher may use the 410 code to indicate a web page that is expired, like a promotion that has ended, an offer for an event that has passed or a product that no longer exists. A web publisher may also use a 410 code for a spam page that may have been generated by a hacker. So for that last reason especially, web publishers may want Google to obey the error code and absolutely forget that web page and not come looking for it.
This is Matt Cutts on 410 pages in 2014:
If we see a 410, then the crawling system says OK, we assume the webmaster knows what they’re doing, because they went off the beaten path to deliberately say that this page is gone. So they immediately convert that 410 to an error rather than protecting it for 24 hours.
Now don’t take this too much the wrong way. We’ll still go back and recheck and make sure are those pages really gone. Or maybe the pages have come back alive again. And I wouldn’t rely on the assumption that that behavior will always be exactly the same.
In general, sometimes webmasters get a little too caught up in tiny little details. And so if a page is gone it’s fine to serve a 404. If you know it’s gone for a real, it’s fine to serve a 410.
But we’ll design our crawling system to try to be robust. but if your site goes down or if you get hacked or whatever, that we try to make sure that we can still find the good contnt whenever it’s available.
There is an earlier statement from 2011 where Google stated that it treats 404 and 410 errors essentially the same. Matt Cutts’ statement indicates that Google actually treats them a little differently. So, just for the record here is what the earlier statement said:
Google Webmaster Blog on 410 Error Responses in 2011
“Currently Google treats 410s (Gone) the same as 404s (Not found), so it’s immaterial to us whether you return one or the other.”
Google Obeys Standards for 410 Server Response Codes
The official standard for the 410 error code states that clients with link editing capabilities should delete references to the URL and that the site owner any links be removed. It doesn’t state that the client is required to remove references to the site or to never return to the URL. So an argument could be made that Google is obeying the 410 error code by not reproducing a link to the web page. Here is the text from the W3C.org web page on server response codes:
“The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval… This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed…”
Takeaway from What John Mueller Stated
Many web publishers may have have considered it troubling for Google to crawl non-existent pages, the 404 and 410 server response pages. However now we know that this is a sign that Google has enough crawl budget to crawl your entire site. So if you see that Google is crawling 404 or 410 pages, there is no reason to worry and no reason to block Google. It is in fact a good sign.
Images by Shutterstock, Modified by Author
Subscribe to SEJ
Get our daily newsletter from SEJ's Founder Loren Baker about the latest news in the industry!