Google Can Index Blocked URLs Without Crawling

March 28, 2019
⋅
4 min read

SEJ STAFF Roger Montti Owner - Martinibuster.com at Martinibuster.com

606

SHARES
5.7K

READS

Google Can Index Blocked URLs Without Crawling

Google’s John Mueller recently “liked” a tweet by search marketing consultant Barry Adams (of Polemic Digital) that concisely stated the purpose of the robots.txt exclusion protocol. He freshened up an old topic and quite possibly gave us a new way to think about it.

Google Can Index Blocked Pages

The issue began when a publisher tweeted that Google had indexed a website that was blocked by robots.txt.

Screenshot of a tweet by a person who says Google indexed a web page that was blocked by Robots.txt John Mueller responded:

“URLs can be indexed without being crawled, if they’re blocked by robots.txt – that’s by design.

Usually that comes from links from somewhere, judging from that number, I’d imagine from within your site somewhere.”

How Robots.txt Works

Barry (@badams) tweeted:

“Robots.txt is a crawl management tool, not an index management tool.”

We often think of Robots.txt as a way to block Google from including a page from Google’s index. But robots.txt is just a way to block which pages Google crawls.

That’s why if another site has a link to a certain page, then Google will crawl and index the page (to a certain extent).

Barry then went on to explain how to keep a page out of Google’s index:

“Use meta robots directives or X-Robots-Tag HTTP headers to prevent indexing – and (counter-intuitively) let Googlebot crawl those pages you don’t want it to index so it sees those directives.”

NoIndex Meta Tag

The noindex meta tag allows crawled pages to be kept out of Google’s index. It doesn’t stop the crawl of the page, but it does assure the page will be kept out of Google’s index.

The noindex meta tag is superior to the robots.txt exclusion protocol for keeping a web page from being indexed.

Here is what John Mueller said in a tweet from August 2018

“…if you want to prevent them from indexing, I’d use the noindex robots meta tag instead of robots.txt disallow.”

Screenshot of a tweet by Google's John Mueller recommending the noindex meta tag to prevent Google from indexing a web page

Robots Meta Tag Has Many Uses

A cool thing about the Robots meta tag is that it can be used to solve issues until a better fix comes along.

For example, a publisher was having trouble generating 404 response codes because the angularJS framework kept generating 200 status codes.

His tweet asking for help said:

Hi @JohnMu I´m having many troubles with managing 404 pages in angularJS, always give me a 200 status on them. Any way to solve it? Thanks

Screenshot of a tweet about 400 pages resolving as 200 response codes

John Mueller suggested using a robots noindex meta tag. This would cause Google to drop that 200 response code page from the index and regard that page as a soft 404.

“I’d make a normal error page and just add a noindex robots meta tag to it. We’ll call it a soft-404, but that’s fine there.”

So, even though the web page is showing a 200 response code (which means the page was successfully served), the robots meta tag will keep the page out of Google index and Google will treat it as if the page is not found, which is a 404 response.

Screenshot of John Mueller tweet explaining how robots meta tag works

Official Description of Robots Meta Tag

According to the official documentation at the World Wide Web Consortion, the official body that decides web standards (W3C), this is what the Robots Meta Tag does:

“Robots and the META element
The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links.”

This is how the W3c documents describe the Robots.txt:

“When a Robot visits a Web site,it firsts checks for …robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document.”

Screenshot of a page from the W3c showing the official standard for the robots meta tag

The W3c interprets the role of the Robots.txt as like a gate keeper for what files are retrieved. Retrieved means crawled by a robot that obeys the Robots.txt exclusion protocol.

Barry Adams was correct to describe the Robots.txt exclusion as a way to manage crawling, not indexing.

It might be useful to thik of the Robots.txt as being like security guards at the door of your site, keeping certain web pages blocked. It may make untangling strange Googlebot activity on blocked web pages a little easier.

More Resources

Images by Shutterstock, Modified by Author
Screenshots by Author, Modified by Author

Category News SEO

The Ultimate Topic Cluster Cheat Sheet & Checklist Bundle

The State Of AI in Marketing

The Hidden Cost Of Google Ads: Stop Wasting Budget Bidding Against Yourself

The State Of AI in Marketing

Social Media Planner: How To Plan Your Content (With Template)

The State Of AI in Marketing

Google Can Index Blocked URLs Without Crawling

Google Can Index Blocked Pages

How Robots.txt Works

NoIndex Meta Tag

Robots Meta Tag Has Many Uses

Official Description of Robots Meta Tag

More Resources