One of our valued readers sent in a question asking why one of the site pages he blocked with Disallow Robots.txt directive is still returned in Google SERPs.
So here’s my detailed explanation of the problem as well as the solution:
To me, blocking pages via Robots.txt has always been primarily about saving the bot’s time than actually trying to hide anything. Search bots crawl on a budget: thus the more "extra" pages you exclude from the very start, the more time it will spend looking for more content-rich pages and including (or updating) them in the index.
What standard “Disallow” directive cannot still do is to make Google drop the page out of the index. So you may end up seeing those blocked pages in Google SERPs – Google won’t know what they actually contain, so it will make judgements based on both internal and external references to that pages.
So quite a natural question caused by the above mentioned stuation is "How do I make Google ignore those "extra" pages completely: not to waste the crawl’s time on them and not listing them in SERPs?"
The answer is not that simple as it may seem. The widely used "NoIndex" meta tag won’t work because Google won’t see it: the page is blocked from Google, so Google can’t enter it to see the Robots meta tag.
There are two other possible solutions though:
1. Use Robots.txt Disallow meta tag and then use the URL removal tool within Google Webmaster Tools;
2. Use Robots.txt Noindex direcive – it is unofficially supported by Google and can be one of the steps to help sculp PageRank. This directive is going to block the page from being crawled and indexed: