Google’s John Mueller warns that pages blocked by robots.txt could still get indexed if there are links pointing to them.
This could become a problem because Google would then see these pages as having no content due to it being blocked from getting crawled.
Mueller says if you have content on your site that you don’t want Google to see, the best course of action would be to use a noindex meta tag.
This topic came up during a recent Webmaster Central hangout when a site owner asked if it would be enough to “disallow” pages that aren’t necessary for indexing.
Mueller’s full response is transcribed below:
“One thing maybe to keep in mind here is that if these pages are blocked by robots.txt, then it could theoretically happen that someone randomly links to one of these pages. And if they do that then it could happen that we index this URL without any content because its blocked by robots.txt. So we wouldn’t know that you don’t want to have these pages actually indexed.
Whereas if they’re not blocked by robots.txt you can put a noindex meta tag on those pages. And if anyone happens to link to them, and we happen to crawl that link and think “maybe there’s something useful here” then we would know that these pages don’t need to be indexed and we can just skip them from indexing completely.
So, in that regard, if you have anything on these pages that you don’t want to have indexed then don’t disallow them, use noindex instead.”
The full question and answer can be viewed in the video below, starting at the 24:36 mark.
For more on robots.txt blocking and the noindex tag, see Should You Noindex Category & Archive Pages?
 
         
        