One of our valued readers sent in a question asking why one of the site pages he blocked with Disallow Robots.txt directive is still returned in Google SERPs.
So here’s my detailed explanation of the problem as well as the solution:
To me, blocking pages via Robots.txt has always been primarily about saving the bot’s time than actually trying to hide anything. Search bots crawl on a budget: thus the more "extra" pages you exclude from the very start, the more time it will spend looking for more content-rich pages and including (or updating) them in the index.
What standard “Disallow” directive cannot still do is to make Google drop the page out of the index. So you may end up seeing those blocked pages in Google SERPs – Google won’t know what they actually contain, so it will make judgements based on both internal and external references to that pages.
So quite a natural question caused by the above mentioned stuation is "How do I make Google ignore those "extra" pages completely: not to waste the crawl’s time on them and not listing them in SERPs?"
The answer is not that simple as it may seem. The widely used "NoIndex" meta tag won’t work because Google won’t see it: the page is blocked from Google, so Google can’t enter it to see the Robots meta tag.
There are two other possible solutions though:
1. Use Robots.txt Disallow meta tag and then use the URL removal tool within Google Webmaster Tools;
2. Use Robots.txt Noindex direcive – it is unofficially supported by Google and can be one of the steps to help sculp PageRank. This directive is going to block the page from being crawled and indexed:
user-agent: googlebot
noindex: /login.php
disallow: /login.php







There is another good way to prevent crawling is Htaccess Authentication.
Actually the noindex directive is more than unofficially supported.
If you look at this page from Google http://www.google.com/support/webmasters/bin/answer.py?answer=35303 in a language different from English you can find a really clear statement:
“L’indice web di Google consente anche l’utilizzo di “Noindex:” in un file robots.txt per impedire persino la visualizzazione di un link ad un URL non sottoposto a scansione nei nostri risultati di ricerca.”
Roughly translated it sounds like:
” Google web index allow using “Noindex:” in the robots.txt file to block even the visualization in the result pages of a link to an URL not spidered”
I really don’t know why this statement is available only if you choose a language different from English.
Yes Htaccess Authentication is one of the best way to prevent Google from crawling & indexing the particular page of a website.
This was not a good article, Ann — not the least because you’re implying that people can actually sculpt PageRank (they cannot).
The noindex tag directive doesn’t prevent pages from accruing PageRank.
^^ Troll.
If you block pages – your saving your PR for your other important pages. Ie. PR sculpting.
Awesome entry on wiki btw. (Please, the time you spend editing this thing alone can be use for something more useful…god)
http://en.wikipedia.org/wiki/User:Michael_Martinez
Actually, on the second thought I did sound like I was under impression NoIndex is meant to sculp PageRank. In reality it was linking to the post detailing how to prevent a page from accruing PageRank and NoIndex directive was mentioned there as on of the steps to take.
Ann, NoIndex will NOT prevent a page from accruing PageRank. There is currently no mechanism for preventing pages from accruing PageRank.
yes this is good information. i will check my website for google indexing
Guys,
i am trying to stop google indexing duplicate strings of a url, like this below from the webmaster tools, how do I set this up in the robots txt file.
Pages with duplicate title tags Pages
Thermolife Products – Sport and Supplements
Go to URL/brands/Thermolife.html
Go to URL/brands/Thermolife.html?sort=alphadesc
Go to URL/brands/Thermolife.html?sort=newest
Go to URL/brands/Thermolife.html?sort=priceasc
Go to URL/brands/Thermolife.html?sort=pricedesc
Sport and Supplements – Products Tagged with 'Health and Wellness/Vitamins'
Go to URL/tags/health-and-wellnessvitamins?page=3&sort=featured
Go to URL/tags/health-and-wellnessvitamins?page=3&sort=newest
Go to URL/tags/health-and-wellnessvitamins?sort=newest
Sorry about the text above but really trying to resolve this. Please help
Regards
Chris
Great sharing indeed
Thank you so much..
Great sharing indeed
Thank you so much..