SEO Q&A: How to Prevent Google from Crawling AND Indexing a Page?

One of our valued readers sent in a question asking why one of the site pages he blocked with Disallow Robots.txt directive is still returned in Google SERPs.

So here’s my detailed explanation of the problem as well as the solution:

To me, blocking pages via Robots.txt has always been primarily about saving the bot’s time than actually trying to hide anything. Search bots crawl on a budget: thus the more "extra" pages you exclude from the very start, the more time it will spend looking for more content-rich pages and including (or updating) them in the index.

What standard “Disallow” directive cannot still do is to make Google drop the page out of the index. So you may end up seeing those blocked pages in Google SERPs – Google won’t know what they actually contain, so it will make judgements based on both internal and external references to that pages.

So quite a natural question caused by the above mentioned stuation is "How do I make Google ignore those "extra" pages completely: not to waste the crawl’s time on them and not listing them in SERPs?"

The answer is not that simple as it may seem. The widely used "NoIndex" meta tag won’t work because Google won’t see it: the page is blocked from Google, so Google can’t enter it to see the Robots meta tag.

There are two other possible solutions though:

1. Use Robots.txt Disallow meta tag and then use the URL removal tool within Google Webmaster Tools;

2. Use Robots.txt Noindex direcive – it is unofficially supported by Google and can be one of the steps to help sculp PageRank. This directive is going to block the page from being crawled and indexed:

user-agent: googlebot
noindex: /login.php
disallow: /login.php

Written By:
PG

| My Blog Guest | @seosmarty

Ann Smarty is the blogger and marketer specializing in SEO consulting and guest blogging. Ann's expertise in blogging and tools serve as a base for her writing, tutorials and her guest blogging project, MyBlogGuest.com.

More Posts By

Comments

  1. sunil says:

    There is another good way to prevent crawling is Htaccess Authentication.

  2. Alex says:

    Actually the noindex directive is more than unofficially supported.

    If you look at this page from Google http://www.google.com/support/webmasters/bin/answer.py?answer=35303 in a language different from English you can find a really clear statement:

    “L’indice web di Google consente anche l’utilizzo di “Noindex:” in un file robots.txt per impedire persino la visualizzazione di un link ad un URL non sottoposto a scansione nei nostri risultati di ricerca.”

    Roughly translated it sounds like:

    ” Google web index allow using “Noindex:” in the robots.txt file to block even the visualization in the result pages of a link to an URL not spidered”

    I really don’t know why this statement is available only if you choose a language different from English.

  3. Tag44 says:

    Yes Htaccess Authentication is one of the best way to prevent Google from crawling & indexing the particular page of a website.

  4. This was not a good article, Ann — not the least because you’re implying that people can actually sculpt PageRank (they cannot).

    The noindex tag directive doesn’t prevent pages from accruing PageRank.

  5. Ankur says:

    ^^ Troll.

    If you block pages – your saving your PR for your other important pages. Ie. PR sculpting.

    Awesome entry on wiki btw. (Please, the time you spend editing this thing alone can be use for something more useful…god)

    http://en.wikipedia.org/wiki/User:Michael_Martinez

  6. Ann Smarty says:

    Actually, on the second thought I did sound like I was under impression NoIndex is meant to sculp PageRank. In reality it was linking to the post detailing how to prevent a page from accruing PageRank and NoIndex directive was mentioned there as on of the steps to take.

  7. Ann, NoIndex will NOT prevent a page from accruing PageRank. There is currently no mechanism for preventing pages from accruing PageRank.

  8. yes this is good information. i will check my website for google indexing

  9. ChrisK says:

    Guys,

    i am trying to stop google indexing duplicate strings of a url, like this below from the webmaster tools, how do I set this up in the robots txt file.

    Pages with duplicate title tags Pages
    Thermolife Products – Sport and Supplements
    Go to URL/brands/Thermolife.html
    Go to URL/brands/Thermolife.html?sort=alphadesc
    Go to URL/brands/Thermolife.html?sort=newest
    Go to URL/brands/Thermolife.html?sort=priceasc
    Go to URL/brands/Thermolife.html?sort=pricedesc

    Sport and Supplements – Products Tagged with 'Health and Wellness/Vitamins'
    Go to URL/tags/health-and-wellnessvitamins?page=3&sort=featured
    Go to URL/tags/health-and-wellnessvitamins?page=3&sort=newest
    Go to URL/tags/health-and-wellnessvitamins?sort=newest

    Sorry about the text above but really trying to resolve this. Please help

    Regards

    Chris

  10. Rajnikant says:

    Great sharing indeed
    Thank you so much..

  11. Rajnikant says:

    Great sharing indeed
    Thank you so much..