3 Common Uses of Robots.txt File

Despite of the existence of pretty self-explanatory Robots.txt standards, all-in-one tutorials and advanced tips, Robots.txt topic is still often misunderstood and misused. Therefore I decided to sum the topic up giving the three most common uses of the file for people to refer to when at a loss.

Default Robots.txt

Default Robots.txt file basically tells every crawler that it is allowed any web site directory to its heart content:

User-agent: *
Disallow:

(which translates as “disallow nothing”)

The often asked question here is why to use it at all. Well, it is not required but recommended to use for the simple reason that search bots will request it anyway (this means you’ll see 404 errors in your log files from bots requesting your non-existent Robots.txt page). Besides, having a default Robots.txt will ensure there won’t be any misunderstandings between your site and a crawler.

Robots.txt Blocking Specific Folders / Content:

The most common usage of Robots.txt is to ban crawlers from visiting private folders or content that gives them no additional information. This is done primarily in order to save the crawler’s time: bots crawl on a budget – if you ensure that it doesn’t waste time on unnecessary content, it will crawl your site deeper and quicker.

Samples of Robots.txt files blocking specific content (note: I highlighted only a few most basic cases):

User-agent: *
Disallow: /database/

(blocks all crawlers from /database/ folder )

User-agent: *
Disallow: /*?

(blocks all crawlers from all URL’s containing ? )

User-agent: *
Disallow: /navy/
Allow: /navy/about.html

(blocks all crawlers from /navy/ folder but allow access to one page from this folder)

Note from John Mueller commenting below:

The “Allow:” statement is not a part of the robots.txt standard (it is however supported by many search engines, including Google)

Robots.txt Allowing Access to Specific Crawlers

Some people choose to save bandwidth and allow access to only those crawlers they care about (e.g. Google, Yahoo and MSN). In this case, Robots.txt file should list those Robots followed by the command itself, etc:

User-agent: *
Disallow: /

User-agent: googlebot
Disallow:

User-agent: slurp
Disallow:

User-agent: msnbot
Disallow:

(the first part blocks all crawlers from everything, while the following 3 blocks list those 3 crawlers that are allowed to access the whole site)

Need Advanced Robots.txt Usage?

I tend to recommend people to refrain from doing anything too tricky in their Robots.txt file unless they are 100% knowledgeable in the topic. Messed-up Robots.txt file can result in screwed project launch.

Many people spend weeks and months trying to figure why there site is ignored by crawlers until they realize (often with some external help) that they have misused their Robots.txt file. The better solution for controlling crawler activity might be to get away with on-page solutions (robots meta tags). Aaron did a great job summing up the difference in his guide (bottom of the page).

Written By:
PG

Ann Smarty | My Blog Guest | @seosmarty

Ann Smarty is the blogger and marketer specializing in SEO consulting and guest blogging. Ann's expertise in blogging and tools serve as a base for her writing, tutorials and her guest blogging project, MyBlogGuest.com

More Posts By

Comments

  1. Dave says:

    I believe the biggest misconception is the fact that many people believe that a robots.txt can stop a bot from crawling your content..

    Nice wrap up !

    Dave

  2. Hi Ann!

    Congratulations! You always write good posts. I’m from Brazil and now following you on twitter.

  3. JohnMu says:

    I just wanted to mention 2 small technicalities:

    - The “Allow:” statement is not a part of the robots.txt standard (it is however supported by many search engines, including Google)

    - Listing multiple user-agents in the same “block” is technically incorrect and may result in search engines only applying the disallow/allow directives to the last user-agent listed. In practice, some search engines may handle it in the way that you intend, but it’s not guaranteed and it may change over time.

    John

  4. Ann Smarty says:

    @John, thanks tons for stopping by to comment! I have edited the post to include your valuable notes.

  5. Neetish says:

    Oh Ann, uve really made SEO a lot easier since ive started following you ………. thanks and continue enlightening.

  6. Hi Ann! Thought you might want to note to protocol that points to your XML sitemap (useful for Google, Yahoo, and particularly Live Search). The final entry in the robots.txt file :
    http://sitemaps.org/protocol.php#submit_robots
    Hope all is well with you
    Charlie

  7. Ann Smarty says:

    @Charlie, many thanks for stopping by! Haven’t spoken to you for ages. Thanks for the link by the way.

  8. John says:

    Ann, The section “Robots.txt Allowing Access to Specific Crawlers” is in the wrong order. I have read that the order doesn’t matter and that it does. I have not read an official post from Google or MSN about this so I err on the side of caution and follow this rule: Robots read the file top to bottom and follow the first directive that includes them. In this case all robots will see the
    User-agent: *
    Disallow: /
    and not crawl the site. The catchall User-agent: * should be after any specific robots entries.

  9. Match says:

    As always a very informative article. Thanks for the great stuff.

  10. Ann Smarty says:

    @John, I’d appreciate if you give me a link discussing this. What I know is that robots follow only their specific directives and ignore all the rest. You can see it discussed here: http://www.searchenginejournal.com/robotstxt-4-things-you-should-know/7292/

  11. Brian says:

    I was just talking about more advanced robots.txt syntax this morning and then your post popped into my RSS reader, perfect.

    So if I’m interpreting correctly, if i wanted to disallow all URLs containing “src”, I would do the following:

    User-agent: *
    Disallow: /*src

    Is that correct?

    Thanks!

  12. Ann Smarty says:

    @Brian, correct.

    Great to here it was timely ;)

  13. I have spent much time tweaking my robots.txt file to try to exclude things which will cause duplicate content and to get the most out of the Google juice.

    I am not an expert in this field by any means. I have tried to find someone who is an expert in this area, but most people who respond come up with just some very basic stuff. Makes me wonder if there are any experts out there or are they keeping the robots.txt configuration a secret?

  14. art jewelry says:

    Thanks for the excellent explanation and comments.

  15. prasad says:

    Hi,

    I have written following code in my robot.txt to allow all crawlers.

    User-agent: *
    Allow: /

    Is this correct??

    Regards
    prasad
    .-= prasad´s last blog ..Welcome on my blog! =-.