Despite of the existence of pretty self-explanatory Robots.txt standards, all-in-one tutorials and advanced tips, Robots.txt topic is still often misunderstood and misused. Therefore I decided to sum the topic up giving the three most common uses of the file for people to refer to when at a loss.
Default Robots.txt
Default Robots.txt file basically tells every crawler that it is allowed any web site directory to its heart content:
User-agent: *
Disallow:
(which translates as “disallow nothing”)
The often asked question here is why to use it at all. Well, it is not required but recommended to use for the simple reason that search bots will request it anyway (this means you’ll see 404 errors in your log files from bots requesting your non-existent Robots.txt page). Besides, having a default Robots.txt will ensure there won’t be any misunderstandings between your site and a crawler.
Robots.txt Blocking Specific Folders / Content:
The most common usage of Robots.txt is to ban crawlers from visiting private folders or content that gives them no additional information. This is done primarily in order to save the crawler’s time: bots crawl on a budget – if you ensure that it doesn’t waste time on unnecessary content, it will crawl your site deeper and quicker.
Samples of Robots.txt files blocking specific content (note: I highlighted only a few most basic cases):
User-agent: *
Disallow: /database/
(blocks all crawlers from /database/ folder )
User-agent: *
Disallow: /*?
(blocks all crawlers from all URL’s containing ? )
User-agent: *
Disallow: /navy/
Allow: /navy/about.html
(blocks all crawlers from /navy/ folder but allow access to one page from this folder)
Note from John Mueller commenting below:
The “Allow:” statement is not a part of the robots.txt standard (it is however supported by many search engines, including Google)
Robots.txt Allowing Access to Specific Crawlers
Some people choose to save bandwidth and allow access to only those crawlers they care about (e.g. Google, Yahoo and MSN). In this case, Robots.txt file should list those Robots followed by the command itself, etc:
User-agent: *
Disallow: /
User-agent: googlebot
Disallow:
User-agent: slurp
Disallow:
User-agent: msnbot
Disallow:
(the first part blocks all crawlers from everything, while the following 3 blocks list those 3 crawlers that are allowed to access the whole site)
Need Advanced Robots.txt Usage?
I tend to recommend people to refrain from doing anything too tricky in their Robots.txt file unless they are 100% knowledgeable in the topic. Messed-up Robots.txt file can result in screwed project launch.
Many people spend weeks and months trying to figure why there site is ignored by crawlers until they realize (often with some external help) that they have misused their Robots.txt file. The better solution for controlling crawler activity might be to get away with on-page solutions (robots meta tags). Aaron did a great job summing up the difference in his guide (bottom of the page).





I believe the biggest misconception is the fact that many people believe that a robots.txt can stop a bot from crawling your content..
Nice wrap up !
Dave
Hi Ann!
Congratulations! You always write good posts. I’m from Brazil and now following you on twitter.
I just wanted to mention 2 small technicalities:
- The “Allow:” statement is not a part of the robots.txt standard (it is however supported by many search engines, including Google)
- Listing multiple user-agents in the same “block” is technically incorrect and may result in search engines only applying the disallow/allow directives to the last user-agent listed. In practice, some search engines may handle it in the way that you intend, but it’s not guaranteed and it may change over time.
John
@John, thanks tons for stopping by to comment! I have edited the post to include your valuable notes.
Oh Ann, uve really made SEO a lot easier since ive started following you ………. thanks and continue enlightening.
Hi Ann! Thought you might want to note to protocol that points to your XML sitemap (useful for Google, Yahoo, and particularly Live Search). The final entry in the robots.txt file :
http://sitemaps.org/protocol.php#submit_robots
Hope all is well with you
Charlie
@Charlie, many thanks for stopping by! Haven’t spoken to you for ages. Thanks for the link by the way.
Ann, The section “Robots.txt Allowing Access to Specific Crawlers” is in the wrong order. I have read that the order doesn’t matter and that it does. I have not read an official post from Google or MSN about this so I err on the side of caution and follow this rule: Robots read the file top to bottom and follow the first directive that includes them. In this case all robots will see the
User-agent: *
Disallow: /
and not crawl the site. The catchall User-agent: * should be after any specific robots entries.
As always a very informative article. Thanks for the great stuff.
@John, I’d appreciate if you give me a link discussing this. What I know is that robots follow only their specific directives and ignore all the rest. You can see it discussed here: http://www.searchenginejournal.com/robotstxt-4-things-you-should-know/7292/
I was just talking about more advanced robots.txt syntax this morning and then your post popped into my RSS reader, perfect.
So if I’m interpreting correctly, if i wanted to disallow all URLs containing “src”, I would do the following:
User-agent: *
Disallow: /*src
Is that correct?
Thanks!
@Brian, correct.
Great to here it was timely ;)
I have spent much time tweaking my robots.txt file to try to exclude things which will cause duplicate content and to get the most out of the Google juice.
I am not an expert in this field by any means. I have tried to find someone who is an expert in this area, but most people who respond come up with just some very basic stuff. Makes me wonder if there are any experts out there or are they keeping the robots.txt configuration a secret?
Thanks for the excellent explanation and comments.
Hi,
I have written following code in my robot.txt to allow all crawlers.
User-agent: *
Allow: /
Is this correct??
Regards
prasad
.-= prasad´s last blog ..Welcome on my blog! =-.