Despite of the existence of pretty self-explanatory Robots.txt standards, all-in-one tutorials and advanced tips, Robots.txt topic is still often misunderstood and misused. Therefore I decided to sum the topic up giving the three most common uses of the file for people to refer to when at a loss.
Default Robots.txt file basically tells every crawler that it is allowed any web site directory to its heart content:
(which translates as “disallow nothing”)
The often asked question here is why to use it at all. Well, it is not required but recommended to use for the simple reason that search bots will request it anyway (this means you’ll see 404 errors in your log files from bots requesting your non-existent Robots.txt page). Besides, having a default Robots.txt will ensure there won’t be any misunderstandings between your site and a crawler.
Robots.txt Blocking Specific Folders / Content:
The most common usage of Robots.txt is to ban crawlers from visiting private folders or content that gives them no additional information. This is done primarily in order to save the crawler’s time: bots crawl on a budget – if you ensure that it doesn’t waste time on unnecessary content, it will crawl your site deeper and quicker.
Samples of Robots.txt files blocking specific content (note: I highlighted only a few most basic cases):
(blocks all crawlers from /database/ folder )
(blocks all crawlers from all URL’s containing ? )
(blocks all crawlers from /navy/ folder but allow access to one page from this folder)
Note from John Mueller commenting below:
The “Allow:” statement is not a part of the robots.txt standard (it is however supported by many search engines, including Google)
Robots.txt Allowing Access to Specific Crawlers
Some people choose to save bandwidth and allow access to only those crawlers they care about (e.g. Google, Yahoo and MSN). In this case, Robots.txt file should list those Robots followed by the command itself, etc:
(the first part blocks all crawlers from everything, while the following 3 blocks list those 3 crawlers that are allowed to access the whole site)
Need Advanced Robots.txt Usage?
I tend to recommend people to refrain from doing anything too tricky in their Robots.txt file unless they are 100% knowledgeable in the topic. Messed-up Robots.txt file can result in screwed project launch.
Many people spend weeks and months trying to figure why there site is ignored by crawlers until they realize (often with some external help) that they have misused their Robots.txt file. The better solution for controlling crawler activity might be to get away with on-page solutions (robots meta tags). Aaron did a great job summing up the difference in his guide (bottom of the page).