Ann Smarty

Robots.txt : 4 Things You Should Know

July 14th, 2008 by Ann Smarty | 11 Comments

Robots.txt has become a widely used method of controlling your site crawling. Thus it has become one of the first things I check when diagnosing on-site issues. While nowadays almost every webmaster knows its basics, some things still cause misunderstandings:

  1. Robots.txt can prevent bots from crawling the page or directory but not from indexing or ranking the URL when it’s found via external references. In this case search engines will use information from these external sources to make judgments about the page and also to formulate the snippet (title and description that appear in search results).
  2. If you have both general (i.e. wildcard *) and specific (e.g. User-agent: googlebot ) user-agent sections, keep in mind that Google (and other crawlers) will only follow the most specific section and ignore all other sections (including the general one): thus repeat all the directives from general section in all specific ones.
  3. Robots.txt specific directions

  4. The matching is from the left to the right, meaning that crawlers are blocked from anything that begins with / pattern. So if you have blocked yoursite.com/a directory, for example, keep in mind that you are also blocking all directories/ pages going after the root and starting with ‘a’ (e.g. yoursite.com/about). The related case was described in the recent WebmasterWorld thread.
  5. To be on the safe side, you’d better have Robots.txt file even if you do not want to include any (specific) directions - let it be empty or default:

    User-agent: *
    Disallow:

    By this you will make sure that:

  • all search crawlers understand what you mean correctly;
  • there are no extra 404 errors in your log files from bots requesting your non-existent Robots.txt page;
  • bots won’t hold off on crawling your site in case they couldn’t reach your Robots.txt (a highly improbable case but might happen anyway).
    Vote for this post : 12 Vote down Vote up or Buzz it at Yahoo :


    Comments

    11 responses so far ↓

    • andymurd on Jul 14, 2008 at 11:25 am

      Great post, Anne. I didn’t know about #3, I’d better get checking my files.

    • g1smd on Jul 14, 2008 at 11:44 am

      Two errors in item 4:

      *** 4. To be on the safe side, you’d better have Robots.txt file even if you do not want to include any (specific) directions - let it be empty or default (user-agent: * allow:) in this case. ***

      – The two items need to be on separate lines, with at least one blank line after the last item.

      – Don’t use “Allow:”, use “Disallow:”

      User-agent: *
      Disallow:

      That will disallow nothing (i.e. “allow” everything).

      If anything, disallow your /robots.txt file, so that it isn’t *indexed* and cannot appear in the SERPs.

    • Ann Smarty on Jul 14, 2008 at 1:11 pm

      @g1smd: corrected, thank you!

    • Garrett Pierson on Jul 14, 2008 at 2:15 pm

      Wonderful tips Ann! Especially tip # 4 “you’d better have Robots.txt file even if you do not want to include any (specific) directions”. I toally agree with this and have always taught my clients and students how important this is. Thanks again for the great content and tips!

    • Marcel on Jul 14, 2008 at 7:06 pm

      Good overview. I learned a couple of things - thanks.

    • istioselida on Jul 15, 2008 at 2:05 am

      very useful post, i will use your tips to add the file in my sites!

    • Software Testing on Jul 15, 2008 at 2:50 am

      Thanks Ann!

    • zemin temizligi on Jul 15, 2008 at 5:13 am

      thank you….

    • J Solutions 網頁設計 on Jul 15, 2008 at 6:46 am

      robots.txt can be very harmful if set wrongly. So I will recommend to use Google Webmaster Tool to test it before uploading to your website.

    • John H Gohde on Jul 15, 2008 at 7:29 am

      What I find to be the most overlooked tidbit on robots text files is that if you want to remove a webpage from from Google, excluding it in your robots.txt can actually be counter productive.

    • Shanghai Web Hosting on Jul 15, 2008 at 9:51 pm

      Nice tips, thanks for sharing. Robots.txt are one of those small things that are usually ignored by webmasters.

    Leave a Comment