Robots.txt : 4 Things You Should Know

Robots.txt has become a widely used method of controlling your site crawling. Thus it has become one of the first things I check when diagnosing on-site issues. While nowadays almost every webmaster knows its basics, some things still cause misunderstandings:

  1. Robots.txt can prevent bots from crawling the page or directory but not from indexing or ranking the URL when it’s found via external references. In this case search engines will use information from these external sources to make judgments about the page and also to formulate the snippet (title and description that appear in search results).
  2. If you have both general (i.e. wildcard *) and specific (e.g. User-agent: googlebot ) user-agent sections, keep in mind that Google (and other crawlers) will only follow the most specific section and ignore all other sections (including the general one): thus repeat all the directives from general section in all specific ones.
  3. Robots.txt specific directions
  4. The matching is from the left to the right, meaning that crawlers are blocked from anything that begins with / pattern. So if you have blocked directory, for example, keep in mind that you are also blocking all directories/ pages going after the root and starting with ‘a’ (e.g. The related case was described in the recent WebmasterWorld thread.
  5. To be on the safe side, you’d better have Robots.txt file even if you do not want to include any (specific) directions – let it be empty or default:

    User-agent: *

    By this you will make sure that:

  • all search crawlers understand what you mean correctly;
  • there are no extra 404 errors in your log files from bots requesting your non-existent Robots.txt page;
  • bots won’t hold off on crawling your site in case they couldn’t reach your Robots.txt (a highly improbable case but might happen anyway).

