The “/robots.txt” file is a text file, with one or more records. Usually contains a single record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
In this example, three directories are excluded.
Note that you need a separate “Disallow” line for every URL prefix you want to exclude — you cannot say “Disallow: /cgi-bin/ /tmp/” on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:
To exclude all robots from the entire server
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:
(or just create an empty “/robots.txt” file, or don’t use one at all)
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:
User-agent: *
Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/stuff/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
Source:http://www.robotstxt.org
Tags: robots.txt rules

March 30th, 2010 at 9:49 pm
Useful info, i was looking for this from days. Thanks dude…..
[Reply]