check for robots.txt
Some times it is useful to check if a given HTTP server has a robots.txt file in it. If it exist it may disclose interesting information, useful for a pentest ![]()
From the Wikipedia:
The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.
Here is a script that checks for the presence of the file in a list of hosts (you can download the source code). Two main parts can be distinguished: command line parsing and file download.
You can call the script in two different ways. Either you do not specify the protocol (and HTTP will be used):-
./robots.sh ...
Or you specify the protocol with:
./robots.sh -p [http|https] ...
Let’s see how this is done:
PROTO=( http https )
HTTP=${PROTO[0]}
FILE=/tmp/robots.txt
# command line parsing
if [ "-p" == $1 ]
then
for bar in ${PROTO[*]}
do
if [ $bar == $2 ];
then
HTTP=$2
HOSTS=${*:3}
fi
done
else
HOSTS=$*
fiWe check if the first argument is “-p” in which case, the next argument should be one of the allowed values (those in $PROTO array). If that is the case, we strip the first two parameters and put everything else in the $HOSTS variable. At the end of the code above, $HTTP will contain either http or https and $HOSTS will consist of a list of hosts whose robots.txt file existance we want to verify.
Once we know what protocol are we using and the list of targets, the only thing left is to try to download the robots.txt file of each server:-
for foo in $HOSTS; do
echo "================"
echo "Server: $foo ($HTTP)"
CODE=`wget -O $FILE $HTTP://$foo/robots.txt 2>&1 | grep HTTP | head -1 | awk '{print $6}'`
echo "Code: $CODE"
if [ "200" == $CODE ]
then
echo "Contents:"
echo "----------------"
cat $FILE
rm $FILE
echo "----------------"
fi
doneIf the response code is 200 OK we cat the file to standard output. Otherwise we just move on to the next target of the list. The only tricky bit of the previous code is:
wget -O $FILE $HTTP://$foo/robots.txt 2>&1 | grep HTTP | head -1 | awk '{print $6}'Where we try to download the file saving it to the location specified by $FILE. In order to get the HTTP error code we redirect standard error to standard output using 2>&1.
One last word, it is acknowledged that the script does not follow HTTP redirects, but if the server replies with a redirect this means that effectively, no robots.txt file is present.




August 27th, 2008 at 3:25 pm
A possible easy way of doing it is
for i in `cat File_with_ips`
do
echo ‘trying $i’
nc -nv $i 80 < robots.txt
done
contents of robots.txt
GET /robots.txt HTTP/1.0