Website heavily crawled by Yahoo Slurp
Coming to this topic, I have to start from the email I got from dreamhost support team.
In their email, I was told that I am the heaviest user on the server where I host my site. I was also told about most common IP Addresses for my domains. Eventually, by checking the IP address, I am able to identify the IP owner - Yahoo’s clawer which extremely frequently crawled my sites.
OK, to prevent gettting such mail from dreamhost again and also run our sites smoothly without any break down, I think it is quite helpful to know how to find out the causes of heavy usage by ourself.
There are basically two commends:
Enter this command to see the IPs hitting the domain the most:
cat access.log| awk '{print $1}' | sort | uniq -c |sort -n
and this command is even more useful in some cases as it specifically targets the last 10,000 hits:
tail -10000 access.log| awk '{print $1}' | sort | uniq -c |sort -n
After running these two commend , you will get something like:
.....
..... 612 122.152.128.16
1277 38.99.13.124
1806 66.94.237.140
1806 66.94.237.141
1823 66.94.237.142
From above result, I will need to write a fragement of code into .htaccess file to prevent the most commond IP Addresses: 66.94.237.*
Order allow,deny allow from all deny from 66.94.237.