BOTs attack

massive attack by bots and crawlers over the last 1 month which was basically resultant of the site being indexed and getting into the top 100k sites.

The bots are now pulling huge CPU load making it impossible to browse the the site.

I need a solution to ban every useragent other than MSN,GOOGLE, YAHOO and couple of others. I dont care if they are valid or not just dont want them to take any of my resources.

robots.txt and .htaccess have both failed in limiting these bad bots.

Am told that there are scripts that run in linux background that can detect bad bot behavior and automatically ban them thus adding them to a bad list and hence increasing efficiency of the site.

If you guys have ideas or such scripts, please do help.

Thanks

17 Replies

Do you mean referral site? Also, couldn't you just use something like fail2ban that would automatically ban that stuff if you configure it right?

as fresbee says, fail2ban would be a okay start: http://edin.no-ip.com/category/tags/fail2ban

It has a badbot part built in if you use debian

Thanks a lot.

I am trying out fail2ban.

Hopefully it should reduce the bots/crawlers.

the var/log/apache2/access.log

is empty.

Any ideas why it will be blank ?

Do I need to enable any config in apache2.conf for the log to start filling?

thanks in advance

Try checking the conf files in /etc/apache2/sites-available/*

If you have used any of the linode libary guides your logs will be located elsewere.

ok folks

fail2ban is not working.

CPU loads are continuosly at elevated levels.

Ny more ideas or any powerful script to ban spiders.

It's still in beta, but you could look at http://www.projecthoneypot.org/httpbl.php

You could block the IP addresses with IP Tables. You will have to look at your logs to find the IP Addresses though…

Is there any way you can post some logs of what is doing the spikes?

If you find specific IP addresses, you can add them to your .htaccess for the affected websites:

Order Deny,Allow
Deny from xxx.xxx.xxx.xxx

And you can even block at the Class C network level:

Order Deny,Allow
Deny from xxx.xxx.xxx

If you see multiple attempts from different hosts at the same Class C network.

I do this to block IP's that are trying to break thru my CAPTCHA at one of my websites.

If we can see some of the logs or maybe a screenshot or 5 of htop we can probably figure something else out.

Am trying to install AWSTATS to study the spiders which are creating the problem.

While installing AWSTATS for debian, the paths are all not updated in awstats configuration file. Particularly I need help in updating the following paths variable which is being provided for an earlier version of debian.

$AWSTATS_PATH='';

$AWSTATSICONPATH='/usr/share/awstats/icon';

$AWSTATSCSSPATH='/usr/share/doc/awstats/examples/css';

$AWSTATSCLASSESPATH='/usr/share/doc/awstats/examples/classes';

$AWSTATSCGIPATH='/usr/lib/cgi-bin';

$AWSTATSMODELCONFIG='/etc/awstats/awstats.model.conf'; # Used only when configure ran on linux

$AWSTATSDIRDATAPATH='/var/lib/awstats';

Please provide any updated documentation on awstats for debian or if anyone has it installed and has these parameters set, please help.

Am not able to see stats through the browser which I think is a problem with thevar not being correctly set

http://www.debianadmin.com/apache-log-f … ebian.html">http://www.debianadmin.com/apache-log-file-analyzer-using-awstats-in-debian.html

The awstats_configure.pl should more or less make sure that all paths correspond with what you have on your system - Did you run the perl script?

Output from top command. The mysql is a runaway process and within 5 minutes of restarting it reaches 400% CPU usage virtualling blocking everything else.

I have made all possibloe changes to my.cnf and apache2.conf to have a stable system but to no avail.

I checked awstats and found that yahoo slurp is creating the maximum trouble.

So blocked yahoo slurp through robots.txt and .htaccess.

SLURP somehow still manages to hit my site. I even blocked its IP 67.195.. but now I find it crawling through another address:

*.crawl.yahoo.net.

Any ways to block yahoo completely off my site? It is one @#$@ hole company.

output from top command

lollipop:~# htop

-bash: htop: command not found

lollipop:~# top

top - 15:43:38 up 1 day, 2:30, 1 user, load average: 24.34, 22.14, 18.43

Tasks: 104 total, 1 running, 103 sleeping, 0 stopped, 0 zombie

Cpu(s): 18.1%us, 81.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st

Mem: 1417440k total, 839028k used, 578412k free, 9284k buffers

Swap: 524280k total, 3996k used, 520284k free, 161652k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

7650 mysql 18 0 483m 44m 4576 S 399 3.2 58:13.00 mysqld

7872 root 15 0 2268 1140 880 R 0 0.1 0:00.84 top

1 root 15 0 1992 568 540 S 0 0.0 0:00.00 init

2 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0

3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0

4 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1

5 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1

6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/2

7 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/2

8 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/3

9 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/3

10 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/0

11 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/1

12 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/2

13 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/3

14 root 20 -5 0 0 0 S 0 0.0 0:00.00 khelper

15 root 11 -5 0 0 0 S 0 0.0 0:00.00 kthread

17 root 11 -5 0 0 0 S 0 0.0 0:00.00 xenwatch

Are you sure that it isn't your own site at fault? An unindexed table can cause huge load with a small amount of traffic.

@Guspaz:

Are you sure that it isn't your own site at fault? An unindexed table can cause huge load with a small amount of traffic.
This might be the issue.

I just can't see how web crawlers can slow down a website so much? They don't make more than a few requests a minute, so their visits should not be the issue.

If your queries take a few seconds then it all stacks up. Check your slow query log.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct