What to do against abusive web scrapers?
It seems an IP was accessing every page of my website (it could be seen on logs, from another country with many attemps per second), resulting on a really huge load of petitions and causing a high mysql usage, but banning it with the firewall automatically turned everything to normal.
I know that with fail2ban you can block login attempts, but is there something similar? It would be effective to not affect users that are just opening a lot of tabs?
1 Reply
There are several solutions for blocking or impeding abuse web scrapers, and we found this guide on How to Prevent Scraping particularly helpful in explaining the pros and cons of each. Here's a breakdown of some of these solutions:
CloudFlare
We recommend CloudFlare to many of our customers for IP obfuscation and DDoS Protection, and they also provide a bot abuse protection feature that would protect your database.
Rate Limiting
Rate limiting requests will protect your database from scrapers, though as you've noted you'll need to take care to avoid blocking legitimate users who are opening a lot of tabs. You can follow this guide on Rate limiting with Apache and mod-security for some configuration suggestions (the article is old, so check out the comments on it if you run into issues), and we have a guide on How to Configure ModSecurity on Apache to go along with it.
Similarly, you could use the DDoS-protection module mod_evasive for Apache to effectuate rate limiting.
Honeypot
A "honeypot" is a URL that is hidden from normal users, but accessible to invasive programs like scrapers. You could then use a honeypot URL in conjunction with regex for Fail2ban to jail users that access the honeypot. Just be sure to exclude the honeypot URL in robots.txt so that you don't inadvertently block legitimate web crawlers like so:
User-agent: *
Disallow: /honeypot_example