Abusive Searching on my Site
Anyways, my problem is this… something or someone (Googlebot?) is hitting this search interface on the order of 200-300 times per minute, which has my mysqld process pegged at 100% and my server crawling. I've employed use of memcache to cache search results, but the script is feeding the search interface a different query each time, and there are literally tens of thousands of distinct possibilities.
So, a solution is to restrict the number of queries that any given 'user' can perform per minute I guess. Any suggestions on how to best accomplish this? Using the IP address I would assume, but how do I store previous IP addresses? Not in the db, as that leads to it's own set of problems in that I am now running to the db 200-300 times per minute to lookup ip addresses…
Any suggestions/best practices for dealing with this? Some kind of captcha would work, but that will kill the useability of the interface…
Thanks!
8 Replies
1) Your web server's access log should log information about the requests, including the user agent. Google's hits will show up like:
::ffff:66.249.65.104 blog.hoopycat.com - [04/Oct/2009:23:36:59 +0000] "GET /tag/arduino HTTP/1.1" 200 5588 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
This will at least help you identify what you're up against.
2) If it is a search engine, odds are good it doesn't need to be using your search function to find things. Notching it out via robots.txt may get it to go away and stop killing your search function.
3) You have memcached running, so might as well use that to track individual IPs
Checking the logs is #1, of course… without knowing exactly what's going on, it's difficult to decide where to go. -rt
@hoopycat:
A couple quick suggestions, should it indeed be a search engine:
1) Your web server's access log should log information about the requests, including the user agent. Google's hits will show up like:
::ffff:66.249.65.104 blog.hoopycat.com - [04/Oct/2009:23:36:59 +0000] "GET /tag/arduino HTTP/1.1" 200 5588 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
This will at least help you identify what you're up against.
2) If it is a search engine, odds are good it doesn't need to be using your search function to find things. Notching it out via robots.txt may get it to go away and stop killing your search function.
3) You have memcached running, so might as well use that to track individual IPs :-) It doesn't have to work 100%, it just has to work well enough to calm things down, right?
Checking the logs is #1, of course… without knowing exactly what's going on, it's difficult to decide where to go. -rt
Per #1… slaps head of course, how could I overlook this. Yep, it's google, have added the needed disallows to the robots.txt file. Any odea how quickly the bot should pick that up?
3, using memcache to track IP's… is brilliant. I'll be implementing that in the near future. I actually modified the code that was performing some of the queries to more agressively use memcache, and the server hit in terms of performance dropped dramatically.
Other than a few hours of digging/tweaking, this has been a great learning experience. Thanks for the pointers.
Paul
I do enjoy memcached, and often treat it like I might treat /tmp or a piece of scratch paper… a place to store little transient tidbits of information that I won't care about in five minutes. Except, of course, without the destruction to trees.
> someone/something be masquerading as a GoogleBot?
I'd tend to agree.
Google is crawling forms these days (see here
@mjrich:
> someone/something be masquerading as a GoogleBot?
I'd tend to agree.Google is crawling forms these days (see
), but 200 to 300 requests/gets per minute seems a bit more than "a small number of fetches per site". Also, if this is a POST form, it shouldn't be google doing it. here
It was indeed Googlebot, and yes, the form is GET for a number of reasons. I re-checked the logs, and it looks like the bot was making a request every 2-5 seconds, so the 200-300 number wasn't completely accurate (there were indeed 200-300 intensive queries per minute, but I had mistakenly 'duplicated' the query in my code, meaning ~100/minute, and I hadn't removed regular user traffic…). So google was indexing my search results form at ~10-30 requests per minute, which is much more reasonable.
Long story short, this has spurred me to make significant improvements to the query via a streamlined query, code and caching techniques so that it now can handle 200-300 requests per minute.
Paul