Abusive Searching on my Site

Hi there. Run a PHP + MySQL site here on my linode, one that has a fulltext search against a largish number of items in a number of tables (these items have attributes that are normalized into multiple tables, which requires a joined sql statement, depending on what the user is searching for).

Anyways, my problem is this… something or someone (Googlebot?) is hitting this search interface on the order of 200-300 times per minute, which has my mysqld process pegged at 100% and my server crawling. I've employed use of memcache to cache search results, but the script is feeding the search interface a different query each time, and there are literally tens of thousands of distinct possibilities.

So, a solution is to restrict the number of queries that any given 'user' can perform per minute I guess. Any suggestions on how to best accomplish this? Using the IP address I would assume, but how do I store previous IP addresses? Not in the db, as that leads to it's own set of problems in that I am now running to the db 200-300 times per minute to lookup ip addresses…

Any suggestions/best practices for dealing with this? Some kind of captcha would work, but that will kill the useability of the interface…

Thanks!

8 Replies

A couple quick suggestions, should it indeed be a search engine:

1) Your web server's access log should log information about the requests, including the user agent. Google's hits will show up like:

::ffff:66.249.65.104 blog.hoopycat.com - [04/Oct/2009:23:36:59 +0000] "GET /tag/arduino HTTP/1.1" 200 5588 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

This will at least help you identify what you're up against.

2) If it is a search engine, odds are good it doesn't need to be using your search function to find things. Notching it out via robots.txt may get it to go away and stop killing your search function.

3) You have memcached running, so might as well use that to track individual IPs :-) It doesn't have to work 100%, it just has to work well enough to calm things down, right?

Checking the logs is #1, of course… without knowing exactly what's going on, it's difficult to decide where to go. -rt

@hoopycat:

A couple quick suggestions, should it indeed be a search engine:

1) Your web server's access log should log information about the requests, including the user agent. Google's hits will show up like:

::ffff:66.249.65.104 blog.hoopycat.com - [04/Oct/2009:23:36:59 +0000] "GET /tag/arduino HTTP/1.1" 200 5588 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

This will at least help you identify what you're up against.

2) If it is a search engine, odds are good it doesn't need to be using your search function to find things. Notching it out via robots.txt may get it to go away and stop killing your search function.

3) You have memcached running, so might as well use that to track individual IPs :-) It doesn't have to work 100%, it just has to work well enough to calm things down, right?

Checking the logs is #1, of course… without knowing exactly what's going on, it's difficult to decide where to go. -rt

Per #1… slaps head of course, how could I overlook this. Yep, it's google, have added the needed disallows to the robots.txt file. Any odea how quickly the bot should pick that up?

3, using memcache to track IP's… is brilliant. I'll be implementing that in the near future. I actually modified the code that was performing some of the queries to more agressively use memcache, and the server hit in terms of performance dropped dramatically.

Other than a few hours of digging/tweaking, this has been a great learning experience. Thanks for the pointers.

Paul

robots.txt changes can take a little while (a day or two?).

I do enjoy memcached, and often treat it like I might treat /tmp or a piece of scratch paper… a place to store little transient tidbits of information that I won't care about in five minutes. Except, of course, without the destruction to trees.

You could also employ a sort of CAPTCHA device as you said for search engine use. I honestly don't see why Google would dig through your own search engine. Maybe it's paranoia talking but couldn't someone/something be masquerading as a GoogleBot?

> someone/something be masquerading as a GoogleBot?
I'd tend to agree.

Google is crawling forms these days (see here), but 200 to 300 requests/gets per minute seems a bit more than "a small number of fetches per site". Also, if this is a POST form, it shouldn't be google doing it.

@mjrich:

> someone/something be masquerading as a GoogleBot?
I'd tend to agree.

Google is crawling forms these days (see here), but 200 to 300 requests/gets per minute seems a bit more than "a small number of fetches per site". Also, if this is a POST form, it shouldn't be google doing it.

It was indeed Googlebot, and yes, the form is GET for a number of reasons. I re-checked the logs, and it looks like the bot was making a request every 2-5 seconds, so the 200-300 number wasn't completely accurate (there were indeed 200-300 intensive queries per minute, but I had mistakenly 'duplicated' the query in my code, meaning ~100/minute, and I hadn't removed regular user traffic…). So google was indexing my search results form at ~10-30 requests per minute, which is much more reasonable.

Long story short, this has spurred me to make significant improvements to the query via a streamlined query, code and caching techniques so that it now can handle 200-300 requests per minute.

Paul

I would go sign up with Google and see what they're doing: https://www.google.com/webmasters/tools/home

Confirmed this on Google Webmaster Tools, thanks!

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct