A few mySQL queries are wrecking my CPU

Hi all,

I am hosting www.nicetranslator.com on my linode (this is the only live site its hosting).

My goal is to have it logging every phrase translated (after the user waits 1.5/s without typing).

I have a table of phrases:

phraseID | phrase | phraseLang | sourceCount | resultCount

(sourceCount is the number of times the phrase is the source, result is number of times its the result)

And a table of translations

translationID | ip | timeSubmitted | sourceID | resultID

The following is the PHP I've got (including the functions I've used, rowExists() and getID() ):

0);
    }
    function getID($table, $prefix, $where){
        $theID = $prefix . "ID";
        $queryRow = mysql_fetch_row(mysql_query("SELECT $theID FROM $table WHERE $where")) or die(mysql_error());
        return $queryRow[0];
    }
?>

When I run this, my CPU use instantly spikes to over 100%. I tried commenting out all the code querying the translation table, so it is only saving phrases and I get the same result.

I have run the optimization, which seemed to have some small affect when I first ran it, but no longer does.

The phrase table now has about 230,000 rows.

To be honest, this is my first project to draw such an audience so my first time dealing with a database this size. I was under the impression, though, that mySQL was quite apt to handle such a job.

What am I missing? Your help is very much appreciated.

Thanks for your time,

Nicky

6 Replies

1. Do you have a primary key and some indexes defined on your tables? Every column that appears in your WHERE clause must have an index on it, otherwise MySQL has to scan all the rows.

2. Is addslashes() causing a problem? You might get unpredictable behavior if you escape a string that has already been escaped by magic quotes. addslashes() also has a problem with non-ASCII character sets, which you'll come across pretty soon as your site is about translating to/from foreign languages. Turn off magic quotes if you can, and use mysqlrealescape_string() instead of addslashes() for real escaping.

3. rowExists() seems rather inefficient for your purpose. What about INSERT … ON DUPLICATE KEY UPDATE ?

4. Does MySQL have enough key_buffer to keep all the indexes in RAM?

etc.

Here's the table structure

CREATE TABLE `phrases2` (
  `phraseID` bigint(20) unsigned NOT NULL auto_increment,
  `phrase` text character set utf8 collate utf8_roman_ci NOT NULL,
  `phraseLang` text character set utf8 collate utf8_roman_ci NOT NULL,
  `sourceCount` bigint(20) unsigned NOT NULL default '0',
  `resultCount` bigint(20) unsigned NOT NULL default '0',
  PRIMARY KEY  (`phraseID`),
  KEY `sourceCount` (`sourceCount`),
  KEY `resultCount` (`resultCount`),
  FULLTEXT KEY `phrase` (`phrase`),
  FULLTEXT KEY `phraseLang` (`phraseLang`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=19312 ;

As you can see I have made 'phrases2' (identical to the original phrases table) which has 19312 entries since about an hour ago when I made the new table… The CPU is running, of course, much lower now but steadily climbing as entries increase.

hybinet:

Thanks for the tip on magic quotes/mysqlreaLescape_string(); using that now.

As for INSERT … ON DUPLICATE KEY UPDATE, I don't think this will work in my situation because the phrase itself is not a unique key. The phrase-language combination is unique. So, if a phrase happens to be common among two languages, I wouldn't want it to be reported as otherwise. Any other ideas on that though?

I hadn't known about the key_buffer before, but looked it up and set it to 64M as the manual suggests for systems with over 256MB of RAM… Does that seem to make sense?

Given what you have seen from the situation, does it seem like there is definitely a problem or that we are truly approaching the limit of our CPU?

I imagine this will force us to design a smarter way of storing phrases, but I am still very curious about what the limits of servers of this size actually are.

Thanks again for the help

MyISAM fulltext indexes are only helpful if you're trying to look up a few keywords from a large chunk of text and then rank the results by relevance. Since you're only interested in finding out if an exact phrase already exists in the table, the fulltext index is totally useless. You'll get much better performance by changing the two "text" columns to "varchar" and creating a regular index on those two columns.

The phrase itself might not be unique, but the combination of the phrase and the language is, right? In that case, you can have a char(40) column as your primary key and put sha1($srcLang.$source) into it. That'll serve as a unique key across all phrases and languages, and spare you from having to keep large indexes on the phrases themselves. (This might or might not work for you, depending on what else you're trying to do with this table.)

Here's another way to avoid having to perform two queries at every single insert. Do an UPDATE first, and call mysqlaffectedrows() to see if any rows were updated. If the result is zero, then go ahead and do an INSERT.

Use utf8-unicode-ci throughout if you expect any non-latin characters (such as Japanese) to be inserted into the DB. Also look into mysqlsetcharset() if you haven't already done so. Foreign languages need special care!

64M key_buffer seems more than enough for the purpose. Just get rid of those meaningless fulltext indexes.

Hitting the limits of the CPU? No way… I used to have a 4.5GB table with 300K rows in it. With proper indexing, I could perform SELECT's on it all day long without ever hitting 10% CPU. And that was on a Linode 360.

As hybinet pointed out, keys are absolutely essential to get some decent query performance from your DB.

But try to avoid indices on columns that don't need an index on (e.g. the count columns). The index needs to be updated for each modification of the table –> expensive.

Another thing you could improve is the way you collect your stats.

Do you really need 1oo% real-time information about which phrase was translated how often ? I doubt it… Row updates are expensive and appending new rows is much cheaper. You could log the phrase/language records to a separate table as they are queried from the web app. Then aggregate the data every 5/10/60/whatever minutes, delete the old log records and store the results in table.

Wow, amazing stuff.

Using sha1() and changed phrases to a varchar as you suggested (using on duplicate key) and that seems to have completely eliminated the problem; running smooth as silk at the moment.

Thanks so much hybinet

oliver, I think I've cleaned up the indices now. I like your suggestion, and that's sord of what I had in mind when I was talking about coming up with a smarter way of handling this data. If there are future problems we'll likely go ahead with something along those lines.

Thanks again for all the help, very nicely done.

For future reference, this is an excellent article covering pretty much what you're trying to do:

~~[http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/" target="_blank">](http://www.mysqlperformanceblog.com/200 … ith-mysql/">http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/](

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct