Disk I/O increasing daily (no plateau) and averaging 4k now
As a new linode sysadmin, let me say that, to date, I'm VERY impressed with linode (the company), the tools, the dashboard, Xen, uptime, etc.
However, I'm very concerned by a steadily increasing disk I/O rate that has been climbing every day regardless of what I do to try and stop it. At some point, it's got to stop, but it doesn't appear to be even slowing.
Basics: Debian 6 (2.6.32.16-linode28),. Told by support to use this kernel because the latest one thrashed until OOM crashed my VM. This kernel version hasn't caused any crashes, but then again, I have this problem now. I've daily bumped up my disk I/O alerts by 1k, and every day it's been exceeding the alert threshold (now set at 4k).
I followed the recommendations in the online docs for "memory-networking" best practices. I have only moved a portion of my prior server over, so the load should/is be almost zero. About 10-15 apache2 websites (mostly WP, Joomla, Drupal, etc), a small-ish mysql and a basic Oracle XE installation (yes, my swap is 1028MB, and I'm a very experienced Oracle DBA, previously running a full 10g on 1.5GB RAM, so this should be enough box for this). I'm running the Linode 1024 configuration, by the way. Disk is 50% utilized.
I don't see any significant swap usage, and iotop doesn't tell me much. The graphs from the dashboard (attached below) are scary though. I have Munin installed and graphing, but nothing stands out (and honestly, I'm not sure what I am looking for). I am no Xen guru, but I have many years of linux admin. I've never seen this before.
Can anyone give me ideas or suggestions on what might be causing this? Support has (so far) not provided any solid help. I really need to at least identify the cause of this before I move all my services over. I'm happy to provide any/all information asked.
Thanks!!
bruce
~~![](<URL url=)http://forsale.fatcity.com/disk_io_24h.png
![](~~
11 Replies
Could be mysql disk usage (if sizes of buffers are not enough), could be errors in php-scripts (especially in file-cachers).
10 drupal-sites is heavy enough, drupal likes to eat resources.
@OZ:
I'm not a XEN-guru, but some guesses:
Could be mysql disk usage (if sizes of buffers are not enough), could be errors in php-scripts (especially in file-cachers).
10 drupal-sites is heavy enough, drupal likes to eat resources.
Thanks for chiming in…
There is actually only one Drupal site, two Joomlas and the rest are traditional html or WordPress, so I'm not expecting a lot of thrashing, but I do see reports of Drupal creating a lot of temp tables (I will be looking into this next).
Support seems to think it's mysql too, but the strange thing is that this ran just FINE on an older physical server. Not even near the same number of disk io's as I'm seeing here.
I've tried the recommendations of mysqltuner.pl, and those were in effect all night, with no difference at all -- it's still climbing. Any more ideas to look at?
thanks!
bruce
@Guspaz:
The peaks look somewhat regular; cron job?
Thanks for the reply…
Yes, that one puzzles me too. I immediately started looking around and the only thing running at the same time hourly is cron.hourly, and there are no jobs in there.
I do have a daily run (which probably accounts for the 6am-ish spike), but nothing on the hourly basis, so I'm scratching my head where that's coming from.
More ideas?
thanks,
bruce
1) Cron job. That one puzzled me and after some looking around, I realized that there WAS a cron job calling
(And to be fair, I wasn't expecting it to improve -- there's no reason why a single cron job should impact performance that much).
2) I called upon a friend who is a MySQL DBA. He looked at the my.cnf settings that came out of the box with a Debian MySQL install, plus the recommended changes I made from the "memory-networking" online linode doc. He said the settings were hosed and sent me a new my.cnf.
3) That, in itself, was an improvement, but it barely registered on the needle. However, one of the settings he included was to turn on slow query log.
Now, to be fair, linode Support also recommended this, but in my arrogance, I said "Nah, I know this is working, so there should be no long running queries" and so I didn't do that. Mistake. Mea culpa.
4) I checked the slow query log and spotted a job which was running every five minutes and was trying to prune back the Drupal watchdog logfile (basically, catching errors in execution). From prior Drupal experience, I know this watchdog is a source of pain for many sites, but I never before had any problems.
Okay, so this query was taking 25s-30s every 5 minutes. It shouldn't… I looked at how many rows it was evaluating to prune back the logfile and it was almost 10.5 MILLION rows. WTF?? How did that logfile, which shouldn't have been over 1,000 rows, get that big. Well…
Seems that there must be a bug in MySQL v5.1.49. I have only found a couple references to it, but it seems that when you try to delete from a very large file (where < timestamp), it fails. This caused Drupal to write even more rows to the table, and so it was growing fast and furious.
I truncated the table to zero, fixed a couple of bugs in the Drupal code (that were already in there anyhow (ereg depreciated)) and then added an index.
I let the system quiesce and now the disk io is back to about, um, ZERO! Yay! See the following graph now…
Thanks to everyone for help and suggestions. Really appreciate it.
In summary, the first thing I should have done is turn on the slow query log and watched for problems. That would have led me to the obvious pig in the database.
The variable settings in my.cnf will probably improve performance in the long run, so those were helpful as well.
Thanks!
bruce
![](
@bbergman:
my swap is 1028MB
You may not be swapping now, but you'll thank yourself later if you use 256. That much swap is just asking for an opportunity to OOM!
–
Travis
@otherbbs:
@bbergman:my swap is 1028MB
You may not be swapping now, but you'll thank yourself later if you use 256. That much swap is just asking for an opportunity to OOM!–
Travis
Hmmm, not sure if you read that right. That's 1 GB swap partition. It's required for Oracle. I doubt I will need that much, but you can't install XE without it being that large.
thanks,
bruce
On a linode (or any Linux machine, generally) it's good to have a bit of swap to allow inactive stuff to get paged out of RAM, but beyond that, you don't want to be using it.
@Guspaz:
The proper answer is more RAM, not more swap. If Oracle "needs" that much "RAM", then it'll slow to a crawl if it actually tries to use that much RAM and gets shoved in swap; swap is too slow to actually work actively out of.
On a linode (or any Linux machine, generally) it's good to have a bit of swap to allow inactive stuff to get paged out of RAM, but beyond that, you don't want to be using it.
Agreed. But trust me, I know Oracle. I may not be a Xen expert, but when it comes to performance tuning, I've got 15+ years working intimately with Oracle and I know exactly how to make it run fine. I've been offering Oracle DB+web hosting for six years, which is something even the big boys have less experience with.
Ironically, that's why my MySQL implementation was giving me fits and performance issues, and my Oracle instance (on the same linode) is performing just perfectly.
thanks,
bruce