[FIXED} Odd disk problem

Hi

I have a issue which has completely stumped me and before I raise a support ticket I wonder if anyone has any ideas.

I have 2 linodes (one in newark and one in atlanta). I am running Centos5.5 and due to a problem with the default nginx packages (we have a policy of running standard packages where avaliable) we are running the Paravirt kernel.

After an indeterminate amount of the box will slow to a crawl and eventually require me to login via lish and run a destroy.

If I try and reboot the box normally (init 6) the system fails to unmount the disks

Filesystem Size Used Avail Use% Mounted on

/dev/xvda 11G 6.7G 3.8G 64% /

/dev/xvdc 1008M 248M 750M 25% /var/log

/dev/xvdd 2.5G 1.1G 1.4G 44% /opt

/dev/xvde 1008M 785M 213M 79% /var/lib/mysql

tmpfs 250M 0 250M 0% /dev/shm

When the box is slowing down, a "top" or "vmstat" shows that the CPU is stuck in wait (100% just before it gives up completely).

As the system slows down the number of processes increases due to cron jobs starting before the previous has completed. Where possible everyone of these have lock files to prevent it.

The box is never busy - it serves a single website which is nearly all static content served by nginx. The remaining dynamic content is served by apache and is only a wordpress blog.

I have mysql replication running between the two boxes and I have "unison" running in a cron job which synchronizes the web content between the two boxes.

So basically I am at a loss. If this was a real machine I would say that there are disk issues. Has anyone got any ideas or should I just go straight to a support ticket?

aarhus

[Edit: 25/02/11 - changed to fixed]

5 Replies

check /var/log/syslog for OOM (out of memory) errors.

No OOM messages

The box is monitored by nagios and there is no trend in the increase of memory before it happens. If I catch it early enough I can see top showing 100% IO wait with still plenty of memory available.

Linode support have come back to me and confirm that no one else is having an issue on the host. Funnily enough it reminds me of a recent issue in my day job where the disk rate on one of the VM's was very high and it exhibited identical symptoms. The disk IO graph in my dashboard doesn't so anything odd either.

Average disk IO of 60 op/s with normal peaks of 200 op/s.

Any way - i'll keep monitoring (I might enable remote syslog to see if I capture anything that isn't synced to disk).

aarhus

I was running into something similar but on ubuntu. could not track it down to anything. On a lark, I changed my kernel to the newer one than the latest stable paravirt (was 2.6.35 at the time, now looks like there's 2.6.37 out there) and that stopped the problem.

Very odd - logged in this morning after started getting process level alerts.

"top" - ran - load average was north of 100. A total of 241 processes were instantiated. Only one was running (top!). 77%idle 23%wait.

Memory - there was still physical memory available - albeit only 44Mb (out of 512) and 250Mb of swap free.

"ps -ef" failed to complete after a screen of data.

However "cat /proc/*/cmdline" worked and there were hundreds of "ps" processes running from the nagios monitor! "killall -KILL ps" failed. Killing an indiviudal PID failed.

"init 6" failed to complete (last process to terminate was the HAL daemon and then it just stopped). I had to issue destroy via lish.

@gig - thanks for the suggestion - I am now running on 2.6.37-linode30 rather than 2.6.32.16-linode28 - will see what happens. It is occuring roughly every 20 hours so not long to wait!

OK - seemed to have solved it by running the latest (non Paravirt) kernel.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct