Strange random freezes after lucid upgrade
It happens about 1 time per day, but not at any set regular time. Sometimes it will last 2 days before freezing, sometimes it freezes a few times in one day.
I can't find anything interesting in logs. I only know the load average is high because I leave a terminal open with htop running, and the last loadavg displayed before the SSH session disconnects shows something like this:
kiomava@h2:~$ age from root@h4 1.6%] Tasks: 382 total, 1 running
2 [ unknown) at 19:10 ... 0.0%] Load average: 38.85 37.81 33.63
Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||868/2020MB] Uptime: 2 days, 12:29:11
The system is going down for power off NOW! 42/719MB]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
32657 kiomava 20 0 2816 1472 940 R 1.0 0.1 1h15:41 htop
The only way to recover is to issue a reboot from the linode dashboard. It will continue to be unresponsive for hours, until I reboot from the linode dashboard.
I've tried several kernels. I've tried pretty much all the recent 2.6 kernels. This started when on latest 2.6 paravirt.
This system has been in more or less the same configuration (by configuration here I mean packages installed, their config files, etc) for a year. This all started exactly when I upgraded to Lucid, so it seems almost a certainty there is some Lucid-specific problem in play.
I have munin going, and there is nothing I see leading up to the points where it freezes. There is just a discontinuity in the various graphs starting at the point where it dies and until it recovers. Even the load average doesn't spike – it must happen too quickly for munin to catch and log successfully. Only running htop catches the load average spike. Munin shows nothing suspicious -- no slowly increasing memory/cpu/load, no slowly increasing process count -- nothing. Just total perfect normalness until it dies.
The logs also show nothing interesting. Just discontinuity when it dies. I've scoured apache logs, java appserver logs, etc., and found nothing interesting around when it dies.
There are no cron jobs timed near when these freezes happen. They happen at seemingly random times, I haven't seen any pattern like it failing every day around the same time or minute or whatever.
So…
Has anybody seen anything similar?
Anybody have suggestions how to better instrument this to see what's going on?
Any other suggested courses of action to fix this?
Thanks in advance for any help you all can offer.
7 Replies
My latest theory is that disk access is the thing that completely freezes.
I ran this script:
#!/bin/bash
unif="stats"
while true; do
date >>$unif
cat /proc/loadavg >>$unif
ps -fel >>$unif
sleep 1
done
And then I also ran this in a screen session from another server:
while true; do date; cat /proc/loadavg; cat /proc/vmstat; ps -fel; sleep 1; done
The last output from the one that appends to the file had this loadavg line:
Fri Sep 17 06:05:03 GMT 2010
0.80 0.87 0.84 2/358 32751
The last loadavg/date output from the console-only one was this:
Fri Sep 17 06:24:29 GMT 2010
59.26 50.67 31.28 1/571 6556
The plain shell stuff to repeatedly do PS was chugging along fine. In fact it just kept going after i logged in to its screen at 6:24 gmt to check.
But the script to repeatedly append a file stopped immediately at 6:05 gmt.
Apache+mod_php, by default, will eventually shred your system in this manner if you have less than a few GB of memory. -rt
I'll drop MaxClients to 50 and see if that helps, thanks for the suggestion.
At the two points in my last post, the number of apache2 processes in the console-based ps was 98, was 11 when file output stopped. I haven't seen the ram usage hit into swap, but I'll add a little more to the console-based script to check that.
Also at the point where it died, it had plenty of free memory.
Have you been able to resolve this? Because I'm experiencing exactly the same problems…