What can hog CPU and starve disk/net?
I have had a linode 360 running for about a week as a primary MX for various scattered computers. It is a pretty vanilla minimal gentoo installation, qmail being the main point of its existence. I use the qmail-spp plugin to implement a simple screen against bogus incoming email by verifying RCPT TO addresses.
By the logs in UTC time, Friday 04 Dec 2009, about 0420, CPU usage shot up to 107%, dropped to 100% about 0530, and I noticed it about 0745. Disk I/O and network traffic fell to zero for the entire period.
I tried ssh but could not get in and used the dash to reboot it. I should have tried LISH but I am new to this and didn't think of it.
My first thought was that my little C plugin had gone into an infinite loop, but I don't see how that could have blocked everything else, including all disk I/O and net traffic. Further incoming port 25 connections would have started a new qmail-smtpd session. Besides, the plugin had been running for at least several hours with no problems.
Does anyone have any ideas on what could make a Linode virtual server go haywire like that, 100% (actually 105.73%!) CPU and zero disk/net?
6 Replies
From the lish shell (e.g. ssh to
@hoopycat:
From the lish shell (e.g. ssh to
linodexxxxx@citynameyyy.linode.com , then detach with ^A-d), run the "logview" command…
@Scarecrow:I should have mentioned that I checked /var/log/messages
These are not the same thing - if your linode was OOMing or panicked, then it would not be able to write to your log files, but it may very well have been able to write an error to the console which you could see with the lish logview.
Any ideas about what might cause the problem? I don't see how my program could have caused all three symptoms at once.
A halted kernel (e.g. one that has panicked but rebootonpanic is unset) will exhibit all of those symptoms… the question is, what halted the kernel?
Altho I don't understand why that would lock up the CPU since it tries a few times then says it i waiting 5 minutes. I would think that 5 minutes would have been plenty of time to launch new smtp connections. Even if tcprules had died and itself could not handle incoming connections, why would sshd not have taken clients? Why would the CPU peg solid rather than for a split second then off for 5 minutes?
And why was it trying to re-init anyway after running for several days since the previous reboot?
I don't think I have actually found the problem, but I did learn something.
The AJAX console also showed the problem, altho lish was easier to use.
Thanks. I like the tools, but I think I will have to wait for it to bork again.