What can hog CPU and starve disk/net?

I realize that's kind of a pathetic question :-)

I have had a linode 360 running for about a week as a primary MX for various scattered computers. It is a pretty vanilla minimal gentoo installation, qmail being the main point of its existence. I use the qmail-spp plugin to implement a simple screen against bogus incoming email by verifying RCPT TO addresses.

By the logs in UTC time, Friday 04 Dec 2009, about 0420, CPU usage shot up to 107%, dropped to 100% about 0530, and I noticed it about 0745. Disk I/O and network traffic fell to zero for the entire period.

I tried ssh but could not get in and used the dash to reboot it. I should have tried LISH but I am new to this and didn't think of it.

My first thought was that my little C plugin had gone into an infinite loop, but I don't see how that could have blocked everything else, including all disk I/O and net traffic. Further incoming port 25 connections would have started a new qmail-smtpd session. Besides, the plugin had been running for at least several hours with no problems.

Does anyone have any ideas on what could make a Linode virtual server go haywire like that, 100% (actually 105.73%!) CPU and zero disk/net?

6 Replies

That sounds like a kernel panic.

From the lish shell (e.g. ssh to linodexxxxx@citynameyyy.linode.com, then detach with ^A-d), run the "logview" command… this will show you the last ~250 lines from the previous boot, along with the last ~100 lines from the current boot. A kernel panic will be obvious if it's there.

I should have mentioned that I checked /var/log/messages and saw … nothing. The last log entry before hanging is from an iptables rule, the first one after reboot is syslog-ng startup.

@hoopycat:

From the lish shell (e.g. ssh to linodexxxxx@citynameyyy.linode.com, then detach with ^A-d), run the "logview" command…
@Scarecrow:

I should have mentioned that I checked /var/log/messages
These are not the same thing - if your linode was OOMing or panicked, then it would not be able to write to your log files, but it may very well have been able to write an error to the console which you could see with the lish logview.

Didn't realize that, but I'll try to remember if it happens again.

Any ideas about what might cause the problem? I don't see how my program could have caused all three symptoms at once.

You can still gather the "logview" data from lish now, as long as you've rebooted exactly once since the problem occurred. That will be the quickest way to figure out exactly what happened.

A halted kernel (e.g. one that has panicked but rebootonpanic is unset) will exhibit all of those symptoms… the question is, what halted the kernel? :-)

I had two bogus entries in inittab, typoes, and it couldn't respawn them fast enough.

Altho I don't understand why that would lock up the CPU since it tries a few times then says it i waiting 5 minutes. I would think that 5 minutes would have been plenty of time to launch new smtp connections. Even if tcprules had died and itself could not handle incoming connections, why would sshd not have taken clients? Why would the CPU peg solid rather than for a split second then off for 5 minutes?

And why was it trying to re-init anyway after running for several days since the previous reboot?

I don't think I have actually found the problem, but I did learn something.

The AJAX console also showed the problem, altho lish was easier to use.

Thanks. I like the tools, but I think I will have to wait for it to bork again.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct