Linode froze for a while

My linode was unresponsive for almost 5 minutes (host21) to network and console activity. Even the console login prompt didn't appear after connecting to lish, but I could happily connect to lish and view logs. When the machine finally returned control to me…

linode.pts/0% uptime
 15:21:25  up 41 days,  2:52,  1 user,  load average: 22.75, 12.30, 5.35

linode.pts/0% cat /proc/io_status
io_count=5549012 io_rate=0 io_tokens=400000 token_refill=512 token_max=400000

Had someone else chewed up all the I/O on the host, or something?

5 Replies

Um… why's your load average 22.75?

You might have a runaway process my friend, but it doesn't appear to be doing much, if any, IO.

Do a "ps aux" to find the culprit (top might not run well under such load) and kill it.

…or issue a reboot and pray it doesn't happen again.

Sounds to me like a load spike. These can happen occasionally as other Linodes thrash the system I/O. Your Linode will not be doing much of anything but will still suffer from the symptoms you describe - huge run load (22.75) but with none of your own processes doing much and with no I/O limiting going on.

This sort of thing used to happen with disturbing regularity on Linode hosts but it's been much, much better in the past 6 months to 1 year. Caker put in place all kinds of corrective measures (the I/O limiter is one) to help reduce this problem, and these measures have worked. Which is not to say that it never happens anymore, but it is pretty rare.

When a load spike is occurring, if you can get a hold of caker (on the IRC channels, or via a ticket) then often he can "take care" of the offending Linodes on the host and return the system to normal. If not, you'll have to wait it out; they typically last only a few minutes but I have seen load spikes that lasted half an hour, but haven't seen such an event for a long time.

This was me searching for a swap thrasher. Each node was only paused for a minute at a time or so, while I isolated the offending node. Occasionally, even the limiter (due to the refill rates) won't catch a thrasher if there is other activity, like a resize (which was going on in this case). After the unpause, that high loadavg would have cleared quite quickly.

-Chris

@untitled9:

Um… why's your load average 22.75?
A load average of 25 merely means "25 jobs attempting to run".

If the I/O throughput is too low then even normal jobs (eg cron checking to see if any work is to be done) will slow down and because the I/O request is not satisfied then the job remains in the "attempting to run" state, and so adds 1 to the load average.

In this case, disk I/O was effectively frozen and so most jobs (eg console login process, ssh forking, cron, web server, postfix checking the queue, postfix accepting mail etc etc etc) all froze and so all added "1" to the load average.

A machine in this state is called "I/O bound".

Another reason for a high load average could be "CPU bound", where too many processes are trying to run and the CPU just can't satisfy all the requests.

It is possible to have a high load average and still good performance; eg a job that just forks a child and terminates, and the child does the same. Processes will be created and terminated very very quickly, and so a large number will be in the run queue every second (so load average will look high) but the system remains perfectly responsive. (I did this a few years back on a Sun Sparc 20 and got a load average of over 30, just to prove a point to my manager… it was impossible to kill, so I had to reboot the machine!)

Unix systems are very dependent on a number of things, and bottlenecks can appear in unexpected places. I presume this is one reason why caker spent so much time on the I/O throttler, simply because disk I/O is very important for smooth running of an linode; even if that linode isn't swapping (mine is nowhere near (30Mb RAM used!) disk I/O performance is high on the critical path on performance tuning.

@bji:

When a load spike is occurring, if you can get a hold of caker (on the IRC channels, or via a ticket) then often he can "take care" of the offending Linodes on the host and return the system to normal. If not, you'll have to wait it out;
This is what I guessed, and it wasn't urgent enough for me to raise a ticket or to have anyone paged. I only posted to this forum because I was sure caker would spot it and reply, so I'd get my answer but without causing problem tickets :-)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct