Linode froze for a while
linode.pts/0% uptime
15:21:25 up 41 days, 2:52, 1 user, load average: 22.75, 12.30, 5.35
linode.pts/0% cat /proc/io_status
io_count=5549012 io_rate=0 io_tokens=400000 token_refill=512 token_max=400000
Had someone else chewed up all the I/O on the host, or something?
5 Replies
You might have a runaway process my friend, but it doesn't appear to be doing much, if any, IO.
Do a "ps aux" to find the culprit (top might not run well under such load) and kill it.
…or issue a reboot and pray it doesn't happen again.
This sort of thing used to happen with disturbing regularity on Linode hosts but it's been much, much better in the past 6 months to 1 year. Caker put in place all kinds of corrective measures (the I/O limiter is one) to help reduce this problem, and these measures have worked. Which is not to say that it never happens anymore, but it is pretty rare.
When a load spike is occurring, if you can get a hold of caker (on the IRC channels, or via a ticket) then often he can "take care" of the offending Linodes on the host and return the system to normal. If not, you'll have to wait it out; they typically last only a few minutes but I have seen load spikes that lasted half an hour, but haven't seen such an event for a long time.
-Chris
@untitled9:
Um… why's your load average 22.75?
A load average of 25 merely means "25 jobs attempting to run".
If the I/O throughput is too low then even normal jobs (eg cron checking to see if any work is to be done) will slow down and because the I/O request is not satisfied then the job remains in the "attempting to run" state, and so adds 1 to the load average.
In this case, disk I/O was effectively frozen and so most jobs (eg console login process, ssh forking, cron, web server, postfix checking the queue, postfix accepting mail etc etc etc) all froze and so all added "1" to the load average.
A machine in this state is called "I/O bound".
Another reason for a high load average could be "CPU bound", where too many processes are trying to run and the CPU just can't satisfy all the requests.
It is possible to have a high load average and still good performance; eg a job that just forks a child and terminates, and the child does the same. Processes will be created and terminated very very quickly, and so a large number will be in the run queue every second (so load average will look high) but the system remains perfectly responsive. (I did this a few years back on a Sun Sparc 20 and got a load average of over 30, just to prove a point to my manager… it was impossible to kill, so I had to reboot the machine!)
Unix systems are very dependent on a number of things, and bottlenecks can appear in unexpected places. I presume this is one reason why caker spent so much time on the I/O throttler, simply because disk I/O is very important for smooth running of an linode; even if that linode isn't swapping (mine is nowhere near (30Mb RAM used!) disk I/O performance is high on the critical path on performance tuning.
@bji:
When a load spike is occurring, if you can get a hold of caker (on the IRC channels, or via a ticket) then often he can "take care" of the offending Linodes on the host and return the system to normal. If not, you'll have to wait it out;
This is what I guessed, and it wasn't urgent enough for me to raise a ticket or to have anyone paged. I only posted to this forum because I was sure caker would spot it and reply, so I'd get my answer but without causing problem tickets:-)