One week in the life of a Linode64 on host5

Hi all. I started keeping track of the load on my Linode last week when the system was being unresponsive. I've accumulated a week's worth of load data and it's rather interesting.

The way that this works is, I run a shell script appends a line with the date and the current "uptime" value into a file, then goes to sleep for 15 seconds, then repeats. Thus I have accumulated load information for my Linode every 15 seconds for a week. I wrote a script which plots this data using gnuplot. Here is the result:

~~![](<URL url=)http://www.ischo.com/uptime/plot.2004.02.24.jpg" />

My system is almost completely unloaded; on its own, I would never expect its load to go over 1, and certainly never 2. I would attribute all of the spikes to activity of other Linodes on the host system.

It's nice to see that the uptime graph accurately reflects the load that occurs on host5 every night at about 1:20 am (note the consistent spikes to load 3 or 4 at this time). What is also interesting is that last night's load was quite low - will we see an improvement in this hotspot in the future? Only time will tell …

Also notice the spikes early last week. Holy crap, I've never seen a load of 18+ on a Linux box before! I started keeping this information around the time of that first load 4 because I was having some performance problems and these are readily reflected in the big spikes in the graph over the next day or so.

But it's really quieted down since then and is more like what I have traditionally found to be the performance on host5 - quite good 95% of the time, but with spikes late at night when everyone's "updatedb" cron jobs run …

BTW, I know that there are more elegant ways to accumulate this data than my stupid little script (MRTG/rrdtools?), but this took me all of 5 minutes to hack together, and I haven't set aside a block of time yet to install/configure those other tools … pointers would be most appreciated!~~

16 Replies

Interesting!

A little history on this topic:

All of the cron jobs (mostly just updatedb and makewhatis and other "not-really-worth-it" jobs) were left in their default times when I first created the disto templates (used by the distro wizard). The first few hosts that were deployed, and the customers who are on them, deployed their Linux installs with the cron jobs running at the same time.

After realizing this, I modified the majority of the template distros and moved the cron jobs to weekly. So now it just gets hammered on Sundays. The two biggest problem hosts are host3 (Linode 128) and host5 (Linode 64). The hosts that were added later don't seem to exhibit this problem at all.

The only reason why an "idle" Linode's loadavg goes up is because of processes blocked (waiting) for disk access. Each process waiting for the disk adds 1 to loadavg.

I don't really like messing with people's filesystems, but I've considered a script which edits the FS the next time the Linode is rebooted. Other options include sending an email to those on host3 and host5 with a few commands they can run to lighten the load.

The biggest reason why I'm pushing 2.6 on the hosts is because of a more fair I/O scheduler. Still, though, running updatedb, etc and sucking up disk bandwidth is wasteful.

I'm open to suggestions.

-Chris

@caker:

Interesting!

A little history on this topic:

All of the cron jobs (mostly just updatedb and makewhatis and other "not-really-worth-it" jobs) were left in their default times when I first created the disto templates (used by the distro wizard). The first few hosts that were deployed, and the customers who are on them, deployed their Linux installs with the cron jobs running at the same time.

After realizing this, I modified the majority of the template distros and moved the cron jobs to weekly. So now it just gets hammered on Sundays. The two biggest problem hosts are host3 (Linode 128) and host5 (Linode 64). The hosts that were added later don't seem to exhibit this problem at all.

The only reason why an "idle" Linode's loadavg goes up is because of processes blocked (waiting) for disk access. Each process waiting for the disk adds 1 to loadavg.

I don't really like messing with people's filesystems, but I've considered a script which edits the FS the next time the Linode is rebooted. Other options include sending an email to those on host3 and host5 with a few commands they can run to lighten the load.

The biggest reason why I'm pushing 2.6 on the hosts is because of a more fair I/O scheduler. Still, though, running updatedb, etc and sucking up disk bandwidth is wasteful.

I'm open to suggestions.

-Chris

Chris,

Thank you for your reponse!

Three things:

1. I think you should send an email out to customers on host3 and host5, rather than modifying people's filesystems without their knowledge. I think that a round of emails just letting people know how they can benefit from changing their cron job times would be sufficient to solve most of the problem (after all, it's for their own good too - their own updatedb will run faster at a time when the Linode host is not loaded down).

2. What about "randomizing" the cron times on a disk image before deploying it for a particular Linode? I imagine that right now, when a user selects the deployment of a particular distribution, the host just copies the filesystem over into their UML partition "file" and then resizes the filesystem. What about adding a step where the filesystem is mounted and the cron times are "randomized" - you could just have a script that opens a filesystem, and writes a randomized /etc/crontab out into it. By "randomized" I mean that daily scripts are run at a random time between say 2 and 4 am EST, weeklies a random time on either Saturday or Sunday between 4 and 6 am, etc.

3. I think that if step 2 was done, then updatedb and other stuff which is normally done daily, should be moved back to daily.

I hope that my graph demonstrates that for 95% of the time, Linode performance is really awesome. It's just those predictable spikes that I'd like to see if we can do something about, and I appreciate your enthusiasm in this endeavor!

Best wishes,

Bryan

Here's this week's graph:

~~![](<URL url=)http://www.ischo.com/uptime/plot.2004.03.02.jpg" />

What's very interesting is that the nightly spike increases in severity linearly up to a maximum on Feb. 27 (Friday), and then decreases linearly from there. Very strange.

What's the status on addressing this issue? Have emails been sent out to host5 Linode owners asking them to change their cron times?

(Edited to change graph to use the same scale as the previous graph; I'll use 0 - 20 as my load scale from now on so that all graphs can be easily compared)~~

@bji:

What's the status on addressing this issue? Have emails been sent out to host5 Linode owners asking them to change their cron times?
Not yet. I need to do is go through each distro and figure out which files to move and to where. Once I have a set of instructions, I'll send out the emails.

-Chris

Still no improvement.

![](" />

@bji:

Still no improvement.

![](" /> Whoops, I didn't realize that I have to keep those images hosted on my server in order for them to show up in this forum. I lost the most recent graph because it wasn't backed up, but I restored the other graphs. I lost about half a day's worth of data at the end of the most recent graph too … my backups were only up to last night … sorry :(

Those results seem very weird. What could be causing the nightly load to be symmetrical like that?

-Mike

@myrealbox:

Those results seem very weird. What could be causing the nightly load to be symmetrical like that?

-Mike

Didn't you read the posts above? It's caused by everyone's "updatedb" cron jobs running at the same time; this cron job puts a heavy disk burden on the Linode, and a bunch of them at once is really bad for the whole system.

The very best solution would be a kernel which somehow fairly allocated disk bandwidth, so that no one would ever "starve" for disk I/O like this.

A secondary solution would be to change the cron times so that they are staggered instead of everyone running them at the same time. I have changed my Linode's cron time, but the majority of people on host5 seem to be oblivious to this problem and have not done so.

@bji:

What's the status on addressing this issue? Have emails been sent out to host5 Linode owners asking them to change their cron times?
Emails sent to host3 and host5 members. I've asked people to ack back when they make a change – we'll see how much of a difference it makes. Of course, tonight/tomorrow morning is cron.weekly day, so might have to wait a few days to see.

Looking forward to your graphs after people make these changes…

Also, I'm working on the host-reboot-to-2.6 schedule. So look for that in the next week or two. 2.6 has been running great on host18 and host19. This is big! :)

-Chris

@bji:

Didn't you read the posts above? It's caused by everyone's "updatedb" cron jobs running at the same time; this cron job puts a heavy disk burden on the Linode, and a bunch of them at once is really bad for the whole system.

But as far as I can see, this does not explain why the load is cyclic and always symmetrical about a particular, but differing, day of the week.

-Mike

@myrealbox:

But as far as I can see, this does not explain why the load is cyclic and always symmetrical about a particular, but differing, day of the week.

-Mike

Ah. Yes, that is an interesting question. Sorry, I didn't understand what you had meant before. I am looking forward to seeing people change their cron times after Chris' email (I hope he asked people to randomize their cron minutes and possibly hours rather than just moving stuff to cron.weekly), I hope that we never have to figure out why the graphs look like that :)

Here it is:

~~![](<URL url=)http://www.ischo.com/uptime/plot.2004.03.16.jpg" />

It's hard to tell if there has been any improvement since Caker's email went out. The spikes are small but there were periods of smaller spikes in previous graphs as well. I hope to see them getting even smaller next week :) …~~

@bji:

It's hard to tell if there has been any improvement since Caker's email went out. The spikes are small but there were periods of smaller spikes in previous graphs as well. I hope to see them getting even smaller next week :)
Good. We'll get a good sample over the next week or so before host5 is rebooted onto 2.6.

-Chris

There has been a definite improvement; this week's peak spike is small. The spikes are still regular but they are definitely getting smaller overall. Will a 2.6 kernel for host5 help? Let's keep our fingers crossed …

![](" />

It seems sto me that what would be cool is some kind of distributed sequencer, such that the individual linodes on each machine have a way of communicating and not running cron jobs at the same time. It could be a command, I will call it distseq for purposes here. It would work like this:

Instead of running a command like "updatedb" at a certain time, run something like "distseq –node linode31 --command "updatedb -blah"".

This distseq command would: 1) contact some server 2) ask server if its ok to run the command now, wait until it says its ok 3) run the command 4) report to server that its done. If 1 failed (network problem etc), just run the command anyway.

There woudl be a server component, call it distseqd, running somewhere, possibly the host linode or just one global host. It would 1) wait for connections 2) when a connection is received, respond by either marking command as finished or requested. In parallel with that it would 1) Check for nodes that are not running a job 2) If any outstanding requests for that node, send message to node saying its ok to run command.

Something along those lines. Is there someting out there that does this? It just seems that the overhead of running 32 or whatever intensive jobs simulataneously on a 2-processor machine will always be a problem, even with more efficient OS implementations. I just timed my updatedb, it only takes like 30 seconds, so running like 32 of em one after the other wouldn't take more than half an hour.

Is there something out there that already does this? Seems like something that all UML implementors would have to deal with. If there isn't something out there, and if there is interest, I can try to code something up. I would ask for someone to audit my code, since there would be obvious security issues with someting like this and I wouldn't want to open holes.

There has been no real improvement in the spikes. When are we going to get kernel 2.6 on host5???

![](" />

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct