Catastrofic performance spike

im running small webserver on linode 1024 with about 20 drupal sites, all of them don`t do more than 1k page views per day. For ~3 months load on my server was about 0.5 but since morning 05 of march load on my server is about 2 !!! Also i can see that in linode panel graf the load on my server have degreased few times since problems started, while actual load on my server have go very UP :shock:

weekly load graph on my box: http://cgp.multi.obin.org/detail.php?p=cpu&pi=0&t=cpu&h=multi.obin.org&s=604800&x=800&y=350

general info: http://cgp.multi.obin.org/host.php?h=multi.obin.org

30 days linode graph http://dl.dropbox.com/u/433776/multi-linode.png

i checked and dont see any new processes on my server, asked support and its not becouse of host linode strange usage. :?

i using the barracuda stack: its nginx, memcache, apc, solr, and aegir. https://github.com/omega8cc/nginx-for-drupal

Do you have any idea of reasons ? Thanx for any help.

19 Replies

A load of 2 on a 4 processor box is not catastrophic.

Catastrophic does seem a bit strong, but a node that has established a baseline around 0.5 suddenly moving consistently up to 2 certainly seems reasonable to question what may have happened and if there is something anomalous going on.

To the OP clearly your average CPU usage did have an uptick recently, but it appears incremental rather than night and day to prior usage. I'd probably start digging into your request logs to see if you've had an increase in inbound request load. Start with some simple line counting, and perhaps see if you can identify increases in certain URLs. If you have something like awstats processing your logs that can help.

If your request load has remained constant, then I'd start looking into how much processing time your requests are taking. Assuming a database in the back-end, maybe you've finally accumulated enough data that you're exceeding memory caches and having to hit the disk more significantly to process requests than before.

Otherwise your stats seem pretty good. Seem to be fitting within memory just fine. About the only thing that seemed a little odd to me was a pretty high average packet/byte count on your loopback interface, but that could be totally normal for your configuration, but the graphs don't look far enough back in time to say if it changed recently.

– David

Consider installing munin, it can track various cool stats which can help you track down why things happen.

well, i call it catastrophic because page load time grow from 100ms to 5 secs some times and its just pure shame …

thanx for info folks, I will look carefully into logs.

@szczym:

well, i call it catastrophic because page load time grow from 100ms to 5 secs some times and its just pure shame …

Then you definitely have a problem, but the load average going up to 2 is not the most important indicator. That does mean your processors are doing a little more work, but it does not mean you are CPU bound. Definitely check your logs, check your memory usage, etc. Something is wrong with that kind of response increase, but it's not CPU.

After hours of searching in logs, asking friends for help and working 2 weeks on very slow server (some times it felt like on modem 15 years ago) i get a reply on other forum, dedicated to my software stack.

Developer out there also head problems with linode IO so i asked support to migrate me to different host despite they "dont see any problem on them side"

And the problem is gone, server is working ok. But right now support is still denaying it was them problem … :shock:

So who is responsible for storage problems at linode ? Should i write to santa claus asking for normal service and support ?

Sorry for my English, i am (still) frustrated :(

BTW: http://arstechnica.com/business/raising … hoices.ars">http://arstechnica.com/business/raising-your-tech-iq/2011/02/storage-networking-and-blades-virtualization-hardware-choices.ars

Your Linode in question is consistently a high disk IO consumer. The highest on the host you just migrated off of, and the highest on the new host you just landed on.

Your Linode is often far into swap. This graphs shows between 200MB-350MB into swap:

http://cgp.multi.obin.org/detail.php?p=swap&pi=&t=swap&h=multi.obin.org&s=8035200&x=800&y=350

My guess is after this Linode is booted for a while, swappage will continue and everything will slow down again.

Our staff is trying to help you. We're not trying to get one over on you or get away with something. When we tell you the host looks fine it's because it does. I know it's frustrating but we're on your side…

Based on my experience the problem is very likely that you're just consuming too much IO - probably due to swap thrashing. Swapping can absolutely ruin performance. Swap is slow. Get that swap usage down into the double digits. You need to tune your services to consume less RAM, or upgrade to a larger Linode.

Hope that helps,

-Chris

sz,

From all the VPS's I've tested. Linode had the best I/O performance.

1 Drupal site is a burden on any VPS due to the high number of database queries it takes just to generate one page. I could not imagine what 20 drupal sites on 1 VPS would do to I/O!!!

Consider that each drupal site is costing $2/month to host, and ask if this is a reasonable price for the service you are expecting.

thank you Caker for info, lets wait and see how it will evolve :?

szczym, show your /etc/mysql/my.cnf

My /etc/mysql/my.cnf

#
# The MySQL database server configuration file.
#
# You can copy this to one of:
# - "/etc/mysql/my.cnf" to set global options,
# - "~/.my.cnf" to set user-specific options.
# 
# One can use all long options that the program supports.
# Run program with --help to get a list of available options and with
# --print-defaults to see which it would actually understand and use.
#
# For explanations see
# http://dev.mysql.com/doc/mysql/en/server-system-variables.html

# This will be passed to all mysql clients
# It has been reported that passwords should be enclosed with ticks/quotes
# escpecially if they contain "#" chars...
# Remember to edit /etc/mysql/debian.cnf when changing the socket location.
[client]
port        = 3306
socket        = /var/run/mysqld/mysqld.sock

# Here is entries for some specific programs
# The following values assume you have at least 32M ram

# This was formally known as [safe_mysqld]. Both versions are currently parsed.
[mysqld_safe]
socket        = /var/run/mysqld/mysqld.sock
nice        = 0

[mysqld]
#
# * Basic Settings
#

#
# * IMPORTANT
#   If you make changes to these settings and your system uses apparmor, you may
#   also need to also adjust /etc/apparmor.d/usr.sbin.mysqld.
#

user        = mysql
socket        = /var/run/mysqld/mysqld.sock
port        = 3306
basedir        = /usr
datadir        = /var/lib/mysql
tmpdir        = /tmp
skip-external-locking
#
# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address        = 127.0.0.1
#
# * Fine Tuning
#
key_buffer        = 16M
max_allowed_packet    = 16M
thread_stack        = 192K
thread_cache_size       = 8
# This replaces the startup script and checks MyISAM tables if needed
# the first time they are touched
myisam-recover         = BACKUP
#max_connections        = 100
#table_cache            = 64
#thread_concurrency     = 10
#
# * Query Cache Configuration
#
query_cache_limit    = 1M
query_cache_size        = 16M
#
# * Logging and Replication
#
# Both location gets rotated by the cronjob.
# Be aware that this log type is a performance killer.
# As of 5.1 you can enable the log at runtime!
#general_log_file        = /var/log/mysql/mysql.log
#general_log             = 1

log_error                = /var/log/mysql/error.log

# Here you can see queries with especially long duration
#log_slow_queries    = /var/log/mysql/mysql-slow.log
#long_query_time = 2
#log-queries-not-using-indexes
#
# The following can be used as easy to replay backup logs or for replication.
# note: if you are setting up a replication slave, see README.Debian about
#       other settings you may need to change.
#server-id        = 1
#log_bin            = /var/log/mysql/mysql-bin.log
expire_logs_days    = 10
max_binlog_size         = 100M
#binlog_do_db        = include_database_name
#binlog_ignore_db    = include_database_name
#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
#
# * Security Features
#
# Read the manual, too, if you want chroot!
# chroot = /var/lib/mysql/
#
# For generating SSL certificates I recommend the OpenSSL GUI "tinyca".
#
# ssl-ca=/etc/mysql/cacert.pem
# ssl-cert=/etc/mysql/server-cert.pem
# ssl-key=/etc/mysql/server-key.pem

[mysqldump]
quick
quote-names
max_allowed_packet    = 16M

[mysql]
#no-auto-rehash    # faster start of mysql but no tab completition

[isamchk]
key_buffer        = 16M

#
# * IMPORTANT: Additional settings that can override those from this file!
#   The files must end with '.cnf', otherwise they'll be ignored.
#
!includedir /etc/mysql/conf.d/

Use this script to optimize my.cnf:

http://www.day32.com/MySQL/tuning-primer.sh

wget http://www.day32.com/MySQL/tuning-primer.sh

This script will require bc:````
apt-get install bc

To run script use:

sh tuning-primer.sh

Script will just show you tips about optimization, nothing will be changed. It's not my script.

Read carefully what is written in the section "memory usage".

My tips without script (in [mysqld] section):

keybuffer = 256M maxconnections = 30
tablecache = 512 querycachelimit = 20M querycachesize = 32M maxsortlength = 20 lowpriority_updates=1

256M is 1024M (your linode RAM) * 25%, so if you have another RAM - change first value.

After any changes you will need to restart MySQL:

/etc/init.d/mysql restart
````

After optimization my.cnf, write in the command line: 'top', then 'c' and show the result (to clarify which process eats memory).

Thank you everybody for performance tips, so fare my server is performing very well on new host. I will investigate mysql tuning, it but first i would like to be assured linode will not put ma again on slow storage (and non of you i hope).

@szczym:

Thank you everybody for performance tips, so fare my server is performing very well on new host. I will investigate mysql tuning, it but first i would like to be assured linode will not put ma again on slow storage (and non of you i hope).

They did not "put you on slow storage". Your machine started swapping, which is going to be slow on a VPS. YOU need to figure out what is going wrong with your unmanaged VPS, not linode support. There are plenty here who will be happy to help you if you ask nicely, but throwing around accusations and demands is unlikely to get you such help.

Excuse me OZ, i did not explained my self properly. i don`t want to be taken as troll or some one who is throwing around accusations and demands, i know i will take me no where. I just would like to be sure about those 2 weeks and ask linode to implement process that will make IO problems visible to support staff.

My server been swapping before 5 of march and is swapping now on its "standard" rate. it is lacking of memory indeed and i need to fix it, hopefully with help of community. Maybe i will be able to help some one in future.

What i predict from my logs is: 5 of march, data link from my host to storage get busy, the for all disc operations slowed down. I could fight for ever with tuning and it would not help because disc was slow.

monthly graf of IO wait:

http://cgp.multi.obin.org/detail.php?p=disk&pi=xvda&t=disk_time&h=multi.obin.org&s=2678400&x=800&y=350

monthly graf of swap:

http://cgp.multi.obin.org/detail.php?p=swap&pi=&t=swap&h=multi.obin.org&s=2678400&x=800&y=350

monthly memory consumption:

http://cgp.multi.obin.org/detail.php?p=memory&pi=&t=memory&h=multi.obin.org&s=2678400&x=800&y=350

Looking from my (non-sys-admin) point of view: if slow performance would be because of vps is running ot of memory, it would be visible in more memory consumption AND/OR higher swap usage. Non of those occurred. Also I did not changed any thing on my system at 5 of march.

Your graphs indeed seem to show more or less consistent memory and swap usage. However, swap usage (how much data you have in there) is different from swap activity (how much data you're moving in and out of there per second). It's okay to have high swap usage if there isn't much activity, but a combination of high swap usage and activity is bad.

To see how much swap activity you're having, you can run vmstat 5 for a minute or two, hit Ctrl+C to stop it, then look at the "si" and "so" columns. If everything is OK, those values should be really small. Run the test when your server is fast, and again when it's slow. Post the screenshots and I'm sure somebody will be able to figure out what's going on.

Here's another test that you can run. Go to the "extras" menu and buy another gig of RAM. Reboot and see if the problem goes away. If it does, you should probably upgrade to a bigger plan. Remember that although only 0.5GB is shown as "used" on the graph, those yellow/blue buffers and cache are also very important. Linux needs plenty of space for them. (If it doesn't work out, you can cancel the extra RAM after a day or two. A prorated refund will be immediately applied to your account.)

Thank you hybinet, right now after Linode moved me to other host my performance problems are more less gone. :D

@szczym:

Thank you hybinet, right now after Linode moved me to other host my performance problems are more less gone. :D
Just so you realize that, presuming all other things remained similar on your own application stack, that was just luck.

Total I/O bandwidth to local storage is pretty much the same on any Linode host, so if you are using the same I/O on the old and new hosts, but are getting better performance from the latter, it likely just means that the other guests on your new host are using less themselves than on your old host. But nothing guarantees that will remain the case, especially if perhaps your current scenario was helped, if only in part, from being moved to a newer host which may not be fully occupied yet.

The I/O bandwidth is shared fairly among those guests trying to use it, but is probably the most constrained resource. So if your current application stack requires a significant amount of I/O (which you would need to determine) you may just have bought a little time until you run into more contention, especially if the host you were moved to was newer and thus has fewer guests at the moment.

Now, it could also have been that it was some other guest on your old host that was a heavy I/O user which can adversely impact even modest users. I had a Linode for example that would consistently get into large I/O wait percentages even though it barely did any I/O itself and never swapped. But given Caker's comment about your Linode's typical usage compared to others on its host (both old and new) it seems that you are the heavy user in both places. I suspect that odds favor your performance degrading over time. Just realize that has nothing to do with slow storage per-se, just that it's a shared resource that your setup needs a lot of, which may not always be available when split among others on your host.

If I were in your shoes, I'd use the "reprieve" you have gotten by moving hosts to analyze and tune your application stack to reduce the I/O requirements as much as possible, making it more likely your performance will remain good over time. If the issue then happens again, you'll know that you're about as efficient as you can be and might need to consider a plan upgrade instead.

– David

I only found this forum because I'm always googling around to see who's using collectl and saw a recent post that mentioned it. I did read this thread with interest and all I can say is to beware of graphs that show coarse data. I'm not saying any of the anaysis is wrong, but basing it on coarse sampling data can be risky.

I've looked at some of the graphs on this topic and see a multi-month plot followed by a declaration that everything is fine. Personally I'd never be willing to do so on such minimal data. When you have a single blip on a graph that is representing multiple hours or more, how can you trust that to mean anything?

Consider those who run sar with a 10 minute sampling rate. I'd claim they too are fooling them themselves. What if there is a 2 minute CPU spike of 100%? They'll never see it and during that 2 minutes the system will be crawling. Same thing with networks, disks, etc.

That's the reason collectl's sampling rate is 10 seconds and sometimes I even run it at 1 second. And before anyone gets all excited and says that will generate too much of a load, let me say that collectl uses less than 0.1% of the cpu at the 10 second rate. Since all these tools use about the same level of overheard I'd use whatever tools you at that level. All you need to do is run it as a daemon and forget about it being there until you have a problem. Then you have enough detail to see what is really happening.

But now there's the problem of plotting the data. I also see all those 'pretty' plots rrd draws, BUT they are far from accurate if you throw a lot of data at them because they 'normalize' the data and as a result information is lost.

I say forget pretty and use a tool like gnuplot. At the very least if you have 8000 data points (that's one per 10 seconds) and 1 of them is a spike you WILL see it and for my money (and this stuff is all free) I'll go with accurate over pretty every time.

-mark

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct