Catastrofic performance spike
weekly load graph on my box:
general info:
30 days linode graph
i checked and dont see any new processes on my server, asked support and its not becouse of host linode strange usage.
i using the barracuda stack: its nginx, memcache, apc, solr, and aegir.
Do you have any idea of reasons ? Thanx for any help.
19 Replies
To the OP clearly your average CPU usage did have an uptick recently, but it appears incremental rather than night and day to prior usage. I'd probably start digging into your request logs to see if you've had an increase in inbound request load. Start with some simple line counting, and perhaps see if you can identify increases in certain URLs. If you have something like awstats processing your logs that can help.
If your request load has remained constant, then I'd start looking into how much processing time your requests are taking. Assuming a database in the back-end, maybe you've finally accumulated enough data that you're exceeding memory caches and having to hit the disk more significantly to process requests than before.
Otherwise your stats seem pretty good. Seem to be fitting within memory just fine. About the only thing that seemed a little odd to me was a pretty high average packet/byte count on your loopback interface, but that could be totally normal for your configuration, but the graphs don't look far enough back in time to say if it changed recently.
– David
thanx for info folks, I will look carefully into logs.
@szczym:
well, i call it catastrophic because page load time grow from 100ms to 5 secs some times and its just pure shame …
Then you definitely have a problem, but the load average going up to 2 is not the most important indicator. That does mean your processors are doing a little more work, but it does not mean you are CPU bound. Definitely check your logs, check your memory usage, etc. Something is wrong with that kind of response increase, but it's not CPU.
Developer out there also head problems with linode IO so i asked support to migrate me to different host despite they "dont see any problem on them side"
And the problem is gone, server is working ok. But right now support is still denaying it was them problem …
So who is responsible for storage problems at linode ? Should i write to santa claus asking for normal service and support ?
Sorry for my English, i am (still) frustrated
BTW:
Your Linode is often far into swap. This graphs shows between 200MB-350MB into swap:
My guess is after this Linode is booted for a while, swappage will continue and everything will slow down again.
Our staff is trying to help you. We're not trying to get one over on you or get away with something. When we tell you the host looks fine it's because it does. I know it's frustrating but we're on your side…
Based on my experience the problem is very likely that you're just consuming too much IO - probably due to swap thrashing. Swapping can absolutely ruin performance. Swap is slow. Get that swap usage down into the double digits. You need to tune your services to consume less RAM, or upgrade to a larger Linode.
Hope that helps,
-Chris
From all the VPS's I've tested. Linode had the best I/O performance.
1 Drupal site is a burden on any VPS due to the high number of database queries it takes just to generate one page. I could not imagine what 20 drupal sites on 1 VPS would do to I/O!!!
Consider that each drupal site is costing $2/month to host, and ask if this is a reasonable price for the service you are expecting.
#
# The MySQL database server configuration file.
#
# You can copy this to one of:
# - "/etc/mysql/my.cnf" to set global options,
# - "~/.my.cnf" to set user-specific options.
#
# One can use all long options that the program supports.
# Run program with --help to get a list of available options and with
# --print-defaults to see which it would actually understand and use.
#
# For explanations see
# http://dev.mysql.com/doc/mysql/en/server-system-variables.html
# This will be passed to all mysql clients
# It has been reported that passwords should be enclosed with ticks/quotes
# escpecially if they contain "#" chars...
# Remember to edit /etc/mysql/debian.cnf when changing the socket location.
[client]
port = 3306
socket = /var/run/mysqld/mysqld.sock
# Here is entries for some specific programs
# The following values assume you have at least 32M ram
# This was formally known as [safe_mysqld]. Both versions are currently parsed.
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
[mysqld]
#
# * Basic Settings
#
#
# * IMPORTANT
# If you make changes to these settings and your system uses apparmor, you may
# also need to also adjust /etc/apparmor.d/usr.sbin.mysqld.
#
user = mysql
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
skip-external-locking
#
# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address = 127.0.0.1
#
# * Fine Tuning
#
key_buffer = 16M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 8
# This replaces the startup script and checks MyISAM tables if needed
# the first time they are touched
myisam-recover = BACKUP
#max_connections = 100
#table_cache = 64
#thread_concurrency = 10
#
# * Query Cache Configuration
#
query_cache_limit = 1M
query_cache_size = 16M
#
# * Logging and Replication
#
# Both location gets rotated by the cronjob.
# Be aware that this log type is a performance killer.
# As of 5.1 you can enable the log at runtime!
#general_log_file = /var/log/mysql/mysql.log
#general_log = 1
log_error = /var/log/mysql/error.log
# Here you can see queries with especially long duration
#log_slow_queries = /var/log/mysql/mysql-slow.log
#long_query_time = 2
#log-queries-not-using-indexes
#
# The following can be used as easy to replay backup logs or for replication.
# note: if you are setting up a replication slave, see README.Debian about
# other settings you may need to change.
#server-id = 1
#log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 10
max_binlog_size = 100M
#binlog_do_db = include_database_name
#binlog_ignore_db = include_database_name
#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
#
# * Security Features
#
# Read the manual, too, if you want chroot!
# chroot = /var/lib/mysql/
#
# For generating SSL certificates I recommend the OpenSSL GUI "tinyca".
#
# ssl-ca=/etc/mysql/cacert.pem
# ssl-cert=/etc/mysql/server-cert.pem
# ssl-key=/etc/mysql/server-key.pem
[mysqldump]
quick
quote-names
max_allowed_packet = 16M
[mysql]
#no-auto-rehash # faster start of mysql but no tab completition
[isamchk]
key_buffer = 16M
#
# * IMPORTANT: Additional settings that can override those from this file!
# The files must end with '.cnf', otherwise they'll be ignored.
#
!includedir /etc/mysql/conf.d/
wget http://www.day32.com/MySQL/tuning-primer.sh
This script will require bc:````
apt-get install bc
To run script use:
sh tuning-primer.sh
Script will just show you tips about optimization, nothing will be changed. It's not my script.
Read carefully what is written in the section "memory usage".
My tips without script (in [mysqld] section):
keybuffer = 256M
maxconnections = 30
tablecache = 512
querycachelimit = 20M
querycachesize = 32M
maxsortlength = 20
lowpriority_updates=1
256M is 1024M (your linode RAM) * 25%, so if you have another RAM - change first value.
After any changes you will need to restart MySQL:
/etc/init.d/mysql restart
````
After optimization my.cnf, write in the command line: 'top', then 'c' and show the result (to clarify which process eats memory).
@szczym:
Thank you everybody for performance tips, so fare my server is performing very well on new host. I will investigate mysql tuning, it but first i would like to be assured linode will not put ma again on slow storage (and non of you i hope).
They did not "put you on slow storage". Your machine started swapping, which is going to be slow on a VPS. YOU need to figure out what is going wrong with your unmanaged VPS, not linode support. There are plenty here who will be happy to help you if you ask nicely, but throwing around accusations and demands is unlikely to get you such help.
My server been swapping before 5 of march and is swapping now on its "standard" rate. it is lacking of memory indeed and i need to fix it, hopefully with help of community. Maybe i will be able to help some one in future.
What i predict from my logs is: 5 of march, data link from my host to storage get busy, the for all disc operations slowed down. I could fight for ever with tuning and it would not help because disc was slow.
monthly graf of IO wait:
monthly graf of swap:
monthly memory consumption:
Looking from my (non-sys-admin) point of view: if slow performance would be because of vps is running ot of memory, it would be visible in more memory consumption AND/OR higher swap usage. Non of those occurred. Also I did not changed any thing on my system at 5 of march.
To see how much swap activity you're having, you can run vmstat 5 for a minute or two, hit Ctrl+C to stop it, then look at the "si" and "so" columns. If everything is OK, those values should be really small. Run the test when your server is fast, and again when it's slow. Post the screenshots and I'm sure somebody will be able to figure out what's going on.
Here's another test that you can run. Go to the "extras" menu and buy another gig of RAM. Reboot and see if the problem goes away. If it does, you should probably upgrade to a bigger plan. Remember that although only 0.5GB is shown as "used" on the graph, those yellow/blue buffers and cache are also very important. Linux needs plenty of space for them. (If it doesn't work out, you can cancel the extra RAM after a day or two. A prorated refund will be immediately applied to your account.)
@szczym:
Thank you hybinet, right now after Linode moved me to other host my performance problems are more less gone.
:D
Just so you realize that, presuming all other things remained similar on your own application stack, that was just luck.
Total I/O bandwidth to local storage is pretty much the same on any Linode host, so if you are using the same I/O on the old and new hosts, but are getting better performance from the latter, it likely just means that the other guests on your new host are using less themselves than on your old host. But nothing guarantees that will remain the case, especially if perhaps your current scenario was helped, if only in part, from being moved to a newer host which may not be fully occupied yet.
The I/O bandwidth is shared fairly among those guests trying to use it, but is probably the most constrained resource. So if your current application stack requires a significant amount of I/O (which you would need to determine) you may just have bought a little time until you run into more contention, especially if the host you were moved to was newer and thus has fewer guests at the moment.
Now, it could also have been that it was some other guest on your old host that was a heavy I/O user which can adversely impact even modest users. I had a Linode for example that would consistently get into large I/O wait percentages even though it barely did any I/O itself and never swapped. But given Caker's comment about your Linode's typical usage compared to others on its host (both old and new) it seems that you are the heavy user in both places. I suspect that odds favor your performance degrading over time. Just realize that has nothing to do with slow storage per-se, just that it's a shared resource that your setup needs a lot of, which may not always be available when split among others on your host.
If I were in your shoes, I'd use the "reprieve" you have gotten by moving hosts to analyze and tune your application stack to reduce the I/O requirements as much as possible, making it more likely your performance will remain good over time. If the issue then happens again, you'll know that you're about as efficient as you can be and might need to consider a plan upgrade instead.
– David
I've looked at some of the graphs on this topic and see a multi-month plot followed by a declaration that everything is fine. Personally I'd never be willing to do so on such minimal data. When you have a single blip on a graph that is representing multiple hours or more, how can you trust that to mean anything?
Consider those who run sar with a 10 minute sampling rate. I'd claim they too are fooling them themselves. What if there is a 2 minute CPU spike of 100%? They'll never see it and during that 2 minutes the system will be crawling. Same thing with networks, disks, etc.
That's the reason collectl's sampling rate is 10 seconds and sometimes I even run it at 1 second. And before anyone gets all excited and says that will generate too much of a load, let me say that collectl uses less than 0.1% of the cpu at the 10 second rate. Since all these tools use about the same level of overheard I'd use whatever tools you at that level. All you need to do is run it as a daemon and forget about it being there until you have a problem. Then you have enough detail to see what is really happening.
But now there's the problem of plotting the data. I also see all those 'pretty' plots rrd draws, BUT they are far from accurate if you throw a lot of data at them because they 'normalize' the data and as a result information is lost.
I say forget pretty and use a tool like gnuplot. At the very least if you have 8000 data points (that's one per 10 seconds) and 1 of them is a spike you WILL see it and for my money (and this stuff is all free) I'll go with accurate over pretty every time.
-mark