No Traffic but I/O Spikes, OOMs, Crashes
constant OOMing has me at a loss for the cause.
sometimes the IO spikes before OOMkiller starts hacking at apache and the box completely crashes, with CPU jumping over 100% on each processor.
here are some of my commonly associate settings:
MinSpareServers 5
MaxSpareServers 10
MaxClients 150
MaxRequestsPerChild 0
StartServers 1
MinSpareServers 3
MaxSpareServers 6
MaxClients 15
MaxRequestsPerChild 3000
Mysql tweaks:
key_buffer = 16K
maxallowedpacket = 1M
thread_stack = 64K
table_cache = 4
sort_buffer = 64K
netbufferlength = 2K
skip-innodb
sysctl.conf tweak: vm/minfreekbytes = 16384
here is a munin report showing the crash:
any help would be appreciated. thanks!
9 Replies
@mattryan29:
constant OOMing has me at a loss for the cause.
Well, the immediate cause is fairly easy to determine - you're running out of memory, to state the obvious. There's some supporting evidence for that in your munin graphs that show significant swap space usage, and if you're using a default 256MB swap image I can certainly see how it's possible you'd run completely out of memory. Not sure the graphs directly show such a failure, but its often the case that when it happens it's fast enough to be missed (or preclude) the next munin polling cycle. And yes, if the OOM killer can't clean out enough fast enough, it can lead to a kernel panic, which can max out CPU usage (though in my case it's always seemed to be a single core, probably for the kernel thread).
Of course, the rub is determining why, but a reasonable first step is to focus on your web stack, since that tends to have the most variability in a lot of configurations.
I'd start by reviewing status when the system is up, to determine a process size estimate for Apache, and how much memory is available once you take away any other standard processes on your system. Then see if that size, multiplied by your MaxClients configuration, could use more memory than you have.
If I could do so I'd also try stress testing the configuration (such as with "ab" choosing a representative URL that involves the full stack and database) in its current configuration to see if I could cause the problem. If so, it's a big advantage since you can stress test after changes.
In either case, you're likely going to want to drop MaxClients. If your analysis shows the current value is too large, you can use that to pick a new value. But even if it appears ok, one troubleshooting approach is to drop MaxClients a lot - like down to 1-2 - as well as dropping MaxRequestsPerChild a lot - maybe low double or single digits - in case there's a per-request leak going on. Dropping MaxRequestsPerChild is to help avoid a single process growing unusually large, which you may not have been able to catch while observing the system. Performance may suffer, but at this point the goal is to completely stop the full failure, and worry about performance second.
If you can't afford to do the stress testing or configuration changes on the production Linode, clone it to another Linode (even if you just add one for a few days for the testing), and then perform your stress testing and test changes there.
If your application stack is large enough memory-wise for each request, it's not necessarily wrong to have to drop MaxClients into single digits on a Linode 512. Nor does doing so necessarily imply terrible performance, unless each request takes a really long time to satisfy. Though of course, delaying requests is still a far more graceful degradation than keeling over completely.
There are a number of topics here on tuning Apache (and associated application stacks) and MySQL that have more detailed suggestions and ways to work up to a final configuration once you've stopped the pain, plus ways to improve performance at whatever configuration your Linode can support, so I'd definitely do a few forum searches to see if those can also help. But in terms of first steps, most of them boil down to dropping the configuration extremely low to guarantee you have enough resources, and then adjusting them slowly upwards. This may eventually also lead to a conclusion that a larger Linode is needed for your purpose, but that's really only something you can conclude with certainty after having tuned everything to the current Linode.
– David
In the end my Apache got down to this:
MinSpareServers 2
MaxSpareServers 2
ServerLimit 3
MaxClients 50
MaxRequestsPerChild 3000
MinSpareThreads 2
MaxSpareThreads 5
ThreadLimit 32
ThreadsPerChild 25
MaxClients 50
ServerLimit 3
MaxRequestsPerChild 0
By limiting the server to just a few instances I have it stable for more than a hour but only if I reboot every hour. However performance is abysmal.
The latest version of the paravirt kernel is buggy according to Linode support but I'm at my wits end as what the cause is.
I'd recommend dropping MaxClients down to 5, and seeing what happens.
apache2 -V | grep MPM
-Chris
@mattryan29:
what is an appropriate max requests per child on a 512 box?
The requests per child setting is a way to amortize the overhead of creating a new worker process across many requests. Even small non-zero values will probably always be helpful. But larger values really only make a difference if that overhead is your bottleneck, and since it leaves worker processes around longer, it risks resource growth over time. There's no single "appropriate" answer.
If I were setting a value for a new configuration, I might choose something like 10-50 just to have a value and then tune as I tested. Starting low acknowledges that the risk of a growing working process exceeding resources can have a much more disastrous impact than slower response rates due to process creation being a bottleneck.
A lot also depends on your request load. If you're peaking at single digit request/s rates, process fork overhead and thus the setting probably isn't going to matter all that much. If you're trying to hit hundreds or thousands a second, it could make a big difference.
It's far more important to first tune things so you are staying within your available resources (which requests per child doesn't really impact - that's MaxClients primarily). Once you're there, increasing requests per child will likely initially benefit performance, but will quickly fall off, something it wouldn't surprise me to see happening even at double digit values.
In the end, best is probably to just test in your specific situation. Use something like ab to get MaxClients to a point where you fit in memory, then see if increasing requests per child yields higher performance rates. My bet is pretty quickly it won't make much difference as the bottlenecks are elsewhere, and dwarf the process creation overhead.
– David
Is there a healthy percentage of memory to be in use, during normal operations? Such as low traffic periods? I know this question depends on a lot of variables. I guess I'm most interested to see what is comfortable for others.
@mattryan29:
Is there a healthy percentage of memory to be in use, during normal operations? Such as low traffic periods? I know this question depends on a lot of variables. I guess I'm most interested to see what is comfortable for others.
You're right that it's almost impossible to answer outside of the parameters of your specific application stack and usage patterns. Ideally you want to use exactly 100% of your memory during the heaviest load (don't worry about low usage times), but usage is dynamic and impossible to predict exactly. Your main choice is likely to be how much space you want to try to leave for filesystem buffers/cache.
Some caching will always be beneficial - I/O is likely the most constrained resource on a Linode, and it doesn't help if you can service extra requests with more Apache workers if they all become I/O bound anyway.
If you just want a number to aim for, try to reserve 10-20% for caching/buffering, so 50-100MB on a 512 under full load. But testing is still best. You may find that due to other bottlenecks there's little performance increase to increasing MaxClients at some point, even if that leaves a lot more memory free. At that point, I'd leave things alone and let the kernel use more memory for buffering. I suppose the converse could be true (increasing MaxClients at the expense of buffering could have a big boost to performance) but I wouldn't bet on it.
This is also just the first "next step" in your additional tuning. This first step helped stop the major pain, but there are a number of performance dimensions you can work on now that you're fitting within the available memory. For example, caching for WordPress (also discussed in other posts) makes a separate trade-off where you'll want to take some other memory away from Apache workers and instead give it to the cache. So whatever balance you find with your initial MaxClients tuning and system memory may change as you fine tune things further.
– David