More CPU / IO spikes..
The CPU suddenly spikes to 400% and IO goes through the roof. It happens about daily now. No response to SSH or Lish..
I've tried to change everything that I thought might cause this spike, but nothing works.
It's a Linode 1024.
Mysql settings changed to the ones over here:
In apache2.conf I have these settings for mpmpreforkmodule (if that's the right one):
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 50
MaxRequestsPerChild 500
I'm running a Drupal 7 site, with Piwik and normally the server hardly ever goes over 20% CPU usage in the Linode graph. Pages load fast, as far as I can tell. And it doesn't seem to have a problem loading a lot of pages at once.
The easiest fix would be to have it reboot after spiking (if I'm sleeping), but I'm not sure if that's possible. The best fix would be to not have the spikes at all, obviously..
59 Replies
@Cheek:
Anything else I can check? Lassie keeps rebooting it now. As if since I changed the apache config file, it can't handle some requests anymore. Like when I hit the Piwik dashboard, it crashes the server sometimes..
Check your console via LISH to see what's been logged at the point of crash. OOM errors can turn into kernel panics if memory can't be freed fast enough, in which case continuing to tune your configuration will also resolve the crashes.
Or, probably less likely, but there have also been one or two threads recently about some crashing issues with a recent 2.6 kernel (can't remember the specifics) that either switching kernel or in some cases boosting a kernel memory parameter higher have helped with (so it's also related to available memory issues). Found one of the threads, so
– David
Kernel panic - not syncing: Out of memory: system-wide panic_on_oom i s enabled
Pid: 2451, comm: apache2 Not tainted 2.6.38.3-linode32 #1
Call Trace:
[<c063cfbf>] ? panic+0x57/0x13e
[<c0181414>] ? out_of_memory+0x2c4/0x2f0
[<c018488c>] ? __alloc_pages_nodemask+0x65c/0x670
[<c01862bd>] ? __do_page_cache_readahead+0xed/0x230
[<c018641e>] ? ra_submit+0x1e/0x30
[<c017eeae>] ? filemap_fault+0x34e/0x420
[<c01941c6>] ? __do_fault+0x56/0x520
[<c0184325>] ? __alloc_pages_nodemask+0xf5/0x670
[<c019529b>] ? handle_pte_fault+0x9b/0xac0
[<c0196f61>] ? handle_mm_fault+0x101/0x1a0
[<c011e81b>] ? do_page_fault+0xfb/0x3e0
[<c016f218>] ? compat_irq_eoi+0x8/0x10
[<c016fa7c>] ? handle_fasteoi_irq+0x8c/0xe0
[<c0105b37>] ? xen_force_evtchn_callback+0x17/0x30
[<c0138070>] ? __do_softirq+0x0/0x130
[<c0106314>] ? check_events+0x8/0xc
[<c010630b>] ? xen_restore_fl_direct_end+0x0/0x1
[<c010af92>] ? do_softirq+0x42/0xb0
[<c011e720>] ? do_page_fault+0x0/0x3e0
[<c063fea6>] ? error_code+0x5a/0x60
[<c011e720>] ? do_page_fault+0x0/0x3e0</c011e720></c063fea6></c011e720></c010af92></c010630b></c0106314></c0138070></c0105b37></c016fa7c></c016f218></c011e81b></c0196f61></c019529b></c0184325></c01941c6></c017eeae></c018641e></c01862bd></c018488c></c0181414></c063cfbf>
So it's not the kernel?
@Cheek:
So it's not the kernel?
Don't think so - more likely an OOM condition that got so bad the kernel gave up.
You'll need to turn down your Apache configuration (or other processes on your node) so that you can manage peak load (you can stress test with something like ab against a URL that involves your full application stack) without exceeding your available memory. I realize that may slow down your possible request rate, but a higher rate won't help if the entire node crashes…
Once you have your system operating within the available resources, then you can focus on improving performance (which may have to include a larger Linode, but I wouldn't jump to that point initially).
As your first post notes, there are a number of tuning threads on the forum that you may wish to review. (Unlike your first comment, I think most do in fact cover a "fix" - which is tuning your configuration to fit available resources).
-- David
The weird thing is, it was working fine for months without crashing. It just crashed again with CPU and I/O spiking through the roof.
I'd rather have Lassie rebooting my node, than the CPU spiking. Because that could crash my server for hours when I'm not around.
But anyway, there's not much else running besides apache and mysql. And as I said, I've set the mysql to the values as in the library and MaxClients to just 10, which seems pretty low.
Is there anything else that could cause this? I was really happy with my linode, but I'm just one guy building a website and can't spent most of the day fixing the server.
I've got 3 options: find a fix, double the Linode or go with something like a Mediatemple DV server. It'll more than double the costs but maybe it's the best option for me?
@Cheek:
The weird thing is, it was working fine for months without crashing. It just crashed again with CPU and I/O spiking through the roof.
That's not that unusual. Traffic load changes, performance of databases as they grow change, etc.. You may have been very close to exceeding your resource for a while now and just not known it. Or something may have bumped the request load to your node up significantly (a link from some site?) without your knowing it.
> I'd rather have Lassie rebooting my node, than the CPU spiking. Because that could crash my server for hours when I'm not around.
You should at least get an eventual email about the CPU usage exceeding the notification limits (I do). The problem is that depending on kernel configuration, a panic doesn't actually halt the box (the kernel is still running, just in a tight loop, which leads to the CPU usage) so Lassie doesn't consider it down.
There's a kernel parameter (kernel.panic) that you can set to the number of seconds after which the box will reboot itself after a panic. As a general matter there could be some risk of always restarting depending on the cause of the panic, but in a scenario like this it's probably preferable rather than staying in the panic'd state. You can save adjustments to that value in /etc/sysctl.conf.
> But anyway, there's not much else running besides apache and mysql. And as I said, I've set the mysql to the values as in the library and MaxClients to just 10, which seems pretty low.
Really the only way to know is to test. You may have a stack that is using even more memory than you think, so even MaxClients of 10 may be too much. You need to actually monitor your resource usage under load to identify what you actually use. The other threads cover ways of doing that in far greater detail, but basically you want to observe how much actual memory each Apache process is using when handling requests.
> Is there anything else that could cause this? I was really happy with my linode, but I'm just one guy building a website and can't spent most of the day fixing the server.
Going into an OOM condition? Nope - pretty much means you're using too much memory.
Assuming it's the Apache configuration is an educated guess, but as it's probably the leading contender of this scenario in just about all cases brought to the forum, it's a good first bet.
> I've got 3 options: find a fix, double the Linode or go with something like a Mediatemple DV server. It'll more than double the costs but maybe it's the best option for me?
Only you can answer that for yourself. Certainly just throwing more resource at the problem (the bigger Linode - I know nothing about Mediatemple) is "simpler", but I can't say it's guaranteed to solve the problem without your first identifying the root cause. Certainly might push off your having to deal with it until later though.
For example, let's say that you were still at MaxClients of 50, and your request load used them all, but each Apache process was using 100MB (all extreme values). Just bumping your Linode to a 2048 wouldn't solve the problem, just let you get a few more simultaneous requests before keeling over the same way.
– David
How about nginx?
@Cheek:
I've got this from logview:
Kernel panic - not syncing: Out of memory: system-wide panic_on_oom i s enabled Pid: 2451, comm: apache2 Not tainted 2.6.38.3-linode32 #1 Call Trace:
BTW, I just re-read the above from your earlier post while writing my previous response. I hadn't noticed earlier but this looks like you've set the vm.paniconoom parameter on your box. This forces an immediate panic on any OOM condition, rather than the regular OOM processing to kill off processes to free memory. I don't believe this is the default configuration, and while killing random processes - the default behavior - isn't necessarily conducive to normal operations, an automatic panic definitely isn't :-)
If you're going to do this, then you do likely want to combine it with a non-zero kernel.panic parameter (as per my other note) which essentially means you want to reboot on OOM. See also
– David
@db3l:
@Cheek:I've got this from logview:
Kernel panic - not syncing: Out of memory: system-wide panic_on_oom i s enabled Pid: 2451, comm: apache2 Not tainted 2.6.38.3-linode32 #1 Call Trace:
BTW, I just re-read the above from your earlier post while writing my previous response. I hadn't noticed earlier but this looks like you've set the vm.paniconoom parameter on your box. This forces an immediate panic on any OOM condition, rather than the regular OOM processing to kill off processes to free memory. I don't believe this is the default configuration, and while killing random processes - the default behavior - isn't necessarily conducive to normal operations, an automatic panic definitely isn't:-) If you're going to do this, then you do likely want to combine it with a non-zero kernel.panic parameter (as per my other note) which essentially means you want to reboot on OOM. See also
http://www.linode.com/wiki/index.php/RebootingonOOM – David I actually set those parameters a couple of weeks ago, when my server started to crash while I was asleep. So just like the two lines in the note. Should I unset it?
@Cheek:
I guess you are right. I'll check the other threads for tuning methods I haven't tried.
If I were in your shoes, I'd hit your node with testing with 'ab' (or seige or any other load test tool) to get it to keel over under controlled circumstances. Given the current behavior you are experiencing, I think this should be easy to do :-)
Then drastically drop MaxClients and MaxRequestsPerChild, say to 1 each, and re-test. That should prevent the problem, but kill performance. Leave something like top running, then slowly increase MaxClients only and re-test, watching memory until you're just at the point of having a little for buffers. Then increase MaxRequestsPerChild slowly until performance stops increasing. Note that running multiple requests per child may result in each worker process using up more memory, so you may need to lower MaxClients when using a higher MaxRequestsPerChild to balance out memory vs. request performance.
At that point decide if your performance is good enough or you need to continue to other more performance oriented tuning steps.
If this isn't possible on the production box, spin up a test Linode (even if only for a day or two), clone your current box to it, and run the tests there.
It's a bit monotonous, but not terribly complicated, and it's really the best way to identify how much resource you can tune your stack to use.
> How about nginx?
Might help or might not. Personally I suspect you'd still need to solve your current root issue and that involving nginx is a later tuning stage for performance. The odds are that most of your memory usage isn't Apache per-se, but the Drupal part of the code stack. And you still need to support that.
Fronting apache with nginx for static content may cut significantly down on the memory needed to serve the static portion of your site, but that's also content that Apache itself serves quickly, so hard to say for sure how much that'll change the overall memory profile. This sort of change may help more with performance than memory.
You could offload Drupal to a completely separate server process, but then you'd probably find that it was that server process eating up the same memory for the same number of simultaneous requests as when it was internal to Apache.
– David
I tried all these, but they don't crash the server:
# ab -kc 300 -t 60 http://www.....com/
# ab -kc 1000 -t 60 -r http://www.....com/
# ab -n 10000 -c 100 http://www.....com/
# ab -n 3000 -kc 150 http://www.....com/
@Cheek:
How to get the server to keel over with ab?
One important item is to pick a good representative URL to use. I'm not sure from the redacted examples you give, but if you're querying a home page it may or may not use anywhere near the full application stack to generate the page. Is the page you are querying created by executing the Drupal code and hitting your database in a representative way? If in doubt, you could check your logging to try to pick out some representative URLs occurring prior to recent crashes.
For testing stuff like MaxClients, the concurrency (-c) is most useful since that's how many requests to do in parallel. As long as it's more than your MaxClients you are testing (maybe with a 20-25% fudge factor) it should be high enough. So I'd think even something around 100 should be more than high enough.
Definitely don't use -k since that issues multiple requests in a single connection and you're trying to test multiple independent connections.
I'd have thought your example with "-n 10000 -c 100" would probably have been a reasonable test. What did your resource stats look like during that test (e.g., memory stats in top)?
Is this with your Apache configuration the same way as the last crash? If so, and if you have a decent test URL then your scenario may be more complicated, either due to specific Drupal processing occasionally needing a lot more memory than when you run these tests, or due to some other unidentified task on the box using up the memory. I suppose at that point you could also ask yourself if there have been recent changes to the site (new features or anything) that might use different Drupal code recently. Maybe a specific type of request needs far more memory than others to process.
Though for any of the Drupal stuff, the general hammer of MaxClients/MaxRequestsPerChild is still a good approach even if you don't know precisely what is happening. If ab can't push you out of resource while tying up all of your MaxClients processes, drop your MaxRequestsPerChild pretty low (maybe single digits) for now. That'll protect you against a long-lived process accumulating memory and throwing off the typical case MaxClients.
– David
@Cheek:
I actually set those parameters a couple of weeks ago, when my server started to crash while I was asleep. So just like the two lines in the note. Should I unset it?
Are you sure both parameters are actually set? If you had kernel.panic set your node should be rebooting in these scenarios and not just staying in a panic. You might want to check (via sysctl) that it actually has the value you think it has.
You could certainly turn off vm.paniconoom if you like - as I note it will just put back the default behavior of killing off processes to try to free up memory, which may or may not be all that better than a panic.
– David
But I'm not really sure.
Perhaps if David can respond back to you, someone could help on this one other than myself.
How did you deploy your configuration?
Are you running plain Drupal/WordPress or something on Apache?
I've done this before on a Linode512, with no problem.
So I'm not sure what's going on.
More information on what your urnning, and or how you configured it primarily might be a good idea?
I wouldn't suggest, getting a loarger Linode.
Budgeting in if you can even aford a higher end Linode is important, and like David or whomever said earlier, it wouldn't solve your problem for sure.
It would only be a temporary fix.
@db3l:
One important item is to pick a good representative URL to use. I'm not sure from the redacted examples you give, but if you're querying a home page it may or may not use anywhere near the full application stack to generate the page. Is the page you are querying created by executing the Drupal code and hitting your database in a representative way? If in doubt, you could check your logging to try to pick out some representative URLs occurring prior to recent crashes.
Ok, I just crashed it using "-n 10000 -c 100" on a different URL and turning drupal caching off. The CPU maxed out, memory slowly went to 100% (took about a minute I think) and after it filled the swap, the server was done. Do you think this is what crashed the server naturally before? By a bot or something?
I will try lowering the apache values and slowly up them tomorrow morning. I'll post the results.
Thanks again for all the help David!
I've found another problem. I've suspected Piwik (an open source Google Analytics alternative) for a long time. I just inspected it and it seems like a live real-time visitor widget on the Piwik dashboard is eating away memory.
When I hit the page it eats some memory and after that it keeps eating (if you keep running the dashboard) until it hits 1024mb, the swap fills, and the server crashes.
Is this a Piwik problem, or just a problem with my server's configuration?
@Cheek:
Is this a Piwik problem, or just a problem with my server's configuration?
Hard to say the root cause, but if there is a leak in a code path, that can be exacerbated by a high MaxRequestsPerChild setting, since that leaves the process around. That would align with what you're seeing about memory growing on repeated requests.
In such a scenario, it's even more important to keep MaxRequestsPerChild low, since that will let Apache restart its worker processes more frequently, and prevent memory growth from getting out of control if a code path is leaking memory. Think of it this way - even if your Piwki code is leaking memory on each request, if you had MaxRequestsPerChild at 1 (minimum) then you could never leak more than 1 request worth - e.g., repeated use of the dashboard wouldn't grow anything. The downside is the overhead of restarting the processes, so a modest value can be of large performance benefit, but you can get most of the benefit with even such modest values (e.g., single or double digits) while minimizing the risk of a single growing process.
In terms of this crash scenario, and the other one you pointed out in your last post, it's not that critical if these are exactly the same as what your users are triggering. Nothing you can do with a stress test ought to be able to crash your box, so any failure that you can control is a great test to use when tuning. The hope is that your stress test will turn out to be worse than anything end users can generate, and while you can't guarantee that the odds favor that if you can find a configuration that can pass these tests you'll be fine with end users.
– David
@Keith-BlindUser:
I can tel you for sure that this isn't a hardware issue with your node.
But I'm not really sure.
@Cheek:
Is this a Piwik problem, or just a problem with my server's configuration? The first thing I did when installing Piwik was set it up as a crontab (that runs every hour), then I disabled it from running on pages to get better performance;
crontab -e
MAILTO="emailforerrors@mydomain.com"
5 * * * * su --shell=/bin/bash --session-command="/path/to/misc/cron/archive.sh > /dev/null" apache
I would highly recommend this for your site
Supposedly if you have alot of data, it parses through it all, and it says by "default" on high load servers to set the memory limit for php to 512mb.. yeah.. I'm not doing that! Another option is just to use a program that can parse with lower overhead.. like google analytics (google does the calc for you), and webalizer (directly reads your http log files).
I tried AWStats, but it was very slow using Perl.. I probably won't use it again.
@superfastcars:
Supposedly if you have alot of data, it parses through it all, and it says by "default" on high load servers to set the memory limit for php to 512mb.. yeah.. I'm not doing that! Another option is just to use a program that can parse with lower overhead.. like google analytics (google does the calc for you), and webalizer (directly reads your http log files).
I'll set up a cron job. Didn't do that yet since I don't think that's the problem.
I just set both MaxClients and MaxRequestsPerChild to 5 and the ab stress test didn't kill it. But when I went to the Piwik live dashboard widget, it slowly ate my memory again until it crashed the server. It took like 10-15 minutes.
I don't think that's normal, is it?
When I run the dashboard without the widget, there's no problem. The ab stress test didn't get the memory usage over 150mb.
MaxClients 10
MaxRequestsPerChild 100
Stress tested it with ab, with caching off memory usage would stay between 500-600mb. Seems fair right? Upping the values didn't seem to help performance much anyway.
With caching on the memory usage wouldn't even go over 125mb, but that's tricky since it's just testing one page, which gets cached pretty well.
I think MaxRequestsPerChild was definitely the problem (plus MaxClients was set too high to begin with).
I'm running into big problems with Piwik though. I've enabled the cron and disabled archiving on the page. But every time I visit a Piwik dashboard page it keeps eating up memory. Even without the live stats.
Where normally the memory would be given back to the system. All the Piwik pages keep eating memory with every page visit until it's filled up and the system crashes. Can this be a mySQL tuning problem?
When testing with ab it's just one page, so I guess it's different when visiting a lot of different pages. Is this still apache's problem? Shouldn't it release the memory after the page has been sent?
StartServers 5
MinSpareServers 5
MaxSpareServers 10
I've now settled on even lower settings. It seems to make some images load pretty slow, but at least the server doesn't seem to be crashing (yet).
StartServers 3
MinSpareServers 2
MaxSpareServers 3
MaxClients 10
MaxRequestsPerChild 30
@Cheek:
Ugh, it seems I can do the same thing on my site. When visiting a lot of different pages that haven't been cached before, it slowly eats all the memory. A lot slower than Piwik, but it happens..
When testing with ab it's just one page, so I guess it's different when visiting a lot of different pages. Is this still apache's problem? Shouldn't it release the memory after the page has been sent? Did you do this step.. ??
@Cheek:
Ok, so I guess the problem I described above is how apache is supposed to work.. It doesn't care about your memory?
Depends on what you mean by "doesn't care". If you mean will it obey the settings in terms of when to create new worker process and how long to let the same process service requests independent of what memory is being used, then yes. But it's not like it's malicious or anything :-)
> I've now settled on even lower settings. It seems to make some images load pretty slow, but at least the server doesn't seem to be crashing (yet).
That's your most important goal at this point. And from the sequence of posts it would seem like you're working your way to at least a safe set of settings, if not ideal for performance.
Once you've stopped the bleeding, so to speak, then you can start thinking about performance tweaks, while staying within your available resource. With lower MaxClients it does mean that each request may get delayed slightly until a worker process is free, which could be what you're seeing with the images.
The issue is that the same Apache worker pool is being used for both full scripting pages (running Drupal) as well as simple static files. So now, this is, for example, a place where considering steps like using nginx as your front-end server, proxying drupal portions of the site back to Apache, is reasonable.
That way, static content can be delivered quickly (and with very low memory overhead) from nginx, while everything else still goes through the full scripted/Apache route.
– David
PS: On the piwik thing, I'm afraid I'm out of my depth as I have no experience with that. I find it hard to believe you're seeing unbounded process growth if you keep MaxRequestsPerChild low (and maybe 30 is even too high), but since I don't know what it does or how it works, I'd have to leave that aspect of things to someone else.
@superfastcars:
@Cheek:Ugh, it seems I can do the same thing on my site. When visiting a lot of different pages that haven't been cached before, it slowly eats all the memory. A lot slower than Piwik, but it happens..
When testing with ab it's just one page, so I guess it's different when visiting a lot of different pages. Is this still apache's problem? Shouldn't it release the memory after the page has been sent? Did you do this step.. ??
Yes, but I don't think archiving is the problem. It's the connection the live widget keeps. http://piwik.org/docs/setup-auto-archiv … every-hour">http://piwik.org/docs/setup-auto-archiving/#toc-disable-piwik-archiving-to-be-triggered-from-the-browser-and-limit-piwik-reports-to-be-updated-every-hour
Anyway, with the latest settings I posted Piwik doesn't use up all my memory anymore. It does make the server a bit slow though I think..
@db3l:
PS: On the piwik thing, I'm afraid I'm out of my depth as I have no experience with that. I find it hard to believe you're seeing unbounded process growth if you keep MaxRequestsPerChild low (and maybe 30 is even too high), but since I don't know what it does or how it works, I'd have to leave that aspect of things to someone else.
Thanks for all your advice David. You've been a great help!:) I think this will be good thread for other beginners in the future..
Any suggestions about the last settings I used?
It seems like StartServers, MinSpareServers and MaxSpareServers did the trick for the crashing. Because when I set MaxClients and MaxRequestsPerChild both to just 5, Piwik still managed to crash the server..
@obs:
Tip: Enable mod_expires and set the expires time for static elements to something high, that will reduce the number of requests on the server since the browser will cache the elements and not re-request them.
Thanks for the tip obs, I'll try that.
<ifmodule mpm_prefork_module="">StartServers 3
MinSpareServers 2
MaxSpareServers 3
MaxClients 9
MaxRequestsPerChild 28</ifmodule>
But it turns out I was a little bit too optimistic.
The server still crashes. It's actually a miracle it worked for months without configuration.
So the 'ab' stress test didn't work and I just tried some random crawler on my site. It follows all links instead of just one url.
Turns out I can crash the server with any setting. Even when I set MaxClients and MaxRequestsPerChild to just 3.
The only thing setting the variables lower does, is making the process of filling up the memory slower.
I'm still baffled this is how it's supposed to work..
Are there any other settings I might have missed? Anything else I can try?
If any setting can eat up my memory, what can I do? A bigger node won't help, just delay the crash..
If I didn't missed them, show your my.cnf settings. MySQL can easily start swapping, with wrong settings.
There is part of example of settings for Linode 512:
low_priority_updates=1
#key buffer = 25% of memory size
key_buffer = 100M
sort_buffer_size = 2M
table_cache = 512
max_heap_table_size = 32M
tmp_table_size = 32M
max_sort_length = 20
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 20
query_cache_limit = 10M
query_cache_size = 16M
max_connections = 50
Do not just copy this settings, use this script to get tips about tuning your configuration:
In the taskmanager I can see it's apache eating up the memory. If it was a MySQL, it should be 'mysqld' process, am I right?
@OZ:
Settings from that tutorial can kill performance of MySQL and MySQL will become an IO-hog.
I've now set it to the settings you provided above. Ran the performance script and lowered max_connections. But still the memory builds up. Can mysql consume memory in the name of apache in the taskmanager?
@Cheek:
@OZ:Settings from that tutorial can kill performance of MySQL and MySQL will become an IO-hog.
I've now set it to the settings you provided above. Ran the performance script and lowered max_connections. But still the memory builds up. Can mysql consume memory in the name of apache in the taskmanager?
No. In the name of apache can be php scripts. Press "c" in top
- you will see commands with processes.
@OZ:
@Cheek:
@OZ:Settings from that tutorial can kill performance of MySQL and MySQL will become an IO-hog.
I've now set it to the settings you provided above. Ran the performance script and lowered max_connections. But still the memory builds up. Can mysql consume memory in the name of apache in the taskmanager?No. In the name of apache can be php scripts. Press "c" in
top
- you will see commands with processes.
Then it's definitely apache, not mysql..
@glg:
what modules other than mod_php do you have enabled? If you have a bunch of extraneous ones, that will make apache take up more ram than it should
You mean apache modules?
Nothing special:
core.c
mod_log_config.c
mod_logio.c
prefork.c
http_core.c
mod_so.c
core_module (static)
log_config_module (static)
logio_module (static)
mpm_prefork_module (static)
http_module (static)
so_module (static)
alias_module (shared)
auth_basic_module (shared)
authn_file_module (shared)
authz_default_module (shared)
authz_groupfile_module (shared)
authz_host_module (shared)
authz_user_module (shared)
autoindex_module (shared)
cgi_module (shared)
deflate_module (shared)
dir_module (shared)
env_module (shared)
expires_module (shared)
mime_module (shared)
negotiation_module (shared)
php5_module (shared)
reqtimeout_module (shared)
rewrite_module (shared)
setenvif_module (shared)
status_module (shared)
What is your memory_limit in php.ini ? Try to set 20M there (or less).
It's like I can't find the leak in my boat..
@vonskippy:
Have you tried rubbing a soapy water solution all over the server and seeing where the bubbles are coming out? I'll open a support request for that..
;)
@Cheek:
It's like I can't find the leak in my boat..
:(
Well, it's not totally identified, but it's certainly localized.
From what I'm seeing, you can eat up all your memory with barely any simultaneous requests, and that memory is going to apache. It seems to me that points squarely at the application stack servicing those web requests, which is Drupal, right?
You probably need to follow-up with those more expert with Drupal and its modules, in terms of situations in which that much memory may be required with the modules you are using. Either it's working as designed (and it just needs a lot of memory to generate whatever pages your site uses), or something is leaking during processing. It could be something as simple as a poorly written database query that ends up pulling a bunch of data back from the database (keeping it in memory) to generate a page.
Not sure if there's enough local Drupal expertise here (definitely not in my case) or not for that.
BTW, in your shoes I'd just set MaxRequestsPerChild to 1 for now so any resources are reclaimed after a single request, with no possibility for additive growth over a series of requests. As it stands now you're letting a single process potentially get 28x larger than it would if you had that setting at 1.
Also, given how much less frequent the issue is under your current settings, at this point growing your Linode (even if temporarily) might in fact give you a decent amount of breathing room, plus some additional data in terms of whether you can really fill that memory, or if now you can see the right peak usage, which maybe you'll be lucky and is just be slightly higher than your current Linode.
– David
@db3l:
BTW, in your shoes I'd just set MaxRequestsPerChild to 1 for now so any resources are reclaimed after a single request, with no possibility for additive growth over a series of requests. As it stands now you're letting a single process potentially get 28x larger than it would if you had that setting at 1.
Somehow, setting the MaxRequestsPerChild to 1 doesn't help. I can fill the memory with just 1 MaxRequestsPerChild.
It seems MaxClients has more effect. I'm now using these settings:
StartServers 1
MinSpareServers 1
MaxSpareServers 1
MaxClients 4
MaxRequestsPerChild 4
(I started with MaxClients 4 + MaxRequestsPerChild 1 and went from there)
Looks like these settings (or lower) are the only settings that aren't crashable. The time to load images however, gets pretty awful. Is this the point to upgrade?
Keep in mind I'm using some extreme scenarios to test the server. But as you said, a properly configured server shouldn't run out of memory right?
> Keep in mind I'm using some extreme scenarios to test the server. But as you said, a properly configured server shouldn't run out of memory right?
Yes. Well, to be fair, unless the minimum requirements of the application stack truly cannot fit in the available resources. But a Linode 512 really ought to be able to handle this, even with a resource hungry stack like Drupal, at worst with reduced performance, but without crashing.
Wow. The only conclusion I can come to tell you're saying that a single request needs at least a quarter of your available memory (sans non-apache processes). Assuming that's in the 100MB range I guess that's not impossible, but does seems excessive.
Darn, this isn't getting any easier for you, is it? :-(
In your shoes, I suppose my last scenario would be MaxClients and MaxRequestsPerChild of 1 each. That lets a single request into your host at a time. If you can crash things that way, you know you literally don't have enough resource for your application stack, and barring fixing a problem there, simply have to grow your Linode. It may be that a Linode 1024 would be fine, or it may be that the same URL will just eat through whatever you give it (if it's a bug). Only way to know would be to test.
The fact that it worked for so long probably implies an existing behavior (potentially a bug) that is now being tickled by differences (or growth) in your data, or maybe some change (a module upgrade?).
I guess the silver lining, if there's any in this scenario, is that if a single request is capable of doing this, if you could figure out what one (or ones) it is, you should be able to narrow your focus a lot.
Is there any way for you to get information from whatever stress tool you've settled on as to what URLs it requests in what order? If you could find the first one that started failing (timeout or whatever) it might point you at somewhere to look, or ask questions in a Drupal group about.
I think it would also be worth your time to clone your current Linode onto a larger size (like a 1024 or larger), and then run the same test against that. If it survives, at least you know you have the option of spending your way out of this short term pending any other analysis. If not, at least you know that as well.
– David
PS: For images, I'd add something like nginx for static content on the front end, proxying dynamic requests back to apache, independent of system size. The latency is because even simple static requests have to wait for a free apache worker, and each worker has the full interpreter stack. Offloading that to nginx would minimize the resources necessary to deliver static content.
Actually, in most cases it is and the memory is stable at only 125mb. But in some cases, the memory just builds up – maybe when google comes around..
When I stress test I request a couple of thousand pages, one after another. I then open the Piwik page, because it somehow is good at eating memory as well. And it slowly runs out. Taking 15 minutes on low settings, for example.
What happens is, it slowly builds, flips back, builds, flips back, etc. Like 100 > 150 > 250 > 350 > 250 > 350 > 450 > 500 > 425 > until it hits 950mb. Maybe this is natural. Maybe it's a bug. Can php scripts leak memory?
@Cheek:
I'm already running a Linode 1024. I think it should be more than enough for Drupal..
Oops. Yeah, I think that should be more than enough for adequate performance too.
> What happens is, it slowly builds, flips back, builds, flips back, etc. Like 100 > 150 > 250 > 350 > 250 > 350 > 450 > 500 > 425 > until it hits 950mb. Maybe this is natural. Maybe it's a bug. Can php scripts leak memory?
That's where this isn't really making as much sense to me, at least not at the level of settings you've reached. With MaxRequestsPerChild of 1, a process is created and destroyed for each request. So even if the PHP interpreter is growing, it'll get destroyed at the end of the request. So while instantaneous peak usage should be MaxClients times your largest size to handle a single request. But you shouldn't see baseline memory usage growing over longer periods of time.
Unless there's some other long-lived process (database server, etc..) that is slowly growing and taking away resource. But that doesn't jive with earlier posts that you're seeing the memory all going to Apache processes. Hmm, as things grow are you seeing more than the expected MaxClients apache processes? Sometimes under load tearing down the old processes is actually high enough overhead to take time, so you can end up with more than you expect. Though normally I'd only expect to see that once you were already swapping and not before, so not something I'd expect to account for the early stages of growth.
A "leak" in the context of a web application stack typically refers to cases when a single request uses up some resource that stays allocated to the interpreter (which is part of the Apache process) after the request is done. And yes, most any code could have such a situation since the worker process/interpreter provides a global context that exists across requests. So that can accumulate over repeated requests if the same Apache process remains alive. But letting that process die will release all those resources so even if the code is leaking this way it shouldn't matter.
I'm pretty much running out of specific suggestions at this point. Maybe if you took a few process snapshots along the way (e.g., when you start your test, when you're using about half your memory, and then just before it keels over) there might be something that jumps out when comparing them.
It's likely that once identified the issue is going to be obvious in hindsight and we'll see where the troubleshooting missed a key bit of information or should have tried some other step, but for now, I don't really know…
-- David
@db3l:
It's likely that once identified the issue is going to be obvious in hindsight and we'll see where the troubleshooting missed a key bit of information or should have tried some other step, but for now, I don't really know…
-- David
I'm getting a bit desperate here. The server crashed again tonight when I was asleep. I had lowered MaxClients to 6 and MaxRequestsPerChild to 15 a few days ago.
Of course I knew this could crash the server, technically. But not in a 'real life scenario'.. These settings are pretty low and affecting the performance of my site.
I'm beginning to think to memory problem is some sort of 'bug', that only happens once in a while. Because whenever I check the memory consumption during the day, it's almost always lower than 150mb.
What I was thinking. Wouldn't it be possible to let apache reboot when the memory usage goes over 900mb?
And one last thing I noticed is that apache.conf has some lines about awstats. Is it possible awstats is somehow connected? If I comment the line 'Include /etc/apache2/awstats.conf' and reboot, will it be disabled?
Then look at what Drupal modules you have installed if you're still getting crashes.
@waldo:
How many of these resource hog stats analyzing packages are you running? awstats and piwik are not all that friendly to run. I also run Urchin and Webalizer..
No, just kidding. :p
You make an interesting point though. As I noticed before, the Piwik dashboard was able to quickly consume my memory. But there's a javascript on every single page calling Piwik to register the user. If this script has the same problem as the one on the dashboard, that would explain the crashing when requesting a lot of pages.
I did some preliminary testing, and the results are promising. I'll do some more testing later and report the results..
@Guspaz:
If all else fails, you could try a different web server, or try running PHP using fastcgi; that will tell you definitely if PHP is the culprit, since PHP will be running in its own processes.
Thanks for the tip!
I've moved Piwik over to another server and the Linode has been up since Saturday night without problems. I was able to crawl the whole site with 4 parallel threads without it crashing (with cache cleared).
These are the Apache settings I'm currently using:
<ifmodule mpm_prefork_module="">StartServers 3
MinSpareServers 2
MaxSpareServers 3
MaxClients 12
MaxRequestsPerChild 12</ifmodule>
Whether Piwik was the root problem, or just the one that made it public, I don't know. But we'll see somewhere in the future..