io rates, apache crashes
In my case, I'm running 3 2048 nodes on Linode – an application server, a web server and a database server. Static content is on the web server and neither the web server nor the database server break a sweat. Nothing gets enormous traffic.
The apps server, however, hosts Django projects that use wsgi and apache. This server frequently averages above 1000 for disk io rate and often hits 2000+ with peaks above 4k. Whenever it gets over 1200, the web server running Nginx returns a 502 error to the site visitor.
Here are my settings in apache2.conf. I've repeatedly lowered maxclients.
<ifmodule mpm_prefork_module="">StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 25
MaxRequestsPerChild 500</ifmodule>
If I had to guess, I'd say Apache is the culprit. Tailing syslog, I see this fairly frequently.
Apr 11 21:05:08 linodeapp kernel: Out of memory: kill process 6225 (apache2) score 62168 or a child
7 Replies
@wfox:
I've been studying the various threads here on reported issues with high io, and have yet to diagnose what's causing the problem on one of my linode servers.
Well, at least it seems fairly clear why you're getting the failures, if not necessarily the key culprit.
> If I had to guess, I'd say Apache is the culprit. Tailing syslog, I see this fairly frequently.
Apr 11 21:05:08 linodeapp kernel: Out of memory: kill process 6225 (apache2) score 62168 or a child
While with Django the odds favor Apache (I think mod_wsgi still embeds the interpreter inside the Apache process), just being chosen by the OOM process killer need not mean that's the case.
But at a very simple level whatever you have running on the apps box is using too much memory in aggregate. Now it may be just Apache (25 simultaneous processes could burn through 2GB with about 80MB/process, ignoring all other processes), or it could be something else on the box using up a chunk of resource. Or the combination of the two - e.g., some other very large process significantly reducing the memory available to Apache processes.
So your next step (as is probably often referenced in similar threads) is to find out where your memory is going. Check your current free memory, then divide it by the worst case size in Apache process you can find, and see if your MaxClients settings puts you over. Then drop that back down so it won't and start tuning from there.
If there's a big discrepancy in Apache process sizes then maybe they grow over time (say perhaps an application stack leak), and you shouldn't let so many requests get handled by the same process (drop requests per child).
But if I were you, I'd consider my first job at this point to ensure the machine never entered the OOM state (even if that means drastically cutting back on MaxClients), and only then see how to tune to get highest throughput.
– David
Mem: 1451748k total, 1233612k used, 218136k free, 6232k buffers
Swap: 262136k total, 83856k used, 178280k free, 45360k cached
And here's ps aux | grep 'apache'
www-data 3906 0.2 3.6 101192 53404 ? Sl 18:15 0:10 /usr/sbin/apache2 -k start
www-data 4116 0.0 0.2 232700 3992 ? Sl 18:35 0:00 /usr/sbin/apache2 -k start
www-data 4154 0.0 0.3 233644 4988 ? Sl 18:35 0:00 /usr/sbin/apache2 -k start
www-data 4189 0.0 0.0 10360 1344 ? S 18:35 0:00 /usr/sbin/apache2 -k start
www-data 4190 0.1 3.4 96640 50600 ? Sl 18:35 0:02 /usr/sbin/apache2 -k start
www-data 4301 0.2 3.3 95424 49252 ? Sl 18:42 0:04 /usr/sbin/apache2 -k start
www-data 4302 0.2 3.9 103716 57536 ? Sl 18:42 0:05 /usr/sbin/apache2 -k start
www-data 4303 0.2 3.6 99468 53312 ? Sl 18:42 0:05 /usr/sbin/apache2 -k start
www-data 4304 0.2 3.4 96112 49936 ? Sl 18:42 0:05 /usr/sbin/apache2 -k start
www-data 4305 0.2 3.4 95924 49792 ? Sl 18:42 0:04 /usr/sbin/apache2 -k start
www-data 4321 0.3 3.6 98720 52344 ? Sl 18:42 0:05 /usr/sbin/apache2 -k start
www-data 4322 0.2 3.5 97692 51480 ? Sl 18:42 0:04 /usr/sbin/apache2 -k start
www-data 4329 0.2 3.6 99568 53232 ? Sl 18:42 0:03 /usr/sbin/apache2 -k start
www-data 4330 0.2 3.4 96168 49900 ? Sl 18:42 0:03 /usr/sbin/apache2 -k start
www-data 4331 0.1 3.3 95396 49212 ? Sl 18:42 0:03 /usr/sbin/apache2 -k start
www-data 4332 0.2 3.7 99836 53776 ? Sl 18:42 0:05 /usr/sbin/apache2 -k start
www-data 4333 0.1 3.2 93548 47344 ? Sl 18:42 0:03 /usr/sbin/apache2 -k start
www-data 4349 0.4 4.5 111524 65456 ? Sl 18:42 0:07 /usr/sbin/apache2 -k start
www-data 4350 0.1 3.2 93040 46724 ? Sl 18:42 0:02 /usr/sbin/apache2 -k start
www-data 4351 0.3 3.8 101396 55364 ? Sl 18:42 0:07 /usr/sbin/apache2 -k start
www-data 4352 0.2 4.4 98908 64060 ? Sl 18:42 0:04 /usr/sbin/apache2 -k start
www-data 4353 0.3 3.9 103964 57812 ? Sl 18:42 0:07 /usr/sbin/apache2 -k start
www-data 4369 0.2 3.7 99784 53776 ? Sl 18:42 0:04 /usr/sbin/apache2 -k start
www-data 4370 0.1 3.6 99572 53360 ? Sl 18:42 0:03 /usr/sbin/apache2 -k start
www-data 4371 0.4 3.7 100160 54060 ? Sl 18:42 0:07 /usr/sbin/apache2 -k start
www-data 4372 0.4 3.7 99900 53724 ? Sl 18:42 0:08 /usr/sbin/apache2 -k start
www-data 4387 0.2 3.6 99616 53460 ? Sl 18:43 0:04 /usr/sbin/apache2 -k start
www-data 4391 0.3 3.7 101068 54908 ? Sl 18:43 0:05 /usr/sbin/apache2 -k start
root 6225 0.0 0.0 10360 916 ? Ss Apr09 0:01 /usr/sbin/apache2 -k start
Oddly, there appear to be 29 apache processes. Not sure which of the memory numbers I need to use in your suggested calculation.
This is a 2048 Linode? Seems to be a lot missing in your overall memory (a small amount due to kernel measurement is ok, but that's like .5GB). Sure this isn't a 1536?
In terms of process count, at least one of those processes is the overall parent (likely 6225 started on the 9th). The others may slightly exceed MaxClients briefly at times, especially under load, while old children exit and new children start up.
It's interesting that you have a few with very large virtual sizes (column 5) but very small resident. I wonder if those are ones in the process of exiting, and if that means that perhaps they do grow a lot over time.
In any event the current snapshot seems to be fitting, but you're tying up about 80% of your physical memory, so not a lot for caching/buffering. All other things being equal I would probably assume this would be an ok configuration, but given you've OOMed in the past, it's a safe assumption you can peak higher. So I'd probably drop MaxClients further (say to at least 15-20) and then monitor the node over time (as with munin or equivalent) to watch for memory spikes or get a feel for how your working set varies over time. But given the current resource usage I don't think a really dramatic drop in MaxClients is warranted.
– David
The Linode dashboard shows all three of my nodes as 2048. Do you think this is something I should bring up with support?
I'll try dropping back on MaxClients.
@wfox:
Thanks, David. This is very helpful.
The Linode dashboard shows all three of my nodes as 2048. Do you think this is something I should bring up with support?
I'll try dropping back on MaxClients.
What is your uptime on these boxes? specifically, have they been rebooted since last June 16th when linode would have upgraded them from 1440 to 2048?
The db and app servers show about 1650MB with uptimes of 377 days. I take it those two need to be rebooted.
@wfox:
Did some checks on all 3 and the web server is the only one of the 3 that appears to show the proper memory (2276MB). Up time 299 days.
The db and app servers show about 1650MB with uptimes of 377 days. I take it those two need to be rebooted.
:shock:
Yep, reboot them to get the additional resources