io rates, apache crashes

I've been studying the various threads here on reported issues with high io, and have yet to diagnose what's causing the problem on one of my linode servers.

In my case, I'm running 3 2048 nodes on Linode – an application server, a web server and a database server. Static content is on the web server and neither the web server nor the database server break a sweat. Nothing gets enormous traffic.

The apps server, however, hosts Django projects that use wsgi and apache. This server frequently averages above 1000 for disk io rate and often hits 2000+ with peaks above 4k. Whenever it gets over 1200, the web server running Nginx returns a 502 error to the site visitor.

Here are my settings in apache2.conf. I've repeatedly lowered maxclients.

 <ifmodule mpm_prefork_module="">StartServers          5
    MinSpareServers       5
    MaxSpareServers      10
    MaxClients           25
    MaxRequestsPerChild 500</ifmodule> 

If I had to guess, I'd say Apache is the culprit. Tailing syslog, I see this fairly frequently.

Apr 11 21:05:08 linodeapp kernel: Out of memory: kill process 6225 (apache2) score 62168 or a child

7 Replies

@wfox:

I've been studying the various threads here on reported issues with high io, and have yet to diagnose what's causing the problem on one of my linode servers.
Well, at least it seems fairly clear why you're getting the failures, if not necessarily the key culprit.

> If I had to guess, I'd say Apache is the culprit. Tailing syslog, I see this fairly frequently.

Apr 11 21:05:08 linodeapp kernel: Out of memory: kill process 6225 (apache2) score 62168 or a child

While with Django the odds favor Apache (I think mod_wsgi still embeds the interpreter inside the Apache process), just being chosen by the OOM process killer need not mean that's the case.

But at a very simple level whatever you have running on the apps box is using too much memory in aggregate. Now it may be just Apache (25 simultaneous processes could burn through 2GB with about 80MB/process, ignoring all other processes), or it could be something else on the box using up a chunk of resource. Or the combination of the two - e.g., some other very large process significantly reducing the memory available to Apache processes.

So your next step (as is probably often referenced in similar threads) is to find out where your memory is going. Check your current free memory, then divide it by the worst case size in Apache process you can find, and see if your MaxClients settings puts you over. Then drop that back down so it won't and start tuning from there.

If there's a big discrepancy in Apache process sizes then maybe they grow over time (say perhaps an application stack leak), and you shouldn't let so many requests get handled by the same process (drop requests per child).

But if I were you, I'd consider my first job at this point to ensure the machine never entered the OOM state (even if that means drastically cutting back on MaxClients), and only then see how to tune to get highest throughput.

– David

Based on running top, here's what my memory shows:

Mem:   1451748k total,  1233612k used,   218136k free,     6232k buffers
Swap:   262136k total,    83856k used,   178280k free,    45360k cached

And here's ps aux | grep 'apache'

www-data  3906  0.2  3.6 101192 53404 ?        Sl   18:15   0:10 /usr/sbin/apache2 -k start
www-data  4116  0.0  0.2 232700  3992 ?        Sl   18:35   0:00 /usr/sbin/apache2 -k start
www-data  4154  0.0  0.3 233644  4988 ?        Sl   18:35   0:00 /usr/sbin/apache2 -k start
www-data  4189  0.0  0.0  10360  1344 ?        S    18:35   0:00 /usr/sbin/apache2 -k start
www-data  4190  0.1  3.4  96640 50600 ?        Sl   18:35   0:02 /usr/sbin/apache2 -k start
www-data  4301  0.2  3.3  95424 49252 ?        Sl   18:42   0:04 /usr/sbin/apache2 -k start
www-data  4302  0.2  3.9 103716 57536 ?        Sl   18:42   0:05 /usr/sbin/apache2 -k start
www-data  4303  0.2  3.6  99468 53312 ?        Sl   18:42   0:05 /usr/sbin/apache2 -k start
www-data  4304  0.2  3.4  96112 49936 ?        Sl   18:42   0:05 /usr/sbin/apache2 -k start
www-data  4305  0.2  3.4  95924 49792 ?        Sl   18:42   0:04 /usr/sbin/apache2 -k start
www-data  4321  0.3  3.6  98720 52344 ?        Sl   18:42   0:05 /usr/sbin/apache2 -k start
www-data  4322  0.2  3.5  97692 51480 ?        Sl   18:42   0:04 /usr/sbin/apache2 -k start
www-data  4329  0.2  3.6  99568 53232 ?        Sl   18:42   0:03 /usr/sbin/apache2 -k start
www-data  4330  0.2  3.4  96168 49900 ?        Sl   18:42   0:03 /usr/sbin/apache2 -k start
www-data  4331  0.1  3.3  95396 49212 ?        Sl   18:42   0:03 /usr/sbin/apache2 -k start
www-data  4332  0.2  3.7  99836 53776 ?        Sl   18:42   0:05 /usr/sbin/apache2 -k start
www-data  4333  0.1  3.2  93548 47344 ?        Sl   18:42   0:03 /usr/sbin/apache2 -k start
www-data  4349  0.4  4.5 111524 65456 ?        Sl   18:42   0:07 /usr/sbin/apache2 -k start
www-data  4350  0.1  3.2  93040 46724 ?        Sl   18:42   0:02 /usr/sbin/apache2 -k start
www-data  4351  0.3  3.8 101396 55364 ?        Sl   18:42   0:07 /usr/sbin/apache2 -k start
www-data  4352  0.2  4.4  98908 64060 ?        Sl   18:42   0:04 /usr/sbin/apache2 -k start
www-data  4353  0.3  3.9 103964 57812 ?        Sl   18:42   0:07 /usr/sbin/apache2 -k start
www-data  4369  0.2  3.7  99784 53776 ?        Sl   18:42   0:04 /usr/sbin/apache2 -k start
www-data  4370  0.1  3.6  99572 53360 ?        Sl   18:42   0:03 /usr/sbin/apache2 -k start
www-data  4371  0.4  3.7 100160 54060 ?        Sl   18:42   0:07 /usr/sbin/apache2 -k start
www-data  4372  0.4  3.7  99900 53724 ?        Sl   18:42   0:08 /usr/sbin/apache2 -k start
www-data  4387  0.2  3.6  99616 53460 ?        Sl   18:43   0:04 /usr/sbin/apache2 -k start
www-data  4391  0.3  3.7 101068 54908 ?        Sl   18:43   0:05 /usr/sbin/apache2 -k start
root      6225  0.0  0.0  10360   916 ?        Ss   Apr09   0:01 /usr/sbin/apache2 -k start

Oddly, there appear to be 29 apache processes. Not sure which of the memory numbers I need to use in your suggested calculation.

The key one up front is the resident size, so column 6, as the best indicator of real memory usage - looks like you're averaging maybe 50-60MB per Apache process. So at 25 clients you'll probably burn around 1.5GB assuming they don't grow further.

This is a 2048 Linode? Seems to be a lot missing in your overall memory (a small amount due to kernel measurement is ok, but that's like .5GB). Sure this isn't a 1536?

In terms of process count, at least one of those processes is the overall parent (likely 6225 started on the 9th). The others may slightly exceed MaxClients briefly at times, especially under load, while old children exit and new children start up.

It's interesting that you have a few with very large virtual sizes (column 5) but very small resident. I wonder if those are ones in the process of exiting, and if that means that perhaps they do grow a lot over time.

In any event the current snapshot seems to be fitting, but you're tying up about 80% of your physical memory, so not a lot for caching/buffering. All other things being equal I would probably assume this would be an ok configuration, but given you've OOMed in the past, it's a safe assumption you can peak higher. So I'd probably drop MaxClients further (say to at least 15-20) and then monitor the node over time (as with munin or equivalent) to watch for memory spikes or get a feel for how your working set varies over time. But given the current resource usage I don't think a really dramatic drop in MaxClients is warranted.

– David

Thanks, David. This is very helpful.

The Linode dashboard shows all three of my nodes as 2048. Do you think this is something I should bring up with support?

I'll try dropping back on MaxClients.

@wfox:

Thanks, David. This is very helpful.

The Linode dashboard shows all three of my nodes as 2048. Do you think this is something I should bring up with support?

I'll try dropping back on MaxClients.

What is your uptime on these boxes? specifically, have they been rebooted since last June 16th when linode would have upgraded them from 1440 to 2048?

Did some checks on all 3 and the web server is the only one of the 3 that appears to show the proper memory (2276MB). Up time 299 days.

The db and app servers show about 1650MB with uptimes of 377 days. I take it those two need to be rebooted. :shock:

@wfox:

Did some checks on all 3 and the web server is the only one of the 3 that appears to show the proper memory (2276MB). Up time 299 days.

The db and app servers show about 1650MB with uptimes of 377 days. I take it those two need to be rebooted. :shock:

Yep, reboot them to get the additional resources :)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct