Linode keeps OOMing; High IOWAIT; High load average - help?
I have a Linode 2048 w/ extra RAM (2678 MB total). I am running Ubuntu 10.04 LTS w/ Apache and MySQL. There is a single WordPress website - teleread.com, it gets an average of 5000 uniques a day.
cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.04
DISTRIB_CODENAME=lucid
DISTRIB_DESCRIPTION="Ubuntu 10.04.1 LTS"
uname -a
Linux teleread 2.6.32.16-linode28 #1 SMP Sun Jul 25 21:32:42 UTC 2010 i686 GNU/Linux
apache2ctl -V
Server version: Apache/2.2.14 (Ubuntu)
Server built: Apr 13 2010 19:28:27
Server's Module Magic Number: 20051115:23
Server loaded: APR 1.3.8, APR-Util 1.3.9
Compiled using: APR 1.3.8, APR-Util 1.3.9
Architecture: 32-bit
Server MPM: Prefork
threaded: no
forked: yes (variable process count)
Server compiled with....
-D APACHE_MPM_DIR="server/mpm/prefork"
-D APR_HAS_SENDFILE
-D APR_HAS_MMAP
-D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
-D APR_USE_SYSVSEM_SERIALIZE
-D APR_USE_PTHREAD_SERIALIZE
-D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
-D APR_HAS_OTHER_CHILD
-D AP_HAVE_RELIABLE_PIPED_LOGS
-D DYNAMIC_MODULE_LIMIT=128
-D HTTPD_ROOT=""
-D SUEXEC_BIN="/usr/lib/apache2/suexec"
-D DEFAULT_PIDLOG="/var/run/apache2.pid"
-D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
-D DEFAULT_LOCKFILE="/var/run/apache2/accept.lock"
-D DEFAULT_ERRORLOG="logs/error_log"
-D AP_TYPES_CONFIG_FILE="/etc/apache2/mime.types"
-D SERVER_CONFIG_FILE="/etc/apache2/apache2.conf"
mysql --version
mysql Ver 14.14 Distrib 5.1.41, for debian-linux-gnu (i486) using readline 6.1
A few times a day the IOWAIT starts to rapidly increase, the SWAP starts to thrash and the load average jumps. I have tuned, and tuned, and retuned Apache and MySQL, but no matter what I do, it keeps happening.
Running Apache2 w/ prefork MPM
<ifmodule mpm_prefork_module="">StartServers 8
MinSpareServers 5
MaxSpareServers 20
ServerLimit 300
MaxClients 300
MaxRequestsPerChild 4000</ifmodule>
MySQL:
[mysqld]
user = mysql
port = 3306
socket = /var/run/mysqld/mysqld.sock
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
skip-external-locking
skip-innodb
key_buffer_size = 64M
table_open_cache = 1048
sort_buffer_size = 1M
read_buffer_size = 1M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 64M
thread_cache_size =16
query_cache_size = 32M
tmp_table_size=64M
max_heap_table_size=64M
back_log = 100
max_connections = 301
max_connect_errors = 5000
join_buffer_size=1M
open-files = 10000
interactive_timeout = 300
wait_timeout = 300
thread_concurrency = 8
I've tried php-cgi, fcgid, and regular php to see if anything made a difference, but it doesn't help.
Eventually I start getting these in the Apache error log:
[Wed Sep 22 09:19:23 2010] [warn] child process 18258 still did not exit, sending a SIGTERM
[Wed Sep 22 09:19:23 2010] [warn] child process 18303 still did not exit, sending a SIGTERM
[Wed Sep 22 09:19:23 2010] [warn] child process 18304 still did not exit, sending a SIGTERM
But I think that's a sign of the OOMing/thrashing, not a sign of the culprit.
According to netstat, at any given moment in time I have about 120+ tcp connections to www, but occasionally it'll spike… I have seen these in the Apache error log (when I lower the MaxClients to test:
[Mon Sep 20 11:51:13 2010] [error] server reached MaxClients setting, consider raising the MaxClients setting
The lowest I've set MaxClients is 150, and I just set it back to 300 after the latest issue.
From top it looks like Apache is using 23M (RES) per process… at this moment netstat says I have 174 connections to www and ps shows 41 apache2 processes… that's roughly 943MB of RAM
At this moment MySQL is at 42M (RES)
These are the stats at this very moment in time
teleread# netstat -t | grep -c www
163
teleread# ps auxww | grep -c www-data
27
teleread# top
top - 10:05:44 up 17:43, 3 users, load average: 0.69, 0.67, 3.93
Tasks: 133 total, 2 running, 131 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.2%us, 1.3%sy, 0.0%ni, 92.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st
Mem: 2708280k total, 978832k used, 1729448k free, 63564k buffers
Swap: 262136k total, 14024k used, 248112k free, 312024k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18886 mysql 20 0 121m 45m 5460 S 0 1.7 0:41.06 mysqld
19435 www-data 20 0 52696 26m 4020 S 0 1.0 0:05.63 apache2
19474 www-data 20 0 52440 26m 4024 S 0 1.0 0:04.07 apache2
19451 www-data 20 0 49648 24m 4056 S 9 0.9 0:07.32 apache2
19429 www-data 20 0 49656 24m 4060 S 0 0.9 0:06.01 apache2
19496 www-data 20 0 49880 24m 3844 S 0 0.9 0:02.08 apache2
19492 www-data 20 0 49652 24m 4056 S 0 0.9 0:04.87 apache2
19331 www-data 20 0 49652 24m 4056 S 1 0.9 0:10.19 apache2
19469 www-data 20 0 49644 23m 4024 S 0 0.9 0:03.30 apache2
19473 www-data 20 0 49636 23m 4020 S 0 0.9 0:04.82 apache2
19479 www-data 20 0 49728 23m 3828 S 0 0.9 0:02.79 apache2
19507 www-data 20 0 49652 23m 3844 S 11 0.9 0:01.59 apache2
19495 www-data 20 0 49644 23m 3844 S 0 0.9 0:01.88 apache2
19508 www-data 20 0 49472 23m 4012 S 0 0.9 0:01.76 apache2
19501 www-data 20 0 49580 23m 3880 S 0 0.9 0:02.13 apache2
19433 www-data 20 0 49368 23m 4028 S 0 0.9 0:04.99 apache2
19487 www-data 20 0 49476 23m 3860 S 0 0.9 0:03.64 apache2
So… I am looking for some ideas, as I said, I am at a loss.
Thank you.
Lew
10 Replies
What's probably happening here is that your PHP processes which normally take ~1GB do something so that they take up ~2GB or more, so you spiral into swap use and start IO thrashing.
But you need better data. Run vmstat, iotop, htop or something and report data from the times that your box goes into swap, what is it doing then?
Maybe install something like munin to monitor your system, maybe you'll see some patterns there.
Maybe you can use mod_php instead of fastcgi? Maybe 80% of your traffic could be handled by a reverse squid proxy?
If he switched to mod_fastcgi, good, but still it's the prefork that sucks so much ram up. Keep fastcgi, switch to worker, and seriously, MaxClients down to 100-120 is totally enough. Especially after you cut down KeepAliveTimeout to 5.
[Mon Sep 20 11:51:13 2010] [error] server reached MaxClients setting, consider raising the MaxClients setting
````
Do not ever listen to what this error message says. It is very misleading in a VPS environment! When you're running out of memory, you want a lower MaxClients setting, not higher.
Some rules of thumb, assuming mpmprefork and modphp:
Linode 512: MaxClients 25 or less
Linode 1024: MaxClients 50 or less
Linode 1536: MaxClients 75 or less
Linode 2048: MaxClients 100 or less
Anything more and you're likely to swap.
Also, install the WordPress Super Cache plugin
I was running iostat -x 1 when it took place and got this (not very useful to me) -- last few entries:
avg-cpu: %user %nice %system %iowait %steal %idle
0.77 0.00 7.79 91.32 0.12 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 19.00 0.00 1157.00 5.00 33456.00 40.00 28.83 220.60 218.08 5.02 583.00
xvdb 415.00 206.00 246.00 1519.00 5056.00 13824.00 10.70 981.47 636.18 3.47 611.90
avg-cpu: %user %nice %system %iowait %steal %idle
0.96 0.00 5.76 93.16 0.12 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.67 1.01 116.95 0.17 3393.29 4.03 29.01 42.31 357.67 6.29 73.61
xvdb 32.72 24.16 14.77 164.93 408.05 1551.68 10.91 125.80 726.63 4.20 75.42
avg-cpu: %user %nice %system %iowait %steal %idle
1.41 0.00 11.54 86.92 0.12 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 1.56 0.00 82.58 0.31 2397.51 7.47 29.01 27.58 242.03 4.11 34.11
xvdb 26.75 11.35 15.40 95.02 364.54 831.10 10.83 52.70 481.36 3.16 34.93
I had tried Worker MPM and w/ FastCGI but it was still happening. In fact, I just changed it to prefork, mod_php this past Friday as an attempt to fix the issue.
I also have W3 Total Cache installed, which I personally think out performs Super Cache.
I'll try lowering MaxClients tonight.
I know what I have is over-kill for the traffic being generated, this is why I am utterly confused. I manage several servers and this is the only one giving me grief.
Thanks again.
@lewayotte:
I also have W3 Total Cache installed, which I personally think out performs Super Cache.
IIRC, if you use Total Cache, you still need to go through the PHP interpreter. Total Cache has truckloads of features, but it comes at the expense of having to use PHP. (After all, its features are written in PHP.) At least that's how I remember it; recent versions may have changed a bit.
In contrast, Super Cache tinkers with your .htaccess file so that most pages completely bypass PHP. The two plugins have different performance characteristics. You have to judge not only by raw speed but also take your low-memory situation into account. I think the "bypass PHP altogether" approach has clear benefits in this regard, but YMMV.
Using prefork, you would save next to nothing since PHP engine is loaded even for serving static content.
It's way easier than trying to convert all the apache dependencies into lighttpd's or nginx's syntax, setting up proxing, and/or (in case of nginx) compiling the whole mess by hand to get anything resembling "recent version".
If anyone's interested…
apc.shm_size = 64
(...)
StartServers 2
MaxClients 250
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0
ThreadStackSize 2097152 # Linux default is 8MB, a ton of wasted VMem... 2MB seems totally safe, probably could be reduced further.
(...)
FastCgiConfig \
-idle-timeout 120 \
-initial-env PHP_FCGI_CHILDREN=24 \
-initial-env PHP_FCGI_MAX_REQUESTS=500 \
-killInterval 100000 \
-listen-queue-depth 300 \
-maxClassProcesses 1 \
-singleThreshold 0
All the power to you if you want to debug it, but a rebuild would probably be quite a bit faster.
@akerl:
Odds are, you accidentally deleted a # in a random config file while tuning things, and uncommented "eat-all-swap-space=1" or something.
I doubt it8)
@lewayotte:
StartServers 8 MinSpareServers 5
MaxSpareServers 20
ServerLimit 300
MaxClients 300
MaxRequestsPerChild 4000
Three hundred prefork processes, each fully loaded with the PHP engine. No surprise the box is OOMing.