Segfaults on host 26 (there's no /lib/tls)
daemons have been dying every few days. It's usually after
cronjobs are run (would logrotate have anything to do with it?)
Anyway, these are the ones which died already:
apache (version from stable)
bind9 (version from stable; died several times)
dovecot (backported; died more than twice)
amavis (backported; died several times)
I had to put lines in /etc/crontab restarting them daily.
Today aptitude also segfaulted:
host:~# aptitude update
Reading Package Lists… Done
Segmentation faulty Tree… 87%
host:~#
And the next time I ran it, everything worked fine.
There is no /lib/tls, or /usr/lib/tls, and libc6 doesn't seem to install
any files with "tls" in the name.
So… Is anyone else having problems with host 26? Doesn't that seem
like a hardware issue (bad memory)?
(Interesting fact: postfix never died.)
jp
14 Replies
-Chris
> Which kernel are you running?
Latest 2.4 (2.4.26-linode31-1um).
> There have been a number of updates to UML that are pending
release – I'll try to build a pre-release kernel and get it out to you guys.
I seriously doubt this is a hardware issue, though..
I thought it would be hardware because if it was the kernel, people with
accounts on all other hosts would have complained already… Anyway,
if you get a new kernel, we try it out!
jp
@jp:
> Which kernel are you running?
Latest 2.4 (2.4.26-linode31-1um).
Although this isn't the final release, would you mind giving 2.6.9-rc2-mm4-linode5 a shot? Before you do, make sure to do an "apt-get install dhcp3-client" to get the required DHCP update for Debian…
Thanks,
-Chris
lines from the cronjob, and I suppose we'll have to wait a few days to see
how the host behaves…
Thanks for your help!
jp
7FD9E4239F 2821 Mon Oct 4 11:34:09
After restarting amavis, some of the messages were delivered, but others had the same problem.
I guess this is because of the new kernel. What is the right way to fix it?
jp
as can be seen by the logs below. (Trimmed -- I'll post or email the complete stuff
if necessary).
I don't run any huge websites or applications on this linode. Just this:
Apache (almost no dynamic content)
MySQL (only for email database)
Postfix
Amavis+Clamav+Spamassassin
Dovecot imapd
And the ordinary stuff (sshd, etc)
No Java, and no heavy stuff.
What's going on?
jp
Oct 4 01:29:17 localhost kernel: oom-killer: gfp_mask=0x1d2
Oct 4 01:29:17 localhost kernel: DMA per-cpu:
Oct 4 01:29:17 localhost kernel: cpu 0 hot: low 8, high 24, batch 4
Oct 4 01:29:17 localhost kernel: cpu 0 cold: low 0, high 8, batch 4
Oct 4 01:29:17 localhost kernel: Normal per-cpu: empty
Oct 4 01:29:17 localhost kernel: HighMem per-cpu: empty
Oct 4 01:29:17 localhost kernel:
Oct 4 01:29:17 localhost kernel: Free pages: 248kB (0kB HighMem)
Oct 4 01:29:17 localhost kernel: Active:6030 inactive:5828 dirty:0 writeback:7 unstable:0 free:62 slab:2119 mapped:5862 pagetables:391
Oct 4 01:29:17 localhost kernel: DMA free:248kB min:256kB low:512kB high:768kB active:24120kB inactive:23312kB present:65536kB
Oct 4 01:29:17 localhost kernel: protections[]: 0 0 0
Oct 4 01:29:17 localhost kernel: Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB
Oct 4 01:29:17 localhost kernel: protections[]: 0 0 0
Oct 4 01:29:27 localhost kernel: HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB
Oct 4 01:29:27 localhost kernel: protections[]: 0 0 0
Oct 4 01:29:27 localhost kernel: DMA: 04kB 18kB 516kB 132kB 064kB 1128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 248kB
Oct 4 01:29:27 localhost kernel: Normal: empty
Oct 4 01:29:27 localhost kernel: HighMem: empty
Oct 4 01:29:27 localhost kernel: Swap cache: add 1140080, delete 1140034, find 191763/356157, race 1+8
Oct 4 01:29:27 localhost kernel: Out of Memory: Killed process 17500 (apache).
Oct 4 01:30:12 localhost kernel: Out of Memory: Killed process 12262 (mysqld).
Oct 4 01:30:12 localhost kernel: Out of Memory: Killed process 1233 (mysqld).
Oct 4 01:30:12 localhost kernel: Out of Memory: Killed process 1235 (mysqld).
Oct 4 01:30:12 localhost kernel: Out of Memory: Killed process 1236 (mysqld).
Oct 4 01:30:12 localhost kernel: Out of Memory: Killed process 1239 (mysqld).
Oct 4 01:30:12 localhost kernel: oom-killer: gfp_mask=0x1d2
Oct 4 01:30:12 localhost kernel: DMA per-cpu:
Oct 4 01:30:12 localhost kernel: cpu 0 hot: low 8, high 24, batch 4
Oct 4 01:30:12 localhost kernel: cpu 0 cold: low 0, high 8, batch 4
Oct 4 01:30:12 localhost kernel: Normal per-cpu: empty
Oct 4 01:30:12 localhost kernel: HighMem per-cpu: empty
Oct 4 01:30:12 localhost kernel: Out of Memory: Killed process 11971 (amavisd-new).
It was at 1:27, and this is the daily cronjob time. I have already disabled aide (thought it was eating up all memory), and processes are still being killed.
This is what is left in cron.daily:
-rwxr-xr-x 1 root root 314 Jun 11 14:15 amavisd-new
-rwxr-xr-x 1 root root 502 Jul 4 2002 calendar
-rwxr-xr-x 1 root root 315 Mar 11 2002 dlocate
-rwxr-xr-x 1 root root 280 Jul 11 17:25 find
-rwxr-xr-x 1 root root 51 Apr 23 2002 logrotate
-rwxr-xr-x 1 root root 708 Mar 14 2002 man-db
-rwxr-xr-x 1 root root 226 Apr 5 2002 mgetty
-rwxr-xr-x 1 root root 495 Nov 18 2001 netkit-inetd
-rwxr-xr-x 1 root root 345 Mar 7 2002 quota
-rwxr-xr-x 1 root root 2736 Oct 1 2001 standard
-rwxr-xr-x 1 root root 1197 Jan 3 2002 sysklogd
(I have oimmited disabled entries)
Can any of these eat up lots of memory? This is geting frustrating…
jp
ps -e -o pid,cmd,%mem,rss,trs,sz,vsz
````
cat /proc/meminfo
to see what your memory usage is like, and
cat /proc/io_status
to see how your io limiter values are set.
> When oom killer is choosing something to kill, it has a preference for processes which are consuming a lot of memory but which are not long lived. The fact that apache seems to get picked first leads me to suspect that it is consuming too much memory. The situation may be being made worse by the io limiter kicking in and delaying swap operations - a possible cause of your meltdown in the early hours of 4th Oct
Well, it doesn't seem like apache is using too much memory… Amavis uses more.
Maybe some daily cronjob is triggering several processes that use lots of memory (not individually, but collectively)? Is it common for cronjobs to behve like that?
I ask because for two consecutive days, the oom killer was triggered at thetime when the daily cronjob was running.
The information you asked follows.
jp
# ps -e -o pid,cmd,%mem,rss,trs,sz,vsz
PID CMD %MEM RSS TRS SZ VSZ
1 init [2] 0.1 72 24 318 1272
2 [ksoftirqd/0] 0.0 0 0 0 0
3 [events/0] 0.0 0 0 0 0
4 [khelper] 0.0 0 0 0 0
5 [kthread] 0.0 0 0 0 0
6 [kblockd/0] 0.0 0 0 0 0
17 [pdflush] 0.0 0 0 0 0
18 [pdflush] 0.0 0 0 0 0
20 [aio/0] 0.0 0 0 0 0
19 [kswapd0] 0.0 0 0 0 0
21 [jfsIO] 0.0 0 0 0 0
22 [jfsCommit] 0.0 0 0 0 0
23 [jfsSync] 0.0 0 0 0 0
24 [xfslogd/0] 0.0 0 0 0 0
25 [xfsdatad/0] 0.0 0 0 0 0
26 [xfsbufd] 0.0 0 0 0 0
652 [kjournald] 0.0 0 0 0 0
692 [kjournald] 0.0 0 0 0 0
693 [xfssyncd] 0.0 0 0 0 0
738 dhclient eth0 0.0 0 378 479 1916
1040 /usr/sbin/sshd 0.1 92 264 697 2788
1052 /sbin/syslogd 0.4 264 22 336 1344
1055 /sbin/klogd 0.2 160 17 316 1264
1058 /usr/sbin/named 1.3 800 232 2582 10328
1059 /usr/sbin/named 1.3 804 232 2582 10328
1061 /usr/sbin/named 1.3 804 232 2582 10328
1064 /usr/sbin/named 1.3 804 232 2582 10328
1065 /usr/sbin/named 1.3 804 232 2582 10328
1071 amavisd (master) 1.5 908 659 7101 28404
1075 /usr/sbin/spamd 0.0 0 659 5633 22532
1085 spamd child 0.0 0 659 5633 22532
1086 spamd child 0.0 0 659 5633 22532
1087 spamd child 0.0 0 659 5633 22532
1088 spamd child 0.0 0 659 5633 22532
1089 spamd child 0.0 0 659 5633 22532
1090 /usr/sbin/clamd 5.7 3424 35 7987 31948
1131 /usr/bin/freshcl 0.6 392 27 510 2040
1136 /usr/sbin/courie 0.0 0 8 362 1448
1137 /usr/lib/courier 0.0 48 59 515 2060
1139 /usr/lib/courier 0.0 48 59 515 2060
1140 /usr/lib/courier 0.0 48 59 515 2060
1141 /usr/lib/courier 0.0 48 59 515 2060
1142 /usr/lib/courier 0.0 48 59 515 2060
1143 /usr/lib/courier 0.0 48 59 515 2060
1156 /usr/sbin/inetd 0.0 24 15 327 1308
1167 /bin/sh /usr/bin 0.0 0 473 547 2188
1209 /usr/sbin/mysqld 1.9 1180 3692 16294 65176
1216 /usr/sbin/mysqld 1.9 1180 3692 16294 65176
1217 /usr/sbin/mysqld 1.9 1180 3692 16294 65176
1218 /usr/sbin/mysqld 1.9 1180 3692 16294 65176
1315 /usr/lib/postfix 0.3 188 21 628 2512
1319 qmgr -l -t fifo 0.8 496 34 680 2720
1353 /usr/lib/postgre 0.3 188 1457 2137 8548
1355 postgres: stats 0.0 32 1457 2385 9540
1356 postgres: stats 0.1 84 1457 2148 8592
1371 /usr/sbin/courie 0.0 0 8 358 1432
1372 /usr/lib/courier 0.0 0 1452 749 2996
1375 /usr/lib/courier 0.0 0 1452 749 2996
1377 /usr/lib/courier 0.0 0 1452 749 2996
1379 /usr/lib/courier 0.0 0 1452 749 2996
1381 /usr/lib/courier 0.0 0 1452 749 2996
1383 /usr/lib/courier 0.0 0 1452 749 2996
1391 /usr/bin/python2 0.1 72 418 1357 5428
1392 /usr/bin/python2 1.3 820 418 4596 18384
1401 /usr/sbin/cron 0.2 172 21 414 1656
1408 /sbin/getty 3840 0.0 0 10 314 1256
1454 /usr/sbin/doveco 0.2 156 75 614 2456
1456 dovecot-auth 0.8 500 105 972 3888
1486 /usr/sbin/clamd 5.7 3424 35 7987 31948
2770 SCREEN -S im 1.1 660 243 654 2616
2771 /bin/bash 0.0 0 473 557 2228
2774 centericq 2.5 1540 3858 2425 9700
6509 /usr/sbin/sshd 0.0 0 264 1430 5720
6512 /usr/sbin/sshd 0.0 48 264 1464 5856
6513 -bash 0.2 140 473 555 2220
6911 /usr/sbin/sshd 0.0 0 264 1430 5720
6914 /usr/sbin/sshd 0.1 60 264 1455 5820
6915 -bash 0.2 132 473 554 2216
16807 /usr/sbin/mysqld 1.9 1180 3692 16294 65176
18443 /usr/sbin/apache 0.5 324 218 20530 82120
18444 /usr/sbin/apache 8.6 5116 218 21157 84628
18445 /usr/sbin/apache 1.4 836 218 20564 82256
18447 /usr/sbin/apache 3.6 2164 218 20709 82836
18448 /usr/sbin/apache 9.2 5484 218 21276 85104
18449 /usr/sbin/apache 8.6 5128 218 21121 84484
19673 /usr/sbin/apache 5.2 3128 218 21137 84548
19678 /usr/sbin/apache 9.0 5372 218 21223 84892
19679 /usr/sbin/apache 3.0 1816 218 21311 85244
19680 /usr/sbin/apache 0.9 588 218 20572 82288
19967 /usr/sbin/apache 5.5 3312 218 21230 84920
28516 amavisd (child) 23.9 14216 659 7133 28532
28557 /usr/sbin/sshd 1.3 800 264 1427 5708
28559 /usr/sbin/sshd 1.6 1008 264 1461 5844
28560 -bash 1.4 864 473 556 2224
28952 amavisd (virgin 3.0 1796 659 7101 28404
29752 pickup -l -t fif 1.6 956 6 661 2644
30493 imap-login 2.0 1220 73 622 2488
30564 imap-login 2.0 1220 73 622 2488
30566 imap-login 2.0 1220 73 622 2488
30567 -su 2.1 1248 473 555 2220
30571 ps -e -o pid,cmd 1.0 644 59 514 2056
# cat /proc/meminfo
MemTotal: 59352 kB
MemFree: 760 kB
Buffers: 348 kB
Cached: 6744 kB
SwapCached: 11184 kB
Active: 45612 kB
Inactive: 2980 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 59352 kB
LowFree: 760 kB
SwapTotal: 132088 kB
SwapFree: 10052 kB
Dirty: 48 kB
Writeback: 4 kB
Mapped: 42456 kB
Slab: 6896 kB
Committed_AS: 483472 kB
PageTables: 1588 kB
VmallocTotal: 973804 kB
VmallocUsed: 676 kB
VmallocChunk: 973120 kB
# cat /proc/io_status
io_count=1851754 io_rate=555 io_tokens=398756 token_refill=512 token_max=400000
The %MEM value shows how much physical memory the process is consuming. Apache has low values because it is swapped out. Look at the SZ values - how much virtual memory is being used. Apache and MySQL are the big users. Apache looks like it has a lot of modules loaded. Apache on my Linode uses a quarter as much memory, even with modssl, modphp and all the standard stuff loaded. Also, you have 11 Apache instances running - a lot for a Linode 64 - if your server needs that many, its probably overloaded. Since they're mostly swapped out, Apache doesn't look that busy.
Suggestions:
* Reconfigure Apache so it doesn't load any module that you aren't using - lots of distros configure it with a bunch of stuff, just in case.
Reduce the value of MinSpareServers (default 5) to 3 and MaxSpareServers (default 10) to 5 in your Apache config.
Tune MySQL to trade speed for less memory usage.
Also, increase the size of your swap partition (or add a second one). I know that the rule of thumb is swap = 2 * physical, but with what you're trying to run, 128MB of swap just isn't enough. I have 256MB of swap on my Linode 64 - this is what caker recommends.
> * Reconfigure Apache so it doesn't load any module that you aren't using - lots of distros configure it with a bunch of stuff, just in case.
Reduce the value of MinSpareServers (default 5) to 3 and MaxSpareServers (default 10) to 5 in your Apache config.
Tune MySQL to trade speed for less memory usage.
I see what you mean. I will make those changes.
> Also, increase the size of your swap partition (or add a second one). I know that the rule of thumb is swap = 2 * physical, but with what you're trying to run, 128MB of swap just isn't enough. I have 256MB of swap on my Linode 64 - this is what caker recommends.
Ah, right! I used to admin another Linode, and I think we used to have 256Mb swap, so that's why it never had problems. I'll reorganize the partitions today.
Thanks a lot for the help!
jp
I had to do this on a (real) FC2 box. mysql kept creating threads. Might be worth a shot..
-Chris
-Chris
After putting 128Mb more for swap and changing the apache and mysql configs, my linode went through two nights without any problems. The logs show no process being killed.
BTW, it seems like switching to 2.6 was a good idea too (with 2.4, I don't remember having anything in the logs telling explicitly what happened).
Thanks!
jp