OOM assistance for a Linux beginner.
Hope you can help. My server technician is away until October, and I noticed today that our server had gone down. I raised a ticket after rebooting the VPS and got a detailed response with a Lish Code Snippet, unfortunately this to me is alien.
Essentially I was told that the cause of the server freezing was because it was OOMing - which I do understand means something had been eating the virtual memory, but I have no idea how to go about diagnosing the cause of this and would be much obliged if someone could provide me with some step by step assistance to working out the cause, and hopefully preventing it happening again?
Thanks in advance.
David
18 Replies
It's a bit difficult to help diagnose if you don't supply any information, although it does not seem like you are able to. Assuming that you have no knowledge (and now seems like a bad time to learn), I'd suggest upgrading your Linode temporarily, or adding on more memory. The latter would be cheaper for you, plus it can be pro-rated (you can downgrade when he gets back and have the remainder of the month reversed). It can be a bit of lengthy procedure and difficult to diagnose an OOM problem.
Here is the information provided to me by support:
Thank you for contacting Linode support. It appears that your Linode was OOMing, meaning something inside your node is consuming all of the available virtual memory. Typically, you can see this for yourself by logging into Lish and viewing the console:
Console Snippet
Out of memory: Kill process 3165 (clamd) score 155 or sacrifice child
Killed process 3165 (clamd) total-vm:220732kB, anon-rss:87628kB, file-rss:64kB
------------[ cut here ]–----------
kernel BUG at mm/swapfile.c:2527!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/vbd-51712/block/xvda/removable
Modules linked in:
Pid: 19561, comm: apache2 Not tainted 2.6.39.1-linode34 #1
EIP: 0061:[
EIP is at swapcountcontinued+0x176/0x180
EAX: f57bad84 EBX: ed25b680 ECX: f57ba000 EDX: 00000000
ESI: ed3d8240 EDI: 00000080 EBP: 00000d84 ESP: c2619e3c
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Process apache2 (pid: 19561, ti=c2618000 task=ebe9d400 task.ti=c2618000)
Stack:
eb9a3e40 0000cd84 00000040 00000000 c01a9601 0000cd84 ed332d60 00000000
00000000 c01a9868 c261cec0 c019b1aa 00000000 8000001b 0000001b ece2da2c
ed3d0760 b7666fc0 cb5e3c58 00000000 00000000 e5d14025 d2054f80 bb18b00c
Call Trace:
[
Please let me know what other information you need and I'll do my best to provide it.
Log in as root, and run apache2ctl -M | grep mpm - if the output contains "prefork" then open the file /etc/apache2/httpd.conf in an editor and find the section that looks like this:
<ifmodule mpm_prefork_module="">StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 256
MaxRequestsPerChild 0</ifmodule>
Change the value of MaxClients from 256 to 15 or so and save the file. Then restart Apache (or simply reboot the server). See other forum threads relating to MaxClients
Of course, if Apache isn't the problem this won't help.
With regard to the OS, I'm using Linux Ubuntu 10.04 on a Linode 1024. I don't believe anything other than LAMP has been installed, but if someone can give me an outline to getting a server log which is of use to diagnosing the problem, I'd be very grateful.
@Vance:
Log in as root, and run apache2ctl -M | grep mpm - if the output contains "prefork" then open the file /etc/apache2/httpd.conf in an editor and find the section that looks like this:
<ifmodule mpm_prefork_module="">StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxClients 256 MaxRequestsPerChild 0</ifmodule>
I followed the above steps, ran the command in terminal and received the output - mpmpreforkmodule (static). So I logged into the root directory of the server via FTP located the httpd.conf file but it is 0 bytes and contains no data.
Edit: I have an apache2.conf in the same directory which does have a similar code snippet as above, but the MaxClients is 150 - should I change that?
@Sienco:
Edit: I have an apache2.conf in the same directory which does have a similar code snippet as above, but the MaxClients is 150 - should I change that?
Yes you should and then restart apache. Then yell at your server admin they should have done that when they set it up.
I don't know if it's yet time to yell at the admin - we're not even sure if Apache was the problem. It's possible that 150 was chosen after careful consideration of the web application resource needs (though admittedly this seems unlikely).
What you can do to monitor the situation is to log in to the server, run vmstat 10 and monitor the output. Every 10 seconds it will print a line of statistics. You want to look at the "si" and "so" values in the swap column. If these are both zero the vast majority of the time, then you aren't experiencing a shortage of memory. You can end the vmstat program by pressing Ctrl-C.
@obs:
Just to note 150 is the default on Ubuntu/Debian so I'd say it's not been touched.
Exactly. 150 is the default and it's way too high for your average linode 1024 running php.
@glg:
@obs:Just to note 150 is the default on Ubuntu/Debian so I'd say it's not been touched.
Exactly. 150 is the default and it's way too high for your average linode 1024 running php.
What if you got Slashdott'ed or HN'ed, just curious how many would be needed for a massive spike like that.
I'd recommend dropping the MaxClients down to something like 15. Then watch. Keep tabs on memory usage, and use some sort of monitoring to get an idea of request response times on your site. You'd then want to adjust as needed.
@jebblue:
@glg:
@obs:Just to note 150 is the default on Ubuntu/Debian so I'd say it's not been touched.
Exactly. 150 is the default and it's way too high for your average linode 1024 running php.
What if you got Slashdott'ed or HN'ed, just curious how many would be needed for a massive spike like that.
Having it too high (and 150 is usually too high) will just cause an OOM when traffic spikes, not help.
@Vance:
What you can do to monitor the situation is to log in to the server, run vmstat 10 and monitor the output. Every 10 seconds it will print a line of statistics. You want to look at the "si" and "so" values in the swap column. If these are both zero the vast majority of the time, then you aren't experiencing a shortage of memory.
Thanks for this suggestion. I logged into the server via Terminal on my Mac and ran the vmstat 10 command as directed. The below is what I received:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 117252 188720 32328 379172 0 0 1 2 2 5 0 0 99 0
0 0 117252 188652 32336 379172 0 0 0 3 103 63 0 0 100 0
0 0 117252 188652 32344 379172 0 0 0 2 104 64 0 0 100 0
1 0 117252 163676 32440 375016 0 0 0 1656 1021 272 3 0 93 0
1 0 117252 127824 32496 379176 0 0 0 3145 1152 185 4 1 91 0
0 0 117252 65560 32512 379176 0 0 0 30 1265 165 3 2 93 0
0 0 117252 163904 32548 379180 0 0 1 38 529 143 1 0 98 0
1 0 117252 150200 32588 379172 0 0 1 45 417 131 1 0 98 0
0 0 117252 153972 32596 379188 0 0 0 5 116 65 0 0 100 0
0 0 117252 154012 32596 379188 0 0 0 25 114 66 0 0 100 0
0 0 117252 154012 32604 379188 0 0 0 13 106 64 0 0 100 0
0 0 117252 154012 32612 379188 0 0 0 3 112 66 0 0 100 0
0 0 117252 154020 32620 379188 0 0 0 4 104 65 0 0 100 0
0 0 117252 154136 32628 379188 0 0 0 2 122 63 0 0 100 0
0 0 117252 154144 32636 379188 0 0 0 8 102 64 0 0 100 0
0 0 117252 154268 32636 379188 0 0 0 2 111 66 0 0 100 0
Both si & so were 0 the whole time. So does this look like Apache was the cause and selecting fewer MaxClients has resolved it, or is it too difficult to be sure with such a small sample of stats?
You can try simulating how your web site will perform under load using a tool like ab
If you see zero swap activity under load, then you're likely safe from Apache causing an out-of-memory problem.
Then again, if your machine is OOMing then you don't need vmstat to tell you that you have a problem. At that point, ps auwx is probably more useful to show what processes are consuming memory.
for me, it's currently running using ~300Mb of memory…
clamav 1932 0.1 14.6 362836 301104 ? Ssl Aug01 109:19 clamd