running out of memory
I encountered a strange event - this morning I found the server totally unresponsive to tickling on any port. From the web I managed to reboot it, and then looked in the /var/log/messages. Here is what I saw:
> Sep 2 03:27:46 li6-184 kernel: _allocpages: 0-order allocation failed (gfp=0xf0/0)
This message is repeated a bunch of time, plus another message saying that this one is repeated is also repeated.
So, looks like something leaked in bad way. Is there any way to find out what caused or which prosess leaked, or get any insight into this at all? Any help is greatly appreciated.
11 Replies
Another interesting symptom is that a while back I was compiling stuff with gcc (just compiling my own little programs), and gcc ran out of memory. I issued the command again, and it was fine. All the while I was running TOP in another session to see what's going on - and no process seemed to be misbehaving.
I think there is some kind of interference from other users of the same host (host20, btw), or the host itself.
I think there is some kind of interference from other users of the same host (host20, btw), or the host itself.
The only possible "interference" (in the absence of some as-yet-undiscovered bug in UML) is that the host VM system might swap out some part of your Linode's memory. This would only slow down yoyr Linode, not break it, and this has not proved to be a problem thus far. The Linode hosts always have $linodeRam * $numLinodes of physical memory available (or more) and swapping out only occurs if the host VMM thinks that the physical memory would be better used as disk cache.
Your error message looks like the kernel is trying to allocate a memory page for its own use but the allocation fails, which means that either both physical and swap are both completely full or all non-kernel physical pages have been marked non-swappable. (It's a single page allocation, so page contiguity problems won't have any effect.)
What size Linode do you have, and how big is your swap space?
How frequently does the problem occur?
Please post the output of cat /proc/meminfo for us to look at.
My linode is Linode128 (li6-184), hosted on host20.
Now that you mention slowing down - I have been experiencing this ever since I signed up a few months ago. Sometimes the server is behaving ok (fast to respond), but sometimes it is verrry slow (i.e. shells are extremely slow, POP access is slow or times out, etc.)
I had the same problem again - this time my sendmail service has died (again, RedHat 9, large, vanilla stuff). I restarted the service this morning, and then looked in /var/log/messages - sure enough, I saw the same allocation problems listed.
The output of /proc/meminfo is this:
total: used: free: shared: buffers: cached:
Mem: 126611456 122982400 3629056 0 9756672 11370496
Swap: 269475840 269475840 0
MemTotal: 123644 kB
MemFree: 3544 kB
MemShared: 0 kB
Buffers: 9528 kB
Cached: 9892 kB
SwapCached: 1212 kB
Active: 20632 kB
Inactive: 93516 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 123644 kB
LowFree: 3544 kB
SwapTotal: 263160 kB
SwapFree: 0 kB
I will try to continue monitoring this and get the output when the system is dead. But I don't think it is going to help much - it looks like the condition "comes and goes", and I really have to do something tricky to actually catch it.
Thanks a lot for your help - I really want my Linode account to work properly. I think that Linode hosting is designed in a very "right" way, and all these linux quirks just have to be rooted out once and for all.
ps -e -o pid,cmd,%mem,rss,trs,sz,vsz
to see if it shows up who ate all the swap.
ps behaviour depends on how your environment is set up, so those format descriptor codes might need changing (it's all on the man page but it really makes my brain hurt).
22:07:05 up 3 days, 13:23, 1 user, load average: 0.05, 0.04, 0.00
48 processes: 47 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 0.3% user 0.1% system 0.0% nice 0.0% iowait 99.4% idle
Mem: 123644k av, 120156k used, 3488k free, 0k shrd, 2888k buff
22576k active, 91812k inactive
Swap: 263160k av, 263156k used, 4k free 9868k cached
25026 root 16 0 1184 1184 884 R 0.3 0.9 0:00 0 top
25027 root 10 0 2900 2712 2160 S 0.1 2.1 0:00 0 sendmail
1 root 8 0 472 444 424 S 0.0 0.3 0:00 0 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:05 0 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd_CPU0
4 root 10 0 0 0 0 SW 0.0 0.0 7:27 0 kswapd
5 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush
6 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 kupdated
7 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 jfsIO
8 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 jfsCommit
9 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 jfsSync
10 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 xfsbufd
11 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 xfslogd/0
12 root 9 0 0 0 0 SW 0.0 0.0 0:00 0 xfsdatad/0
13 root 18446744073709551615 -20 0 0 0 SW< 0.0 0.0 0:00 0 mdrecoveryd
14 root 9 0 0 0 0 SW 0.0 0.0 0:07 0 kjournald
812 root 8 0 904 728 628 S 0.0 0.5 0:00 0 dhclient
862 root 9 0 572 516 480 S 0.0 0.4 0:04 0 syslogd
866 root 9 0 448 436 396 S 0.0 0.3 0:00 0 klogd
911 root 9 0 788 648 556 S 0.0 0.5 0:02 0 sshd
921 root 8 0 696 640 572 S 0.0 0.5 0:00 0 xinetd
957 root 9 0 3276 796 772 S 0.0 0.6 5:07 0 httpd
966 root 9 0 532 508 468 S 0.0 0.4 0:00 0 crond
984 daemon 9 0 524 508 468 S 0.0 0.4 0:00 0 atd
990 root 9 0 392 348 344 S 0.0 0.2 0:00 0 mingetty
6774 root 10 0 2028 1536 1416 S 0.0 1.2 0:01 0 sendmail
6782 smmsp 9 0 1844 1408 1312 S 0.0 1.1 0:00 0 sendmail
24803 leo 9 0 1452 1452 1128 S 0.0 1.1 0:00 0 bash
24839 root 9 0 972 972 804 S 0.0 0.7 0:00 0 su
24840 root 10 0 1472 1472 1140 S 0.0 1.1 0:00 0 bash
24965 root 9 0 2908 2724 2160 S 0.0 2.2 0:00 0 sendmail
24985 root 9 0 2904 2716 2164 S 0.0 2.1 0:00 0 sendmail
25001 root 9 0 2896 2708 2156 S 0.0 2.1 0:00 0 sendmail
25003 root 9 0 2596 2284 1968 S 0.0 1.8 0:00 0 sendmail
25006 root 9 0 2900 2712 2160 S 0.0 2.1 0:00 0 sendmail
and here is the output of ps with the recommended switches:
[root@li6-184 root]# ps -e -o pid,cmd,%mem,rss,trs,sz,vsz
1 init [3] 0.3 444 23 347 1388
2 [keventd] 0.0 0 0 0 0
3 [ksoftirqd_CPU0] 0.0 0 0 0 0
4 [kswapd] 0.0 0 0 0 0
5 [bdflush] 0.0 0 0 0 0
6 [kupdated] 0.0 0 0 0 0
7 [jfsIO] 0.0 0 0 0 0
8 [jfsCommit] 0.0 0 0 0 0
9 [jfsSync] 0.0 0 0 0 0
10 [xfsbufd] 0.0 0 0 0 0
11 [xfslogd/0] 0.0 0 0 0 0
12 [xfsdatad/0] 0.0 0 0 0 0
13 [mdrecoveryd] 0.0 0 0 0 0
14 [kjournald] 0.0 0 0 0 0
812 /sbin/dhclient - 0.5 728 314 498 1992
862 syslogd -m 0 0.4 516 24 389 1556
866 klogd -x 0.3 436 18 347 1388
911 /usr/sbin/sshd 0.5 648 265 880 3520
921 xinetd -stayaliv 0.5 672 129 512 2048
957 /usr/sbin/httpd 0.6 796 289 4819 19276
966 crond 0.4 508 19 360 1440
984 /usr/sbin/atd 0.4 508 12 357 1428
990 /sbin/mingetty t 0.2 348 6 342 1368
27321 /usr/sbin/httpd 9.1 11272 289 28557 114228
27322 /usr/sbin/httpd 8.2 10140 289 20617 82468
27323 /usr/sbin/httpd 10.0 12424 289 26167 104668
27325 /usr/sbin/httpd 8.8 10908 289 28056 112224
27326 /usr/sbin/httpd 5.9 7388 289 19985 79940
27327 /usr/sbin/httpd 14.4 17848 289 21912 87648
32327 /usr/sbin/httpd 6.5 8144 289 19750 79000
12585 /usr/sbin/httpd 7.4 9172 289 10063 40252
2942 /usr/sbin/httpd 13.5 16792 289 20417 81668
6774 sendmail: accept 1.2 1536 635 1557 6228
6782 sendmail: Queue 1.1 1408 635 1506 6024
23133 /usr/sbin/httpd 1.8 2228 289 5129 20516
23165 /usr/sbin/httpd 1.6 2064 289 4863 19452
24799 /usr/sbin/sshd 1.3 1700 265 1694 6776
24802 /usr/sbin/sshd 1.5 1952 265 1702 6808
24803 -bash 1.1 1452 588 1092 4368
24839 su - 0.7 972 16 1027 4108
24840 -bash 1.1 1472 588 1094 4376
25001 sendmail: ./i83B 2.1 2708 635 1793 7172
25006 sendmail: ./i838 2.1 2716 635 1793 7172
25030 sendmail: ./i83J 2.0 2580 635 1719 6876
25034 sendmail: ./i85H 2.2 2724 635 1794 7176
25036 sendmail: ./i87G 2.1 2664 635 1793 7172
25038 ps -e -o pid,cmd 0.5 700 66 663 2652
One points to sendmail, the other - to httpd.
As far as sendmail goes, I run my own software for a mailing list (average of about 10 messages a day, roughly 60 subscribers, roughly 4-5 bounces per message). The software (pechkin_dispatcher) simply receives the message, tweaks some headers, pipes, forks, launches sendmail in the child process, and then feeds it the necessary message. Sendmail is launched in a queuing mode. I also receive a significant amount of spam through linode (old domain name), and my e-mail client bounces whatever SpamPal detects as spam (which accounts for some messages always present in the pending queue and not being flushed for a while due to bad address or whatever). However, all of this is pretty "normal" - I have been using this stuff on another provider with absolutely no sweat. I also had a bunch of sendmails running at a time trying to clear the queue, but overall my traffic is not really industrial.
The webpages served by apache are not very popular (except when robots start crawling around). Pretty much everything is a CGI script compiled from C/C++ that launches two or three processes piped (I need this to convert charsets and stuff). Again, this stuff is pretty much "standard" - I've been using this software for quite a while, written it myself. Now, if I ever had problems, it was with one of those processes launched (i.e. my programs), not really httpd being stuck or anything.
So… I'm at a loss so far. The only guess I can venture at this point that does not involve a linode conspiracy has to do with the log sizes for sendmail (/var/log/maillog in the order of 50-80 Mb) and apache (/var/log/httpd/access_log in the order of 1-2 Mb). I don't think those should be a problem, but what the hell…
I know that the rule of thumb is swap = 2 * physical, but I suggest that you increase your swap size. My argument in favour of this is that a 'real' machine doing this kind of work would have more physical memory (and more swap, probably).
If increasing the swap size fixes the problem, then fine - you were just trying to do too much for the available memory. If Apache eats all the expanded swap too, then there really is a problem, and we need to think again.
Or maybe there is a Linode conspiracy - remember: "Just 'cuz you're paranoid, it don't mean they ain't out ta getcha"
PID CMD %MEM RSS TRS SZ VSZ 27321 /usr/sbin/httpd 9.1 11272 289 28557 114228 27322 /usr/sbin/httpd 8.2 10140 289 20617 82468 27323 /usr/sbin/httpd 10.0 12424 289 26167 104668 27325 /usr/sbin/httpd 8.8 10908 289 28056 112224 27326 /usr/sbin/httpd 5.9 7388 289 19985 79940 27327 /usr/sbin/httpd 14.4 17848 289 21912 87648 32327 /usr/sbin/httpd 6.5 8144 289 19750 79000 12585 /usr/sbin/httpd 7.4 9172 289 10063 40252 2942 /usr/sbin/httpd 13.5 16792 289 20417 81668 23133 /usr/sbin/httpd 1.8 2228 289 5129 20516 23165 /usr/sbin/httpd 1.6 2064 289 4863 19452
Wow, those httpd processes are taking up a lot of memory, On my home Linux machine my processes are a fraction of that size.
Recommend you check your httpd.conf and remove any modules you have in there which you don't need. Maybe you have php4 or MySql or other modules loaded into the apache instance that you're not using. I dunno.
Also, if your site is low usage you may want to reduce the number of server processes (MinSpareServers, MaxSpareServers, StartServers, MaxClients).
The system seems to be faster now. The dumps show presence of swap (almost all of it) and some memory. I think I am on the road to recovery. If I may, let me say just a couple more things:
1. Would this memory exhaustion be cause for the Linode system to be very slow at times, or should I still suspect some other mischief or interference? In other words, do Linodes have CPU spikes?
2. Are there other problems that may be explained by this? I am experiences several nuisances (such as SMTP on my laptop not connecting to server while downloading incoming, say, 100 messages… maybe it's a descriptor count issue or smth…)
3. The memory usage of HTTPD processes does start off with 2.0 units, but it grows as HTTPD actually works. I guess it pools memory or smth. Should I worry about this also, and are there methods of fighting this? The output of that magic ps command is:
4124 /usr/sbin/httpd 1.3 1640 289 1576 6304
4127 /usr/sbin/httpd 4.8 6048 289 5854 23416
4128 /usr/sbin/httpd 2.2 2800 289 3972 15888
4129 /usr/sbin/httpd 1.9 2412 289 5854 23416
4130 /usr/sbin/httpd 3.4 4212 289 3462 13848
4131 /usr/sbin/httpd 5.5 6872 289 4267 17068
4132 /usr/sbin/httpd 5.8 7212 289 3961 15844
4133 /usr/sbin/httpd 4.6 5796 289 5048 20192
4134 /usr/sbin/httpd 2.0 2532 289 1648 6592
4153 /usr/sbin/httpd 6.2 7724 289 4171 16684
4154 /usr/sbin/httpd 7.2 8940 289 4760 19040
4155 /usr/sbin/httpd 2.0 2500 289 3432 13728
4233 /usr/sbin/httpd 24.4 30184 289 15015 60060
4. I found a nice link that seems to be related to the issues that I had been experiencing:
5. This page also suggests that Linux uses all remaining RAM for caching, which means that an estimate of "free" memory has to include the cache. My output of the free command is:
total used free shared buffers cached
Mem: 123644 120224 3420 0 33504 9512
-/+ buffers/cache: 77208 46436
Swap: 263160 35860 227300
Am I to understand that my "free" memory is about 46MB?
Thanks A MILLION to everybody who has taken the time to help me out with this. This is by far my best tech support experience.
1. Would this memory exhaustion be cause for the Linode system to be very slow at times
Yes, especially by the amount you were swapping. For a program to wake up, it potentially had to get another program to write data out to swap, then load in it's stuff and then run it. Something that can take 100 times longer than if you weren't swapping.
> 2. Are there other problems that may be explained by this? I am experiences several nuisances (such as SMTP on my laptop not connecting to server while downloading incoming, say, 100 messages… maybe it's a descriptor count issue or smth…)
If you run out of virtual memory totally then new programs may fail to start. When you connect to the SMTP server, typically it copies itself (known as "forking" itself, essentially it makes a new program) and you talk to that copy, allowing other people to connect. If the fork fails then your communication will be aboirt.
> 3. The memory usage of HTTPD processes does start off with 2.0 units, but it grows as HTTPD actually works. I guess it pools memory or smth. Should I worry about this also, and are there methods of fighting this? The output of that magic ps command is:
Programs do grow in size as they load data, or if you load a module (eg if you actually use modperl then this will take up more memory). Normally it will grow to a maximum size and stay there, unless there is a "memory leak" bug. Part of OS tuning is in working out the maximum working set size of your applications.
> 5. This page also suggests that Linux uses all remaining RAM for caching, which means that an estimate of "free" memory has to include the cache. My output of the free command is:
total used free shared buffers cached Mem: 123644 120224 3420 0 33504 9512 -/+ buffers/cache: 77208 46436 Swap: 263160 35860 227300
Am I to understand that my "free" memory is about 46MB?
Sort of. 35Mb of swap space has been used, so at some point in time programs needed more memory than you have. Those programs are still swapped out, so I'm guessing they're sleeping processes not doing much and may just be woken up if something unusual happens. This is fine.
The rest of your programs have fit into 77Mb of memory, allowing the OS to use the other 46Mb for disk caching and buffers. If you started a small program (eg 5Mb) then it would use that cache/buffer space. So, for practical reasons, yes, your "free" memory is sort of 46Mb.
This is my linode:
total used free shared buffers cached
Mem: 60020 56716 3304 0 6640 11776
-/+ buffers/cache: 38300 21720
Swap: 263160 2248 260912
As you can see, my linode is doing similar; it needed some small amount of swap in the past, but everything is now fitting nicely in memory and my linode is healthy.
Your linode is a lot healthier than it was before you played with the httpd configuration