running out of memory

Hello,

I encountered a strange event - this morning I found the server totally unresponsive to tickling on any port. From the web I managed to reboot it, and then looked in the /var/log/messages. Here is what I saw:

> Sep 2 03:27:46 li6-184 kernel: _allocpages: 0-order allocation failed (gfp=0xf0/0)

This message is repeated a bunch of time, plus another message saying that this one is repeated is also repeated.

So, looks like something leaked in bad way. Is there any way to find out what caused or which prosess leaked, or get any insight into this at all? Any help is greatly appreciated.

11 Replies

What are you running? Anything experimental/testing/compiled out of CVS?

I am running RedHat 9 (large) distro, with vanilla stuff - sendmail, apache, sshd… Nothing custom installed.

Another interesting symptom is that a while back I was compiling stuff with gcc (just compiling my own little programs), and gcc ran out of memory. I issued the command again, and it was fine. All the while I was running TOP in another session to see what's going on - and no process seemed to be misbehaving.

I think there is some kind of interference from other users of the same host (host20, btw), or the host itself.

@Orca:

I think there is some kind of interference from other users of the same host (host20, btw), or the host itself.
The only possible "interference" (in the absence of some as-yet-undiscovered bug in UML) is that the host VM system might swap out some part of your Linode's memory. This would only slow down yoyr Linode, not break it, and this has not proved to be a problem thus far. The Linode hosts always have $linodeRam * $numLinodes of physical memory available (or more) and swapping out only occurs if the host VMM thinks that the physical memory would be better used as disk cache.

Your error message looks like the kernel is trying to allocate a memory page for its own use but the allocation fails, which means that either both physical and swap are both completely full or all non-kernel physical pages have been marked non-swappable. (It's a single page allocation, so page contiguity problems won't have any effect.)

What size Linode do you have, and how big is your swap space?

How frequently does the problem occur?

Please post the output of cat /proc/meminfo for us to look at.

Sorry for not responding for a few days - wanted to monitor the system.

My linode is Linode128 (li6-184), hosted on host20.

Now that you mention slowing down - I have been experiencing this ever since I signed up a few months ago. Sometimes the server is behaving ok (fast to respond), but sometimes it is verrry slow (i.e. shells are extremely slow, POP access is slow or times out, etc.)

I had the same problem again - this time my sendmail service has died (again, RedHat 9, large, vanilla stuff). I restarted the service this morning, and then looked in /var/log/messages - sure enough, I saw the same allocation problems listed.

The output of /proc/meminfo is this:

        total:    used:    free:  shared: buffers:  cached:
Mem:  126611456 122982400  3629056        0  9756672 11370496
Swap: 269475840 269475840        0
MemTotal:       123644 kB
MemFree:          3544 kB
MemShared:           0 kB
Buffers:          9528 kB
Cached:           9892 kB
SwapCached:       1212 kB
Active:          20632 kB
Inactive:        93516 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       123644 kB
LowFree:          3544 kB
SwapTotal:      263160 kB
SwapFree:            0 kB

I will try to continue monitoring this and get the output when the system is dead. But I don't think it is going to help much - it looks like the condition "comes and goes", and I really have to do something tricky to actually catch it.

Thanks a lot for your help - I really want my Linode account to work properly. I think that Linode hosting is designed in a very "right" way, and all these linux quirks just have to be rooted out once and for all.

Your cat /proc/meminfo shows 3544 kB of physical memory free and absolutely no swap free at all. I have never seen my Linode 64 consume anything like all the available swap. Try ps -e -o pid,cmd,%mem,rss,trs,sz,vsz
to see if it shows up who ate all the swap.

ps behaviour depends on how your environment is set up, so those format descriptor codes might need changing (it's all on the man page but it really makes my brain hurt).

Linode Staff

Another low-tech method is to run "top", and then hit "M" (note: capital M) to sort my virtual memory used…

-Chris

Here is the otput of top with M hit:

 22:07:05  up 3 days, 13:23,  1 user,  load average: 0.05, 0.04, 0.00
48 processes: 47 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:   0.3% user   0.1% system   0.0% nice   0.0% iowait  99.4% idle
Mem:   123644k av,  120156k used,    3488k free,       0k shrd,    2888k buff
        22576k active,              91812k inactive
Swap:  263160k av,  263156k used,       4k free                    9868k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
25026 root      16   0  1184 1184   884 R     0.3  0.9   0:00   0 top
25027 root      10   0  2900 2712  2160 S     0.1  2.1   0:00   0 sendmail
    1 root       8   0   472  444   424 S     0.0  0.3   0:00   0 init
    2 root       9   0     0    0     0 SW    0.0  0.0   0:05   0 keventd
    3 root      19  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd_CPU0
    4 root      10   0     0    0     0 SW    0.0  0.0   7:27   0 kswapd
    5 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 bdflush
    6 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 kupdated
    7 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 jfsIO
    8 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 jfsCommit
    9 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 jfsSync
   10 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 xfsbufd
   11 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 xfslogd/0
   12 root       9   0     0    0     0 SW    0.0  0.0   0:00   0 xfsdatad/0
   13 root     18446744073709551615 -20     0    0     0 SW<   0.0  0.0   0:00   0 mdrecoveryd
   14 root       9   0     0    0     0 SW    0.0  0.0   0:07   0 kjournald
  812 root       8   0   904  728   628 S     0.0  0.5   0:00   0 dhclient
  862 root       9   0   572  516   480 S     0.0  0.4   0:04   0 syslogd
  866 root       9   0   448  436   396 S     0.0  0.3   0:00   0 klogd
  911 root       9   0   788  648   556 S     0.0  0.5   0:02   0 sshd
  921 root       8   0   696  640   572 S     0.0  0.5   0:00   0 xinetd
  957 root       9   0  3276  796   772 S     0.0  0.6   5:07   0 httpd
  966 root       9   0   532  508   468 S     0.0  0.4   0:00   0 crond
  984 daemon     9   0   524  508   468 S     0.0  0.4   0:00   0 atd
  990 root       9   0   392  348   344 S     0.0  0.2   0:00   0 mingetty
 6774 root      10   0  2028 1536  1416 S     0.0  1.2   0:01   0 sendmail
 6782 smmsp      9   0  1844 1408  1312 S     0.0  1.1   0:00   0 sendmail
24803 leo        9   0  1452 1452  1128 S     0.0  1.1   0:00   0 bash
24839 root       9   0   972  972   804 S     0.0  0.7   0:00   0 su
24840 root      10   0  1472 1472  1140 S     0.0  1.1   0:00   0 bash
24965 root       9   0  2908 2724  2160 S     0.0  2.2   0:00   0 sendmail
24985 root       9   0  2904 2716  2164 S     0.0  2.1   0:00   0 sendmail
25001 root       9   0  2896 2708  2156 S     0.0  2.1   0:00   0 sendmail
25003 root       9   0  2596 2284  1968 S     0.0  1.8   0:00   0 sendmail
25006 root       9   0  2900 2712  2160 S     0.0  2.1   0:00   0 sendmail

and here is the output of ps with the recommended switches:

[root@li6-184 root]# ps -e -o pid,cmd,%mem,rss,trs,sz,vsz
  PID CMD              %MEM  RSS  TRS    SZ   VSZ
    1 init [3]          0.3  444   23   347  1388
    2 [keventd]         0.0    0    0     0     0
    3 [ksoftirqd_CPU0]  0.0    0    0     0     0
    4 [kswapd]          0.0    0    0     0     0
    5 [bdflush]         0.0    0    0     0     0
    6 [kupdated]        0.0    0    0     0     0
    7 [jfsIO]           0.0    0    0     0     0
    8 [jfsCommit]       0.0    0    0     0     0
    9 [jfsSync]         0.0    0    0     0     0
   10 [xfsbufd]         0.0    0    0     0     0
   11 [xfslogd/0]       0.0    0    0     0     0
   12 [xfsdatad/0]      0.0    0    0     0     0
   13 [mdrecoveryd]     0.0    0    0     0     0
   14 [kjournald]       0.0    0    0     0     0
  812 /sbin/dhclient -  0.5  728  314   498  1992
  862 syslogd -m 0      0.4  516   24   389  1556
  866 klogd -x          0.3  436   18   347  1388
  911 /usr/sbin/sshd    0.5  648  265   880  3520
  921 xinetd -stayaliv  0.5  672  129   512  2048
  957 /usr/sbin/httpd   0.6  796  289  4819 19276
  966 crond             0.4  508   19   360  1440
  984 /usr/sbin/atd     0.4  508   12   357  1428
  990 /sbin/mingetty t  0.2  348    6   342  1368
27321 /usr/sbin/httpd   9.1 11272 289 28557 114228
27322 /usr/sbin/httpd   8.2 10140 289 20617 82468
27323 /usr/sbin/httpd  10.0 12424 289 26167 104668
27325 /usr/sbin/httpd   8.8 10908 289 28056 112224
27326 /usr/sbin/httpd   5.9 7388  289 19985 79940
27327 /usr/sbin/httpd  14.4 17848 289 21912 87648
32327 /usr/sbin/httpd   6.5 8144  289 19750 79000
12585 /usr/sbin/httpd   7.4 9172  289 10063 40252
 2942 /usr/sbin/httpd  13.5 16792 289 20417 81668
 6774 sendmail: accept  1.2 1536  635  1557  6228
 6782 sendmail: Queue   1.1 1408  635  1506  6024
23133 /usr/sbin/httpd   1.8 2228  289  5129 20516
23165 /usr/sbin/httpd   1.6 2064  289  4863 19452
24799 /usr/sbin/sshd    1.3 1700  265  1694  6776
24802 /usr/sbin/sshd    1.5 1952  265  1702  6808
24803 -bash             1.1 1452  588  1092  4368
24839 su -              0.7  972   16  1027  4108
24840 -bash             1.1 1472  588  1094  4376
25001 sendmail: ./i83B  2.1 2708  635  1793  7172
25006 sendmail: ./i838  2.1 2716  635  1793  7172
25030 sendmail: ./i83J  2.0 2580  635  1719  6876
25034 sendmail: ./i85H  2.2 2724  635  1794  7176
25036 sendmail: ./i87G  2.1 2664  635  1793  7172
25038 ps -e -o pid,cmd  0.5  700   66   663  2652

One points to sendmail, the other - to httpd.

As far as sendmail goes, I run my own software for a mailing list (average of about 10 messages a day, roughly 60 subscribers, roughly 4-5 bounces per message). The software (pechkin_dispatcher) simply receives the message, tweaks some headers, pipes, forks, launches sendmail in the child process, and then feeds it the necessary message. Sendmail is launched in a queuing mode. I also receive a significant amount of spam through linode (old domain name), and my e-mail client bounces whatever SpamPal detects as spam (which accounts for some messages always present in the pending queue and not being flushed for a while due to bad address or whatever). However, all of this is pretty "normal" - I have been using this stuff on another provider with absolutely no sweat. I also had a bunch of sendmails running at a time trying to clear the queue, but overall my traffic is not really industrial.

The webpages served by apache are not very popular (except when robots start crawling around). Pretty much everything is a CGI script compiled from C/C++ that launches two or three processes piped (I need this to convert charsets and stuff). Again, this stuff is pretty much "standard" - I've been using this software for quite a while, written it myself. Now, if I ever had problems, it was with one of those processes launched (i.e. my programs), not really httpd being stuck or anything.

So… I'm at a loss so far. The only guess I can venture at this point that does not involve a linode conspiracy has to do with the log sizes for sendmail (/var/log/maillog in the order of 50-80 Mb) and apache (/var/log/httpd/access_log in the order of 1-2 Mb). I don't think those should be a problem, but what the hell…

Some of your Apache processes have pretty big VM footprints. I think that you really are running out of memory, and in the absence of an OOM killer, the next process that tries to allocate memory (and fails) just dies (sendmail, last time). When the kernel fails to allocate memory, the whole Linode stops (whereas with an OOM killer, kernel selected user space processes would have been killed first - and your problem would always have been 'random stuff keeps dying', rather than 'sometimes random stuff dies, sometimes the kernel seizes up').

I know that the rule of thumb is swap = 2 * physical, but I suggest that you increase your swap size. My argument in favour of this is that a 'real' machine doing this kind of work would have more physical memory (and more swap, probably).

If increasing the swap size fixes the problem, then fine - you were just trying to do too much for the available memory. If Apache eats all the expanded swap too, then there really is a problem, and we need to think again.

Or maybe there is a Linode conspiracy - remember: "Just 'cuz you're paranoid, it don't mean they ain't out ta getcha" :D.

@Orca:

  PID CMD              %MEM  RSS  TRS    SZ   VSZ
27321 /usr/sbin/httpd   9.1 11272 289 28557 114228
27322 /usr/sbin/httpd   8.2 10140 289 20617 82468
27323 /usr/sbin/httpd  10.0 12424 289 26167 104668
27325 /usr/sbin/httpd   8.8 10908 289 28056 112224
27326 /usr/sbin/httpd   5.9 7388  289 19985 79940
27327 /usr/sbin/httpd  14.4 17848 289 21912 87648
32327 /usr/sbin/httpd   6.5 8144  289 19750 79000
12585 /usr/sbin/httpd   7.4 9172  289 10063 40252
 2942 /usr/sbin/httpd  13.5 16792 289 20417 81668
23133 /usr/sbin/httpd   1.8 2228  289  5129 20516
23165 /usr/sbin/httpd   1.6 2064  289  4863 19452


Wow, those httpd processes are taking up a lot of memory, On my home Linux machine my processes are a fraction of that size.

Recommend you check your httpd.conf and remove any modules you have in there which you don't need. Maybe you have php4 or MySql or other modules loaded into the apache instance that you're not using. I dunno.

Also, if your site is low usage you may want to reduce the number of server processes (MinSpareServers, MaxSpareServers, StartServers, MaxClients).

Gentlemen, there is progress and some light at the end of the tunnel. The analysis of HTTPD (apache) seems to be right on the money. After checking its configuration files (including those carefully hidden under …/conf.d/) I realized that by default RedHat 9 Apache does load a truckload of modules. I removed some, most notably PHP, Perl, SSL. Now when HTTPD service launches it creates 10 processes each taking 2.0 units (percent?) of memory, as opposed to 10 processes each taking 6.5 units of memory.

The system seems to be faster now. The dumps show presence of swap (almost all of it) and some memory. I think I am on the road to recovery. If I may, let me say just a couple more things:

1. Would this memory exhaustion be cause for the Linode system to be very slow at times, or should I still suspect some other mischief or interference? In other words, do Linodes have CPU spikes?

2. Are there other problems that may be explained by this? I am experiences several nuisances (such as SMTP on my laptop not connecting to server while downloading incoming, say, 100 messages… maybe it's a descriptor count issue or smth…)

3. The memory usage of HTTPD processes does start off with 2.0 units, but it grows as HTTPD actually works. I guess it pools memory or smth. Should I worry about this also, and are there methods of fighting this? The output of that magic ps command is:

 4124 /usr/sbin/httpd   1.3 1640  289  1576  6304
 4127 /usr/sbin/httpd   4.8 6048  289  5854 23416
 4128 /usr/sbin/httpd   2.2 2800  289  3972 15888
 4129 /usr/sbin/httpd   1.9 2412  289  5854 23416
 4130 /usr/sbin/httpd   3.4 4212  289  3462 13848
 4131 /usr/sbin/httpd   5.5 6872  289  4267 17068
 4132 /usr/sbin/httpd   5.8 7212  289  3961 15844
 4133 /usr/sbin/httpd   4.6 5796  289  5048 20192
 4134 /usr/sbin/httpd   2.0 2532  289  1648  6592
 4153 /usr/sbin/httpd   6.2 7724  289  4171 16684
 4154 /usr/sbin/httpd   7.2 8940  289  4760 19040
 4155 /usr/sbin/httpd   2.0 2500  289  3432 13728
 4233 /usr/sbin/httpd  24.4 30184 289 15015 60060

4. I found a nice link that seems to be related to the issues that I had been experiencing:

http://www.webmasterworld.com/forum23/3159.htm

5. This page also suggests that Linux uses all remaining RAM for caching, which means that an estimate of "free" memory has to include the cache. My output of the free command is:

             total       used       free     shared    buffers     cached
Mem:        123644     120224       3420          0      33504       9512
-/+ buffers/cache:      77208      46436
Swap:       263160      35860     227300

Am I to understand that my "free" memory is about 46MB?

Thanks A MILLION to everybody who has taken the time to help me out with this. This is by far my best tech support experience.

@Orca:

1. Would this memory exhaustion be cause for the Linode system to be very slow at times

Yes, especially by the amount you were swapping. For a program to wake up, it potentially had to get another program to write data out to swap, then load in it's stuff and then run it. Something that can take 100 times longer than if you weren't swapping.

> 2. Are there other problems that may be explained by this? I am experiences several nuisances (such as SMTP on my laptop not connecting to server while downloading incoming, say, 100 messages… maybe it's a descriptor count issue or smth…)

If you run out of virtual memory totally then new programs may fail to start. When you connect to the SMTP server, typically it copies itself (known as "forking" itself, essentially it makes a new program) and you talk to that copy, allowing other people to connect. If the fork fails then your communication will be aboirt.

> 3. The memory usage of HTTPD processes does start off with 2.0 units, but it grows as HTTPD actually works. I guess it pools memory or smth. Should I worry about this also, and are there methods of fighting this? The output of that magic ps command is:
Programs do grow in size as they load data, or if you load a module (eg if you actually use modperl then this will take up more memory). Normally it will grow to a maximum size and stay there, unless there is a "memory leak" bug. Part of OS tuning is in working out the maximum working set size of your applications.

> 5. This page also suggests that Linux uses all remaining RAM for caching, which means that an estimate of "free" memory has to include the cache. My output of the free command is:

             total       used       free     shared    buffers     cached
Mem:        123644     120224       3420          0      33504       9512
-/+ buffers/cache:      77208      46436
Swap:       263160      35860     227300

Am I to understand that my "free" memory is about 46MB?

Sort of. 35Mb of swap space has been used, so at some point in time programs needed more memory than you have. Those programs are still swapped out, so I'm guessing they're sleeping processes not doing much and may just be woken up if something unusual happens. This is fine.

The rest of your programs have fit into 77Mb of memory, allowing the OS to use the other 46Mb for disk caching and buffers. If you started a small program (eg 5Mb) then it would use that cache/buffer space. So, for practical reasons, yes, your "free" memory is sort of 46Mb.

This is my linode:

             total       used       free     shared    buffers     cached
Mem:         60020      56716       3304          0       6640      11776
-/+ buffers/cache:      38300      21720
Swap:       263160       2248     260912

As you can see, my linode is doing similar; it needed some small amount of swap in the past, but everything is now fitting nicely in memory and my linode is healthy.

Your linode is a lot healthier than it was before you played with the httpd configuration :-)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct