Server gets frozen on random

Hi!

Such strange thing is happening first time to me. I have been using Linode for couple years now and have not experienced this type of behaviour. Guys from Linode had no idea what to suggest, so I'm hoping that here few of you can suggest where to look.

Situation is like this - server can run for days or for month without problems. And then suddenly it stops responding at all. http down, impossible to even connect through ssh. After reboot everything is fine. And then again it can work for weeks or gets frozen after few hours. Even munin graphs have gaps for those periods of downtime. So i would tell that whole system crashes.

What is running on server: Apache2, Nginx, Mysql, php5. Everything on Ubuntu 10.04.

Any idea where to look? Are there any log files that could be useful?

23 Replies

are you running mpm_prefork in apache (most likely yes). if so, what is your MaxClients set to? The default is way too high. What could be happening is you're getting a short traffic spike that is causing you to run out of memory.

Running out of memory? You can leave a lish session open running 'top' or something (sorted by memory perhaps) to see if that's the case, and also installing Munin can provide some useful graphs to try to see if there is some info to be gained from all the resource graphs.

Out of memory is not it. I just opened Munin grapsh and apps usage is 304MB on avarage, max 361. I have 1GB linode.

In front of everything is nginx who handles static content and then apache2 comes in to server php (i use itk module).

Traffic should not be problem. On some days i serve 500 000 hits and everything is smooth. Today traffic was not as high as few previous days. When server went down it was only at about 50% of its regular load.

What kernel are you running ? run uname -a

The same apparent thing happened to me (I posted at http://forum.linode.com/viewtopic.php?t=7459&highlight= but had no answers). I logged in via Lish and the Linode was happily humming away, it had just lost networking. Mine has only done that once though, and it could be a different issue.

It should not be network lose, because Munin have gaps in graphs for time of downtime. So nothing for CPU, memory or any other usage. Like someone would have switched it off too. In case of lost network munin should still work….

Freezes where the whole system randomly locks up completely normally indicates a kernel issue check /var/log/messages or /var/log/syslog for any kernel related message.

Checked logs you suggested. In messages nothing interesting.
> Aug 4 06:25:07 yyyy rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="1954" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.

Aug 4 16:22:11 yyyy kernel: imklog 4.2.0, log source = /proc/kmsg started.
Last time server went down around 15:55.

However syslog is a bit more interesting:
> Aug 4 15:17:01 yyyy CRON[21025]: (root) CMD ( cd / && run-parts –report /etc/cron.hourly)

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

ug 4 16:22:11 yyyy kernel: imklog 4.2.0, log source = /proc/kmsg started.
Interesting part is '^@' . If i open file with some regular editor (like Notepad2) i get 'null' there. If i open in gVIM i get bunch of '^@' characters. So something bad was there, but I gues no way to find out what exactly.

Interesting, the yyyy should be the year as well so something's pretty busted what distro are you running? Can you remember roughly when you last rebooted before this started happening?

With 'yyyy' i just replaced name of my linode. :D So that part is good. Strange part is only those 'null' (^@^@) characters.

I'm using 32bit Ubuntu 10.04 (with latest updates).

First time i noticed this was on July 11. Since then it happened few times. In first cases i noticed this in few minutes and was able to reboot. Today i did not notice that for ~30min and whole server was down for that time. So 30min downtime suggests, that it does not bounce back.

Ah right ok, try booting with 3.0.0-linode35 kernel and see if the problems goes away (don't be scared of the big 3.0 it's not that different from 2.6)

OK. I'm now using "3.0.0-linode35". Will monitor situation and will post update if it will crash. I hope there will be no updates from me. :D

If 3.0 is stable for you, keep an eye on http://www.linode.com/kernels/ when the latest Paravirt becomes 3.0 you should change to that to make sure your kernel is kept up to date.

You can also post here http://forum.linode.com/viewtopic.php?t=7505 to let Linode know how the kernel's doing.

One interesting thing. With previous kernel swap usage was never higher than 80MB. And it took more time to reach it. Now it is already 101MB when i have plenty of free RAM. Can that be related to kernel and should i worry that swap is used so much even when i have plenty of RAM?

             total       used       free     shared    buffers     cached
Mem:          1004        944         59          0         96        465
-/+ buffers/cache:        382        621
Swap:          255        101        154

The kernel does manage swap so a change isn't surprising I wouldn't worry though, using swap can be good it puts processes that aren't used very often onto the disk so that the ram can be used for more useful things. If you start swapping in and out a lot that's when bad things happen, your munin graphs will show how much you swap in and out.

With the nulls in the log file: Usually, this is because the logfile was being written to while the system crashed. Not too unusual.

With regards munin: It only snapshots every five minutes. If things go awry in under five minutes, munin will look completely normal.

On kernel output: The "logview" command in the lish shell will spit out the console output from the last run, and can be handy for troubleshooting kernel crashes and the like.

Worse comes to worse, sshing in and leaving 'htop' or 'top' running can give you a bit of a snapshot of the moment before it does keel over again!

It happened again. 4 days uptime and system crashed. In Lish i was able to view this:

[<c011f3bf>] ? do_page_fault+0x24f/0x3a0                                                           
 [<c0105c27>] ? xen_force_evtchn_callback+0x17/0x30                                                 
 [<c0106404>] ? check_events+0x8/0xc                                                                
 [<c01063fb>] ? xen_restore_fl_direct_reloc+0x4/0x4                                                 
 [<c011f170>] ? mm_fault_error+0x130/0x130                                                          
 [<c06bfc66>] ? error_code+0x5a/0x60                                                                
 [<c012007b>] ? try_preserve_large_page+0x7b/0x340                                                  
 [<c011f170>] ? mm_fault_error+0x130/0x130                                                          
 [<c01ab8a8>] ? swap_count_continued+0x158/0x180                                                    
 [<c01abe22>] ? __swap_duplicate+0xc2/0x160                                                         
 [<c01abb04>] ? add_swap_count_continuation+0x54/0x130                                              
 [<c01abee4>] ? swap_duplicate+0x14/0x40                                                            
 [<c01a068b>] ? copy_pte_range+0x45b/0x500                                                          
 [<c0106404>] ? check_events+0x8/0xc                                                                
 [<c01a08c5>] ? copy_page_range+0x195/0x200                                                         
 [<c0132756>] ? dup_mmap+0x1c6/0x2c0                                                                
 [<c0132b88>] ? dup_mm+0xa8/0x130                                                                   
 [<c01335fa>] ? copy_process+0x98a/0xb30                                                            
 [<c01337ef>] ? do_fork+0x4f/0x280                                                                  
 [<c010f780>] ? sys_clone+0x30/0x40                                                                 
 [<c06c000d>] ? ptregs_clone+0x15/0x48                                                              
 [<c06bf6f1>] ? syscall_call+0x7/0xb                                                                
 [<c06b0000>] ? sctp_backlog_rcv+0xf0/0x100                                                         
INFO: rcu_sched_state detected stall on CPU 2 (t=60000 jiffies)                                     
INFO: rcu_sched_state detected stall on CPU 1 (t=60000 jiffies)                                     
INFO: rcu_sched_state detected stall on CPU 3 (t=240030 jiffies)                                    
INFO: rcu_sched_state detected stall on CPU 2 (t=240031 jiffies)                                    
INFO: rcu_sched_state detected stall on CPU 1 (t=240031 jiffies)                                    
INFO: rcu_sched_state detected stall on CPU 1 (t=420061 jiffies)                                    
INFO: rcu_sched_state detected stall on CPU 2 (t=420061 jiffies)                                    
INFO: rcu_sched_state detected stall on CPU 1 (t=600091 jiffies)</c06b0000></c06bf6f1></c06c000d></c010f780></c01337ef></c01335fa></c0132b88></c0132756></c01a08c5></c0106404></c01a068b></c01abee4></c01abb04></c01abe22></c01ab8a8></c011f170></c012007b></c06bfc66></c011f170></c01063fb></c0106404></c0105c27></c011f3bf> 

Lish was not responsive. Was not able to write anything there. And as usually - no SSH, no web, nothing.

Ideas?

Just rebooted and again:

[<c06bf28d>] ? rwsem_down_failed_common+0x9d/0x110                                                 
 [<c06bf353>] ? call_rwsem_down_read_failed+0x7/0xc                                                 
 [<c06bea6a>] ? down_read+0xa/0x10                                                                  
 [<c01683f5>] ? acct_collect+0x35/0x160                                                             
 [<c0137fbd>] ? do_exit+0x27d/0x350                                                                 
 [<c011f170>] ? mm_fault_error+0x130/0x130                                                          
 [<c010b7e1>] ? oops_end+0x71/0xa0                                                                  
 [<c011ef8f>] ? bad_area_nosemaphore+0xf/0x20                                                       
 [<c011f3bf>] ? do_page_fault+0x24f/0x3a0                                                           
 [<c0105c27>] ? xen_force_evtchn_callback+0x17/0x30                                                 
 [<c0106404>] ? check_events+0x8/0xc                                                                
 [<c01063fb>] ? xen_restore_fl_direct_reloc+0x4/0x4                                                 
 [<c011f170>] ? mm_fault_error+0x130/0x130                                                          
 [<c06bfc66>] ? error_code+0x5a/0x60                                                                
 [<c012007b>] ? try_preserve_large_page+0x7b/0x340                                                  
 [<c011f170>] ? mm_fault_error+0x130/0x130                                                          
 [<c01ab8a8>] ? swap_count_continued+0x158/0x180                                                    
 [<c01abe22>] ? __swap_duplicate+0xc2/0x160                                                         
 [<c01abee4>] ? swap_duplicate+0x14/0x40                                                            
 [<c01a068b>] ? copy_pte_range+0x45b/0x500                                                          
 [<c01a08c5>] ? copy_page_range+0x195/0x200                                                         
 [<c0132756>] ? dup_mmap+0x1c6/0x2c0                                                                
 [<c0132b88>] ? dup_mm+0xa8/0x130                                                                   
 [<c01335fa>] ? copy_process+0x98a/0xb30                                                            
 [<c01337ef>] ? do_fork+0x4f/0x280                                                                  
 [<c06bf395>] ? _raw_spin_lock+0x5/0x10                                                             
 [<c01c2cf0>] ? set_close_on_exec+0x40/0x60                                                         
 [<c01c3804>] ? do_fcntl+0x2c4/0x3b0                                                                
 [<c010f780>] ? sys_clone+0x30/0x40                                                                 
 [<c06c000d>] ? ptregs_clone+0x15/0x48                                                              
 [<c06bf6f1>] ? syscall_call+0x7/0xb</c06bf6f1></c06c000d></c010f780></c01c3804></c01c2cf0></c06bf395></c01337ef></c01335fa></c0132b88></c0132756></c01a08c5></c01a068b></c01abee4></c01abe22></c01ab8a8></c011f170></c012007b></c06bfc66></c011f170></c01063fb></c0106404></c0105c27></c011f3bf></c011ef8f></c010b7e1></c011f170></c0137fbd></c01683f5></c06bea6a></c06bf353></c06bf28d> 

I found out why server started to crash even after reboot - multiple tables in mysql crashed - and thus mysql started to use 400% of cpu, apache started to build up in line, etc. etc. And as result - all ram was used and swap. Tables repaired and now everything is smooth again.

Most likely tables crashed when server got frozen. However question remains - why it crashed.

@zumzum:

I found out why server started to crash even after reboot - multiple tables in mysql crashed - and thus mysql started to use 400% of cpu, apache started to build up in line, etc. etc. And as result - all ram was used and swap. Tables repaired and now everything is smooth again.

Most likely tables crashed when server got frozen. However question remains - why it crashed.

This is similar, if not the same, as my issue (http://forum.linode.com/viewtopic.php?t=7538). I wonder?

Having the exact same issue here. Tried both 3.0.0 and 2.6.39 with no success. Getting these crashes something like once an hour (though not with any regularity or pattern). Did you ever find a solution?

What i have found in other forums is that kernel sometimes does such things when system is under heavy load. It should not be some CPU load, but for example, IO load. Few possible reason: OOM problem because of to many apache servers or some crashed MySQL tables.

What i did was - i found out that some of my tables grow insanely fast and then when reach ~3GB they start to crash and thus very fast server starts to swap and whole server crashes without any evidences of problem. I have set up crons to clean up tables regulary, dowgraded to older kernel and now 6 days without any problem. Dont know if that solved problem permanently or i'm just having good luck, but it is working for now.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct