Server gets frozen on random
Such strange thing is happening first time to me. I have been using Linode for couple years now and have not experienced this type of behaviour. Guys from Linode had no idea what to suggest, so I'm hoping that here few of you can suggest where to look.
Situation is like this - server can run for days or for month without problems. And then suddenly it stops responding at all. http down, impossible to even connect through ssh. After reboot everything is fine. And then again it can work for weeks or gets frozen after few hours. Even munin graphs have gaps for those periods of downtime. So i would tell that whole system crashes.
What is running on server: Apache2, Nginx, Mysql, php5. Everything on Ubuntu 10.04.
Any idea where to look? Are there any log files that could be useful?
23 Replies
In front of everything is nginx who handles static content and then apache2 comes in to server php (i use itk module).
Traffic should not be problem. On some days i serve 500 000 hits and everything is smooth. Today traffic was not as high as few previous days. When server went down it was only at about 50% of its regular load.
> Aug 4 06:25:07 yyyy rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="1954" x-info="
Aug 4 16:22:11 yyyy kernel: imklog 4.2.0, log source = /proc/kmsg started.
Last time server went down around 15:55.
However syslog is a bit more interesting:
> Aug 4 15:17:01 yyyy CRON[21025]: (root) CMD ( cd / && run-parts –report /etc/cron.hourly)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
ug 4 16:22:11 yyyy kernel: imklog 4.2.0, log source = /proc/kmsg started.
Interesting part is '^@' . If i open file with some regular editor (like Notepad2) i get 'null' there. If i open in gVIM i get bunch of '^@' characters. So something bad was there, but I gues no way to find out what exactly.
I'm using 32bit Ubuntu 10.04 (with latest updates).
First time i noticed this was on July 11. Since then it happened few times. In first cases i noticed this in few minutes and was able to reboot. Today i did not notice that for ~30min and whole server was down for that time. So 30min downtime suggests, that it does not bounce back.
You can also post here
total used free shared buffers cached
Mem: 1004 944 59 0 96 465
-/+ buffers/cache: 382 621
Swap: 255 101 154
With regards munin: It only snapshots every five minutes. If things go awry in under five minutes, munin will look completely normal.
On kernel output: The "logview" command in the lish shell will spit out the console output from the last run, and can be handy for troubleshooting kernel crashes and the like.
Worse comes to worse, sshing in and leaving 'htop' or 'top' running can give you a bit of a snapshot of the moment before it does keel over again!
[<c011f3bf>] ? do_page_fault+0x24f/0x3a0
[<c0105c27>] ? xen_force_evtchn_callback+0x17/0x30
[<c0106404>] ? check_events+0x8/0xc
[<c01063fb>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c011f170>] ? mm_fault_error+0x130/0x130
[<c06bfc66>] ? error_code+0x5a/0x60
[<c012007b>] ? try_preserve_large_page+0x7b/0x340
[<c011f170>] ? mm_fault_error+0x130/0x130
[<c01ab8a8>] ? swap_count_continued+0x158/0x180
[<c01abe22>] ? __swap_duplicate+0xc2/0x160
[<c01abb04>] ? add_swap_count_continuation+0x54/0x130
[<c01abee4>] ? swap_duplicate+0x14/0x40
[<c01a068b>] ? copy_pte_range+0x45b/0x500
[<c0106404>] ? check_events+0x8/0xc
[<c01a08c5>] ? copy_page_range+0x195/0x200
[<c0132756>] ? dup_mmap+0x1c6/0x2c0
[<c0132b88>] ? dup_mm+0xa8/0x130
[<c01335fa>] ? copy_process+0x98a/0xb30
[<c01337ef>] ? do_fork+0x4f/0x280
[<c010f780>] ? sys_clone+0x30/0x40
[<c06c000d>] ? ptregs_clone+0x15/0x48
[<c06bf6f1>] ? syscall_call+0x7/0xb
[<c06b0000>] ? sctp_backlog_rcv+0xf0/0x100
INFO: rcu_sched_state detected stall on CPU 2 (t=60000 jiffies)
INFO: rcu_sched_state detected stall on CPU 1 (t=60000 jiffies)
INFO: rcu_sched_state detected stall on CPU 3 (t=240030 jiffies)
INFO: rcu_sched_state detected stall on CPU 2 (t=240031 jiffies)
INFO: rcu_sched_state detected stall on CPU 1 (t=240031 jiffies)
INFO: rcu_sched_state detected stall on CPU 1 (t=420061 jiffies)
INFO: rcu_sched_state detected stall on CPU 2 (t=420061 jiffies)
INFO: rcu_sched_state detected stall on CPU 1 (t=600091 jiffies)</c06b0000></c06bf6f1></c06c000d></c010f780></c01337ef></c01335fa></c0132b88></c0132756></c01a08c5></c0106404></c01a068b></c01abee4></c01abb04></c01abe22></c01ab8a8></c011f170></c012007b></c06bfc66></c011f170></c01063fb></c0106404></c0105c27></c011f3bf>
Lish was not responsive. Was not able to write anything there. And as usually - no SSH, no web, nothing.
Ideas?
[<c06bf28d>] ? rwsem_down_failed_common+0x9d/0x110
[<c06bf353>] ? call_rwsem_down_read_failed+0x7/0xc
[<c06bea6a>] ? down_read+0xa/0x10
[<c01683f5>] ? acct_collect+0x35/0x160
[<c0137fbd>] ? do_exit+0x27d/0x350
[<c011f170>] ? mm_fault_error+0x130/0x130
[<c010b7e1>] ? oops_end+0x71/0xa0
[<c011ef8f>] ? bad_area_nosemaphore+0xf/0x20
[<c011f3bf>] ? do_page_fault+0x24f/0x3a0
[<c0105c27>] ? xen_force_evtchn_callback+0x17/0x30
[<c0106404>] ? check_events+0x8/0xc
[<c01063fb>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c011f170>] ? mm_fault_error+0x130/0x130
[<c06bfc66>] ? error_code+0x5a/0x60
[<c012007b>] ? try_preserve_large_page+0x7b/0x340
[<c011f170>] ? mm_fault_error+0x130/0x130
[<c01ab8a8>] ? swap_count_continued+0x158/0x180
[<c01abe22>] ? __swap_duplicate+0xc2/0x160
[<c01abee4>] ? swap_duplicate+0x14/0x40
[<c01a068b>] ? copy_pte_range+0x45b/0x500
[<c01a08c5>] ? copy_page_range+0x195/0x200
[<c0132756>] ? dup_mmap+0x1c6/0x2c0
[<c0132b88>] ? dup_mm+0xa8/0x130
[<c01335fa>] ? copy_process+0x98a/0xb30
[<c01337ef>] ? do_fork+0x4f/0x280
[<c06bf395>] ? _raw_spin_lock+0x5/0x10
[<c01c2cf0>] ? set_close_on_exec+0x40/0x60
[<c01c3804>] ? do_fcntl+0x2c4/0x3b0
[<c010f780>] ? sys_clone+0x30/0x40
[<c06c000d>] ? ptregs_clone+0x15/0x48
[<c06bf6f1>] ? syscall_call+0x7/0xb</c06bf6f1></c06c000d></c010f780></c01c3804></c01c2cf0></c06bf395></c01337ef></c01335fa></c0132b88></c0132756></c01a08c5></c01a068b></c01abee4></c01abe22></c01ab8a8></c011f170></c012007b></c06bfc66></c011f170></c01063fb></c0106404></c0105c27></c011f3bf></c011ef8f></c010b7e1></c011f170></c0137fbd></c01683f5></c06bea6a></c06bf353></c06bf28d>
Most likely tables crashed when server got frozen. However question remains - why it crashed.
@zumzum:
I found out why server started to crash even after reboot - multiple tables in mysql crashed - and thus mysql started to use 400% of cpu, apache started to build up in line, etc. etc. And as result - all ram was used and swap. Tables repaired and now everything is smooth again.
Most likely tables crashed when server got frozen. However question remains - why it crashed.
This is similar, if not the same, as my issue (http://forum.linode.com/viewtopic.php?t=7538
What i did was - i found out that some of my tables grow insanely fast and then when reach ~3GB they start to crash and thus very fast server starts to swap and whole server crashes without any evidences of problem. I have set up crons to clean up tables regulary, dowgraded to older kernel and now 6 days without any problem. Dont know if that solved problem permanently or i'm just having good luck, but it is working for now.