Hard kernel crash, rcu_sched stall, kernel 3.7.5-linode48
The messages seem to vary, but they all seem to be related to memory page faults. Looks like a Xen bug. Anyone have recommendations on how I can alleviate this problem? Downgrade to a known-good kernel, etc? Many thanks.
INFO: rcu_sched self-detected stall on CPU
3: (779947 ticks this GP) idle=1b9/140000000000001/0
(t=780012 jiffies)
Pid: 24440, comm: gs Tainted: G B D 3.7.5-linode48 #1
Call Trace:
[<c0193e8f>] ? print_cpu_stall+0xdf/0x190
[<c078bae1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
[<c016890f>] ? update_wall_time+0x18f/0x290
[<c019435a>] ? rcu_check_callbacks+0x12a/0x230
[<c013f965>] ? update_process_times+0x35/0x70
[<c016f2fd>] ? tick_sched_timer+0x6d/0xc0
[<c0151f65>] ? __remove_hrtimer+0x45/0xa0
[<c016f290>] ? tick_nohz_handler+0xe0/0xe0
[<c01520ed>] ? __run_hrtimer+0x4d/0xf0
[<c0152569>] ? hrtimer_interrupt+0x119/0x2f0
[<c01068f7>] ? xen_timer_interrupt+0x17/0x30
[<c018d9ff>] ? handle_irq_event_percpu+0x3f/0x150
[<c018fed5>] ? irq_get_irq_data+0x5/0x10
[<c04f97a5>] ? info_for_irq+0x5/0x20
[<c04f9e60>] ? evtchn_from_irq+0x10/0x40
[<c0190191>] ? handle_percpu_irq+0x31/0x50
[<c04f9664>] ? __xen_evtchn_do_upcall+0x164/0x210
[<c04fa868>] ? xen_evtchn_do_upcall+0x18/0x30
[<c078cb3b>] ? xen_do_upcall+0x7/0xc
[<c018007b>] ? update_if_frozen+0x6b/0xd0
[<c04f00d8>] ? irq_cpu_rmap_add+0x88/0x90
[<c01013a7>] ? xen_hypercall_sched_op+0x7/0x20
[<c04f9ed7>] ? xen_poll_irq_timeout+0x47/0x60
[<c0108295>] ? xen_spin_lock_slow+0x65/0xd0
[<c010835c>] ? xen_spin_lock_flags+0x5c/0x70
[<c078ba97>] ? _raw_spin_lock_irqsave+0x27/0x40
[<c01ab83d>] ? pagevec_lru_move_fn+0x5d/0xb0
[<c01ab170>] ? pagevec_lookup+0x20/0x20
[<c01c0bd7>] ? exit_mmap+0x37/0x110
[<c04fa86d>] ? xen_evtchn_do_upcall+0x1d/0x30
[<c078cb3b>] ? xen_do_upcall+0x7/0xc
[<c0101227>] ? xen_hypercall_xen_version+0x7/0x20
[<c0106297>] ? xen_force_evtchn_callback+0x17/0x30
[<c01308eb>] ? mmput+0x2b/0xa0
[<c0136113>] ? exit_mm+0xd3/0x100
[<c078bac0>] ? _raw_spin_lock_irq+0x10/0x20
[<c0137b9d>] ? do_exit+0x11d/0x3a0
[<c0131f17>] ? print_oops_end_marker+0x27/0x30
[<c010c272>] ? oops_end+0x72/0xa0
[<c012687e>] ? __bad_area_nosemaphore+0xae/0x140
[<c018da40>] ? handle_irq_event_percpu+0x80/0x150
[<c018fed5>] ? irq_get_irq_data+0x5/0x10
[<c012696b>] ? bad_area+0x3b/0x50
[<c0126f32>] ? __do_page_fault+0x402/0x410
[<c04f96ce>] ? __xen_evtchn_do_upcall+0x1ce/0x210
[<c01947b3>] ? rcu_irq_exit+0x53/0xb0
[<c04fa86d>] ? xen_evtchn_do_upcall+0x1d/0x30
[<c078cb3b>] ? xen_do_upcall+0x7/0xc
[<c0126f40>] ? __do_page_fault+0x410/0x410
[<c078c2fe>] ? error_code+0x5a/0x60
[<c0126f40>] ? __do_page_fault+0x410/0x410
[<c01a78f8>] ? get_page_from_freelist+0x118/0x3c0
[<c0103138>] ? load_TLS_descriptor+0x58/0xa0
[<c01a7e81>] ? __alloc_pages_nodemask+0x141/0x6d0
[<c01aa98d>] ? __do_page_cache_readahead+0xdd/0x1a0
[<c01aaa6e>] ? ra_submit+0x1e/0x30
[<c01a3099>] ? filemap_fault+0x309/0x3e0
[<c01ba5a5>] ? __do_fault+0x75/0x570
[<c01bdbf0>] ? handle_pte_fault+0xa0/0x2f0
[<c01bdf35>] ? handle_mm_fault+0xf5/0x1b0
[<c0126c6a>] ? __do_page_fault+0x13a/0x410
[<c01c36a5>] ? sys_mprotect+0x1b5/0x1f0
[<c0126f40>] ? __do_page_fault+0x410/0x410
[<c078c2fe>] ? error_code+0x5a/0x60
[<c0126f40>] ? __do_page_fault+0x410/0x410</c0126f40></c078c2fe></c0126f40></c01c36a5></c0126c6a></c01bdf35></c01bdbf0></c01ba5a5></c01a3099></c01aaa6e></c01aa98d></c01a7e81></c0103138></c01a78f8></c0126f40></c078c2fe></c0126f40></c078cb3b></c04fa86d></c01947b3></c04f96ce></c0126f32></c012696b></c018fed5></c018da40></c012687e></c010c272></c0131f17></c0137b9d></c078bac0></c0136113></c01308eb></c0106297></c0101227></c078cb3b></c04fa86d></c01c0bd7></c01ab170></c01ab83d></c078ba97></c010835c></c0108295></c04f9ed7></c01013a7></c04f00d8></c018007b></c078cb3b></c04fa868></c04f9664></c0190191></c04f9e60></c04f97a5></c018fed5></c018d9ff></c01068f7></c0152569></c01520ed></c016f290></c0151f65></c016f2fd></c013f965></c019435a></c016890f></c078bae1></c0193e8f>
16 Replies
INFO: rcu_sched self-detected stall on CPU
1: (239698 ticks this GP) idle=98d/140000000000001/0
(t=240004 jiffies)
Pid: 2486, comm: litespeed Not tainted 3.7.5-linode48 #1
Call Trace:
James
commit 09ea1383126d942a993b0895cec16e0961db5af9
Author: Eric Dumazet <
Date: Thu Jan 10 07:06:10 2013 +0000
tcp: splice: fix an infinite loop in tcpreadsock()
[ Upstream commit ff905b1e4aad8ccbbb0d42f7137f19482742ff07 ]
commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers)
added a regression.
[ 83.843570] INFO: rcu_sched self-detected stall on CPU
[ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
[ 83.844582] Task dump for CPU 6:
[ 83.844584] netperf R running task 0 8966 8952 0x0000000c
[ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
[ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
[ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
[ 83.844594] Call Trace:
[ 83.844596] [
[ 83.844601] [
[ 83.844606] [
[ 83.844610] [
[ 83.844613] [
[ 83.844615] [
[ 83.844618] [
[ 83.844622] [
[ 83.844627] [
[ 83.844630] [
[ 83.844633] [
[ 83.844636] [
if recv_actor() returns 0, we should stop immediately,
because looping wont give a chance to drain the pipe.
Signed-off-by: Eric Dumazet <
Cc: Willy Tarreau <
Signed-off-by: David S. Miller <
Signed-off-by: Greg Kroah-Hartman <
reboot system boot 3.7.5-linode48 Mon Feb 18 05:53
reboot system boot 3.7.5-linode48 Mon Feb 4 12:58
reboot system boot 3.7.5-linode48 Mon Feb 4 07:22
reboot system boot 3.6.5-linode47 Mon Jan 28 21:39
reboot system boot 3.6.5-linode47 Sun Jan 27 13:51
reboot system boot 3.6.5-linode47 Mon Jan 14 07:50
reboot system boot 3.6.5-linode47 Sun Jan 6 15:42
reboot system boot 3.6.5-linode47 Sat Dec 22 13:09
It would be amazing/heroic if Linode could provide 1) some sort of very simple "external" monitoring service, i.e. does a particular URL respond to at least 1 of 5 retried requests over 60 seconds, 2) hook the automatic reboot capability into this, 3) notify me of the event. I realize this has its own dangers and complexities, but I'm fairly sure that kernel bugs like this will pop up from now to eternity, and the silent hard lockups are a real pain. (I'm using Server Density to do monitoring, which is how I discover these outages, but I'm grandfathered in under their old [sane!] pricing.) Charge me $5/month for this, or give it away free knowing that I will be more hesitant to leave Linode thanks to this extra automatic monitoring/reliability.
echo 1 > /proc/sys/kernel/panic # reboot (in our case, exit) 1 second later after a panic
echo 1 > /proc/sys/kernel/paniconoops # give up after OOPsing
-Chris
@caker:
3.7.9 kernels are inbound!
Have you been able to estimate an arrival date?
James
@caker:
3.7.9 kernels are inbound!
Have you been able to estimate an arrival date?
James
@caker:
3.7.9 kernels are inbound!
Have you been able to estimate an arrival date?
James
Enjoy!
-Chris
@caker:
3.7.10-linode49 and 3.7.10-x86_64-linode30 were released today
Outstanding, just outstanding. Thank you.
James