RFT: Kernels 2.6.28-linode15 and 2.6.28-x86_64-linode4
round of 2.6.27.x testingone new issue
Some of you noticed that the 2.6.28 kernels have been available for a few weeks, and have been running them. We received two reports of this problem
Thanks!
-Chris
19 Replies
Does this usually happen?
kernel BUG at drivers/block/xen-blkfront.c:243!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/class/net/ip6tnl0/type
Modules linked in:
Pid: 18889, comm: dump Not tainted (2.6.28-linode15 #2)
EIP: 0061:[<c03ee830>] EFLAGS: 00010046 CPU: 1
EIP is at do_blkif_request+0x2e0/0x360
EAX: 00000001 EBX: 00000000 ECX: c7055740 EDX: d054a650
ESI: d596c144 EDI: d596c184 EBP: 00000325 ESP: c147fc78
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
Process dump (pid: 18889, ti=c147e000 task=d5b01690 task.ti=c147e000)
Stack:
00000001 d596c144 00000144 d58fb9e0 d5926000 c9fdba84 00000000 0000000a
d5926000 00000004 00000003 d596c000 00000000 c7055740 d6132edc ffffffff
d58fb9e0 d58fb9e0 00000000 00000004 c03a7503 cc471d0c c03a7a42 00000000
Call Trace:
[<c03a7503>] __generic_unplug_device+0x23/0x30
[<c03a7a42>] __make_request+0xe2/0x3e0
[<c0561b15>] _spin_lock+0x5/0x10
[<c01827d6>] cache_alloc_refill+0x186/0x520
[<c03a6566>] generic_make_request+0xe6/0x230
[<c0105c53>] xen_restore_fl_direct_end+0x0/0x1
[<c01825f7>] kmem_cache_alloc+0x57/0xb0
[<c016141d>] mempool_alloc+0x2d/0xe0
[<c03a78d3>] submit_bio+0x63/0xf0
[<c01a72bd>] bvec_alloc_bs+0x8d/0x140
[<c01a758b>] bio_alloc_bioset+0x6b/0xe0
[<c01a389a>] submit_bh+0xba/0xf0
[<c01a6a05>] block_read_full_page+0x255/0x2d0
[<c01a9d10>] blkdev_get_block+0x0/0x60
[<c0166757>] __do_page_cache_readahead+0x1e7/0x220
[<c0160767>] generic_file_aio_read+0x227/0x670
[<c0185c65>] do_sync_read+0xd5/0x120
[<c0105c53>] xen_restore_fl_direct_end+0x0/0x1
[<c0561ca3>] _spin_unlock_irqrestore+0x13/0x20
[<c013c3e0>] autoremove_wake_function+0x0/0x40
[<c012d78b>] do_softirq+0x4b/0x60
[<c036c5ac>] security_file_permission+0xc/0x10
[<c0185d0e>] rw_verify_area+0x5e/0xd0
[<c0185b90>] do_sync_read+0x0/0x120
[<c018659d>] vfs_read+0x9d/0x160
[<c01869b1>] sys_read+0x41/0x70
[<c010840a>] syscall_call+0x7/0xb
Code: 2c 8d 54 03 40 8d 44 0e 54 b9 6c 00 00 00 e8 98 a5 fc ff 8b 44 24 3c e8 ff 92 fd ff 83 44 24 18 01 e9 40 fd ff ff 0f 0b eb fe 90 <0f> 0b eb fe 8b 44 24 20 ba 40 e5 3e c0 8b 4c 24 20 c7 04 24 0b
EIP: [<c03ee830>] do_blkif_request+0x2e0/0x360 SS:ESP 0069:c147fc78
---[ end trace 89f444916269aa8d ]---</c03ee830></c010840a></c01869b1></c018659d></c0185b90></c0185d0e></c036c5ac></c012d78b></c013c3e0></c0561ca3></c0105c53></c0185c65></c0160767></c0166757></c01a9d10></c01a6a05></c01a389a></c01a758b></c01a72bd></c03a78d3></c016141d></c01825f7></c0105c53></c03a6566></c01827d6></c0561b15></c03a7a42></c03a7503></c03ee830>
ETA: It appears I was doing a "dump 5" of one of the filesystems at the time of the crash (at least the nightly backup didn't complete).
So .27 hangs occaisionally for no good reason; .28 crashes.
Just as a side comment; I've found my linode reliability a lot worse after the move to Xen. Do you still offer UML based linodes? I know they're not as fast, but they seem to be a lot more reliable!
But why does my Linode 360 only report 349M of RAM when running the .27 kernel? Gimme back my 11M!!!!
@hybinet - That's something we've noticed with the pv_ops kernels, but I haven't mentioned it to anybody yet. I'm going to make a xen-devel report soon…
-Chris
$ uname -a
Linux mydomain.com 2.6.28-x86_64-linode3 #2 SMP Wed Jan 14 10:19:10 EST 2009 x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux
Since the modules.dep helper thinks it is -linode4, the modules.dep gets put in the wrong place.
EDIT: Linode is on atlanta46 if that is important.
$ uname -a
Linux linode 2.6.26-1-xen-686 #1 SMP Sat Jan 10 22:52:47 UTC 2009 i686 GNU/Linux
It works perfectly for a week, now.
I've updated the wiki with details required to run stock Debian kernel:
-Chris
@caker:
I just pushed out 2.6.28.3-linode17 and 2.6.28.3-x86_64-linode5. This should fix the "kernel BUG at drivers/block/xen-blkfront.c:243!" – @sweh, can you confirm?
I'll try and give it a go tonight and see what happens. (Currently using 2.6.27.4-linode14)
I added ntpd again and there are no steps - just remains stable. The rest of the VM sees light load for throughput testing and a few web domains. No problems so far.
Will give 2.6.28.3-linode17 a go.
–deckert
@sweh:
@caker:I just pushed out 2.6.28.3-linode17 and 2.6.28.3-x86_64-linode5. This should fix the "kernel BUG at drivers/block/xen-blkfront.c:243!" – @sweh, can you confirm?
I'll try and give it a go tonight and see what happens. (Currently using 2.6.27.4-linode14)
So far, so good. The backup that consistently crashed the previous kernel completed with the linode17 kernel. Fingers crossed!
I did something like this:````
mv /usr/portage /usr/portage.tmp
losetup /dev/loop0 /usr/portage.img
mkreiserfs /dev/loop0
mount /dev/loop0 /usr/portage
cp -a /usr/portage.tmp/* /usr/portage
Everything worked fine so far
chown -R portage:portage /usr/portage
````And whole linode crashed. There was kernel panic report on console screen but unfortunatelly was not complete and I did not discovered it in logs.
Can somebody try similar procedure or tell me what I am doing wrong or tell me how to provide more information?
-Chris
Edit: Ah, it seems to be similar (or same) bug as Sweh's one. Maybe only x86 branch has been fixed?
------------[ cut here ]------------
kernel BUG at drivers/block/xen-blkfront.c:243!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/uevent_seqnum
CPU 0
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.28.3-x86_64-linode5 #1
RIP: e030:[<ffffffff8054e2e0>] [<ffffffff8054e2e0>] do_blkif_request+0x3e0/0x470
RSP: e02b:ffffffff80998da8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff8800112582c0 RCX: ffff8800112582c0
RDX: ffff88000c416520 RSI: 0000000000000000 RDI: 000000000000016e
RBP: 000000000000016e R08: 0000000000000b70 R09: 0000000100b1ea23
R10: 0000000100b1ea22 R11: ffffffff8027a0e0 R12: 0000000000000004
R13: ffff880005836f40 R14: ffff88001547c270 R15: 0000000000000000
FS: 00007f864c7146f0(0000) GS:ffffffff809a1000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001a9f048 CR3: 00000000155da000 CR4: 0000000000002620
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff808e2000, task ffffffff80852340)
Stack:
ffff8800165e2ae8 0000000000000208 0000000000000050 0000000000000005
ffff880015419430 ffff880015a12000 ffff880015abbcd8 0000000000000000
0000000000000014 ffff880015a12000 0000000000000001 ffff88001547c000
Call Trace:
<irq> <0> [<ffffffff804fd285>] ? blk_invoke_request_fn+0xa5/0x110
[<ffffffff8054e388>] ? kick_pending_request_queues+0x18/0x30
[<ffffffff8054e585>] ? blkif_interrupt+0x1e5/0x230
[<ffffffff80271eb9>] ? handle_IRQ_event+0x39/0x80
[<ffffffff80274278>] ? handle_level_irq+0x98/0x120
[<ffffffff80214595>] ? do_IRQ+0xb5/0x110
[<ffffffff80524595>] ? xen_evtchn_do_upcall+0xf5/0x140
[<ffffffff806e460e>] ? xen_do_hypervisor_callback+0x1e/0x30
<eoi> <0> [<ffffffff802093aa>] ? _stext+0x3aa/0x1000
[<ffffffff802093aa>] ? _stext+0x3aa/0x1000
[<ffffffff8054df00>] ? do_blkif_request+0x0/0x470
[<ffffffff8020de4c>] ? xen_safe_halt+0xc/0x20
[<ffffffff8020c1e6>] ? xen_idle+0x26/0x50
[<ffffffff8020ff85>] ? cpu_idle+0x45/0x90
Code: 48 8b 42 60 48 89 41 68 48 8b 42 68 48 89 41 70 8b 7c 24 74 e8 a2 52 fd ff 83 44 24 3c 01 e9 4c fc ff ff 0f 0b eb fe 0f 1f 40 00 <0f> 0b eb fe 48 8b 7c 24 48 48 8b 54 24 48 b9 0b 00 00 00 48 c7
RIP [<ffffffff8054e2e0>] do_blkif_request+0x3e0/0x470
RSP <ffffffff80998da8>
Kernel panic - not syncing: Fatal exception in interrupt
------------[ cut here ]------------
WARNING: at kernel/smp.c:333 smp_call_function_mask+0x246/0x250()
Modules linked in:
Pid: 0, comm: swapper Tainted: G D 2.6.28.3-x86_64-linode5 #1
Call Trace:
<irq> [<ffffffff8023bc04>] warn_on_slowpath+0x64/0xa0
[<ffffffff80255946>] up+0x16/0x50
[<ffffffff806e3e04>] _spin_unlock_irqrestore+0x14/0x20
[<ffffffff8023c3ad>] release_console_sem+0x18d/0x1d0
[<ffffffff8023c7e2>] vprintk+0x192/0x3c0
[<ffffffff806e3dde>] _spin_lock_irqsave+0x2e/0x40
[<ffffffff8025fb36>] smp_call_function_mask+0x246/0x250
[<ffffffff8023ca5e>] printk+0x4e/0x60
[<ffffffff8023bcde>] panic+0x8e/0x150
[<ffffffff806e3e04>] _spin_unlock_irqrestore+0x14/0x20
[<ffffffff80246336>] __mod_timer+0xb6/0xd0
[<ffffffff8053ac39>] set_palette+0x9/0x50
[<ffffffff8053c718>] do_unblank_screen+0xb8/0x140
[<ffffffff8053a0fd>] set_cursor+0x3d/0x70
[<ffffffff80214710>] oops_end+0xb0/0xc0
[<ffffffff80213674>] do_invalid_op+0x84/0xa0
[<ffffffff8054e2e0>] do_blkif_request+0x3e0/0x470
[<ffffffff8050b31a>] __next_cpu+0x1a/0x30
[<ffffffff80231703>] find_busiest_group+0x193/0x810
[<ffffffff806e40ea>] error_exit+0x0/0x70
[<ffffffff8027a0e0>] mempool_free_slab+0x0/0x10
[<ffffffff8054e2e0>] do_blkif_request+0x3e0/0x470
[<ffffffff8054e111>] do_blkif_request+0x211/0x470
[<ffffffff804fd285>] blk_invoke_request_fn+0xa5/0x110
[<ffffffff8054e388>] kick_pending_request_queues+0x18/0x30
[<ffffffff8054e585>] blkif_interrupt+0x1e5/0x230
[<ffffffff80271eb9>] handle_IRQ_event+0x39/0x80
[<ffffffff80274278>] handle_level_irq+0x98/0x120
[<ffffffff80214595>] do_IRQ+0xb5/0x110
[<ffffffff80524595>] xen_evtchn_do_upcall+0xf5/0x140
[<ffffffff806e460e>] xen_do_hypervisor_callback+0x1e/0x30
<eoi> [<ffffffff802093aa>] _stext+0x3aa/0x1000
[<ffffffff802093aa>] _stext+0x3aa/0x1000
[<ffffffff8054df00>] do_blkif_request+0x0/0x470
[<ffffffff8020de4c>] xen_safe_halt+0xc/0x20
[<ffffffff8020c1e6>] xen_idle+0x26/0x50
[<ffffffff8020ff85>] cpu_idle+0x45/0x90
---[ end trace f9e449f10704c37e ]---</ffffffff8020ff85></ffffffff8020c1e6></ffffffff8020de4c></ffffffff8054df00></ffffffff802093aa></ffffffff802093aa></eoi></ffffffff806e460e></ffffffff80524595></ffffffff80214595></ffffffff80274278></ffffffff80271eb9></ffffffff8054e585></ffffffff8054e388></ffffffff804fd285></ffffffff8054e111></ffffffff8054e2e0></ffffffff8027a0e0></ffffffff806e40ea></ffffffff80231703></ffffffff8050b31a></ffffffff8054e2e0></ffffffff80213674></ffffffff80214710></ffffffff8053a0fd></ffffffff8053c718></ffffffff8053ac39></ffffffff80246336></ffffffff806e3e04></ffffffff8023bcde></ffffffff8023ca5e></ffffffff8025fb36></ffffffff806e3dde></ffffffff8023c7e2></ffffffff8023c3ad></ffffffff806e3e04></ffffffff80255946></ffffffff8023bc04></irq></ffffffff80998da8></ffffffff8054e2e0></ffffffff8020ff85></ffffffff8020c1e6></ffffffff8020de4c></ffffffff8054df00></ffffffff802093aa></ffffffff802093aa></eoi></ffffffff806e460e></ffffffff80524595></ffffffff80214595></ffffffff80274278></ffffffff80271eb9></ffffffff8054e585></ffffffff8054e388></ffffffff804fd285></irq></ffffffff8054e2e0></ffffffff8054e2e0>
@drake127:
I managed to crash 2.6.28.3-x86_64-linode5 during recursive chown Gentoo portage on loopback device formatted to reiserfs twice in row. Since it is production machine, I'd like not to test it again without better ways to describe the issue.
I did something like this:````
mv /usr/portage /usr/portage.tmp
losetup /dev/loop0 /usr/portage.img
mkreiserfs /dev/loop0
mount /dev/loop0 /usr/portage
cp -a /usr/portage.tmp/* /usr/portageEverything worked fine so far
chown -R portage:portage /usr/portage
````And whole linode crashed. There was kernel panic report on console screen but unfortunatelly was not complete and I did not discovered it in logs.Can somebody try similar procedure or tell me what I am doing wrong or tell me how to provide more information?
I tried this on my linode with the same kernel and could not recreate your problems. I didn't do it directly on /usr, but made a separate lv for testing. I tried reiser on reiser, reiser on ext3, and ext3 on ext3, and in all cases I could copy /usr/portage/* and chown -R the mounted loopback without errors.lvcreate -L1G -ntest vg mkreiserfs /dev/vg/test mount /dev/vg/test /mnt/tmp dd if=/dev/zero of=/mnt/tmp/portage.img bs=1K count=500K losetup /dev/loop0 /mnt/tmp/portage.img mkreiserfs /dev/loop0 mkdir /mnt/tmp/portage mount /dev/loop0 /mnt/tmp/portage cp -a /usr/portage/* /mnt/tmp/portage chown -R portage:portage /mnt/tmp/portage
I'm not sure what is different, but it seems stable for me.
So I let it be and experimented with custom kernel (pv_ops) and hit this bug again during its compilation. So I suspect it throws this panic based on intensive disk I/O and some yet unrevealed condition.
php[839]: segfault at 7f502fb03ac0 ip 00007f502fb03ac0 sp 00007fff3ff60838 error 14 in libtasn1.so.3
.0.16[7f5032ca2000+10000]
php[2172]: segfault at 7f55984fdac0 ip 00007f55984fdac0 sp 00007fffa8959cf8 error 14 in libtasn1.so.
3.0.16[7f559b69c000+10000]
php[2833]: segfault at 7fa6f6d29ac0 ip 00007fa6f6d29ac0 sp 00007fff071876a8 error 14 in libtasn1.so.
3.0.16[7fa6f9ec8000+10000]
php[12205]: segfault at 7f21f71ccac0 ip 00007f21f71ccac0 sp 00007fff0762a008 error 14 in libtasn1.so
.3.0.16[7f21fa36b000+10000]
which was then followed by:
general protection fault: 0000 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 13223, comm: htop Not tainted 2.6.28.3-x86_64-linode5 #1
RIP: e030:[<ffffffff8024e5ee>] [<ffffffff8024e5ee>] pid_task+0xe/0x30
RSP: e02b:ffff880014dfdc48 EFLAGS: 00010286
RAX: c381ffffde87e83c RBX: ffff880011c93468 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: c381ffffde87e83c
RBP: ffff880011e09318 R08: 0000000000000000 R09: ffff880015756140
:
R13: ffff880014dfdd18 R14: ffff880014dfdd28 R15: 00000000ffffff9c
FS: 00007f87ead906e0(0000) GS:ffffffff809a1000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f87ead99000 CR3: 00000000156c9000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process htop (pid: 13223, threadinfo ffff880014dfc000, task ffff880015a0d950)
Stack:
ffffffff8024e615 ffffffff802eeb34 ffff880011c93468 ffff88000e07c00d
ffff880014dfde68 ffffffff802aa124 ffff880014dfde08 ffff88001568be80
ffff880014dfdc98 ffff8800162dfa88 ffff88000e07c006 ffff880014dfdd18
Call Trace:
[<ffffffff8024e615>] ? get_pid_task+0x5/0x10
[<ffffffff802eeb34>] ? pid_revalidate+0x24/0x120
[<ffffffff802aa124>] ? do_lookup+0x64/0x250
[<ffffffff802ac73f>] ? __link_path_walk+0x74f/0xdd0
[<ffffffff806e3e04>] ? _spin_unlock_irqrestore+0x14/0x20
[<ffffffff802ace14>] ? path_walk+0x54/0xb0
[<ffffffff806e3e04>] ? _spin_unlock_irqrestore+0x14/0x20
[<ffffffff802acfb2>] ? do_path_lookup+0x82/0x1d0
[<ffffffff802ade68>] ? path_lookup_open+0x68/0xd0
[<ffffffff8020c356>] ? xen_mc_flush+0xc6/0x190
[<ffffffff802ae152>] ? do_filp_open+0xc2/0x910
[<ffffffff80295bec>] ? free_pages_and_swap_cache+0x8c/0xb0
[<ffffffff8029dbdc>] ? kmem_cache_free+0x5c/0xa0
[<ffffffff8028d7ba>] ? remove_vma+0x4a/0x60
[<ffffffff802b7319>] ? alloc_fd+0x109/0x130
[<ffffffff8029fc0c>] ? do_sys_open+0x5c/0xf0
[<ffffffff8021130a>] ? system_call_fastpath+0x16/0x1b
Code: 49 8b 80 d8 01 00 00 48 89 50 08 48 c7 87 e0 01 00 00 00 02 20 00 c3 66 0f 1f 44 00 00 48 85 f
f 75 03 31 c0 c3 89 f2 48 8d 04 d7 <48> 8b 48 08 48 85 c9 74 ee 48 8d 04 52 48 8d 04 c5 d8 01 00 00
RIP [<ffffffff8024e5ee>] pid_task+0xe/0x30
RSP <ffff880014dfdc48>
---[ end trace ef3c7ac0c009758b ]---</ffff880014dfdc48></ffffffff8024e5ee></ffffffff8021130a></ffffffff8029fc0c></ffffffff802b7319></ffffffff8028d7ba></ffffffff8029dbdc></ffffffff80295bec></ffffffff802ae152></ffffffff8020c356></ffffffff802ade68></ffffffff802acfb2></ffffffff806e3e04></ffffffff802ace14></ffffffff806e3e04></ffffffff802ac73f></ffffffff802aa124></ffffffff802eeb34></ffffffff8024e615></ffffffff8024e5ee></ffffffff8024e5ee>
which then followed a metric ton of:
php-cgi[13218] general protection ip:674970 sp:7fffedaf59b0 error:0 in php5-cgi[400000+506000]
php-cgi[13219] general protection ip:674970 sp:7fffedaf59b0 error:0 in php5-cgi[400000+506000]
php-cgi[13220] general protection ip:674970 sp:7fffedaf59b0 error:0 in php5-cgi[400000+506000]
php-cgi[13221] general protection ip:674970 sp:7fffedaf59b0 error:0 in php5-cgi[400000+506000]
php-cgi[13222] general protection ip:674970 sp:7fffedaf59b0 error:0 in php5-cgi[400000+506000]