Null pointer deference rb_erase() when running mkfs.xfs

This document (000019644) is provided subject to the disclaimer at the end of this document.

Environment

Servers running SUSE Linux Enterprise Server 15 (SLES 15) with kernel versions 4.12.14-150.41 or 4.12.14-150.47 but other OS versions and kernel versions may be affected.

Situation

When running mkfs.xfs on a logical volume, a kernel oops can occur with a trace similar to the following:
[ 391.842983] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 391.846699] IP: rb_erase+0x285/0x350
[ 391.846859] PGD 0 P4D 0
[ 391.846859] Oops: 0002 [#1] SMP PTI
[ 391.846859] CPU: 126 PID: 0 Comm: swapper/126 Tainted: G 4.12.14-150.47-default #1 SLE15
[ 391.846859] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 03/12/2019
[ 391.846859] task: ffff98beabbe05c0 task.stack: ffffb4a0f1618000
[ 391.846859] RIP: 0010:rb_erase+0x285/0x350
[ 391.846859] RSP: 0018:ffff996d37583d40 EFLAGS: 00010206
[ 391.846859] RAX: fffff0ee8d09e3c0 RBX: ffff996b8d761480 RCX: 0000000000000000
[ 391.846859] RDX: fffff0ee8d09e3c0 RSI: ffff996b7a7b7c90 RDI: ffff996b8d761508
[ 391.846859] RBP: ffff996b8d761508 R08: 0000000000000018 R09: 0000000001ffffff
[ 391.846859] R10: 0000000000000000 R11: ffff996ffffd6000 R12: 0000000000000000
[ 391.846859] R13: ffff980c6ae1bc00 R14: ffff98ba7f7b8140 R15: ffff996b7a7b7c90
[ 391.846859] FS: 0000000000000000(0000) GS:ffff996d37580000(0000) knlGS:0000000000000000
[ 391.846859] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 391.846859] CR2: 0000000000000018 CR3: 000002974800a001 CR4: 00000000003606e0
[ 391.846859] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 391.846859] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 391.846859] Call Trace:
[ 391.846859] <IRQ>
[ 391.846859] elv_rb_del+0x25/0x40
[ 391.846859] bfq_remove_request+0x7b/0x280
[ 391.846859] bfq_finish_request+0x50/0x390
[ 391.846859] blk_mq_free_request+0x55/0x160
[ 391.846859] scsi_end_request+0x89/0x210 [scsi_mod]
[ 391.846859] scsi_io_completion+0x213/0x630 [scsi_mod]
[ 391.846859] __blk_mq_complete_request+0xcb/0x140
[ 391.846859] storvsc_on_channel_callback+0x252/0x600 [hv_storvsc]
[ 391.846859] ? enqueue_hrtimer+0x37/0x80
[ 391.846859] vmbus_on_event+0x34/0x100 [hv_vmbus]
[ 391.846859] tasklet_action+0x5f/0x110
[ 391.846859] __do_softirq+0xde/0x2c6
[ 391.846859] irq_exit+0xed/0x100
[ 391.846859] hyperv_vector_handler+0x5b/0x70
[ 391.846859] hyperv_callback_vector+0x8f/0xa0
[ 391.846859] </IRQ>
[ 391.846859] RIP: 0010:native_safe_halt+0xe/0x10
[ 391.846859] RSP: 0018:ffffb4a0f161bed8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
[ 391.846859] RAX: ffffffffb56e4ec0 RBX: 000000000000007e RCX: 0000000000000000
[ 391.846859] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 391.846859] RBP: 000000000000007e R08: 0000000000000003 R09: 0106c28af1a127bb
[ 391.846859] R10: ffffb4a0f161be08 R11: 00000000003d0900 R12: 0000000000000000
[ 391.846859] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 391.846859] ? __sched_text_end+0x5/0x5
[ 391.846859] default_idle+0x1a/0x100
[ 391.846859] do_idle+0x169/0x1e0
[ 391.846859] cpu_startup_entry+0x5d/0x60
[ 391.846859] start_secondary+0x1b3/0x200
[ 391.846859] secondary_startup_64+0xa5/0xb0
[ 391.846859] Code: 10 0f 84 dd 00 00 00 4c 89 49 08 c3 4c 89 0e 4d 85 d2 0f 84 22 fe ff ff 48 83 c8 01 48 89 0a 49 8902 c3 4d 85 c0 4c 89 06 74 11 <49> 89 10 c3 48 89 0e c3 4d 89 48 10 eb d6 4c 89 0e f3 c3 48 89
[ 391.846859] Modules linked in: ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security ip_tables x_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs mlx5_ib ib_core nls_iso8859_1 nls_cp437 vfat fat mlx5_core mlxfw nfit devlink libnvdimm crc32_pclmul dm_mod pci_hyperv(X) ghash_clmulni_intel pcbc aesni_intel hv_utils(X) aes_x86_64 crypto_simd glue_helper ptp cryptd pcspkr hv_netvsc(X) hyperv_fb(X) pps_core hv_balloon(X) joydev xfs libcrc32c sd_mod serio_raw hv_storvsc(X) hid_generic scsi_transport_fc hyperv_keyboard(X) hid_hyperv(X) crc32c_intel hv_vmbus(X) sg scsi_mod efivarfs autofs4
[ 391.846859] Supported: Yes, External
[ 391.846859] CR2: 0000000000000018
[ 391.846859] ---[ end trace ccf8bbf09fab667a ]---
[ 392.024102] RIP: 0010:rb_erase+0x285/0x350
[ 392.024102] RSP: 0018:ffff996d37583d40 EFLAGS: 00010206
[ 392.024102] RAX: fffff0ee8d09e3c0 RBX: ffff996b8d761480 RCX: 0000000000000000
[ 392.032436] RDX: fffff0ee8d09e3c0 RSI: ffff996b7a7b7c90 RDI: ffff996b8d761508
[ 392.032436] RBP: ffff996b8d761508 R08: 0000000000000018 R09: 0000000001ffffff
[ 392.032436] R10: 0000000000000000 R11: ffff996ffffd6000 R12: 0000000000000000
[ 392.032436] R13: ffff980c6ae1bc00 R14: ffff98ba7f7b8140 R15: ffff996b7a7b7c90
[ 392.032436] FS: 0000000000000000(0000) GS:ffff996d37580000(0000) knlGS:0000000000000000
[ 392.032436] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 392.032436] CR2: 0000000000000018 CR3: 000002974800a001 CR4: 00000000003606e0
[ 392.032436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 392.032436] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 392.032436] Kernel panic - not syncing: Fatal exception in interrupt
[ 392.032436] Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 392.032436] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
[ 392.069668] sched: Unexpected reschedule of offline CPU#52!
[ 392.069668] ------------[ cut here ]------------

Resolution

A PTF has been released which fixes this issue.  A maintenance release kernel will also be released in the future that contains a fix. Request a PTF by opening a ticket with SUSE Support.

There is a workaround until a PTF is obtained.  The workaround is to switch block devices from the bfq IO scheduler to the mq-deadline IO scheduler. 

To perform this on a block device without rebooting, perform the following, replacing sdX with the block device affected. Perform this on all block devices affected by the issue:
echo "mq-deadline" > /sys/block/sdX/queue/scheduler

Cause

This appears to affect servers where block devices are using the bfq io scheduler. Investigation is ongoing.

Additional Information

There is another kernel Oops which may be related with a trace similar to the following:

[   32.749952] Oops: 0000 [#1] SMP PTI
[   32.749952] CPU: 0 PID: 1286 Comm: ( ) Tainted: G                   4.12.14-150.41-default #1 SLE15
[   32.749952] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
[   32.749952] task: ffff9352a100c300 task.stack: ffff9f2bc5d88000
[   32.749952] RIP: 0010:bfq_rq_pos_tree_lookup.isra.22+0x1d/0x80

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019644
  • Creation Date: 22-Jun-2020
  • Modified Date:22-Jun-2020
    • SUSE Linux Enterprise Server
    • SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center