Panic in spl_kmem_cache_free #16969

philmb3487 · 2025-01-21T07:18:05Z

I don't know if this is a known issue or has been reported. I have been getting one or two kernel panics per day on average. Here is an example of a kernel panic. I don't always have the opportunity to take a screenshot of the kernel panic because most of the time it takes down the machine before I have the ability to run a console to run dmesg (which requires working I/O).

Thanks!

System information

Type	Version/Name
Distribution Name	Gentoo amd64
Distribution Version	profile desktop plasma 6.2.5
Kernel Version	6.12.9
Architecture	amd64
OpenZFS Version	2.3.0

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

terencejferraro · 2025-01-21T19:59:48Z

Maybe I should have commented on this existing issue rather than the comment I made in #16966, but I'm also seeing an identical issue in 5 different machines I upgraded from 2.2.6 to 2.3.0. Average uptime per day went from being measured in months to 12-24 hours on average, my best one so far stayed up for 36 hours. Have to hard power cycle every time.

philmb3487 · 2025-01-21T22:52:42Z

Nice machines you have there. I see we all have in common a lot of RAM on these, could it be related? Is it stable with 16 GiB?

terencejferraro · 2025-01-21T23:08:37Z

I've just finished setting the ARC to 8GB across all machines (setting the zfs_arc_max lower doesn't seem to indicate in arc_summary that it's recognizing anything lower than that), and disabled primary/secondary cache across all pools on all servers. I'm definitely expecting some performance hits, but it won't overload any of my systems. I don't think reducing the size of the ARC was actually necessary since I've disable caching across all pools; but I figured it couldn't hurt to try to isolate here.

I should know pretty quickly if it's related to the ARC by doing it across the board. I suspect it is related, as most of the stack messages I was seeing had some reference to cache or arc.

amotin · 2025-01-21T23:33:39Z

@terencejferraro I don't see how it is related to ARC size here, especially during boot. The panic indeed happened in memory free, but seems to be caused by some memory corruption, not a deficit. I am not sure zio_suspend() in the trace can be trusted, but if so, I wonder if there were some I/O errors or falling off devices or something else that triggered it.

terencejferraro · 2025-01-21T23:44:33Z

Yeah, I wasn't so sure what I was seeing was 100% related to the instability during boot in #16966, but none of my machines use zfs during boot; everything is mounted and zfs loaded after the machine is fully booted, so I thought perhaps there might be some correlation based on the description...but I couldn't say for certain.

philmb3487 · 2025-01-22T02:58:43Z

but seems to be caused by some memory corruption, not a deficit. I am not sure zio_suspend() in the trace can be trusted, but if so, I wonder if there were some I/O errors or falling off devices or something else that triggered it.

Absolutely, I have seen errors regarding I/O.

I am not sure how to reproduce the issue, it happens after on average 12 hours, and always in the same way, I/O locks up and the machine stops responding after a few seconds.

@terencejferraro What makes you think the issue you are running into is this one?

philmb3487 · 2025-01-22T03:01:32Z

There is this as well.
This happened as I was working.

I hope my drive is not the cause here. I should probably run a memtest overnight too.

amotin · 2025-01-22T03:05:34Z

The last about NVMe device does look like possible trigger for the ZFS problem. Whether it is the device or something else, hard to say, but it looks like it fallen off the PCIe bus completely.

terencejferraro · 2025-01-22T03:12:30Z

Fair point, maybe it's not the same.

What originally led me to think it was the same was the timing (12-24 hours) and the result (kernel segfault) on systems that were previously stable.

Sometimes, I was seeing that the machine itself was completely unresponsive. Other times, only my zfs drives were unresponsive, other drives using xfs, ext4, etc. were unaffected...but, still required a hard power cycle to get it back.

Either way, I'm not seeing I/O drops like you are...so maybe that are indeed different.

terencejferraro · 2025-01-24T07:27:23Z

While it may not have been the same issue, disabling the ARC appears to have fixed my issue (though at a not insignificant performance penalty, of course).

philmb3487 · 2025-01-24T10:56:33Z

The kernel options "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" seems to have resolved my issues. I have not had a single crash in 2 days. I can live without the power savings if it means no kernel panics.

lyc8503 · 2025-01-25T15:49:01Z

I had a very similar kernel panic recently after upgrading both kernel and my NVME SSD. The server crashes in about 10 minutes if there's some high load on ZFS, but it works well if there's no load on ZFS.

[ 1548.473480] general protection fault, probably for non-canonical address 0xfeff9b9a419e6760: 0000 [#9] PREEMPT SMP NOPTI
[ 1548.473523] ---[ end trace 0000000000000000 ]---
[ 1548.474155] CPU: 3 PID: 9899 Comm: z_wr_int_1 Tainted: P      D    O       6.8.12-7-pve #1
[ 1548.475702] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B460M Pro4, BIOS P1.50 08/31/2020
[ 1548.476349] RIP: 0010:kmem_cache_free+0x2d9/0x470
[ 1548.476975] Code: ff 49 8b 04 24 a8 40 0f 84 36 fe ff ff 49 8b 44 24 48 4c 8d 68 ff a8 01 4d 0f 44 ec e9 22 fe ff ff 4c 8b 7d 08 e9 d8 fd ff ff <49> 8b 4e 60 48 8b 52 60 48 c7 c6 f0 56 24 b5 48 c7 c7 e0 d3 7b b5
[ 1548.477637] RSP: 0018:ffffa9888721bc70 EFLAGS: 00010206
[ 1548.483068] RAX: fffff72f9d9e3800 RBX: ffff9ba0a78e6400 RCX: 0017ffffc0000840
[ 1548.483730] RDX: ffff9b9a419e6700 RSI: ffff9ba0a78e6400 RDI: ffff9b9a419e6700
[ 1548.484394] RBP: ffffa9888721bcc0 R08: 0000000000000000 R09: 0000000000000000
[ 1548.485061] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ba1278e6400
[ 1548.489617] R13: 0000000000000000 R14: feff9b9a419e6700 R15: ffff9ba0a78e6400
[ 1548.490281] FS:  0000000000000000(0000) GS:ffff9ba95fd80000(0000) knlGS:0000000000000000
[ 1548.490945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1548.491615] CR2: 0000000028827000 CR3: 000000017715c005 CR4: 00000000003726f0
[ 1548.492283] Call Trace:
[ 1548.497011]  <TASK>
[ 1548.497677]  ? show_regs+0x6d/0x80
[ 1548.498342]  ? die_addr+0x37/0xa0
[ 1548.499004]  ? exc_general_protection+0x1db/0x480
[ 1548.503601]  ? asm_exc_general_protection+0x27/0x30
[ 1548.504265]  ? kmem_cache_free+0x2d9/0x470
[ 1548.504921]  spl_kmem_cache_free+0x137/0x1f0 [spl]
[ 1548.505590]  zio_destroy+0x9a/0xe0 [zfs]
[ 1548.510895]  zio_done+0x715/0x10b0 [zfs]
[ 1548.511689]  zio_execute+0x88/0x130 [zfs]
[ 1548.512481]  taskq_thread+0x27f/0x4c0 [spl]
[ 1548.513142]  ? __pfx_default_wake_function+0x10/0x10
[ 1548.517799]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 1548.518597]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 1548.519251]  kthread+0xef/0x120
[ 1548.519897]  ? __pfx_kthread+0x10/0x10
[ 1548.524858]  ret_from_fork+0x44/0x70
[ 1548.525501]  ? __pfx_kthread+0x10/0x10
[ 1548.526146]  ret_from_fork_asm+0x1b/0x30
[ 1548.526792]  </TASK>
[ 1548.531683] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel snd_hda_codec_hdmi snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda intel_rapl_msr snd_sof_pci intel_rapl_common snd_hda_codec_realtek snd_sof_xtensa_dsp intel_uncore_frequency snd_hda_codec_generic intel_uncore_frequency_common snd_sof snd_sof_utils snd_soc_hdac_hda snd_soc_acpi_intel_match intel_tcc_cooling snd_soc_acpi x86_pkg_temp_thermal soundwire_generic_allocation intel_powerclamp soundwire_bus coretemp snd_soc_avs kvm_intel snd_soc_hda_codec snd_hda_ext_core kvm snd_soc_core i915 snd_compress ac97_bus irqbypass snd_pcm_dmaengine crct10dif_pclmul snd_hda_intel polyval_clmulni snd_intel_dspcfg polyval_generic snd_intel_sdw_acpi ghash_clmulni_intel snd_hda_codec sha256_ssse3 sha1_ssse3 snd_hda_core drm_buddy aesni_intel snd_hwdep ttm
[ 1548.531721]  crypto_simd snd_pcm drm_display_helper mei_hdcp mei_pxp cryptd snd_timer intel_pmc_core cec mei_me intel_vsec rapl snd rc_core intel_cstate pmt_telemetry mei soundcore pcspkr intel_wmi_thunderbolt wmi_bmof i2c_algo_bit ee1004 serial_multi_instantiate pmt_class acpi_pad acpi_tad joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid uas hid usb_storage xhci_pci nvme xhci_pci_renesas ahci nvme_core i2c_i801 video crc32_pclmul e1000e xhci_hcd libahci nvme_auth i2c_smbus wmi
[ 1548.545775] general protection fault, probably for non-canonical address 0xfeff9b9a419e6760: 0000 [#10] PREEMPT SMP NOPTI
[ 1548.545893] ---[ end trace 0000000000000000 ]---

philmb3487 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 21, 2025

classabbyamp mentioned this issue Jan 25, 2025

zfs: update to 2.3.0. void-linux/void-packages#53964

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic in spl_kmem_cache_free #16969

Panic in spl_kmem_cache_free #16969

philmb3487 commented Jan 21, 2025 •

edited

Loading

terencejferraro commented Jan 21, 2025

philmb3487 commented Jan 21, 2025

terencejferraro commented Jan 21, 2025

amotin commented Jan 21, 2025

terencejferraro commented Jan 21, 2025

philmb3487 commented Jan 22, 2025

philmb3487 commented Jan 22, 2025 •

edited

Loading

amotin commented Jan 22, 2025 •

edited

Loading

terencejferraro commented Jan 22, 2025

terencejferraro commented Jan 24, 2025

philmb3487 commented Jan 24, 2025

lyc8503 commented Jan 25, 2025 •

edited

Loading

Panic in spl_kmem_cache_free #16969

Panic in spl_kmem_cache_free #16969

Comments

philmb3487 commented Jan 21, 2025 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

terencejferraro commented Jan 21, 2025

philmb3487 commented Jan 21, 2025

terencejferraro commented Jan 21, 2025

amotin commented Jan 21, 2025

terencejferraro commented Jan 21, 2025

philmb3487 commented Jan 22, 2025

philmb3487 commented Jan 22, 2025 • edited Loading

amotin commented Jan 22, 2025 • edited Loading

terencejferraro commented Jan 22, 2025

terencejferraro commented Jan 24, 2025

philmb3487 commented Jan 24, 2025

lyc8503 commented Jan 25, 2025 • edited Loading

philmb3487 commented Jan 21, 2025 •

edited

Loading

philmb3487 commented Jan 22, 2025 •

edited

Loading

amotin commented Jan 22, 2025 •

edited

Loading

lyc8503 commented Jan 25, 2025 •

edited

Loading