Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in spl_kmem_cache_free #16969

Open
philmb3487 opened this issue Jan 21, 2025 · 12 comments
Open

Panic in spl_kmem_cache_free #16969

philmb3487 opened this issue Jan 21, 2025 · 12 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@philmb3487
Copy link

philmb3487 commented Jan 21, 2025

I don't know if this is a known issue or has been reported. I have been getting one or two kernel panics per day on average. Here is an example of a kernel panic. I don't always have the opportunity to take a screenshot of the kernel panic because most of the time it takes down the machine before I have the ability to run a console to run dmesg (which requires working I/O).

Image

Thanks!

System information

Type Version/Name
Distribution Name Gentoo amd64
Distribution Version profile desktop plasma 6.2.5
Kernel Version 6.12.9
Architecture amd64
OpenZFS Version 2.3.0

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

@philmb3487 philmb3487 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 21, 2025
@terencejferraro
Copy link

Maybe I should have commented on this existing issue rather than the comment I made in #16966, but I'm also seeing an identical issue in 5 different machines I upgraded from 2.2.6 to 2.3.0. Average uptime per day went from being measured in months to 12-24 hours on average, my best one so far stayed up for 36 hours. Have to hard power cycle every time.

@philmb3487
Copy link
Author

Nice machines you have there. I see we all have in common a lot of RAM on these, could it be related? Is it stable with 16 GiB?

@terencejferraro
Copy link

I've just finished setting the ARC to 8GB across all machines (setting the zfs_arc_max lower doesn't seem to indicate in arc_summary that it's recognizing anything lower than that), and disabled primary/secondary cache across all pools on all servers. I'm definitely expecting some performance hits, but it won't overload any of my systems. I don't think reducing the size of the ARC was actually necessary since I've disable caching across all pools; but I figured it couldn't hurt to try to isolate here.

I should know pretty quickly if it's related to the ARC by doing it across the board. I suspect it is related, as most of the stack messages I was seeing had some reference to cache or arc.

@amotin
Copy link
Member

amotin commented Jan 21, 2025

@terencejferraro I don't see how it is related to ARC size here, especially during boot. The panic indeed happened in memory free, but seems to be caused by some memory corruption, not a deficit. I am not sure zio_suspend() in the trace can be trusted, but if so, I wonder if there were some I/O errors or falling off devices or something else that triggered it.

@terencejferraro
Copy link

Yeah, I wasn't so sure what I was seeing was 100% related to the instability during boot in #16966, but none of my machines use zfs during boot; everything is mounted and zfs loaded after the machine is fully booted, so I thought perhaps there might be some correlation based on the description...but I couldn't say for certain.

@philmb3487
Copy link
Author

but seems to be caused by some memory corruption, not a deficit. I am not sure zio_suspend() in the trace can be trusted, but if so, I wonder if there were some I/O errors or falling off devices or something else that triggered it.

Absolutely, I have seen errors regarding I/O.
image

I am not sure how to reproduce the issue, it happens after on average 12 hours, and always in the same way, I/O locks up and the machine stops responding after a few seconds.

@terencejferraro What makes you think the issue you are running into is this one?

@philmb3487
Copy link
Author

philmb3487 commented Jan 22, 2025

There is this as well.
This happened as I was working.

image

I hope my drive is not the cause here. I should probably run a memtest overnight too.

@amotin
Copy link
Member

amotin commented Jan 22, 2025

The last about NVMe device does look like possible trigger for the ZFS problem. Whether it is the device or something else, hard to say, but it looks like it fallen off the PCIe bus completely.

@terencejferraro
Copy link

Fair point, maybe it's not the same.

What originally led me to think it was the same was the timing (12-24 hours) and the result (kernel segfault) on systems that were previously stable.

Sometimes, I was seeing that the machine itself was completely unresponsive. Other times, only my zfs drives were unresponsive, other drives using xfs, ext4, etc. were unaffected...but, still required a hard power cycle to get it back.

Either way, I'm not seeing I/O drops like you are...so maybe that are indeed different.

@terencejferraro
Copy link

While it may not have been the same issue, disabling the ARC appears to have fixed my issue (though at a not insignificant performance penalty, of course).

@philmb3487
Copy link
Author

The kernel options "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" seems to have resolved my issues. I have not had a single crash in 2 days. I can live without the power savings if it means no kernel panics.

@lyc8503
Copy link

lyc8503 commented Jan 25, 2025

I had a very similar kernel panic recently after upgrading both kernel and my NVME SSD. The server crashes in about 10 minutes if there's some high load on ZFS, but it works well if there's no load on ZFS.

[ 1548.473480] general protection fault, probably for non-canonical address 0xfeff9b9a419e6760: 0000 [#9] PREEMPT SMP NOPTI
[ 1548.473523] ---[ end trace 0000000000000000 ]---
[ 1548.474155] CPU: 3 PID: 9899 Comm: z_wr_int_1 Tainted: P      D    O       6.8.12-7-pve #1
[ 1548.475702] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B460M Pro4, BIOS P1.50 08/31/2020
[ 1548.476349] RIP: 0010:kmem_cache_free+0x2d9/0x470
[ 1548.476975] Code: ff 49 8b 04 24 a8 40 0f 84 36 fe ff ff 49 8b 44 24 48 4c 8d 68 ff a8 01 4d 0f 44 ec e9 22 fe ff ff 4c 8b 7d 08 e9 d8 fd ff ff <49> 8b 4e 60 48 8b 52 60 48 c7 c6 f0 56 24 b5 48 c7 c7 e0 d3 7b b5
[ 1548.477637] RSP: 0018:ffffa9888721bc70 EFLAGS: 00010206
[ 1548.483068] RAX: fffff72f9d9e3800 RBX: ffff9ba0a78e6400 RCX: 0017ffffc0000840
[ 1548.483730] RDX: ffff9b9a419e6700 RSI: ffff9ba0a78e6400 RDI: ffff9b9a419e6700
[ 1548.484394] RBP: ffffa9888721bcc0 R08: 0000000000000000 R09: 0000000000000000
[ 1548.485061] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ba1278e6400
[ 1548.489617] R13: 0000000000000000 R14: feff9b9a419e6700 R15: ffff9ba0a78e6400
[ 1548.490281] FS:  0000000000000000(0000) GS:ffff9ba95fd80000(0000) knlGS:0000000000000000
[ 1548.490945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1548.491615] CR2: 0000000028827000 CR3: 000000017715c005 CR4: 00000000003726f0
[ 1548.492283] Call Trace:
[ 1548.497011]  <TASK>
[ 1548.497677]  ? show_regs+0x6d/0x80
[ 1548.498342]  ? die_addr+0x37/0xa0
[ 1548.499004]  ? exc_general_protection+0x1db/0x480
[ 1548.503601]  ? asm_exc_general_protection+0x27/0x30
[ 1548.504265]  ? kmem_cache_free+0x2d9/0x470
[ 1548.504921]  spl_kmem_cache_free+0x137/0x1f0 [spl]
[ 1548.505590]  zio_destroy+0x9a/0xe0 [zfs]
[ 1548.510895]  zio_done+0x715/0x10b0 [zfs]
[ 1548.511689]  zio_execute+0x88/0x130 [zfs]
[ 1548.512481]  taskq_thread+0x27f/0x4c0 [spl]
[ 1548.513142]  ? __pfx_default_wake_function+0x10/0x10
[ 1548.517799]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 1548.518597]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 1548.519251]  kthread+0xef/0x120
[ 1548.519897]  ? __pfx_kthread+0x10/0x10
[ 1548.524858]  ret_from_fork+0x44/0x70
[ 1548.525501]  ? __pfx_kthread+0x10/0x10
[ 1548.526146]  ret_from_fork_asm+0x1b/0x30
[ 1548.526792]  </TASK>
[ 1548.531683] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel snd_hda_codec_hdmi snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda intel_rapl_msr snd_sof_pci intel_rapl_common snd_hda_codec_realtek snd_sof_xtensa_dsp intel_uncore_frequency snd_hda_codec_generic intel_uncore_frequency_common snd_sof snd_sof_utils snd_soc_hdac_hda snd_soc_acpi_intel_match intel_tcc_cooling snd_soc_acpi x86_pkg_temp_thermal soundwire_generic_allocation intel_powerclamp soundwire_bus coretemp snd_soc_avs kvm_intel snd_soc_hda_codec snd_hda_ext_core kvm snd_soc_core i915 snd_compress ac97_bus irqbypass snd_pcm_dmaengine crct10dif_pclmul snd_hda_intel polyval_clmulni snd_intel_dspcfg polyval_generic snd_intel_sdw_acpi ghash_clmulni_intel snd_hda_codec sha256_ssse3 sha1_ssse3 snd_hda_core drm_buddy aesni_intel snd_hwdep ttm
[ 1548.531721]  crypto_simd snd_pcm drm_display_helper mei_hdcp mei_pxp cryptd snd_timer intel_pmc_core cec mei_me intel_vsec rapl snd rc_core intel_cstate pmt_telemetry mei soundcore pcspkr intel_wmi_thunderbolt wmi_bmof i2c_algo_bit ee1004 serial_multi_instantiate pmt_class acpi_pad acpi_tad joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid uas hid usb_storage xhci_pci nvme xhci_pci_renesas ahci nvme_core i2c_i801 video crc32_pclmul e1000e xhci_hcd libahci nvme_auth i2c_smbus wmi
[ 1548.545775] general protection fault, probably for non-canonical address 0xfeff9b9a419e6760: 0000 [#10] PREEMPT SMP NOPTI
[ 1548.545893] ---[ end trace 0000000000000000 ]---

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants