-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic in spl_kmem_cache_free #16969
Comments
Maybe I should have commented on this existing issue rather than the comment I made in #16966, but I'm also seeing an identical issue in 5 different machines I upgraded from 2.2.6 to 2.3.0. Average uptime per day went from being measured in months to 12-24 hours on average, my best one so far stayed up for 36 hours. Have to hard power cycle every time. |
Nice machines you have there. I see we all have in common a lot of RAM on these, could it be related? Is it stable with 16 GiB? |
I've just finished setting the ARC to 8GB across all machines (setting the zfs_arc_max lower doesn't seem to indicate in arc_summary that it's recognizing anything lower than that), and disabled primary/secondary cache across all pools on all servers. I'm definitely expecting some performance hits, but it won't overload any of my systems. I don't think reducing the size of the ARC was actually necessary since I've disable caching across all pools; but I figured it couldn't hurt to try to isolate here. I should know pretty quickly if it's related to the ARC by doing it across the board. I suspect it is related, as most of the stack messages I was seeing had some reference to cache or arc. |
@terencejferraro I don't see how it is related to ARC size here, especially during boot. The panic indeed happened in memory free, but seems to be caused by some memory corruption, not a deficit. I am not sure |
Yeah, I wasn't so sure what I was seeing was 100% related to the instability during boot in #16966, but none of my machines use zfs during boot; everything is mounted and zfs loaded after the machine is fully booted, so I thought perhaps there might be some correlation based on the description...but I couldn't say for certain. |
Absolutely, I have seen errors regarding I/O. I am not sure how to reproduce the issue, it happens after on average 12 hours, and always in the same way, I/O locks up and the machine stops responding after a few seconds. @terencejferraro What makes you think the issue you are running into is this one? |
The last about NVMe device does look like possible trigger for the ZFS problem. Whether it is the device or something else, hard to say, but it looks like it fallen off the PCIe bus completely. |
Fair point, maybe it's not the same. What originally led me to think it was the same was the timing (12-24 hours) and the result (kernel segfault) on systems that were previously stable. Sometimes, I was seeing that the machine itself was completely unresponsive. Other times, only my zfs drives were unresponsive, other drives using xfs, ext4, etc. were unaffected...but, still required a hard power cycle to get it back. Either way, I'm not seeing I/O drops like you are...so maybe that are indeed different. |
While it may not have been the same issue, disabling the ARC appears to have fixed my issue (though at a not insignificant performance penalty, of course). |
The kernel options "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" seems to have resolved my issues. I have not had a single crash in 2 days. I can live without the power savings if it means no kernel panics. |
I had a very similar kernel panic recently after upgrading both kernel and my NVME SSD. The server crashes in about 10 minutes if there's some high load on ZFS, but it works well if there's no load on ZFS.
|
I don't know if this is a known issue or has been reported. I have been getting one or two kernel panics per day on average. Here is an example of a kernel panic. I don't always have the opportunity to take a screenshot of the kernel panic because most of the time it takes down the machine before I have the ability to run a console to run dmesg (which requires working I/O).
Thanks!
System information
Describe the problem you're observing
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: