-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement statx(STATX_DIOALIGN)
so applications can discover correct O_DIRECT
alignment
#16972
base: master
Are you sure you want to change the base?
Conversation
This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both the user memory alignment and the IO DMA alignment for the underlying device. These concepts don't map perfectly to OpenZFS, so we choose values that are guaranteed to work. - for memory alignment, the object's block size, or if the object is still in the first block, the maximum size it could grow to (ie the dataset's recordsize) - for IO alignment, the pool minimum ashift. This value is largely irrelevant since its intended to indicate the DMA alignment, which doesn't make sense for OpenZFS. We have to return something though, and this value seems logically correct. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>
A standalone helper program to call statx(2) with various masks, and display the returned value, or fail if the kernel does not support it. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>
Uses the statx helper to test the results of the STATX_DIOALIGN request as we manipulate DIO enable, dataset recordsize and file size and structure. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>
@gamanakis @hanneskasp FYI. Hannes, I'd love any feedback from your development team; it's hard to guess what applications are expecting here. Thank you! |
Hello, we use the following code for testing today
Server 1
Server 2
Does that help to answer the question (I'm not a developer and would bring one in if needed)? Best regards |
@hanneskasp OpenZFS 2.3.0 does not include the present pull-request, that's why memalign=0 and offsetalign=0 in your example. In order for your program to output some meaningful memalign and offsetalign you have to apply this pull-request locally and then compile OpenZFS. @robn is actually asking how you would use memalign and offsetalign, what values do you expect? |
quoting my colleague:
Does that help? The plan is to continue testing once the PR is in the 2.3 RPM (the main testing system is Rocky Linux 9 with ZFS 2.3) |
@robn on my veeam backup OpenZFS 2.3.0 filesystem with 128k recordsize, with this PR applied and using the script from @hanneskasp I get:
|
@robn May be I misremember something, but the patch looks weird to me. I'd expect you to statically set |
} else | ||
stat->dio_mem_align = stat->dio_offset_align = 0; | ||
stat->result_mask |= STATX_DIOALIGN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is the right answer, but does it make more sense to just not set the STATX_DIOALGIN bit if DIO is off?
I'm concerned people will call statx(), check that STATX_DIOALIGN was set in the response, and then blindly posix_memalign()
a buffer using the dio_mem_align
value (which would be 0 if DIO is disabled). Considering some dataset could have DIO on, and others off, it makes it so the user has to check both STATX_DIOALIGN is set, and dio_mem_align != 0
if they want to proceed.
It's a little awkward with the statx iterface, because there's no easy way to say "I know you're querying STATX_DIOALIGN, and I do support DIO in general and know how to respond to that, but it's not supported on this particular dataset".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://manpages.debian.org/bookworm/manpages-dev/statx.2.en.html explicitly tells "or 0 if direct I/O is not supported on this file.". So the code looks right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree dio_mem_align
should be 0 if DIO is disabled. I don't know if that necessarily means the STATX_DIOALGIN bit should be set though:
If a filesystem does not support a field or if it has an unrepresentable value
(for instance, a file with an exotic type), then the mask bit corresponding to that
field will be cleared in stx_mask even if the user asked for it and a dummy value
will be filled in for compatibility purposes if one is available (e.g., a dummy UID
and GID may be specified to mount under some circumstances).
It gets down to the definition what "does not support" means. Does ZFS "not support" the field if DIO is disabled? I would lean towards yes, since I'm guessing most users would look at the bit as a "can I do DIO or not" flag.
For what it's worth, the Linux Test Project is marking the combination of "STATX_DIOALIGN bit set + stx_dio_mem_align == 0" as a test failure:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/statx/statx10.c#L39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the Linux Test Project is marking" -- it feels wrong to me, but whatever, I am not going to fight it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tonyhutter on this one. It is fine to set them to 0
, but the result_mask
should not have STATX_DIOALIGN
set if it is disabled
or zfs_dio_enabled=0
. I think that makes the most sense reading the documentation and looking al the LTP stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not set STATX_DIOALGIN when dio_mem_align == 0.
What also makes me wonder is how this API supposed to be used for creating and writing new files? I wonder if applications in that case call it on directory, in which case we might not care about specific zp properties, but only the dataset in general? Or it specifies name of still absent file, in which case I wonder what call ZFS will receive? |
* really do that, but we have to return something | ||
* non-zero, and this seems reasonably correct. | ||
*/ | ||
stat->dio_offset_align = MAX(PAGE_SIZE, 1 << |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @amotin stated, the offset alignment will always be PAGE_SIZE
. There was a point were @behlendorf and I considered just using ashift, but the plumbing got a bit messy from what I recall. So this should always just be set to PAGE_SIZE
for dio_offset_align
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected this to need to be MAX(PAGE_SIZE, dn_datablksz/z_blksz). Why should it be just PAGE_SIZE?
if (!zfs_dio_enabled || zfsvfs->z_os->os_direct == ZFS_DIRECT_DISABLED) | ||
return (SET_ERROR(EOPNOTSUPP)); | ||
|
||
/* Read only needs to be page-aligned */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not correct. All read and writes must be PAGE_SIZE
aligned.
if (zp->z_size <= zp->z_blksz) | ||
*walignp = MAX(zp->z_blksz, zfsvfs->z_max_blksz); | ||
else | ||
*walignp = zp->z_blksz; | ||
*walignp = MAX(PAGE_SIZE, *walignp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky as recordsize aligned write buffers will only be issued as Direct I/O writes. The good news with the growing blocksize is the first block of a file is never written using O_DIRECT
as we do not allow it in that case anyways. See the o_direct_defer
variable in zfs_write()
.
In the write case for both direct=standard
and direct=always
it will just silently pass it off to the ARC if the buffer is not recordsize aligned. However, the only true restrictions on alignment is PAGE_SIZE
to not return EINVAL
with direct=standard
. The recordsize alignment is to avoid the RMW that would be required with checksums. For reads, the only alignment requirements are PAGE_SIZE
. So I guess the question here is what does statx
really say about the offset? If it is alignment, then PAGE_SIZE
should be used for both dio_mem_align
and dio_offset_align
. I could see how this would trip a user up though if they are issuing a write with a 64k page-aligned buffer to a 128k recordsize block and thinking it is going to do an Direct I/O write (it would fall into the ARC silently in this case); however, EINVAL
would still not be returned as an error. I just wonder if it would misleading to set dio_offset_align
to the blocksize? In the case of reads this is not at all true. If a partial block is read using O_DIRECT
(and the buffer is PAGE_SIZE
aligned) then the entire block will be read to validate the checksum. I guess my question is, are we saying dio_offset_align
is the optimal offset alignment to guarantee Direct I/O writes take place and no inflated blcocksize Direct I/O reads will occur? If so, then I guess we stick with that value even if it is misleading for a read operation.
In either case, we are only strict that the buffer memory and buffer size must be PAGE_SIZE
in order to not return EINVAL
with direct=standard
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we saying dio_offset_align is the optimal offset alignment to guarantee Direct I/O writes take place
Yes, exactly this. What I think we want to do is advertise the optimal offset size to ensure Direct I/O can be performed without incurring any RMW penalty on write. Otherwise, you'd probably be better of going through the ARC anyway. There's no harm in accepting page-aligned offsets, bit it seems best not to recommend it via statx
.
This does potentially make reads a little misleading but it shouldn't significantly impact performance. Under the covers we're always going to be reading the full block from disk anyway to validate the checksum. Which I believe boils down to:
dio_mem_align = PAGE_SIZE;
dio_offset_align = MAX(zp->z_blksz, PAGE_SIZE);
Motivation and Context
statx(STATX_DIOALIGN)
is a Linux-specific syscall to let an application discover if direct IO (O_DIRECT
) to a file is possible and, if so, what the alignment requirements are. This is useful since Linux does not mandate what happens when an unaligned access is requested; the filesystem may returnEINVAL
(as OpenZFS does), or may fall back to a non-aligned write.This PR wires this call up to OpenZFS.
Further reading:
Closes: #16970
Description
zfs_get_direct_alignment()
, to return the direct read and write alignment for the given open znode;zpl_getattr_impl
to handleSTATX_DIOALIGN
by filling it out with information derived fromzfs_get_direct_alignment()
;statx(2)
requesting specific fields and display the results;direct=
,recordsize=
and object size and structure.Notes
STATX_DIOALIGN
is required to return two alignment values:dio_mem_align
: the required alignment for a userspace memory region used inO_DIRECT
calls;dio_offset_align
: the required alignment for DMA to the underlying storage device.Both set to 0 indicates that direct IO is not possible on the file. If direct IO is possible, both must be non-zero. It's not valid to set one to zero and the other non-zero.
Since its a
stat
-like query, it's not even necessary that the file be open, so our decision on whether or not direct is possible is simply checking that DIO would be possible on this dataset:zfs_dio_enabled
set, anddirect=always
orstandard
.OpenZFS has different alignment requirements for reads and writes, which can't be expressed through this interface, so I've taken the larger of the read and write alignment, so at least the caller gets something that is guaranteed to work for both cases.
Write alignment for direct IO is documented to be recordsize. This is implemented as
z_blksz
, which makes sense, but presents a conundrum. If an object is only one L0 block, thenz_blksz
may be smaller than the recordsize right now, because it will grow if more data is added. If we only returnedz_blksz
, then it would be possible for the apparent alignment requirement to change after a single write, which seems like it would make things difficult for an application that wanted to get the alignment once, when the file is opened, and then go to work. So, I've simply made it that if the object is still in the first block and thez_blksz
is less than the dataset'sz_max_blksz
(ie recordsize), we use the dataset max instead.I'm not sure if this really lines up with what really happens in
zfs_setup_direct()
, but it seems that it falling out as the recordsize most of the time is probably what's expected, and honestly, direct IO for files smaller than recordsize is kind of a misuse of the whole thing? If anyone knows a program that uses thisSTATX_DIOALIGN
, I'd like to find out what it actually expects.For the IO alignment, I've just gone with the pool ashift. Again, without knowing how its used I don't really know, but it seems like the closest to not outright lying, even if we don't ever pass user pages down to IO devices. Possibly we should return the same value for both, given that an arbitrary 4K alignment probably doesn't work for direct access to a 128K record, so we don't really want anything to choose it.
How Has This Been Tested?
New test included.
Test passes on Linux 6.6.x and 6.1.x. Test skipped on Linux 5.4.x and 4.19.x. Test also skipped on FreeBSD 14.2.
Types of changes
Checklist:
Signed-off-by
.