Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement statx(STATX_DIOALIGN) so applications can discover correct O_DIRECT alignment #16972

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented Jan 22, 2025

Motivation and Context

statx(STATX_DIOALIGN) is a Linux-specific syscall to let an application discover if direct IO (O_DIRECT) to a file is possible and, if so, what the alignment requirements are. This is useful since Linux does not mandate what happens when an unaligned access is requested; the filesystem may return EINVAL (as OpenZFS does), or may fall back to a non-aligned write.

This PR wires this call up to OpenZFS.

Further reading:

Closes: #16970

Description

  • Adds a ZPL function, zfs_get_direct_alignment(), to return the direct read and write alignment for the given open znode;
  • Extends zpl_getattr_impl to handle STATX_DIOALIGN by filling it out with information derived from zfs_get_direct_alignment();
  • Adds a ZTS helper program to call statx(2) requesting specific fields and display the results;
  • Adds a ZTS test to use the program to check the values are returned correctly for various combinations of direct=, recordsize= and object size and structure.

Notes

STATX_DIOALIGN is required to return two alignment values:

  • dio_mem_align: the required alignment for a userspace memory region used in O_DIRECT calls;
  • dio_offset_align: the required alignment for DMA to the underlying storage device.

Both set to 0 indicates that direct IO is not possible on the file. If direct IO is possible, both must be non-zero. It's not valid to set one to zero and the other non-zero.

Since its a stat-like query, it's not even necessary that the file be open, so our decision on whether or not direct is possible is simply checking that DIO would be possible on this dataset: zfs_dio_enabled set, and direct=always or standard.

OpenZFS has different alignment requirements for reads and writes, which can't be expressed through this interface, so I've taken the larger of the read and write alignment, so at least the caller gets something that is guaranteed to work for both cases.

Write alignment for direct IO is documented to be recordsize. This is implemented as z_blksz, which makes sense, but presents a conundrum. If an object is only one L0 block, then z_blksz may be smaller than the recordsize right now, because it will grow if more data is added. If we only returned z_blksz, then it would be possible for the apparent alignment requirement to change after a single write, which seems like it would make things difficult for an application that wanted to get the alignment once, when the file is opened, and then go to work. So, I've simply made it that if the object is still in the first block and the z_blksz is less than the dataset's z_max_blksz (ie recordsize), we use the dataset max instead.

I'm not sure if this really lines up with what really happens in zfs_setup_direct(), but it seems that it falling out as the recordsize most of the time is probably what's expected, and honestly, direct IO for files smaller than recordsize is kind of a misuse of the whole thing? If anyone knows a program that uses this STATX_DIOALIGN, I'd like to find out what it actually expects.

For the IO alignment, I've just gone with the pool ashift. Again, without knowing how its used I don't really know, but it seems like the closest to not outright lying, even if we don't ever pass user pages down to IO devices. Possibly we should return the same value for both, given that an arbitrary 4K alignment probably doesn't work for direct access to a 128K record, so we don't really want anything to choose it.

How Has This Been Tested?

New test included.

Test passes on Linux 6.6.x and 6.1.x. Test skipped on Linux 5.4.x and 4.19.x. Test also skipped on FreeBSD 14.2.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

robn added 3 commits January 22, 2025 13:01
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both the user memory alignment and the IO DMA
alignment for the underlying device. These concepts don't map perfectly
to OpenZFS, so we choose values that are guaranteed to work.

- for memory alignment, the object's block size, or if the object is
  still in the first block, the maximum size it could grow to (ie the
  dataset's recordsize)

- for IO alignment, the pool minimum ashift. This value is largely
  irrelevant since its intended to indicate the DMA alignment, which
  doesn't make sense for OpenZFS. We have to return something though,
  and this value seems logically correct.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
A standalone helper program to call statx(2) with various masks, and
display the returned value, or fail if the kernel does not support it.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Uses the statx helper to test the results of the STATX_DIOALIGN request
as we manipulate DIO enable, dataset recordsize and file size and
structure.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
@robn robn requested review from amotin and bwatkinson January 22, 2025 06:11
@robn
Copy link
Member Author

robn commented Jan 22, 2025

@gamanakis @hanneskasp FYI. Hannes, I'd love any feedback from your development team; it's hard to guess what applications are expecting here. Thank you!

@hanneskasp
Copy link

Hello,
we try to solve the O_DIRECT misalignment problem mentioned here #16953 (there is an strace about what fails today in #16953 (comment))

we use the following code for testing today

#include <iostream>
#define _GNU_SOURCE
#include <fcntl.h>
#include <sys/stat.h>

int main(int argc, char* argv[])
{
        struct statx statxbuf{};
        if (statx(0, argv[1], 0, STATX_DIOALIGN, &statxbuf))
        {
                perror("Cannot execute statx");
                return -1;
        }
        std::cout << "mask" << statxbuf.stx_mask << " memalign: " << statxbuf.stx_dio_mem_align << " offsetalign: " << statxbuf.stx_dio_offset_align << std::endl;
        return 0;
}

Server 1

  • Rocky Linux 9.5 with 5.14.0-503.21.1.el9_5.x86_64 (we know that statx needs kernel 6.1, but I did not find a quick way to get the mainline kernel working with ZFS on Rocky 9.5 and it seems to work fine for XFS)
  • zfs-2.3.0-1 & zfs-kmod-2.3.0-1
    [root@rocky-zfs23 ~]# ./a.out /etc/passwd (this is XFS)
    mask14335 memalign: 512 offsetalign: 512
    [root@rocky-zfs23 ~]# ./a.out /datareduction/recordsize16k/VBR-FK/BJ-HK-zfs-fastdedup_DC01/HK-DC01_7B18C.vbm
    mask6143 memalign: 0 offsetalign: 0

Server 2

  • Ubuntu 24.04 with 6.8.0-51-generic kernel
  • zfs-2.2.2-0ubuntu9.1 and zfs-kmod-2.2.2-0ubuntu9.1 root@ubuntu:# ./a.out /etc/passwd (this is ext4)
    mask14335 memalign: 4 offsetalign: 512
    root@ubuntu:# ./a.out /alignment/test.txt
    mask6143 memalign: 0 offsetalign: 0

Does that help to answer the question (I'm not a developer and would bring one in if needed)?

Best regards
Hannes

@gamanakis
Copy link
Contributor

@hanneskasp OpenZFS 2.3.0 does not include the present pull-request, that's why memalign=0 and offsetalign=0 in your example. In order for your program to output some meaningful memalign and offsetalign you have to apply this pull-request locally and then compile OpenZFS.

@robn is actually asking how you would use memalign and offsetalign, what values do you expect?

@hanneskasp
Copy link

quoting my colleague:

We're expecting values that will tell us the needed alignment for correct execution of direct IO requests. We will align the memory address of the memory buffer we're passing to the request, offset of the data for IO and size of the data for IO according to these values.

Does that help?

The plan is to continue testing once the PR is in the 2.3 RPM (the main testing system is Rocky Linux 9 with ZFS 2.3)

@gamanakis
Copy link
Contributor

@robn on my veeam backup OpenZFS 2.3.0 filesystem with 128k recordsize, with this PR applied and using the script from @hanneskasp I get:

mask14303 memalign: 131072 offsetalign: 4096

@amotin
Copy link
Member

amotin commented Jan 22, 2025

@robn May be I misremember something, but the patch looks weird to me. I'd expect you to statically set dio_mem_align to PAGE_SIZE, since Direct I/O code strictly requires all buffers to be page-aligned/-multiple. dio_offset_align is indeed tricky since it is different for reads and writes. For reads it can be anything page-aligned/-multiple, but preferably the file block size to avoid partial reads. For writes should be a full block size to not fall back to ARC, but as you pointed the block size might change. I think we should return there the file's current block size if the file has more than one block, or either recordsize or the file's current block size (rounded to the next power of 2 if needed) if it is bigger, but in either case no less than PAGE_SIZE.

Comment on lines +513 to +515
} else
stat->dio_mem_align = stat->dio_offset_align = 0;
stat->result_mask |= STATX_DIOALIGN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is the right answer, but does it make more sense to just not set the STATX_DIOALGIN bit if DIO is off?

I'm concerned people will call statx(), check that STATX_DIOALIGN was set in the response, and then blindly posix_memalign() a buffer using the dio_mem_align value (which would be 0 if DIO is disabled). Considering some dataset could have DIO on, and others off, it makes it so the user has to check both STATX_DIOALIGN is set, and dio_mem_align != 0 if they want to proceed.

It's a little awkward with the statx iterface, because there's no easy way to say "I know you're querying STATX_DIOALIGN, and I do support DIO in general and know how to respond to that, but it's not supported on this particular dataset".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://manpages.debian.org/bookworm/manpages-dev/statx.2.en.html explicitly tells "or 0 if direct I/O is not supported on this file.". So the code looks right to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree dio_mem_align should be 0 if DIO is disabled. I don't know if that necessarily means the STATX_DIOALGIN bit should be set though:

If a filesystem does not support a field or if it has an unrepresentable value
(for instance, a file with an exotic type), then the mask bit corresponding to that
field will be cleared in stx_mask even if the user asked for it and a dummy value
will be filled in for compatibility purposes if one is available (e.g., a dummy UID
and GID may be specified to mount under some circumstances).

It gets down to the definition what "does not support" means. Does ZFS "not support" the field if DIO is disabled? I would lean towards yes, since I'm guessing most users would look at the bit as a "can I do DIO or not" flag.

For what it's worth, the Linux Test Project is marking the combination of "STATX_DIOALIGN bit set + stx_dio_mem_align == 0" as a test failure:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/statx/statx10.c#L39

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the Linux Test Project is marking" -- it feels wrong to me, but whatever, I am not going to fight it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tonyhutter on this one. It is fine to set them to 0, but the result_mask should not have STATX_DIOALIGN set if it is disabled or zfs_dio_enabled=0. I think that makes the most sense reading the documentation and looking al the LTP stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not set STATX_DIOALGIN when dio_mem_align == 0.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jan 22, 2025
@amotin
Copy link
Member

amotin commented Jan 22, 2025

What also makes me wonder is how this API supposed to be used for creating and writing new files? I wonder if applications in that case call it on directory, in which case we might not care about specific zp properties, but only the dataset in general? Or it specifies name of still absent file, in which case I wonder what call ZFS will receive?

* really do that, but we have to return something
* non-zero, and this seems reasonably correct.
*/
stat->dio_offset_align = MAX(PAGE_SIZE, 1 <<
Copy link
Contributor

@bwatkinson bwatkinson Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @amotin stated, the offset alignment will always be PAGE_SIZE. There was a point were @behlendorf and I considered just using ashift, but the plumbing got a bit messy from what I recall. So this should always just be set to PAGE_SIZE for dio_offset_align.

Copy link
Contributor

@ryao ryao Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this to need to be MAX(PAGE_SIZE, dn_datablksz/z_blksz). Why should it be just PAGE_SIZE?

if (!zfs_dio_enabled || zfsvfs->z_os->os_direct == ZFS_DIRECT_DISABLED)
return (SET_ERROR(EOPNOTSUPP));

/* Read only needs to be page-aligned */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is not correct. All read and writes must be PAGE_SIZE aligned.

Comment on lines +1112 to +1116
if (zp->z_size <= zp->z_blksz)
*walignp = MAX(zp->z_blksz, zfsvfs->z_max_blksz);
else
*walignp = zp->z_blksz;
*walignp = MAX(PAGE_SIZE, *walignp);
Copy link
Contributor

@bwatkinson bwatkinson Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky as recordsize aligned write buffers will only be issued as Direct I/O writes. The good news with the growing blocksize is the first block of a file is never written using O_DIRECT as we do not allow it in that case anyways. See the o_direct_defer variable in zfs_write().

In the write case for both direct=standard and direct=always it will just silently pass it off to the ARC if the buffer is not recordsize aligned. However, the only true restrictions on alignment is PAGE_SIZE to not return EINVAL with direct=standard. The recordsize alignment is to avoid the RMW that would be required with checksums. For reads, the only alignment requirements are PAGE_SIZE. So I guess the question here is what does statx really say about the offset? If it is alignment, then PAGE_SIZE should be used for both dio_mem_align and dio_offset_align. I could see how this would trip a user up though if they are issuing a write with a 64k page-aligned buffer to a 128k recordsize block and thinking it is going to do an Direct I/O write (it would fall into the ARC silently in this case); however, EINVAL would still not be returned as an error. I just wonder if it would misleading to set dio_offset_align to the blocksize? In the case of reads this is not at all true. If a partial block is read using O_DIRECT (and the buffer is PAGE_SIZE aligned) then the entire block will be read to validate the checksum. I guess my question is, are we saying dio_offset_align is the optimal offset alignment to guarantee Direct I/O writes take place and no inflated blcocksize Direct I/O reads will occur? If so, then I guess we stick with that value even if it is misleading for a read operation.

In either case, we are only strict that the buffer memory and buffer size must be PAGE_SIZE in order to not return EINVAL with direct=standard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we saying dio_offset_align is the optimal offset alignment to guarantee Direct I/O writes take place

Yes, exactly this. What I think we want to do is advertise the optimal offset size to ensure Direct I/O can be performed without incurring any RMW penalty on write. Otherwise, you'd probably be better of going through the ARC anyway. There's no harm in accepting page-aligned offsets, bit it seems best not to recommend it via statx.

This does potentially make reads a little misleading but it shouldn't significantly impact performance. Under the covers we're always going to be reading the full block from disk anyway to validate the checksum. Which I believe boils down to:

dio_mem_align = PAGE_SIZE;
dio_offset_align = MAX(zp->z_blksz, PAGE_SIZE);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

O_DIRECT implementation misses statx(STATX_DIOALIGN) extension
8 participants