Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add os.readinto API for reading data into a caller provided buffer #129205

Open
cmaloney opened this issue Jan 22, 2025 · 5 comments
Open

Add os.readinto API for reading data into a caller provided buffer #129205

cmaloney opened this issue Jan 22, 2025 · 5 comments
Labels
extension-modules C modules in the Modules dir type-feature A feature request or enhancement

Comments

@cmaloney
Copy link
Contributor

cmaloney commented Jan 22, 2025

Feature or enhancement

Proposal:

Code reading data in pure python tends to make a buffer variable, call os.read() which returns a separate newly allocated buffer of data, then copy/append that data onto the pre-allocated buffer[0]. That creates unnecessary extra buffer objects, as well as unnecessary copies. Provide os.readinto for directly filling a Buffer Protocol object.

os.readinto should closely mirror _Py_read which underlies os.read in order to get the same behaviors around retries as well as well-tested cross-platform support.

Move simple cases that use os.read (ex. [0]) to use the new API when it makes code simpler and more efficient. Potentially adding readinto to more readable/writeable file-like proxy objects or objects which transform the data (ex. Lib/_compression) is out of scope for this issue.

[0]

cpython/Lib/subprocess.py

Lines 1914 to 1921 in 298dda5

# Wait for exec to fail or succeed; possibly raising an
# exception (limited in size)
errpipe_data = bytearray()
while True:
part = os.read(errpipe_read, 50000)
errpipe_data += part
if not part or len(errpipe_data) > 50000:
break

def read_signed(fd):
data = b''
length = SIGNED_STRUCT.size
while len(data) < length:
s = os.read(fd, length - len(data))
if not s:
raise EOFError('unexpected EOF')
data += s
return SIGNED_STRUCT.unpack(data)[0]

cpython/Lib/_pyio.py

Lines 1695 to 1701 in 298dda5

def readinto(self, b):
"""Same as RawIOBase.readinto()."""
m = memoryview(b).cast('B')
data = self.read(len(m))
n = len(data)
m[:n] = data
return n

cc: @vstinner

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

#129005 (comment)

Linked PRs

@cmaloney cmaloney added the type-feature A feature request or enhancement label Jan 22, 2025
@vstinner
Copy link
Member

Do you want to work on a PR? If not, I can if you prefer.

@cmaloney
Copy link
Contributor Author

Almost done with one for adding os.readinto + tests :). Just trying to make sure I port / replicate the right os.read tests (but not all / _Py_read is used by both and handles most cases). Definitely be nice to both work + review os.read -> os.readinto as appropriate if you're up for that (there's quite a bit in multiprocessing, subprocess, ...).

cmaloney added a commit to cmaloney/cpython that referenced this issue Jan 23, 2025
Add a new OS api which will read data directly into a caller provided
writeable buffer protocol object.
@picnixz picnixz added the extension-modules C modules in the Modules dir label Jan 23, 2025
@bluetech
Copy link

Just curious, how would you rewrite your first example using os.readinto?

@cmaloney
Copy link
Contributor Author

@bluetech Ideally to me they'd move to file.read() / the io file object stack which would give the desired "read until EOF or max_bytes is reached efficiently" behavior. Already that is efficient in a lot of cases, but can't be used everywhere currently. Mixture of some overhead that would need to be reduced to match performance (locks, buffer allocation, etc) + relative age of different parts of the codebase (io came in Python 3.0, many other python modules are older). That case I would like to be a BufferedIO with a 0-byte buffer doing a .read() with a "max_size" parameter, which needs a little bit of new features on BufferedIO to handle the use case. Currently though it's open coded.

Migrating cases like that should happen, and I suspect whats the simplest/cleanest will evolve with code review. My prototype has looked something like:

errpipe_data = bytearray(50_000)
bytes_read = 0
while bytes_read < 50_000:
    count := os.readinto(errpipe_read, memoryview(errpipe_data)[bytes_read:]):
    if count == 0:
        break
    bytes_read += count
del errpipe_data[bytes_read:]  # Remove excess bytes

Are some behavior differences between that and the code as implemented today (Today after reading 49_999 bytes, could get 50_000 bytes resulting in 99_999 bytes in errpipe_data, but my prototype never goes past 50_000 bytes).

Not sure if it's good / needed to include the -1 case (and probably something I should add a test around.. Returning 0, number of bytes read, or throwing an exception would definitely simplify call site hadnling a bit...

I also like doing the same thing but with a match instead of the if.

@bluetech
Copy link

Are some behavior differences between that and the code as implemented today (Today after reading 49_999 bytes, could get 50_000 bytes resulting in 99_999 bytes in errpipe_data, but my prototype never goes past 50_000 bytes).

Side note, it does look like the original code is meant to be len(errpipe_data) >= 50000 instead of >, but it doesn't matter much I suppose.

vstinner added a commit that referenced this issue Jan 26, 2025
…ded buffer (#129211)

Add a new OS API which will read data directly into a caller provided
writeable buffer protocol object.

Co-authored-by: Bénédikt Tran <[email protected]>
Co-authored-by: Victor Stinner <[email protected]>
vstinner added a commit to vstinner/cpython that referenced this issue Jan 26, 2025
* Use f-string
* Fix grammar: replace 'datas' with 'data' (and replace 'data' with
  'item').
* Remove unused variables: 'pid' and 'old_mask'.
vstinner added a commit to vstinner/cpython that referenced this issue Jan 26, 2025
* Use f-string.
* Fix grammar: replace 'datas' with 'data' (and replace 'data' with
  'item').
* Remove unused variables: 'pid' and 'old_mask'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants