Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files retrieved with urllib over https are truncated #129264

Open
electricworry opened this issue Jan 24, 2025 · 2 comments
Open

Files retrieved with urllib over https are truncated #129264

electricworry opened this issue Jan 24, 2025 · 2 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@electricworry
Copy link

electricworry commented Jan 24, 2025

Bug report

Bug description:

I am finding that some files downloaded with urllib are always truncated. I have a demonstration file which is 187527168 bytes of NULs.

If I download with wget it always is retrieved ok:

root@697bf25b6113:~# wget https://electricworry-public.s3.eu-west-1.amazonaws.com/test -O test-wget
--2025-01-24 14:41:27--  https://electricworry-public.s3.eu-west-1.amazonaws.com/test
Resolving electricworry-public.s3.eu-west-1.amazonaws.com (electricworry-public.s3.eu-west-1.amazonaws.com)... 52.218.90.80, 52.218.108.120, 3.5.72.214, ...
Connecting to electricworry-public.s3.eu-west-1.amazonaws.com (electricworry-public.s3.eu-west-1.amazonaws.com)|52.218.90.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 187527168 (179M) [binary/octet-stream]
Saving to: 'test-wget'

test-wget                                                   100%[=========================================================================================================================================>] 178.84M  5.57MB/s    in 33s     

2025-01-24 14:42:01 (5.38 MB/s) - 'test-wget' saved [187527168/187527168]

root@697bf25b6113:~# ls -l
total 183132
-rw-r--r-- 1 root root 187527168 Jan 24 14:31 test-wget

If I attempt the following python3 code I end up with a slightly truncated file:

import urllib.request
import shutil
request = urllib.request.Request("https://electricworry-public.s3.eu-west-1.amazonaws.com/test")
r = urllib.request.urlopen(request, None, 1000)
f = open("test-python", "wb")
shutil.copyfileobj(r, f)
f.close()

Here's what I end up with:

root@697bf25b6113:~# python3
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> import shutil
>>> request = urllib.request.Request("https://electricworry-public.s3.eu-west-1.amazonaws.com/test")
>>> r = urllib.request.urlopen(request, None, 1000)
>>> f = open("test-python", "wb")
>>> shutil.copyfileobj(r, f)
>>> f.close()
>>> 
root@697bf25b6113:~# ls -l
total 363136
-rw-r--r-- 1 root root 184313073 Jan 24 14:43 test-python
-rw-r--r-- 1 root root 187527168 Jan 24 14:31 test-wget

I've tried this on several computers:

  • Physical host Dell XPS 13 running Ubuntu 24.04
  • Physical own-build workstation running Linux Mint 22.1 Xia
  • Docker container running debian:bookworm

A wireshark packet capture seems to indicate that the remote side completes and closes the connection (FIN, PSH, ACK) which it should as urllib by default sends "Connection: close" in the headers.

Is this a known problem? The problem doesn't happen when I switch from https to http.

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

@electricworry electricworry added the type-bug An unexpected behavior, bug, or error label Jan 24, 2025
@picnixz picnixz added the stdlib Python modules in the Lib dir label Jan 24, 2025
@picnixz
Copy link
Member

picnixz commented Jan 24, 2025

shutil.copyfile says:

Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied

Is the file position of the response file-like object at 0 or not? In addition, urlopen returns in this case a modified HTTPResponse object which is a BufferedIO object and has an underlying fp attribute. Could you perhaps check that this the case?

Does it also happen for files that have non-NUL bytes or is it for files that only have NULs? it might happen that it's the OS that is actually truncating the file itself, so you might also want to check that the buffer that was retrieved has the appropriate size (namely, the result size and the actual size on the disk may be different due to some optimized copyfileobj, but I don't know if this is the case).

@electricworry
Copy link
Author

Thanks for the fast response.

Does it also happen for files that have non-NUL bytes or is it for files that only have NULs? it might happen that it's the OS that is actually truncating the file itself, so you might also want to check that the buffer that was retrieved has the appropriate size (namely, the result size and the actual size on the disk may be different due to some optimized copyfileobj, but I don't know if this is the case).

It does happen with non-NUL files. The problem manifested itself when I was trying to install my own snap packages using Ansible. First I ruled out Ansible as the problem (with this script) and then I tested a file of same length with NULs to eliminate the content as a factor.

shutil.copyfile says:

Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied

Is the file position of the response file-like object at 0 or not? In addition, urlopen returns in this case a modified HTTPResponse object which is a BufferedIO object and has an underlying fp attribute. Could you perhaps check that this the case?

I don't seem to be able to query r.tell() or r.fp.tell() - I get an UnsupportedOperation error - but I can deduce that it starts at zero because the (non-NUL) file always starts correctly and then ends with some varying amount of data omitted. This was confirmed with Beyond Compare in Hex Comparison. The file is byte-for-byte identical until the one retrieved in Python stops early. There is no other corruption before the truncation.

I'm really bewildered. My next step will be to use a proxy so I can read inside the TLS connection, but I'll have to come back to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants