You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
root@697bf25b6113:~# python3
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> import shutil
>>> request = urllib.request.Request("https://electricworry-public.s3.eu-west-1.amazonaws.com/test")
>>> r = urllib.request.urlopen(request, None, 1000)
>>> f = open("test-python", "wb")
>>> shutil.copyfileobj(r, f)
>>> f.close()
>>>
root@697bf25b6113:~# ls -l
total 363136
-rw-r--r-- 1 root root 184313073 Jan 24 14:43 test-python
-rw-r--r-- 1 root root 187527168 Jan 24 14:31 test-wget
I've tried this on several computers:
Physical host Dell XPS 13 running Ubuntu 24.04
Physical own-build workstation running Linux Mint 22.1 Xia
Docker container running debian:bookworm
A wireshark packet capture seems to indicate that the remote side completes and closes the connection (FIN, PSH, ACK) which it should as urllib by default sends "Connection: close" in the headers.
Is this a known problem? The problem doesn't happen when I switch from https to http.
CPython versions tested on:
3.11, 3.12
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered:
Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied
Is the file position of the response file-like object at 0 or not? In addition, urlopen returns in this case a modified HTTPResponse object which is a BufferedIO object and has an underlying fp attribute. Could you perhaps check that this the case?
Does it also happen for files that have non-NUL bytes or is it for files that only have NULs? it might happen that it's the OS that is actually truncating the file itself, so you might also want to check that the buffer that was retrieved has the appropriate size (namely, the result size and the actual size on the disk may be different due to some optimized copyfileobj, but I don't know if this is the case).
Does it also happen for files that have non-NUL bytes or is it for files that only have NULs? it might happen that it's the OS that is actually truncating the file itself, so you might also want to check that the buffer that was retrieved has the appropriate size (namely, the result size and the actual size on the disk may be different due to some optimized copyfileobj, but I don't know if this is the case).
It does happen with non-NUL files. The problem manifested itself when I was trying to install my own snap packages using Ansible. First I ruled out Ansible as the problem (with this script) and then I tested a file of same length with NULs to eliminate the content as a factor.
shutil.copyfile says:
Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied
Is the file position of the response file-like object at 0 or not? In addition, urlopen returns in this case a modified HTTPResponse object which is a BufferedIO object and has an underlying fp attribute. Could you perhaps check that this the case?
I don't seem to be able to query r.tell() or r.fp.tell() - I get an UnsupportedOperation error - but I can deduce that it starts at zero because the (non-NUL) file always starts correctly and then ends with some varying amount of data omitted. This was confirmed with Beyond Compare in Hex Comparison. The file is byte-for-byte identical until the one retrieved in Python stops early. There is no other corruption before the truncation.
I'm really bewildered. My next step will be to use a proxy so I can read inside the TLS connection, but I'll have to come back to that.
Bug report
Bug description:
I am finding that some files downloaded with urllib are always truncated. I have a demonstration file which is 187527168 bytes of NULs.
If I download with wget it always is retrieved ok:
If I attempt the following python3 code I end up with a slightly truncated file:
Here's what I end up with:
I've tried this on several computers:
A wireshark packet capture seems to indicate that the remote side completes and closes the connection (FIN, PSH, ACK) which it should as urllib by default sends "Connection: close" in the headers.
Is this a known problem? The problem doesn't happen when I switch from https to http.
CPython versions tested on:
3.11, 3.12
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered: