-
-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[path-] fix undercounted progress for multibyte chars #2323
base: develop
Are you sure you want to change the base?
Conversation
2a45fd5
to
612fea8
Compare
612fea8
to
ecc8628
Compare
ecc8628 estimates character length and uses that to estimate progress. It fixes the progress for UTF-16/UTF-32 as discussed. It also fixes it for UTF-8. For example, with a sample UTF-8 dataset of mostly Thai characters, loading progress was too slow by a factor of 2.5. Progress would max out at 40% instead of 100%. The progress estimator samples characters every so often, to estimate the average bytes per character. Right now it samples more characters early in the file. That's because it needs more samples to come up with a decent estimate of byte length early. It doesn't slow down code much. For a UTF-8 tsv file with 10 million short lines ( |
A related bug in v3.0.2. Progress on compressed textfiles was overcounted. It would pass 100% and go to 200-600%. |
Hi @midichef! Could you review the most recent commit? We removed the batching to simplify the code a bit. |
For text files encoded with more than one byte per character,
FileProgress
undercounts loading progress.To demonstrate, you can use a UTF-32 file, where every character takes 4 bytes:
The progress only goes up to 25%, not 100%.
That's because
read()
progress is counting the characters, but the goal is measured in bytes. The file is around 7 million characters long, but when encoded in UTF-32, it is 28 million bytes, so even at the end, 7 million/28 million becomes 25%.This PR changes
FileProgress
to track progress as bytes.