-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(RFC): Adds altair.datasets
#3631
base: main
Are you sure you want to change the base?
Conversation
- Allow quickly switching between version tags #3150 (comment)
To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)
Not required for these requests, but may be helpful to avoid limits
As an example, for comparing against the most recent I've added the 5 most recent
- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests
Experimenting with querying the url cache w/ expressions
- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**
- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well
Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling
- Adds `_exceptions.py` with some initial cases - Renaming `result` -> `meta` - Reduced the complexity of `_PyArrowReader` - Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work
- Using `load` instead of `data` - Don't mention multi-versions, as that was dropped
- `Application.generate_typing` now mostly populated by `DataPackage` methods - Docs are defined alongside expressions - Factored out repetitive code into `spell_literal_alias` - `Metadata` examples table is now generated inside the doc
- Eliminated all flaky tests - Mocking more of the internals that is safer to run in parallel - Split out non-threadsafe tests with `@no_xdist` - Huge performance improvement for the slower tests - Added some helper functions (`is_*`) where common patterns were identified - **Removed skipping from native `pandas` backend** - Confirms that its now safe without `pyarrow` installed
Using only `.parquet` was relevant in earlier versions that produced multiple `.parquet` files Now these methods safely handle all formats in use
Without a default, I found that VSCode was always suggesting the **last** overload first (`"pyarrow"`) This is a bad suggestion, as it provides the *worst native* experience. The default now aligns with the backend providing the *best native* experience
- Multiple **brief** examples, for a taste of the public API - See (#3763) - Refs to everywhere a first-time user may need help from - Also aligned the (`Loader`|`load`) docs w/ eachother and the new phrasing here
- Reduce superfluous docs - Format/reorganize remaining docs - Follow up on some comments Misc style changes
These changes are very minor in VSCode, but fix a lot of rendering issues on the website
load: _Load[Any, Any] | ||
""" | ||
Get a remote dataset and load as tabular data. | ||
|
||
For full <kbd>Tab</kbd> completions, instead use:: | ||
|
||
from altair.datasets import Loader | ||
load = Loader.from_backend("polars") | ||
cars = load("cars") | ||
movies = load("movies") | ||
|
||
Alternatively, specify ``backend`` during a call:: | ||
|
||
from altair.datasets import load | ||
cars = load("cars", backend="polars") | ||
movies = load("movies", backend="polars") | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def url( | ||
name: Dataset | LiteralString, | ||
suffix: Extension | None = None, | ||
/, | ||
) -> str: | ||
""" | ||
Return the address of a remote dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns | ||
------- | ||
``str`` | ||
|
||
.. _Path.stem: | ||
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.stem | ||
.. _Path.suffix: | ||
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of Returns sections, but sphinx
seems to be unable to handle .. _links
directly after Parameters.
Truly surprised this solved the issue
Related
Status
Waiting on the next
vega-datasets
release.Once there is a stable
datapackage.json
available - there is quite a lot oftools/datasets
that can be simplified/removed.3.0.0
Release vega-datasets#654Description
Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.
This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:
metadata.parquet
npm
andgithub
pandas
"polars"
backend, the slowest I've had on a cache-hit is 0.1s to loadExamples
These all come from the docstrings of:
Loader
Loader.from_backend
Loader.__call__
Tasks
altair.datasets
Loader.__call__
_readers._Reader
_typing.Metadata
(Align with revised descriptions from (c572180))tools.datasets
Application
models.py
github.py
npm.py
semver.py
_PyArrowReader
JSON limitation (3fbc759), (4f5b4de)Resolved
Investigate bundling metadata
npm
does not have every version availableGitHub
npm
github
, but during testing this was much slowernpm
Plan strategy for user-configurable dataset cache
altair
, each release would simply ship with changes baked inaltair
package size with datasetsaltair
versionsALTAIR_DATASETS_DIR
Deferred
Reducing cache footprint
.(csv|tsv|json)
files as.parquet
Investigate providing a decorator to add a backend
_name: LiteralString
_read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
_scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]
Provide more meaningful info on the state of
ALTAIR_DATASETS_DIR
sha
cover?nw.Expr.(first|last)
nw.Expr.(head|tail)(1)
not equivalent in agroup_by().agg(...)
contextpandas
-> scalarpolars
-> listpl.Enum
translating to non-orderedpd.Categorical
polars
-native solution