feat(RFC): Adds `altair.datasets` #3631

dangotbanned · 2024-10-04T18:57:00Z

Status

Waiting on the next vega-datasets release.
Once there is a stable datapackage.json available - there is quite a lot of tools/datasets that can be simplified/removed.

3.0.0 Release vega-datasets#654

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

No datasets are included in the package
- Instead, included is only a single 18.7KB file metadata.parquet
- The file describes all versions of all datasets
  - provided they are accessible via both npm and github
Strong support for typing
- Annotations are generated from the metadata itself
- https://github.com/vega/altair/blob/9e9deeb95668d2c4e7d30311e85a8f9f6acdc88c/altair/datasets/_typing.py
So far, 4 backends have been implemented, instead of only pandas
- These provide precise IDE completions, with a lot of help from https://github.com/narwhals-dev/narwhals
Users can opt-in to caching remote dataset requests
- With the "polars" backend, the slowest I've had on a cache-hit is 0.1s to load
  - https://cdn.jsdelivr.net/npm/[email protected]/data/flights-200k.json

Examples

These all come from the docstrings of:

Loader
Loader.from_backend
Loader.__call__

from altair.datasets import Loader

load = Loader.from_backend("polars")
>>> load
Loader[polars]

cars = load("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

load = Loader.from_backend("pandas")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

load = Loader.from_backend("pandas[pyarrow]")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                       string[pyarrow]
Miles_per_Gallon           double[pyarrow]
Cylinders                   int64[pyarrow]
Displacement               double[pyarrow]
Horsepower                  int64[pyarrow]
Weight_in_lbs               int64[pyarrow]
Acceleration               double[pyarrow]
Year                timestamp[ns][pyarrow]
Origin                     string[pyarrow]
dtype: object

load = Loader.from_backend("pandas")
source = load("stocks")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

load = Loader.from_backend("pyarrow")
source = load("stocks")

>>> source.column_names
['symbol', 'date', 'price']

Tasks

Resolved

Investigate bundling metadata

Investigating bundling metadata (22a5039), (1792340)
- Depending on how well the compression scales, it might be reasonable to include this for some number of versions
- Deliberately including redundant info early on - can always chip away at it later

npm does not have every version available GitHub

Sources
- npm/vega-datasets
  - Fixed with: https://data.jsdelivr.com/v1/packages/npm/vega-datasets
- https://github.com/vega/vega-datasets/tags
Known missing
feat(DRAFT): Add a source for available npm versions
Need to add some handling to invalidate these entries returned from list-repository-tags once confirmed they cannot be requested from npm
- Can technically request from github, but during testing this was much slower
- Also, these versions would not have been available from https://github.com/altair-viz/vega_datasets, since that only used npm

Plan strategy for user-configurable dataset cache

Everything so far has been building the tools for a compact bundled index
- 1, 2, 3, 4, 5
- Refreshing the index would not be included in altair, each release would simply ship with changes baked in
Trying to avoid bloating altair package size with datasets
User-facing
- Goal of requesting each unique dataset version once
  - The user cache would not need to be updated between altair versions
- Some kind of opt-in config to say store the datasets in this directory please
  - Basic solution would be defining an env variable like ALTAIR_DATASETS_DIR
  - When not provided, always perform remote requests
    - User motivation would be that it would be faster to enable caching

Deferred

Reducing cache footprint

e.g. storing the .(csv|tsv|json) files as .parquet
Need to do more testing on this though to ensure
- the shape of each dataset is preserved
- where relevant - intentional errors remain intact

Investigate providing a decorator to add a backend

Will be trivial for the user-side, since they don't need to be concerned about imports
Just need to provide these attributes:
- _name: LiteralString
- _read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
- _scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

How many datasets, size (per & total)?
What version range does a given sha cover?
Blocked: Running into issues with
- pandas/pyarrow group_by warnings
- min and max return all nulls in pl.Enum pola-rs/polars#18394
- Missing nw.Expr.(first|last)
- nw.Expr.(head|tail)(1) not equivalent in a group_by().agg(...) context
  - pandas -> scalar
  - polars -> list
- pl.Enum translating to non-ordered pd.Categorical

polars-native solution

from __future__ import annotations

from pathlib import Path

import polars as pl
from altair.datasets import Loader, _readers

data = Loader.from_backend("polars")

# NOTE: Enable caching, populate with some responses
data.cache_dir = Path.home() / ".altair_cache"
data("cars")
data("cars", tag="v1.5.0")
data("movies")
data("movies", tag="v1.24.0")
data("jobs")


if cache_dir := data.cache_dir:
    cached_stems: tuple[str, ...] = tuple(fp.stem for fp in cache_dir.iterdir())
else:
    msg = "Datasets cache unset"
    raise TypeError(msg)

# NOTE: Lots of redundancies, many urls point to the same data (sha)
>>> pl.read_parquet(_readers._METADATA).shape
# (2879, 9)

# NOTE: Version range per sha
tag_sort: pl.Expr = pl.col("tag").sort()
tag_range: pl.Expr = pl.concat_str(tag_sort.first(), tag_sort.last(), separator=" - ")

# NOTE: Producing a name only when the file is already in the cache
file_name: pl.Expr = pl.when(pl.col("sha").is_in(cached_stems)).then(
    pl.concat_str("sha", "suffix")
)

cache_summary: pl.DataFrame = (
    pl.scan_parquet(_readers._METADATA)
    .group_by("dataset_name", "suffix", "size", "sha")
    .agg(tag_range=tag_range)
    .select(pl.exclude("sha"), file_name=file_name)
    .sort("dataset_name", "size")
    .collect()
)

>>> cache_summary.shape
# (116, 5)

>>> cache_summary.head(10)

shape: (10, 5)
┌───────────────┬────────┬─────────┬───────────────────┬─────────────────────────────────┐
│ dataset_name  ┆ suffix ┆ size    ┆ tag_range         ┆ file_name                       │
│ ---           ┆ ---    ┆ ---     ┆ ---               ┆ ---                             │
│ str           ┆ str    ┆ i64     ┆ str               ┆ str                             │
╞═══════════════╪════════╪═════════╪═══════════════════╪═════════════════════════════════╡
│ 7zip          ┆ .png   ┆ 3969    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ airports      ┆ .csv   ┆ 210365  ┆ v1.5.0 - v2.10.0  ┆ 608ba6d51fa70584c3fa1d31eb9453… │
│ annual-precip ┆ .json  ┆ 266265  ┆ v1.29.0 - v2.10.0 ┆ null                            │
│ anscombe      ┆ .json  ┆ 1703    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ barley        ┆ .json  ┆ 8487    ┆ v1.5.0 - v2.10.0  ┆ 8dc50de2509b6e197ce95c24c98f90… │
│ birdstrikes   ┆ .csv   ┆ 1223329 ┆ v2.0.0 - v2.10.0  ┆ null                            │
│ birdstrikes   ┆ .json  ┆ 4183924 ┆ v1.5.0 - v1.31.1  ┆ null                            │
│ budget        ┆ .json  ┆ 374289  ┆ v1.5.0 - v2.8.1   ┆ null                            │
│ budget        ┆ .json  ┆ 391353  ┆ v2.9.0 - v2.10.0  ┆ null                            │
│ budgets       ┆ .json  ┆ 18079   ┆ v1.5.0 - v2.10.0  ┆ 8a909e24f698a3b0f6c637c30ec95e… │
└───────────────┴────────┴─────────┴───────────────────┴─────────────────────────────────┘

- Allow quickly switching between version tags #3150 (comment)

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

Not required for these requests, but may be helpful to avoid limits

As an example, for comparing against the most recent I've added the 5 most recent

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

Experimenting with querying the url cache w/ expressions

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling

- Adds `_exceptions.py` with some initial cases - Renaming `result` -> `meta` - Reduced the complexity of `_PyArrowReader` - Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work

Related #3771 https://github.com/vega/altair/actions/runs/12810882256/job/35718940621?pr=3631

Originally added in e1290d4 Try to reduce the size of #3631. This change is atomic and useful enough on its own

- Using `load` instead of `data` - Don't mention multi-versions, as that was dropped

- `Application.generate_typing` now mostly populated by `DataPackage` methods - Docs are defined alongside expressions - Factored out repetitive code into `spell_literal_alias` - `Metadata` examples table is now generated inside the doc

- Eliminated all flaky tests - Mocking more of the internals that is safer to run in parallel - Split out non-threadsafe tests with `@no_xdist` - Huge performance improvement for the slower tests - Added some helper functions (`is_*`) where common patterns were identified - **Removed skipping from native `pandas` backend** - Confirms that its now safe without `pyarrow` installed

Using only `.parquet` was relevant in earlier versions that produced multiple `.parquet` files Now these methods safely handle all formats in use

Without a default, I found that VSCode was always suggesting the **last** overload first (`"pyarrow"`) This is a bad suggestion, as it provides the *worst native* experience. The default now aligns with the backend providing the *best native* experience

- Multiple **brief** examples, for a taste of the public API - See (#3763) - Refs to everywhere a first-time user may need help from - Also aligned the (`Loader`|`load`) docs w/ eachother and the new phrasing here

- Reduce superfluous docs - Format/reorganize remaining docs - Follow up on some comments Misc style changes

These changes are very minor in VSCode, but fix a lot of rendering issues on the website

dangotbanned · 2025-01-22T17:43:38Z

altair/datasets/__init__.py

+load: _Load[Any, Any]
+"""
+Get a remote dataset and load as tabular data.
+
+For full <kbd>Tab</kbd> completions, instead use::
+
+    from altair.datasets import Loader
+    load = Loader.from_backend("polars")
+    cars = load("cars")
+    movies = load("movies")
+
+Alternatively, specify ``backend`` during a call::
+
+    from altair.datasets import load
+    cars = load("cars", backend="polars")
+    movies = load("movies", backend="polars")
+"""


Related

feat(RFC): Adds altair.datasets #3631 (comment)

feat(RFC): Adds altair.datasets #3631 (comment)

feat(RFC): Adds altair.datasets #3631 (comment)

dangotbanned · 2025-01-22T17:44:14Z

altair/datasets/__init__.py

+def url(
+    name: Dataset | LiteralString,
+    suffix: Extension | None = None,
+    /,
+) -> str:
+    """
+    Return the address of a remote dataset.


Related

feat(RFC): Adds altair.datasets #3631 (comment)

feat(RFC): Adds altair.datasets #3631 (comment)

`vega_datasets` source #3150 (reply in thread)

feat(RFC): Adds altair.datasets #3631 (comment)

dangotbanned · 2025-01-22T20:04:14Z

altair/datasets/__init__.py

+    Returns
+    -------
+    ``str``
+
+    .. _Path.stem:
+        https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.stem
+    .. _Path.suffix:
+        https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix


I'm not a fan of Returns sections, but sphinx seems to be unable to handle .. _links directly after Parameters.

Truly surprised this solved the issue

dangotbanned added 6 commits October 2, 2024 22:13

wip

7933771

feat(DRAFT): Minimal reimplementation

b30081e

refactor: Make version accessible via data.source_tag

279586b

- Allow quickly switching between version tags #3150 (comment)

refactor: ext_fn -> Dataset.read_fn

32150ad

docs: Add trailing docs to long literals

f1d18a2

docs: Add module-level doc

4d3c550

dangotbanned added the maintenance label Oct 4, 2024

dangotbanned added 23 commits October 4, 2024 20:15

Merge branch 'main' into vega-datasets

7e65841

Merge branch 'main' into vega-datasets

05773af

Merge branch 'main' into vega-datasets

4fff80a

feat: Adds .arrow support

3a284a5

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

feat: Add support for caching metadata

22a5039

feat: Support env var VEGA_GITHUB_TOKEN

a618ffc

Not required for these requests, but may be helpful to avoid limits

feat: Add support for multi-version metadata

1792340

As an example, for comparing against the most recent I've added the 5 most recent

refactor: Renaming, docs, reorganize

fa2c9e7

feat: Support collecting release tags

24cd7d7

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

feat: Adds refresh_tags

7dd461f

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

feat(DRAFT): Adds url_from

9768495

Experimenting with querying the url cache w/ expressions

fix: Wrap all requests with auth

c38c235

chore: Remove DATASET_NAMES_USED

a22cc8a

feat: Major GitHub rewrite, handle rate limiting

1181860

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

feat(DRAFT): Partial implement data("name")

31eeb20

fix(typing): Resolve some mypy errors

511a845

Merge branch 'main' into vega-datasets

c76cfd4

Merge branch 'main' into vega-datasets

d3f0497

Merge branch 'main' into vega-datasets

1b3390b

fix(ruff): Apply 3.8 fixes

a770ba9

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

docs(typing): Add WorkInProgress marker to data(...)

686a485

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Merge branch 'main' into vega-datasets

ba4491d

Merge branch 'main' into vega-datasets

1a4e107

dangotbanned added 9 commits January 16, 2025 13:47

feat: Extend _extract_suffix to support Metadata

2db8daf

Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling

refactor(typing): Simplify Dataset import

c265e1d

fix: Convert str to correct types in CsvCache

5503e0b

feat: Support pandas w/o a .parquet reader

3c7c571

refactor: Reduce repetition w/ _Reader._download

c23805d

feat(DRAFT): Metadata-based error handling

056f96d

- Adds `_exceptions.py` with some initial cases - Renaming `result` -> `meta` - Reduced the complexity of `_PyArrowReader` - Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work

Merge remote-tracking branch 'upstream/main' into vega-datasets

9c5db19

chore(ruff): Remove unused 0.9.2 ignores

e168948

Related #3771 https://github.com/vega/altair/actions/runs/12810882256/job/35718940621?pr=3631

Merge remote-tracking branch 'upstream/main' into vega-datasets

7d6b81d

dangotbanned added a commit that referenced this pull request Jan 16, 2025

test: Make skip_requires_pyarrow compatible w/ pytest.param

be3f18a

Originally added in e1290d4 Try to reduce the size of #3631. This change is atomic and useful enough on its own

dangotbanned mentioned this pull request Jan 16, 2025

test: Make skip_requires_pyarrow compatible w/ pytest.param #3772

Merged

Merge remote-tracking branch 'upstream/main' into vega-datasets

a752b3c

This was referenced Jan 17, 2025

docs(example): Adds Confidence Interval Ellipses #3747

Merged

Tracking: uv transition #3773

Open

dangotbanned added 13 commits January 17, 2025 12:55

Merge remote-tracking branch 'upstream/main' into vega-datasets

5975a8b

refactor: clean up, standardize _exceptions.py

7fd1f4d

test: Refactor decorators, test new errors

5dc227e

docs: Replace outdated docs

ba01af1

- Using `load` instead of `data` - Don't mention multi-versions, as that was dropped

Merge remote-tracking branch 'upstream/main' into vega-datasets

80647b6

Merge remote-tracking branch 'upstream/main' into vega-datasets

ad4c747

refactor: Clean up tools.datasets

63f4be0

- `Application.generate_typing` now mostly populated by `DataPackage` methods - Docs are defined alongside expressions - Factored out repetitive code into `spell_literal_alias` - `Metadata` examples table is now generated inside the doc

refactor: Reuse tools.fs more, fix app.(read|scan)

d64dbee

Using only `.parquet` was relevant in earlier versions that produced multiple `.parquet` files Now these methods safely handle all formats in use

docs: Adds module-level doc to altair.datasets

8e4c168

- Multiple **brief** examples, for a taste of the public API - See (#3763) - Refs to everywhere a first-time user may need help from - Also aligned the (`Loader`|`load`) docs w/ eachother and the new phrasing here

test: Clean up test_datasets

106f8bb

- Reduce superfluous docs - Format/reorganize remaining docs - Follow up on some comments Misc style changes

docs: Make sphinx happy with docs

c3c2eda

These changes are very minor in VSCode, but fix a lot of rendering issues on the website

dangotbanned commented Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

dangotbanned Jan 22, 2025

dangotbanned Jan 22, 2025

dangotbanned Jan 22, 2025

feat(RFC): Adds altair.datasets #3631

Are you sure you want to change the base?

feat(RFC): Adds altair.datasets #3631

Conversation

dangotbanned commented Oct 4, 2024 • edited Loading

Related

Status

Description

Examples

Tasks

Resolved

Deferred

Reducing cache footprint

Investigate providing a decorator to add a backend

Provide more meaningful info on the state of ALTAIR_DATASETS_DIR

dangotbanned Jan 22, 2025

Choose a reason for hiding this comment

Related

dangotbanned Jan 22, 2025

Choose a reason for hiding this comment

Related

dangotbanned Jan 22, 2025

Choose a reason for hiding this comment

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`