Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(RFC): Adds altair.datasets #3631

Draft
wants to merge 208 commits into
base: main
Choose a base branch
from
Draft

feat(RFC): Adds altair.datasets #3631

wants to merge 208 commits into from

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Oct 4, 2024

Related

Status

Waiting on the next vega-datasets release.
Once there is a stable datapackage.json available - there is quite a lot of tools/datasets that can be simplified/removed.

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

Examples

These all come from the docstrings of:

  • Loader
  • Loader.from_backend
  • Loader.__call__
from altair.datasets import Loader

load = Loader.from_backend("polars")
>>> load
Loader[polars]

cars = load("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

load = Loader.from_backend("pandas")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

load = Loader.from_backend("pandas[pyarrow]")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                       string[pyarrow]
Miles_per_Gallon           double[pyarrow]
Cylinders                   int64[pyarrow]
Displacement               double[pyarrow]
Horsepower                  int64[pyarrow]
Weight_in_lbs               int64[pyarrow]
Acceleration               double[pyarrow]
Year                timestamp[ns][pyarrow]
Origin                     string[pyarrow]
dtype: object

load = Loader.from_backend("pandas")
source = load("stocks")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

load = Loader.from_backend("pyarrow")
source = load("stocks")

>>> source.column_names
['symbol', 'date', 'price']

Tasks

Resolved

Investigate bundling metadata

  • Investigating bundling metadata (22a5039), (1792340)
    • Depending on how well the compression scales, it might be reasonable to include this for some number of versions
    • Deliberately including redundant info early on - can always chip away at it later

npm does not have every version available GitHub

Plan strategy for user-configurable dataset cache

  • Everything so far has been building the tools for a compact bundled index
    • 1, 2, 3, 4, 5
    • Refreshing the index would not be included in altair, each release would simply ship with changes baked in
  • Trying to avoid bloating altair package size with datasets
  • User-facing
    • Goal of requesting each unique dataset version once
      • The user cache would not need to be updated between altair versions
    • Some kind of opt-in config to say store the datasets in this directory please
      • Basic solution would be defining an env variable like ALTAIR_DATASETS_DIR
      • When not provided, always perform remote requests
        • User motivation would be that it would be faster to enable caching

Deferred

Reducing cache footprint

  • e.g. storing the .(csv|tsv|json) files as .parquet
  • Need to do more testing on this though to ensure
    • the shape of each dataset is preserved
    • where relevant - intentional errors remain intact

Investigate providing a decorator to add a backend

  • Will be trivial for the user-side, since they don't need to be concerned about imports
  • Just need to provide these attributes:
    • _name: LiteralString
    • _read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
    • _scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]

Provide more meaningful info on the state of ALTAIR_DATASETS_DIR

polars-native solution

from __future__ import annotations

from pathlib import Path

import polars as pl
from altair.datasets import Loader, _readers

data = Loader.from_backend("polars")

# NOTE: Enable caching, populate with some responses
data.cache_dir = Path.home() / ".altair_cache"
data("cars")
data("cars", tag="v1.5.0")
data("movies")
data("movies", tag="v1.24.0")
data("jobs")


if cache_dir := data.cache_dir:
    cached_stems: tuple[str, ...] = tuple(fp.stem for fp in cache_dir.iterdir())
else:
    msg = "Datasets cache unset"
    raise TypeError(msg)

# NOTE: Lots of redundancies, many urls point to the same data (sha)
>>> pl.read_parquet(_readers._METADATA).shape
# (2879, 9)

# NOTE: Version range per sha
tag_sort: pl.Expr = pl.col("tag").sort()
tag_range: pl.Expr = pl.concat_str(tag_sort.first(), tag_sort.last(), separator=" - ")

# NOTE: Producing a name only when the file is already in the cache
file_name: pl.Expr = pl.when(pl.col("sha").is_in(cached_stems)).then(
    pl.concat_str("sha", "suffix")
)

cache_summary: pl.DataFrame = (
    pl.scan_parquet(_readers._METADATA)
    .group_by("dataset_name", "suffix", "size", "sha")
    .agg(tag_range=tag_range)
    .select(pl.exclude("sha"), file_name=file_name)
    .sort("dataset_name", "size")
    .collect()
)

>>> cache_summary.shape
# (116, 5)

>>> cache_summary.head(10)
shape: (10, 5)
┌───────────────┬────────┬─────────┬───────────────────┬─────────────────────────────────┐
│ dataset_name  ┆ suffix ┆ size    ┆ tag_range         ┆ file_name                       │
│ ---           ┆ ---    ┆ ---     ┆ ---               ┆ ---                             │
│ str           ┆ str    ┆ i64     ┆ str               ┆ str                             │
╞═══════════════╪════════╪═════════╪═══════════════════╪═════════════════════════════════╡
│ 7zip          ┆ .png   ┆ 3969    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ airports      ┆ .csv   ┆ 210365  ┆ v1.5.0 - v2.10.0  ┆ 608ba6d51fa70584c3fa1d31eb9453… │
│ annual-precip ┆ .json  ┆ 266265  ┆ v1.29.0 - v2.10.0 ┆ null                            │
│ anscombe      ┆ .json  ┆ 1703    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ barley        ┆ .json  ┆ 8487    ┆ v1.5.0 - v2.10.0  ┆ 8dc50de2509b6e197ce95c24c98f90… │
│ birdstrikes   ┆ .csv   ┆ 1223329 ┆ v2.0.0 - v2.10.0  ┆ null                            │
│ birdstrikes   ┆ .json  ┆ 4183924 ┆ v1.5.0 - v1.31.1  ┆ null                            │
│ budget        ┆ .json  ┆ 374289  ┆ v1.5.0 - v2.8.1   ┆ null                            │
│ budget        ┆ .json  ┆ 391353  ┆ v2.9.0 - v2.10.0  ┆ null                            │
│ budgets       ┆ .json  ┆ 18079   ┆ v1.5.0 - v2.10.0  ┆ 8a909e24f698a3b0f6c637c30ec95e… │
└───────────────┴────────┴─────────┴───────────────────┴─────────────────────────────────┘

Not required for these requests, but may be helpful to avoid limits
As an example, for comparing against the most recent I've added the 5 most recent
- Basic mechanism for discovering new versions
- Tries to minimise number of and total size of requests
Experimenting with querying the url cache w/ expressions
- `metadata_full.parquet` stores **all known** file metadata
- `GitHub.refresh()` to maintain integrity in a safe manner
- Roughly 3000 rows
- Single release: **9kb** vs 46 releases: **21kb**
- Still undecided exactly how this functionality should work
- Need to resolve `npm` tags != `gh` tags issue as well
dangotbanned added a commit that referenced this pull request Jan 16, 2025
Originally added in e1290d4

Try to reduce the size of #3631. This change is atomic and useful enough on its own
- Using `load` instead of `data`
- Don't mention multi-versions, as that was dropped
- `Application.generate_typing` now mostly populated by `DataPackage` methods
- Docs are defined alongside expressions
- Factored out repetitive code into `spell_literal_alias`
- `Metadata` examples table is now generated inside the doc
- Eliminated all flaky tests
- Mocking more of the internals that is safer to run in parallel
- Split out non-threadsafe tests with `@no_xdist`
- Huge performance improvement for the slower tests
- Added some helper functions (`is_*`) where common patterns were identified
- **Removed skipping from native `pandas` backend**
  - Confirms that its now safe without `pyarrow` installed
Using only `.parquet` was relevant in earlier versions that produced multiple `.parquet` files
Now these methods safely handle all formats in use
Without a default, I found that VSCode was always suggesting the **last** overload first (`"pyarrow"`)
This is a bad suggestion, as it provides the *worst native* experience.

The default now aligns with the backend providing the *best native* experience
- Multiple **brief** examples, for a taste of the public API
  - See (#3763)
- Refs to everywhere a first-time user may need help from
- Also aligned the (`Loader`|`load`) docs w/ eachother and the new phrasing here
- Reduce superfluous docs
- Format/reorganize remaining docs
- Follow up on some comments
Misc style changes
These changes are very minor in VSCode, but fix a lot of rendering issues on the website
Comment on lines +84 to +100
load: _Load[Any, Any]
"""
Get a remote dataset and load as tabular data.

For full <kbd>Tab</kbd> completions, instead use::

from altair.datasets import Loader
load = Loader.from_backend("polars")
cars = load("cars")
movies = load("movies")

Alternatively, specify ``backend`` during a call::

from altair.datasets import load
cars = load("cars", backend="polars")
movies = load("movies", backend="polars")
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +103 to +109
def url(
name: Dataset | LiteralString,
suffix: Extension | None = None,
/,
) -> str:
"""
Return the address of a remote dataset.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +121 to +128
Returns
-------
``str``

.. _Path.stem:
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.stem
.. _Path.suffix:
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of Returns sections, but sphinx seems to be unable to handle .. _links directly after Parameters.

Truly surprised this solved the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants