Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend and simplify API for calculation of range-based rolling window offsets #17807

Draft
wants to merge 7 commits into
base: branch-25.04
Choose a base branch
from

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jan 24, 2025

Description

In both cudf.pandas and cudf-polars we would like to be able to use the existing libcudf functionality for computing the preceding and following columns for range-based (grouped) rolling windows.

The current functionality is designed with spark in mind and supports calculations with slightly different requirements compared to pandas and polars.

In this PR, we unify the construction of these offset column calculations to satisfy all uses.

Specifically:

  1. We introduce a new window type: a BOUNDED_OPEN window. This is a range-based window where the endpoint of the range is not included.
  2. We add support for negative values of the window range offset parameter. The existing code only supports non-negative values.

The proposed interface for describing the requested window-type going forward is a strongly-typed one using std::variant. This removes the need for default-constructed scalars when specifying UNBOUNDED and CURRENT_ROW windows.

Performance improvements

Spark permits nulls in the orderby column. In the grouped-rolling case, these nulls must be sorted at one end of each group. The current implementation finds the partition point in each group using thrust::for_each over each group and thrust::partition_point to find the break. For low-cardinality grouped rolling aggregations this can be very slow, since we have only a single thread searching over each group. We replace this with a segmented sum with CUB to find the per-group null count (and hence the break).

The dispatched functor for computing the bound given a row value has constexpr dispatch for the common, and required for pandas and polars, case that the orderby column has no-nulls. This shaves some milliseconds.

In the case that the orderby column does have nulls, we extend the interface such that the caller (who must have arranged that it is sorted correctly) tells us whether nulls were sorted BEFORE or AFTER. This avoids multiple kernel launches to deduce the null order, saving some milliseconds, and (in the grouped case) memory footprint. We polyfill this deduction so that the existing interface still works until we can deprecate and remove it.

Guide for review

The core logic is implemented in range_utils.cuh, specifically the range_window_clamper, dispatched to by make_range_window. To keep compile times under control, the template instantiation for computing preceding and following windows is moved into separate translation units: locally compilation takes ~10mins/TU. Things could be split out further if deemed necessary.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

These demonstrate the O(m*n) behaviour of rolling aggregations for
columns of size n and windows of length m.
In constrast to fixed offset rolling windows these must also search an
orderby column to find the window bounds.
Copy link

copy-pr-bot bot commented Jan 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 24, 2025
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 24, 2025
@wence-
Copy link
Contributor Author

wence- commented Jan 24, 2025

In draft because waiting for #17787

@wence- wence- changed the base branch from branch-25.02 to branch-25.04 January 24, 2025 17:54
@mythrocks
Copy link
Contributor

For low-cardinality grouped rolling aggregations this can be very slow, since we have only a single thread searching over each group. We replace this with a segmented sum with CUB to find the per-group null count (and hence the break).

This is brilliant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
2 participants