Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] cudf-polars chunked parquet reader #16789

Draft
wants to merge 46 commits into
base: branch-24.08
Choose a base branch
from

Conversation

brandon-b-miller
Copy link
Contributor

Test PR to generate some wheels exploring some I/O functionality on top of feature/cudf-polars.

Do not merge.

wence- and others added 30 commits July 29, 2024 10:48
…-config

Use new polars engine config object in cudf-polars callback
## Description
<!-- Provide a standalone description of changes in this PR. -->
<!-- Reference any issues closed by this PR with "closes rapidsai#1234". -->
<!-- Note: The pull request title will be included in the CHANGELOG. -->

Adapts to IR changes in polars 1.4 and handles nrows/skiprows a little
more correctly.

## Checklist
- [ ] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.

---------

Co-authored-by: Lawrence Mitchell <[email protected]>
Add support for ``pl.col.str.replace`` and ``pl.col.str.replace_many``

Authors:
  - Thomas Li (https://github.com/lithomas1)

Approvers: None

URL: rapidsai#16039
…ai#16509)

contributes to rapidsai#16478

This implements "cum_min", "cum_max", "cum_prod", "cum_sum"

"cum_count" is not implemented for now, since there's no exact libcudf match (I imagine the non-grouped case is also not used that much but haven't checked).
I suppose we could implement it by creating a column of 1s and copying the null mask over, and doing a cum_sum on that.
Let me know if you want to try that.

Authors:
  - Thomas Li (https://github.com/lithomas1)

Approvers:
  - https://github.com/brandon-b-miller

URL: rapidsai#16509
…olumn

Use a key column rather than a placeholder for count agg
…16596)

polars.from_arrow renames empty column names (see
pola-rs/polars#11632). This causes problems
when round-tripping specially crafted dataframes. Avoid the problem by
constructing the table with fake names and then renaming.
Add support for additional unaryops through `cudf-polars`. 

Closes rapidsai#16566

---------

Co-authored-by: Lawrence Mitchell <[email protected]>
Add support for string `strip` in `pylibcudf` and `cudf-polars`.

---------

Co-authored-by: Lawrence Mitchell <[email protected]>
This PR adds datetime/timestamp parsing from string columns in pylibcudf
and cudf-polars.

Closes rapidsai#16174
Support `pl.Expr.quantile` in cudf-polars.

---------

Co-authored-by: Vyas Ramasubramani <[email protected]>
Since the full-frame `Agg` handler for first and last doesn't construct
a request (because we can do it without a `from_scalar` call), we didn't
handle these in a groupby context. Fortunately it is easy to add.
We were previously not calling the superclass __post_init__ in custom
validations of IR nodes. This meant that we would sometimes fail to
raise when the schema contained an EMPTY column.

Since we can't really compute with these types, we just fall back.
wence- and others added 8 commits September 4, 2024 14:15
…sai#16476)

We previously didn't support this case correctly, but it's not too bad.

This would be much easier if we could do it in libcudf, hence:
rapidsai#16475
Correctly handle `pow` and `log` by translating to binary expressions
when we observe the node.

Upgrade our minimum supported polars version (so that we see all these
function names from the rust IR).

Also tighten check for which groupby-aggs are supported when the
expression contains a unary function.
Add support for unpivoting a DataFrame. We raise for cases where the
concatenation of the value columns produces a cast that is not supported
by standard fixed-width unary casting.
Reject two more edge cases that we do not support.

We could easily support the case where the parquet read just needs to
read the metadata, but it is low priority, so have not done so here.
---------

Co-authored-by: brandon-b-miller <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Lawrence Mitchell <[email protected]>
…ai#16755)

This field renaming was due to a recent refactor in (as-yet-unreleased)
polars 1.7.
## Description

We implement a small pytest plugin that defaults the polars engine to
GPU (by monkeypatching `LazyFrame.collect`, yet another reason to have a
global default somehow).

As well as this, we collate all the known failures and classify them.


## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
@brandon-b-miller brandon-b-miller added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 10, 2024
Copy link

copy-pr-bot bot commented Sep 10, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Sep 10, 2024
@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Sep 10, 2024
@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - DO NOT MERGE Hold off on merging; see PR for details CMake CMake build issue cudf.polars Issues specific to cudf.polars feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants