Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle nan and inf float values #249

Merged
merged 5 commits into from
Aug 9, 2024
Merged

handle nan and inf float values #249

merged 5 commits into from
Aug 9, 2024

Conversation

dberenbaum
Copy link
Collaborator

@dberenbaum dberenbaum commented Aug 7, 2024

Partial fix for https://github.com/iterative/dvcx/issues/1697#issuecomment-2269457257:

This PR treats all missing values in float columns as NaN for a couple reasons:

  1. Consistency with the pydata ecosystem. Numpy and pandas both treat all missing values in float columns as NaN.
  2. Sqlite does not differentiate between null and NaN (see sqlite discussions and source code), although CH does.

@dberenbaum dberenbaum marked this pull request as draft August 7, 2024 12:22
Copy link

cloudflare-workers-and-pages bot commented Aug 7, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: f909bc7
Status: ✅  Deploy successful!
Preview URL: https://c4808f6d.datachain-documentation.pages.dev
Branch Preview URL: https://nan-inf.datachain-documentation.pages.dev

View logs

Comment on lines +1418 to +1420
convert_options = ConvertOptions(
strings_can_be_null=True, null_values=STR_NA_VALUES
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This provides pandas-like null handling for csv files

@@ -45,12 +47,12 @@ class TeamMember(BaseModel):


def test_merge_objects(test_session):
skip_if_not_sqlite()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -103,14 +105,13 @@ def test_merge_similar_objects(test_session):


def test_merge_values(test_session):
skip_if_not_sqlite()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum
Copy link
Collaborator Author

Remaining test failures are either unrelated or require https://github.com/iterative/studio/pull/10429

@dberenbaum dberenbaum marked this pull request as ready for review August 7, 2024 19:30
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool 🪄
I didn't expect this can support this as naturally.



def test_from_values_nan_inf(tmp_dir, catalog):
vals = [float("nan"), float("inf"), float("-inf")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's impressive!

@dberenbaum
Copy link
Collaborator Author

Going to merge but we should follow up on arrays and CH joins

@dberenbaum dberenbaum merged commit 8e0f970 into main Aug 9, 2024
31 of 36 checks passed
@dberenbaum dberenbaum deleted the nan-inf branch August 9, 2024 20:58
mattseddon added a commit that referenced this pull request Aug 13, 2024
@shcheklein
Copy link
Member

Going to merge but we should follow up on arrays and CH joins

@dberenbaum can we please review and create a ticket in the DataChain repo for this if needed?

@dberenbaum
Copy link
Collaborator Author

  • make sure Array(Float) support that - needs follow-up since orjson does not support nan/inf

Opened #386. CH was fixed by https://github.com/iterative/studio/pull/10471.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants