-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(parsing): normalize nested column names #542
Conversation
0e59d41
to
394c40e
Compare
Deploying datachain-documentation with Cloudflare Pages
|
394c40e
to
6ecc5b0
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #542 +/- ##
==========================================
+ Coverage 87.36% 87.37% +0.01%
==========================================
Files 97 97
Lines 10168 10178 +10
Branches 1390 1391 +1
==========================================
+ Hits 8883 8893 +10
Misses 922 922
Partials 363 363
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
24130a7
to
3b8c1d8
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
f59ceec
to
e79d6fa
Compare
any other practical improvements left @0x2b3bfa0 ? |
e79d6fa
to
ba5dd56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a couple PEP-20 suggestions, but looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good improvement to me 👍
I can’t think of any cases where this change would cause issues.
Co-authored-by: Helio Machado <[email protected]>
Co-authored-by: Helio Machado <[email protected]>
7219876
to
5c13afd
Compare
@shcheklein, are tests failing due to my last changes, or is it an unrelated failure? |
no, it's an unfortunate coincidence - pyarrow 18 got released 8 hours ago test is not reliable, I'll improve it a bit |
This reverts commit 7146527.
Part of the #481
Fixes a few issue with nested column names. We had before a function that was normalizing the top level schema names in a single place here.
It means that if you have a JSON file to parse with
parse_tabular
(yes, it supports JSONs as well), or any other nested structure where we create Pydantic models for nested fields withdict_to_data_model
we either had a bunch of non-normalized names except for the top level, or we could end up with a syntax error.E.g. I had a JSON like this:
Particularly, dealing with a field
WARC-Concurrent-To
is problematic.sqlalchmey
fails to quote bound params inINSERT ... VALUES ...
and-TO
triggers a syntax error, as well as-SELECT
for example.With this PR, we are applying the same rules to all column names - top level and nested. It requires making an alias on Pydantic model, but I hope that is fine.
TODO: