Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(parsing): normalize nested column names #542

Merged
merged 4 commits into from
Oct 28, 2024

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented Oct 26, 2024

Part of the #481

Fixes a few issue with nested column names. We had before a function that was normalizing the top level schema names in a single place here.

It means that if you have a JSON file to parse with parse_tabular (yes, it supports JSONs as well), or any other nested structure where we create Pydantic models for nested fields with dict_to_data_model we either had a bunch of non-normalized names except for the top level, or we could end up with a syntax error.

E.g. I had a JSON like this:

{
  "bff_contained_ngram_count_before_dedupe": 0,
  "language_id_whole_page_fasttext": {
    "en": 0.9512948989868164
  },
  "metadata": {
    "Content-Length": "112946",
    "Content-Type": "application/http; msgtype=response",
    "WARC-Block-Digest": "sha1:U462KE2IDQWYDJQDRXIP3UVH465G2I2G",
    "WARC-Concurrent-To": "<urn:uuid:eaf4d817-f140-4a5e-9fa9-64d0dcb32a08>",
    "WARC-Date": "2017-05-27T21:20:09Z",
    "WARC-IP-Address": "198.57.149.47",
    "WARC-Identified-Payload-Type": "text/html",
    "WARC-Payload-Digest": "sha1:KAFZ77S3HPJ7SOABQ4ZMMVT47OCXZR6R",
    "WARC-Record-ID": "<urn:uuid:f8557905-b0dd-47b9-bc60-3106b3c18b4c>",
    "WARC-Target-URI": "http://www.muslimlinkpaper.com/index.php/editors-desk/13-letter-to-the-editor/3054-a-letter-to-president-obama.html",
    "WARC-Truncated": "length",
    "WARC-Type": "response",
    "WARC-Warcinfo-ID": "<urn:uuid:398fb229-e8ed-4f8b-86af-00eb8d6b2594>"
  },
  "previous_word_count": 178,
  "text": "A Letter ...",
  "url": "http://www.muslimlinkpaper.com/index.php/editors-desk/13-letter-to-the-editor/3054-a-letter-to-president-obama.html",
  "warcinfo": "robots: classic\r\nhostname: ip-10-185-224-210.ec2.internal\r\nsoftware: Nutch 1.6 (CC)/CC WarcExport 1.0\r\nisPartOf: CC-MAIN-2017-22\r\noperator: Common Crawl Admin\r\ndescription: Wide crawl of the web for May 2017\r\npublisher: Common Crawl\r\nformat: WARC File Format 1.0\r\nconformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf",
  "fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train_prob": 0.967424213886261
}

Particularly, dealing with a field WARC-Concurrent-To is problematic. sqlalchmey fails to quote bound params in INSERT ... VALUES ... and -TO triggers a syntax error, as well as -SELECT for example.

With this PR, we are applying the same rules to all column names - top level and nested. It requires making an alias on Pydantic model, but I hope that is fine.

TODO:

  • Add more unit tests for the new name normalization function
  • To @0x2b3bfa0 point - make it even more strict (make them valid Python identifiers - e.g. they can't start with numbers) - at least in case when we generate a pydantic model out of them.

Copy link

cloudflare-workers-and-pages bot commented Oct 26, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 49a5d00
Status: ✅  Deploy successful!
Preview URL: https://3f58b49c.datachain-documentation.pages.dev
Branch Preview URL: https://normalize-nested-columns.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Oct 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.37%. Comparing base (0eabe20) to head (49a5d00).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #542      +/-   ##
==========================================
+ Coverage   87.36%   87.37%   +0.01%     
==========================================
  Files          97       97              
  Lines       10168    10178      +10     
  Branches     1390     1391       +1     
==========================================
+ Hits         8883     8893      +10     
  Misses        922      922              
  Partials      363      363              
Flag Coverage Δ
datachain 87.31% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shcheklein shcheklein force-pushed the normalize-nested-columns branch 2 times, most recently from 24130a7 to 3b8c1d8 Compare October 27, 2024 02:33
@0x2b3bfa0

This comment was marked as off-topic.

@shcheklein shcheklein force-pushed the normalize-nested-columns branch 2 times, most recently from f59ceec to e79d6fa Compare October 28, 2024 02:15
@shcheklein
Copy link
Member Author

any other practical improvements left @0x2b3bfa0 ?

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a couple PEP-20 suggestions, but looks good to me!

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good improvement to me 👍
I can’t think of any cases where this change would cause issues.

@0x2b3bfa0
Copy link
Member

@shcheklein, are tests failing due to my last changes, or is it an unrelated failure?

@shcheklein
Copy link
Member Author

@shcheklein, are tests failing due to my last changes, or is it an unrelated failure?

no, it's an unfortunate coincidence - pyarrow 18 got released 8 hours ago

test is not reliable, I'll improve it a bit

@shcheklein shcheklein merged commit 7146527 into main Oct 28, 2024
38 checks passed
@shcheklein shcheklein deleted the normalize-nested-columns branch October 28, 2024 19:58
samran5 added a commit to samran5/datachain that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants