You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we have a single file (JSONL or CVS/Parquet with a column with JSONs) we need a way to "explode" those JSONs/dicts into a Pythonic model and store it in DataChain not a single column, but as multiple columns - one per each path in that JSON/dict.
E.g. this is how JSONL looks like after a naive parse:
Or from the CVS file (mind the meta column):
There is an obvious way to mitigate this - create a Model class and populate it from in the UDF that. But that's seems very annoying and redundant - model description becomes 2x/3x code of the parser.
Suggestions
DataChain.explode(C("meta")). This one is more or less obvious and requires creating an extra table.
Make functions like map, gen dynamically figure out schema and create Pydantic model as it is parsing files. This requires more complicated implementation, but can faster since it can work in a streaming mode:
This may be already possible with the combination of read_meta and map depending on how we want to solve this. (read_meta requires a storage_path with an example data to create a schema or a pydantic model passed).
How are we going to determine schema? Is it based on a sample of rows (which is what read_meta does, it reads a single row) or by reading all the rows?
How are we going to determine schema? Is it based on a sample of rows (which is what read_meta does, it reads a single row) or by reading all the rows?
yes, based on a sample (like we do already in the from_parquet and friends)
This may be already possible with the combination of read_meta and map depending on how we want to solve this. (read_meta requires a storage_path with an example data to create a schema or a pydantic model passed).
yes, idea is the same but we need to wrap it into a user-friendly function and may be generalize a bit?
Follow up https://github.com/iterative/dvcx/pull/1368
Based also on this discussion / feedback by @tibor-mach https://iterativeai.slack.com/archives/C04A9RWEZBN/p1727194987119179
Base also on iteration on DCLM - https://github.com/iterative/studio/issues/10596
Summary
When we have a single file (JSONL or CVS/Parquet with a column with JSONs) we need a way to "explode" those JSONs/dicts into a Pythonic model and store it in DataChain not a single column, but as multiple columns - one per each path in that JSON/dict.
E.g. this is how JSONL looks like after a naive parse:
Or from the CVS file (mind the meta column):
There is an obvious way to mitigate this - create a Model class and populate it from in the UDF that. But that's seems very annoying and redundant - model description becomes 2x/3x code of the parser.
Suggestions
DataChain.explode(C("meta"))
. This one is more or less obvious and requires creating an extra table.map
,gen
dynamically figure out schema and create Pydantic model as it is parsing files. This requires more complicated implementation, but can faster since it can work in a streaming mode:Imagine something like this:
The text was updated successfully, but these errors were encountered: