Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fields related to specifics of storage layer #4

Open
joshua-d-campbell opened this issue Apr 26, 2022 · 0 comments
Open

Fields related to specifics of storage layer #4

joshua-d-campbell opened this issue Apr 26, 2022 · 0 comments

Comments

@joshua-d-campbell
Copy link
Collaborator

Fields such as data_type and representation may be better left to the underlying data structures or storage layer. Is it redundant to record these at the metadata level or is there utility to this?

Based on a previous discussion on Google Docs:

Isaac Virshup:
Should these be part of the semantic metadata? To me they make more sense as part of the data format.

Josh Campbell:
Hi Isaac, it is a really good question! I was originally thinking the same, that they wouldn't need to be recorded here. Someone asked for the representation. If they are included, they will probably be optional.

The place where "data_type" may be useful is for data.frames with mixed types, which happens often for annotation matrices (var, obs). If someone saves the annotation matrices in flat/text file (as was originally done for the HTAN schema), it can be difficult to know if an underlying integer vector (e.g. 1, 2, 3) is supposed to be categorical or truly integer. This variable could be used to maintain this type of info (assuming this variable can be a vector of types the same length as the number of annotations instead of just a single value). An example of a data matrix that could be of mixed types is a "morphology" matrix generated by spatial workflows. This of course needs a bit more expansion in the description. I will put this down as an issue in the new GitHub repo.

Isaac Virshup:
Thanks for the quick response and looking forward to this being put up on github.

I'm coming at this from the perspective of having distinct storage and semantic layers. E.g. matrix-api would define the storage layer and FOM defines the semantic conventions and metadata.

To me, data_type and representation seem necessary at the storage level. The library which decodes the column into a factor, pd.Categorical, or arrow dictionary encoded value doesn't need to know anything about biology – and so this is structural.

Maybe this info could be specified, but the semantic layer could defer the specification of data types to the storage layer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant