Fields related to specifics of storage layer #4

joshua-d-campbell · 2022-04-26T14:15:21Z

Fields such as data_type and representation may be better left to the underlying data structures or storage layer. Is it redundant to record these at the metadata level or is there utility to this?

Based on a previous discussion on Google Docs:

Isaac Virshup:
Should these be part of the semantic metadata? To me they make more sense as part of the data format.

Josh Campbell:
Hi Isaac, it is a really good question! I was originally thinking the same, that they wouldn't need to be recorded here. Someone asked for the representation. If they are included, they will probably be optional.

The place where "data_type" may be useful is for data.frames with mixed types, which happens often for annotation matrices (var, obs). If someone saves the annotation matrices in flat/text file (as was originally done for the HTAN schema), it can be difficult to know if an underlying integer vector (e.g. 1, 2, 3) is supposed to be categorical or truly integer. This variable could be used to maintain this type of info (assuming this variable can be a vector of types the same length as the number of annotations instead of just a single value). An example of a data matrix that could be of mixed types is a "morphology" matrix generated by spatial workflows. This of course needs a bit more expansion in the description. I will put this down as an issue in the new GitHub repo.

Isaac Virshup:
Thanks for the quick response and looking forward to this being put up on github.

I'm coming at this from the perspective of having distinct storage and semantic layers. E.g. matrix-api would define the storage layer and FOM defines the semantic conventions and metadata.

To me, data_type and representation seem necessary at the storage level. The library which decodes the column into a factor, pd.Categorical, or arrow dictionary encoded value doesn't need to know anything about biology – and so this is structural.

Maybe this info could be specified, but the semantic layer could defer the specification of data types to the storage layer?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fields related to specifics of storage layer #4

Fields related to specifics of storage layer #4

joshua-d-campbell commented Apr 26, 2022

Fields related to specifics of storage layer #4

Fields related to specifics of storage layer #4

Comments

joshua-d-campbell commented Apr 26, 2022