You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fields such as data_type and representation may be better left to the underlying data structures or storage layer. Is it redundant to record these at the metadata level or is there utility to this?
Isaac Virshup:
Should these be part of the semantic metadata? To me they make more sense as part of the data format.
Josh Campbell:
Hi Isaac, it is a really good question! I was originally thinking the same, that they wouldn't need to be recorded here. Someone asked for the representation. If they are included, they will probably be optional.
The place where "data_type" may be useful is for data.frames with mixed types, which happens often for annotation matrices (var, obs). If someone saves the annotation matrices in flat/text file (as was originally done for the HTAN schema), it can be difficult to know if an underlying integer vector (e.g. 1, 2, 3) is supposed to be categorical or truly integer. This variable could be used to maintain this type of info (assuming this variable can be a vector of types the same length as the number of annotations instead of just a single value). An example of a data matrix that could be of mixed types is a "morphology" matrix generated by spatial workflows. This of course needs a bit more expansion in the description. I will put this down as an issue in the new GitHub repo.
Isaac Virshup:
Thanks for the quick response and looking forward to this being put up on github.
I'm coming at this from the perspective of having distinct storage and semantic layers. E.g. matrix-api would define the storage layer and FOM defines the semantic conventions and metadata.
To me, data_type and representation seem necessary at the storage level. The library which decodes the column into a factor, pd.Categorical, or arrow dictionary encoded value doesn't need to know anything about biology – and so this is structural.
Maybe this info could be specified, but the semantic layer could defer the specification of data types to the storage layer?
The text was updated successfully, but these errors were encountered:
Fields such as
data_type
andrepresentation
may be better left to the underlying data structures or storage layer. Is it redundant to record these at the metadata level or is there utility to this?Based on a previous discussion on Google Docs:
Isaac Virshup:
Should these be part of the semantic metadata? To me they make more sense as part of the data format.
Josh Campbell:
Hi Isaac, it is a really good question! I was originally thinking the same, that they wouldn't need to be recorded here. Someone asked for the representation. If they are included, they will probably be optional.
The place where "data_type" may be useful is for data.frames with mixed types, which happens often for annotation matrices (var, obs). If someone saves the annotation matrices in flat/text file (as was originally done for the HTAN schema), it can be difficult to know if an underlying integer vector (e.g. 1, 2, 3) is supposed to be categorical or truly integer. This variable could be used to maintain this type of info (assuming this variable can be a vector of types the same length as the number of annotations instead of just a single value). An example of a data matrix that could be of mixed types is a "morphology" matrix generated by spatial workflows. This of course needs a bit more expansion in the description. I will put this down as an issue in the new GitHub repo.
Isaac Virshup:
Thanks for the quick response and looking forward to this being put up on github.
I'm coming at this from the perspective of having distinct storage and semantic layers. E.g. matrix-api would define the storage layer and FOM defines the semantic conventions and metadata.
To me, data_type and representation seem necessary at the storage level. The library which decodes the column into a
factor
,pd.Categorical
, or arrow dictionary encoded value doesn't need to know anything about biology – and so this is structural.Maybe this info could be specified, but the semantic layer could defer the specification of data types to the storage layer?
The text was updated successfully, but these errors were encountered: