Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft Census schema for support of Visium and Spatial data #1092

Open
7 tasks
pablo-gar opened this issue Apr 5, 2024 · 12 comments · May be fixed by #1245
Open
7 tasks

Draft Census schema for support of Visium and Spatial data #1092

pablo-gar opened this issue Apr 5, 2024 · 12 comments · May be fixed by #1245
Assignees
Labels
census schema Schema definition or specification P0 Priority 0 - Critical, fix ASAP!

Comments

@pablo-gar
Copy link
Contributor

pablo-gar commented Apr 5, 2024

LAST EDITED: Aug, 29, 2024

See parent Epic for further information.
chanzuckerberg/single-cell#644

See current draft for spatial support in SOMA https://docs.google.com/document/d/1S48pD5XTzDcaLGlq6YVYCoUjptR93PHHHmG79TiJzsA/edit

TODOs

  • Change all version mentions of datasset schema to 5.1.0
  • Update ./census_accepted_assays.csv to include:
    • EFO:0010961 - Visium Spatial Gene Expression
    • EFO:0030062 - Slide-seqV2
    • EFO:0009920 - Slide-seq maybe?
  • Define a diameter for slide-seq data to be stored in spatial[scene_id].obsl["loc"]["soma_geometry"]
  • Define the units of the positions data frame for slide-seq (visium is pixels)

Schema changes


Version: 2.2.0

Last edited: April, 2024.


Data included

All datasets included in the Census MUST be of CELLxGENE dataset schema version 5.1.0. The following data constraints are imposed on top of the CELLxGENE dataset schema.

Editor's note: do this change in all other places where the CELLxGENE dataset schema version is mentioned. For simplicity all other changes are omitted here.


Assays

[...]

The Census MUST include all cells from the list of accepted assays.

These assays were selected with the following criteria:

Only children "EFO:0002772" or "EFO:0010183" are shown as this is a constraint imposed by the CELLxGENE dataset schema >3.0.0.

  • Must measure gene expression via RNA sequencing.
  • Can be done at the single-cell level.
  • May include nascent or elongating RNA data.
  • May be targeted to specific genes in an assay-specific manner.
  • May include spatial data only from Visium or Slide-seq.
  • Doesn’t measure spatial data from other assays,
  • Doesn't measure other non-RNA molecules concurrently.
  • Doesn’t require author metadata for correct interpretability (e.g. perturbation-based technologies).
  • Doesn’t intend to primarily measure RNA structure, RNA fusions, RNA modifications, or RNA interactions.
  • Doesn’t intend to primarily measure non-mRNA (e.g. tRNA, rRNA, small RNAs).
  • Doesn’t intend to primarily measure viral RNA.
  • Doesn’t intend to primarily measure introns.
  • Doesn’t do ribosome profiling.
Spatial Assays

Only observations from Visium and Slide-seq assays MUST be included in Census, as indicated in the list of accepted assays. Per the CELLxGENE dataset schema, datasets with spatial observations can be identified with the presence of the slot uns["spatial"]. For these assays, only observations from datasets that contain "one Space Ranger output for a single tissue section" MUST be included in Census.

The full logic above can be asserted as follows:

  • If a dataset has the slot uns["spatial"] and True in uns["spatial"]["is_single"], then all observations MUST be included.
  • If a dataset has the slot uns["spatial"] and False in uns["spatial"]["is_single"], then all observations MUST be excluded.

Census metadata – census_obj​​["census_info"]["summary"]SOMADataFrame

[...]

  1. Total number of cells or spatial spots included in this Census build:
    1. label: "total_cell_count"
    2. value: Cell count
  2. Unique number of cells or spatial spots included in this Census build (is_primary_data == True)
    1. label: "unique_cell_count"
    2. value: Cell count

Data encoding and organization

[...]

Census Non-Spatial Data – census_obj["census_data"][organism]SOMAExperiment

Non-spatial data for Homo sapiens MUST be stored as a SOMAExperiment in census_obj["census_data"]["homo_sapiens"].

Non-spatial data for Mus musculus MUST be stored as a SOMAExperiment in census_obj["census_data"]["mus_musculus"].


Feature dataset presence matrix – census_obj["census_data"][organism].ms["RNA"]["feature_dataset_presence_matrix"]SOMASparseNDArray

[...]

Census Spatial Sequencing Data – census_obj["census_spatial_sequencing"][organism]SOMAExperiment

Only Visium and Slide-seq are supported for spatial data. See the "assays included" section above.

Spatial data for Homo sapiens MUST be stored as a SOMAExperiment in census_obj["census_spatial_sequencing"]["homo_sapiens"].

Spatial data for Mus musculus MUST be stored as a SOMAExperiment in census_obj["census_spatial_sequencing"]["mus_musculus"].

For each organism the SOMAExperiment MUST contain the following:

  • Cell metadata – census_obj["census_spatial_sequencing"][organism].obsSOMADataFrame
  • Non-spatial data – census_obj["census_spatial_sequencing"][organism].msSOMACollection. This SOMACollection MUST only contain one SOMAMeasurement in census_obj["census_spatial_sequencing"][organism].ms["RNA"] with the following:
    • Matrix data – census_obj["census_spatial_sequencing"][organism].ms["RNA"].XSOMACollection. It MUST contain exactly two layers:
      • Count matrix – census_obj["census_spatial_sequencing"][organism].ms["RNA"].X["raw"]SOMASparseNDArray
    • Feature metadata – census_obj["census_spatial_sequencing"][organism].ms["RNA"].varSOMAIndexedDataFrame
    • Feature dataset presence matrix – census_obj["census_spatial_sequencing"][organism].ms["RNA"]["feature_dataset_presence_matrix"]SOMASparseNDArray
  • Obs to spatial data mapping:
    • Obs to spatial data – census_obj["census_spatial_sequencing"][organism].obs_scene. It indicates the link between an observation and a scene, it MUST have two columns: 1) obs_id corresponding to soma_joinid of obs and 2) scene_id corresponding to the associated scene.
  • Spatial data – census_obj["census_spatial_sequencing"][organism].spatialSOMACollection.
    • Spatial Scenes with spatial data – census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid]SOMAScene. There will be as many as Spatial Scenes as spatial datasets. EachSOMAScene MUST contain the following:
      • MUST contain a positions array – census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].obsl["loc"]SOMAGeometryNDArray. This will contain the spatial array positions for each observation, the geometry points associated to them, and additional metadata.
      • MAY contain a full resolution image – census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["fullres_image"]SOMAImageNDArray.
      • MUST contain a high resolution image – census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["highres_image"]SOMAImageNDArray.

Matrix Data, count (raw) matrix – census_obj["census_spatial_sequencing"][organism].ms["RNA"].X["raw"]SOMASparseNDArray

Same as non-spatial data. See the corresponding section here.

Feature metadata – census_obj["census_spatial_sequencing"][organism].ms["RNA"].varSOMADataFrame

Same as non-spatial data. See the corresponding section here.

Feature dataset presence matrix – census_obj["census_spatial_sequencing"][organism].ms["RNA"]["feature_dataset_presence_matrix"]SOMASparseNDArray

Same as non-spatial data. See the corresponding section here.

Cell metadata – census_obj["census_spatial_sequencing"][organism].obsSOMADataFrame

Same as non-spatial data. See the corresponding section here.

Important note: In addition, the following spatial obs columns from the CELLxGENE dataset schema MUST be included in this SOMADataFrame

Column Encoding Description
array_col As defined in CELLxGENE dataset schema
array_row
in_tissue

Obs to spatial mapping – census_obj["census_spatial_sequencing"][organism].obs_sceneSOMADataFrame

It indicates the link between an observation and a scene. Each row corresponds to an observation with the following columns:

Column Encoding Description
obs_id int It MUST be valid soma_joinid from census_obj["census_spatial_sequencing"][organism].obs.
scene_id string It MUST be valid scene_id from census_obj["census_spatial_sequencing"][organism].spatial.
value bool It MUST be True if the scene contains spatial information about the oberservation, otherwise it MUST be False.

Positions array of a Scene – census_obj["census_spatial_sequencing"][organism].spatial[scene_id].obsl["loc"]SOMAGeometryNDArray

scene_soma_joinid MUST correspond to the values soma_joinid in census_obj["census_spatial_sequencing"][organism].spatial.scenes.

For each observation in each Scene, spatial array positions, the geometry points associated to them, and additional positional metadata MUST be encoded as a SOMAGeometryNDArray. Each row corresponds to an observation with the following columns:

If Visium ("EFO:0010961") the units for the spatial array pisitions are pixels from the high-resolution image (spatial[scene_soma_joinid].img["highres_image"]). Otherwise TBD.

Column Encoding Description
X float It MUST be the corresponding value in the first column of obsm["spatial"]. As defined in the CELLxGENE dataset schema.
Y float It MUST be the corresponding value in the second column of obsm["spatial"]. As defined in the CELLxGENE dataset schema.
soma_geometry float Radius of points: dimeter/2. If Visium ("EFO:0010961") diameter MUST be uns.["spatial"][library_id]['spot_diameter_fullres']. As defined in the CELLxGENE dataset schema. Otherwise TBD-TODO (else for Slide-seq it should be 0.003% of the radius occupied by the full cloud of points).

Images of a Scene - census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id] – SOMASMultiscaleImage`.

Images of a Visium ("EFO:0010961") scene MUST adhere to the following specifications. Other assays MUST NOT have images, and MUST NOT include the img collection.

library_id MUST be the corresponding value in the source H5AD slot uns.["spatial"][library_id], as defined in the CELLxGENE dataset schema.

Full resolution image of a Scene – census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["fullres_image"]SOMAImageNDArray.

The full resolution image of a Visium ("EFO:0010961") scene MAY be included and MUST be encoded as a SOMAImageNDArray.

Value: the image from uns["spatial"][library_id]['images']['fullres'] as defined in the CELLxGENE dataset schema.

High resolution image of a Scene – census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["highres_image"]SOMAImageNDArray.

The full resolution image of a Visium ("EFO:0010961") scene MUST be included and MUST be encoded as a SOMAImageNDArray.

Value: the image from uns["spatial"][library_id]['images']['hires'] as defined in the CELLxGENE dataset schema.

@pablo-gar pablo-gar added documentation Improvements or additions to documentation P0 Priority 0 - Critical, fix ASAP! labels Apr 5, 2024
@pablo-gar pablo-gar self-assigned this Apr 5, 2024
@pablo-gar pablo-gar added census schema Schema definition or specification and removed documentation Improvements or additions to documentation labels Apr 5, 2024
@pablo-gar
Copy link
Contributor Author

First iteration, very likely to change

https://drive.google.com/file/d/1_A8YlZsVZrDrt_hhjHIYQ_jVw0M5b_eP/view?usp=sharing

@pablo-gar
Copy link
Contributor Author

Second iteration

census_schema_spatial_v2.pdf

@pablo-gar
Copy link
Contributor Author

Third iteration (changes reflected in text as of today).

census_schema_spatial_v3.pdf

@pablo-gar
Copy link
Contributor Author

@prathapsridharan
Copy link
Contributor

@pablo-gar - Some questions/comments here about the differences in the diagram of census_scheme_spatial_v4.pdf and the descriptions of the data fields and types in the text above:

Does soma_joinid in scenes dataframe correspond to soma_joinid in experiment.obs dataframe? That is, the two references to soma_joinid are actually talking about a particular observation? If so, should scenes dataframe just contain an scene_id instead of soma_joinid? I say this because experiment.obs already has a scene_id that ties each observation to a Scene and the scenes dataframe seems to be about metadata about each scene and therefore I don't see why it should contain anything about a particular observation like obs_joinid. From what I understand, a scene corresponds to multiple observations so scene metadata dataframe probably should not contain anything about obs_joinid other than perhaps num_observations or something like that?

soma_dim_0 and soma_dim_1 are defined as categoricals that contain the name of the "X" and "Y" spatial coordinate names. If that is the case then soma_dim_0 and soma_dim_1 are weird names. Maybe something like spatial_X_coord_name and spatial_Y_coord_name or something like that?

Spatial Scenes with spatial data – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid] – SOMAScene. There will be as many as Spatial Scenes as spatial datasets

Should .spatial[scene_soma_joinid] be replaced with .spatial[scene_id] where scene_id is specified in the experiment.obs dataframe (and possibly in scenes dataframe)? Also the text above describing the columns of scenes dataframe doesn't quite match with the columns listed in the v4 diagram and one or the other needs updating

A data frame with raw spatial coordinates – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid]["obs_locations"] – SOMADataFrame

There is no obs_locations in the v4 diagram anymore. Should this be removed from the text description above?

MUST contain a positions array – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid].obsl["positions"] – SOMASparseSpatialArray. This will contain the spatial array positions for each observation, the geometry points associated to them, and additional metadata.

According the v4 diagram, this is a SOMAGeometryNDArray. Should the text should be modified? Also even in the v4 boxed diagram, positions is also specified as a SparseSpatialArray which is confusing. I also think the text description about the fields of positions should be updated since it doesn't match with v4 diagram description. For instance the text contains a column called in_tissue that is not in the v4 diagram

Full resolution image of a Scene and High resolution image of a Scene are specified as SOMAImageNDArray in the text above but the v4 diagram calls them as DenseSpatialArray. This needs updating

@brianraymor
Copy link

@pablo-gar - I noticed a reference to fiducial_diameter_fullres in the PDF above. This is unsupported by the dataset schema, per earlier conversations. Please see #cell-sci-modalities.

@pablo-gar
Copy link
Contributor Author

@prathapsridharan answering your questions

Does soma_joinid in scenes dataframe correspond to soma_joinid in experiment.obs dataframe? That is, the two references to soma_joinid are actually talking about a particular observation? If so, should scenes dataframe just contain an scene_id instead of soma_joinid? I say this because experiment.obs already has a scene_id that ties each observation to a Scene and the scenes dataframe seems to be about metadata about each scene and therefore I don't see why it should contain anything about a particular observation like obs_joinid. From what I understand, a scene corresponds to multiple observations so scene metadata dataframe probably should not contain anything about obs_joinid other than perhaps num_observations or something like that?

No, soma_joinid in scenes datafame DOES NOT correspond to soma_joinid in experiment.obs dataframe. If that was understood from the schema text I should improve it.

soma_dim_0 and soma_dim_1 are defined as categoricals that contain the name of the "X" and "Y" spatial coordinate names. If that is the case then soma_dim_0 and soma_dim_1 are weird names. Maybe something like spatial_X_coord_name and spatial_Y_coord_name or something like that

I'll bring this proposal to Julia and Aaron. I don't have an strong opinion on it.


Spatial Scenes with spatial data – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid] – SOMAScene. There will be as many as Spatial Scenes as spatial datasets

Should .spatial[scene_soma_joinid] be replaced with .spatial[scene_id] where scene_id is specified in the experiment.obs dataframe (and possibly in scenes dataframe)? Also the text above describing the columns of scenes dataframe doesn't quite match with the columns listed in the v4 diagram and one or the other needs updating

I'm proposing to unify everything via the soma_joinid of the .spatial["scenes"] DataFrame, this effectively acts as a scene ID, so adding yet another scene_id field seems redundant to me. Do you I'm missing something here?


A data frame with raw spatial coordinates – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid]["obs_locations"] – SOMADataFrame

There is no obs_locations in the v4 diagram anymore. Should this be removed from the text description above?

Yes, thanks for the catch! I will remove it


MUST contain a positions array – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid].obsl["positions"] – SOMASparseSpatialArray. This will contain the spatial array positions for each observation, the geometry points associated to them, and additional metadata.

According the v4 diagram, this is a SOMAGeometryNDArray. Should the text should be modified? Also even in the v4 boxed diagram, positions is also specified as a SparseSpatialArray which is confusing. I also think the text description about the fields of positions should be updated since it doesn't match with v4 diagram description. For instance the text contains a column called in_tissue that is not in the v4 diagram

Full resolution image of a Scene and High resolution image of a Scene are specified as SOMAImageNDArray in the text above but the v4 diagram calls them as DenseSpatialArray. This needs updating

Yes, thanks for catching all of these!

@pablo-gar
Copy link
Contributor Author

@brianraymor Thanks for the catch I've fixed it.

@pablo-gar
Copy link
Contributor Author

pablo-gar commented May 9, 2024

Fourth iteration with fixes from the comments above. Text has also been updated in the top-level comment.

census_schema_spatial_v5.pdf

@ivirshup ivirshup linked a pull request Jul 22, 2024 that will close this issue
19 tasks
@pablo-gar
Copy link
Contributor Author

Sixth iteration:

  • array_col, array_row, in_tissue moved to obs
  • Updated structure to .spatial to adhere to latest changes in TileDB-SOMA
  • Added var_scene and obs_scene

census_schema_spatial_v6.pdf

@pablo-gar
Copy link
Contributor Author

pablo-gar commented Aug 10, 2024

Seventh iteration:

  • Removed spot_diameter_fullres from census_obj["census_spatial_data"][organism].spatial[scene_id].obsl["loc"]
  • Removed summary data frame of Scenes – census_obj["census_spatial_data"][organism].spatial.scenesSOMADataFrame.
  • Removed Var to spatial mapping – census_obj["census_spatial_data"][organism].var_sceneSOMADataFrame
  • Updated name of image collection
  • Made high resolution image required
  • Updated units for positions geometry dataframe

census_schema_spatial_v7.pdf

@pablo-gar
Copy link
Contributor Author

pablo-gar commented Aug 29, 2024

Eighth iteration:

  • Updated details of img collection to match the latest changes in TileDB-SOMA for MultiscaleImage
  • Updated editorial changes to match all image requirements.
  • Replace "census_spattial_data" with "census_spatial_sequencing" in all occurrences see this document for more details

census_schema_spatial_v8.pdf

@ivirshup ivirshup linked a pull request Oct 31, 2024 that will close this issue
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
census schema Schema definition or specification P0 Priority 0 - Critical, fix ASAP!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants