[python/r/c++] Revisit `shape` for component arrays #2407

johnkerl · 2024-04-08T18:18:08Z

PRs

PRs in process:

[python] Fix last 2.27+Python+dense failing test case #3269 kerl/python-227-dense-ned-read
[r] Fixes for dense+2.27 #3270 kerl/r-227-dense-fixes

PRs to be created:

Follow up with [python] Fix last 2.27+Python+dense failing test case #3269: handle the case of dense-array partial writes that do not start at 0. (That problem precedes this new-shape project, but this project will be fixing that problem.)
Check Python exception-types against the spec (below) (TileDBSOMAError vs ValueError for various error modes)
Example notebooks & other doc material -- be sure to link from the TileDB-SOMA 1.15 release notes

Merged PRs:

[c++] Methodize timestamped-schema-evolution factory #2909 kerl/schevo-timestamp-methodize
[c++] Trivial name-neaten #2913 kerl/name-neaten
[python] Trivial rename within a unit-test file #2908 kerl/ut-soma-exc-simplify
[c++] Trivial parameterizes in test/common.cc #2910 kerl/test-common-parameterize
[c++] Remove some dead unit-test code #2918 kerl/cpp-test-deadstrip
[c++] Minor function-extract in a unit-test helper #2919 kerl/minor-unit-test-helper-mod
[c++] Pre-neatens before making unit-test helpers for variant-indexed dataframes #2936 kerl/cpp-ut-helper-neaten
[c++] More use_current_domain unit-test parameterization #2938 kerl/more-cur-dom-parameterize
[c++] Remove some dead code in int64-only shape accessor #2915 kerl/cpp-strict-int64-shape
[c++] Arrow utils with current-domain option #2911 kerl/arrow-util-current-domain-optional
[c++] Parameterize schema-creation unit-test helpers #2939 kerl/step-two-temp
[c++] Unit-test resize for SparseNDArray and DenseNDArray #2947 kerl/cpp-ndarray-resize-testing
[c++] Make a Catch2 test fixture for dataframes #2945 kerl/dataframe-test-fixture
[c++] Unit-test variant-indexed dataframes #2944 kerl/cpp-variant-indexed-dataframes
[c++] Performant DataFrame.shape #2916 kerl/sdf-shape
[c++] Resize for variant-indexed DataFrame #2917 kerl/cpp-resizes
[c++] Implement upgrade_shape for SparseNDArray and DenseNDArray #2948 kerl/upgrade-shape-int64
[c++] Clarify dataframe-shaping test/access points #2951 kerl/sdf-test-accessors
[python/r] Expose shape-related accessors to Python/R bindings #2953 kerl/py-r-accessor-plumbing
[c++/python] Map core-to-soma domains correctly #2957 kerl/sdf-domain-accessors
Add tracking links for dense/new-shape support #2960 kerl/dense-link
[python] Fix nightly-build failure / pybind11 exception-mapping #2963 kerl/nightly-fix
[r] Fix DenseNDArray write after create #2970 kerl/dense-writeable-after-create
[python] Trivial dead-strip #2968 kerl/minor-trim
[python] Minor name-neaten in internals for domain/maxdomain #2969 kerl/more-py-domain-name-neaten
[c++] Support option to set log level from environment #2972 kerl/libtiledbsoma-env-logging-level
[python/r] Array-creation mods for new shape #2962 kerl/py-r-creation-paths
[python/r] Implement resize and tiledbsoma_upgrade_shape #2950 kerl/py-r-test-2
[c++] Centralize some nanoarrow helpers #2994 kerl/nanoarrow-helpers
[c++] Pre-neatens for polymorphic domainish accessors #3011 kerl/polydom3
[c++] Update dataframe unit-test writes in prep for polytype domainish accessors #3017 kerl/polydom5
[c++] Readback-testing pieces for polytype domainish accessors #3018 kerl/polydom6
[c++] Fix bug in nnz of variant-indexed dataframes #2990 kerl/variant-nnz-bug
[c++] Have a DataFrame test case with soma_joinid not first #3019 kerl/index-swap
[c++] Be a bit more careful testing dim-max vs shape #3020 kerl/ut-max-shape
[c++] Implementation and unit testing for domainish accessors #3012 kerl/polydom4
[c++] Fix bad merge from #3012 and #3020 #3025 kerl/fix-3020-merge
[python] One more rename in prep for domainish pushdown #3026 kerl/one-more-rename
[c++] Improve current-domain signaling for string dims #3028 kerl/ff-not
[c++] Fix a valgrind issue in unit-test code #3029 kerl/ut-vg
[c++] Fix memory-management issues in new domainish helpers #3030 kerl/table-utils-memory
[r] Improve test-case field names for DataFrame #3067 kerl/improve-sdf-test-field-names
[c++] Extend some unit-test cases for new shape #3068 kerl/ut-generate
[c++] Expose custom DataFrame domain for libtiledbsoma unit-test cases #3069 kerl/cpp-sdf-domain-at-create
[python/r] Use pushdown domainish accessors at the Python/R UX level #3027 kerl/hll-domainish
[python] Use same default max domain between Python and R #3088 kerl/max-domain-int64
[c++] Simplify internal API for dataframe-resizer #3090 kerl/maybe-resize-soma-joinid-cpp-tweak
[r] Implement missing domain argument to DataFrame create #3032 kerl/sdf-domain-at-create -- fixes [r] SOMADataFrame create needs to accept a domain argument #2967
[python/r] DataFrame resizer #3091 kerl/maybe-resize-soma-joinid-py-r
[c++] Add can-resize helpers in prep for experiment-level resize #3095 kerl/cpp-exp-resize-prep
[r] Make DataFrame objects shapeable at ingest #3089 kerl/r-dataframe-shapeable
Sync domain argument between Collection.add_new_dataframe and DataFrame.create SOMA#233
[c++] Trivial name-shortens in unit-test code #3125 kerl/cpp-ut-name-shortens
[c++/python/r] Rename a helper function #3127 kerl/helper-rename
[c++] Propagate Python/R function names to C++ for upgrade/resize methods #3130 kerl/cpp-can-resizers-names
[c++] Dataframe-sizing helpers #3132 kerl/cpp-dataframe-sizing-helpers
[c++] Unit-test the dataframe upgrader #3139 kerl/cpp-dataframe-upgrade-test
[python] Connect resizers to the Python API #3140 kerl/py-resizer-connects
[python] Complete #3140 #3151 kerl/py-can-upgrade-shape
[python] Let registrar provide new shapes for resize #3152 kerl/registration-shape-acceessors
[python] Canned tests for old/new arrays without/with new shapes #3156 kerl/py-exp-shaping
[python] Experiment-level upgrader/resizer #3157 kerl/py-exp-shaping2
[python] Append mode with resizing #3148 kerl/py-exp-resize
[python] Domain-at-create unit-test PR 1/5 #3191 kerl/py-domain-at-create-ut-1
[python] Domain-at-create unit-test PR 2/5 #3190 kerl/py-domain-at-create-ut-2
[python] Domain-at-create unit-test PR 3/5 #3192 kerl/py-domain-at-create-ut-3
[python] Domain-at-create unit-test PR 4/5 #3193 kerl/py-domain-at-create-ut-4
[python] Domain-at-create unit-test PR 5/5 #3194 kerl/py-domain-at-create-ut-5
[python] Min-sizing for dataframes/arrays with new shape feature #3203 kerl/min-size-2
[r] Min-sizing for dataframes/arrays with new shape feature #3208 kerl/r-min-sizing
[c++] can_upgrade_domain #3211 kerl/cpp-ugr-dom
[ci] Run R/Python interop tests with new-shape flag off/on #3232 kerl/ff-interop
[ci] Default the new-shape feature to enabled, still testing both #3230 kerl/ffon
[python] Strip needless wrapper-class docstrings #3234 kerl/docstring-prune
[python] Proper prefixing for shape-related methods #3236 kerl/prefixing
[python] Fix bad merge of #3236 #3241 kerl/fix-bad-merge
[python] Bindings for upgrade_domain #3235 kerl/py-r-ugr-dom
[r] Proper prefixing for shape-related methods #3237 kerl/py-r-ugr-dom-2
[r] Bindings for upgrade_domain #3238 kerl/py-r-ugr-dom-3
[python] Rename set_reader_coords to set_coords #3253 kerl/set-coords-rename
[python] Centralize sparse/dense pybind11 shape methods #3261 kerl/pybind11-nda-sizing
[python] Fix some dense+2.27 failing test cases #3265 kerl/dense-227-a
[c++] Apply subarrays for dense reads and writes #3263 kerl/dense-range-trim
[python] Fix 3D/4D cases with core 2.27 #3268 kerl/dim-explosion

Closed PRs:

[c++/python] Testing new sparse-shape feature from core 2.25 [no merge] #2785 -- This was only dogfooding for the core 2.25 release -- not to be merged
[python/r] Temporary feature-flags for stacked PRs [WIP] #2952 kerl/feature-flag-temp -- folded into 2962
[c++] Push polymorphic domain-ish accessors down to C++ #2995 kerl/polydom
[python] New-shape testing for tiledbsoma.io [WIP] #2964 kerl/tiledbsoma-io-test
[python] Min-sizing for dataframes/arrays [no merge] #3189 kerl/min-size
[c++] upgrade_domain for DataFrame #3220 kerl/cpp-ugr-dom-2
[c++/python] Fixes for dense arrays and core 2.27/dev #3244 kerl/dense-227-fixes

Issues which are related but non-blocking:

Use correct shape for obsm/varm layers in anndata export SOMA#216
Enable current domain on dense arrays. TileDB-Inc/TileDB#5303
[r] SparseNDArray/DenseNDArray create methods need to accept tile extent from PlatformConfig #2966
[python] Typeguard update for _cast_domainish #3081
Note: R append mode does not exist yet -- see [r] Port append-mode logic #1630 -- so an experiment-level resizer is not a priority in R

See also: [sc-51048].

Problem to be solved

Users want to know the shape of an array, in the SciPy sense:

Reads and writes are bounds-checked against the shape
This retains its value regardless of which values of a sparse array are or are not actually occupied
Users can resize.
- Some users need the ability to grow their datasets later, using either tiledbsoma.io's append mode, or subsequent writes using the tiledbsoma API.
- Note that the cellxgene census doesn't need this: eact week's published census has fixed shape, and any updates will happen in new storage, on a new week.

Using TileDB-SOMA up until the present:

The TIleDB domain is immutable after array creation
- This does bounds-checking for reads and writes, which is good
- To leverage this to function as a shape, users would need to set the domain at array-creation time. However, users lose the ability to grow their datasets later.
There is a non_empty_domain accessor
- This only indicates min/max coordinates at which data exists. Consider an X array for 100 cells and 200 genes. If non-zero expression counts exist only for cell join IDs 2-17, then the non_empty_domain will indicate (2,17) along soma_dim_0.
- Consider an obms["X_pca"] within the same experiment. This may be 100 cells by 50 PCA components: we need a placd to store the number 50.
- Therefore users cannot leverage this to function as a shape accessor.
We have offered a used_shape accessor since TileDB-SOMA 1.5.
- This functions as a shape accessor, in the SciPy sense, but it is not multi-writer safe.

New feature for TileDB-SOMA 1.15:

Arrays will have a shape
Reads and writes are bounds-checked against the shape
This retains its value regardless of which values of a sparse array are or are not actually occupied
Users can resize
The used_shape accessor will be deprecated in TileDB-SOMA 1.13, and slated for removal in TileDB-SOMA 1.14.

Compatiblity:

This will now require users to do an explicit resize before appending/growing TileDB-SOMA Experiments. Guidance in the form of example notebooks will be provided.

Tracking

Scheduling

Support arrives in TileDB Core 2.25. Deprecations for TileDB-SOMA will be released with 1.13. Full support within TileDB-SOMA will be release in 1.14.

Details

SOMA API mods as we've discussed in a Google doc are as follows.

`SOMADataFrame`

create: Retain the domain argument
- Issue:
  - Core has a (lo, hi) tuple per dim, e.g. (0,99) or (10,19)
  - SOMA has count per dim, with 0 implicit: e.g. 100 or 20
  - For SparseNDArray and DenseNDArray core can have (lo, hi) and SOMA can have count
  - For DataFrame there can be multiple dims --- default is a single soma_joinid
  - That could be treated either in (lo, hi) fashion or count fashion
  - However additional dims (e.g. cell_type) can be on any type, including strings, floats, etc. where there is no implicit lo=0
  - Therefore we need to keep the current SOMA API wherein DataFrame takes a domain argument (in (lo, hi) fashion) and not a shape argument (in count fashion)

`SparseNDArray and DenseNDArray`

create
- Have an optional shape argument which is of type Tuple[Int,...] where each element is the cell count of the corresponding dimension
  - If unsupplied, or if supplied but None in any slot: use the minimum 0 in each slot – nothing larger makes sense since we will not support downsize
- User guidance should make clear that it will not be possible to create an ‘old’ style array with the ‘new style’ API. (See also the upgrade logic below.)

All three of `SOMADataFrame`, `SparseNDArray`, `DenseNDArray`

write
- For new arrays, created with the new shape feature:
  - Core will bounds-check that coordinates provided at write time are within the current shape
  - Core will raise tiledb.cc.TileDBError to TileDB-SOMA, which will catch and raise IndexError, and R-standard behavior on the R side
- For old arrays created before this feature:
  - Core will not bounds-check that coordinates provided at write time are within the current shape
Existing used_shape accessor
- TileDB-SOMA will deprecate this over a release cycle.
- For new arrays: raise NotImplementedError
- For old arrays: return what’s currently returned, with a deprecation warning.
- Mechanism for determining old vs. new: array.schema.version (the core storage version).
Existing shape accessor
- For new arrays:
  - Have this return the new shape as proposed by core, no longer returning the TileDB domain.
- For old arrays created before this feature:
  - Return the TileDB domain as now.
Existing non_empty_domain accessor
- Same behavior for old and new arrays (unaffected by this proposal).
- Keep this accessor supported, but, with user notes that it’s generally non-useful
- This should return None (or R equivalent) when there is a schema but no data have been written.
New maxshape accessor
- Maps the core-level (lo, hi) accessor for domain to count-style accessor hi+1. E.g. if the core domain is either (0,99) or (50,99) then TileDB-SOMA maxshape will say 100.
- Same behavior for old and new arrays.
- Let users query for what the TileDB domain is, with user notes that it’s the maximum that users can reshape to.
- Issac suggests: maybe domain or maxshape (see h5py).
New resize mutator
- Note: reshape means something else in the community (numpy, zarr, h5py), e.g. a 5x20 (total 100 cells) being reinterpreted as 4x25 (still 100 cells). The standard name for changing cell-count is resize.
- For old arrays created before this feature: raise NotImplementedError.
- For new arrays:
  - Will raise ValueError if the new shape is smaller on any dim than currently in storage
  - Regardless of whether any data have been written whatsoever
  - Will raise ValueError if the new shape exceeds the TileDB domain from create time, which will serve TileDB-SOMA in a role of “max possible shape the user can reshape to”
  - Otherwise, any calls to write from this point will bounds-check writes within this new shape
  - We don’t expect resize to be multi-writer safe with regard to write ; user notes must be clear on this point
New tiledbsoma_upgrade_shape method for SparseNDArray and DenseNDArray
- This will leverage array.schema.version to see if an upgrade is needed
- Leverage core support for storage-version updates
- This will take a shape argument as in create
- For arrays created with “just-right” size: this will succeed
- For arrays created with “room-for-growth” / “two billion-ish” size: this will succeed
- If the user passes a shape which exceeds the current TileDB domain: this will fail
New tiledbsoma_upgrade_domain method for DataFrame
- Same as for SparseNDArray/DenseNDArray except it will take a domain at the SOMA-API level just as DataFrame's create method

`tiledbsoma.io`

The user-facing API has no shape arguments and thus won’t need changing.
Internally to tiledbsoma.io, we’ll still ask the tiledbsoma API for the “big domain” (2 billionish)
Append mode:
- Will need a new resize method at the Experiment level
- Users will need to:
  - Register as now
  - Call the experiment-level resize
    - Could be exp.resize(...), or (better) this could be tiledbsoma.io.reshape_experiment
- In either case: this method will take the new obs and var counts as inputs:
  - exp.obs.reshape to new obs count
  - exp.ms[name].var.reshape to new var count
  - exp.ms[name].X[name].reshape to new obs count x var count
  - exp.ms[name].obsm[name].reshape to new obs count x same width
  - exp.ms[name].obsp[name].reshape to new obs count x obs count
  - exp.ms[name].varm[name].reshape to new var count x same width
  - exp.ms[name].varp[name].reshape to new var count x var count
- Do the individual append-mode writes as now

The text was updated successfully, but these errors were encountered:

johnkerl · 2024-07-10T22:43:45Z

#2785 is a quick-and-dirty concept-prover -- its sole function is to flush out any API misunderstandings we might have, in prep for 2.25.0 core release.

johnkerl self-assigned this Apr 8, 2024

johnkerl added long-term-tracker blocked labels Apr 8, 2024

This was referenced Apr 8, 2024

SparseNDArray write bounding box not multi-writer safe #1969

Closed

[python] SparseNDArray bounding box inconsistent -- shape vs. used coordinates #1971

Open

johnkerl changed the title ~~[python/r/c++] Revisit shape for sparse arrays~~ [python/r/c++] Revisit shape for sparse arrays [long-term tracker] Apr 8, 2024

johnkerl mentioned this issue Apr 8, 2024

[r] Check for zero-copy semantics in bounding-box logic #1900

Open

johnkerl changed the title ~~[python/r/c++] Revisit shape for sparse arrays [long-term tracker]~~ [python/r/c++] Revisit shape for sparse arrays May 15, 2024

johnkerl mentioned this issue May 15, 2024

[Bug/Question] Experiment.obs.count has long wall clock time - is this normal? #2510

Closed

johnkerl assigned bkmartinjr and jp-dark May 16, 2024

johnkerl removed blocked long-term-tracker labels Jul 10, 2024

johnkerl unassigned bkmartinjr and jp-dark Jul 10, 2024

johnkerl mentioned this issue Jul 10, 2024

[c++/python] Testing new sparse-shape feature from core 2.25 [no merge] #2785

Closed

johnkerl mentioned this issue Jul 11, 2024

[c++] Improve exception-handling for query futures #2789

Merged

johnkerl added python-api r-api labels Jul 23, 2024

This was referenced Aug 2, 2024

[python] Remove reshape stub #2826

Merged

[r/python] Deprecation notices for used_shape #2834

Merged

[r] Use .Deprecated for used_shape in R #2835

Merged

johnkerl changed the title ~~[python/r/c++] Revisit shape for sparse arrays~~ [python/r/c++] Revisit `shape Oct 17, 2024

johnkerl changed the title ~~[python/r/c++] Revisit `shape~~ [python/r/c++] Revisit shape for component arrays Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python/r/c++] Revisit `shape` for component arrays #2407

[python/r/c++] Revisit `shape` for component arrays #2407

johnkerl commented Apr 8, 2024 •

edited

Loading

johnkerl commented Jul 10, 2024

[python/r/c++] Revisit shape for component arrays #2407

[python/r/c++] Revisit shape for component arrays #2407

Comments

johnkerl commented Apr 8, 2024 • edited Loading

PRs

Problem to be solved

Tracking

Scheduling

Details

SOMADataFrame

SparseNDArray and DenseNDArray

All three of SOMADataFrame, SparseNDArray, DenseNDArray

tiledbsoma.io

johnkerl commented Jul 10, 2024

[python/r/c++] Revisit `shape` for component arrays #2407

[python/r/c++] Revisit `shape` for component arrays #2407

johnkerl commented Apr 8, 2024 •

edited

Loading

`SOMADataFrame`

`SparseNDArray and DenseNDArray`

All three of `SOMADataFrame`, `SparseNDArray`, `DenseNDArray`

`tiledbsoma.io`