Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Initial work toward PyTorch data loaders #2823

Closed
wants to merge 84 commits into from

Conversation

bkmartinjr
Copy link
Member

@bkmartinjr bkmartinjr commented Jul 31, 2024

This PR is superceded by single-cell-data/TileDB-SOMA-ML#1 and is being closed. No further work will be done on this branch.


Notes to reviewers:

  • APIs have changed from the CZI contribution. The demo notebook (tutorial_pytorch.ipynb) has been updated to run with all API modifications.
  • This PR is for an initial drop we plan to use for pre-release feedback. More work is required prior to release
  • This is set up as a separate Python package, with its own config/build/etc.

This PR contains a PyTorch iterable-style DataSet/DataPipe for use with SOMA Experiment. Initial code contributed by the Chan Zuckerberg single-cell team. Modifications to the original code include:

  • stand-alone Python package tiledbsoma_ml
  • significant refactoring to improve performance
  • CI and other config for stand-alone Python package
  • improved context/config handling - pass through all user-specified configuration allowing use with other storage solutions, etc.
  • fix lint and typing issues identified by the CI pipeline in this repo
  • update doctrings and copyright
  • some minor enhancements to unit tests
  • added torch.utils.data.IterableDataset alongside the existing torchdata.datapipes.iter.IterDataPipe, acknowledging that the torchdata.datapipes is deprecated and slated for removal
  • add Python 3.8 support
  • API refinement to better align with ExperimentAxisQuery
  • modifications to support DDP (mult-GPU) training.

API change summary:

  • constructor takes ExperimentAxisQuery, rather than than an Experiment
  • reworked I/O and shuffle chunk size param for better API UX

Benchmarking results vs the cellxgene-census loader will be forthcoming shortly.

Copy link

codecov bot commented Jul 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.79%. Comparing base (6f723d4) to head (124a510).

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2823       +/-   ##
===========================================
+ Coverage   61.10%   89.79%   +28.69%     
===========================================
  Files          95       40       -55     
  Lines       12382     4096     -8286     
  Branches      781        0      -781     
===========================================
- Hits         7566     3678     -3888     
+ Misses       4625      418     -4207     
+ Partials      191        0      -191     
Flag Coverage Δ
libtiledbsoma ?
python 89.79% <ø> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.79% <ø> (+0.14%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@johnkerl johnkerl changed the title initial contribution of ExperimentDataPipe/Loader from CZI [python] Initial contribution of ExperimentDataPipe/Loader from CZI Aug 6, 2024
@bkmartinjr bkmartinjr changed the title [python] Initial contribution of ExperimentDataPipe/Loader from CZI [python] Initial work toward PyTorch data loaders Aug 20, 2024
Copy link
Member

@aaronwolen aaronwolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. I'm excited to start testing it out. The docs are comprehensive and super clear. My suggested changes are largely minor typos and small improvements for clarity/consistency.

@bkmartinjr
Copy link
Member Author

PR superceded by single-cell-data/TileDB-SOMA-ML#1

@bkmartinjr bkmartinjr closed this Sep 19, 2024
@bkmartinjr bkmartinjr deleted the bkmartinjr/experimentdatapipe branch October 19, 2024 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants