Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing Large Dataset Loading and Differential Expression Analysis in local hosted CellxGene VM #2630

Open
chunhuicai opened this issue Sep 8, 2023 · 3 comments
Labels
question request for help/clarification

Comments

@chunhuicai
Copy link

We are currently utilizing CellxGene VM (https://github.com/Novartis/cellxgene-gateway) to host a substantial spatial transcriptomic dataset comprising roughly 16 million cells. However, we are facing a couple of critical issues that are hampering our analysis workflow:

Dataset Loading:

  • Incomplete Loading: During the dataset loading process, we often experience disruptions and incomplete loading scenarios. Though after several attempts, we can achieve full dataset loading with a loading time around 3m30s, the inconsistency remains a concern.
  • Conversion to CXG: After successful conversion of our dataset to CXG format, we realized that it is not being recognized by our self-hosted explorer.

Differential Expression Analysis:

  • Inconsistent Loading of Gene Details: While attempting to utilize the differential expressed gene function, we noticed it doesn't uniformly complete the loading of all gene details.

Comparatively, using CZI to work with large datasets (over 4 million cells), we observed a fast data loading and a smooth completion of differential expression analysis in a few seconds. Is there any practices, setups, or approaches that would help us to efficiently handle and analyze big datasets on the local CellxGene VM to achieve performance similar to CZI?

@chunhuicai chunhuicai added the question request for help/clarification label Sep 8, 2023
@mohammed-hussain1259
Copy link

Hi, I am having similar issues with loading times for larger datasets. I was wondering how you were able to convert your datasets into CXG format?

@MaximilianLombardo
Copy link

Hey @mohammed-hussain1259

Thanks for the question. The original issue this user was experiencing was partially addressed over private communication, but sharing here for visibility and to continue the public discussion.

W.r.t. converting to CXG you can refer to this code in the single cell data portal repo that is the entry point for the CXG conversion. To provide a bit more context, the CXG file format is an implementation of the TIleDB format/data structure that adheres to the SOMA specification, with the goal of more specifically catering to the single cell use case.

Happy to provide more information if needed, but as a disclaimer - because of the wide range of contexts/requirements of different self-hosting use cases, the CELLxGENE team does not explicitly offer/guarantee support for self-hosting CZ CELLxGENE Annotate.

@mohammed-hussain1259
Copy link

Hi @MaximilianLombardo,

Thank you so much for your quick response. I am attempting to self-host CellxGene with a similar setup as the original poster using CellxGene Gateway (https://github.com/Novartis/cellxgene-gateway). I have been running into similar issues with loading times, as even a dataset of 400k cells (4.5 gb) takes nearly 15 minutes to load. I noticed that the https://cellxgene.cziscience.com/ is able to load similar sized datasets at an incredibly fast speed and I was wondering if you could provide some guidance on how I might be able to achieve similar load times. I have used the --sparse and --backed flags and while it does improve performance the load times are still not comparable to what I see on the CellxGene site.

I completely understand that you are not able to explicitly offer/guarantee support for self-hosting, however I appreciate any guidance you can provide.

If you would like to correspond over private communication, please feel free to reach out to me on my email at [email protected]

Thank you again for all your assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question request for help/clarification
Projects
None yet
Development

No branches or pull requests

3 participants