scDEC is a computational tool for single cell ATAC-seq data analysis with deep generative neural networks. scDEC enables simultaneously learning the deep embedding and clustering of the cells in an unsupervised manner. scDEC is also applicable to multi-modal single cell data. We tested it on the PBMC paired data (scRNA-seq and scATAC-seq) from 10x Genomics (see Tutorials).
An modified version of scDEC won the first place in NeurIPS 2021 Multimodal Single-Cell Data Integration competition two Joint Embedding tasks.
- TensorFlow==1.13.1
- Scikit-learn==0.19.0
- Python==2.7
Download scDEC by
git clone https://github.com/kimmo1019/scDEC
Installation has been tested in a Linux platform with Python2.7. GPU is recommended for accelerating the training process.
This section provides instructions on how to run scDEC with scATAC-seq datasets. One can also refer to Codeocean platform and click Reproducible Run
on the right. The embedding and clustering results of several datasets will be shown on the right panel.
Several scATAC-seq datasets have been prepared as the input of scDEC model. These datasets can be downloaded from the zenode repository. Uncompress the datasets.tar.gz
in datasets
folder then each dataset will have its own subfolder. Each dataset will contain two major files, which denote raw read count matrix (sc_mat.txt
) and cell label (label.txt
), respectively. The first column of sc_mat.txt
represents the peaks information.
scDEC is an unsupervised learning model for analyzing scATAC-seq data. One can run
python main_clustering.py --data [dataset] --K [nb_of_clusters] --dx [x_dim] --dy [y_dim] --train [is_train]
[dataset] - the name of the dataset (e.g.,Splenocyte)
[nb_of_clusters] - the number of clusters (e.g., 6)
[x_dim] - the dimension of Gaussian distribution
[y_dim] - the dimension of PCA (defalt: 20)
[is_train] - indicate training from scratch or using pretrained model
For an example, one can run CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Splenocyte --K 12 --dx 8 --dy 20
to cluster the scATAC-seq data with pretrained model. Note that the dimension of the embedding should be K+x_dim
Or one can run CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Splenocyte --K 12 --dx 8 --dy 20 --train True
to train the model from scratch.
If the pretrained model was used, the clustering results in the last step will be directly saved in results/[dataset]/data_pre.npz
where dataset
is the name of the scATAC-seq dataset. Note that data_pre.npz
or data_at_xxx.npz
contains the predictions from the H network. The first part denotes the embeddings and the second part denotes the inferred one-hot label where one can use np.argmax
function to get the cluster label.
Then one can run python eval.py --data [dataset]
to analyze the clustering results.
For an example, one can run python eval.py --data Splenocyte
The t-SNE visualization plot of latent features (scDEC_embedding.png
), latent feature matrix (scDEC_embedding.csv
), inferred cluster label (scDEC_cluster.txt
) will be saved in the results/[dataset]
folder.
If scDEC model was trained from scratch, the results will be marked by a unique timestamp YYYYMMDD_HHMMSS. This timestamp records the exact time when you run the script. The outputs from the training includes:
-
log
files and predicted assignmemntsdata_at_xxx.npz
(xxx denotes different epoch) can be found at folderresults/[dataset]/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2
. -
Model weights will be saved at folder
checkpoint/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2
. -
The training loss curves were recorded at folder
graph/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2
, which can be visualized using TensorBoard.
Next, one can run
python eval.py --data [dataset] --timestamp [timestamp] --epoch [epoch] --train [is_train]
[dataset] - the name of the dataset (e.g.,Splenocyte)
[timestamp] - the timestamp of the experiment you ran
[epoch] - specify to use the results of which epoch (it can be ignored)
[is_train] - indicate training from scratch
E.g., python eval.py --data All_blood --timestamp 20200910_143208 --train True
The t-SNE visualization plot of latent features (scDEC_embedding.png
), latent feature matrix (scDEC_embedding.csv
), inferred cluster label (scDEC_cluster.txt
) will be saved in the same results
folder as 1).
One can also use scDEC to analyze custome scATAC-seq dataset, especially the label is unknown. First, the users should prepare raw read count matrix (sc_mat.txt
) under the folder datasets/[NAME]
. [NAME]
denotes the dataset name.
Second, one can run the following command:
python main_clustering.py --data [dataset] --K [nb_of_clusters] --dx [x_dim] --dy [y_dim] --train [is_train] --no_label
[dataset] - the name of the dataset (e.g.,Mydataset)
[nb_of_clusters] - the number of clusters (e.g., 6)
[x_dim] - the dimension of latent space (continous part)
[y_dim] - the dimension of PCA (defalt: 20)
[is_train] - indicate training from scratch
For an example, one can run CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Mydataset --K 10 --dx 5 --dy 20 --train True --no_label
to clustering custom dataset.
Then one can run python eval.py --data Mydataset --timestamp YYYYMMDD_HHMMSS --epoch epoch --no_label
. Nota time the timestamp YYYYMMDD_HHMMSS
(for training) and epoch/batch index epoch
(the last training epoch/batch index is recommended) should be provided. The clustering results (cluster assignments) will be saved in the results/Mydataset/YYYYMMDD_HHMMSS_xxx
folder.
Tutorial Splenocyte Run scDEC on Splenocyte dataset (3166 cells)
Tutorial Full mouse atlas Run scDEC on full Mouse atlas dataset (81173 cells)
Tutorial PBMC10k paired data Run scDEC on PBMC data, which contains around 10k cells with both scRNA-seq and scATAC-seq (labels were manually annotated from 10x Genomic R&D group)
Also Feel free to open an issue in Github or contact [email protected]
if you have any problem in running scDEC.
This project is licensed under the MIT License - see the LICENSE.md file for details