This repository contains the implementation of the ShallowChrome
modeling pipeline presented in the paper:
Frasca F., Matteucci M., Leone M., Morelli M. J. and Masseroli M. "Accurate and highly interpretable prediction of gene expression from histone modifications", 2022; 23: 151
available here.
ShallowChrome
is a novel computational pipeline for accurate and fully interpretable modeling of epigenetic gene transcriptional regulation operated by Histone Mark (HM) modifications. ShallowChrome
leverages on the procedure of 'peak calling' to retrieve gene-wise, significant and dynamically located HM features that can strongly predict the transcriptional state of genes. In our modeling pipeline we:
- Fit logistic regression models on these extracted features to solve the task of binary classification of gene transcriptional state over 56 cell-types from the REMC database;
- Analyse and rigorously interpret the obtained models by extracting insightful gene-specific regulative patterns;
- Compare the extracted patterns with the characteristic chromatin state emissions from ChromHMM (Ernst et al., 2012), showing that
ShallowChrome
is able to coherently rank groups of chromatin states w.r.t. their transcriptional activity.
More on how to replicate paper results is in the following.
ShallowChrome/
|-- README.md
|-- LICENSE
|-- .gitignore
|-- notebooks/
| |-- utils.py
| |-- model fitting.ipynb
| |-- model inspection.ipynb
| |-- model validation.ipynb
| |-- model fitting - valley thresholding.ipynb
| |-- data extraction.ipynb
|-- scores/
| |-- DeepChrome_scores.txt
|-- data/
| |-- - splits/
| | |-- iteration_0/
| | |-- iteration_1/
| | |-- iteration_2/
| | ...
| |-- - targets/
| | |-- E003/
| | |-- E004/
| | ...
| |-- cells.csv
| |-- gene_list.txt
| |-- GeneFile.txt
| |-- names.csv
README.md
this fileLICENSE
MIT license file.gitignore
standard .gitignore file for Python projectsnotebooks/
folder containing Python notebooks to run the modeling pipelinenotebooks/utils.py
core Python routines called from within the notebooks to perform modeling and analysesnotebooks/model fitting.ipynb
notebook where ShallowChrome models are fitted to solve binary classification of gene transcriptional state; reproduces Tables 2 and S1 and Figure 2 of the papernotebooks/model inspection.ipynb
notebook to inspect ShallowChrome models and to extract and plot gene-wise regulative patterns; reproduces Figure 3 of the papernotebooks/model validation.ipynb
notebook to compare ShallowChrome regulative patterns with ChromHMM chromatin state emissions; reproduces Figure 4 of the papernotebooks/model fitting - valley thresholding.ipynb
here the classification task is solved with an alternative approach to define target classes; reproduces Table S4 and Figures S2 and S3 of the papernotebooks/data extraction.ipynb
notebook to perform data extraction with pygmql and pandasscores/
default folder where numerical results from the modeling pipeline are storedscores/DeepChrome_scores.txt
test scores from the DeepChrome model (Singh et al., 2016)data/
default folder where data and reference files are storeddata/- splits/
folder containing random split indices for model fittingdata/- targets/
folder containing RPKM target values for each epigenomedata/cells.csv
csv file enumerating the 56 epigenomes object object of the present studydata/gene_list.txt/
txt file containing the ordered list of genes considered in the present studydata/GeneFile.txt
txt file containing promoter window coordinates for each of the considered genesdata/names.csv
csv file enumerating the Histone Mark modifications considered in the present study
In order to run the ShallowChrome
model fitting and analyses (notebooks model fitting.ipynb
, model inspection.ipynb
, model validation.ipynb
, model fitting - valley thresholding.ipynb
), the following libraries are required:
- matplotlib
- numpy
- scikit
- scipy
- jupyter
We suggest installing them within a Python virtual environment via pip. Paper results can be reproduced with the following versions on
Python 2.7.15
:
matplotlib==2.2.4
numpy==1.16.6
scikit-learn==0.20.4
scipy==1.2.2
In order to run the de novo data extraction notebook (data extraction.ipynb
) the following libraries are required:
- numpy
- gmql
- pandas
- jupyter
We employed the following versions on
Python 3.6.12
numpy==1.16.0
gmql==0.1.1
pandas==1.1.5
NB: gmql
additionally requires Java. Please follow the installation procedure here.
- Download the pre-processed data from here;
- Uncompress the downloaded ".zip" file in folder "data";
- Run the
notebooks/model fitting.ipynb
notebook to reproduce Tables 2 and S1 and Figure 2 of the paper; - Run the
notebooks/model inspection.ipynb
notebook to reproduce Figure 3 of the paper; - Run the
notebooks/model inspection.ipynb
notebook with variabletarget_only
set toFalse
: this will perform model selection and fitting for all epigenomes over the 'standard' DeepChrome split, dumping all fitted models to disk; - Run the
notebooks/model validation.ipynb
notebook to reproduce Figure 4 of the paper; - Run the
notebooks/model fitting - valley thresholding.ipynb
notebook to reproduce Table S4 and Figures S2 and S3 of the paper. Alternatively: - Run the
notebooks/data extraction.ipynb
to prepare all necessary data for model fitting and analyses; - Go to step 3. above.