A statistical method to detect, analyze and visualize aberrations in chromatin accessibility (ATAC-seq, DNase-Seq) and quantify its effect on gene expression.
doi: https://doi.org/10.1101/2023.08.25.554881
Install epiout
and its companion packages with:
pip install epiout
hic-straw
is optional dependency to annotate chromatin interactions with EpiAnnot
:
conda install -c bioconda hic-straw
or
conda install -c conda-forge curl
pip install epiout[hic]
and another optional dependency is onnxruntime
to predict aberrant gene expression from aberrant chromatin accessibility:
pip install epiout[onnx]
Counting chromatin accessibility from ATAC-seq data with EpiCount
:
epicount --bed {bed} --alignments {alignments.tsv} --output_prefix {output_prefix} --cores {threads}
where bed
is a bed file of genomic regions to count accessibility, alignments.tsv
is a tab-delimited file of ATAC-seq alignments, output_prefix
is the prefix of output files, and threads
is the number of threads to use. See epicount --help
for more details.
alignments.tsv
lists bam files of ATAC-seq alignments, one file per line, with the following columns:
path/a.bam
path/b.bam
path/c.bam
File names are used as sample names in the output files. Alternatively, you can use a tab-delimited file with the following columns to specify sample names:
path/a.bam sample_a
path/b.bam sample_b
path/c.bam sample_c
EpiCount
will generate three files: prefix.counts.parquet
, prefix.raw_counts.parquet
, prefix.bed
. The parquet files containing the count matrix. The raw_counts
file is not filtered for replication and counts
file is filtered. bed
file containing replicated the genomic regions across samples. The parquet file can be loaded with pandas
and looking like:
df = pd.read_parquet('output_prefix.parquet')
df
To call outliers with EpiOut, run:
epiout --count_table {prefix.counts.parquet} --output_prefix {output_prefix} --cores {threads}
where count_table
is the output of EpiCount
, output_prefix
is the prefix of output files, and threads
is the number of threads to use. See epiout --help
for more details. You can pass ordinary csv file of count matrix to --count_table
argument where rows are genomic regions and columns are samples.
Output of EpiOut
is prefix.h5ad
file and prefix.results.csv
. h5ad file contains statistics about outliers
from epiout import EpiOutResult
result = EpiOutResult.load('result.h5ad')
# outliers as dataframe
result.outlier
# log adjusted p-values as dataframe
result.log_padj
# results as dataframe alternatively read results.csv file
df_results = result.results()
# Visualize outliers or accessibile regions
result.qq_plot('chr1:100-200')
result.plot_counts('chr1:100-200')
result.plot_volcona('chr1:100-200')
See the documentation of EpiOutResult
for more details.
EpiOut performs hyperparameter optimization to tune optimal bottleneck size of autoencoder. To specifiy the bottleneck size, use --bottleneck_size
arguments.
epiannot_create --tissue {tissue or cell line name} --output_prefix {output_prefix}
where tissue
is the name of tissue or cell line avaliable on ENCODE to fetch, output_prefix
is the prefix of output files where config.yaml
will be created and contains metadata and related files will be downloaded. See epiannot_create --help
for more details:
Also you can check avaliable tissues
or cell lines
:
epiannot_list
To annotate accesible regions and chromatin interactions with EpiAnnot
, run:
epiannot --bed {bed} --gtf {gtf} --counts {prefix.h5ad} --chrom_sizes {chrom_sizes} --output_prefix {output_prefix}
where bed
is a bed file of genomic regions to annotate, gtf
is a gtf file of gene annotations, counts
is the output of EpiOut
in h5ad file format or counts
obtained with EpiCount, chrom_sizes
is a file of chromosome sizes can be generated with pyfaidx from fasta file, and output_prefix
is the prefix of output files. See epiannot --help
for more details.
Output contains prefix.annotation.csv
annotation of genomic regions based on histone marks provided in the config file, prefix.gtf.csv
annotation on regions based on the proximity to genes, prefix.interaction.csv
annotation of chromatin interactions between regions and prefix.genes.csv
indicating that poteintial effected by the aberrant chromatin accessibility.
You can create annotation with your custom config file:
config.yaml
H3K27ac:
- ENCFF817IVB.bed.gz
- ENCFF916FML.bed.gz
H3K4me1:
- ENCFF456GWH.bed.gz
H3K4me3:
- ENCFF867WVM.bed.gz
your_custom_mark:
- a.bed
hic:
- ENCFF311CLH.hic
- ENCFF787ZVA.hic
They keys in the config file are the names of histone marks and the values are the list of bed files of histone marks. The config file can also contain a list of hic files to annotate chromatin interactions. Hic data is optional. The config file can be created with epiannot_create
command or you can use your config file. To call promoter, active and poised enhancers please make sure that you name your histone marks as H3K4me3
, H3K27ac
, and H3K4me1
respectively. An other histone mark or bed files can be used annotate regions. Output prefix.annotation.csv
will have a column for each key in the config file and will incidate if accesible region overlaps with the annotation source.