Starting from alignments (BAM files), pulseRTc can be used to infer genome-wide kinetics of RNA abundance from 4sU-tagging metabolic labeling methods, including biochemical enrichment after thiol-specific biotinylation, and recent approaches such as SLAM-seq, TimeLapse-seq or TUC-seq that rely on bioinformatic enrichment of newly transcribed RNAs.
We use pulseR, a kinetic and statistical modelling framework, to estimate RNA decay rates (half-lives).
Warning
Kinetic models for biochemical separation (with or without spike-ins) and nucleotide conversion protocols are pre-defined for a pulse labeling experiment. These models can be modified to reflect differences in experimental procedures ( e.g. chase experiment), but this requires a bit of R scripting, see models. Currently, the pre-processing workflow for nucleotide conversion protocols splits input alignment (BAM) files into "labelled" (reads with T -> C) and "unlabelled" (old/pre-existing reads without T -> C), according to the presence or not of characteristic mismatches ( after quality filtering, SNPs removal, etc. ).
Clone the git repository
git clone https://github.com/dieterich-lab/pulseRTc.git
cd pulseRTc
Create a conda environment and activate it
# or use mamba...
conda env create --name pulsertc --file environment.yml
conda activate pulsertc
Note
General dependencies include Samtools, BCFtools, and Picard Tools. Picard Tools is optional. For gene abundance quantification, Subread (featureCounts) is also installed. See below for transcript abundance quantification. Comment out corresponding lines in environment.yml, before creating the environment, to skip installation of selected dependencies.
Warning
For transcript abundance quantification, GffRead and Salmon are required, however Salmon is currently NOT installable viaconda
, see here. If you have existing installations, you can specify paths to binaries for both Salmon and GffRead in makefile.vars.
Install additional R packages
# install pulseR
Rscript scripts/R/install_extra_r_pkgs.R
Install pulseRTc. To install the local VCS project in development mode, use the --editable
or -e
option, otherwise
this flag can be ignored
pip --verbose install [-e] . 2>&1 | tee install.log
The workflow is made up of different parts which may or may not all be run sequentially. To get started, cd scripts
, and set all paths, variables, and program options in makefile.vars. None of these paths is required to be located in this repository, i.e. they can be anywhere on your system, as long as they are accessible from the calling context. Most of them are self-explanatory, e.g INDEXLOC
is the path to your FASTA
and GTF
files, etc. You also need to define which commands MKCMD
are used to call programs, i.e. whether you use Slurm, or simply bash, etc.
We also provide examples of SAMPLELIST.TXT (optional) and config.yaml (required) that are defined using variables SAMPLELIST
and YML
, respectively. DATALOC
is the location where these files reside.
Note
Make sure you have permission for all scripts! This can be done e.g. bychmod +x filename
If your BAM files are ready to be processed, the first step is to run BCFtools to get SNPs, see next step. In general, removing duplicate ( e.g due to PCR artefacts) might help to reduce the number of false positives for variant calling.
Warning
BAM files need to be sorted and indexed. They require the MD tag! If they do not, you can trysamtools calmd file.sorted.bam reference.fasta > file.md.sorted.bam
where the reference fasta must be the same as the one used for mapping.
If your BAM files are ready to be processed, then you want to get SNPs. We recommend to use BCFtools, as it is installed in the environment. You need to make sure BCFLOC
and OUTPUT_BCF
are set, and make run-bcftools
. The final location of your BAM files is defined in the config.yaml using the bamloc
key. This is automatically picked-up by makefile.vars.
Once this is done, you need to update the snpdata
key in the config.yaml with the output of BCFtools (or GRAND-SLAM), and set vcf: True
if using the BCFtools output. You need to set parent
and any program options, and make run-workflow
. After completion, you can get some summary statistics using make plot-mm
.
We are now ready to proceed to read abundance quantification, as pulseR needs read counts as input. If we are interested in gene read counts, then we can use featureCounts, and make count-fc
. For transcript quantification, see further below.
To prepare pulseR input from featureCounts tables, we make prep-fc
, then fit the models using make pulse-all-fc-counts
. Before fitting the models, selected time points have to be defined in time_pts.R.
Note
make pulse-all-fc-counts
will fit all time points defined in time_pts.R using allSets. After these results are available, you can fit subsets of time points using timeSets andmake pulse-set-fc-counts
.
Warning
The script prep_fc.R called usingmake prep-fc
is NOT all-purpose, and may need to be adjusted depending on your data. If you already have count tables and associated metadata (R objects), they can be put underpulsedir/featureCounts/data
, wherepulsedir
is given in the config.yaml. The count tables must be namedcounts.rds
(gene by samples), and the metadatasamples.rds
, with columnssample fraction time rep
, wheresample
must match the columns names ofcounts.rds
.
If you have matching results from GRAND-SLAM, you can combine all estimates from pulseR and GRAND-SLAM together by make match-gs
.
We use Salmon for transcript abundance quantification (and GffRead to extract transcript sequences). Variables INDEX_SALMON
, SALMON_PATH
, SALMON_OPTS
, and GFFREAD_PATH
need to be set in makefile.vars. We use Salmon in (quasi-mapping) mapping-based mode with a decoy-aware transcriptome (using the entire genome): the indexing step is independent of the reads, and only needs to be run once for a particular set of reference transcripts. This can be done with get-salmon-index
. This generally results in larger index files, see here for other options. If you already have an index, you can skip this part, but make sure to set INDEX_SALMON
to the parent directory of your index files, which must be located in a sub-directory called index.
Since we already have split BAM files, we only need to convert them back to FASTQ. Salmon does not currently have built-in support for interleaved FASTQ files, so READ1 and READ1 (paired-end sequencing reads) flags are directed to different files. Since our BAM files were coordinate sorted, we shuffle and groups reads beforehand. This is done with make run-bam2fastq
, followed by make run-salmon-qm
.
Some scripts are available e.g. to compare pulseR and GRAND-SLAM results, or perform DGE to check gene expression at varying labeling times vs. 0 (unlabelled), however many of these currently contain hard coded parameters/options.
We use the terminology labelled (new) and unlabelled (old), although this can be inadequate, e.g. a read can be new without being labelled (few Ts present, low incorporation, etc.).
For SNPs estimation, we recommend to use all samples, including 0h time points.
Mapped reads must have the MD attribute (e.g. for STAR --outSAMattributes MD
). This is necessary to find mismatches. Reads are processed using the entire query sequence, including soft-clipped bases. You can use --trim5p
and --trimp3p
to remove mismatches at the ends of the reads, but note that this is done based on the mapped sequence.
Reads are sorted into first and second in pair for quality control. In general, one read is sense, the other one is antisense. We define a read pair to be sense, if the first read is sense, and antisense if the first read is antisense. For stranded libraries, sense pairs map to the + strand, and vice versa. For reverse-stranded libraries, antisense pairs map to the + strand, and vice versa.
To sort reads, all mismatches are reported with respect to the + strand, e.g. a T->C on the + RNA template is reported as T->C, and an A->G on the - RNA template is also reported as T->C. However, for quality control, aligned pairs are reported consistently with the read orientation/protocol at sequencing.