HPV-EM is an HPV genotyping tool that utilizes an expectation maximization algorithm to identify the presence of different HPV genotypes in a sample from RNA-seq data.
First ensure the prerequisites have been installed on your system and that STAR and SAMtools appear in your PATH. In order for STAR to align reads against the human genome, you will first need to obtain the human reference genome in FASTA format and a corresponding annotation file in GTF format (these can be downloaded from Ensembl). You must then generate genome indexes for STAR aligner using its genomeGenerate command (see the STAR documentation for details).
You can then download HPV-EM from https://github.com/jin-wash-u/HPV-EM. Click on "Clone or download" and then click on "Download ZIP". Unzip the contents in the desired location.
STAR must also generate indexes for the viral reference genome. The packaged set of HPV genomes is found at reference/combined_pave_hpv.fa. This default viral reference should be indexed using the command:
$ STAR --runMode genomeGenerate --genomeFastaFiles reference/combined_pave_hpv.fa --genomeSAindexNbases 9 --genomeDir reference/combined_pave_hpv_STAR
If you desire, you can include HPV-EM in your PATH by running:
$ export PATH=$INSTALLDIR:$PATH
where $INSTALLDIR
is the folder into which you extracted the HPV-EM files.
Running HPV-EM with the -h option (or --help) will print a desciption of its optional and required input arguments. A description of each follows.
HPV-EM.py [-h] [-t THREADS] [-r REFERENCE] [--starviral STARVIRAL] [-o OUTNAME]
[-d] [--tpm TPM] [-p] [-k] [-v] -s STARGENOME reads1 [reads2]
python HPV-EM.py --threads 40 --reference reference/combined_pave_hpv.fa --annotation reference/hpv_gene_annot.tsv --starviral reference/combined_pave_hpv_STAR/ --outname ./SAMPLE_NAME --stargenome reference/ data/READ1.fastq.gz data/READ2.fastq.gz
- -h, --help
Prints the help menu, which contains an example of the function usage and abbreviated explanations of each of the options. - -t THREADS, --threads THREADS
Some portions of the HPV-EM pipeline support multithreading; if you wish to use this feature, set this to the number of cores available (default: 1) - -r REFERENCE, --reference REFERENCE
HPV-EM includes a comprehensive set of HPV reference genomes assembled by PaVE, the use of which is recommended. However, if you wish to supply your own viral reference in FASTA format, use this option with the path to the file - -a ANNOTATION, --annotation ANNOTATION
A file of viral gene annotations in TSV format to be used in place of the default HPV gene annotations [$INSTALLDIR/reference/hpv_gene_annot.tsv] - --starviral STARVIRAL
path to a directory containing STAR-generated viral genome indexes based on the reference FASTA - -o OUTNAME, --outname OUTNAME
Prefix attached to output file names (default: ./hpvEM) - -d, --disabledust
By default, HPV-EM filters low-complexity reads using the DUST algorithm. Specify this option to disable filtering - --tpm TPM
TPM threshold for identifying a true positive HPV genotype (default: 1.48) - -p, --printem
Causes HPV-EM to print its results to STDOUT - -k, --keepint
By default, HPV-EM removes files generated by intermediate steps of the pipeline. Specify this option to keep these files - -v, --version
Print program's version number and exit
- -s STARGENOME, --stargenome STARGENOME
Path to the directory containing STAR-generated human genome index files
- reads1 : A FASTQ file of single-end RNA-seq reads or the first of two paired-end FASTQs
- reads2 : (optional) Second paired-end FASTQ file
Reads files may be gzipped; if so, the filename must end in ".gz".
The output of the EM algorithm is written to OUTNAME.results.tsv (and, optionally, printed to stdout). The first two lines indicate how may steps the algorithm took to converge and the maximum likelihood estimate (MLE) of the sequencing error rate. The remaining lines have the format:
HPVtype MappedReads MappedProportion MLE_Reads MLE_Probability
This is a table including each HPV type with at least one read mapped to it, indicating the number of mapped reads, the proportion of all reads mapped to the HPV reference genomes mapped to this type, the MLE of the expected number of reads for this HPV type, and the MLE probability of this HPV type.
In addition to the results table, HPV-EM also generates read coverage maps for each HPV type with a non-zero MLE probability (OUTNAME.*.cov.pdf) and a visualization of the difference between the mapped reads proportions and MLE probabilites (OUTNAME.props.pdf). The reads aligned to the HPV references are also recorded in BAM format (OUTNAME.aligned.*.bam).