Skip to content

Latest commit

 

History

History
93 lines (74 loc) · 7.76 KB

File metadata and controls

93 lines (74 loc) · 7.76 KB

CZ ID Consensus Genome workflow

This workflow performs reference-based consensus genome mapping from metagenomic sequencing assays with spiked primer enrichment or from amplicon sample sequencing assays.

The workflow has several modes:

  • Oxford Nanopore SARS-CoV-2 samples
  • Illumina SARS-CoV-2 samples
  • Illumina samples of other viruses
  • Illumina reads isolated from metagenomic samples via the CZ ID mngs workflow

Based on original work at:

With enhancements and additional modules by the CZI Infectious Disease team.

Running Consensus-Genome locally

For Consensus Genome workflow we can follow a similar workflow to the short-read-mngs presented in wiki: Running-WDL-workflows-locally.

We first build a local Docker container image with the consensus genome workflow:

./scripts/docker-build.sh workflows/consensus-genome -t czid-consensus-genome

TIPS: For more detailed setup information

Run

We then use our local sample configuration file that points to CZ ID's public references for smaller runs:

miniwdl run --verbose czid-workflows/consensus-genome/run.wdl \
    docker_image_id=czid-consensus-genome \
    fastqs_0=czid-workflows/consensus-genome/test/sample_sars-cov-2_paired_r1.fastq.gz \
    fastqs_1=czid-workflows/consensus-genome/test/sample_sars-cov-2_paired_r2.fastq.gz \
    sample=sample_sars-cov-2_paired \
    technology=Illumina \
    ref_fasta=s3://czid-public-references/consensus-genome/MN908947.3.fa \
    -i czid-workflows/consensus-genome/test/local_test.yml

Where:

  • docker_image_id= should be set to the docker image tag you used when building the image (in our example, czid-consensus-genome)
  • czid-workflows/consensus-genome/run.wdl is the WDL for the consensus genome sequencing workflow.
  • fastqs_0 and fastqs_1 are the pair of FASTQ files. The ones referred to are small files to run locally.
  • sample is the name to use where referencing the sample in the output files.
  • technology is the sequencing technology (options = Illumina or ONT)
  • local_test.yml supplies boilerplate workflow inputs, such as the S3 paths for the reference databases. For local run purposes, we use lighter references:
    • The human database for host removal only contains chromosome 1.
    • The kraken db used locally only has coronavirus sequences.

Reference files

Filename Provenance
s3://czid-public-references/consensus-genome/MN908947.3.fa Downloaded from https://www.ncbi.nlm.nih.gov/nuccore/MN908947 in July 2020
s3://czid-public-references/consensus-genome/ampliseq_primers.bed The .bed file was obtained from the Illumina Ampliseq protocol documentation https://www.illumina.com/products/by-brand/ampliseq/community-panels/sars-cov-2.html on 2021-01-26
s3://czid-public-references/consensus-genome/artic_v3_primers.bed The .bed file was obtained from the CZ Biohub sc2 pipeline repository: https://raw.githubusercontent.com/czbiohub/sc2-illumina-pipeline/master/data/nCoV-2019.bed in July 2020. The master file can be downloaded from ARTIC network here: https://github.com/artic-network/fieldbioinformatics/blob/master/test-data/primer-schemes/nCoV-2019/V3/nCoV-2019.bed
s3://czid-public-references/consensus-genome/artic_v3_short_275_primers.bed The .bed file was received from scientists at UCSF on 2021-03-11 and links to this protocol https://www.protocols.io/view/covid-19-artic-v3-illumina-library-construction-an-bh4zj8x6
s3://czid-public-references/consensus-genome/combined_msspe_artic_primers.bed The .bed file was obtained from scientists at the CZ Biohub on 2021-01-26.
s3://czid-public-references/consensus-genome/ercc_sequences.fasta ERCC sequence file was obtained from the CZ Biohub sc2 pipeline repository: https://github.com/czbiohub/sc2-illumina-pipeline/blob/cd37a25cdf3c0260d082bd0146daa5e192704893/data/ercc_sequences.fasta in July 2020. The initial sequences can be downloaded from ThermoFisher here: https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip
s3://czid-public-references/consensus-genome/hg38.fa.gz The human genome was downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz in July 2020.
s3://czid-public-references/consensus-genome/human_chr1.fa Test file was obtained from the CZ Biohub sc2 pipeline repository: https://github.com/czbiohub/sc2-illumina-pipeline/blob/master/data/human_chr1.fa in July 2020.
s3://czid-public-references/consensus-genome/kraken2_h+v_20200319.tar.gz Kraken2 database of sars-cov-2 + human that was downloaded from https://genexa.ch/sars2-bioinformatics-resources/ in July 2020, however this resource is no longer kept up-to-date.
s3://czid-public-references/consensus-genome/kraken_coronavirus_db_only.tar.gz The smaller kraken2 database was taken from the CZ Biohub sc2 pipeline configuration in July 2020.
s3://czid-public-references/consensus-genome/msspe_primers.bed The primer .bed file was initially generated by scientists at CZ Biohub and was taken from the Biohub sc2 pipeline https://raw.githubusercontent.com/czbiohub/sc2-illumina-pipeline/master/data/SARS-COV-2_spikePrimers.bed in July 2020.
s3://czid-public-references/consensus-genome/msspe_primers-v2.bed The primer .bed file was updated by scientists at CZ Biohub to flip the orientation of the primer sequences.
s3://czid-public-references/consensus-genome/snap_primers.bed The .bed file was obtained from the swift representatives in December 2020. It can also be found here https://swiftbiosci.com/wp-content/uploads/2020/09/sarscov2_v1_masterfile.txt.zip
s3://czid-public-references/consensus-genome/covidseq_primers.bed The .bed file matches the artic V1 file which can be found here https://raw.githubusercontent.com/artic-network/artic-ncov2019/master/primer_schemes/nCoV-2019/V1/nCoV-2019.bed
s3://czid-public-references/consensus-genome/neb_vss1a.primer.bed The .bed file was downloaded from https://github.com/nebiolabs/VarSkip
s3://czid-public-references/consensus-genome/easyseq.bed Downloaded from https://www.nimagen.com/shop/products/rc-cov096/easyseq-sars-cov-2-novel-coronavirus-whole-genome-sequencing-kit
s3://czid-public-references/consensus-genome/vadr-models-corona-1.1.3-1.tar.gz Downloaded from https://ftp.ncbi.nlm.nih.gov/pub/nawrocki/vadr-models/coronaviridae/1.2-1/vadr-models-corona-1.2-1.tar.gz on 2021-03-05
s3://czid-public-references/consensus-genome/artic-primer-schemes.tar.gz primer_schemes directory of https://github.com/artic-network/artic-ncov2019/commit/7e359dae37d894b40ae7e35c3582f14244ef4d36
test/MT007544.fastq.gz Copied from https://github.com/artic-network/fieldbioinformatics/blob/master/test-data/MT007544/MT007544.fastq on 2021-03-06
test/trim_primers_input/snap_aligned_reads.bam Derived from test/snap_top10k_R files*
test/trim_primers_input/tailedseq_aligned_reads.bam Derived from test/tailedseq_top10k_R* files
test/vadr_input/really-long-name-consensus.fa Derived from test/sample_sars-cov-2_paired_r* files

More Information

For more information, including a screencast of this example, see the learn-miniwdl open source course

Troubleshooting

  • When running on docker desktop for MacOS, must uncheck use gRPC FUSE for file sharing