MetONTIIME is a Meta-barcoding pipeline for analysing ONT data in QIIME2 framework. Starting from v2.0.0, the pipeline is based on Nextflow, to allow for easier installation and better execution monitoring.
Prerequisites
Installation
git clone https://github.com/MaestSi/MetONTIIME.git
cd MetONTIIME
chmod 755 *
The MetONTIIME pipeline requires you to open metontiime2.conf configuration file and set the desired options. Then, you can run the pipeline using either docker or singularity environments just specifying a value for the -profile variable.
Usage:
nextflow -c metontiime2.conf run metontiime2.nf --workDir="/path/to/workDir" --resultsDir="/path/to/resultsDir" -profile docker
Mandatory argument:
-profile Configuration profile to use. Available: docker, singularity
Other mandatory arguments which may be specified in the metontiime2.conf file
--workDir Path to working directory including fastq.gz files
--sampleMetadata Path to sample metadata tsv file; if it doesn't exist yet, it is created at runtime
--dbSequencesFasta Path to database file with sequences in fasta format
--dbTaxonomyTsv Path to database file with sequence id-to-taxonomy correspondence in tsv format
--dbSequencesQza Database file name with sequences as QIIME2 artifact (qza)
--dbTaxonomyQza Database file name with sequence id-to-taxonomy correspondence as QIIME2 artifact (qza)
--classifier Taxonomy classifier, available: VSEARCH, Blast
--maxNumReads Maximum number of reads per sample; if one sample has more than maxNumReads, random downsampling is performed
--minReadLength Minimum length (bp) for a read to be retained
--maxReadLength Maximum length (bp) for a read to be retained
--minQual Minimum average PHRED score for a read to be retained
--extraEndsTrim Number of bases to be trimmed at both ends
--clusteringIdentity Identity for de novo clustering [0-1]
--maxAccepts Maximum number of candidate hits for each read, to be used for consensus taxonomy assignment
--minConsensus Minimum fraction of assignments must match top hit to be accepted as consensus assignment [0.5-1]
--minQueryCoverage Minimum query coverage for an alignment to be considered a candidate hit [0-1]
--minIdentity Minimum alignment identity for an alignment to be considered a candidate hit [0-1]
--taxaLevelDiversity Taxonomy level at which you want to perform non phylogeny-based diversity analyses
--numReadsDiversity Max num. reads for diversity analyses
--taxaOfInterest Taxa of interest that you want to retain and to focus the analysis on
--minNumReadsTaxaOfInterest Minimum number of reads assigned to Taxa of interest to retain a sample
--resultsDir Path to directory containing results
MetONTIIME pipeline allows the users to choose the database according to the marker gene sequenced and to their preferences. Some marker gene reference database, as SILVA (16S/18S rRNA), GreenGenes (16SrRNA) and UNITE (fungal ITS) are already formatted for use with QIIME2, as they are available as a pair of sequences (fasta) and taxonomy (tsv) files, and they can therefore be easily imported as QIIME2 artifacts (qza). In case you downloaded a fasta file from NCBI and you want to obtain the corresponding taxonomy tsv file, you can use TaxonomyTsv_from_fastaNCBI.R script. This R script requires an R installation with taxize and Biostrings packages installed. For example, if you want to use the same database used by the EPI2ME 16S workflow for bacterial 16S gene, you can go to BioProject 33175, click send to, select Complete Record and File, set the Format to FASTA and then click Create File; the corresponding taxonomyTsv file can then be created with:
Rscript /path/to/TaxonomyTsv_from_fastaNCBI.R \
dbSequencesFasta="/path/to/input/dbSequences.fasta" \
dbTaxonomyTsv="./path/to/output//dbTaxonomy.tsv" \
ENTREZ_KEY="myentrezkey"
The optional ENTREZ_KEY argument allows speeding up data retrieval from NCBI. You can get your own ENTREZ_KEY following the instructions reported here.
The pipeline is composed of a set of processes. They can be optionally turned-off by setting them to "false" in the metontiime2.conf file.
- importDb: import a fasta file with sequences dbSequencesFasta and a tsv file with sequence ids and multi-level taxonomy (with each level separated by ';') dbTaxonomyTsv as a pair of QIIME2 artifacts dbSequencesQza and dbTaxonomyQza.
- concatenateFastq: in case workDir is the output directory generated by MinKNOW, this process concatenates all fastq files corresponding to each barcode (in workDir/barcode<num>) and compresses them to fastq.gz; if workDir already contains fastq.gz files for each barcode, set the process to "false".
- filterFastq: filter fastq.gz files based on length (minReadLength, maxReadLength) and quality (minQual). Moreover, trim extraEndsTrim bases from both sides.
- downsampleFastq: cap the amount of sequencing reads for each sample to maxNumReads.
- importFastq: import filtered fastq.gz files as QIIME2 artifacts.
- dataQC: evaluate sequencing reads quality/length statistics.
- derepSeq: perform clustering at clusteringIdentity identity (in case clusteringIdentity=1 , perform dereplication only), and obtain a set of representative sequences and their abundance.
- assignTaxonomy: assign taxonomy to representative sequences using classifier classifier, retrieve up to maxAccepts hits filtered by minIdentity and minQueryCoverage, and perform consensus taxonomy assignment.
- collapseTables: collapse feature tables at the available taxonomy levels.
- filterTaxa: retain only reads from taxaOfInterest and discard samples with less than minNumReadsTaxaOfInterest assigned to that taxa. In case you do not want to focus the analysis on a specific taxa, set the process to "false".
- taxonomyVisualization: produce barplots describing the relative abundance of all taxa, all taxa excluding "Unclassified" reads, and of taxaOfInterest (if any).
- diversityAnalyses: evaluate non-phylogenetic alpha- and beta-diversity indexes at taxaLevelDiversity level, either for all taxa or for taxaOfInterest, downsampling each sample at numReadsDiversity reads. Produce also alpha-rarefaction curves.
All .qzv and .qza artifacts can be visualized importing them to QIIME2 View.
In particular, you could visualize an interactive multi-sample taxonomy barplot, describing the composition of each sample at the desired taxonomic level, and a PCA plot of Beta-diversity among samples.
A demo dataset composed of 1000 reads named Zymo-GridION-EVEN-BB-SN_sup_pass_filtered_27F_1492Rw_1000_reads.fastq.gz is available. This dataset was obtained re-basecalling with Guppy v 6.2.1 "sup" the dataset generated by LomanLab sequencing Zymo Community Standards 2 (Even) Batch ZRC190633 mock community with R9.4.1 chemistry on a GridION device. The portion of reads corresponding to 16S gene was then extracted using in-silico PCR with 27F-1492Rw primers pair.
MetONTIIME is a Nextflow pipeline based on QIIME2.
If this tool is useful for your work, please consider citing our manuscript.
Matoute, A.; Maestri, S.; Saout, M.; Laghoe, L.; Simon, S.; Blanquart, H.; Hernandez Martinez, M.A.; Pierre Demar, M. Meat-Borne-Parasite: A Nanopore-Based Meta-Barcoding Work-Flow for Parasitic Microbiodiversity Assessment in the Wild Fauna of French Guiana. Curr. Issues Mol. Biol. 2024, 46, 3810-3821. https://doi.org/10.3390/cimb46050237
Please, refer to the following manuscripts for further information.
Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852–857. https://doi.org/10.1038/s41587-019-0209-9
Di Tommaso, P., Chatzou, M., Floden, E. et al. Nextflow enables reproducible computational workflows. Nat Biotechnol 35, 316–319 (2017). https://doi.org/10.1038/nbt.3820
For further information and insights into pipeline development, please have a look at my doctoral thesis.
Maestri, S (2021). Development of novel bioinformatic pipelines for MinION-based DNA barcoding (Doctoral thesis, Università degli Studi di Verona, Verona, Italy). Retrieved from https://iris.univr.it/retrieve/handle/11562/1042782/205364/.