4.3.0.0
Download release: gatk-4.3.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.3.0.0 release:
-
Support for the Ultima Genomics flow-based sequencing platform
-
A next-generation suite of tools for variant filtration based on site-level annotation, intended to eventually supersede the older
VariantRecalibrator
workflow -
CompareReferences
andCheckReferenceCompatibility
: new tools for comparing and checking compatibility with genomic references -
Support in
HaplotypeCaller
/Mutect2
for supplementing the variants discovered in local assembly with variants discovered via a pileup-based approach
Full list of changes:
-
Support for the Ultima Genomics flow-based sequencing platform (#7876)
- Added a new
--flow-mode
argument toHaplotypeCaller
which better supports flow-based calling- Added a new Haplotype Filtering step after assembly which removes suspicious haplotypes from the genotyper
- Added two new likelihoods models,
FlowBasedHMM
and theFlowBasedAlignmentLkelihoodEngine
- Added a new
--flow-mode
argument toMutect2
which better supports flow-based calling - Added support for uncertain read end-positions in
MarkDuplicatesSpark
- Added a new tool
FlowFeatureMapper
for quick heuristic calling of bams for diagnostics - Added a new tool
GroundTruthReadsBuilder
to generate ground truth files for Basecalling - Added a new diagnostic tool
HaplotypeBasedVariantRecaller
for recalling VCF files using theHaplotypeCallerEngine
- Added a new tool breaking up CRAM files by their blocks,
SplitCram
- Added a new read interface called
FlowBasedRead
that manages the new features for FlowBased data - Added a number of flow-specific read filters
- Added a number of flow-specific variant annotations
- Added support for read annotation-clipping as part of clipreads and GATKRead
- Added a new
PartialReadsWalker
that supports terminating before traversal is finished
- Added a new
-
Next-generation suite of tools for variant filtration based on site-level annotations (#7954) (#8049)
- This tool suite is intended to eventually supersede the older
VariantRecalibrator
workflow - The new tools include:
ExtractVariantAnnotations
: extracts site-level variant annotations, labels, and other metadata from a VCF file to HDF5 filesTrainVariantAnnotationsModel
: trains a model for scoring variant calls based on site-level annotationsScoreVariantAnnotations
: scores variant calls in a VCF file based on site-level annotations using a previously trained model
- This tool suite is intended to eventually supersede the older
-
New Reference Comparison Tools
CompareReferences
: a new tool for analyzing the differences between references at both the dictionary and the base level (#7930) (#7987) (#7973)- In its default mode, this tool uses the reference dictionaries to generate an MD5-keyed table comparing the specified references, and does an analysis to summarize the differences between the references provided.
- Comparisons are made against a "primary" reference, specified with the
-R
argument. Subsequent references to be compared may be specified using the ``--references-to-compare` argument. - A supplementary table keyed by sequence name can be displayed using the
--display-sequences-by-name argument
; to display only sequence names for which the references are not consistent, run with the--display-only-differing-sequences
argument as well. - MD5s can be recalculated from the actual sequence when missing from the dictionary
- When run with
--base-comparison FULL_ALIGNMENT
, the tool performs full-sequence alignment on the differing reference sequences to produce a VCF with SNPs and Indels. However, this mode ignores IUPAC / N bases. - Running with
--base-comparison FIND_SNPS_ONLY
finds single-base differences between differing reference sequences of the same length. This mode can handle IUPAC / N bases correctly, but not indels. - To perform the full-sequence alignment, GATK now packages a distribution of
MUMmer
for x86_64 Mac and Linux, which can be invoked from within the GATK using the newMummerExecutor
class.
CheckReferenceCompatibility
: a new tool to check a BAM/CRAM/VCF for compatibility against a set of references (#7959) (#7973)- This tool generates a table analyzing the compatibility of a BAM/CRAM/VCF input file against provided references.
- The tool works to compare BAM/CRAMs (specified using the -I argument) as well as VCFs (specified using the -V argument) against provided reference(s), specified using the
--references-to-compare
argument. - When MD5s are present, the tool decides compatibility based on all sequence information (MD5, name, length); when MD5s are missing, the tool makes compatibility calls based only on sequence name and length.
-
HaplotypeCaller/Mutect2
- Added an optional "Pileup Detection" step to
Mutect2
andHaplotypeCaller
before assembly that supplements the variants from local assembly with variants that show up in the pileups (#7432) - Fixed a
Mutect2
IndexOutOfBoundException
with germline resource (#7979) Mutect3
dataset enhancements: optional truth VCF for labels, seq error likelihood annotation (#7975)- Added
Mutect3
dataset generation to theMutect2
WDL (#7992) GetPileupSummaries
now streams its output rather than storing it in memory (#7664)- Fixed a rare edge case in the
AdaptiveChainPruner
where theJavaPriorityQueue
is undefined for tied elements (#7851)
- Added an optional "Pileup Detection" step to
-
SV Calling
CondenseDepthEvidence
: a new tool that combines adjacent intervals in DepthEvidence files (#7926)LocusDepthtoBAF
: a new tool that merges locus-sorted LocusDepth evidence files, calculates the bi-allelic frequency (baf) for each sample and site, and writes these values as a BafEvidence output file (#7776)PrintReadCounts
: a new tool that prints (and optionally subsets) an read depth (DepthEvidence) file or a counts file as one or more (for multi-sample DepthEvidence files) counts files for CNV determination (#8015)CollectSVEvidence
: fixed a bug where trailing SNP sites and depth intervals without read coverage were being omitted from the output (#8045)CollectSVEvidence
: added read depth generation and raw-counts output (#8015)- Improved
PrintSVEvidence
performance by tweaking theMultiFeatureWalker
traversal (#7869) - Fixes related to
BafEvidence
(biallelic-frequency of a sample at some locus) (#7861) - Fixed a bug where the end coordinate was being incorrectly compared when sorting discordant read pair evidence (#7835)
- Sort output from
SVClusterEngine
(#7779) - Remove abandoned SV filtering project and unneeded build dependency (#7950)
-
CNV Calling
-
GenomicsDB
GenomicsDBImport
: added the ability to specify explicit index locations via the sample name map file (#7967)- Each line in the sample name map file may now optionally contain a third column with the path/URI to the index. This is useful when the index is not in the same location as the corresponding GVCF.
-
Bug Fixes
- Fixed an issue where we weren't properly merging AD values when combining GVCFs and no PLs were present (#7836)
- Fixed a bug in
ReblockGVCF
that could cause the first position on a contig to be dropped (#8028) - Fixed an allele-ordering issue in the allele-specific annotation code (#7585)
VariantRecalibrator
: type change int -> long to prevent tranche novel variant count overflow (#7864)- Fixed an issue with tabix index generation (#7858)
- Fixed a bug in
SiteDepthCodec
(#7910)
-
Miscellaneous Changes
VariantsToTable
now includes all fields when none are specified (#7911)SelectVariants
now warns the user about poor performance when the sample names in the VCF header are unsorted (#7887)VariantRecalibrator
now has a--dont-run-rscript
argument to disable execution of its R script but still output the actual R script file (#7900)- Added some generic read tag/expression filters for use on numeric tags (#7746)
- Replaced Travis CI with Github Actions for our continuous testing (#7754)
- Switched over to Github Actions for building our nightly docker image (#7775)
- Created a new
build_docker_remote.sh
script for building the docker image remotely with Google Cloud Build (#7951) - Added an argument mode manager for group arguments and a demonstration of how it might be used in
HaplotypeCaller
--dragen-mode
(#7745) - Added unit tests for the
Utils.concat()
methods (#7918) - Added a test to validate WDLs in the scripts directory. (#7826)
- Added a
use_allele_specific_annotation
arg and fixed task with empty input in theJointVcfFiltering
WDL (#8027) - Fixed an issue in the GATK stats script in which the first day's downloads on a new release were set to 0 (#7794)
- Fixed a typo in the Dockerfile that broke git lfs pull (#7806)
- Removed unused code in the
utils.solver
package (#7922) - Corrected the time for GATK nightly build cron jobs (#7784)
- Disabled the red "X" from failing
CodeCov
builds and delaying the posting of coverage information to complete test (#7817) - Some minor misc engine changes (#7744)
-
Documentation
- Marked
JointGermlineCNVSegmentation
as a DocumentedFeature (#7871) - Marked
SVAnnotate
as a DocumentedFeature (#7833) - Marked
CollectSVEvidence
as a DocumentedFeature (#8041) - Docs clarification in
GenotypeGVCFs
for some reblocking-related funkiness (#7846) - Updated the GATK Readme to reflect the switch from Travis CI to Github Actions (#7808)
- Marked
-
Dependencies