Releases: broadinstitute/gatk
4.0.2.1
This is a small bug fix release containing fixes for the following issues:
HaplotypeCaller
: fix the-contamination
/-contamination-file
arguments, which were not working properly, and add tests (#4455)- Fixes/improvements to the GATK configuration file mechanism (#4445)
- If a Java system property is specified explicitly on the user's command line, allow it to override the corresponding value in the GATK config file
- Bundle an example GATK configuration file with the GATK binary distribution. This config file can be edited and passed to the GATK via the
--gatk-config-file
argument. - There are still some configuration-related TODOs/known issues: in particular, the
gatk
front-end script currently bakes in some system properties internally, which will always override the corresponding values in the config file. We plan to patch thegatk
script to no longer set these system properties internally, and delegate to the config file instead.
Mutect2
: minor bug fixes and improvements (#4466)- Fix "FilterMutectCalls trips on non-int value in MFRL tag" (#4363)
- Fix ordering of allele trimming vs. variant annotation (#4402)
- Fix "CalculateContamination gives >100% results" (#3889)
- Disable the
MateOnSameContigOrNoMappedMateReadFilter
by default (#3514) - Make mapping quality threshold in
GetPileupSummaries
modifiable (#4011)
SV Tools:
Add a scan for intervals of high depth, and exclude reads from those regions from SV evidence (#4438)- In the GATK docker image, run the GATK using the fully-packaged binary distribution jars, rather than the unpackaged jars (#4476). This fixes a number of minor issues reported by users of the docker image.
4.0.2.0
This is a small release that includes a new Beta tool, a port of VariantAnnotator
from Gatk3, as well as some bug fixes and other improvements. Mutect2
is no longer beta.
-
Mutect2
andFilterMutectCalls
are now no longer beta! (#4384) -
new tool
VariantAnnotator
(#3803):- ported tool from GATK3
- first beta release
-
Spark Improvements
: -
new
CNV
Tumor only WDL (#4414) -
Viterbi segmentation and segment quality calculation for gcnvkernel (#4335)
-
Other Bug Fixes and Improvements:
- update to latest GKL, improves performance of GZIP level 2 compression (#4379)
CalculateGenotypePosteriors
fixed bug that caused duplicates in the output VCF as well as several other issues (#4352, #4431)- Display a more prominent warning message for Beta and Experimental tools. (#4429)
- non-zero Picard tool exit codes now cause a non-zero exit from gatk (#4437)
- removed support for deprecated Google Reference API (#4266)
- Improve evidence info dumps and SV pipeline management (#4385)
- oncotator docker uses default docker if not specified (#4394)
- Added check for non-finite copy ratios in ModelSegments pipeline. (#4292)
- make FASTQ reader remove phred bias from quals (#4415)
4.0.1.2
This is a small bug fix release to fix issues in the WDLs for Mutect2
and the CNV
tools. It also includes a newer version of the GKL
(Genomics Kernel Library) with some compression-related performance improvements.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
4.0.1.1
This is a small bug fix release that fixes the following:
- Fix sorting bug in
GatherTranches
. Gathered tranches should now be closer to target truth sensitivity in the lower range (~90%). Mutect2
WDL: fix memory requests to request MB instead of GB.- CNV somatic pair workflow WDL: added missing
Oncotator
optional arguments - Prevent printing a stack trace when the user specifies the name of a tool that doesn't exist. Instead print suggestions for similar tool names.
4.0.1.0
Highlights of this release include a preview version of a future neural-network-based VQSR replacement, the ability to generate a VCF from the GermlineCNVCaller
output, allele-specific annotation support in GenomicsDBImport
, as well as a number of important post-4.0 bug fixes. See below for the full list of changes.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Changes in this release:
- New experimental tool
NeuralNetInference
(#4097)- An eventual VQSR replacement.
- Performs variant score inference with a 1D Convolutional Neural Network with a pre-trained model. This is faster but not as high quality the 2D model which is coming along with training and tranche-style filtering in the next GATK release (#4245).
- Tool name subject to change!
GenomicsDBImport
:- Add support for allele-specific annotations (#4261) (#3707)
- Allow sample names with whitespace in the sample name map file (#3982)
- Fix segfault crash on long path names (#4160)
- Allow multiple import commands to be run in the same workspace directory (#4106)
- Fix segfault crash during import when flag fields not declared in the VCF header (#3736)
- Improve warning message when PLs are dropped for records with too many alleles (#3745)
- CNV tools:
HaplotypeCaller
- Fix the
--min-base-quality-score
/-mbq
argument, which previously had no effect (#4128). This fix also affectsMutect2
. - Fix a "contig must be non-null and not equal to *, and start must be >= 1" error by patching an edge case in the ReadClipper code: when reverting soft-clipped bases of a read at the start of a contig, don't explode if you end up with an empty read (#4203)
- Fix the
Mutect2
:- Smarter contamination model (#4195)
- Removed the
--dbsnp
and--comp
arguments. The best practice now is to pass ingnomAD
as thegermline-resource
. - Removed a number of other arguments that were
HaplotypeCaller
-specific and not appropriate forMutect2
, such as--emit-ref-confidence
. - Mutect2 WDL: CRAM support (#4297)
- Mutect2 WDL: Compressed vcf output and Funcotator options (#4271)
- Miscellaneous WDL cleanup
HaplotypeCallerSpark
:- Fixes to the tool that make its output much closer to that of the non-Spark
HaplotypeCaller
(#4278). Note that this tool (unlike the non-SparkHaplotypeCaller
) is still in beta, and should not be used for any real work. There are still major performance issues with the tool that in practice prevent running on certain kinds of large data and in certain modes. - Disallow writing a
.vcf.gz
when in GVCF mode, as this combination currently doesn't work (#4277)
- Fixes to the tool that make its output much closer to that of the non-Spark
BwaSpark
:- set more reasonable default set of read filters (#4286)
PathSeq
:- Add WDL for running the
PathSeq
pipeline with a README and example JSON input. (#4143)
- Add WDL for running the
- Fix piping between Picard tools run via the GATK by changing logging output to stderr (#4167)
- Disallow unindexed block-compressed tribble files as input to walkers (#4240) (#4224). This works around a bug in HTSJDK that could cause such files to appear truncated. Until the HTSJDK bug is fixed, block-compressed
.vcf.gz
files (and similar files) will need to be accompanied by an index, which can be generated using theIndexFeatureFile
tool. - Restore
.list
as an allowed extension for files containing multiple values for command-line arguments (#4270). The previous extension.args
is also still allowed. This feature allows users to provide a file ending in.list
or.args
containing all of the values for an argument that accepts multiple values (for example: a list of BAM files), instead of typing all the values individually on the command line. - Fix conda environment creation to work better with the release distribution. (#4233)
IndexFeatureFile
: more informative error message when trying to index a malformed file (#4187)- Suggest using BED files as a way to resolve ambiguous interval queries. (#4183)
- Set Spark parameter userClassPathFirst = false #3933 (#3946)
- Update to HTSJDK 2.14.1 (#4210)
4.0.0.0
4.beta.6
This release brings a critical bug fix to the GenomicsDBImport
tool related to sample ordering, plus a new tool FixCallSetSampleOrdering
to repair vcfs generated using the pre-4.beta.6
version of the tool. See the description of the bug in #3682 to determine whether you are affected. Do not run FixCallSetSampleOrdering
unless you are sure that you are affected by the bug in #3682.
Other highlights include upgrading to the latest version of the Picard tools, and adding engine support for reading Gencode GTF files.
A docker image for this release can be found in the broadinstitute/gatk
repository on dockerhub. Within the image, cd into /gatk
then run gatk-launch
commands as usual.
Note: Due to our current dependency on a snapshot of google-cloud-java
, this release cannot be published to maven central.
Full list of changes for this release:
- Fixed sample name reordering bug in GenomicsDBImport (#3667)
- New tool FixCallSetSampleOrdering to repair vcfs affected by #3682 (#3675)
- Integrate latest Picard tools via Picard jar. (#3620)
- Adding in codec to read from Gencode GTF files. Fixes #3277 (#3410)
- Upgrade to HTSJDK version 2.12.0 (#3634)
- Upgrade to GKL version 0.7 (#3615)
- Upgrade to GenomicsDB version 0.7.0 (#3575)
- Upgrade Mockito from 1.10.19 -> 2.10.0. (#3581)
- Add GVCF support to VariantsSparkSink (#3450)
- Fix writing variants to GCS buckets (#3485)
- Support unmapped reads in Spark. (#3369)
- Correct gVCF header lines (#3472)
- Dump more evidence info for SV pipeline debugging (#3691)
- Add omitFromCommandLine=true for example tools (#3696)
- Change gatkDoc and gatkTabComplete build tasks to include Picard. (#3683)
- Adding data.table R package. (#3693)
- Added a missing newline in ParamUtils method. (#3685)
- Fix minor HTML issues in ReadFilter documentation (#3654)
- Add CRAM integration tests for HaplotypeCaller. (#3681)
- Fix SamAssertionUtils SortSam call. (#3665)
- Add ExtremeReadsTest (#3070)
- removing required FASTA reference input that was needed before (for its dict) for sorting variants in output VCF, now using header in input SAM/BAM (#3673)
- re-enable snappy use in htsjdk (#3635)
- fix 3612 (#3613)
- pass read metadata to all code that needs to translate contig ids using read metadata (#3671)
- quick fix for broken read (mapped to no ref bases) (#3662)
- Fix log4j logging by removing extra copy from the classpath.#2622 (#3652)
- add suggestion to regularly update gcloud to README (#3663)
- Automatically distribute the BWA-MEM index image file to executors for BwaSpark (#3643)
- Have PSFilter strip mate number from read names (#3640)
- Added the tool PreprocessIntervals that bins the intervals given by the user to be used for coverage collection. (#3597)
- Cpx SV PR serisers, part-4 (#3464)
- fixed bug in which F1R2 and F2R1 annotation kept discarded alleles (#3636)
- imprecise deletion calling (#3628)
- Significant improvements to CalculateContamination (#3638)
- Adds supplementary alignment info into fastq output, also additional… (#3630)
- Adding tool to annotate with pair orientation info (#3614)
- add elapsed time to assembly info in intervals file (#3629)
- Created a VariantAnnotationArgumentCollection to reduce code duplication and added a StandardM2Annotation group (#3621)
- Docs for turning assembled haplotypes into variant alleles (#3577)
- Simplify spark_eval scripts and improve documentation. (#3580)
- Renames StructuralVariantContext to SVContext. (#3617)
- Added KernelSegmenter. (#3590)
- Fix bug in for allele order independant comparison (#3616)
- Docs for local assembly (#3363)
- Added a method to VariantContextUtils which supports allele alt allele order independant comparison of variant contexts. (#3598)
- Fixed incorrect logger in CollectAllelicCounts and RecalibrationReport. (#3606)
- updating to newer htsjdk snapshot (#3588)
- clear diffuse high frequency kmers (#3604)
- update SmithWatermanAligner in preparation for native optimized aligner (#3600)
- added spark tool for extracting original SAM records based on a file containning read names (#3589)
- update README with correct path to install_R_packages.R #3601 (#3602)
- HostAlignmentReadFilter and PSScorer use only identity scores and exp… (#3537)
- Fixed alt-allele count in AllelicCountCollector and changed unspecified alleles in AllelicCount to N. (#3550)
- Fix bad version check in manage_sv_pipeline.sh (#3595)
- Use a handmade TestReferenceMultiSource in tests instead of a mock. (#3586)
- Repackage ReadFilter plugin tests (#3525)
- BamOut in M2 WDL and unsupported version with NIO for SpecOps Team (#3582)
- Changed the path for posting the test reports
- updates sv manager and cluster creation scripts to utilize dataproc cluster timed self-termination feature (#3579)
- Implemented watershed algorithm for finding local minima in 1D data based on topological persistence. (#3515)
- Reduce number of output partitions in PathSeqPipelineSpark (#3545)
- add gathering of imprecise evidence links and extend evidence intervals to make links coherent in most cases (#3469)
- Refactor PrimaryAlignmentReadFilter to PrimaryLineReadFilter (#3195)
- Update ReadFilters documentation (#3128)
- Changes in BwaMemIntegrationTest to avoid a 3-4 minutes runtime. (#3563)
- Make error informative for non-diploid family likelihoods #3320 (#3329)
- TableFeature javadoc and more tests (#3175)
- Re-enable ancient BED test in IndexFeatureFile. (#3507)
- add external evidence stream for CNVs (#3542)
- clip M2 alleles before emitting in case some alleles were dropped (#3509)
- Docs for M2 filtering (#3560)
- Fix static test blocks and @BeforeSuite usages to prevent excessive code execution when tests aren't included in a suite. (#3551)
- hide prototyping tools in sv package from help message (but still runnable if knowing their existence) (#3556)
- Add support for running tools with omitFromCommandLine=true (#3486)
- Adds utility methods to ReadUtils and CigarUtils. (#3531)
- Cpx SV PR serisers, part-3 (#3457)
4.beta.5
Small release, includes highlights include an update to our BWA-MEM
version, an experimental PythonScriptExecutor
and an important bugfix for ValidateVariants -gvcf
mode
Note: this still includes snapshot dependencies that prevent us from releasing to Maven central.
Complete change list:
- Make directory name unique for BucketUtilsTest#testDirSizeGCS to avoid unwanted test interaction. (#3547)
- Simple PythonScriptExecutor. #3501 (#3536)
- Fix BucketUtils#dirSize on GCS. #3437 (#3539)
- code duplication in read pos rank sum and its allele-specific version #1882 (#2657)
- validatevariants -gvcf fix (#3530)
- Added
GetSampleName
as stopgap until we have named parameters (#3538) - Pair HMM docs (#3433)
- Fix MissingReferenceDictFile exception constructor. #3492 #2922 (#3524)
- Extend ReadsPipelineSpark to run HaplotypeCallerSpark (#3452)
- Updates bwamem-jni depedency to 1.0.2 and adds the possibility of aligning singletons to BwaEngine classes. (#3474)
- Structural Variant Context (#3476)
4.beta.4
Highlights of this release include fixes to the GATK4 HaplotypeCaller
to bring it closer to the output of the GATK3 HaplotypeCaller
(although many of these fixes still need to be applied to HaplotypeCallerSpark
), fixes for longstanding indexing and CRAM-related bugs in htsjdk, bash tab completion support for GATK commands, and many improvements to Mutect2
and the SV tools.
A docker image for this release can be found in the broadinstitute/gatk
repository on dockerhub. Within the image, cd into /gatk
then run gatk-launch
commands as usual.
Note: Due to our current dependency on a snapshot of google-cloud-java
, this release cannot be published to maven central.
Changes in this release:
HaplotypeCaller
: a number of important updates and fixes to bring it closer to GATK 3.x's output (most of these fixes apply only toHaplotypeCaller
, notHaplotypeCallerSpark
) (#3519)- reduce memory usage of the
AssemblyRegion
traversal by an order of magnitude - create empty pileup objects for uncovered loci internally (fixes occasional gaps between GVCF blocks as well as some calling artifacts)
- when determining active regions, only consider loci within the user's intervals
- port some additional changes to the GATK 3.x
HaplotypeCaller
to GATK4 - fix bug with handling of the
MQ
annotation
- reduce memory usage of the
- Added bash tab completion support for GATK commands (#3424)
- Updated to
Intel GKL
0.5.8, which fixes bug in AVX detection, which was behaving incorrectly on some AMD systems (#3513) - Upgrade
htsjdk
to 2.11.0-4-g958dc6e-SNAPSHOT to pick up an important VCF header performance fix. (#3504) - Updated
google-cloud-nio
dependency to 0.20.4-alpha-20170727.190814-1:shaded (#3373) - Fix tabix indexing bugs in htsjdk, and reenable the
IndexFeatureFile
tool (#3425) - Fix longstanding issue with CRAM MD5 slice calculation in htsjdk (#3430)
- Started publishing nightly builds
- Performance improvements to allow MD+BQSR+HC Spark pipeline to scale to a full genome (#3106)
- Eliminate expensive
toString()
call inGenotypeGVCFs
(#3478) ValidateVariants
gvcf memory optimization (#3445)- Simplified
Mutect2
annotations (#3351) - Fix MuTect2 INFO field types in the VCF header (#3422)
- SV tools: fixed possibility of a negative fragment length that shouldn't have happened (#3463)
- Added command line argument for IntervalMerging based on GATK3 (#3254)
- Added 'nio_max_retries' option as a command line accessible option for GATK tools (#3328)
- Fix aligned PathSeq input getting filtered by WellformedReadFilter (#3453)
- Patch the
ReferenceBases
annotation to handle the case where no reference is present (#3299) - Honor index/MD5 creation for HaplotypeCaller/Mutect2 bamouts. (#3374)
- Fix SV pipeline default init script handling (#3467)
- SV tools: improve the test bam (#3455)
- SV tools: improved filtering for smallish indels (#3376)
- Extends BwaMemImageSingleton into a cache, BwaMemImageCache, that can… (#3359)
- Try installing R packages from multiple CRAN repos in case some are down (#3451)
- Run Oncotator (optional) in the CNV case WDL. (#3408)
- Add option to run Spark tests only (#3377)
- Added a .dockerignore file (#3418)
- Code cleanup in the sv discovery package (#3361) and fixes #3224
- Implement PathSeq taxon hit scoring in Spark (#3406)
- Add option to skip pre-Bwa repartitioning in PSFilter (#3405)
- Update the GQ after PLs get subset (#3409)
- Removed the explicit System.exit(0) from Main (#3400)
- build_docker.sh can run tests again #3191 #3160 (#3323)
- Minor doc fixes #3173 (#3332)
- Use ReadClipper in BaseQualityClipReadTransformer (#3388)
- PathSeq adapter trimming and simple repeat masking (#3354)
- Add scripts to manage SV spark jobs and copy result (#3370)
- Output empty VQSLOD tranches in scatterTranches mode if no variant has VQSLOD high enough for requested threshold (#3397)
- Option to filter short pathogen reference contigs (#3355)
- Rewrote hapmap autoval wdl (#3379)
- fixed contamination calculation, added error bars to output (#3385)
- wrote wdl for Mutect panel of normals (#3386)
- Turn off tranches plots if no output Rscript is specified (for annotation plots) (#3383)
Mutect2
wdls output the contamination (#3375)- Increased maximum copy-ratio variance slice-sampling bound. (#3378)
- Replace --allowMissingData with --errorIfMissingData (gives opposite default behavior as previously) and print NA for null object in VariantsToTable (#3190)
- docs for proposed tumor-in-normal tool (#3264)
- Fixed the git version for the output jar on docker automatic builds (#3496)
- Use correct logger class in MathUtils (#3479)
- Make ShardBoundaryShard implement Serializable (#3245)
4.beta.3
This release contains a number of bug fixes and improvements. Highlights include a fix for intermittent failures/timeouts when accessing data in Google Cloud Storage
(GCS), new and improved active-region detection for Mutect2
, and a new VariantRecalibrator
argument to allow the tool to scale better. See the full list of changes below. Most of the major known issues listed in the release notes for 4.beta.1 still apply, with the exception of the "intermittent GCS failures/timeouts" issue, which is now resolved.
A docker image for this release can be found in the broadinstitute/gatk repository on dockerhub. Within the image, cd into /gatk
then run gatk-launch
commands as usual.
Note: Due to our current dependency on a snapshot of google-cloud-java
, this release cannot be published to maven central.
Changes in this release:
GATK engine
: Move togoogle-cloud-java
snapshot with more robust retries, and set number of retries/reopens globally. This fixes the intermittent "all retries/reopens failed" error when accessing data on GCS (Google Cloud Storage). See issue #2749Mutect2
: Implemented a new algorithm for active-region detection, reducing spurious active regions by almost 50%Mutect2
: Filter artifacts that arise from apparent-duplicate readsMutect2 WDL
:Oncotator
is now being told the case and control sample names explicitly in the WDL. The Oncotator code for inferring this could yield incorrect answers in some cases. See issue #3343FilterByOrientationBias
: We discovered that it is impossible to guarantee a FDR threshold of all the variants when one artifact mode had high oxoQ and the other had low. We have changed the tool to guarantee the FDR threshold within each artifact mode, rather than for all variants. For more details, see issue #3344FilterByOrientationBias
: Summary table was not being populated properly. That has been fixed. See issue #3309VariantRecalibrator
: Add argument to pre-sample data for VQSR model building (and also recalibration) to reduce memory usage for production pipeline. See issue #3230- Fix a stack overflow issue at high depths in the strand artifact annotation. See issue #3317
GenomicsDBImport
: add--readerThreads
argument for multi-threaded vcf pre-loading. Improves performance of the tool by ~30% in our tests.ValidateVariants
: port gvcf validation option from GATK3- Polish up
PathSeq
and add pipeline tool - Fix error message describing how to set the
GATK_STACKTRACE_ON_USER_EXCEPTION
property Mutect2FilteringEngine
: correctMEDIAN_BASE_QUALITY_DIFFERENCE_FILTER
andMEDIAN_MAPPING_QUALITY_DIFFERENCE_FILTER
filter namesMutect2 WDL
: gaveProcessOptionalArguments
a leaner dockerGATK4 Docker Image
: changed the landing directory for the docker image to be/gatk
instead of/root
Travis CI
: fixed test report not being uploaded to GCSTravis CI
: removed non-docker unit and integration tests, which were redundant