Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SVStratify and GroupedSVCluster tools #8990

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

mwalker174
Copy link
Contributor

Implements two new tools and updates some methods for a revamp of the CombineBatches cross-batch integration module in gatk-sv.

  • SVStratify - tool for splitting out a VCF by variant class. Users pass in a configuration table (see tool documentation for an example) specifying one or more stratification groups classified by SVTYPE, SVLEN range, and reference context(s). The latter are specified as a set of interval lists using --context-name and --context-intervals arguments. All variants are matched with their respective group which is annotated in the STRAT INFO field. Optionally, the output can be split into multiple VCFs by group, which is a very useful functionality that currently can't be done efficiently with common commands/toolkits.
  • GroupedSVCluster - a hybrid tool combining functionality from SVStratify with SVCluster to perform intra-stratum clustering. This tool is critical for fine-tuned clustering of specific variants types within certain reference contexts. For example, small variants in simple repeats tend to have lower breakpoint accuracy and are typically "reclustered" during call set refinement with looser clustering criteria.
  • SVStratificationEngine - new class for performing stratification.
  • Updates to breakpoint refinement in CanonicalSVCollapser that should improve breakpoint accuracy, particularly in larger call sets. Raw evidence support and variant quality are now considered when choosing a representative breakpoint for a group of clustered SVs.
  • Added FlagFieldLogic type for customizing how BOTHSIDE_PASS and HIGH_SR_BACKGROUND INFO flags are collapsed during clustering.
  • RD_CN is now used as a backup if CN is not available when determining carrier status for sample overlap.
  • Removed no-sort option in favor of spooled sorting.
  • Bug fix: support for empty EVIDENCE info fields
  • Bug fix: in one of the JointGermlineCnvDefragmenter tests

 This is a combination of 2 commits.

Implement SVStratify

Interchrom events

Start unit tests

Finish unit tests

Multiple contexts per stratum

Integration tests; fix empty context bug

Test duplicate context name

Handle CPX/CTX, add some tests

Compiler error

Prioritize SR and PE evidence types for representative breakpoint strategy

Add SVStratifiedCluster

Integration tests for SVStratifiedClustering

Unused line

Spooled sorting in cluster tools

Start fixing JointGCNVSegmentation

Fix JointGCNVSegmentation integration test

Rename to GroupedSVCluster

Fix sorting bug

Comment

Use RD_CN for CNV sample overlap; improve testing

Documentation

Add comment about requiresOverlapAndProximity

Allow empty EVIDENCE list

Add STRAT INFO field

Representative breakpoint by variant quality

Handle BOTHSIDES_SUPPORT and HIGH_SR_BACKGROUND in variant collapsing

Add expected exception
Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only big ask is some clarification around "context" because it took me a long time to figure that out. Everything else you can take or leave.

@@ -77,17 +77,17 @@ protected static boolean hasSampleOverlap(final SVCallRecord a, final SVCallReco
final Set<String> samples = new HashSet<>(SVUtils.hashMapCapacity(genotypesA.size() + genotypesB.size()));
samples.addAll(genotypesA.getSampleNames());
samples.addAll(genotypesB.getSampleNames());
if (samples.isEmpty()) {
// Empty case considered perfect overlap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own edification, why is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a sites-only vcf then we don't need to enforce sample overlap.

if (matchedSampleGenotype != null) {
return VariantContextGetters.getAttributeAsInt(matchedSampleGenotype, GATKSVVCFConstants.EXPECTED_COPY_NUMBER_FORMAT, 0);
} else {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we have to choose something as the default, but I feel weird defaulting to homozygous deletion. Why not -1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. This code path shouldn't be possible so I changed it to throw an error. For the defaults, I made them -1 as a unique "null" value.

protected GATKPath outputFile;

/**
* Expected format is tab-delimited and contains a header with the first column SAMPLE and remaining columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a tool that outputs this file? This is not familiar to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is a gatk-sv file designed to replace the ped file for storing allosome ploidies. We might make a gatk tool for this in the future though.

* clustering, but any associated annotation fields will be set to null in the output.
*/
@Argument(
doc = "Fast mode. Drops hom-ref and no-call genotype fields and emits them as no-calls.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call the attributes no-calls, I would call them missing

* Default genotypes are assigned when they cannot be inferred from the inputs, such as when VCFs with different
* variants and samples are provided.
*/
@Argument(fullName = DEFAULT_NO_CALL_LONG_NAME,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default you get 0/0 with no attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Turns out this is sufficient for our batch clustering module, we just need the genotypes.

runCommandLine(args, SVStratify.class.getSimpleName());
}

@Test(expectedExceptions = {GATKException.class, IllegalArgumentException.class})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been burned this way before where another class throws and IllegalArgumentException. I also made a note where the Utils arg validation is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

public void testBwaMeltCohortRedundant() {
final File outputDir = createTempDir("stratify");
final String inputVcfPath = getToolTestDataDir() + "bwa_melt.chr22.vcf.gz";
final String configFile = getToolTestDataDir() + "test_config_redundant.tsv";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the config conflicts with the cmdline args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll get caught in the Stratrum class and throw an error

* @param svType SV type, may be null
* @param minSize minimum size in bp (inclusive), may be null
* @param maxSize maximum size in bp (exclusive), may be null
* @param contextNames reference context names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me for a while because in the engine the ReferenceContext is like the actual FASTA sequence. You think this is obvious enough for users who haven't done GATK development?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, changed context to track

expectedOutputSuffixes.put("DUP_lt5kb_RM", 0);
expectedOutputSuffixes.put("INV_gt1kb", 26);
expectedOutputSuffixes.put("BND_SD", 77);
expectedOutputSuffixes.put(SVStratify.DEFAULT_STRATUM, 1196);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to trust that you verified this with an orthogonal python script or somesuch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've been testing against the gatk-sv pipeline results, although those evaluations are qualitative. I trust this will alert us if any unintentional changes happen though.

public GATKPath configFile;

@Argument(
doc = "Context intervals file. Can be specified multiple times.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this something more like "repeat track intervals file". I'm sure you could generalize beyond repeats, but I found this really vague.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now "Track intervals file" if that works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants