Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tools for SV VCF cleaning #8996

Draft
wants to merge 32 commits into
base: master
Choose a base branch
from
Draft

Create tools for SV VCF cleaning #8996

wants to merge 32 commits into from

Conversation

kjaisingh
Copy link

@kjaisingh kjaisingh commented Oct 9, 2024

This PR is intended to introduce several new tools related to the CleanVcf workflow in GATK-SV, which the use of these tools being documented in broadinstitute/gatk-sv#733. These tools are intended to introduce several enhancements over the existing implementation, including but not limited to:

  • Introduce various unit and integration tests into the workflow.
  • Create more robust and generalizable tools that can be used independent of CleanVcf.
  • Improve runtime and execution speed by leveraging Java.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall. I have some suggestions on style and some places where you can reuse other classes. I did not mark all places where you can add final to variable declarations, just a few cases. You may not need to create a separate "Engine" class for the internals here unless you think some of the components would be reusable in another step or if it makes testing easier.

Comment on lines 144 to 148
@Argument(
fullName = OUTPUT_REVISED_EVENTS_LIST_LONG_NAME,
doc="Output list of revised genotyped events"
)
private GATKPath outputRevisedEventsList;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be possible to build this info directly into the VCF rather than have this extra file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - added this as a new flag in the INFO field.


private void processSVType(VariantContext variant, VariantContextBuilder builder) {
final String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, null);
if (svType != null && variant.getAlleles().stream().noneMatch(allele -> allele.getDisplayString().contains(GATKSVVCFConstants.ME))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to parse the alleles with GATKSVVariantContextUtils.getSymbolicAlleleSymbols() and check for ME. Sometimes the alt is just <INS:ME>.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for the pointer.

Comment on lines +166 to +167
failSet = readLastColumn(failList);
passSet = readLastColumn(passList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use TableReader here. It would require you to add header lines to the lists in the WDL, which would be okay too. See TableUtils.reader() and look at some implementations to see examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for the tip. For visibility, I have made corresponding changes to GATK-SV in broadinstitute/gatk-sv#733.

@kjaisingh kjaisingh self-assigned this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants