You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For "multiallelic" sites containing a spanning deletion alleles (represented by *), the getType method in VariantContext behaves somewhat unexpectedly. There have been numerous old discussions surrounding spanning deletion handling in htsjdk, like here and here, but this particular issue is a little more focused. Regardless of what might be the "best" way to handle these special sites, the current situation is quite confusing.
If the * allele is next to a SNP or a deletion, the site is considered an INDEL. But if it's next to an insertion, it's not considered an INDEL. See below for concrete examples.
Your environment:
version of htsjdk: 3.0.1
version of java: Temurin-17.0.7+7
which OS: MacOS Ventura 13.6
Steps to reproduce
Set up the following files for a complete example.
example.fasta:
>chr1
AAAAAAAAAA
Run samtools faidx example.fasta and samtools dict example.fasta > example.dict. Then make the following vcf as example.vcf:
##fileformat=VCFv4.3
##contig=<ID=chr1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chr1 3 variant1 A AAA,* 30 . . GT 1/2
chr1 4 variant2 A C,* 30 . . GT 1/2
chr1 5 variant3 AAA A,* 30 . . GT 1/2
Run bcftools view example.vcf -o example.vcf.gz to compress it and bcftools index -t example.vcf.gz. Now it's ready to run in GATK to see what happens. Run gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-snp.vcf.gz -select 'vc.isSNP()'
and gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-indel.vcf.gz -select 'vc.isIndel()'
As expected, gatk-snp.vcf.gz looks like (with bcftools view -H):
chr1 4 variant2 A C,* 30 . . GT 1/2
but gatk-indel.vcf.gz looks like:
chr1 5 variant3 AAA A,* 30 . . GT 1/2
In other words, the site with <INS>,* is excluded while <DEL>,* is included when looking for INDEL sites.
Expected behaviour
I don't know what's the best way once and for all to classify VariantContexts with spanning deletions, but to me this seems like a "bug" either way. Whatever is decided in classifying the <INS>,* type should equally apply to <DEL>,*. For what it's worth, my instinct is to expect these sites to both be classified as "Indel" in this way since the spanning deletion is a deletion anyway. This is consistent with the behavior that SelectVariants will also pull out a multiallelic DEL+INS site as "Indel."
Actual behaviour
See above.
The text was updated successfully, but these errors were encountered:
Description of the issue:
For "multiallelic" sites containing a spanning deletion alleles (represented by
*
), thegetType
method inVariantContext
behaves somewhat unexpectedly. There have been numerous old discussions surrounding spanning deletion handling in htsjdk, like here and here, but this particular issue is a little more focused. Regardless of what might be the "best" way to handle these special sites, the current situation is quite confusing.If the
*
allele is next to a SNP or a deletion, the site is considered an INDEL. But if it's next to an insertion, it's not considered an INDEL. See below for concrete examples.Your environment:
Steps to reproduce
Set up the following files for a complete example.
example.fasta:
Run
samtools faidx example.fasta
andsamtools dict example.fasta > example.dict
. Then make the following vcf as example.vcf:Run
bcftools view example.vcf -o example.vcf.gz
to compress it andbcftools index -t example.vcf.gz
. Now it's ready to run in GATK to see what happens. Rungatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-snp.vcf.gz -select 'vc.isSNP()'
and
gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-indel.vcf.gz -select 'vc.isIndel()'
As expected,
gatk-snp.vcf.gz
looks like (withbcftools view -H
):but
gatk-indel.vcf.gz
looks like:In other words, the site with
<INS>,*
is excluded while<DEL>,*
is included when looking for INDEL sites.Expected behaviour
I don't know what's the best way once and for all to classify VariantContexts with spanning deletions, but to me this seems like a "bug" either way. Whatever is decided in classifying the
<INS>,*
type should equally apply to<DEL>,*
. For what it's worth, my instinct is to expect these sites to both be classified as "Indel" in this way since the spanning deletion is a deletion anyway. This is consistent with the behavior that SelectVariants will also pull out a multiallelic DEL+INS site as "Indel."Actual behaviour
See above.
The text was updated successfully, but these errors were encountered: