document convention for "QC squeezing" in population VCF #527

mlin · 2020-09-07T03:44:00Z

This PR documents a convention developed in spVCF to reduce the size of population-wide VCF files (presenting the full locus x sample matrix) by selectively omitting FORMAT fields. As written, this is not a spec change but merely suggests a useful invocation of an existing clause (referenced inline). We suggest it may be worth documenting expressly because we've encountered some downstream tools that do get tripped up by it.

In our experiments, applying this convention to WGS/WES VCF files for cohorts like 1KGP and UKB (generated with different pipelines) delivers 4-6X file size reduction without doing anything else.

Related PRs:

add reference blocksize and checkpointing to VCF #435 suggests a way to encode the matrix sparsely, implemented in the Hail VCF Combiner. spVCF has a different sparse encoding, and both approaches have distinctive merits (add reference blocksize and checkpointing to VCF #435 is more natural when GVCF files are the conceptual point of departure; spVCF's is more natural when starting from a population VCF file). Both require substantial work on existing VCF parsers to interpret correctly, especially for tabix-style random access.
The convention suggested here complements either approach, with a cheap way to "densify" the sparsely-encoded matrix so that it's much easier for existing tools to consume (perhaps with minor fixes, if they don't honor the existing spec clause). Furthermore if we know this is the endpoint, then we may be able to encode the matrix "even more" sparsely.
Define Local Alleles in VCF to allow for sparser format #434 and more recent discussion about the star allele (VCF spec allows ALTs that are mixes of * and bases, but doesn't define how to interpret them #437 VCF: Clarify * as overlapping allele, not overlapping base #437 #464) suggest ways of dealing with multiallelic loci, another principal source of excessive population VCF file size growth. The convention here delivers partial value by omitting AD and PL in most entries, which may reduce the urgency, while certainly not eliminating.

hts-specs-bot · 2020-09-07T03:46:11Z

Changed PDFs as of 508c8c6: VCFv4.4.draft (diff).

document convention for project VCF "QC squeezing"

508c8c6

tskir added the vcf label Nov 9, 2020

illusional mentioned this pull request Apr 9, 2021

Different VCF conventions #554

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document convention for "QC squeezing" in population VCF #527

document convention for "QC squeezing" in population VCF #527

mlin commented Sep 7, 2020

hts-specs-bot commented Sep 7, 2020

document convention for "QC squeezing" in population VCF #527

Are you sure you want to change the base?

document convention for "QC squeezing" in population VCF #527

Conversation

mlin commented Sep 7, 2020

hts-specs-bot commented Sep 7, 2020