document convention for "QC squeezing" in population VCF #527
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR documents a convention developed in spVCF to reduce the size of population-wide VCF files (presenting the full locus x sample matrix) by selectively omitting FORMAT fields. As written, this is not a spec change but merely suggests a useful invocation of an existing clause (referenced inline). We suggest it may be worth documenting expressly because we've encountered some downstream tools that do get tripped up by it.
In our experiments, applying this convention to WGS/WES VCF files for cohorts like 1KGP and UKB (generated with different pipelines) delivers 4-6X file size reduction without doing anything else.
Related PRs:
The convention suggested here complements either approach, with a cheap way to "densify" the sparsely-encoded matrix so that it's much easier for existing tools to consume (perhaps with minor fixes, if they don't honor the existing spec clause). Furthermore if we know this is the endpoint, then we may be able to encode the matrix "even more" sparsely.