Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools view outputs empty string for non-trailing missing genotype field represented in BCF as [EOV, EOV] #1622

Closed
andersleung opened this issue Nov 30, 2021 · 4 comments

Comments

@andersleung
Copy link

andersleung commented Nov 30, 2021

I believe this is related to samtools/hts-specs#593

I'm working on BCF 2.2 writing for htsjdk, and we are trying to verify that our codec is compatible with bcftools' codec.

Following how we interpreted the wording of the spec, we output a vector of EOV values when a certain genotype key is missing entirely for a sample, as opposed to present but containing missing values. When bcftools reads our BCF output and writes it as VCF, it writes the empty string for that particular sample, where we would expect either . or .,., and this does not seem to be valid VCF.

An example of what we're seeing is in the CNL attribute of the second sample where we have 1|0::8

GT:CNL:DP:GQ:HQ 0|0:10,20:1:48:25,30	1|0::8:48:49,51	./.:1:5:43:.,.

The original VCF from which the data was obtained had the following line. The order of the genotype keys is different because htsjdk sorts them. You can see CNL for the second sample is just a missing value ..

GT:GQ:DP:HQ:CNL 0|0:48:1:25,30:10,20    1|0:48:8:49,51:.        ./.:43:5:.,.:1

This original VCF was converted to BCF using htsjdk's (work in progress) BCF 2.2 writer, which writes [EOV, EOV] for the missing CNL attribute, and then fed to bcftools, which is producing the incorrect VCF.

@pd3
Copy link
Member

pd3 commented Dec 9, 2021

I hope I follow well what is the input, what bcftools outputs and what is the desired output.

Obviously this is not a valid GT field 1|0::8, it should be 1|0:.:8, as only trailing fields can be omitted. If I am reading it correctly, this field was printed by bcftools and this is because the input BCF contains the EOV value in the htsjdk flavor of the BCF, whereas the htslib version would put .,EOV, so this is the first time bcftools encountered a file like this.

How to proceed:

First a general note, the HTSlib implementation came first and the language of the specification was trying to describe what's been implemented (sometimes imperfectly). However, I am perfectly happy for the changes to go both ways in principle.

Now, if you want to follow what has been implemented in HTSlib, then such BCF should not be produced and the specification should be clarified to reflect the actual behavior.

If you want to allow this new BCF flavor and make changes in the HTSlib implementation, then it might be good to first discuss the benefits and make everyone agree.

@andersleung
Copy link
Author

andersleung commented Dec 9, 2021

Thank you for the response. You understood everything correctly. We are alright with matching the behavior of bcftools because in practice VCF cannot represent this distinction anyway. However, as the person who opened the ticket on the BCF spec mentioned, it would be good if this could be made explicit in the BCF spec. My understanding of how bcftools represents this case, and how I might try to reflect this behavior in the spec would be:

An "empty vector" is one for which all values are missing, i.e. there is no data is available for the given key. Empty vectors must be represented by at least one MISSING value followed by as many EOV values are required to pad the vector to the appropriate length. The exact number of MISSING values in the representation is not required to be consistent between BCF implementations, except for the requirement that at least one MISSING value be present. All representations of an empty vector are considered equal.

@pd3
Copy link
Member

pd3 commented Dec 10, 2021

I wouldn't go so far as to say that it does not matter how many missing values there are. I can't think of a specific example, but using the genotype (GT) as an imperfect example (imperfect because it is a different beast), there it matters if one writes . or ./. or ././. as it communicates ploidy.

Instead, I suggest the following wording:

An "empty vector" is one for which all values are missing, i.e. there is no data is available for the given key.
Empty vectors must be represented by one MISSING value followed by as many EOV values are required 
to pad the vector to the appropriate length.

@pd3
Copy link
Member

pd3 commented Dec 16, 2021

A pull request to hts-specs has been made samtools/hts-specs#617

@pd3 pd3 closed this as completed Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants