-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcftools view outputs empty string for non-trailing missing genotype field represented in BCF as [EOV, EOV] #1622
Comments
I hope I follow well what is the input, what bcftools outputs and what is the desired output. Obviously this is not a valid GT field How to proceed: First a general note, the HTSlib implementation came first and the language of the specification was trying to describe what's been implemented (sometimes imperfectly). However, I am perfectly happy for the changes to go both ways in principle. Now, if you want to follow what has been implemented in HTSlib, then such BCF should not be produced and the specification should be clarified to reflect the actual behavior. If you want to allow this new BCF flavor and make changes in the HTSlib implementation, then it might be good to first discuss the benefits and make everyone agree. |
Thank you for the response. You understood everything correctly. We are alright with matching the behavior of bcftools because in practice VCF cannot represent this distinction anyway. However, as the person who opened the ticket on the BCF spec mentioned, it would be good if this could be made explicit in the BCF spec. My understanding of how bcftools represents this case, and how I might try to reflect this behavior in the spec would be:
|
I wouldn't go so far as to say that it does not matter how many missing values there are. I can't think of a specific example, but using the genotype (GT) as an imperfect example (imperfect because it is a different beast), there it matters if one writes Instead, I suggest the following wording:
|
A pull request to hts-specs has been made samtools/hts-specs#617 |
I believe this is related to samtools/hts-specs#593
I'm working on BCF 2.2 writing for htsjdk, and we are trying to verify that our codec is compatible with bcftools' codec.
Following how we interpreted the wording of the spec, we output a vector of
EOV
values when a certain genotype key is missing entirely for a sample, as opposed to present but containing missing values. When bcftools reads our BCF output and writes it as VCF, it writes the empty string for that particular sample, where we would expect either.
or.,.
, and this does not seem to be valid VCF.An example of what we're seeing is in the
CNL
attribute of the second sample where we have1|0::8
The original VCF from which the data was obtained had the following line. The order of the genotype keys is different because htsjdk sorts them. You can see
CNL
for the second sample is just a missing value.
.This original VCF was converted to BCF using htsjdk's (work in progress) BCF 2.2 writer, which writes
[EOV, EOV]
for the missingCNL
attribute, and then fed to bcftools, which is producing the incorrect VCF.The text was updated successfully, but these errors were encountered: