-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phred scale clarifications #579
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good idea. Some small comments below.
VCFv4.3.tex
Outdated
@@ -544,13 +544,14 @@ \subsection{VCF tag naming conventions} | |||
\begin{itemize} | |||
\item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$. | |||
Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL). | |||
The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.\ PL). | |||
The likelihood can be also represented in some cases as a phred-true scale ($-10 \log_{10}(probability\_of\_being\_correct)$) in a separate tag (e.g.\ PL). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to introduce the term "phred-true scale"? It seems like a bit of a strange one off. It's not clear that it's contrasting probability of correct with probability of incorrect until you read the 'Q Suffix' box below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, but I'm unsure how best to describe it given the "P" prefix was almost certainly chosen for Phred.
Right now it states "Likelihoods are represented as
7cafb52
to
571f96c
Compare
Rebased off current master, with the line-wrapping adjusted for SAMv1.tex. The "phred-true-scale" sentence has been rewritten as left in as a separate commit so you can see the change, but should be squashed with the other VCF change when merging. Note this doesn't change vcf 4.4 as I don't know what our policies are on versions to update. Which versions need to be changed? Do we have one commit per spec version so we can search git logs easier, or do we combine them together in one commit? |
571f96c
to
d3da996
Compare
Applied the changes to VCF4.4 draft and, minimally to VCF4.2. I also squashed the previous commit and this change into the main VCF doc updates. For review purposes the changes applied to VCF 4.3 (bar a minor grammar fix) since previous review are still visible as 571f96c |
d3da996
to
e7eddf0
Compare
VCFv4.3.tex
Outdated
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ | ||
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\ | ||
PQ & 1 & Integer & Phasing quality \\ | ||
PL & G & Integer & $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting nits: use \log
to print a function in roman (and set off by thinspace from the -10) rather than badly spaced single-letter variables in italics; IMHO it looks better keeping the hyphen before -scaled
(and it's typeset very differently from a minus sign so shouldn't be confusing).
There's already a bunch of 10 log 10
in VCF*.tex that are missing the \log
, but this PR adds a bunch more instances so should probably use \log
in its additions.
PL & G & Integer & $-10 log_{10}$ scaled genotype likelihoods rounded to the closest integer \\ | |
PL & G & Integer & $-10\log_{10}$-scaled genotype likelihoods rounded to the closest integer \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah just noticed the "-scaled" bit too... I'll update again. :-)
I fixed the The first commit fixes the use of log in the changes I had made, but only those lines. Hence it could be squashed in with that. The second commit fixes use the of log in all other lines, for the sake of consistency. This is purely a formatting commit so it's possibly worth keeping it separate from the wording changes. |
7c1ada1
to
3d38893
Compare
I think two commits would be appropriate: (1) fix existing (I once had a draft of fixing all the existing |
Sometimes this refers to $10 log_{10}(p)$, sometimes to $10 log_{10}(1-p)$, and sometimes to something normalised so $p$ isn't really a probability at all. Note CNL, CNP and CNQ don't mention phred anywhere in their short description and only Phred in the long description for CNQ, so I applied the same logic to PL, PP (is this correct?) and PQ. Also clarified the "VCF tag naming conventions" part. I changed phred-scale in one part there to phred-true-scale. I'm not so happy with that, but as it's immediately followed by the formula I think it's clear.
3d38893
to
31feca3
Compare
At the next file format meeting we should decide on a policy on how far we backport clarifications and how we want to balance specifications stability, maintenance burden, and specifications clarity (same issue with #652). |
This fixes #534.
We abuse the term Phred-scale, particularly in VCF. This PR also has a minor wording change in SAM to be more explicit.
Sometimes VCF uses Phred as -10log_10(P(correct)) and sometimes -10log_10(P(incorrect)). Having two completely opposite direction scales with bigger-is-better vs smaller-is-better is really unhelpful. Plus sometimes it's not even probabilities at all, but likelihoods. Sometimes they're normalised, and sometimes not. This poor wording lead me up the garden path for a while and it wasn't until I checked source code that I understand what they meant, mainly due to preconceived (correct!) notions on what "Phred-scaled" means. Phred explicitly defined the scale as P(incorrect), so we now only use the term only when that is true.
I see this was already done correctly for the CNL, CNP and CNQ VCF tags, so I applied the same logic to the PL, PP and PQ tags.
Hopefully this also helps improve the understanding of GL (log10(p)) vs PL (-10log10(p)) by being more explicit instead of having formulae for defining one and a vague term for the other.