-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify SAM file encoding (mostly ASCII, some UTF-8 parts) #670
Conversation
8987c26
to
0dc7e3d
Compare
I'm happy with the new text, although I note it's draft still. PS. The rendering of the diff PDF is visible in Opera and MS Edge on Windows, but firefox shows the hex codes for unknown font symbol. I'm not sure what's going on there, but maintainers only need 1 working browser for validation and the final PDF is fine. :) |
I was less happy with the new text, I think because it said it was UTF-8 first and then adjusted it to mostly-ASCII second. I've turned that around again, and moved most of the text out of footnotes, so I think it's clearer and with a more accurate emphasis now. Unicode adds several more line termination characters, so I added a footnote pointing out that only ASCII line terminators are valid. And I've proposed a second commit using @jkbonfield's suggestion on #719 of using |
The page breaking is now horrible, splitting the table between VN and SO. I have a followup PR that rearranges the first section a bit to bring both figures in the SAM example back to the first page, which restores the previous better page breaks in the header section. |
I'd be wary of attempting to specify unicode with a hex range. We could, but I'm not 100% sure what "printable" even means in unicode. Is the block replacement character (FFFD) printable? It's there for when decoding fails I think. Also why to FFFE? There are more beyond that. 𐘬 :-) I just think it's a hiding to nothing and frankly |
Indeed, that pseudo-range was mostly a joke — like the ‘uh’ and ‘or something’, I wasn't suggesting that we'd add it to the spec verbatim. I didn't bother looking up what the “last” (printable) Unicode character is. In the proposed commit, I've used your |
I've split the second commit adjusting the header regex off as a separate PR, #726. This PR is now a fairly straightforward clarification to a single paragraph. I'd like to get it merged so that I can prepare the followup formatting improvement PR that affects the same sections. So let's start a one week timer for this one: if there are no objections before then, I will merge this on Friday May 19th. @jkbonfield: Feel free to review and merge this before then if you wish. |
Looks good to me, thanks. I'll merge it. |
…ls#670) Reword so that it is clear that the encodings specified apply to the *entirety* of SAM file contents. Mention allowable line terminators, and note that most UTF-8 line terminating characters are invalid (as this whitespace is in the ASCII-only parts outwith UTF-8 fields). Fixes samtools#664. Spell out that the locale considerations are about e.g. requiring that floating-point values use '.' rather than a localised decimal point.
Draft text clarifying the paragraph around file encoding, locales, and regular expressions.
Addresses #664.
I would like to spell it out in the first sentence a bit more but haven't come up with a great wording yet; something like