From 101135654384a2328f9a8ae99028e8fb074d1cd1 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Tue, 13 Sep 2022 16:02:56 +0100 Subject: [PATCH] Clarify SAM file encoding (ASCII, UTF-8 in designated fields) Reword so that it is clear that the encodings specified apply to the *entirety* of SAM file contents. Mention allowable line terminators, and note that most UTF-8 line terminating characters are invalid (as this whitespace is in the ASCII-only parts outwith UTF-8 fields). Fixes #664. Spell out that the locale considerations are about e.g. requiring that floating-point values use '.' rather than a localised decimal point. --- SAMv1.tex | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/SAMv1.tex b/SAMv1.tex index 97b8e74c3..2a9490c48 100644 --- a/SAMv1.tex +++ b/SAMv1.tex @@ -67,8 +67,16 @@ \section{The SAM Format Specification} BAM file may optionally specify the version being used via the {\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}. -Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII \footnote{Charset ANSI\_X3.4-1968 as defined in RFC1345.} in using the POSIX / C locale. -Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. +SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8. +Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields.% +\footnote{Hence in particular SAM files must not begin with a byte order mark~(BOM) and lines of text are delimited by ASCII line terminator characters only. +% Unicode identifies VT and FF as line break characters as well, but no one uses them in SAM. +In addition to the local platform's text file line termination conventions, implementations may wish to support \textsc{lf} and \textsc{cr\>lf} for interoperability with other platforms.} + +Where it makes a difference, SAM file contents should be read and written using the POSIX\,/\,C locale. +For example, floating-point values in SAM always use `{\tt .}' for the decimal-point character. + +The regular expressions in this specification are written using the POSIX\,/\,IEEE Std 1003.1 extended syntax. \subsection{An example}\label{sec:example} Suppose we have the following alignment with bases in lowercase