Skip to content

Commit

Permalink
Clarify SAM file encoding (ASCII, UTF-8 in designated fields)
Browse files Browse the repository at this point in the history
Reword so that it is clear that the encodings specified apply to the
*entirety* of SAM file contents. Mention allowable line terminators,
and note that most UTF-8 line terminating characters are invalid
(as this whitespace is in the ASCII-only parts outwith UTF-8 fields).
Fixes #664.

Spell out that the locale considerations are about e.g. requiring that
floating-point values use '.' rather than a localised decimal point.
  • Loading branch information
jmarshall committed May 4, 2023
1 parent 3c493e7 commit 1011356
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,16 @@ \section{The SAM Format Specification}
BAM file may optionally specify the version being used via the
{\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}.

Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII \footnote{Charset ANSI\_X3.4-1968 as defined in RFC1345.} in using the POSIX / C locale.
Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax.
SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8.
Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields.%
\footnote{Hence in particular SAM files must not begin with a byte order mark~(BOM) and lines of text are delimited by ASCII line terminator characters only.
% Unicode identifies VT and FF as line break characters as well, but no one uses them in SAM.
In addition to the local platform's text file line termination conventions, implementations may wish to support \textsc{lf} and \textsc{cr\>lf} for interoperability with other platforms.}

Where it makes a difference, SAM file contents should be read and written using the POSIX\,/\,C locale.
For example, floating-point values in SAM always use `{\tt .}' for the decimal-point character.

The regular expressions in this specification are written using the POSIX\,/\,IEEE Std 1003.1 extended syntax.

\subsection{An example}\label{sec:example}
Suppose we have the following alignment with bases in lowercase
Expand Down

0 comments on commit 1011356

Please sign in to comment.