-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols #5
Comments
For reference sequence names, SAM and VCF (and hence BAM, CRAM, and BCF) have a very specific regular expression that disallows whitespace, backslashes, commas, various quotation marks, and brackets, and also
In htsget and refget, these appear as double-quoted JSON strings. So the SAM regex values would fit in these formats without additional escaping or syntax. OTOH neither htsget nor refget suggests that these values are anything other than arbitrary strings. |
For sample identifiers, both SAM and VCF disallow tabs, and VCF currently de facto disallows commas. It is proposed (samtools/hts-specs#414) to be explicit about VCF's commas. Htsget does not currently tie this down in any way, though it would probably be impractical to use a In phenopackets, this appears to be described only as an “arbitrary identifier”. It might be good to try to restrict the character set further (beyond <tab> and <comma>) to free up more punctuation characters for convenient use as format delimiters and on tools' command lines, but this would require analysis of what constitutes a sample name in the wild at present. For reference sequence names we collected fairly extensive statistics (see samtools/hts-specs#333 (comment)), but I am unaware of that having been done for sample name identifiers. |
Probably not relevant, but here are some identifier rules in the http://bit.ly/ga4gh-passport-v1 spec:
|
I think GA4GH needs a common set of core data types, with identifier being one of those. The types should be based on existing standards (e.g., ISO) so we don't reinvent the wheel, and they should be as technology/language-agnostic as possible to support implementations in a variety of systems. Just my 2 cents. Looking forward to the discussion. |
@cdvoisin: Thanks, that's interesting. It's a bit different from the very particular individual classes of metadata item that this issue is trying to focus on, as they inherit the existing defined syntax of URLs. @rrfreimuth: Those are good general principles. But as was hopefully clear in the discussion, this issue is intended to be about specific items of metadata individually and e.g. is just referring to this group of particular items collectively as “identifiers”. |
Next steps on this IMHO is to start with (say) reference sequence names, and try to answer the questions posed in the presentation in the April meeting:
|
Re sample identifiers, I believe Phenopackets has a representation for such a field:
@frafrx or other Phenopackets experts: Biosample describes this |
@jmarshall AFAIK in Phenopackets & Beacon we follow the principles of
|
Late to the party, and too late to change for VCF too, but I dislike the word "contig" being used as just another form of sequence. So did Rodger Staden, the person who coined the word "contig": https://staden.sourceforge.net/contig.html Note for the curious this also uses "gel readings", which later just got shortened to the "reads" we know today. The original definitions have one thing very clear - it's contiguous, without gaps (if I recall Rodger later adimtted he probably meant continous as the reads don't have to abut, just overlap). Genome browsers started muddling things when they didn't understand the difference between a set of overlapping reads forming a contig, and their consensus sequence. They started just using contig instead of consensus sequence, and in doing so lost of the original meaning and caused more confusion. It then got corrupted even further when they stopped caring whether the sequence was even contiguous or not. Sadly that's where VCF ended up. (Although I note it sometimes uses "Chromosome" instead.) So my preference is definitely for "reference sequence" or similar, and this also fits far better with most of the other use cases here (SAM, BAM, CRAM, Refget, etc). If we're talking about assemblies, then sometimes "consensus sequence" is more appropriate, but the two have largely interchangeable use cases. |
There are a number of short identifier-sized pieces of metadata that are used across many GA4GH products. For example:
Reference sequence names
In SAM/BAM/CRAM, this is the
@SQ-SN
header field andRNAME
/RNEXT
/etc fields.In VCF/BCF, it's the
##contig
ID
.In htsget, it's
referenceName
.In refget, it may be returned as an
alias
.Sample identifiers
In SAM/BAM/CRAM, this is the
@RG-SM
header field.In VCF/BCF, it's the
##SAMPLE
ID
and it also appears on the#CHROM
header line.In htsget, it forms the bulk of the
path
part of request URLs, and there is a proposal to encode samples in thequery
part as well (htsget: addsamples
query parameter, principally to select subset of VCF columns samtools/hts-specs#430).In Phenopackets, it's a Biosample's
id
field.These items of metadata are embedded within the surrounding text using various delimiters in these various formats and protocols. So there are various restrictions on what characters may appear in them so as to avoid conflicting with the delimiter characters or otherwise requiring complicated escaping or encoding mechanisms. It would be good to harmonise these restrictions across GA4GH products, so that a value that was e.g. a valid Sample identifier in one product could be assumed to also be valid in other products.
The text was updated successfully, but these errors were encountered: