Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols #5

jmarshall · 2020-02-03T16:46:46Z

There are a number of short identifier-sized pieces of metadata that are used across many GA4GH products. For example:

Reference sequence names

In SAM/BAM/CRAM, this is the @SQ-SN header field and RNAME/RNEXT/etc fields.
In VCF/BCF, it's the ##contig ID.
In htsget, it's referenceName.
In refget, it may be returned as an alias.
Sample identifiers

In SAM/BAM/CRAM, this is the @RG-SM header field.
In VCF/BCF, it's the ##SAMPLE ID and it also appears on the #CHROM header line.
In htsget, it forms the bulk of the path part of request URLs, and there is a proposal to encode samples in the query part as well (htsget: add samples query parameter, principally to select subset of VCF columns samtools/hts-specs#430).
In Phenopackets, it's a Biosample's id field.

These items of metadata are embedded within the surrounding text using various delimiters in these various formats and protocols. So there are various restrictions on what characters may appear in them so as to avoid conflicting with the delimiter characters or otherwise requiring complicated escaping or encoding mechanisms. It would be good to harmonise these restrictions across GA4GH products, so that a value that was e.g. a valid Sample identifier in one product could be assumed to also be valid in other products.

The text was updated successfully, but these errors were encountered:

jmarshall · 2020-02-03T17:01:10Z

For reference sequence names, SAM and VCF (and hence BAM, CRAM, and BCF) have a very specific regular expression that disallows whitespace, backslashes, commas, various quotation marks, and brackets, and also = or * as the first character (see samtools/hts-specs#333 and samtools/hts-specs#379):

[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

In htsget and refget, these appear as double-quoted JSON strings. So the SAM regex values would fit in these formats without additional escaping or syntax. OTOH neither htsget nor refget suggests that these values are anything other than arbitrary strings.

jmarshall · 2020-02-03T17:19:51Z

For sample identifiers, both SAM and VCF disallow tabs, and VCF currently de facto disallows commas. It is proposed (samtools/hts-specs#414) to be explicit about VCF's commas.

Htsget does not currently tie this down in any way, though it would probably be impractical to use a ? character and whitespace would be semi-impractical. One proposal for the query part would also disallow the comma character.

In phenopackets, this appears to be described only as an “arbitrary identifier”.

It might be good to try to restrict the character set further (beyond <tab> and <comma>) to free up more punctuation characters for convenient use as format delimiters and on tools' command lines, but this would require analysis of what constitutes a sample name in the wild at present. For reference sequence names we collected fairly extensive statistics (see samtools/hts-specs#333 (comment)), but I am unaware of that having been done for sample name identifiers.

cdvoisin · 2020-03-09T16:59:06Z

Probably not relevant, but here are some identifier rules in the http://bit.ly/ga4gh-passport-v1 spec:

https://github.com/ga4gh-duri/ga4gh-duri.github.io/blob/master/researcher_ids/ga4gh_passport_v1.md#custom-passport-visa-types
https://github.com/ga4gh-duri/ga4gh-duri.github.io/blob/master/researcher_ids/ga4gh_passport_v1.md#url-fields
https://github.com/ga4gh-duri/ga4gh-duri.github.io/blob/master/researcher_ids/ga4gh_passport_v1.md#affiliationandrole -- specifically the character restrictions on custom roles.

rrfreimuth · 2020-04-16T17:55:15Z

I think GA4GH needs a common set of core data types, with identifier being one of those. The types should be based on existing standards (e.g., ISO) so we don't reinvent the wheel, and they should be as technology/language-agnostic as possible to support implementations in a variety of systems.

Just my 2 cents. Looking forward to the discussion.

jmarshall · 2020-05-11T14:31:03Z

@cdvoisin: Thanks, that's interesting. It's a bit different from the very particular individual classes of metadata item that this issue is trying to focus on, as they inherit the existing defined syntax of URLs.

@rrfreimuth: Those are good general principles. But as was hopefully clear in the discussion, this issue is intended to be about specific items of metadata individually and e.g. is just referring to this group of particular items collectively as “identifiers”.

jmarshall · 2020-05-11T14:32:50Z

Next steps on this IMHO is to start with (say) reference sequence names, and try to answer the questions posed in the presentation in the April meeting:

Do other working groups (besides LSG with SAM & VCF) have any similar restrictions on their equivalents of reference sequence names?

(We would want to harmonise to the lowest common denominator.)
Should this definition of RNAME identifier be promoted as the GA4GH portable reference sequence name?

If so, how? Ways of doing this include:
- Incorporate this regexp or equivalent into other GA4GH standards explicitly
- Add references in other GA4GH standards to some pan-GA4GH location describing this pan-GA4GH standard RNAME building block
- Or simply note in other GA4GH standards that “SAM has rules restricting the characters used in RNAMEs — see the SAM specification for details”

jmarshall · 2020-06-08T13:39:06Z

Re sample identifiers, I believe Phenopackets has a representation for such a field:

A Biosample refers to a unit of biological material from which the substrate molecules (e.g. genomic DNA, RNA, proteins) for molecular analyses (e.g. sequencing, array hybridisation, mass-spectrometry) are extracted.
[…]

Field Type Status Description

id string required arbitrary identifier

[…]

Example
{
  "id": "sample1",
  "individualId": "patient1",
  "description": "",
  […]

@frafrx or other Phenopackets experts: Biosample describes this id field simply as an “arbitrary identifier”. Does Phenopackets have any other rules about how these ids may be formed? Would Phenopackets wish to align with VCF's rules disallowing tabs and commas (and possibly other punctuation characters to be determined)?

mbaudis · 2023-07-10T15:24:35Z

@jmarshall AFAIK in Phenopackets & Beacon we follow the principles of

id is resource-local (but could use namespaced identifier)
the id of a schema defined object (biosample, subject/individual, ...) can be referenced in other schemas with the schema's name (i.e. the id of a biosample can be referenced in a derived analysis as biosampleId or biosample_id)

jkbonfield · 2023-07-11T08:33:24Z

Late to the party, and too late to change for VCF too, but I dislike the word "contig" being used as just another form of sequence. So did Rodger Staden, the person who coined the word "contig": https://staden.sourceforge.net/contig.html Note for the curious this also uses "gel readings", which later just got shortened to the "reads" we know today.

The original definitions have one thing very clear - it's contiguous, without gaps (if I recall Rodger later adimtted he probably meant continous as the reads don't have to abut, just overlap). Genome browsers started muddling things when they didn't understand the difference between a set of overlapping reads forming a contig, and their consensus sequence. They started just using contig instead of consensus sequence, and in doing so lost of the original meaning and caused more confusion. It then got corrupted even further when they stopped caring whether the sequence was even contiguous or not. Sadly that's where VCF ended up. (Although I note it sometimes uses "Chromosome" instead.)

So my preference is definitely for "reference sequence" or similar, and this also fits far better with most of the other use cases here (SAM, BAM, CRAM, Refget, etc). If we're talking about assemblies, then sometimes "consensus sequence" is more appropriate, but the two have largely interchangeable use cases.

jmarshall changed the title ~~Harmonise identifier (e.g. RNAME/Contig; Sample) rules across formats and protocols~~ Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols Feb 3, 2020

mamanambiya pinned this issue Apr 21, 2020

jmarshall mentioned this issue Nov 13, 2020

What characters should be allowed in sequence names? ga4gh/refget#2

Closed

mamanambiya unpinned this issue Apr 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols #5

Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols #5

jmarshall commented Feb 3, 2020 •

edited

Loading

jmarshall commented Feb 3, 2020 •

edited

Loading

jmarshall commented Feb 3, 2020

cdvoisin commented Mar 9, 2020

rrfreimuth commented Apr 16, 2020

jmarshall commented May 11, 2020

jmarshall commented May 11, 2020

jmarshall commented Jun 8, 2020

mbaudis commented Jul 10, 2023

jkbonfield commented Jul 11, 2023 •

edited

Loading

Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols #5

Harmonise identifier (e.g. RNAME/Contig, Sample) rules across formats and protocols #5

Comments

jmarshall commented Feb 3, 2020 • edited Loading

jmarshall commented Feb 3, 2020 • edited Loading

jmarshall commented Feb 3, 2020

cdvoisin commented Mar 9, 2020

rrfreimuth commented Apr 16, 2020

jmarshall commented May 11, 2020

jmarshall commented May 11, 2020

jmarshall commented Jun 8, 2020

mbaudis commented Jul 10, 2023

jkbonfield commented Jul 11, 2023 • edited Loading

jmarshall commented Feb 3, 2020 •

edited

Loading

jmarshall commented Feb 3, 2020 •

edited

Loading

jkbonfield commented Jul 11, 2023 •

edited

Loading