-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identifier construction: To prefix or not to prefix #37
Comments
On 2022-10-19, there was some support for including the type prefix, but not the namespace prefix ( Level 0
Level 1
Level 2
At level 2, would we want to add in the
|
Level 2 is the bit that worries me since |
According to the CURIE Syntax document:
Not 100% sure whether a service-related schema such as ours would qualify as a "host language", but if so we seem to be free to define our own mechanism for defining a default prefix value. I googled my way to the specification of the UHF Hypermedia Format (UHF), which makes use of default CURIE prefixes and is also similar to our use case as it is basically a JSON schema or "format". I am really only arguing that we can omit the prefix and still state that the values are CURIEs. Any automated usage must still extract our default prefix in a custom way, as the CURIE syntax document does not seem to define a canonical method for providing the default prefix in an automated fashion. In the end, I suggest we contact identifiers.org or other relevant entities to get their view of the issue. @andrewyatz For clarity, does the refget standard specify that the endpoints require the prefix to be available or is it optional? |
GA4GH compliance refget instances in v2 will accept GA4GH identifiers of the format |
A nice blog post about CURIEs and why we need them, as background: https://cthoyt.com/2021/09/14/curies.html |
Some summary from today's discussion: 2 questions posted by Tim:
It seems we were approaching consensus that we could offer API endpoints that behave both ways: either they give exactly the string that was digested, if requested, or they give a more information-rich version. In fact, if we include non-digested arrays, then by definition the server will be serving up data that is different from exactly what is digested. Maybe it would be nice to have a flag or endpoint or option to get the exact digested string, though. So, a thought experiment is:
So this leads to a few next questions:
|
Great writeup, @nsheff! I only want to add some comments regarding the Refget v2 digest. I think we also agreed that the Refget v2 digest isn’t actually a CURIE, even though it looks very much like one. This was surprising to me and I think it has also been a cause of misunderstandings lately. From the CURIE syntax document:
So for the reget v2 digest to be a CURIE, say Example: Say you host a refget v2 server with the main endpoint available at (sorry, i did not bother looking up the actual endpoint name requirements in refget v2):
Then if
However, this leaves out the namespace from the input to the endpoint, contrary to what Refget v2 requires, according to @andrewyatz (#37 (comment)). I think it is unfortunate that the Refget v2 digest quacks like a duck without being a duck (but perhaps a swan?… 😁). Even if the standard does not state that the digest is a CURIE, it looks very much like one. I understand the ship has sailed in Refget v2 on this, sadly. I think another thing we were nearing consensus on was that we would probably want to raise an issue to a higher power in GA4GH on what to use for the namespace of a seqcol CURIE identifier? I would argue for using just
Which is not uncommmon for CURIES, ex Including some variant of a
|
Just wanted to concretize some of my thoughts after todays meeting and the decision to not include any prefixes in the serializations (except the Refget one): Digests vs identifiersFor me, the decision was made based on a clear separation of concern between the:
Two different concepts should have different identifiers, even if the contents are the same. A way to clearly separate these concerns is to not include any prefixes at all in the digests. This is in essence what I believe we decided on today. About identifiersRegarding the identifiers, I think we should discern between locally and globally unique identifiers (Reference: "Unique, persistent identifiers" FAIR Cookbook). Identifiers should also be persistent and machine-resolvable. Identifiers could be full URI, for instance using persistent URLs, or they could be represented as CURIEs (see the FAIR Cookbook recipe or the above-mentioned blog post. Suggestion for top-level seqcol identifiersSyntaxSo I have the following simple suggestion for relating globally unique identifiers in the form of CURIEs with the top level digests:
e.g.
Globally vs locally uniqueIf we remove the prefix, we get a locally unique identifier, which is in this case is equal to the digest. Following the conceptual framework from the CURIE syntax, this can be viewed as defining, in the context of a seqcol server, that the "default prefix value" is Similarly, when others are making use of the seqcol identifiers in other contexts, they could in the same way define In conclusion: In the specification, we can basically say that a seqcol identifier is a CURIE, constructed according to the above syntax, and that the default prefix for a seqcol server is Note: A consequence of defining Resolving the CURIE identifiers to URIsIn a CURIE resolution service, such as identifiers.org or N2T one could e.g. provide the following mappings:
Resolving the
Suggestion for second-level seqcol identifiersSo what about possible identifiers for concepts represented by arrays (second level)? I suggest the following syntax:
e.g.
CURIE resolution services would then resolve this identifier into e.g.:
Whether the endpoints would accept that identifier or not is up to the implementation. Note on persistent URLsOne could also later provide mapping to a persistent URLs scheme if there is the need for that, e.g.:
(BTW: I found this ga4gh domain under the Internet Archive-governed PURL system. It seems to have been registered by the GA4GH-Pedigree-Standard, helpfully using the top-level domain directly...) |
In discussions in November and December 2022, we divided this issue into 2 related issues:
For the first, we have an agreement: we do not include the ga4gh prefix, or type prefixes. This is codified in PR #42. The second is kind of a spinoff question, which I believe is still under debate. |
Following other discussions with Nathan I had in a 1:1 discussion, apologies for not being in the meeting yesterday from the start, we think there is a good course of action. We also believe that due to the misnaming of name-spaced identifiers as CURIEs we have conflated retrieval of an entity by its ID and the data required to resolve such an identifier.
|
We discussed this in the GKS leads call this week. A few takeaways from the discussion:
|
On 2022-09-21 we debated how to actually form the identifiers. Like, is there a
<prefix>
, and/or a<type_prefix>
, and are these modifiers used just for returning identifiers, or are they actually digested, since our protocol involves digesting digests.Here are some thoughts:
<prefix>:<type_prefix>.<digest>
The text was updated successfully, but these errors were encountered: