Approval for array-based data structure and multi-tiered digests for sequence collections #10

nsheff · 2021-04-21T15:11:32Z

I would like to solicit feedback from community members on the latest iteration of the digest algorithm for sequence collection identifiers. After lots of discussion (see #8, #1), here is the latest proposal.

It's an array of arrays, and here I'm showing just 3 arrays, but this approach works for any number of arrays, and is backwards compatible with sequence collections that lack certain array definitions.

The retrieval works like this

A simple server could allow only recursion=1, but we agreed that recursion=0 is very useful and should probably be a required part of the specification, while recursion=2 should probably be disabled. Given that the =0 and =1 layers are possible, this also enables retrieval of components, which is independently valuable:

Regardless of what elements end up in a sequence collection, we're in a position to approve this as the digest algorithm. Feedback welcome!

The text was updated successfully, but these errors were encountered:

nsheff · 2021-05-04T14:06:37Z

@jmarshall and @jb-adams would be interested to hear your thoughts.

jb-adams · 2021-05-04T14:36:29Z

I see this as a way to incorporate the two paradigms to sequence identifiers that we've been struggling to navigate:

vendor-neutral, checksum derived from only the sequence itself
vendor-specific, allows other fields (names, lengths) to be incorporated into the checksum

The result is a vendor-specific "meta-checksum" that contains within it the vendor-neutral checksum component. I think this is good, and a good way to get adopters to implement, but does raise a few questions:

If the vendor-specific checksum is used primarily (passed to the service, referenced in papers, etc.), then does this lead to downstream difficulty in identifying equivalent collections across services, research papers, etc.? For example, if collection abcdef and collection 012345 had the same underlying sequences but different names, this fact would not be immediately apparent just by looking at the final checksum. This could be solved by the recursive lookup, but it's not easily apparent and requires API calls to investigate. This may not be a problem, but seems to be moving away from the transparency of the original refget spec.
Thoughts around what checksum algorithms we support, and how we map across different identifiers for the same collection. If I provide an MD5 identifier do I only get MD5 identifiers for the components? Do we plan to support a mechanism by which I can get the ga4gh or trunc512 id of the components in the response? Again, would be difficult to see when 2 collections are equivalent if the ids in 2 published papers are using different algorithms. Also there might be bloating, performance issues if every component needs to support 3 ids. Simplest solution would be to have one algorithm for everything, but implementers may not like this.

My questions are mainly based on assessment of equivalency, and making sure we don't have a false-negative problem (2 collections are the same at the sequence level but appear different) when researchers share results. But if this is not a concern, then everything looks good to me.

sveinugu · 2021-05-19T15:58:21Z

I guess the following comment is relevant here, depending on whether the question of ordering is included in the definition of the "digest algorithm": #7 (comment)

nsheff · 2023-01-11T20:21:45Z

This was approved in the ADR with PR #14

nsheff mentioned this issue Apr 21, 2021

What characters should we use for delimiters? #1

Closed

tcezard mentioned this issue May 5, 2021

How will the seqcol compatibility flags be encoded? #7

Closed

sveinugu mentioned this issue Aug 25, 2021

Sequence collection, ordered? or unordered? #5

Closed

nsheff mentioned this issue Feb 1, 2022

What information is included within the string-to-digest? #8

Closed

tcezard added this to the V1.0 milestone Sep 5, 2022

nsheff changed the title ~~Approval for digest algorithm for sequence collections~~ Approval for array-based data structure and multi-tiered digests for sequence collections Jan 11, 2023

nsheff closed this as completed Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approval for array-based data structure and multi-tiered digests for sequence collections #10

Approval for array-based data structure and multi-tiered digests for sequence collections #10

nsheff commented Apr 21, 2021

nsheff commented May 4, 2021

jb-adams commented May 4, 2021

sveinugu commented May 19, 2021

nsheff commented Jan 11, 2023

Approval for array-based data structure and multi-tiered digests for sequence collections #10

Approval for array-based data structure and multi-tiered digests for sequence collections #10

Comments

nsheff commented Apr 21, 2021

nsheff commented May 4, 2021

jb-adams commented May 4, 2021

sveinugu commented May 19, 2021

nsheff commented Jan 11, 2023