Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approval for array-based data structure and multi-tiered digests for sequence collections #10

Closed
nsheff opened this issue Apr 21, 2021 · 4 comments
Milestone

Comments

@nsheff
Copy link
Member

nsheff commented Apr 21, 2021

I would like to solicit feedback from community members on the latest iteration of the digest algorithm for sequence collection identifiers. After lots of discussion (see #8, #1), here is the latest proposal.

image

It's an array of arrays, and here I'm showing just 3 arrays, but this approach works for any number of arrays, and is backwards compatible with sequence collections that lack certain array definitions.

The retrieval works like this

image

A simple server could allow only recursion=1, but we agreed that recursion=0 is very useful and should probably be a required part of the specification, while recursion=2 should probably be disabled. Given that the =0 and =1 layers are possible, this also enables retrieval of components, which is independently valuable:

image

Regardless of what elements end up in a sequence collection, we're in a position to approve this as the digest algorithm. Feedback welcome!

@nsheff
Copy link
Member Author

nsheff commented May 4, 2021

@jmarshall and @jb-adams would be interested to hear your thoughts.

@jb-adams
Copy link
Member

jb-adams commented May 4, 2021

I see this as a way to incorporate the two paradigms to sequence identifiers that we've been struggling to navigate:

  1. vendor-neutral, checksum derived from only the sequence itself
  2. vendor-specific, allows other fields (names, lengths) to be incorporated into the checksum

The result is a vendor-specific "meta-checksum" that contains within it the vendor-neutral checksum component. I think this is good, and a good way to get adopters to implement, but does raise a few questions:

  • If the vendor-specific checksum is used primarily (passed to the service, referenced in papers, etc.), then does this lead to downstream difficulty in identifying equivalent collections across services, research papers, etc.? For example, if collection abcdef and collection 012345 had the same underlying sequences but different names, this fact would not be immediately apparent just by looking at the final checksum. This could be solved by the recursive lookup, but it's not easily apparent and requires API calls to investigate. This may not be a problem, but seems to be moving away from the transparency of the original refget spec.
  • Thoughts around what checksum algorithms we support, and how we map across different identifiers for the same collection. If I provide an MD5 identifier do I only get MD5 identifiers for the components? Do we plan to support a mechanism by which I can get the ga4gh or trunc512 id of the components in the response? Again, would be difficult to see when 2 collections are equivalent if the ids in 2 published papers are using different algorithms. Also there might be bloating, performance issues if every component needs to support 3 ids. Simplest solution would be to have one algorithm for everything, but implementers may not like this.

My questions are mainly based on assessment of equivalency, and making sure we don't have a false-negative problem (2 collections are the same at the sequence level but appear different) when researchers share results. But if this is not a concern, then everything looks good to me.

@sveinugu
Copy link
Collaborator

I guess the following comment is relevant here, depending on whether the question of ordering is included in the definition of the "digest algorithm": #7 (comment)

@tcezard tcezard added this to the V1.0 milestone Sep 5, 2022
@nsheff nsheff changed the title Approval for digest algorithm for sequence collections Approval for array-based data structure and multi-tiered digests for sequence collections Jan 11, 2023
@nsheff
Copy link
Member Author

nsheff commented Jan 11, 2023

This was approved in the ADR with PR #14

@nsheff nsheff closed this as completed Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants