-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ADR for serialisation of sequence collection #34
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to instead of implementing our own digest solution, we adopt the solution from the GA4GH Variation Representation Specification, or a downstream improvement on this, if any such adoptions exist.
I approve the general idea being pushed here and that this would be the most natural way to achieve the encoding as we discussed within the call. @sveinugu comment about using JSON encoding to build the digests in implementations is low cost at the top level but requires an additional encoding cost i.e. going from in-memory data to a JSON string then to the digest. Both schemes achieve the same goal i.e. the JSON encoding is effectively delimiting fields through the use of strings & commas rather than commas alone. @sveinugu if the |
docs/decision_record.md
Outdated
|
||
#### For converting from level 2 to level 1 | ||
|
||
Each array is converted into a canonical string representation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided not do convert to strings, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments on the diff of the file. I also think there are a few minor grammatical mistakes, but I leave that to those with English as their native tongue. Apart from that, Looks good to me!
docs/decision_record.md
Outdated
The serialisation of a sequence collection will use the following steps | ||
|
||
1. Apply RFC-8785 on each array of level 2 | ||
2. Digest each the canonical representation of each array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo? Each x2
docs/decision_record.md
Outdated
b'[248956422,242193529,198295559]' | ||
``` | ||
|
||
would be converted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be converted to?
docs/decision_record.md
Outdated
|
||
### Known limitations | ||
|
||
The JSON canonical serialisation defined in RFC-8785 has a limited set of reference implementation. It is possible that its implementation might make sequence collection implementation more difficult in other languages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mentioned that seqcol only use a subset of the standard, specifically no floating point and only ascii characters for array names? Almost all of the complexity of the implementations are related to these two things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided that it might be worth a sentence, but should probably go in some type of "implementation details" or examples or something when the final spec is written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a sentence explaining that we don't need the whole of RFC-8785.
``` | ||
|
||
#### 3. Creation of an object composed of the array names and the digested arrays | ||
An object is created with the array name as properties and the digest as value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to include the decision to not add prefixes here, or is this a separate ADR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to write another ADR for that -- what do you think? #42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think splitting into a separate ADR is correct. That seemed to address the omission of ga4gh:
but not the omission of DT.
where DT
is a condensed data type string. The data type though would be constant across all array digests so this might not be an issue and can be added once if the ID exits the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but raising a side issue about the prefix which might be better on ADR #42
|
||
### Decision | ||
|
||
The serialisation of a sequence collection will use the following steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to add a quick definition of 'serialisation' here since I think this word can have multiple meanings in different contexts, and here I think we mean a specific definition that may not be the most common or known definition. Maybe worth a bit of background on what we're trying to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a line for the definition
This PR now describe the serialisation that sequence collection should go through before the digest algorithm should be applied based on @sveinugu suggestion.
This should address #1 and #33