Add ADR for serialisation of sequence collection #34

tcezard · 2022-06-02T08:08:43Z

This PR now describe the serialisation that sequence collection should go through before the digest algorithm should be applied based on @sveinugu suggestion.
This should address #1 and #33

sveinugu

I suggest to instead of implementing our own digest solution, we adopt the solution from the GA4GH Variation Representation Specification, or a downstream improvement on this, if any such adoptions exist.

docs/decision_record.md

andrewyatz · 2022-06-06T08:40:31Z

I approve the general idea being pushed here and that this would be the most natural way to achieve the encoding as we discussed within the call. @sveinugu comment about using JSON encoding to build the digests in implementations is low cost at the top level but requires an additional encoding cost i.e. going from in-memory data to a JSON string then to the digest.

Both schemes achieve the same goal i.e. the JSON encoding is effectively delimiting fields through the use of strings & commas rather than commas alone.

@sveinugu if the , becomes reserved as this ADR would make it could we consider an encoding requirement?

nsheff · 2022-09-21T13:45:18Z

docs/decision_record.md

+
+#### For converting from level 2 to level 1
+
+Each array is converted into a canonical string representation.


We decided not do convert to strings, right?

sveinugu

Added some comments on the diff of the file. I also think there are a few minor grammatical mistakes, but I leave that to those with English as their native tongue. Apart from that, Looks good to me!

sveinugu · 2022-11-30T23:22:14Z

docs/decision_record.md

+The serialisation of a sequence collection will use the following steps
+
+ 1. Apply RFC-8785 on each array of level 2
+ 2. Digest each the canonical representation of each array


Typo? Each x2

sveinugu · 2022-11-30T23:24:41Z

docs/decision_record.md

+b'[248956422,242193529,198295559]'
+```
+
+would be converted


would be converted to?

sveinugu · 2022-11-30T23:29:26Z

docs/decision_record.md

+
+### Known limitations
+
+The JSON canonical serialisation defined in RFC-8785 has a limited set of reference implementation. It is possible that its implementation might make sequence collection implementation more difficult in other languages.  


Should we mentioned that seqcol only use a subset of the standard, specifically no floating point and only ascii characters for array names? Almost all of the complexity of the implementations are related to these two things.

We decided that it might be worth a sentence, but should probably go in some type of "implementation details" or examples or something when the final spec is written.

I added a sentence explaining that we don't need the whole of RFC-8785.

nsheff · 2023-01-11T21:06:45Z

docs/decision_record.md

+```
+
+#### 3. Creation of an object composed of the array names and the digested arrays
+An object is created with the array name as properties and the digest as value.


Do you want to include the decision to not add prefixes here, or is this a separate ADR?

I decided to write another ADR for that -- what do you think? #42

I think splitting into a separate ADR is correct. That seemed to address the omission of ga4gh: but not the omission of DT. where DT is a condensed data type string. The data type though would be constant across all array digests so this might not be an issue and can be added once if the ID exits the system

andrewyatz

Looks good but raising a side issue about the prefix which might be better on ADR #42

nsheff · 2023-01-25T13:42:36Z

docs/decision_record.md

+
+### Decision
+
+The serialisation of a sequence collection will use the following steps


I think it would be useful to add a quick definition of 'serialisation' here since I think this word can have multiple meanings in different contexts, and here I think we mean a specific definition that may not be the most common or known definition. Maybe worth a bit of background on what we're trying to do here.

Added a line for the definition

tcezard requested review from andrewyatz, sveinugu and nsheff June 2, 2022 08:08

sveinugu requested changes Jun 2, 2022

View reviewed changes

docs/decision_record.md Outdated Show resolved Hide resolved

tcezard added 2 commits September 5, 2022 17:45

Describe the serialisation method based on RFC-8785

f8b0f06

Fix date

578c9d4

tcezard force-pushed the delimiters branch from b734852 to 578c9d4 Compare September 5, 2022 16:46

tcezard requested a review from sveinugu September 5, 2022 16:46

tcezard changed the title ~~Add ADR for delimiter and digest structure~~ Add ADR for serialisation of sequence collection Sep 5, 2022

tcezard added 2 commits September 7, 2022 16:05

Remove the initial string conversion

0df0544

Remove the initial string conversion

7fcb59b

nsheff reviewed Sep 21, 2022

View reviewed changes

tcezard added 5 commits October 5, 2022 14:02

Detail each step of the serialisation

27bab77

Change quotes

ad04ae4

Change quotes

f06ccef

Add ga4gh prefixes for level 1 identifiers

fa8fea2

Revert to not using the prefix

22a90f0

sveinugu approved these changes Nov 30, 2022

View reviewed changes

nsheff mentioned this pull request Jan 11, 2023

Identifier construction: To prefix or not to prefix #37

Open

Fix typo raise in review

f046caf

nsheff reviewed Jan 11, 2023

View reviewed changes

andrewyatz reviewed Jan 12, 2023

View reviewed changes

nsheff reviewed Jan 25, 2023

View reviewed changes

nsheff and others added 2 commits January 25, 2023 09:15

Merge branch 'master' into delimiters

f2d9316

Define serialisation and add limitation on what RFC-8785 is used for

ca4f9e7

tcezard merged commit 52971bd into ga4gh:master Jan 25, 2023

nsheff mentioned this pull request Feb 28, 2023

RFC-8785 and refget compatibility #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ADR for serialisation of sequence collection #34

Add ADR for serialisation of sequence collection #34

tcezard commented Jun 2, 2022 •

edited

Loading

sveinugu left a comment

andrewyatz commented Jun 6, 2022

nsheff Sep 21, 2022

sveinugu left a comment

sveinugu Nov 30, 2022

sveinugu Nov 30, 2022

sveinugu Nov 30, 2022

nsheff Jan 11, 2023

tcezard Jan 25, 2023

nsheff Jan 11, 2023

nsheff Jan 11, 2023

andrewyatz Jan 12, 2023

andrewyatz left a comment

nsheff Jan 25, 2023

tcezard Jan 25, 2023


		#### For converting from level 2 to level 1

		Each array is converted into a canonical string representation.


		### Known limitations

		The JSON canonical serialisation defined in RFC-8785 has a limited set of reference implementation. It is possible that its implementation might make sequence collection implementation more difficult in other languages.


		### Decision

		The serialisation of a sequence collection will use the following steps

Add ADR for serialisation of sequence collection #34

Add ADR for serialisation of sequence collection #34

Conversation

tcezard commented Jun 2, 2022 • edited Loading

sveinugu left a comment

Choose a reason for hiding this comment

andrewyatz commented Jun 6, 2022

Choose a reason for hiding this comment

sveinugu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewyatz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tcezard commented Jun 2, 2022 •

edited

Loading