Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ADR for serialisation of sequence collection #34

Merged
merged 12 commits into from
Jan 25, 2023

Conversation

tcezard
Copy link
Collaborator

@tcezard tcezard commented Jun 2, 2022

This PR now describe the serialisation that sequence collection should go through before the digest algorithm should be applied based on @sveinugu suggestion.
This should address #1 and #33

Copy link
Collaborator

@sveinugu sveinugu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to instead of implementing our own digest solution, we adopt the solution from the GA4GH Variation Representation Specification, or a downstream improvement on this, if any such adoptions exist.

docs/decision_record.md Outdated Show resolved Hide resolved
@andrewyatz
Copy link
Collaborator

I approve the general idea being pushed here and that this would be the most natural way to achieve the encoding as we discussed within the call. @sveinugu comment about using JSON encoding to build the digests in implementations is low cost at the top level but requires an additional encoding cost i.e. going from in-memory data to a JSON string then to the digest.

Both schemes achieve the same goal i.e. the JSON encoding is effectively delimiting fields through the use of strings & commas rather than commas alone.

@sveinugu if the , becomes reserved as this ADR would make it could we consider an encoding requirement?

@tcezard tcezard changed the title Add ADR for delimiter and digest structure Add ADR for serialisation of sequence collection Sep 5, 2022

#### For converting from level 2 to level 1

Each array is converted into a canonical string representation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided not do convert to strings, right?

Copy link
Collaborator

@sveinugu sveinugu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments on the diff of the file. I also think there are a few minor grammatical mistakes, but I leave that to those with English as their native tongue. Apart from that, Looks good to me!

The serialisation of a sequence collection will use the following steps

1. Apply RFC-8785 on each array of level 2
2. Digest each the canonical representation of each array
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? Each x2

b'[248956422,242193529,198295559]'
```

would be converted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be converted to?


### Known limitations

The JSON canonical serialisation defined in RFC-8785 has a limited set of reference implementation. It is possible that its implementation might make sequence collection implementation more difficult in other languages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mentioned that seqcol only use a subset of the standard, specifically no floating point and only ascii characters for array names? Almost all of the complexity of the implementations are related to these two things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided that it might be worth a sentence, but should probably go in some type of "implementation details" or examples or something when the final spec is written.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sentence explaining that we don't need the whole of RFC-8785.

```

#### 3. Creation of an object composed of the array names and the digested arrays
An object is created with the array name as properties and the digest as value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to include the decision to not add prefixes here, or is this a separate ADR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to write another ADR for that -- what do you think? #42

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think splitting into a separate ADR is correct. That seemed to address the omission of ga4gh: but not the omission of DT. where DT is a condensed data type string. The data type though would be constant across all array digests so this might not be an issue and can be added once if the ID exits the system

Copy link
Collaborator

@andrewyatz andrewyatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but raising a side issue about the prefix which might be better on ADR #42


### Decision

The serialisation of a sequence collection will use the following steps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to add a quick definition of 'serialisation' here since I think this word can have multiple meanings in different contexts, and here I think we mean a specific definition that may not be the most common or known definition. Maybe worth a bit of background on what we're trying to do here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a line for the definition

@tcezard tcezard merged commit 52971bd into ga4gh:master Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants