Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on undigested attributes and sorted-name-length-pairs #40

Open
nsheff opened this issue Jan 11, 2023 · 20 comments
Open

Discussion on undigested attributes and sorted-name-length-pairs #40

nsheff opened this issue Jan 11, 2023 · 20 comments

Comments

@nsheff
Copy link
Member

nsheff commented Jan 11, 2023

Related discussions:

History

For years we've debated the question of whether sequence collections would be ordered or unordered. In #17 we determined that the digests will reflect order. However, it is still valuable to come up with a way to identify if two sequences collections are the same in content but in different order. While the comparison function can do this, it is not as simple as comparing two identical identifiers. After lots of debate both in-person and on GitHub (#5), we never came up with a satisfying way to do this. Here is a proposal that can solve this, and other problems, which has arisen in discussion with @sveinugu. This idea has a few components: 1. Undigested arrays; 2. Relaxing the 1-to-1 constraint imposed on default arrays; 3. A specific recommended undigested array named e.g. "names-lengths". In detail:

1. Undigested attributes

Undigested attributes are attributes of a sequence collection that are not considered in computing the top-level digest. We suggest formalizing this as a part of the specification. In the schema that describes the data model for the sequence collections, the attributes will be annotated as either "digested" or "undigested". Only attributes marked as 'digested' will be used to compute the identifier. Other attributes can be stored/returned or otherwise used just like the digested attributes, the only difference is that they do not contribute to the construction of the identifier. This is easy to implement; for returning values, you return the complete level 1 structure, but before computing a digest, you first filter down to the digested schema attributes.

2. 1-to-1 constraint

All of our original arrays have the constraint that they are 1-to-1 with the other arrays, in order. In other words, the 3rd element of the names array MUST reference the 3rd element of the sequences array, etc.. They must have the same number of elements and match order.

I propose this constraint be relaxed for custom arrays/attributes. In fact, this, too, could be specified in the JSON-schema, as order-length-index: true (or something). In other words, arrays that do align 1-to-1 with the primary arrays would be flagged, but not all attributes would have to. This would be useful in case users want to store some information about a sequence collection that is not an ordered array with 1 element matching each sequence, or for situations that need to re-order.

3. The names-lengths array/attribute

Next, we'll use both these features for a new proposal. An example of a useful undigested attributed that lacks the 1-to-1 constraint is the names-lengths array. This is made like this:

  1. Lump together names and lengths from the primary arrays into an object, like {"length":123,"name":"chr1"}.
  2. Canonicalize according to the seqcol spec.
  3. Digest each name-length pair individually.
  4. Sort the digests lexographically.
  5. Add as an undigested, non-1-to-1 array to the sequence collection.

If this array is digested it provides an identifier for an order-invariant coordinate system for this sequence collection. It doesn't actually affect the identity of the collection, since it's computed from the collection, so there's no point considering it in the level 0 digest computation -- that's why it should be considered undigested. It also does not depend on the actual sequence content, so it could be consistent across two collections with different sequence content, and therefore different identifiers. It lives under the names-lengths (or something) attribute on the level 1 sequence collection object, but it's not digested.

The term "undigested"

To clarify the digesting: in this case, I am actually proposing to digest this array itself in the same way that all the arrays get digested; but this array would not get included in level 1 object that digests to form the level 0 digest. So maybe "undigested" is not the right term. Ideas?

Add to spec?

There are lots of other non-digested arrays that could useful. This particular one is pretty universal and useful, so it seems like it may be worth adding formally to specification as a RECOMMENDED, but not required, undigested array.

@sveinugu
Copy link
Collaborator

sveinugu commented Jan 25, 2023

Thanks, @nsheff for a clear writeup and concrete suggestions for how to incorporate our ideas into the specification in a simple and non-intrusive way! Having spent so much time debating this issue, I think it would be awesome if we were able to add the "names-lengths" recommendation to the v1.0 specification!

I believe I am all for the suggestion as it stands logically/algorithmically. I will think through the naming of the concepts/array and perhaps have some other suggestions for those, but that is in any case a minor issue.

Some comments on the usefulness of this (this is a sort of summary of pages with comments I made earlier, but hopefully this time my comments are briefer and easier to follow):

At the top of our rolling minutes document there is a motivational sentence that I assume has been there from the beginning: "Let’s make the world slightly more interoperable and end the chr1 vs 1 debacle of bioinformatics!". I do believe the above suggestion brings us much closer to actually following through on this intention. Here's why:

  • As far as I am aware, the "chr1 vs 1 debacle" started in relation to genome browsers, specifically by the UCSC Genome Browser choosing the "chr1" variant and making this a de facto standard for genome browsers, BED files, etc., while the INSCD-related sequence repos adopted the number-based standard. I presume due to the interconnectedness of the Ensembl Genome Browser with the EBI-hosted sequence archive (ENA), the Ensembl browser also adopted the number standard, making this into a bit of a USA/Europe schism. (I assume there is much more to this story, and I would be fascinated to learn about this, but I believe this summary is at least correct enough to function as a backdrop for my arguments.) It is still a major issue, as evident in The UCSC Genome Browser database: 2023 update, where one of the highlights is the newly developed chromAlias system which also aims to solve the problem (but not as thoroughly as we have the possibility to do!).

  • Since the world of genome browsers / genome annotations / tracks is central to the schism, a solution that also caters to some technicalities of this domain is, I believe, of importance. Let's focus on the role of a sequence collection identifier to annotate a file of genome coordinates, say a BED file. This is for instance done in Galaxy, where each dataset can be associated with a reference genome identifier. So the important part here is that such an identifier is traditionally used in two distinct contexts: 1) to preserve the information about which reference genome was used to generate the file, and 2) to filter files according to their compatibility, i.e. whether they can be visualised or analysed together, for instance in a genome browser.

  • Traditionally, imprecise identifiers like "hg19" have been used for both purposes simultaneously, and it is precisely the impreciseness of the identifiers that have made this possible. For instance, all patches of "hg19" share the same coordinate system. For the sake of reproducibility, one would want to retain the information about the exact patch of the reference genome used for mapping, say "hg19.p13". On the other hand, for visualisation in a genome browser or other analysis, one would ignore the patch part and lump all files from all "hg19" variants together. And even though mapping from "hg19.p13" to "hg19" is a triviality, people often in practice end up keeping only the "hg19" identifier together with the file and thereby lose the connection to the patch info.

  • With the core part of our sequence collection standard we will have a precise way of identifying the exact sequence collection used to generate e.g. a BED file, including the ordering of the sequences (which is crucial for some tools as differently ordered sequence collections might generate different results). Awesome as it is, it still only solves the first concern about preserving the history of the file. We do not provide a way yet to directly solve the second concern, which is to precisely identify the "common denominator" of the coordinate systems used for different data files so that they can be easily analysed or visualised together. Lastly, we need a way to map from the first type of identifier to the second type.

  • Having the option of excluding the sequences themselves from a sequence collection gives us the possibility to precisely identify a coordinate system. However such a coordinate system will still constitute an ordered collection. One advantage of an ordered coordinate system identifier is that such an identifier is mappable one-to-one to the ordered "chromlen" files needed by many tools, for instance BEDtools. Moreover, the output of a call to a seqcol-compliant server with an ordered coordinate system identifier will exactly produce a "chromlen" file in JSON format. So seqcol might potentially replace the use of such files altogether.

  • However, in a genome browser, as well as in many coordinate-system-based analysis tools, the order of the sequences does not matter. Even many tools that make use of "chromlen" files will internally ignore the order of the sequences, or perhaps only use it for presentation purposes. A precise identifier to describe a particular instance of a genome browser (for a particular sequence collection) should thus really be an "order-invariant" identifier/digest, mapping one-to-one with an unordered coordinate system. If one correctly annotates a bunch of BED files with such "order-invariant" coordinate system identifiers, one can use these identifiers directly to divide the files into different subsets, where the files of each subset are guaranteed to be visualisable together in a particular instance of a genome browser. The identifier of a group of BED files as well as the identifier of the genome browser instance will then be one and the same, solving the second main usage scenario for sequence collection identifiers in this domain (or third, if you also include the intermediate "chromlen" scenario).

  • The last issue to be solved is then the issue of mapping from the first identifier variant, representing ordered sequence collections that include the sequences themselves, to the latter variant, representing order-invariant coordinate systems that is used as configurations for grouping files prior to joint analysis/visualisation. It is relatively trivial to agree on an order-invariant identifier standard as an extension of the core seqcol specification, the only thing one needs to add is a canonical ordering scheme. However, one would need a process to automatically add such order-invariant coordinate systems every time a standard sequence collection is deposited, regardless of whether this is set up within the major repos or as a third party service. More importantly, there will be no direct mapping between the identifiers within the context of the standard itself.

This leads me to what I believe is the main advantage of the solution proposed above:

If implemented in practice by the major adopters of the seqcol standard, the mapping between the two identifier types will be directly available within all sequence collection records, in the relationship between the 0-level digest and the 1-level digest of the names-lengths array!"

The alternative will be for third parties to extract seqcol records, calculate the corresponding order-invariant coordinate system digest/identifiers and host the mapping independently, without the foundation of a standard to lean on. In practice, I don't believe this path will end up providing a realistic and universal solution to the issues.

So to conclude where I started:

How can this then help solve the chr1/1 schism?

A very nice property of the names-lengths digest in particular is that for all sequence collection variants with differing sequence contents (e.g. different patches) and different ordering of sequences, if only they share the same set of names and lengths, they will end up with the same order-invariant names-lenghts digest. There will in this case be only one single digest representing the order-invariant coordinate system of all the "chr1"-type variants, while another single digest will represent the order-invariant coordinate system of all the number-based variants. Consequently, the mapping between these two digest/identifier-types will be simple. So one can then easily map from any of the "chr1"-type sequence collections to any of the compatible number-based collections by merely extracting the level 1 contents, as well as consult a simple map that relates the relatively limited set of order-invariant coordinate system identifiers across the schism. These identifier-to-identifier mappings can then be easily represented by third parties in "semantic web"-style knowledge bases, which again is very suitable as input for AI-based methodologies. All this depends on the adoption by the major sequence collection deposition repositories.

(The only caveat with the above that I can see now is the issue of subsets of sequences. So coordinate systems are in practice often compatible if one is the subset of the other, or if the only differences appear in the "non-priority" sequences, such as the alternative sequences that are often added in reference genome patches. A variant of the names-lengths array that also filters the sequences according to a priority array could then possibly be the best solution in the context of genome browsers. That would represent an example of another useful "undigested" array type that will be possible as a consequence of the two conceptual additions suggested here. However, adding a priority-related array or other similar variants to the specification will very easily open up for the kind of drawn-out discussions that will further delay the delivery of v1.0 and which we therefore should avoid. Which is why (I assume) that @nsheff limited this suggestion to include only one extra recommended, undigested array: the names-lenghts one.)

My conclusion:

At the very least, the implementation of the names-lengths array is pretty straightforward and will, in the context of the specification text itself, function as an illustrative example of the usefulness of the suggested additional array properties, should we decide to add them. More importantly, I believe the names-lengths array in itself will provide major benefits to the community if adopted, with the main contribution being the possibilities of direct mapping between identifiers across the schism (as I have argued for above). Including that array as a recommendation will furthermore sketch out a direction for future development of the specification, while at the same time open up the possibility for third parties to straight away implement custom extensions in the same vein, within the bounds of the v1.0 specification. Lastly, we will also get some solid results out of the time we have spent discussing the order issue!

So that's my (more than) two cents!

(Note: I always end up writing too long texts, which is why I am very happy that @nsheff wrote the actual suggestion above! )

@nsheff
Copy link
Member Author

nsheff commented Feb 8, 2023

Here are some terminology ideas I had, for what terms the schema would use to describe the attributes:

  • required: indicates attributes that must be included
  • inherent: indicates attributes that do contribute to the identity of the collection (previously called digested)
  • collated: indicates array attributes that are 1-to-1 and in order with collection items.

e.g.

type: object
properties:
  lengths:
    type: array
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    description: "Human-readable identifiers of each sequence, commonly called chromosome names."
    items:
      type: string
...
required:
  - lengths
inherent:
  - lengths
  - names
  - sequences
collated:
  - lengths
  - names
  - sequences
  - topologies

@sveinugu
Copy link
Collaborator

sveinugu commented Feb 8, 2023

Here are some terminology ideas I had, for what terms the schema would use to describe the attributes:

I like this very much. Very precise terms and natural way to specify in the JSON schema.

For what it's worth as someone with English as the second language, I did not really before now know the precise meaning of the word "collate", I thought it had a more general meaning, like "collect", "organize" or similar. But that might just be a random hole in my knowledge. So if this is a word that is assumed to be understandable for the typical reader/user, I am all for it. Seems very precise.

@sveinugu
Copy link
Collaborator

sveinugu commented Feb 8, 2023

I think "Inherent" is spot on!

@nsheff
Copy link
Member Author

nsheff commented Feb 8, 2023

"collate" is typically used when you're printing multiple copies of a multi-page document; non-collated would be like, page 111222333, collated would be 123123123 -- So the collated approach is ready for binding. So it essentially means, "each unit in its proper order".

I spent some time searching for a word to mean the 1-to-1 idea and "collated" just seemed like a match that captures that whole concept in a single word.

@nsheff
Copy link
Member Author

nsheff commented Mar 2, 2023

I have a new implementation now that handles inherent attributes and also provides names-lengths arrays. I'm working on a few demos at: https://seqcolapi.databio.org/

@nsheff
Copy link
Member Author

nsheff commented Mar 7, 2023

Here's a function that computes the names-lengths array as described above: https://github.com/refgenie/seqcol/blob/46675b669ae07db9da4fc3d113fefa2c1667b1fb/seqcol/seqcol.py#L300-L310

@tcezard
Copy link
Collaborator

tcezard commented Mar 22, 2023

I was wondering about the relationship between the different type of properties:
I was assuming that if a properties is inherent then it has to be collated from the fact that all non-collated properties I can think of are metadata that do not define the collections.
That might not be something the specification wants to enforce though.

Also I was thinking that we should specify how the collated properties should be handle in the comparison function i.e. they should be part of the comparison process similar to what @nsheff has implemented here but what about the non-collated attribute

@sveinugu
Copy link
Collaborator

sveinugu commented Jun 15, 2023

On the question whether the non-collated array sorted_names_lengths (or what we end up calling it) should be required, I think we should consider issue #36, which I don't think we have discussed much. Basically, if one seqcol is a reordering of the other, the current comparison endpoint does not discern whether array values go out if sync (1-to-1) with respective values in other collated arrays (e.g. if a sequence ends up with the wrong name). Since the lengths are derived from the sequences, a sorted_names_lengths array will in most cases fix the issue for the core arrays, as the digests per names-lengths pair will change if the core arrays get out of sync (assuming the sequence collection is still technically valid by lengths and sequences being in sync). The only exception is in cases where different sequences in the same seqcol have the same length.

One possible full solution across all arrays would be to define a new collated and non-inherent array which consists of a digest of the full row-wise representation of each element across all other collated arrays (inherent or not). The contents of this array (pre-digest) would, I believe, look very similar to the row-wise seqcol idea we had in the very beginning. However, adding such an array is not on the table now, and even if it gets on the table later, that is something that I believe should happen based on other needs. (Note that such an array is different from an array that provides row-wise identifiers, which I have argued for elsewhere, as an identifier array would be limited to only include values from other arrays that are both collated and inherent!) I just wanted to mention this option to complete the picture.

Given that we want to fix issue #36 (and I do think we should), I think there are now two main options:

a) keep the comparison function as it is but make the sorted-names-lengths required, at least if the comparison function is implemented. This would cover most cases (but only for the core arrays) and it requires very little change. Or

b) add another output in the comparison function that provides a check on whether the array values are in sync of not. This separates fixing issue #36 from the issue of whether sorted_names_lengths should be required. Separating these two
issues is definitely cleaner, but will on the other hand require extending the comparison endpoint. Also not sure how easy it is to implement such an "in-sync" check in practice.

@sveinugu
Copy link
Collaborator

While writing the above comment (#40 (comment)), I think I have myself changed opinion from supporting (a) to (b), as issue #36 is really a technical validity issue for the comparison function and providing a partial fix by adding data content just makes everything more confusing while not fully solving the issue.

So if I leave this issue aside, I see that making the sorted_names_lenghts array required could be an nuisance for very simple seqcol implementations (e.g. based on S3-buckets). I do, however, think we should clearly recommend it, and one way of pushing implementers to use it would be to "nickname" the array as the "genome browser compatibility"-array or similar, so that it is more clear what this is useful for.

@nsheff
Copy link
Member Author

nsheff commented Jul 12, 2023

Should it be sorted_name_length_pairs or sorted-name-length-pairs?

In the past I've been using underscores; but in the compare endpoint we decided to use hyphens for the return values.

@nsheff nsheff changed the title Discussion on undigested attributes Discussion on undigested attributes and sorted-name-length-pairs Jul 26, 2023
@tcezard
Copy link
Collaborator

tcezard commented Jul 26, 2023

My vote is for underscores.
We should also decide on how the sorted_name_length_pairs attribute is computed
I've seen at least 2 different descriptions of the attribute:
1.

{
  "sorted_name_length_pairs": [
    "chr1_103483637", 
    "chr2_23873"
  ]
}

or

{
  "sorted_name_length_pairs": [
    {"names": "chr1", "lengths": 103483637}, 
    {"names": "chr2", "lengths": 23873} 
  ]
}

Was there another one ?
Which one should we go for ?

@nsheff
Copy link
Member Author

nsheff commented Jul 26, 2023

That's interesting -- I only knew of one suggestion and it was neither of those ;) it's:

{"length":123,"name":"chr1"}.

that's also how I implemented it... https://github.com/refgenie/seqcol/blob/46675b669ae07db9da4fc3d113fefa2c1667b1fb/seqcol/seqcol.py#L300-L310

So, that's my vote. I don't like the underscore method, I like re-using the json scheme we've already been using.

@sveinugu
Copy link
Collaborator

I second @nsheff's suggestion (and implementation) above. I like that the "name" and "length" keys are included (in singular) which makes the non-digested contents (more-or-less) self-explanatory, which might prove useful in certain scenarios. The redundancy of repeating "name" and "length" for each sequence should not matter for performance/efficiency, as (in most cases) the end product to be stored in any case are the digests, which have a fixed number of characters (32? Cannot remember right now).

@sveinugu
Copy link
Collaborator

All for underscores. Cannot remember if we discussed this for the compare endpoint, but perhaps we could still harmonize this post-ADR? There is also, I suppose, a possibility to harmonize the style across GA4GH services (isn't there a GA4GH style document? If not, shouldn't there be?)

@sveinugu
Copy link
Collaborator

An illustrative use case to be added to the specification could be something like this:

  1. A specific instance of a genome browser (for a particular reference genome coordinate system) can be fully defined logically by the level 1 digest of the sorted-name-length-pairs array of the corresponding sequence collection (as the order of the sequences does not matter).

  2. To check whether a data file is compatible with a genome browser instance one can then compare this digest with the similar level 1 digest of the sequence collection used to generate the file (assumed to be attached to the file as metadata somehow).

There are only two possibilities for compatibility:

2A) If the digests are equal, then the data file is directly compatible.

2B) If not, one can check the compare endpoint to see whether the sorted-name-length-pairs array of the sequence collection attached to the data file is a direct subset of the same array in the sequence collection attached to the genome browser instance. If so, the data file is also compatible.

  1. For efficiency, if 2B is true one can add the corresponding sorted-name-length-pairs level 1 digest (for the data file) to a list of cached digests that are known to be compatible for a particular genome browser instance. In practice, this list will be short as only a handful of variants will appear in a real-life scenario (e.g. including or excluding chrM and/or alternative sequences, reference genome patches (possibly removing some unplaced sequences) and similar).

Thus, in a production setting, the full compatibility check can be reduced to a lookup into a short, pre-generated list of sorted-name-length-pairs level-1 digests known to be compatible with a specific genome browser instance. One would not even need to contact a seqcol server at all if all data files to be considered for compatibility come pre-annotated with such level-1 digests.

@tcezard
Copy link
Collaborator

tcezard commented Aug 21, 2023

I want to confirm another point brought by @sveinugu earlier: We store the digested version of the object not the object itself.
The algorithm for generating the array is:

  1. For each element:
    a. Construct the object for each element with singular property names{"length":123,"name":"chr1"}
    b. Normalise with RFC-8785
    c. Digest with sha512t24
  2. Sort the array lexicographically
  3. Store the resulting array

Is that correct ?

Also I for the seqcol schema the current description I use is:

"Objects composed of length and name of a sequence digested and ordered lexicographically"

But there might be a better one.

@tcezard
Copy link
Collaborator

tcezard commented Aug 24, 2023

The newly published specs now answers most of my questions above.

I have an additional question about sorted-name-length-pairs:
I'm wondering if there are use cases for storing the whole array of digested objects over only storing the level 1 digest as a single value.
I haven't fully worked it out but the only use case I can see it to check for subsets of name-length pairs.

@sveinugu
Copy link
Collaborator

sveinugu commented Aug 24, 2023

I have an additional question about sorted-name-length-pairs:
I'm wondering if there are use cases for storing the whole array of digested objects over only storing the level 1 digest as a single value.
I haven't fully worked it out but the only use case I can see it to check for subsets of name-length pairs.

I think this is a very useful use case, though!

This also relates to issue #36 which we have yet to discuss and which is bugging me a bit. Storing the array of digests is a partial solution for that issue, but only for the names and lenghts arrays.

Another similar use case is just to test for the existence of a particular name-length-pair in a sequence collection, e.g. which chrM coordinate system is included in this hg19 seqcol variant? An alternative would be to find "chrM" in the names array, note the index, and find the corresponding length in the lengths array. Even though this is not very complex I believe the simplicity of just grepping for a digest will prove to be very handy for e.g. scripts or one-liners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants