-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on undigested attributes and sorted-name-length-pairs #40
Comments
Thanks, @nsheff for a clear writeup and concrete suggestions for how to incorporate our ideas into the specification in a simple and non-intrusive way! Having spent so much time debating this issue, I think it would be awesome if we were able to add the "names-lengths" recommendation to the v1.0 specification! I believe I am all for the suggestion as it stands logically/algorithmically. I will think through the naming of the concepts/array and perhaps have some other suggestions for those, but that is in any case a minor issue. Some comments on the usefulness of this (this is a sort of summary of pages with comments I made earlier, but hopefully this time my comments are briefer and easier to follow): At the top of our rolling minutes document there is a motivational sentence that I assume has been there from the beginning: "Let’s make the world slightly more interoperable and end the chr1 vs 1 debacle of bioinformatics!". I do believe the above suggestion brings us much closer to actually following through on this intention. Here's why:
This leads me to what I believe is the main advantage of the solution proposed above: If implemented in practice by the major adopters of the seqcol standard, the mapping between the two identifier types will be directly available within all sequence collection records, in the relationship between the 0-level digest and the 1-level digest of the The alternative will be for third parties to extract seqcol records, calculate the corresponding order-invariant coordinate system digest/identifiers and host the mapping independently, without the foundation of a standard to lean on. In practice, I don't believe this path will end up providing a realistic and universal solution to the issues. So to conclude where I started: How can this then help solve the chr1/1 schism? A very nice property of the (The only caveat with the above that I can see now is the issue of subsets of sequences. So coordinate systems are in practice often compatible if one is the subset of the other, or if the only differences appear in the "non-priority" sequences, such as the alternative sequences that are often added in reference genome patches. A variant of the My conclusion: At the very least, the implementation of the So that's my (more than) two cents! (Note: I always end up writing too long texts, which is why I am very happy that @nsheff wrote the actual suggestion above! ) |
Here are some terminology ideas I had, for what terms the schema would use to describe the attributes:
e.g.
|
I like this very much. Very precise terms and natural way to specify in the JSON schema. For what it's worth as someone with English as the second language, I did not really before now know the precise meaning of the word "collate", I thought it had a more general meaning, like "collect", "organize" or similar. But that might just be a random hole in my knowledge. So if this is a word that is assumed to be understandable for the typical reader/user, I am all for it. Seems very precise. |
I think "Inherent" is spot on! |
"collate" is typically used when you're printing multiple copies of a multi-page document; non-collated would be like, page 111222333, collated would be 123123123 -- So the collated approach is ready for binding. So it essentially means, "each unit in its proper order". I spent some time searching for a word to mean the 1-to-1 idea and "collated" just seemed like a match that captures that whole concept in a single word. |
I have a new implementation now that handles inherent attributes and also provides names-lengths arrays. I'm working on a few demos at: https://seqcolapi.databio.org/ |
Here's a function that computes the names-lengths array as described above: https://github.com/refgenie/seqcol/blob/46675b669ae07db9da4fc3d113fefa2c1667b1fb/seqcol/seqcol.py#L300-L310 |
And here's my code for handling |
I was wondering about the relationship between the different type of properties: Also I was thinking that we should specify how the |
On the question whether the non- One possible full solution across all arrays would be to define a new Given that we want to fix issue #36 (and I do think we should), I think there are now two main options: a) keep the comparison function as it is but make the b) add another output in the comparison function that provides a check on whether the array values are in sync of not. This separates fixing issue #36 from the issue of whether |
While writing the above comment (#40 (comment)), I think I have myself changed opinion from supporting (a) to (b), as issue #36 is really a technical validity issue for the comparison function and providing a partial fix by adding data content just makes everything more confusing while not fully solving the issue. So if I leave this issue aside, I see that making the |
Should it be In the past I've been using underscores; but in the compare endpoint we decided to use hyphens for the return values. |
My vote is for underscores. {
"sorted_name_length_pairs": [
"chr1_103483637",
"chr2_23873"
]
} or {
"sorted_name_length_pairs": [
{"names": "chr1", "lengths": 103483637},
{"names": "chr2", "lengths": 23873}
]
} Was there another one ? |
That's interesting -- I only knew of one suggestion and it was neither of those ;) it's:
that's also how I implemented it... https://github.com/refgenie/seqcol/blob/46675b669ae07db9da4fc3d113fefa2c1667b1fb/seqcol/seqcol.py#L300-L310 So, that's my vote. I don't like the underscore method, I like re-using the json scheme we've already been using. |
I second @nsheff's suggestion (and implementation) above. I like that the "name" and "length" keys are included (in singular) which makes the non-digested contents (more-or-less) self-explanatory, which might prove useful in certain scenarios. The redundancy of repeating "name" and "length" for each sequence should not matter for performance/efficiency, as (in most cases) the end product to be stored in any case are the digests, which have a fixed number of characters (32? Cannot remember right now). |
All for underscores. Cannot remember if we discussed this for the compare endpoint, but perhaps we could still harmonize this post-ADR? There is also, I suppose, a possibility to harmonize the style across GA4GH services (isn't there a GA4GH style document? If not, shouldn't there be?) |
An illustrative use case to be added to the specification could be something like this:
There are only two possibilities for compatibility: 2A) If the digests are equal, then the data file is directly compatible. 2B) If not, one can check the
Thus, in a production setting, the full compatibility check can be reduced to a lookup into a short, pre-generated list of |
I want to confirm another point brought by @sveinugu earlier: We store the digested version of the object not the object itself.
Is that correct ? Also I for the seqcol schema the current description I use is:
But there might be a better one. |
The newly published specs now answers most of my questions above. I have an additional question about |
I think this is a very useful use case, though! This also relates to issue #36 which we have yet to discuss and which is bugging me a bit. Storing the array of digests is a partial solution for that issue, but only for the Another similar use case is just to test for the existence of a particular name-length-pair in a sequence collection, e.g. which chrM coordinate system is included in this hg19 seqcol variant? An alternative would be to find "chrM" in the |
Related discussions:
History
For years we've debated the question of whether sequence collections would be ordered or unordered. In #17 we determined that the digests will reflect order. However, it is still valuable to come up with a way to identify if two sequences collections are the same in content but in different order. While the comparison function can do this, it is not as simple as comparing two identical identifiers. After lots of debate both in-person and on GitHub (#5), we never came up with a satisfying way to do this. Here is a proposal that can solve this, and other problems, which has arisen in discussion with @sveinugu. This idea has a few components: 1. Undigested arrays; 2. Relaxing the 1-to-1 constraint imposed on default arrays; 3. A specific recommended undigested array named e.g. "names-lengths". In detail:
1. Undigested attributes
Undigested attributes are attributes of a sequence collection that are not considered in computing the top-level digest. We suggest formalizing this as a part of the specification. In the schema that describes the data model for the sequence collections, the attributes will be annotated as either "digested" or "undigested". Only attributes marked as 'digested' will be used to compute the identifier. Other attributes can be stored/returned or otherwise used just like the digested attributes, the only difference is that they do not contribute to the construction of the identifier. This is easy to implement; for returning values, you return the complete level 1 structure, but before computing a digest, you first filter down to the digested schema attributes.
2. 1-to-1 constraint
All of our original arrays have the constraint that they are 1-to-1 with the other arrays, in order. In other words, the 3rd element of the names array MUST reference the 3rd element of the sequences array, etc.. They must have the same number of elements and match order.
I propose this constraint be relaxed for custom arrays/attributes. In fact, this, too, could be specified in the JSON-schema, as
order-length-index: true
(or something). In other words, arrays that do align 1-to-1 with the primary arrays would be flagged, but not all attributes would have to. This would be useful in case users want to store some information about a sequence collection that is not an ordered array with 1 element matching each sequence, or for situations that need to re-order.3. The
names-lengths
array/attributeNext, we'll use both these features for a new proposal. An example of a useful undigested attributed that lacks the 1-to-1 constraint is the
names-lengths
array. This is made like this:{"length":123,"name":"chr1"}
.If this array is digested it provides an identifier for an order-invariant coordinate system for this sequence collection. It doesn't actually affect the identity of the collection, since it's computed from the collection, so there's no point considering it in the level 0 digest computation -- that's why it should be considered undigested. It also does not depend on the actual sequence content, so it could be consistent across two collections with different sequence content, and therefore different identifiers. It lives under the
names-lengths
(or something) attribute on the level 1 sequence collection object, but it's not digested.The term "undigested"
To clarify the digesting: in this case, I am actually proposing to digest this array itself in the same way that all the arrays get digested; but this array would not get included in level 1 object that digests to form the level 0 digest. So maybe "undigested" is not the right term. Ideas?
Add to spec?
There are lots of other non-digested arrays that could useful. This particular one is pretty universal and useful, so it seems like it may be worth adding formally to specification as a RECOMMENDED, but not required, undigested array.
The text was updated successfully, but these errors were encountered: