-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What information is included within the string-to-digest? #8
Comments
One comment that came up in our discussion (thanks @jmarshall) is that perhaps we should relax the constraint of requirement for name, length, and topology. I originally had everything required, and then dropped 'sequence digest' from required because I had a use case where it would be useful, so I guess the others are still listed as required simply because that's how I started. But now as I think about it more, I think this is right; there are potentially useful reasons to make a "sequence collection" that has nothing but sequence names, or a set of sequences without names, or something like that, and then you'd be able to use just some subset of seqcol functionality that only requires some aspects of the annotation. So, I think it's right that going forward, there's no reason we can't just make everything optional. |
During the meeting I was questioning the reasoning behind including the topology in the digest. @rishidev pointed to minutes from last year (2019) meeting which discuss this issue. @yfarjoun @andrewyatz Can you recollect the rational behind adding topology in the digest? |
A side question is: when a field is missing (or empty) does it have to be empty in all the elements of the collection? |
The issue was (and still is) that the same sequence could be included in a
sequence dictionary in either linear or circular topology, but the choice
can affect how downstream tools will understand the data.
Originally we thought of putting it into the sequence (refget) itself, but
that was deemed inappropriate since it was "metadata" not information about
the sequence itself.
Thus, in order to be able to distinguish a dictionary that has MT/linear
from MT/circular, we have to include the topology somewhere...it could be
in the sequence collection, or it could be in an additional layer that
embeds and annotates a sequence collection... we could include an
annotation wrapper that embellishes the collection and its elements with
arbitrary tags. That would be the most flexible solution and it would not
make the topology a special case.
Y.
…On Wed, Dec 9, 2020 at 11:31 AM Timothee Cezard ***@***.***> wrote:
During the meeting I was questioning the reasoning behind including the
topology in the digest.
Basic argument is that regardless of the topology stated in the collection
the representation of the sequence is still linear and the topology is an
attribute of biological sequence that could be held in the metadata rather
than the digest.
@rishidev <https://github.com/rishidev> pointed to minutes from last year
(2019) meeting
<https://docs.google.com/document/d/1h1KauiREYLhJXW-J-vMeb0Fu2CKasrwCkb55XkB4J2U/edit#>
which discuss this issue.
One point seems to be that the adding the topology in the digest would
enable the "recording" that a sequence was analysed as a circular sequence
and so it should be recorded in the sequence collection associated.
To me this seem like an analysis metadata though
@yfarjoun <https://github.com/yfarjoun> @andrewyatz
<https://github.com/andrewyatz> Can you recollect the rational behind
adding topology in the digest?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAU6JUQVPGQZQVUVKC3SFW3ST6Q6LANCNFSM4UTR2OCA>
.
|
I believe that every sequence should have all four elements we've indicated here, or to be more specific it'll have a name, length, checksum and topology is assumed to be linear unless specified otherwise. |
One possible use case for sequence-less seqcolls: Say you only have the chromosome and length information for a bunch of reference genomes, which would be the case for instance in a default installation of Galaxy. I think it would then make sense to create a database of these, and making use of seqcol API mostly for the comparison functionality. Generally, I believe, in line with @nsheff that making elements optional would open up for a range of functionality that would be useful for various "real-life" challenges, especially for issues related to track analysis, which both @nsheff and myself have experience with (e.g. https://doi.org/10.1093/nar/gky474). |
As to the issue of topology (and other aspects discussed here), I think we really need to provide a clear definition of what a sequence collection is. I see at least two directions here:
Also the second way of thinking could extend towards graph genomes, at least to a set of sequences describing a sample set of personal genomes from which a graph genome can be generated. As mentioned elsewhere in this project (cannot remember where), GRCh38 does really fall into this category due to the alternative sequences. Under this view, a seqcol might also cover a set of coordinate systems without sequences. The obvious difficulty then is to decide on where to draw the line of what is in scope or not. |
I personally believe it moves towards the second point of view. That topology is important. Also the role this will play with graph genomes is an interesting one. This for me depends on what the unit of work of a graph genome will be. Do we want to refer to common builds/paths through a graph of haplotypes or something deeper? I think those are open questions for the moment and this standard looks to resolve the existing issues with reference genomes which will remain linear for some time to come & continue to play a part in many clinical diagnostic pipelines |
We've just had a lot more discussion on this point in the refget call. We have gone back and forth; in the end I think the consensus at the moment leaning toward option 1, that is, not including topology as a fundamental to a sequence collection, and then building the topology metadata information on top of that as an extra layer of information. But, we are still having some debate and we see compelling arguments in both directions. Arguments for not including topology in the digest are:
Arguments for including topology in the digest are:
|
Other metadata like topology?What other information is there that is of the same class as topology? One thing that came up is a sequence mask. A mask is not inherent in a sequence, but annotates a sequence. The difference here is that a mask annotates individual elements in the sequence and can be represented by regions, for example; the topology is an annotation of the entire sequence. But another thing that annotates a primary sequence is "primary" vs "secondary". We previously discussed annotating sequence collections with a way to flag which elements are core elements, like the assembled chromosomes in the human genome, and which are secondary, like unassigned contigs or other random sequences. This feels like the same type of information as topology. |
Here's a proposed solution
|
Latest solution to this gets my vote. |
I guess this comes down to what the goal of seqcoll is: Are we trying to just describe the sequences as they are currently stored, or do we try to improve upon the current state? As a latecomer, I am not exactly sure of what the goal is, but personally I would think improving on the current state is a worthy goal. So a scenario (which is not very far-fetched) is that some analysis system that do handle circular and linear genomes differently (which all really should) would be able to, from a bunch of digests, be able to grab the correct sequences and preprocess them according to topology, without knowing what they represent in terms of biology. So I would say topology should have been a part of the sequence info from the get-go (e.g. encoded in FASTA), and that we now have a possibility to improve on that lack.
Probably not, but I think a more relevant scenario is the one I mention, as the topology information is crucial for managing the sequence correctly, even though many tools don't care about this. Without this, one needs to add more command-line parameter to the tool, and I think we already have plenty of those. I do however think that we could try to make such a parameter more abstract, in order to support later extensions, such as "sequence_type" or similar.
I disagree. I don't think it is a function of the particular analysis. However, it is not a function of the sequence if by sequence you mean the contents of a FASTA file. However it is a function of the mathematical model of a DNA (or whatever) sequence, which is really what you need for analysis.
The centrality of SAM to all things in bioinformatics makes this a heavy argument in my mind. |
I'm a bit uncertain what a sequence mask would represent here. I assume it is not the same as the Ns in the sequence (hard mask), but perhaps soft masks, like what was done with repeating regions (https://www.biostars.org/p/98042/)? So the repeating regions is clearly metadata annotation (a track in itself), while the hard mask is a part of the sequence information (I think we all agree on including N's in the sequences). So I would say that topology here is at the level of the hard masks, which we do include. A side-note of relevance: One of the main issues creating lots of false positives in the analysis of track files is that most common formats (except GFF in theory, but not in practice) do not include mask-type information about which regions of the DNA the track is defined. Most tracks are e.g. not defined in N-rich regions (e.g. centromeres) and other regions with a high number of short repeats. It is very different that a lack of a region is due do technical reasons (e.g. that the current analysis window is within the centromere) or that the lack of a region represents a real lack of some genomic feature. So this is essential information about the track file that should always have been included (in addition to having an identifier of the reference genome of course). It can be estimated from the reference genome, but only the creator of the track would know if there are other constraints that would have made it impossible for a region to appear in certain areas. See e.g. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2438-1. So I would argue that this is the similar type of scenario where essential information is lacking from the sequence files due to some historical judgements by someone. Topology is one of these types. Primary/secondary is another, as you discuss below:
So what about sequences of other types than DNA? I assume we are supporting those also, but would it not be a point to know which kind of sequence we are talking about. But instead of having an ever-increasing list of terms for this, one could e.g. denote the possible characters used in the sequence, e.g. acgnt (sorted order) for basic DNA sequences, but there are plenty of other options out there: https://en.wikipedia.org/wiki/Nucleic_acid_notation. This string would be very different for protein sequences (if that is to be supported). It would be extremely useful for a parser to be able to know the character set in advance to check if one can handle the sequence. Also, in case a specific set of characters is in theory supported, the lack of use of those in the sequence does also carry information (similar to the track example above). EDIT: GFF only supports leaving out bits in the beginning and end, not in the middle of sequences, but I have anyway never seen this used. |
Array-based approachIntroductionIn discussion at the refget meeting, we came up with a few new insights on this. For example, following the proposal above, if we include topology outside of the core sequence collections as a standalone array... this leads to the idea of encoding the names, sequences, and lengths as arrays as well, instead of as properties of a sequence, bundled together. This has a few nice advantages, and some disadvantages as well. I think a useful way to conceptualize the question is, what is the schema for data included. We considered a few possible schemas. Here I'll try to summarize, it's easiest to show with a simple example of a sequence collection with 3 sequences. Here, we use an array of objects, each with 3 properties: name, length, sequence. This is the way I've been thinking about it so far.
A new proposal is, we can just think of it as 1 object, with 3 properties: an array of names, array of lengths, and array of sequences.
Side note: digesting each array separatelyThis leads to the possibility of digesting these arrays separately. If we digest each one separately it looks like:
Now, this is kind of nice because we get a sequence-only digest, and a name-only digest, etc. Maybe we would only digest the sequences, and not the names/lengths? Or maybe none of them. Adding in topologyBut regardless of whether we digest each array separately, another advantage of using separate arrays is that it seems to be nice for fitting in more information, like topology. So, for example:
This feels clean; the seqcol is just a bunch of named arrays, and you use whichever ones you need. Each array must be the same length. If we digest each option individually and then digest them together, then for example, one seqcol in "flat form" could be:
Another could be:
We can immediately tell they have the same sequences/lengths/names, but one has topologies and the other doesn't. And of course, we can recurse to get whatever layer of information we need. Issues with the array-based approachThis seems appealing to me, but there are a few issues:
Solution to issue 1A solution to problem 1, we can introduce a new structural layer, say,
In just 1 layer it looks like:
So, this adds the complexity of an additional layer in the object, but gives the ability to use that 'rawseqcol' digest independently, if you don't want an 'annotated seqcol' with all the other stuff. So now, you'd end up with a checksum for the raw sequence collection (without topologies), and this is what I already implemented in a demo server. Solution to issue 2But this doesn't solve the second problem above. How do we define the metadata that sits alongside topology, like 'priorities', and 'masks' and things like this? Where does it stop? is it fixed? An alternative could be to allow arbitrary "tags" with data, so, it would look something like this. At the top level is a split into {'rawseqcol': '5655f27ed3dd7b8bb7c108c4ddc562979f82bc294877988c', The annotation object groups all that "other stuff" (outside the "Big 3"), and looks like:
It's now flexible, so 'topologies' is not hard-coded into the schema...but you can provide it by specifying the 'name'. So you have to name your object as a variable ("topology", "priority", etc), and then provide its value, instead of hard-coding those possibilities in the schema. So, this is kind of nice because it can be used for any sort of future annotation. But -- it also explodes the possibilities, so now you can make a checksum for whatever kind of notations about your sequence you want, and now we've lost a chance to standardize and nobody's digests will match. It also makes it harder to define the checksum algorithm, since somehow that identifier will need to be included in the digest, instead of just inferred. So, this is starting to really complicate things, and I'm not sure it's worth it. DiscussionAfter thinking through these options for several hours, I come back to the core question: what elements are part of the identity of the sequence collection? I think all the things we've discussed here are valuable, and @sveinugu raised some really good points about how important topology and masks and priorities are. But we need to separate out the components that belong in the digest from those that just annotate that digest. This seems like the critical question: what elements are part of the identity of the sequence collection? Because the rest of it, while valuable, seems to me like it should be included somewhere, but does it really need to be digested? That's what we're trying to decide here; not general importance, just what belongs in the unique, global identifier. I think we don't need to make "Sequence Collections" solve every problem. It can be a very simple thing, that will be easy to implement, easy to use, and very universal... and easy to extend with whatever other information eventually makes sense to extend with. Then we or others can build things that extend it. So, I think it's more important that it be extensible, simple, and re-usable than comprehensive. My opinion on including each item in the digestNameOn one hand, LengthsThere's an argument that lengths shouldn't be included in the digest, either, since they're redundant with sequences. Providing them is one thing, but we could technically provide them in the API without including them in the digest. Given a set of sequences, the lengths are fixed, so why include them? Well, one reason is that we want sequence collections to be able to represent "Chromsizes" files, or coordinate systems, which consist of just names and lengths (no sequences). So, in this case, you'd have to include the lengths in the digest, because there are no sequences. To make a single digest algorithm be able to handle both 'chromsizes', as well as actual sequences, it seems to make sense to just require lengths to always be included. The alternative would be to come up with a separate digesting algorithm for coordinate systems... In that case, maybe a coordinate system is names+lengths, and keep sequences separate:
Now, we can digest the sequences separately. I suppose this is a possibility. Then, maybe our primary digests could be built from a tuple of
TopologiesI'm still torn on this one. I think sequence names are practically widespread and useful enough to belong in the digest, but on topologies/priorities, I'm not really sold that they are practically useful or widespread enough to justify including in the core definition. These can be handled outside sequence collections. The advantage I see to excluding them is that in our core seqcol identifier set, there will be fewer canonical sequence collections, and so they will be more reusable, because we won't have the problem of "the identifier with the topologies included vs the one without". Because, as great as it would be if all seqcols included the proper topologies, many people will not bother, and the data doesn't always even exist, leading to multiple sets of near-equivalent identifiers in widespread use (which is basically where we are now). But -- including them means they can be used, at least, and maybe that opens some new use cases, without too much of a cost. So, I don't know. PrioritiesIf we include topologies, I don't see why we can't also include priorities. MasksIf we include priorities, I want to include masks. But masks feel like a much bigger issue, and I'm afraid this causes the problem to balloon out of control. Therefore, I want to set the cutoff much smaller, and then deal with these other issues in the next project. Final thoughtsI like the simplicity of the independent arrays. Therefore, I would probably choose representing a sequence collection like this:
I would probably not re-digest everything, so we create the string-to-digest by concatenated array items with one delimiter, and then arrays with another. The order would be lengths;names;sequencedigests, so for example (here shown with arbitrary delimiters):
Side note: one nice thing about digesting each one independently is that we only need 1 delimiter. Then, we start another, separate project surrounding extending that unit with additional information (topologies/priorities if not included, and masks, or anything else). |
To add a few items from the refget meeting today Splitting into separate arraysBroad agreement that this is the right way to approach the problem and offers the most flexibility to ourselves and implementors to bring in new data concerning sequences (considering the list has grown to include topologies and priorities and masks it's likely this will continue to grow). Attribute importanceA sliding scale was proposed of attribute importance roughly running like
Digest algorithmsA few solutions were suggested but concerns were raised over the absence/presence of key fields and the flexibility/inflexibility of the digest i.e. how do we ensure digests do not need to be recalculated all the time. Suggests to avoid this included
|
Here's an approach. Build a string-to-digest like this:
The first number is the number of items in the array. All arrays must have that number of items. Before each delimited set of items is the identifier for the category (lengths, names, sequences). These must be in alphabetical order. Now to make an identifier without sequence digests, you'd do:
To add in topology, you'd do:
The advantages of this approach is that it's flexible with respect to what we can put in the string-to-digest, so it's backwards compatible if we add something in the future, since there's no affect on the digest to leave one of the possible arrays out. |
With a second delimiter, we could avoid the need to put the number of items in the array. This is a more flexible approach as it would allow arrays to contain different numbers of items. Maybe not necessary for this particular use case, but it makes it more universal.
or
|
Couple of questions or comments
|
Some comments after the last meeting (which seems to have turned into more of a novel!):
A. sequence This is obviously essential. Being able to bucket data/metadata according to the sequence content, regardless of the sequence names, is obviously useful, especially in the context of the UCSC vs. rest of the world divide on chromosome names. We can probably come up with many examples of this, and you probably already have (I did not double-check with the use case document ). B. names + lengths I have already explained the usefulness of this for track file analysis. C. sequence + names (+ length) This would be the most common way of bucketing e.g. reference genomes. Adding lengths (or other directly sequence-derived arrays) would not change the bucketing ability of the digest, other than defining a second set of buckets with the same properties (the only difference being whether lengths were added to the digest calculation or not). In such a context, being able to easily see the difference between the two types directly from the digest might be a plus, but one could also simply store this info as metadata, filter out one of the digest types as part of data import, etc. The same logic applies to the sequence + lengths option. D. names (labels?) only I feel that there should be some use cases for this, but I have a hard time coming up with scenarios where adding either the sequences or the lengths (or both) would not be a better choice. (Btw: should we rather refer to names as "labels"?). E. lengths only Don't think so. A-E. core rules From the above considerations of the 3 most "core" arrays, we could define the following two core rules: i) Either Rule i) and ii) results in 3 possible combinations of core arrays. Rule ii) also has as a consequence that one would always be able to prepare for a full sequence download at the level of the first recursion level, for instance by preparing some data structures at the appropriate lengths, or selecting API for sequence retrieval based on the size (e.g. allowing small sequence downloads from the seqcoll API, but switching to another API for larger sequences). Assuming these two core rules, let's continue with the other possible arrays (which would then never be found without the core arrays, something which makes total sense to me): F. topologies So we have discussed this back and forth for a long time. There are scenarios where such information is crucial for an algorithm, but one could also in those cases just require topology to be added as additional metadata. As topology is unlikely to change in different versions of the same genome reference (other than such information being present or not), it might not seem very useful to add topology (in addition to the core arrays) when defining the buckets. It will only work as a flag on whether topology info is being taken into consideration or not. However: I believe adding the possibility will not hurt much in the current implementation idea and it will have as a consequence that topology information would become more easily available and would thus, in practice, also be used more for algorithms/services that can make use of it (assuming that adding topology info is often done through an optional tool parameter that most people just skip, which I believe is a reasonable assumption). Hence, just including topology as an optional array in the standard would most probably result in a slight increase in overall data/result quality in the life science field, which is not a bad argument, IMO. G. priorities (sequence groups?) I believe this would be same/similar to what in the NCBI assembly database is referred to as "Sequence role", in combination with "Assembly Unit", see e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_assembly_report.txt (from https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26). This basically represents a subdivision of assembly sequences into smaller groups: In e.g. the human genome, the I believe such subdivisions are important information to have available in a machine-readable way, and having this information in a seqcol array is an easy way to support this. However, for identifier-based bucketing, adding such a subgroup category array would only work as a flag on whether such information is available or not, across all sequences, which is of limited use. In contrast with the topologies array, adding the subgroup info to a sequence would probably not change how that sequence is handled by itself. The most natural usage scenario is to make use of such subgroups for filtering. However, in practice this is typically done prior to analysis, so that the sequence collection used in the analysis step would no longer match the full collection from which it came. Such filtering leads to genomes such as the b38 genome (containing no alternate sequences, I believe), used for mapping, variant calling, etc. However, one could quite easily support identifying many of these sub-genomes by carefully defining a subgroup array and then automatically calculate digests for the sequences that are included in each subgroup, or in combinations of them:
The problem is obviously the general applicability of such categorization across sequence collections. For instance, what about protein sequences? However, I believe the advantages, especially in light of the FAIR principles, are so great that I think we should explore this path. H. masks (sequence gaps?) One can define many possible types of masks, but we have not (in the meetings I have been present, at least) delineated this much. There is at least one type of mask I would argue would be very useful as an array, namely a mask specifying the gaps in the assembly, perhaps in combination with contig placement (such as CONTIG in https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11). For DNA, one could also derive such gap info directly from the 'N' characters (which I believe would include also smaller gaps). There would be similar masks derivable from other alphabets, e.g. proteins, where gaps will be encoded differently (I believe, by a period character).
I. alphabet I would like to add 'alphabet' to the list of arrays. In my mind, this is not the same as the summary of the characters used in the sequence, but the possible characters that could have been used. Similar to the assembly gaps use case above, the fact that a character is not used in a sequence position is a separate piece of information given that we know the character was allowed in the first place. Hence, the alphabet (defined in this way) adds information to what is inherent in the sequence itself, and one can easily imagine sequences that are the same but are defined in relation to different alphabets. Having the alphabet available in recursion level 1 (together with the lengths) would also help if one needs to pre-register some data structure before the sequences are downloaded. Also, even if this is not the main argument, having the alphabet available would be a simple way to differentiate between e.g. DNA and protein sequences, without having to add a sequence type array/metadata field (which is btw also be a possibility, of course, but I think I will stop here...). F-I. rules for non-core arrays If all these non-core arrays (and more) are supported, there will be a large number of possible combinations of them. In this case, given 3 core configurations and 4 extra arrays, there are 3*2^4=48 total combinations. If one has all this information available, calculating the digest for all of the combinations seems a bit over-the-top, especially if in combination with different subgroups. Hence, it is probably smart to place the extra arrays into groups where all arrays in a group are required to be present, at least if one is generating digests for various combinations of arrays. Here is a suggestion of such rules: iii) ** Conclusion** In the case of 4 sequence group values, there are 3 (core) * 2^2 (extra) * 2^4 (seq groups)= 192 combinations to be calculated for each full sequence collection. This of course represents a bit of overhead, but we are still not talking about much data.
So what so you think? There is a huge number of details to discuss here, but there is also a discussion to be had on a more philosophical level on what kind of problems seqcol should solve. I am, as you can easily conclude from the above, in the "we have an opportunity to be heroes" camp! And sorry for the length of this post... |
Above is a lengthy discussion on what to include in the array-to-digest. This thread happened as we transitioned to the array-based digest method (also addressed in issue #10). Anyway, the new array-based approach changed how we conceptualized what is included in the sequence collection, and made it much more flexible. We sort of see it as being able to hold whatever arrays a user wants, with a few controlled vocabulary terms. It gives a lot of flexibility to the user, and also reshapes the question here a bit, to: which arrays should be mandatory? This is addressed in pending ADR in PR #18. We haven't really finalized this, so I wanted to finish this. It's clear from discussion that sequences should be optional, and lengths should be mandatory. But what about names? Should names be mandatory? A corollary is whether and where we enforce uniqueness. It's clear to me that we don't enforce uniqueness for lengths. I guess we also don't enforce uniqueness for sequence. But what about for names? I think @daviesrob had some thoughts here as to the mandatory-ness and uniqueness constraints of the names array. Can we come to a conclusion here, now that we have increased clarity on what a seqcol is? |
Making names mandatoryFrom memory the argument here is that a collection could be composed only of names and lengths. We also said it was likely a collection would define either names of sequences to stand alongside the lengths. I also think we suggested someone might just duplicate the checksums in sequences to provide names if we made names mandatory. If the above is a good recollection then I would support making names/sequences required but only require one of those in a given collection. This also seems to align with a number of @sveinugu comments about length alone would not be a valid collection. Uniqueness of namesThe comment about sequences being non-unique from memory is because two independent sequences could create the same checksum but have different associated metadata. Is there a reason why a sequence collection would be invalid if a collection had a total duplication of an entry? Sequence, name & length were just duplicated. I personally think if we can answer that question (perhaps we have done last week) then the uniqueness of names would be an easy thing to answer. If you can't duplicate an entry in a collection then names can be unique |
I think you're right, Andy. I like the concept of "total duplication".
Maybe to just restate this question in another way: Which of the following data structures is a sequence collection?
Since our entries are split out into separate arrays, It will be easier to implement if we allow total duplication of an entry, because we must use lists under the hood to accommodate duplication within an array. However, if we don't need total duplication allowed for some reason, then implementing a check to prevent it may give us some nice computational properties down the line, such as being able to assume that the concatenation of I suppose we could enforce disallowing total duplication of an entry by simply requiring and enforcing that names be unique. |
Unique names are required for some use-cases (e.g. SAM and VCF). Unless we can think of a really good case where having duplicated entries is necessary, I think enforcing name uniqueness would give us some very handy properties. |
Looking over my notes from our previous discussion on this (2021-08-25) we had I think approached consensus that names should be required, and should be unique. If you don't have a names array, you should just generate one with numbers 1...x as the sequence identifiers. This provides uniquely identifiable sequences. I can't really think of a situation where this would bite us. You could always stick your non-unique identifiers into some other custom array if you needed that. So:
|
I am a bit uncertain about requiring names in that it removes one of the more straightforward applications of the sequence collections standard: to provide a single unique digest for a collection of sequences, given only the sequences themselves. So extracting the lengths from the sequences is simple and uniquely defined, there is no choice do be done for this. However, having to generate a names array with unique values (when this information is not present) can be done in a multitude of ways, with indexing from 0...n being one and duplicating the sequences array being another. Each of these will result in a different digest for the sequence collection. As a consequence we loose one of the simplest use cases where one can uniquely and consistently create one (and only one) digest from a bunch of sequences, given that there is no other metadata. This BTW is also an argument for providing an order-invariant digest algorithm, as having to enforce an order into an unordered set of sequences is also a process where there are several canonical ordering schemes to choose from, each of which will result in a different digest for the same set of sequences. In fact, defining a canonical ordering scheme to be able to uniquely generate a single order-invariant digest, given order is missing/not important (which I argue for in #5 (comment)) seems similar to defining a canonical naming scheme to be able to uniquely generate a single naming-invariant digest, given names are missing/not important. Even more, it seems the choice of defining a sequence collection as an ordered list of sequences opened up the need for also defining a canonical ordering scheme in cases where order is not defined. Similarly, making the choice of requiring the names column opens up the need for defining a canonical naming scheme in cases where names are not defined. For me, if we are unable to provide a deterministic function:
then I suspect that we have over-engineered things. I have no objections against requiring unique names, but if the only argument is to disallow total duplication, I believe we might be adding unnecessary constraints. |
In conclusion, my suggestion is to require |
Well, if we define "default names", then this application is not removed. Your argument is right, but only if we don't specify the approved way to construct default names.
Could be, but there are multiple arguments:
|
One way out of the problem, where you have sequences but no names might be to derive the names from the sequence checksums. For example, this python script will read in a list of checksums, and make names by combining the first six characters with a counter to deal with clashes. The counter values are set based on an alphabetical sorting of the checksums, preserving the original order for duplicates. #!/usr/bin/python3
import fileinput
from collections import defaultdict
list = []
for line in fileinput.input():
list.append(line.strip())
seen = {}
count = defaultdict(int)
for hash in sorted(list):
prefix = hash[0:6]
count[prefix] += 1
seen.setdefault(hash, count[prefix])
for hash in list:
prefix = hash[0:6]
print(f"{prefix}-{seen[hash]:06d} {hash}")
seen[hash] += 1 So this input:
gets these names:
This has the property that sorting by name gives the same ordering as if you sorted by sequence checksum, which might be useful, e.g. for generating a canonical order:
|
Yes, it does have some value, although I am uncertain of how important this is.
This is a very good argument. The difference between saying "seqcol is compatible with SAM and VCF" and "seqcol is compatible with SAM and VCF given that a
It does in the cases where the sequence names are already given and/or the sequences already have a natural ordering. However, if one wants to create a name-invariant seqcol digest (if names are not available or are unimportant) and the order of the sequences are not given, one still needs to first order the sequences according to some deterministic algorithm. The fact that a
I definitely agree that enforcing unique identifiers would be very useful downstream. However, I feel the above is really only an argument for requiring names to be unique IF they are present, but not an argument for enforcing that the names column as a whole is mandatory. One can argue that it would be an advantage in many typical implementation scenarios to know a unique 'names' column will always be present. However, if unique keys are needed, one can always just generate them by adding indexes, etc. However, in conclusion, I agree that there are strong arguments for making the Edit: I will try to write out my suggestion tomorrow. Too late for me today. |
So I have now tried to write out my idea in full (and then some). (Warning: novel-length comment follows! This can be a nice read over the weekend, under a blanket by the fireside with a cup of tea...) Before going into the details, I would like to take a step back and provide some thoughts as to what we are actually trying to accomplish with the seqcol standard, the way I see it (with the reservation that I was not part of this project from the start and might have misunderstood some fundamentals): 1. Create a persistent, globally unique identifier for a collection of sequences as defined in a repo of some sort, with a particular focus on reference genome repos. 2. That the identifier should be a deterministic and unique digest of the sequence collection contents, where the contents are defined in some (more or less) standard data structure. 3. Provide a standard for an endpoint for fetching a sequence collection given the identifier 4. Provide a comparison function that can return a useful overview of important similarities and differences between the contents of two sequence collections 5. Allow for use of the above in scenarios where the exact sequences are not important, e.g. to allow the creation of identifiers for coordinate systems that contain names and lengths, but not the actual sequences. The most prominent use case of this is as metadata for genome position annotation files (e.g. BED/SAM/VCF files) [which is what I would call "genomic track" files, as argued for here and onwards, but that is a digression]. That is more or less my impression of our goals. The last goal seems to be the least understood of the ones in the list, where I assume I have contributed significantly to the confusion by arguing at length and in great detail in issue #5 why I believe that an identifier that refer to a coordinate system should be based on an unordered data structure. One lesson drawn from this discussion is that it seems we are actually talking about three different types of identifiers:
(Btw: this will be more than a rehash of the unordered/ordered discussion, just be patient with me...) Identifier type C might seem the least important of the lot, and in one sense I agree with this. However, it is the only identifier type that can be used to fully represent the "common shared denominator" of several sequence collections if these are differently ordered. Identifiers of type B can easily be replaced by identifiers of type A in scenarios where you are only analyzing e.g. a single BED file (and where you in any case should preserve the A type identifier to maintain provenance). However, in scenarios where you want to identify a shared coordinate system for several files (as in a Genome Browser), such an identifier should really be of type C, I believe. The alternative is to use an identifier of type B (or A) and say, in a sense, that the ordering of sequences in B (or A) is the canonical ordering and that all other sequence orders must be converted to follow B, even if there are no good reason to pick any particular order over another. At this point you might be thinking that, yes, having to use type A or B instead of C might be a nuisance, but it should not keep you awake at nights, so to speak. We need to focus on A and possibly B. And if this was all that there was to this, I would properly agree. However, there is a fourth type of identifier that we have not discussed in a principled way in the seqcol meetings:
This is really what the But we have had persistent unique sequence collection identifiers all along for genome assemblies, like the GenBank assembly accession: GCA_000001405.29 (which refers to hg38 patch 14 that was released a week ago!). I believe one main thought behind 1. users provide whatever metadata is required by tools, with the UCSC Genome Browser being a prime example of this. 2. That Regarding point 2, I believe My though has been all along that if we want people to replace So this long rant is really questioning whether we are really trying to solve the correct questions related to goal 5 (coordinate systems). So if the real issue is that the names used to identify sequences are imprecise and incompatible (i.e. 6. Define content-derived globally persistent and unique identifiers for single sequences within sequence collections (let's call these If each element within for example a BED file is annotated with a persistent identifier to the sequence it refers to, one can make sense of the data without requiring that the BED file itself is annotated with a precise enough reference genome identifier. I believe GFF, SAM and VCF have all gone this route, allowing the sequences to be precisely specified within the headers inside the file itself. For a BED file (maintaining compatibility by not adding any headers) this would mean exchanging the So could not the refget digest be used as a sequence id? Well, I don't think so, as the main advantage of So what if we could support the following workflow?: I. Input file i: A legacy II. Input file ii: Properly annotated seqcol-compliant BED file based on the III. Tool X is a BED file conversion tool that unifies the input files to use sequence names that precisely identify the coordinate system in the particular context needed for the analysis. For file i, it translates IV. Tool Y is an analysis tool that might or might not take advantage of the features that the V. References to So how would this be implemented? Here's my suggestion:
So the
In addition to the advantages of including globally unique and persistent per-sequence identifiers that can be customized to almost any need, there are also a bunch of advantageous technical properties of this solution:
EDIT: 13. I forgot to add that total duplications are not by definition possible with this idea. |
Thanks for writing this @sveinugu. I can't claim to follow your arguments completely, but I think I understand some of the main points. After reading through this and Rob's function suggestion above, I can add my thoughts: My main conclusion is that: I think the canonical way to assign sequence names should be simply number them from 1 to X, and that's it. My position comes from these basic points:
I pose that names should be required to be unique within a collection, and should also be mandatory. In this case, the possible scenarios of what a user may have with names/sequences are:
So, the issue is: what do you do when a human hasn't provided names for sequences? Well:
I still think the best approach is to say "if you don't have names, you just number them" and that's it. In the event that a human has sequences but no names, then the names don't really matter. They are arbitrary and so we just number them to give them a unique identifier so the seqcol protocols can function correctly. @sveinugu argues against this above here:
But to me, this is missing the point. The sequences must be in some order. There's no such thing as a list of sequences that do not have an order, that's not the way computers or lists work. The seqcol is not handling order at all. |
So I was asked in the meeting to provide a use case, and the best use case I have is my own. Even though this is a very specific use case, my thinking is that if it is possible to generalize the seqcol a bit in a principled manner to cater for my use case with no or only marginal costs to the existing functionality, I believe we would also probably cater for a range of other use cases that we haven't thought of yet. So my core use case is in the context of developing tools for statistical analysis of BED-like files (which I call genomic track files). One of the more typical questions I would like to answer is this: "Given two (or more) track files representing sets of intervals along a reference genome (say, two or more files of ChIP-seq peaks), are these sets of intervals co-localized (e.g. overlapping) more than expected by chance?" The following Bioinformatics paper describes the implementational and methodological considerations for such type of investigations: https://doi.org/10.1093/bioinformatics/bty835 The most important consideration regards the question of what one means by "expected by chance", i.e. what is the assumed null model for the investigation? Statistical significance measures whether the observations of the data are more extreme than what is expected by the null model, so the realism of the assumptions that make up the null model are of uttermost importance. Too simplistic null models will result in a large number of false positives, where the statistical test will reject the null model and thus indicate a biologically relevant relation when there are none. In brief: If the assumptions that make up the null model are false (unrealistic), the statistical test is not answering the question we want it to answer. So one type of statistical tests deployed to answer the question of co-localization are based upon randomization of the dataset(s) by shuffling the intervals to random positions along the coordinate system derived from the reference genome. So if the null model assumes uniform shuffling to all parts of the coordinate system, some of these randomized intervals will probably be moved to the centromere or other areas of the genome where they would not appear experimentally. Thus, the statistical test will in this case assume that some intervals "by chance" will appear in centromeres, while in fact all of the ChIP-seq peaks in the track data files by the very nature of the experimental and data processing methodology will be avoiding the centromeres. In this case, the ChIP-seq files will in fact be co-localizing (outside of the centromeres) more than expected by chance (which here assumes that peaks might as well appear in the centromeres as elsewhere). The null model will be correctly rejected by the test, however, the results are of no use. The only thing we have tested is, simply put, whether the data files contains ChIP-seq peaks, which is something we knew in advance. So this long story is just to provide the background for why the gaps in the assembly (which includes the centromeres) are very important for this kind of analysis. Other restrictions on the randomization of the data should also be taken into account, but I will leave that out of this discussion and focus only on the assembly gaps. (Btw: @nsheff I believe one reason why we have differing views on the nature of coordinate systems is that your statistical methodologies, at least LOLA, are based on another way to restrict the null model, i.e. by having the user define a set of regions that define the universe for the analysis). The conclusion is this: If I want to represent the coordinate system as a sequence collection (and I do), such sequence collections would include at least the This represents an additional level of complexity compared to the way we have previously defined a coordinate system. For given that we know the exact sequence collection used to generate each file of ChIP-seq peaks, if these are all different patches of the same version of the reference genome (say: hg38), and even if we ignore the actual sequence contents, the assembly gap information would still differ (the patches are indeed filling gaps) for the same sequence across the seqcols defined by the data files. However, if we were to also include the the Furthermore, the coordinate system assumed by the null model should really be defined as a "largest common denominator" of the coordinate systems implied by each the files. One might, for instance, base the null model assumptions on a coordinate system with assembly gaps that represents a mean of the assembly gap information derived across all the files. Thus, the resulting coordinate system might be a "synthetic" one defined as a middle ground for analysis. In the same vein of thought, if the sequences differ in order across the per-file sequence collections, a "middle ground" coordinate system should have no order. Furthermore, if the names differ, the "middle ground" coordinate system should really not have any names, only name-invariant sequence identifiers. Granted, there are ways of avoiding this complexity by e.g. specifying a canonical order or naming based on a specific sequence collection, but in that case the mathematically most correct solution will be unavailable due to limitations in the standard . Or, of course, one could decide to not follow the sequence collections standard at all. This will however limit adoption of the standard, in my case by one less developer. In addition, specifying a particular sequence collection that defines canonical ordering or naming might have non-optimal consequences. For instance, in the domain, most BED-like files follow the non-standard UCSC-type names ('chr1' not '1'), even though the files themselves might have been generated using FASTA files based on standard naming. In such cases, one would probably as a data processing step carry out a string replacement operation to comply with the UCSC de facto standard. As a consequence, if a sequence collection identifier is attached to the file, it will end up out of sync with the actual names used in the file. Another alternative would be to define and use ad hoc "crossover" seqcols (where everything but the names stay the same), but these would probably not be registered in any seqcol repos. Lastly, if one wants to follow the sequence collection standard to the letter, one can opt for not carrying out such naming conversions in the data processing steps. However, this will reintroduce the need for the analysis tools to convert the names to follow a more or less arbitrarily selected canonical naming scheme (e.g. UCSC style). Avoiding such complexities was, I believe, one of the reasons why the UCSC-style names instead became the de facto standard at the level of data generation. As a conclusion, I believe it will greatly increase adoption of the standard in this domain if we provide a simple canonical way to automatically carry out such conversions (either between naming schemes or towards machine-operable identifiers, or both). Similar arguments can be made (and have been made by me previously) against canonical ordering, however I think the issues related to canonical naming are better at illustrating my concerns. To summarize: I believe what my example mainly illustrates is that there will be many possible definitions of what is meant by a sequence in a sequence collection that are not covered by using the refget identifier to refer to a sequence. The advantage of the array-based solution is that it opens up for specifying whichever arrays are relevant in each context. However, providing limitations on ordering or naming will limit the landscape of potential applications that is opened up by the array solution. In my mind (and I'm sure you agree) I think we should weigh the benefits (as we percieve them) against the cost in functionality, complexity and/or time to overcome such limits in the standard itself. If the costs are very low, I believe we should by default move towards the most general standard. Based on the feedbacks from the meeting, I believe the consensus is that the costs are still too high, and I agree, but they are in my mind only slightly too high. So, I have an idea for how to further lower the costs and provide even more benefits, but that will have to wait until my next post. |
So here’s the latest version of my idea: First, here is a table-based representation of our current solution. In this example, I have added an
So one thing that can easily be done is to convert the array digests into array identifiers, let’s call them
The So similarly, one can easily also define row-wise globally unique identifiers as a digest of all the values in a row, including array names. Let's call these identifiers
Since these identifiers are globally unique, they can be deterministically sorted and thus give rise to order-invariant sequence collections (with unique identifiers). In order for the
So, in contrast to my last suggestion, where I suggested we require that there should be no duplications excluding the
Thus, adding a However, formulating the constraints with these two rules in addition allows for name-invariant sequence collections by just removing the
So the take-home message is this: This idea allows for globally and locally unique as well as fully configurable sequence identifiers (where the configuration is based on which arrays are included in the seqcol) without any practical negative consequences for the large majority of use cases. It also allows for both |
Sidenote 1: on total duplications for egde case scenarios (added here for logical completeness, can be skipped): The way I see it, for a If we from the last table want to remove also the Hence, there exists a possible third requirement:
Here, dependency is meant in the statistical sense. If there are external variables that explains the values, which is the case for most sequence names (e.g. 'chr1' or '1' is the largest chromosome in the human genome), then the requirement still holds. The main consequence of this rule is that about the only thing one should not do is this: Given a set of unnamed sequences with no apparent order, one should not use the arbitrary index of the sequences as the default name, which was suggested above. This will ruin the global uniqueness and hence also the possibility for order-invariant seqcol ids (sorting of The natural solutions given one has a set of unordered sequences is either: 1) remove any duplicates and create a seqcol that is invariant of both (Edit, removed the possibility of canonical sorting, as this will not solve the problem. Total duplicates are per definition "the same concept of a sequence" and should always have the same identifier.) However, as has been argued by others: This is a very edge case scenario, and I don't believe it needs to enter into the standard itself. It is a problem for advanced users with use cases that are out of scope of the standard, and they will probably figure something out anyway. |
Sidenotes on alternative Sidenote 2a: The suggested redefinition of Sidenote 2b: Given that |
I understand the use case above about statistics -- but, to me, this is solved by the I'm trying to condense all this down into what you're actually recommending, and I think it's this: You advocate for:
If we do it like that, this satisfies you, correct? Because it seems to me you can then build all the things you are talking about here on top of the seqcol implementation. So, the follow-up question is: are you sure there's no way for you to build what you describe under the constraint that names are mandatory? Where you specified a need for nameless collections was here:
I believe in this case you could simply name the sequences according to the way you've defined So, correct me if I'm wrong, but I believe I have understood your use case and your argument, and I still think it is possible to solve that even with maintaining the constraint that names be both unique and mandatory. I think given the computational benefits and simplifications we gain downstream from assuming names to be unique and present, it is worth enforcing this constraint. |
It strikes me that my proposal above is kind of a mix between what @daviesrob suggested and what @sveinugu suggested. Rob's suggestion was default names based on a the refget digest, but this fails if you lack a sequences array, and also doesn't satisfy the uniqueness constraints that sveinung needs. Sveinung's proposal generates unique row identifiers that satisfies his needs, but put them into a new concept of So, just combine the two -- generate the |
First of all, my idea is not anymore to put the Also, I might not have conveyed the full argument for my So following your arguments, you could say that one in that case could just define a new
I believe what is needed for the implementations to support this is simply to add an additional endpoint variant that provides the top-level view in a row-wise fashion with |
I also might be mis-understanding the proposal here but I feel to a point there is a circular argument developing here since we come back to names and lengths being attributes of sequences which could be used as an identifier. In addition we now have gaps. As a genome provider I don't believe we actually know where those gaps are. This maybe back in the original AGPs but as we move more towards T2T based assemblies, these gaps are not going to be a unique identifier for elements within the collection. Perhaps the bit I cannot quite fathom is:
As a user of a genome/sequence collection I can't control any of the above three sources of differences. Also I don't see how any system eliminates this when it is partly based on those major sources of differences. I also admit this might be me just not quite getting the thrust of the case for support for seqids or how they're being generated |
Right... correct me if I'm wrong, but the Here's the ADR for where and why we decided not to do it that way: https://github.com/ga4gh/seqcol-spec/pull/14/files I think @sveinugu is proposing to include both types of digests in the sequence collection, which is a bit different from the original discussion, where we were deciding on one or the other. It would basically combine the old and new ways of computing the digests. So, a level 1 secol in total would look something like this (
As Sveinung proposed, these could be served via different endpoints. Some thoughts:
I think the big-picture point here is that we have to distinguish between these two things:
These are not the same thing. For example, we once discussed removing the length from the digest calculation, because it doesn't confer any additional identifiability, given the sequences. We decided it makes more sense to keep it in, for consistency, because the sequence array is not always present. Even if we had decided to remove it from the digest calculation, the server could still store and return those values. In this case, we are facing a similar situation. From the perspective of the digest calculation, the ASDs add no uniqueness. The only value they could add is as a return value from the server, a reshaping of existing data. There are probably lots of other reshapings of existing data that could be useful for various specific downstream analyses. Should we build all of them into the spec? Or should we just make it possible to enact all of these things easily on top of the spec, for the use cases that would require them? What would we need to do to enable these ASDs within the current spec? Allow a server to return additional layer 1 elements that are not included in the digest algorithm. Actually, you could already do this. The cost of making this a part of the specification would be:
I do think this is useful. I do not know if it's worth making this a universal requirement. |
So I am not really competent when it comes |
Yes, I believe you are right. I was not part of those discussions, but I came to the same idea through a bunch of trial and error trying to provide solutions for some limitations of the array-based variant, namely the ability to provide
So my proposal would be to add an additional, independent object which is just the array of the ASD, which will take no part in the existing structure.
I agree, but the way I think of that is that it is really an advantage. Adding an array is really redefining the conceptual model of what you consider to be an Annotated Sequence, and thus it’s digest should change correspondingly. Thinking of the digest as a globally unique identifier, if the concept of a sequence that you refer to change, the identifier should also change to reflect that. For instance removing the
Or not include them in this structure at all, but instead provide an alternative and independent object/endpoint.
I agree.
I agree. My idea is to add the ASDs as an endpoint output only (2). But a mandatory one. The reason for this requirement is that, if we assume most reference genome providers will adopt the sequence collection standard, then having the ASDs also available from all of them will open up use cases to researchers and developers that will not available otherwise, even if one designs one own functionality on top of the sequence collections standard (unless one manages to convince all of them to implement the same additional functionality). So for me, this is a pragmatic issue more than a theoretical one. So one can easily implement search/indexing tools externally, but only if these ASD identifiers are available from the databases. Otherwise, one would need to replicate the full database contents in order to calculate these digests.
I believe the row-wise digests is in a special category and not comparable to “lots of other reshapings”. It is the main logical alternative to the array-based digests.
I agree.
I still don’t get this part.
Using numeric indexes as ‘names’ makes the ASD dependent on the order of the sequence within the collection, which breaks the possibility to use this to identify equality of Annotated Sequences between collections. Instead I propose that EDIT: i misread you. I see now that you explicitly suggest to ignore the |
Defined now in: #50 |
Now that we have the |
We need to confirm the final contents of the string to digest. Right now, I think we had settled on these elements:
contents
chr1
)constraints:
Related to this, I initially described this with a JSON-schema that I use in my implementation, but this is a bit more complicated because it encodes the recursion (the fact that the sequence digest element is itself a top-level refgettable element., But, for reference if anyone's interested, it's here:
https://github.com/refgenie/seqcol/blob/master/seqcol/schemas/AnnotatedSequenceList.yaml
Open for discussion.
The text was updated successfully, but these errors were encountered: