-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What characters should we use for delimiters? #1
Comments
Pinging @jmarshall for input on this. |
Also paging @yfarjoun and @lbergelson for input. We had discussed previously using the dict format for this. However this might be limiting in the types of other uses-cases the underlying seqcol framework could be used for. Ultimately it's a framework for associating key metadata with a checksum and allowing comparison functions to run. There are lots of other ideas around this. |
I would recommend |
My vote would also be for |
We had previously discounted whitespace characters for the reasons noted above, and for ease of exposition and clarity in a textual specification. Similar arguments apply to other control characters (US/RS/etc). If the digest and hashing was a one-way transformation, there would be no reason not to use ordinary punctuation characters as delimiters. However some proposed use cases involve parsing the unhashed digest, which would place additional constraints on this. Hence really this delimiter question is subordinate to the question about whether to include use cases that involve reversing and parsing the digest. |
Thanks @jmarshall
I think we do include use cases that involve reversing and parsing the digest. The way I envision it, the database lookup component that is fundamental to seqcol would require this. For example, I have a digest, say My perception from the discussion today was that attendees favored the @jmarshall, given that parsing the digest will be critical, what do you suggest we use for the delimiters? |
I have been envisaging the fundamental lookup process as being: the server looks up the client's provided In previous discussions, we decided against using whitespace characters for reasons of ease of exposition and implementation: avoiding ambiguity in the textual spec about whether the blank space on the page represents a tab or a couple of spaces or some mixture, and avoiding any questions or implementation bugs about whether the newlines are I believe somewhere someone mentioned a possible use case of having the unhashed digest string as the value of e.g. a SAM (header) tag or VCF header, i.e., embedding it as a value in some other textual format. IMHO this is an interesting possibility that we should enable. It would not be feasible if the string contained whitespace, and only barely feasible if the string contained other non-printable ASCII control characters. If you don't have to parse the unhashed digest string (only hash it and do the resulting lookups), you can use pretty much any delimiters as the string merely needs to be unambiguous and avoid collisions, but does not need to be parseable. Even if being able to parse it is also desirable to enable other use cases (okay, 👍 from me), most of the items are constrained enough that there is no problem using ordinary punctuation characters anyway. Last I looked there was still a question about what metadata items would go into the digest. Notably whether
(which I have displayed across several lines for ease of presentation — the real string value is all on one line and contains no whitespace) This is clear and unerrorprone, human-readable, embeddable in other tab-separated formats, and easy to describe and show in the spec text. The spec will need to be clear whether the (But really I think the question of delimiters can't be properly answered until we've converged on an answer to exactly which sequence metadata items contribute to the digest and exactly what the rules are about whether they may or may not be present.) |
Hmm… where have I seen counted strings being used? e.g.
Now there are zero concerns about what characters can appear in sequence names… |
I like this. Just a few initial thoughts:
Yes, I think this is possible. In my implementation, I store just the actual digested string, along with the digest. So, the data model is extremely simple and I can just use a key-value store. In addition to simplifying the database structure, an advantage of doing it this way is that I can use the same backend/store for refget and seqcol (or anything really). But I guess these are implementation details...
Ok, I think this is definitely worth discussing. I don't have enough experience with these header formats, but in principle I agree. So, I think your argument for using "normal punctuation" makes sense. It also makes it easier to actually construct the string to digest.
You're right. In fact, if we never needed to parse it, you wouldn't even need delimiters. So, I think you're correct that the delimiters are solely needed to make it parseable. But since we do want it to be parsable, we do need them.
Well, the 'circular' would be a possible entry under 'topology'. So, I don't think that's a different entry. But yes, we should make sure this is finalized. I created a new issue for this discussion: #8.
That's an interesting approach. Those prefixed numbers indicate the number of characters in the sequence name (or topology field), right? I think that would indeed solve the problem... But do we need that? if we go with the SAM spec, which seems likely, then commas and backslashes are not allowed in sequence names. In that case, couldn't we just use commas and backslashes, since we can guarantee that none of the fields would use them? But I like the counted strings approach because it's even more universal, and could therefore be implemented in a back-end that used a similar approach for other data which may not have the same restriction... Would you want to also count the string lengths on the other, non-name components, just to make it universal?
Agreed, let's follow that up in #8, where I have just outlined what I have envisioned at the moment. |
I also agree with supporting such a use case. A typical problem with track files is that the information about the reference genome in many cases is not a part of the file itself, or if it is, often not in a unique way. (Track is a term I use broadly btw including SAM/BAM and mostly everything downstream, as long as it relates to the sequence file coordinates).
This is also an argument for including the topology part, as a parser would like to know that information. Adding #8 here to create a link from that issue to this post.
Cool idea. I really like that and am wondering why this is not a common syntax everywhere... |
I think this is a really good encoding scheme. |
As I think about it more, I think there are a few downsides of the counted strings that I wanted to raise: First, there is the minor issue that it increases size of the string to digest, which, since we'll be storing that in a database, will increase the data use. It is nice that you can accommodate any characters in the fields this way, but in this situation, since those fields are restricted, we don't actually need that, and can just use commas and backslashes. So, it's extra space used without an actual benefit. I suppose this isn't that big of a deal, though, as the extra space used is small, and perhaps it's worth the benefit that the algorithm is more universal and could therefore be re-used. More importantly, consider how to parse the string. With raw delimiters, I can just say "split the string on comma", for which typical languages have built-in functions. But with the counted string case, the parser is going to have to be more sophisticated, since those delimiters could end up embedded within a field. This is not as easy to parse...I think I'd have to write some kind of state machine that will read character-by-character, stop when reaching a colon, then read X characters for that field, etc. Maybe not a dealbreaker, but it does put more burden on someone implementing the parsing, and likely also on the time to parse. Of course this is possible, but it's clearly more work than just a string split. I'm unaware of existing libraries in python to do this. For these two reasons, even though I really like the idea of the counted strings in general, I am now leaning back toward just this:
Notice I also switched from forward slash to backslash, since forward-slash is allowed in the name in the SAM spec, and backslash is not. Am I thinking about it correctly? |
(1) Whether this will be stored in a database is an implementation detail; many implementations will not store it. (2) The increase is trivial.
It is true that code would have to be written to parse the string from left to right, rather than just using
DO NOT make me regret suggesting this by advocating the Microsoftian horror of abusing backslash for use as a separator 🤮 If you want a record terminator that is disallowed as a SAM RNAME character, I would suggest But once again, thinking about this now is putting the cart before the horse. If in #2 we choose to follow the SAM RNAME rule, then we can just separate the fields with commas. If we choose to allow more arbitrary characters in sequence names, then a counted string approach in some form would be advantageous. |
So I tried to implement this in Python as a test, but it ended up being very cumbersome. There is probably some Python wizardry that can be used to do such parsing in an elegant way, but I agree, writing parsing code for such a digest does indeed feel very awkward.
Yes, I was going to mention your last point. Using backslash as a delimiter would create havoc when e.g. copy/pasting some digest into e.g. a manually Python string, as the backslash would the be interpreted as an escape character/
|
In the latest proposal for digest calculations, we switched to a simpler recursive array-of-arrays structure, which now requires only one delimiter (See #10). In that proposal, I'm using the comma as the one delimiter. |
Ultimately, we went with a JSON canonicalization, which essentially provides a third-party standard, and made the discussion of delimiters irrelevant. This decision is captured in the ADR on 2023-01-12. |
We'd like to solicit community feedback on the delimiters to be codified in the official seqcol algorithm. You can read the draft specification here.
Brief explanation of what the delimiters will be used for
The seqcol spec requires 2 different delimiters:
ATTRIBUTE_DELIMITER
is used to delimit the attributes in the string we digest. For example, if|
were the delimiter, the string to digest could bechr1|248956422|2648ae1bacce4ec4b6cf337dcae37816
, where|
delimites the name, length, and refget digest fields. it is used to delimit attributes within a single item.ITEM_DELIMITER
is used to delimit the individual items (sequences) in the collection. Since a collection contains more than 1 item, which we will concatenate, we need a delimiter. So, if/
were the item delimiter, for a collection with 2 sequences, we'd have:chr1|248956422|2648ae1bacce4ec4b6cf337dcae37816/chr2|242193529|4bb4f82880a14111eb7327169ffb729b|
.Proposal 1: human-readable whitespace characters
One proposal argues that we should use human-readable characters;
\t
(tab)\n
(unix newline).The nice thing about this proposal is that you could just print this string and it would be visually appealing. Downsides include that the newline character is not cross-platform, and that some text editors may replace tabs with spaces.
Proposal 2: use ascii non-printing character (NPC) separater delimiters
The arguments for proposal 2 are nicely laid out on the wikipedia page for delimiter collision:
We can restrict the name field from officially allowing whitespace characters, but this won't necessarily prevent people from trying that, meaning we'll require more checks. The non-printing ASCII characters were explicitly created for this purpose and avoid a lot of the potential problems of using visible characters. We could design simple printing functions that would replace them for visualization purposes.
Other options
If you like the general idea behind proposal 1 or 2, but prefer different specific delimiters (like a space, or something else), then please chime in. Here's a list of non-printable ASCII characters.
The text was updated successfully, but these errors were encountered: