Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What characters should we use for delimiters? #1

Closed
nsheff opened this issue Nov 11, 2020 · 16 comments
Closed

What characters should we use for delimiters? #1

nsheff opened this issue Nov 11, 2020 · 16 comments
Milestone

Comments

@nsheff
Copy link
Member

nsheff commented Nov 11, 2020

We'd like to solicit community feedback on the delimiters to be codified in the official seqcol algorithm. You can read the draft specification here.

Brief explanation of what the delimiters will be used for

The seqcol spec requires 2 different delimiters:

  • ATTRIBUTE_DELIMITER is used to delimit the attributes in the string we digest. For example, if | were the delimiter, the string to digest could be chr1|248956422|2648ae1bacce4ec4b6cf337dcae37816, where | delimites the name, length, and refget digest fields. it is used to delimit attributes within a single item.

  • ITEM_DELIMITER is used to delimit the individual items (sequences) in the collection. Since a collection contains more than 1 item, which we will concatenate, we need a delimiter. So, if / were the item delimiter, for a collection with 2 sequences, we'd have: chr1|248956422|2648ae1bacce4ec4b6cf337dcae37816/chr2|242193529|4bb4f82880a14111eb7327169ffb729b|.

Proposal 1: human-readable whitespace characters

One proposal argues that we should use human-readable characters;

  • Attribute delimiter: \t (tab)
  • Item delimiter: \n (unix newline).

The nice thing about this proposal is that you could just print this string and it would be visually appealing. Downsides include that the newline character is not cross-platform, and that some text editors may replace tabs with spaces.

Proposal 2: use ascii non-printing character (NPC) separater delimiters

  • Attribute delimiter: UNIT SEPARATOR Position 31 (U+001F)
  • Item delimiter: RECORD SEPARATOR Position 30 (U+001E)

The arguments for proposal 2 are nicely laid out on the wikipedia page for delimiter collision:

Because delimiter collision is a very common problem, various methods for avoiding it have been invented. Some authors may attempt to avoid the problem by choosing a delimiter character (or sequence of characters) that is not likely to appear in the data stream itself. This ad hoc approach may be suitable, but it necessarily depends on a correct guess of what will appear in the data stream, and offers no security against malicious collisions.

We can restrict the name field from officially allowing whitespace characters, but this won't necessarily prevent people from trying that, meaning we'll require more checks. The non-printing ASCII characters were explicitly created for this purpose and avoid a lot of the potential problems of using visible characters. We could design simple printing functions that would replace them for visualization purposes.

Other options

If you like the general idea behind proposal 1 or 2, but prefer different specific delimiters (like a space, or something else), then please chime in. Here's a list of non-printable ASCII characters.

@andrewyatz
Copy link
Collaborator

Pinging @jmarshall for input on this.

@andrewyatz
Copy link
Collaborator

Also paging @yfarjoun and @lbergelson for input. We had discussed previously using the dict format for this. However this might be limiting in the types of other uses-cases the underlying seqcol framework could be used for. Ultimately it's a framework for associating key metadata with a checksum and allowing comparison functions to run. There are lots of other ideas around this.

@yfarjoun
Copy link

I would recommend \t (0x009) for attribute delimiter and \n (0x00A) for item delimiter. This is working fine in the current implementations of sequence dictionaries (SAM and VCF) and thus, should work here as well.

@daviesrob
Copy link
Contributor

My vote would also be for \t and \n.

@jmarshall
Copy link
Member

We had previously discounted whitespace characters for the reasons noted above, and for ease of exposition and clarity in a textual specification. Similar arguments apply to other control characters (US/RS/etc).

If the digest and hashing was a one-way transformation, there would be no reason not to use ordinary punctuation characters as delimiters. However some proposed use cases involve parsing the unhashed digest, which would place additional constraints on this.

Hence really this delimiter question is subordinate to the question about whether to include use cases that involve reversing and parsing the digest.

@nsheff
Copy link
Member Author

nsheff commented Nov 25, 2020

Thanks @jmarshall

some proposed use cases involve parsing the unhashed digest... Hence really this delimiter question is subordinate to the question about whether to include use cases that involve reversing and parsing the digest.

I think we do include use cases that involve reversing and parsing the digest. The way I envision it, the database lookup component that is fundamental to seqcol would require this. For example, I have a digest, say ze829vl for short. I hit the API with /seqcol/ze829vl to return the sequences. Internally, the software first looks up the value in a database, which would be the delimited string. It parses this string, to return the collection of sequences, perhaps in JSON or in FASTA format or whatever. However it returns it, it will ideally have parsed those delimiters to return some kind of useful structured response. We also need to parse to do things like compatibility functions.

My perception from the discussion today was that attendees favored the \t and \n delimiters. Personally, I have a slight preference for using RS and US, but it's not a strong preference, so I think we're closing in on a decision in favor of the whitespace characters.

@jmarshall, given that parsing the digest will be critical, what do you suggest we use for the delimiters?

@jmarshall
Copy link
Member

I have been envisaging the fundamental lookup process as being: the server looks up the client's provided ze829vl digest value in its database or other data store, and finds the corresponding list of sequences in whatever format is convenient for its way of implementing its data store (e.g. it could be several rows of sequence rowids or some tuple of similar sequence ids of some kind). From that it directly accesses the requested data of the sequences. Or if it wanted to it could reconstruct the unhashed digest string from the metadata recorded with the sequences — but this would not involve already knowing the unhashed digest string or parsing it.

In previous discussions, we decided against using whitespace characters for reasons of ease of exposition and implementation: avoiding ambiguity in the textual spec about whether the blank space on the page represents a tab or a couple of spaces or some mixture, and avoiding any questions or implementation bugs about whether the newlines are \n or \r\n or whether there might or might not be a newline at the end of the last entry. (These are all issues that have come up in BED or in BAM headers. It's even worse here because you can't just parse generously and accept all variations: any change in the input will lead to a completely different hash value, and it seems like a bad design to have invisible characters going into a hash calculation; cf refget's existing sequence checksum calculation which explicitly canonicalises by first removing whitespace et al.)

I believe somewhere someone mentioned a possible use case of having the unhashed digest string as the value of e.g. a SAM (header) tag or VCF header, i.e., embedding it as a value in some other textual format. IMHO this is an interesting possibility that we should enable. It would not be feasible if the string contained whitespace, and only barely feasible if the string contained other non-printable ASCII control characters.

If you don't have to parse the unhashed digest string (only hash it and do the resulting lookups), you can use pretty much any delimiters as the string merely needs to be unambiguous and avoid collisions, but does not need to be parseable. Even if being able to parse it is also desirable to enable other use cases (okay, 👍 from me), most of the items are constrained enough that there is no problem using ordinary punctuation characters anyway.

Last I looked there was still a question about what metadata items would go into the digest. Notably whether circular would be a factor. But supposing the entries are sequence name, length, refget digest, and topology: the only one potentially containing any real punctuation characters is the name. Hence I would favour something like

{chr1},248956422,2648ae1bacce4ec4b6cf337dca
e37816,linear/{chr2},242193529,4bb4f82880a1
4111eb7327169ffb729b,linear/

(which I have displayed across several lines for ease of presentation — the real string value is all on one line and contains no whitespace)

This is clear and unerrorprone, human-readable, embeddable in other tab-separated formats, and easy to describe and show in the spec text. The spec will need to be clear whether the / character is a record separator or a record terminator; either way when the character is visible it's clear which of those it is from examples in the spec text. IMHO a terminator is generally easier for implementations.

(But really I think the question of delimiters can't be properly answered until we've converged on an answer to exactly which sequence metadata items contribute to the digest and exactly what the rules are about whether they may or may not be present.)

@jmarshall
Copy link
Member

Hmm… where have I seen counted strings being used? e.g.

4:chr1,248956422,2648ae1bacce4ec4b6cf337dca
e37816,6:linear/4:chr2,242193529,4bb4f82880
a14111eb7327169ffb729b,6:linear/

Now there are zero concerns about what characters can appear in sequence names…

@nsheff
Copy link
Member Author

nsheff commented Dec 9, 2020

I like this. Just a few initial thoughts:

I have been envisaging the fundamental lookup process as being: the server looks up the client's provided ze829vl digest value in its database or other data store, and finds the corresponding list of sequences in whatever format is convenient for its way of implementing its data store

Yes, I think this is possible. In my implementation, I store just the actual digested string, along with the digest. So, the data model is extremely simple and I can just use a key-value store. In addition to simplifying the database structure, an advantage of doing it this way is that I can use the same backend/store for refget and seqcol (or anything really). But I guess these are implementation details...

I believe somewhere someone mentioned a possible use case of having the unhashed digest string as the value of e.g. a SAM (header) tag or VCF header, i.e., embedding it as a value in some other textual format. IMHO this is an interesting possibility that we should enable. It would not be feasible if the string contained whitespace, and only barely feasible if the string contained other non-printable ASCII control characters.

Ok, I think this is definitely worth discussing. I don't have enough experience with these header formats, but in principle I agree. So, I think your argument for using "normal punctuation" makes sense. It also makes it easier to actually construct the string to digest.

you can use pretty much any delimiters as the string merely needs to be unambiguous and avoid collisions, but does not need to be parseable.

You're right. In fact, if we never needed to parse it, you wouldn't even need delimiters. So, I think you're correct that the delimiters are solely needed to make it parseable. But since we do want it to be parsable, we do need them.

Last I looked there was still a question about what metadata items would go into the digest. Notably whether circular would be a factor.

Well, the 'circular' would be a possible entry under 'topology'. So, I don't think that's a different entry. But yes, we should make sure this is finalized. I created a new issue for this discussion: #8.

where have I seen counted strings...

That's an interesting approach. Those prefixed numbers indicate the number of characters in the sequence name (or topology field), right? I think that would indeed solve the problem...

But do we need that? if we go with the SAM spec, which seems likely, then commas and backslashes are not allowed in sequence names. In that case, couldn't we just use commas and backslashes, since we can guarantee that none of the fields would use them? But I like the counted strings approach because it's even more universal, and could therefore be implemented in a back-end that used a similar approach for other data which may not have the same restriction...

Would you want to also count the string lengths on the other, non-name components, just to make it universal?
So:

4:chr1,9:248956422,32:2648ae1bacce4ec4b6cf337dca
e37816,6:linear/4:chr2,9:242193529,32:4bb4f82880
a14111eb7327169ffb729b,6:linear/

(But really I think the question of delimiters can't be properly answered until we've converged on an answer to exactly which sequence metadata items contribute to the digest and exactly what the rules are about whether they may or may not be present.)

Agreed, let's follow that up in #8, where I have just outlined what I have envisioned at the moment.

@sveinugu
Copy link
Collaborator

sveinugu commented Dec 9, 2020

I believe somewhere someone mentioned a possible use case of having the unhashed digest string as the value of e.g. a SAM (header) tag or VCF header, i.e., embedding it as a value in some other textual format. IMHO this is an interesting possibility that we should enable. It would not be feasible if the string contained whitespace, and only barely feasible if the string contained other non-printable ASCII control characters.

Ok, I think this is definitely worth discussing. I don't have enough experience with these header formats, but in principle I agree. So, I think your argument for using "normal punctuation" makes sense. It also makes it easier to actually construct the string to digest.

I also agree with supporting such a use case. A typical problem with track files is that the information about the reference genome in many cases is not a part of the file itself, or if it is, often not in a unique way. (Track is a term I use broadly btw including SAM/BAM and mostly everything downstream, as long as it relates to the sequence file coordinates).
So the hashed digest version of seqcol could be used to remedy this, but users of the file who would need e.g. length info (and most parsers would need that) would then have to look up the digest in some repo (and might even have to try several places). So having a canonical way of specifying the sequence collection in unhashed form is extremely useful for automatic parsing and would in that way work as a replacement of the @SQ header in SAM, but in almost any kind of file format.

you can use pretty much any delimiters as the string merely needs to be unambiguous and avoid collisions, but does not need to be parseable.

You're right. In fact, if we never needed to parse it, you wouldn't even need delimiters. So, I think you're correct that the delimiters are solely needed to make it parseable. But since we do want it to be parsable, we do need them.

Last I looked there was still a question about what metadata items would go into the digest. Notably whether circular would be a factor.

Well, the 'circular' would be a possible entry under 'topology'. So, I don't think that's a different entry. But yes, we should make sure this is finalized. I created a new issue for this discussion: #8.

This is also an argument for including the topology part, as a parser would like to know that information. Adding #8 here to create a link from that issue to this post.

where have I seen counted strings...

That's an interesting approach. Those prefixed numbers indicate the number of characters in the sequence name (or topology field), right? I think that would indeed solve the problem...

But do we need that? if we go with the SAM spec, which seems likely, then commas and backslashes are not allowed in sequence names. In that case, couldn't we just use commas and backslashes, since we can guarantee that none of the fields would use them? But I like the counted strings approach because it's even more universal, and could therefore be implemented in a back-end that used a similar approach for other data which may not have the same restriction...

Would you want to also count the string lengths on the other, non-name components, just to make it universal?
So:

4:chr1,9:248956422,32:2648ae1bacce4ec4b6cf337dca
e37816,6:linear/4:chr2,9:242193529,32:4bb4f82880
a14111eb7327169ffb729b,6:linear/

Cool idea. I really like that and am wondering why this is not a common syntax everywhere...

@andrewyatz
Copy link
Collaborator

I think this is a really good encoding scheme.

@nsheff
Copy link
Member Author

nsheff commented Dec 10, 2020

As I think about it more, I think there are a few downsides of the counted strings that I wanted to raise:

First, there is the minor issue that it increases size of the string to digest, which, since we'll be storing that in a database, will increase the data use. It is nice that you can accommodate any characters in the fields this way, but in this situation, since those fields are restricted, we don't actually need that, and can just use commas and backslashes. So, it's extra space used without an actual benefit. I suppose this isn't that big of a deal, though, as the extra space used is small, and perhaps it's worth the benefit that the algorithm is more universal and could therefore be re-used.

More importantly, consider how to parse the string. With raw delimiters, I can just say "split the string on comma", for which typical languages have built-in functions. But with the counted string case, the parser is going to have to be more sophisticated, since those delimiters could end up embedded within a field. This is not as easy to parse...I think I'd have to write some kind of state machine that will read character-by-character, stop when reaching a colon, then read X characters for that field, etc. Maybe not a dealbreaker, but it does put more burden on someone implementing the parsing, and likely also on the time to parse. Of course this is possible, but it's clearly more work than just a string split. I'm unaware of existing libraries in python to do this.

For these two reasons, even though I really like the idea of the counted strings in general, I am now leaning back toward just this:

chr1,248956422,2648ae1bacce4ec4b6cf337dca
e37816,linear\chr2,242193529,4bb4f82880
a14111eb7327169ffb729b,linear\

Notice I also switched from forward slash to backslash, since forward-slash is allowed in the name in the SAM spec, and backslash is not. Am I thinking about it correctly?

@jmarshall
Copy link
Member

First, there is the minor issue that it increases size of the string to digest, which, since we'll be storing that in a database

(1) Whether this will be stored in a database is an implementation detail; many implementations will not store it. (2) The increase is trivial.

But with the counted string case, the parser is going to have to be more sophisticated, since those delimiters could end up embedded within a field.

It is true that code would have to be written to parse the string from left to right, rather than just using split twice. This wouldn't be particularly onerous, and it would have to be a pretty unfortunate implementation to be substantially slower.

Notice I also switched from forward slash to backslash, since forward-slash is allowed in the name in the SAM spec, and backslash is not.

DO NOT make me regret suggesting this by advocating the Microsoftian horror of abusing backslash for use as a separator 🤮 If you want a record terminator that is disallowed as a SAM RNAME character, I would suggest < à la the machine readable part of passports. (There is also a non-taste-based argument against backslash: \ is likely to be used as an escape character by formats that we may be embedded within—if we facilitate that possibility—, so it would be unfortunate to assign it a meaning at this level too.)

But once again, thinking about this now is putting the cart before the horse. If in #2 we choose to follow the SAM RNAME rule, then we can just separate the fields with commas. If we choose to allow more arbitrary characters in sequence names, then a counted string approach in some form would be advantageous.

@sveinugu
Copy link
Collaborator

First, there is the minor issue that it increases size of the string to digest, which, since we'll be storing that in a database

(1) Whether this will be stored in a database is an implementation detail; many implementations will not store it. (2) The increase is trivial.

But with the counted string case, the parser is going to have to be more sophisticated, since those delimiters could end up embedded within a field.

It is true that code would have to be written to parse the string from left to right, rather than just using split twice. This wouldn't be particularly onerous, and it would have to be a pretty unfortunate implementation to be substantially slower.

So I tried to implement this in Python as a test, but it ended up being very cumbersome. There is probably some Python wizardry that can be used to do such parsing in an elegant way, but I agree, writing parsing code for such a digest does indeed feel very awkward.

Notice I also switched from forward slash to backslash, since forward-slash is allowed in the name in the SAM spec, and backslash is not.

DO NOT make me regret suggesting this by advocating the Microsoftian horror of abusing backslash for use as a separator 🤮 If you want a record terminator that is disallowed as a SAM RNAME character, I would suggest < à la the machine readable part of passports. (There is also a non-taste-based argument against backslash: \ is likely to be used as an escape character by formats that we may be embedded within—if we facilitate that possibility—, so it would be unfortunate to assign it a meaning at this level too.)

Yes, I was going to mention your last point. Using backslash as a delimiter would create havoc when e.g. copy/pasting some digest into e.g. a manually Python string, as the backslash would the be interpreted as an escape character/

But once again, thinking about this now is putting the cart before the horse. If in #2 we choose to follow the SAM RNAME rule, then we can just separate the fields with commas. If we choose to allow more arbitrary characters in sequence names, then a counted string approach in some form would be advantageous.

@nsheff
Copy link
Member Author

nsheff commented Apr 21, 2021

In the latest proposal for digest calculations, we switched to a simpler recursive array-of-arrays structure, which now requires only one delimiter (See #10). In that proposal, I'm using the comma as the one delimiter.

@nsheff
Copy link
Member Author

nsheff commented Feb 21, 2024

Ultimately, we went with a JSON canonicalization, which essentially provides a third-party standard, and made the discussion of delimiters irrelevant. This decision is captured in the ADR on 2023-01-12.

@nsheff nsheff closed this as completed Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants