-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perhaps always add a new line after record / group / file separators #3
Comments
Yes you're correct this is a pain point. I've wrestled with this exact area in many projects during the past few years of working with USV. What I've learned in my own use of USV is that developer ergonomics in typical editors really do need the newlines. In fact I use two newlines before and after each separator. You can see a real-world USV file here that uses the extra newlines: https://github.com/SixArm/sixarm-data-ilo-isco/blob/main/ilo-isco-2008.usv I've experimented with approaches such as allowing surrounding newlines, which then the parser must skip or delete. This is friendlier for the person editing, yet much harder for tiny parsers, and Unix commands such as As a side note, I've also hit some issues with whitespace on different platforms being different, such as CR-LF versus \n. So far, the solution that seems to work the best in practice is a compromise: the person editing uses the newlines as desired (which is still valid USV); the USV parser only uses the 4 characters (thus preserves whitespace); the application programmer can then choose to write their own additional step of whitespace stripping. Thoughts about these areas? |
I find this approach very problematic from the specification and selling point of view... I'll quote from the repository's readme:
So, the above quote promises that USV would solve the issue of wrestling with whitespaces, escaping, quoting, etc. However the approach you've suggested in the reply is contrary to this, namely "let the programmer handle additional whitespace stripping". Thus I think additional whitespaces needed to make the USV text actually readable / writable with ordinary tools should be part of the specification, not something each user has to figure out by himself. (For example I always use TSV as opposed to CSV, because there is no "CSV standard", and each tool does escaping and quoting in his own way.) |
This was exactly my concern.
However, given that most (all?) tools do expect / write a
In terms of editors, in my experience:
However, given the above (perhaps except the last point), I think the safest choice is to just require a |
Very good comments. You're providing excellent constructive which I greatly appreciate. I have time this weekend to dig into this and think about it more carefully. |
Spec now allows liners (i.e. carriage returns and line feeds) around content. |
The current USV specification (without touching the issues described in #2) is a very nice and simple one. At least from a technical point of view.
However, it has one major drawback: loading a large TSV or CSV file in a "dumb" text editor will work just OK, because in TSV and CSV records are separated by new lines; however loading a large USV file in such a "dumb" text editor would give the user a single never ending line.
This imposes at least the following major problems:
less
pager, when presented with a\0
terminated file, would just become unresponsive trying to handle what it perceives as a file with a single huge line;)\n
after the last unit, because most of them are expected to handle properly line-ended files;Thus I propose that each record / group / file should be terminated not only by the unicode separator but also by a new line (i.e.
\n
).This change does "break" somewhat the simplicity of the specification, but it does make it more practical.
The text was updated successfully, but these errors were encountered: