Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perhaps always add a new line after record / group / file separators #3

Closed
cipriancraciun opened this issue May 5, 2022 · 5 comments

Comments

@cipriancraciun
Copy link

The current USV specification (without touching the issues described in #2) is a very nice and simple one. At least from a technical point of view.

However, it has one major drawback: loading a large TSV or CSV file in a "dumb" text editor will work just OK, because in TSV and CSV records are separated by new lines; however loading a large USV file in such a "dumb" text editor would give the user a single never ending line.

This imposes at least the following major problems:

  • most editors would struggle with such a long line; (for example the well known less pager, when presented with a \0 terminated file, would just become unresponsive trying to handle what it perceives as a file with a single huge line;)
  • some (most?) editors would just introduce (after a save) a \n after the last unit, because most of them are expected to handle properly line-ended files;
  • the user can't easily see where a record ends, because he has to visually search for the separator;

Thus I propose that each record / group / file should be terminated not only by the unicode separator but also by a new line (i.e. \n).

This change does "break" somewhat the simplicity of the specification, but it does make it more practical.

@joelparkerhenderson
Copy link
Member

Yes you're correct this is a pain point. I've wrestled with this exact area in many projects during the past few years of working with USV.

What I've learned in my own use of USV is that developer ergonomics in typical editors really do need the newlines. In fact I use two newlines before and after each separator. You can see a real-world USV file here that uses the extra newlines: https://github.com/SixArm/sixarm-data-ilo-isco/blob/main/ilo-isco-2008.usv

I've experimented with approaches such as allowing surrounding newlines, which then the parser must skip or delete. This is friendlier for the person editing, yet much harder for tiny parsers, and Unix commands such as tr, and commands that treat lines as data separators such as grep.

As a side note, I've also hit some issues with whitespace on different platforms being different, such as CR-LF versus \n.

So far, the solution that seems to work the best in practice is a compromise: the person editing uses the newlines as desired (which is still valid USV); the USV parser only uses the 4 characters (thus preserves whitespace); the application programmer can then choose to write their own additional step of whitespace stripping.

Thoughts about these areas?

@cipriancraciun
Copy link
Author

So far, the solution that seems to work the best in practice is a compromise: the person editing uses the newlines as desired (which is still valid USV); the USV parser only uses the 4 characters (thus preserves whitespace); the application programmer can then choose to write their own additional step of whitespace stripping.

I find this approach very problematic from the specification and selling point of view...

I'll quote from the repository's readme:

USV can handle data that contains commas, tabs, newlines, and other special characters, all without escaping.

So, the above quote promises that USV would solve the issue of wrestling with whitespaces, escaping, quoting, etc. However the approach you've suggested in the reply is contrary to this, namely "let the programmer handle additional whitespace stripping".

Thus I think additional whitespaces needed to make the USV text actually readable / writable with ordinary tools should be part of the specification, not something each user has to figure out by himself.

(For example I always use TSV as opposed to CSV, because there is no "CSV standard", and each tool does escaping and quoting in his own way.)

@cipriancraciun
Copy link
Author

CSV and TSV files often end with a newline, which makes these formats easier to edit with a typical line-oriented editor, and also easier to commit to repositories that require every text file to have a final newline, or that use line-oriented merge tools that flag a missing final newline.

This was exactly my concern.

In practice, a USV format user will often encounter a final newline to deal with, or delete, because of line-oriented Unix tools.

However, given that most (all?) tools do expect / write a \n at the end, then making it mandatory would solve this issue without negative consequences.


An open question is how much of a pain point it would be to enforce zero trailing newline in a typical developer's editor.

In terms of editors, in my experience:

  • many editors allow the user to choose between enforcing or not a last \n;
  • many editors default to writing it; (some OSX editors perhaps not?)
  • I think many editors preserve the existence / absence of a trailing \n;
  • there are however a few editors that don't give the user a choice and always enforce it; (my own sce editor is one of these;) :)

However, given the above (perhaps except the last point), I think the safest choice is to just require a \n: most write it, most perhaps preserve it, some even enforce it; however I don't think there are many editors that enforce the lack of \n.

@joelparkerhenderson
Copy link
Member

Very good comments. You're providing excellent constructive which I greatly appreciate. I have time this weekend to dig into this and think about it more carefully.

@joelparkerhenderson
Copy link
Member

Spec now allows liners (i.e. carriage returns and line feeds) around content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants