-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle utf-8 encoding #1569
base: master
Are you sure you want to change the base?
handle utf-8 encoding #1569
Conversation
@@ -28,6 +28,7 @@ | |||
import java.io.IOException; | |||
import java.io.OutputStream; | |||
import java.io.Writer; | |||
import java.nio.charset.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention we try to avoid import *, except in the case where not doing so would require a really large list of imports. In this case, it looks like you're only using StandardCharsets
, so you can just import that.
} | ||
public void write(final char[] chars, int off, int len) throws IOException { | ||
String str = new String(chars,off,len); | ||
byte[] b = str.getBytes(StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to write all SAM content as UTF-8, but I think the spec only permits UTF-8 in a few places, and requires 7 bit ascii elsewhere. So we may need a different approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're actually just doing a costly reimplementation of a standard Java Writer here now. It seems like validation of the character ranges should be done in the SamRecord and not here. Maybe this class has outlived it's usefulness and we can just use a standard Writer in place of it.
} | ||
catch (final Exception ex) { | ||
Assert.assertTrue(ex.getMessage().contains(exceptionString)); | ||
throw ex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this test is constructed makes it hard to tell if its throwing on read, or on write. Lets replace it with separate tests that show that we fail in both cases. Since we will presumably fail when we read a misplaced multibyte character, I think the test that shows that we fail on write will have to create the output programmatically, rather then via roundtrip.
writer.addAlignment(rec); | ||
} | ||
} | ||
catch (final Exception ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expectedExceptions clause indicates that you expect IllegalArgumentException
or SAMFormatException
. So it would be preferable to narrow the scope of the catch clause to specify the narrower types that you actually expect (you can use multi-catch clause to match more than one type). If some other exception got thrown, like an IO exception, this clause will catch it and the assert will fail, even though the real problem would be completely unrelated.
So looking at this more closely, it's very unclear to me why we have this class in the first place. It seems to be entirely some sort of performance optimization that's designed to avoid the cost of converting string formats. Is it possible that we could just remove this and replace it with a normal java Writer set to use UTF-8? |
return new Object[][]{ | ||
{"roundtrip_with_utf8_bad_1.sam", "Invalid character in read bases"}, | ||
{"roundtrip_with_utf8_bad_2.sam", "Non-numeric value in POS column"}, | ||
{"roundtrip_with_utf8_bad_2.sam", "Non-numeric value in POS column"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tests may be failing somewhat accidentally, because they're testing multi byte chars in fields that are already constrained to specific sets of characters (numeric for POS, or valid bases for read bases). I'm not certain that its demonstrating that the chars are rejected because they're multi-byte; but rather that its just that they don't conform to the constraints. I suspect they would still fail if you used single byte ASCII characters that didn't conform.
It may be find to leave these, but I think we some need additional cases that use multi byte chars in places where there is no constraint other than that they be ASCII.
@lbergelson Just to clarify, Do you mean the class would be replaced with a normal java Writer set to use UTF-8 for certain fields and Ascii for the rest of the fields that do not permit UTF-8? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yash-puligundla I did a quick pass with some comments to start, let me know if you have any questions, or if anything doesn't make sense.
Description
If UTF-8 encoded characters are present in a SAM file, it is being corrupted while writing the file. This is because AsciiWriter downcasts the input
char
to abyte
.More details here
SAM specification specifies different field where UTF-8 encoding is allowed
Fix
cast from String to bytes using
str.getBytes(StandardCharsets.UTF_8)