Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle utf-8 encoding #1569

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

yash-puligundla
Copy link
Contributor

@yash-puligundla yash-puligundla commented Sep 27, 2021

Description

If UTF-8 encoded characters are present in a SAM file, it is being corrupted while writing the file. This is because AsciiWriter downcasts the input char to a byte.

More details here
SAM specification specifies different field where UTF-8 encoding is allowed

Fix

cast from String to bytes using str.getBytes(StandardCharsets.UTF_8)

@yash-puligundla yash-puligundla marked this pull request as draft October 4, 2021 20:00
yash-puligundla added 2 commits October 4, 2021 16:01
@yash-puligundla yash-puligundla marked this pull request as ready for review October 4, 2021 20:02
@yash-puligundla
Copy link
Contributor Author

pushed more commits to re-trigger Travis build. But, that didn't work
image

@@ -28,6 +28,7 @@
import java.io.IOException;
import java.io.OutputStream;
import java.io.Writer;
import java.nio.charset.*;
Copy link
Collaborator

@cmnbroad cmnbroad Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention we try to avoid import *, except in the case where not doing so would require a really large list of imports. In this case, it looks like you're only using StandardCharsets, so you can just import that.

}
public void write(final char[] chars, int off, int len) throws IOException {
String str = new String(chars,off,len);
byte[] b = str.getBytes(StandardCharsets.UTF_8);
Copy link
Collaborator

@cmnbroad cmnbroad Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to write all SAM content as UTF-8, but I think the spec only permits UTF-8 in a few places, and requires 7 bit ascii elsewhere. So we may need a different approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're actually just doing a costly reimplementation of a standard Java Writer here now. It seems like validation of the character ranges should be done in the SamRecord and not here. Maybe this class has outlived it's usefulness and we can just use a standard Writer in place of it.

}
catch (final Exception ex) {
Assert.assertTrue(ex.getMessage().contains(exceptionString));
throw ex;
Copy link
Collaborator

@cmnbroad cmnbroad Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this test is constructed makes it hard to tell if its throwing on read, or on write. Lets replace it with separate tests that show that we fail in both cases. Since we will presumably fail when we read a misplaced multibyte character, I think the test that shows that we fail on write will have to create the output programmatically, rather then via roundtrip.

writer.addAlignment(rec);
}
}
catch (final Exception ex) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectedExceptions clause indicates that you expect IllegalArgumentException or SAMFormatException. So it would be preferable to narrow the scope of the catch clause to specify the narrower types that you actually expect (you can use multi-catch clause to match more than one type). If some other exception got thrown, like an IO exception, this clause will catch it and the assert will fail, even though the real problem would be completely unrelated.

@lbergelson
Copy link
Member

So looking at this more closely, it's very unclear to me why we have this class in the first place. It seems to be entirely some sort of performance optimization that's designed to avoid the cost of converting string formats. Is it possible that we could just remove this and replace it with a normal java Writer set to use UTF-8?

return new Object[][]{
{"roundtrip_with_utf8_bad_1.sam", "Invalid character in read bases"},
{"roundtrip_with_utf8_bad_2.sam", "Non-numeric value in POS column"},
{"roundtrip_with_utf8_bad_2.sam", "Non-numeric value in POS column"}
Copy link
Collaborator

@cmnbroad cmnbroad Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests may be failing somewhat accidentally, because they're testing multi byte chars in fields that are already constrained to specific sets of characters (numeric for POS, or valid bases for read bases). I'm not certain that its demonstrating that the chars are rejected because they're multi-byte; but rather that its just that they don't conform to the constraints. I suspect they would still fail if you used single byte ASCII characters that didn't conform.

It may be find to leave these, but I think we some need additional cases that use multi byte chars in places where there is no constraint other than that they be ASCII.

@yash-puligundla
Copy link
Contributor Author

yash-puligundla commented Oct 5, 2021

So looking at this more closely, it's very unclear to me why we have this class in the first place. It seems to be entirely some sort of performance optimization that's designed to avoid the cost of converting string formats. Is it possible that we could just remove this and replace it with a normal java Writer set to use UTF-8?

@lbergelson Just to clarify, Do you mean the class would be replaced with a normal java Writer set to use UTF-8 for certain fields and Ascii for the rest of the fields that do not permit UTF-8?

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yash-puligundla I did a quick pass with some comments to start, let me know if you have any questions, or if anything doesn't make sense.

@yash-puligundla yash-puligundla marked this pull request as draft October 19, 2021 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants