handle utf-8 encoding #1569

yash-puligundla · 2021-09-27T20:01:17Z

Description

If UTF-8 encoded characters are present in a SAM file, it is being corrupted while writing the file. This is because AsciiWriter downcasts the input char to a byte.

More details here
SAM specification specifies different field where UTF-8 encoding is allowed

Fix

cast from String to bytes using str.getBytes(StandardCharsets.UTF_8)

yash-puligundla · 2021-10-04T20:04:55Z

pushed more commits to re-trigger Travis build. But, that didn't work

cmnbroad · 2021-10-05T16:10:39Z

src/main/java/htsjdk/samtools/util/AsciiWriter.java

@@ -28,6 +28,7 @@
 import java.io.IOException;
 import java.io.OutputStream;
 import java.io.Writer;
+import java.nio.charset.*;


By convention we try to avoid import *, except in the case where not doing so would require a really large list of imports. In this case, it looks like you're only using StandardCharsets, so you can just import that.

cmnbroad · 2021-10-05T16:14:43Z

src/main/java/htsjdk/samtools/util/AsciiWriter.java

-        }
+    public void write(final char[] chars, int off, int len) throws IOException {
+        String str = new String(chars,off,len);
+        byte[] b = str.getBytes(StandardCharsets.UTF_8);


This is going to write all SAM content as UTF-8, but I think the spec only permits UTF-8 in a few places, and requires 7 bit ascii elsewhere. So we may need a different approach.

It seems like we're actually just doing a costly reimplementation of a standard Java Writer here now. It seems like validation of the character ranges should be done in the SamRecord and not here. Maybe this class has outlived it's usefulness and we can just use a standard Writer in place of it.

cmnbroad · 2021-10-05T16:28:05Z

src/test/java/htsjdk/samtools/SAMFileRoundTripTest.java

+        }
+        catch (final Exception ex) {
+            Assert.assertTrue(ex.getMessage().contains(exceptionString));
+            throw ex;


The way this test is constructed makes it hard to tell if its throwing on read, or on write. Lets replace it with separate tests that show that we fail in both cases. Since we will presumably fail when we read a misplaced multibyte character, I think the test that shows that we fail on write will have to create the output programmatically, rather then via roundtrip.

cmnbroad · 2021-10-05T16:33:35Z

src/test/java/htsjdk/samtools/SAMFileRoundTripTest.java

+                writer.addAlignment(rec);
+            }
+        }
+        catch (final Exception ex) {


The expectedExceptions clause indicates that you expect IllegalArgumentException or SAMFormatException. So it would be preferable to narrow the scope of the catch clause to specify the narrower types that you actually expect (you can use multi-catch clause to match more than one type). If some other exception got thrown, like an IO exception, this clause will catch it and the assert will fail, even though the real problem would be completely unrelated.

lbergelson · 2021-10-05T16:33:59Z

So looking at this more closely, it's very unclear to me why we have this class in the first place. It seems to be entirely some sort of performance optimization that's designed to avoid the cost of converting string formats. Is it possible that we could just remove this and replace it with a normal java Writer set to use UTF-8?

cmnbroad · 2021-10-05T16:41:55Z

src/test/java/htsjdk/samtools/SAMFileRoundTripTest.java

+        return new Object[][]{
+                {"roundtrip_with_utf8_bad_1.sam", "Invalid character in read bases"},
+                {"roundtrip_with_utf8_bad_2.sam", "Non-numeric value in POS column"},
+                {"roundtrip_with_utf8_bad_2.sam", "Non-numeric value in POS column"}


I think these tests may be failing somewhat accidentally, because they're testing multi byte chars in fields that are already constrained to specific sets of characters (numeric for POS, or valid bases for read bases). I'm not certain that its demonstrating that the chars are rejected because they're multi-byte; but rather that its just that they don't conform to the constraints. I suspect they would still fail if you used single byte ASCII characters that didn't conform.

It may be find to leave these, but I think we some need additional cases that use multi byte chars in places where there is no constraint other than that they be ASCII.

yash-puligundla · 2021-10-05T16:42:22Z

So looking at this more closely, it's very unclear to me why we have this class in the first place. It seems to be entirely some sort of performance optimization that's designed to avoid the cost of converting string formats. Is it possible that we could just remove this and replace it with a normal java Writer set to use UTF-8?

@lbergelson Just to clarify, Do you mean the class would be replaced with a normal java Writer set to use UTF-8 for certain fields and Ascii for the rest of the fields that do not permit UTF-8?

cmnbroad

@yash-puligundla I did a quick pass with some comments to start, let me know if you have any questions, or if anything doesn't make sense.

handle utf-8 encoding

0681bcb

yash-puligundla marked this pull request as draft October 4, 2021 20:00

yash-puligundla added 2 commits October 4, 2021 16:01

test

747dbae

revert

c619ba6

yash-puligundla marked this pull request as ready for review October 4, 2021 20:02

cmnbroad reviewed Oct 5, 2021

View reviewed changes

cmnbroad requested changes Oct 5, 2021

View reviewed changes

yash-puligundla marked this pull request as draft October 19, 2021 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle utf-8 encoding #1569

handle utf-8 encoding #1569

yash-puligundla commented Sep 27, 2021 •

edited

Loading

yash-puligundla commented Oct 4, 2021

cmnbroad Oct 5, 2021 •

edited

Loading

cmnbroad Oct 5, 2021 •

edited

Loading

lbergelson Oct 5, 2021

cmnbroad Oct 5, 2021 •

edited

Loading

cmnbroad Oct 5, 2021

lbergelson commented Oct 5, 2021

cmnbroad Oct 5, 2021 •

edited

Loading

yash-puligundla commented Oct 5, 2021 •

edited

Loading

cmnbroad left a comment

handle utf-8 encoding #1569

Are you sure you want to change the base?

handle utf-8 encoding #1569

Conversation

yash-puligundla commented Sep 27, 2021 • edited Loading

Description

Fix

yash-puligundla commented Oct 4, 2021

cmnbroad Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

cmnbroad Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

lbergelson Oct 5, 2021

Choose a reason for hiding this comment

cmnbroad Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

cmnbroad Oct 5, 2021

Choose a reason for hiding this comment

lbergelson commented Oct 5, 2021

cmnbroad Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

yash-puligundla commented Oct 5, 2021 • edited Loading

cmnbroad left a comment

Choose a reason for hiding this comment

yash-puligundla commented Sep 27, 2021 •

edited

Loading

cmnbroad Oct 5, 2021 •

edited

Loading

cmnbroad Oct 5, 2021 •

edited

Loading

cmnbroad Oct 5, 2021 •

edited

Loading

cmnbroad Oct 5, 2021 •

edited

Loading

yash-puligundla commented Oct 5, 2021 •

edited

Loading