Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663

yash-puligundla · 2023-03-21T19:22:24Z

NOTE: This PR is in draft as it is dependent on RANS NX16 PR and Range Codec PR

Description
This PR is part of an effort to upgrade CRAM to v3.1. It adds the Name Tokenization Decoder implementation.

List of Changes:
Add Name Tokenization Decoder
Add NameTokenizationInteropTest to test the Name Tokenization Decoder using the test files from htscodecs. These interop tests use the files from samtools installation (samtools-1.14/htslib-1.14/htscodes/tests/names)

…dded in the streams).

…e them.

… them.

…ift methods for RANS Nx16 Order 0 and Order 1, RANS Nx16 Order 0 and Order 1 with format flags = 1 works as expected when N=4

…h right shifts.

… it.

cmnbroad

Thats it for this round. When these comments are addressed I'll do another round.

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/NameTokenisationEncode.java

cmnbroad · 2023-12-11T20:25:18Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/NameTokenisationEncode.java

+
+                case TOKEN_DELTA0:
+                    tokenStream.get(TOKEN_DELTA0).getByteBuffer().put((byte)Integer.parseInt(encodeToken.getRelativeTokenValue()));
+                    break;


Same comment about having a default case as elsewhere: there should be a multi-value case statement here with all of the remaining type values, with a break stmt, and a comment saying they are deliberately dropped, followed by a default with a throw.

cmnbroad · 2023-12-11T21:59:24Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/TokenStreams.java

+                inputByteBuffer.get(dataBytes, 0, clen); // offset in the dst byte array
+                final ByteBuffer uncompressedDataByteBuffer;
+                if (useArith != 0) {
+                    RangeDecode rangeDecode = new RangeDecode();


Definitely suggest caching just one of these (probably in TokenStreams) so we don't have to recreate these for each stream, and then just reset them for each use.

cmnbroad · 2023-12-11T22:00:01Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/TokenStreams.java

+                    uncompressedDataByteBuffer = rangeDecode.uncompress(ByteBuffer.wrap(dataBytes));
+
+                } else {
+                    RANSDecode ransdecode = new RANSNx16Decode();


Same thing here - this needs to be cached.

When we do the integration of these with the CRAM reader/writer, we may even want to cache/reuse these at an even higher level, so that we don't have to create these for each slice. But for now caching them in TokenStreams will be a big win in terms of reducing allocation/GC.

cmnbroad · 2023-12-11T22:11:38Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/NameTokenisationDecode.java

+            final ByteBuffer inBuffer,
+            final String separator) {
+        inBuffer.order(ByteOrder.LITTLE_ENDIAN);
+        final int uncompressedLength =  inBuffer.getInt() & 0xFFFFFFFF; //unused variable. Following the spec


On further review, we're really missing the boat here, since we're not taking advantage of this value. I suspect that the uncompressed length is serialized to the stream specifically to allow decode implementations to efficiently allocate memory for it up front, but we're not doing that because we're using a List<List<>> instead of a byte buffer for the output. As I've mentioned elsewhere, we may wind up wanting to retain the List approach, since it will be more efficient to integrate with the reader/writer that way, but I just wanted to note this as part of the review.

cmnbroad · 2023-12-11T22:24:33Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/TokenStreams.java

+                    RANSDecode ransdecode = new RANSNx16Decode();
+                    uncompressedDataByteBuffer = ransdecode.uncompress(ByteBuffer.wrap(dataBytes));
+                }
+                this.getTokenStreamByType(tokenType).add(tokenPosition,new Token(uncompressedDataByteBuffer));


It seems strange to use a local method call here to get the stream (this.getTokenStreamByType(tokenType)), when everywhere else in this method you just do tokenStreams.get(tokenType) directly. I would either use it everywhere, or (preferably) eliminate the local wrapper since it does't add much.

cmnbroad · 2023-12-12T21:25:40Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/NameTokenisationDecode.java

+        TokenStreams tokenStreams = new TokenStreams(inBuffer, useArith, numNames);
+        List<List<String>> tokensList = new ArrayList<>(numNames);
+        for(int i = 0; i < numNames; i++) {
+            tokensList.add(new ArrayList<>());


Can these ArrayLists be preallocated to the correct length, or some reasonable length, rather than relying on them being serially reallocated as they grow ? I think even overallocating somewhat would be preferable.

Upon looking further, it looks like these arrays need to have a size equal to the number of tokens which the names are broken into. While you may not be able to predict that up front, you could probably guess and use a an estimate. The default for array list is 10 anyway, but I would suggest maybe 15 as a reasonable default (add a constant with a comment saying its an estimate for preallocation. 15 will probably be too big most of the time but specifying a size value would reinforce what these are expected to be.

cmnbroad · 2023-12-12T21:56:07Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/TokenStreams.java

+        // contains a ByteBuffer of length = number of names
+        // This ByteBuffer helps determine the type of each of the token at the specicfied pos
+
+        this();


Also, it seems like tokenStreams array is always of size 13 (number of token types) x numNames. If thats true, then when you combine these constructors, can you preallocate the list of Arrays to size numNames ?

cmnbroad · 2023-12-12T22:15:39Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/NameTokenisationDecode.java

+        byte currentByte = inputBuffer.get();
+        while (currentByte != 0) {
+            resultStringBuilder.append((char) currentByte);
+            currentByte = inputBuffer.get();
+        }


Not a big deal in this case, but in the spirit of keeping narrow variable scope, I would always prefer something like this:

Suggested change

byte currentByte = inputBuffer.get();

while (currentByte != 0) {

resultStringBuilder.append((char) currentByte);

currentByte = inputBuffer.get();

}

for (byte currentByte = inputBuffer.get(); currentByte != 0; currentByte = inputBuffer.get()) {

resultStringBuilder.append((char) currentByte);

}

cmnbroad · 2023-12-12T22:16:43Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/NameTokenisationDecode.java

+        while (value.length() < len) {
+            value = "0" + value;
+        }


This is also super inefficient. It would be much preferable to not reallocate this string (possibly) len times. Instead, either use a single StringBuilder, or add apache.StringUtils and use leftPad.

… stripe, nosize, cat flags

… Codec as well

…s class

yash-puligundla and others added 30 commits October 19, 2023 17:56

adding comments to Frequencies.java

75171ee

separate encode and decode classes

7821469

Add Frequency methods to encode and decode classes

defc174

clean up rans tests and add separate packages for rans 4x8 and nx16

f3734ca

filter out extra column from q40+dir file

8582ab8

rans nx16 order 1 freq tables + refactor

0769ecc

clean up

faf7c10

Update RAN test method names.

720357b

Remove unncessary params arg from uncompress methods (params are embe…

03773c6

…dded in the streams).

Remove unnecessary RANSNx16Params state.

4a41948

Fix bug in the case where the cat bit is set.

ba088c6

Reduce unncessary buffer allocation.

ed68e3b

Thread RANSNx16 params through RANSNx16 implementation.

0ce9080

Dont initialize RANSNx16 decoding structures unless we're going to us…

671d21f

…e them.

Move/inline RANS Nx16 D0N uncompress method into RANSNx16Decode.

9cd168a

Move/inline RANS Nx16 D1N uncompress method into RANSNx16Decode.

2688906

Move/inline RANS Nx16 E0N compress method into RANSNx16Encode.

e01b08e

Move/inline RANS Nx16 E1N compress method into RANSNx16Encode.

3c7ebb8

Suppress spotbugs warnings.

56f2b86

Don't initialize RANS4x8 decoding structure unless we're going to use…

55e290d

… them.

Move/inline RANS 4x8 E04 compress method into RANS4x8Encode.

b89a222

Move/inline RANS 4x8 E14 compress method into RANS4x8Encode.

e23a7e3

Move/inline RANS 4x8 D04 uncompress method into RANS4x8Decode.

0b3fd27

Move/inline RANS 4x8 D14 uncompress method into RANS4x8Decode.

e53f109

Fix normalized Frequency (4096), add normalize Frequency using bit sh…

d0279aa

…ift methods for RANS Nx16 Order 0 and Order 1, RANS Nx16 Order 0 and Order 1 with format flags = 1 works as expected when N=4

Add ransNx16 for format flags = 1,4,5 (N=32) and replace division wit…

c2cac35

…h right shifts.

When CAT is true, add limit and rewind the outBuffer before returning…

5034915

… it.

Add RANSTest with formatflags = 32, 33, 36, 37

c966eec

Remove initialization of alphabet array.

890940e

Add RLE Encode and Decode. Works as expected for RANSNx16 Order 0

c3dd46d

cmnbroad requested changes Dec 12, 2023

View reviewed changes

yash-puligundla added 24 commits December 13, 2023 15:05

Addressing the feedback from Nov 7 and Nov 20 - part 4

1a89cb4

Addressing the feedback from Nov 7 and Nov 20 - part 5

f4fd67c

Addressing the feedback from Nov 7 and Nov 20 - part 6

b095b1c

Addressing the feedback from Nov 7 and Nov 20 - part 6

b2187c3

Move common code to CompressionUtils

52549f5

add Range Encode

2db77e9

Fix RangeEncode for order 0 and formatflags=0x00

43a68c9

rebase - Add Range Codec, RangeTest, RangeInteropTest for order, rle,…

49eac0e

… stripe, nosize, cat flags

Add uncompressEXT and decodePack to RangeDecode

7b454d6

add Pack flag to tests

81bcac7

Add Range encode and decode for EXT flag

5de4036

debug spotbugs error

fc6227d

debug - add decodePack on top of CAT flag

249db30

Addressing format related feedback from RANS PR that applies to Range…

e2d5a37

… Codec as well

Rebase on RANS branch and use common methods from CRAMInteropTestUtil…

58ace69

…s class

Addressing feedback from nov 21 - part 1

f9b066c

Addressing feedback from nov 21 - part 2

4a416f7

Addressing feedback from nov 21, nov 28 - part 3

55f6086

Add NameTokenization Decoder

6c0387c

Add NameTokenization Encoder

902ca33

add descriptive variable names

d3ae09d

Add unittests

1c0cb8d

Use List<Token> instead of TokenStreams in NameTokenisationEncoder

eff17cf

Addressing feedback from dec 5 - part 1

67ad384

yash-puligundla force-pushed the yp_cram_3_1_name_tokeniser branch from 7d2d0c5 to 67ad384 Compare January 30, 2024 19:17

Addressing feedback from dec 5 - part 2

3233e39

yash-puligundla mentioned this pull request Feb 7, 2024

Add FQZComp Codec (Update CRAM Codecs to CRAM 3.1) #1704

Open

cmnbroad mentioned this pull request Sep 4, 2024

Implementation of CRAM 3.1 codecs. #1714

Draft

lbergelson added the cram label Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663

Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663

yash-puligundla commented Mar 21, 2023

cmnbroad left a comment

cmnbroad Dec 11, 2023

cmnbroad Dec 11, 2023

cmnbroad Dec 11, 2023

cmnbroad Dec 11, 2023

cmnbroad Dec 11, 2023

cmnbroad Dec 11, 2023

cmnbroad Dec 12, 2023

cmnbroad Dec 12, 2023

cmnbroad Dec 12, 2023

cmnbroad Dec 12, 2023

cmnbroad Dec 12, 2023

Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663

Are you sure you want to change the base?

Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663

Conversation

yash-puligundla commented Mar 21, 2023

cmnbroad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment