Fixes involving CRAM and header API for long references. #976

jkbonfield · 2019-11-15T12:30:43Z

This is obviously only partial because we need CRAM4 for long reads, but it's now better at reporting errors and also now copes correctly with the cases inbetween 28-bit and 32-bit sizes for long cigar ops.

The header API has also had some fixes, which were discovered in the process of updating CRAM. The main culprit was sam_hdr_dup which didn't honour h->sdict.

We can now have many cigar ops (eg ref-skips) that combine to more than 512Mb. This isn't a full fix for long references as CRAM sequence positions are still 32-bit. However this is a starting point down that road and there are still sizes between 28-bit and 32-bit where the CIGAR failed to operate correctly.

The h->sdict hash is used to track references that are > 4Gb in size. The dup code didn't copy this. This manifested itself as CRAM SQ headers being truncated (read SAM hdr, dup, write as CRAM hdr). To fix this a function was written that creates or updates the sdict from the hrecs parsed header structs. It's possible this should be called directly from the sam_hdr_create function (part of the SAM format parser) instead of manually keeping track of sdict itself, however doing so would require initialising the new header structs so I haven't done this. This is a general utility, so perhaps should be made a public part of the header API. However IMO the new header API should hide this nuance away and just return the correct data, also ensuring that header updates work correctly and honour the text form. Since c83c9e2 the header API also was using the 32-bit capped target_len in preference to the parsed text from SQ LN fields when they differed. I am assuming this was a decision in what takes priority in BAM where the sequence names and lengths exist in both text and binary form. This commit reverses this and makes the text form always take priority. As this is at least required in some scenarios (long references) it seems easier to simply make it apply in all scenarios.

valeriuo · 2019-11-18T10:52:47Z

sam.c

@@ -179,6 +226,14 @@ sam_hdr_t *sam_hdr_dup(const sam_hdr_t *h0)
        h->text[h->l_text] = '\0';
    }

+    if (h0->sdict) {
+        // We have a reason to use new API, so build it so we


@jkbonfield As discussed, this would be better addressed inside sam_hdr_update_target_arrays.

…h `target_name` and `target_len`. Rename `rebuild_target_arrays` to `sam_hdr_rebuild_target_arrays`. Remove unnecessary external condition on `hrecs->refs_changed`. Rearrange the position of `sam_hrecs_rebuild_text`, to reflect its internal usage.

jkbonfield · 2019-11-18T16:27:06Z

Moving the code to sam_hdr_update_target_arrays looks to have fixed the problem in a more elegant way. Thanks.

jkbonfield added 2 commits November 15, 2019 12:29

jkbonfield force-pushed the long_cigar2 branch from a914023 to 3c079af Compare November 15, 2019 12:33

valeriuo reviewed Nov 18, 2019

View reviewed changes

daviesrob mentioned this pull request Nov 20, 2019

Fix sam_hdr_dup to cope with long refs. #983

Merged

jkbonfield mentioned this pull request Nov 26, 2020

Adjust parser for undefined CIGARs #1182

Closed

jkbonfield mentioned this pull request May 24, 2021

Issues with long ref and ref-skips #971

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes involving CRAM and header API for long references. #976

Fixes involving CRAM and header API for long references. #976

jkbonfield commented Nov 15, 2019

valeriuo Nov 18, 2019

jkbonfield commented Nov 18, 2019

Fixes involving CRAM and header API for long references. #976

Are you sure you want to change the base?

Fixes involving CRAM and header API for long references. #976

Conversation

jkbonfield commented Nov 15, 2019

valeriuo Nov 18, 2019

Choose a reason for hiding this comment

jkbonfield commented Nov 18, 2019