Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes involving CRAM and header API for long references. #976

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

jkbonfield
Copy link
Contributor

This is obviously only partial because we need CRAM4 for long reads, but it's now better at reporting errors and also now copes correctly with the cases inbetween 28-bit and 32-bit sizes for long cigar ops.

The header API has also had some fixes, which were discovered in the process of updating CRAM. The main culprit was sam_hdr_dup which didn't honour h->sdict.

We can now have many cigar ops (eg ref-skips) that combine to more than 512Mb.

This isn't a full fix for long references as CRAM sequence positions
are still 32-bit.  However this is a starting point down that road and
there are still sizes between 28-bit and 32-bit where the CIGAR failed
to operate correctly.
The h->sdict hash is used to track references that are > 4Gb in size.
The dup code didn't copy this.  This manifested itself as CRAM SQ
headers being truncated (read SAM hdr, dup, write as CRAM hdr).

To fix this a function was written that creates or updates the sdict
from the hrecs parsed header structs.  It's possible this should be
called directly from the sam_hdr_create function (part of the SAM
format parser) instead of manually keeping track of sdict itself,
however doing so would require initialising the new header structs so
I haven't done this.

This is a general utility, so perhaps should be made a public part of
the header API.  However IMO the new header API should hide this
nuance away and just return the correct data, also ensuring that
header updates work correctly and honour the text form.

Since c83c9e2 the header API also was using the 32-bit capped
target_len in preference to the parsed text from SQ LN fields when
they differed.  I am assuming this was a decision in what takes
priority in BAM where the sequence names and lengths exist in both
text and binary form.  This commit reverses this and makes the text
form always take priority.  As this is at least required in some
scenarios (long references) it seems easier to simply make it apply in
all scenarios.
sam.c Outdated
@@ -179,6 +226,14 @@ sam_hdr_t *sam_hdr_dup(const sam_hdr_t *h0)
h->text[h->l_text] = '\0';
}

if (h0->sdict) {
// We have a reason to use new API, so build it so we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbonfield As discussed, this would be better addressed inside sam_hdr_update_target_arrays.

…h `target_name` and `target_len`.

Rename `rebuild_target_arrays` to `sam_hdr_rebuild_target_arrays`.
Remove unnecessary external condition on `hrecs->refs_changed`.
Rearrange the position of `sam_hrecs_rebuild_text`, to reflect its internal usage.
@jkbonfield
Copy link
Contributor Author

Moving the code to sam_hdr_update_target_arrays looks to have fixed the problem in a more elegant way. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants