Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Cache control for re-using previously-downloaded headers #325

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Jul 12, 2018

This proposal is a follow-up to #322. It will require rebasing etc as #322 develops, so I don't anticipate updating or polishing this until after the class proposal has landed in master.

However, if that facility is to be used by clients to enable re-using previously-downloaded headers and this is to be done safely, then I think using HTTP cache control will be the natural way to make it safe and extrapolating ETag/etc to the htsget ticket is a natural extension. So if enabling this safety is considered important, I think this followup will also need to be considered soon after class.

But this is somewhat moot in the absence of implementations, hence this separate PR.

@jdidion
Copy link

jdidion commented Jul 17, 2018

Thanks for the additions here and in #322. Practical question: I am trying to implement a POC htsget server that splits up BAM files into header and body pieces. I'm at a loss for how to actually create a header BAM file and a body BAM file, such that the header BAM is valid by itself but can be concatenated with the body BAM to create the final BAM. In pseudo-python, the way I want my client to interact with the server is:

urlobjs = get_urls_from_server()
headers = None

with open('out.bam', 'wb') as outbam:
    for urlobj in urls:
        url_content = fetch(urlobj['url'])
        if urlobj['class'] == 'header':
            headers = decode_bam(url_content).headers
            # ...do something with the headers...
        outbam.write(url_content)

# Now open the BAM for reading
with open_bam('out.bam', 'r') as bam_reader:
    # The headers in the BAM file should be the same as what I read from the header BAM above
    assert bam_reader.headers == headers
    for record in bam_reader:
        # ...do something with the record...

I realize that I can use samtools to split my BAM file into header-only and body-only by first converting to SAM, splitting into header-only SAM and body-only SAM, and then converting both of those back into BAM. But concatenating those two files does not create a valid BAM. I guess I could just write out the body BAM and then use samtools reheader to add the header, but that's quite slow for large BAM files. Any other suggestions?

@jmarshall
Copy link
Member Author

jmarshall commented Jul 17, 2018

You need to find the boundary file offset between the header and the body, which requires understanding the format in a way that a general-purpose read-the-records API won't provide. So for BAM, you need to

  • figure out the length in uncompressed bytes of the BAM header, essentially by adding up
    l_text + sum(l_name[1…n_ref]) plus the constant-size fields;

  • figure out how many BGZF blocks at the start of the file are used for those headers, by adding up blocks' isize fields until the total equals the uncompressed header size.

At that point, you'll have the header-body boundary (in “compressed space”) that you're looking for.

Note that this assumes that a new BGZF block is started for the first body data record (i.e. the header-body boundary is also at a BGZF block boundary) — this has never been stated in the SAM specification, but is something that the main implementations have done for BAM since 2010 (see #300). It seems to me that implementing htsget requires BAM files to have this property. (And similarly for BCF files, but AFAIK the main implementations don't do this for them!)

In practice, you'd more likely find this boundary by looking in a BAI/etc index for the virtual file offset of the first body data record — e.g. (presumably) by finding the smallest ioffset in any of the linear indices.

@jdidion
Copy link

jdidion commented Jul 17, 2018

@jmarshall that makes sense, thanks for the explanation. I have a python library for parsing index files I've been meaning to release for a while - seems like that will be useful here. I'll work on putting together a library and command line tool that can be used to split up BAM/BCF/etc files for serving by htsget.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants