Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAM IndexedReader from S3 url: I/O Error #216

Closed
brainstorm opened this issue Jun 22, 2020 · 5 comments
Closed

BAM IndexedReader from S3 url: I/O Error #216

brainstorm opened this issue Jun 22, 2020 · 5 comments
Assignees

Comments

@brainstorm
Copy link
Member

brainstorm commented Jun 22, 2020

This is a followup from #189 (comment) regarding feature = ["s3"]. Here's some minimal example code to read a BAM header hosted on S3:

pub fn bam_header(bucket: String, key: String) -> Vec<String> {
    let s3_url = Url::parse(&("s3://".to_string() + &bucket + "/" + &key)).unwrap();
    let bam_reader = IndexedReader::from_url(&s3_url).unwrap();

    let targets = bam_reader.header().target_names().into_iter()
                            .map(|raw_name| String::from_utf8_lossy(raw_name).to_string())
                            .collect();
    return targets;
}

When feature does not have s3, it (predictably) goes "Protocol not supported", like this htslib+pysam's bug recently fixed:

[E::hts_open_format] Failed to open file "s3://umccr-research-dev/htsget/htsnexus_test_NA12878.bam" : Protocol not supported
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: BamOpen { source: Open { target: "s3://umccr-research-dev/htsget/htsnexus_test_NA12878.bam" } }', src/main.rs:17:12

(...)

END RequestId: a2efde5b-5e19-4905-90e1-0b91badcb163
REPORT RequestId: a2efde5b-5e19-4905-90e1-0b91badcb163	Duration: 543.71 ms	Billed Duration: 600 ms	Memory Size: 128 MB	Max Memory Used: 15 MB	
RequestId: a2efde5b-5e19-4905-90e1-0b91badcb163 Error: Runtime exited with error: exit status 101
Runtime.ExitError

Then, enabling S3 support, it leads to I/O error with some simple code that tries to retrieve target names from a BAM header:

START RequestId: 7197fc91-d554-4147-9e0d-a29fe3d6e0fb Version: $LATEST
[E::hts_open_format] Failed to open file "s3://umccr-research-dev/htsget/htsnexus_test_NA12878.bam" : I/O error
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: BamOpen { source: Open { target: "s3://umccr-research-dev/htsget/htsnexus_test_NA12878.bam" } }', src/main.rs:17:12
stack backtrace:
   0:           0x641674 - backtrace::backtrace::libunwind::trace::h234d741a55b60f88
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1:           0x641674 - backtrace::backtrace::trace_unsynchronized::h350b2c8c65b00d1d
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2:           0x641674 - std::sys_common::backtrace::_print_fmt::h4a536ea1c8e8e74a
                               at src/libstd/sys_common/backtrace.rs:78
   3:           0x641674 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::had63074188e24509
                               at src/libstd/sys_common/backtrace.rs:59
   4:           0x67b09c - core::fmt::write::h0f3ca38b916f7bdd
                               at src/libcore/fmt/mod.rs:1069
   5:           0x63f6d3 - std::io::Write::write_fmt::h904ea4dad7931404
                               at src/libstd/io/mod.rs:1504
   6:           0x643b95 - std::sys_common::backtrace::_print::h5b567d4903ca6eb3
                               at src/libstd/sys_common/backtrace.rs:62
   7:           0x643b95 - std::sys_common::backtrace::print::hf98b9b1b18a4dc81
                               at src/libstd/sys_common/backtrace.rs:49
   8:           0x643b95 - std::panicking::default_hook::{{closure}}::h5fbf8e21242992f2
                               at src/libstd/panicking.rs:198
   9:           0x6438d2 - std::panicking::default_hook::hb4d89e36502020cd
                               at src/libstd/panicking.rs:218
  10:           0x6441a2 - std::panicking::rust_panic_with_hook::hc36f90fb81cc1268
                               at src/libstd/panicking.rs:511
  11:           0x643d8b - rust_begin_unwind
                               at src/libstd/panicking.rs:419
  12:           0x67a481 - core::panicking::panic_fmt::h31cb4ec4ac5347b3
                               at src/libcore/panicking.rs:111
  13:           0x67a2a3 - core::option::expect_none_failed::h3e3ee4886fcb0833
                               at src/libcore/option.rs:1268
  14:           0x402134 - bootstrap::main::h4cfb5e1da07e4c36
  15:           0x401903 - std::rt::lang_start::{{closure}}::h71ce4b28a2a11ce2
  16:           0x6444d1 - std::rt::lang_start_internal::{{closure}}::ha24276d619b0834a
                               at src/libstd/rt.rs:52
  17:           0x6444d1 - std::panicking::try::do_call::ha58b8718efdbddf5
                               at src/libstd/panicking.rs:331
  18:           0x6444d1 - std::panicking::try::h2d6d423bf379e813
                               at src/libstd/panicking.rs:274
  19:           0x6444d1 - std::panic::catch_unwind::h45b4b6133cb33025
                               at src/libstd/panic.rs:394
  20:           0x6444d1 - std::rt::lang_start_internal::h47125699e3ec3d7e
                               at src/libstd/rt.rs:51
  21:           0x402222 - main
END RequestId: 7197fc91-d554-4147-9e0d-a29fe3d6e0fb
REPORT RequestId: 7197fc91-d554-4147-9e0d-a29fe3d6e0fb	Duration: 543.11 ms	Billed Duration: 600 ms	Memory Size: 128 MB	Max Memory Used: 15 MB	
RequestId: 7197fc91-d554-4147-9e0d-a29fe3d6e0fb Error: Runtime exited with error: exit status 101
Runtime.ExitError

I have created this repository as a test/reproducer:

https://github.com/brainstorm/s3-rust-htslib-bam

@pmarks, @dlaehnemann, Would you mind taking a peek at my code and let me know if I'm doing something obviously wrong in there? I would really like to document this down to take a stab at #198 and/or write a blogpost about rust-htslib's 101 to attract more devs/users ;)

@pmarks
Copy link
Contributor

pmarks commented Jun 22, 2020

Unfortunately I don't know much about https, so I'm a little out my depth here. Using VSCode, I can step through the hfile / curl code & watch it prepare and execute a query to AWS. This happens successfully, but AWS rejects the request.

The s3 url starts off as: s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam

and the s3 module in htslib translates it to: https://gatk-test-data.s3.amazonaws.com/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam

It also adds a bunch of headers to the http request that appear to be AWS specific:
Authorization: AWS4-HMAC-SHA256 Credential=AKIAJSSJORL7JC7SHMQQ/20200622/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=2e671bc80866165c53a8c030dafde5caee3c851f8886bb7f0f997958883c8b8b
x-amz-date: 20200622T160559Z

x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Then curl makes the request to Amazon and gets a 403 response:
162\r\n<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message><AWSAccessKeyId>AKIAJSSJORL7JC7SHMQQ</AWSAccessKeyId><RequestId>388AACE1A4367822</RequestId><HostId>66DFEbxbR2w7Hy9Kk6gvIEb0IZps/WxMZ24tuRJevgU1AB2sKk2pFmxw5+yguNGK3OS/ON9MsjU=</HostId></Error>\r\n0\r\n\r\n

The same things appears to happen if I access the s3:// url with samtools view. So presumably this is fixable by setting some access key info as described here: http://www.htslib.org/doc/htslib-s3-plugin.html

@brainstorm
Copy link
Member Author

brainstorm commented Jun 23, 2020

Gotcha and thanks for the debugging Patrick, I'll focus on extending a small C test today that uses hfile_s3.c and try to fix this upstream on htslib for good.

@brainstorm
Copy link
Member Author

brainstorm commented Jun 23, 2020

@pmarks As I mentioned here: #189 (comment) ... samtools view -H works for me, are you using a recent version of samtools+htslib with S3 support (1.10.x series)?

@dlaehnemann
Copy link
Member

@brainstorm Sorry for cross-posting in this issue, but as you mentioned here, that you're hoping to tackle bits of the docs in the midterm, I figured I'd post about our planned rust-bio docathon next week Tuesday:
rust-bio/rust-bio#276

Maybe you have some time to drop by, and in any case we might be able to repeat the same thing over here.

@brainstorm
Copy link
Member Author

Hey, thanks for telling @dlaehnemann, I'll make sure I drop by ;)

On topic for this issue, I have a working proof on concept for S3+AWS Lambda+rust-htslib over here:

https://github.com/brainstorm/s3-rust-htslib-bam

The dawn of large scale lambda-backed bioinformatics on AWS is nigh! ;P

Time to close this issue I reckon, I'll document this on a PR during the docathon ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants