wishlist s3 cache #1748

cariaso · 2024-02-13T17:07:15Z

htslib is awesome, I feel embarrassed and greedy to wish for more, but I have a workflow where I see a potential for a big win, and I'm sure I'm not the only one who would benefit.

Consider
https://requests-cache.readthedocs.io/en/stable/
a transparent drop in for requests, which does local caching.

and imagine how many of us are pulling the same 0.01% of the same s3 file over and over again during development, and never need 99.99% of the whole BAM/CRAM

if there was an ENV var, or other config to allow htslib-s3-plugin to write to a local cache, and to pull from that when available it seems like that would payoff quite well for a presumably common use case.

even if it isn't used in every code path, a partial win would still payoff handsomely.

it seems sufficiently adjacent to
#1670
that perhaps during that work, leaving some notes about the relevant paths would be possible

daviesrob · 2024-02-15T17:31:26Z

While a nice idea, I fear this could be quite complicated in practice. In particular, managing a cache between multiple processes that may even be running on different hosts is likely to lead to a number of low-level difficulties.

For the use case you mention, it may be better to set up a separate process to act as a caching proxy. Having only one process to manage the files makes things much easier. A quick web search brings up https://github.com/rhelmer/caching-s3-proxy although it seems not to have been touched for a while, and I don't know how well it would work on very large files.

cariaso · 2024-02-15T18:49:04Z

I'll give some thought to the mitm style solution you've suggested.

alexpreynolds · 2024-02-16T15:53:26Z

You might also take a look at CloudWatch, which offers some caching of S3 requests.

cariaso · 2024-02-16T16:01:40Z

I have to imagine you didn't intend CloudWatch, that seems very unrelated.

However https://github.com/nginxinc/nginx-s3-gateway is another possibility.

alexpreynolds · 2024-02-16T17:29:29Z

CloudFront offers the ability to cache requests to S3-sourced files. I'm mentioning it here as it may be another option for those using S3 for sharing BAM files. Hope this might be useful. Apologies if you're talking about something else.

edit: I meant CloudFront, not CloudWatch. Sorry!

cariaso · 2024-02-16T17:59:59Z

can you point to a URL that gives a little more documentation on what you’re thinking of. The cloud watch that I know serves a very different purpose as near as I understand.

daviesrob self-assigned this Feb 15, 2024

cariaso closed this as completed Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wishlist s3 cache #1748

wishlist s3 cache #1748

cariaso commented Feb 13, 2024 •

edited

Loading

daviesrob commented Feb 15, 2024

cariaso commented Feb 15, 2024

alexpreynolds commented Feb 16, 2024

cariaso commented Feb 16, 2024

alexpreynolds commented Feb 16, 2024 •

edited

Loading

cariaso commented Feb 16, 2024 via email •

edited

Loading

wishlist s3 cache #1748

wishlist s3 cache #1748

Comments

cariaso commented Feb 13, 2024 • edited Loading

daviesrob commented Feb 15, 2024

cariaso commented Feb 15, 2024

alexpreynolds commented Feb 16, 2024

cariaso commented Feb 16, 2024

alexpreynolds commented Feb 16, 2024 • edited Loading

cariaso commented Feb 16, 2024 via email • edited Loading

cariaso commented Feb 13, 2024 •

edited

Loading

alexpreynolds commented Feb 16, 2024 •

edited

Loading

cariaso commented Feb 16, 2024 via email •

edited

Loading