Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wishlist s3 cache #1748

Closed
cariaso opened this issue Feb 13, 2024 · 6 comments
Closed

wishlist s3 cache #1748

cariaso opened this issue Feb 13, 2024 · 6 comments
Assignees

Comments

@cariaso
Copy link

cariaso commented Feb 13, 2024

htslib is awesome, I feel embarrassed and greedy to wish for more, but I have a workflow where I see a potential for a big win, and I'm sure I'm not the only one who would benefit.

Consider
https://requests-cache.readthedocs.io/en/stable/
a transparent drop in for requests, which does local caching.

and imagine how many of us are pulling the same 0.01% of the same s3 file over and over again during development, and never need 99.99% of the whole BAM/CRAM

if there was an ENV var, or other config to allow htslib-s3-plugin to write to a local cache, and to pull from that when available it seems like that would payoff quite well for a presumably common use case.

even if it isn't used in every code path, a partial win would still payoff handsomely.

it seems sufficiently adjacent to
#1670
that perhaps during that work, leaving some notes about the relevant paths would be possible

@daviesrob daviesrob self-assigned this Feb 15, 2024
@daviesrob
Copy link
Member

While a nice idea, I fear this could be quite complicated in practice. In particular, managing a cache between multiple processes that may even be running on different hosts is likely to lead to a number of low-level difficulties.

For the use case you mention, it may be better to set up a separate process to act as a caching proxy. Having only one process to manage the files makes things much easier. A quick web search brings up https://github.com/rhelmer/caching-s3-proxy although it seems not to have been touched for a while, and I don't know how well it would work on very large files.

@cariaso
Copy link
Author

cariaso commented Feb 15, 2024

I'll give some thought to the mitm style solution you've suggested.

@cariaso cariaso closed this as completed Feb 15, 2024
@alexpreynolds
Copy link

You might also take a look at CloudWatch, which offers some caching of S3 requests.

@cariaso
Copy link
Author

cariaso commented Feb 16, 2024

I have to imagine you didn't intend CloudWatch, that seems very unrelated.

However https://github.com/nginxinc/nginx-s3-gateway is another possibility.

@alexpreynolds
Copy link

alexpreynolds commented Feb 16, 2024

CloudFront offers the ability to cache requests to S3-sourced files. I'm mentioning it here as it may be another option for those using S3 for sharing BAM files. Hope this might be useful. Apologies if you're talking about something else.

edit: I meant CloudFront, not CloudWatch. Sorry!

@cariaso
Copy link
Author

cariaso commented Feb 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants