You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary: GCS Read/Write could use resumable uploads and downloads to recover better from network failures or other transient issues
I've noticed that the Google Cloud Storage support for htslib makes single requests for both download and upload, without doing resumable, multiple chunk, or multipart uploads. I believe that using resumable uploads and using Range headers on downloads could significantly increase reliability when working with GCS, potentially fixing most of the bugs related to read/write problems from GCS that I've seen reported here. I've personally had a pretty bad time trying to read/write large files in GCS, it works intermittently but we experience failures every few hours that make dealing with large files infeasible.
My group is trying to work in Google Cloud via terra.bio, and hoping to be able to stream input and output from Google Cloud Storage so that we avoid having to copy around >1TB vcf.gz and bcfs: samtools/bcftools#2235
I see two main ways to approach this. It would be possible to have hfile_gcs wrap hfile_libcurl just like it currently does, make a request to start a resumable upload before it starts sending data, and then handle creating a new hFILE for each large chunk. hfile_gcs could also handle retrying, although if we want robust retry logic we'd need to keep each chunk in memory until we know it's been successfully sent.
It's also possible to work around this in some situations by using the GCP cli to do the read/write, but this won't work in all situations. For example, one can: gcloud storage cat gs://my-bucket/my-file.bcf | bcftools view | gcloud storage cp - gs://my-bucket/the-output.bcf
It would also be really nice to use range requests for reading, as it'd be possible to request just one bgzf block at a time if you're doing random I/O.
The text was updated successfully, but these errors were encountered:
These are all good things that we should add. Our entire cloud storage code needs looking at and seeing what we can do better with. At the moment we are spread a bit thin due to other projects, but hopefully we can get around to this in the not too distant future.
Summary: GCS Read/Write could use resumable uploads and downloads to recover better from network failures or other transient issues
I've noticed that the Google Cloud Storage support for htslib makes single requests for both download and upload, without doing resumable, multiple chunk, or multipart uploads. I believe that using resumable uploads and using Range headers on downloads could significantly increase reliability when working with GCS, potentially fixing most of the bugs related to read/write problems from GCS that I've seen reported here. I've personally had a pretty bad time trying to read/write large files in GCS, it works intermittently but we experience failures every few hours that make dealing with large files infeasible.
My group is trying to work in Google Cloud via terra.bio, and hoping to be able to stream input and output from Google Cloud Storage so that we avoid having to copy around >1TB vcf.gz and bcfs: samtools/bcftools#2235
Google's recommendations for streaming uploads and downloads are here:
https://cloud.google.com/storage/docs/streaming-uploads
https://cloud.google.com/storage/docs/streaming-downloads
I see two main ways to approach this. It would be possible to have hfile_gcs wrap hfile_libcurl just like it currently does, make a request to start a resumable upload before it starts sending data, and then handle creating a new hFILE for each large chunk. hfile_gcs could also handle retrying, although if we want robust retry logic we'd need to keep each chunk in memory until we know it's been successfully sent.
The other approach would be to rework, extend, or wrap hfile_s3_write, because Google Cloud Storage also supports XML multipart uploads matching the S3 API: https://cloud.google.com/storage/docs/multipart-uploads
It's also possible to work around this in some situations by using the GCP cli to do the read/write, but this won't work in all situations. For example, one can:
gcloud storage cat gs://my-bucket/my-file.bcf | bcftools view | gcloud storage cp - gs://my-bucket/the-output.bcf
It would also be really nice to use range requests for reading, as it'd be possible to request just one bgzf block at a time if you're doing random I/O.
The text was updated successfully, but these errors were encountered: