Silent truncation of records in Tabix range retrieval after networking failure from S3 bucket #1851

ChristopherWilks · 2024-10-15T15:06:43Z

Hi,

First, thanks for the great tools, I use Tabix/Bgzip extensively in my work and am very grateful for the continued support of you folks continuously making them better (especially the extension of S3/GCS support)!

I think this may be related to this #1037, and/or if it is or part of another issue I missed in my brief search of the issues list, feel free to close/merge it in there. Related to that @daviesrob may be interested in this ticket.

I noticed recently that when running many concurrent tabix queries---using GNU parallel with -j80---against a small set of bgzipped/indexed files on an S3 bucket from an EC2 instance in the same AWS region, that I was seeing empty results from a few of them when there should have been actual records pulled down, but no errors were reported (return status was 0 for all queries).

I am using bash with set -exo pipefail, so I found this odd. [I'm fine with a minority of errors cropping up as long as they're reported---I'll just re-run those queries.]

My working hypothesis is that I'm overloading the networking stack (probably a receive buffer somewhere) on the system and that libcurl is reporting errors for a few of the concurrent jobs, but these aren't being fully caught and reported by Tabix. That said, libcurl maybe the culprit but I'm assuming it's not in this case.

I'm using version htslib 1.20, but the section of the code where I think this issue is (below) doesn't appear to be different between 1.20 and the current development branch.

I went back and added some of my own manual debug fprintf's to hfile_libcurl.c where I think the problem may be occurring, just before this line,

htslib/hfile_libcurl.c

Line 859 in ca92061

return got;

, and compiled without optimizations to get full debugging info (not shown here but I did run a bunch of straces as well):

fprintf(stderr,"in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: %ld,%d,%d,%ld,%d\n",got,fp->finished,fp->final_result, to_skip, errno);

The one test instance where I saw something relevant was here:

in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 18882,0,-1,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 32193,0,-1,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 25206,1,0,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 25206,1,0,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 0,1,0,-1,0
[W::bgzf_read_block] EOF marker is absent. The input may be truncated
    Command being timed: "htslib-1.20/tabix -D s3://S3_PATH_TO_BUCKET/allpairs.byfeature.gz chr12:11456460-11457010"
....
Exit status: 0

That range has records in the bgzipped file on S3, but the output was empty and I noticed that got here was 0 which is not being caught by libcurl_read(...) in this case.

My quick and dirty solution was to simply add:

if(got == 0) { return -1; }

and that seemed to fix it (in the sense of reporting an error when this happens, which is all I want) though I haven't run extensive tests.

I'm not claiming this fixes all the issues, but it does seem to get at a potential gap in the error checking in that file.

Thanks,
Chris

The text was updated successfully, but these errors were encountered:

whitwham · 2024-10-24T17:03:31Z

Using your fprintf statement I get this:
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 0,1,0,-1,0
at the end of every download from s3. It looks like a normal part of the process.

Can you check if it appears on your working tabixes?

daviesrob assigned whitwham Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent truncation of records in Tabix range retrieval after networking failure from S3 bucket #1851

Silent truncation of records in Tabix range retrieval after networking failure from S3 bucket #1851

ChristopherWilks commented Oct 15, 2024

whitwham commented Oct 24, 2024

Silent truncation of records in Tabix range retrieval after networking failure from S3 bucket #1851

Silent truncation of records in Tabix range retrieval after networking failure from S3 bucket #1851

Comments

ChristopherWilks commented Oct 15, 2024

whitwham commented Oct 24, 2024