Increase the input block size for bgzip. #1768

jkbonfield · 2024-04-09T16:30:25Z

Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down.

The easy solution is just to increase the buffer size to something sensible. It's a little messy as we have to use hfile_internal.h to get hfile_set_blksize, but it works. I'm not sure why we didn't elect to make that API more public. Probably simply out of caution.

Fixes #1767

jkbonfield · 2024-04-10T15:25:31Z

Edited the buffer down to 256Kb instead of 1Mb as it seems to be sufficient (tested on tmpfs, fast NFS and lustre).

jkbonfield · 2024-04-10T16:43:33Z

Hmm, it's still variable! The effect of a bigger block size is more memcpy as we have fewer direct reads (as @daviesrob points out a readv can partially solve that as we can read direct to caller buff + remainder to look-ahead cache, but it doesn't fit with the backend semantics), but it also reduces the impact of many small reads due to small pipe size.

Hence some machines it's slower to have a bigger block size, and it can have weird interactions with CPU load too which I cannot explain. Eg it's a win at -l0, but a penalty at -l1. It needs more head scratching probably.

jkbonfield · 2024-10-29T16:35:45Z

I retested this with different block sizes on a few systems. This time it was aggregate wall clock time on 100 trials of the same file, so cached and a lot of reproducability. Bgzip -l5, but our systems all have CPU frequency scaling on and the filesystems are sometimes shared so there are sources of error and randomness too. Even so, the charts are interesting.

So 128k is enough on the Intel and 256k is enough on the AMD. I also tried an older Intel machine, but it just flatlined the CPU and made very little difference in buffer sizes. Trying -l1 instead changed it a little, but not significantly.

I also tested reading an actual file rather than a pipe from Lustre on the AMD system and basically it was identical performance regardless of buffer size. This backs up my findings in the linked issue. Looking at CPU instead of elapsed time shows a small drop at around 128KB onwards, but not huge and not worth taking into account I suspect.

However I think the 256k here is fine, as would 128kb so we should consider this PR again. We could also consider doing an fstat and only applying it on pipes.

daviesrob · 2024-10-31T13:34:56Z

I think 128kB blocks are better, but would it be a good idea to put this in hFILE so all pipe users could benefit? I think it should probably only be for FIFOs though to avoid potential issues with over-reading on index-based jobs. The current limit of 32k in hfile_init would need to be increased, but I doubt making it 128k instead would be a major issue.

Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down. The easy solution is just to increase the buffer size to something sensible. Currently we play it cautiously and only do this on pipes and fifos. Fixes samtools#1767

jkbonfield · 2024-10-31T16:52:22Z

It's not possible to distinguish between normal pipes and named pipes (FIFOs), but I don't think that matters much and it still works OK on named pipes.

The real problem is the buffer size is pathetic at 4096 according to stat. This is actually smaller than the standard linux pipe size. It's still internally cachine at 64kb, but we're reading it in small chunks. I suspece that's because it's the size the OS supports for atomic writes, but we're optimise for speed here.

I've moved the change to hfile instead.

jkbonfield force-pushed the bgzip_blk_size branch from 2270f34 to aa6f354 Compare April 10, 2024 15:24

daviesrob self-assigned this May 9, 2024

jkbonfield force-pushed the bgzip_blk_size branch from aa6f354 to fdd6143 Compare October 30, 2024 09:33

jkbonfield force-pushed the bgzip_blk_size branch from fdd6143 to 186d21b Compare October 31, 2024 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the input block size for bgzip. #1768

Increase the input block size for bgzip. #1768

jkbonfield commented Apr 9, 2024

jkbonfield commented Apr 10, 2024

jkbonfield commented Apr 10, 2024 •

edited

Loading

jkbonfield commented Oct 29, 2024 •

edited

Loading

daviesrob commented Oct 31, 2024

jkbonfield commented Oct 31, 2024

Increase the input block size for bgzip. #1768

Are you sure you want to change the base?

Increase the input block size for bgzip. #1768

Conversation

jkbonfield commented Apr 9, 2024

jkbonfield commented Apr 10, 2024

jkbonfield commented Apr 10, 2024 • edited Loading

jkbonfield commented Oct 29, 2024 • edited Loading

daviesrob commented Oct 31, 2024

jkbonfield commented Oct 31, 2024

jkbonfield commented Apr 10, 2024 •

edited

Loading

jkbonfield commented Oct 29, 2024 •

edited

Loading