Update to the depth filtering for mpileup. #1084

jkbonfield · 2020-06-15T17:01:44Z

We can now count the depth anywhere along the read rather than only at the left end. It's not buffering reads up so it doesn't know what's about to come, so for any individual column it'll still favour reads that end near a column than start near a column, but by counting at the right hand end we avoid the boom & bust approach to before where the coverage could plumet to zero immediately following a large spike.

With iter->depth_pos = 1.0 the coverage ends up being a minimum of maxcnt (assuming the true coverage is higher), rather than a maximum of maxcnt. So we go over-depth, but avoid the drop outs.

The following plot shows an example of fake amplicon based sequencing (350bp amplicons every 300bp) with full unlimited depth, reducing depth on the current develop branch, with this PR, and with an alternative more complex set of heuristics[1]. Plots are staggered marginally in X for clarity only.

[1] Reading simulated data consisting of qname, pos and len (on ref, not len of seq).

    while(scanf("%s %d %d\n", name, &pos, &len) == 3) {                         
        if (min > pos)                                                          
            min = pos;                                                          
                                                                                
        if (max < pos + len)                                                    
            max = pos + len;                                                    
                                                                                
        if (depth[pos] > max_depth * 0.5) {                                     
            khint_t h = __ac_Wang_hash(__ac_X31_hash_string(name) ^ seed);      
            if ((double)h / 0xffffffff >                                        
                  pow((max-1.0-max2)/(len+.01),2)                               
                * pow((1-(depth[max-1]+.1)/max_depth), 2)                       
                * pow((max_depth+.1)/depth[pos],2)*2) {
                continue;                                                       
            }                                                                   
        }                                                                       

        int p;                                                                  
        for (p = pos; p < pos+len; p++) {                                       
            depth[p]++;                                                         
            if (max2 < p && depth[p] >= max_depth)                              
                max2 = p;                                                       
        }                                                                       
    }

The name hash is used as a random number generator for removal of reads, but it gives a bias towards keeping pairs together. It's not perfect by any stretch, being stochastic, but at 1k depth out of a pool of ~300k depth about 1 in 4 reads also had their mate pair, compared to just 6% if using drand48 as the random number generator instead of h.

We can now count the depth anywhere along the read rather than only at the left end. It's not buffering reads up so it doesn't know what's about to come, so for any individual column it'll still favour reads that end near a column than start near a column, but by counting at the right hand end we avoid the boom & bust approach to before where the coverage could plumet to zero immediately following a large spike. With iter->depth_pos = 1.0 the coverage ends up being a minimum of maxcnt (assuming the true coverage is higher), rather than a maximum of maxcnt. So we go over-depth, but avoid the drop outs.

jkbonfield · 2020-06-15T17:07:08Z

Note using a depth part way into the future does mean higher depth as we don't know how many other reads will be appearing between our current processing point at the place we measured the depth. The depth figure largely becomes a minimum value, rather than mean value.

This can be seen in this real world data, at depth 10.7 (70%).

If we wanted the depth to be lower then we could take into account the read length and make a stab at what's the come, or just manually change it. Eg see the same plot at 5.7 instead.

Some other example plots, from synthetic testing:

jkbonfield · 2020-06-15T17:08:31Z

In the synthetic testing, the more complex heuristics definitely works better, but they're really hard to understand and get a grasp on. The only reason I think it may be beneficial is the enrichment of paired data. We may wish to consider something like this for a better samtools view -s $fract options.

coord. It's not so vital with the revised formula, so within 10 works good enough and prevents some more excessive depth raises. Also (ifdef-ed out) implemented the "other" more complex method, for comparison. Cull whichever we wish before merging.

jkbonfield · 2020-06-17T08:32:04Z

I thought of one way the old depth method could collapse while in the shower this morning (yes, I do still have those during covid times; occasionally!).

The old method attempts to always ensure we have coverage at every base, even if moving on from a deep region, by not filtering data unless it starts at the same location as earlier deep data. Ie it has the code to only filter reads if read pos == current iterator pos. ie multiple reads starting at the same point.

It then checks depth at pos. So if all reads have cigar 101M and we have high depth of reads starting at pos 1000, then when iter->pos increments from 1000 to 1001 the first read is added. All subsequent reads at 1001 however match iter->pos and are removed as the number of reads in flight (overlapping 1001) is too high due to the high depth at pos 1000. That gives a crash down to depth 1 periodically. The mistake is that it's checking depth overlapping, not depth of reads starting at 1001. The intention was good, but it wasn't thought through.

Secondly, all it takes is for that initial read to be soft-clipped and not actually extend beyond 1100 and we then hit a zero depth output at 1101. I don't think this is the sole cause of our gaps because we're not incrementing by 1 in our amplicon sequencing, but by ~150ish, yet still seeing crashes in depth. So I think it's likely there are other ways that iter->pos can move on. However it shows at least one way the existing method fails.

In both cases, the fix is actually to measure the depth at the last base and not the first base, which is coincidentally precisely what the PR is doing. Further still, by doing that, we can now relax the constraint that we only remove reads if the start pos matches the last one. That is tragic on high depth long reads as we pretty much never remove anything.

Being able to specify where the depth measurement is taken (from 0.0 being 1st base to 1.0 being last base) we gain some control over how sensitive we wish to be to potential zero depth regions and how wide they may be. It's not as good as a future buffer of reads as we're still figuring out what to do without knowing what's about to come, but it's a simple fix.

A better fix, as discussed outside this PR, is a modified samtools view -s that can use knowledge of future reads to ascertain whether the current read is required, and also to work in a template pair fashion so depth reduction doesn't break pearing. That can always be piped into mpileup if required, so mpileup being the simple method and a general purpose subset mechanism as a separate tool seems reasonable.

daviesrob · 2020-07-17T16:07:36Z

Unfortunately it looks like the simple solution here can overshoot the target depth quite badly if you set depth_pos_fract = 1.0, as in this example where the nominal depth limit was 400 (note that the "no limit" and "Count pos 100%" lines are on top of each other):

This happens because most of the reads are exactly the same length and the depth added at each position is below the limit. As the depth is being measured at the far end, it starts off at zero for each new reference position so everything can get through. The overshoot is much less bad when counting closer to the start of the read (I tried 50% and 80%), but then the problem with badly undershooting reappears:

The example here was from this sam file, I subtracted 104524607 from the positions when making the depth plots to keep the x axis values small.

I'm not sure if this problem can be solved without keeping track of a bigger pool of reads...

jkbonfield mentioned this pull request Jun 15, 2020

Mpileup maxcnt tweak. samtools/samtools#1269

Open

jkbonfield force-pushed the mpileup_maxcnt branch from dbbd34d to 46105c9 Compare June 16, 2020 14:43

jkbonfield mentioned this pull request Oct 5, 2020

mpileup: document effect of --max-depth samtools/samtools#1319

Open

jkbonfield mentioned this pull request Apr 7, 2021

samtools depth inconsistent at a position with different region size samtools/samtools#1412

Closed

jkbonfield mentioned this pull request Jun 8, 2021

Reporting of depth with/without overlap by bcftools mpileup and samtools depth samtools/bcftools#1504

Closed

jkbonfield mentioned this pull request May 18, 2023

samtools mpileup giving inconsistent pileup output with -r option samtools/samtools#1845

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to the depth filtering for mpileup. #1084

Update to the depth filtering for mpileup. #1084

jkbonfield commented Jun 15, 2020

jkbonfield commented Jun 15, 2020

jkbonfield commented Jun 15, 2020

jkbonfield commented Jun 17, 2020

daviesrob commented Jul 17, 2020

Update to the depth filtering for mpileup. #1084

Are you sure you want to change the base?

Update to the depth filtering for mpileup. #1084

Conversation

jkbonfield commented Jun 15, 2020

jkbonfield commented Jun 15, 2020

jkbonfield commented Jun 15, 2020

jkbonfield commented Jun 17, 2020

daviesrob commented Jul 17, 2020