Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to the depth filtering for mpileup. #1084

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

jkbonfield
Copy link
Contributor

We can now count the depth anywhere along the read rather than only at the left end. It's not buffering reads up so it doesn't know what's about to come, so for any individual column it'll still favour reads that end near a column than start near a column, but by counting at the right hand end we avoid the boom & bust approach to before where the coverage could plumet to zero immediately following a large spike.

With iter->depth_pos = 1.0 the coverage ends up being a minimum of maxcnt (assuming the true coverage is higher), rather than a maximum of maxcnt. So we go over-depth, but avoid the drop outs.

The following plot shows an example of fake amplicon based sequencing (350bp amplicons every 300bp) with full unlimited depth, reducing depth on the current develop branch, with this PR, and with an alternative more complex set of heuristics[1]. Plots are staggered marginally in X for clarity only.

pe

[1] Reading simulated data consisting of qname, pos and len (on ref, not len of seq).

    while(scanf("%s %d %d\n", name, &pos, &len) == 3) {                         
        if (min > pos)                                                          
            min = pos;                                                          
                                                                                
        if (max < pos + len)                                                    
            max = pos + len;                                                    
                                                                                
        if (depth[pos] > max_depth * 0.5) {                                     
            khint_t h = __ac_Wang_hash(__ac_X31_hash_string(name) ^ seed);      
            if ((double)h / 0xffffffff >                                        
                  pow((max-1.0-max2)/(len+.01),2)                               
                * pow((1-(depth[max-1]+.1)/max_depth), 2)                       
                * pow((max_depth+.1)/depth[pos],2)*2) {
                continue;                                                       
            }                                                                   
        }                                                                       

        int p;                                                                  
        for (p = pos; p < pos+len; p++) {                                       
            depth[p]++;                                                         
            if (max2 < p && depth[p] >= max_depth)                              
                max2 = p;                                                       
        }                                                                       
    }                                                                           

The name hash is used as a random number generator for removal of reads, but it gives a bias towards keeping pairs together. It's not perfect by any stretch, being stochastic, but at 1k depth out of a pool of ~300k depth about 1 in 4 reads also had their mate pair, compared to just 6% if using drand48 as the random number generator instead of h.

We can now count the depth anywhere along the read rather than only at
the left end.  It's not buffering reads up so it doesn't know what's
about to come, so for any individual column it'll still favour reads
that end near a column than start near a column, but by counting at
the right hand end we avoid the boom & bust approach to before where
the coverage could plumet to zero immediately following a large spike.

With iter->depth_pos = 1.0 the coverage ends up being a minimum of
maxcnt (assuming the true coverage is higher), rather than a maximum
of maxcnt.  So we go over-depth, but avoid the drop outs.
@jkbonfield
Copy link
Contributor Author

Note using a depth part way into the future does mean higher depth as we don't know how many other reads will be appearing between our current processing point at the place we measured the depth. The depth figure largely becomes a minimum value, rather than mean value.

This can be seen in this real world data, at depth 10.7 (70%).
E10

If we wanted the depth to be lower then we could take into account the read length and make a stab at what's the come, or just manually change it. Eg see the same plot at 5.7 instead.

E5

Some other example plots, from synthetic testing:

pb
pc
pd

@jkbonfield
Copy link
Contributor Author

In the synthetic testing, the more complex heuristics definitely works better, but they're really hard to understand and get a grasp on. The only reason I think it may be beneficial is the enrichment of paired data. We may wish to consider something like this for a better samtools view -s $fract options.

coord.  It's not so vital with the revised formula, so within 10 works
good enough and prevents some more excessive depth raises.

Also (ifdef-ed out) implemented the "other" more complex method, for
comparison.  Cull whichever we wish before merging.
@jkbonfield
Copy link
Contributor Author

I thought of one way the old depth method could collapse while in the shower this morning (yes, I do still have those during covid times; occasionally!).

The old method attempts to always ensure we have coverage at every base, even if moving on from a deep region, by not filtering data unless it starts at the same location as earlier deep data. Ie it has the code to only filter reads if read pos == current iterator pos. ie multiple reads starting at the same point.

It then checks depth at pos. So if all reads have cigar 101M and we have high depth of reads starting at pos 1000, then when iter->pos increments from 1000 to 1001 the first read is added. All subsequent reads at 1001 however match iter->pos and are removed as the number of reads in flight (overlapping 1001) is too high due to the high depth at pos 1000. That gives a crash down to depth 1 periodically. The mistake is that it's checking depth overlapping, not depth of reads starting at 1001. The intention was good, but it wasn't thought through.

Secondly, all it takes is for that initial read to be soft-clipped and not actually extend beyond 1100 and we then hit a zero depth output at 1101. I don't think this is the sole cause of our gaps because we're not incrementing by 1 in our amplicon sequencing, but by ~150ish, yet still seeing crashes in depth. So I think it's likely there are other ways that iter->pos can move on. However it shows at least one way the existing method fails.

In both cases, the fix is actually to measure the depth at the last base and not the first base, which is coincidentally precisely what the PR is doing. Further still, by doing that, we can now relax the constraint that we only remove reads if the start pos matches the last one. That is tragic on high depth long reads as we pretty much never remove anything.

Being able to specify where the depth measurement is taken (from 0.0 being 1st base to 1.0 being last base) we gain some control over how sensitive we wish to be to potential zero depth regions and how wide they may be. It's not as good as a future buffer of reads as we're still figuring out what to do without knowing what's about to come, but it's a simple fix.

A better fix, as discussed outside this PR, is a modified samtools view -s that can use knowledge of future reads to ascertain whether the current read is required, and also to work in a template pair fashion so depth reduction doesn't break pearing. That can always be piped into mpileup if required, so mpileup being the simple method and a general purpose subset mechanism as a separate tool seems reasonable.

@daviesrob
Copy link
Member

Unfortunately it looks like the simple solution here can overshoot the target depth quite badly if you set depth_pos_fract = 1.0, as in this example where the nominal depth limit was 400 (note that the "no limit" and "Count pos 100%" lines are on top of each other):
overshoot_example

This happens because most of the reads are exactly the same length and the depth added at each position is below the limit. As the depth is being measured at the far end, it starts off at zero for each new reference position so everything can get through. The overshoot is much less bad when counting closer to the start of the read (I tried 50% and 80%), but then the problem with badly undershooting reappears:
undershoot_example

The example here was from this sam file, I subtracted 104524607 from the positions when making the depth plots to keep the x axis values small.

I'm not sure if this problem can be solved without keeping track of a bigger pool of reads...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants