Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to appropriately interpret & utilize strand information in call-1 #69

Open
Sfeng666 opened this issue Jan 16, 2024 · 1 comment
Open

Comments

@Sfeng666
Copy link

Hi Michael & fellow users,

I have a few questions about the strand information in JACUSA 2 output, and the proper usage of that information for RNA editing identification & frequency estimation:

In the variant detection scenario call-1 and given an RNA sample sequenced on the first strand, using -P RF-FIRSTSTRAND will allow some sites to have two lines of output, each for a strand, like this bed-like out output:

#contig start   end     name    score   strand  bases11 info    filter  ref
3       10084   10085   call-1  0.0     +       0,5,0,0 *       *       C
3       10084   10085   call-1  0.0     -       0,0,7,0 *       *       G
3       10085   10086   call-1  0.0     +       0,0,5,0 *       *       G
3       10085   10086   call-1  0.0     -       0,7,0,0 *       *       C
3       10086   10087   call-1  0.0     +       0,0,0,5 *       *       T
3       10086   10087   call-1  0.0     -       7,0,0,0 *       *       A
3       10087   10088   call-1  0.0     +       5,0,0,0 *       *       A
3       10087   10088   call-1  0.0     -       0,0,0,7 *       *       T

Above is a simple case of non-editing sites, where the ref allele is simply inverted, but the allele counts (here equal to depth) are not the same (but similar) for inverted bases. How to interpret this? My guess is that there are some reads mapped to both strands, contributing to allele counts on both strands, while reads that mapped specifically to either strands contribute to the difference in allele counts. Can you confirm on this?

#contig start   end     name    score   strand  bases11 info    filter  ref
3       10088   10089   call-1  1.556381955545767       +       2,0,3,0 *       *       G
3       10088   10089   call-1  1.4505124214123128      -       0,5,0,2 *       *       C

In a more complicated case, where RNA editing is detected on both strands, I feel uncertain about how to interpret this result. I wonder if it's ok to interpret this as true RNA editing events on both strands, or if we are more confident about editing called on one of the strands, based on certain statistics (e.g., Z score)?

If my guess about the first example is correct, it could be possible that the 3 reads covering G on the positive strand are contributed by the same reads that also mapped to C on the negative strand (base order : A, C, G, T). If unfortunately those reads were from negative strand-sourced transcripts, then RNA editing occurred at the genomic location on negative strand.

The problem is: we won't know what proportion of allele counts are actually contributed by the same reads, nor which strand of transcripts were those reads from.

I could think of the following options to determine which strand an editing is more likely to occurred at, when JACUSA2 detected candidate editing on both strands of a given site:

  1. Simply pick the strand that has a higher z score (i.e., the divergence between sample and simulated base vectors, if in call-1).
  2. Pick the strand that has a higher total coverage at that site; if coverage ties between strands, pick the strand with a higher z score.
  3. Discard these sites on both strands.

Since the proportion of overlapping genes on opposite strands is generally low (an average of 3.3% of overlapping exons on opposite strands in human chromosome), I think it is fine to either discard these sites, or pick one strand and give up potential information on the other. What do you think?

I know this is a long discussion, but I would appreciate a lot if you could clarify my questions, and/or offer your insights in how to solve the dual-strand problem.

@piechottam
Copy link
Contributor

call-1 works like call-2 when it comes to library type and strand information.
The only difference is that an in silico sample is simulated from the existing read information by substituting all non-reference base calls with the reference base.

  • Are you sure, that "RF-FIRSTSTRAND" is correct library type? You could check with how the strand information derived in JACUSA2 call-1 compares to existing GTF annotation.
  • Do you have an IGV snapshot of the problematic region?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants