Don't align transcripts with different numbers of exons #195

reece · 2015-09-28T19:53:49Z

Originally reported by Reece Hart (Bitbucket: reece, GitHub: reece) in biocommons/uta #195
Migrated by bitbucket-issue-migration on 2016-09-09 15:15:07

UTA historically has aligned transcript and genomic exons even when the number of exons in each exon set differs. This practice masks real issues in underlying data and should be discontinued.

gostachowiak · 2020-09-07T11:00:37Z

I have discovered an issue with transcript NM_001278433.1 (gene PRKAR1A), which I believe is an example of this issue. If my understanding is incorrect, please let me know.

Exon sets for the transcript:

SET search_path=uta_20180821;
SELECT * FROM exon_set WHERE tx_ac='NM_001278433.1';

267741	NM_001278433.1	AC_000149.1	1	splign	2014-02-11 01:22:19.920492
332948	NM_001278433.1	NC_000017.10	1	blat	2014-02-11 02:40:24.121284
267727	NM_001278433.1	NC_000017.10	1	splign	2014-02-11 01:22:19.920492
763376	NM_001278433.1	NC_000017.11	1	splign	2016-08-27 17:40:37.616249
267735	NM_001278433.1	NC_018928.2	1	splign	2014-02-11 01:22:19.920492
738588	NM_001278433.1	NM_001278433.1	1	transcript	2016-08-27 10:28:27.974572
88837	NM_001278433.1	NM_001278433.1	1	transcript/8ecabff0	2014-02-11 00:00:18.455632
344311	NM_001278433.1	NM_001278433.1	1	transcript/92190059	2015-08-25 22:44:41.311184

The GRCh37 splign chromosomal alignment has 10 exons:

SET search_path=uta_20180821;
SELECT * FROM exon WHERE exon_set_id='267727';

The "self" alignment has 11 exons:

SET search_path=uta_20180821;
SELECT * FROM exon WHERE exon_set_id='738588';

By looking at exon lengths, the discrepancy is in exon 1 so when doing g-to-c calculations using hgvs, variants along the entire transcript give bad results.

My assumption was that "transcript" is the relevant self-alignment, and not "transcript/8ecabff0" or "transcript/92190059"

reece · 2020-09-09T04:38:32Z

First, I'm impressed that you dove this far into UTA internals!

I don't know the story for this transcript specifically, and these data are 4-6 years old, perhaps from the time before NCBI released gff files. So, this might be hard to reproduce now from sources.

When alt_aln_method contains /, it means that the UTA loader encountered a case where the definition provided by NCBI changed over time. When this happens, UTA deprecates the existing one by renaming the alignment method. (The hash after the / is a truncated md5 made by serializing the start,end coordinates and CDS start,end.)

The presence of / nearly always mean that the assembly and/or alignments are problematic. So, proceed with caution.

In uta_20190926, I see this:

anonymous@uta/uta=> set search_path  = uta_20190926 ;
anonymous@uta/uta=> select alt_ac, alt_aln_method, n_exons from tx_exon_set_summary_mv where tx_ac = 'NM_001278433.1' order by 2;
┌────────────────┬─────────────────────┬─────────┐
│     alt_ac     │   alt_aln_method    │ n_exons │
├────────────────┼─────────────────────┼─────────┤
│ NC_000017.10   │ blat                │      11 │
│ NC_018928.2    │ splign              │      10 │
│ AC_000149.1    │ splign              │      10 │
│ NC_000017.10   │ splign              │      11 │
│ NC_000017.11   │ splign              │      11 │
│ NC_000017.10   │ splign/04e3c837     │      10 │
│ NM_001278433.1 │ transcript          │      11 │
│ NM_001278433.1 │ transcript/8ecabff0 │      11 │
│ NM_001278433.1 │ transcript/92190059 │      10 │
└────────────────┴─────────────────────┴─────────┘

So, it looks to me as though you should upgrade to uta_20190926, in which NM_001278433.1 aligns to NC_000017.10 and NC_000017.11 without issues.

Please close if that answers your question.

gostachowiak · 2020-09-09T15:16:38Z

Reece:

Thank you very much for your time-- that was helpful.

I don't see uta_20190926 as a tag on the dockerhub page, so I wasn't sure if it was advisable to use:
https://hub.docker.com/r/biocommons/uta/tags

Is this version an "official" release that was built/validated to the same standards as the uta_20180821 version?

Also, if we did update to the 2019 uta, which versions of hgvs and seqrepo would you recommend moving up to?

We currently use:

uta: uta_20180821
seqrepo: 2018-08-21
hgvs: 1.3.0

Thanks again.

Matt

reece · 2020-09-12T16:34:24Z

uta_20190926 currently has an issue (#228) that prevents us from building a docker images. A change was made to materialize a very large view, and it takes >12 hours (when I killed it) to materialize data. We'll need to unwind that before distributing docker images.

You should be able to use any version of hgvs. The change log may help you figure out whether any of the changes since 1.3.0 are relevant to you.

Unfortunately, you'll have to wait on the uta fixes. No ETA yet.

reece added major enhancement New feature or request labels Sep 9, 2016

reece added this to the 0.3.0 milestone Sep 9, 2016

github-actions bot removed the major label Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't align transcripts with different numbers of exons #195

Don't align transcripts with different numbers of exons #195

reece commented Sep 28, 2015

gostachowiak commented Sep 7, 2020

reece commented Sep 9, 2020

gostachowiak commented Sep 9, 2020

reece commented Sep 12, 2020 •

edited

Loading

Don't align transcripts with different numbers of exons #195

Don't align transcripts with different numbers of exons #195

Comments

reece commented Sep 28, 2015

gostachowiak commented Sep 7, 2020

reece commented Sep 9, 2020

gostachowiak commented Sep 9, 2020

reece commented Sep 12, 2020 • edited Loading

reece commented Sep 12, 2020 •

edited

Loading