-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept U as a valid base type and convert to T for BAM. #1728
Comments
Interesting. I figured everyone just represented it as a T from the start. It's not clear to me why it's important to be able to store it as U in an RNA FASTQ or SAM file. It's already an abstraction, it just seems simpler to call it T and save us all the headache. Is there any use case where you could reasonably encounter both and it wouldn't be clear from the context which it is already? I'm unconvinced about the 'hybrid adapter' read that was mentioned. The only scenarios I can think of where this might be important seem very exotic.
If we need to disambiguate at read time I like the idea you had of tracking it in the read group header. That makes a lot more sense to me than using a FLAG bit. Sure, we have a few left, but why waste them on something useless? It should be an easy change to allow reading in a FASTQ with U and converting it on the fly. I think we'll break lots of things if we want to natively represent U as an option in the SAMRecord by default. |
Also, update your java! You're way out of warranty and probably using broken crypto libraries or something. :) |
Lol you don't have to tell me! Our main farm machines are Ubuntu 22 LTS, which has OpenJDK 11 as the package. I've no idea why it's so ancient. The Ubuntu universe packages go up to OpenJDK 21 (or higher?) I think for Ubuntu 22 so we could (and should) be updating such things. Shrug. It's a pain to do all this yourself all the while, especially once you start getting dependencies etc. Tedium, but it's what a lot of groups do, or just run virtual machines / docker containers to avoid the outdated base OS. This all feels like needless effort. As for U vs T, I couldn't agree more really. It amazed me we didn't silently support it from day 1 (it's in IUPAC afterall). Having said that, it's also surprising anyone really cares. I thought most people just used T as synonymous and go on with life :). Why make problems for yourself? I think perhaps with ONT though the base caller does have discrete models for the chemical characteristics of U vs T being different, in order to improve accuracy, and consequentially they're probably outputting U vs T as a useful confirmation to the user that they chose the correct calibration files. I can see that does have a benefit, albeit small. There is discussion on mixed DNA and RNA on the hts-specs issue so you should probably read it there, but in summary it's possible in theory due to DNA adapters added on to RNA, but whether we actually want to honour that in SAM/BAM is another thing. Generally we move adapters or barcodes over to tags by the time we've got to SAM, but it's not always the case in FASTQ hence they're perhaps being very meticulous and precise in their labelling. PS. Yes I could totally imagine a basecaller with A, C, G, T, U bases in it and some autodetection based on whether most T/U end up T or U (possibly with a second pass to tighten up the calling with the correct base). It would be a neat application to distinguish the two on the fly. |
Before you submit
Make sure your issue is not already in the htsjdk issue tracker
It was discussed in #1478, now closed, but see below
Description of the issue:
The background to this is samtools/samtools#2131 where a user has a SAM file containing U. Certainly we know ONT write fastqs with U (if RNA) and when they write BAM they translate them to T, but I'm still not entirely sure where this SAM came from. It broke htslib, and would also break htsjdk.
Your environment:
I checked the source, so largely irrelevant, but yeah our system supported Java is too old for the latest picard. :/
Steps to reproduce
Create a SAM file with U in the sequence.
SamFormatConverter from SAM to BAM fails with:
Expected behaviour
I'd like it to encode U as T in BAM so the data at least can be round-tripped. The user has the option of doing a T to U substitution on decode if they wish later on. Ideally it'd be tracked in the meta-data somewhere too.
See samtools/hts-specs#800 and samtools/hts-specs#801 for context.
Given U is IUPAC, I feel it was an early accident to disallow it. I contast to #1478 I disagree that this is a base modification. There are still 4 base types, but DNA and RNA differ in the chemical structure for T vs U. We don't need to track which bases are T and which are U as it's simply a property of the material. Furthermore IUPAC doesn't permit this. It has no ambiguity codes for e.g. A/U. The only mention of U in the original IUPAC paper was for V listed as not-T or not-U. All other not-? codes are the original base type +1 char. It's clear they chose V over U due to the T/U issue, but it's also clear the authors basically treat T and U as interchangeable and we should do too.
FWIW, I've already made this change to htslib in a merged PR, but not yet in a release.
The text was updated successfully, but these errors were encountered: