Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funcotator - removing the required transcript file #8863

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

jonn-smith
Copy link
Collaborator

This PR removes the requirement for Gencode datasources to have transcript files. This is the first step in fixing some long-standing funcotator issues involving variants that run over the edge of transcripts.

@gatk-bot
Copy link

gatk-bot commented Jun 5, 2024

Github actions tests reported job failures from actions build 9386157234
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 9386157234.12 logs
integration 17.0.6+10 9386157234.11 logs
unit 17.0.6+10 9386157234.1 logs
integration 17.0.6+10 9386157234.0 logs

@gatk-bot
Copy link

gatk-bot commented Jun 11, 2024

Github actions tests reported job failures from actions build 9470383170
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 9470383170.12 logs
integration 17.0.6+10 9470383170.11 logs
unit 17.0.6+10 9470383170.1 logs
integration 17.0.6+10 9470383170.0 logs

@gatk-bot
Copy link

gatk-bot commented Jun 26, 2024

Github actions tests reported job failures from actions build 9685898135
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 9685898135.12 logs
integration 17.0.6+10 9685898135.11 logs
unit 17.0.6+10 9685898135.1 logs
integration 17.0.6+10 9685898135.0 logs

@gatk-bot
Copy link

gatk-bot commented Jun 26, 2024

Github actions tests reported job failures from actions build 9688050367
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 9688050367.12 logs
integration 17.0.6+10 9688050367.11 logs
unit 17.0.6+10 9688050367.1 logs
integration 17.0.6+10 9688050367.0 logs

@gatk-bot
Copy link

gatk-bot commented Jun 27, 2024

Github actions tests reported job failures from actions build 9688780153
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 9688780153.12 logs
integration 17.0.6+10 9688780153.11 logs
unit 17.0.6+10 9688780153.1 logs
integration 17.0.6+10 9688780153.0 logs

@gatk-bot
Copy link

gatk-bot commented Jun 27, 2024

Github actions tests reported job failures from actions build 9690704227
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 9690704227.12 logs
integration 17.0.6+10 9690704227.11 logs
unit 17.0.6+10 9690704227.1 logs
integration 17.0.6+10 9690704227.0 logs

@gatk-bot
Copy link

gatk-bot commented Jun 27, 2024

Github actions tests reported job failures from actions build 9690907019
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 9690907019.11 logs
integration 17.0.6+10 9690907019.0 logs

@gatk-bot
Copy link

gatk-bot commented Jun 27, 2024

Github actions tests reported job failures from actions build 9699861398
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 9699861398.11 logs
integration 17.0.6+10 9699861398.0 logs

@jamesemery
Copy link
Collaborator

@jonn-smith Looks like that did the trick. The tests are passing now. How do you want to handle this branch? should I give you a review or try to take it over and let you review the changes i make to it?

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clean-up requests. A few questions about assumptions. One maybe bug/oversight. A suggestion for a stronger test that would put to bed any lingering questions about this branch I think.


// If we're on the reverse strand, we need to reverse complement the sequence:
if ( transcript.getGenomicStrand() == Strand.NEGATIVE ) {
return new String(BaseUtils.simpleReverseComplement(TranscriptUtils.extractTrascriptFromReference(referenceContext, transcriptFeatureList, doExonContigConversionToB37ForTranscripts).getBytes())) + tailPaddingBases;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tail padding bases are not being RCed after you reverse complement the expression bases... that seems incorrect and like it could lead to some nasty bug down the line

// Finally, if we're on the reverse strand, we need to reverse complement the UTR bases.
// NOTE: the extra bases are not reverse complemented because they are not part of the UTR.
if (transcript.getGenomicStrand() == Strand.NEGATIVE) {
return new String(BaseUtils.simpleReverseComplement(utrBases.getBytes())) + new String(BaseUtils.simpleReverseComplement(extraBases.getBytes()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it true that the extra bases go AFTER the sequence here? Looks like yes?

private static final ReferenceDataSource refDataSourceHg19Ch3;
private static final ReferenceDataSource refDataSourceB37;

private static final List<AutoCloseable> autoCloseableList = new ArrayList<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are reasonable... however they missed the off by 1 bug that was one layer up from this method grabbing the transcrit bases that is not exposed (and possibly not as easy to test because it involves the transcript datasources directly)

private static final List<AutoCloseable> autoCloseableList = new ArrayList<>();
static {
refDataSourceHg19Ch3 = ReferenceDataSource.of( IOUtils.getPath(FuncotatorReferenceTestUtils.retrieveHg19Chr3Ref()) );
refDataSourceB37 = ReferenceDataSource.of( IOUtils.getPath(b37Reference) );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to check in a chunk of datasource that HAS these transcript files in it and just call the get___fromReference() methods and directly do the string comparison with the transcript file that is checked in from gencode directly? That seems like the strongest possible test and not too terribly difficult to write

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I think we should have this available for testing already. I can add another test to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants