-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ncbi egapx #6456
base: main
Are you sure you want to change the base?
Add ncbi egapx #6456
Conversation
I start with NCBI's docker image from dockerhub and create a new one which includes an installation of Nexflow. The augmented container is uploaded to quay.io. The strategy is to run this augmented docker container on a single large node. I'm working on a scheduled github action that will check for new versions of EGAPx on dockerhub, build new augmented containers, and upload them to quay.io. If someone already has something like this, I can just reuse it. If not, when it's ready I can submit the github action to the IUC as well. |
#else: | ||
#set yamlconfig = $yamlin | ||
#end if | ||
source /galaxy/env.bash && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add please a comment what this one is doing?
tools/ncbi_egapx/ncbi_egapx.xml
Outdated
</param> | ||
</when> | ||
<when value="history"> | ||
<param name="yamlin" type="data" format="yaml,txt" label="egapx configuration yaml file to pass to Nextflow"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<param name="yamlin" type="data" format="yaml,txt" label="egapx configuration yaml file to pass to Nextflow"/> | |
<param name="yamlin" type="data" format="yaml" label="egapx configuration yaml file to pass to Nextflow"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should add here more red warnings ... if people specify here too many cores, most schedulers will just kill the job ... so people need to do what they are doing here.
tools/ncbi_egapx/ncbi_egapx.xml
Outdated
help="Select a built in, history or remote URI for the reference genome fasta"> | ||
<option value="history" selected="True">Use a genome fasta file from the current history</option> | ||
<option value="indexed">Use a Galaxy server built-in genome</option> | ||
<option value="uri">Provide a remote web link URI ("https://...") pointing at the required genome reference fasta file</option> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<option value="uri">Provide a remote web link URI ("https://...") pointing at the required genome reference fasta file</option> | |
<option value="uri">Provide a remote web link URI ("https://...") pointing at the required genome reference FASTA file</option> |
add a validator?
maybe a proper URI validator in Galaxy would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'm assuming there's already python code somewhere that validates strings based on RFC 3986. Maybe there should also be a "check" or similar option to the URL validator that if the input string passes the RFC 3986 check, it will do an HTTP HEAD to see if the resource actually exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URI validator should also include RFC 3987, Internationalized Resource Identifiers (IRIs).
tools/ncbi_egapx/ncbi_egapx.xml
Outdated
</collection> | ||
</outputs> | ||
<tests> | ||
<test expect_test_failure="true"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No successful test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests both succeed. I set expect_test_failure="true"
because I wasn't sure if the tests would succeed when being run by the repo's CI. I can give it a try on my personal repo and see how it goes.
|
||
.. class:: warningmark | ||
|
||
**Proof of concept: a hack to run a NF workflow inside a specialised Galaxy tool wrapper** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really surprising that this works? We can execute anything that runs on a command line, or?
|
||
It is also very new and in rapid development. Investing developer effort and keeping updated as EGAPx changes rapidly may be *inefficient of developer resources*. | ||
|
||
This wrapper is designed to allow measuring how *inefficient* it is in terms of computing resource utilisation, in comparison to the developer effort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wrapper is designed to allow measuring
How will this be measured?
inefficient it is in terms of computing resource utilisation
This was my 1st thought, having one resource allocation for the whole pipeline will not work on many systems. On our system jobs that are not utilizing the allocated CPUs/memory efficiently are automatically killed.
We already have quite a few tools wrapping (non-NF) pipelines which all have this problem. On my system I can only assign 1 CPU to them, because they are killed otherwise.
|
||
**Proof of concept: a hack to run a NF workflow inside a specialised Galaxy tool wrapper** | ||
|
||
EGAPx is a big, complicated Nextflow workflow, challenging and costly to re-implement **properly**, requiring dozens of new tools and replicating a lot of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My second thought was that, while it is certainly challenging and a lot of work to integrate the tools it would pay of for the Galaxy project as a whole.
EGAPx is a big, complicated Nextflow workflow,
Is it systematically tested (not as far as I see, at least not on github)
tools/ncbi_egapx/ncbi_egapx.xml
Outdated
<param name="genome_type_select" value="uri"/> | ||
<param name="uri" value="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/809/275/GCF_020809275.1_ASM2080927v1/GCF_020809275.1_ASM2080927v1_genomic.fna.gz"/> | ||
<param name="rna_type_select" value="list"/> | ||
<param name="rnaseq" value="https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/sample_data/Dermatophagoides_farinae_small/SRR8506572.1 https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/sample_data/Dermatophagoides_farinae_small/SRR8506572.2 https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/sample_data/Dermatophagoides_farinae_small/SRR9005248.1 https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/sample_data/Dermatophagoides_farinae_small/SRR9005248.2"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be location
instead of value
and the URLs comma separated?
</when> | ||
</conditional> | ||
<param name="proteins" type="data" format="fasta,tasta.gz" optional="true" label="Select a protein set"/> | ||
<param name="xtra" type="text" area="true" label="Additional yaml to append to the egapx.yaml configuration" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need a validator for free text entry (and disabling the sanitizer as definitely a bad idea). Maybe better allow yaml upload?
tools/ncbi_egapx/ncbi_egapx.xml
Outdated
help="Select a built in, history or remote URI for the reference genome fasta"> | ||
<option value="history" selected="True">Use a genome fasta file from the current history</option> | ||
<option value="indexed">Use a Galaxy server built-in genome</option> | ||
<option value="uri">Provide a remote web link URI ("https://...") pointing at the required genome reference fasta file</option> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should implement the URI download in a tool. Users can upload URIs and defer the upload until job execution.
</param> | ||
</when> | ||
<when value="history"> | ||
<param name="rnaseq" type="data" format="fastqsanger,fastqsanger.gz" multiple="true" label="Select multiple RNA-seq fastqsanger inputs from the current history" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paired end is irrelevant?
Adding NCBI Eukaryotic Genome Annotation Pipeline - External (EGAPx)
FOR CONTRIBUTOR: