Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize paths for genome fetch and some of the genome indexer data managers, plus additional moderinzation #6489

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

natefoo
Copy link
Member

@natefoo natefoo commented Oct 25, 2024

Update the genome fetch and most commonly used indexer DMs to normalize the on-disk layout as proposed in galaxyproject/galaxy#19013.

In addition:

  • For those that had Python wrappers, I dropped the wrappers. In some cases this avoids building a special mulled container just for the DM
  • Updated some underlying tool versions
  • Added some tests of non-default options
  • For the STAR DM, automatically calculate the --genomeSAindexNbases and --genomeChrBinNbits options as recommended by the manual, to drastically reduce the index size for small genomes.

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

- Drop colorspace builder
- Drop python wrapper
- Update bowtie version
- Add tests
- Drop python wrapper
- Update bowtie2 version
- Add test of non-default options
- Drop python wrapper
- Update bwa version
- Add test of non-default options
- Drop python wrapper
- Add options to automatically calculate --genomeSAindexNbases and --genomeChrBinNbits
@natefoo natefoo changed the title Normalize paths for genome fetch and some Normalize paths for genome fetch and some of the genome indexer data managers, plus additional moderinzation Oct 25, 2024
Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Nate!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this file need to be here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just a rename from rnastar_index2_versioned.loc, which was inconsistent with its name in all other tables. An empty .loc as referenced in tool_data_table_conf.xml.test must exist in order to be written to by the test.

@@ -10,13 +9,12 @@
<column name="path" output_ref="out_file" >
<move type="directory" relativize_symlinks="True">
<!-- <source>${path}</source>--> <!-- out_file.extra_files_path is used as base by default --> <!-- if no source, eg for type=directory, then refers to base -->
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${dbkey}/bwa_mem_index/${value}</target>
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">genomes/${dbkey}/bwa_mem_index/v1/${value}</target>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the "v1" comes from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the proposal, this is because it's version 1 of bwa-mem.

</move>
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${dbkey}/bwa_mem2_index/${value}/${path}</value_translation>
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/genomes/${dbkey}/bwa_mem_index/v2/${value}/${path}</value_translation>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see ... Not sure if the v1/v2 should be under bwa_mem_index. I would assume those are separate tools - separate indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or in oder words, as an admin, I would search for bwa_mem2_index

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider that, I ended up with this because under the scheme in the proposal every DM will now contain a version directory. So this will result in:

  • genomes/${dbkey}/bwa_mem_index/v1/${value}
  • genomes/${dbkey}/bwa_mem2_index/v2/${value}

Which is kind of redundant and negates the purpose of the version directory in this case. That said I can understand how you would expect to have the directory name match the indexer name, although we already violate that for the other DMs, since the table name (bowtie_indexes) is plural and the tool ID/directory (bowtie_index) is not (to say nothing of the bowtie DMs using both "indexes" in the table/directory and "indices" in the loc file name).

…_star_index_builder.xml

Co-authored-by: Björn Grüning <[email protected]>
@natefoo natefoo marked this pull request as draft October 28, 2024 19:31
@natefoo
Copy link
Member Author

natefoo commented Oct 28, 2024

Converted to draft:

  1. I need to add the sam_fasta_index DM as well.
  2. I propose we remove the symlink to the reference genome as proposed here. @mvdbeek already tested this for one tool (bowtie2?).

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate that we have more and more DMs that do not require the extra python stuff.

I'm a bit worried about the side effects for admins.

<inputs>
<param name="all_fasta_source" type="select" label="Source FASTA Sequence">
<options from_data_table="all_fasta"/>
</param>
<param name="sequence_name" type="text" value="" label="Name of sequence" />
<param name="sequence_id" type="text" value="" label="ID for sequence" />
<param name="tophat2" type="boolean" truevalue="--data_table_name tophat2_indexes" falsevalue="" checked="True" label="Also make available for TopHat" help="Adds values to tophat2_indexes tool data table" />
<param name="tophat2" type="boolean" checked="True" label="Also make available for TopHat" help="Adds values to tophat2_indexes tool data table" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool to cover this, but it seems SNAFU anyway, since the tophat2 data table refers to the bowtie2 loc file.

Maybe we just deprecate the tophat2 datatable (and update the tool to use the bowtie2 one)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is probably just legacy. Good idea to update the tophat2 tool although I suppose in the unlikely case an admin has indexes in the tophat2_indexes table that are not in bowtie2_indexes then they would disappear. And tophat is of course deprecated itself.

</move>
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${dbkey}/bowtie2_index/${value}/${path}</value_translation>
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/genomes/${dbkey}/bowtie_index/v2/${value}/${path}</value_translation>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why this is done? It will require all admins to restucture reference data, or?
But certainly this would be nicer if we would start from scratch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't have to - if you update to the new DMs and run them, they will just place data in a new folder, but the old loc with the old data at the old paths will continue to be loaded. That said, in the proposal I suggested that we recommend admins to specify a new tool_data_path just for organizational purposes. I also said I'd write a script to restructure the data for anyone who preferred to unify it.

@@ -1,34 +0,0 @@
<tool id="bowtie_color_space_index_builder_data_manager" name="Bowtie Color index" tool_type="manage_data" version="1.2.1" profile="23.0">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move these to the deprecated/data_managers/ folder of this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants