Workflow management systems (WMSs) such as Galaxy are uniquely positioned to enable researchers to perform more environmentally-sustainable computational data analysis as they have full control of the resources used for a given workflow. In this project we want to reduce Galaxy resource usage by focusing on: 1) job caching to enable the reuse of tool outputs, and 2) environmentally-friendly job scheduling.
Job caching uses the provenance information stored in Galaxy’s database for each tool execution to avoid unnecessary recalculations when the relevant parameters match. An initial implementation is already available and we will work on making the job cache apply in more scenarios. In particular, we will infer how strict dataset metadata needs to match for a job to be considered identical. We will also enable sharing the job cache for users that have opted-in to this feature, making it possible to run large-scale analyses during training sessions without consuming an unnecessary amount of computing resources.
Advanced job scheduling is made possible in Galaxy through the Total Perspective Vortex (TPV) plugin. TPV can route entities (tools, users) to selected destinations with appropriate resource allocations (cores, GPUs, memory). It additionally allows arbitrary Python-based rules for e.g. custom ranking functions for choosing between destinations. We will specifically rank destinations in an order that promotes sustainability. Expanding on our initial implementation, we will collect (job-related) statistics and information from the Galaxy database and (Pulsar) compute destinations in a central location and add additional algorithms for the ranking based on these statistics.
Nicola Soranzo, Paul De Geest