Advances in sequencing technologies and assembly algorithms have enabled an explosion in diverse reference genomes across the tree of life, together with a need to annotate functional and structural features. There is no current set of minimal metadata to support genome annotations as FAIR objects, limiting their reproducibility and reliability.
The FAIRification of Genomic Annotations Working Group (FGA-WG) in the Research Data Alliance (RDA) will develop a harmonised metadata model and recommended infrastructure to improve discovery and reuse of publicly available genomic annotations/tracks, supporting harmonised metadata for GFF3 files. Such metadata exists in e.g. project-specific databases or spreadsheets, workflow systems, repositories, exchange formats, and linked data.
Harmonising metadata according to a unified data model requires the extraction, transformation and integration of data sourced in different research contexts, including ""messy"" data, using schema mappings or ""crosswalks"". These operations are time-consuming and may introduce opaque errors. FAIR principles emphasise reproducibility and trust in data analyses with persisted and shared accessible, auditable and executable data transformation and validation methods.
Omnipy and Whyqd (/wɪkɪd/) are independently-developed Python libraries offering general functionality for auditable and executable metadata mappings. Each is pragmatically designed to ensure transformations are executable on real-world data, with validation and feedback. They differ in scope and users, and provide complementary functionalities.
In this project, we will integrate Omnipy and Whyqd to develop executable mappings that transform existing metadata from biodiversity projects, such as ERGA, to conform to the FGA-WG metadata model, kickstarting the process of FAIRifying genome annotation GFF3 files.
Sveinung Gundersen, Gavin Chait