This repository contains scripts for integrating species and subsequent traits data from trydb with taxonomic ids from gbif, otol, ncbi and wikidata. At the moment, data for only 25 traits was downloaded from TRY-db. Subsequently, the traits metadata was retrieved from TRY-db website and a subset of enpkg was also retrieved. The csv files retrieved were converted to duckdb (adavantge: on-disk approach for sql queries).
The TRY-db dataset with 25 traits has multiple columns ('data/trydbtemp_Ontop/trydbAll.csv'). These columns have a complex relationship as depicted in the diagram below.
NOTE: the trydbAll table containing the datasets from the TRY-db is a subset of the actual data.
I. Prerequisites:
- For smooth running of the scripts (R,shell), install R (version 4.1.2) and the following R-packages :
a) For accessing taxonomic ids from wikidata, with mappings from gbif and ncbi (taxizedb) and from open treel of life (rotl)
install.packages(c("taxizedb", "rotl"))
b) For data manipulation, install dplyr and dbplyr (backend wrapper to convert dplyr code into SQL)
install.packages(c("dplyr", "dbplyr"))
c) For the on-disk approach of accessing and querying databases, duckdb's API client for R
install.packages("duckdb")
and duckdb
d) For building a Virtual Knowledge Graph (VKG), download Ontop-cli/Ontop-protege bundle (version 5.1.2)
- For converting ontology files between multiple formats (e.g.: owl to ttl), install robot.
II. Script to map the TRY plant species name to the gbif, ncbi, wikidata and otol ids
Rscript matchTaxonomy.R
To plot distribution of the TRY-db species matched with ids from ott, ncbi, gbif and wikidata, run
Rscript distTaxonomicIds.R
III. Script to build a duckdb database for Ontop and build the knowledge graph
duckdb data/Ontop_input.db -c "IMPORT DATABASE 'data/trydbtemp_Ontop'"
or
sh run_duckdb.sh
The relations between tables are depicted in this diagram.
IV. Script to build the knowledge graph in Ontop
#Set the path in data/Ontop_config/duckdb.properties
sh run_ontop.sh
V. Disclaimer
Tha mappings in the ontop virtual knowledge graph are faulty at the moment. Therefore, the SPARQL query does not result in correct results. Work in progress...