-
Notifications
You must be signed in to change notification settings - Fork 0
/
background.tex
36 lines (31 loc) · 6.94 KB
/
background.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
The integration of heterogeneous, multi-scale, and multi-modal biomedical data has become a cornerstone of modern computational investigation of the mechanisms and aetiologies underlying complex diseases~\cite{VanDam2014,Iyappan2016,Wanichthanarak2015,Himmelstein2017,Fan2019}.
An overarching strategy was proposed by Davidson \textit{et al.} more than two decades ago that outlined the transformation of data into a common model, semantic alignment of related objects, integration of schemata, and federation of data~\cite{Davidson1995}.
However, integration remains a challenging task that requires the identification and deep understanding of biological data sources and their respective formats, conversion, harmonization, and unification.
Initial interest in the semantic web and linked open data along with the adoption of RDF (Resource Description Framework) in the biomedical community led to the Bio2RDF project, in which pipelines for converting and serializing several biological data sources to RDF were developed~\cite{Belleau2008}.
Several updates have been issued since its deployment such as the inclusion of chemical information systems~\cite{Chen2010}.
Further, it has also influenced in and has been adopted by subsequent projects such as Open PHACTS~\cite{Williams2012}.
While RDF is highly expressive and each of these projects have developed and enforced well-defined schemata, the format is often not well-suited for downstream analyses and must first be queried with languages like SPARQL (SPARQL Query Language for RDF) and subsequently be transformed into appropriate formats with general-purpose programming languages.
Alternatives to RDF/SPARQL such as property graphs (e.g., Neo4j, OrientDB) are comparable~\cite{Alocci2015} but also necessitate similar post-processing.
Conversely, there have been several biologically meaningful integration efforts (e.g., STRING~\cite{Warde-Farley2010}; GeneMANIA~\cite{Szklarczyk2015}; GeneCards~\cite{Stelzer2016}).
However, most suffer from a lack of defined schemata or standardized data format that impede biological database interoperability.
As interoperability itself is a multifaceted concept, we would like to highlight three of its facets: first, data sources should refer to named entities using high-quality, publicly accessible terminologies as prescribed by the Minimal Information Requested in the Annotation of Biochemical Models standard~\cite{Laibe2007}.
Second, data sources should additionally denote the ontological classes of named entities (e.g., gene, transcript, protein, pathway, disease) along with their reference using controlled vocabularies such as the Systems Biology Ontology~\cite{Courtot2011}.
Some identifiers, such as those for genes, are often used to refer not only to the physical region of DNA within the genome, but also the corresponding RNA transcript(s) or protein product(s).
Unfortunately, many biological databases do not explicitly distinguish between these entity classes.
For example, the STRING database lists gene-centric homology relationships, transcript-centric co-expression relationships, and protein-centric protein-protein interactions using gene-centric nomenclature.
While it may be possible to identify the classes of entities based on their incident relationships, doing so requires specific knowledge of the database including the semantics of its relationships.
Third, resources should, at a minimum, map their relationships to controlled vocabularies such as the Relation Ontology, or further use standardized data formats with defined semantics (e.g., PSI MI-TAB) to minimize both the interpretation and implementation effort when combining them with other resources.
OmniPath~\cite{Turei2016} began to address these facets when it combined several signaling pathway and transcriptional regulation databases.
It achieved interoperability between several databases by normalizing the identifiers and relationships between entities from several databases describing the same phenomena (e.g., microRNA-target interactions, protein-protein interactions, etc.) and creating a unified network.
However, because it did not use a standard format or schema as mentioned in the third facet for interoperability, OmniPath itself cannot readily be directly integrated with other biological data sources.
Pathway Commons~\cite{Cerami2011} addressed this concern when combining several molecular pathway and interaction databases by translating the source databases into the BioPAX standard~\cite{Demir2010} using automated pipelines.
However, it suffers from low granularity and low recovery of information from some of its primary biological data sources which may be due to prioritization of software development time, data usage restrictions, or shortcomings in the BioPAX standard.
While BioPAX is well-suited for representing biological reactions and transformations, it is limited in its ability to represent correlative and associative relationships across multi-scale biology (e.g., at the levels of processes, phenotypes, and clinical observations).
As an alternative, we propose the use of Biological Expression Language (BEL~\cite{Slater2014}) as an integration schema in order to overcome the limits faced by previous efforts and to simultaneously address all three facets of interoperability.
BEL has begun to prove itself as a robust format in the curation and integration of previously isolated biological data sources of high granular information on genetic variation~\cite{Naz2016}, epigenetics~\cite{Irin2015}, chemogenomics~\cite{Emon2017}, and clinical biomarkers~\cite{Iyappan2017}.
Its syntax and semantics are also appropriate for representing, for example, disease-disease similarities, disease-protein associations, chemical space networks, genome-wide association studies, and phenome-wide association studies.
With the same focus on reproducibility as Bio2RDF, OmniPath, and Pathway Commons as well as deference to software maintainability and the ease of development and inclusion of new biological data sources, we have developed a growing list of Bio2BEL packages, each capable of downloading, structuring, and serializing various biological data sources to BEL (Table~\ref{tab:summary}).
Each can be found in the Bio2BEL GitHub organization (\url{https://github.com/bio2bel}) as an independent open-source Python package that can readily be installed with pip.
We have also developed and freely provided a framework (\url{https://github.com/bio2bel/bio2bel}) in the Python programming language to enable code reuse and the fast generation of additional Bio2BEL packages.
Notably, the list of Bio2BEL packages includes one for OmniPath as a proof of concept that authors of other resources can implement their own Bio2BEL packages.
In this article, we present the philosophy and implementation of Bio2BEL packages, a summary of past and future Bio2BEL packages, and finally, several case studies including the utility of Bio2BEL packages during curation of pathway mappings, in the analysis of cancer genome data, and for machine learning applications.