Construction of an ortholog database using the semantic web technology for integrative analysis of genomic data

PLoS One. 2015 Apr 13;10(4):e0122802. doi: 10.1371/journal.pone.0122802. eCollection 2015.

Abstract

Recently, various types of biological data, including genomic sequences, have been rapidly accumulating. To discover biological knowledge from such growing heterogeneous data, a flexible framework for data integration is necessary. Ortholog information is a central resource for interlinking corresponding genes among different organisms, and the Semantic Web provides a key technology for the flexible integration of heterogeneous data. We have constructed an ortholog database using the Semantic Web technology, aiming at the integration of numerous genomic data and various types of biological information. To formalize the structure of the ortholog information in the Semantic Web, we have constructed the Ortholog Ontology (OrthO). While the OrthO is a compact ontology for general use, it is designed to be extended to the description of database-specific concepts. On the basis of OrthO, we described the ortholog information from our Microbial Genome Database for Comparative Analysis (MBGD) in the form of Resource Description Framework (RDF) and made it available through the SPARQL endpoint, which accepts arbitrary queries specified by users. In this framework based on the OrthO, the biological data of different organisms can be integrated using the ortholog information as a hub. Besides, the ortholog information from different data sources can be compared with each other using the OrthO as a shared ontology. Here we show some examples demonstrating that the ortholog information described in RDF can be used to link various biological data such as taxonomy information and Gene Ontology. Thus, the ortholog database using the Semantic Web technology can contribute to biological knowledge discovery through integrative data analysis.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Bacteria / genetics
  • Computational Biology / methods*
  • Computational Biology / statistics & numerical data
  • Databases, Genetic*
  • Datasets as Topic
  • Fungi / genetics
  • Gene Ontology*
  • Genome*
  • Humans
  • Internet
  • Plants / genetics
  • Semantics

Grants and funding

This work was supported by the National Bioscience Database Center, Japan Science Technology Agency. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.