Text as data: using text-based features for proteins representation and for computational prediction of their characteristics

Methods. 2015 Mar:74:54-64. doi: 10.1016/j.ymeth.2014.10.027. Epub 2014 Nov 15.

Abstract

The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined. Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year. Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining. Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text. In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.

Keywords: Biomedical text mining; Machine learning; Protein annotation; Protein function prediction; Protein location prediction; Protein representation; Protein subcellular location; Text classification; Text mining.

Publication types

  • Research Support, Non-U.S. Gov't
  • Review
  • Research Support, N.I.H., Extramural

MeSH terms

  • Animals
  • Computational Biology / methods*
  • Computational Biology / trends
  • Data Mining / methods*
  • Data Mining / trends
  • Databases, Protein / trends
  • Humans
  • Proteins / genetics

Substances

  • Proteins