SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data

Bioinformatics. 2007 Jun 1;23(11):1410-7. doi: 10.1093/bioinformatics/btm115. Epub 2007 Mar 28.

Abstract

Motivation: Knowing the localization of a protein within the cell helps elucidate its role in biological processes, its function and its potential as a drug target. Thus, subcellular localization prediction is an active research area. Numerous localization prediction systems are described in the literature; some focus on specific localizations or organisms, while others attempt to cover a wide range of localizations.

Results: We introduce SherLoc, a new comprehensive system for predicting the localization of eukaryotic proteins. It integrates several types of sequence and text-based features. While applying the widely used support vector machines (SVMs), SherLoc's main novelty lies in the way in which it selects its text sources and features, and integrates those with sequence-based features. We test SherLoc on previously used datasets, as well as on a new set devised specifically to test its predictive power, and show that SherLoc consistently improves on previous reported results. We also report the results of applying SherLoc to a large set of yet-unlocalized proteins.

Availability: SherLoc, along with Supplementary Information, is available at: http://www-bs.informatik.uni-tuebingen.de/Services/SherLoc/

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Databases, Protein*
  • Information Storage and Retrieval / methods*
  • Molecular Sequence Data
  • Natural Language Processing*
  • Proteins / chemistry
  • Proteins / classification
  • Proteins / metabolism*
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Sequence Alignment / methods
  • Sequence Analysis, Protein / methods*
  • Software
  • Structure-Activity Relationship
  • Subcellular Fractions / metabolism*

Substances

  • Proteins