Using substitution probabilities to improve position-specific scoring matrices

Comput Appl Biosci. 1996 Apr;12(2):135-43. doi: 10.1093/bioinformatics/12.2.135.

Abstract

Each column of amino acids in a multiple alignment of protein sequences can be represented as a vector of 20 amino acid counts. For alignment and searching applications, the count vector is an imperfect representation of a position, because the observed sequences are an incomplete sample of the full set of related sequences. One general solution to this problem is to model unobserved sequences by adding artificial 'pseudo-counts' to the observed counts. We introduce a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities. In extensive empirical tests, this position-based method out-performed other pseudo-count methods and was a substantial improvement over the traditional average score method used for constructing profiles.

Publication types

  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Amino Acid Sequence
  • Computers
  • Databases, Factual
  • Evaluation Studies as Topic
  • Odds Ratio
  • Probability
  • Proteins / chemistry
  • Proteins / genetics
  • Sequence Alignment / methods*
  • Sequence Alignment / statistics & numerical data

Substances

  • Proteins