Homology-based method for identification of protein repeats using statistical significance estimates

M A Andrade; C P Ponting; T J Gibson; P Bork

doi:10.1006/jmbi.2000.3684

Homology-based method for identification of protein repeats using statistical significance estimates

J Mol Biol. 2000 May 5;298(3):521-37. doi: 10.1006/jmbi.2000.3684.

Authors

M A Andrade¹, C P Ponting, T J Gibson, P Bork

Affiliation

¹ European Molecular Biology Laboratory, Meyerhofstr. 1, Heidelberg, 69012, Germany.

PMID: 10772867
DOI: 10.1006/jmbi.2000.3684

Abstract

Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Alkyl and Aryl Transferases / chemistry
Amino Acid Motifs
Animals
Ankyrins / chemistry
Caenorhabditis elegans / chemistry
Caenorhabditis elegans Proteins
Carrier Proteins / chemistry
Cell Cycle Proteins*
Computational Biology / methods*
Computational Biology / statistics & numerical data*
DNA-Binding Proteins / chemistry
Databases, Factual
Farnesyltranstransferase
Genome
Guanine Nucleotide Exchange Factors*
HSP40 Heat-Shock Proteins
Heat-Shock Proteins
Helminth Proteins / chemistry
Humans
Leucine / chemistry
Models, Molecular
Molecular Chaperones
Nuclear Proteins*
Phylogeny
Protein Structure, Secondary
Proteins / chemistry*
Proteins / metabolism*
Repetitive Sequences, Amino Acid*
Saccharomyces cerevisiae / chemistry
Sensitivity and Specificity
Sequence Alignment
Sequence Homology, Amino Acid*

Substances

Ankyrins
Caenorhabditis elegans Proteins
Carrier Proteins
Cell Cycle Proteins
DNA-Binding Proteins
DNAJC7 protein, human
Guanine Nucleotide Exchange Factors
HSP40 Heat-Shock Proteins
Heat-Shock Proteins
Helminth Proteins
Molecular Chaperones
Nuclear Proteins
Proteins
RCC1 protein, human
che-2 protein, C elegans
ran-3 protein, C elegans
Alkyl and Aryl Transferases
Farnesyltranstransferase
Leucine