Global self-organization of all known protein sequences reveals inherent biological signatures

J Mol Biol. 1997 May 2;268(2):539-56. doi: 10.1006/jmbi.1997.0948.

Abstract

A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acid residues and a dynamic programming distance is calculated between each pair of segments. This space of segments is initially embedded into Euclidean space. The algorithm that we apply embeds every finite metric space into Euclidean space so that (1) the dimension of the host space is small, (2) the metric distortion is small. A novel self-organized, cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. We monitor the validity of our clustering by randomly splitting the data into two parts and performing an hierarchical clustering algorithm independently on each part. At every level of the hierarchy we cross-validate the clusters in one part with the clusters in the other. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural data about proteins. Some of the known families clustered into well distinct clusters. Motifs and domains such as the zinc finger, EF hand, homeobox, EGF-like and others are automatically correctly identified, and relations between protein families are revealed by examining the splits along the tree. This clustering leads to a novel representation of protein families, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporter family. Finally, we introduce a new concise representation for complete proteins that is very useful in presenting multiple alignments, and in searching for close relatives in the database. The self-organization method presented is very general and applies to any data with a consistent and computable measure of similarity between data items.

MeSH terms

  • Amino Acid Sequence
  • Animals
  • Cluster Analysis*
  • DNA-Binding Proteins / classification
  • Hemeproteins / classification
  • Homeodomain Proteins / classification
  • Humans
  • Metalloproteins / classification
  • Molecular Sequence Data
  • Proteins / classification*
  • Sequence Analysis / methods*
  • Zinc Fingers

Substances

  • DNA-Binding Proteins
  • Hemeproteins
  • Homeodomain Proteins
  • Metalloproteins
  • Proteins