Evolutionary fingerprinting of genes

Mol Biol Evol. 2010 Mar;27(3):520-36. doi: 10.1093/molbev/msp260. Epub 2009 Oct 28.

Abstract

Over time, natural selection molds every gene into a unique mosaic of sites evolving rapidly or resisting change-an "evolutionary fingerprint" of the gene. Aspects of this evolutionary fingerprint, such as the site-specific ratio of nonsynonymous to synonymous substitution rates (dN/dS), are commonly used to identify genetic features of potential biological interest; however, no framework exists for comparing evolutionary fingerprints between genes. We hypothesize that protein-coding genes with similar protein structure and/or function tend to have similar evolutionary fingerprints and that comparing evolutionary fingerprints can be useful for discovering similarities between genes in a way that is analogous to, but independent of, discovery of similarity via sequence-based comparison tools such as Blast. To test this hypothesis, we develop a novel model of coding sequence evolution that uses a general bivariate discrete parameterization of the evolutionary rates. We show that this approach provides a better fit to the data using a smaller number of parameters than existing models. Next, we use the model to represent evolutionary fingerprints as probability distributions and present a methodology for comparing these distributions in a way that is robust against variations in data set size and divergence. Finally, using sequences of three rapidly evolving RNA viruses (HIV-1, hepatitis C virus, and influenza A virus), we demonstrate that genes within the same functional group tend to have similar evolutionary fingerprints. Our framework provides a sound statistical foundation for efficient inference and comparison of evolutionary rate patterns in arbitrary collections of gene alignments, clustering homologous and nonhomologous genes, and investigation of biological and functional correlates of evolutionary rates.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Artificial Intelligence
  • Cluster Analysis
  • Codon
  • Computational Biology / methods*
  • Computer Simulation
  • DNA Fingerprinting / methods*
  • Databases, Genetic
  • Evolution, Molecular*
  • Genes, Viral*
  • HIV-1 / genetics
  • Hepacivirus / genetics
  • Influenza A virus / genetics
  • Models, Genetic*
  • Mutation
  • Nonlinear Dynamics
  • Principal Component Analysis
  • Reproducibility of Results
  • Sequence Alignment

Substances

  • Codon