Recent de novo origin of human protein-coding genes

Genome Res. 2009 Oct;19(10):1752-9. doi: 10.1101/gr.095026.109. Epub 2009 Sep 2.

Abstract

The origin of new genes is extremely important to evolutionary innovation. Most new genes arise from existing genes through duplication or recombination. The origin of new genes from noncoding DNA is extremely rare, and very few eukaryotic examples are known. We present evidence for the de novo origin of at least three human protein-coding genes since the divergence with chimp. Each of these genes has no protein-coding homologs in any other genome, but is supported by evidence from expression and, importantly, proteomics data. The absence of these genes in chimp and macaque cannot be explained by sequencing gaps or annotation error. High-quality sequence data indicate that these loci are noncoding DNA in other primates. Furthermore, chimp, gorilla, gibbon, and macaque share the same disabling sequence difference, supporting the inference that the ancestral sequence was noncoding over the alternative possibility of parallel gene inactivation in multiple primate lineages. The genes are not well characterized, but interestingly, one of them was first identified as an up-regulated gene in chronic lymphocytic leukemia. This is the first evidence for entirely novel human-specific protein-coding genes originating from ancestrally noncoding sequences. We estimate that 0.075% of human genes may have originated through this mechanism leading to a total expectation of 18 such cases in a genome of 24,000 protein-coding genes.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Base Sequence
  • DNA, Intergenic / analysis
  • DNA, Intergenic / genetics
  • Databases, Genetic
  • Evolution, Molecular*
  • Genes / physiology
  • Genome, Human
  • Humans
  • Models, Biological
  • Molecular Sequence Data
  • Mutation / physiology
  • Pan troglodytes / genetics
  • Phylogeny
  • Proteins / genetics*
  • Sequence Analysis, DNA
  • Sequence Homology, Nucleic Acid

Substances

  • DNA, Intergenic
  • Proteins

Associated data

  • GENBANK/FJ713693
  • GENBANK/FJ713696
  • GENBANK/FJ713697