Computational Prediction of De Novo Emerged Protein-Coding Genes

Methods Mol Biol. 2019:1851:63-81. doi: 10.1007/978-1-4939-8736-8_4.

Abstract

De novo genes, that is, protein-coding genes originating from previously noncoding sequence, have gone from being considered impossibly unlikely to being recognized as an important source of genetic novelty in eukaryotic genomes. It is clear that de novo gene evolution is a rare but consistent feature of eukaryotic genomes, being detected in every genome studied. However, different studies often use different computational methods, and the numbers and identities of the detected genes vary greatly. Here we present a coherent protocol for the computational identification of de novo genes by comparative genomics. The method described uses homology searches, identification of syntenic regions, and ancestral sequence reconstruction to produce high-confidence candidates with robust evidence of de novo emergence. It is designed to be easily applicable given the basic knowledge of bioinformatic tools and scalable so that it can be applied on large and small datasets.

Keywords: De novo genes; Gene birth; Genome evolution; Genome-wide detection; New gene evolution; Novel genes; ORF formation; Protein-coding genes.

MeSH terms

  • Amino Acid Sequence
  • Computational Biology / methods*
  • Evolution, Molecular
  • Genomics / methods*
  • Phylogeny
  • Proteins / classification
  • Proteins / genetics
  • Synteny

Substances

  • Proteins