A shotgun approach to discovering and reconstructing consensus retrotransposons ex novo from dense contigs of short sequences derived from Genbank Genome Survey Sequence database records

Gene. 2009 Dec 15;448(2):168-73. doi: 10.1016/j.gene.2009.06.011. Epub 2009 Jun 26.

Abstract

Retrotransposons constitute the majority of pseudogenic protein coding regions of most eukaryotic genomes. Most genomes carry tens to thousands of retrotransposon copies derived from dozens of distinct families, but most if not all of these copies are non-functional and contain disabling mutations, including large numbers of indels. Until recently, most regions rich in these elements were virtually ignored in all but the most complete genome sequencing projects, and the full extent of their impact on the structure and function of the genomes of higher eukaryotes was under-appreciated. Even when new retrotransposons are encountered and annotated by automated gene finding programs and similarity searches, coding regions are treated as exons and invariably and not surprisingly mistranslated because of numerous frameshift mutations and large indels. Very few functional retrotransposons contain introns, as in silico annotations imply. While many repetitive DNA consensus sequences have been assembled from collections of largely full-length copies using full-length templates, we have shown that repetitive DNA consensus sequence contigs representing long, moderately high copy-number elements can also be generated ex novo in the absence of templates from very short overlapping sequences. We have devised an in silico strategy to recover and reconstruct consensus sequences of elements up to 20,000 bp by building dense contigs of hundreds of overlapping 400 to 900-bp records found in the Genbank Genome Survey Sequence database. The results are hypothetical ancestral sequences that encode elements that appear to be fully functional with intact open reading frames and other conserved features.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Base Sequence
  • Cloning, Molecular / methods*
  • Computational Biology / methods
  • Consensus Sequence / genetics*
  • Contig Mapping
  • Databases, Nucleic Acid*
  • Models, Biological
  • Molecular Sequence Data
  • Open Reading Frames / genetics
  • Retroelements / genetics*

Substances

  • Retroelements