Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree

IEEE/ACM Trans Comput Biol Bioinform. 2012;9(2):421-9. doi: 10.1109/TCBB.2011.127. Epub 2011 Sep 27.

Abstract

Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19--50 times the text size with the best engineering efforts, prohibiting their usability on massive data. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage is less than double the sequence size. Our method is also orders of magnitude faster than the prior methods for processing massive texts, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Computational Biology / methods*
  • Databases, Protein
  • Genome
  • Natural Language Processing*
  • Proteins / chemistry
  • Proteins / genetics
  • Sequence Analysis, Protein / methods*

Substances

  • Proteins