Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree

M Oğuzhan Külekci; Jeffrey Scott Vitter; Bojian Xu

doi:10.1109/TCBB.2011.127

Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree

IEEE/ACM Trans Comput Biol Bioinform. 2012;9(2):421-9. doi: 10.1109/TCBB.2011.127. Epub 2011 Sep 27.

Authors

M Oğuzhan Külekci¹, Jeffrey Scott Vitter, Bojian Xu

Affiliation

¹ National Research Institute of Electronics and Cryptology, Gebze.

PMID: 21968959
DOI: 10.1109/TCBB.2011.127

Abstract

Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19--50 times the text size with the best engineering efforts, prohibiting their usability on massive data. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage is less than double the sequence size. Our method is also orders of magnitude faster than the prior methods for processing massive texts, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Computational Biology / methods*
Databases, Protein
Genome
Natural Language Processing*
Proteins / chemistry
Proteins / genetics
Sequence Analysis, Protein / methods*

Substances

Proteins