Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections

Hongwei Huo; Xiaoyang Chen; Xu Guo; Jeffrey Scott Vitter

doi:10.1109/TCBB.2020.2968323

Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2394-2408. doi: 10.1109/TCBB.2020.2968323. Epub 2021 Dec 8.

Authors

Hongwei Huo, Xiaoyang Chen, Xu Guo, Jeffrey Scott Vitter

PMID: 31985436
DOI: 10.1109/TCBB.2020.2968323

Abstract

In this paper, we focus upon the important problem of indexing and searching highly repetitive DNA sequence collections. Given a collection G of t sequences S_i of length n each, we can represent G succinctly in 2nH_k(T) + O(n^' loglogn) + o(q n^') + o(tn) bits using O(t n² + q n^') time, where H_k(T) is the kth-order empirical entropy of the sequence T ∈ G that is used as the reference sequence, n^' is the total number of variations between T and the sequences in G, and q is a small fixed constant. We can restore any length len substring S[ sp, ..., sp + len-1] of S ∈ G in O(n_s^' + len(logn)² / loglogn) time and report all positions where P occurs in G in O(m ·t + occ ·t ·(logn)²/loglogn ) time. In addition, we propose a dynamic programming method to find the variations between T and the sequences in G in a space-efficient way, with which we can build succinct structures to enable efficient search. For highly repetitive sequences, experimental results on the tested data demonstrate that the proposed method has significant advantages in space usage and retrieval time over the current state-of-the-art methods. The source code is available online.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Computational Biology / methods
Data Compression / methods*
Repetitive Sequences, Nucleic Acid / genetics*
Sequence Analysis, DNA / methods*