CIndex: compressed indexes for fast retrieval of FASTQ files

Hongwei Huo; Pengfei Liu; Chenhui Wang; Hongbo Jiang; Jeffrey Scott Vitter

doi:10.1093/bioinformatics/btab655

CIndex: compressed indexes for fast retrieval of FASTQ files

Bioinformatics. 2022 Jan 3;38(2):335-343. doi: 10.1093/bioinformatics/btab655.

Authors

Hongwei Huo¹, Pengfei Liu¹, Chenhui Wang¹, Hongbo Jiang¹, Jeffrey Scott Vitter²

Affiliations

¹ Department of Computer Science, Xidian University, Xi'an 710071, China.
² Department of Computer Science, Tulane University, New Orleans, LA 70118, USA.

PMID: 34524416
DOI: 10.1093/bioinformatics/btab655

Abstract

Motivation: Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files.

Results: We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.

Availability and implementation: The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Data Compression* / methods
Genome
Genomics / methods
High-Throughput Nucleotide Sequencing / methods
Sequence Analysis, DNA / methods
Software*

Grants and funding

61373044/National Natural Science Foundation of China