RefSelect: a reference sequence selection algorithm for planted (l, d) motif search

BMC Bioinformatics. 2016 Jul 19;17 Suppl 9(Suppl 9):266. doi: 10.1186/s12859-016-1130-6.

Abstract

Background: The planted (l, d) motif search (PMS) is an important yet challenging problem in computational biology. Pattern-driven PMS algorithms usually use k out of t input sequences as reference sequences to generate candidate motifs, and they can find all the (l, d) motifs in the input sequences. However, most of them simply take the first k sequences in the input as reference sequences without elaborate selection processes, and thus they may exhibit sharp fluctuations in running time, especially for large alphabets.

Results: In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100× on the protein data, and (3) is also suitable for large data sets which contain hundreds or more sequences.

Conclusions: The proposed algorithm RefSelect can be used to solve the problem that many pattern-driven PMS algorithms present execution time instability. RefSelect requires a small amount of storage space and is capable of selecting reference sequences efficiently and effectively. Also, the parallel version of RefSelect is provided for handling large data sets.

Keywords: Pattern-driven; Planted (l, d) motif search; Reference sequences.

MeSH terms

  • Algorithms
  • Amino Acid Motifs
  • Computational Biology / methods*
  • Protein Domains
  • Proteins / chemistry*
  • Proteins / genetics
  • Sequence Analysis, Protein
  • Software

Substances

  • Proteins