MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites

Jujuan Zhuang; Xinru Huang; Shuhan Liu; Wanquan Gao; Rui Su; Kexin Feng

doi:10.1021/acs.jcim.3c02088

MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites

J Chem Inf Model. 2024 May 11. doi: 10.1021/acs.jcim.3c02088. Online ahead of print.

Authors

Jujuan Zhuang¹, Xinru Huang¹, Shuhan Liu¹, Wanquan Gao¹, Rui Su¹, Kexin Feng¹

Affiliation

¹ The School of Science, Dalian Maritime University, Dalian 116026, China.

PMID: 38733561
DOI: 10.1021/acs.jcim.3c02088

Abstract

Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average R² of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.