Low-Quality Structural and Interaction Data Improves Binding Affinity Prediction via Random Forest

Hongjian Li; Kwong-Sak Leung; Man-Hon Wong; Pedro J Ballester

doi:10.3390/molecules200610947

Low-Quality Structural and Interaction Data Improves Binding Affinity Prediction via Random Forest

Molecules. 2015 Jun 12;20(6):10947-62. doi: 10.3390/molecules200610947.

Authors

Hongjian Li¹, Kwong-Sak Leung², Man-Hon Wong³, Pedro J Ballester⁴

Affiliations

¹ Department of Computer Science and Engineering, Chinese University of Hong Kong, Sha Tin, New Territories 999077, Hong Kong. jackyleehongjian@gmail.com.
² Department of Computer Science and Engineering, Chinese University of Hong Kong, Sha Tin, New Territories 999077, Hong Kong. ksleung@cse.cuhk.edu.hk.
³ Department of Computer Science and Engineering, Chinese University of Hong Kong, Sha Tin, New Territories 999077, Hong Kong. mhwong@cse.cuhk.edu.hk.
⁴ Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France. pedro.ballester@inserm.fr.

Abstract

Docking scoring functions can be used to predict the strength of protein-ligand binding. It is widely believed that training a scoring function with low-quality data is detrimental for its predictive performance. Nevertheless, there is a surprising lack of systematic validation experiments in support of this hypothesis. In this study, we investigated to which extent training a scoring function with data containing low-quality structural and binding data is detrimental for predictive performance. We actually found that low-quality data is not only non-detrimental, but beneficial for the predictive performance of machine-learning scoring functions, though the improvement is less important than that coming from high-quality data. Furthermore, we observed that classical scoring functions are not able to effectively exploit data beyond an early threshold, regardless of its quality. This demonstrates that exploiting a larger data volume is more important for the performance of machine-learning scoring functions than restricting to a smaller set of higher data quality.

Keywords: binding affinity prediction; docking; machine-learning scoring functions.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Models, Theoretical*
Structure-Activity Relationship