Leveraging Language Model Multitasking To Predict C-H Borylation Selectivity

J Chem Inf Model. 2024 May 27;64(10):4286-4297. doi: 10.1021/acs.jcim.4c00137. Epub 2024 May 6.

Abstract

C-H borylation is a high-value transformation in the synthesis of lead candidates for the pharmaceutical industry because a wide array of downstream coupling reactions is available. However, predicting its regioselectivity, especially in drug-like molecules that may contain multiple heterocycles, is not a trivial task. Using a data set of borylation reactions from Reaxys, we explored how a language model originally trained on USPTO_500_MT, a broad-scope set of patent data, can be used to predict the C-H borylation reaction product in different modes: product generation and site reactivity classification. Our fine-tuned T5Chem multitask language model can generate the correct product in 79% of cases. It can also classify the reactive aromatic C-H bonds with 95% accuracy and 88% positive predictive value, exceeding purpose-developed graph-based neural networks.

MeSH terms

  • Hydrogen* / chemistry
  • Models, Chemical
  • Neural Networks, Computer

Substances

  • Hydrogen