Large language model for horizontal transfer of resistance gene: From resistance gene prevalence detection to plasmid conjugation rate evaluation

Sci Total Environ. 2024 Apr 16:931:172466. doi: 10.1016/j.scitotenv.2024.172466. Online ahead of print.

Abstract

The burgeoning issue of plasmid-mediated resistance genes (ARGs) dissemination poses a significant threat to environmental integrity. However, the prediction of ARGs prevalence is overlooked, especially for emerging ARGs that are potentially evolving gene exchange hotspot. Here, we explored to classify plasmid or chromosome sequences and detect resistance gene prevalence by using DNABERT. Initially, the DNABERT fine-tuned in plasmid and chromosome sequences followed by multilayer perceptron (MLP) classifier could achieve 0.764 AUC (Area under curve) on external datasets across 23 genera, outperforming 0.02 AUC than traditional statistic-based model. Furthermore, Escherichia, Pseudomonas single genera based model were also be trained to explore its predict performance to ARGs prevalence detection. By integrating K-mer frequency attributes, our model could boost the performance to predict the prevalence of ARGs in an external dataset in Escherichia with 0.0281-0.0615 AUC and Pseudomonas with 0.0196-0.0928 AUC. Finally, we established a random forest model aimed at forecasting the relative conjugation transfer rate of plasmids with 0.7956 AUC, drawing on data from existing literature. It identifies the plasmid's repression status, cellular density, and temperature as the most important factors influencing transfer frequency. With these two models combined, they provide useful reference for quick and low-cost integrated evaluation of resistance gene transfer, accelerating the process of computer-assisted quantitative risk assessment of ARGs transfer in environmental field.

Keywords: ARGs prevalence prediction; BERT; Deep learning; Large language model; Plasmid conjugation rate.