A publication-wide association study (PWAS), historical language models to prioritise novel therapeutic drug targets

David Narganes-Carlón; Daniel J Crowther; Ewan R Pearson

doi:10.1038/s41598-023-35597-4

A publication-wide association study (PWAS), historical language models to prioritise novel therapeutic drug targets

Sci Rep. 2023 May 24;13(1):8366. doi: 10.1038/s41598-023-35597-4.

Authors

David Narganes-Carlón^{1

2}, Daniel J Crowther³, Ewan R Pearson⁴

Affiliations

¹ Division of Population Health and Genomics, Ninewells Hospital, School of Medicine, University of Dundee, Dundee, DD1 9SY, UK. dnarganes@exscientia.co.uk.
² Exscientia Ltd, Dundee One, River Court, 5 West Victoria Dock Road, Dundee, DD1 3JT, UK. dnarganes@exscientia.co.uk.
³ Exscientia Ltd, Dundee One, River Court, 5 West Victoria Dock Road, Dundee, DD1 3JT, UK.
⁴ Division of Population Health and Genomics, Ninewells Hospital, School of Medicine, University of Dundee, Dundee, DD1 9SY, UK.

Abstract

Most biomedical knowledge is published as text, making it challenging to analyse using traditional statistical methods. In contrast, machine-interpretable data primarily comes from structured property databases, which represent only a fraction of the knowledge present in the biomedical literature. Crucial insights and inferences can be drawn from these publications by the scientific community. We trained language models on literature from different time periods to evaluate their ranking of prospective gene-disease associations and protein-protein interactions. Using 28 distinct historical text corpora of abstracts published between 1995 and 2022, we trained independent Word2Vec models to prioritise associations that were likely to be reported in future years. This study demonstrates that biomedical knowledge can be encoded as word embeddings without the need for human labelling or supervision. Language models effectively capture drug discovery concepts such as clinical tractability, disease associations, and biochemical pathways. Additionally, these models can prioritise hypotheses years before their initial reporting. Our findings underscore the potential for extracting yet-to-be-discovered relationships through data-driven approaches, leading to generalised biomedical literature mining for potential therapeutic drug targets. The Publication-Wide Association Study (PWAS) enables the prioritisation of under-explored targets and provides a scalable system for accelerating early-stage target ranking, irrespective of the specific disease of interest.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Databases, Factual
Drug Delivery Systems*
Drug Discovery*
Humans
Language
Prospective Studies

Grants and funding

MRC_/Medical Research Council/United Kingdom