Building a specialized lexicon for breast cancer clinical trial subject eligibility analysis

Euisung Jung; Hemant Jain; Atish P Sinha; Carmelo Gaudioso

doi:10.1177/1460458221989392

Building a specialized lexicon for breast cancer clinical trial subject eligibility analysis

Health Informatics J. 2021 Jan-Mar;27(1):1460458221989392. doi: 10.1177/1460458221989392.

Authors

Euisung Jung¹, Hemant Jain², Atish P Sinha³, Carmelo Gaudioso⁴

Affiliations

¹ Information Operations and Technology Management, John B. and Lillian E. Neff College of Business and Innovation, The University of Toledo, USA.
² Gary W. Rollins College of Business, The University of Tennessee at Chattanooga, USA.
³ Lubar School of Business, University of Wisconsin-Milwaukee, USA.
⁴ Roswell Park Cancer Institute, USA.

PMID: 33535885
DOI: 10.1177/1460458221989392

Abstract

A natural language processing (NLP) application requires sophisticated lexical resources to support its processing goals. Different solutions, such as dictionary lookup and MetaMap, have been proposed in the healthcare informatics literature to identify disease terms with more than one word (multi-gram disease named entities). Although a lot of work has been done in the identification of protein- and gene-named entities in the biomedical field, not much research has been done on the recognition and resolution of terminologies in the clinical trial subject eligibility analysis. In this study, we develop a specialized lexicon for improving NLP and text mining analysis in the breast cancer domain, and evaluate it by comparing it with the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT). We use a hybrid methodology, which combines the knowledge of domain experts, terms from multiple online dictionaries, and the mining of text from sample clinical trials. Use of our methodology introduces 4243 unique lexicon items, which increase bigram entity match by 38.6% and trigram entity match by 41%. Our lexicon, which adds a significant number of new terms, is very useful for matching patients to clinical trials automatically based on eligibility matching. Beyond clinical trial matching, the specialized lexicon developed in this study could serve as a foundation for future healthcare text mining applications.

Keywords: breast cancer; clinical trial; natural language processing; specialized lexicon.

MeSH terms

Breast Neoplasms* / therapy
Data Mining
Female
Humans
Natural Language Processing