A Natural Language Processing Tool to Extract Quantitative Smoking Status from Clinical Narratives

IEEE Int Conf Healthc Inform. 2020 Nov-Dec:2020:10.1109/ICHI48887.2020.9374369. doi: 10.1109/ICHI48887.2020.9374369. Epub 2021 Mar 12.

Abstract

This study presents a natural language processing (NLP) tool to extract quantitative smoking information (e.g., Pack-Year, Quit Year, Smoking Year, and Pack per Day) from clinical notes and standardized them into Pack-Year unit. We annotated a corpus of 200 clinical notes from patients who had low-dose CT imaging procedures for lung cancer screening and developed an NLP system using a two-layer rule-engine structure. We divided the 200 notes into a training set and a test set and developed the NLP system only using the training set. The experimental results on the test set showed that our NLP system achieved the best F1 scores of 0.963 and 0.946 for lenient and strict evaluation, respectively.

Keywords: natural language processing; quantitative smoking information extraction; tobacco use.