Ensemble-based Methods to Improve De-identification of Electronic Health Record Narratives

Youngjun Kim; Paul Heider; Stéphane Meystre

Ensemble-based Methods to Improve De-identification of Electronic Health Record Narratives

AMIA Annu Symp Proc. 2018 Dec 5:2018:663-672. eCollection 2018.

Authors

Youngjun Kim¹, Paul Heider¹, Stéphane Meystre^{1

2}

Affiliations

¹ Medical University of South Carolina, Charleston, South Carolina, USA.
² Clinacuity, Inc., Charleston, South Carolina, USA.

PMID: 30815108
PMCID: PMC6371277

Abstract

Text de-identification is an application of clinical natural language processing that offers significant efficiency and scalability advantages. Hence, various learning algorithms have been applied to this task to yield better performance. Instead of choosing the best individual learning algorithm, we aim to improve de-identification by constructing ensembles that lead to more accurate classification. We present three different ensemble methods that combine multiple de-identification models trained from deep learning, shallow learning, and rule-based approaches. Each model is capable of automated de-identification without manual medical expertise. Our experimental results show that the stacked learning ensemble is more effective than other ensemble methods, producing the highest recall, the most important metric for de-identification. The stacked ensemble achieved state-of-the-art performance on the 2014 i2b2 dataset with 97.04% precision, 94.45% recall, and 95.73% F₁ score.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Data Anonymization*
Electronic Health Records*
Humans
Machine Learning*
Methods
Natural Language Processing*

Abstract

Publication types

MeSH terms

Grants and funding