Ensemble-based Methods to Improve De-identification of Electronic Health Record Narratives

AMIA Annu Symp Proc. 2018 Dec 5:2018:663-672. eCollection 2018.

Abstract

Text de-identification is an application of clinical natural language processing that offers significant efficiency and scalability advantages. Hence, various learning algorithms have been applied to this task to yield better performance. Instead of choosing the best individual learning algorithm, we aim to improve de-identification by constructing ensembles that lead to more accurate classification. We present three different ensemble methods that combine multiple de-identification models trained from deep learning, shallow learning, and rule-based approaches. Each model is capable of automated de-identification without manual medical expertise. Our experimental results show that the stacked learning ensemble is more effective than other ensemble methods, producing the highest recall, the most important metric for de-identification. The stacked ensemble achieved state-of-the-art performance on the 2014 i2b2 dataset with 97.04% precision, 94.45% recall, and 95.73% F1 score.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Data Anonymization*
  • Electronic Health Records*
  • Humans
  • Machine Learning*
  • Methods
  • Natural Language Processing*