De-identifying free text of Japanese electronic health records

Kohei Kajiyama; Hiromasa Horiguchi; Takashi Okumura; Mizuki Morita; Yoshinobu Kano

doi:10.1186/s13326-020-00227-9

De-identifying free text of Japanese electronic health records

J Biomed Semantics. 2020 Sep 21;11(1):11. doi: 10.1186/s13326-020-00227-9.

Authors

Kohei Kajiyama¹, Hiromasa Horiguchi², Takashi Okumura³, Mizuki Morita⁴, Yoshinobu Kano⁵

Affiliations

¹ Faculty of Informatics, Shizuoka University, Johoku 3-5-1, Naka-ku, Hamamatsu, Shizuoka, 432-8011, Japan.
² National Hospital Organization Headquaters, 2-5-21 Higashigaoka, Meguro-ku, Tokyo, 152-8621, Japan.
³ National University Corporation Kitami Institute of Technology, 165, Koencho, Kitami, Hokkaido, 090-8507, Japan.
⁴ Graduate School of Interdisciplinary Science and Engineering in Health Systems, Okayama University, 2-5-1, Kita-ku, Okayama, Okayama, 700-8558, Japan.
⁵ Faculty of Informatics, Shizuoka University, Johoku 3-5-1, Naka-ku, Hamamatsu, Shizuoka, 432-8011, Japan. kano@inf.shizuoka.ac.jp.

Abstract

Background: Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset.

Results: Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR.

Conclusions: Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance. Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.

Keywords: De-identification; Electronic health records; Japanese language.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Deep Learning
Electronic Health Records*
Language*
Natural Language Processing*