Predictive structured-unstructured interactions in EHR models: A case study of suicide prediction

Ilkin Bayramli; Victor Castro; Yuval Barak-Corren; Emily M Madsen; Matthew K Nock; Jordan W Smoller; Ben Y Reis

doi:10.1038/s41746-022-00558-0

Predictive structured-unstructured interactions in EHR models: A case study of suicide prediction

NPJ Digit Med. 2022 Jan 27;5(1):15. doi: 10.1038/s41746-022-00558-0.

Authors

Ilkin Bayramli^{1

2}, Victor Castro^{3

4}, Yuval Barak-Corren¹, Emily M Madsen^{5

6}, Matthew K Nock^{4

7

8}, Jordan W Smoller^#^{5

6

9}, Ben Y Reis^#^{10

11}

Affiliations

¹ Predictive Medicine Group, Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.
² Harvard University, Cambridge, MA, USA.
³ Mass General Brigham Research Information Science and Computing, Boston, MA, USA.
⁴ Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁵ Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁶ Center for Precision Psychiatry, Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁷ Department of Psychology, Harvard University, Cambridge, MA, USA.
⁸ Mental Health Research Program, Franciscan Children's, Brighton, MA, USA.
⁹ Harvard Medical School, Boston, MA, USA.
¹⁰ Predictive Medicine Group, Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA. ben_reis@harvard.edu.
¹¹ Harvard Medical School, Boston, MA, USA. ben_reis@harvard.edu.

^# Contributed equally.

Abstract

Clinical risk prediction models powered by electronic health records (EHRs) are becoming increasingly widespread in clinical practice. With suicide-related mortality rates rising in recent years, it is becoming increasingly urgent to understand, predict, and prevent suicidal behavior. Here, we compare the predictive value of structured and unstructured EHR data for predicting suicide risk. We find that Naive Bayes Classifier (NBC) and Random Forest (RF) models trained on structured EHR data perform better than those based on unstructured EHR data. An NBC model trained on both structured and unstructured data yields similar performance (AUC = 0.743) to an NBC model trained on structured data alone (0.742, p = 0.668), while an RF model trained on both data types yields significantly better results (AUC = 0.903) than an RF model trained on structured data alone (0.887, p < 0.001), likely due to the RF model's ability to capture interactions between the two data types. To investigate these interactions, we propose and implement a general framework for identifying specific structured-unstructured feature pairs whose interactions differ between case and non-case cohorts, and thus have the potential to improve predictive performance and increase understanding of clinical risk. We find that such feature pairs tend to capture heterogeneous pairs of general concepts, rather than homogeneous pairs of specific concepts. These findings and this framework can be used to improve current and future EHR-based clinical modeling efforts.

Abstract

Grants and funding