Development and external validation of deep learning clinical prediction models using variable-length time series data

Fereshteh S Bashiri; Kyle A Carey; Jennie Martin; Jay L Koyner; Dana P Edelson; Emily R Gilbert; Anoop Mayampurath; Majid Afshar; Matthew M Churpek

doi:10.1093/jamia/ocae088

Development and external validation of deep learning clinical prediction models using variable-length time series data

J Am Med Inform Assoc. 2024 Apr 29:ocae088. doi: 10.1093/jamia/ocae088. Online ahead of print.

Authors

Fereshteh S Bashiri¹, Kyle A Carey², Jennie Martin¹, Jay L Koyner², Dana P Edelson², Emily R Gilbert³, Anoop Mayampurath^{1

4}, Majid Afshar^{1

4}, Matthew M Churpek^{1

4}

Affiliations

¹ Department of Medicine, University of Wisconsin-Madison, Madison, WI 53792, United States.
² Department of Medicine, University of Chicago, Chicago, IL 60637, United States.
³ Department of Medicine, Loyola University, Chicago, IL 60153, United States.
⁴ Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, United States.

PMID: 38679906
DOI: 10.1093/jamia/ocae088

Abstract

Objectives: To compare and externally validate popular deep learning model architectures and data transformation methods for variable-length time series data in 3 clinical tasks (clinical deterioration, severe acute kidney injury [AKI], and suspected infection).

Materials and methods: This multicenter retrospective study included admissions at 2 medical centers that spanned 2007-2022. Distinct datasets were created for each clinical task, with 1 site used for training and the other for testing. Three feature engineering methods (normalization, standardization, and piece-wise linear encoding with decision trees [PLE-DTs]) and 3 architectures (long short-term memory/gated recurrent unit [LSTM/GRU], temporal convolutional network, and time-distributed wrapper with convolutional neural network [TDW-CNN]) were compared in each clinical task. Model discrimination was evaluated using the area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC).

Results: The study comprised 373 825 admissions for training and 256 128 admissions for testing. LSTM/GRU models tied with TDW-CNN models with both obtaining the highest mean AUPRC in 2 tasks, and LSTM/GRU had the highest mean AUROC across all tasks (deterioration: 0.81, AKI: 0.92, infection: 0.87). PLE-DT with LSTM/GRU achieved the highest AUPRC in all tasks.

Discussion: When externally validated in 3 clinical tasks, the LSTM/GRU model architecture with PLE-DT transformed data demonstrated the highest AUPRC in all tasks. Multiple models achieved similar performance when evaluated using AUROC.

Conclusion: The LSTM architecture performs as well or better than some newer architectures, and PLE-DT may enhance the AUPRC in variable-length time series data for predicting clinical outcomes during external validation.

Keywords: AI in medicine; deep learning; variable-length time series.

Abstract

Grants and funding