External validation of prognostic models for critically ill patients required substantial sample sizes

N Peek; D G T Arts; R J Bosman; P H J van der Voort; N F de Keizer

doi:10.1016/j.jclinepi.2006.08.011

External validation of prognostic models for critically ill patients required substantial sample sizes

J Clin Epidemiol. 2007 May;60(5):491-501. doi: 10.1016/j.jclinepi.2006.08.011. Epub 2007 Feb 5.

Authors

N Peek¹, D G T Arts, R J Bosman, P H J van der Voort, N F de Keizer

Affiliation

¹ Department of Medical Informatics, Academic Medical Center--Universiteit van Amsterdam, Amsterdam, the Netherlands. n.b.peek@amc.uva.nl

PMID: 17419960
DOI: 10.1016/j.jclinepi.2006.08.011

Abstract

Objective: To investigate the behavior of predictive performance measures that are commonly used in external validation of prognostic models for outcome at intensive care units (ICUs).

Study design and setting: Four prognostic models (Simplified Acute Physiology Score II, the Acute Physiology and Chronic Health Evaluation II, and the Mortality Probability Models II) were evaluated in the Dutch National Intensive Care Evaluation registry database. For each model discrimination (AUC), accuracy (Brier score), and two calibration measures were assessed on data from 41,239 ICU admissions. This validation procedure was repeated with smaller subsamples randomly drawn from the database, and the results were compared with those obtained on the entire data set.

Results: Differences in performance between the models were small. The AUC and Brier score showed large variation with small samples. Standard errors of AUC values were accurate but the power to detect differences in performance was low. Calibration tests were extremely sensitive to sample size. Direct comparison of performance, without statistical analysis, was unreliable with either measure.

Conclusion: Substantial sample sizes are required for performance assessment and model comparison in external validation. Calibration statistics and significance tests should not be used in these settings. Instead, a simple customization method to repair lack-of-fit problems is recommended.

Publication types

Research Support, Non-U.S. Gov't
Validation Study

MeSH terms

Aged
Calibration
Critical Illness / mortality*
Epidemiologic Methods
Female
Humans
Intensive Care Units*
Male
Middle Aged
Netherlands / epidemiology
Outcome Assessment, Health Care / methods*
Prognosis