Power and reproducibility in the external validation of brain-phenotype predictions

Matthew Rosenblatt; Link Tejavibulya; Chris C Camp; Rongtao Jiang; Margaret L Westwater; Stephanie Noble; Dustin Scheinost

doi:10.1101/2023.10.25.563971

Power and reproducibility in the external validation of brain-phenotype predictions

bioRxiv [Preprint]. 2023 Oct 30:2023.10.25.563971. doi: 10.1101/2023.10.25.563971.

Authors

Matthew Rosenblatt¹, Link Tejavibulya², Chris C Camp², Rongtao Jiang³, Margaret L Westwater³, Stephanie Noble^{3

4

5}, Dustin Scheinost^{1

2

3

6

7}

Affiliations

¹ Department of Biomedical Engineering, Yale University, New Haven, CT.
² Interdepartmental Neuroscience Program, Yale University, New Haven, CT.
³ Department of Radiology & Biomedical Imaging, Yale School of Medicine, New Haven, CT.
⁴ Department of Bioengineering, Northeastern University, Boston, MA.
⁵ Department of Psychology, Northeastern University, Boston, MA.
⁶ Child Study Center, Yale School of Medicine, New Haven, CT.
⁷ Department of Statistics & Data Science, Yale University, New Haven, CT.

Abstract

Identifying reproducible and generalizable brain-phenotype associations is a central goal of neuroimaging. Consistent with this goal, prediction frameworks evaluate brain-phenotype models in unseen data. Most prediction studies train and evaluate a model in the same dataset. However, external validation, or the evaluation of a model in an external dataset, provides a better assessment of robustness and generalizability. Despite the promise of external validation and calls for its usage, the statistical power of such studies has yet to be investigated. In this work, we ran over 60 million simulations across several datasets, phenotypes, and sample sizes to better understand how the sizes of the training and external datasets affect statistical power. We found that prior external validation studies used sample sizes prone to low power, which may lead to false negatives and effect size inflation. Furthermore, increases in the external sample size led to increased simulated power directly following theoretical power curves, whereas changes in the training dataset size offset the simulated power curves. Finally, we compared the performance of a model within a dataset to the external performance. The within-dataset performance was typically within r=0.2 of the cross-dataset performance, which could help decide how to power future external validation studies. Overall, our results illustrate the importance of considering the sample sizes of both the training and external datasets when performing external validation.

Publication types

Preprint

Abstract

Publication types

Grants and funding