Assessing the effect of data integration on predictive ability of cancer survival models

Health Informatics J. 2020 Mar;26(1):8-20. doi: 10.1177/1460458218824692. Epub 2019 Jan 23.

Abstract

Cancer is the second leading cause of death in the United States. To improve cancer prognosis and survival rates, a better understanding of multi-level contributory factors associated with cancer survival is needed. However, prior research on cancer survival has primarily focused on factors from the individual level due to limited availability of integrated datasets. In this study, we sought to examine how data integration impacts the performance of cancer survival prediction models. We linked data from four different sources and evaluated the performance of Cox proportional hazard models for breast, lung, and colorectal cancers under three common data integration scenarios. We showed that adding additional contextual-level predictors to survival models through linking multiple datasets improved model fit and performance. We also showed that different representations of the same variable or concept have differential impacts on model performance. When building statistical models for cancer outcomes, it is important to consider cross-level predictor interactions.

Keywords: cancer survival; data heterogeneities; data integration; interactions; model performance; multi-level data analysis.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Breast Neoplasms*
  • Female
  • Humans
  • Male
  • Medicare / statistics & numerical data
  • Models, Statistical*
  • Neoplasms*
  • Prognosis
  • Survival Rate
  • United States