Gaussian-based routines to impute categorical variables in health surveys

Stat Med. 2011 Dec 20;30(29):3447-60. doi: 10.1002/sim.4355. Epub 2011 Oct 4.

Abstract

The multivariate normal (MVN) distribution is arguably the most popular parametric model used in imputation and is available in most software packages (e.g., SAS PROC MI, R package norm). When it is applied to categorical variables as an approximation, practitioners often either apply simple rounding techniques for ordinal variables or create a distinct 'missing' category and/or disregard the nominal variable from the imputation phase. All of these practices can potentially lead to biased and/or uninterpretable inferences. In this work, we develop a new rounding methodology calibrated to preserve observed distributions to multiply impute missing categorical covariates. The major attractiveness of this method is its flexibility to use any 'working' imputation software, particularly those based on MVN, allowing practitioners to obtain usable imputations with small biases. A simulation study demonstrates the clear advantage of the proposed method in rounding ordinal variables and, in some scenarios, its plausibility in imputing nominal variables. We illustrate our methods on a widely used National Survey of Children with Special Health Care Needs where incomplete values on race posed a valid threat on inferences pertaining to disparities.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Adolescent
  • Bias
  • Child
  • Child, Preschool
  • Computer Simulation
  • Female
  • Health Care Surveys / statistics & numerical data
  • Health Status Disparities
  • Health Surveys / statistics & numerical data*
  • Healthcare Disparities / statistics & numerical data
  • Humans
  • Infant
  • Male
  • Multivariate Analysis*
  • Needs Assessment / statistics & numerical data*
  • Normal Distribution*
  • Racial Groups / statistics & numerical data
  • Software / statistics & numerical data