Comparative methods for handling missing data in large databases

作者:Henry Antonia J; Hevelone Nathanael D; Lipsitz Stuart; Nguyen Louis L*
来源:Journal of Vascular Surgery, 2013, 58(5): 1353-+.
DOI:10.1016/j.jvs.2013.05.008

摘要

Objective: Analysis of complex survey databases is an important tool for health services researchers. Missing data elements are challenging because the reasons for "missingness" are multifactorial, especially categorical variables such as race. We simulated missing data for race and analyzed the bias from five methods used in predicting major amputation in patients with critical limb ischemia (CLI). Methods: Patient discharges with fully observed data containing lower extremity revascularization or major amputation and CLI were selected from the 2003 to 2007 Nationwide Inpatient Sample, a complex survey database (weighted n = 684,057). Considering several random missing data schemes, we compared five missing data methods: complete case analysis, replacement with observed frequencies, missing indicator variable, multiple imputation, and reweighted estimating equations. We created 100 simulated data sets, with 5%, 15%, or 30% of subjects' race drawn to be missing from the full data set. Bias was estimated by comparing the estimated regression coefficients averaged over 100 simulated data sets (beta(miss)) from each method vs estimates from the fully observed data set (beta(full)), with relative bias calculated as (beta(full)-beta(miss)/beta(full)) x 100%. Results: Our results demonstrate that reweighted estimating equations produce the least biased and the missing indicator variable produces the most biased coefficients. Complete case analysis, replacement with observed frequencies, and multiple imputation resulted in moderate bias. Sensitivity analysis demonstrated the optimal method choice depends on the quantity and type of missing data encountered. Conclusions: Missing data are an important analytic topic in research with large databases. The commonly used missing indicator variable method introduces severe bias and should be used with caution. We present empiric evidence to guide method selection for handling missing data.

  • 出版日期2013-11