Imputation of incomplete covariates in longitudinal data: Can Bayesian non-parametric methods prevent model-misspecification?


Context: This work is motivated by a study in Type II diabetes patients and their progression to diabetic retinopathy. Specifically, physicians are interested in identifying risk factors, longitudinal and baseline, for progression. An important complication for the analysis is that several of the risk factors are not available for all patients or not at all time-points. Default approaches to handle this issue, like the various flavours of multiple imputation, assume linear and additive relationships between the risk factors and outcome, and between the covariates themselves. However, a preliminary analysis of our data using the complete cases showed that these assumptions of linearity and additivity are seriously violated.
Objective: In previous work, we have shown that misspecifying these relationships during imputation can have a large impact on the parameter estimates of interest in the analysis model. To overcome this issue, we propose here a unified non-parametric Bayesian framework. We investigate if, and how, in this framework less restrictive imputation models can be utilized in an automated way to reduce bias due to misspecification of the association structure and/or residual distribution.
Methods: Specifically, penalized B-splines and pairwise interactions on which shrinkage priors are imposed are implemented to relax the assumption of linear associations. To increase flexibility in the residual distribution of continuous covariates we apply Bayesian non-parametric models using a mixture of normal distributions with a truncated Dirichlet process prior.
Results: Simulations showed that p-splines may substantially reduce relative bias (>30% vs 8%) and improve coverage (0.33 vs 0.87) while maintaining CI width. However, p-splines require a sufficient number of observed cases across the entire range of values for any two potentially non-linearly related variables. The nonparametric residual distribution reduced bias in cases with severe deviation from normality (25% vs 6%) but produced wider CIs than the parametric model.
Conclusion: The proposed flexible approaches outperformed the parametric models only in certain settings. Criteria that compare distributions of observed and incomplete pairs of variables, and criteria that help misspecified models, e.g., posterior predictive checks, are needed to guide the choice of models in an automated manner and to alert the user of potential problems.

Leuven, Belgium