Center for Quantitative Methods
Erasmus Medical Center
31 October 2018

“Why can’t I just use MICE?”

Well, you can use MICE, …

… and in standard settings it usually works well, …


… but there are settings in which it doesn’t!

Then (naive) use of MICE leads to

  • violation of assumptions
  • invalid imputations
  • biased results

Example: Quadratic effect

Consider an analysis model \(\quad y = \beta_0 + \beta_1 x + \boldsymbol{\beta_2 x^2} + \ldots\)

Example: Quadratic effect

MICE uses a linear relation when imputing \(x\): \(\quad x = \theta_{10} + \theta_{11} y + \ldots\)

Example: Quadratic effect

severely biased results

Example: Interaction Effects

Another example: non-linear relationship due to interaction term \[y = \beta_0 + \beta_x x + \beta_z z + \boldsymbol{\beta_{xz} xz} + \ldots\]

Example: Interaction Effects

Again: MICE assumes a linear relation between \(x\) and \(y\) in the imputation model \[x = \theta_{10} + \theta_{11} y + \theta_{12} z + \ldots\]

Example: Interaction Effects

severely biased estimates

Example: Longitudinal data

ID y x1 x2 x3 x4 time
5 NA
5 NA
5 NA
5 NA
6 NA NA
6 NA NA
6 NA NA
8 NA
8 NA
8 NA
18 NA
18 NA
18 NA
18 NA

Here, \(x_1, \ldots, x_4\) are baseline covariates, i.e., not measured repeatedly.

Example: Longitudinal data



Imputation in long format

  • regards each row as independent,
  • may cause bias
  • and inconsistent imputations.
ID y x1 x2 x3 x4 time
5 boy
5 girl
5 girl
5 girl
6 girl high
6 girl mid
6 girl high
8 37.22
8 37.71
8 41.37
18 boy
18 boy
18 boy
18 boy

Example: Longitudinal data

Estimates can be severely biased.

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.00609238
## (tol = 0.002, component 1)

Example: Longitudinal data

Sometimes imputation in wide format may be possible.

Example: Longitudinal data

id y.1 y.3 y.5 y.7 y.9 time.1 time.3 time.5 time.7 time.9
5 NA NA
6 NA NA NA NA
8 NA NA NA NA
18 NA NA
\(\ddots\)


In wide format:

  • missing values in outcome and measurement times need to be imputed
    (to be used as predictors in imputation of covariates)
  • Very inefficient in for unbalanced data!

Example: Longitudinal data

Better, but very large confidence intervals