In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.6.0)JointAI
(version: 0.6.0)ggplot2
(version: 3.2.1)reshape2
(version: 1.4.3)ggpubr
(version: 0.2.2)For this practical, we will use the NHANES3 data, another subset of the data we have already seen in the lecture slides and the previous practicals. It contains only those cases that have observed wgt
and some columns that are not needed were excluded.
To load this dataset, you can use the command file.choose()
which opens the explorer and allows you to navigate to the location of the file NHANES3_for_practicals.RData
on your computer. If you know the path to the file, you can also use load("<path>/NHANES3_for_practicals.RData")
.
The focus of this practical is the imputation of data that has features that require special attention.
In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.
Our aim is to fit the following linear regression model for weight:
We expect that the effects of cholesterol and HDL may differ with age, and, hence, include interaction terms between age
and chol
and HDL
, respectively.
Additionally, we want to include the other variables in the dataset as auxiliary variables.
Use of the Just Another Variable approach can in some settings reduce bias. Alternatively, we can use passive imputation, i.e., calculate the interaction terms in each iteration of the MICE algorithm. Furthermore, predictive mean matching tends to lead to less bias than normal imputation models.
mice()
without any iterations.# calculate the interaction terms
NHANES3$agechol <- NHANES3$age * NHANES3$chol
NHANES3$ageHDL <- NHANES3$age * NHANES3$HDL
# setup run
imp0 <- mice(NHANES3, maxit = 0,
defaultMethod = c('norm', 'logreg', 'polyreg', 'polr'))
imp0
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## wgt gender bili age chol HDL hgt educ race SBP hypten
## "" "" "norm" "" "norm" "norm" "norm" "polr" "" "norm" "logreg"
## WC agechol ageHDL
## "norm" "norm" "norm"
## PredictorMatrix:
## wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## gender 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## bili 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## age 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## chol 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## HDL 1 1 1 1 1 0 1 1 1 1 1 1 1 1
Since the interaction terms are calculated from the orignal variables, these interaction terms should not be used to impute the original variables.
meth <- imp0$method
pred <- imp0$predictorMatrix
# change imputation for "bili" to pmm (to prevent negative values)
meth["bili"] <- 'pmm'
# changes in predictor matrix to prevent original variables being imputer based
# on the interaction terms
pred["chol", "agechol"] <- 0
pred["HDL", "ageHDL"] <- 0
meth
## wgt gender bili age chol HDL hgt educ race SBP hypten
## "" "" "pmm" "" "norm" "norm" "norm" "polr" "" "norm" "logreg"
## WC agechol ageHDL
## "norm" "norm" "norm"
## wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## gender 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## bili 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## age 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## chol 1 1 1 1 0 1 1 1 1 1 1 1 0 1
## HDL 1 1 1 1 1 0 1 1 1 1 1 1 1 0
## hgt 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## educ 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## race 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## SBP 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## hypten 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## WC 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## agechol 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## ageHDL 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Run the imputation using the JAV approach and check the traceplot.