Preface

R packages

In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:

  • R version 3.6.1 (2019-07-05)
  • mice (version: 3.6.0)
  • (not essential) JointAI (version: 0.6.0)

Dataset

For this practical, we will use the NHANES2 dataset, a subset of the data we have seen in the lecture slides.

To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES2_for_practicals.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES2_for_practicals.RData"). RStudio users can also just click on the file in the “Files” pane/tab to load it.

Preparing for imputation

Set-up run

Imputation needs to be tailored to the dataset at hand and, hence, using the function mice() well requires several arguments to be specified. To make the specification easier it is useful to do a dry-run which will create the default versions of everything that needs to be specified.

These default settings can then be adapted to our data.

Task

Do the set-up run of mice() with the NHANES2 data without any iterations (maxit = 0).

Solution

# Note: This command will not produce any output.
library(mice)
imp0 <- mice(NHANES2, maxit = 0)
## Warning: Number of logged events: 1

Imputation method

There are many imputation methods available in mice. You can find the list in the help page of the mice() function. We will focus here on the following ones:

name variable type description
pmm any Predictive mean matching
norm numeric Bayesian linear regression
logreg binary Logistic regression
polr ordered Proportional odds model
polyreg unordered Polytomous logistic regression

The default imputation methods that mice() selects can be specified in the argument defaultMethod.

If unspecified, mice will use

  • pmm for numerical columns,
  • logreg for factor columns with two categories,
  • polyreg for columns with unordered and
  • polr for columns with ordered factors with more than two categories.

In the NHANES2 data we have the following variables:

par(mar = c(2,3,2,1), mgp = c(2, 0.6, 0))
JointAI::plot_all(NHANES2, nclass = 30)

Task 1

When a normal imputation model seems to be appropriate for most of the continuous covariates, you may want to specify norm as the default method in the setup run. Let’s do that:

The order of the types of variable is: continuous, binary, factor, ordered factor.

Solution 1

imp0 <- mice(NHANES2, maxit = 0, 
             defaultMethod = c("norm", 'logreg', 'polyreg', 'polr'))
## Warning: Number of logged events: 1

Task 2

In the plot of the NHANES2 data shown above we can see that the variable creat has a skewed distribution, hence, using a normal imputation model may not work well.

  • Extract the default settings of meth from imp0.
  • Change the imputation method for creat so that this variable will be imputed using predictive mean matching.
  • Check that all specified imputation methods are correct. When no imputation method is specified ("") the variable will not be imputed.

Solution 2

meth <- imp0$meth
meth["creat"] <- "pmm"
meth
##       HDL      race      bili     smoke        DM    gender        WC      chol  HyperMed       alc 
##    "norm"        ""    "norm"    "polr"        ""        ""    "norm"    "norm"    "polr"    "polr" 
##       SBP       wgt    hypten    cohort     occup       age      educ      albu     creat  uricacid 
##    "norm"    "norm"  "logreg"        "" "polyreg"        ""    "polr"    "norm"     "pmm"    "norm" 
##       BMI   hypchol       hgt 
##    "norm"  "logreg"    "norm"

Predictor matrix

The predictor matrix specifies which variables are used in the linear predictors of each of the imputation models.

A value of 1 specifies that the variable given in the column name is used in the model to impute the variable given in the row name (and 0 specifies that this variable is not used in that model).

Task 1

Get the predictorMatrix from imp0. Notice that mice has already set some of the values to 0. Do you understand why?

Solution 1

pred <- imp0$predictorMatrix
pred
##          HDL race bili smoke DM gender WC chol HyperMed alc SBP wgt hypten cohort occup age educ
## HDL        0    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## race       1    0    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## bili       1    1    0     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## smoke      1    1    1     0  1      1  1    1        1   1   1   1      1      0     1   1    1
## DM         1    1    1     1  0      1  1    1        1   1   1   1      1      0     1   1    1
## gender     1    1    1     1  1      0  1    1        1   1   1   1      1      0     1   1    1
## WC         1    1    1     1  1      1  0    1        1   1   1   1      1      0     1   1    1
## chol       1    1    1     1  1      1  1    0        1   1   1   1      1      0     1   1    1
## HyperMed   1    1    1     1  1      1  1    1        0   1   1   1      1      0     1   1    1
## alc        1    1    1     1  1      1  1    1        1   0   1   1      1      0     1   1    1
## SBP        1    1    1     1  1      1  1    1        1   1   0   1      1      0     1   1    1
## wgt        1    1    1     1  1      1  1    1        1   1   1   0      1      0     1   1    1
## hypten     1    1    1     1  1      1  1    1        1   1   1   1      0      0     1   1    1
## cohort     1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## occup      1    1    1     1  1      1  1    1        1   1   1   1      1      0     0   1    1
## age        1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   0    1
## educ       1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    0
## albu       1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## creat      1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## uricacid   1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## BMI        1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## hypchol    1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
## hgt        1    1    1     1  1      1  1    1        1   1   1   1      1      0     1   1    1
##          albu creat uricacid BMI hypchol hgt
## HDL         1     1        1   1       1   1
## race        1     1        1   1       1   1
## bili        1     1        1   1       1   1
## smoke       1     1        1   1       1   1
## DM          1     1        1   1       1   1
## gender      1     1        1   1       1   1
## WC          1     1        1   1       1   1
## chol        1     1        1   1       1   1
## HyperMed    1     1        1   1       1   1
## alc         1     1        1   1       1   1
## SBP         1     1        1   1       1   1
## wgt         1     1        1   1       1   1
## hypten      1     1        1   1       1   1
## cohort      1     1        1   1       1   1
## occup       1     1        1   1       1   1
## age         1     1        1   1       1   1
## educ        1     1        1   1       1   1
## albu        0     1        1   1       1   1
## creat       1     0        1   1       1   1
## uricacid    1     1        0   1       1   1
## BMI         1     1        1   0       1   1
## hypchol     1     1        1   1       0   1
## hgt         1     1        1   1       1   0

The column corresponding to the variable cohort is set to 0 which means that this variable is not used in any of the imputation models. cohort has the same value for all observations, so it would not be useful as a covariate.

Task 2

Because BMI is calculated from height (hgt) and weight (wgt), and there are cases where only one of these two variables is missing, we want to impute hgt and wgt separately. BMI should be imputed using passive imputation.

To avoid multicollinearity (which may lead to problems during imputation), imputation models should not include all three variables as predictor variables. In this example, we will use BMI to impute the other variables.

Moreover, we need to exclude waist circumference (WC) from the imputation model for wgt because the high correlation between WC, BMI and wgt would otherwise lead to problems during imputation. (In practie you would first try to leave it in and only exclude it if it causes problems.)

And since HyperMed does not give us a lot more information than hypten, but has a lot more missing values, we do not want to use it as a predictor variable.

Apply the necessary changes to pred and meth.

For passive imputation, you need to specify the formula used to calculate BMI in meth using "~I(...)".

Solution 2

# BMI will not be used as predictor of height and weight
pred[c("hgt", "wgt"), "BMI"] <- 0
# height and weight will not be used as predictor in any model
pred[, c("hgt", "wgt")] <- 0
# height and weight will be used as predictors for each other
pred["hgt", "wgt"] <- 1
pred["wgt", "hgt"] <- 1

# WC is not used as predictor for weight
pred["wgt", "WC"] <- 0

# HyperMed will not be used as predictor in any model
pred[, "HyperMed"] <- 0

# hypchol will not be used as predictor in the imputation model for chol
pred["chol", "hypchol"] <- 0

# BMI will be imputed passively
meth["BMI"] <- "~I(wgt/hgt^2)"
# HyperMed will not be imputed
meth["HyperMed"] <- ""

Visit sequence

The visit sequence specifies the order in which the variables are imputed.

Task

To be sure that the imputed values of BMI match the imputed values of hgt and wgt, BMI needs to be imputed after hgt and wgt.

  • Get the visitSequence from imp0, and
  • change it if necessary.

Solution

visSeq <- imp0$visitSequence
visSeq
##  [1] "HDL"      "race"     "bili"     "smoke"    "DM"       "gender"   "WC"       "chol"    
##  [9] "HyperMed" "alc"      "SBP"      "wgt"      "hypten"   "cohort"   "occup"    "age"     
## [17] "educ"     "albu"     "creat"    "uricacid" "BMI"      "hypchol"  "hgt"
# find "BMI" in the visit sequence
which_BMI <- match("BMI", visSeq)
# move "BMI" to the end of the sequence
visSeq <- c(visSeq[-which_BMI], visSeq[which_BMI])

Imputation

Running the imputation

Task

With the changes that we have made to the predictorMatrix, method and visitSequence, we can now perform the imputation. For now, use no more then m = 5 and maxit = 10 (to save some time).

Solution

imp <- mice(NHANES2, method = meth, predictorMatrix = pred, visitSequence = visSeq,
            maxit = 10, m = 5, seed = 2019)

mice() prints the name of the variable being imputed for each iteration and imputation. If you run mice() on your own computer the output will show up continuously. There, you may notice that imputation is slowest for categorical variables, especially when they have many categories.

You can hide the lengthy output by specifying printFlag = FALSE.

What does mice return?

Task

mice() does not return a data.frame. Find out the class of the object returned by mice() function using the function class() and names(), and take a look at the help file for this class.

Solution

class(imp)
## [1] "mids"
names(imp)
##  [1] "data"            "imp"             "m"               "where"           "blocks"         
##  [6] "call"            "nmis"            "method"          "predictorMatrix" "visitSequence"  
## [11] "formulas"        "post"            "blots"           "seed"            "iteration"      
## [16] "lastSeedValue"   "chainMean"       "chainVar"        "loggedEvents"    "version"        
## [21] "date"

We see that imp is an object of class mids.

The help file tells us that a mids object is a list with several elements:

data: Original (incomplete) data set.
imp: The imputed values: A list of ncol(data) components, each list component is a matrix with nmis[j] rows and m columns.
m: The number of imputations.
where: The missingness indicator matrix.
blocks The blocks argument of the mice() function.
call: The call that created the mids object.
nmis: The number of missing observations per variable.
method: The vector imputation methods.
predictorMatrix: The predictor matrix.
visitSequence: The sequence in which columns are visited during imputation.
formulas A named list of formulas corresponding the the imputed variables (blocks).
post: A vector of strings of length length(blocks) with commands for post-processing.
seed: The seed value of the solution.
iteration: The number of iterations.
lastSeedValue: The most recent seed value.
chainMean: The mean of imputed values per variable and iteration: a list of m components. Each component is a matrix with maxitcolumns and length(visitSequence) rows.
chainVar: The variances of imputed values per variable and iteration(same structure as chainMean).
loggedEvents: A data.frame with the record of automatic corrective actions and warnings; (NULL if no action was made).
version Version number of the mice package that created the object.
date Date at which the object was created.

Details of the loggedEvents:

mice() does some pre-processing of the data:

  • variables containing missing values, that are not imputed but used as predictor are removed
    • constant variables are removed
    • collinear variables are removed

Furthermore, during each iteration

  • variables (or dummy variables) that are linearly dependent are removed
    • polr imputation that does not converge is replaced by polyreg.
The data.frame in loggedEvents has the following columns:
it iteration number
im imputation number
dep name of the name of the variable being imputed
meth imputation method used
out character vector with names of altered/removed predictors
 

© Nicole Erler