Preface

R packages

In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:

  • R version 3.6.1 (2019-07-05)
  • mice (version: 3.6.0)
  • RColorBrewer (version: 1.1.2)
  • reshape2 (version: 1.4.3)
  • ggplot2 (version: 3.2.1)

Dataset

For this practical, we will again use the NHANES2 dataset that we have seen in the previous practical.

To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES2_for_practicals.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES2_for_practicals.RData"). RStudio users can also just click on the file in the “Files” pane/tab to load it.

Imputed data

The imputed data are stored in a mids object called imp that we created in the previous practical.

You can load it into your workspace by clicking the object imps.RData if you are using RStudio. Alternatively, you can load this workspace using load("<path>/imps.RData"). You then need to run:

imp <- savedimps_imp

Evaluate the imputation

Checking the settings

It is good practice to make sure that mice() has not done any processing of the data that was not planned or that you are not aware of. This means checking that the correct method, predictorMatrix and visitSequence were used.

Task

Do these checks for imp.

Solution

identical(imp$method, meth)
## [1] TRUE
identical(imp$predictorMatrix, pred)
## [1] TRUE
identical(imp$visitSequence, visSeq)
## [1] TRUE
# or:
imp$method
##             HDL            race            bili           smoke              DM          gender 
##          "norm"              ""          "norm"          "polr"              ""              "" 
##              WC            chol        HyperMed             alc             SBP             wgt 
##          "norm"          "norm"              ""          "polr"          "norm"          "norm" 
##          hypten          cohort           occup             age            educ            albu 
##        "logreg"              ""       "polyreg"              ""          "polr"          "norm" 
##           creat        uricacid             BMI         hypchol             hgt 
##           "pmm"          "norm" "~I(wgt/hgt^2)"        "logreg"          "norm"
imp$predictorMatrix
##          HDL race bili smoke DM gender WC chol HyperMed alc SBP wgt hypten cohort occup age educ
## HDL        0    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## race       1    0    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## bili       1    1    0     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## smoke      1    1    1     0  1      1  1    1        0   1   1   0      1      0     1   1    1
## DM         1    1    1     1  0      1  1    1        0   1   1   0      1      0     1   1    1
## gender     1    1    1     1  1      0  1    1        0   1   1   0      1      0     1   1    1
## WC         1    1    1     1  1      1  0    1        0   1   1   0      1      0     1   1    1
## chol       1    1    1     1  1      1  1    0        0   1   1   0      1      0     1   1    1
## HyperMed   1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## alc        1    1    1     1  1      1  1    1        0   0   1   0      1      0     1   1    1
## SBP        1    1    1     1  1      1  1    1        0   1   0   0      1      0     1   1    1
## wgt        1    1    1     1  1      1  0    1        0   1   1   0      1      0     1   1    1
## hypten     1    1    1     1  1      1  1    1        0   1   1   0      0      0     1   1    1
## cohort     1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## occup      1    1    1     1  1      1  1    1        0   1   1   0      1      0     0   1    1
## age        1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   0    1
## educ       1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    0
## albu       1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## creat      1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## uricacid   1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## BMI        1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## hypchol    1    1    1     1  1      1  1    1        0   1   1   0      1      0     1   1    1
## hgt        1    1    1     1  1      1  1    1        0   1   1   1      1      0     1   1    1
##          albu creat uricacid BMI hypchol hgt
## HDL         1     1        1   1       1   0
## race        1     1        1   1       1   0
## bili        1     1        1   1       1   0
## smoke       1     1        1   1       1   0
## DM          1     1        1   1       1   0
## gender      1     1        1   1       1   0
## WC          1     1        1   1       1   0
## chol        1     1        1   1       0   0
## HyperMed    1     1        1   1       1   0
## alc         1     1        1   1       1   0
## SBP         1     1        1   1       1   0
## wgt         1     1        1   0       1   1
## hypten      1     1        1   1       1   0
## cohort      1     1        1   1       1   0
## occup       1     1        1   1       1   0
## age         1     1        1   1       1   0
## educ        1     1        1   1       1   0
## albu        0     1        1   1       1   0
## creat       1     0        1   1       1   0
## uricacid    1     1        0   1       1   0
## BMI         1     1        1   0       1   0
## hypchol     1     1        1   1       0   0
## hgt         1     1        1   0       1   0
imp$visitSequence
##  [1] "HDL"      "race"     "bili"     "smoke"    "DM"       "gender"   "WC"       "chol"    
##  [9] "HyperMed" "alc"      "SBP"      "wgt"      "hypten"   "cohort"   "occup"    "age"     
## [17] "educ"     "albu"     "creat"    "uricacid" "hypchol"  "hgt"      "BMI"

Logged events

Checking the loggedEvent shows us if mice() detected any problems during the imputation.

Task 1

Check the loggedEvents for imp.

Solution 1

imp$loggedEvents
## NULL

There are no logged events, great!

Task 2

Let’s see what would have happened if we had not prepared the predictorMatrix, method and visitSequence before imputation.

Run the imputation without setting any additional arguments:
impnaive <- mice(NHANES2, m = 5, maxit = 10)

Take a look at the loggedEvents of impnaive.

Solution 2

impnaive$loggedEvents

The loggedEvents of the “naive” imputation show that the constant variable cohort was excluded before the imputation (as it should be). Furthermore, in the imputation model for HyperMed, the variable hyptenyes was excluded (hyptenyes is the dummy variable belonging to hypten).


Task 3

We did not change the visitSequence in impnaive. Find out how that affected the imputed values of BMI.

Solution 3

naiveDF1 <- complete(impnaive, 1)
naivecalcBMI <- with(naiveDF1, wgt/hgt^2)

impDF1 <- complete(imp, 1)
impcalcBMI <- with(impDF1, wgt/hgt^2)

cbind(naiveBMI = naiveDF1$BMI, naivecalcBMI,
      impBMI = impDF1$BMI, impcalcBMI)[which(is.na(NHANES2$BMI)), ]
##       naiveBMI naivecalcBMI   impBMI impcalcBMI
##  [1,] 27.72183     27.39723 24.83157   24.83157
##  [2,] 42.57342     41.14373 29.34160   29.34160
##  [3,] 34.96218     37.19945 39.39542   39.39542
##  [4,] 25.09025     23.43565 30.54933   30.54933
##  [5,] 31.11811     32.50800 26.77603   26.77603
##  [6,] 53.45653     44.30180 44.55843   44.55843
##  [7,] 33.45054     33.85769 33.95596   33.95596
##  [8,] 25.82705     24.18514 24.84901   24.84901
##  [9,] 20.25445     20.67391 35.92126   35.92126
## [10,] 40.35064     39.02754 38.00768   38.00768
## [11,] 27.56521     26.88638 35.19163   35.19163
## [12,] 26.29231     24.03070 28.91268   28.91268
## [13,] 24.96105     25.29304 36.65114   36.65114
## [14,] 31.41156     30.99954 38.40088   38.40088
## [15,] 22.70998     22.46495 21.42614   21.42614
## [16,] 30.91019     30.66649 32.16342   32.16342
## [17,] 24.79963     23.41127 22.42358   22.42358
## [18,] 21.78935     22.04695 19.05868   19.05868
## [19,] 24.69459     24.58942 26.68989   26.68989
## [20,] 31.50786     32.05961 21.82161   21.82161
## [21,] 24.40730     23.02497 21.44369   21.44369
## [22,] 31.79397     31.24754 23.63931   23.63931
## [23,] 27.60408     26.54276 35.61949   35.61949
## [24,] 23.91393     23.41127 23.33941   23.33941
## [25,] 30.91019     29.26408 27.79779   27.79779
## [26,] 45.61438     49.80076 40.85093   40.85093
## [27,] 38.62077     38.37203 32.16596   32.16596
## [28,] 25.69610     24.41214 24.00366   24.00366
## [29,] 23.01259     22.46495 33.98395   33.98395
## [30,] 26.52057     26.45250 28.13602   28.13602
## [31,] 23.61830     23.41127 31.71061   31.71061
## [32,] 21.69968     20.98261 24.12301   24.12301
## [33,] 34.96218     34.20240 34.34352   34.34352
## [34,] 29.44466     29.47564 44.54863   44.54863
## [35,] 22.03277     22.04695 24.68605   24.68605
## [36,] 43.04508     46.76502 44.28761   44.28761
## [37,] 33.20051     32.61309 28.55831   28.55831

When we compare the imputed and calculated values of BMI from impnaive we can see that the imputed hgt and wgt give a different BMI than is imputed. This is because BMI is imputed before wgt, which means that the most recent imputed value of wgt is from the previous iteration.

Changing the visitSequence in imp prevented this inconsistency.

Convergence

In order to obtain correct results, the MICE algorithm needs to have converged. This can be checked visually by plotting summaries of the imputed values accross the iterations.

The mean and variance of the imputed values per iteration and variable are stored in the elements chainMean and chainVar of the mids object.

Task

Plot them to see if our imputation imp has converged.

Solution

# implemented plotting function (use layout to change the number of rows and columns)
plot(imp, layout = c(6, 6))

The chains in imp seem to have converged, however it is difficult to judge this based on only 10 iterations. In practice, more iterations should be done.

To save you some time, I ran the imputation again with 30 iterations and the traceplots confirm convergence:

plot(imp30, layout = c(6, 6))

Continue

In comparison, impnaive (in my version with maxit = 30) had some convergence problems:

plot(impnaive, layout = c(6, 6))

hgt, and wgt show a clear trend and the chains do not mix well, i.e., there is more variation between the chains than within each chain. (the same is the case for BMI).

These are clear signs that there is correlation or identification problems between these variables and some other variables (which is why we made adjustments to the predictorMatrix for imp).

Imputed values

Now that we know that imp has converged, we can compare the distribution of the imputed values against the distribution of the observed values. When our imputation models fit the data well, they should have similar distributions (conditional on the covariates used in the imputation model).

Task 1

  • Plot the distributions of the imputed variables (continuous and categorical).
  • Make sure the imputed values are realistic (e.g., height of 2.50m or weight of 10kg for adults).

You can use densityplot() and propplot() to get plots for all continuous and categorical variables.

propplot() is not part of any package. Copy the following syntax that defines this function:

To check all imputed values you can either get a summary of the imp element of the mids object or create a complete dataset containing all imputations using the function complete() and get the summary of that.

Solution 1

# plot densities of continuous variables
densityplot(imp)

# plot for all categorical variables
propplot(imp)
## Warning: attributes are not identical across measure variables; they will be dropped

# get the summary of the imputed values
sapply(Filter(function(x) nrow(x) > 0, imp$imp),
       function(x) summary(unlist(x))
)
## $HDL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06009 1.19649 1.42542 1.42400 1.68001 2.59759 
## 
## $bili
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1174  0.5719  0.7760  0.7856  0.9980  1.7547 
## 
## $smoke
##   never  former current 
##       5       3       2 
## 
## $WC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   64.46   87.59   98.09   99.49  109.60  152.22 
## 
## $chol
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.844   4.163   5.057   4.961   5.763   8.431 
## 
## $HyperMed
##    Mode    NA's 
## logical    4065 
## 
## $alc
##   0 <=1 1-3 3-7  >7 
## 409 712 192 161 121 
## 
## $SBP
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78.95  111.94  124.57  124.38  137.66  170.34 
## 
## $wgt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.18   68.43   80.13   81.41   94.91  136.87 
## 
## $hypten
##  no yes 
## 153  57 
## 
## $occup
##          working looking for work      not working 
##               49                6               30 
## 
## $educ
##  Less than 9th grade         9-11th grade High school graduate         some college 
##                    0                    0                    2                    1 
##     College or above 
##                    2 
## 
## $albu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.234   4.079   4.292   4.286   4.503   5.106 
## 
## $creat
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.450   0.710   0.850   0.909   0.980   7.460 
## 
## $uricacid
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.094   4.403   5.337   5.375   6.304  10.834 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.53   24.69   28.23   29.55   33.96   48.07 
## 
## $hypchol
##  no yes 
## 368  62 
## 
## $hgt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.563   1.608   1.631   1.704   1.942
# or of the whole dataset
summary(complete(imp, action = 'long'))
##       .imp        .id              HDL                          race           bili        
##  Min.   :1   Min.   :   1.0   Min.   :0.06009   Mexican American  : 560   Min.   :-0.1174  
##  1st Qu.:2   1st Qu.: 250.8   1st Qu.:1.11000   Other Hispanic    : 550   1st Qu.: 0.6000  
##  Median :3   Median : 500.5   Median :1.32000   Non-Hispanic White:1750   Median : 0.7000  
##  Mean   :3   Mean   : 500.5   Mean   :1.39411   Non-Hispanic Black:1145   Mean   : 0.7558  
##  3rd Qu.:4   3rd Qu.: 750.2   3rd Qu.:1.60000   other             : 995   3rd Qu.: 0.9000  
##  Max.   :5   Max.   :1000.0   Max.   :4.03000                             Max.   : 2.9000  
##      smoke        DM          gender           WC              chol            HyperMed   
##  never  :3045   no :4265   male  :2465   Min.   : 61.90   Min.   : 1.844   no      : 125  
##  former : 958   yes: 735   female:2535   1st Qu.: 85.10   1st Qu.: 4.270   previous: 100  
##  current: 997                            Median : 95.40   Median : 4.910   yes     : 710  
##                                          Mean   : 96.51   Mean   : 4.995   NA's    :4065  
##                                          3rd Qu.:105.52   3rd Qu.: 5.640                  
##                                          Max.   :154.70   Max.   :10.680                  
##   alc            SBP              wgt         hypten        cohort                       occup     
##  0  : 974   Min.   : 78.95   Min.   : 25.18   no :3618   Length:5000        working         :2769  
##  <=1:2117   1st Qu.:109.33   1st Qu.: 63.50   yes:1382   Class :character   looking for work: 236  
##  1-3: 717   Median :118.22   Median : 77.11              Mode  :character   not working     :1995  
##  3-7: 566   Mean   :120.40   Mean   : 78.42                                                        
##  >7 : 626   3rd Qu.:129.33   3rd Qu.: 89.36                                                        
##             Max.   :202.00   Max.   :167.83                                                        
##       age                          educ           albu           creat           uricacid     
##  Min.   :20.00   Less than 9th grade : 415   Min.   :3.000   Min.   :0.4400   Min.   : 1.094  
##  1st Qu.:31.00   9-11th grade        : 665   1st Qu.:4.100   1st Qu.:0.6900   1st Qu.: 4.400  
##  Median :43.00   High school graduate:1142   Median :4.300   Median :0.8200   Median : 5.300  
##  Mean   :45.23   some college        :1416   Mean   :4.289   Mean   :0.8576   Mean   : 5.358  
##  3rd Qu.:59.00   College or above    :1362   3rd Qu.:4.500   3rd Qu.:0.9600   3rd Qu.: 6.200  
##  Max.   :79.00                               Max.   :5.400   Max.   :7.4600   Max.   :10.834  
##       BMI        hypchol         hgt       
##  Min.   :12.53   no :4433   Min.   :1.397  
##  1st Qu.:23.30   yes: 567   1st Qu.:1.616  
##  Median :26.61              Median :1.676  
##  Mean   :27.56              Mean   :1.684  
##  3rd Qu.:30.85              3rd Qu.:1.753  
##  Max.   :60.54              Max.   :1.981

Unfortunately, we have some negative imputed values for bili. Often, this would not result in bias in the analysis, but may be difficult to explain when providing a summary of the imputed data in a publication. In the present example we can see that the observed values have a slightly right-skewed distribution compared to the imputed values. Re-doing the imputation with pmm instead of norm for bili should fix this. (However, since the imputations seem fine overall, and there is little knowledge gain in re-doing the previous steps, we will skip this repetition in this practical.)

The distributions of the imputed values for hgt and SBP differ a bit from the distributions of the observed data.

We also imputed a larger proportion than might have been expected in the highest category of alc, and the distribution of values for smoke looks a bit weird (but smoke only has one missing value, which makes it difficult to judge the distribution of the imputed values).

Task 2

Investigate if differences in the distributions of observed and imputed values, can be explained by other variables. Check this for

  • SBP conditional on gender and hypten
  • hgt conditional on gender
  • alc conditional on gender or smoke

Solution 2

densityplot(imp, ~SBP|hypten + gender)

densityplot(imp, ~hgt|gender)

propplot(imp, alc ~ gender + smoke)

 

© Nicole Erler