# Chapter 6 JointAI: Joint Analysis and Imputation of Incomplete Data in R

This chapter is based on

Nicole S. Erler, Dimitris Rizopoulos and Emmanuel M. E. H. Lesaffre. JointAI: Joint Analysis and Imputation of Incomplete Data in R. (Manuscript in preparation)

### Abstract

Missing data occur in many types of studies and typically complicate the analysis. Multiple imputation, either using joint modelling or the more flexible full conditional specification approach, is popular and works well in standard settings. However, in settings involving non-linear associations or interactions, incompatibility of the imputation model with the analysis model is an issue often resulting in bias. Similarly, complex outcomes such as longitudinal or survival outcomes cannot be adequately handled by standard implementations.

In this chapter, we introduce the R package JointAI, which utilizes the Bayesian framework to perform simultaneous analysis and imputation in regression models with incomplete covariates. Using a fully Bayesian joint modelling approach it overcomes the issue of uncongeniality while retaining the attractive flexibility of fully conditional specification multiple imputation by specifying the joint distribution of analysis and imputation models as a sequence of univariate models that can be adapted to the type of variable. JointAI provides functions for Bayesian inference with generalized linear and generalized linear mixed models as well as survival models, that take arguments analogous to their corresponding and well known complete data versions from base R and other packages. Usage and features of JointAI are described and illustrated using various examples and the theoretical background is outlined.

## 6.1 Introduction

Missing data are a challenge common to the analysis of data from virtually all kinds of studies. Especially when many variables are measured, as in big cohort studies, or when data are obtained retrospectively, e.g., from registries, proportions of missing values up to 50% are not uncommon in some variables.

Multiple imputation, which is often considered the gold standard to handle incomplete data, has its origin in the 1970s and was primarily developed for survey data (Rubin 1987, 2004). One of its first implementations in R (R Core Team 2018) is the package norm (Novo and Schafer 2010), which performs multiple imputation under the joint modelling framework using a multivariate normal distribution (Schafer 1997). Nowadays, multiple imputation using a fully conditional specification (FCS), also called multiple imputation using chained equations (MICE), and its seminal implementation in the R package mice (Van Buuren and Groothuis-Oudshoorn 2011; Van Buuren 2012), is more frequently used.

Datasets have gotten more complex compared to the survey data multiple imputation was developed for. Therefore, more sophisticated methods that can adequately handle the features of modern data and comply with the assumptions made in its analysis are required. Modern studies do not only record univariate outcomes, measured in cross-sectional settings, but outcomes that consist of two or more measurements, such as repeatedly measured or survival outcomes. Furthermore, non-linear effects, introduced by functions of covariates, such as transformations, polynomials or splines, or interactions between variables are considered in the analysis and hence need to be taken into account during imputation.

Standard multiple imputation, either using FCS or a joint modelling approach, e.g., under a multivariate normal distribution, assumes linear associations between all variables. Moreover, FCS requires the outcome to be explicitly specified in each of the linear predictors of the full conditional distributions. In settings where the outcome is more complex than just univariate, this is not straightforward and not generally possible without information loss, leading to misspecified imputation models.

Some extensions of standard multiple imputation have been developed and are implemented in R packages and other software, but the larger part of software for imputation is restricted to standard settings such as cross-sectional survey data. R packages that offer extensions frequently focus on particular settings and researchers need to be familiar with a number of different packages, which often require input and specifications in very different forms. Moreover, most R packages dealing with incomplete data implement multiple imputation, i.e., create multiple imputed datasets, which are then analysed in a second step, followed by pooling of the results.

The R package JointAI (Erler 2019), which is presented in this chapter, follows a different, fully Bayesian approach. It provides a unified framework for both simple and more complex models, using a consistent specification most users will be familiar with from commonly used (base) R functions. By modelling the analysis model of interest jointly with the incomplete covariates, analysis and imputation can be performed simultaneously while assuring compatibility between all sub-models (Erler et al. 2016, 2019). In this joint modelling approach, the added uncertainty due to the missing values is automatically taken into account in the posterior distribution of the parameters of interest, and no pooling of results from repeated analyses is necessary. The joint distribution is specified in a convenient way, using a sequence of conditional distributions that can be specified flexibly according to each type of variable. Since the analysis model of interest defines the first distribution in the sequence, the outcome is included in the joint distribution without the need for it to enter the linear predictor of any of the other models. Moreover, non-linear associations that are part of the analysis model are automatically taken into account for the imputation of missing values. This directly enables our approach to handle complicated models, with complex outcomes and flexible linear predictors.

In the following, we introduce the R package JointAI, which performs joint analysis and imputation of regression models with incomplete covariates under the Missing At Random assumption (Rubin 1976), and explain how data with incomplete covariate information can be analysed and imputed with it. The package is available for download at the Comprehensive R Archive Network (CRAN) under https://CRAN.R-project.org/package=JointAI. Section 6.2 briefly describes the theoretical background. An outline of the general structure of JointAI is given in Section 6.3, followed by an introduction to example datasets that are used throughout this chapter, in Section 6.4. Details about model specification, settings controlling the Markov Chain Monte Carlo sampling, and summary, plotting and other functions that can be applied after fitting the model are given in Sections 6.5 through 6.7. We conclude the paper with an outlook of planned extensions and discuss the limitations that are introduced by the assumptions made in the sequential imputation approach.

## 6.2 Theoretical Background

Consider the general setting of a regression model where interest lies in a set of parameters $$\boldsymbol\theta$$ that describe the association between a univariate outcome $$\mathbf y$$ and a set of covariates $$\mathbf X = (\mathbf x_1, \ldots, \mathbf x_p)$$. In the Bayesian framework, inference over $$\boldsymbol\theta$$ is obtained by estimation of the posterior distribution of $$\boldsymbol\theta$$, which is proportional to the product of the likelihood of the data $$(\mathbf y, \mathbf X)$$ and the prior distribution of $$\boldsymbol\theta$$, $p(\boldsymbol\theta\mid \mathbf y, \mathbf X) \propto p(\mathbf y, \mathbf X \mid \boldsymbol\theta)\,p(\boldsymbol\theta).$

When some of the covariates are incomplete, $$\mathbf X$$ consists of two parts, the completely observed variables $$\mathbf X_{obs}$$ and those variables that are incomplete, $$\mathbf X_{mis}$$. If $$\mathbf y$$ had missing values (and this missingness was ignorable), the only necessary change in the formulas below would be to replace $$\mathbf y$$ by $$\mathbf y_{mis}$$. We will, therefore, without loss of generality, consider $$\mathbf y$$ to be completely observed.

The likelihood of the complete data, i.e., observed and unobserved, can be factorized in the following convenient way: $p(\mathbf y, \mathbf X_{obs}, \mathbf X_{mis} \mid \boldsymbol\theta) = p(\mathbf y \mid \mathbf X_{obs}, \mathbf X_{mis}, \boldsymbol\theta_{y\mid x})\, p(\mathbf X_{mis} \mid \mathbf X_{obs}, \boldsymbol\theta_x),$ where the first factor constitutes the analysis model of interest, described by a vector of parameters $$\boldsymbol\theta_{y\mid x}$$, and the second factor is the joint distribution of the incomplete variables, i.e., the imputation part of the model, described by parameters $$\boldsymbol\theta_x$$, and $$\boldsymbol\theta = (\boldsymbol\theta_{y\mid x}^\top, \boldsymbol\theta_x^\top)^\top$$.

Explicitly specifying the joint distribution of all data is one of the major advantages of the Bayesian approach, since this facilitates the use of all available information of the outcome in the imputation of the incomplete covariates (Erler et al. 2016), which becomes especially relevant for more complex outcomes such as repeatedly measured variables (see Section 6.2.2).

In complex models, the posterior distribution can usually not be analytically derived but Markov Chain Monte Carlo (MCMC) methods are used to obtain samples from the posterior distribution. The MCMC sampling in JointAI is done using Gibbs sampling, which iteratively samples from the full conditional distributions of the unknown parameters and missing values.

In the following sections, we describe each of the three parts of the model, the analysis model, the imputation part and the prior distributions, in detail.

### 6.2.1 Analysis Model

The analysis model of interest is described by the probability density function $$p(\mathbf y \mid \mathbf X, \boldsymbol\theta_{y\mid x})$$. The R package JointAI can currently handle analysis models that are either generalized linear regression models (GLM), (generalized) linear mixed models (GLMM), cumulative logit (mixed) models, parametric (Weibull) survival models or Cox proportional hazards models.

For a GLM, the probability density function is chosen from the exponential family and the model has the linear predictor $g(E(y_i\mid \mathbf X, \boldsymbol\theta_{y\mid x})) = \mathbf x_i^\top\boldsymbol\beta,$ where $$g(\cdot)$$ is a link function, $$y_i$$ the value of the outcome variable for subject $$i$$, and $$\mathbf x_i$$ is the row of $$\mathbf X$$ that contains the covariate information for $$i$$.

For a GLMM the linear predictor is of the form $g(E(y_{ij}\mid \mathbf X, \mathbf b_i, \boldsymbol\theta_{y\mid x})) = \mathbf x_{ij}^\top\boldsymbol\beta + z_{ij}^\top\mathbf b_i,$ where $$y_{ij}$$ is the $$j$$-th outcome of subject $$i$$, $$\mathbf x_{ij}$$ is the corresponding vector of covariate values, $$\mathbf b_i$$ a vector of random effects pertaining to subject $$i$$, and $$\mathbf z_{ij}$$ the row of the design matrix of the random effects, $$\mathbf Z$$, that corresponds to the $$j$$-th measurement of subject $$i$$. $$\mathbf Z$$ typically contains a subset of the variables in $$\mathbf X$$, and $$\mathbf b_i$$ follows a normal distribution with mean zero and covariance matrix $$\mathbf D$$.

In both cases the parameter vector $$\boldsymbol\theta_{y\mid x}$$ contains the regression coefficients $$\boldsymbol\beta$$, and potentially additional variance parameters (e.g., for linear (mixed) models), for which prior distributions will be specified in Section 6.2.3.

Cumulative logit mixed models are of the form $\begin{eqnarray*} y_{ij} &\sim& \text{Multinom}(\pi_{ij,1}, \ldots, \pi_{ij,K}),\\[2ex] \pi_{ij,1} &=& P(y_{ij} \leq 1),\\ \pi_{ij,k} &=& P(y_{ij} \leq k) - P(y_{ij} \leq k-1), \quad k \in 2, \ldots, K-1,\\ \pi_{ij,K} &=& 1 - \sum_{k = 1}^{K-1}\pi_{ij,k},\\[2ex] \text{logit}(P(y_{ij} \leq k)) &=& \gamma_k + \eta_{ij}, \quad k \in 1,\ldots,K,\\ \eta_{ij} &=& x_{ij}^\top\boldsymbol\beta + z_{ij}^\top\mathbf b_i,\\[2ex] \gamma_1,\delta_1,\ldots,\delta_{K-1} &\overset{iid}{\sim}& N(\mu_\gamma, \sigma_\gamma^2),\\ \gamma_k &\sim& \gamma_{k-1} + \exp(\delta_{k-1}),\quad k = 2,\ldots,K, \end{eqnarray*}$ where $$\pi_{ij,k} = P(y_{ij} = k)$$ and $$\text{logit}(x) = \log\left(\frac{x}{1-x}\right)$$. A cumulative logit regression model for a univariate outcome $$y_i$$ can be obtained by dropping the index $$j$$ and omitting $$z_{ij}^\top\mathbf b_i$$. In cumulative logit (mixed) models, the design matrix $$\mathbf X$$ does not contain an intercept, since outcome category specific intercepts $$\gamma_1,\ldots, \gamma_K$$ are specified. Here, the parameter vector $$\boldsymbol \theta_{y\mid x}$$ includes the regression coefficients $$\boldsymbol\beta$$, and the first intercept $$\gamma_1$$ and increments $$\delta_1, \ldots, \delta_{K-1}$$.

Survival data are typically characterized by the observed event or censoring times, $$T_i$$, and the event indicator, $$D_i$$, which is equal to one if the event was observed and zero otherwise. JointAI provides two types of models to analyse right censored survival data, a parametric model assuming a Weibull distribution for the true (but partially unobserved) survival times $$T^*$$, and a semi-parametric Cox proportional hazards model.

The parametric survival model is implemented as $\begin{eqnarray*} T_i^* &\sim& \text{Weibull}(1, r_i, s),\\ D_i &\sim& \unicode{x1D7D9}(T_i^* \geq C_i),\\ \log(r_j) &=& - \mathbf x_i^\top\boldsymbol\beta,\\ s &\sim& \text{Exp}(0.01), \end{eqnarray*}$ where $$\unicode{x1D7D9}\;$$ is the indicator function which is one if $$T_i^* \geq C_i$$, and zero otherwise.

For the Cox proportional hazards model, following Lunn et al. (2012), a counting process representation is implemented, where the baseline hazard is assumed to be piecewise constant and changes only at observed event times. Let $$\{N_i(t), t \geq 0\}$$ be an event counting process for individual $$i$$, where $$N_i(t) = 0$$ until the individual experiences an event and increases by one at the time of the event. $$dN_i(t)$$ then denotes the change in $$N_i(t)$$ in the interval $$[t, t+dt)$$, where $$dt$$ is the length of that interval, and can be modelled as a Poisson random variable with time-varying intensity $$\lambda_i(t)$$. This intensity depends on covariates $$\mathbf x_i$$, the baseline hazard $$\lambda_0(t)$$, and the risk set indicator $$R_i(t)$$, which is equal to one if, at time $$t$$, subject $$i$$ is at risk for an event, and zero otherwise. $\begin{eqnarray*} dN(t)_i &\sim& \text{Poisson}(\lambda_i(t)),\quad t \in 0,\ldots, T\\ \lambda_i(t) &=& \exp(\mathbf x_i^\top\boldsymbol\beta) \; \lambda_0(t) \; R_i(t)\\ \lambda_0(t) &\sim& \text{Gamma}(c\lambda_0(t)^*, c) \end{eqnarray*}$ where $$\lambda_0(t)^*$$ is a prior guess of the failure rate at time $$t$$, and $$c$$ represents the confidence about that prior guess.

### 6.2.2 Imputation Part

A convenient way to specify the joint distribution of the incomplete covariates $$\mathbf X_{mis} = (\mathbf x_{mis_1}, \ldots, \mathbf x_{mis_q})$$ is to use a sequence of conditional univariate distributions (Erler et al. 2016; Ibrahim, Chen, and Lipsitz 2002) $$\boldsymbol\theta_{x} = (\boldsymbol\theta_{x_1}^\top, \ldots, \boldsymbol\theta_{x_q}^\top)^\top$$.

Each of the conditional distributions is a member of the exponential family, extended with distributions for ordinal categorical variables, and chosen according to the type of the respective variable. Its linear predictor is $g_\ell\left\{E\left(x_{i,mis_\ell} \mid \mathbf x_{i,obs}, \mathbf x_{i, mis_{<\ell}}, \boldsymbol\theta_{x_\ell}\right) \right\} = (\mathbf x_{i, obs}^\top, x_{i, mis_1}, \ldots, x_{i, mis_{\ell-1}}) \boldsymbol\alpha_{\ell},$ for $$\ell=1,\ldots,q$$, where $$\mathbf x_{i,mis_{<\ell}} = (x_{i,mis_1}, \ldots, x_{i,mis_{\ell-1}})^\top$$.

Factorization of the joint distribution of the covariates in such a sequence yields a straightforward specification of the joint distribution, even when the covariates are of mixed type.

Missing values in the covariates are sampled from their full conditional distributions that can be derived from the full joint distribution of outcome and covariates.

When, for instance, the analysis model is a GLM, the full conditional distribution of an incomplete covariate $$x_{i, mis_{\ell}}$$ can be written as $\begin{eqnarray} p(x_{i, mis_{\ell}} \mid \mathbf y_i, \mathbf x_{i,obs}, \mathbf x_{i,mis_{-\ell}}, \boldsymbol\theta) & \propto& p \left(y_i \mid \mathbf x_{i, obs}, \mathbf x_{i, mis}, \boldsymbol\theta_{y\mid x} \right)\, p(\mathbf x_{i, mis}\mid \mathbf x_{i, obs}, \boldsymbol\theta_{x})\\ & & p(\boldsymbol\theta_{y\mid x})\, p(\boldsymbol\theta_{x})\\\nonumber &\propto& p\left(y_i \mid \mathbf x_{i, obs}, \mathbf x_{i, mis}, \boldsymbol\theta_{y\mid x} \right)\, p(x_{i, mis_\ell} \mid \mathbf x_{i, obs}, \mathbf x_{i, mis_{<\ell}}, \boldsymbol\theta_{x_\ell})\\ & &\left\{\prod_{k=\ell+1}^q p(x_{i,mis_k}\mid \mathbf x_{i, obs}, \mathbf x_{i, mis_{<k}}, \boldsymbol\theta_{x_k}) \right\}\, p(\boldsymbol\theta_{y\mid x}) p(\boldsymbol\theta_{x_\ell})\\ & & \prod_{k=\ell+1}^p \, p(\boldsymbol\theta_{x_k}),\tag{6.2} \end{eqnarray}$ where $$\boldsymbol\theta_{x_{\ell}}$$ is the vector of parameters describing the model for the $$\ell$$-th covariate, and contains the vector of regression coefficients and potentially additional (variance) parameters. The product of distributions enclosed by curly brackets represents the distributions of those covariates that have $$x_{mis_\ell}$$ as a predictive variable in the specification of the sequence in (6.1).

Even though (6.2) describes the actual imputation model, i.e., the distribution the imputed values for $$x_{i, mis_{\ell}}$$ are sampled from, we will use the term “imputation model” for the conditional distribution of $$x_{i, mis_{\ell}}$$ from (6.1), since the latter is the distribution that is explicitly specified by the user and, hence, of more relevance when using JointAI.

#### Imputation with Longitudinal Outcomes

Factorizing the joint distribution into the analysis model and imputation part allows a straightforward extension to settings with more complex outcomes, such as repeatedly measured outcomes. In the case where the analysis model is a GLMM, the conditional distribution of the outcome in (6.2), $$p(y_i \mid \mathbf x_{i, obs}, \mathbf x_{i, mis}, \boldsymbol\theta_{y\mid x})$$, has to be replaced by $$\mathbf y$$ does not appear in any of the other terms in (6.2), and (6.3) can be chosen to be a model that is appropriate for the outcome at hand, the thereby specified full conditional distribution of $$x_{i, mis_\ell}$$ allows us to draw valid imputations that use all available information on the outcome.

This is an important difference to standard FCS, where the full conditional distributions used to impute missing values are specified directly, usually as regression models, and require the outcome to be explicitly included in the linear predictor of the imputation model. In settings with complex outcomes, it is not clear how this should be done, and simplifications may lead to biased results (Erler et al. 2016). The joint model specification utilized in JointAI overcomes this difficulty.

When some of the covariates are time-varying, it is convenient to specify models for these variables at the beginning of the sequence of covariate models, so that models for longitudinal variables have other longitudinal and baseline covariates in their linear predictor, but longitudinal covariates do not enter the predictors of baseline covariates.

Note that whenever there are incomplete baseline covariates it is necessary to specify models for all longitudinal variables, even completely observed ones, while models for completely observed baseline covariates can be omitted. This becomes clear when we extend the factorized joint distribution from above with completely and incompletely observed covariates $$s_{obs}$$ and $$s_{mis}$$: $\begin{multline*} p(y_{ij} \mid \mathbf s_{ij, obs}, \mathbf s_{ij, mis}, \mathbf x_{i, obs}, \mathbf x_{i, mis}, \boldsymbol\theta_{y\mid x})\, p(\mathbf s_{ij, mis}\mid \mathbf s_{ij, obs}, \mathbf x_{i, obs}, \mathbf x_{i, mis}, \boldsymbol\theta_{s_{mis}})\\ p(\mathbf s_{ij, obs}\mid \mathbf x_{i, obs}, \mathbf x_{i, mis}, \boldsymbol\theta_{s_{obs}})\, p(\mathbf x_{i, mis}\mid \mathbf x_{i, obs}, \boldsymbol\theta_{x_{mis}})\, p(\mathbf x_{i, obs} \mid \boldsymbol\theta_{x_{obs}})\\ p(\boldsymbol\theta_{y\mid x})\, p(\boldsymbol\theta_{s_{mis}}) \, p(\boldsymbol\theta_{s_{obs}})\, p(\boldsymbol\theta_{x_{mis}}) \, p(\boldsymbol\theta_{x_{obs}}) \end{multline*}$ Given that the parameter vectors $$\boldsymbol\theta_{x_{obs}}$$, $$\boldsymbol\theta_{x_{mis}}$$, $$\boldsymbol\theta_{s_{obs}}$$ and $$\boldsymbol\theta_{s_{mis}}$$ are a priori independent, and $$p(\mathbf x_{i, obs} \mid \boldsymbol\theta_{x_{obs}})$$ is independent of both $$\mathbf x_{i,mis}$$ and $$\mathbf s_{ij,mis}$$, it can be excluded from the model.

Since $$p(\mathbf s_{ij, obs}\mid \mathbf x_{i, obs}, \mathbf x_{i, mis}, \boldsymbol\theta_{s_{obs}})$$, however, has $$\mathbf x_{i, mis}$$ in its linear predictor and will, hence, be part of the full conditional distribution of $$\mathbf x_{i, mis}$$, it cannot be omitted from the model specification.

#### Non-linear Associations and Interactions

Other settings in which the fully Bayesian approach employed in JointAI has an advantage over standard FCS are settings with interaction terms that involve incomplete covariates, or when the association of the outcome with an incomplete covariate is non-linear. In standard FCS such settings lead to incompatible imputation models (White, Royston, and Wood 2011; Bartlett et al. 2015). This becomes clear when considering the following simple example where the analysis model of interest is the linear regression $$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i$$ and $$x_i$$ is imputed using $$x_i = \alpha_0 + \alpha_1 y_i + \tilde\varepsilon_i$$. While the analysis model assumes a quadratic relationship, the imputation model assumes a linear association between $$\mathbf x$$ and $$\mathbf y$$ and a joint distribution with these imputation and analysis models as its full conditional distributions does not exist. Because, in JointAI, the analysis model is a factor in the full conditional distribution that is used to impute $$x_i$$, the non-linear association is taken into account. Furthermore, since it is the joint distribution that is specified, and the full conditionals are derived from it, the joint distribution is guaranteed to exist.

### 6.2.3 Prior Distributions

Prior distributions have to be specified for all (hyper)parameters. A common prior choice for the regression coefficients is the normal distribution with mean zero and large variance. In JointAI variance parameters in models for normally distributed variables are specified as inverse-gamma distributions.

The covariance matrix of the random effects in a mixed model, $$\mathbf D$$, is assumed to follow an inverse-Wishart distribution where the degrees of freedom are chosen to be equal to the dimension of the random effects and the scale matrix is diagonal. Since the magnitude of the diagonal elements relates to the variance of the random effects, the choice of suitable values depends on the scale of the variable the random effect is associated with. Therefore, JointAI uses independent gamma hyperpriors for each of the diagonal elements. More details about the default hyperparameters and how to change them are given in Section 6.5.8 and Appendix 6.A.

## 6.3 Package Structure

The package JointAI has seven main functions, lm_imp(), glm_imp(), clm_imp(), lme_imp(), glme_imp(), clmm_imp(), survreg_imp() and coxph_imp(), that perform regression of continuous and categorical, univariate or multi-level data as well as right censored survival data. Model specification is similar to that of standard regression models in R and is described in detail in Section 6.5.

Based on the specified model formula and other arguments that are provided by the user, JointAI does some pre-processing of the data. It checks which variables are incomplete and/or time-varying, and identifies their measurement level in order to specify appropriate (imputation) models. Interactions and functional forms of variables are detected in the model formula and corresponding design matrices for the various parts of the model are created.

MCMC sampling is performed by the program JAGS (Plummer 2003). The JAGS model, data list, containing all necessary parts of the data, and user-specified settings for the MCMC sampling (further described in Section 6.6) are passed to JAGS via the R package rjags (Plummer 2018).

The above named main functions, from here on abbreviated as *_imp(), all return an object of class JointAI. Summary and plotting methods for JointAI objects, as well as functions to evaluate convergence and precision of the MCMC samples, to predict from JointAI objects and to export imputed values are discussed in Section 6.7.

Currently, the package works under the assumption of a Missing At Random (MAR) missingness process (Rubin 1976, 1987). When this assumption holds, it is valid to exclude cases with missing values in the outcome in Bayesian inference. Hence, our focus here is on missing covariate values. Nevertheless, JointAI can handle missing values in the outcome; they are implicitly imputed using the specified analysis model.

## 6.4 Example Data

To illustrate the functionality of JointAI we use three datasets that are part of this package or the package survival (Therneau 2015; Terry M. Therneau and Patricia M. Grambsch 2000). The first dataset, the NHANES data, contains data from a cross-sectional cohort study, whereas the second dataset (simLong) is a simulated dataset based on a longitudinal cohort study in toddlers. The third dataset (lung) contains data on survival of patients with advanced lung cancer.

### 6.4.1 The NHANES Data

The NHANES data is a subset of observations from the 2011 – 2012 wave of the National Health and Nutrition Examination Survey (National Center for Health Statistics (NCHS) 2011) and contains information on 186 men and women between 20 and 80 years of age. The variables contained in this dataset are

• SBP: systolic blood pressure in mmHg; complete
• gender: male vs female; complete
• age: in years; complete
• race: 5 unordered categories; complete
• WC: waist circumference in cm; 1.1% missing
• alc: alcohol consumption; <1 drink per week vs >= 1 drink per week; 18.3% missing
• educ: educational level; low vs high; complete
• creat: creatinine concentration in mg/dL; 4.5% missing
• albu: albumin concentration in g/dL; 4.3% missing
• uricacid: uric acid concentration in mg/dL; 4.3% missing
• bili: bilirubin concentration in mg/dL; 4.3% missing
• occup: occupational status; 3 unordered categories; 15.1% missing
• smoke: smoking status; 3 ordered categories; 3.8% missing

Figure 6.1 shows the distributions of all variables in the NHANES data, together with the proportion of missing values for incomplete variables, and can be obtained with the function plot_all(). Arguments fill and border allow colours to change, the number of rows and columns can be adapted using nrow and/or ncol, and additional arguments can be passed to hist() and barplot() via "...".

The pattern of missing values in the NHANES data is shown in Figure 6.2. Each row represents a pattern of missing values, where observed (missing) values are depicted with dark (light) colour. The frequency with which each of the patterns is observed is given on the right margin, the number of missing values in each variable is given underneath the plot. Rows and columns are ordered by the number of cases per pattern (decreasing) and the number of missing values (increasing). The first row, for instance, shows that there are 116 complete cases, the second row that there are 29 cases for which only alc is missing. Furthermore, it is apparent that creat, albu, uricacid and bili are always missing together. Since these variables are all measured in serum this is not surprising.

The plot of the missing data pattern can be obtained with md_pattern(). Again, arguments color and border allow us to change colours, and arguments such as legend.position, print_xaxis and print_yaxis permit further customization.

A matrix representation of the missing data pattern can be obtained by setting pattern = TRUE. There, missing and observed values are represented with a "0" and "1" respectively.

### 6.4.2 The simLong Data

The simLong data is a simulated dataset mimicking a longitudinal cohort study of 200 mother-child pairs. It contains the following baseline (i.e., not time-varying) covariates

• GESTBIR: gestational age at birth in weeks; complete
• ETHN: ethnicity; European vs other; 2.8% missing
• AGE_M: age of the mother at intake; complete
• HEIGHT_M: height of the mother in cm; 2.0% missing
• PARITY: number of times the mother has given birth; 0 vs >=1; 2.4% missing
• SMOKE: smoking status of the mother during pregnancy; 3 ordered categories; 12.2% missing
• EDUC: educational level of the mother; 3 ordered categories; 7.8% missing
• MARITAL: marital status; 3 unordered categories; 7.0% missing
• ID: subject identifier

and seven longitudinal variables:

• time: measurement occasion/visit (by design children should have been measured at 1, 2, 3, 4, 7, 11, 15, 20, 26, 32, 40 and 50 months of age); complete
• age: child’s age at measurement time in months
• hgt: child’s height in cm; 20.0% missing
• wgt: child’s weight in gram; 8.8% missing
• bmi: child’s BMI (body mass index) in kg/m2; 21.6% missing
• hc: child’s head circumference in cm; 23.6% missing
• sleep: child’s sleeping behaviour; 3 ordered categories; 24.7% missing

Figure 6.3 shows the longitudinal profiles of hgt, wgt, bmi and hc over age. All four variables have a non-linear pattern over time. Distributions of all variables in the simLong data are displayed in Figure 6.4. Here, arguments use_level and idvar of the function plot_all() are used to display baseline (level-2) covariates on the subject level instead of the observation level:

plot_all(simLong, use_level = TRUE, idvar = "ID", ncol = 4)

The missing data pattern of the simLong data is shown in Figure 6.5. For readability, it is plotted separately for baseline (left) and longitudinal (right) covariates.

### 6.4.3 The lung Data

For demonstration of the use of JointAI for the analysis of survival data, we use the dataset lung that is included in the R package survival. It contains data of 228 patients with advanced lung cancer and includes the following variables:

• inst: institution code; complete
• time: survival time in days; complete
• status: event indicator (1: censored, 2: dead); complete
• age: patient’s age in years; complete
• sex: male (1) vs female (2); complete
• ph.ecog: ECOG performance score (describes how the disease impacts the patient’s daily life); scale from 0 (no impact) to 5 (dead); 0.4% missing
• ph.karno: Karnofsky performance score as rated by physician (describes the degree of a patient’s impairment by the disease); scale from 0 (dead) to 100 (no impairment); 0.4% missing
• pat.karno: Karnofsky performance score as rated by patient; 1.3% missing
• meal.cal: kilocalories consumed at meals; 20.6% missing
• wt.loss: weight loss over the last six months in kg; 6.1% missing

The distribution of the observed values and the missing data pattern of the lung data are shown in Figures 6.6 and 6.7.

## 6.5 Model Specification

The main arguments of the functions

lm_imp(formula, data,
n.chains = 3, n.adapt = 100, n.iter = 0, thin = 1, ...)

glm_imp(formula, family, data,
n.chains = 3, n.adapt = 100, n.iter = 0, thin = 1, ...)

clm_imp(formula, data,
n.chains = 3, n.adapt = 100, n.iter = 0, thin = 1, ...)

lme_imp(fixed, data, random,
n.chains = 3, n.adapt = 100, n.iter = 0, thin = 1, ...)

glme_imp(fixed, data, random, family,
n.chains = 3, n.adapt = 100, n.iter = 0, thin = 1, ...)

clmm_imp(fixed, data, random,
n.chains = 3, n.adapt = 100, n.iter = 0, thin = 1, ...)

survreg_imp(formula, data,
n.chains = 3, n.adapt = 100, n.iter = 0, thin = 1, ...)

coxph_imp(formula, data,
n.chains = 3, n.adapt = 100, n.iter = 0, thin = 1, ...)

i.e., formula, data, family, fixed, and random, are used analogously to the specification in the standard complete data functions lm() and glm() from package stats, lme() from package nlme (Pinheiro et al. 2018), and survreg() and coxph() from package survival.

For the description of the remaining arguments see Section 6.6.

The arguments formula (in lm_imp(), glm_imp() and clm_imp()) and fixed (in lme_imp(), glme_imp() and clmm_imp()) take a standard two-sided formula object, where an intercept is added automatically. For the use of the argument random, see Section 6.5.3.

Survival models expect the left hand side of formula to be a survival object (created with the function Surv() from package survival; see Section 6.5.4).

The argument family enables the choice of a distribution and link function from a range of options when using glm_imp() or glme_imp(). The implemented options are given in Table 6.1.

Table 6.1: Possible choices for the model family and link functions in glm_imp() and glme_imp()}.
family
gaussian with links: identity, log
binomial with links: logit, probit, log, cloglog
Gamma with links: identity, log
poisson with links: log, identity

### 6.5.1 Interactions

In JointAI, interactions between any type of variables (observed, incomplete, time-varying) can be handled. When an incomplete variable is involved, the interaction term is re-calculated within each iteration of the MCMC sampling, using the imputed values from the current iteration. Interaction terms involving incomplete variables should hence not be pre-calculated as an additional variable since this would lead to inconsistent imputed values of main effect and interaction term.

Interactions between multiple variables can be specified using parentheses; for higher lever interactions the "^" operator can be used:

mod1a <- glm_imp(educ ~ gender * (age + smoke + creat),
data = NHANES, family = binomial())

mod1b <- glm_imp(educ ~ gender + (age + smoke + creat)^3,
data = NHANES, family = binomial())

### 6.5.2 Non-linear Functional Forms

In practice, associations between outcome and covariates do not always meet the standard assumption of linearity. Often, assuming a logarithmic, quadratic or other non-linear effect is more appropriate.

For completely observed covariates, JointAI can handle any common type of function implemented in R, including splines, e.g., using ns() or bs() from the package splines. Since functions involving variables that have missing values need to be re-calculated in each iteration of the MCMC sampling, currently, only functions that are available in JAGS can be used for incomplete variables. Those functions include:

• log(), exp()
• sqrt(), polynomials (using I())
• abs()
• sin(), cos()
• algebraic operations involving one or multiple (in)complete variables, as long as the formula can be interpreted by JAGS.

The list of functions implemented in JAGS can be found in the JAGS user manual available at https://sourceforge.net/projects/mcmc-jags/files/Manuals/.

Some examples (that do not necessarily have a meaningful interpretation or good model fit) are:

mod2a <- lm_imp(SBP ~ age + gender + abs(bili - creat), data = NHANES)
library("splines")
mod2b <- lm_imp(SBP ~ ns(age, df = 2) + gender + I(bili^2) + I(bili^3),
data = NHANES)
mod2c <- lm_imp(SBP ~ age + gender + I(creat/albu^2), data = NHANES,
trunc = list(albu = c(1e-5, 1e5)))
# (for explantion of the argument trunc see section below)
mod2d <- lm_imp(SBP ~ bili + sin(creat) + cos(albu), data = NHANES)

It is also possible to nest a function in another function.

mod2e <- lm_imp(SBP ~ age + gender + sqrt(exp(creat)/2), data = NHANES)

#### 6.5.2.1 Functions with Restricted Support

When a function of an incomplete variable has restricted support, e.g., $$\log(x)$$ is only defined for $$x > 0$$, or, as in mod2c from above, I(creat/albu^2) can not be calculated for albu = 0, the distribution used to impute that variable needs to comply with these restrictions. This can either be achieved by truncating the distribution, using the argument trunc, or by selecting a distribution that meets the restrictions.

Example: When using a $$\log$$ transformation for the covariate bili, we can use the default imputation method "norm" (a normal distribution) and truncate it by specifying trunc = list(bili = c(lower, upper)), where lower and upper are the smallest and largest values allowed:

mod3a <- lm_imp(SBP ~ age + gender + log(bili) + exp(creat),
trunc = list(bili = c(1e-5, 1e10)), data = NHANES)

Truncation always requires the specification of both limits. Since $$-\text{Inf}$$ and $$\text{Inf}$$ are not accepted, a value far enough outside the range of the variable (here: $$1e10$$) can be selected for the second boundary of the truncation interval.

Alternatively, we may choose a model for the incomplete variable (using the argument models; for more details see Section 6.5.5) that only imputes positive values such as a log-normal distribution ("lognorm") or a gamma distribution ("gamma"):

mod3b <- lm_imp(SBP ~ age + gender + log(bili) + exp(creat),
models = c(bili = 'lognorm'), data = NHANES)

mod3c <- lm_imp(SBP ~ age + gender + log(bili) + exp(creat),
models = c(bili = 'gamma'), data = NHANES)

#### Functions Not Available in R

It is possible to use functions that have different names in R and JAGS, or that do exist in JAGS, but not in R, by defining a new function in R that has the name of the function in JAGS.

Example:
In JAGS the inverse logit transformation is defined in the function ilogit. In base R, there is no function ilogit, but the inverse logit is available as the distribution function of the logistic distribution plogis(). Thus, we can define the function ilogit() as

ilogit <- plogis

and use it in the model formula

mod4a <- lm_imp(SBP ~ age + gender + ilogit(creat), data = NHANES)

#### A Note on What Happens Inside JointAI

When a function of a complete or incomplete variable is used in the model formula, the main effect of that variable is automatically added as an auxiliary variable (more on auxiliary variables in Section 6.5.6), and only the main effects are used as predictors in the imputation models.

In mod2b, for example, the spline of age is used as predictor for SBP, but in the imputation model for bili, age enters with a linear effect. This can be checked using the function list_models(), which prints a list of the covariate models used in a JointAI model. Here, we are only interested in the predictor variables, and, hence, suppress printing of information on prior distributions, regression coefficients and other parameters by setting priors, regcoef and otherpars to FALSE:

list_models(mod2b, priors = FALSE, regcoef = FALSE, otherpars = FALSE)
## Normal imputation model for 'bili'
## * Predictor variables:
##   (Intercept), genderfemale, age

When a function of a variable is specified as an auxiliary variable, this function is used in the imputation models. For example, in mod4b, waist circumference (WC) is not part of the model for SBP, and the quadratic term I(WC^2) is used in the linear predictor of the imputation model for bili:

mod4b <- lm_imp(SBP ~ age + gender + bili, auxvars = "I(WC^2)",
data = NHANES)

list_models(mod4b, priors = FALSE, regcoef = FALSE, otherpars = FALSE)
## Normal imputation model for 'WC'
## * Predictor variables:
##   (Intercept), age, genderfemale
##
## Normal imputation model for 'bili'
## * Predictor variables:
##   (Intercept), age, genderfemale, I(WC^2)

Incomplete variables are always imputed on their original scale, i.e., in mod2b the variable bili is imputed and the quadratic and cubic versions are calculated from the imputed values. Likewise, creat and albu in mod2c are imputed separately, and I(creat/albu^2) is calculated from the imputed (and observed) values. To ensure consistency between variables, functions involving incomplete variables should be specified as part of the model formula and not be pre-calculated as separate variables.

### 6.5.3 Multi-level Structure and Longitudinal Covariates

In multi-level models, additional to the fixed effects structure specified by the argument fixed, a random effects structure needs to be provided via the argument random.

Analogous to the specification of the argument random in lme(), it takes a one-sided formula starting with a "~", and the grouping variable is separated by "|". A random intercept is added automatically and only needs to be specified in a random intercept only model.

A few examples:

• random = ~1 | id: random intercept only, with id as grouping variable
• random = ~time | id: random intercept and slope for variable time
• random = ~time + I(time^2) | id: random intercept, slope and quadratic random effect for time
• random = ~time + x | id random intercept, random slope for time and random effect for variable x

It is possible to use splines in the random effects structure if there are no missing values in the variables involved, e.g.:

mod5 <- lme_imp(bmi ~ GESTBIR + ETHN + HEIGHT_M + ns(age, df = 2),
random = ~ ns(age, df = 2) | ID, data = simLong)

Longitudinal (“time-varying”; level-1) covariates can be used in the fixed or random effects and will be imputed if they contain any missing values. Currently only one level of nesting is possible.

### 6.5.4 Survival Models

JointAI provides two functions to analyse survival data with incomplete covariates: survreg_imp() and coxph_imp(). Analogous to the complete data versions of these functions from the package survival, the left-hand side of the model formula has to be a survival object specified using the function Surv().

Example:
To analyse the lung data, we can either use a parametric Weibull model or a Cox proportional hazards model. The survival package needs to be loaded for the function Surv().

library(survival)
mod6a <- survreg_imp(Surv(time, status) ~ age + sex + ph.karno +
meal.cal + wt.loss, data = lung, n.iter = 250)

mod6b <- coxph_imp(Surv(time, status) ~ age + sex + ph.karno + meal.cal +
wt.loss, data = lung, n.iter = 250)

Currently only right censored survival data and time-constant covariates can be handled and it is not yet possible to take into account strata, clustering or frailty terms.

### 6.5.5 Imputation / Covariate Model Types

JointAI automatically selects an (imputation) model for each of the incomplete baseline (level-2) or longitudinal (level-1) covariates, based on the class of the variable and the number of levels. The automatically selected types for baseline covariates are:

• norm: linear model (default for continuous variables)
• logit: binary logistic model (default for factors with two levels)
• multilogit: multinomial logit model (default for unordered factors with $$>2$$ levels)
• cumlogit: cumulative logit model (default for ordered factors with $$>2$$ levels)

The default methods for longitudinal covariates are:

• lmm: linear mixed model (default for continuous longitudinal variables)
• glmm_logit: logistic mixed model (default for longitudinal factors with two levels)
• clmm: cumulative logit mixed model (default for longitudinal ordered factors with >2 levels)

When a continuous variable has only two different values, it is automatically converted to a factor and imputed using a logistic model, unless an imputation model type is specified by the user. Variables of type logical are automatically converted to binary factors.

The imputation models that are chosen by default may not necessarily be appropriate for the data at hand, especially for continuous variables, which often do not comply with the assumptions of (conditional) normality.

Therefore, the following alternative imputation methods are available for continuous baseline covariates:

• lognorm: normal model for the log-transformed variable (right-skewed variables $$>0$$)
• gamma: gamma model with log-link (right-skewed variables $$>0$$)
• beta: beta model with logit-link (continuous variables in $$[0, 1]$$)

lognorm assumes a (conditional) normal distribution for the natural logarithm of the variable, but the variable enters the linear predictor of the analysis model (and possibly other imputation models) on its original scale.

For longitudinal covariates the following alternative model types are available:

• glmm_gamma: gamma mixed model with log-link (right-skewed variables $$>0$$)
• glmm_poisson: Poisson mixed model with log-link (count variables)

#### Specification of Imputation/Covariate Model Types

In models mod3b and mod3c in Section 6.5.2.1 we have already seen two examples in which the imputation model type was changed using the argument models, which takes a named vector. When the vector supplied to models only contains specifications for a subset of the incomplete and longitudinal covariates, default models are used for the remaining covariates. As explained in Section 6.2.2, models for completely observed longitudinal covariates only need to be specified when any baseline covariates have missing values.

mod7a <- lm_imp(SBP ~ age + gender + WC + alc + bili + occup + smoke,
models = c(WC = 'gamma', bili = 'lognorm'),
data = NHANES, n.iter = 100)

mod7a$models ## WC smoke bili occup alc ## "gamma" "cumlogit" "lognorm" "multilogit" "logit" The function get_models(), which finds and assigns the default imputation methods, can be called directly. get_models() has arguments • fixed: the fixed effects formula • random: the random effects formula (if necessary) • data: the dataset • auxvars: an optional vector of auxiliary variables • no_model: an optional vector of names of covariates for which no model will be specified mod7b_models <- get_models(bmi ~ GESTBIR + ETHN + HEIGHT_M + SMOKE + hc + MARITAL + ns(age, df = 2), random = ~ ns(age, df = 2) | ID, data = simLong) mod7b_models ##$models
##     HEIGHT_M         ETHN      MARITAL        SMOKE          age
##       "norm"      "logit" "multilogit"   "cumlogit"        "lmm"
##           hc
##        "lmm"
##
## $meth ## HEIGHT_M ETHN MARITAL SMOKE hc ## "norm" "logit" "multilogit" "cumlogit" "lmm" get_models() returns a list of two vectors: • models: containing all specified models • meth: containing the models for the incomplete variables only When there is a “time” variable in the model, such as age in our example (which is the age of the child at the time of the measurement), it may not be meaningful to specify a model for that variable. Especially when the “time” variable is pre-specified by the design of the study, it can usually be assumed to be independent of baseline covariates and a model for it has no useful interpretation. The argument no_model (in get_models() and in the main functions *_imp()) allows for the exclusion of models for such variables (as long as they are completely observed): mod7c_models <- get_models(bmi ~ GESTBIR + ETHN + HEIGHT_M + SMOKE + hc + MARITAL + ns(age, df = 2), random = ~ ns(age, df = 2) | ID, data = simLong, no_model = "age") # mod7c_models By excluding the model for age we implicitly assume that incomplete baseline variables are independent of age. #### Order of the Sequence of Imputation Models By default, models for level-1 covariates are specified after models for level-2 covariates, so that the level-2 covariates are used as predictor variables in the models for level-1 covariates but not vice versa. Within the two groups, models are ordered by the number of missing values (decreasing), so that the model for the variable with the most missing values has the most variables in its linear predictor. ### 6.5.6 Auxiliary Variables Auxiliary variables are variables that are not part of the analysis model but should be considered as predictor variables in the imputation models because they can inform the imputation of unobserved values. Good auxiliary variables are (Van Buuren 2012) • associated with an incomplete variable of interest, or are • associated with the missingness of that variable and • do not have too many missing values themselves. Importantly, they should be observed for a large proportion of the cases that have a missing value in the variable to be imputed. In *_imp(), auxiliary variables can be specified with the argument auxvars, which is a vector containing the names of the auxiliary variables. Example: We might consider the variables educ and smoke as predictors for the imputation of occup: mod8a <- lm_imp(SBP ~ gender + age + occup, auxvars = c('educ', 'smoke'), data = NHANES, n.iter = 100) The variables educ and smoke are not included in the analysis model (as can be seen when printing the posterior mean of the regression coefficients of the analysis model with coef()): coef(mod8a) ## (Intercept) genderfemale age ## 105.5937694 -5.8192149 0.3781785 ## occuplooking for work occupnot working ## 4.0294291 -0.7349338 They are, however, used as predictors in the imputation for occup and imputed themselves (if they have missing values): list_models(mod8a, priors = FALSE, regcoef = FALSE, otherpars = FALSE, refcat = FALSE) ## Cumulative logit imputation model for 'smoke' ## * Predictor variables: ## genderfemale, age, educhigh ## ## Multinomial logit imputation model for 'occup' ## * Predictor variables: ## (Intercept), genderfemale, age, educhigh, smokeformer, smokecurrent #### Functions of Variables as Auxiliary Variables As shown above in mod3e, it is possible to specify functions of auxiliary variables. In that case, the auxiliary variable is not considered as a linear effect but as specified by the function. Note that omitting auxiliary variables from the analysis model implies that the outcome is independent of these variables, conditional on the other variables in the model. If this is not true, the model is misspecified which may lead to biased results (similar to leaving a confounding variable out of a model). ### 6.5.7 Reference Values for Categorical Covariates In JointAI, dummy coding is used when categorical variables enter a linear predictor, i.e., a binary variable is created for each category, except the reference category. These binary variables have value one when that category was observed and zero otherwise. Contrary to the behaviour in base R, this coding is also used for ordered factors. By default, the first category of a categorical variable (ordered or unordered) is used as reference, however, this may not always allow the desired interpretation of the regression coefficients. Moreover, when categories are unbalanced, setting the largest group as reference may result in better mixing of the MCMC chains. Therefore, JointAI allows specification of the reference category separately for each variable, via the argument refcats. Changes in refcats will not impact the imputation of the respective variable, but change categories for which dummy variables are included in the linear predictor of the analysis model or other covariate models. #### Setting Reference Categories for All Variables To specify the choice of reference category globally for all variables in the model, refcats can be set as • refcats = "first" • refcats = "last" • refcats = "largest" For example: mod9a <- lm_imp(SBP ~ gender + age + race + educ + occup + smoke, refcats = "largest", data = NHANES) #### Setting Reference Categories for Individual Variables Alternatively, refcats takes a named vector, in which the reference category for each variable can be specified either by its number or its name, or one of the three global types: "first", "last" or "largest". For variables for which no reference category is specified in the list the default is used. mod9b <- lm_imp(SBP ~ gender + age + race + educ + occup + smoke, refcats = list(occup = "not working", race = 3, educ = "largest"), data = NHANES) To facilitate specification of the reference categories, the function set_refcat() can be used. It prints the names of the categorical variables that are selected by • a specified model formula and/or • a vector of auxiliary variables, or • a vector naming covariates, or all categorical variables in the data if only the argument data is provided, and asks a number of questions which the user needs to reply to via input of a number. refs_mod9 <- set_refcat(NHANES, formula = formula(mod9b)) ## ## How do you want to specify the reference categories? ## ## 1: Use the first category for each variable. ## 2: Use the last category for each variabe. ## 3: Use the largest category for each variable. ## 4: Specify the reference categories individually. When option 4 is chosen, a choice is given for each categorical variable, for example: #> The reference category for "race" should be #> #> 1: Mexican American #> 2: Other Hispanic #> 3: Non-Hispanic White #> 4: Non-Hispanic Black #> 5: other After specification of the reference categories, the determined specification for the argument refcats is printed: #> In the JointAI model specify: #> refcats = c(gender = 'female', race = 'Non-Hispanic White', #> educ = 'low', occup = 'not working', smoke = 'never') #> #> or use the output of this function. set_refcat() also returns a named vector that can be passed to the argument refcats: mod9c <- lm_imp(SBP ~ gender + age + race + educ + occup + smoke, refcats = refs_mod9, data = NHANES) ### 6.5.8 Hyperparameters In the Bayesian framework, parameters are random variables for which a distribution needs to be specified. These distributions depend on parameters themselves, i.e., on hyperparameters. The function default_hyperpars() returns a list containing the default hyperparameters used in a JointAI model (see Appendix 6.A). To change the hyperparameters in a JointAI model, the list obtained from default_hyperpars() can be edited and passed to the argument hyperpars in *_imp(). mu_reg_* and tau_reg_* refer to the mean and precision of the prior distribution for regression coefficients. shape_tau_* and rate_tau_* are the shape and rate parameters of a gamma distribution that is used as prior for precision parameters. RinvD is the scale matrix in the Wishart prior for the inverse of the random effects design matrix D, and KinvD is the number of degrees of freedom in that distribution. shape_diag_RinvD and rate_diag_RinvD are the shape and rate parameters of the gamma prior of the diagonal elements of RinvD. In random effects models with only one random effect, a gamma prior is used instead of the Wishart distribution for the inverse of D. The hyperparameters mu_reg_surv and tau_reg_surv are used in survreg_imp() as well as coxph_imp(). For the Cox proportional hazards model, the hyperparameters c, r and eps refer to the confidence in the prior guess for the hazard function, failure rate per unit time ($$\lambda_0(t)^*$$ in Section 6.2.1) and time increment (for numerical stability), respectively. ### 6.5.9 Scaling When variables are measured on very different scales this can result in slow convergence and bad mixing. Therefore, JointAI automatically scales continuous variables to have mean zero and standard deviation one. Results (parameters and imputed values) are transformed back to the original scale when the results are printed or imputed values are exported. In some settings, however, it is not possible to scale continuous variables. This is the case for incomplete variables that enter a linear predictor in a function and variables that are imputed with models that are defined on a subset of the real line (i.e., with a gamma or a beta distribution). Variables that are imputed with a log-normal distribution are scaled, but not centred. To prevent scaling, the argument scale_vars can be set to FALSE. When a vector of variable names is supplied to scale_vars, only those variables are scaled. By default, only the MCMC samples that are scaled back to the scale of the data are stored in a JointAI object. When the argument keep_scaled_mcmc = TRUE, the scaled sample is also kept. ### 6.5.10 Ridge Regression Using the argument ridge = TRUE it is possible to impose a ridge penalty on the parameters of the analysis model, shrinking these parameters towards zero. This is done by specification of a $$\text{Gamma}(0.01, 0.01)$$ prior for the precision of the regression coefficients $$\beta$$ instead of setting it to a fixed (small) value. ## 6.6 MCMC Settings The functions *_imp() have a number of arguments that specify settings for the MCMC sampling: • n.chains: number of MCMC chains • n.adapt: number of iterations in the adaptive phase • n.iter: number of iterations in the sampling phase • thin: thinning degree • monitor_params: parameters/nodes to be monitored • seed: optional seed value for reproducibility • inits: initial values • parallel: whether MCMC chains should be sampled in parallel • ncores: how many cores are available for parallel sampling The first four, as well as the following two parameters, are passed directly to functions from the R package rjags: • quiet: should printing of information be suppressed? • progress.bar: type of progress bar ("text", "gui" or "none") In the following sections, the arguments listed above are explained in more detail and examples are given. ### 6.6.1 Number of Chains, Iterations and Samples #### Number of Chains To evaluate convergence of MCMC chains it is helpful to create multiple chains that have different starting values. More information on how to evaluate convergence and the specification of initial values can be found in Sections 6.7.3.1 and 6.6.3 respectively. The argument n.chains selects the number of chains (by default, n.chains = 3). For calculating the model summary, multiple chains are merged. #### Adaptive Phase JAGS has an adaptive mode, in which samplers are optimized (for example the step size is adjusted). Samples obtained during the adaptive mode do not form a Markov chain and are discarded. The argument n.adapt controls the length of this adaptive phase. The default value for n.adapt is 100, which works well in many of the examples considered here. Complex models may require longer adaptive phases. If the adaptive phase is not sufficient for JAGS to optimize the samplers, a warning message will be printed (see example below). #### Sampling Iterations n.iter specifies the number of iterations in the sampling phase, i.e., the length of the MCMC chain. How many samples are required to reach convergence and to have sufficient precision (see also Section 6.7.3.2) depends on the complexity of data and model, and may range from as few as 100 to several million. #### Thinning In settings with high autocorrelation, it may take many iterations before a sample is created that sufficiently represents the whole range of the posterior distribution. Processing of such long chains can be slow and cause memory issues. The parameter thin allows the user to specify if and how much the MCMC chains should be thinned out before storing them. By default thin = 1 is used, which corresponds to keeping all values. A value thin = 10 would result in keeping every 10th value and discarding all other values. #### 6.6.1.1 Example: Default Settings Using the default settings n.adapt = 100 and thin = 1, and 100 sampling iterations, a simple model would be mod10a <- lm_imp(SBP ~ alc, data = NHANES, n.iter = 100) The relevant part of the model summary (obtained with summary()) shows that the first 100 iterations (adaptive phase) were discarded, the 100 iterations that follow form the posterior sample, thinning was set to “1” and there are three chains. ## [...] ## MCMC settings: ## Iterations = 101:200 ## Sample size per chain = 100 ## Thinning interval = 1 ## Number of chains = 3 #### Example: Insufficient Adaptation Phase mod10b <- lm_imp(SBP ~ alc, data = NHANES, n.adapt = 10, n.iter = 100) ## Warning in rjags::jags.model(file = modelfile, data = data_list, inits ## = inits, : Adaptation incomplete ## NOTE: Stopping adaptation Specifying n.adapt = 10 results in a warning message. The relevant part of the model summary from the resulting model is: ## [...] ## Iterations = 11:110 ## Sample size per chain = 100 ## Thinning interval = 1 ## Number of chains = 3 #### Example: Thinning mod10c <- lm_imp(SBP ~ alc, data = NHANES, n.iter = 500, thin = 10) Here, iterations 110 until 600 are used in the output, but due to a thinning interval of ten, the resulting MCMC chains contain only 50 samples instead of 500, that is, the samples from iteration 110, 120, 130, ## [...] ## Iterations = 110:600 ## Sample size per chain = 50 ## Thinning interval = 10 ## Number of chains = 3 ### 6.6.2 Parameters to Follow Since JointAI uses JAGS (Plummer 2003) for performing the MCMC sampling, and JAGS only saves the values of MCMC chains for those nodes which the user has specified should be monitored, this is also the case in JointAI. For this purpose, the main functions *_imp() have an argument monitor_params, which takes a named list (or a named vector) with possible entries given in Table 6.2. This table contains a number of keywords that refer to (groups of) nodes. Each of the keywords works as a switch and can be specified as TRUE or FALSE (with the exception of other). Table 6.2: Keywords and names of (groups of) nodes that can be specified to be monitored using the argument monitor_params. name/keyword what is monitored analysis_main betas and sigma_y (and D) betas regression coefficients of the analysis model tau_y precision of the residuals from the analysis model sigma_y std. deviation of the residuals from the analysis model analysis_random ranef, D, invD, RinvD ranef random effects D covariance matrix of the random effects invD inverse of D RinvD scale matrix in Wishart prior for invD imp_pars alphas, tau_imp, gamma_imp, delta_imp alphas regression coefficients in the imputation models tau_imp precision of the residuals from imputation models gamma_imp intercepts in ordinal imputation models delta_imp increments of ordinal intercepts imps imputed values other additional parameters #### Parameters of the Analysis Model The default setting is monitor_params = c(analysis_main = TRUE), i.e., only the main parameters of the analysis model are monitored, and monitoring is switched off for all other parameters. Main parameters are the regression coefficients of the analysis model (beta), depending on the model type, the residual standard deviation (sigma_y), and, for mixed models, the random effects variance-covariance matrix D. The function parameters() returns the parameters specified to be followed (also for models where no MCMC sampling was performed, i.e., when n.iter = 0 and n.adapt = 0). We use it here to demonstrate the effect that different choices for monitor_params have. For example: mod11a <- lm_imp(SBP ~ gender + WC + alc + creat, data = NHANES, n.adapt = 0) parameters(mod11a) ## [1] "(Intercept)" "genderfemale" "WC" "alc>=1" ## [5] "creat" "sigma_SBP" #### Parameters of the Covariate Models and Imputed Values The parameters of the models for the incomplete variables can be selected with monitor_params = c(imp_pars = TRUE). This will set monitors for the regression coefficients (alpha) and other parameters, such as precision (tau_*) and intercepts and increments (gamma_* and delta_*) in cumulative logit models. mod11b <- lm_imp(SBP ~ age + WC + alc + smoke + occup, data = NHANES, n.adapt = 0, monitor_params = c(imp_pars = TRUE, analysis_main = FALSE)) parameters(mod11b) ## [1] "alpha" "tau_WC" "gamma_smoke" "delta_smoke" To generate (multiple) imputed datasets to be used for further analyses, the imputed values need to be monitored. This can be achieved by setting monitor_params = c(imps = TRUE). mod11c <- lm_imp(SBP ~ age + WC + alc + smoke + occup, data = NHANES, n.adapt = 0, monitor_params = c(imps = TRUE, analysis_main = FALSE)) Extraction of multiple imputed datasets from a JointAI model is described in Section 6.7.6. #### Random Effects For mixed models, analysis_main also includes the random effects variance-covariance matrix D. Setting analysis_random = TRUE will switch on monitoring for the random effects (ranef), random effects variance-covariance matrix (D), inverse of the random effects variance-covariance matrix (invD) and the diagonal of the scale matrix of the Wishart-prior of invD (RinvD). mod11d <- lme_imp(bmi ~ age + EDUC, random = ~age | ID, data = simLong, n.adapt = 0, monitor_params = c(analysis_random = TRUE)) parameters(mod11d) ## [1] "(Intercept)" "EDUCmid" "EDUClow" "age" "sigma_bmi" ## [6] "b" "invD[1,1]" "invD[1,2]" "invD[2,2]" "D[1,1]" ## [11] "D[1,2]" "D[2,2]" "RinvD[1,1]" "RinvD[2,2]" It is possible to select only a subset of the random effects parameters by specifying them directly, e.g., mod11e <- lme_imp(bmi ~ age + EDUC, random = ~age | ID, data = simLong, n.adapt = 0, monitor_params = c(analysis_main = TRUE, RinvD = TRUE)) parameters(mod11e) ## [1] "(Intercept)" "EDUCmid" "EDUClow" "age" "sigma_bmi" ## [6] "D[1,1]" "D[1,2]" "D[2,2]" "RinvD[1,1]" "RinvD[2,2]" or by switching unwanted parts of analysis_random off, e.g., mod11f <- lme_imp(bmi ~ age + EDUC, random = ~age | ID, data = simLong, n.adapt = 0, monitor_params = c(analysis_main = TRUE, analysis_random = TRUE, RinvD = FALSE, ranef = FALSE)) parameters(mod11f) ## [1] "(Intercept)" "EDUCmid" "EDUClow" "age" "sigma_bmi" ## [6] "invD[1,1]" "invD[1,2]" "invD[2,2]" "D[1,1]" "D[1,2]" ## [11] "D[2,2]" #### Other Parameters The element other in monitor_params allows for the specification of one or multiple additional parameters to be monitored. When other is used with more than one element, monitor_params has to be a list. Here, as an example, we monitor the probability of being in the alc>=1 group for subjects one through three and the expected value of the distribution of creat for the first subject. mod11g <- lm_imp(SBP ~ gender + WC + alc + creat, data = NHANES, n.adapt = 0, monitor_params = list(analysis_main = FALSE, other = c('p_alc[1:3]', "mu_creat[1]"))) parameters(mod11g) ## [1] "p_alc[1:3]" "mu_creat[1]" Even though this example may not be particularly meaningful, in cases of convergence issues it can be helpful to be able to monitor any node of the model, not just the ones that are typically of interest. ### 6.6.3 Initial Values Initial values are the starting point for the MCMC sampler. Setting good initial values, i.e., values that are likely under the posterior distribution, can speed up convergence. By default, the argument inits = NULL, which means that initial values are generated automatically by JAGS. It is also possible to supply initial values directly as a list or as a function. Initial values can be specified for every unobserved node, that is, parameters and missing values, but it is possible to only specify initial values for a subset of nodes. When the initial values provided by the user do not have elements named ".RNG.name" or ".RNG.seed", JointAI will add those elements, which specify the name and seed value of the random number generator used for each chain. The argument seed allows the specification of a seed value with which the starting values of the random number generator, and, hence, the values of the MCMC sample, can be reproduced. #### Initial Values in a List of Lists A list containing initial values should have the same length as the number of chains, where each element is a named list of initial values. Moreover, initial values should differ between the chains. For example, to create initial values for the parameter vector beta and the precision parameter tau_SBP for two chains the following syntax could be used: init_list <- lapply(1:2, function(i) { list(beta = rnorm(4), tau_SBP = rgamma(1, 1, 1)) }) init_list ## [[1]] ## [[1]]$beta
## [1] 0.2995096 0.2123710 0.6478957 0.8952516
##
## [[1]]$tau_SBP ## [1] 1.000624 ## ## ## [[2]] ## [[2]]$beta
## [1]  2.2559981  0.9786635 -1.2725176 -0.7251253
##
## [[2]]$tau_SBP ## [1] 0.05501739 mod12a <- lm_imp(SBP ~ gender + age + WC, data = NHANES, n.chain = 2, inits = init_list) The user provided lists of initial values are stored in the JointAI object (together with starting values for the random number generator) and can be accessed via mod11a$mcmc_settings$inits. #### Initial Values as a Function Initial values can be specified as a function. The function should either take no arguments or a single argument called chain, and return a named list that supplies values for one chain. For example, to create initial values for the parameter vectors beta and alpha: inits_fun <- function() { list(beta = rnorm(4), alpha = rnorm(3)) } inits_fun() ##$beta
## [1] -1.6045542  0.1872611  1.0167161 -0.4272887
##
## $alpha ## [1] 0.8542140 0.6391477 0.4720952 mod12b <- lm_imp(SBP ~ gender + age + WC, data = NHANES, inits = inits_fun) mod12b$mcmc_settings$inits ## [[1]] ## [[1]]$beta
## [1] -0.07058338  0.41772091 -1.66236440  1.24957652
##
## [[1]]$alpha ## [1] -0.7204577 0.1424769 -1.0114044 ## ## [[1]]$.RNG.name
## [1] "base::Wichmann-Hill"
##
## [[1]]$.RNG.seed ## [1] 77704 ## ## ## [[2]] ## [[2]]$beta
## [1] -0.50236788 -0.01997157  1.40425944  1.18807193
##
## [[2]]$alpha ## [1] -0.8065902 -0.9709539 -0.7020397 ## ## [[2]]$.RNG.name
## [1] "base::Mersenne-Twister"
##
## [[2]]$.RNG.seed ## [1] 29379 ## ## ## [[3]] ## [[3]]$beta
## [1] -0.3516978 -0.9144069 -1.7397631 -0.1083395
##
## [[3]]$alpha ## [1] 0.9067457 -0.4471829 0.1170837 ## ## [[3]]$.RNG.name
## [1] "base::Mersenne-Twister"
##
## [[3]]$.RNG.seed ## [1] 83619 When a function is supplied, the function is evaluated by JointAI and the resulting list is stored in the JointAI object. #### 6.6.3.1 For which Nodes can Initial Values be Specified? Initial values can be specified for all unobserved stochastic nodes, i.e., parameters or unobserved data for which a distribution is specified in the JAGS model. They have to be supplied in the format of the parameter or unobserved value in the JAGS model. To find out which nodes there are in a model and in which form they have to be specified, the function coef() from package rjags can be used to obtain a list with the current values of the MCMC chains (by default the first chain) from a JAGS model object. This object is contained in a JointAI object under the name model. Elements of the initial values should have the same structure as the elements in this list of current values. ##### Example: We are using a longitudinal model and the simLong data in this example. Here we only show some relevant parts of the output. mod12c <- lme_imp(bmi ~ time + HEIGHT_M + hc + SMOKE, random = ~ time | ID, no_model = 'time', data = simLong) # coef(mod12c$model)

RinvD is the scale matrix in the Wishart prior for the inverse of the random effects variance-covariance matrix D. In the data that is passed to JAGS (which is stored in the element data_list in a JointAI object), this matrix is specified as a diagonal matrix, with unknown diagonal elements:

mod12c$data_list['RinvD'] ##$RinvD
##      [,1] [,2]
## [1,]   NA    0
## [2,]    0   NA

These diagonal elements are estimated in the model and have a gamma prior. The corresponding part of the JAGS model is:

## [...]
##   # Priors for the covariance of the random effects
##   for (k in 1:2){
##     RinvD[k, k] ~ dgamma(shape_diag_RinvD, rate_diag_RinvD)
##   }
##   invD[1:2, 1:2] ~ dwish(RinvD[ , ], KinvD)
##   D[1:2, 1:2] <- inverse(invD[ , ])
## [...]

The element RinvD in the initial values has to be a $$2\times2$$ matrix, with positive values on the diagonal and NA as off-diagonal elements, since these are fixed in the data:

## $RinvD ## [,1] [,2] ## [1,] 0.03648925 NA ## [2,] NA 0.07655993 Lines 82 through 85 of the design matrix of the fixed effects of baseline covariates, Xc, in the data are: mod12c$data_list$Xc[82:85, ] ## (Intercept) HEIGHT_M SMOKEsmoked until[...] SMOKEcontin[...] ## 172.1 1 0.1148171 NA NA ## 173.1 1 NA NA NA ## 174.1 1 0.5045126 NA NA ## 175.1 1 1.8822249 NA NA The matrix Xc in the initial values has the same dimension as Xc in the data. It has values where there are missing values in Xc in the data, e.g., Xc[83, 2], and is NA elsewhere: ##$Xc
##        [,1]       [,2] [,3] [,4]
##   [1,]   NA         NA   NA   NA
##
## [...]
##
##  [82,]   NA         NA   NA   NA
##  [83,]   NA  0.8028938   NA   NA
##  [84,]   NA         NA   NA   NA
##
## [...]

There are no initial values specified for the third and fourth column, since these columns contain the dummy variables corresponding to the categorical variable SMOKE and are calculated from the corresponding column in the matrix Xcat, i.e., these are deterministic nodes, not stochastic nodes.

The relevant part of the JAGS model is:

## [...]
##     # ordinal model for SMOKE
##     Xcat[i, 1] ~ dcat(p_SMOKE[i, 1:3])
## [...]
##     Xc[i, 3] <- ifelse(Xcat[i, 1] == 2, 1, 0)
##     Xc[i, 4] <- ifelse(Xcat[i, 1] == 3, 1, 0)
## [...]

Elements that are completely unobserved, like the parameter vectors alpha and beta, the random effects b or scalar parameters delta_SMOKE and gamma_SMOKE are entirely specified in the initial values.

### 6.6.4 Parallel Sampling

To reduce the computational time it is possible to perform sampling of multiple MCMC chains in parallel. This can be specified by setting the argument parallel = TRUE. The maximum number of cores to be used can be set with the argument ncores. By default this is two less than the number of cores available on the machine, but never more than the number of MCMC chains.

Parallel computation is done using the packages foreach (Microsoft and Weston 2017) and doParallel (Corporation and Weston 2018). Note that it is currently not possible to display a progress bar when using parallel computation.

## 6.7 After Fitting

Any of the main functions *_imp() will return an object of class JointAI. It contains the original data (data), information on the type of model (call, analysis_type, models, fixed, random, hyperpars, scale_pars) and MCMC sampling (mcmc_settings), the JAGS model (model) and MCMC sample (MCMC; if a sample was generated), the computational time (time) of the MCMC sampling, and some additional elements that are used by methods for objects of class JointAI but are typically not of interest for the user.

In the remaining part of this section, we describe how the results from a JointAI model can be visualized, summarized and evaluated. The functions described here use, by default, the full MCMC sample and show only the parameters of the analysis model. Arguments start, end and thin are available to select a subset of the MCMC samples that is used to calculate the summary. The argument subset allows the user to control for which nodes the summary or visualization is returned and follows the same logic as the argument monitor_params in *_imp(). The use of these arguments is further explained in Section 6.7.4.

### 6.7.1 Visualizing the Posterior Sample

The posterior sample can be visualized by two commonly used plots: a traceplot, showing samples across iterations, and a plot of the empirical density of the posterior sample.

#### Traceplot

A traceplot shows the sampled values per chain and node throughout iterations. It allows the visual evaluation of convergence and mixing of the chains and can be obtained with the function traceplot().

mod13a <- lm_imp(SBP ~ gender + WC + alc +  creat,
data = NHANES, n.iter = 500)

traceplot(mod13a)

When the sampler has converged the chains show one horizontal band, as in Figure 6.8. Consequently, when traces show a trend, convergence has not been reached and more iterations are necessary (e.g., using add_samples()).

Graphical aspects of the traceplot can be controlled by specifying standard graphical arguments via the dots argument "...", which are passed to matplot(). This allows the user to change colour, linetype and -width, limits, etc. Arguments nrow and/or ncol can be supplied to set specific numbers of rows and columns for the layout of the grid of plots.

With the argument use_ggplot it is possible to get a ggplot2 (Wickham 2016) version of the traceplot. It can be extended using standard ggplot2 syntax. The output of the following syntax is shown in Figure 6.9.

library(ggplot2)
traceplot(mod13a, ncol = 3, use_ggplot = TRUE) +
theme(legend.position = 'bottom',
panel.background = element_rect(fill = grey(0.95)),
panel.border = element_rect(fill = NA, color = grey(0.85)),
strip.background = element_rect(color = grey(0.85))) +
scale_color_manual(values = c("#783D4F", "#60B5BC", "#6F5592"))

#### Densityplot

The posterior distributions can also be visualized using the function densplot(), which plots the empirical density per node per chain, or combining chains (when joined = TRUE).

The argument vlines takes a list of lists, containing specifications passed to the function abline() (from package graphics which is available with base R), and allows the addition of (vertical) lines to the plot, e.g., marking zero, or marking the posterior mean and 2.5% and 97.5% quantiles (Figure 6.10).

densplot(mod13a, ncol = 3,
vlines = list(list(v = summary(mod13a)$stats[, "Mean"], lty = 1, lwd = 2), list(v = summary(mod13a)$stats[, "2.5%"], lty = 2),
list(v = summary(mod13a)$stats[, "97.5%"], lty = 2) ) ) As with traceplot() it is possible to use the ggplot2 version of densplot() when setting use_ggplot = TRUE. Here, vertical lines can be added as additional layers. Figure 6.11 shows, as an example, the posterior density from mod13a to which vertical lines, representing the 95% credible interval and a 95% confidence interval from a complete case analysis, are added. The corresponding syntax is given in Appendix 6.B. ### 6.7.2 Model Summary A summary of the posterior distribution estimated in a JointAI model can be obtained using the function summary(). The posterior summary consists of the mean, standard deviation and quantiles (by default the 2.5% and 97.5% quantiles) of the MCMC samples from all chains combined, as well as the tail probability (see below) and Gelman-Rubin criterion (see Section 6.7.3.1). Additionally, some important characteristics of the MCMC samples on which the summary is based, are given. This includes the range and number of iterations (Sample size per chain), thinning interval and number of chains. Furthermore, the number of observations (number of rows in the data) is printed. summary(mod13a) ## ## Linear model fitted with JointAI ## ## Call: ## lm_imp(formula = SBP ~ gender + WC + alc + creat, data = NHANES, ## n.iter = 500) ## ## Posterior summary: ## Mean SD 2.5% 97.5% tail-prob. GR-crit ## (Intercept) 81.368 9.844 61.972 100.732 0.0000 1.01 ## genderfemale 0.353 2.541 -4.826 5.421 0.8907 1.01 ## WC 0.302 0.073 0.159 0.441 0.0000 1.00 ## alc>=1 6.219 2.456 1.321 10.984 0.0147 1.01 ## creat 7.939 7.586 -7.465 22.848 0.2973 1.01 ## ## Posterior summary of residual std. deviation: ## Mean SD 2.5% 97.5% GR-crit ## sigma_SBP 14.4 0.758 13 15.9 1 ## ## ## MCMC settings: ## Iterations = 101:600 ## Sample size per chain = 500 ## Thinning interval = 1 ## Number of chains = 3 ## ## Number of observations: 186 For mixed models, summary() also returns the posterior summary of the random effects covariance matrix D and the number of groups: library(splines) mod13b <- lme_imp(bmi ~ GESTBIR + ETHN + HEIGHT_M + ns(age, df = 3), random = ~ ns(age, df = 3)|ID, data = subset(simLong, !is.na(bmi)), n.iter = 500, no_model = 'age') summary(mod13b, start = 300) ## Linear mixed model fitted with JointAI ## ## Call: ## lme_imp(fixed = bmi ~ GESTBIR + ETHN + HEIGHT_M + ns(age, df = 3), ## data = subset(simLong, !is.na(bmi)), random = ~ns(age, df = 3) | ## ID, n.iter = 500, no_model = "age") ## ## Posterior summary: ## Mean SD 2.5% 97.5% tail-prob. GR-crit ## (Intercept) 16.31448 2.31751 11.9351 21.0625 0.000 1.06 ## GESTBIR -0.03379 0.04371 -0.1258 0.0466 0.463 1.03 ## ETHNother -0.00247 0.15991 -0.3029 0.3288 0.999 1.10 ## HEIGHT_M 0.00545 0.00947 -0.0144 0.0247 0.560 1.04 ## ns(age, df = 3)1 -0.30356 0.06287 -0.4186 -0.1834 0.000 3.12 ## ns(age, df = 3)2 1.62073 0.19038 1.2633 1.9926 0.000 2.44 ## ns(age, df = 3)3 -1.29077 0.06674 -1.4097 -1.1670 0.000 2.94 ## ## Posterior summary of random effects covariance matrix: ## Mean SD 2.5% 97.5% tail-prob. GR-crit ## D[1,1] 1.439 0.1694 1.132 1.788 1.04 ## D[1,2] -0.774 0.1126 -1.010 -0.573 0 1.18 ## D[2,2] 0.735 0.0971 0.569 0.946 1.09 ## D[1,3] -2.541 0.3590 -3.300 -1.873 0 1.07 ## D[2,3] 2.424 0.2785 1.942 3.027 0 1.01 ## D[3,3] 8.222 0.9635 6.549 10.314 1.08 ## D[1,4] -0.741 0.1003 -0.957 -0.550 0 1.08 ## D[2,4] 0.626 0.0743 0.492 0.783 0 1.04 ## D[3,4] 2.096 0.2492 1.626 2.621 0 1.08 ## D[4,4] 0.561 0.0767 0.422 0.717 1.22 ## ## Posterior summary of residual std. deviation: ## Mean SD 2.5% 97.5% GR-crit ## sigma_bmi 0.457 0.00876 0.441 0.475 1 ## ## ## MCMC settings: ## Iterations = 300:600 ## Sample size per chain = 301 ## Thinning interval = 1 ## Number of chains = 3 ## ## Number of observations: 1881 ## Number of groups: 200 The summary of parametric Weibull survival models also returns the summary of the posterior sample of the shape parameter of the Weibull distribution: summary(mod6a) ## ## Weibull survival model fitted with JointAI ## ## Call: ## survreg_imp(formula = Surv(time, status) ~ age + sex + ph.karno + ## meal.cal + wt.loss, data = lung, n.iter = 250) ## ## Posterior summary: ## Mean SD 2.5% 97.5% tail-prob. GR-crit ## (Intercept) 5.58e+00 0.741546 4.19e+00 6.935742 0.0000 1.06 ## age -8.67e-03 0.007618 -2.38e-02 0.006025 0.2640 1.09 ## sex2 4.22e-01 0.140230 1.39e-01 0.703982 0.0000 1.01 ## ph.karno 9.28e-03 0.004567 -4.53e-05 0.018862 0.0533 1.05 ## meal.cal 7.23e-05 0.000184 -2.59e-04 0.000426 0.7040 1.14 ## wt.loss 1.69e-03 0.004738 -7.54e-03 0.011345 0.7413 1.00 ## ## Posterior summary of the shape of the Weibull distribution: ## Mean SD 2.5% 97.5% GR-crit ## shape 1.3 0.081 1.13 1.45 1.05 ## ## ## MCMC settings: ## Iterations = 101:350 ## Sample size per chain = 250 ## Thinning interval = 1 ## Number of chains = 3 ## ## Number of observations: 228 #### Tail Probability The tail probability, calculated as $$2\times\min\left\{Pr(\theta > 0), Pr(\theta < 0)\right\},$$ where $$\theta$$ is the parameter of interest, is a measure of how likely the value 0 is under the estimated posterior distribution. Figure 6.12 visualizes three examples of posterior distributions and the corresponding minimum of $$Pr(\theta > 0)$$ and $$Pr(\theta < 0)$$ (shaded area). ### 6.7.3 Evaluation Criteria Convergence of the MCMC chains and precision of the posterior sample can also be evaluated in a more formal manner. Implemented in JointAI are the Gelman-Rubin criterion for convergence (Gelman and Rubin 1992; Brooks and Gelman 1998) and a comparison of the Monte Carlo Error with the posterior standard deviation. #### 6.7.3.1 Gelman-Rubin Criterion for Convergence The Gelman-Rubin criterion (Gelman and Rubin 1992; Brooks and Gelman 1998), also referred to as “potential scale reduction factor”, evaluates convergence by comparing within and between chain variability and, thus, requires at least two MCMC chains to be calculated. It is implemented for JointAI objects in the function GR_crit(), which is based on the function gelman.diag() from the package coda (Plummer et al. 2006). The upper limit of the confidence interval should not be much larger than 1. GR_crit(mod13a) ## Potential scale reduction factors: ## ## Point est. Upper C.I. ## (Intercept) 1.00 1.01 ## genderfemale 1.00 1.01 ## WC 1.00 1.01 ## alc>=1 1.01 1.03 ## creat 1.00 1.00 ## sigma_SBP 1.00 1.00 ## ## Multivariate psrf ## ## 1.01 Besides the arguments start, end, thin, and subset, which are explained in Section 6.7.4, GR_crit() also takes the arguments confidence, transform and autoburnin of gelman.diag(). #### 6.7.3.2 Monte Carlo Error Precision of the MCMC sample can be checked with the function MC_error(). It uses the function mcmcse.mat() from the package mcmcse (Flegal et al. 2017) to calculate the Monte Carlo error (the error that is made since the sample is finite) and compares is to the standard deviation of the posterior sample. A rule of thumb is that the Monte Carlo error should not be more than 5% of the standard deviation (Lesaffre and Lawson 2012). Besides the arguments explained in Section 6.7.4, MC_error() takes the arguments of mcmcse.mat(). MC_error(mod13a) ## est MCSE SD MCSE/SD ## (Intercept) 81.37 0.2896 9.844 0.029 ## genderfemale 0.35 0.0750 2.541 0.029 ## WC 0.30 0.0016 0.073 0.022 ## alc>=1 6.22 0.0912 2.456 0.037 ## creat 7.94 0.2387 7.586 0.031 ## sigma_SBP 14.37 0.0222 0.758 0.029 MC_error() returns an object of class MCElist, which is a list containing matrices with the posterior mean, estimated Monte Carlo error, posterior standard deviation and the ratio of the Monte Carlo error and posterior standard deviation, for the scaled (if this MCMC sample was included in the JointAI object) and unscaled (transformed back to the scale of the data) posterior samples. The associated print method prints only the latter. To facilitate quick evaluation of the Monte Carlo error to posterior standard deviation ratio, plotting of an object of class MCElist using plot() shows this ratio for each (selected) node and automatically adds a vertical line at the desired cut-off (by default 5%; see Figure 6.13). par(mar = c(3, 5, 0.5, 0.5), mgp = c(2, 0.6, 0), mfrow = c(1, 2)) plot(MC_error(mod13a)) # left panel plot(MC_error(mod13a, end = 250)) # right panel ### 6.7.4 Subset of Output By default, the functions traceplot(), densplot(), summary(), predict(), GR_crit() and MC_Error() use all iterations of the MCMC sample and consider only the parameters of the analysis model (if they were monitored). In this section we describe how the set of iterations and parameters/nodes to display can be changed using the arguments subset, start, end and thin. #### Subset of Parameters When the main parameters of the analysis model have been monitored in a JointAI object, i.e., when monitor_params was set to TRUE, only these parameters are returned in the model summary, plots and criteria shown above. If the main parameters of the analysis model were not monitored and the argument subset is not specified, all parameters that were monitored are displayed. To display output for nodes other than the main parameters of the analysis model or for a subset of nodes, the argument subset needs to be specified. ##### 6.7.4.0.1 Example: To display only the parameters of the imputation models, we set subset = c(analysis_main = FALSE, imp_pars = TRUE) (after re-estimating the model with the monitoring for these parameters switched on): mod13c <- update(mod13a, monitor_params = c(imp_pars = TRUE)) summary(mod13c, subset = c(analysis_main = FALSE, imp_pars = TRUE)) ## Linear model fitted with JointAI ## ## Call: ## lm_imp(formula = SBP ~ gender + WC + alc + creat, data = NHANES, ## n.iter = 500, monitor_params = c(imp_pars = TRUE)) ## ## Posterior summary: ## Mean SD 2.5% 97.5% tail-prob. GR-crit ## alpha[1] 0.1624 0.0990 -0.0327 0.3598 0.1000 1.00 ## alpha[2] -0.3506 0.1484 -0.6414 -0.0557 0.0173 1.00 ## alpha[3] 0.4901 0.0902 0.3078 0.6633 0.0000 1.01 ## alpha[4] -1.0338 0.1331 -1.2930 -0.7647 0.0000 1.01 ## alpha[5] 0.0775 0.0666 -0.0522 0.2084 0.2587 1.00 ## alpha[6] -0.1277 0.2449 -0.6010 0.3807 0.5880 1.01 ## alpha[7] -0.8928 0.4133 -1.6684 -0.0750 0.0320 1.01 ## alpha[8] 0.0887 0.1769 -0.2568 0.4329 0.6373 1.01 ## alpha[9] -0.2627 0.2086 -0.6772 0.1529 0.2080 1.00 ## tau_WC 1.0280 0.1083 0.8277 1.2551 0.0000 1.01 ## tau_creat 1.3997 0.1525 1.1064 1.7261 0.0000 1.01 ## ## ## MCMC settings: ## Iterations = 101:600 ## Sample size per chain = 500 ## Thinning interval = 1 ## Number of chains = 3 ## ## Number of observations: 186 ##### Example: To select only some of the parameters, they can be specified directly by name via the other element of subset. densplot(mod13a, subset = list(other = c('creat', 'alc>=1'))) ##### Example: This also works when a subset of the imputed values should be displayed: # re-fit the model and monitor the imputed values mod13d <- lm_imp(SBP ~ gender + age + albu + occup + alc, n.iter = 200, data = NHANES, monitor_params = c(imps = TRUE)) # select all imputed values for 'albu' (4th column of Xc) sub3 <- grep('Xc\$[[:digit:]]+,4\$', parameters(mod13d), value = TRUE) sub3 ## [1] "Xc[10,4]" "Xc[14,4]" "Xc[18,4]" "Xc[25,4]" "Xc[80,4]" "Xc[118,4]" ## [7] "Xc[172,4]" "Xc[180,4]" # pass "sub3" to "subset" via "other", for example in a traceplot: # traceplot(mod13d, subset = list(other = sub3), ncol = 2) ##### Example: When the number of imputed values is large or in order to check convergence of random effects, it may not be feasible to plot and inspect all traceplots. In that case, a random subset of, for instance, the random effects, can be selected (output not shown): # re-fit the model monitoring the random effects mod13e <- update(mod13a, monitor_params = c(ranef = TRUE)) # extract random intercepts and random slopes ri <- grep('^b\$[[:digit:]]+,1\$$', colnames(mod13e$MCMC[[1]]), value = TRUE) rs <- grep('^b\$[[:digit:]]+,2\$$', colnames(mod13e$MCMC[[1]]), value = TRUE) # to plot the chains of 12 randomly selected random intercepts & slopes: traceplot(mod13e, subset = list(other = sample(ri, size = 12)), ncol = 4) traceplot(mod13e, subset = list(other = sample(rs, size = 12)), ncol = 4) #### 6.7.4.1 Subset of MCMC Samples With the arguments start, end and thin it is possible to select which iterations from the MCMC sample are included in the summary. start and end specify the first and last iterations to be used, thin the thinning interval. Specification of start thus allows the user to discard a “burn-in”, i.e., the iterations before the MCMC chain had converged. ### 6.7.5 Predicted Values Often, the aim of an analysis is not only to estimate the association between outcome and covariates but to predict future outcomes or outcomes for new subjects. The function predict() allows us to obtain predicted values and corresponding credible intervals from JointAI objects. Note that for mixed models, currently only marginal prediction is implemented, not prediction conditional on the random effects. A dataset containing data for which the prediction should be performed for is specified via the argument newdata. The argument quantiles allows specification of the quantiles of the posterior sample that are used to obtain the prediction interval (by default the 2.5% and 97.5% quantile). Arguments start, end and thin control the subset of MCMC samples used. predict(mod13a, newdata = NHANES[27, ]) ##$dat
##          SBP gender age             race   WC alc educ creat albu uricacid
## 392 126.6667   male  32 Mexican American 94.1  <1  low  0.83  4.2      8.7
##     bili occup  smoke      fit     2.5%    97.5%
## 392    1  <NA> former 116.3997 112.5729 120.2564
##
## $fit ## [1] 116.3997 ## ##$quantiles
##           [,1]
## 2.5%  112.5729
## 97.5% 120.2564

predict() returns a list with elements dat (a dataset consisting of newdata with the predicted values and quantiles appended), fit (the predicted values) and quantiles (the quantiles that form the prediction interval).

#### Prediction to Visualize Non-linear Effects

Another reason to obtain predicted values is the visualization of non-linear effects (see Figure 6.14). To facilitate the generation of a dataset for such a prediction, the function predDF() can be used. It generates a data.frame that contains a sequence of values through the range of observed values for a covariate specified by the argument var, and the median or reference value for all other continuous and categorical variables.

# create dataset for prediction
newDF <- predDF(mod13b, var = "age")

# obtain predicted values
pred <- predict(mod13b, newdata = newDF, start = 300)

# plot predicted values and prediction interval
matplot(pred$dat$age, pred$dat[, c('fit', '2.5%', '97.5%')], lty = c(1,2,2), type = 'l', col = 1, xlab = 'age in months', ylab = 'predicted value') ### 6.7.6 Export of Imputed Values Imputed datasets can be extracted from a JointAI object (in which a monitor for the imputed values has been set, i.e., monitor_params = c(imps = TRUE)) with the function get_MIdat(). A completed dataset is created by taking the imputed values from a randomly chosen iteration of the MCMC sample, transforming them back to the original scale (if scaling was performed before the MCMC sampling) and filling them into the original incomplete data. The argument m specifies the number of imputed datasets to be created, include controls whether the original data are included in the long format data.frame (default is include = TRUE), start specifies the first iteration that may be used, and minspace is the minimum number of iterations between iterations eligible for selection. To make the selection of iterations reproducible, a seed value can be specified via the argument seed. When export_to_SPSS = TRUE the imputed data is exported to SPSS (IBM SPSS Statistics, IBM Corp.), i.e., a .txt file containing the data and a .sps file containing SPSS syntax to convert the data into an SPSS data file (with ending .sav) are written. Arguments filename and resdir allow specification of the name of the .txt and .sps file and the directory they are written to. get_MIdat() returns a long-format data.frame containing the imputed datasets (and by default the original data) stacked on top of each other. The imputation number is given in the variable Imputation_, column .id contains a newly created id variable for each observation in cross-sectional data (multi-level data should already contain an id variable) and the column .rownr identifies rows of the original data (which is relevant in multi-level data). impDF <- get_MIdat(mod13d, m = 5, seed = 2019) The function plot_imp_distr() allows visual comparison of the distributions of the observed and imputed values (Figure 6.15). plot_imp_distr(impDF, nrow = 1) ## 6.8 Assumptions and Extensions Like any statistical model, the approach to imputation followed in JointAI relies on assumptions that need to be satisfied in order to obtain valid results. A commonly made assumption that is also required for JointAI is that the missing data mechanism is ignorable, i.e., that data is Missing At Random (MAR) or Missing Completely At Random (MCAR) (Rubin 1976,) and that parameters in the model of the missingness mechanism are independent of the parameters in the data model (Schafer 1997). It is the task of the researcher to critically evaluate whether this assumption is satisfied for a given dataset and model. Furthermore, all models involved in the imputation and analysis need to be correctly specified. In current implementations of imputation procedures in software, imputation models are typically automatically specified, using standard assumptions like linear associations and default model types. In JointAI, the arguments models and auxvar permit tailoring of the automatically chosen models to some extent, by allowing the user to chose non-normal imputation models for continuous variables and to include variables or functional forms of variables in the linear predictor of the imputation models that are not used in the analysis model. When using auxiliary variables in JointAI, it should be noted that due to the ordering of the conditional distributions in the sequence of models it is implied that the auxiliary variable is independent of the outcome, since neither the model for the auxiliary variable (if the auxiliary variable has missing values) has the outcome in its linear predictor nor vice versa. Moreover, in order to make software usable, default values have to be chosen for various parameters. These default values are chosen to work well in certain settings, but can not be guaranteed to be appropriate in general and it is the task of the user to make the appropriate changes. In JointAI this concerns, for example, the choice of hyperparameters and automatically chosen types of imputation models. To expand the range of settings in which JointAI provides a valid and user-friendly way to simultaneously analyse and impute data, several extensions are planned, for example: • Implement the use of (penalized) splines for incompletely observed covariates, thereby improving model fit. • Increase the flexibility of imputation models by optional inclusion of interaction terms and non-parametric Bayesian models that allow imputation under non-standard distributions. • Evaluation of model fit of the analysis and imputation models to help the user prevent bias due to misspecification. • Implementation of subject-specific prediction from mixed models. • Extend the analysis models to handle endogenous covariates by modelling random effects (and error terms) jointly. ## Appendix ### 6.A Default Hyperparameters default_hyperpars() ##$norm
##    mu_reg_norm   tau_reg_norm shape_tau_norm  rate_tau_norm
##          0e+00          1e-04          1e-02          1e-02
##
## $gamma ## mu_reg_gamma tau_reg_gamma shape_tau_gamma rate_tau_gamma ## 0e+00 1e-04 1e-02 1e-02 ## ##$beta
##    mu_reg_beta   tau_reg_beta shape_tau_beta  rate_tau_beta
##          0e+00          1e-04          1e-02          1e-02
##
## $logit ## mu_reg_logit tau_reg_logit ## 0e+00 1e-04 ## ##$poisson
##  mu_reg_poisson tau_reg_poisson
##           0e+00           1e-04
##
## $probit ## mu_reg_probit tau_reg_probit ## 0e+00 1e-04 ## ##$multinomial
##  mu_reg_multinomial tau_reg_multinomial
##               0e+00               1e-04
##
## $ordinal ## mu_reg_ordinal tau_reg_ordinal mu_delta_ordinal tau_delta_ordinal ## 0e+00 1e-04 0e+00 1e-04 ## ##$Z
## function (nranef)
## {
##     if (nranef > 1) {
##         RinvD <- diag(as.numeric(rep(NA, nranef)))
##         KinvD <- nranef
##     }
##     else {
##         RinvD <- KinvD <- NULL
##     }
##     list(RinvD = RinvD, KinvD = KinvD, shape_diag_RinvD = 0.1,
##         rate_diag_RinvD = 0.01)
## }
## <bytecode: 0x000000001d177070>
## <environment: 0x000000001d173e10>
##
## $surv ## mu_reg_surv tau_reg_surv ## 0.000 0.001 ## ##$coxph
##     c     r   eps
## 1e-03 1e-01 1e-10

### 6.B Density Plot using ggplot2

# fit the complete-case version of the model
mod13a_cc <- lm(formula(mod13a), data = NHANES)

# make a dataset containing the quantiles of the posterior sample and
# confidence intervals from the complete case analysis:
quantDF <- rbind(data.frame(variable = rownames(summary(mod13a)$stat), type = '2.5%', model = 'JointAI', value = summary(mod13a)$stat[, c('2.5%')]
),
data.frame(variable = rownames(summary(mod13a)$stat), type = '97.5%', model = 'JointAI', value = summary(mod13a)$stat[, c('97.5%')]
),
data.frame(variable = names(coef(mod13a_cc)),
type = '2.5%',
model = 'cc',
value = confint(mod13a_cc)[, '2.5 %']
),
data.frame(variable = names(coef(mod13a_cc)),
type = '97.5%',
model = 'cc',
value = confint(mod13a_cc)[, '97.5 %']
)
)

# ggplot version:
p13a <- densplot(mod13a, ncol = 3, use_ggplot = TRUE, joined = TRUE) +
theme(legend.position = 'bottom',
panel.background = element_rect(fill = grey(0.95)),
panel.border = element_rect(fill = NA, color = grey(0.85)),
strip.background = element_rect(color = grey(0.85)))

# add vertical lines for the:
# - confidence intervals from the complete case analysis
# - quantiles of the posterior distribution
p13a +
geom_vline(data = quantDF, aes(xintercept = value, color = model),
lty = 2, lwd = 1) +
scale_color_manual(name = 'CI from model: ',
limits = c('JointAI', 'cc'),
values = c("#783D4F", "#174F88"),
labels = c('JointAI', 'compl.case'))
red <- function(x) {
x <- rlang::enexpr(x)
if (knitr:::is_latex_output()) {
paste0("\\textcolor{red}{", x, "}")
} else {
paste0("<span style='color:red'>", x, "</span>")
}
}

R <- function() {
if (knitr:::is_latex_output()) {
'\\textsf{R}'
} else {
'R'
}
}

JAGS <- function() {
if (knitr:::is_latex_output()) {
'\\textsf{JAGS}'
} else {
'JAGS'
}
}

WinBUGS <- function() {
if (knitr:::is_latex_output()) {
'\\textsf{WinBUGS}'
} else {
'WinBUGS'
}
}

### References

Bartlett, Jonathan W., Shaun R. Seaman, Ian R. White, and James R. Carpenter. 2015. “Multiple Imputation of Covariates by Fully Conditional Specification: Accommodating the Substantive Model.” Statistical Methods in Medical Research 24 (4): 462–87. https://doi.org/10.1177/0962280214521348.

Brooks, Stephen P., and Andrew Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7 (4): 434–55. https://doi.org/10.1080/10618600.1998.10474787.

Corporation, Microsoft, and Steve Weston. 2018. DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.

Erler, Nicole S. 2019. JointAI: Joint Analysis and Imputation of Incomplete Data. https://github.com/nerler/JointAI.

Erler, Nicole S., Dimitris Rizopoulos, Vincent W. V. Jaddoe, Oscar H. Franco, and Emmanuel M. E. H. Lesaffre. 2019. “Bayesian Imputation of Time-Varying Covariates in Linear Mixed Models.” Statistical Methods in Medical Research 28 (2): 555–88. https://doi.org/10.1177/0962280217730851.

Erler, Nicole S., Dimitris Rizopoulos, Joost van Rosmalen, Vincent W. V. Jaddoe, Oscar H. Franco, and Emmanuel M. E. H. Lesaffre. 2016. “Dealing with Missing Covariates in Epidemiologic Studies: A Comparison Between Multiple Imputation and a Full Bayesian Approach.” Statistics in Medicine 35 (17): 2955–74. https://doi.org/10.1002/sim.6944.

Flegal, James M., John Hughes, Dootika Vats, and Ning Dai. 2017. Mcmcse: Monte Carlo Standard Errors for Mcmc. Riverside, CA, Denver, CO, Coventry, UK,; Minneapolis, MN.

Gelman, Andrew, and Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72. https://doi.org/10.1214/ss/1177011136.

Ibrahim, Joseph G., Ming-Hui Chen, and Stuart R. Lipsitz. 2002. “Bayesian Methods for Generalized Linear Models with Covariates Missing at Random.” Canadian Journal of Statistics 30 (1): 55–78. https://doi.org/10.2307/3315865.

Lesaffre, Emmanuel M. E. H., and Andrew B. Lawson. 2012. Bayesian Biostatistics. John Wiley & Sons. https://doi.org/10.1002/9781119942412.

Lunn, David, Chris Jackson, Nicky Best, David Spiegelhalter, and Andrew Thomas. 2012. The BUGS book: A practical introduction to Bayesian analysis. Chapman; Hall/CRC.

Microsoft, and Steve Weston. 2017. Foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach.

National Center for Health Statistics (NCHS). 2011. “National Health and Nutrition Examination Survey Data.” U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. https://www.cdc.gov/nchs/nhanes/.

Novo, Alvaro A., and Joseph L. Schafer. 2010. Norm: Analysis of Multivariate Normal Datasets with Missing Values. (version R package version 1.0-9.5.). http://CRAN.R-project.org/package=norm.

Pinheiro, Jose, Douglas Bates, Saikat DebRoy, Deepayan Sarkar, and R Core Team. 2018. nlme: Linear and Nonlinear Mixed Effects Models. https://CRAN.R-project.org/package=nlme.

Plummer, M. 2003. “JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling.” Edited by Kurt Hornik, Friedrich Leisch, and Achim Zeileis.

Plummer, Martyn. 2018. Rjags: Bayesian Graphical Models Using Mcmc. https://CRAN.R-project.org/package=rjags.

Plummer, Martyn, Nicky Best, Kate Cowles, and Karen Vines. 2006. “CODA: Convergence Diagnosis and Output Analysis for Mcmc.” R News 6 (1): 7–11. https://journal.r-project.org/archive/.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92. https://doi.org/10.2307/2335739.

Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470316696.

Rubin, Donald B. 2004. “The Design of a General and Flexible System for Handling Nonresponse in Sample Surveys.” The American Statistician 58 (4): 298–302. https://doi.org/10.1198/000313004X6355.

Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. Chapman & Hall/Crc Monographs on Statistics & Applied Probability. Taylor & Francis.

Terry M. Therneau, and Patricia M. Grambsch. 2000. Modeling Survival Data: Extending the Cox Model. New York: Springer.

Therneau, Terry M. 2015. A Package for Survival Analysis in S. https://CRAN.R-project.org/package=survival.

Van Buuren, Stef. 2012. Flexible Imputation of Missing Data. Chapman & Hall/CRC Interdisciplinary Statistics. Taylor & Francis.

Van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.

White, Ian R., Patrick Royston, and Angela M. Wood. 2011. “Multiple imputation using chained equations: Issues and guidance for practice.” Statistics in Medicine 30 (4): 377–99. https://doi.org/10.1002/sim.4067.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.