# Chapter 7 General Discussion

In this thesis, we have presented a fully Bayesian approach for simultaneous analysis and imputation of incomplete data. Here, we summarize some of its advantages, highlight important assumptions made by the approach and its current implementation in the R package JointAI, and discuss possible extensions and directions for future work.

Modelling the joint distribution as a sequence of univariate distributions yields several advantages. Contrary to MICE, the joint distribution always exists and, since the predictive distributions used to draw imputations are derived from this joint distribution, compatibility among imputation models and congeniality with the analysis model is assured. Factorization of the joint distribution using univariable conditional distributions facilitates its specification in settings with variables of mixed type, where the joint multivariate distribution does not have a known form. This avoids the need to make any general approximation, as is the case with joint model MI, thereby improving model fit. Moreover, since the analysis model is part of the factorization, non-linear associations between outcome and covariates, or between covariates, can be handled adequately. Choosing the analysis model as the first factor in the sequence of conditional distributions, i.e., the model for the outcome conditional on all covariates, prevents the need to include the outcome in any of the linear predictors of the other distributions, and makes analysis and imputation in settings with complex data structures straightforward.

In Chapters 3 and 4 we demonstrated that analysis with the sequential fully Bayesian approach also allows obtaining imputed values, which may be used for secondary analyses of the same data, and can thus be used in a multiple imputation framework. Data were imputed during the main analysis, and used in subsequent analyses of outcomes derived from the longitudinal outcome, or sensitivity analysis in subsets of the data. Naturally, congeniality of the imputation models with these secondary analysis models is then no longer assured, but bias is likely to be small, provided that the secondary analysis models are in a sense contained in the analysis model assumed during imputation. In Chapter 3 this is the case, since the outcome of the secondary models, gestational weight gain during different periods of pregnancy, is calculated from the longitudinal trajectories of weight, which were taken into account during imputation, and the same set of baseline covariates is used.

In settings where multiple related outcomes are of interest and their models share incomplete covariates, it may be desirable to model these outcomes jointly and to perform imputation in this joint model. This was the case in Chapter 4, where the association of the consumption of sugar containing beverages of the mother during pregnancy with child body composition, consisting of repeated measures of BMI, and measurements of the fat mass index (FMI) and fat-free mass index (FFMI) at age six, were modelled jointly. Correlation between different outcomes can be taken into account by specification of a joint distribution for error terms of univariate outcomes and random effects in models for longitudinal outcomes, following the same principle used in (5.8) in the multivariate normal approach to multiple imputation applied in Chapter 5.

Another advantage of the Bayesian approach is its ability to take into account endogeneity of, usually longitudinal, covariates by jointly modelling their random effects, and potentially also the error terms, with the corresponding parts of the analysis model. Standard methods that are commonly applied after multiple imputation only specify models for the outcome, which implicitly assumes that all covariates are exogenous, and may lead to bias.

The implementation of the sequential Bayesian approach in the R package provides a convenient tool for researchers from various backgrounds, who are familiar with commonly used base R functions and do not have specific experience in Bayesian methodology or software for Bayesian inference, such as JAGS or WinBUGS.

## 7.2 Assumptions

Like all statistical methods, the sequential fully Bayesian approach is based on certain assumptions that must be met in order to obtain correct results. In the following, we highlight two assumptions that are crucial to the validity of the analysis: correctness of the model specification and the assumption of ignorable missingness.

### 7.2.1 Model Specification

An implicit assumption of the sequential Bayesian approach to handle missing data is that the model is correctly specified. This assumption is not particular for our approach but made in all parametric models. Even though in practice it may usually not be possible to specify a correct model, and the specified model can only approximate the “truth”, it is important that this approximation is precise enough to avoid relevant bias.

In the model presented in the previous chapters, it is the joint distribution $$p(\mathbf y, \mathbf X_{obs}, \mathbf X_{mis} \mid \boldsymbol\theta)$$ that needs to be specified correctly or at least needs to fit the data well. Since the joint distribution is specified as a sequence of conditional distributions, this requirement translates to the conditional distributions.

When there are more than just a few incomplete variables, choosing conditional distributions that fit the data sufficiently well can become a tedious task. Variables in the linear predictor of the conditional models may have non-linear effects and interact with other variables, and which variables are included in a particular linear predictor depends on the order of the sequence of conditional distributions. Moreover, except for the conditional distribution for the first incomplete covariate in the sequence, models have incomplete variables in their linear predictor, complicating evaluation of model fit. For continuous variables, additionally, different choices for the error distribution need to be considered. It is unrealistic to expect such time-consuming model building to happen in practice. Many researchers will rely on the default choices set in the software implementation and, at most, consider alternative distributions for continuous, non-normal variables.

A useful implementation in a software package should, therefore, make it easy to use models that are flexible enough to fit a wide range of data sufficiently well to provide valid results. When each of the conditional distributions in the sequential fully Bayesian approach is specified correctly, the order chosen for the sequence is irrelevant for the validity of the results. In its current version (0.5.1) the R package JointAI provides multiple parametric options for imputation of continuous variables (using a normal, log-normal, gamma or beta distribution), but makes the default assumption that all associations between covariates are linear and do not interact with each other. Even though this assumption is appropriate in many settings, relaxing them may be necessary for some analyses.

### 7.2.2 Ignorable Missingness

The second crucial assumption necessary to obtain valid results when using the approach presented in the preceding chapters is that the missing data process is ignorable.

Imputation under the MAR assumption is attractive since MAR implies that non-responders are the same as responders in the sense that the conditional distribution of a variable $$\mathbf x$$ is the same for cases where $$\mathbf x$$ is missing and for cases where $$\mathbf x$$ is observed. This facilitates a straightforward specification of the posterior predictive distribution to impute missing values.

Throughout this thesis we have made the ignorability assumption, however, for some variables, like maternal smoking or alcohol use during pregnancy, the assumption of MAR, i.e., that the probability of the smoking or drinking status being missing is independent of the true status, is questionable. It is widely known that the use of alcohol or cigarettes during pregnancy is not advised, and mothers disregarding this advice may feel guilty or do not dare to respond to the question due to fear of judgement.

When missing data are indeed MNAR conditional on the available information, ignoring the missingness process when analysing or imputing the data may lead to severe bias and faulty conclusions. This issue is not limited to missing values in covariates. Primarily in clinical trials, missingness of outcome values is often associated with what is investigated in the study. For instance, critically ill patients may feel too sick to fill in questionnaires regarding their quality of life, or patients who believe that the treatment or intervention they have been assigned to does not have the desired effect may leave the study. Especially in conjunction with the usually small number of covariates that is (available to be) included in the analysis, it is unlikely that the probability of a value being missing can be fully explained by the recorded information. In such settings, where MNAR is the more plausible assumption, omitting missing outcome values would lead to biased results.

## 7.3 Directions for Future Work

To reduce bias due to violation of the assumptions implied by the approach described in this thesis, future work should focus on extensions that help researchers to model their data appropriately. In this section, we briefly discuss some approaches that address the aforementioned assumptions of model fit and ignorability of the missingness mechanism.

### 7.3.1 Implementation of Flexible Models

A possible extension to improve the fit of the conditional distributions for incomplete covariates is to allow for a more general specification of the mean structure, relaxing the assumption of linear associations. One way to do this in an automated way that can be implemented in JAGS would be to provide the option to include pair-wise interactions between all covariates in the linear predictors of the models for incomplete covariates. Since this may lead to a large number of coefficients and, hence, likely leads to overfitting and convergence issues, it may be necessary to shrink parameters that contribute little to the model fit. In the Bayesian framework, shrinkage can be applied through specification of the prior distribution. A popular choice is the use of Bayesian ridge regression, where instead of using a vague normal prior for the regression coefficients $$\boldsymbol\alpha$$, a hyperprior $$\sigma_\alpha^2 \sim Inv-\chi^2(\nu, u^2)$$, where $$u^2$$ has a gamma prior (Mallick and Yi 2013), is specified for the variance of this prior distribution. Alternatively, Bayesian elastic net priors, like the one used in Chapter 5, may be chosen.

To relax the assumption of a parametric error distribution for continuous variables, non-parametric Bayesian methods can be applied. In this framework, uncertainty about the probability density function $$G$$ of an incompletely observed covariate $$\mathbf x$$ can be reflected by not specifying $$G$$ directly, but assuming a prior probability model for $$G$$. A convenient and popular choice for such a prior probability model is the Dirichlet process prior, which allows the specification of an infinite mixture of simple parametric distributions, often normal distributions, where the values of the parameters of these parametric distributions are determined by the Dirichlet process (Escobar and West 1995; Müller et al. 2015). Due to the clustering property of the Dirichlet process, in practice, only a finite number of clusters are used, but since the number of clusters needed is determined by the data and does not need to be pre-specified, this approach facilitates automated and very flexible density estimation.

### 7.3.2 Evaluation of Model Fit

Even when flexible models are used, model fit should be evaluated to ensure the assumption of a correctly specified model is not violated. In Chapters 2 and 5, the fit of the conditional distributions was evaluated using posterior predictive checks (Gelman, Meng, and Stern 1996).

A common approach to posterior predictive checks is to calculate a $$\chi^2$$-type statistic $\sum_{i=1}^N \frac{\left\{x_i - \text{E}(x_i\mid \boldsymbol\theta)\right\}^2}{\text{Var}(x_i\mid \boldsymbol\theta)}.$ The statistic is calculated twice, once for the observed values and once for the corresponding values sampled from the estimated posterior predictive distribution. If the specified model fits the observed data well, comparison of the two statistics should show that on average (over all iterations) one is not larger than the other.

To advocate and facilitate the use of posterior predictive checks for model fit when using the sequential Bayesian approach, functionality to automatically perform such model evaluation will be added to JointAI in the future.

### 7.3.3 Non-ignorable Missingness

Most popular software for handling incomplete data is limited to settings with MAR mechanisms, posing an often insurmountable hurdle for applied researchers to perform imputation or analysis under the MNAR assumption.

It is not possible to provide a completely automated procedure since the handling of non-randomly missing data requires external information on the missingness mechanism. Under MNAR the unobserved data are assumed to have a different distribution (conditional on covariates) than the observed data, but since this distribution cannot be obtained from the observed data, assumptions about this unknown distribution need to be made by the researcher. Nevertheless, the use of MNAR analysis can be supported by providing software that allows the researcher to introduce his or her assumption into the analysis.

By factorizing the likelihood $$p(\mathbf X,\mathbf R\mid\boldsymbol\psi,\boldsymbol\theta) = p(\mathbf X\mid \mathbf R, \boldsymbol\theta)\; p(\mathbf R\mid \boldsymbol\psi)$$, i.e., using a pattern mixture model specification instead of the selection model factorization used in Chapter 1, MNAR can be modelled as a deviation from MAR: $\begin{eqnarray*} p(\mathbf X,\mathbf R\mid \boldsymbol\psi,\boldsymbol\theta) &=& p(\mathbf X_{obs}, \mathbf X_{mis}\mid \mathbf R, \boldsymbol\theta)\; p(\mathbf R\mid \boldsymbol\psi)\\ &=& p(\mathbf X_{mis} \mid \mathbf X_{obs}, \mathbf R, \boldsymbol \theta_{mis})\; p(\mathbf X_{obs} \mid \mathbf R, \boldsymbol\theta_{obs})\; p(\mathbf R\mid \boldsymbol\psi), \end{eqnarray*}$ where $$(\boldsymbol\theta_{mis}, \boldsymbol\theta_{obs}) = g(\boldsymbol\theta)$$, $$\boldsymbol\theta_{mis} = f(\boldsymbol\theta_{obs}, \boldsymbol\Delta)$$ and $$\boldsymbol\Delta$$ represents the deviation from MAR. Since no information about $$\boldsymbol\Delta$$ is available from the data, $$\boldsymbol\Delta$$ or a prior distribution $$p(\boldsymbol\Delta)$$ must be specified, reflecting the analyst’s hypothesis about the missing data mechanism.

The R package JointAI could be extended to enable users to specify for which incomplete variables they assume MNAR, and allow them to provide a value or distribution for $$\boldsymbol\Delta$$.

Offering researchers the opportunity to perform analysis and imputation under the assumption of MNAR with only slightly more effort than analysis under the assumption of MAR would require, will substantially improve the quality of research; researchers will be more likely to make appropriate assumptions and perform the sensitivity analyses required for those (untestable) assumptions.

## 7.4 Conclusion

Missing data occur in a wide range of studies but in practice their treatment is often not given adequate time and consideration, as they are regarded a frustrating complication that needs to be resolved before the actual research question can be answered. Applied researchers, hence, often prefer off-the-shelf solutions that require minimal knowledge about specific statistical approaches and little time to apply. This is in conflict with the careful consideration that is necessary to make appropriate assumptions about the missing data mechanism and the shape of the models used to impute missing values.

To support the correct use of statistical methodology, software is necessary that provides an accessible interface to valid statistical approaches. Such software needs to be flexible enough to handle data with different characteristics, non-restrictive enough to make it more likely that assumptions are met, and should provide options allowing straightforward evaluation of potential violations. Since the analysis of incomplete data involves untestable assumptions, additional functionality allowing sensitivity analyses about these assumptions is needed.

The implementation of the fully Bayesian approach to analysis of incomplete data in the R package JointAI is a first step to provide such software. Extensions, as discussed above and in Chapter 6 are necessary to fully reach this goal.

### References

Escobar, Michael D., and Mike West. 1995. “Bayesian Density Estimation and Inference Using Mixtures.” Journal of the American Statistical Association 90 (430): 577–88. https://doi.org/10.2307/2291069.

Gelman, Andrew, Xiao-Li Meng, and Hal Stern. 1996. “Posterior Predictive Assessment of Model Fitness via Realized Discrepancies.” Statistica Sinica 6 (4): 733–60.

Mallick, Himel, and Nengjun Yi. 2013. “Bayesian Methods for High Dimensional Linear Models.” Journal of Biometrics & Biostatistics 4 (S3): S1–005. https://doi.org/10.4172/2155-6180.S1-005.

Müller, Peter, Fernando Andrés Quintana, Alejandro Jara, and Tim Hanson. 2015. Bayesian Nonparametric Data Analysis. Springer. https://doi.org/10.1007/978-3-319-18968-0.