Semiparametric Inference of the Complier Average Causal Effect with Nonignorable Missing Outcomes

Hua Chen, ¹¹1Institute of Applied Physics and Computational Mathematics, Beijing, 100088, China Peng Ding, ²²2Department of Statistics, Harvard University, One Oxford Street, Cambridge, MA 02138, USA. ³³3Corresponding author’s Email: pengding@fas.harvard.edu Zhi Geng, ⁴⁴4School of Mathematical Sciences, Peking University, Beijing, 100871, China and Xiaohua Zhou ⁵⁵5Department of Biostatistics, University of Washington, and Biostatistics Unit, HSR&D Center of Excellence, VA Puget Sound Health Care System, Seattle, Washington 98101, USA

Abstract

Noncompliance and missing data often occur in randomized trials, which complicate the inference of causal effects. When both noncompliance and missing data are present, previous papers proposed moment and maximum likelihood estimators for binary and normally distributed continuous outcomes under the latent ignorable missing data mechanism. However, the latent ignorable missing data mechanism may be violated in practice, because the missing data mechanism may depend directly on the missing outcome itself. Under noncompliance and an outcome-dependent nonignorable missing data mechanism, previous studies showed the identifiability of complier average causal effect for discrete outcomes. In this paper, we study the semiparametric identifiability and estimation of complier average causal effect in randomized clinical trials with both all-or-none noncompliance and the outcome-dependent nonignorable missing continuous outcomes, and propose a two-step maximum likelihood estimator in order to eliminate the infinite dimensional nuisance parameter. Our method does not need to specify a parametric form for the missing data mechanism. We also evaluate the finite sample property of our method via extensive simulation studies and sensitivity analysis, with an application to a double-blinded psychiatric clinical trial.

Key Words: Causal inference; Instrumental variable; Missing not at random; Noncompliance; Outcome-dependent missing; Principal stratification.

1 Introduction

Randomization is an effective way to study the average causal effects ( $ACE$ s) of new drugs or training programs. However, randomized trials are often plagued with noncompliance and missing data, which may make statistical inference difficult and biased. The noncompliance problem happens when some subjects fail to comply with their assigned treatments, and the missing data problem happens when investigators fail to collect information for some subjects. Ignoring noncompliance and missing data problems may lead to biased estimators of the $ACE$ s.

The noncompliance problem has attracted a lot of attention in the literature. Efron and Feldman (1991) studied the noncompliance problem before the principal stratification framework (Frangakis and Rubin, 2002) was proposed. In the presence of noncompliance, Balke and Pearl (1997) proposed large sample bounds of the $ACE$ s for binary outcomes using the linear programming method. Angrist et al. (1996) discussed the identifiability of the causal effect using the instrumental variable method. Imbens and Rubin (1997) proposed a Bayesian method to estimate the complier average causal effect ( $CACE$ ). When some outcomes are missing, the identifiability and estimation of $CACE$ are more complicated, and different types of missing data mechanisms have sizable impacts on the identifiability and estimation of $CACE$ . Frangakis and Rubin (1999) established the identifiability and proposed a moment estimator of $CACE$ under the latent ignorable (LI) missing data mechanism. Under the LI missing data mechanism, Zhou and Li (2006) and O'Malley and Normand (2005) proposed Expectation-Maximization (EM) algorithms (Dempster et al., 1977) to find the maximum likelihood estimators (MLEs) of $CACE$ for binary and normally distributed outcomes, respectively. Barnard et al. (2003) proposed a Bayesian approach to estimate $CACE$ with bivariate outcomes and covariate adjustment. Taylor and Zhou (2011) proposed a multiple imputation method to estimate $CACE$ for clustered encouragement design studies.

However, the LI assumption may be implausible in some clinical studies when the missing data mechanism may depend on the missing outcome. Chen et al. (2009) and Imai (2009) discussed the identifiability of $CACE$ for discrete outcomes under the outcome-dependent nonignorable (ODN) missing data mechanism. To the best of our knowledge, there are no published papers in the literature studying the identifiability of $CACE$ for continuous outcomes under the ODN assumption. In this paper, we show that $CACE$ is semiparametrically identifiable under some regular conditions, and propose estimation methods for $CACE$ with continuous outcomes under the ODN assumption. For our semiparametric method, we need only assume that the distribution of the outcomes belongs to the exponential family without specifying a parametric form for the missing data mechanism.

This paper proceeds as follows. In Section 2, we discuss the notation and assumptions used in this paper and define the parameter of interest. In Section 3, we show the semiparametric identifiability and propose a two-step maximum likelihood estimator (TSMLE). In Section 4, we use several simulation studies to illustrate the finite sample properties of our proposed estimators and consider sensitivity analysis to assess the robustness of our estimation strategy. In Section 5, we analyze a double-blinded randomized clinical trial using the methods proposed in this paper. We conclude with a discussion and provide all proofs in the Appendices.

2 Notation and Assumptions

We consider a randomized trial with a continuous outcome. For the $i$ -th subject, let $Z_{i}$ denote the randomized treatment assignment ( $1$ for treatment and $0$ for control). Let $D_{i}$ denote the treatment received ( $1$ for treatment and $0$ for control). When $Z_{i}\not=D_{i}$ , there exists noncompliance. Let $Y_{i}$ denote the outcome variable. Let $R_{i}$ denote the missing data indicator of $Y_{i}$ , i.e., $R_{i}=1$ if $Y_{i}$ is observed and $R_{i}=0$ if $Y_{i}$ is missing. First, we need to make the following fundamental assumption.

Assumption 1 (Stable unit treatment value assumption, SUTVA):: There is no interference between units, which means that the potential outcomes of one individual do not depend on the treatment status of other individuals (Rubin, 1980), and there is only one version of potential outcome of a certain treatment (Rubin, 1986).

Except in the dependent case for infectious diseases (Hudgens and Halloran, 2008), the SUTVA assumption is reasonable in many cases. Under the SUTVA assumption, we define $D_{i}(z),Y_{i}(z)$ and $R_{i}(z)$ as the potential treatment received, the potential outcome measured, and the potential missing data indicator for subject $i$ if he/she were assigned to treatment $z$ . These variables are potential outcomes because only one of the pairs $\{D_{i}(1),Y_{i}(1),R_{i}(1)\}$ and $\{D_{i}(0),Y_{i}(0),R_{i}(0)\}$ can be observed. Since $Z_{i}$ is the observed treatment assignment for subject $i$ , $D_{i}=D_{i}(Z_{i}),Y_{i}=Y_{i}(Z_{i})$ , and $R_{i}=R_{i}(Z_{i})$ are the observed treatment received, the observed outcome, and the observed missing data indicator.

Under the principal stratification framework (Angrist et al., 1996; Frangakis and Rubin, 2002), we let $U_{i}$ be the compliance status of subject $i$ , defined as follows:

\displaystyle U_{i}=\left\{\begin{array}[]{ll}a,&\text{if}\hskip 5.69054ptD_{i}(1)=1\hskip 5.69054pt\text{and}\hskip 5.69054ptD_{i}(0)=1;\\ c,&\text{if}\hskip 5.69054ptD_{i}(1)=1\hskip 5.69054pt\text{and}\hskip 5.69054ptD_{i}(0)=0;\\ d,&\text{if}\hskip 5.69054ptD_{i}(1)=0\hskip 5.69054pt\text{and}\hskip 5.69054ptD_{i}(0)=1;\\ n,&\text{if}\hskip 5.69054ptD_{i}(1)=0\hskip 5.69054pt\text{and}\hskip 5.69054ptD_{i}(0)=0;\end{array}\right.

(5)

where $a,c,d$ and $n$ represent ``always-taker'', ``complier'', ``defier'' and ``never-taker'', respectively. Here $U_{i}$ is an unobserved variable, because we can observe only $D_{i}(1)$ or $D_{i}(0)$ for subject $i$ , but not both. The $CACE$ of $Z$ to $Y$ is the parameter of interest, defined as

CACE(Z\rightarrow Y)=E\{Y(1)-Y(0)\mid U=c\}.

$CACE$ is a subgroup causal effect for the compliers, with incompletely observed compliance status. Next, we give some sufficient conditions about the latent variables to make $CACE(Z\rightarrow Y)$ identifiable, in the presence of noncompliance and nonignorable missing outcomes.

Assumption 2 (Randomization):: The treatment assignment $Z$ is completely randomized.

Randomization means that $Z$ is independent of $\{D(1),D(0),Y(1),Y(0),R(1),R(0)\}$ , and we define $\xi=P\{Z=1\mid D(1),D(0),Y(1),Y(0),R(1),R(0)\}=P(Z=1)$ . Under the randomization assumption, $CACE(Z\rightarrow Y)$ can be expressed as

CACE(Z\rightarrow Y)=E(Y\mid Z=1,U=c)-E(Y\mid Z=0,U=c).

Assumption 3 (Monotonicity):: $D_{i}(1)\geq D_{i}(0)$ for each subject $i$ .

The monotonicity of $D_{i}(z)$ implies that there are no defiers. Define $\omega_{u}=P(U=u)$ for $u=a,c,d,n$ , and the monotonicity assumption implies $\omega_{d}=0$ . Assumption 3 is plausible when the treatment assignment has a nonnegative effect on the treatment received for each subject, and it holds directly when the treatment is not available to subjects in the control arm, meaning $D_{i}(0)=0$ for all subjects. The monotonicity assumption implies a positive $ACE$ of $Z$ on $D$ . However, under general circumstances, Assumption 3 is not fully testable, since only one of $D_{i}(1)$ and $D_{i}(0)$ can be observed.

Assumption 4:: $ACE(Z\rightarrow D)\not=0.$

By randomization, we have that $ACE(Z\rightarrow D)=P(D=1\mid Z=1)-P(D=1\mid Z=0)\not=0$ under Assumption 4, and therefore $Z$ is correlated with $D$ . Without Assumption 4, we have $P(D=1\mid Z=1)=P(D=1\mid Z=0)$ , which implies that $\omega_{c}=0$ under Assumption 3. Since we are interested in the identifiability of $CACE(Z\rightarrow Y)$ , Assumption 4 is necessary.

Assumption 5 (Compound exclusion restrictions):: For never-takers and always-takers, we assume $P\{Y(1),R(1)\mid U=n\}=P\{Y(0),R(0)\mid U=n\}$ , and $P\{Y(1),R(1)\mid U=a\}=P\{Y(0),R(0)\mid U=a\}$ .

The traditional exclusion restriction assumes $P\{Y(1)\mid U=u\}=P\{Y(0)\mid U=u\}$ for $u=a$ and $n$ . Frangakis and Rubin (1999) extended it to the compound exclusion restrictions, and imposed similar assumption on the joint vector of the outcome and the missing data indicator. Assumption 5 is reasonable in a double-blinded clinical trial, because the patients do not know the treatment assigned to them and thus $Z$ has no ``direct effect'' on the outcome and the missing data indicator. However, when the missing data indicator depends on the treatment assigned, the compound exclusion restrictions may be violated. When $Z$ is randomized, Assumption 5 is equivalent to $P(Y,R\mid Z=1,U=n)=P(Y,R\mid Z=0,U=n)$ and $P(Y,R\mid Z=1,U=a)=P(Y,R\mid Z=0,U=a).$

Assumption 6 (Outcome-dependent nonignorable missing data):

For all $y;$ $z=0,1$ ; $d=0,1;$ and $u\in\{a,c,n\}$ , we assume

	$\displaystyle P\{R(z)=1\mid Y(z)=y,D(z)=d,U=u\}$	$\displaystyle=$	$\displaystyle P\{R(z)=1\mid Y(z)=y\}$		(6)
	$\displaystyle P\{R(1)=1\mid Y(1)=y\}$	$\displaystyle=$	$\displaystyle P\{R(0)=1\mid Y(0)=y\}.$		(7)

When $Z$ is randomized, the equation (6) becomes $P(R=1\mid Y=y,D=d,U=u,Z=z)=P(R=1\mid Y=y,Z=z),$ and (7) becomes $P(R=1\mid Y=y,Z=1)=P(R=1\mid Y=y,Z=0)$ . Define $\rho(y)=P(R=1\mid Y=y)$ . Therefore Assumption 2 and Assumption 6 imply that $\rho(y)=P(R=1\mid Y=y,D=d,U=u,Z=z)$ . Hence $R$ depends on $Y$ , but is independent of $(Z,D,U)$ given $Y$ .

In previous papers (Frangakis and Rubin, 1999; O'Malley and Normand, 2005; Zhou and Li, 2006), the LI assumption is used for modeling missing data, which means that the potential outcomes and associated potential nonresponse indicators are independent within each principal stratum, that is $P\{R(z)\mid Y(z),D(z),U\}=P\{R(z)\mid U\}$ . Under the ODN missing data mechanism, the missing data indicator depends on the possibly missing outcome $Y$ , which may be more reasonable than the LI missing data assumption in some applications. For example, some patients may have higher probabilities to leave the trial if their health outcomes are not good, and they may be more likely to stay in the trial otherwise. We illustrate the LI and ODN missing data mechanisms using the graphical models in Figure 1. Note that the arrows from $Z$ to $R$ are absent because of the compound exclusion restriction assumption.

(a) LI under Assumptions 2 and 5

(b) ODN under Assumptions 2 and 5

Figure 1: Graphical models for different missing data mechanisms

3 Semiparametric Identifiability and Estimation

In this section, we first discuss the difficulty of nonparametric identifiability without assuming a parametric form for both the outcome distribution and the missing data mechanism. If both the distribution of the outcome $Y$ and the missing data mechanism $\rho(y)$ are not specified, the model is essentially not identifiable without further assumptions. We then propose a semiparametric method, specifying only the distribution of $Y$ without assuming any parametric form for the missing data mechanism. We show the identifiability and propose a TSMLE of $CACE(Z\rightarrow Y)$ under the assumption that the distribution of the outcome variable $Y$ belongs to the exponential family.

3.1 Semiparametric Identifiability

Under SUTVA, randomization and monotonicity assumptions, we have $\xi=P(Z=1)$ , $\omega_{a}=P(U=a)=P(D=1\mid Z=0)$ , $\omega_{n}=P(U=n)=P(D=0\mid Z=1)$ , and $\omega_{c}=1-\omega_{a}-\omega_{n}$ . These parameters can be identified directly from the observed data. Next we focus on the identification of the parameters of $Y$ .

Assumption 7:

The conditional density of the outcome variable $Y$ belongs to the following exponential family:

f_{zu}(y)=f(y\mid Z=z,U=u)=c(\theta_{zu})h(y)\exp\left\{\sum_{k=1}^{K}p_{k}(\theta_{zu})T_{k}(y)\right\},

(8)

where $c(\cdot),h(\cdot),p_{k}(\cdot)$ , and $T_{k}(\cdot)$ are known functions, and $\theta=\{\theta_{zu}:z=0,1;u=c,a,n\}$ are unknown parameters. We denote $f(y\mid Z=z,U=u)$ simply as $f_{zu}$ hereinafter.

The parametric assumption of the outcome is untestable in general, since the missing data mechanism depends arbitrarily on the outcome. But for binary outcome, Small and Cheng (2009) proposed a goodness-of-fit test for the model under the ODN missing data mechanism. When the randomization assumption holds, the $CACE$ is the difference between the expectations of the conditional density of $Y$ , that is

CACE=E(Y\mid Z=1,U=c)-E(Y\mid Z=0,U=c)=\int yf_{1c}(y)dy-\int yf_{0c}(y)dy.

Hence if the parameters of $f_{zu}(y)$ are identified, the $CACE$ is also identified. The exponential family defined by Assumption 7 includes many common distributions, such as normal distributions $N(\mu_{zu},\sigma^{2})$ , exponential distributions $Exp(\lambda_{zu})$ with mean parameter $1/\lambda_{zu}$ , Gamma distributions $Gamma(\alpha_{zu},\lambda)$ with shape parameter $\alpha_{zu}$ and rate parameter $\lambda$ , and the log-normal distributions $Lognormal(\mu_{zu},\sigma^{2})$ , where $CACE$ s are specified as $CACE_{nor}=\mu_{1c}-\mu_{0c}$ , $CACE_{exp}=1/\lambda_{1c}-1/\lambda_{0c}$ , $CACE_{gam}=\alpha_{1c}/\lambda-\alpha_{0c}/\lambda$ , and $CACE_{log}=\exp\{\mu_{1c}+\sigma^{2}/2\}-\exp\{\mu_{0c}+\sigma^{2}/2\}$ , respectively.

Next, Theorem 1 will show the identifiability of the parameters of $\theta$ . The proof of Theorem 1 is provided in Appendix A. Assumption 5 implies $\theta_{1n}=\theta_{0n}$ and $\theta_{1a}=\theta_{0a}$ , which can be simplified as $\theta_{n}$ and $\theta_{a},$ respectively.

Theorem 1.

Under Assumptions 1 to 7, the vector $\eta=(p_{1}(\theta_{n})-p_{1}(\theta_{a})$ , $p_{1}(\theta_{1c})-p_{1}(\theta_{a})$ , $p_{1}(\theta_{0c})-p_{1}(\theta_{n})$ , $\ldots$ , $p_{K}(\theta_{n})-p_{K}(\theta_{a})$ , $p_{K}(\theta_{1c})-p_{K}(\theta_{a})$ , $p_{K}(\theta_{0c})-p_{K}(\theta_{n})$ , $\log\left\{c(\theta_{n})\right\}-\log\left\{c(\theta_{a})\right\}$ , $\log\left\{c(\theta_{1c})\right\}-\log\left\{c(\theta_{a})\right\}$ , $\log\left\{c(\theta_{0c})\right\}-\log\left\{c(\theta_{n})\right\})$ is identifiable. If there exists a one-to-one mapping from the parameter set $\theta$ to the vector $\eta$ , then $\theta$ is identifiable and so is $CACE.$

The one-to-one mapping condition seems complicated, but it is reasonable and holds for many widely-used distributions, such as homoskedastic normal distributions, exponential distributions, etc. We will verify the one-to-one mapping condition for normal and exponential distributions in Appendix C and Appendix D. Other distributions such as heteroskedastic normal distributions, Gamma distributions and lognormal distributions can be verified similarly. However, counterexamples do exist, and we provide one in Appendix A.

3.2 TSMLE of $CACE$

Because we do not specify a parametric form on the missing data mechanism $\rho(y)$ , the joint distribution of $(Z,U,D,Y,R)$ is not specified completely. Thus the MLEs of parameters are hard to obtain, since the likelihood depends on the infinite dimensional parameter $\rho(y)$ as shown in Appendix B. In this subsection, we propose a two-step likelihood method to estimate parameters, which can be viewed as an example of the Two-Step Maximum Likelihood studied by Murphy and Topel (2002).

In the first step, a consistent estimator for $\alpha=(\xi,\omega_{a},\omega_{n})$ can be obtained by MLE using the data $\{(Z_{i},D_{i}):i=1,...,N\}$ . Let $N$ denote the sample size, $N_{1}=\#\{i:Z_{i}=1\}$ , $N_{0}=\#\{i:Z_{i}=0\}$ and $n_{zd}=\#\{i:Z_{i}=z,D_{i}=d\}$ for $z=0,1$ and $d=0,1$ . Then the log likelihood function for $\alpha$ is

l_{1}(\alpha)=N_{1}\log\xi+N_{0}\log(1-\xi)+n_{11}\log(1-\omega_{n})+n_{10}\log(\omega_{n})+n_{01}\log(\omega_{a})+n_{00}\log(1-\omega_{a}).

(9)

The MLE for $\alpha$ is $\hat{\alpha}=(\hat{\xi},\hat{\omega}_{n},\hat{\omega}_{a})=(N_{1}/N,n_{10}/N_{1},n_{01}/N_{0})$ , equivalent to the moment estimator.

In the second step, we propose a conditional likelihood method to estimate the parameter set $\theta$ , which is based on the conditional probability of $(Z,D)$ given $Y$ and $R=1$ . Here the proposed conditional likelihood function does not depend on the nuisance parameter $\rho(y)$ , based on the fact that the following equations (3.2) to (12) do not depend on $\rho(y)$ :

	$\displaystyle\log\left\{\frac{P(Z=1,D=0\mid Y=y,R=1)(1-\xi)}{P(Z=0,D=1\mid Y=y,R=1)\xi}\right\}$
$\displaystyle=$	$\displaystyle\log\left\{\frac{P(U=n)f(y\mid Z=1,U=n)}{P(U=a)f(y\mid Z=0,U=a)}\right\}$
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}\left\{p_{k}(\theta_{n})-p_{k}(\theta_{a})\right\}T_{k}(y)+\log\left\{\frac{\omega_{n}c(\theta_{n})}{\omega_{a}c(\theta_{a})}\right\},$
	$\displaystyle\log\left\{\frac{P(Z=0,D=0\mid Y=y,R=1)\xi}{P(Z=1,D=0\mid Y=y,R=1)(1-\xi)}-1\right\}$
$\displaystyle=$	$\displaystyle\log\left\{\frac{P(U=n)f(y\mid Z=1,U=n)+P(U=c)f(y\mid Z=0,U=c)}{P(U=n)f(y\mid Z=1,U=n)}-1\right\}$
$\displaystyle=$	$\displaystyle\log\left\{\frac{P(U=c)f(y\mid Z=0,U=c)}{P(U=n)f(y\mid Z=1,U=n)}\right\}$
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}\left\{p_{k}(\theta_{0c})-p_{k}(\theta_{n})\right\}T_{k}(y)+\log\left\{\frac{\omega_{c}c(\theta_{0c})}{\omega_{a}c(\theta_{n})}\right\},$
	$\displaystyle\log\left\{\frac{P(Z=1,D=1\mid Y=y,R=1)(1-\xi)}{P(Z=0,D=1\mid Y=y,R=1)\xi}-1\right\}$
$\displaystyle=$	$\displaystyle\log\left\{\frac{P(U=a)f(y\mid Z=1,U=a)+P(U=c)f(y\mid Z=1,U=c)}{P(U=a)f(y\mid Z=0,U=a)}-1\right\}$
$\displaystyle=$	$\displaystyle\log\left\{\frac{P(U=c)f(y\mid Z=1,U=c)}{P(U=a)f(y\mid Z=0,U=a)}\right\}$
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}\left\{p_{k}(\theta_{1c})-p_{k}(\theta_{a})\right\}T_{k}(y)+\log\left\{\frac{\omega_{c}c(\theta_{1c})}{\omega_{a}c(\theta_{a})}\right\}.$	(12)

It is obvious that

\sum\limits_{z=0}^{1}\sum\limits_{d=0}^{1}P(Z=z,D=d\mid Y=y,R=1)=1.

(13)

The left hand sides of equations (3.2) to (12) consist of $P(Z=z,D=d\mid Y=y,R=1)$ and $\xi$ , with the latter identified from the first step. The right hand sides of equations (3.2) to (12) consist of the parameters of interest. Therefore we can estimate $\theta$ through a likelihood method. Note the right hand sides do not depend on $\rho(y)$ , so we do not need to specify the form of $\rho(y)$ . Let $p_{zd}(\theta,\alpha;y)$ denote $P(Z=z,D=d\mid Y=y,R=1)$ . Since $(Z,D)$ given $(Y=y,R=1)$ follows a multinomial distribution with four categories, the conditional log-likelihood function of $(Z,D)$ can be written as

l_{2}(\theta,\alpha)=\sum\limits_{i=1}^{N}\sum\limits_{z=0}^{1}\sum\limits_{d=0}^{1}I(Z_{i}=z,D_{i}=d,R_{i}=1)\log p_{zd}(\theta,\alpha;y_{i}).

(14)

From the proof of Theorem 1, the parameter $\theta$ can be identified from the second likelihood function (14) after identifying $\alpha$ from the first likelihood function (9). Therefore, by maximizing $l_{2}(\theta,\hat{\alpha})$ over $\theta$ , we obtain the maximizer $\hat{\theta}$ . In practice, we can use the bootstrap method to approximate the sampling variance of the estimator of $CACE.$

4 Simulation Studies and Sensitivity Analysis

We report simulation studies and sensitivity analyses in order to evaluate the finite sample properties of the estimating methods proposed in this paper. In Tables 2-4, the columns with labels ``bias'', `` $Std.$ $dev.$ '', `` $95\%$ CP'' and `` $95\%$ CI'' represent the average bias, standard deviation, $95\%$ coverage proportion and average $95\%$ confidence interval, respectively.

First, we generate the outcomes under the ODN missing data mechanism from homoskedastic normal distributions (denoted as `` $homo\_normal$ ''), exponential distributions, Gamma distributions and log-normal distributions, respectively. We set the number of simulations to be $10000$ , and choose the sample sizes as $500,1000$ , $2000$ and $4000$ , respectively. We show the joint distributions of $(Z,U,D,Y,R)$ in Table 1, and report the results in Table 2. The results have small biases and standard deviations, which decrease as the sample sizes become larger. And all the confidence intervals of $CACE$ have empirical coverage proportions very close to their nominal values.

Table 1: True parameters for simulation

	$homo\_normal$	Exponential	Gamma	Lognormal
$Y(1)\|U=c$	$N(5,1)$	$Exp(1/5)$	$Gamma(5,1)$	$Lognormal(0,1)$
$Y(0)\|U=c$	$N(4,1)$	$Exp(1/4)$	$Gamma(4,1)$	$Lognormal(-1,1)$
$Y(z)\|U=a$	$N(6,1)$	$Exp(1/6)$	$Gamma(6,1)$	$Lognormal(-1.5,1)$
$Y(z)\|U=n$	$N(3,1)$	$Exp(1/3)$	$Gamma(3,1)$	$Lognormal(-0.5,1)$
$Z$	Bernoulli(0.5)
$U$	$\omega_{c}=\omega_{n}=\omega_{a}=1/3$
$R\|Y=y$	$\rho_{y}=I(y\leq 2)\times 0.85+I(y\geq 7)\times 0.8+I(2<y<7)\times 0.9$

Table 2: Results of simulation studies

true value	$N$	bias	$Std.$ $dev.$	$95\%$ CP	$95\%$ CI
$CACE_{nor}=1.0$	500	-0.0194	0.3395	0.9489	[0.3152, 1.6461]
	1000	-0.0073	0.2343	0.9476	[0.5335, 1.4518]
	2000	0.0022	0.1629	0.9500	[0.6828, 1.3215]
	4000	-0.0019	0.1145	0.9504	[0.7736, 1.2225]
$CACE_{exp}=1.0$	500	0.0872	1.5910	0.9455	[-2.0312, 4.2056]
	1000	0.0312	1.0309	0.9441	[-0.9893, 3.0517]
	2000	0.0106	0.7091	0.9479	[-0.3793, 2.4004]
	4000	0.0072	0.4891	0.9506	[0.0486, 1.9657]
$CACE_{gam}=1.0$	500	0.0830	1.6872	0.9915	[-2.2237, 4.3901]
	1000	0.0284	0.5978	0.9625	[-0.1432, 2.2000]
	2000	0.0108	0.3636	0.9493	[0.2981, 1.7235]
	4000	0.0032	0.2530	0.9505	[0.5073, 1.4992]
$CACE_{log}=1.0422$	500	0.1156	0.7849	0.9617	[-0.3806, 2.6962]
	1000	0.0599	0.4571	0.9494	[0.2061, 1.9981]
	2000	0.0218	0.3093	0.9496	[0.4578, 1.6702]
	4000	0.0106	0.2130	0.9469	[0.6353, 1.4702]

Second, we report the results of comparison of our methods with the MLE proposed by O'Malley and Normand (2005) (``LI'' in Tables 3) under five different cases, which violate the homoskedastic normal outcomes or the ODN assumption. We repeat our simulation $10000$ times with sample sizes of $4000$ in each case. The results of five cases are shown in Table 3, named as ``Heter'', ``Unif'',``T'', ``DY'' and ``DYU'', respectively. The first case, ``Heter'' case, violates the homoskedastic normal outcomes. Next two cases, ``Unif'' and ``T'', violate the exponential family assumption. The last two cases, ``DY'' and ``DYU'', violate the ODN assumption. In the ``Heter'' case we generate data from heteroskadastic normal outcomes. The data generating process is the same as `` $homo\_normal$ '' except that $Y(1)|U=c\sim N(5,0.25)$ , $Y(z)|U=a\sim N(6,0.30)$ for $z=0,1$ . In the `Unif'' case the data is generated the same as `` $homo\_normal$ '' except that the outcomes follow uniform distribution with $Y(1)|U=c\sim U[2,8]$ , $Y(0)|U=c\sim U[1,7]$ , $Y(z)|U=a\sim U[3,9]$ and $Y(z)|U=n\sim U[1,5]$ , respectively. The data generating process in the ``T'' case is the same as `` $homo\_normal$ '' except that the outcomes follow t distributions with the same means as `` $homo\_normal$ '' and degrees of freedom $4$ . In the ``DY'' case we generate data under the missing data mechanism depending on both $D$ and $Y$ , and choose $P(R=1\mid D,Y)=0.8-I(Y>5)\times 0.5+I(D=1)\times 0.1-I(Y>5)I(D=0)\times 0.1$ with other conditional distributions the same as `` $homo\_normal$ ''. In the ``DYU'' case we generate data with the missing data mechanism depending on $D$ , $Y$ and $U$ , and choose $P(R=1\mid D,Y,U)=(1+\exp\{5+0.1D-Y-0.1U\})^{-1}$ , with other conditional distributions the same as `` $homo\_normal$ '' and $U=1,2,3$ corresponding to $U=c,n,a$ . From Table 3, we can see that the point estimator of our method is generally robust to four kinds of violations of the assumptions. However, the results are worse for ``Unif'' case, which has a large bias, low $95\%$ coverage proportion and whose $95\%$ confidence interval does not cover the true value.

Table 3: Comparison of the methods assuming ODN and LI under five cases which violate the homoskedastic normal outcomes or the ODN assumption. (

CACE_{true}=1.0

)

Method	Assump.	bias	$Std.$ $dev.$	$95\%$ CP	$95\%$ CI
ODN	Heter	-0.0268	0.0772	0.9363	[0.8220, 1.1244]
	Unif	-0.3815	0.1865	0.4676	[0.2530, 0.9841]
	T	-0.0350	0.1730	0.9427	[0.6960, 1.3740]
	DY	0.0201	0.1555	0.9465	[0.7154, 1.3249]
	DYU	-0.0852	0.1691	0.9242	[0.5834, 1.2462]
LI	Heter	-0.0277	0.0677	0.9300	[0.8395, 1.1051]
	Unif	-0.8521	0.2474	0.0695	[-0.3369, 0.6327]
	T	0.2244	0.1225	0.5577	[0.9843, 1.4646]
	DY	-0.0288	0.0894	0.9370	[0.7959, 1.1465]
	DYU	-0.1267	0.1321	0.8426	[0.6144, 1.1322]

Finally, we compare our methods with the MLE proposed by O'Malley and Normand (2005) under the LI missing data mechanism (``LI'' in Table 4). We repeat our simulation $10000$ times with sample sizes of $4000$ in each case. The data generating processes are the same as `` $homo\_normal$ '', but the missing data mechanisms are LI. Denote $\gamma_{du}=P(R=1\mid D=d,U=u)$ , and choose $(\gamma_{1c},\gamma_{0c},\gamma_{0n},\gamma_{1a})=(0.8,0.75,0.7,0.9)$ , $(0.9,0.7,0.8,0.7)$ , $(0.7,0.6,0.6,0.8)$ and $(0.6,0.7,0.9,0.7)$ for ``LI1'' to ``LI4'' respectively as shown in rows 1-4 and 5-8 of Table 4. Since the missing data mechanisms are LI, the ``LI'' method exhibits very small biases. Although the assumptions required by the ``ODN'' methods do not hold, the biases are not very large except for the missing mechanism LI4. The last case, LI4, has the largest variability among the $\gamma_{zu}$ 's and thus the largest bias for estimating the $CACE$ , since the missing data mechanism has the ``strongest'' dependence on $D$ and $U$ but not $Y$ . Next we generate data under the ODN assumption, and compare the methods under both ODN and LI assumptions. Let $\rho(y;\delta)=I(y\leq 2)\times(0.9-\delta)+I(y\geq 7)\times(0.9-2\delta)+I(2<y<7)\times 0.9$ where $0<\delta<0.9$ . As $\delta$ increases, the relationship between $Y$ and $R$ becomes stronger. The data are generated from the same joint distribution as `` $homo\_normal$ '' except for different $\rho(y;\delta)$ . The results are shown in Figure 2. The method under ODN missing data mechanism has small bias and promising coverage property irrespective of $\delta$ , but the method under LI missing data mechanism has larger bias and poorer coverage property with larger $\delta$ .

Table 4: Comparison of the methods assuming ODN and LI when LI holds. (

CACE_{true}=1.0

Method		bias	$Std.$ $dev.$	$95\%$ CP	$95\%$ CI
ODN	LI1	-0.0291	0.1236	0.9438	[0.7287, 1.2132]
	LI2	0.0976	0.1452	0.8961	[0.8130, 1.3823]
	LI3	-0.0569	0.1371	0.9332	[0.6745, 1.2118]
	LI4	-0.1961	0.1152	0.5966	[0.5781, 1.0297]
LI	LI1	-0.0013	0.1123	0.9491	[0.7785, 1.2189]
	LI2	-0.0010	0.1099	0.9504	[0.7836, 1.2145]
	LI3	-0.0008	0.1262	0.9502	[0.7519, 1.2465]
	LI4	-0.0015	0.1290	0.9505	[0.7456, 1.2515]

Refer to caption — Figure 2: Comparison of the methods assuming ODN and LI when ODN holds. ( $CACE_{true}=1.0$ )

5 Application

We use the new methods proposed in this paper to re-analyze a psychiatric clinical trial. It is a double-blinded randomized study comparing the relative effect of clozapine and haloperidol in adults with refractory schizophrenia at fifteen Veterans Affairs medical centers. Clozapine is found to be more efficacious than standard drugs in patients with refractory schizophrenia. Yet it is associated with potentially fatal agranulocytosis. One objective for conducting this trial is to evaluate the clinical effect of two antipsychotic medications. The dataset has been analysed in Rosenheck et al. (1997), Levy et al. (2004), and O'Malley and Normand (2005). Some summary statistics of the data are described in Table 5. More details about the trial can be found in Rosenheck et al. (1997) and O'Malley and Normand (2005). In the treatment arm, $203$ patients are randomized to clozapine; in the control arm, $218$ patients are randomized to haloperidol. The outcome of interest is the positive and negative syndrome score (PANSS) with higher values indicating more severe symptoms. The baseline PANSS is nearly balanced in both groups. Missing outcome patterns are obviously different in the clozapine group (about $40/203\approx 0.20$ ) than in the haloperidol group (about $59/218\approx 0.27$ ). Hence it is possible that outcomes are not missing at random. The primary reasons for dropout in the clozapine group are side effects or non-drug-related reasons. The reasons for discontinuing haloperidol are lack of efficacy or worsening of symptoms. Therefore, the missing mechanism may possibly depend on the missing outcome, and we think that the ODN assumption may be more reasonable in this case.

Table 5: Summary statistics of the data from the psychiatric clinical trial

	Received treatment
Assigned treatment	Clozapine ( $D=1$ )	Haloperidol ( $D=0$ )	Total
Clozapine ( $Z=1$ )
Sample size	122	81	203
Missing sample size	0	40	40
Mean of the baseline PANSS	90.83	91.20	90.98
Haloperidol ( $Z=0$ )
Sample size	57	161	218
Missing sample size	12	47	59
Mean of the Baseline PANSS	96.30	90.69	92.16

The estimates of $CACE$ by different methods are shown in Table 6. In Table 6, the ``homo'' and ``hetero'' in parentheses after ``ODN'' correspond to the homoskedastic and heteroskedastic model assumptions, respectively; and ``LI'' corresponds to the MLE proposed in O'Malley and Normand (2005). The columns of Table 6 correspond to the methods, point estimates, standard errors, $95\%$ and $90\%$ confidence intervals, respectively. The bootstrap method is used to compute standard errors and confidence intervals for all methods. From Table 6, we can see that under the homoskedastic assumption subjects in the clozapine group had $5.00$ lower symptom levels than those in the haloperidol group, and under the heteroskedastic assumption it was $5.54$ . Both methods have similar conclusions that clozapine is somewhat more effective than haloperidol for patients with refractory schizophrenia. Both of the semiparametric methods give insignificant results since both $95\%$ confidence intervals include zero. Our results are similar to O'Malley and Normand (2005) which also gave an insignificant result. However, the results of the $90\%$ confidence intervals are somewhat different: the result under the ODN mechanism with the heteroskedastic assumption is significant, but the results from other two models are not.

Assuming different missing data mechanisms such as LI, ODN and under other different assumptions on the outcome variable, we can find different point estimates and confidence intervals for $CACE$ using the data from the psychiatric clinical trial. When we have prior knowledge that the missing mechanism depends only on the treatment received and the compliance status, the method under LI missing mechanism will provide more credible conclusion. However, when we have prior knowledge that the missing mechanism depends directly on the outcome, we recommend our methods under the ODN missing data mechanism. The newly proposed methods can be used as alternatives for the predominate methods assuming the LI missing mechanism in sensitivity analysis.

Table 6: Estimates of

CACE

by different methods

Method	Estimate	$S.E.$	$95\%$ CI	$90\%$ CI
ODN(homo)	-5.00	3.05	[-10.98, 0.98]	[-10.02, 0.02]
ODN(hetero)	-5.54	3.05	[-11.52, 0.44]	[-10.56, -0.52]
LI	-4.36	4.35	[-12.89, 4.17]	[-11.52, 2.80]

6 Discussion

Randomization is a powerful tool to measure the relative causal effect of treatment versus control. Some subjects in randomized trials, however, may fail to comply with the assigned treatments or drop out before the final outcomes are measured. Noncompliance and missing data problems make statistical causal inference difficult, because the causal effects are not identifiable without additional assumptions. Under different assumptions about the missing data mechanisms of the outcomes, the identifiability and estimation methods may be fundamentally different. Most previous studies (Frangakis and Rubin, 1999; Barnard et al., 2003; O'Malley and Normand, 2005; Zhou and Li, 2006) rely on the LI assumption in order to identify $CACE$ , but the LI assumption may be not reasonable when the missing data mechanism may depend on the outcome. Under the ODN missing data mechanism, Chen et al. (2009) and Imai (2009) showed the identifiability and proposed the moment estimator and the MLE of $CACE$ for discrete outcomes. But there are no results for continuous outcomes under both noncompliance and ODN missing data mechanism. As a generalization of Chen et al. (2009) and Imai (2009), we study the semiparametric identifiability, and propose estimation methods for $CACE$ with continuous outcomes under the ODN missing data mechanism.

The ODN assumption allows the missing data mechanism to depend on the outcome. However, the missing data processes in practical problems may be more complicated, and they may depend on other variables such as $Z$ , $U$ and $D$ . For example, a missing mechanism depending on both the compliance status and the outcome may be reasonable in some real studies. Small and Cheng (2009) proposed a saturated model for $P(R=1\mid Z,U,Y)$ , and the models under LI and ODN are special cases of their model. However, their model is generally not identifiable without restrictions on the parameters. It is worthwhile to study the identifiability of $CACE$ under all possible restrictions of $P(R=1\mid Z,U,Y)$ and perform sensitivity analysis for models lack of identifiability. We consider only cross-sectional data in this paper, and generalizing our methods to longitudinal data is a future research topic.

Acknowledgments

We would like to thank Editor, Associate Editor and three reviewers for their very valuable comments and suggestions. Chen's research was supported in part by NSFC 11101045, CAEP 2012A0201011 and CAEP 2013A0101004. Geng's research was supported by NSFC 11021463, 10931002 and 11171365. Zhou's research was supported in part by Department of Veterans Affairs HSR&D RCS Award 05-196. It does not necessarily represent the views of VA HSR&D Service.

Appendices

Appendix A

Proof of Theorem 3.1. From equations (3.5) to (3.7), we can identify $(p_{1}(\theta_{n})-p_{1}(\theta_{a})$ , $p_{1}(\theta_{1c})-p_{1}(\theta_{a})$ , $p_{1}(\theta_{0c})-p_{1}(\theta_{n})$ , $\ldots$ , $p_{K}(\theta_{n})-p_{K}(\theta_{a})$ , $p_{K}(\theta_{1c})-p_{K}(\theta_{a})$ , $p_{K}(\theta_{0c})-p_{K}(\theta_{n})$ , $\log\left\{c(\theta_{n})\right\}-\log\left\{c(\theta_{a})\right\}$ , $\log\left\{c(\theta_{1c})\right\}-\log\left\{c(\theta_{a})\right\}$ , $\log\left\{c(\theta_{0c})\right\}-\log\left\{c(\theta_{n})\right\})$ using generalized linear models. Therefore $\theta$ is identifiable because of the one-to-one mapping.

Counterexample for identifiability. Consider the following exponential family:

f_{zu}(y)=f(y\mid Z=z,U=u)=c(\theta_{zu})h(y)\exp\left\{\sum_{k=1}^{K}\theta_{zu,k}y^{k}\right\}.

The number of unknown parameters contained in $\theta$ is $4K$ , and the number of identifiable parameters contained in $\eta$ is $3(K+1)$ . A necessary condition for the existence of a one-to-one mapping from $\theta$ to $\eta$ is $4K\leq 3(K+1)$ , or, equivalently, $K\leq 3.$ Therefore, when $K>3$ , a one-to-one mapping from $\theta$ to $\eta$ does not exist.

Appendix B: Full likelihood for $(\alpha,\theta,\rho(y))$

Define $M_{zd}=\#\{i:Z_{i}=z,D_{i}=d,R_{i}=0\}$ for $z=0,1$ and $d=0,1.$ Under the compound exclusion restriction, we have $f_{1n}(y)=f_{0n}(y)=f_{n}(y)$ and $f_{1a}(y)=f_{0a}(y)=f_{a}(y).$ The full likelihood for $(\alpha,\theta,\rho(y))$ is

			$\displaystyle L(\alpha,\theta,\rho(y))$
		$\displaystyle\propto$	$\displaystyle\xi^{N_{1}}(1-\xi)^{N_{0}}(1-\omega_{n})^{n_{11}}\omega_{n}^{n_{10}}\omega_{a}^{n_{01}}(1-\omega_{a})^{n_{00}}$
			$\displaystyle\cdot\left[\frac{\omega_{c}}{\omega_{c}+\omega_{a}}\int\{1-\rho(y)\}f_{1c}(y)dy+\frac{\omega_{a}}{\omega_{c}+\omega_{a}}\int\{1-\rho(y)\}f_{a}(y)dy\right]^{M_{11}}\left[\int\{1-\rho(y)\}f_{n}(y)dy\right]^{M_{10}}$
			$\displaystyle\cdot\left[\int\{1-\rho(y)\}f_{a}(y)dy\right]^{M_{01}}\left[\frac{\omega_{c}}{\omega_{c}+\omega_{n}}\int\{1-\rho(y)\}f_{0c}(y)dy+\frac{\omega_{n}}{\omega_{c}+\omega_{n}}\int\{1-\rho(y)\}f_{n}(y)dy\right]^{M_{00}}$
			$\displaystyle\cdot\prod_{i:R_{i}=1}\rho(Y_{i})\cdot\prod_{i:(Z_{i},D_{i},R_{i})=(1,1,1)}\left\{\frac{\omega_{c}}{\omega_{c}+\omega_{a}}f_{1c}(Y_{i})+\frac{\omega_{a}}{\omega_{c}+\omega_{a}}f_{a}(Y_{i})\right\}\prod_{i:(Z_{i},D_{i},R_{i})=(1,0,1)}f_{n}(Y_{i})$
			$\displaystyle\cdot\prod_{i:(Z_{i},D_{i},R_{i})=(0,1,1)}f_{a}(Y_{i})\cdot\prod_{i:(Z_{i},D_{i},R_{i})=(0,0,1)}\left\{\frac{\omega_{c}}{\omega_{c}+\omega_{n}}f_{0c}(Y_{i})+\frac{\omega_{n}}{\omega_{c}+\omega_{n}}f_{n}(Y_{i})\right\}.$

Appendix C: Verification of homoskedastic normal distribution in Subsection 3.2

For homoskedastic normal outcomes, equations (3.5) to (3.7) can be re-written as:

$\displaystyle a_{1}y+b_{1}$	$\displaystyle=$	$\displaystyle\log\left\{\frac{P(Z=1,D=0\mid Y=y,R=1)(1-\xi)}{P(Z=0,D=1\mid Y=y,R=1)\xi}\right\}$
	$\displaystyle=$	$\displaystyle\frac{\mu_{n}-\mu_{a}}{\sigma^{2}}y+\frac{\mu_{a}^{2}-\mu_{n}^{2}}{2\sigma^{2}}+\log\left\{\frac{\omega_{n}}{\omega_{a}}\right\},$
$\displaystyle a_{2}y+b_{2}$	$\displaystyle=$	$\displaystyle\log\left\{\frac{P(Z=0,D=0\mid Y=y,R=1)\xi}{P(Z=1,D=0\mid Y=y,R=1)(1-\xi)}-1\right\}$
	$\displaystyle=$	$\displaystyle\frac{\mu_{0c}-\mu_{n}}{\sigma^{2}}y+\frac{\mu_{n}^{2}-\mu_{0c}^{2}}{2\sigma^{2}}+\log\left\{\frac{\omega_{c}}{\omega_{n}}\right\},$
$\displaystyle a_{3}y+b_{3}$	$\displaystyle=$	$\displaystyle\log\left\{\frac{P(Z=1,D=1\mid Y=y,R=1)(1-\xi)}{P(Z=0,D=1\mid Y=y,R=1)\xi}-1\right\}$
	$\displaystyle=$	$\displaystyle\frac{\mu_{1c}-\mu_{a}}{\sigma^{2}}y+\frac{\mu_{a}^{2}-\mu_{1c}^{2}}{2\sigma^{2}}+\log\left\{\frac{\omega_{c}}{\omega_{a}}\right\}.$

Since $a_{i}$ and $b_{i}$ can be identified from generalized linear models, we can identify all the parameters from the above equations and obtain the following results:

$\displaystyle\sigma^{2}$	$\displaystyle=$	$\displaystyle 2\left[\frac{b_{1}-\log\left\{\frac{\omega_{n}}{\omega_{a}}\right\}}{a_{1}}-\frac{b_{2}-\log\left\{\frac{\omega_{c}}{\omega_{n}}\right\}}{a_{2}}\right]/(a_{1}+a_{2}),$
	$\displaystyle=$	$\displaystyle 2\left[\frac{b_{1}-\log\left\{\frac{\omega_{n}}{\omega_{a}}\right\}}{a_{1}}-\frac{b_{3}-\log\left\{\frac{\omega_{c}}{\omega_{a}}\right\}}{a_{3}}\right]/(a_{3}-a_{1}),$
$\displaystyle\mu_{1c}$	$\displaystyle=$	$\displaystyle\frac{1}{2}a_{3}\sigma^{2}-\frac{b_{3}-\log\left\{\frac{\omega_{c}}{\omega_{a}}\right\}}{a_{3}},$
$\displaystyle\mu_{0c}$	$\displaystyle=$	$\displaystyle\frac{1}{2}a_{2}\sigma^{2}-\frac{b_{2}-\log\left\{\frac{\omega_{c}}{\omega_{n}}\right\}}{a_{2}}.$

Therefore, we can identify $CACE=\mu_{1c}-\mu_{0c}.$

Appendix D: Verification of the exponential distribution in Subsection 3.2

For the exponentially distributed outcomes, equations (3.5) to (3.7) can be re-written as:

	$\displaystyle a_{1}y+b_{1}$	$\displaystyle=$	$\displaystyle\log\left\{\frac{P(Z=1,D=0\mid Y=y,R=1)(1-\xi)}{P(Z=0,D=1\mid Y=y,R=1)\xi}\right\}$
		$\displaystyle=$	$\displaystyle(\lambda_{a}-\lambda_{n})y+\log\left\{\frac{\omega_{n}\lambda_{n}}{\omega_{a}\lambda_{a}}\right\},$

	$\displaystyle a_{2}y+b_{2}$	$\displaystyle=$	$\displaystyle\log\left\{\frac{P(Z=0,D=0\mid Y=y,R=1)\xi}{P(Z=1,D=0\mid Y=y,R=1)(1-\xi)}-1\right\}$
		$\displaystyle=$	$\displaystyle(\lambda_{n}-\lambda_{0c})y+\log\left\{\frac{\omega_{c}\lambda_{0c}}{\omega_{n}\lambda_{n}}\right\},$

	$\displaystyle a_{3}y+b_{3}$	$\displaystyle=$	$\displaystyle\log\left\{\frac{P(Z=1,D=1\mid Y=y,R=1)(1-\xi)}{P(Z=0,D=1\mid Y=y,R=1)\xi}-1\right\}$
		$\displaystyle=$	$\displaystyle(\lambda_{a}-\lambda_{1c})y+\log\left\{\frac{\omega_{c}\lambda_{1c}}{\omega_{a}\lambda_{a}}\right\}.$