Strategy to select most efficient RCT samples based on observational data

Wenqi Shi Department of Industrial Engineering, Tsinghua University Xi Lin Department of Statistics, University of Oxford

(Draft: )

Abstract

Randomized experiments can provide unbiased estimates of sample average treatment effects. However, estimates of population treatment effects can be biased when the experimental sample and the target population differ. In this case, the population average treatment effect can be identified by combining experimental and observational data. A good experiment design trumps all the analyses that come after. While most of the existing literature centers around improving analyses after RCTs, we instead focus on the design stage, fundamentally improving the efficiency of the combined causal estimator through the selection of experimental samples. We explore how the covariate distribution of RCT samples influences the estimation efficiency and derive the optimal covariate allocation that leads to the lowest variance. Our results show that the optimal allocation does not necessarily follow the exact distribution of the target cohort, but adjusted for the conditional variability of potential outcomes. We formulate a metric to compare and choose from candidate RCT sample compositions. We also develop variations of our main results to cater for practical scenarios with various cost constraints and precision requirements. The ultimate goal of this paper is to provide practitioners with a clear and actionable strategy to select RCT samples that will lead to efficient causal inference.

1 Introduction

1.1 Motivation

There is growing interest in combining observational and experimental data to draw causal conclusions (Hartman et al., 2015; Athey et al., 2020; Yang and Ding, 2020; Chen et al., 2021; Oberst et al., 2022; Rosenman et al., 2020). Experimental data from randomized controlled trials (RCTs) are considered the gold standard for causal inference and can provide unbiased estimates of average treatment effects. However, the scale of the experimental data is usually limited and the trial participants might not represent those in the target cohort. For example, the recruitment criteria for an RCT may prescribe that participants must be less than 65 years old and satisfy certain health criteria, whereas the target population considered for treatment may cover all age groups. This problem is known as the lack of transportability (Pearl and Bareinboim, 2011; Rudolph and van der Laan, 2017), generalizability (Cole and Stuart, 2010; Hernán and VanderWeele, 2011; Dahabreh and Hernán, 2019), representativeness (Campbell, 1957) and external validity (Rothwell, 2005; Westreich et al., 2019). By contrast, observational data usually has both the scale and the scope desired, but one can never prove that there is no hidden confounding. Any unmeasured confounding in the observational data may lead to a biased estimate of the causal effect. When it comes to estimating the causal effect in the target population, combining obervational and experimental data provides an avenue to exploit the benefits of both.

Existing literature has proposed several methods of integrating RCT and observational data to address the issue of the RCT population not being representative of the target cohort. Kallus et al. (2018), considered the case where the supports do not fully overlap, and proposed a linear correction term to approximate the difference between the causal estimates from observational data and experimental data caused by hidden confounding.

Sometimes even though the domain of observational data overlaps with the experimental data, sub-populations with certain traits may be over- or under-represented in the RCT compared to the target cohort. This difference can lead to a biased estimate of the average treatment effect, and as a result, the causal conclusion may not be generalizable to the target population. In this case, reweighting the RCT population to make it applicable to the target cohort is a common choice of remedy (Hartman et al., 2015; Andrews and Oster, 2017). In particular, Inverse Probability of Sampling Weighting (IPSW) has been a popular estimator for reweighting (Cole and Hernán, 2008; Cole and Stuart, 2010; Stuart et al., 2011). In this paper, we base our theoretical results on the IPSW estimator.

1.2 Design Trumps Analysis

Most of the existing literature, including those discussed above, focuses on the analysis stage after RCTs are completed, and propose methods to analyse the data as given. This means, the analysis methods, including reweighting through IPSW, are to passively deal with the RCT data as they are. However, the quality of the causal inference is largely predetermined by the data collected. ‘Design trumps analysis’ (Rubin, 2008); a carefully designed experiment benefits the causal inference by far more than the analysis that follows. Instead of marginally improving through analyses, we focus on developing a strategy for the design phase, specifically the selection of RCT participants with different characteristics , to fundamentally improve the causal inference.

When designing an RCT sample to draw causal conclusions on the target cohort, a heuristic strategy that practitioners tend to opt for is to construct the RCT sample that looks exactly like a miniature version of target cohort. For example, suppose that we want to examine the efficacy of a drug on a target population consisting $30\%$ women and $70\%$ men. If the budget allows us to recruit 100 RCT participants in total, then the intuition is to recruit exactly $30$ females and $70$ males. This intuition definitely works, yet, is it efficient? We refer to the efficiency of the reweighted causal estimator for the average treatment effect in the target population, and specifically, its variance ¹¹1We note that the efficiency of an unbiased estimator $T$ is formally defined as $e(T)=\frac{\mathcal{I}^{-1}(\theta)}{var(T)}$ , that is, the ratio of its lowest possible variance over its actual variance. For our purpose, we do not discuss the behaviour of the Fisher information of the data but rather focus on reducing the variance of the estimators. With slight abuse of terminology, in this paper, when we say that one estimator is more efficient than another, we mean that the variance of the former is lower. Similarly, we say that an RCT sample is more efficient if it eventually leads to an estimator of lower variance..

In fact, we find that RCTs following the exact covariate distribution of the target cohort do not necessarily lead to the most efficient estimates after reweighting. Instead, our result suggests that the optimal covariate allocation of experiment samples is the target cohort distribution adjusted by the conditional variability of potential outcomes. Intuitively, this means that an optimal strategy is to sample more from the segments where the causal effect is more volatile or uncertain, even if they do not make up a large proportion of the target cohort.

1.3 Contributions

In this work, we focus on the common practice of generalizing the causal conclusions from an RCT to a target cohort. We aim at fundamentally improving the estimation efficiency by improving the selection of individuals into the trial, that is, the allocation of a certain number of places in the RCT to individuals of certain characteristics. We derive the optimal covariate allocation that minimizes the variance of the causal estimate of the target cohort. Practitioners can use this optimal allocation as a guide when they decide ‘who’ to recruit for the trial. We also formulate a deviation metric that quantifies how far a given RCT allocation is from optimal, and practitioners can use this metric to decide when they are presented with several candidate RCT allocations to choose from.

We develop variations of the main results to cater for various practical scenarios such as where the total number of participants in the trial is fixed, or the total recruitment cost is fixed while unit costs can differ, or with different precision requirements: best overall precision, equal segment precision or somewhere in between. In this paper, we provide practitioners with a clear strategy and versatile tools to select the most efficient RCT samples.

1.4 Outline

The remainder of this paper is organized at follows: In Section 2, we introduce the problem setting, notations, provide the main assumptions and provide more details on the IPSW estimator that we consider. In Section 3, we derive the optimal covariate allocation for RCT samples to improve estimation efficiency, propose a deviation metric to assess candidate experimental designs and illustrate how this metric influences estimation efficiency. Section 4 provides an estimate of the optimal covariate allocation and the corresponding assumptions to ensure consistency. Section 5 extends the main results and propose design strategies under other practical scenarios like heterogeneous unit cost and same precision requirement. In Section 6, we use two numerical studies, a synthetic simulation and a semi-synthetic simulation with real-word data, to corroborate our theoretical results.

2 Setup, Assumptions and Estimators

2.1 Problem Setup and Assumptions

In this paper, we based our notations and assumptions on the potential outcome framework (Rubin, 1974). We assume to have two datasets: a RCT and an observational data. We also make the assumption that the target cohort of interest is contained in the observational data.

Define $S\in\{0,1\}$ as the sample indicator where $s=1$ indicates membership of the experimental data and $s=0$ the target cohort, where $T\in\{0,1\}$ as the treatment indicator and $t=1$ indicates treatment and $t=0$ indicates control. Let $Y_{is}^{(t)}$ denotes potential outcome for a unit $i$ assigned to data set $s$ and treatment $t$ . We define $X$ as a set of observable pre-treatment variables, which can consist discrete and/or continuous variables. Let $n_{0}$ , $n_{1}$ , $n=n_{0}+n_{1}$ denote the number of units in the target cohort, RCT, and the combined dataset, respectively. We use $f_{1}(x)$ and $f_{0}(x)$ to denote the distribution of $X$ in the RCT population and target cohort, respectively.

The causal quantity of interest here is the average treatment effect (ATE) on the target population, denoted by $\tau$ .

Definition 2.1.

(ATE on target cohort)

\tau:=\mathbb{E}\left[Y^{(1))}-Y^{(0)}\mid S=0\right].

We also define the CATE on the trial population, denoted by $\tau(x)$ .

Definition 2.2.

(CATE on trial population)

\tau(x):=\mathbb{E}\left[Y^{(1)}-Y^{(0)}\mid X=x,S=1\right].

To ensure an unbiased estimator of the ATE on the target population after reweighting the estimates from the RCT, we need to make several standard assumptions.

Assumption 1.

(Identifiability of CATE in the RCT data) For all the observations in the RCT data, we assume the following conditions hold.

(i)

Consistency: $Y_{i}=Y_{i1}^{(t)}$ when $T=t$ and $S=1$ ;
(ii)

Ignorability: $Y_{i}^{(t)}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T\mid(X,S=1)$ ;
(iii)

Positivity: $0<\mathbb{P}(T=t\mid X,S=1)<1$ for all $t\in\{0,1\}$ .

The ignorability condition assumes that the experimental data is unconfounded and the positivity condition is guaranteed to hold in conditionally randomized experiments.The igonrability and positivity assumptions combined is also referred to as strong ignorability. Under Assumption 1, the causal effect conditioned on $X=x$ in the experimental sample can be estimated without bias using:

\displaystyle\hat{\tau}(x)

\displaystyle=

\displaystyle\frac{\sum_{S_{i}=1,X_{i}=x}\frac{T_{i}Y_{i}}{e(x)}-\frac{(1-T_{i})Y_{i}}{1-e(x)}}{\sum_{S_{i}=1,X_{i}=x}1},

where $e(x)=\mathbb{P}(T=1\mid X=x,S=1)$ is the probability of treatment assignment in the experimental sample. This estimator is also known as the Horvitz-Thompson estimator (Horvitz and Thompson, 1952), which we will provide more details later in this section.

To make sure that we can ‘transport’ the effect from the experimental data to the target cohort, we make the following transportability assumption.

Assumption 2.

(Transportability) $Y^{(t)}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}S\mid(X,T=t)$ .

Assumption 2 can be interpreted from several perspectives, as elaborated in Hernán and Robins (2010). First, it assumes that all the effect modifiers are captured by the set of observable covariates $X$ . Second, it also ensures that the treatment $T$ for different data stays the same. If the assigned treatment differs between the study population and the target population, then the magnitude of the causal effect of treatment will differ too. Lastly, the transportability assumption prescribes that there is no interference across the two populations. That is, treating one individual in one population does not interfere with the outcome of individuals in the other population.

Furthermore, we require the trial population fully overlaps with the the target cohort, so that we can reweight the CATE in the experimental sample to estimate the ATE in the target cohort. That is, for each individual in the target cohort, we want to make sure that we can find a comparable counterpart in the experimental sample with the same characteristics.

Assumption 3.

(Positivity of trial participation) $0<\mathbb{P}(S=1\mid T=t,X=x)<1$ for all $x\in\text{supp}(X\,|\,S=0)$ .

In Assumption 3, $\text{supp}(X\,|\,S=0)$ denotes the support of the target cohort, in other words, the set of values that $X$ can take for individuals in the target cohort. Mathematically, $x\in\text{supp}(X\,|\,S=0)$ is equivalent to $\mathbb{P}\left(\|X-x\|\leq\delta\mid S=0\right)>0$ , $\forall\delta>0$ . Assumption 3 requires that the support of the experimental sample includes the target cohort of interest.

2.2 Estimators and related work

Inverse Propensity (IP) weighted estimators were proposed by Horvitz and Thompson (1952) for surveys in which subjects are sampled with unequal probabilities.

Definition 2.3.

(Horvitz-Thompson estimator)

	$\displaystyle\widehat{Y}^{(t)}_{\text{HT}}$	$\displaystyle=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\frac{I(T_{i}=t)Y_{i}}{\mathbb{P}\left(T_{i}=t\,\|\,X=X_{i}\right)},$
	$\displaystyle\hat{\tau}_{\text{HT}}$	$\displaystyle=\widehat{Y}^{(1)}_{\text{HT}}-\widehat{Y}^{(0)}_{\text{HT}}=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\frac{T_{i}Y_{i}}{e\left(X_{i}\right)}-\frac{(1-T_{i})Y_{i}}{1-e\left(X_{i}\right)},$

where the probability of treatment $e(X_{i})$ is assumed to be known as we focus on the design phase of experiments. In practice, we can extend the Horvitz-Thompson estimator by replacing $e(x)$ with an estimate $\hat{e}(x)$ , for example the Hajek estimator (Hájek, 1971) and the difference-in-means estimator.

Definition 2.4.

(Augmented Inverse Propensity Weighted estimator)

$\displaystyle\widehat{Y}^{(1)}_{\text{AIPW}}$	$\displaystyle=$	$\displaystyle\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\left[\hat{m}^{(1)}\left(X_{i}\right)+\frac{T_{i}}{\hat{e}\left(X_{i}\right)}\left(Y_{i}-\hat{m}^{(1)}\left(X_{i}\right)\right)\right],$
$\displaystyle\widehat{Y}^{(0)}_{\text{AIPW}}$	$\displaystyle=$	$\displaystyle\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\left[\hat{m}^{(0)}\left(X_{i}\right)+\frac{(1-T_{i})}{1-\hat{e}\left(X_{i}\right)}\left(Y_{i}-\hat{m}^{(0)}\left(X_{i}\right)\right)\right],$
$\displaystyle\hat{\tau}_{\text{\text{AIPW}}}$	$\displaystyle=$	$\displaystyle\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\frac{T_{i}(Y_{i}-\hat{m}^{(1)}(X_{i}))}{\hat{e}\left(X_{i}\right)}-\frac{(1-T_{i})(Y_{i}-\hat{m}^{(0)}(X_{i})))}{1-\hat{e}\left(X_{i}\right)}+\hat{m}^{(1)}(X_{i})-\hat{m}^{(0)}(X_{i}),$

where $m^{(t)}(x)$ denotes the average outcome of treatment $t$ given covariate $X=x$ , that is, $m^{(t)}(x)=\mathbb{E}[Y\mid T=t,X=x,S=1]$ , and $\hat{m}^{(t)}(x)$ is an estimate of $m^{(t)}(x)$ (Robins, 1994).

The estimator $\hat{\tau}_{\text{AIPW}}$ is doubly robust: $\hat{\tau}_{\text{AIPW}}$ is consistent if either (1) $\hat{e}\left(X_{i}\right)$ is consistent or (2) $\hat{m}^{(t)}(x)$ is consistent.

Definition 2.5.

(Inverse Propensity Sample Weighted (IPSW) estimator)

	$\displaystyle\hat{\tau}_{\text{IPSW}}^{*}$	$\displaystyle=$	$\displaystyle\frac{1}{n_{1}}\sum_{i\in\{i:S_{i}=1\}}w\left(X_{i}\right){\left(\frac{Y_{i}A_{i}}{e(X_{i})}-\frac{Y_{i}\left(1-A_{i}\right)}{1-e(X_{i})}\right)},$
	$\displaystyle w(x)$	$\displaystyle=$	$\displaystyle\frac{f_{0}(X)}{f_{1}(X)}.$

We can see that the IPSW estimator extends the Horvitz-Thompson estimator by adding a weight $w(x)$ , which is the ratio between the probably of observing an individual with characteristics $X=x$ in the trial population that in the target population (Stuart et al., 2011)²²2The definition of the weight $w(x)$ differs slightly from that in Stuart et al. (2011), where $w(x)$ is defined as $\frac{P\left(S=1\mid X=x\right)}{P\left(S=0\mid X=x\right)}$ . That is, the ratio of the distribution of being selected into the trail over being selected into the target cohort. This definition is based on the problem setting where there is a super population which the target cohort and trial cohort are sampled from. Our definition here agrees with that in Colnet et al. (2022).. We use an asterisk in the notation to denote that it is an oracle definition where we assume both $f_{1}(X)$ and $f_{0}(X)$ are known, which is probably unrealistic. The IPSW estimator of average treatment effect on target cohort is proven to be unbiased under Assumptions 1– 3.

A concurrent study of high relevance to our work by Colnet et al. (2022) investigated performance of IPSW estimators. In particular, they defined different versions of IPSW estimators, where $f_{1}(X)$ and $f_{0}(X)$ are either treated as known or estimated, and derived the expressions of asymptotic variance for each version. They concluded that the semi-oracle estimator, where $f_{1}(X)$ is estimated and $f_{0}(X)$ is treated as known, outperforms the other two versions giving the lowest asymptotic variance.

Definition 2.6.

(Semi-oracle IPSW estimator, Colnet et al. (2022))

	$\displaystyle\hat{\tau}_{\text{IPSW}}$	$\displaystyle=$	$\displaystyle\frac{1}{n_{1}}\sum_{i\in\{i:S_{i}=1\}}\frac{f_{0}(X_{i})}{\hat{f_{1}}(X_{i})}{\left(\frac{Y_{i}A_{i}}{e(X_{i})}-\frac{Y_{i}\left(1-A_{i}\right)}{1-e(X_{i})}\right)},\text{ and}$
	$\displaystyle\hat{f_{1}}(x)$	$\displaystyle=$	$\displaystyle\frac{1}{n_{1}}\sum_{S_{i}=1}\mathbbm{1}_{X_{i}=x}.$

The re-weighted ATE estimator we use in this paper, $\hat{\tau}$ , coincides with their semi-oracle IPSW estimator defined above, where $f_{1}(X)$ is estimated from the RCT data.

3 Main Results

In this section, we start with the case where the number of possible covariate values is finite and derive the optimal covariate allocation of RCT samples that minimizes the variance of the ATE estimate, $\hat{\tau}$ . We then develop a deviation metric, $\mathcal{D}(f_{1})$ , that quantifies how much a candidate RCT sample composition with covariate distribution $f_{1}$ deviates from the optimal allocation. We prove that this deviation metric, $\mathcal{D}(f_{1})$ , is proportional to the variance of $\hat{\tau}$ therefore it can be used as a metric for selection. Finally, we derive the above results in presence of continuous covariates.

3.1 Variance-Minimizing RCT Covariate Allocation

We first consider the more straight-forward case, where the number of possible covariate values is finite. Recall that $e(x)$ denotes the propensity score, We assume that the exact value of $e(x)$ is known for the RCT.

When units in the experimental dataset cover all the possible covariate values, for $m=1,\ldots,M$ , recall the Horvitz-Thompson inverse-propensity weighted estimators (Horvitz and Thompson, 1952) of CATE:

\displaystyle\hat{\tau}(x_{m})

\displaystyle=

\displaystyle\frac{\sum_{S_{i}=1,X_{i}=x_{m}}\frac{T_{i}Y_{i}}{e(x_{m})}-\frac{(1-T_{i})Y_{i}}{1-e(x_{m})}}{\sum_{S_{i}=1,X_{i}=x_{m}}1}.

Discrete covariates can be furthered divided into two types: ordinal, for example, test grade, and categorical such as blood type. For ordinal covariates, we can construct a smoother estimator by applying kernel-based local averaging:

\displaystyle\hat{\tau}_{\text{K}}(x_{m})=\frac{\frac{1}{n_{1}h^{k}}\sum_{S_{i}=1}\left(\frac{T_{i}Y_{i}}{e(X_{i})}-\frac{(1-T_{i})Y_{i}}{1-e(X_{i})}\right)K\left(\frac{X_{i}-x_{m}}{h}\right)}{\frac{1}{n_{1}h^{k}}\sum_{S_{i}=1}K\left(\frac{X_{i}-x_{m}}{h}\right)},

where $K(\cdot)$ is kernel function and $h$ is the smoothing parameter. Conceptually, the kernel function measures how individuals with covariates in proximity to $x_{m}$ influence the estimation of $\hat{\tau}_{\text{K}}(x_{m})$ . This kernel-based estimator works even if the observational data does not fully overlap with the experimental data. The estimator $\hat{\tau}_{\text{K}}$ is inspired by Abrevaya et al. (2015), who used it to estimate the CATE. Specifically, if the covariate is ordinal and the sample size of a sub-population with a certain covariate value is small or even zero, we can consider $\hat{\tau}_{\text{K}}(x)$ , as it applies local averaging so that each CATE is informed by more data.

To study the variance of CATE estimates $\hat{\tau}(x_{m})$ and $\hat{\tau}_{\text{K}}(x_{m})$ , we define the following terms:

	$\displaystyle\sigma_{\psi}^{2}(x)$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left(\psi(X,Y,T)-\tau(x)\right)^{2}\mid X=x,S=1\right],$
	$\displaystyle\psi(x,y,t)$	$\displaystyle=$	$\displaystyle\frac{t(y-m^{(1)}(x))}{e(x)}-\frac{(1-t)(y-m^{(0)}(x))}{1-e(x)}+m^{(1)}(x)-m^{(0)}(x).$

The random vector $\psi(X,Y,T)$ is the influence function of the AIPW estimator (Bang and Robins, 2005). Term $\sigma_{\psi}^{2}(x)$ measures the conditional variability of the difference in potential outcomes given covariate $X=x$ , and $m^{(t)}(x)$ denotes the average outcome with treatment $t$ given covariate $X=x$ .

Assumption 4.

As $n$ goes to infinity, $n_{1}/n$ has a limit in $(0,1)$ .

Assumption 4 suggests that when we consider the asymptotic behavior of our estimators, sample sizes for both experimental data and observational data go to infinity, though usually there is more observational samples than experimental samples.

Theorem 1.

Under Assumption 4, for $m=1,\ldots,M$ , we have

	$\displaystyle\sqrt{n_{1}}(\hat{\tau}(x_{m})-\tau(x_{m}))$	$\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}$	$\displaystyle N\left(0,\frac{\sigma_{\psi}^{2}(x_{m})}{f_{1}(x_{m})}\right),$
	$\displaystyle\sqrt{n_{1}h}(\hat{\tau}_{\text{K}}(x_{m})-\tau(x_{m}))$	$\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}$	$\displaystyle\mathcal{N}\left(0,\frac{\\|K\\|_{2}^{2}\sigma_{\psi}^{2}(x_{m})}{f_{1}(x_{m})}\right),$

where $\|K\|_{2}=(\int K(u)^{2}du)^{1/2}$ .

Theorem 1 shows the asymptotic distribution of the two CATE estimators for every possible covariate value. Complete randomization in experiments ensures that $\hat{\tau}(x)$ is unbiased. Based on the idea of IPSW estimator, we then construct the following two reweighted estimators for ATE:

\hat{\tau}=\sum_{m=1}^{M}f_{0}(x_{m})\hat{\tau}(x_{m}),\quad\hat{\tau}_{\text{K}}=\sum_{m=1}^{M}f_{0}(x_{m})\hat{\tau}_{\text{K}}(x_{m}).

It is easy to see that the $\hat{\tau}$ above is the same as the semi-oracle IPSW estimator defined in Definition 2.6 once we substitute in the expression of $\hat{\tau}(x_{m})$ .

Theorem 2.

Under Assumption 1–4, we have

	$\displaystyle n_{1}\text{var}(\hat{\tau})$	$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}f_{0}^{2}(x_{m})\frac{\sigma_{\psi}^{2}(x_{m})}{f_{1}(x_{m})},$
	$\displaystyle n_{1}h\text{var}_{\text{a}}(\hat{\tau}_{\text{K}})$	$\displaystyle=$	$\displaystyle\\|K\\|_{2}^{2}\sum_{m=1}^{M}f_{0}^{2}(x_{m})\frac{\sigma_{\psi}^{2}(x_{m})}{f_{1}(x_{m})},$

where $\text{var}_{\text{a}}(\cdot)$ denotes the asymptotic variance. For $m=1,\ldots,M$ , the optimal covariate RCT distribution to minimize both $\text{var}(\hat{\tau})$ and $\text{var}_{\text{a}}(\hat{\tau}_{\text{K}})$ is

\displaystyle f_{1}^{*}(x_{m})=\frac{f_{0}(x_{m})\sigma_{\psi}(x_{m})}{\sum_{j=1}^{M}f_{0}(x_{j})\sigma_{\psi}(x_{j})}.

Theorem 2 indicates that even if the covariate distribution of experimental data is exactly the same as that of the target cohort, it does not necessarily produce the most efficient estimator. The optimal RCT covariate distribution also depends on the conditional variability of potential outcomes. In fact, $f_{1}^{*}$ is essentially the target covariate distribution adjusted by the variability of conditional causal effects. This result suggests that we should sample relatively more individuals from sub-populations where the causal effect is more volatile, even if they do not take up a big proportion of the target cohort. Moreover, the two estimators, $\hat{\tau}$ and $\hat{\tau}_{\text{K}}$ , share the same optimal covariate weight no matter whether local averaging is applied.

In practice, if the total number of samples is fixed, experiment designers can select RCT samples with covariate allocation identical to $f_{1}^{*}$ to improve the efficiency of IPSW estimate.

3.2 Deviation Metric

Corollary 1.

Under Assumption 1–4, we have

	$\displaystyle n_{1}\text{var}(\hat{\tau})$	$\displaystyle=$	$\displaystyle\left(\sum_{m=1}^{M}f_{0}(x_{m})\sigma_{\psi}(x_{m})\right)^{2}\times\left\{\mathcal{D}(f_{1})+1\right\}$
	$\displaystyle n_{1}h\text{var}_{\text{a}}(\hat{\tau}_{\text{K}})$	$\displaystyle=$	$\displaystyle\\|K\\|_{2}^{2}\left(\sum_{m=1}^{M}f_{0}(x_{m})\sigma_{\psi}(x_{m})\right)^{2}\times\left\{\mathcal{D}(f_{1})+1\right\},$

where $\text{var}_{1}(\cdot)=\text{var}_{X\mid S=1}(\cdot)$ , and we define

\mathcal{D}(f_{1})=\text{var}_{1}\left(\frac{f_{1}^{*}(X)}{f_{1}(X)}\right)

as the deviation metric of experiment samples as it measures the difference between the optimal covariate distribution $f_{1}^{*}$ and the real covariate distribution $f_{1}$ . We have $\mathcal{D}(f_{1})\geq 0$ , and $\mathcal{D}(f_{1}^{*})=0$ if and only if the real covariate distribution of experiment samples is identical to the optimal one, i.e. ${f_{1}^{*}(x)}={f_{1}(x)}$ for $\forall x\in\{x_{1},\ldots,x_{M}\}$ .

Accoring to Corollary 1, the variance of $\hat{\tau}$ depends on two parts: the first part $\sum_{m=1}^{M}f_{0}(x_{m})\sigma_{\psi}(x_{m})$ depends on the true distribution of the target population, while the second part $\mathcal{D}(f_{1})$ is a measure of the deviation of the RCT sample allocation $f_{1}$ compared to the optimal variability-adjusted allocation $f_{1}^{*}$ , and can thus reflect the representativeness of our RCT samples. As Corollary 1 shows, the variance of IPSW estimator for the population, $\hat{\tau}$ , is proportional to $\mathcal{D}(f_{1})$ .

The deviation metric equips us with a method to compare candidate experiment designs. To be specific, if experiment designers have several potential plans for RCT samples, they can choose one with the smallest deviation metric to maximize the estimation efficiency.

3.3 Including Continuous Covariates

For continuous covariates, for instance, body mass index (BMI), we apply stratification based on propensity score. By considering an appropriate partition of the support $\{A_{1},\ldots,A_{L}\}$ with finite $L\in\mathbb{N}$ , we can turn it into the discrete case above.

Assumption 5.

For $l=1,\ldots,L$ , and $x,x^{\prime}\in A_{l}$ , we have

(i)

$\mathbb{P}(T=1\mid X=x)=\mathbb{P}(T=1\mid X=x^{\prime})$ ;
(ii)

$\mathbb{E}(Y^{(1)}-Y^{(0)}\mid X=x,S=1)=\mathbb{E}(Y^{(1)}-Y^{(0)}\mid X=x^{\prime},S=1)$ .

Assumption 5 assumes that units within each stratum share the same propensity score and CATE. This is a strong but reasonable condition if we make each stratum $A_{l}$ sufficiently small. Under Assumption 5, let $\hat{\tau}(A_{l})$ , $\hat{\tau}_{\text{K}}(A_{l})$ , $\sigma_{\psi}^{2}(A_{l})$ , $e(A_{l})$ denote the causal effect estimate, causal effect estimate with kernel-based local averaging, variance of influence function, propensity score, that are conditioned on $X\in A_{l}$ . Let $f_{0}(A_{l})=\mathbb{P}(X\in A_{l}\mid S=0)$ and $f_{1}(A_{l})=\mathbb{P}(X\in A_{l}\mid S=1)$ . We can then construct two IPSW estimators:

\hat{\tau}=\sum_{l=1}^{L}f_{0}(A_{l})\hat{\tau}(A_{l}),\quad\hat{\tau}_{\text{K}}=\sum_{l=1}^{L}f_{0}(A_{l})\hat{\tau}_{\text{K}}(A_{l}).

As shown in Corollary 2, we have similar results to Theorem 2, but instead of the optimal covariate distribution, we derive the optimal probability on each covariate set $A_{l}$ .

Corollary 2.

Under Assumption 1– 5, we have

	$\displaystyle n_{1}\text{var}(\hat{\tau})$	$\displaystyle=$	$\displaystyle\sum_{l=1}^{L}f_{0}^{2}(A_{l})\frac{\sigma_{\psi}^{2}(A_{l})}{f_{1}(A_{l})},$
	$\displaystyle n_{1}h\text{var}_{\text{a}}(\hat{\tau}_{\text{K}})$	$\displaystyle=$	$\displaystyle\\|K\\|_{2}^{2}\sum_{l=1}^{L}f_{0}^{2}(A_{l})\frac{\sigma_{\psi}^{2}(A_{l})}{f_{1}(A_{l})}.$

For $l=1,\ldots,L$ , the optimal distribution on each covariate set to minimize both $\text{var}(\hat{\tau})$ and $\text{var}_{\text{a}}(\hat{\tau}_{\text{K}})$ is

\displaystyle f_{1}^{*}(A_{l})=\frac{f_{0}(A_{l})\sigma_{\psi}(A_{l})}{\sum_{j=1}^{L}f_{0}(A_{j})\sigma_{\psi}(A_{j})}.

Moreover,

\displaystyle\sum_{l=1}^{L}f_{0}^{2}(A_{l})\frac{\sigma_{\psi}^{2}(A_{l})}{f_{1}(A_{l})}

\displaystyle=

\displaystyle\left(\sum_{l=1}^{L}f_{0}(A_{l})\sigma_{\psi}(A_{l})\right)^{2}\times\left\{\mathcal{D}(f_{1})+1\right\},

where $A(x)=\{A:x\in A;A\in\{A_{1},\ldots,A_{L}\}\}$ .

In the sections that follow, for simplicity, we illustrate our method in the scenario where the covariates are all discrete with finite possible values. The results can easily be extended to include continuous covariates following the same logic as descibed in this section.

4 Estimating Conditional Variability

The optimal covariate allocation derived above can benefit the planning of the composition of RCT samples. However, it’s difficult, or impossible, to estimate the conditional variability of potential outcomes prior to RCTs being carried out. In this section, we provide a practical strategy using information from the observational data to estimate the theoretical optimal covariate distribution, and derive conditions under which our strategy yields consistent results.

In completely randomized experiments, we can show that

\displaystyle\sigma_{\psi}^{2}(x)=\frac{1}{e(x)}\text{var}(Y^{(1)}\mid X=x)+\frac{1}{1-e(x)}\text{var}(Y^{(0)}\mid X=x).

To estimate $\sigma_{\psi}^{2}(x)$ by observational data, $\forall x$ , let

	$\displaystyle\widehat{Y}^{(0)}(x)$	$\displaystyle=\frac{\sum_{S_{i}=0,T_{i}=0}Y_{i}}{\sum_{S_{i}=0,T_{i}=0}1},\quad$	$\displaystyle\widehat{Y}^{(1)}(x)$	$\displaystyle=\frac{\sum_{S_{i}=0,T_{i}=1}Y_{i}}{\sum_{S_{i}=0,T_{i}=1}1},$
	$\displaystyle\widehat{S}^{(0)}(x)$	$\displaystyle=\frac{\sum_{S_{i}=0,T_{i}=0}\left(Y_{i}-\widehat{Y}^{(0)}(x)\right)^{2}}{\sum_{S_{i}=0,T_{i}=0}1-1},\quad$	$\displaystyle\widehat{S}^{(1)}(x)$	$\displaystyle=\frac{\sum_{S_{i}=0,T_{i}=1}\left(Y_{i}-\widehat{Y}^{(1)}(x)\right)^{2}}{\sum_{S_{i}=0,T_{i}=1}1-1}.$

We can then estimate the conditional variability of potential outcomes, the optimal covariate distribution and the deviation metric of RCT samples from observational data as follows

$\displaystyle\hat{\sigma}_{\psi}^{2}(x)$	$\displaystyle=$	$\displaystyle\frac{1}{e(x)}\widehat{S}^{(1)}(x)+\frac{1}{1-e(x)}\widehat{S}^{(0)}(x)$
$\displaystyle\hat{f}_{1}^{*}(x_{m})$	$\displaystyle=$	$\displaystyle\frac{f_{0}(x_{m})\hat{\sigma}_{\psi}(x_{m})}{\sum_{j=1}^{M}f_{0}(x_{j})\hat{\sigma}_{\psi}(x_{j})},$
$\displaystyle\widehat{\mathcal{D}}(f_{1})$	$\displaystyle=$	$\displaystyle\text{var}_{1}\left(\frac{\hat{f}_{1}^{*}(X)}{f_{1}(X)}\right).$

Assumption 6 below ensures the consistency of the estimated conditional variance of potential outcomes $\hat{\sigma}_{\psi}^{2}(x)$ . The main problem of estimating $\hat{\sigma}_{\psi}^{2}(x)$ from observational data is the possibility of unobserved confounding. Instead of assuming unconfoundedness, our Assumption 6 is weaker and requires that the expectation of estimate $\hat{\sigma}_{\psi}^{2}(x)$ is proportional to the target conditional variability, which is a weaker condition.

Assumption 6.

For $\forall x$ , suppose

			$\displaystyle\frac{1}{e(x)}\text{var}{(Y^{(1)}\mid X=x,T=1,S=0)}+\frac{1}{1-e(x)}\text{var}{(Y^{(0)}\mid X=x,T=0,S=0)}$
		$\displaystyle=$	$\displaystyle c\left[\frac{1}{e(x)}\text{var}{(Y^{(1)}\mid X=x,S=0)}+\frac{1}{1-e(x)}\text{var}{(Y^{(0)}\mid X=x,S=0)}\right],$

where $c>0$ is an unknown constant.

The left-hand side of the equation above is the conditional variance of observed outcomes that can be estimated from the observational data, and the right-hand side is the theoretical conditional variance of potential outcomes that we want to approximate. Assumption 6 requires these two quantities to be proportional, rather than absolutely equal. Intuitively, the assumption supposes that the covariate segments in the observational data that exhibit high volatility in observed outcomes, also have high variance in their potential outcomes, although it is not required that the absolute levels of variance have to be the same.

Theorem 3.

Under Assumption 4 and Assumption 6, $\forall x$ we have

\hat{f}_{1}^{*}(x)\rightarrow f_{1}^{*}(x),\quad\widehat{\mathcal{D}}(f_{1})\rightarrow\mathcal{D}(f_{1}).

Thus, $\hat{f}_{1}^{*}(x)$ and $\widehat{\mathcal{D}}(f_{1})$ are consistent.

Based on Thereom 3, we propose a novel strategy to select efficient RCT samples. Specifically, we select the candidate experimental design with covariate allocation $f_{1}$ that minimizes the estimate of deviation metric $\widehat{\mathcal{D}}(f_{1})$ . By contrast, a naive strategy prefers the candidate experimental design with $f_{1}=f_{0}$ , which mimics exactly the covariate distribution in the target cohort. If the conditional variability of potential outcomes $\sigma_{\psi}^{2}(x)$ vary widely according to $x$ . Our strategy can lead to much more efficient treatment effect estimator compared to the naive strategy.

5 Practical Scenarios

5.1 Heterogeneous Unit Cost

We also consider the experimental design with a cost constraint and heterogeneous cost for different sub-populations. The goal is to find the optimal sample allocation for RCT that minimizes the variance of the proposed estimator subject to a cost constraint. For $m=1,\ldots,M$ , let $C_{m}$ denote the cost to collect a sample in the sub-population with $X=x_{m}$ .

Theorem 4.

Under the cost constraint that

\sum_{m=1}^{M}C_{m}\left(\sum_{S_{i}=1,X_{i}=x_{m}}1\right)=C,

the optimal sample allocation $f_{1}(X)$ that minimizes $\text{var}(\hat{\tau})$ is

f_{1}^{\text{c},*}(x_{m})=\frac{f_{0}(x_{m})\sigma_{\psi}(x_{m})/\sqrt{C_{m}}}{\sum_{i=1}^{M}f_{0}(x_{i})\sigma_{\psi}(x_{i})/\sqrt{C_{i}}}

for $m=1,\ldots,M$ . Here we use the superscript $c$ to denote the cost constraint.

Theorem 4 suggests that the optimal RCT covariate allocation under the given cost constraint is the covariate allocation of target cohort adjusted by both the heterogeneous costs for sub-populations and the conditional variability of potential outcomes. Intuitively, compared to the case without heterogeneous costs, we should include more RCT samples with lower unit cost.

5.2 Different Precision Requirements

Theorem 2 shows the optimal sample allocation to maximize the efficiency of average treatment effect estimator for the target cohort. If we require the same precision for estimators in each domain, we need the sample allocation as follows:

f_{1}^{\text{s},*}(x_{m})=\frac{\sigma^{2}_{\psi}(x_{m})}{\sum_{j=1}^{M}\sigma^{2}_{\psi}(x_{j})},

where we use the superscript s to denote the requirement of same precision for the CATE estimate in each segment.

Intuitively, to take both objectives into consideration, we propose a compromised allocation that falls between the two optimum allocations $\forall k\in[0,1]$ :

f_{1}^{k,*}(x_{m})=\frac{f_{0}^{k}(x_{m})\sigma^{2-k}_{\psi}(x_{m})}{\sum_{j=1}^{M}f_{0}^{k}(x_{j})\sigma^{2-k}_{\psi}(x_{j})}.

Corollary 3.

If for $m=1,\ldots,M$ ,

f_{0}(x_{m})=\frac{\sigma_{\psi}(x_{m})}{\sum_{j=1}^{M}\sigma_{\psi}(x_{j})},

we have for $\forall k\in[0,1]$ ,

f_{1}^{*}(X)=f_{1}^{\text{s},*}(X)=f_{1}^{k,*}(X).

The deviation metric for sample allocation under same precision strategy and compromise strategy are

	$\displaystyle\mathcal{D}(f_{1}^{\text{s},*})$	$\displaystyle=$	$\displaystyle\text{var}_{1}\left(\frac{f_{0}(X)}{\sigma_{\psi}(X)}\right)\left(\frac{\sum_{m=1}^{M}\sigma^{2}_{\psi}(x_{m})}{\sum_{m=1}^{M}f_{0}(x_{m})\sigma_{\psi}(x_{m})}\right)^{2},$
	$\displaystyle\mathcal{D}(f_{1}^{k,*})$	$\displaystyle=$	$\displaystyle\text{var}_{1}\left(\frac{f_{0}(X)}{\sigma_{\psi}(X)}\right)^{1-k}\left(\frac{\sum_{m=1}^{M}f_{0}^{k}(x_{m})\sigma^{2-k}_{\psi}(x_{m})}{\sum_{m=1}^{M}f_{0}(x_{m})\sigma_{\psi}(x_{m})}\right)^{2},$

respectively.

6 Numerical Study

6.1 Simulation

In this section, we conduct a simulation study to illustrate how representativeness of experiment samples influences the estimation efficiency of average treatment effect for a target cohort, and demonstrate how the representativeness metric $\mathcal{D}(f_{1})$ can facilitate the selection from candidate RCT sample designs. We set the size of observational dataset $n_{0}=10000$ and the size of experimental dataset $n_{1}=200$ . For the units in the observational data, we draw covariates $x$ from $\{1,2,3\}$ with probability 0.3, 0.2 and 0.5 respectively. We then set

	$\displaystyle Y^{(0)}=2X+X^{4}\epsilon,\quad Y^{(1)}=1-X+\epsilon,\text{ then,}$
	$\displaystyle Y^{(1)}-Y^{(0)}=1-3X+(X^{4}-1)\epsilon,$

where $\epsilon\sim\mathcal{N}(0,1)$ . We can then compute the conditional variability of potential outcomes $\sigma^{2}_{\psi}(x)$ and thus the optimal covariate distribution $f^{*}_{1}$ from the true population. Our model engenders distinctive conditional variability $\sigma^{2}_{\psi}(x)$ given different $x$ , making the optimal covariate distribution $f_{1}^{*}$ very different from the target covariate distribution $f_{0}$ .

For experimental data, we simulate 100 different candidate experimental sample designs. In each design, we randomly draw experiment samples from the target cohort with probability

\mathbb{P}(S=1\mid X=x)=\frac{e^{p_{x}}}{e^{p_{1}}+e^{p_{2}}+e^{p_{3}}},

where $p_{1},p_{2},p_{3}$ are i.i.d. samples drawn from standard normal distribution. We can then compute the real covariate distribution $f_{1}(x)$ and the repressiveness metric $\mathcal{D}(f_{1})$ . To estimate the efficiency of average treatment effect estimator, we conduct 1000 experiments for each design. In each experiment, the treatment for each unit follows a Bernoulli distribution with probability 0.5. The simulation result is shown in Fig 1. The relationship between the variance and representativeness can be fit into a line, which is consistent with our result that $\text{var}(\hat{\tau})\propto\mathcal{D}(f_{1})$ . The red line shows the value $R(f_{1})$ for the naive strategy mimicking exactly the target cohort distribution, which is not zero, and we can see that it is not the optimal RCT sample and does not produce the most efficient causal estimator.

Refer to caption — Figure 1: How the deviation metric of experiment samples $\mathcal{D}$ correlates with the estimated variance of the $\hat{\tau}$ . The red line marks the deviation metric $\mathcal{D}$ of a trial sample selected following the naïve strategy.

For the case with heterogeneous unit cost, we set the costs for sub-populations with covariate $X$ being 1, 2, 3 to be 20, 30, 40, respectively. The total capital available is 30000. Instead of randomly drawing experiment samples from the target cohort with a fixed number of total subjects, we randomly assign capitals for different sub-populations with a fixed amount of total cost. Given the budget assigned to each sub-population, we then draw subjects randomly from the sub-population, where the amount of subjects can be determined by the assigned budget. The simulation result is illustrated by Figure 2. We can see that under the cost constraint, experiment samples that follow a distribution closer to $f_{1}^{c,*}$ lead to causal estimator with better efficiency.

6.2 Real Data Illustration

We use the well-cited Tennessee Student/Teacher Achievement Ratio (STAR) experiment to assess how covariate distribution of experiment samples influences the estimation efficiency of average treatment effect for target cohort. STAR is a randomized experiment started in 1985 to measure the effect of class size on student outcomes, measured by standardized test scores. Similar to the exercise in Kallus et al. (2018), we focus a binary treatment: $T=1$ for small classes (13-17 pupils), and $T=0$ for regular classes (22-25 pupils). Since many students only started the study at first grade, we took as treatment their class-type at first grade. The outcome $Y$ is the sum of the listening, reading, and math standardized test at the end of first grade. We use the following covariates $X$ for each student: student race ( $X_{1}\in\{1,2\}$ ) and school urbanicity ( $X_{2}\in\{1,2,3,4\}$ ). We exclude units with missing covariates. The records of 4584 students remain, with 1733 assigned to treatment (small class, $T=1$ ), and 2413 to control (regular size class, $T=0$ ). Before analysis we fill the missing outcome by linear regression based on treatments and two covariates so that both two potential outcomes $Y_{0}$ and $Y_{1}$ for each student are known.

We simulate 500 candidate experiment sample allocations. For each allocations, we select $n_{1}=500$ experiment units from the dataset with probability

\mathbb{P}(S=1\mid X=x)=\frac{e^{p_{x_{1}x_{2}}}}{e^{p_{11}}+e^{p_{12}}+e^{p_{13}}+e^{p_{14}}+e^{p_{21}}+e^{p_{22}}+e^{p_{23}}+e^{p_{24}}}.

where $p_{11},p_{12},p_{13},p_{14},p_{21},p_{22},p_{23},p_{24}$ are i.i.d. samples drawn from standard normal distribution. We can then compute the real covariate distribution $f_{1}(x)$ and the repressiveness of the experiment samples $\mathcal{D}(f_{1})=\text{var}_{1}(f_{1}^{*}(X)/f_{1}(X))$ . To estimate the efficiency of average treatment effect estimator, we conduct 200 experiments for each design. In each experiment, the treatment follows a Bernoulli distribution with probability $1733/4584=0.378$ . The simulation result is shown in Fig 3. The relationship between the variance and the deviation metric $\mathcal{D}(f)$ can be fit into a line, which is consistent with our result that $\text{var}(\hat{\tau})\propto\mathcal{D}(f_{1})$ .

7 Conclusion

In this paper, we examine the common procedure of generalizing causal inference from an RCT to a target cohort. We approach this as a problem where we can combine an RCT with an observational data. The observational data has two roles in the combination: one is to provide the exact covariate distribution of the target cohort, and the other role is to provide a means to estimate the conditional variability of the causal effect by covariate values.

We give the expression of the variance of Inverse Propensity Sampling Weights estimator as a function of covariate distribution in the RCT. We subsequently derive the variance-minimizing optimal covariate allocation in the RCT, under the constraint that the size of the trial population is fixed. Our result indicates that the optimal covariate distribution of the RCT does not necessarily follow the exact distribution of the target cohort, but is instead adjusted by the conditional variability of potential outcomes. Practitioners who are at the design phase of a trial can use the optimal allocation result to plan the group of participants to recruit into the trial.

We also formulate a deviation metric quantifying how far a given RCT allocation is from optimal. The advantage of this metric is that it is proportional to the variance of the final ATE estimate so that when presented with several candidate RCT cohorts, practitioners can compare and choose the most efficient RCT according to this metric.

The above results depend on the estimation of conditional variability of the causal effect by covariate values, which remains unknown. We propose to estimate it using the observational data and outline mild assumptions that needs to be met. In reality, practitioners usually have complex considerations when designing a trial, for instance cost constraints and precision requirements. We develop variants of our main results to apply in such practical scenarios. Finally, we use two numerical studies to corroborate our theoretical results.

References

Abrevaya et al. (2015) J. Abrevaya, Y.-C. Hsu, and R. P. Lieli. Estimating conditional average treatment effects. Journal of Business & Economic Statistics, 33(4):485–505, 2015.
Andrews and Oster (2017) I. Andrews and E. Oster. Weighting for external validity. National Bureau of Economic Research, 2017.
Athey et al. (2020) S. Athey, R. Chetty, and G. Imbens. Combining experimental and observational data to estimate treatment effects on long term outcomes. arXiv preprint arXiv:2006.09676, 2020.
Bang and Robins (2005) H. Bang and J. M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
Campbell (1957) D. T. Campbell. Factors relevant to the validity of experiments in social settings. Psychological bulletin, 54(4):297, 1957.
Chen et al. (2021) S. Chen, B. Zhang, and T. Ye. Minimax rates and adaptivity in combining experimental and observational data. 9 2021. URL http://arxiv.org/abs/2109.10522.
Cole and Hernán (2008) S. R. Cole and M. A. Hernán. Constructing inverse probability weights for marginal structural models. American journal of epidemiology, 168(6):656–664, 2008.
Cole and Stuart (2010) S. R. Cole and E. A. Stuart. Generalizing evidence from randomized clinical trials to target populations: the actg 320 trial. American journal of epidemiology, 172(1):107–115, 2010.
Colnet et al. (2022) B. Colnet, J. Josse, G. Varoquaux, and E. Scornet. Reweighting the rct for generalization: finite sample analysis and variable selection. arXiv preprint arXiv:2208.07614, 2022.
Dahabreh and Hernán (2019) I. J. Dahabreh and M. A. Hernán. Extending inferences from a randomized trial to a target population. European Journal of Epidemiology, 34(8):719–722, 2019.
Hájek (1971) J. Hájek. Comment on “an essay on the logical foundations of survey sampling, part one”. The foundations of survey sampling, 236, 1971.
Hartman et al. (2015) E. Hartman, R. Grieve, R. Ramsahai, and J. S. Sekhon. From sate to patt: combining experimental with observational studies to estimate population treatment effects. JR Stat. Soc. Ser. A Stat. Soc.(forthcoming). doi, 10:1111, 2015.
Hernán and Robins (2010) M. A. Hernán and J. M. Robins. Causal inference, 2010.
Hernán and VanderWeele (2011) M. A. Hernán and T. J. VanderWeele. Compound treatments and transportability of causal inference. Epidemiology (Cambridge, Mass.), 22(3):368, 2011.
Horvitz and Thompson (1952) D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
Kallus et al. (2018) N. Kallus, A. M. Puli, and U. Shalit. Removing hidden confounding by experimental grounding. Advances in neural information processing systems, 31, 2018.
Oberst et al. (2022) M. Oberst, A. D’Amour, M. Chen, Y. Wang, D. Sontag, and S. Yadlowsky. Bias-robust integration of observational and experimental estimators. 5 2022. URL http://arxiv.org/abs/2205.10467.
Pearl and Bareinboim (2011) J. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A formal approach. In Twenty-fifth AAAI conference on artificial intelligence, 2011.
Robins (1994) J. M. Robins. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics-Theory and methods, 23(8):2379–2412, 1994.
Rosenman et al. (2020) E. Rosenman, G. Basse, A. Owen, and M. Baiocchi. Combining observational and experimental datasets using shrinkage estimators. arXiv preprint arXiv:2002.06708, 2020.
Rothwell (2005) P. M. Rothwell. External validity of randomised controlled trials:“to whom do the results of this trial apply?”. The Lancet, 365(9453):82–93, 2005.
Rubin (1974) D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.
Rubin (2008) D. B. Rubin. For objective causal inference, design trumps analysis. The annals of applied statistics, 2(3):808–840, 2008.
Rudolph and van der Laan (2017) K. E. Rudolph and M. J. van der Laan. Robust estimation of encouragement design intervention effects transported across sites. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1509–1525, 2017.
Stuart et al. (2011) E. A. Stuart, S. R. Cole, C. P. Bradshaw, and P. J. Leaf. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174(2):369–386, 2011.
Westreich et al. (2019) D. Westreich, J. K. Edwards, C. R. Lesko, S. R. Cole, and E. A. Stuart. Target validity and the hierarchy of study designs. American journal of epidemiology, 188(2):438–443, 2019.
Yang and Ding (2020) S. Yang and P. Ding. Combining multiple observational data sources to estimate causal effects. Journal of the American Statistical Association, 115:1540–1554, 7 2020. ISSN 1537274X. doi: 10.1080/01621459.2019.1609973.

Appendix A Proof

To be continued.

Lemma 1.

Let $f(x)=\sum_{i=1}^{n}a_{i}^{2}/x_{i}$ , where $x=(x_{1},\ldots,x_{n})$ and $a=(a_{1},\ldots,a_{n})$ s.t. $x,a\geq 0$ and $\sum_{i=1}^{n}x_{i}=1$ . Function $f(x)$ reaches its minimum when $x_{i}=a_{i}/(\sum_{i=1}^{n}a_{i})$ for $i=1,\ldots,n$ .

Proof of Lemma 1.

Note that $x_{n}=1-\sum_{i=1}^{n-1}x_{i}$ , for $i=1,\ldots,n-1$ , we have

	$\displaystyle\frac{\partial f(x)}{\partial x_{i}}$	$\displaystyle=$	$\displaystyle-\frac{a_{i}^{2}}{x_{i}^{2}}+\frac{a_{n}^{2}}{(1-\sum_{j=1}^{n-1}x_{j})^{2}},$
	$\displaystyle\frac{\partial^{2}f(x)}{\partial x_{i}^{2}}$	$\displaystyle=$	$\displaystyle\frac{2a_{i}^{2}}{x_{i}^{3}}+\frac{2a_{n}^{2}}{(1-\sum_{j=1}^{n-1}x_{j})^{3}}\geq 0.$

Given that $x$ is positive, by setting ${\partial f(x)}/{\partial x_{i}}=0$ , we can get $x_{i}=a_{i}/(\sum_{i=1}^{n}a_{i})$ for $i=1,\ldots,n$ . This, together with ${\partial^{2}f(x)}/{\partial x_{i}^{2}}\geq 0$ , ensures that $f(x)$ reaches its minimum when $x_{i}=a_{i}/(\sum_{i=1}^{n}a_{i})$ for $i=1,\ldots,n$ . ∎

Proof of Theorem 2.

Given Theorem 4 and the definition of $\hat{\tau}$ , $\hat{\tau}_{\text{K}}$ , a simple calculation then gives $\text{var}(\hat{\tau})$ and $\text{var}_{\text{a}}(\hat{\tau}_{\text{K}})$ . Note that $\|K\|_{2}$ is constant given kernel $K$ , according to Lemma 1, we have

\displaystyle f_{1}^{*}(x_{m})=\frac{f_{0}(x_{m})\sigma_{\psi}(x_{m})}{\sum_{j=1}^{M}f_{0}(x_{j})\sigma_{\psi}(x_{j})}.

We then have

$\displaystyle\sum_{m=1}^{M}f_{0}^{2}(x_{m})\frac{\sigma_{\psi}^{2}(x_{m})}{f_{1}(x_{m})}$	$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}f_{1}(x_{m})f_{0}^{2}(x_{m})\frac{\sigma_{\psi}^{2}(x_{m})}{f_{1}^{2}(x_{m})}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{1}\left\{\left(f_{0}(X)\frac{\sigma_{\psi}(X)}{f_{1}(X)}\right)^{2}\right\}$
	$\displaystyle=$	$\displaystyle\text{var}_{1}\left(\frac{f_{0}(X)\sigma_{\psi}(X)}{f_{1}(X)}\right)+\mathbb{E}_{1}^{2}\left\{f_{0}(X)\frac{\sigma_{\psi}(X)}{f_{1}(X)}\right\}$
	$\displaystyle=$	$\displaystyle\text{var}_{1}\left(\frac{f_{0}(X)\sigma_{\psi}(X)}{f_{1}(X)}\right)+\left(\sum_{m=1}^{M}f_{1}(x_{m})f_{0}(x_{m})\frac{\sigma_{\psi}(x_{m})}{f_{1}(x_{m})}\right)^{2}$
	$\displaystyle=$	$\displaystyle\left(\sum_{m=1}^{M}f_{0}(x_{m})\sigma_{\psi}(x_{m})\right)^{2}\times\left\{\mathcal{D}(f_{1})+1\right\},$

where $\mathbb{E}_{1}(\cdot)=\mathbb{E}_{X\mid S=1}(\cdot)$ .

∎

Proof of Theorem 3.

We first show that in experimental data, we have

\displaystyle\sigma_{\psi}^{2}(x)=\frac{1}{e(x)}\text{var}(Y^{(1)}\mid X=x)+\frac{1}{1-e(x)}\text{var}(Y^{(0)}\mid X=x).

Note that $\tau(x)=m^{(1)}(x)-m^{(0)}(x)$ and $T(1-T)=0$ , together with the definition of $\sigma_{\psi}^{2}(x)$ , they ensure that

$\displaystyle\sigma_{\psi}^{2}(x)$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\frac{T^{2}(Y-m^{(1)}(x))^{2}}{e(x)^{2}}+\frac{(1-T)^{2}(Y-m^{(0)}(x))^{2}}{(1-e(x))^{2}}-2\frac{T(1-T)(Y-m^{(1)}(x))(Y-m^{(0)}(x))}{e(x)(1-e(x))}\mid X=x,S=1\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\frac{T^{2}(Y-m^{(1)}(x))^{2}}{e(x)^{2}}+\frac{(1-T)^{2}(Y-m^{(0)}(x))^{2}}{(1-e(x))^{2}}\mid X=x,S=1\right]$
	$\displaystyle=$	$\displaystyle\mathbb{P}(T=1\mid X=x,S=1)\mathbb{E}\left[\frac{T^{2}(Y-m^{(1)}(x))^{2}}{e(x)^{2}}\mid X=x,S=1\right]$
		$\displaystyle+\mathbb{P}(T=0\mid X=x,S=1)\mathbb{E}\left[\frac{(1-T)^{2}(Y-m^{(0)}(x))^{2}}{(1-e(x))^{2}}\mid X=x,S=1\right]$
	$\displaystyle=$	$\displaystyle e(x)\frac{\mathbb{E}\left[T^{2}(Y-m^{(1)}(x))^{2}\mid X=x,S=1\right]}{e(x)^{2}}+(1-e(x))\frac{\mathbb{E}\left[(1-T)^{2}(Y-m^{(0)}(x))^{2}\mid X=x,S=1\right]}{(1-e(x))^{2}}$
	$\displaystyle=$	$\displaystyle\frac{1}{e(x)}\text{var}(Y^{(1)}\mid X=x,S=1)+\frac{1}{1-e(x)}\text{var}(Y^{(0)}\mid X=x,S=1)$
	$\displaystyle=$	$\displaystyle\frac{1}{e(x)}\text{var}(Y^{(1)}\mid X=x)+\frac{1}{1-e(x)}\text{var}(Y^{(0)}\mid X=x).$

The last equation holds due to Assumption 1.

Central limit theorem and Assumption 6 then give

	$\displaystyle\hat{\sigma}^{2}_{\psi}(x)$	$\displaystyle\rightarrow$	$\displaystyle\frac{1}{e(x)}\text{var}{(Y^{(1)}\mid X=x,T=1,S=0)}+\frac{1}{1-e(x)}\text{var}{(Y^{(0)}\mid X=x,T=0,S=0)},$
	$\displaystyle\hat{\sigma}^{2}_{\psi}(x)$	$\displaystyle\rightarrow$	$\displaystyle c\sigma^{2}_{\psi}(x).$

This, together with Slutsky’s theorem, ensures

\displaystyle\hat{f}_{1}^{*}(x)\rightarrow f_{1}^{*}(x).

∎