This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust and Efficient Estimation of Potential Outcome Means Under Random Assignment

Akanksha Negi and Jeffrey M. Wooldridge
Abstract

We study efficiency improvements in randomized experiments for estimating a vector of potential outcome means using regression adjustment (RA) when there are more than two treatment levels. We show that linear RA which estimates separate slopes for each assignment level is never worse, asymptotically, than using the subsample averages. We also show that separate RA improves over pooled RA except in the obvious case where slope parameters in the linear projections are identical across the different assignment levels. We further characterize the class of nonlinear RA methods that preserve consistency of the potential outcome means despite arbitrary misspecification of the conditional mean functions. Finally, we apply these regression adjustment techniques to efficiently estimate the lower bound mean willingness to pay for an oil spill prevention program in California.

JEL Codes: C21, C25

Keywords: Multivalued Treatments, Experiment, Regression Adjustment, Heterogeneous Effects

Department of Econometrics and Business Statistics, Monash University. Email: akanksha.negi@monash.edu, Department of Economics, Michigan State University. Email: wooldri1@msu.edu. We would like to thank the Associate Editor and three anonymous referees for their helpful comments and suggestions.

1 Introduction

In the past several decades, the potential outcomes framework has become a staple of causal inference in statistics, econometrics, and related fields. Envisioning each unit in a population under different states of intervention or treatment allows one to define treatment or causal effects without reference to a model. One merely needs the potential outcome (PO) means for subpopulations.

When interventions are randomized – as in a clinical trial [Hirano and Imbens (2001)], assignment to participate in a job training program [Calónico and Smith (2017)], receiving a private school voucher [Angrist et al. (2006)], or contingent valuation studies where different bid values are randomized among individuals [Carson et al. (2004)] – one can simply use the subsample means for each treatment level in order to obtain unbiased and consistent estimators of the PO means. In some cases, the precision of the subsample means will be sufficient. Nevertheless, with the availability of good predictors of the outcome or response, it is natural to think that the precision can be improved, thereby shrinking confidence intervals and making conclusions about interventions more reliable.

In this paper we build on Negi and Wooldridge (2021), NW (2021) hereafter, who studied the problem of estimating the average treatment effect (ATE) under random assignment with one control and one treatment group. In the context of random sampling, NW (2021) showed that performing separate linear regressions for the two groups while estimating the ATE never does worse, asymptotically, than the simple difference in means estimator or a pooled regression adjustment estimator. These findings are complementary to Lin (2013), who used a finite population framework to study linear regression adjustment and arrived at essentially the same conclusions. NW (2021) also characterized the class of nonlinear regression adjustment methods that produce consistent estimators of the ATE without any additional assumptions (except regularity conditions).

In the current paper, we allow for G2G\geq 2 treatment levels and study the problem of joint estimation of the vector of PO means. We assume that the assignment to treatment is random – independent of both potential outcomes and observed predictors of the POs. In terms of the assignment mechanism, we assume a Bernoulli scheme whereby each unit has an equally likely probability of being assigned to a given treatment level, gg. Importantly, other than substantive restrictions such as random assignment, i.i.di.i.d sampling, and stable unit treatment value assumption, we impose only standard regularity conditions (such as finite second moments of the covariates). In other words, the regression adjustment (RA) estimators are consistent under essentially the same assumptions as the subsample means estimator with, generally, smaller asymptotic variances. Interestingly, even if the predictors are unhelpful, or the slopes in the linear projections are the same across all groups, no asymptotic efficiency is lost from using the most general RA method.

The extension from the binary treatment case to more than two treatment levels is nontrivial, both in terms of derivations and scope of applications. In the binary case, NW (2021) compared variances of estimators of the scalar ATE. In the current paper, we show that differences in asymptotic variance matrices of the PO estimators are positive semi-definite. Naturally, this result implies that any linear functions of the PO means are more efficiently estimated using separate RA. In addition, nonlinear functions of PO means –such as ratios – are also more efficiently estimated. Even in the binary treatment case, our results significantly improve over NW (2021).111It is not enough, even in the G=2G=2 case, to use NW (2021) for each of the pairwise difference in means. Knowing that each pairwise ATE is more efficiently estimated does not imply that all linear combinations are.

We also extend the nonlinear RA results in NW (2021) to this general PO framework with GG treatment levels. Gail et al. (1984), who study nonlinear regression for the binary treatment case, have cautioned against using certain nonlinear regression models as being biased for the ATE. Imbens and Rubin (2015) also leave the impression that nonlinear RA will be more efficient but at the cost of consistency, and therefore should be avoided. We show that for particular kinds of responses– the leading cases being binary, fractional, and nonnegative – it is possible to consistently estimate PO means using pooled and separate RA by combining results from Quasi Maximum Likelihood Estimation (QMLE) in the linear exponential family (LEF) with features of the canonical link function.222In the generalized linear model literature, a link function m1()m^{-1}(\cdot) relates the conditional mean of the outcome to a linear predictor such that 𝔼[Y(g)|𝐗]=m(𝐗𝜷)\mathbb{E}[Y(g)|\mathbf{X}]=m(\mathbf{X}\bm{\beta}). A canonical link is a special link function which guarantees that the model m(𝐗𝜷)m(\mathbf{X}\bm{\beta}) fits the overall mean of the outcome without necessarily being correctly specified. While Gail et al. (1984) discussed maximum likelihood estimation in the exponential family, they do not distinguish between canonical and non-canonical link functions. As shown in NW (2021), and extended to the multiple treatment case here, one needs to use the mean function associated with the canonical link to ensure consistency without imposing additional restrictions. Guo and Basse (2023) reach a similar conclusion in the binary treatment case studied in NW (2021), showing that generalized linear models with canonical links are examples of imputation models that are prediction unbiased for the sample average treatment effect under the design-based framework. Unlike the linear RA case, we do not have a general result that shows separate nonlinear RA is asymptotically more efficient than the subsample means estimator. However, if we add the assumption that the conditional mean functions are correctly specified, we show that doing separate RA is (weakly) more efficient than the subsample means estimator.333Cohen and Fogarty (2024) propose a two-step calibrated Generalized Oaxaca Blinder estimator that is efficient even under misspecification. We further establish that separate RA attains the semiparametric efficiency bound under correct conditional mean specification.

A Monte Carlo exercise substantiates our theoretical results.444We consider three different population models. The outcomes are generated to be either fractional, non-negative, or continuous with an unrestricted support. The results for fractional and non-negative outcomes can be found in section B of the online appendix. We find that RA estimators, both linear and nonlinear, improve over subsample means in terms of standard deviations and have small biases, even with fairly small sample sizes. Not surprisingly, the magnitude of the precision gains with RA methods generally depends on how well the covariates predict the potential outcomes. Finally, we conclude with an empirical application that uses the RA methods to estimate the lower bound mean willingness to pay for an oil spill prevention program along California’s coast.

In terms of contribution, this paper provides a unified framework for studying regression adjustment in experiments, allowing for multiple treatments under the infinite population (or superpopulation) setting. Besides studying linear and nonlinear RA methods and making efficiency comparisons, we also characterize the semiparametric efficiency bound for all consistent and asymptotically linear estimators of the PO means and show that separate RA attains this bound under correct conditional mean specification. Because our results compare asymptotic variance matrices for different PO mean estimators, the efficiency results derived here extend to linear and (smooth) nonlinear functions of the PO means. We also fill an important gap in the regression adjustment literature by studying pooled nonlinear methods which have not been discussed elsewhere. We show how nonlinear pooled RA is consistent whenever nonlinear separate RA is, making it an important alternative when the researcher lacks sufficient degrees of freedom for estimating separate slopes. On the practical side, the proposed RA estimators are easy to implement with common statistical software like Stata (see online appendix D for reference).

Related Literature: This paper contributes to a vast and rich literature on regression adjustment in experiments. In the design-based framework, Freedman (2008a, b) highlighted that adjusting for covariates additively in a regression is not guaranteed to produce efficiency gains in experiments compared to the simple difference in means estimator. For a binary treatment, this debate was followed up in Lin (2013) which showed that separate RA is weakly more efficient than both the pooled RA and simple difference in means estimators of the ATE.555More recently, Chang, Middleton, and Aronow (2024) propose exact bias-corrected estimators to alleviate the finite sample concerns of RA. This is different from the focus of our paper which is interested in improving efficiency with RA and assumes that sample sizes are large enough to ensure consistent estimation of the PO means. A similar argument for bias correction in high dimensional settings can be found in Lei and Ding (2021) and Chiang, Matsushita, and Otsu (2023).

Among those adopting a superpopulation approach with a binary treatment, Yang and Tsiatis (2001) and Leon, Tsiatis, and Davidian (2003) focus on a pre-test post-test trial with no baseline covariates. Much like the current paper, specialized to the binary treatment case with ATE as the parameter of interest, Yang and Tsiatis (2001) show that the interacted estimator is the most efficient among simple difference in means and pooled RA estimators. Leon et al. (2003) identify the most efficient ATE estimator using the semiparametric theory of Robins et al. (1994) but do not provide any efficiency comparisons between commonly employed estimators of the ATE. Tsiatis et al. (2008) characterize the class of all consistent and asymptotically normal (CAN) estimators of the ATE. They further show that such estimators have a common form which can be used to rank the different linear estimators in terms of efficiency. The current paper (with G=2G=2) and NW (2021) arrive at the same efficiency results with respect to linear RA as Tsiatis et al. (2008). In addition, the former explicitly discuss consistency with nonlinear RA (separate and pooled) involving quasi-log-likelihoods and canonical links from the generalized linear model literature. Furthermore, this paper establishes the asymptotic efficiency of separate RA relative to subsample averages when the conditional mean functions are correctly specified. Pitkin et al. (2013) compare the asymptotic variances of simple difference in means and linear RA (pooled and separate) under essentially a similar assumption as this paper wherein the conditional mean functions are not assumed to be linear. However, Pitkin et al. (2013) neither discuss nonlinear RA nor conditions under which it is efficient relative to simple difference in means. Similar to Leon et al. (2003), Zhang, Tsiatis, and Davidian (2008) also study regression adjustment from a semiparametric perspective. However, their focus is on CAN estimators of pairwise treatment comparisons in a multiarmed trial. As mentioned above, the efficiency results derived in this paper extend to linear and nonlinear functions of the PO means, including pairwise ATEs. The current paper also explicitly ranks the different linear and nonlinear RA estimators, which is not explored in Zhang et al. (2008). We further expand our set of results to include the semiparametric efficiency bound for all CAN estimators of the PO means and establish that SRA of the QML variety attains the efficiency bound when the conditional means are correctly specified. Another paper that studies nonlinear RA for estimating the ATE is Rosenblum and Van Der Laan (2010). The efficiency results discussed there are nested within this paper, which additionally considers pooled nonlinear methods that have not been discussed elsewhere. An added advantage of the nonlinear estimators proposed here is that they are computationally simple and easy to implement.

In the design-based setting, Guo and Basse (2023) and Cohen and Fogarty (2024) also discuss nonlinear RA in binary experiments. The former proposes ATE estimators which impute the missing potential outcomes using nonlinear models that are prediction unbiased. This is similar to the “mean-fitting” property discussed here and in NW (2021) which is crucial for consistent estimation of the PO means. This property is guaranteed by choosing appropriate combinations of canonical link and quasi-log-likelihood functions in the LEF estimated using QMLE. Cohen and Fogarty (2024) extend Guo and Basse (2023)’s efficiency result to the case of mean misspecification.666They argue that a second-step calibration of the Generalized Oaxaca Blinder estimator produces an estimator of the ATE that is efficient even under misspecification of the nonlinear model.

A recent strand also looks at multiarmed treatments in the context of factorial experiments under the design-based framework [Zhao and Ding (2022a, b), Zhao and Ding (2023), Pashley and Bind (2023)]. Such experiments are designed to accommodate multiple factors of interest in a single experiment and define treatment levels as all possible combinations of the factors involved. The typical concern in this literature is degree of freedom conservation while estimating all main and interaction effects of these factors simultaneously. An exception is Zhao and Ding (2023) which discusses linear RA for multiarmed experiments (factorial experiments being a special case). The efficiency results concerning linear RA in Zhao and Ding (2023) have a direct analogue in the current paper under the superpoulation framework. For more complex experimental designs, see Cytrynbaum (2023), Jiang et al. (2023), Roth and Sant’Anna (2023), and Chang (2023).

The rest of the paper is organized as follows. Section 2 describes the potential outcomes framework with GG treatment levels along with a discussion of the main assumptions. Section 3 presents the asymptotic variances of the linear RA estimators, whereas section 4 ranks them in terms of asymptotic efficiency. Section 5 considers a class of nonlinear RA estimators that ensure consistency for the PO means without imposing additional assumptions. Section 6 characterizes the semiparametric efficiency bound for all regular estimators of the PO means and establishes local efficiency of separate RA. Section 7 constructs a simulation exercise for studying the finite sample behavior of the RA estimators. Section 8 applies the RA methods to study willingness to pay for an oil spill prevention program, and section 9 concludes.

2 Potential Outcomes Framework and Assumptions

We use the standard potential outcomes framework, also known as the Neyman-Rubin causal model. The goal is to estimate the population means of GG potential (counterfactual) outcomes, Y(g)Y(g), g=1,,Gg=1,...,G. Define μg=𝔼[Y(g)]\mu_{g}=\mathbb{E}\left[Y(g)\right] for each g=1,,G.g=1,...,G.

The vector of assignment indicators is 𝐖=(W1,,WG)\mathbf{W}=(W_{1},...,W_{G}), where each WgW_{g} is binary and W1+W2++WG=1W_{1}+W_{2}+\cdots+W_{G}=1. In other words, the groups are exhaustive and mutually exclusive. The setup applies to many situations, including the standard treatment-control setup with G=2G=2, multiple treatment levels (with g=1g=1 as the control group), and in contingent valuation studies where subjects are presented with a set of GG prices or bid values.

Next, let 𝐗=(X1,X2,,XK)\mathbf{X}=(X_{1},X_{2},...,X_{K}) be a vector of observed covariates of dimension KK which is assumed to be fixed in our asymptotic analysis. With respect to the assignment process, we make the following assumption.

Assumption 1 (Random Assignment).

Assignment is independent of the potential outcomes and observed covariates: 𝐖[Y(1),Y(2),,Y(G),𝐗]\mathbf{W}\perp\left[Y(1),Y(2),...,Y(G),\mathbf{X}\right]. Further, ρg(Wg=1)>0\rho_{g}\equiv\mathbb{P}(W_{g}=1)>0 holds.

Assumption 1 puts us in the framework of experimental interventions. We also assume that each group, gg, has a positive probability of being assigned such that ρ1+ρ2++ρG=1\rho_{1}+\rho_{2}+\cdots+\rho_{G}=1.

Assumption 2 (Random Sampling).

For a nonrandom integer NN, {[𝐖i,Yi(1),Yi(2),,Yi(G),\big{\{}\big{[}\mathbf{W}_{i},Y_{i}(1),Y_{i}(2),...,Y_{i}(G), 𝐗i]\mathbf{X}_{i}\big{]} :i=1,2,,N}:i=1,2,...,N\big{\}} is independent and identically distributed.

The i.i.d assumption is not the only one we can make. For example, we could allow for a sampling-without-replacement scheme given a fixed sample size NN. This would complicate the analysis because it generates a slight correlation across draws from the population. As discussed in NW (2021), Assumption 2 is traditional in studying the asymptotic properties of estimators and is realistic as an approximation in cases where a sample is taken from a large population. It also forces us to account for the sampling error in 𝐗¯\mathbf{\bar{X}}, as an estimator of 𝝁𝐗=𝔼(𝐗)\bm{\mu}_{\mathbf{X}}=\mathbb{E}\left(\mathbf{X}\right).

A different approach is to assume that there is no sampling uncertainty and instead assume we observe the entire population. In the design–based approach, all uncertainty is in the assignment of the treatment indicators, 𝐖=(W1,,WG)\mathbf{W}=(W_{1},...,W_{G}). Among others, Freedman (2008a) and Lin (2013) take this approach in studying linear regression adjustment with a binary treatment. In many examples in economics (including the empirical example in our supplement), one does not observe the entire population and so it seems natural to study the random sampling framework. Our results for the random sampling case can be used to further study the efficiency issue in a framework that allows both sampling and assignment uncertainty, as in Abadie et al. (2020). Such a framework will affect inference only in cases where the sample is a substantial fraction of the population. If the sample size is small relative to the population size, the finite-population adjustments of the kind derived in Abadie et al. (2020) would have a trivial effect.

Given Assumption 2, for each draw ii from the population we only observe

Yi=Wi1Yi(1)+Wi2Yi(2)++WiGYi(G),Y_{i}=W_{i1}Y_{i}(1)+W_{i2}Y_{i}(2)+\cdots+W_{iG}Y_{i}(G), (1)

and so the data are {(𝐖i,Yi,𝐗i):i=1,2,,N}\left\{\left(\mathbf{W}_{i},Y_{i},\mathbf{X}_{i}\right):i=1,2,...,N\right\}. Definition of population quantities only requires us to use the random vector (𝐖,Y,𝐗)\left(\mathbf{W},Y,\mathbf{X}\right), which represents the population.

Along with assumptions 1 and 2, we also assume that the stable unit treatment value assumption (SUTVA) holds. This implies that there are no spillovers or hidden variations of the treatments. Random assignment, i.i.di.i.d sampling, and SUTVA are the only substantive restrictions used in this paper and will be maintained throughout. Subsequently, we assume that linear projections exist and that the central limit theorem holds for properly standardized sample averages of i.i.d random vectors. Therefore, we are implicitly imposing at least finite second moment assumptions on each Y(g)Y(g) and XjX_{j}. We do not make this explicit in what follows.

2.1 Completely Randomized Experiment

In this paper, we consider a randomized experiment where each unit has an equally likely probability, ρg\rho_{g}, of being assigned to treatment level gg. An assignment scheme that is common in practice is where a fixed number of units are assigned to each treatment level. This is known as a completely randomized experiment and creates correlation between draws. Common analyses of a completely randomized experiment occur in the design-based framework where the sample coincides with the population (which is, naturally, assumed to be finite). Our framework is based on the availability of i.i.d draws, which includes the infinite superpopulation setting and also the finite population setting where sampling is done with replacement. An interesting question is whether the results obtained here apply to the setting of a completely randomized experiment, or settings where the population is finite and the assignments are i.i.d draws from a multinomial distribution. Other settings of interest include when the assignment might be determined in two (or more) stages. For example, clusters of units are defined, and then clusters are randomly assigned to be treated or control clusters. For the treated clusters, units within the cluster are randomly assigned to control or treatment– as in Abadie et al. (2023) for the case of binary treatment without covariates. We leave these topics for future research.

3 Subsample Means and Linear Regression Adjustment

In this section we discuss the asymptotic variances of three estimators: the subsample means, separate regression adjustment, and pooled regression adjustment.777The asymptotic representation proofs of the three estimators can be found in section A of the online appendix.

3.1 Subsample Means (SM)

The simplest estimator of μg\mu_{g} is the sample average within treatment group gg: Y¯g=Ng1i=1NWigYi=Ng1i=1NWigYi(g),\bar{Y}_{g}=N_{g}^{-1}\sum_{i=1}^{N}W_{ig}Y_{i}=N_{g}^{-1}\sum_{i=1}^{N}W_{ig}Y_{i}(g), where Ng=i=1NWigN_{g}=\sum_{i=1}^{N}W_{ig} is a random variable in our setting. In expressing Y¯g\bar{Y}_{g} as a function of the Yi(g)Y_{i}(g) we use WihWig=0W_{ih}W_{ig}=0 for hgh\neq g. Under random assignment and random sampling,

𝔼(Y¯g|W1g,,WNg,Ng>0)\displaystyle\mathbb{E}\left(\bar{Y}_{g}|W_{1g},...,W_{Ng},N_{g}>0\right) =Ng1i=1NWig𝔼[Yi(g)|W1g,,WNg,Ng>0]\displaystyle=N_{g}^{-1}\sum_{i=1}^{N}W_{ig}\mathbb{E}\left[Y_{i}(g)|W_{1g},...,W_{Ng},N_{g}>0\right]
=Ng1i=1NWig𝔼[Yi(g)]=Ng1i=1NWigμg=μg,\displaystyle=N_{g}^{-1}\sum_{i=1}^{N}W_{ig}\mathbb{E}\left[Y_{i}(g)\right]=N_{g}^{-1}\sum_{i=1}^{N}W_{ig}\mu_{g}=\mu_{g},

and so Y¯g\bar{Y}_{g} is unbiased conditional on observing a positive number of units in group gg. By the law of large numbers, a consistent estimator of ρg\rho_{g} is ρ^g=Ng/N\hat{\rho}_{g}=N_{g}/N which is the sample share of units in group gg. Therefore, by the law of large numbers and Slutsky’s Theorem,

Y¯g\displaystyle\bar{Y}_{g} =(NNg)N1i=1NWigYi(g)𝑝ρg1𝔼[WgY(g)]=ρg1𝔼(Wg)𝔼[Y(g)]=μg,\displaystyle=\left(\frac{N}{N_{g}}\right)N^{-1}\sum_{i=1}^{N}W_{ig}Y_{i}(g)\overset{p}{\rightarrow}\rho_{g}^{-1}\mathbb{E}\left[W_{g}Y(g)\right]=\rho_{g}^{-1}\mathbb{E}\left(W_{g}\right)\mathbb{E}\left[Y(g)\right]=\mu_{g},

and so Y¯g\bar{Y}_{g} is consistent for μg\mu_{g}. By the central limit theorem, N(Y¯gμg)\sqrt{N}\left(\bar{Y}_{g}-\mu_{g}\right) is asymptotically normal. We need an asymptotic representation of N(Y¯gμg)\sqrt{N}\left(\bar{Y}_{g}-\mu_{g}\right) that allows us to compare its asymptotic variance with those from regression adjustment estimators. To this end, write

Y(g)=μg+V(g) and 𝐗˙=𝐗𝝁𝐗,Y(g)=\mu_{g}+V(g)\text{ and }\mathbf{\dot{X}}=\mathbf{X}-\bm{\mu}_{\mathbf{X}},

where 𝐗˙\mathbf{\dot{X}} is demeaned using the population mean, 𝝁𝐗\bm{\mu}_{\mathbf{X}}. The demeaning ensures that the intercept in the equation above can be directly interpreted as the PO mean. Now project each V(g)V(g) linearly onto 𝐗˙\mathbf{\dot{X}}: V(g)=𝐗˙𝜷𝒈+U(g)V(g)=\mathbf{\dot{X}}\bm{\beta_{g}}+U(g). By construction, the population projection errors U(g)U(g) have the properties: 𝔼[U(g)]=0 and 𝔼[𝐗˙U(g)]=𝟎\mathbb{E}\left[U(g)\right]=0\text{ and }\mathbb{E}[\mathbf{\dot{X}}^{\prime}U(g)]=\mathbf{0}\text{, } for each gg. Plugging in gives

Y(g)=μg+𝐗˙𝜷𝒈+U(g)g=1,,G.Y(g)=\mu_{g}+\mathbf{\dot{X}}\bm{\beta_{g}}+U(g)\text{, }g=1,...,G.

Importantly, by random assignment, 𝐖\mathbf{W} is independent of [U(1),,U(G),𝐗˙][U(1),...,U(G),\mathbf{\dot{X}}]. The observed outcome can be written as

Y=g=1GWg[μg+𝐗˙𝜷𝒈+U(g)].Y=\sum_{g=1}^{G}W_{g}\left[\mu_{g}+\mathbf{\dot{X}}\bm{\beta_{g}}+U(g)\right].

Our goal is to be able to make efficiency statements about both linear and nonlinear functions of the vector of means 𝝁=(μ1,μ2,,μG)\bm{\mu}=\left(\mu_{1},\mu_{2},...,\mu_{G}\right)^{\prime}, and so we stack the subsample means into the G×1G\times 1 vector 𝐘¯\mathbf{\bar{Y}}. For later comparison, it is helpful to remember that 𝐘¯\mathbf{\bar{Y}} is the vector of OLS coefficients in the regression (without an intercept): Yi on Wi1Wi2, …, WiGi=1,2,,N.Y_{i}\text{ on }W_{i1}\text{, }W_{i2}\text{, ..., }W_{iG}\text{, }i=1,2,...,N. The asymptotic representation for the subsample means estimator is given as

N(𝐘¯𝝁)=N1/2i=1N(𝐋i+𝐐i)+op(1)\sqrt{N}\left(\mathbf{\bar{Y}}-\bm{\mu}\right)=N^{-1/2}\sum_{i=1}^{N}\left(\mathbf{L}_{i}+\mathbf{Q}_{i}\right)+o_{p}(1)

where

𝐋i(Wi1𝐗˙i𝜷𝟏/ρ1Wi2𝐗˙i𝜷𝟐/ρ2WiG𝐗˙i𝜷𝑮/ρG) and 𝐐i(Wi1Ui(1)/ρ1Wi2Ui(2)/ρ2WiGUi(G)/ρG)\mathbf{L}_{i}\equiv\begin{pmatrix}W_{i1}\mathbf{\dot{X}}_{i}\bm{\beta_{1}}/\rho_{1}\\ W_{i2}\mathbf{\dot{X}}_{i}\bm{\beta_{2}}/\rho_{2}\\ \vdots\\ W_{iG}\mathbf{\dot{X}}_{i}\bm{\beta_{G}}/\rho_{G}\end{pmatrix}\text{ and }\mathbf{Q}_{i}\equiv\begin{pmatrix}W_{i1}U_{i}(1)/\rho_{1}\\ W_{i2}U_{i}(2)/\rho_{2}\\ \vdots\\ W_{iG}U_{i}(G)/\rho_{G}\end{pmatrix} (2)

For the vectors defined above, we know that by random assignment and the linear projection property, 𝔼(𝐋i)=𝔼(𝐐i)=𝟎\mathbb{E}\left(\mathbf{L}_{i}\right)=\mathbb{E}\left(\mathbf{Q}_{i}\right)=\mathbf{0}, and 𝔼(𝐋i𝐐i)=𝟎\mathbb{E}\left(\mathbf{L}_{i}\mathbf{Q}_{i}^{\prime}\right)=\mathbf{0}. Also, because WigWih=0W_{ig}W_{ih}=0, ghg\neq h, the elements of 𝐋i\mathbf{L}_{i} are pairwise uncorrelated; the same is true of the elements of 𝐐i\mathbf{Q}_{i}.

3.2 Separate Regression Adjustment (SRA)

To motivate SRA, write the linear projection for each gg as

Y(g)=αg+𝐗𝜷𝒈+U(g),𝔼[U(g)]=0 and 𝔼[𝐗U(g)]=𝟎.Y(g)=\alpha_{g}+\mathbf{X}\bm{\beta_{g}}+U(g),\ \mathbb{E}\left[U(g)\right]=0\text{ and }\mathbb{E}\left[\mathbf{X}^{\prime}U(g)\right]=\mathbf{0}. (3)

It follows immediately that μg=αg+𝝁𝐗𝜷𝒈.\mu_{g}=\alpha_{g}+\bm{\mu}_{\mathbf{X}}\bm{\beta_{g}}. Consistent estimators of αg\alpha_{g} and 𝜷𝒈\bm{\beta_{g}} are obtained from the regression: Yi on 1𝐗i, if Wig=1,Y_{i}\text{ on }1\text{, }\mathbf{X}_{i}\text{,\ if }W_{ig}=1, which produces intercept and slopes α^g\hat{\alpha}_{g} and 𝜷^𝒈\bm{\hat{\beta}_{g}}. Therefore, the SRA estimator of PO mean, μg\mu_{g} is given by μ^g=α^g+𝐗¯𝜷^g\hat{\mu}_{g}=\hat{\alpha}_{g}+\mathbf{\bar{X}}\bm{\hat{\beta}}_{g}. Stacking all the SRA estimators for PO means into the vector, 𝝁^SRA\bm{\hat{\mu}}_{SRA}, we obtain the following asymptotic representation

N(𝝁^SRA𝝁)=N1/2i=1N(𝐊i+𝐐i)+op(1)\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)=N^{-1/2}\sum_{i=1}^{N}(\mathbf{K}_{i}+\mathbf{Q}_{i})+o_{p}(1)

where 𝐐i\mathbf{Q}_{i} is given in (2) and

𝐊i=(𝐗˙i𝜷𝟏𝐗˙i𝜷𝟐𝐗˙i𝜷𝑮).\mathbf{K}_{i}=\begin{pmatrix}\mathbf{\dot{X}}_{i}\bm{\beta_{1}}\\ \mathbf{\dot{X}}_{i}\bm{\beta_{2}}\\ \vdots\\ \mathbf{\dot{X}}_{i}\bm{\beta_{G}}\end{pmatrix}. (4)

Again, for the vectors defined above, both 𝐊i\mathbf{K}_{i} and 𝐐i\mathbf{Q}_{i} have zero means, the latter by random assignment. Further, 𝔼(𝐊i𝐐i)=𝟎\mathbb{E}\left(\mathbf{K}_{i}\mathbf{Q}_{i}^{\prime}\right)=\mathbf{0} because 𝔼[𝐗˙iWigUi(g)]=𝔼(Wig)𝔼[𝐗˙iUi(g)]=𝟎\mathbb{E}[\mathbf{\dot{X}}_{i}^{\prime}W_{ig}U_{i}(g)]=\mathbb{E}(W_{ig})\mathbb{E}[\mathbf{\dot{X}}_{i}^{\prime}U_{i}(g)]=\mathbf{0}. However, unlike the elements of 𝐋i\mathbf{L}_{i}, we must recognize that the elements of 𝐊i\mathbf{K}_{i} are correlated except in the trivial case that all but one of the 𝜷𝒈\bm{\beta_{g}} are zero. It is important to note here that if assignment is unconfounded given 𝐗\mathbf{X}, SRA will be consistent provided that the conditional means are linear.

3.3 Pooled Regression Adjustment (PRA)

Now consider the pooled estimator, 𝝁^PRA\bm{\hat{\mu}}_{PRA}, which can be obtained as the vector of coefficients on 𝐖i=(Wi1,Wi2,,WiG)\mathbf{W}_{i}=\left(W_{i1},W_{i2},...,W_{iG}\right) from the regression: Yi on 𝐖i𝐗¨i, i=1,2,,NY_{i}\text{ on }\mathbf{W}_{i}\text{, }\mathbf{\ddot{X}}_{i},\text{ }i=1,2,...,N where 𝐗¨i=𝐗i𝐗¯\mathbf{\ddot{X}}_{i}=\mathbf{X}_{i}-\mathbf{\bar{X}} is the sample demeaned covariate vector. We refer to this as a pooled method because the coefficients on 𝐗¨i\mathbf{\ \ddot{X}}_{i}, say, 𝜷ˇ\bm{\check{\beta}}, are assumed to be the same for all groups. Compared with subsample means, we add the controls 𝐗¨i\mathbf{\ \ddot{X}}_{i}, but unlike SRA, the pooled method imposes the same coefficients across all gg. The asymptotic representation is given as

N(𝝁^PRA𝝁)=N1/2i=1N(𝐅i+𝐊i+𝐐i)+op(1)\sqrt{N}\left(\bm{\hat{\mu}}_{PRA}-\bm{\mu}\right)=N^{-1/2}\sum_{i=1}^{N}\left(\mathbf{F}_{i}+\mathbf{K}_{i}+\mathbf{Q}_{i}\right)+o_{p}(1)

where 𝐊i\mathbf{K}_{i} and 𝐐i\mathbf{Q}_{i} are defined as before and

𝐅i(ρ11(Wi1ρ1)𝐗˙i𝜹𝟏ρ21(Wi2ρ2)𝐗˙i𝜹𝟐ρG1(WiGρG)𝐗˙i𝜹𝑮)\mathbf{F}_{i}\equiv\begin{pmatrix}\rho_{1}^{-1}\left(W_{i1}-\rho_{1}\right)\mathbf{\dot{X}}_{i}\bm{\delta_{1}}\\ \rho_{2}^{-1}\left(W_{i2}-\rho_{2}\right)\mathbf{\dot{X}}_{i}\bm{\delta_{2}}\\ \vdots\\ \rho_{G}^{-1}\left(W_{iG}-\rho_{G}\right)\mathbf{\dot{X}}_{i}\bm{\delta_{G}}\end{pmatrix} (5)

and 𝜹𝒈=𝜷𝒈𝜷\bm{\delta_{g}}=\bm{\beta_{g}}-\bm{\beta}, where 𝜷\bm{\beta} is the linear projection of  YY on 𝐗˙\mathbf{\dot{X}}. Again, by random assignment and the linear projection property, 𝔼(𝐅i𝐊i)=𝔼(𝐅i𝐐i)=𝟎\mathbb{E}\left(\mathbf{F}_{i}\mathbf{K}_{i}^{\prime}\right)=\mathbb{E}\left(\mathbf{F}_{i}\mathbf{Q}_{i}^{\prime}\right)=\mathbf{0}.

4 Comparing the Asymptotic Variances

We now take the representations given above and use them to compare the asymptotic variances of the three estimators. It is helpful to summarize the conclusions reached in Section 3:

N(𝝁^SM𝝁)\displaystyle\sqrt{N}\left(\bm{\hat{\mu}}_{SM}-\bm{\mu}\right) =N1/2i=1N(𝐋i+𝐐i)+op(1)\displaystyle=N^{-1/2}\sum_{i=1}^{N}\left(\mathbf{L}_{i}+\mathbf{Q}_{i}\right)+o_{p}(1) (6)
N(𝝁^SRA𝝁)\displaystyle\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right) =N1/2i=1N(𝐊i+𝐐i)+op(1)\displaystyle=N^{-1/2}\sum_{i=1}^{N}\left(\mathbf{K}_{i}+\mathbf{Q}_{i}\right)+o_{p}(1) (7)
N(𝝁^PRA𝝁)\displaystyle\sqrt{N}\left(\bm{\hat{\mu}}_{PRA}-\bm{\mu}\right) =N1/2i=1N(𝐅i+𝐊i+𝐐i)+op(1)\displaystyle=N^{-1/2}\sum_{i=1}^{N}\left(\mathbf{F}_{i}+\mathbf{K}_{i}+\mathbf{Q}_{i}\right)+o_{p}(1) (8)

where 𝐋i\mathbf{L}_{i}, 𝐐i\mathbf{Q}_{i}, 𝐊i\mathbf{K}_{i}, and 𝐅i\mathbf{F}_{i} are defined in (2), (4) and (5), respectively.

4.1 Comparing Separate RA to SM

We now show that, asymptotically, 𝝁^SRA\bm{\hat{\mu}}_{SRA} is no worse than 𝝁^SM\bm{\hat{\mu}}_{SM}.

Theorem 1 (Efficiency of SRA relative to SM).

Under Assumptions 1, 2, SUTVA, and finite second moments used to obtain the asymptotic representations in 6 and 7,

Avar[N(𝝁^SM𝝁)]Avar[N(𝝁^SRA𝝁)]=𝛀𝐋𝛀𝐊\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{SM}-\bm{\mu}\right)\right]-\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)\right]=\mathbf{\Omega_{L}}-\mathbf{\Omega_{K}}

is PSD where 𝛀𝐋𝔼(𝐋i𝐋i)\bm{\Omega}_{\mathbf{L}}\equiv\mathbb{E}(\mathbf{L}_{i}\mathbf{L}_{i}^{\prime}) and 𝛀𝐊𝔼(𝐊i𝐊i)\bm{\Omega}_{\mathbf{K}}\equiv\mathbb{E}(\mathbf{K}_{i}\mathbf{K}_{i}^{\prime}).

The proof can be found in Appendix A. To get an intuition for this efficiency gain, note that SRA uses covariate values across the entire sample for imputing the missing potential outcomes. This requires estimating the population mean of 𝐗\mathbf{X}. Therefore, the asymptotic variance of 𝝁^SRA\bm{\hat{\mu}}_{SRA} is influenced not only by the linear projection error but also 𝐗¯\mathbf{\bar{X}}. In contrast, SM only averages observed outcomes for each treatment level which means that its asymptotic variance ends up being a function of the variance of the subsample means, 𝐗¯g\mathbf{\bar{X}}_{g}, and the linear projection errors. So a comparison of the two estimators boils down to comparing 𝐋i\mathbf{L}_{i} and 𝐊i\mathbf{K}_{i} whose gg-th elements are given by Wig𝐗˙i𝜷𝒈/ρgW_{ig}\mathbf{\dot{X}}_{i}\bm{\beta_{g}}/\rho_{g} and 𝐗˙i𝜷𝒈\mathbf{\dot{X}}_{i}\bm{\beta_{g}}, respectively. To put it simply, SM is inefficient because it uses 𝐗¯g\mathbf{\bar{X}}_{g} which is a consistent (due to random assignment) but less efficient estimator of the population mean compared to 𝐗¯\mathbf{\bar{X}}, that is used by SRA. Adding orthogonal controls that reduce the error variance without inducing collinearity is relatively intuitive. SRA does that without using a “misspecified” model in the sense that it does not impose incorrect restrictions: the slopes are the same. Imposing zero slopes (SM) and common slopes (PRA) can both be viewed as forms of misspecification. They don’t cause inconsistency, but it is not generally possible to compare efficiencies when using two “misspecified” models.

The one case where there is no gain in asymptotic efficiency in using SRA is when 𝜷𝒈=𝟎\bm{\beta_{g}}=\mathbf{0}, g=1,,Gg=1,...,G, in which case the covariates do not help predict any of the potential outcomes. Importantly, there is no gain in asymptotic efficiency in imposing 𝜷𝒈=𝟎\bm{\beta_{g}}=\mathbf{0} when it is true. From an asymptotic perspective, it is harmless to separately estimate the 𝜷𝒈\bm{\beta_{g}} even when they are zero. When they are not all zero, estimating them leads to asymptotic efficiency gains.

Theorem 1 also implies that any smooth nonlinear function of 𝝁\bm{\mu} is estimated more efficiently using 𝝁^SRA\bm{\hat{\mu}}_{SRA} by the delta method. For example, in estimating a percentage difference in means, we would be interested in μ2/μ1\mu_{2}/\mu_{1}, and using SRA is asymptotically more efficient than using SM.

4.2 Separate RA versus Pooled RA

The comparison between SRA and PRA is simple given the expressions in (7) and (8) because, as stated earlier, 𝐅i\mathbf{F}_{i}, 𝐊i\mathbf{K}_{i}, and 𝐐i\mathbf{Q}_{i} are pairwise uncorrelated.

Theorem 2 (Efficiency of SRA relative to PRA).

Under Assumptions 1, 2, SUTVA, and finite second moments used to obtain the asymptotic representations in 7 and 8,

Avar[N(𝝁^PRA𝝁)]Avar[N(𝝁^SRA𝝁)]=𝛀𝐅\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{PRA}-\bm{\mu}\right)\right]-\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)\right]=\mathbf{\Omega_{F}}

where 𝛀𝐅𝔼(𝐅i𝐅i)\bm{\Omega}_{\mathbf{F}}\equiv\mathbb{E}(\mathbf{F}_{i}\mathbf{F}_{i}^{\prime}) is PSD.

The proof can be found in Appendix A. It follows immediately from Theorem 2 that 𝝁^SRA\bm{\hat{\mu}}_{SRA} is never less asymptotically efficient than 𝝁^PRA\bm{\hat{\mu}}_{PRA}. There are some special cases where the estimators achieve the same asymptotic variance, the most obvious being when the slopes in the linear projections are homogeneous: 𝜷𝟏=𝜷𝟐==𝜷𝒈\bm{\beta_{1}}=\bm{\beta_{2}}=\cdots=\bm{\beta_{g}}. As with comparing SRA with subsample means, there is no gain in efficiency from imposing this restriction when it is true. This is another fact that makes SRA attractive if the sample sizes within each treatment level are not small.

Other situations where there is no asymptotic efficiency gain in using SRA are more subtle. In general, suppose we are interested in linear combinations τ=𝐚𝝁\tau=\mathbf{a}^{\prime}\bm{\mu} for a given G×1G\times 1 vector 𝐚\mathbf{a}. If 𝐚𝛀𝐅𝐚=0\mathbf{a}^{\prime}\mathbf{\Omega}_{\mathbf{F}}\mathbf{a}=0, then 𝐚𝝁^PRA\mathbf{a}^{\prime}\bm{\hat{\mu}}_{PRA} is asymptotically as efficient as 𝐚𝝁^SRA\mathbf{a}^{\prime}\bm{\hat{\mu}}_{SRA} for estimating τ\tau. Generally, the diagonal elements of 𝛀F=𝔼(𝐅i𝐅i)\mathbf{\Omega}_{F}=\mathbb{E}\left(\mathbf{F}_{i}\mathbf{F}_{i}^{\prime}\right) are (1ρg)𝜹𝒈𝛀𝐗𝜹𝒈/ρg(1-\rho_{g})\bm{\delta_{g}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{g}}/\rho_{g} because 𝔼[(Wigρg)2]=ρg(1ρg)\mathbb{E}\left[\left(W_{ig}-\rho_{g}\right)^{2}\right]=\rho_{g}(1-\rho_{g}). The off diagonal terms of 𝛀𝐅\mathbf{\Omega}_{\mathbf{F}} are 𝜹𝒈𝛀𝐗𝜹𝒉-\bm{\delta_{g}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{h}} because 𝔼[(Wigρg)(Wihρh)]=ρgρh\mathbb{E}\left[\left(W_{ig}-\rho_{g}\right)\left(W_{ih}-\rho_{h}\right)\right]=-\rho_{g}\rho_{h}. Now consider the case covered in NW (2021), where G=2G=2 and 𝐚=(1,1)\mathbf{a}^{\prime}=\left(-1,1\right), so the parameter of interest is τ=μ2μ1\tau=\mu_{2}-\mu_{1} (the ATE). If ρ1=ρ2=1/2\rho_{1}=\rho_{2}=1/2 then

𝛀𝐅=(𝜹𝟏𝛀𝐗𝜹𝟏𝜹𝟏𝛀𝐗𝜹𝟐𝜹𝟐𝛀𝐗𝜹𝟏𝜹𝟐𝛀𝐗𝜹𝟐)\mathbf{\Omega}_{\mathbf{F}}=\begin{pmatrix}\bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}&-\bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{2}}\\ -\bm{\delta_{2}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}&\bm{\delta_{2}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{2}}\end{pmatrix}

Now 𝜹𝟐=𝜹𝟏\bm{\delta_{2}}=-\bm{\delta_{1}} because 𝜹𝟏=𝜷𝟏(𝜷𝟏+𝜷𝟐)/2=(𝜷𝟏𝜷𝟐)/2=𝜹2\bm{\delta_{1}}=\bm{\beta_{1}}-(\bm{\beta_{1}}+\bm{\beta_{2}})/2=(\bm{\beta_{1}}-\bm{\beta_{2}})/2=-\bm{\delta}_{2}. Therefore,

𝛀𝐅=(𝜹𝟏𝛀𝐗𝜹𝟏𝜹𝟏𝛀𝐗𝜹𝟏𝜹𝟏𝛀𝐗𝜹𝟏𝜹𝟏𝛀𝐗𝜹𝟏)and(11)(𝜹𝟏𝛀𝐗𝜹𝟏𝜹𝟏𝛀𝐗𝜹𝟏𝜹𝟏𝛀𝐗𝜹𝟏𝜹𝟏𝛀𝐗𝜹𝟏)(11)=0.\mathbf{\Omega}_{\mathbf{F}}=\begin{pmatrix}\bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}&\bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}\\ \bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}&\bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}\end{pmatrix}~{}and~{}\begin{pmatrix}-1&1\end{pmatrix}\begin{pmatrix}\bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}&\bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}\\ \bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}&\bm{\delta_{1}}^{\prime}\mathbf{\Omega}_{\mathbf{X}}\bm{\delta_{1}}\end{pmatrix}\begin{pmatrix}-1\\ 1\end{pmatrix}=0.

Interestingly, the asymptotic equivalence of the separate and pooled OLS methods does not extend to the case G3G\geq 3. For any G3G\geq 3 with ρg=1/G\rho_{g}=1/G for all gg, 1ρg=11G=(G1)G1-\rho_{g}=1-\frac{1}{G}=\frac{\left(G-1\right)}{G} and so 1ρgρg=G1\frac{1-\rho_{g}}{\rho_{g}}=G-1. Note that 𝜹𝒈=𝜷𝒈(𝜷𝟏+𝜷𝟐++𝜷𝒈)/G\bm{\delta_{g}}=\bm{\beta_{g}}-\left(\bm{\beta_{1}}+\bm{\beta_{2}}+\cdots+\bm{\beta_{g}}\right)/G and it is less clear when a degeneracy occurs in the difference in asymptotic variances. So, for example, in our application which estimates a lower bound mean willingness to pay, SRA is generally more efficient than PRA even though the assignment probabilities are identical.

5 Nonlinear Regression Adjustment

We now discuss a class of nonlinear regression adjustment methods that preserve consistency without adding additional assumptions other than the ones discussed in section 2 (and weak regularity conditions). In particular, we extend the setup in NW (2021) to more than two treatment levels.

Typically, when the outcome is discrete or has limited support, nonlinear functions of 𝔼[Y(g)|𝐗]\mathbb{E}[Y(g)|\mathbf{X}] may be more appropriate and may provide a better approximation to the true conditional mean. However, consistency with nonlinear models is closely tied to functional form assumptions. In this section, we discuss nonlinear methods that are consistent for 𝔼[Y(g)]\mathbb{E}[Y(g)] despite possible misspecification in 𝔼[Y(g)|𝐗]\mathbb{E}[Y(g)|\mathbf{X}]. Not surprisingly, using a canonical link function in the context of QMLE in the linear exponential family plays a key role. We show that both separate and pooled nonlinear methods are consistent provided we choose the mean functions and objective functions appropriately. Unlike in the linear case, we can only show that SRA improves over the subsample means estimator when the conditional means are correctly specified.

5.1 Separate Regression Adjustment

We model the conditional means, 𝔼[Y(g)|𝐗]\mathbb{E}\left[Y(g)|\mathbf{X}\right], for each g=1,2,Gg=1,2,...G to be m(αg+𝐗𝜷𝒈)m(\alpha_{g}+\mathbf{X}\bm{\beta_{g}}), where m()m\left(\cdot\right) is a smooth function defined on \mathbb{R}.888The range of m()m\left(\cdot\right) is chosen to reflect the nature of Y(g)Y(g). Given that the nature of Y(g)Y(g) does not change across treatment levels, we choose a common function m()m\left(\cdot\right) across all gg. Also, as usual, the vector 𝐗\mathbf{X} can include nonlinear functions (typically squares, interactions, and so on) of underlying covariates. In the generalized linear model literature, m1()m^{-1}(\cdot) is known as the link function and relates the conditional mean of the outcome to a linear predictor. A canonical link is tied to a specific quasi-log-likelihood (QLL) in the LEF such that if one chooses certain combinations of QLLs and canonical link functions, we obtain

𝔼[Y(g)]=𝔼[m(αg+𝐗𝜷𝒈)],\mathbb{E}\left[Y(g)\right]=\mathbb{E}\left[m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta_{g}}^{\ast})\right], (9)

where αg\alpha_{g}^{\ast} and 𝜷𝒈\bm{\beta_{g}}^{\ast} are the probability limits of the QMLE whether or not the conditional mean function, chosen to be m()m(\cdot), is correctly specified. Table 1 gives the pairs of mean function (or canonical links) and QLLs that ensure consistent estimation of PO means. To ensure consistency, the mean should have the index form common in the generalized linear models literature.

Support Restrictions Mean Function QLL
None Linear Gaussian (Normal)
Y(g)[0,1]Y(g)\in\left[0,1\right] (binary, fractional) Logistic Bernoulli
Y(g)[0,B]Y(g)\in\left[0,B\right] (count, corners) Logistic Binomial
Y(g)0Y(g)\geq 0 (count, continuous, corner ) Exponential Poisson
  • This table enumerates combinations of quasi-log-likelihood and canonical link function specifications, both of which depend on the kind of restrictions placed on the support of the response variable. The selection of these QLL and conditional mean function specifications will ensure consistent estimation of the PO means despite misspecification in the mean function.

Table 1: Combinations of Means and QLLs to Ensure Consistency

The binomial QMLE is a good choice for counts with a known upper bound, even if it is individual-specific (BiB_{i} need not be a positive integer for each ii). It can also be applied to corner solution outcomes in the interval [0,Bi][0,B_{i}] where the outcome is continuous on (0,Bi)(0,B_{i}) but perhaps has mass at zero or BiB_{i}. The leading case is Bi=BB_{i}=B. Note that we do not recommend a Tobit model in such cases because the Tobit estimates of μg\mu_{g} are not robust to failure of the Tobit distributional assumptions or the Tobit form of the conditional mean.

This “mean fitting property” given in (9) can be established by studying the population first order conditions (FOCs) with the choice of canonical link function,

𝔼[Wg(Ym(αg+𝐗𝜷𝒈))]\displaystyle\mathbb{E}\left[W_{g}\cdot(Y-m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta_{g}}^{\ast}))\right] =0\displaystyle=0 (10)
𝔼[Wg𝐗(Ym(αg+𝐗𝜷𝒈))]\displaystyle\mathbb{E}\left[W_{g}\cdot\mathbf{X}^{\prime}\cdot(Y-m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta_{g}}^{\ast}))\right] =𝟎.\displaystyle=\mathbf{0}. (11)

It follows from equations (1) and (10) that

𝔼[WgY]=𝔼[WgY(g)]=𝔼[Wgm(αg+𝐗𝜷𝒈)].\mathbb{E}\left[W_{g}Y\right]=\mathbb{E}\left[W_{g}Y(g)\right]=\mathbb{E}\left[W_{g}m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta_{g}}^{\ast})\right]. (12)

Also, we can rewrite the right hand side using the law of iterated expectations as

𝔼[Wgm(αg+𝐗𝜷𝒈)]=𝔼(𝔼[Wgm(αg+𝐗𝜷𝒈)|𝐗])=ρg𝔼[m(αg+𝐗𝜷𝒈)],\mathbb{E}\left[W_{g}m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta_{g}}^{\ast})\right]=\mathbb{E}\left(\mathbb{E}\left[W_{g}m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta_{g}}^{\ast})|\mathbf{X}\right]\right)=\rho_{g}\mathbb{E}\left[m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta_{g}}^{\ast})\right], (13)

where the second equality holds by random assignment. Using (12) and (13) along with the fact that 𝔼[WgY(g)]=ρgμg\mathbb{E}\left[W_{g}Y(g)\right]=\rho_{g}\mu_{g}, we obtain the mean fitting result in equation (9). Applying nonlinear RA, therefore, with multiple treatment levels is straightforward. One can obtain α^g\hat{\alpha}_{g} and 𝜷^𝒈\bm{\hat{\beta}_{g}} by solving the sample analogues of conditions (10) and (11). For treatment level gg, after obtaining α^g\hat{\alpha}_{g}, 𝜷^𝒈\bm{\hat{\beta}_{g}} by QMLE, the mean, μg\mu_{g}, is estimated as μ^g=N1i=1Nm(α^g+𝐗i𝜷^𝒈)\hat{\mu}_{g}=N^{-1}\sum_{i=1}^{N}m(\hat{\alpha}_{g}+\mathbf{X}_{i}\bm{\hat{\beta}_{g}}), which includes linear RA as a special case. This estimator is consistent by a standard application of the uniform law of large numbers; see, for example, Wooldridge (2010) (Chapter 12, Lemma 12.1). As in the linear case, the FOCs of the QMLEs using a canonical link in the LEF allow us to write the subsample averages as

Y¯g=Ng1i=1NWigm(α^g+𝐗i𝜷^𝒈)g=1,,G.\bar{Y}_{g}=N_{g}^{-1}\sum_{i=1}^{N}W_{ig}m(\hat{\alpha}_{g}+\mathbf{X}_{i}\bm{\hat{\beta}_{g}})\text{, }g=1,...,G.

Given that μ^g\hat{\mu}_{g} averages across all of the observations rather than just the subset of units at treatment level gg, it seems that μ^g\hat{\mu}_{g} should be asymptotically more efficient than Y¯g\bar{Y}_{g}. Unfortunately, the proof used in the linear case does not generally go through in the nonlinear case. Nevertheless, when the conditional mean function is correctly specified, it is possible to show that nonlinear SRA improves over the subsample means estimator.

Theorem 3 (Efficiency of SRA relative to SM).

Assume 1, 2, SUTVA, and finite second moments used to obtain the asymptotic representation for the subsample means estimator and add that the SRA estimator uses canonical link function, m()m(\cdot), with a QMLE in the linear exponential family. If for some αg\alpha_{g}^{\ast}, 𝛃𝐠\bm{\beta_{g}}^{\ast}, 𝔼[Y(g)|𝐗]=m(αg+𝐗𝛃𝐠)g=1,,G\mathbb{E}\left[Y(g)|\mathbf{X}\right]=m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta_{g}}^{\ast})\text{, }g=1,...,G with probability one, then

Avar[N(𝝁^SM𝝁)]Avar[N(𝝁^SRA𝝁)]is PSD.\mathrm{Avar}\left[\sqrt{N}(\bm{\hat{\mu}}_{SM}-\bm{\mu})\right]-\mathrm{Avar}\left[\sqrt{N}(\bm{\hat{\mu}}_{SRA}-\bm{\mu})\right]\text{is PSD}.

The proof can be found in Appendix A. Even when the means are not correctly specified, it seems that separate nonlinear estimation should improve over the subsample means estimator if the 𝐗\mathbf{X} are sufficiently predictive. We see that in the simulation results in Section 7. Finally, as with the linear case, if assignment is only unconfounded conditional on X, nonlinear SRA is consistent under correct specification of the conditional mean, whereas SM is not.

5.2 Pooled Regression Adjustment

In cases where NN is not especially large, one might, just as in the linear case, resort to a pooled method. Provided the mean/QLL combinations are chosen as in Table 1, PRA is still consistent under arbitrary misspecification of the mean function. To see why, write the mean function with common slopes 𝜷\bm{\beta}, and without an intercept in the index, as m(γ1W1+γ2W2++γGWG+𝐗𝜷)m(\gamma_{1}W_{1}+\gamma_{2}W_{2}+\cdots+\gamma_{G}W_{G}+\mathbf{X}\bm{\beta}). Using a similar argument as in the previous section, the first-order conditions of the pooled QMLE include the GG equations

N1i=1NWig[Yim(γˇ1Wi1+γˇ2Wi2++γˇGWiG+𝐗i𝜷ˇ)]=0g=1,,G.N^{-1}\sum_{i=1}^{N}W_{ig}\left[Y_{i}-m(\check{\gamma}_{1}W_{i1}+\check{\gamma}_{2}W_{i2}+\cdots+\check{\gamma}_{G}W_{iG}+\mathbf{X}_{i}\bm{\check{\beta}})\right]=0\text{, }g=1,...,G. (14)

Therefore, assuming no degeneracies, the probability limits of the estimators, indicated with a “*” superscript, solve the population analogs of eq (14):

𝔼(WgY)=𝔼[WgY(g)]=𝔼[Wgm(𝐖𝜸+𝐗𝜷)],\mathbb{E}\left(W_{g}Y\right)=\mathbb{E}\left[W_{g}Y(g)\right]=\mathbb{E}\left[W_{g}m(\mathbf{W}\bm{\gamma}^{\ast}+\mathbf{X}\bm{\beta}^{\ast})\right], (15)

where 𝐖=(W1,W2,,WG)\mathbf{W}=(W_{1},W_{2},...,W_{G}). By random assignment, 𝔼[WgY(g)]=ρgμg\mathbb{E}\left[W_{g}Y(g)\right]=\rho_{g}\mu_{g} and by iterated expectations

𝔼[Wgm(𝐖𝜸+𝐗𝜷)]=𝔼{𝔼[Wgm(𝐖𝜸+𝐗𝜷)|𝐗]}.\mathbb{E}\left[W_{g}m(\mathbf{W}\bm{\gamma}^{\ast}+\mathbf{X}\bm{\beta}^{\ast})\right]=\mathbb{E}\left\{\mathbb{E}\left[W_{g}m(\mathbf{W}\bm{\gamma}^{\ast}+\mathbf{X}\bm{\beta}^{\ast})|\mathbf{X}\right]\right\}. (16)

Using random assignment again,

𝔼[Wgm(𝐖𝜸+𝐗𝜷)|𝐗]=(Wg=1|𝐗)m(γg+𝐗𝜷)=ρgm(γg+𝐗𝜷).\mathbb{E}\left[W_{g}m(\mathbf{W}\bm{\gamma}^{\ast}+\mathbf{X}\bm{\beta}^{\ast})|\mathbf{X}\right]=\mathbb{P}(W_{g}=1|\mathbf{X})m(\gamma_{g}^{\ast}+\mathbf{X}\bm{\beta}^{\ast})=\rho_{g}m(\gamma_{g}^{\ast}+\mathbf{X}\bm{\beta}^{\ast}). (17)

Combining equations (16) and (17) gives 𝔼[Wgm(𝐖𝜸+𝐗𝜷)]=ρg𝔼[m(γg+𝐗𝜷)]\mathbb{E}\left[W_{g}m(\mathbf{W}\bm{\gamma}^{\ast}+\mathbf{X}\bm{\beta}^{\ast})\right]=\rho_{g}\mathbb{E}\left[m(\gamma_{g}^{\ast}+\mathbf{X}\bm{\beta}^{\ast})\right]. Now, using ρg>0\rho_{g}>0 and 𝔼[WgY(g)]=ρgμg\mathbb{E}[W_{g}Y(g)]=\rho_{g}\mu_{g}, we have shown that equation (15) gives us

μg=𝔼[m(γg+𝐗𝜷)],\mu_{g}=\mathbb{E}\left[m(\gamma_{g}^{\ast}+\mathbf{X}\bm{\beta}^{\ast})\right], (18)

which shows that the μg\mu_{g} are identified by the population mean of m(γg+𝐗𝜷)m(\gamma_{g}^{\ast}+\mathbf{X}\bm{\beta}^{\ast}) even when the conditional mean function is misspecified. Under weak regularity conditions, γˇg\check{\gamma}_{g} is consistent for γg\gamma_{g}^{\ast} and 𝜷ˇ\bm{\check{\beta}} is consistent for 𝜷\bm{\beta}^{\ast}. Therefore, after the pooled QMLE estimation, we obtain the estimated means as μˇg=N1i=1Nm(γˇg+𝐗i𝜷ˇ),\check{\mu}_{g}=N^{-1}\sum_{i=1}^{N}m(\check{\gamma}_{g}+\mathbf{X}_{i}\bm{\check{\beta}}), and these are consistent [Wooldridge (2010) (Chapter 12, Lemma 12.1)]. As in the case of comparing nonlinear SRA to the SM, we have no general asymptotic efficiency results comparing nonlinear SRA to nonlinear PRA.

6 Semiparametric Efficiency Bound

In this section, we generalize the efficiency result with regression adjustment in experiments for multiple treatments by providing an efficiency benchmark for estimating the vector of PO means. We characterize the efficient influence function and the semiparametric efficiency bound (SEB) for all regular and asymptotically linear estimators of 𝝁\bm{\mu}. The SEB is useful as it provides the semiparametric analogue of the Cramer-Rao lower bound for parametric models. It gives us a standard against which to compare the asymptotic variance of any regular estimator of the potential outcome means, including the RA estimators studied in the previous sections.

In our case, the most efficient estimator for 𝝁\bm{\mu} has the following asymptotically linear form

N(𝝁^𝝁)=N1/2i=1N𝝍i+op(1)\sqrt{N}\left(\bm{\hat{\mu}}-\bm{\mu}\right)=N^{-1/2}\sum_{i=1}^{N}\bm{\psi}_{i}+o_{p}(1)

such that 𝔼[𝝍i]=𝟎\mathbb{E}[\bm{\psi}_{i}]=\bm{0} and 𝔼[𝝍i𝝍i]\mathbb{E}[\bm{\psi}_{i}\bm{\psi}_{i}^{\prime}] is finite and non-singular. Then, the SEB is given by 𝐕=𝔼[𝝍i𝝍i]\mathbf{V_{\ast}}=\mathbb{E}[\bm{\psi}_{i}\bm{\psi}_{i}^{\prime}] where the influence function is

𝝍i(Wi1Yi(1)/ρ1ρ11(Wi1ρ1)𝔼[Yi(1)|𝐗i]μ1Wi2Yi(2)/ρ2ρ21(Wi2ρ2)𝔼[Yi(2)|𝐗i]μ2WiGYi(G)/ρGρG1(WiGρG)𝔼[Yi(G)|𝐗i]μG).\displaystyle\bm{\psi}_{i}\equiv\begin{pmatrix}W_{i1}Y_{i}(1)/\rho_{1}-\rho_{1}^{-1}\left(W_{i1}-\rho_{1}\right)\mathbb{E}[Y_{i}(1)|\mathbf{X}_{i}]-\mu_{1}\\ W_{i2}Y_{i}(2)/\rho_{2}-\rho_{2}^{-1}\left(W_{i2}-\rho_{2}\right)\mathbb{E}[Y_{i}(2)|\mathbf{X}_{i}]-\mu_{2}\\ \vdots\\ W_{iG}Y_{i}(G)/\rho_{G}-\rho_{G}^{-1}\left(W_{iG}-\rho_{G}\right)\mathbb{E}[Y_{i}(G)|\mathbf{X}_{i}]-\mu_{G}\end{pmatrix}. (19)

This result follows from Cattaneo (2010), who studies efficient semiparametric estimation of treatment effects under the assumption of unconfoundedness. Given that we have the stronger assumption of randomized treatment, this essentially leads us to the same moment conditions as Cattaneo (2010) with the exception that the propensity score here is a constant.

Theorem 4 (Semiparametric efficiency of SRA).

If the conditional means of the potential outcomes, 𝔼[Y(g)|𝐗]\mathbb{E}[Y(g)|\mathbf{X}] for each gg, are correctly specified then SRA is semiparametrically efficient.

The proof can be found in Appendix A. The above result establishes that the separate regression adjustment estimator achieves the semiparametric efficiency bound if the true conditional mean of the POs is correctly specified. This implies that running separate linear regressions for different treatment level, as advocated in section 4, is optimally efficient if the mean function, m()m(\cdot), is truly linear. However, if m()m(\cdot) is truly logistic or exponential, then nonlinear SRA discussed in section 5, is optimally efficient for estimating the PO means.

7 Monte Carlo Simulations

This section studies the finite sample properties of the different estimators of the PO means, namely, SM, PRA, SRA, and their nonlinear counterparts. For the simulations, we generate a population of one million observations and mimic the asymptotic setting of random sampling from an “infinite” population. The empirical distributions of the RA estimators are simulated for sample sizes N{500,1000,5000}N\in\{500,1000,5000\} by randomly drawing the data vector {(𝐖i,Yi,𝐗i):i=1,2,,N}\left\{\left(\mathbf{W}_{i},Y_{i},\mathbf{X}_{i}\right):i=1,2,...,N\right\}, without replacement, ten thousand times from the population. We consider three different population models, each of which uses a unique data generating process for the potential outcomes, but we report results only for the linear model. Results for fractional and non-negative outcomes can be found in online appendix B.999See Tables B.1-B.4 for results on fractional and non-negative outcomes.

To simulate multiple treatments, we consider potential outcomes, Y(g)Y(g), corresponding to three treatment states, g=1,2,3g=1,2,3. Hence, G=3G=3 for all population models. In each of the populations, the treatment vector 𝐖=(W1,W2,W3)\mathbf{W}=\left(W_{1},W_{2},W_{3}\right) is generated with probability mass function defined by ρg\rho_{g} such that there is an equally likely probability of being assigned to a particular treatment group.

7.1 Population Models

To compare the empirical distributions of the RA estimators, we consider the first population model which simulates Y(g)Y(g) for each gg to be continuous with an unrestricted range using a linear specification. We consider two covariates, 𝐗=(X1,X2)\mathbf{X}=\left(X_{1},X_{2}\right), where X1X_{1} is continuously distributed whereas X2X_{2} is binary. The covariates are generated as follows:

X1\displaystyle X_{1} =K1+V and X2={1, if K2+V2>00, otherwise\displaystyle=K_{1}+V\text{ and }X_{2}=\begin{cases}1,\text{ if }K_{2}+\displaystyle\frac{V}{2}>0\\ 0,\text{ otherwise}\end{cases}

where K1,K2N(0,2)K_{1},K_{2}\sim N(0,2), and VN(0,1)V\sim N(0,1).

Population 1:

For each gg, Y(g)=𝐗˘𝜸𝒈+R(g)Y(g)=\mathbf{\breve{X}}\bm{\gamma_{g}}+R(g) where 𝐗˘=(1,X1,X2,X1X2)\mathbf{\breve{X}}=\left(1,X_{1},X_{2},X_{1}\cdot X_{2}\right) and 𝜸g=(γ0g,γ1g,γ2g,γ3g)\bm{\gamma}_{g}=\left(\gamma_{0g},\gamma_{1g},\gamma_{2g},\gamma_{3g}\right)^{\prime}. Moreover, R(g)|X1,X2N(0,σg2)R(g)\lvert X_{1},X_{2}\sim N(0,\sigma_{g}^{2}) and R(1),R(2), and R(3)R(1),R(2),\text{ and }R(3) are correlated.101010R(1)N(0,1),R(2)=3/2[R(1)+V1],R(3)=1/2[R(1)+V2]R(1)\sim N(0,1),R(2)=3/\sqrt{2}\cdot[R(1)+V_{1}],R(3)=1/\sqrt{2}\cdot[R(1)+V_{2}] where V1V_{1} and V2V_{2} are distributed N(0,1)N(0,1). The parameter vector, 𝜸g\bm{\gamma}_{g}, is chosen to depict settings where covariates are either mildly predictive (R2(g)<0.5R^{2}(g)<0.5) or highly predictive (R2(g)>0.5R^{2}(g)>0.5) of the potential outcomes. For example, the parameter vectors, 𝜸gL\bm{\gamma}_{g}^{L} and 𝜸gH\bm{\gamma}_{g}^{H} are chosen such that they lead to low and high RR-squared settings, respectively.

𝜸1L=(0,0.2,0.1,0.1) , 𝜸2L=(1,1.4,1,1) , 𝜸3L=(2,0.01,0.1,0.5)𝜸1H=(0,1,0.2,0.3) , 𝜸2H=(1,2.5,0,0.3) , 𝜸3H=(2,1.1,1,0.1)\begin{split}\bm{\gamma}_{1}^{L}&=\left(0,0.2,0.1,0.1\right)^{\prime}\text{ , }\bm{\gamma}_{2}^{L}=\left(1,1.4,-1,-1\right)^{\prime}\text{ , }\bm{\gamma}_{3}^{L}=\left(2,-0.01,0.1,0.5\right)^{\prime}\\ \bm{\gamma}_{1}^{H}&=\left(0,1,0.2,-0.3\right)^{\prime}\text{ , }\bm{\gamma}_{2}^{H}=\left(1,2.5,0,-0.3\right)^{\prime}\text{ , }\bm{\gamma}_{3}^{H}=\left(2,1.1,-1,0.1\right)^{\prime}\end{split}

While estimating the PO means, we assume that the true population model is unknown. With linear RA methods, we simply regress the observed outcome on a constant and the covariates. For nonlinear RA, based on the nature of the outcomes, we use a particular combination of log-likelihood and canonical link function from the LEF.

7.2 Discussion

Tables 2 and 3 report the bias, standard deviation, standard error, and 95% coverage probabilities for SM, PRA, and SRA across both high and low R-squared settings for population 1. Each table considers three different sample sizes. Note that in most cases, the bias of RA methods is comparable to SM, which is unbiased for the PO means. However, one may be willing to forego the bias in RA estimates in favor of efficiency and so we turn our attention to efficiency comparisons.

Across both low and high R2R^{2} configurations, we see that regression adjustment typically improves over subsample means. The magnitude of such a precision gain with PRA depends on how predictive the covariates are of the potential outcomes. We see that this gain is generally smaller at low R2R^{2} than high R2R^{2} values. For instance, in Table 2, the standard deviations of the PRA estimates are between 0.2%-7.7% smaller, across the different sample sizes, compared to the SM estimator at low R2R^{2} settings, whereas these differences are of order 13.6%-31.8% for the high R2R^{2} setting (see Table 3). The difference in precision between PRA and SRA depends on how heterogeneous are the slope coefficients in the true linear projections. While the standard deviation of the SRA estimates are between 2.3%-6.9% smaller than those of PRA in the low R2R^{2} case, these differences are of order 5.7%-19.4% for the high R2R^{2} case. Note that comparisons made with standard error estimates give us similar ranges, since large sample approximations are quite good. For fractional and non-negative outcomes, nonlinear RA generally seems to improve over linear RA methods and SM (see Tables B.1-B.4 in the online appendix).

Low R𝟐\bm{R^{2}}
Estimator N 500 1000 5000
𝝁𝟏\bm{\mu_{1}} 𝝁𝟐\bm{\mu_{2}} 𝝁𝟑\bm{\mu_{3}} 𝝁𝟏\bm{\mu_{1}} 𝝁𝟐\bm{\mu_{2}} 𝝁𝟑\bm{\mu_{3}} 𝝁𝟏\bm{\mu_{1}} 𝝁𝟐\bm{\mu_{2}} 𝝁𝟑\bm{\mu_{3}}
SM Bias 0.0035 -0.0051 -0.0031 0.0026 -0.0030 -0.0024 0.0021 -0.0019 -0.0031
Sd 0.0899 0.2948 0.0982 0.0631 0.2075 0.0692 0.0279 0.0926 0.0309
Se 0.0894 0.2925 0.0978 0.0633 0.2072 0.0693 0.0283 0.0927 0.0310
Cr 0.9468 0.9470 0.9461 0.9512 0.9503 0.9496 0.9536 0.9491 0.9488
PRA Bias 0.0016 -0.0050 -0.0013 0.0011 -0.0036 -0.0003 0.0006 -0.0026 -0.0008
Sd 0.0884 0.2730 0.0978 0.0620 0.1914 0.0686 0.0275 0.0860 0.0309
Se 0.0880 0.2702 0.0975 0.0622 0.1916 0.0690 0.0278 0.0858 0.0309
Cr 0.9464 0.9480 0.9490 0.9505 0.9512 0.9490 0.9529 0.9477 0.9500
SRA Bias 0.0026 -0.0049 -0.0022 0.0018 -0.0043 -0.0013 0.0013 -0.0034 -0.0020
Sd 0.0823 0.2661 0.0923 0.0578 0.1857 0.0646 0.0256 0.0841 0.0290
Se 0.0816 0.2621 0.0912 0.0579 0.1860 0.0648 0.0259 0.0834 0.0290
Cr 0.9452 0.9450 0.9454 0.9495 0.9511 0.9490 0.9535 0.9467 0.9488
  • a

    Here SM refers to subsample means, PRA is pooled regression adjustment, and SRA is separate regression adjustment estimator.

  • b

    Empirical distributions are generated with 10,000 Monte Carlo repetitions.

  • c

    Row labeled ‘Se’ stands for average of the estimated standard error based on the asymptotic variance formulas and ‘Cr’ stands for 95% coverage probability/rate.

  • d

    The true population mean vector is: 𝝁=(0.0588,0.3999,2.0969)\bm{\mu}=(0.0588,0.3999,2.0969)

  • e

    Low R2R^{2} corresponds to (R12,R22,R32)=(0.2405,0.2865,0.1839)(R^{2}_{1},R^{2}_{2},R^{2}_{3})=\left(0.2405,0.2865,0.1839\right)

Table 2: Monte Carlo results for RA estimators under population 1 (linear) for a low R2R^{2}
High R𝟐\bm{R^{2}}
Estimator N 500 1,000 5,000
𝝁𝟏\bm{\mu_{1}} 𝝁𝟐\bm{\mu_{2}} 𝝁𝟑\bm{\mu_{3}} 𝝁𝟏\bm{\mu_{1}} 𝝁𝟐\bm{\mu_{2}} 𝝁𝟑\bm{\mu_{3}} 𝝁𝟏\bm{\mu_{1}} 𝝁𝟐\bm{\mu_{2}} 𝝁𝟑\bm{\mu_{3}}
SM Bias 0.0056 -0.0038 -0.0042 0.0044 0.0003 -0.0051 0.0041 0.0011 -0.0068
Sd 0.1713 0.4711 0.2151 0.1201 0.3317 0.1528 0.0531 0.1474 0.0684
Se 0.1692 0.4684 0.2143 0.1198 0.3314 0.1517 0.0536 0.1484 0.0678
Cr 0.9473 0.9485 0.9514 0.9483 0.9491 0.9484 0.9542 0.9529 0.9506
PRA Bias -0.0001 -0.0032 0.0011 -0.0003 -0.0015 0.0013 -0.0009 -0.0010 0.0002
Sd 0.1470 0.3574 0.1482 0.1031 0.2508 0.1046 0.0459 0.1119 0.0467
Se 0.1460 0.3542 0.1481 0.1032 0.2509 0.1047 0.0461 0.1123 0.0468
Cr 0.9466 0.9464 0.9510 0.9492 0.9494 0.9484 0.9502 0.9495 0.9514
SRA Bias 0.0025 -0.0029 -0.0001 0.0016 -0.0024 0.0001 0.0012 -0.0021 -0.0011
Sd 0.1187 0.3340 0.1388 0.0840 0.2333 0.0978 0.0370 0.1049 0.0440
Se 0.1180 0.3301 0.1387 0.0836 0.2340 0.0982 0.0374 0.1048 0.0440
Cr 0.9503 0.9465 0.9489 0.9486 0.9521 0.9490 0.9526 0.9484 0.9488
  • a

    SM refers to subsample means, PRA is pooled regression adjustment, and SRA is separate regression adjustment.

  • b

    Empirical distributions are generated with 10,000 Monte Carlo repetitions.

  • c

    Row labeled ‘Se’ stands for average of the estimated standard error based on the asymptotic variance formulas and ‘Cr’ stands for 95% coverage probability/rate.

  • d

    The true population mean vector is: 𝝁=(0.0698,0.9654,1.5069)\bm{\mu}=(0.0698,0.9654,1.5069).

  • e

    High R2R^{2} corresponds to an (R12,R22,R32)=(0.7671,0.7517,0.8680)(R^{2}_{1},R^{2}_{2},R^{2}_{3})=\left(0.7671,0.7517,0.8680\right).

Table 3: Monte Carlo results for RA estimators under for population 1 (linear) for high R2R^{2}

8 Application to California Oil Spill Study

This section applies regression adjustment to estimate the lower bound mean willingness-to-pay (WTP) using survey data from the California oil spill study that can be found in Carson et al. (2004).111111See online appendix C on how RA methods can be applied to estimate the lower bound on mean WTP. The study implements a contingent valuation survey to assess the value of damages to natural resources from future oil spills along California’s central coast. The survey provides respondents with the choice of voting for or against a governmental program that would prevent natural resource injuries to shorelines and wildlife along the coast over the next decade. In return, the public is asked to pay a one-time lump-sum income tax surcharge for setting up the program.

The main survey sample used to elicit the yes or no votes was conducted by Westat, Inc. The data are a random sample of 1,085 interviews conducted with English speaking households where the respondent is 18 years or older and lives in private residence that is either owned or rented. Each respondent is randomly assigned one of five tax amounts: $5, $25, $65, $120, or $220 and a choice of “yes” or “no” is recorded at the assigned amount.

The survey also collects data on five major characteristics of the respondents’ household which are economic, demographic, preferences and attitudes towards the environment, interest in and use of the affected natural resources, evaluations of the expected harm and prevention program, and interpretations of the payment mechanism. The economic and demographic variables include log-income of the household, whether household pays state taxes in 1994, and resides in the central coast primary sampling unit. Preference towards environment include indicator variables reflecting the importance of preventing oil spills in coastal areas, spending to protect wildlife, and whether respondents consider themselves to be environmentalists or environmental activists. The attitudinal characteristics relate to respondents’ attitudes towards government programs such as whether people consider spending to be important and whether respondents consider taxes to be the appropriate payment method for protecting the environment. Variables reflecting interest and use of the affected natural resources are: driving along the central coast on highway 1 and familiarity with at least one of the five species of birds most often harmed by past oil spills. Variables measuring respondents’ evaluations of the expected harm and prevention program include identifying people who think oil spills over the next decade would cause more harm than mentioned in survey, who think oil spills would cause less harm, those who believe in the program’s effectiveness in achieving the desired goal, and those who have concerns regarding program’s effectiveness. The final set of variables relate to respondents’ interpretation of the payment mechanism, such as whether respondents believe the tax to not be limited to one year and those who protested that either the oil companies should pay for the program or those who thought that the companies would pass the costs to consumers in the form of higher gas and oil prices.

Table 4 reports the proportion of respondents randomly assigned to the different bid amounts where we see approximately the same number of people at each bid value. Finally, Table 5 provides estimates of the PO means using linear and nonlinear RA estimators. These control for the respondent characteristics described above. We then use the PO mean estimates to obtain a lower bound on mean WTP.

In terms of the standard errors, we see a ranking among the linear RA estimators. Standard errors for the linear SRA estimator are between 0.3%-3% smaller than those for linear PRA estimates. Except in one case, nonlinear SRA standard errors are between 0.5%-1.5% smaller than for the linear SRA ones. Finally, nonlinear PRA standard errors are between 0%-5% smaller than those for the nonlinear SRA estimates. In this application, using pooled logistic regression produces the most efficiency gains over the usual SM estimator, also known as ABERS estimator introduced by Ayer et al. (1955) in contingent valuation studies–see the online appendix for reference.

Tax Total respondents %
$5 219 20.2
$25 216 19.9
$65 241 22.2
$120 181 16.7
$220 228 21.0
Total 1,085 100
Table 4: Proportion of respondents randomly assigned to the different bid amounts
PO means
Linear Nonlinear
Bids SM PRA SRA PRA SRA
$5 0.6894 0.6841 0.6829 0.6841 0.6844
(0.0313) (0.0289) (0.0284) (0.0288) (0.0288)
$25 0.5694 0.5874 0.5899 0.5893 0.5899
(0.0338) (0.0311) (0.0310) (0.0299) (0.0308)
$65 0.4855 0.484 0.4815 0.4862 0.4862
(0.0323) (0.0299) (0.0297) (0.0286) (0.0293)
$120 0.4033 0.3806 0.3828 0.3804 0.3813
(0.0365) (0.0338) (0.0329) (0.0328) (0.0334)
$220 0.2895 0.2972 0.2921 0.294 0.2925
(0.0301) (0.0290) (0.0290) (0.0285) (0.0286)
WTP 85.3852 85.1792 84.74 84.9763 84.8894
(3.9051) (3.8356) (3.8141) (3.6102) (3.7947)
Observations 1,085 1,085 1,085 1,085 1,085
  • a

    Robust standard errors are in parentheses. For the WTP estimate, these are estimated using the lincom command in Stata (see online appendix for details).

  • b

    SM refers to subsample means estimator. This simply averages the ‘yes’ responses to estimate the PO mean at a particular bid amount. PRA refers to the estimator that uses pooled regression adjustment for estimating mean bid amounts, and SRA refers to separate regression adjustment.

  • c

    The lower bound mean WTP estimate is calculated using g=1G(bgbg1)μ^g\sum_{g=1}^{G}(b_{g}-b_{g-1})\hat{\mu}_{g} where ABERS estimate uses μ^g=Y¯g\hat{\mu}_{g}=\bar{Y}_{g}, and the regression adjustment estimators use μ^g=μ^PRA\hat{\mu}_{g}=\hat{\mu}_{PRA}, μ^g=μ^SRA\hat{\mu}_{g}=\hat{\mu}_{SRA}, μ^g=μ^PRA-logit\hat{\mu}_{g}=\hat{\mu}_{\text{PRA-logit}}, and μ^g=μ^SRA-logit\hat{\mu}_{g}=\hat{\mu}_{\text{SRA-logit}}.

Table 5: Lower bound mean willingness to pay estimate using ABERS and regression adjustment estimators

9 Conclusion

Building on the binary treatment case in NW (2021), we study efficiency improvements with RA when there are more than two treatment levels. In particular, we consider the case of random assignment of GG treatment levels. We show that jointly estimating the vector of potential outcome means using linear SRA, which allows for separate slopes for the different assignment levels, is asymptotically never worse than just using subsample averages; this result improves on the earlier work even when G=2G=2. One case when there is no gain in asymptotic efficiency from using SRA is when the slopes are all zero. In other words, when the covariates are not predictive of the potential outcomes, then using separate slopes does not produce more precise estimates than subsample averages. We also show that SRA is generally more efficient compared to PRA, unless the slopes in true linear projections are homogeneous. In this case, using SRA to estimate the vector of PO means is harmless.

In addition, we also extend the nonlinear RA results in NW (2021) to multiple treatment levels. We show that nonlinear RA of the QML variety is consistent if one chooses the conditional mean and objective functions appropriately from the linear exponential family of quasi-likelihoods. Furthermore, we also characterize the semiparametric efficiency bound for estimating the vector of PO means and show that SRA attains this bound, when the conditional means are correctly specified.

In simulations, we find that RA can provide substantial improvements over subsample means. Naturally, the magnitude of this precision gain generally depends on how strongly covariates predict the potential outcomes. As an illustration, we apply the different RA estimators to estimate the lower bound mean willingness to pay for a contingent valuation study which randomized different bid amounts to measure the cost of oil spills along California’s coast. We find that the lower bound is estimated more efficiently when we use SRA rather than the commonly used ABERS estimator, which uses subsample averages for the PO means.

References

  • Abadie et al. (2020) Abadie, A., S. Athey, G. W. Imbens, and J. M. Wooldridge (2020): “Sampling-Based versus Design-Based Uncertainty in Regression Analysis,” Econometrica, 88, 265–296.
  • Abadie et al. (2023) ——— (2023): “When should you adjust standard errors for clustering?” The Quarterly Journal of Economics, 138, 1–35.
  • Angrist et al. (2006) Angrist, J., E. Bettinger, and M. Kremer (2006): “Long-term Educational Consequences of Secondary School Vouchers: Evidence from Administrative Records in Colombia,” American Economic Review, 96, 847–862.
  • Ayer et al. (1955) Ayer, M., H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman (1955): “An Empirical Distribution Function for Sampling with Incomplete Information,” The Annals of Mathematical Statistics, 641–647.
  • Bishop et al. (2017) Bishop, R. C., K. J. Boyle, R. T. Carson, D. Chapman, W. M. Hanemann, B. Kanninen, R. J. Kopp, J. A. Krosnick, J. List, N. Meade, et al. (2017): “Putting a value on injuries to natural assets: The BP oil spill,” Science, 356, 253–254.
  • Boschini et al. (2018) Boschini, A., A. Dreber, E. von Essen, A. Muren, and E. Ranehill (2018): “Gender and altruism in a random sample,” Journal of behavioral and experimental economics, 77, 72–77.
  • Calónico and Smith (2017) Calónico, S. and J. Smith (2017): “The Women of the National Supported Work Demonstration,” Journal of Labor Economics, 35, S65–S97.
  • Carson et al. (2004) Carson, R. T., M. B. Conaway, W. M. Hanemann, J. A. Krosnick, R. C. Mitchell, and S. Presser (2004): “Valuing Oil Spill Prevention,” .
  • Cattaneo (2010) Cattaneo, M. D. (2010): “Efficient semiparametric estimation of multi-valued treatment effects under ignorability,” Journal of Econometrics, 155, 138–154.
  • Chang (2023) Chang, H. (2023): “Design-based Estimation Theory for Complex Experiments,” arXiv preprint arXiv:2311.06891.
  • Chang et al. (2024) Chang, H., J. Middleton, and P. Aronow (2024): “Exact Bias Correction for Linear Adjustment of Randomized Controlled Trials,” Forthcoming, Econometrica.
  • Chiang et al. (2023) Chiang, H. D., Y. Matsushita, and T. Otsu (2023): “Regression adjustment in randomized controlled trials with many covariates,” arXiv preprint arXiv:2302.00469.
  • Cohen and Fogarty (2024) Cohen, P. L. and C. B. Fogarty (2024): “No-harm calibration for generalized Oaxaca–Blinder estimators,” Biometrika, 111, 331–338.
  • Cytrynbaum (2023) Cytrynbaum, M. (2023): “Covariate adjustment in stratified experiments,” .
  • Freedman (2008a) Freedman, D. A. (2008a): “On regression adjustments to experimental data,” Advances in Applied Mathematics, 40, 180–193.
  • Freedman (2008b) ——— (2008b): “On regression adjustments in experiments with several treatments,” The Annals of Applied Statistics, 176–196.
  • Gail et al. (1984) Gail, M. H., S. Wieand, and S. Piantadosi (1984): “Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates,” Biometrika, 71, 431–444.
  • Goldin et al. (2021) Goldin, J., I. Z. Lurie, and J. McCubbin (2021): “Health insurance and mortality: Experimental evidence from taxpayer outreach,” The Quarterly Journal of Economics, 136, 1–49.
  • Guo and Basse (2023) Guo, K. and G. Basse (2023): “The generalized oaxaca-blinder estimator,” Journal of the American Statistical Association, 118, 524–536.
  • Hirano and Imbens (2001) Hirano, K. and G. W. Imbens (2001): “Estimation of Causal Effects using Propensity Score Weighting: An Application to Data on Right Heart Catheterization,” Health Services and Outcomes Research Methodology, 2, 259–278.
  • Imbens and Rubin (2015) Imbens, G. W. and D. B. Rubin (2015): Causal inference in statistics, social, and biomedical sciences, Cambridge University Press.
  • Jiang et al. (2023) Jiang, L., P. C. Phillips, Y. Tao, and Y. Zhang (2023): “Regression-adjusted estimation of quantile treatment effects under covariate-adaptive randomizations,” Journal of Econometrics, 234, 758–776.
  • Lei and Ding (2021) Lei, L. and P. Ding (2021): “Regression adjustment in completely randomized experiments with a diverging number of covariates,” Biometrika, 108, 815–828.
  • Leon et al. (2003) Leon, S., A. A. Tsiatis, and M. Davidian (2003): “Semiparametric estimation of treatment effect in a pretest-posttest study,” Biometrics, 59, 1046–1055.
  • Lewbel (2000) Lewbel, A. (2000): “Semiparametric Qualitative Response Model Estimation with Unknown Heteroscedasticity or Instrumental Variables,” Journal of Econometrics, 97, 145–177.
  • Lin (2013) Lin, W. (2013): “Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique,” The Annals of Applied Statistics, 7, 295–318.
  • Negi and Wooldridge (2021) Negi, A. and J. M. Wooldridge (2021): “Revisiting regression adjustment in experiments with heterogeneous treatment effects,” Econometric Reviews, 40, 504–534.
  • Pashley and Bind (2023) Pashley, N. E. and M.-A. C. Bind (2023): “Causal inference for multiple treatments using fractional factorial designs,” Canadian Journal of Statistics, 51, 444–468.
  • Pitkin et al. (2013) Pitkin, E., R. Berk, L. Brown, A. Buja, E. George, K. Zhang, and L. Zhao (2013): “Improved precision in estimating average treatment effects,” arXiv preprint arXiv:1311.0291.
  • Robins et al. (1994) Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994): “Estimation of regression coefficients when some regressors are not always observed,” Journal of the American statistical Association, 89, 846–866.
  • Rosenblum and Van Der Laan (2010) Rosenblum, M. and M. J. Van Der Laan (2010): “Simple, efficient estimators of treatment effects in randomized trials using generalized linear models to leverage baseline variables,” The international journal of biostatistics, 6.
  • Roth and Sant’Anna (2023) Roth, J. and P. H. Sant’Anna (2023): “Efficient estimation for staggered rollout designs,” Journal of Political Economy Microeconomics, 1, 669–709.
  • Tsiatis et al. (2008) Tsiatis, A. A., M. Davidian, M. Zhang, and X. Lu (2008): “Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach,” Statistics in medicine, 27, 4658–4677.
  • Watanabe (2010) Watanabe, M. (2010): “Nonparametric Estimation of Mean Willingness to Pay from Discrete Response Valuation Data,” American Journal of Agricultural Economics, 92, 1114–1135.
  • Wooldridge (2010) Wooldridge, J. M. (2010): Econometric Analysis of Cross Section and Panel Data, MIT press.
  • Yang and Tsiatis (2001) Yang, L. and A. A. Tsiatis (2001): “Efficiency study of estimators for a treatment effect in a pretest–posttest trial,” The American Statistician, 55, 314–321.
  • Zhang et al. (2008) Zhang, M., A. A. Tsiatis, and M. Davidian (2008): “Improving efficiency of inferences in randomized clinical trials using auxiliary covariates,” Biometrics, 64, 707–715.
  • Zhao and Ding (2022a) Zhao, A. and P. Ding (2022a): “Reconciling design-based and model-based causal inferences for split-plot experiments,” The Annals of Statistics, 50, 1170–1192.
  • Zhao and Ding (2022b) ——— (2022b): “Regression-based causal inference with factorial experiments: estimands, model specifications and design-based properties,” Biometrika, 109, 799–815.
  • Zhao and Ding (2023) ——— (2023): “Covariate adjustment in multiarmed, possibly factorial experiments,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 85, 1–23.

Appendix A Proofs

Theorem 1.

From (6), (7), 𝔼(𝐋i𝐐i)=𝟎\mathbb{E}\left(\mathbf{L}_{i}\mathbf{Q}_{i}^{\prime}\right)=\mathbf{0}, and 𝔼(𝐊i𝐐i)=𝟎\mathbb{E}\left(\mathbf{K}_{i}\mathbf{Q}_{i}^{\prime}\right)=\mathbf{0}, it follows that Avar[N(𝝁^SM𝝁)]=𝛀𝐋+𝛀𝐐\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{SM}-\bm{\mu}\right)\right]=\bm{\Omega}_{\mathbf{L}}+\bm{\Omega}_{\mathbf{Q}} and Avar[N(𝝁^SRA𝝁)]=𝛀𝐊+𝛀𝐐\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)\right]=\bm{\Omega}_{\mathbf{K}}+\bm{\Omega}_{\mathbf{Q}} where 𝛀𝐋=𝔼(𝐋i𝐋i)\bm{\Omega}_{\mathbf{L}}=\mathbb{E}\left(\mathbf{L}_{i}\mathbf{L}_{i}^{\prime}\right) and so on. Therefore, to show that Avar[N(𝝁^SRA𝝁)]\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)\right] is smaller (in the matrix sense), we must show 𝛀𝐋𝛀𝐊\mathbf{\Omega}_{\mathbf{L}}-\mathbf{\Omega}_{\mathbf{K}} is PSD. The elements of 𝐋i\mathbf{L}_{i} are uncorrelated because WigWih=0W_{ig}W_{ih}=0 for ghg\neq h. The variance of the gthg^{th} element is

𝔼[(Wig𝐗˙i𝜷𝒈/ρg)2]=𝔼(Wig)ρg2𝔼[(𝐗˙i𝜷𝒈)2]=ρg1𝔼[(𝐗˙i𝜷𝒈)2]=ρg1𝜷𝒈𝛀𝐗𝜷𝒈.\mathbb{E}\left[\left(W_{ig}\mathbf{\dot{X}}_{i}\bm{\beta_{g}}/\rho_{g}\right)^{2}\right]=\mathbb{E}\left(W_{ig}\right)\rho_{g}^{-2}\mathbb{E}\left[\left(\mathbf{\dot{X}}_{i}\bm{\beta_{g}}\right)^{2}\right]=\rho_{g}^{-1}\mathbb{E}\left[\left(\mathbf{\dot{X}}_{i}\bm{\beta_{g}}\right)^{2}\right]=\rho_{g}^{-1}\bm{\beta_{g}}^{\prime}\bm{\Omega}_{\mathbf{X}}\bm{\beta_{g}}.

Therefore,

𝔼(𝐋i𝐋i)=(𝜷𝟏𝛀𝐗𝜷𝟏ρ1000𝜷𝟐𝛀𝐗𝜷𝟐ρ2000𝜷𝑮𝛀𝐗𝜷𝑮ρG)𝐁=𝐁[𝐑𝛀𝐗]𝐁\displaystyle\mathbb{E}\left(\mathbf{L}_{i}\mathbf{L}_{i}^{\prime}\right)=\begin{pmatrix}\frac{\bm{\beta_{1}}^{\prime}\bm{\Omega}_{\mathbf{X}}\bm{\beta_{1}}}{\rho_{1}}&0&\cdots&0\\ 0&\frac{\bm{\beta_{2}}^{\prime}\bm{\Omega}_{\mathbf{X}}\bm{\beta_{2}}}{\rho_{2}}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\cdots&0&\frac{\bm{\beta_{G}}^{\prime}\bm{\Omega}_{\mathbf{X}}\bm{\beta_{G}}}{\rho_{G}}\end{pmatrix}\mathbf{B}=\mathbf{B}^{\prime}\left[\mathbf{R}\otimes\mathbf{\Omega}_{\mathbf{X}}\right]\mathbf{B}

where

𝐁=(𝜷𝟏000𝜷𝟐000𝜷𝑮) and 𝐑=(ρ11000ρ21000ρG1)\mathbf{B}=\begin{pmatrix}\bm{\beta_{1}}&0&\cdots&0\\ 0&\bm{\beta_{2}}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\cdots&0&\bm{\beta_{G}}\end{pmatrix}\text{ and }\mathbf{R}=\begin{pmatrix}\rho_{1}^{-1}&0&\cdots&0\\ 0&\rho_{2}^{-1}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\cdots&0&\rho_{G}^{-1}\end{pmatrix}

For the variance matrix of 𝐊i\mathbf{K}_{i}, 𝕍(𝐗˙i𝜷𝒈)=𝜷𝒈𝛀𝐗𝜷𝒈\mathbb{V}\left(\mathbf{\dot{X}}_{i}\bm{\beta_{g}}\right)=\bm{\beta_{g}}^{\prime}\bm{\Omega}_{\mathbf{X}}\bm{\beta_{g}} and (𝐗˙i𝜷g,𝐗˙i𝜷𝒉)=𝜷𝒈𝛀𝐗𝜷𝒉\mathbb{C}(\mathbf{\dot{X}}_{i}\bm{\beta}_{g},\mathbf{\dot{X}}_{i}\bm{\beta_{h}})=\bm{\beta_{g}}^{\prime}\bm{\Omega}_{\mathbf{X}}\bm{\beta_{h}}. Therefore, one may write

𝔼(𝐊i𝐊i)=𝐁[(𝐣G𝐣G)𝛀𝐗]𝐁\mathbb{E}\left(\mathbf{K}_{i}\mathbf{K}_{i}^{\prime}\right)=\mathbf{B}^{\prime}\left[\left(\mathbf{j}_{G}\mathbf{j}_{G}^{\prime}\right)\otimes\mathbf{\Omega}_{\mathbf{X}}\right]\mathbf{B}

where 𝐣G=(1,1,,1)\mathbf{j}_{G}^{\prime}=(1,1,...,1). Therefore, the comparison we need to make is 𝐑𝛀𝐗 versus (𝐣G𝐣G)𝛀𝐗\mathbf{R}\otimes\mathbf{\Omega}_{\mathbf{X}}\text{ versus }\left(\mathbf{j}_{G}\mathbf{j}_{G}^{\prime}\right)\otimes\mathbf{\Omega}_{\mathbf{X}}. That is, we need to show [𝐑(𝐣G𝐣G)]𝛀𝐗\left[\mathbf{R}-\left(\mathbf{j}_{G}\mathbf{j}_{G}^{\prime}\right)\right]\otimes\mathbf{\Omega}_{\mathbf{X}} is PSD. The Kronecker product of two PSD matrices is also PSD, so it suffices to show 𝐑(𝐣G𝐣G)\mathbf{R}-\left(\mathbf{j}_{G}\mathbf{j}_{G}^{\prime}\right) is PSD when the ρg\rho_{g} add to unity. Let 𝐚\mathbf{a} be any G×1G\times 1 vector. Then 𝐚𝐑𝐚=g=1Gag2/ρg𝐚(𝐣G𝐣G)𝐚=(𝐚𝐣G)2=(g=1Gag)2\mathbf{a}^{\prime}\mathbf{R}\mathbf{a}=\sum_{g=1}^{G}a_{g}^{2}/\rho_{g}\implies\mathbf{a}^{\prime}\left(\mathbf{j}_{G}\mathbf{j}_{G}^{\prime}\right)\mathbf{a}=\left(\mathbf{a}^{\prime}\mathbf{j}_{G}\right)^{2}=\left(\sum_{g=1}^{G}a_{g}\right)^{2}. So we have to show that g=1Gag2/ρg(g=1Gag)2\sum_{g=1}^{G}a_{g}^{2}/\rho_{g}\geq\left(\sum_{g=1}^{G}a_{g}\right)^{2}. Define vectors 𝐛=(a1/ρ1,a2/ρ2,,aG/ρG)\mathbf{b}=\left(a_{1}/\sqrt{\rho_{1}},a_{2}/\sqrt{\rho_{2}},...,a_{G}/\sqrt{\rho_{G}}\right)^{\prime} and 𝐜=(ρ1,ρ2,,ρG)\mathbf{c}=\left(\sqrt{\rho_{1}},\sqrt{\rho_{2}},...,\sqrt{\rho_{G}}\right)^{\prime} and apply the Cauchy-Schwarz inequality:

(g=1Gag)2=(𝐛𝐜)2(𝐛𝐛)(𝐜𝐜)=(g=1Gag2/ρg)(g=1Gρg)=(g=1Gag2/ρg)\left(\sum_{g=1}^{G}a_{g}\right)^{2}=\left(\mathbf{b}^{\prime}\mathbf{c}\right)^{2}\leq\left(\mathbf{b}^{\prime}\mathbf{b}\right)\left(\mathbf{c}^{\prime}\mathbf{c}\right)=\left(\sum_{g=1}^{G}a_{g}^{2}/\rho_{g}\right)\left(\sum_{g=1}^{G}\rho_{g}\right)=\left(\sum_{g=1}^{G}a_{g}^{2}/\rho_{g}\right)

because g=1Gρg=1\sum_{g=1}^{G}\rho_{g}=1. ∎

Theorem 2.

From (7), (8), 𝔼(𝐅i𝐊i)=𝔼(𝐊i𝐐i)=𝔼(𝐅i𝐐i)=𝟎\mathbb{E}(\mathbf{F}_{i}\mathbf{K}_{i}^{\prime})=\mathbb{E}(\mathbf{K}_{i}\mathbf{Q}_{i}^{\prime})=\mathbb{E}(\mathbf{F}_{i}\mathbf{Q}_{i}^{\prime})=\mathbf{0}, it follows that Avar[N(𝝁^SRA𝝁)]=𝛀𝐊+𝛀𝐐\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)\right]=\bm{\Omega}_{\mathbf{K}}+\bm{\Omega}_{\mathbf{Q}} and Avar[N(𝝁^PRA𝝁)]=𝛀𝐅+𝛀𝐊+𝛀𝐐\mathrm{Avar}\left[\sqrt{N}\left(\bm{\hat{\mu}}_{PRA}-\bm{\mu}\right)\right]=\bm{\Omega}_{\mathbf{F}}+\bm{\Omega}_{\mathbf{K}}+\bm{\Omega}_{\mathbf{Q}} where 𝛀𝐊=𝔼(𝐊i𝐊i)\bm{\Omega}_{\mathbf{K}}=\mathbb{E}\left(\mathbf{K}_{i}\mathbf{K}_{i}^{\prime}\right) and so on. Therefore, comparing the two variance expressions, we obtain the result. ∎

Theorem 3.

Let m(αg+𝐗𝜷𝒈)m(\alpha_{g}+\mathbf{X}\bm{\beta_{g}}) be the mean function associated with the canonical link function from a linear exponential family for g=1,2,,Gg=1,2,\ldots,G. Then the QMLE estimators, α^g\hat{\alpha}_{g} and 𝜷^𝒈\bm{\hat{\beta}_{g}} satisfy the first order conditions i=1NWig[Yim(α^g+𝐗i𝜷^𝒈)]=0\sum_{i=1}^{N}W_{ig}\cdot[Y_{i}-m(\hat{\alpha}_{g}+\mathbf{X}_{i}\bm{\hat{\beta}_{g}})]=0 and i=1NWig𝐗i[Yim(α^g+𝐗i𝜷^𝒈)]=𝟎\sum_{i=1}^{N}W_{ig}\cdot\mathbf{X}_{i}^{\prime}\cdot[Y_{i}-m(\hat{\alpha}_{g}+\mathbf{X}_{i}\bm{\hat{\beta}_{g}})]=\mathbf{0}. Then, the SRA estimators of μg\mu_{g} are given as μ^g=1Ni=1Nm(α^g+𝐗i𝜷^𝒈)\hat{\mu}_{g}=\frac{1}{N}\sum_{i=1}^{N}m(\hat{\alpha}_{g}+\mathbf{X}_{i}\bm{\hat{\beta}_{g}}). For notational simplicity, let 𝐗˘(1,𝐗)\mathbf{\breve{X}}\equiv\left(1,\mathbf{X}\right), and 𝜹g(αg,𝜷𝒈)\bm{\delta}_{g}\equiv\left(\alpha_{g},\bm{\beta_{g}}\right)^{\prime}. Now,

N(μ^gμg)=1Ni=1Nm˙ig+𝐌gN(𝜹^g𝜹g)+op(1)\begin{split}\sqrt{N}(\hat{\mu}_{g}-\mu_{g})&=\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\dot{m}_{ig}+\mathbf{M}^{\ast}_{g}\sqrt{N}(\bm{\hat{\delta}}_{g}-\bm{\delta}^{\ast}_{g})+o_{p}(1)\end{split}

where mig=m(𝐗˘i𝜹g)m_{ig}^{\ast}=m(\mathbf{\breve{X}}_{i}\bm{\delta}^{\ast}_{g}), m˙ig=migμg\dot{m}_{ig}=m_{ig}^{\ast}-\mu_{g}, and 𝐌g=𝔼[δgm(𝐗˘i𝜹g)]\mathbf{M}^{\ast}_{g}=\mathbb{E}\left[\nabla_{\delta_{g}}m(\mathbf{\breve{X}}_{i}\bm{\delta}^{\ast}_{g})\right]. Because of estimation via QMLE using linear exponential family with canonical mean m()m(\cdot), 𝜹^g\bm{\hat{\delta}}_{g} will satisfy FOC: 1Ni=1NWig𝐗˘i[Yi(g)m(𝐗˘i𝜹^g)]=𝟎\frac{1}{N}\sum_{i=1}^{N}W_{ig}\mathbf{\breve{X}}_{i}^{\prime}\left[Y_{i}(g)-m(\mathbf{\breve{X}}_{i}\bm{\hat{\delta}}_{g})\right]=\mathbf{0} which implies that

N(𝜹^g𝜹g)=(𝔼[Wig𝐗˘iδgm(𝐗˘i𝜹g)])1N1/2i=1NWig𝐗˘iUi(g)+op(1)\sqrt{N}\left(\bm{\hat{\delta}}_{g}-\bm{\delta}^{\ast}_{g}\right)=\Big{(}\mathbb{E}\left[W_{ig}\mathbf{\breve{X}}^{\prime}_{i}\nabla_{\delta_{g}}m(\mathbf{\breve{X}}_{i}\bm{\delta}^{\ast}_{g})\right]\Big{)}^{-1}N^{-1/2}\sum_{i=1}^{N}W_{ig}\mathbf{\breve{X}}^{\prime}_{i}U_{i}(g)+o_{p}(1)

where Ui(g)Yi(g)m(𝐗˘i𝜹g)U_{i}(g)\equiv Y_{i}(g)-m(\mathbf{\breve{X}}_{i}\bm{\delta}_{g}^{\ast}). Because of random assignment, we know that WigW_{ig} is independent of 𝐗˘i\mathbf{\breve{X}}_{i} and therefore we can write 𝔼[Wig𝐗˘iδgm(𝐗˘i𝜹g)]=ρg𝐂g\mathbb{E}\left[W_{ig}\mathbf{\breve{X}}^{\prime}_{i}\nabla_{\delta_{g}}m(\mathbf{\breve{X}}_{i}\bm{\delta}^{\ast}_{g})\right]=\rho_{g}\cdot\mathbf{C}_{g}^{\ast} where ρgP(Wig=1)\rho_{g}\equiv P(W_{ig}=1) and 𝐂g𝔼[𝐗˘iδgm(𝐗˘i𝜹g)]\mathbf{C}_{g}^{\ast}\equiv\mathbb{E}\left[\mathbf{\breve{X}}^{\prime}_{i}\nabla_{\delta_{g}}m(\mathbf{\breve{X}}_{i}\bm{\delta}^{\ast}_{g})\right]. Substituting the influence function (IF) for 𝜹^g\bm{\hat{\delta}}_{g} into the IF for μ^g\hat{\mu}_{g},

N(μ^gμg)=N1/2i=1Nm˙ig+ρg1𝐌g𝐂g1[N1/2i=1NWig𝐗˘iUi(g)]+op(1)\sqrt{N}(\hat{\mu}_{g}-\mu_{g})=N^{-1/2}\sum_{i=1}^{N}\dot{m}_{ig}+\rho_{g}^{-1}\mathbf{M}^{\ast}_{g}\mathbf{C}^{\ast-1}_{g}\Bigg{[}N^{-1/2}\sum_{i=1}^{N}W_{ig}\mathbf{\breve{X}}^{\prime}_{i}U_{i}(g)\Bigg{]}+o_{p}(1)

Hence, stacking all such IFs for g=1,2,,Gg=1,2,\ldots,G, we obtain

N(𝝁^SRA𝝁)=N1/2i=1N(𝐇i+𝐉i)+op(1)\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)=N^{-1/2}\sum_{i=1}^{N}\left(\mathbf{H}_{i}+\mathbf{J}_{i}\right)+o_{p}(1) (A.1)

where

𝐇i=(m˙i1m˙i2m˙iG) and 𝐉i=(ρ11𝐌1𝐂11Wi1𝐗˘iUi(1)ρ21𝐌2𝐂21Wi2𝐗˘iUi(2)ρG1𝐌G𝐂G1WiG𝐗˘iUi(G))\mathbf{H}_{i}=\begin{pmatrix}\dot{m}_{i1}\\ \dot{m}_{i2}\\ \vdots\\ \dot{m}_{iG}\end{pmatrix}\text{ and }\mathbf{J}_{i}=\begin{pmatrix}\rho_{1}^{-1}\mathbf{M}^{\ast}_{1}\mathbf{C}^{\ast-1}_{1}W_{i1}\mathbf{\breve{X}}^{\prime}_{i}U_{i}(1)\\ \rho_{2}^{-1}\mathbf{M}^{\ast}_{2}\mathbf{C}^{\ast-1}_{2}W_{i2}\mathbf{\breve{X}}^{\prime}_{i}U_{i}(2)\\ \vdots\\ \rho_{G}^{-1}\mathbf{M}^{\ast}_{G}\mathbf{C}^{\ast-1}_{G}W_{iG}\mathbf{\breve{X}}^{\prime}_{i}U_{i}(G)\end{pmatrix}

Now, N(Y¯gμg)=N1/2i=1NWigρg m˙ig+ρg1𝐌g𝐂g1N1/2i=1NWig𝐗˘iUi(g)+op(1)\sqrt{N}\left(\bar{Y}_{g}-\mu_{g}\right)=N^{-1/2}\sum_{i=1}^{N}\frac{W_{ig}}{\rho_{g}}\text{ }\dot{m}_{ig}+\rho_{g}^{-1}\mathbf{M}^{\ast}_{g}\mathbf{C}^{\ast-1}_{g}N^{-1/2}\sum_{i=1}^{N}W_{ig}\mathbf{\breve{X}}_{i}^{\prime}U_{i}(g)+o_{p}(1). Similarly, stacking all such IFs for all g=1,2,,Gg=1,2,\ldots,G, we get

N(𝐘¯𝝁)=N1/2i=1N(𝐏i+𝐉i)+op(1)\sqrt{N}\left(\mathbf{\bar{Y}}-\bm{\mu}\right)=N^{-1/2}\sum_{i=1}^{N}\left(\mathbf{P}_{i}+\mathbf{J}_{i}\right)+o_{p}(1) (A.2)

where 𝐏i=(Wi1ρ1 m˙i1Wi2ρ2 m˙i2WiGρG m˙iG)\mathbf{P}_{i}=\begin{pmatrix}\frac{W_{i1}}{\rho_{1}}\text{ }\dot{m}_{i1}&\frac{W_{i2}}{\rho_{2}}\text{ }\dot{m}_{i2}&\ldots&\frac{W_{iG}}{\rho_{G}}\text{ }\dot{m}_{iG}\end{pmatrix}^{\prime}. Consider equations (A.1) and (A.2) in order to compare the two estimators. We see that both equations have the same vector component, 𝐉i\mathbf{J}_{i}. Note that if the conditional mean function is correctly specified, then, by random assignment, 𝔼[Ui(g)|𝐗˘i,𝐖i]=0\mathbb{E}\left[U_{i}(g)|\mathbf{\breve{X}}_{i},\mathbf{W}_{i}\right]=0. This implies, 𝔼(𝐉i𝐇i)=𝔼(𝐉i𝐏i)=𝟎\mathbb{E}\left(\mathbf{J}_{i}\mathbf{H}_{i}^{\prime}\right)=\mathbb{E}\left(\mathbf{J}_{i}\mathbf{P}_{i}^{\prime}\right)=\mathbf{0} where,

𝔼(𝐉i𝐏i)=(ρ11𝐌1𝐂11𝔼[𝐗˘iUi(1)m˙i1]000ρ21𝐌2𝐂21𝔼[𝐗˘iUi(2)m˙i2]000ρG1𝐌G𝐂G1𝔼[𝐗˘iUi(G)m˙iG])\begin{split}\mathbb{E}\left(\mathbf{J}_{i}\mathbf{P}_{i}^{\prime}\right)=\begin{pmatrix}\rho_{1}^{-1}\mathbf{M}_{1}^{\ast}\mathbf{C}_{1}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(1)\dot{m}_{i1}\right]&0&\ldots&0\\ 0&\rho_{2}^{-1}\mathbf{M}_{2}^{\ast}\mathbf{C}_{2}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(2)\dot{m}_{i2}\right]&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&\rho_{G}^{-1}\mathbf{M}_{G}^{\ast}\mathbf{C}_{G}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(G)\dot{m}_{iG}\right]\end{pmatrix}\end{split}

and,

𝔼(𝐉i𝐇i)=(𝐌1𝐂11𝔼[𝐗˘iUi(1)m˙i1]𝐌1𝐂11𝔼[𝐗˘iUi(1)m˙i2]𝐌1𝐂11𝔼[𝐗˘iUi(1)m˙iG]𝐌2𝐂21𝔼[𝐗˘iUi(2)m˙i1]𝐌2𝐂21𝔼[𝐗˘iUi(2)m˙i2]𝐌2𝐂21𝔼[𝐗˘iUi(2)m˙iG]𝐌G𝐂G1𝔼[𝐗˘iUi(G)m˙i1]𝐌G𝐂G1𝔼[𝐗˘iUi(G)m˙i2]𝐌G𝐂G1𝔼[𝐗˘iUi(G)m˙iG])\begin{split}\mathbb{E}\left(\mathbf{J}_{i}\mathbf{H}_{i}^{\prime}\right)=\begin{pmatrix}\mathbf{M}_{1}^{\ast}\mathbf{C}_{1}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(1)\dot{m}_{i1}\right]&\mathbf{M}_{1}^{\ast}\mathbf{C}_{1}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(1)\dot{m}_{i2}\right]&\ldots&\mathbf{M}_{1}^{\ast}\mathbf{C}_{1}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(1)\dot{m}_{iG}\right]\\ \mathbf{M}_{2}^{\ast}\mathbf{C}_{2}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(2)\dot{m}_{i1}\right]&\mathbf{M}_{2}^{\ast}\mathbf{C}_{2}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(2)\dot{m}_{i2}\right]&\ldots&\mathbf{M}_{2}^{\ast}\mathbf{C}_{2}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(2)\dot{m}_{iG}\right]\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{M}_{G}^{\ast}\mathbf{C}_{G}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(G)\dot{m}_{i1}\right]&\mathbf{M}_{G}^{\ast}\mathbf{C}_{G}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(G)\dot{m}_{i2}\right]&\ldots&\mathbf{M}_{G}^{\ast}\mathbf{C}_{G}^{\ast-1}\mathbb{E}\left[\mathbf{\breve{X}}_{i}^{\prime}U_{i}(G)\dot{m}_{iG}\right]\end{pmatrix}\end{split}

Therefore, Avar[N(𝐘¯𝝁)]Avar[N(𝝁^SRA𝝁)]=𝔼(𝐏i𝐏i)𝔼(𝐇i𝐇i)\text{Avar}\big{[}\sqrt{N}\left(\mathbf{\bar{Y}}-\bm{\mu}\right)\big{]}-\text{Avar}\big{[}\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)\big{]}=\mathbb{E}\left(\mathbf{P}_{i}\mathbf{P}_{i}^{\prime}\right)-\mathbb{E}\left(\mathbf{H}_{i}\mathbf{H}_{i}^{\prime}\right). Now, we can express

𝔼(𝐏i𝐏i)=𝔼[(m˙i1000m˙i2000m˙iG)𝐑(m˙i1000m˙i2000m˙iG)]\begin{split}\mathbb{E}\left(\mathbf{P}_{i}\mathbf{P}_{i}^{\prime}\right)&=\mathbb{E}\left[\begin{pmatrix}\dot{m}_{i1}&0&\ldots&0\\ 0&\dot{m}_{i2}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&\dot{m}_{iG}\end{pmatrix}\mathbf{R}\begin{pmatrix}\dot{m}_{i1}&0&\ldots&0\\ 0&\dot{m}_{i2}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&\dot{m}_{iG}\end{pmatrix}\right]\\ \end{split}

The off-diagonal terms are all zero since, WigWih=0W_{ig}\cdot W_{ih}=0 for any ghg\neq h. Similarly,

𝔼(𝐇i𝐇i)=𝔼[(m˙i1000m˙i2000m˙iG)𝐣G𝐣G(m˙i1000m˙i2000m˙iG)]\begin{split}\mathbb{E}\left(\mathbf{H}_{i}\mathbf{H}_{i}^{\prime}\right)&=\mathbb{E}\left[\begin{pmatrix}\dot{m}_{i1}&0&\ldots&0\\ 0&\dot{m}_{i2}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&\dot{m}_{iG}\end{pmatrix}\mathbf{j}_{G}\mathbf{j}_{G}^{\prime}\begin{pmatrix}\dot{m}_{i1}&0&\ldots&0\\ 0&\dot{m}_{i2}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&\dot{m}_{iG}\end{pmatrix}\right]\end{split}

where 𝐣G=(1,1,,1)\mathbf{j}_{G}^{\prime}=(1,1,...,1). Therefore, Avar[N(𝐘¯𝝁)]Avar[N(𝝁^SRA𝝁)]=\text{Avar}\big{[}\sqrt{N}\left(\mathbf{\bar{Y}}-\bm{\mu}\right)\big{]}-\text{Avar}\big{[}\sqrt{N}\left(\bm{\hat{\mu}}_{SRA}-\bm{\mu}\right)\big{]}=

=𝔼[(m˙i1000m˙i2000m˙iG){𝐑(𝐣G𝐣G)}(A.3)(m˙i1000m˙i2000m˙iG)]\begin{split}&=\mathbb{E}\left[\begin{pmatrix}\dot{m}_{i1}&0&\ldots&0\\ 0&\dot{m}_{i2}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&\dot{m}_{iG}\end{pmatrix}\underbrace{\left\{\mathbf{R}-\left(\mathbf{j}_{G}\mathbf{j}_{G}^{\prime}\right)\right\}}_{\text{(A.3)}}\begin{pmatrix}\dot{m}_{i1}&0&\ldots&0\\ 0&\dot{m}_{i2}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&\dot{m}_{iG}\end{pmatrix}\right]\end{split}

Here we can use the result that if 𝐓\mathbf{T} is a real symmetric matrix that is PSD, then 𝐒𝐓𝐒\mathbf{S}^{\prime}\mathbf{T}\mathbf{S} is also real, symmetric and PSD for some real matrix 𝐒\mathbf{S}. Moreover, expectation of a random PSD matrix is also PSD. To utilize these results here, all we need to show is that (A.3) is PSD which we have already established in the proof of Theorem 1. Hence, we obtain our result. ∎

Theorem 4.

To show that linear SRA achieves the bound, let us assume that for each gg, Y(g)=μg+𝐗˙𝜷g+U(g),𝔼[U(g)|𝐗]=0.Y(g)=\mu_{g}+\mathbf{\dot{X}}\bm{\beta}_{g}+U(g),\ \mathbb{E}[U(g)|\mathbf{X}]=0. Note that one can rewrite

ψg=Wgρg(Y(g)𝔼[Y(g)|𝐗])+𝔼[Y(g)|𝐗]μg=WgρgU(g)+𝐗˙𝜷g\displaystyle\psi_{g}=\frac{W_{g}}{\rho_{g}}\cdot\left(Y(g)-\mathbb{E}[Y(g)|\mathbf{X}]\right)+\mathbb{E}[Y(g)|\mathbf{X}]-\mu_{g}=\frac{W_{g}}{\rho_{g}}\cdot U(g)+\mathbf{\dot{X}}\bm{\beta}_{g}

Then, looking at the asymptotic representation of the linear SRA estimator, we see that the influence function121212See Lemma 2 of the online appendix for the asymptotic representation of the SRA estimator. of 𝝁^SRA\bm{\hat{\mu}}_{SRA} is given by 𝝍i\bm{\psi}_{i} such that for each gg, ψig=WigUi(g)/ρg+𝐗˙i𝜷g\psi_{ig}=W_{ig}U_{i}(g)/\rho_{g}+\mathbf{\dot{X}}_{i}\bm{\beta}_{g} and Avar(𝝁^SRA)=𝐕/N\textup{Avar}(\bm{\hat{\mu}}_{SRA})=\mathbf{V}_{\ast}/N.

To show that the nonlinear SRA estimator achieves the SEB, let 𝔼[Y(g)|𝐗]=m(αg+𝐗𝜷g)=m(𝐗˘𝜹g)\mathbb{E}[Y(g)|\mathbf{X}]=m(\alpha_{g}^{\ast}+\mathbf{X}\bm{\beta}_{g}^{\ast})=m(\mathbf{\breve{X}}\bm{\delta}_{g}^{\ast}). Then, the asymptotic expansion of the nonlinear SRA estimator is N(𝝁^SRA𝝁)=N1/2i=1N(𝐇i+𝐉i)+op(1)\sqrt{N}(\bm{\hat{\mu}}_{SRA}-\bm{\mu})=N^{-1/2}\sum_{i=1}^{N}(\mathbf{H}_{i}+\mathbf{J}_{i})+o_{p}(1) where the gg-th element of the influence function, 𝐇i+𝐉i\mathbf{H}_{i}+\mathbf{J}_{i}, is given by m˙ig+ρg1𝐌g𝐂g1Wig𝐗˘iUi(g)\dot{m}_{ig}+\rho_{g}^{-1}\mathbf{M}_{g}^{\ast}\mathbf{C}_{g}^{\ast-1}W_{ig}\mathbf{\breve{X}}_{i}U_{i}(g) where m˙ig=migμg\dot{m}_{ig}=m_{ig}^{\ast}-\mu_{g}, mig=m(𝐗˘i𝜹g)m_{ig}^{\ast}=m(\mathbf{\breve{X}}_{i}\bm{\delta}^{\ast}_{g}), 𝐌g=𝔼[δgm(𝐗˘i𝜹g)]\mathbf{M}_{g}^{\ast}=\mathbb{E}\left[\nabla_{\delta_{g}}m(\mathbf{\breve{X}}_{i}\bm{\delta}^{\ast}_{g})\right], and 𝐂g𝔼[𝐗˘iδgm(𝐗˘i𝜹g)]\mathbf{C}_{g}^{\ast}\equiv\mathbb{E}\left[\mathbf{\breve{X}}^{\prime}_{i}\nabla_{\delta_{g}}m(\mathbf{\breve{X}}_{i}\bm{\delta}^{\ast}_{g})\right]. Now, let’s define, fig=m(𝐗˘i𝜹g)/(𝐗˘𝜹g)|𝜹g=𝜹gf_{ig}^{\ast}=\partial m(\mathbf{\breve{X}}_{i}\bm{\delta}_{g})/\partial(\mathbf{\breve{X}}\bm{\delta}_{g})|_{\bm{\delta}_{g}=\bm{\delta}_{g}^{\ast}}. Then,

𝐌g=(𝔼(fig)𝔼(𝐗ifig)) and 𝐂g=(𝔼(fig)𝔼(𝐗ifig)𝔼(𝐗ifig) 𝔼(𝐗i𝐗ifig))\displaystyle\mathbf{M}_{g}^{\ast}=\begin{pmatrix}\mathbb{E}(f_{ig}^{\ast})&\mathbb{E}(\mathbf{X}_{i}f_{ig}^{\ast})\end{pmatrix}\text{ and }\mathbf{C}_{g}^{\ast}=\begin{pmatrix}\mathbb{E}(f_{ig}^{\ast})&\mathbb{E}(\mathbf{X}_{i}f_{ig}^{\ast})\\ \mathbb{E}(\mathbf{X}_{i}^{\prime}f_{ig}^{\ast}) &\mathbb{E}(\mathbf{X}_{i}^{\prime}\mathbf{X}_{i}f_{ig}^{\ast})\end{pmatrix}

Using inverse rules for partitioned matrices, we know that 𝐂g1=(c11c12𝐜21𝐜22)\mathbf{C}_{g}^{\ast-1}=\begin{pmatrix}c_{11}&c_{12}\\ \mathbf{c}_{21}&\mathbf{c}_{22}\end{pmatrix} where

c11\displaystyle c_{11} =𝔼(fig)1{1+𝔼(𝐗ifig)𝐐1𝔼(𝐗ifig)𝔼(fig)1}\displaystyle=\mathbb{E}(f_{ig}^{\ast})^{-1}\bigg{\{}1+\mathbb{E}(\mathbf{X}_{i}f_{ig}^{\ast})\mathbf{Q}^{-1}\mathbb{E}(\mathbf{X}_{i}^{\prime}f_{ig}^{\ast})\mathbb{E}(f_{ig}^{\ast})^{-1}\bigg{\}} c12\displaystyle c_{12} =𝔼(fig)1𝔼(𝐗ifig)𝐐1\displaystyle=-\mathbb{E}(f_{ig}^{\ast})^{-1}\mathbb{E}(\mathbf{X}_{i}f_{ig}^{\ast})\mathbf{Q}^{-1}
𝐜21\displaystyle\mathbf{c}_{21} =𝐐1𝔼(𝐗ifig)𝔼(fig)1\displaystyle=-\mathbf{Q}^{\prime-1}\mathbb{E}(\mathbf{X}_{i}^{\prime}f_{ig}^{\ast})\mathbb{E}(f_{ig}^{\ast})^{-1} 𝐜22\displaystyle\mathbf{c}_{22} =𝐐1\displaystyle=\mathbf{Q}^{-1}

and 𝐐=𝔼(𝐗i𝐗ifig)𝔼(𝐗ifig)𝔼(fig)1𝔼(𝐗ifig)\mathbf{Q}=\mathbb{E}(\mathbf{X}_{i}^{\prime}\mathbf{X}_{i}f_{ig}^{\ast})-\mathbb{E}(\mathbf{X}_{i}^{\prime}f_{ig}^{\ast})\mathbb{E}(f_{ig}^{\ast})^{-1}\mathbb{E}(\mathbf{X}_{i}f_{ig}^{\ast}). Then,

𝐌g𝐂g1=(𝔼(fig)c11+𝔼(𝐗ifig)𝐜21𝔼(fig)c12+𝔼(𝐗ifig)𝐜22)=(1,𝟎1×K).\mathbf{M}_{g}^{\ast}\mathbf{C}_{g}^{\ast-1}=\begin{pmatrix}\mathbb{E}(f_{ig}^{\ast})c_{11}+\mathbb{E}(\mathbf{X}_{i}f_{ig}^{\ast})\mathbf{c}_{21}&\mathbb{E}(f_{ig}^{\ast})c_{12}+\mathbb{E}(\mathbf{X}_{i}f_{ig}^{\ast})\mathbf{c}_{22}\end{pmatrix}=\begin{pmatrix}1,\mathbf{0}_{1\times K}\end{pmatrix}.

This implies that 𝐌g𝐂g1Wig𝐗˘iUi(g)=WigUi(g)\mathbf{M}_{g}^{\ast}\mathbf{C}_{g}^{\ast-1}W_{ig}\mathbf{\breve{X}}_{i}U_{i}(g)=W_{ig}U_{i}(g) so m˙ig+ρg1𝐌g𝐂g1Wig𝐗˘iUi(g)=ψig\dot{m}_{ig}+\rho_{g}^{-1}\mathbf{M}_{g}^{\ast}\mathbf{C}_{g}^{\ast-1}W_{ig}\mathbf{\breve{X}}_{i}U_{i}(g)=\psi_{ig}. ∎