This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

secReferences

Interacted two-stage least squares with treatment effect heterogeneity

Anqi Zhao  Peng Ding  Fan Li
Anqi Zhao: Fuqua School of Business, Duke University. Peng Ding: Department of Statistics, University of California, Berkeley. Fan Li: Department of Statistical Science, Duke University. Peng Ding was supported by the U.S. National Science Foundation (grant #1945136).
Abstract

Treatment effect heterogeneity with respect to covariates is common in instrumental variable (IV) analyses. An intuitive approach, which we term the interacted two-stage least squares (2sls), is to postulate a linear working model of the outcome on the treatment, covariates, and treatment-covariate interactions, and instrument it by the IV, covariates, and IV-covariate interactions. We clarify the causal interpretation of the interacted 2sls under the local average treatment effect (LATE) framework when the IV is valid conditional on the covariates. Our contributions are threefold. First, we show that the interacted 2sls with centered covariates is consistent for estimating the LATE if either of the following conditions holds: (i) the IV-covariate interactions are linear in the covariates; (ii) the linear outcome model underlying the interacted 2sls is correct. In particular, condition (i) is satisfied if either (ia) the covariates are categorical or (ib) the IV is randomly assigned. In contrast, existing 2sls procedures with an additive second stage only recover weighted averages of the conditional LATEs that generally differ from the LATE. Second, we show that the coefficients of the treatment-covariate interactions from the interacted 2sls are consistent for estimating treatment effect heterogeneity with regard to covariates among compliers if either condition (i) or condition (ii) holds. Moreover, we connect the 2sls estimator with the weighting perspective in Abadie (2003) and establish the necessity of condition (i) in the absence of additional assumptions on the outcome model. Third, leveraging the consistency guarantees of the interacted 2sls for categorical covariates, we propose a stratification strategy based on the IV propensity score to approximate the LATE and treatment effect heterogeneity with regard to the IV propensity score when neither condition (i) nor condition (ii) holds.


Keywords: Conditional average treatment effect, instrumental variable, local average treatment effect, potential outcome, treatment effect variation

1 Introduction

Two-stage least squares (2sls) is widely used for estimating treatment effects when instrumental variables (IVs) are available for endogenous treatments. Under the potential outcomes framework, Imbens and Angrist (1994) and Angrist et al. (1996) defined the local average treatment effect (LATE) as the average treatment effect over the subpopulation of compliers whose treatment status is affected by the IV, and formalized the assumptions on the IV that ensure the consistency of 2sls for estimating the LATE.

Often, IVs only satisfy the IV assumptions after conditioning on a set of covariates. Refer to such IVs as conditionally valid IVs hence. A common strategy is to include the corresponding covariates as additional regressors when fitting the 2sls. When treatment effect heterogeneity with regard to covariates is suspected, an intuitive generalization, which we term the interacted 2sls, is to add the IV-covariate and treatment-covariate interactions to the first and second stages, respectively (Angrist and Pischke, 2009, Section 4.5.2).

To our knowledge, theoretical properties of the interacted 2sls have not been discussed under the LATE framework except when the IV is randomly assigned hence unconditionally valid (Ding et al., 2019). Our paper closes this gap and clarifies the properties of the interacted 2sls in the more prevalent setting where the IV is conditionally valid.

Our contributions are threefold. First, we show that the interacted 2sls with centered covariates is consistent for estimating the LATE if either of the following conditions holds:

  1. (i)

    the IV-covariate interactions are linear in the covariates;

  2. (ii)

    the linear outcome model underlying the interacted 2sls is correct.

In particular, condition (i) is satisfied if either (ia) the covariates are categorical or (ib) the IV is randomly assigned. In contrast, existing 2sls procedures with an additive second stage only recover weighted averages of the conditional LATEs that generally differ from the LATE (Angrist and Imbens, 1995; Sloczynski, 2022; Blandhol et al., 2022). This illustrates an advantage of including treatment-covariate interactions in the second stage.

Second, define systematic treatment effect variation as the variation in individual treatment effects that can be explained by covariates (Heckman et al., 1997; Djebbari and Smith, 2008). We show that the coefficients of the treatment-covariate interactions from the interacted 2sls are consistent for estimating the systematic treatment effect variation among compliers if either condition (i) or condition (ii) holds. Furthermore, we connect the 2sls estimator with the weighting perspective in Abadie (2003), and establish the necessity of condition (i) in the absence of additional assumptions on the outcome model. The results extend the literature on the connection between regression and weighting for estimating the average treatment effect and LATE (Kline, 2011; Chattopadhyay and Zubizarreta, 2023; Chattopadhyay et al., 2024; Słoczyński et al., 2025; frolich2007nonparametric).

Third, to remedy the possible biases when neither condition (i) nor condition (ii) holds, we propose stratifying based on the estimated propensity score of the IV and using the stratum dummies as the new covariates to fit the interacted 2sls. While propensity score stratification is standard for estimating the average treatment effect (Rosenbaum and Rubin, 1983) and LATE (Cheng and Lin, 2018; Pashley et al., 2024), we highlight its value in approximating treatment effect heterogeneity with respect to the IV propensity score.

Notation.

For a set of tuples {(ui,vi1,,viL):i=1,,N}\{(u_{i},v_{i1},\ldots,v_{iL}):i=1,\ldots,N\} with uiKu_{i}\in\mathbb{R}^{K} and vilKlv_{il}\in\mathbb{R}^{K_{l}} for l=1,,Ll=1,\ldots,L, denote by lm(uivi1++viL)\texttt{lm}(u_{i}\sim v_{i1}+\cdots+v_{iL}) the least squares fit of the component-wise linear regression of uiu_{i} on (vi1,,viL)(v_{i1}^{\top},\ldots,v_{iL}^{\top})^{\top}. We allow each vilv_{il} to be a scalar or a vector and use + to denote concatenation of regressors. Throughout, we focus on the numeric outputs of least squares without invoking any assumption of the corresponding linear model. For two random vectors YpY\in\mathbb{R}^{p} and XqX\in\mathbb{R}^{q}, denote by proj(YX)\textup{proj}(Y\mid X) the linear projection of YY on XX in that proj(YX)=BX\textup{proj}(Y\mid X)=BX, where B=argminbp×q𝔼{(YbX)2}=𝔼(YX){𝔼(XX)}1B=\textup{argmin}_{b\in\mathbb{R}^{p\times q}}\mathbb{E}\{(Y-bX)^{2}\}=\mathbb{E}(YX^{\top})\{\mathbb{E}(XX^{\top})\}^{-1}. Let 1()1(\cdot) denote the indicator function. Let \perp\!\!\!\perp denote independence.

2 Interacted 2SLS and identifying assumptions

2.1 Interacted 2SLS

Consider a study with two treatment levels, indexed by d=0,1d=0,1, and a study population of NN units, indexed by i=1,,Ni=1,\ldots,N. For each unit, we observe a treatment status Di{0,1}D_{i}\in\{0,1\}, an outcome of interest YiY_{i}\in\mathbb{R}, a baseline covariate vector XiKX_{i}\in\mathbb{R}^{K}, and a binary IV Zi{0,1}Z_{i}\in\{0,1\} that is valid conditional on XiX_{i}. We assume Xi=(1,Xi1,,Xi,K1)X_{i}=(1,X_{i1},\ldots,X_{i,K-1})^{\top} includes the constant one as the first element unless specified otherwise, and state the IV assumptions in Section 2.2. Let 𝔼(ZiXi)=(Zi=1Xi)\mathbb{E}(Z_{i}\mid X_{i})=\mathbb{P}(Z_{i}=1\mid X_{i}) denote the IV propensity score of unit ii (Rosenbaum and Rubin, 1983).

Definition 1 below reviews the standard 2sls procedure for estimating the causal effect of the treatment on outcome. We call it the additive 2sls to signify the additive regression specifications in both stages.

Definition 1 (Additive 2sls).
  1. (i)

    First stage: Fit lm(DiZi+Xi)\texttt{lm}(D_{i}\sim Z_{i}+X_{i}) over i=1,,Ni=1,\ldots,N. Denote the fitted values by D^i\hat{D}_{i} for i=1,,Ni=1,\ldots,N.

  2. (ii)

    Second stage: Fit lm(YiD^i+Xi)\texttt{lm}(Y_{i}\sim\hat{D}_{i}+X_{i}) over i=1,,Ni=1,\ldots,N. Estimate the treatment effect by the coefficient of D^i\hat{D}_{i}.

The standard 2sls in Definition 1 is motivated by the additive working model Yi=DiβD+XiβX+ηiY_{i}=D_{i}\beta_{D}+X_{i}^{\top}\beta_{X}+\eta_{i}, where 𝔼(ηiZi,Xi)=0\mathbb{E}(\eta_{i}\mid Z_{i},X_{i})=0. The coefficient of DiD_{i}, βD\beta_{D}, is the constant treatment effect of interest, and we use ZiZ_{i} to instrument the possibly endogenous DiD_{i}. We use the term working model to refer to models that are used to construct estimators but not necessarily correct.

When treatment effect heterogeneity with regard to covariates is suspected, an intuitive generalization is to consider the interacted working model

Yi=DiXiβDX+XiβX+ηi,where 𝔼(ηiZi,Xi)=0,\displaystyle Y_{i}=D_{i}X_{i}^{\top}\beta_{DX}+X_{i}^{\top}\beta_{X}+\eta_{i},\quad\text{where \ $\mathbb{E}(\eta_{i}\mid Z_{i},X_{i})=0$,} (1)

with XiβDXX_{i}^{\top}\beta_{DX} representing the covariate-dependent treatment effect. We can then use ZiXiZ_{i}X_{i} to instrument the possibly endogenous DiXiD_{i}X_{i}; see, e.g., Angrist and Pischke (2009, Section 4.5.2). We term the resulting 2sls procedure the interacted 2sls, formalized in Definition 2 below.

Definition 2 (Interacted 2sls).
  1. (i)

    First stage: Fit lm(DiXiZiXi+Xi)\texttt{lm}(D_{i}X_{i}\sim Z_{i}X_{i}+X_{i}) over i=1,,Ni=1,\ldots,N, as the component-wise regression of DiXiD_{i}X_{i} on (ZiXi,Xi)(Z_{i}X_{i}^{\top},X_{i}^{\top})^{\top}. Denote the fitted values by DX^i\widehat{DX}_{i} for i=1,,Ni=1,\ldots,N.

  2. (ii)

    Second stage: Fit lm(YiDX^i+Xi)\texttt{lm}(Y_{i}\sim\widehat{DX}_{i}+X_{i}) over i=1,,Ni=1,\ldots,N. Let β^2sls\hat{\beta}_{\textsc{2sls}} denote the coefficient vector of DX^i\widehat{DX}_{i}.

Of interest is the causal interpretation of the coefficient vector β^2sls\hat{\beta}_{\textsc{2sls}}.

2.2 IV assumptions and causal estimands

We formalize below the IV assumptions and corresponding causal estimands using the potential outcomes framework (Imbens and Angrist, 1994; Angrist et al., 1996).

For z,d{0,1}z,d\in\{0,1\}, let Di(z)D_{i}(z) denote the potential treatment status of unit ii if Zi=zZ_{i}=z, and let Yi(z,d)Y_{i}(z,d) denote the potential outcome of unit ii if Zi=zZ_{i}=z and Di=dD_{i}=d. The observed treatment status and outcome satisfy Di=Di(Zi)D_{i}=D_{i}(Z_{i}) and Yi=Yi(Zi,Di)=Yi(Zi,Di(Zi))Y_{i}=Y_{i}(Z_{i},D_{i})=Y_{i}(Z_{i},D_{i}(Z_{i})).

Following the literature, we classify the units into four compliance types, denoted by UiU_{i}, based on the joint values of Di(1)D_{i}(1) and Di(0)D_{i}(0). We call unit ii an always-taker if Di(1)=Di(0)=1D_{i}(1)=D_{i}(0)=1, denoted by Ui=aU_{i}=\textup{a}; a complier if Di(1)=1D_{i}(1)=1 and Di(0)=0D_{i}(0)=0, denoted by Ui=cU_{i}=\textup{c}; a defier if Di(1)=0D_{i}(1)=0 and Di(0)=1D_{i}(0)=1, denoted by Ui=dU_{i}=\textup{d}; and a never-taker if Di(1)=Di(0)=0D_{i}(1)=D_{i}(0)=0, denoted by Ui=nU_{i}=\textup{n}.

Assumption 1 below reviews the assumptions for conditionally valid IVs that we assume throughout the paper.

Assumption 1.

{Yi(z,d),Di(z),Xi,Zi:z=0,1;d=0,1}\{Y_{i}(z,d),D_{i}(z),X_{i},Z_{i}:z=0,1;\ d=0,1\} are independent and identically distributed across i=1,,Ni=1,\ldots,N, and satisfy the following conditions:

  1. (i)

    Selection on observables: Zi{Yi(z,d),Di(z):z=0,1;d=0,1}XiZ_{i}\perp\!\!\!\perp\{Y_{i}(z,d),D_{i}(z):z=0,1;\,d=0,1\}\mid X_{i}.

  2. (ii)

    Overlap: 0<𝔼(ZiXi)<10<\mathbb{E}(Z_{i}\mid X_{i})<1.

  3. (iii)

    Exclusion restriction: Yi(0,d)=Yi(1,d)=Yi(d)Y_{i}(0,d)=Y_{i}(1,d)=Y_{i}(d) for d=0,1d=0,1, where Yi(d)Y_{i}(d) denotes the common value.

  4. (iv)

    Relevance: 𝔼{Di(1)Di(0)Xi}0\mathbb{E}\{D_{i}(1)-D_{i}(0)\mid X_{i}\}\neq 0.

  5. (v)

    Monotonicity: Di(1)Di(0)D_{i}(1)\geq D_{i}(0).

Assumption 1(i) ensures ZiZ_{i} is as-if randomly assigned given XiX_{i}. Assumption 1(ii) ensures positive probabilities of Zi=1Z_{i}=1 and Zi=0Z_{i}=0 at all possible values of XiX_{i}. Assumption 1(iii) ensures ZiZ_{i} has no effect on the outcome once we condition on the treatment status. Assumption 1(iv) ensures ZiZ_{i} has a nonzero causal effect on the treatment status at all possible values of XiX_{i}. Assumption 1(v) precludes defiers. Assumption 1(iv)–(v) together ensure positive proportion of compliers at all possible values of XiX_{i}, i.e., (Ui=cXi)>0\mathbb{P}(U_{i}=\textup{c}\mid X_{i})>0.

Recall Yi(d)Y_{i}(d) as the common value of Yi(1,d)=Yi(0,d)Y_{i}(1,d)=Y_{i}(0,d) for d=0,1d=0,1 under Assumption 1(iii). Define τi=Yi(1)Yi(0)\tau_{i}=Y_{i}(1)-Y_{i}(0) as the individual treatment effect of unit ii. Define

τc=𝔼(τiUi=c)\displaystyle\tau_{\textup{c}}=\mathbb{E}(\tau_{i}\mid U_{i}=\textup{c}) (2)

as the local average treatment effect (LATE) on compliers, and define

τc(Xi)=𝔼(τiXi,Ui=c)\displaystyle\tau_{\textup{c}}(X_{i})=\mathbb{E}(\tau_{i}\mid X_{i},\,U_{i}=\textup{c}) (3)

as the conditional LATE given XiX_{i}. The law of iterated expectations ensures τc=𝔼{τc(Xi)Ui=c}\tau_{\textup{c}}=\mathbb{E}\left\{\tau_{\textup{c}}(X_{i})\mid U_{i}=\textup{c}\right\}.

To quantify treatment effect heterogeneity with regard to covariates, consider the linear projection of τi\tau_{i} on XiX_{i} among compliers, denoted by projUi=c(τiXi)=Xiβc\textup{proj}_{\,U_{i}=\textup{c}}(\tau_{i}\mid X_{i})=X_{i}^{\top}\beta_{\textup{c}}. The coefficient vector equals

βc=argminbK𝔼{(τiXib)2Ui=c}={𝔼(XiXiUi=c)}1𝔼(XiτiUi=c).\displaystyle\beta_{\textup{c}}=\textup{argmin}_{b\in\mathbb{R}^{K}}\mathbb{E}\left\{(\tau_{i}-X_{i}^{\top}b)^{2}\mid U_{i}=\textup{c}\right\}=\left\{\mathbb{E}(X_{i}X_{i}^{\top}\mid U_{i}=\textup{c})\right\}^{-1}\mathbb{E}\left(X_{i}\tau_{i}\mid U_{i}=\textup{c}\right). (4)

By Angrist and Pischke (2009, Theorem 3.1.6), XiβcX_{i}^{\top}\beta_{\textup{c}} is also the linear projection of the conditional LATE τc(Xi)\tau_{\textup{c}}(X_{i}) on XiX_{i} among compliers with βc=argminbK𝔼[{τc(Xi)Xib}2Ui=c]\beta_{\textup{c}}=\textup{argmin}_{b\in\mathbb{R}^{K}}\mathbb{E}[\{\tau_{\textup{c}}(X_{i})-X_{i}^{\top}b\}^{2}\mid U_{i}=\textup{c}]. This ensures XiβcX_{i}^{\top}\beta_{\textup{c}} is the best linear approximation to both τi\tau_{i} and τc(Xi)\tau_{\textup{c}}(X_{i}) based on XiX_{i}, generalizing the idea of systematic treatment effect variation in Djebbari and Smith (2008) to the IV setting; see also Heckman et al. (1997). We define βc\beta_{\textup{c}} as the causal estimand for quantifying treatment effect heterogeneity among compliers that is explained by XiX_{i}.

Recall β^2sls\hat{\beta}_{\textsc{2sls}} as the coefficient vector from the interacted 2sls in Definition 2. Under the finite-population, design-based framework, Ding et al. (2019) showed that β^2sls\hat{\beta}_{\textsc{2sls}} is consistent for estimating βc\beta_{\textup{c}} when ZiZ_{i} is randomly assigned hence valid without conditioning on XiX_{i}. We establish in Section 3 the properties of β^2sls\hat{\beta}_{\textsc{2sls}} for estimating τc\tau_{\textup{c}}, τc(Xi)\tau_{\textup{c}}(X_{i}), and βc\beta_{\textup{c}} in the more prevalent setting where ZiZ_{i} is only conditionally valid given XiX_{i}.

2.3 Identifying assumptions

We introduce below two assumptions central to our identification results. First, Assumption 2 below restricts the functional form of the IV propensity score, requiring the product of 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) and XiX_{i} to be linear in XiX_{i}.

Assumption 2 (Linear IV-covariate interactions).

𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) is linear in XiX_{i}.

In contrast, Assumption 3 below imposes linearity on the outcome model.

Assumption 3 (Correct interacted outcome model).

The interacted working model (1) is correct.

Proposition 1 below states three implications of Assumptions 23.

Proposition 1.

Let KK denote the dimension of the covariate vector XiX_{i}.

  1. (i)

    Assumption 2 holds if either of the following conditions holds

    1. (a)

      Categorical covariates: XiX_{i} is categorical with KK levels.

    2. (b)

      Random assignment of IV: ZiXiZ_{i}\perp\!\!\!\perp X_{i}.

  2. (ii)

    If Assumption 2 holds, then 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) takes at most KK distinct values across all possible values of XiX_{i}. If Assumption 2 holds for all possible assignment mechanisms of ZiZ_{i}, then XiX_{i} is categorical with KK levels.

  3. (iii)

    Assume Assumption 1. Assumption 3 holds if the potential outcomes satisfy

    𝔼{Yi(1)Xi,Ui=a}=𝔼{Yi(1)Xi,Ui=c}=Xiβ1,𝔼{Yi(0)Xi,Ui=n}=𝔼{Yi(0)Xi,Ui=c}=Xiβ0\displaystyle\begin{array}[]{l}\mathbb{E}\{Y_{i}(1)\mid X_{i},U_{i}=\textup{a}\}=\mathbb{E}\{Y_{i}(1)\mid X_{i},U_{i}=\textup{c}\}=X_{i}^{\top}\beta_{1},\\ \mathbb{E}\{Y_{i}(0)\mid X_{i},U_{i}=\textup{n}\}=\mathbb{E}\{Y_{i}(0)\mid X_{i},U_{i}=\textup{c}\}=X_{i}^{\top}\beta_{0}\end{array}

    for some fixed β0,β1K\beta_{0},\beta_{1}\in\mathbb{R}^{K}.

Proposition 1(i) provides two important special cases of Assumption 2. Together with Assumption 1, the condition of ZiXiZ_{i}\perp\!\!\!\perp X_{i} in Proposition 1(ib) ensures the IV is valid without conditioning on XiX_{i}.

Proposition 1(ii) ensures that the IV propensity score takes at most KK distinct values under Assumption 2. This substantially restricts the possible IV assignment mechanisms beyond the two special cases in Proposition 1(i).

Proposition 1(iii) provides a special case of Assumption 3 where the potential outcomes are linear in the covariates.

3 Identification by interacted 2SLS

We establish in this section the conditions for the interacted 2sls to recover the LATE and treatment effect heterogeneity with regard to covariates in the presence of treatment effect heterogeneity.

3.1 Identification of LATE

We first establish the properties of the interacted 2sls for estimating the LATE and unify the results with the literature. Assume throughout this subsection that Xi=(1,Xi1,,Xi,K1)X_{i}=(1,X_{i1},\ldots,X_{i,K-1})^{\top} has the constant one as the first element.

3.1.1 Interacted 2SLS with centered covariates

Let μk=𝔼(XikUi=c)\mu_{k}=\mathbb{E}(X_{ik}\mid U_{i}=\textup{c}) denote the complier average of XikX_{ik} for k=1,,K1k=1,\ldots,K-1. While μk\mu_{k} is generally unknown, we can consistently estimate it by the method of moments proposed by Imbens and Rubin (1997) or the weighted estimator proposed by Abadie (2003). Definition 3 below formalizes our proposed estimator of the LATE.

Definition 3.

Given Xi=(1,Xi1,,Xi,K1)X_{i}=(1,X_{i1},\ldots,X_{i,K-1})^{\top}, let μ^k\hat{\mu}_{k} be a consistent estimate of μk=𝔼(XikUi=c)\mu_{k}=\mathbb{E}(X_{ik}\mid U_{i}=\textup{c}) for k=1,,K1k=1,\ldots,K-1. Let τ^××\hat{\tau}_{\times\times} be the first element of β^2sls\hat{\beta}_{\textsc{2sls}} from the interacted 2sls in Definition 2 that uses Xi0=(1,Xi1μ^1,,Xi,K1μ^K1)X_{i}^{0}=(1,X_{i1}-\hat{\mu}_{1},\ldots,X_{i,K-1}-\hat{\mu}_{K-1})^{\top} as the covariate vector.

We use the subscript ××\times\times to signify the interacted first and second stages in the interacted 2sls. Given DiXi0=(Di,Di(Xi1μ^1),Di(Xi,K1μ^K1))D_{i}X_{i}^{0}=(D_{i},D_{i}(X_{i1}-\hat{\mu}_{1}),\ldots D_{i}(X_{i,K-1}-\hat{\mu}_{K-1}))^{\top}, τ^××\hat{\tau}_{\times\times} is the coefficient of the fitted value of DiD_{i} from the interacted second stage. Let τ××\tau_{\times\times} denote the probability limit of τ^××\hat{\tau}_{\times\times} as the sample size NN goes to infinity. Theorem 1 below establishes the conditions for τ××\tau_{\times\times} to identify the LATE.

Theorem 1.

Assume Assumption 1. Then τ××=τc\tau_{\times\times}=\tau_{\textup{c}} if either Assumption 2 or Assumption 3 holds.

Theorem 1 ensures the consistency of τ^××\hat{\tau}_{\times\times} for estimating the LATE when either Assumption 2 or Assumption 3 holds. From Proposition 1, this encompasses (i) categorical covariates, (ii) unconditionally valid IV, and (iii) correct interacted working model as three special cases. In particular, the regression specification of the interacted 2sls is saturated when XiX_{i} is categorical, and is therefore a special case of the saturated 2sls discussed by Blandhol et al. (2022).

Recall that τ^××\hat{\tau}_{\times\times} is the first element of β^2sls\hat{\beta}_{\textsc{2sls}} with centered covariates. We provide intuition for the use of centered covariates in Section 3.2 after presenting the more general theory for β^2sls\hat{\beta}_{\textsc{2sls}}, and demonstrate the sufficiency of Assumptions 2 and 3 in Section 4. See Hirano and Imbens (2001) and Lin (2013) for similar uses of centered covariates when using interacted regression to estimate the average treatment effect.

Note that when computing τ^××\hat{\tau}_{\times\times}, we use the estimated complier averages to center the covariates, because the true values are unknown. To conduct inference based on τ^××\hat{\tau}_{\times\times}, we need to account for the uncertainty in estimating the complier means. This can be achieved by bootstrapping.

3.1.2 Unification with the literature

Theorem 1 contributes to the literature on using 2sls to estimate the LATE in the presence of treatment effect heterogeneity. Previously, Angrist and Imbens (1995) studied a variant of the additive 2sls in Definition 1 for categorical XiX_{i}, and showed that the coefficient of D^i\hat{D}_{i} recovers a weighted average of the conditional LATEs that generally differs from τc\tau_{\textup{c}}. Sloczynski (2022) studied the additive 2sls in Definition 1, and showed that when 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) is linear in XiX_{i}, the coefficient of D^i\hat{D}_{i} recovers a weighted average of the conditional LATEs that generally differs from τc\tau_{\textup{c}}. Blandhol et al. (2022) studied the additive 2sls and showed that the coefficient of D^i\hat{D}_{i} recovers a weighted average of the conditional LATEs when the additive first stage is correctly specified, or saturated. In contrast, Theorem 1 ensures that the interacted 2sls directly recovers τc\tau_{\textup{c}} if either (i) the covariates are categorical, or (ii) the IV is unconditionally valid, or (iii) the interacted working model (1) is correct. This illustrates an advantage of including the treatment-covariate interactions in the second stage.

We formalize the above overview in Definition 4 and Proposition 2 below. First, Definition 4 states the 2sls procedure considered by Angrist and Imbens (1995) and Sloczynski (2022).

Definition 4 (Interacted-additive 2sls).
  1. (i)

    First stage: Fit lm(DiZiXi+Xi)\texttt{lm}(D_{i}\sim Z_{i}X_{i}+X_{i}) over i=1,,Ni=1,\ldots,N. Denote the fitted values by D^i\hat{D}_{i} for i=1,,Ni=1,\ldots,N.

  2. (ii)

    Second stage: Fit lm(YiD^i+Xi)\texttt{lm}(Y_{i}\sim\hat{D}_{i}+X_{i}) over i=1,,Ni=1,\ldots,N. Estimate the treatment effect by the coefficient of D^i\hat{D}_{i}, denoted by τ^×+\hat{\tau}_{\times+}.

We use the subscript ×+\times+ to signify the interacted first stage and additive second stage. The additive, interacted, and interacted-additive 2sls in Definitions 12 and 4 are three variants of 2sls discussed in the LATE literature, summarized in Table 1. The combination of additive first stage and interacted second stage leads to degenerate second stage and is hence omitted.

Let τ^++\hat{\tau}_{++} denote the coefficient of D^i\hat{D}_{i} from the additive 2sls in Definition 1. Let τ++\tau_{++} and τ×+\tau_{\times+} denote the probability limits of τ^++\hat{\tau}_{++} and τ^×+\hat{\tau}_{\times+}, respectively. Proposition 2 below states the conditions for τ++\tau_{++} and τ×+\tau_{\times+} to identify a weighted average of τc(Xi)\tau_{\textup{c}}(X_{i}), and contrasts the results with that of τ××\tau_{\times\times}.

Let

w(Xi)=var{𝔼(DiZi,Xi)Xi}𝔼[var{𝔼(DiZi,Xi)Xi}]\displaystyle w(X_{i})=\dfrac{\operatorname{var}\{\mathbb{E}(D_{i}\mid Z_{i},X_{i})\mid X_{i}\}}{\mathbb{E}\big{[}\operatorname{var}\{\mathbb{E}(D_{i}\mid Z_{i},X_{i})\mid X_{i}\}\big{]}}

with w(Xi)>0w(X_{i})>0 and 𝔼{w(Xi)}=1\mathbb{E}\{w(X_{i})\}=1. Let π(Xi)=(Ui=cXi)\pi(X_{i})=\mathbb{P}(U_{i}=\textup{c}\mid X_{i}) denote the proportion of compliers given XiX_{i}. Let π~(Xi)\tilde{\pi}(X_{i}) denote the linear projection of π(Xi)\pi(X_{i}) on XiX_{i} weighted by var(ZiXi)\operatorname{var}(Z_{i}\mid X_{i}). That is, π~(Xi)=aXi\tilde{\pi}(X_{i})=a^{\top}X_{i} with a=𝔼{var(ZiXi)XiXi}1𝔼{var(ZiXi)Xiπ(Xi)}a=\mathbb{E}\{\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}X_{i}^{\top}\}^{-1}\mathbb{E}\{\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}\,\pi(X_{i})\}.

Proposition 2.

Assume Assumption 1.

  1. (i)

    If 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) is linear in XiX_{i}, then τ++=𝔼{w+(Xi)τc(Xi)}\tau_{++}=\mathbb{E}\{w_{+}(X_{i})\cdot\tau_{\textup{c}}(X_{i})\}, where

    w+(Xi)=var(ZiXi)π(Xi)𝔼{var(ZiXi)π(Xi)}with w+(Xi)>0 and 𝔼{w+(Xi)}=1.\displaystyle w_{+}(X_{i})=\dfrac{\operatorname{var}(Z_{i}\mid X_{i})\cdot\pi(X_{i})}{\mathbb{E}\left\{\operatorname{var}(Z_{i}\mid X_{i})\cdot\pi(X_{i})\right\}}\quad\text{with \ \ $w_{+}(X_{i})>0$ and $\mathbb{E}\{w_{+}(X_{i})\}=1$.}

    Further assume that 𝔼(DiZi,Xi)\mathbb{E}(D_{i}\mid Z_{i},X_{i}) is linear in (Zi,Xi)(Z_{i},X_{i}), i.e., the additive first stage in Definition 1 is correctly specified. Then w+(Xi)=w(Xi)w_{+}(X_{i})=w(X_{i}).

  2. (ii)

    If Assumption 2 holds, then τ×+=𝔼{w×(Xi)τc(Xi)}\tau_{\times+}=\mathbb{E}\{w_{\times}(X_{i})\cdot\tau_{\textup{c}}(X_{i})\}, where

    w×(Xi)=var(ZiXi)π~(Xi)π(Xi)𝔼{var(ZiXi)π~(Xi)2}with𝔼{w×(Xi)}=1.\displaystyle w_{\times}(X_{i})=\dfrac{\operatorname{var}(Z_{i}\mid X_{i})\cdot\tilde{\pi}(X_{i})\cdot\pi(X_{i})}{\mathbb{E}\left\{\operatorname{var}(Z_{i}\mid X_{i})\cdot\tilde{\pi}(X_{i})^{2}\right\}}\quad\text{with}\ \ \mathbb{E}\{w_{\times}(X_{i})\}=1.

    Further assume that 𝔼(DiZi,Xi)\mathbb{E}(D_{i}\mid Z_{i},X_{i}) is linear in (ZiXi,Xi)(Z_{i}X_{i},X_{i}), i.e., the interacted first stage in Definition 4 is correctly specified. Then w×(Xi)=w(Xi)w_{\times}(X_{i})=w(X_{i}).

  3. (iii)

    If Assumption 2 or Assumption 3 holds, then τ××=τc\tau_{\times\times}=\tau_{\textup{c}}.

Proposition 2(i) reviews Sloczynski (2022, Corollary 3.4) and ensures that when the IV propensity score 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) is linear in XiX_{i}, τ^++\hat{\tau}_{++} recovers a weighted average of τc(Xi)\tau_{\textup{c}}(X_{i}) that generally differs from τc\tau_{\textup{c}}. We further show that w+(Xi)w_{+}(X_{i}) simplifies to w(Xi)w(X_{i}) when the additive first stage is correctly specified. See Sloczynski (2022) for a detailed discussion about the deviation of τ++\tau_{++} from τc\tau_{\textup{c}}.

Proposition 2(ii) extends Angrist and Imbens (1995) on categorical covariates to general covariates, and ensures that when 𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) is linear in XiX_{i}, τ^×+\hat{\tau}_{\times+} recovers a weighted average of τc(Xi)\tau_{\textup{c}}(X_{i}) that generally differs from τc\tau_{\textup{c}}. A technical subtlety is that w×(Xi)w_{\times}(X_{i}) may be negative due to π~(Xi)\tilde{\pi}(X_{i}), so that some τc(Xi)\tau_{\textup{c}}(X_{i}) may be negatively weighted. Note that categorical covariates satisfy both Assumption 2 and the condition that 𝔼(DiZi,Xi)\mathbb{E}(D_{i}\mid Z_{i},X_{i}) is linear in (ZiXi,Xi)(Z_{i}X_{i},X_{i}). Proposition 2(ii) then simplifies to the special case in Angrist and Imbens (1995).

Proposition 2(iii) follows from Theorem 1 and ensures that when 𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) is linear in XiX_{i}, τ^××\hat{\tau}_{\times\times} from the interacted 2sls directly recovers τc\tau_{\textup{c}}.

Together, Proposition 2 illustrates the advantage of τ^××\hat{\tau}_{\times\times} over τ^×+\hat{\tau}_{\times+} and τ^++\hat{\tau}_{++} for estimating the LATE, especially when the covariates are categorical or the IV is unconditionally valid; c.f. Proposition 1. See Table 1 for a summary and Section 6 for a simulated example.

Table 1: 2sls for estimating the LATE.
Second stage
First stage Additive Interacted
lm(YiD^i+Xi)\texttt{lm}(Y_{i}\sim\hat{D}_{i}+X_{i}) lm(YiDX^i+Xi)\texttt{lm}(Y_{i}\sim\widehat{DX}_{i}+X_{i})
Additive
lm(DiZi+Xi)\texttt{lm}(D_{i}\sim Z_{i}+X_{i})
τ^++𝔼{w+(Xi)τc(Xi)}\hat{\tau}_{++}\overset{\mathbb{P}}{\to}\mathbb{E}\left\{w_{+}(X_{i})\cdot\tau_{\textup{c}}(X_{i})\right\}
if 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) is linear in XiX_{i}
n.a.
Interacted
lm(DiZiXi+Xi)\texttt{lm}(D_{i}\sim Z_{i}X_{i}+X_{i})
or
lm(DiXiZiXi+Xi)\texttt{lm}(D_{i}X_{i}\sim Z_{i}X_{i}+X_{i})
τ^×+𝔼{w×(Xi)τc(Xi)}\hat{\tau}_{\times+}\overset{\mathbb{P}}{\to}\mathbb{E}\left\{w_{\times}(X_{i})\cdot\tau_{\textup{c}}(X_{i})\right\}
if Assumption 2 holds
τ^××τc\hat{\tau}_{\times\times}\overset{\mathbb{P}}{\to}\tau_{\textup{c}} if
if Assumption 2 or
Assumption 3 holds

Note. τ^++\hat{\tau}_{++} denotes the coefficient of D^i\hat{D}_{i} from the additive 2sls in Definition 1. τ^×+\hat{\tau}_{\times+} denotes the coefficient of D^i\hat{D}_{i} from the interacted-additive 2sls in Definition 4. τ^××\hat{\tau}_{\times\times} denotes the coefficient of D^i\hat{D}_{i} from the interacted 2sls with centered covariates in Definition 3. w+(Xi)w_{+}(X_{i}) and w×(Xi)w_{\times}(X_{i}) are defined in Proposition 2.

3.2 Identification of treatment effect heterogeneity

We establish in this subsection the properties of the interacted 2sls for estimating βc\beta_{\textup{c}} that quantifies treatment effect heterogeneity with regard to covariates. Recall β^2sls\hat{\beta}_{\textsc{2sls}} as the coefficient vector of DiXiD_{i}X_{i} from the interacted 2sls in Definition 2. Let β2sls\beta_{\textsc{2sls}} denote the probability limit of β^2sls\hat{\beta}_{\textsc{2sls}}.

3.2.1 General theory

Let proj(DiXiZiXi,Xi)\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i}) denote the linear projection of DiXiD_{i}X_{i} on (ZiXi,Xi)(Z_{i}X_{i}^{\top},X_{i}^{\top})^{\top}. Let DX~i\widetilde{DX}_{i} denote the residual from the linear projection of proj(DiXiZiXi,Xi)\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i}) on XiX_{i}. Let ϵi=τiXiβc\epsilon_{i}=\tau_{i}-X_{i}^{\top}\beta_{\textup{c}} for compliers, as the residual from the linear projection of τi\tau_{i} on XiX_{i} among compliers. Let Δi=Yi(0)\Delta_{i}=Y_{i}(0) for compliers and never-takers, and Δi=Yi(1)Xiβc\Delta_{i}=Y_{i}(1)-X_{i}^{\top}\beta_{\textup{c}} for always-takers. Let

B1=C1𝔼[{𝔼(ZiXiXi)proj(ZiXiXi)}{𝔼(ΔiXi)proj(ΔiXi)}],B2=C1(IKC2)𝔼{𝔼(ZiXiXi)ϵiUi=c}(Ui=c),\displaystyle\begin{array}[]{lll}B_{1}&=&C_{1}\cdot\mathbb{E}\Big{[}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})-\textup{proj}(Z_{i}X_{i}\mid X_{i})\Big{\}}\cdot\Big{\{}\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i})\Big{\}}\Big{]},\\ B_{2}&=&C_{1}(I_{K}-C_{2})\cdot\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\Big{\}}\cdot\mathbb{P}(U_{i}=\textup{c}),\end{array} (8)

where C1C_{1} is the coefficient matrix of ZiXiZ_{i}X_{i} in proj(DiXiZiXi,Xi)\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i}) and C2C_{2} is the coefficient matrix of XiX_{i} in proj(ZiXiXi)\textup{proj}(Z_{i}X_{i}\mid X_{i}).

Theorem 2.

Assume Assumption 1. Then β2sls=βc+{𝔼(DX~iDX~i)}1(B1+B2)\beta_{\textsc{2sls}}=\beta_{\textup{c}}+\{\mathbb{E}(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top})\}^{-1}(B_{1}+B_{2}), where B1=B2=0B_{1}=B_{2}=0 if either Assumption 2 or Assumption 3 holds.

Theorem 2 ensures that β2sls\beta_{\textsc{2sls}} identifies βc\beta_{\textup{c}} if and only if B1+B2=0B_{1}+B_{2}=0, and establishes Assumption 2 and Assumption 3 as two sufficient conditions. We discuss the sufficiency and “almost necessity” of these two assumptions in Section 4.

Recall that Assumption 2 holds when the IV is randomly assigned. Therefore, Theorem 2 ensures β2sls=βc\beta_{\textsc{2sls}}=\beta_{\textup{c}} for randomly assigned IVs as a special case. See Ding et al. (2019, Theorem 7) for similar results in the finite-population, design-based framework.

In addition, recall that Assumption 2 holds when the covariates are categorical. Therefore, Theorem 2 ensures β2sls=βc\beta_{\textsc{2sls}}=\beta_{\textup{c}} for categorical covariates. As it turns out, the elements of βc\beta_{\textup{c}} equal the category-specific LATEs if we define XiX_{i} as the vector of all category dummies, without the constant one. This allows us to recover the conditional LATEs as coefficients of the interacted 2sls for categorical covariates. We formalize the intuition below.

3.2.2 The special case of categorical covariates

Assume throughout this subsection that the covariate vector XiX_{i} encodes a KK-level categorical variable, represented by Xi{1,,K}X_{i}^{*}\in\{1,\ldots,K\}, using category dummies:

Xi=(1(Xi=1),,1(Xi=K)).\displaystyle X_{i}=(1(X_{i}^{*}=1),\ldots,1(X_{i}^{*}=K))^{\top}. (9)

Define

τ[k]c=𝔼(τiXi=k,Ui=c)\displaystyle\tau_{[k]\textup{c}}=\mathbb{E}(\tau_{i}\mid X_{i}^{*}=k,\,U_{i}=\textup{c})

as the conditional LATE on compliers with Xi=kX_{i}^{*}=k; c.f. Eq. (3). Lemma 1 below states the simplified forms of βc\beta_{\textup{c}} and τc(Xi)\tau_{\textup{c}}(X_{i}) under (9).

Lemma 1.

Assume (9). Then βc=(τ[1]c,,τ[K]c)\beta_{\textup{c}}=(\tau_{[1]\textup{c}},\ldots,\tau_{[K]\textup{c}})^{\top} and τc(Xi)=Xiβc\tau_{\textup{c}}(X_{i})=X_{i}^{\top}\beta_{\textup{c}}.

Lemma 1 has two implications. First, when XiX_{i} is the dummy vector, the corresponding βc\beta_{\textup{c}} has the subgroup LATEs τ[k]c\tau_{[k]\textup{c}}’s as its elements, which adds interpretability. Second, with categorical XiX_{i}, the conditional LATE τc(Xi)\tau_{\textup{c}}(X_{i}) is linear in XiX_{i} and equals the linear projection XiβcX_{i}^{\top}\beta_{\textup{c}}. This is not true for general XiX_{i}, so that linear projection is only an approximation.

Corollary 1 below follows from Theorem 2 and Lemma 1, and ensures the coefficients of the interacted 2sls are consistent for estimating the conditional LATEs.

Corollary 1.

Assume Assumption 1 and (9). We have β2sls=(τ[1]c,,τ[K]c)\beta_{\textsc{2sls}}=(\tau_{[1]\textup{c}},\ldots,\tau_{[K]\textup{c}})^{\top}.

3.3 Reconciliation of Theorems 1 and 2

We now clarify the connection between Theorems 1 and 2, and provide intuition for the use of centered covariates in estimating the LATE.

Recall that τ^××\hat{\tau}_{\times\times} is the first element of β^2sls\hat{\beta}_{\textsc{2sls}} with centered covariates. Theorem 2 ensures that β^2sls\hat{\beta}_{\textsc{2sls}} is consistent for estimating βc\beta_{\textup{c}} when either Assumption 2 or Assumption 3 holds. Therefore, a logical basis for using τ^××\hat{\tau}_{\times\times} to estimate τc\tau_{\textup{c}} is that τc\tau_{\textup{c}} equals the first element of βc\beta_{\textup{c}}. Lemma 2 below ensures that this condition is satisfied when the nonconstant covariates are centered by their respective complier averages. This establishes Theorem 1 as a direct consequence of Theorem 2, and justifies the use of centered covariates for estimating the LATE.

Lemma 2.

Assume Xi=(1,Xi1,,Xi,K1)X_{i}=(1,X_{i1},\ldots,X_{i,K-1})^{\top}. If 𝔼(XikUi=c)=0\mathbb{E}(X_{ik}\mid U_{i}=\textup{c})=0 for k=1,,K1k=1,\ldots,K-1, then the first element of βc\beta_{\textup{c}} equals τc\tau_{\textup{c}}.

4 Discussion of Assumptions 2 and 3

We discuss in this section the sufficiency and “almost necessity” of Assumptions 2 and 3 to ensure β2sls=βc\beta_{\textsc{2sls}}=\beta_{\textup{c}} in Theorem 2.

4.1 Sufficiency and “almost necessity” of Assumption 2

We discuss below the sufficiency and “almost necessity” of Assumption 2. We first demonstrate the sufficiency of Assumption 2 to ensure B1=B2=0B_{1}=B_{2}=0, as a sufficient condition for β2sls=βc\beta_{\textsc{2sls}}=\beta_{\textup{c}} by Theorem 2. We then adopt the weighting perspective in Abadie (2003) and establish the necessity of Assumption 2 to ensure β2sls=βc\beta_{\textsc{2sls}}=\beta_{\textup{c}} in the absence of additional assumptions on the outcome model beyond Assumption 1.

4.1.1 Sufficiency

Recall the definitions of B1B_{1} and B2B_{2} from Eq. (8), where C2C_{2} denotes the coefficient matrix of XiX_{i} in proj(ZiXiXi)\textup{proj}(Z_{i}X_{i}\mid X_{i}). Assumption 2 ensures

𝔼(ZiXiXi)=proj(ZiXiXi)=C2Xi\displaystyle\mathbb{E}(Z_{i}X_{i}\mid X_{i})=\textup{proj}(Z_{i}X_{i}\mid X_{i})=C_{2}X_{i} (10)

so that B1=0B_{1}=0.

In addition, recall that ϵi\epsilon_{i} is the residual from the linear projection of τi\tau_{i} on XiX_{i} among compliers. Properties of linear projection ensure 𝔼(XiϵiUi=c)=0\mathbb{E}(X_{i}\epsilon_{i}\mid U_{i}=\textup{c})=0 so that the expectation term in B2B_{2} satisfies

𝔼{𝔼(ZiXiXi)ϵiUi=c}=(10)𝔼(C2XiϵiUi=c)=0.\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\Big{\}}\overset{\eqref{eq:proj_assm2}}{=}\mathbb{E}(C_{2}X_{i}\epsilon_{i}\mid U_{i}=\textup{c})=0.

This ensures B2=0B_{2}=0 and therefore the sufficiency of Assumption 2 to ensure B1=B2=0B_{1}=B_{2}=0.

4.1.2 Necessity in the absence of additional assumptions on outcome model

Proposition 3 below expresses β2sls\beta_{\textsc{2sls}} as a weighted estimator and establishes the necessity of Assumption 2 to ensure β2sls=βc\beta_{\textsc{2sls}}=\beta_{\textup{c}} in the absence of additional assumptions on the outcome model. Recall that DX~i\widetilde{DX}_{i} denotes the residual from the linear projection of proj(DiXiZiXi,Xi)\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i}) on XiX_{i}.

Proposition 3.
  1. (i)

    β2sls=𝔼(wiYi)\beta_{\textsc{2sls}}=\mathbb{E}(w_{i}Y_{i}), where wi={𝔼(DX~iDX~i)}1DX~i.w_{i}=\{\mathbb{E}(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top})\}^{-1}\widetilde{DX}_{i}.

  2. (ii)

    Under Assumption 1, we have βc=𝔼(uiYi)\beta_{\textup{c}}=\mathbb{E}(u_{i}Y_{i}), where ui={𝔼(XiXiUi=c)}1πc1ΔκXiu_{i}=\{\mathbb{E}(X_{i}X_{i}^{\top}\mid U_{i}=\textup{c})\}^{-1}\pi_{\textup{c}}^{-1}\Delta\kappa X_{i} with

    Δκ=Zi𝔼(ZiXi)𝔼(ZiXi){1𝔼(ZiXi)}.\displaystyle\Delta\kappa=\dfrac{Z_{i}-\mathbb{E}(Z_{i}\mid X_{i})}{\mathbb{E}(Z_{i}\mid X_{i})\{1-\mathbb{E}(Z_{i}\mid X_{i})\}}.
  3. (iii)

    If wi=uiw_{i}=u_{i}, then Assumption 2 must hold.

Proposition 3(i) expresses β2sls\beta_{\textsc{2sls}} as a weighted average of the outcome YiY_{i}. The result is numeric and does not require Assumption 1.

Proposition 3(ii) follows from Abadie (2003) and expresses βc\beta_{\textup{c}} as a weighted average of YiY_{i} when Assumption 1 holds.

Proposition 3(iii) connects Proposition 3(i)–(ii) and establishes the necessity of Assumption 2 to ensure wi=uiw_{i}=u_{i}.

Together, Proposition 3(i)–(iii) establish the necessity of Assumption 2 to ensure β2sls=βc\beta_{\textsc{2sls}}=\beta_{\textup{c}} in the absence of additional assumptions on the outcome model.

4.2 Sufficiency and “almost necessity” of Assumption 3

We discuss below the sufficiency and “almost necessity” of Assumption 3 to ensure B1=B2=0B_{1}=B_{2}=0 in the absence of additional assumptions on ZiZ_{i} beyond Assumption 1.

Without additional assumptions on ZiZ_{i}, the 𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) in the definitions of B1B_{1} and B2B_{2} is arbitrary. To ensure B1=0B_{1}=0, we generally need 𝔼(ΔiXi)=proj(ΔiXi)\mathbb{E}(\Delta_{i}\mid X_{i})=\textup{proj}(\Delta_{i}\mid X_{i}), which is equivalent to

𝔼(ΔiXi)\mathbb{E}(\Delta_{i}\mid X_{i}) is linear in XiX_{i}. (11)

In addition, Lemma S4 in the Supplementary Material ensures that C1(IKC2)C_{1}(I_{K}-C_{2}) is invertible. From Eq. (8), we have B2=0B_{2}=0 if and only if

𝔼{𝔼(ZiXiXi)ϵiUi=c}=0.\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\Big{\}}=0. (12)

By the law of iterated expectations, the left-hand side of condition (12) equals

𝔼{𝔼(ZiXiXi)ϵiUi=c}\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\Big{\}} =\displaystyle= 𝔼[𝔼{𝔼(ZiXiXi)ϵiXi,Ui=c}Ui=c]\displaystyle\mathbb{E}\Big{[}\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\mid X_{i},U_{i}=\textup{c}\Big{\}}\mid U_{i}=\textup{c}\Big{]}
=\displaystyle= 𝔼[𝔼(ZiXiXi)𝔼(ϵiXi,Ui=c)Ui=c].\displaystyle\mathbb{E}\Big{[}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\mathbb{E}(\epsilon_{i}\mid X_{i},U_{i}=\textup{c})\mid U_{i}=\textup{c}\Big{]}.

When 𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) is arbitrary, condition (12) generally requires 𝔼(ϵiXi,Ui=c)=0\mathbb{E}(\epsilon_{i}\mid X_{i},U_{i}=\textup{c})=0, which is equivalent to

τc(Xi)=𝔼(τiXi,Ui=c)\tau_{\textup{c}}(X_{i})=\mathbb{E}(\tau_{i}\mid X_{i},U_{i}=\textup{c}) is linear in XiX_{i}. (13)

The above observations suggest the sufficiency and “almost necessity” of the combination of conditions (11) and (13) to ensure B1=B2=0B_{1}=B_{2}=0 in the absence of additional assumptions on ZiZ_{i}. We further show in Lemma S5 in the Supplementary Material that under Assumption 1, Assumption 3 is equivalent to the combination of conditions (11) and (13). This demonstrates the sufficiency and “almost necessity” of Assumption 3 to ensure B1=B2=0B_{1}=B_{2}=0 in the absence of additional assumptions on ZiZ_{i}.

Remark 1.

From Theorem 2 and Lemma 2, τ××=τc\tau_{\times\times}=\tau_{\textup{c}} if and only if

the first dimension of {𝔼(DX~iDX~i)}1(B1+B2)\{\mathbb{E}(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top})\}^{-1}(B_{1}+B_{2}) equals 0, (14)

which is weaker than the condition of B1+B2=0B_{1}+B_{2}=0 to ensure β2sls=βc\beta_{\textsc{2sls}}=\beta_{\textup{c}}. Nonetheless, we did not find intuitive relaxations of Assumptions 2 and 3 to satisfy (14), and therefore invoke the same identifying conditions for τc\tau_{\textup{c}} and βc\beta_{\textup{c}} in Theorems 1 and 2.

5 Stratification based on IV propensity score

The results in Section 3 suggest that the interacted 2sls may be inconsistent for estimating the LATE and treatment effect heterogeneity when neither Assumption 2 or Assumption 3 holds for the original covariates. When this is the case, we propose stratifying based on the IV propensity score and using the stratum dummies as the new covariates to fit the interacted 2sls. The consistency of the interacted 2sls for categorical covariates provides heuristic justification for using the resulting 2sls coefficients to approximate the LATE and treatment effect heterogeneity with regard to the IV propensity score. We formalize the intuition below. See Cheng and Lin (2018) and Pashley et al. (2024) for other applications of stratification based on the IV propensity score.

Assume that XiX_{i}^{*} is a general covariate vector that ensures the conditional validity of ZiZ_{i}. Let ei=(Zi=1Xi)e_{i}=\mathbb{P}(Z_{i}=1\mid X_{i}^{*}) denote the IV propensity score of unit ii, and define

τc(ei)=𝔼(τiei,Ui=c)\displaystyle\tau_{\textup{c}}(e_{i})=\mathbb{E}(\tau_{i}\mid e_{i},\,U_{i}=\textup{c})

as the conditional LATE given eie_{i}; c.f. Eq. (3). Properties of the propensity score (Rosenbaum and Rubin, 1983) ensure Zi{Yi(z,d),Di(z):z=0,1;d=0,1}eiZ_{i}\perp\!\!\!\perp\{Y_{i}(z,d),D_{i}(z):z=0,1;\ d=0,1\}\mid e_{i} so that ZiZ_{i} is also conditionally valid given eie_{i}. This, together with the consistency guarantees of the interacted 2sls for categorical covariates, motivates viewing eie_{i} as a new covariate that ensures the validity of ZiZ_{i}, and fitting the interacted 2sls with a discretized version of eie_{i} to approximate τc\tau_{\textup{c}} and τc(ei)\tau_{\textup{c}}(e_{i}). In most applications, eie_{i} is unknown so we need to first estimate eie_{i} using XiX_{i}^{*}. We formalize the procedure in Algorithm 1 below.

Algorithm 1.
  1. (i)

    Estimate ei=(Zi=1Xi)e_{i}=\mathbb{P}(Z_{i}=1\mid X_{i}^{*}) as e^i\hat{e}_{i}.

  2. (ii)

    Partition [0,1][0,1] into KK strata, denoted by {𝒮k:k=1,,K}\{\mathcal{S}_{k}:k=1,\ldots,K\}. Define ^i=(1,i1,,i,K1)\hat{\mathcal{I}}_{i}=(1,\mathcal{I}_{i1},\ldots,\mathcal{I}_{i,K-1})^{\top} and ~i=(i1,,iK)\tilde{\mathcal{I}}_{i}=(\mathcal{I}_{i1},\ldots,\mathcal{I}_{iK})^{\top}, where ik=1(e^i𝒮k)\mathcal{I}_{ik}=1(\hat{e}_{i}\in\mathcal{S}_{k}), as two representations of the stratum of e^i\hat{e}_{i}. In practice, we can choose KK and the cutoffs for 𝒮k\mathcal{S}_{k} based on the realized values of e^i\hat{e}_{i}’s.

  3. (iii)

    Compute τ^××\hat{\tau}_{\times\times}^{*} by letting Xi=^iX_{i}=\hat{\mathcal{I}}_{i} in Definition 3 to approximate the LATE.

    Compute β^2sls\hat{\beta}_{\textsc{2sls}}^{*} by letting Xi=~iX_{i}=\tilde{\mathcal{I}}_{i} in Definition 2. For k=1,,Kk=1,\ldots,K, use the kkth element of β^2sls\hat{\beta}_{\textsc{2sls}}^{*} to approximate τc(ei)\tau_{\textup{c}}(e_{i}) for ei𝒮ke_{i}\in\mathcal{S}_{k}.

Algorithm 1 transforms arbitrary covariate vector XiX_{i}^{*} into categorical ^i\hat{\mathcal{I}}_{i} and ~i\tilde{\mathcal{I}}_{i} to fit the interacted 2sls. Note that ^i\hat{\mathcal{I}}_{i} and ~i\tilde{\mathcal{I}}_{i} are discretized approximations to the actual eie_{i}. Heuristically, an IV that is conditionally valid given eie_{i} can be viewed as “approximately valid” given ^i\hat{\mathcal{I}}_{i} or ~i\tilde{\mathcal{I}}_{i}. This, together with Theorem 1 and Corollary 1, provides the heuristic justification of the resulting τ^××\hat{\tau}_{\times\times}^{*} and β^2sls\hat{\beta}_{\textsc{2sls}}^{*} to approximate the LATE and subgroup LATEs

τ[k]c=𝔼(τiei𝒮k,Ui=c),wherek=1,,K,\displaystyle\tau_{[k]\textup{c}}=\mathbb{E}(\tau_{i}\mid e_{i}\in\mathcal{S}_{k},\,U_{i}=\textup{c}),\quad\text{where}\ \ k=1,\ldots,K, (15)

respectively. The subgroup LATEs in (15) further provide a piecewise approximation to τc(ei)\tau_{\textup{c}}(e_{i}) by assuming τc(ei)τ[k]c\tau_{\textup{c}}(e_{i})\approx\tau_{[k]\textup{c}} for ei𝒮ke_{i}\in\mathcal{S}_{k}. This is analogous to the idea of regressogram in nonparametric regression (Liero, 1989; Wasserman, 2006).

6 Simulation

6.1 Interacted 2SLS for estimating LATE

We now use a simulated example to illustrate the advantage of the interacted 2sls for estimating the LATE. Assume the following model for (Xi,Zi,Di,Yi)(X_{i},Z_{i},D_{i},Y_{i}):

  1. (i)

    Xi=(1,Xi1)X_{i}=(1,X_{i1})^{\top}, where Xi1X_{i1}\sim Bernoulli(0.5).

  2. (ii)

    ZiXiZ_{i}\mid X_{i}\sim Bernoulli(0.5+0.4Xi1)(0.5+0.4X_{i1}).

  3. (iii)

    Di=1(Ui=a)+Zi1(Ui=c)D_{i}=1(U_{i}=\textup{a})+Z_{i}\cdot 1(U_{i}=\textup{c}), where (Ui=aXi)=0.1\mathbb{P}(U_{i}=\textup{a}\mid X_{i})=0.1 and (Ui=cXi)=0.70.5Xi1\mathbb{P}(U_{i}=\textup{c}\mid X_{i})=0.7-0.5X_{i1}.

  4. (iv)

    Yi=DiYi(1)+(1Di)Yi(0)Y_{i}=D_{i}Y_{i}(1)+(1-D_{i})Y_{i}(0) with Yi(0)=0Y_{i}(0)=0 and Yi(1)=1+5Xi1Y_{i}(1)=-1+5X_{i1}.

The data-generating process ensures τ[1]c=𝔼(τiXi1=1,Ui=c)=4\tau_{[1]\textup{c}}=\mathbb{E}(\tau_{i}\mid X_{i1}=1,U_{i}=\textup{c})=4, τ[2]c=𝔼(τiXi1=0,Ui=c)=1\tau_{[2]\textup{c}}=\mathbb{E}(\tau_{i}\mid X_{i1}=0,U_{i}=\textup{c})=-1, and τc=1/9\tau_{\textup{c}}=1/9. For each replication, we generate N=10000N=10000 independent realizations of (Xi,Zi,Di,Yi)(X_{i},Z_{i},D_{i},Y_{i}) and compute τ^++\hat{\tau}_{++}, τ^××\hat{\tau}_{\times\times}, and τ^×+\hat{\tau}_{\times+} from the additive, interacted, and interacted-additive 2sls in Definitions 14; c.f. Table 1. We use the method of moments to estimate the complier average of Xi1X_{i1} in computing τ^××\hat{\tau}_{\times\times}.

Figure 1 shows the distributions of the three estimators over 1000 replications. Almost all realizations of τ^++\hat{\tau}_{++} and τ^×+\hat{\tau}_{\times+} are below 0.4-0.4 while the actual τc\tau_{\textup{c}} is positive. In contrast, the distribution of τ^××\hat{\tau}_{\times\times} is approximately centered at the actual τc\tau_{\textup{c}}, which is coherent with Theorem 1 for categorical covariates.

Refer to caption
Figure 1: Distributions of τ^××\hat{\tau}_{\times\times}, τ^++\hat{\tau}_{++}, and τ^×+\hat{\tau}_{\times+} over 1000 replications.

6.2 Approximation of LATE by Algorithm 1.

We now illustrate the improvement of Algorithm 1 for estimating the LATE. Assume the following model for generating (Xi,Zi,Di,Yi)(X_{i},Z_{i},D_{i},Y_{i}).

  1. (i)

    Xi=(1,Xi1,Xi2)X_{i}=(1,X_{i1},X_{i2})^{\top}, where Xi1X_{i1} and Xi2X_{i2} are independent standard normals.

  2. (ii)

    (ZiXi)={1+exp(Xi1+Xi2)}1\mathbb{P}(Z_{i}\mid X_{i})=\{1+\exp(X_{i1}+X_{i2})\}^{-1}.

  3. (iii)

    Di=Zi1(Ui=c)+1(Ui=a)D_{i}=Z_{i}\cdot 1(U_{i}=\textup{c})+1(U_{i}=\textup{a}), where (Ui=c)=0.7\mathbb{P}(U_{i}=\textup{c})=0.7 and (Ui=a)=0.2\mathbb{P}(U_{i}=\textup{a})=0.2.

  4. (iv)

    Yi=DiYi(1)+(1Di)Yi(0)Y_{i}=D_{i}Y_{i}(1)+(1-D_{i})Y_{i}(0) with Yi(0)=0Y_{i}(0)=0 and Yi(1)=X1i2+X2i2Y_{i}(1)=X_{1i}^{2}+X_{2i}^{2}.

For each replication, we generate N=1000N=1000 independent realizations of (Xi,Zi,Di,Yi)(X_{i},Z_{i},D_{i},Y_{i}) and compute τ^××\hat{\tau}_{\times\times}^{*} by Algorithm 1. We consider K=5,10,15K=5,10,15 to evaluate the impact of the number of strata on estimation. For comparison, we also compute τ^++\hat{\tau}_{++}, τ^××\hat{\tau}_{\times\times}, and τ^×+\hat{\tau}_{\times+} from the additive, interacted, and interacted-additive 2sls in Definitions 14.

Figure 2 shows the violin plot of the deviations of the resulting estimators from the LATE over 1000 replications. The three estimators by Algorithm 1 substantially reduce the biases compared with τ^++\hat{\tau}_{++}, τ^×+\hat{\tau}_{\times+}, and τ^××\hat{\tau}_{\times\times}. In addition, increasing the number of strata further reduces the bias but increases variability. Table 2 summarizes the means and standard deviations of the deviations.

Refer to caption
Figure 2: Violin plot of the deviations of the 2sls estimators from the LATE over 1000 replications. The first three columns correspond to τ^++\hat{\tau}_{++}, τ^×+\hat{\tau}_{\times+}, and τ^××\hat{\tau}_{\times\times}, respectively. The last three columns correspond to τ^××\hat{\tau}_{\times\times}^{*} by Algorithm 1 with K=5,10,15K=5,10,15.
Table 2: Biases and standard deviations of the 2sls estimators over 1000 replications.
τ^++\hat{\tau}_{++} τ^×+\hat{\tau}_{\times+} τ^××\hat{\tau}_{\times\times} τ^××\hat{\tau}_{\times\times}^{*} from Algorithm 1
K=5K=5 K=10K=10 K=15K=15
Bias 0.559-0.559 0.556-0.556 0.557-0.557 0.106-0.106 0.054-0.054 0.043-0.043
Standard dev. 0.144 0.165 0.157 0.124 0.140 0.336

6.3 Possible inconsistency for estimating βc\beta_{\textup{c}}

We now illustrate the possible inconsistency of the interacted 2sls in Theorem 2. Assume the following model for (Xi,Zi,Di,Yi)(X_{i},Z_{i},D_{i},Y_{i}):

  1. (i)

    Xi=(1,Xi1)X_{i}=(1,X_{i1})^{\top}, where Xi1X_{i1}\sim Uniform(0,1).

  2. (ii)

    ZiZ_{i}\sim Bernoulli(eie_{i}) with ei=Xi1e_{i}=X_{i1}.

  3. (iii)

    Di=Zi1(Ui=c)+1(Ui=a)D_{i}=Z_{i}\cdot 1(U_{i}=\textup{c})+1(U_{i}=\textup{a}), where (Ui=c)=0.7\mathbb{P}(U_{i}=\textup{c})=0.7 and (Ui=a)=0.2\mathbb{P}(U_{i}=\textup{a})=0.2.

  4. (iv)

    Yi=DiYi(1)+(1Di)Yi(0)Y_{i}=D_{i}Y_{i}(1)+(1-D_{i})Y_{i}(0) with Yi(0)=0Y_{i}(0)=0 and Yi(1)=Xi12Y_{i}(1)=X_{i1}^{2}.

The data-generating process ensures τi=Yi(1)Yi(0)=Xi12\tau_{i}=Y_{i}(1)-Y_{i}(0)=X_{i1}^{2} and βc=(1/6,1)\beta_{\textup{c}}=(-1/6,1)^{\top}.

For each replication, we generate N=1000N=1000 independent realizations of (Xi,Zi,Di,Yi)(X_{i},Z_{i},D_{i},Y_{i}) and compute β^2sls\hat{\beta}_{\textsc{2sls}} from the interacted 2sls in Definition 2. Figure 3 shows the distribution of β^2slsβc\hat{\beta}_{\textsc{2sls}}-\beta_{\textup{c}} over 1000 replications, indicating clear empirical biases in both dimensions.

Refer to caption Refer to caption
Figure 3: Distributions of β^2slsβc\hat{\beta}_{\textsc{2sls}}-\beta_{\textup{c}} over 1000 replications. The labels β^2sls,1βc,1\hat{\beta}_{\textsc{2sls},1}-\beta_{\textup{c},1} and β^2sls,2βc,2\hat{\beta}_{\textsc{2sls},2}-\beta_{\textup{c},2} denote the first and second elements of β^2slsβc\hat{\beta}_{\textsc{2sls}}-\beta_{\textup{c}}, respectively.

6.4 Approximation of τc(ei)\tau_{\textup{c}}(e_{i}) by Algorithm 1.

We now illustrate the utility of Algorithm 1 for approximating the conditional LATEs given the IV propensity score. Assume the same model as in Section 6.3 for generating (Xi,Zi,Di,Yi)(X_{i},Z_{i},D_{i},Y_{i}). The data-generating process ensures τi=Xi12=ei2\tau_{i}=X_{i1}^{2}=e_{i}^{2} with τc(ei)=𝔼(τiei,Ui=c)=ei2\tau_{\textup{c}}(e_{i})=\mathbb{E}(\tau_{i}\mid e_{i},U_{i}=\textup{c})=e_{i}^{2}.

To use Algorithm 1 to approximate τc(ei)\tau_{\textup{c}}(e_{i}), we estimate the IV propensity score by the logistic regression of ZiZ_{i} on XiX_{i}, and create the strata by dividing (e^i)i=1N(\hat{e}_{i})_{i=1}^{N} into KK equal-sized bins. Figure 4 shows the results at K=5,10,15K=5,10,15. The first row is based on one replication to illustrate the evolution of the approximation as KK increases. The second row shows the 95% confidence bands based on 3000 replications. The coverage of the confidence band improves as KK increases. Note that the logistic regression for estimating eie_{i} is misspecified in this example. The results suggest the robustness of the method to misspecification of the propensity score model with moderate number of strata.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 4: Piecewise approximations of τc(ei)\tau_{\textup{c}}(e_{i}) by Algorithm 1. The first row is based on one replication, and the second row shows the 95% confidence band based on 3000 replications.

7 Application

We now apply Algorithm 1 to the dataset that Abadie (2003) used to study the effects of 401(k) retirement programs on savings. The data contain cross-sectional observations of N=9275N=9275 units on eligibility for and participation in 401(k) plans along with income and demographic information. The outcome of interest is net financial assets. The treatment variable is an indicator of participation in 401(k) plans, and the IV is an indicator of 401(k) eligibility. Based on Abadie (2003), the IV is conditionally valid after controlling for income, age, and marital status.

Table 3 shows the point estimates of the LATE by the additive, interacted, interacted-additive 2sls in Definitions 14 and Algorithm 1, along with their respective bootstrap standard deviations. Figure 5 shows the piecewise approximations of τc(ei)\tau_{\textup{c}}(e_{i}) based on Algorithm 1 and the 95% confidence bands based on bootstrapping. The result suggests heterogeneity of conditional LATEs across different strata of the propensity score.

Table 3: Point estimates of the LATE and bootstrap standard deviations by the additive, interacted, additive-interacted 2sls and Algorithm 1 over 1000 replications.
τ^++\hat{\tau}_{++} τ^×+\hat{\tau}_{\times+} τ^××\hat{\tau}_{\times\times} τ^××\hat{\tau}_{\times\times}^{*} from Algorithm 1
K=5K=5 K=10K=10 K=15K=15 K=20K=20
Point estimate 9.658 11.019 9.296 12.760 12.148 12.008 12.066
Standard dev. 2.222 2.520 1.913 2.076 2.079 2.081 2.089
Refer to caption Refer to caption
Figure 5: Piecewise approximations of τc(ei)\tau_{\textup{c}}(e_{i}) by Algorithm 1 and the 95% confidence bands by bootstrapping.

8 Extensions

8.1 Extension to partially interacted 2SLS

Let XiX_{i} denote the covariate vector that ensures the conditional validity of the IV. The interacted 2sls in Definition 2 includes interactions of ZiZ_{i} and DiD_{i} with the whole set of XiX_{i} in accommodating treatment effect heterogeneity. When treatment effect heterogeneity is of interest or suspected for only a subset of XiX_{i}, denoted by ViXiV_{i}\subset X_{i}, an intuitive modification is to interact ZiZ_{i} and DiD_{i} with only ViV_{i} in fitting the 2sls, formalized in Definition 5 below. See Resnjanskij et al. (2024) for a recent application. See also Hirano and Imbens (2001) and Hainmueller et al. (2019) for applications of the partially interacted regression in the ordinary least squares setting.

Definition 5 (Partially interacted 2sls).
  1. (i)

    First stage: Fit lm(DiViZiVi+Xi)\texttt{lm}(D_{i}V_{i}\sim Z_{i}V_{i}+X_{i}) over i=1,,Ni=1,\ldots,N. Denote the fitted values by DVi^\widehat{DV_{i}} for 1,,N1,\ldots,N.

  2. (ii)

    Second stage: Fit lm(YiDVi^+Xi)\texttt{lm}(Y_{i}\sim\widehat{DV_{i}}+X_{i}) over i=1,,Ni=1,\ldots,N. Let β^2sls\hat{\beta}_{\textsc{2sls}}^{\prime} denote the coefficient vector of DVi^\widehat{DV_{i}}.

The partially interacted 2sls in Definition 5 generalizes the interacted 2sls, and reduces to the interacted 2sls when Vi=XiV_{i}=X_{i}. It is the 2sls procedure corresponding to the working model

Yi=DiViβDV+XiβX+ηi,where𝔼(ηiZi,Xi)=0,\displaystyle Y_{i}=D_{i}V_{i}^{\top}\beta_{DV}+X_{i}^{\top}\beta_{X}+\eta_{i},\quad\text{where}\ \ \mathbb{E}(\eta_{i}\mid Z_{i},X_{i})=0, (16)

with ViβDVV_{i}^{\top}\beta_{DV} representing the covariate-dependent treatment effect. We establish below the causal interpretation of β^2sls\hat{\beta}_{\textsc{2sls}}^{\prime}.

Causal estimand.

Analogous to the definition of βc\beta_{\textup{c}} in Eq. (4), let projUi=c(τiVi)=Viβc\textup{proj}_{U_{i}=\textup{c}}(\tau_{i}\mid V_{i})=V_{i}^{\top}\beta_{\textup{c}}^{\prime} denote the linear projection of τi\tau_{i} on ViV_{i} among compliers with

βc=argminbQ𝔼{(τiVib)2Ui=c}.\displaystyle\beta_{\textup{c}}^{\prime}=\textup{argmin}_{b\in\mathbb{R}^{Q}}\mathbb{E}\Big{\{}(\tau_{i}-V_{i}^{\top}b)^{2}\mid U_{i}=\textup{c}\Big{\}}.

Parallel to the discussion after Eq. (4), ViβcV_{i}^{\top}\beta_{\textup{c}}^{\prime} is also the linear projection of τc(Xi)\tau_{\textup{c}}(X_{i}) on ViV_{i} among compliers with βc=argminbQ𝔼[{τc(Xi)Vib}2Ui=c]\beta_{\textup{c}}^{\prime}=\textup{argmin}_{b\in\mathbb{R}^{Q}}\mathbb{E}[\{\tau_{\textup{c}}(X_{i})-V_{i}^{\top}b\}^{2}\mid U_{i}=\textup{c}]. This ensures ViβcV_{i}^{\top}\beta_{\textup{c}}^{\prime} is the best linear approximation to both τi\tau_{i} and τc(Xi)\tau_{\textup{c}}(X_{i}) based on ViV_{i}. We define βc\beta_{\textup{c}}^{\prime} as the causal estimand for quantifying treatment effect heterogeneity with regard to ViV_{i} among compliers.

Properties of β^2sls\hat{\beta}_{\textsc{2sls}}^{\prime} for estimating βc\beta_{\textup{c}}^{\prime}.

Let β2sls\beta_{\textsc{2sls}}^{\prime} denote the probability limit of β^2sls\hat{\beta}_{\textsc{2sls}}^{\prime}. Proposition 4 below generalizes Theorem 2 and states three sufficient conditions for β2sls\beta_{\textsc{2sls}}^{\prime} to identify βc\beta_{\textup{c}}^{\prime}.

Proposition 4.

Assume Xi=(Vi,Wi)KX_{i}=(V_{i}^{\top},W_{i}^{\top})^{\top}\in\mathbb{R}^{K}, where ViQV_{i}\in\mathbb{R}^{Q} and WiKQW_{i}\in\mathbb{R}^{K-Q} with QKQ\leq K. Let DVi~\widetilde{DV_{i}} denote the residual of the linear projection of proj(DiViZiVi,Xi)\textup{proj}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i}) on XiX_{i}. Let B=𝔼{DVi~(YiDiViβc)}B^{\prime}=\mathbb{E}\{\widetilde{DV_{i}}\cdot(Y_{i}-D_{i}V_{i}^{\top}\beta_{\textup{c}}^{\prime})\}. Assume Assumption 1. Then

β2sls=βc+{𝔼(DVi~DVi~)}1B,\displaystyle\beta_{\textsc{2sls}}^{\prime}=\beta_{\textup{c}}^{\prime}+\{\mathbb{E}(\widetilde{DV_{i}}\cdot\widetilde{DV_{i}}^{\top})\}^{-1}B^{\prime},

where B=0B^{\prime}=0 if any of the following conditions holds:

  1. (i)

    Categorical covariates: ViV_{i} is categorical with QQ levels, and satisfies 𝔼(ZiXi)=𝔼(ZiVi)\mathbb{E}(Z_{i}\mid X_{i})=\mathbb{E}(Z_{i}\mid V_{i}) and proj(ZiViXi)=proj(ZiViVi)\textup{proj}(Z_{i}V_{i}\mid X_{i})=\textup{proj}(Z_{i}V_{i}\mid V_{i}).

  2. (ii)

    Random assignment of IV: ZiXiZ_{i}\perp\!\!\!\perp X_{i}.

  3. (iii)

    Correct interacted outcome model: The interacted working model (16) is correct.

Proposition 4(i) parallels the identification of βc\beta_{\textup{c}} with categorical covariates by Theorem 2, and states a sufficient condition for β2sls\beta_{\textsc{2sls}}^{\prime} to identify βc\beta_{\textup{c}}^{\prime} when ViV_{i} is categorical. A special case is when XiX_{i} encodes two categorical variables and ViV_{i} corresponds to one of them.

Proposition 4(ii) parallels the identification of βc\beta_{\textup{c}} with randomly assigned IV by Theorem 2, and ensures β2sls=βc\beta_{\textsc{2sls}}^{\prime}=\beta_{\textup{c}}^{\prime} when ZiZ_{i} is unconditionally valid.

Proposition 4(iii) parallels the identification of βc\beta_{\textup{c}} under correct interacted working model (1) by Theorem 2, and ensures β2sls=βc\beta_{\textsc{2sls}}^{\prime}=\beta_{\textup{c}}^{\prime} when the working model (16) underlying the partially interacted 2sls is correct.

We relegate possible relaxations of these three sufficient conditions to Lemma S9 in the Supplementary Material.

8.2 Interacted OLS as a special case

Ordinary least squares (ols) is a special case of 2sls with Zi=DiZ_{i}=D_{i}. Therefore, our theory extends to the ols regression of YiY_{i} on (DiXi,Xi)(D_{i}X_{i},X_{i}) when DiD_{i} is conditionally randomized. Interestingly, the discussion of interacted ols in observational studies is largely missing except Rosenbaum and Rubin (1983). See Aronow and Samii (2016) and Chattopadhyay and Zubizarreta (2023) for the corresponding discussion of additive ols.

Renew Xi0=(1,Xi1X¯1,,Xi,K1X¯K1)X_{i}^{0}=(1,X_{i1}-\bar{X}_{1},\ldots,X_{i,K-1}-\bar{X}_{K-1})^{\top}, where X¯k=N1i=1NXik\bar{X}_{k}=N^{-1}\sum_{i=1}^{N}X_{ik}, as a variant of XiX_{i} centered by covariate means over all units. Let τ^ols\hat{\tau}_{\textsc{ols}} be the coefficient of DiD_{i} in lm(YiDiXi0+Xi0)\texttt{lm}(Y_{i}\sim D_{i}X_{i}^{0}+X_{i}^{0}), as the ols analog of τ^××\hat{\tau}_{\times\times} when Zi=DiZ_{i}=D_{i}. Let β^ols\hat{\beta}_{\textsc{ols}} be the coefficient vector of DiXiD_{i}X_{i} in lm(YiDiXi+Xi)\texttt{lm}(Y_{i}\sim D_{i}X_{i}+X_{i}), as the analog of β^2sls\hat{\beta}_{\textsc{2sls}} when Zi=DiZ_{i}=D_{i}. We derive below the causal interpretation of τ^ols\hat{\tau}_{\textsc{ols}} and β^ols\hat{\beta}_{\textsc{ols}}.

First, letting Zi=DiZ_{i}=D_{i} ensures that all units are compliers. The definitions of τc\tau_{\textup{c}} and βc\beta_{\textup{c}} in Eqs. (2) and (4) simplify to τ=𝔼(τi)\tau=\mathbb{E}(\tau_{i}), as the average treatment effect, and β=argminbK𝔼{(τiXib)2}\beta=\textup{argmin}_{b\in\mathbb{R}^{K}}\mathbb{E}\{(\tau_{i}-X_{i}^{\top}b)^{2}\}, as the coefficient vector in the linear projection of τi\tau_{i} on XiX_{i}, respectively. The IV assumptions in Assumption 1 simplify to the selection on observables and overlap assumptions on DiD_{i}, summarized in Assumption 4 below.

Assumption 4.
  1. (i)

    Selection on observables: Di{Yi(d):d=0,1}XiD_{i}\perp\!\!\!\perp\{Y_{i}(d):d=0,1\}\mid X_{i};

  2. (ii)

    Overlap: 0<(Di=1Xi)<10<\mathbb{P}(D_{i}=1\mid X_{i})<1.

Proposition 5 below follows from applying Theorems 12 with Zi=DiZ_{i}=D_{i}, and establishes the properties of τ^ols\hat{\tau}_{\textsc{ols}} and β^ols\hat{\beta}_{\textsc{ols}} for estimating τ\tau and β\beta.

Proposition 5.

Let τols\tau_{\textsc{ols}} and βols\beta_{\textsc{ols}} denote the probability limits of τ^ols\hat{\tau}_{\textsc{ols}} and β^ols\hat{\beta}_{\textsc{ols}}. Assume Assumption 4. Then τols=τ\tau_{\textsc{ols}}=\tau and βols=β\beta_{\textsc{ols}}=\beta if either of the following conditions holds:

  1. (i)

    Linear treatment-covariate interactions: 𝔼(DiXiXi)\mathbb{E}(D_{i}X_{i}\mid X_{i}) is linear in XiX_{i}.

  2. (ii)

    Linear outcome: Yi=DiXiβZX+XiβX+ηiY_{i}=D_{i}X_{i}^{\top}\beta_{ZX}+X_{i}^{\top}\beta_{X}+\eta_{i}, where 𝔼(ηiDi,Xi)=0\mathbb{E}(\eta_{i}\mid D_{i},X_{i})=0, for some fixed βZX,βXK\beta_{ZX},\beta_{X}\in\mathbb{R}^{K}.

Parallel to Proposition 1(i), Proposition 5(i) encompasses (a) categorical covariates and (b) randomly assigned treatment as special cases. Proposition 5(ii) implies that the interacted regression is correctly specified.

Similar results can be obtained for partially interacted ols regression of YiY_{i} on (DiVi,Xi)(D_{i}V_{i},X_{i}) by letting Zi=DiZ_{i}=D_{i} in Proposition 4. We omit the details.

9 Discussion

We formalized the interacted 2sls as an intuitive approach for incorporate treatment effect heterogeneity in IV analyses, and established its properties for estimating the LATE and systematic treatment effect variation under the LATE framework. We showed that the coefficients from the interacted 2sls are consistent for estimating the LATE and treatment effect heterogeneity with regard to covariates among compliers if either (i) the IV-covariate interactions are linear in the covariates or (ii) the linear outcome model underlying the interacted 2sls is correct. Condition (i) includes categorical covariates and randomly assigned IV as special cases. When neither condition (i) nor condition (ii) holds, we propose a stratification strategy based on the IV propensity score to approximate the LATE and treatment effect heterogeneity with regard to the IV propensity score. We leave possible relaxations of the identifying conditions for the LATE to future work.

References

  • Abadie (2003) Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics 113, 231–263.
  • Angrist (1998) Angrist, J. D. (1998). Estimating the labor market impact of voluntary military service using social security data on military applicants. Econometrica 66, 249–288.
  • Angrist and Imbens (1995) Angrist, J. D. and G. W. Imbens (1995). Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association 90, 431–442.
  • Angrist et al. (1996) Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996). Identification of causal effects using instrumental variables (with discussion). Journal of the American Statistical Association 91, 444–455.
  • Angrist and Pischke (2009) Angrist, J. D. and J.-S. Pischke (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton: Princeton University Press.
  • Aronow and Samii (2016) Aronow, P. M. and C. Samii (2016). Does regression produce representative estimates of causal effects? American Journal of Political Science 60, 250–267.
  • Blandhol et al. (2022) Blandhol, C., J. Bonney, M. Mogstad, and A. Torgovitsky (2022). When is tsls actually late? Technical report, National Bureau of Economic Research Cambridge, MA.
  • Chattopadhyay et al. (2024) Chattopadhyay, A., N. Greifer, and J. R. Zubizarreta (2024). lmw: Linear model weights for causal inference. Observational Studies 10, 33–62.
  • Chattopadhyay and Zubizarreta (2023) Chattopadhyay, A. and J. R. Zubizarreta (2023). On the implied weights of linear regression for causal inference. Biometrika 110, 615–629.
  • Cheng and Lin (2018) Cheng, J. and W. Lin (2018). Understanding causal distributional and subgroup effects with the instrumental propensity score. American Journal of Epidemiology 187, 614–622.
  • Ding et al. (2019) Ding, P., A. Feller, and L. Miratrix (2019). Decomposing treatment effect variation. Journal of the American Statistical Association 114, 304–317.
  • Djebbari and Smith (2008) Djebbari, H. and J. Smith (2008). Heterogeneous impacts in progresa. Journal of Econometrics 145, 64–80.
  • Hainmueller et al. (2019) Hainmueller, J., J. Mummolo, and Y. Xu (2019). How much should we trust estimates from multiplicative interaction models? Simple tools to improve empirical practice. Political Analysis 27, 163–192.
  • Heckman et al. (1997) Heckman, J. J., J. Smith, and N. Clements (1997). Making the most out of programme evaluations and social experiments: Accounting for heterogeneity in programme impacts. The Review of Economic Studies 64, 487–535.
  • Hirano and Imbens (2001) Hirano, K. and G. W. Imbens (2001). Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Services and Outcomes Research Methodology 2, 259–278.
  • Imbens and Angrist (1994) Imbens, G. W. and J. D. Angrist (1994). Identification and estimation of local average treatment effects. Econometrica 62, 467–475.
  • Imbens and Rubin (1997) Imbens, G. W. and D. B. Rubin (1997). Estimating outcome distributions for compliers in instrumental variables models. The Review of Economic Studies 64, 555–574.
  • Kline (2011) Kline, P. (2011). Oaxaca-Blinder as a reweighting estimator. American Economic Review 101, 532–537.
  • Liero (1989) Liero, H. (1989). Strong uniform consistency of nonparametric regression function estimates. Probability Theory and Related Fields 82, 587–614.
  • Lin (2013) Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. The Annals of Applied Statistics 7, 295–318.
  • Pashley et al. (2024) Pashley, N. E., L. Keele, and L. W. Miratrix (2024, 08). Improving instrumental variable estimators with poststratification. Journal of the Royal Statistical Society Series A: Statistics in Society, qnae073.
  • Resnjanskij et al. (2024) Resnjanskij, S., J. Ruhose, S. Wiederhold, L. Woessmann, and K. Wedel (2024). Can mentoring alleviate family disadvantage in adolescence? a field experiment to improve labor market prospects. Journal of Political Economy 132, 000–000.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55.
  • Sloczynski (2022) Sloczynski, T. (2022). When should we (not) interpret linear IV estimands as LATE?
  • Słoczyński et al. (2025) Słoczyński, T., S. D. Uysal, and J. M. Wooldridge (2025). Abadie’s kappa and weighting estimators of the local average treatment effect. Journal of Business & Economic Statistics 43, 164–177.
  • Strang (2006) Strang, G. (2006). Linear Algebra and Its Applications (4 ed.). Cengage Learning.
  • Wasserman (2006) Wasserman, L. (2006). All of Nonparametric Statistics. New York: Springer.

Supplementary Material

Section S1 gives additional results that complement the main paper.

Section S2 gives the lemmas for the proofs. In particular, Section S2.4 gives two linear algebra results that underlie Proposition 1(ii).

Section S3 gives the proofs of the results in the main paper.

Section S4 gives the proofs of the results in Section S1.

Section S5 gives the proofs of the linear algebra results in Section S2.4.

Notation.

Recall that for two random vectors YpY\in\mathbb{R}^{p} and XqX\in\mathbb{R}^{q}, we use proj(YX)\textup{proj}(Y\mid X) to the linear projection of YY on XX. Further let res(YX)=Yproj(YX)\textup{res}(Y\mid X)=Y-\textup{proj}(Y\mid X) denote the corresponding residual.

Write

DX^i=proj(DiXiZiXi,Xi)=C1ZiXi+C0Xi,proj(ZiXiXi)=C2Xi,\displaystyle\widehat{DX}_{i}=\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i})=C_{1}Z_{i}X_{i}+C_{0}X_{i},\quad\textup{proj}(Z_{i}X_{i}\mid X_{i})=C_{2}X_{i}, (S1)

where C1C_{1}, C0C_{0}, and C2C_{2} denote the coefficient matrices. A useful fact for the proofs is

DX~i=res(DX^iXi)\displaystyle\widetilde{DX}_{i}=\textup{res}(\widehat{DX}_{i}\mid X_{i}) =(S1)\displaystyle\overset{\eqref{eq:wt_c1c0}}{=} res(C1ZiXi+C0XiXi)\displaystyle\textup{res}(C_{1}Z_{i}X_{i}+C_{0}X_{i}\mid X_{i}) (S2)
=\displaystyle= C1res(ZiXiXi)\displaystyle C_{1}\cdot\textup{res}(Z_{i}X_{i}\mid X_{i})
=(S1)\displaystyle\overset{\eqref{eq:wt_c1c0}}{=} C1(ZiXiC2Xi).\displaystyle C_{1}(Z_{i}X_{i}-C_{2}X_{i}).

Recall the definition of ϵi\epsilon_{i} and Δi\Delta_{i} from Theorem 2. Extend the definition of ϵi\epsilon_{i} to all units as

ϵi={τiXiβcfor compliers0for always- and never-takers.\displaystyle\epsilon_{i}=\left\{\begin{array}[]{cl}\tau_{i}-X_{i}^{\top}\beta_{\textup{c}}&\text{for compliers}\\ 0&\text{for always- and never-takers}\end{array}\right..

A useful fact for the proofs is

YiDiXiβc={Yi(1)Xiβcfor always-takersYi(0)+Zi(τiXiβc)for compliersYi(0)for never-takers=Δi+Ziϵi.\displaystyle Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}}=\left\{\begin{array}[]{ll}Y_{i}(1)-X_{i}^{\top}\beta_{\textup{c}}&\text{for always-takers}\\ Y_{i}(0)+Z_{i}(\tau_{i}-X_{i}^{\top}\beta_{\textup{c}})&\text{for compliers}\\ Y_{i}(0)&\text{for never-takers}\end{array}\right.=\Delta_{i}+Z_{i}\epsilon_{i}. (S7)

S1 Additional results

Population 2SLS estimand.

Linear projection gives the population analog of least-squares regression. Proposition S1 below formalizes the population analog of the interacted 2sls in Definition 2.

Proposition S1.

Define the population interacted 2sls as the combination of the following two linear projections:

  1. (i)

    First stage: the linear projection of DiXiD_{i}X_{i} on (ZiXi,Xi)(Z_{i}X_{i}^{\top},X_{i}^{\top})^{\top}, denoted by proj(DiXiZiXi,Xi)\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i}).

  2. (ii)

    Second stage: the linear projection of YiY_{i} on the concatenation of proj(DiXiZiXi,Xi)\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i}) and XiX_{i}.

Then β2sls\beta_{\textsc{2sls}} equals the coefficient vector of proj(DiXiZiXi,Xi)\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i}) from the second stage.

Invariance of the interacted 2SLS to nondegenerate linear transformations.

The values of β^2sls\hat{\beta}_{\textsc{2sls}} from the interacted 2sls and βc\beta_{\textup{c}} from Eq. (4) both depend on the choice of the covariate vector XiX_{i}. Let β^2sls(Xi)\hat{\beta}_{\textsc{2sls}}(X_{i}), β2sls(Xi)\beta_{\textsc{2sls}}(X_{i}), and βc(Xi)\beta_{\textup{c}}(X_{i}) denote the values of β^2sls\hat{\beta}_{\textsc{2sls}}, β2sls\beta_{\textsc{2sls}}, and βc\beta_{\textup{c}} corresponding to covariate vector XiX_{i}. Proposition S2 below states their invariance to nondegenerate linear transformation of XiX_{i}.

Proposition S2.

For K×1K\times 1 covariate vectors {XiK:i=1,,N}\{X_{i}\in\mathbb{R}^{K}:i=1,\ldots,N\} and an invertible K×KK\times K matrix Γ\Gamma, we have

β^2sls(ΓXi)=(Γ)1β^2sls(Xi),β2sls(ΓXi)=(Γ)1β2sls(Xi),βc(ΓXi)=(Γ)1βc(Xi).\displaystyle\hat{\beta}_{\textsc{2sls}}(\Gamma X_{i})=(\Gamma^{\top})^{-1}\hat{\beta}_{\textsc{2sls}}(X_{i}),\quad\beta_{\textsc{2sls}}(\Gamma X_{i})=(\Gamma^{\top})^{-1}\beta_{\textsc{2sls}}(X_{i}),\quad\beta_{\textup{c}}(\Gamma X_{i})=(\Gamma^{\top})^{-1}\beta_{\textup{c}}(X_{i}).
Generalized additive 2SLS.

Definition S1 below generalizes the additive and interacted-additive 2sls in Definitions 1 and 4 to arbitrary first stage.

Definition S1 (Generalized additive 2sls).

Let Ri=R(Xi,Zi)R_{i}=R(X_{i},Z_{i}) be an arbitrary vector function of (Xi,Zi)(X_{i},Z_{i}).

  1. (i)

    First stage: Fit lm(DiRi)\texttt{lm}(D_{i}\sim R_{i}) over i=1,,Ni=1,\ldots,N. Denote the fitted values by D^i\hat{D}_{i} for i=1,,Ni=1,\ldots,N.

  2. (ii)

    Second stage: Fit lm(YiD^i+Xi)\texttt{lm}(Y_{i}\sim\hat{D}_{i}+X_{i}) over i=1,,Ni=1,\ldots,N. Denote the coefficient of D^i\hat{D}_{i} by τ^+\hat{\tau}_{*+}.

We use the subscript +*+ to represent the arbitrary first stage and additive second stage. The additive 2sls in Definition 1 is a special case of the generalized additive 2sls with Ri=(Zi,Xi)R_{i}=(Z_{i},X_{i}^{\top})^{\top}. The interacted-additive 2sls in Definition 4 is a special case with Ri=(ZiXi,Xi)R_{i}=(Z_{i}X_{i}^{\top},X_{i}^{\top})^{\top}.

Let τ+\tau_{*+} denote the probability limit of τ^+\hat{\tau}_{*+}. Proposition S3 below clarifies the causal interpretation of τ+\tau_{*+}.

Let Dˇi=proj(DiRi)\check{D}_{i}=\textup{proj}(D_{i}\mid R_{i}) denote the population fitted value from the first stage of the generalized additive 2sls. Let D~i=res(DˇiXi)\tilde{D}_{i}=\textup{res}(\check{D}_{i}\mid X_{i}). Let Δi=Yi(0)\Delta_{i}=Y_{i}(0) for compliers and never-takers and Δi=Yi(1)τc(Xi)\Delta_{i}=Y_{i}(1)-\tau_{\textup{c}}(X_{i}) for always-takers. Recall

w(Xi)=var{𝔼(DiZi,Xi)Xi}𝔼[var{𝔼(DiZi,Xi)Xi}].\displaystyle w(X_{i})=\dfrac{\operatorname{var}\{\mathbb{E}(D_{i}\mid Z_{i},X_{i})\mid X_{i}\}}{\mathbb{E}\big{[}\operatorname{var}\{\mathbb{E}(D_{i}\mid Z_{i},X_{i})\mid X_{i}\}\big{]}}.
Proposition S3.

Assume Assumption 1. Then τ+=𝔼{α(Xi)τc(Xi)}+{𝔼(D~i2)}1B\tau_{*+}=\mathbb{E}\left\{\alpha(X_{i})\cdot\tau_{\textup{c}}(X_{i})\right\}+\{\mathbb{E}(\tilde{D}_{i}^{2})\}^{-1}B, where

α(Xi)=𝔼(DiD~iXi)𝔼(D~i2),B=𝔼[{𝔼(DˇiXi)proj(DˇiXi)}{𝔼(ΔiXi)proj(ΔiXi)}].\displaystyle\alpha(X_{i})=\dfrac{\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i})}{\mathbb{E}(\tilde{D}_{i}^{2})},\quad B=\mathbb{E}\Big{[}\Big{\{}\mathbb{E}(\check{D}_{i}\mid X_{i})-\textup{proj}(\check{D}_{i}\mid X_{i})\Big{\}}\cdot\Big{\{}\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i})\Big{\}}\Big{]}.

In particular,

  1. (i)

    B=0B=0 if either of the following conditions holds:

    1. (a)

      𝔼(DˇiXi)\mathbb{E}(\check{D}_{i}\mid X_{i}) is linear in XiX_{i}.

    2. (b)

      𝔼(ΔXi)\mathbb{E}(\Delta\mid X_{i}) is linear in XiX_{i}.

  2. (ii)

    α(Xi)=w(Xi)\alpha(X_{i})=w(X_{i}) if both of the following conditions hold:

    1. (a)

      𝔼(DˇiXi)\mathbb{E}(\check{D}_{i}\mid X_{i}) is linear in XiX_{i}.

    2. (b)

      Dˇi=𝔼(DiZi,Xi)\check{D}_{i}=\mathbb{E}(D_{i}\mid Z_{i},X_{i}); i.e., the arbitrary first stage is correct.

S2 Lemmas

S2.1 Projection and conditional expectation

Lemmas S1S3 below review some standard results regarding projection and conditional expectation. We omit the proofs.

Lemma S1.

For random vectors AA and BB,

  1. (i)

    𝔼(AB)=proj(AB)\mathbb{E}(A\mid B)=\textup{proj}(A\mid B) if and only if 𝔼(AB)\mathbb{E}(A\mid B) is linear in BB;

  2. (ii)

    if ABA\perp\!\!\!\perp B, then 𝔼(ABB)=proj(ABB)=𝔼(A)B\mathbb{E}(AB\mid B)=\textup{proj}(AB\mid B)=\mathbb{E}(A)B;

  3. (iii)

    if AA is binary, then 𝔼(AB)\mathbb{E}(A\mid B) is constant implies ABA\perp\!\!\!\perp B.

Lemma S2.

Let XX be an m×1m\times 1 random vector with 𝔼(XX)\mathbb{E}(XX^{\top}) being invertible. Then

  1. (i)

    for any subvector of XX, denoted by VV, 𝔼(VV)\mathbb{E}(VV^{\top}) is invertible.

  2. (ii)

    XX takes at least mm distinct values.

  3. (iii)

    If XX takes exactly mm distinct values, denoted by x1,,xmmx_{1},\ldots,x_{m}\in\mathbb{R}^{m}, then

  4. -

    Γ=(x1,,xm)m×m\Gamma=(x_{1},\ldots,x_{m})\in\mathbb{R}^{m\times m} is invertible, and X=Γ(1(X=x1),,1(X=xm))X=\Gamma(1(X=x_{1}),\ldots,1(X=x_{m}))^{\top}.

  5. -

    XXXX^{\top} is component-wise linear in XX.

  6. -

    for any random vector AA, 𝔼(AX)\mathbb{E}(A\mid X) is linear in XX.

Lemma S3 (Population Frisch–Waugh–Lovell (FWL)).

For a random variable YY, a p1×1p_{1}\times 1 random vector X1X_{1}, and a p2×1p_{2}\times 1 random vector X2X_{2}, let Y~=res(YX2)\tilde{Y}=\textup{res}(Y\mid X_{2}) and X~1=res(X1X2)\tilde{X}_{1}=\textup{res}(X_{1}\mid X_{2}) denote the residuals from the linear projections of YY and X1X_{1} on X2X_{2}, respectively. Then the coefficient vector of X1X_{1} in the linear projection of YY on (X1,X2)(X_{1}^{\top},X_{2}^{\top})^{\top} equals the coefficient vector of X~1\tilde{X}_{1} in the linear projection of YY or Y~\tilde{Y} on X~1\tilde{X}_{1}.

S2.2 Lemmas for the interacted 2SLS

Assumption S1 below states the rank condition for the interacted 2sls, which we assume implicitly throughout the main paper. By the law of large numbers,

β2sls=the first K dimensions of [𝔼{(ZiXiXi)(DiXi,Xi)}]1𝔼{(ZiXiXi)Yi}.\displaystyle\beta_{\textsc{2sls}}=\text{the first $K$ dimensions of $\left[\mathbb{E}\left\{\begin{pmatrix}Z_{i}X_{i}\\ X_{i}\end{pmatrix}\Big{(}D_{i}X_{i}^{\top},X_{i}^{\top}\Big{)}\right\}\right]^{-1}\mathbb{E}\left\{\begin{pmatrix}Z_{i}X_{i}\\ X_{i}\end{pmatrix}Y_{i}\right\}$}.
Assumption S1.

𝔼{(ZiXiXi)(DiXi,Xi)}\mathbb{E}\left\{\begin{pmatrix}Z_{i}X_{i}\\ X_{i}\end{pmatrix}\Big{(}D_{i}X_{i}^{\top},X_{i}^{\top}\Big{)}\right\} is invertible.

Lemma S4 below states some implications of Assumption S1 that are useful for the proofs.

Lemma S4.

Recall C0,C1,C2C_{0},C_{1},C_{2} from (S1). Assume Assumption S1. We have

  1. (i)

    𝔼{(ZiXiXi)(ZiXi,Xi)}\mathbb{E}\left\{\begin{pmatrix}Z_{i}X_{i}\\ X_{i}\end{pmatrix}\Big{(}Z_{i}X_{i}^{\top},X_{i}^{\top}\Big{)}\right\} and 𝔼(XiXi)\mathbb{E}(X_{i}X^{\top}_{i}) are both invertible with

    (C1,C0)\displaystyle(C_{1},C_{0}) =\displaystyle= 𝔼{DiXi(ZiXi,Xi)}[𝔼{(ZiXiXi)(ZiXi,Xi)}]1,\displaystyle\mathbb{E}\Big{\{}D_{i}X_{i}(Z_{i}X_{i}^{\top},X^{\top}_{i})\Big{\}}\left[\mathbb{E}\left\{\begin{pmatrix}Z_{i}X_{i}\\ X_{i}\end{pmatrix}(Z_{i}X_{i}^{\top},X^{\top}_{i})\right\}\right]^{-1},
    C2\displaystyle C_{2} =\displaystyle= 𝔼(ZiXiXi){𝔼(XiXi)}1.\displaystyle\mathbb{E}(Z_{i}X_{i}X^{\top}_{i})\big{\{}\mathbb{E}(X_{i}X^{\top}_{i})\big{\}}^{-1}.
  2. (ii)

    XiX_{i} takes at least KK distinct values.

  3. (iii)

    C1C_{1}, C2C_{2}, and IKC2I_{K}-C_{2} are all invertible.

Proof of Lemma S4.

We verify below Lemma S4(i)–(iii), respectively. We omit the subscript ii in the proof. Let

Γ1=𝔼{(DXX)(ZX,X)},Γ2=𝔼{(ZXX)(ZX,X)}\displaystyle\Gamma_{1}=\mathbb{E}\left\{\begin{pmatrix}DX\\ X\end{pmatrix}(ZX^{\top},X^{\top})\right\},\quad\Gamma_{2}=\mathbb{E}\left\{\begin{pmatrix}ZX\\ X\end{pmatrix}\Big{(}ZX^{\top},X^{\top}\Big{)}\right\}

be shorthand notation with Γ1\Gamma_{1} being invertible under Assumption S1.

Proof of Lemma S4(i):

We prove the result by contradiction. Note that Γ2\Gamma_{2} is positive semidefinite. If Γ2\Gamma_{2} is degenerate, then there exists a nonzero vector aa such that aΓ2a=0a^{\top}\Gamma_{2}a=0. This implies

0=a𝔼{(ZXX)(ZX,X)}a=𝔼{a(ZXX)(ZX,X)a}\displaystyle 0=a^{\top}\mathbb{E}\left\{\begin{pmatrix}ZX\\ X\end{pmatrix}(ZX^{\top},X^{\top})\right\}a=\mathbb{E}\left\{a^{\top}\begin{pmatrix}ZX\\ X\end{pmatrix}(ZX^{\top},X^{\top})a\right\}

such that (ZX,X)a=0(ZX^{\top},X^{\top})a=0. As a result, we have

Γ1a=𝔼{(DXX)(ZX,X)}a=𝔼{(DXX)(ZX,X)a}=0\Gamma_{1}a=\mathbb{E}\left\{\begin{pmatrix}DX\\ X\end{pmatrix}(ZX^{\top},X^{\top})\right\}a=\mathbb{E}\left\{\begin{pmatrix}DX\\ X\end{pmatrix}(ZX^{\top},X^{\top})a\right\}=0

for a nonzero aa. This contradicts with Γ1\Gamma_{1} being invertible.

That 𝔼(XX)\mathbb{E}(XX^{\top}) is invertible then follows from Lemma S2(i).

Proof of Lemma S4(ii).

Given 𝔼(XX)\mathbb{E}(XX^{\top}) is invertible as we just proved, the result follows from Lemma S2(ii).

Proof of Lemma S4(iii):

Properties of linear projection ensure

proj{(DXX)ZX,X}=(C1C00IK)(ZXX),\displaystyle\textup{proj}\left\{\begin{pmatrix}DX\\ X\end{pmatrix}\mid ZX,X\right\}=\begin{pmatrix}C_{1}&C_{0}\\ 0&I_{K}\end{pmatrix}\begin{pmatrix}ZX\\ X\end{pmatrix},

with

(C1C00IK)=𝔼{(DXX)(ZX,X)}[𝔼{(ZXX)(ZX,X)}]1=Γ1Γ21.\displaystyle\begin{pmatrix}C_{1}&C_{0}\\ 0&I_{K}\end{pmatrix}=\mathbb{E}\left\{\begin{pmatrix}DX\\ X\end{pmatrix}(ZX^{\top},X^{\top})\right\}\left[\mathbb{E}\left\{\begin{pmatrix}ZX\\ X\end{pmatrix}(ZX^{\top},X^{\top})\right\}\right]^{-1}=\Gamma_{1}\Gamma_{2}^{-1}. (S8)

This ensures det(C1)=det(C1C00IK)=det(Γ1Γ21)0\det(C_{1})=\det\begin{pmatrix}C_{1}&C_{0}\\ 0&I_{K}\end{pmatrix}=\det(\Gamma_{1}\Gamma_{2}^{-1})\neq 0.

In addition, note that

IKC2\displaystyle I_{K}-C_{2} =\displaystyle= 𝔼(XX){𝔼(XX)}1𝔼(ZXX){𝔼(XX)}1\displaystyle\mathbb{E}(XX^{\top})\cdot\left\{\mathbb{E}(XX^{\top})\right\}^{-1}-\mathbb{E}(ZXX^{\top})\cdot\left\{\mathbb{E}(XX^{\top})\right\}^{-1}
=\displaystyle= 𝔼{(1Z)XX}{𝔼(XX)}1.\displaystyle\mathbb{E}\left\{(1-Z)XX^{\top}\right\}\cdot\left\{\mathbb{E}(XX^{\top})\right\}^{-1}.

To show that C2C_{2} and IKC2I_{K}-C_{2} are invertible, it suffices to show that 𝔼(ZXX)\mathbb{E}(ZXX^{\top}) and 𝔼{(1Z)XX}\mathbb{E}\{(1-Z)XX^{\top}\} are invertible. This is ensured by

0<det(Γ2)\displaystyle 0<\det(\Gamma_{2}) =\displaystyle= det{(𝔼(ZXX)𝔼(ZXX)𝔼(ZXX)𝔼(XX))}\displaystyle\det\left\{\begin{pmatrix}\mathbb{E}(ZXX^{\top})&\mathbb{E}(ZXX^{\top})\\ \mathbb{E}(ZXX^{\top})&\mathbb{E}(XX^{\top})\end{pmatrix}\right\}
=\displaystyle= det{(𝔼(ZXX)𝔼(ZXX)0𝔼{(1Z)XX})}\displaystyle\det\left\{\begin{pmatrix}\mathbb{E}(ZXX^{\top})&\mathbb{E}(ZXX^{\top})\\ 0&\mathbb{E}\left\{(1-Z)XX^{\top}\right\}\end{pmatrix}\right\}
=\displaystyle= det{𝔼(ZXX)}det[𝔼{(1Z)XX}].\displaystyle\det\Big{\{}\mathbb{E}(ZXX^{\top})\Big{\}}\cdot\det\Big{[}\mathbb{E}\left\{(1-Z)XX^{\top}\right\}\Big{]}.

Lemma S5 below formalizes the sufficiency and necessity of Assumption 3 to ensure the combination of conditions (11) and (13); c.f. Section 4.2.

Lemma S5.

Assume Assumption 1. Assumption 3 is both sufficient and necessary for

𝔼(ΔiXi)=XiβΔfor some βΔK,τc(Xi)=𝔼(τiXi,Ui=c)=Xiβc.\displaystyle\mathbb{E}(\Delta_{i}\mid X_{i})=X_{i}^{\top}\beta_{\Delta}\ \ \text{for some $\beta_{\Delta}\in\mathbb{R}^{K}$},\quad\tau_{\textup{c}}(X_{i})=\mathbb{E}(\tau_{i}\mid X_{i},U_{i}=\textup{c})=X_{i}^{\top}\beta_{\textup{c}}. (S9)
Proof of Lemma S5.

We verify below the sufficiency and necessity of Assumption 3, respectively.

Sufficiency.

Under Assumption 3, we have

ηi=YiDiXiβDXXiβX={Yi(1)Xi(βDX+βX)if Ui=aYi(0)XiβXif Ui=nYi(0)XiβX+Zi(τiXiβDX)if Ui=c=λi+Ziζi,\displaystyle\begin{array}[]{rcl }\eta_{i}&=&Y_{i}-D_{i}X_{i}^{\top}\beta_{DX}-X_{i}^{\top}\beta_{X}\\ &=&\left\{\begin{array}[]{lll}Y_{i}(1)-X_{i}^{\top}(\beta_{DX}+\beta_{X})&\text{if $U_{i}=\textup{a}$}\\ Y_{i}(0)-X_{i}^{\top}\beta_{X}&\text{if $U_{i}=\textup{n}$}\\ Y_{i}(0)-X_{i}^{\top}\beta_{X}+Z_{i}(\tau_{i}-X_{i}^{\top}\beta_{DX})&\text{if $U_{i}=\textup{c}$}\end{array}\right.\\ &=&\lambda_{i}+Z_{i}\zeta_{i},\end{array} (S16)

where

λi={Yi(1)Xi(βDX+βX)if Ui=aYi(0)XiβXif Ui=nYi(0)XiβXif Ui=c,ζi={τiXiβDXif Ui=c0if Uic.\displaystyle\lambda_{i}=\left\{\begin{array}[]{lll}Y_{i}(1)-X_{i}^{\top}(\beta_{DX}+\beta_{X})&\text{if $U_{i}=\textup{a}$}\\ Y_{i}(0)-X_{i}^{\top}\beta_{X}&\text{if $U_{i}=\textup{n}$}\\ Y_{i}(0)-X_{i}^{\top}\beta_{X}&\text{if $U_{i}=\textup{c}$}\end{array}\right.,\quad\zeta_{i}=\left\{\begin{array}[]{cl}\tau_{i}-X_{i}^{\top}\beta_{DX}&\text{if $U_{i}=\textup{c}$}\\ 0&\text{if $U_{i}\neq\textup{c}$}\end{array}\right..

The definition of ζi\zeta_{i} ensures

𝔼(ζiXi)\displaystyle\mathbb{E}(\zeta_{i}\mid X_{i}) =\displaystyle= 𝔼(ζiXi,Ui=c)(Ui=cXi)\displaystyle\mathbb{E}(\zeta_{i}\mid X_{i},U_{i}=\textup{c})\cdot\mathbb{P}(U_{i}=\textup{c}\mid X_{i}) (S18)
=\displaystyle= {𝔼(τiXi,Ui=c)XiβDX}(Ui=cXi)\displaystyle\Big{\{}\mathbb{E}(\tau_{i}\mid X_{i},U_{i}=\textup{c})-X_{i}^{\top}\beta_{DX}\Big{\}}\cdot\mathbb{P}(U_{i}=\textup{c}\mid X_{i})
=\displaystyle= {τc(Xi)XiβDX}(Ui=cXi).\displaystyle\Big{\{}\tau_{\textup{c}}(X_{i})-X_{i}^{\top}\beta_{DX}\Big{\}}\cdot\mathbb{P}(U_{i}=\textup{c}\mid X_{i}).

This, together with (S16), ensures

𝔼(ηiXi,Zi)\displaystyle\mathbb{E}(\eta_{i}\mid X_{i},Z_{i}) =(S16)\displaystyle\overset{\eqref{eq:eta_2}}{=} 𝔼(λiXi,Zi)+Zi𝔼(ζiXi,Zi)\displaystyle\mathbb{E}(\lambda_{i}\mid X_{i},Z_{i})+Z_{i}\cdot\mathbb{E}(\zeta_{i}\mid X_{i},Z_{i}) (S19)
=Assm. 1\displaystyle\overset{\text{Assm. \ref{assm:iv}}}{=} 𝔼(λiXi)+Zi𝔼(ζiXi)\displaystyle\mathbb{E}(\lambda_{i}\mid X_{i})+Z_{i}\cdot\mathbb{E}(\zeta_{i}\mid X_{i})
=(S18)\displaystyle\overset{\eqref{eq:zeta}}{=} 𝔼(λiXi)+Zi{τc(Xi)XiβDX}(Ui=cXi).\displaystyle\mathbb{E}(\lambda_{i}\mid X_{i})+Z_{i}\cdot\Big{\{}\tau_{\textup{c}}(X_{i})-X_{i}^{\top}\beta_{DX}\Big{\}}\cdot\mathbb{P}(U_{i}=\textup{c}\mid X_{i}).

Under Assumption 3, we have 𝔼(ηiXi,Zi)=0\mathbb{E}(\eta_{i}\mid X_{i},Z_{i})=0 so that (S19) implies

𝔼(λiXi)\displaystyle\mathbb{E}(\lambda_{i}\mid X_{i}) =(S19)\displaystyle\overset{\eqref{eq:eta_3}}{=} 𝔼(ηiXi,Zi=0)=Assm. 30,\displaystyle\mathbb{E}(\eta_{i}\mid X_{i},Z_{i}=0)\overset{\text{Assm. \ref{assm:linear outcome}}}{=}0, (S20)
τc(Xi)XiβDX\displaystyle\tau_{\textup{c}}(X_{i})-X_{i}^{\top}\beta_{DX} =(S19)\displaystyle\overset{\eqref{eq:eta_3}}{=} 𝔼(ηiXi,Zi=1)𝔼(ηiXi,Zi=0)(Ui=cXi)=Assm. 30.\displaystyle\dfrac{\mathbb{E}(\eta_{i}\mid X_{i},Z_{i}=1)-\mathbb{E}(\eta_{i}\mid X_{i},Z_{i}=0)}{\mathbb{P}(U_{i}=\textup{c}\mid X_{i})}\overset{\text{Assm. \ref{assm:linear outcome}}}{=}0. (S21)

(S21), together with Lemma S1(i), ensures

τc(Xi)=XiβDXwithβc=βDX.\displaystyle\tau_{\textup{c}}(X_{i})=X_{i}^{\top}\beta_{DX}\ \ \text{with}\ \ \beta_{\textup{c}}=\beta_{DX}. (S22)

Plugging (S22) in the definition of λi\lambda_{i} ensures Δi=λi+XiβX\Delta_{i}=\lambda_{i}+X_{i}^{\top}\beta_{X} with

𝔼(ΔiXi)=𝔼(λiXi)+XiβX=(S20)XiβX.\displaystyle\mathbb{E}(\Delta_{i}\mid X_{i})=\mathbb{E}(\lambda_{i}\mid X_{i})+X_{i}^{\top}\beta_{X}\overset{\eqref{eq:lmd_2}}{=}X_{i}^{\top}\beta_{X}.

This, together with (S21), verifies the sufficiency of Assumption 3 to ensure (S9).

Necessity.

We show below that (S9) implies Assumption 3. Recall from (S7) that YiDiXiβc=Δi+ZiϵiY_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}}=\Delta_{i}+Z_{i}\epsilon_{i}. When (S9) holds, we have 𝔼(ϵiXi,Ui=c)=0\mathbb{E}(\epsilon_{i}\mid X_{i},U_{i}=\textup{c})=0 so that

𝔼(ϵiXi)=𝔼(ϵiXi,Ui=c)(Ui=cXi)=0.\displaystyle\mathbb{E}(\epsilon_{i}\mid X_{i})=\mathbb{E}(\epsilon_{i}\mid X_{i},U_{i}=\textup{c})\cdot\mathbb{P}(U_{i}=\textup{c}\mid X_{i})=0. (S23)

This ensures

𝔼(YiDiXiβcXi,Zi)\displaystyle\mathbb{E}(Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}}\mid X_{i},Z_{i}) =(S7)\displaystyle\overset{\eqref{eq:Delta_ep}}{=} 𝔼(ΔiXi,Zi)+Zi𝔼(ϵiXi,Zi)\displaystyle\mathbb{E}(\Delta_{i}\mid X_{i},Z_{i})+Z_{i}\cdot\mathbb{E}(\epsilon_{i}\mid X_{i},Z_{i}) (S24)
=Assm. 1\displaystyle\overset{\text{Assm. \ref{assm:iv}}}{=} 𝔼(ΔiXi)+Zi𝔼(ϵiXi)\displaystyle\mathbb{E}(\Delta_{i}\mid X_{i})+Z_{i}\cdot\mathbb{E}(\epsilon_{i}\mid X_{i})
=(S9)+(S23)\displaystyle\overset{\eqref{eq:cond_linear Delta_tcx}+\eqref{eq:thm2_cond3_necessary_1}}{=} XiβΔ.\displaystyle X_{i}^{\top}\beta_{\Delta}.

Let ηi=YiDiXiβcXiβΔ\eta^{\prime}_{i}=Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}}-X_{i}^{\top}\beta_{\Delta}. We have

Yi\displaystyle Y_{i} =\displaystyle= DiXiβc+XiβΔ+ηi,\displaystyle D_{i}X_{i}^{\top}\beta_{\textup{c}}+X_{i}^{\top}\beta_{\Delta}+\eta^{\prime}_{i},
𝔼(ηiXi,Zi)\displaystyle\mathbb{E}(\eta^{\prime}_{i}\mid X_{i},Z_{i}) =\displaystyle= 𝔼(YiDiXiβcXi,Zi)XiβΔ=(S24)0\displaystyle\mathbb{E}(Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}}\mid X_{i},Z_{i})-X_{i}^{\top}\beta_{\Delta}\overset{\eqref{eq:thm2_cond3_necessary}}{=}0

so that Assumption 3 holds. ∎

Lemma S6 below states a numeric result that we use to prove Corollary 1.

Lemma S6.

Let DX^i\widehat{DX}_{i} and D^i\hat{D}_{i} denote the fitted values from lm(DiXiZiXi+Xi)\texttt{lm}(D_{i}X_{i}\sim Z_{i}X_{i}+X_{i}) and

lm(DiZiXi+Xi),\displaystyle\texttt{lm}(D_{i}\sim Z_{i}X_{i}+X_{i}), (S25)

respectively. Assume that XiX_{i} is categorical with KK levels. Then DX^i=D^iXi\widehat{DX}_{i}=\hat{D}_{i}X_{i}.

Proof of Lemma S6.

We verify below the result for Xi=(1(Xi=1),,1(Xi=K))X_{i}=(1(X_{i}^{*}=1),\ldots,1(X_{i}^{*}=K))^{\top}, where Xi{1,,K}X_{i}^{*}\in\{1,\ldots,K\} denotes a KK-level categorical variable. The result for general categorical XiX_{i} then follows from the invariance of least squares to nondegenerate linear transformations of the regressor vector.

Given Xi=(1(Xi=1),,1(Xi=K))X_{i}=(1(X_{i}^{*}=1),\ldots,1(X_{i}^{*}=K))^{\top}, for k=1,,Kk=1,\ldots,K, the kkth element of D^iXi\hat{D}_{i}X_{i} equals D^i1(Xi=k)\hat{D}_{i}\cdot 1(X_{i}^{*}=k), and the kkth element of DX^i\widehat{DX}_{i} equals the fitted value from

lm{Di1(Xi=k)ZiXi+Xi},\displaystyle\texttt{lm}\Big{\{}D_{i}\cdot 1(X_{i}^{*}=k)\sim Z_{i}X_{i}+X_{i}\Big{\}}, (S26)

denoted by DX^i[k]\widehat{DX}_{i}[k]. By symmetry, to show that DX^i=D^iXi\widehat{DX}_{i}=\hat{D}_{i}X_{i}, it suffices to show that the first element of D^iXi\hat{D}_{i}X_{i} equals the first element of DX^i\widehat{DX}_{i}, i.e.,

D^i1(Xi=1)=DX^i[1].\displaystyle\hat{D}_{i}\cdot 1(X_{i}^{*}=1)=\widehat{DX}_{i}[1]. (S27)

We verify (S27) below.

Left-hand side of (S27).

Standard results ensure that (S25) is equivalent to KK stratum-wise regressions

lm(DiZi+1)over units with Xi=k\displaystyle\texttt{lm}(D_{i}\sim Z_{i}+1)\quad\text{over units with $X_{i}^{*}=k$}

for k=1,,Kk=1,\ldots,K, such that D^i\hat{D}_{i} equals the sample mean of {Di:Xi=k,Zi=z}\{D_{i}:X_{i}^{*}=k,Z_{i}=z\} for units with {Xi=k,Zi=z}\{X_{i}^{*}=k,Z_{i}=z\}. This ensures

D^i1(Xi=1)={0for{i:Xi1};sample mean of {Di:Xi=1,Zi=z}for{i:Xi=1;Zi=z}.\displaystyle\hat{D}_{i}\cdot 1(X_{i}^{*}=1)=\left\{\begin{array}[]{cll}0&\text{for}&\{i:X_{i}^{*}\neq 1\};\\ \text{sample mean of $\{D_{i}:X_{i}^{*}=1,Z_{i}=z\}$}&\text{for}&\{i:X_{i}^{*}=1;\,Z_{i}=z\}.\end{array}\right. (S30)
Right-hand side of (S27).

From (S26), DX^i[1]\widehat{DX}_{i}[1] equals the fitted value from

lm{Di1(Xi=1)ZiXi+Xi},\displaystyle\texttt{lm}\Big{\{}D_{i}\cdot 1(X_{i}^{*}=1)\sim Z_{i}X_{i}+X_{i}\Big{\}},

which is equivalent to KK stratum-wise regressions

lm{Di1(Xi=1)Zi+1}over units with Xi=k\texttt{lm}\Big{\{}D_{i}\cdot 1(X_{i}^{*}=1)\sim Z_{i}+1\Big{\}}\quad\text{over units with $X_{i}^{*}=k$}

for k=1,,Kk=1,\ldots,K. This ensures that DX^i[1]\widehat{DX}_{i}[1] equals the sample mean of {Di1(Xi=1):Xi=k,Zi=z}\{D_{i}\cdot 1(X_{i}^{*}=1):X_{i}^{*}=k,Z_{i}=z\} for units with {Xi=k,Zi=z}\{X_{i}^{*}=k,Z_{i}=z\}, which simplies to the follows:

DX^i[1]={0for{i:Xi1};sample mean of {Di:Xi=1,Zi=z}for{i:Xi=1;Zi=z}.\displaystyle\widehat{DX}_{i}[1]=\left\{\begin{array}[]{cll}0&\text{for}&\{i:X_{i}^{*}\neq 1\};\\ \text{sample mean of $\{D_{i}:X_{i}^{*}=1,Z_{i}=z\}$}&\text{for}&\{i:X_{i}^{*}=1;\,Z_{i}=z\}.\end{array}\right.

This is identical to the expression of D^i1(Xi=1)\hat{D}_{i}\cdot 1(X_{i}^{*}=1) in (S30), and ensures (S27). ∎

S2.3 Lemmas for the partially interacted 2SLS

Assumption S2 below states the rank condition for the partially interacted 2sls.

Assumption S2.

𝔼{(ZiViXi)(DiVi,Xi)}\mathbb{E}\left\{\begin{pmatrix}Z_{i}V_{i}\\ X_{i}\end{pmatrix}\Big{(}D_{i}V_{i}^{\top},X_{i}^{\top}\Big{)}\right\} is invertible.

Lemma S7 below generalizes Lemma S4 and states some implications of Assumption S2 useful for the proofs. The proof is similar to that of Lemma S4, hence omitted.

Lemma S7.

Assume Assumption S2. We have

  1. (i)

    𝔼{(ZiViXi)(ZiVi,Xi)}\mathbb{E}\left\{\begin{pmatrix}Z_{i}V_{i}\\ X_{i}\end{pmatrix}\Big{(}Z_{i}V_{i}^{\top},X_{i}^{\top}\Big{)}\right\}, 𝔼(XiXi)\mathbb{E}(X_{i}X_{i}^{\top}), and 𝔼(ViVi)\mathbb{E}(V_{i}V_{i}^{\top}) are all invertible with

    proj(DiViZiVi,Xi)=C1ZiVi+C0Xi,proj(ZiViVi)=C2Vi\displaystyle\textup{proj}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i})=C_{1}Z_{i}V_{i}+C_{0}X_{i},\qquad\textup{proj}(Z_{i}V_{i}\mid V_{i})=C_{2}V_{i}

    with

    (C1,C0)\displaystyle(C_{1},C_{0}) =\displaystyle= 𝔼{DiVi(ZiVi,Xi)}[𝔼{(ZiViXi)(ZiVi,Xi)}]1,\displaystyle\mathbb{E}\Big{\{}D_{i}V_{i}(Z_{i}V_{i}^{\top},X_{i}^{\top})\Big{\}}\left[\mathbb{E}\left\{\begin{pmatrix}Z_{i}V_{i}\\ X_{i}\end{pmatrix}\Big{(}Z_{i}V_{i}^{\top},X_{i}^{\top}\Big{)}\right\}\right]^{-1},
    C2\displaystyle C_{2} =\displaystyle= 𝔼(ZiViVi){𝔼(ViVi)}1.\displaystyle\mathbb{E}(Z_{i}V_{i}V_{i}^{\top})\big{\{}\mathbb{E}(V_{i}V_{i}^{\top})\big{\}}^{-1}.
  2. (ii)

    XiX_{i} takes at least KK distinct values. ViV_{i} takes at least QQ distinct values.

  3. (iii)

    C1C_{1}, C2C_{2}, and IQC2I_{Q}-C_{2} are all invertible.

Lemma S8.

Assume Xi=(Vi,Wi)KX_{i}=(V_{i}^{\top},W_{i}^{\top})^{\top}\in\mathbb{R}^{K}, where ViQV_{i}\in\mathbb{R}^{Q} with QKQ\leq K and 𝔼(XiXi)\mathbb{E}(X_{i}X_{i}^{\top}) is invertible. Then 𝔼(ZiViXi)\mathbb{E}(Z_{i}V_{i}\mid X_{i}) is linear in XiX_{i} and 𝔼(ZiXi){Viproj(ZiViXi)}\mathbb{E}(Z_{i}\mid X_{i})\cdot\{V_{i}-\textup{proj}(Z_{i}V_{i}\mid X_{i})\} is linear in ViV_{i} if any of the following conditions holds:

  1. (i)

    ViV_{i} is categorical with QQ levels; proj(ZiViXi)=proj(ZiViVi)\textup{proj}(Z_{i}V_{i}\mid X_{i})=\textup{proj}(Z_{i}V_{i}\mid V_{i}) and 𝔼(ZiXi)=𝔼(ZiVi)\mathbb{E}(Z_{i}\mid X_{i})=\mathbb{E}(Z_{i}\mid V_{i}).

  2. (ii)

    ZiXiZ_{i}\perp\!\!\!\perp X_{i}.

Proof of Lemma S8.

To ensure 𝔼(ZiViXi)\mathbb{E}(Z_{i}V_{i}\mid X_{i}) is linear in XiX_{i},

  • the sufficiency of Lemma S8(i) follows from 𝔼(ZiViXi)=𝔼(ZiViVi)\mathbb{E}(Z_{i}V_{i}\mid X_{i})=\mathbb{E}(Z_{i}V_{i}\mid V_{i}), where 𝔼(ZiViVi)\mathbb{E}(Z_{i}V_{i}\mid V_{i}) is linear in ViV_{i} by Lemma S2(iii).

  • the sufficiency of Lemma S8(ii) follows from 𝔼(ZiViXi)=𝔼(ZiXi)Vi=𝔼(Zi)Vi\mathbb{E}(Z_{i}V_{i}\mid X_{i})=\mathbb{E}(Z_{i}\mid X_{i})\cdot V_{i}=\mathbb{E}(Z_{i})\cdot V_{i}.

We verify below the sufficiency of the conditions for ensuring that 𝔼(ZiXi){Viproj(ZiViXi)}\mathbb{E}(Z_{i}\mid X_{i})\cdot\{V_{i}-\textup{proj}(Z_{i}V_{i}\mid X_{i})\} is linear in ViV_{i}. We omit the subscript ii in the proof.

Proof of Condition (i).

Given 𝔼(XX)\mathbb{E}(XX^{\top}) is invertible, Lemma S2(i) ensures that 𝔼(VV)\mathbb{E}(VV^{\top}) is invertible. When VV is categorical with QQ levels, Lemma S2(iii) ensures that VVVV^{\top} is component-wise linear in VV and that 𝔼(ZV)\mathbb{E}(Z\mid V) is linear in VV. We can write

Vproj(ZVV)=CV,𝔼(ZV)=Vc,\displaystyle V-\textup{proj}(ZV\mid V)=CV,\quad\mathbb{E}(Z\mid V)=V^{\top}c, (S32)

for C=IQ𝔼(ZVV){𝔼(VV)}1C=I_{Q}-\mathbb{E}(ZVV^{\top})\{\mathbb{E}(VV^{\top})\}^{-1} and some constant cQc\in\mathbb{R}^{Q}. This, together with proj(ZVX)=proj(ZVV)\textup{proj}(ZV\mid X)=\textup{proj}(ZV\mid V) and 𝔼(ZX)=𝔼(ZV)\mathbb{E}(Z\mid X)=\mathbb{E}(Z\mid V), ensures

{Vproj(ZVX)}𝔼(ZX)={Vproj(ZVV)}𝔼(ZV)=(S32)CVVc,\displaystyle\Big{\{}V-\textup{proj}(ZV\mid X)\Big{\}}\cdot\mathbb{E}(Z\mid X)=\Big{\{}V-\textup{proj}(ZV\mid V)\Big{\}}\cdot\mathbb{E}(Z\mid V)\overset{\eqref{eq:v_iii_ss}}{=}CV\cdot V^{\top}c,

which is linear in VV when VVVV^{\top} is component-wise linear in VV.

Proof of Condition (ii).

When ZXZ\perp\!\!\!\perp X, we have

proj(ZXX)=𝔼(Z)X\displaystyle\textup{proj}(ZX\mid X)=\mathbb{E}(Z)X (S33)

by Lemma S1(ii). Given V=(IQ,0)XV=(I_{Q},0)X, we have

proj(ZVX)\displaystyle\textup{proj}(ZV\mid X) =\displaystyle= proj{Z(IQ,0)XX}\displaystyle\textup{proj}\left\{Z\cdot(I_{Q},0)X\mid X\right\}
=\displaystyle= (IQ, 0)proj(ZXX)=(S33)(IQ, 0)𝔼(Z)X=𝔼(Z)V\displaystyle(I_{Q},\ 0)\cdot\textup{proj}(ZX\mid X)\overset{\eqref{eq:cond2_ss}}{=}(I_{Q},\ 0)\cdot\mathbb{E}(Z)X=\mathbb{E}(Z)V

so that 𝔼(ZX){Vproj(ZVX)}=𝔼(Z){1𝔼(Z)}V\mathbb{E}(Z\mid X)\cdot\{V-\textup{proj}(ZV\mid X)\}=\mathbb{E}(Z)\cdot\{1-\mathbb{E}(Z)\}\cdot V. ∎

Lemma S9 below underlies the sufficient conditions in Proposition 4. Renew

B1=C1𝔼[{𝔼(ZiViXi)proj(ZiViXi)}{𝔼(ΔiXi)proj(ΔiXi)}],B2=C1𝔼[𝔼(ZiXi){Viproj(ZiViXi)}ϵiUi=c](Ui=c),\displaystyle\begin{array}[]{lll}B_{1}&=&C_{1}\cdot\mathbb{E}\left[\Big{\{}\mathbb{E}(Z_{i}V_{i}\mid X_{i})-\textup{proj}(Z_{i}V_{i}\mid X_{i})\Big{\}}\cdot\Big{\{}\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i})\Big{\}}\right],\\ B_{2}&=&C_{1}\cdot\mathbb{E}\Big{[}\mathbb{E}(Z_{i}\mid X_{i})\cdot\{V_{i}-\textup{proj}(Z_{i}V_{i}\mid X_{i})\}\cdot\epsilon_{i}\mid U_{i}=\textup{c}\Big{]}\cdot\mathbb{P}(U_{i}=\textup{c}),\end{array} (S36)

where C1C_{1} is the coefficient matrix of ZiViZ_{i}V_{i} in proj(DiViZiVi,Xi)\textup{proj}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i}). Renew

ϵi\displaystyle\epsilon_{i} =\displaystyle= {τiViβcfor compliers;0for always-takers and never-takers,\displaystyle\left\{\begin{array}[]{cl}\tau_{i}-V_{i}^{\top}\beta_{\textup{c}}^{\prime}&\text{for compliers;}\\ 0&\text{for always-takers and never-takers,}\end{array}\right.
Δi\displaystyle\Delta_{i} =\displaystyle= {Yi(0)for compliers and never-takers;Yi(1)Viβcfor always-takers.\displaystyle\left\{\begin{array}[]{cl}Y_{i}(0)&\text{for compliers and never-takers;}\\ Y_{i}(1)-V_{i}^{\top}\beta_{\textup{c}}^{\prime}&\text{for always-takers.}\end{array}\right.

The renewed ϵi\epsilon_{i}, Δi\Delta_{i}, and C1C_{1} reduce to their original definitions in Theorem 2 when Vi=XiV_{i}=X_{i}.

Lemma S9.

Recall B=𝔼{DVi~(YiDiViβc)}B^{\prime}=\mathbb{E}\{\widetilde{DV_{i}}\cdot(Y_{i}-D_{i}V_{i}^{\top}\beta_{\textup{c}}^{\prime})\} with DVi~=res{proj(DiViZiXi,Xi)Xi}\widetilde{DV_{i}}=\textup{res}\{\textup{proj}(D_{i}V_{i}\mid Z_{i}X_{i},X_{i})\mid X_{i}\} in Proposition 4. Under Assumptions 1 and S2, we have B=B1+B2B^{\prime}=B_{1}+B_{2} with B1B_{1} and B2B_{2} satisfying the following:

  1. (i)

    B1=B2=0B_{1}=B_{2}=0 if either of Lemma S8(i)–(ii) holds.

  2. (ii)

    B1=0B_{1}=0 if 𝔼(ΔiXi)\mathbb{E}(\Delta_{i}\mid X_{i}) is linear in XiX_{i}.

    B2=0B_{2}=0 if τc(Xi)=𝔼(τiXi,Ui=c)\tau_{\textup{c}}(X_{i})=\mathbb{E}(\tau_{i}\mid X_{i},U_{i}=\textup{c}) is linear in ViV_{i}.

Proof of Lemma S9.

Observe that

YiDiViβc={Yi(1)Viβcfor always-takers Yi(0)+DiτiDiViβcfor compliers Yi(0)for never-takers =Δi+Ziϵi.\displaystyle Y_{i}-D_{i}V_{i}^{\top}\beta_{\textup{c}}^{\prime}=\left\{\begin{array}[]{ll}Y_{i}(1)-V_{i}^{\top}\beta_{\textup{c}}^{\prime}&\text{for always-takers }\\ Y_{i}(0)+D_{i}\tau_{i}-D_{i}V_{i}^{\top}\beta_{\textup{c}}^{\prime}&\text{for compliers }\\ Y_{i}(0)&\text{for never-takers }\end{array}\right.=\Delta_{i}+Z_{i}\epsilon_{i}. (S42)

This ensures

B\displaystyle B^{\prime} =\displaystyle= 𝔼{DVi~(YiDiViβc)}\displaystyle\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(Y_{i}-D_{i}V_{i}^{\top}\beta_{\textup{c}}^{\prime})\right\}
=(S42)\displaystyle\overset{\eqref{eq:s6_1}}{=} 𝔼{DVi~(Δi+Ziϵi)}=𝔼(DVi~Δi)+𝔼(DVi~Ziϵi).\displaystyle\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(\Delta_{i}+Z_{i}\epsilon_{i})\right\}=\mathbb{E}(\widetilde{DV_{i}}\cdot\Delta_{i})+\mathbb{E}(\widetilde{DV_{i}}\cdot Z_{i}\epsilon_{i}).

Therefore, it suffices to show that

B1=𝔼(DVi~Δi),B2=𝔼(DVi~Ziϵi)\displaystyle B_{1}=\mathbb{E}(\widetilde{DV_{i}}\cdot\Delta_{i}),\quad B_{2}=\mathbb{E}(\widetilde{DV_{i}}\cdot Z_{i}\epsilon_{i}) (S43)

for B1B_{1} and B2B_{2} in Eq. (S36). To simplify the notation, write

proj(DiViZiVi,Xi)=C1ZiVi+C0Xi,\displaystyle\textup{proj}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i})=C_{1}Z_{i}V_{i}+C_{0}X_{i},

where C1C_{1} and C0C_{0} are well defined by Lemma S7 with C1C_{1} being invertible under Assumption S2. We have

DVi~\displaystyle\widetilde{DV_{i}} =\displaystyle= res{proj(DiViZiXi,Xi)Xi}\displaystyle\textup{res}\Big{\{}\textup{proj}(D_{i}V_{i}\mid Z_{i}X_{i},X_{i})\mid X_{i}\Big{\}}
=\displaystyle= res(C1ZiVi+C0XiXi)\displaystyle\textup{res}(C_{1}Z_{i}V_{i}+C_{0}X_{i}\mid X_{i})
=\displaystyle= C1res(ZiViXi)\displaystyle C_{1}\cdot\textup{res}(Z_{i}V_{i}\mid X_{i})
=\displaystyle= C1{ZiViproj(ZiViXi)},\displaystyle C_{1}\Big{\{}Z_{i}V_{i}-\textup{proj}(Z_{i}V_{i}\mid X_{i})\Big{\}},
DVi~Zi\displaystyle\widetilde{DV_{i}}\cdot Z_{i} =\displaystyle= C1{ZiViZiproj(ZiViXi)}=C1Zi{Viproj(ZiViXi)}.\displaystyle C_{1}\Big{\{}Z_{i}V_{i}-Z_{i}\cdot\textup{proj}(Z_{i}V_{i}\mid X_{i})\Big{\}}=C_{1}\cdot Z_{i}\cdot\Big{\{}V_{i}-\textup{proj}(Z_{i}V_{i}\mid X_{i})\Big{\}}.

This ensures

𝔼(DVi~Xi)=C1{𝔼(ZiViXi)proj(ZiViXi)},𝔼(DVi~ZiXi)=C1𝔼(ZiXi){Viproj(ZiViXi)}.\displaystyle\begin{array}[]{rll}\mathbb{E}(\widetilde{DV_{i}}\mid X_{i})&=&C_{1}\Big{\{}\mathbb{E}(Z_{i}V_{i}\mid X_{i})-\textup{proj}(Z_{i}V_{i}\mid X_{i})\Big{\}},\\ \mathbb{E}(\widetilde{DV_{i}}\cdot Z_{i}\mid X_{i})&=&C_{1}\cdot\mathbb{E}(Z_{i}\mid X_{i})\cdot\Big{\{}V_{i}-\textup{proj}(Z_{i}V_{i}\mid X_{i})\Big{\}}.\end{array} (S46)

Let ξi=Δiproj(ΔiXi)\xi_{i}=\Delta_{i}-\textup{proj}(\Delta_{i}\mid X_{i}) with 𝔼(ξiXi)=𝔼(ΔiXi)proj(ΔiXi)\mathbb{E}(\xi_{i}\mid X_{i})=\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i}). Properties of linear projection ensure 𝔼{DVi~proj(ΔiXi)}=0\mathbb{E}\{\widetilde{DV_{i}}\cdot\textup{proj}(\Delta_{i}\mid X_{i})\}=0 such that

𝔼(DVi~Δi)=𝔼{DVi~proj(ΔiXi)}+𝔼(DVi~ξi)=𝔼(DVi~ξi).\displaystyle\mathbb{E}(\widetilde{DV_{i}}\cdot\Delta_{i})=\mathbb{E}\left\{\widetilde{DV_{i}}\cdot\textup{proj}(\Delta_{i}\mid X_{i})\right\}+\mathbb{E}(\widetilde{DV_{i}}\cdot\xi_{i})=\mathbb{E}(\widetilde{DV_{i}}\cdot\xi_{i}). (S47)

In addition, DVi~\widetilde{DV_{i}} is a function of (Xi,Zi)(X_{i},Z_{i}), whereas ξi\xi_{i} and ϵi\epsilon_{i} are both functions of {Yi(z,d),Di(z),Xi:z=0,1;d=0,1}\{Y_{i}(z,d),D_{i}(z),X_{i}:z=0,1;\ d=0,1\}. This ensures

DVi~ξiXi,DVi~ZiϵiXi\displaystyle\widetilde{DV_{i}}\perp\!\!\!\perp\xi_{i}\mid X_{i},\qquad\widetilde{DV_{i}}\cdot Z_{i}\perp\!\!\!\perp\epsilon_{i}\mid X_{i} (S48)

under Assumption 1.

Proof of B1=𝔼(DVi~Δi)B_{1}=\mathbb{E}(\widetilde{DV_{i}}\cdot\Delta_{i}) in (S43) and sufficient conditions of B1=0B_{1}=0.

(S46)–(S48) ensure

𝔼(DVi~Δi)\displaystyle\mathbb{E}(\widetilde{DV_{i}}\cdot\Delta_{i}) =(S47)\displaystyle\overset{\eqref{eq:Delta_sub}}{=} 𝔼(DVi~ξi)\displaystyle\mathbb{E}(\widetilde{DV_{i}}\cdot\xi_{i})
=\displaystyle= 𝔼{𝔼(DVi~ξiXi)}\displaystyle\mathbb{E}\left\{\mathbb{E}(\widetilde{DV_{i}}\cdot\xi_{i}\mid X_{i})\right\}
=(S48)\displaystyle\overset{\eqref{eq:indep_sub}}{=} 𝔼[𝔼(DVi~Xi)𝔼(ξiXi)]\displaystyle\mathbb{E}\left[\mathbb{E}(\widetilde{DV_{i}}\mid X_{i})\cdot\mathbb{E}(\xi_{i}\mid X_{i})\right]
=(S46)\displaystyle\overset{\eqref{eq:tdvi0}}{=} C1𝔼[{𝔼(ZiViXi)proj(ZiViXi)}{𝔼(ΔiXi)proj(ΔiXi)}]\displaystyle C_{1}\cdot\mathbb{E}\left[\Big{\{}\mathbb{E}(Z_{i}V_{i}\mid X_{i})-\textup{proj}(Z_{i}V_{i}\mid X_{i})\Big{\}}\cdot\Big{\{}\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i})\Big{\}}\right]
=\displaystyle= B1.\displaystyle B_{1}.

By Lemma S8, Lemma S9(i) ensures 𝔼(ZiViXi)\mathbb{E}(Z_{i}V_{i}\mid X_{i}) is linear in XiX_{i} so that B1=0B_{1}=0. By Lemma S1(i), Lemma S9(ii) ensures 𝔼(ΔiXi)=proj(ΔiXi)\mathbb{E}(\Delta_{i}\mid X_{i})=\textup{proj}(\Delta_{i}\mid X_{i}) so that B1=0B_{1}=0.

Proof of B2=𝔼(DVi~Ziϵi)B_{2}=\mathbb{E}(\widetilde{DV_{i}}\cdot Z_{i}\epsilon_{i}) in (S43) and sufficient conditions of B2=0B_{2}=0.

Let g(Xi)=𝔼(ZiXi){Viproj(ZiViXi)}g(X_{i})=\mathbb{E}(Z_{i}\mid X_{i})\cdot\{V_{i}-\textup{proj}(Z_{i}V_{i}\mid X_{i})\} be a shorthand notation. (S46) and (S48) ensure

𝔼(DVi~Ziϵi)\displaystyle\mathbb{E}(\widetilde{DV_{i}}\cdot Z_{i}\epsilon_{i}) =(S48)\displaystyle\overset{\eqref{eq:indep_sub}}{=} 𝔼{𝔼(DVi~ZiXi)𝔼(ϵiXi)}\displaystyle\mathbb{E}\left\{\mathbb{E}(\widetilde{DV_{i}}\cdot Z_{i}\mid X_{i})\cdot\mathbb{E}(\epsilon_{i}\mid X_{i})\right\}
=(S46)\displaystyle\overset{\eqref{eq:tdvi0}}{=} C1𝔼{g(Xi)𝔼(ϵiXi)}\displaystyle C_{1}\cdot\mathbb{E}\Big{\{}g(X_{i})\cdot\mathbb{E}(\epsilon_{i}\mid X_{i})\Big{\}}
=\displaystyle= C1𝔼[𝔼{g(Xi)ϵiXi}]\displaystyle C_{1}\cdot\mathbb{E}\Big{[}\mathbb{E}\big{\{}g(X_{i})\cdot\epsilon_{i}\mid X_{i}\big{\}}\Big{]}
=\displaystyle= C1𝔼{g(Xi)ϵi}\displaystyle C_{1}\cdot\mathbb{E}\big{\{}g(X_{i})\cdot\epsilon_{i}\big{\}}
=\displaystyle= C1𝔼{g(Xi)ϵiUi=c}(Ui=c)=B2.\displaystyle C_{1}\cdot\mathbb{E}\big{\{}g(X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\big{\}}\cdot\mathbb{P}(U_{i}=\textup{c})=B_{2}.

Given C1C_{1} is invertible and (Ui=c)>0\mathbb{P}(U_{i}=\textup{c})>0, we have B2=0B_{2}=0 if and only if

𝔼{g(Xi)ϵiUi=c}=0.\displaystyle\mathbb{E}\big{\{}g(X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\big{\}}=0. (S49)

The definition of ϵi\epsilon_{i} ensures that (S49) holds when g(Xi)g(X_{i}) is linear in ViV_{i}. By Lemma S8, Lemma S9(i) ensures that g(Xi)g(X_{i}) is linear in ViV_{i}.

In addition, we have

𝔼{g(Xi)ϵiUi=c}\displaystyle\mathbb{E}\big{\{}g(X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\big{\}} =\displaystyle= 𝔼[𝔼{g(Xi)ϵiXi,Ui=c}Ui=c]\displaystyle\mathbb{E}\Big{[}\mathbb{E}\Big{\{}g(X_{i})\cdot\epsilon_{i}\mid X_{i},\,U_{i}=\textup{c}\Big{\}}\mid U_{i}=\textup{c}\Big{]}
=\displaystyle= 𝔼{g(Xi)𝔼(ϵiXi,Ui=c)Ui=c}\displaystyle\mathbb{E}\Big{\{}g(X_{i})\cdot\mathbb{E}\left(\epsilon_{i}\mid X_{i},\,U_{i}=\textup{c}\right)\mid U_{i}=\textup{c}\Big{\}}

so that (S49) holds if 𝔼(ϵiXi,Ui=c)=0\mathbb{E}(\epsilon_{i}\mid X_{i},U_{i}=\textup{c})=0. This is equivalent to 𝔼(τiXi,Ui=c)=Viβc\mathbb{E}(\tau_{i}\mid X_{i},U_{i}=\textup{c})=V_{i}^{\top}\beta_{\textup{c}}^{\prime}, which is ensured by Lemma S9(ii). ∎

S2.4 Two linear algebra propositions

Propositions S4S5 below state two linear algebra results that underlie Proposition 1(ii). We did not find formal statements or proofs of these results in the literature. We provide our original proofs in Section S5.

Proposition S4.

Assume X=(X0,X1,,Xm)X=(X_{0},X_{1},\ldots,X_{m})^{\top} is an (m+1)×1(m+1)\times 1 vector such that XX0XX_{0} is linear in (1,X)(1,X). That is, there exists constants {ci,bi,aij:i=0,,m;j=1,,m}\{c_{i},b_{i},a_{ij}\in\mathbb{R}:i=0,\ldots,m;\,j=1,\ldots,m\} such that

XX0=(X0X1Xm)X0=(c0c1cm)+(b0b1bm)X0+(a01a0ma11a1mam1amm)(X1Xm).\displaystyle XX_{0}=\begin{pmatrix}X_{0}\\ X_{1}\\ \vdots\\ X_{m}\end{pmatrix}X_{0}=\begin{pmatrix}c_{0}\\ c_{1}\\ \vdots\\ c_{m}\end{pmatrix}+\begin{pmatrix}b_{0}\\ b_{1}\\ \vdots\\ b_{m}\end{pmatrix}X_{0}+\begin{pmatrix}a_{01}&\ldots&a_{0m}\\ a_{11}&\ldots&a_{1m}\\ &\ddots&\\ a_{m1}&\ldots&a_{mm}\end{pmatrix}\begin{pmatrix}X_{1}\\ \vdots\\ X_{m}\end{pmatrix}. (S50)

Then X0X_{0} takes at most m+2m+2 distinct values across all solutions of XX to (S50).

Proposition S5 below extends Proposition S4 and ensures that if XX0,XX1,,XXmXX_{0},XX_{1},\ldots,XX_{m} are all linear in (1,X)(1,X), then the whole vector X=(X0,X1,,Xm)X=(X_{0},X_{1},\ldots,X_{m})^{\top} takes at most m+2m+2 distinct values.

Proposition S5.

Assume X=(X1,,Xm)X=(X_{1},\ldots,X_{m})^{\top} is an m×1m\times 1 vector such that XXXX^{\top} is component-wise linear in (1,X)(1,X). That is, there exists constants {cij,aij[k]:i,j,k=1,,m}\{c_{ij},a_{ij[k]}\in\mathbb{R}:i,j,k=1,\ldots,m\} such that

XiXj=cij+k=1maij[k]Xkfor i,j=1,,m.\displaystyle X_{i}X_{j}=c_{ij}+\sum_{k=1}^{m}a_{ij[k]}X_{k}\quad\text{for \ $i,j=1,\ldots,m$}.

Then XX takes at most m+1m+1 distinct values.

Corollary S1 is a direct implication of Propositions S4S5.

Corollary S1.

Let X=(X1,,Xm)X=(X_{1},\ldots,X_{m})^{\top} be an m×1m\times 1 vector.

  1. (i)

    If there exists an m×1m\times 1 constant vector c=(c1,,cm)c=(c_{1},\ldots,c_{m})^{\top} such that XXcXX^{\top}c is linear in (1,X)(1,X), then XcX^{\top}c takes at most m+1m+1 distinct values.

  2. (ii)

    If XXcXX^{\top}c is linear in (1,X)(1,X) for all m×1m\times 1 constant vectors cmc\in\mathbb{R}^{m}, then XX takes at most m+1m+1 distinct values.

S3 Proofs of the results in the main paper

S3.1 Proofs of the results in Section 2

Proof of Proposition 1.

We verify below Proposition 1(i)–(iii), respectively.

Proof of Proposition 1(i).

The sufficiency of categorical covariates follows from Lemma S1. The sufficiency of randomly assigned IV follows from 𝔼(ZiXiXi)=𝔼(ZiXi)Xi=𝔼(Zi)Xi\mathbb{E}(Z_{i}X_{i}\mid X_{i})=\mathbb{E}(Z_{i}\mid X_{i})X_{i}=\mathbb{E}(Z_{i})X_{i}.

Proof of Proposition 1(ii).

Let Xi[1]=(X1,,XK1)X_{i[-1]}=(X_{1},\ldots,X_{K-1})^{\top} with Xi=(1,Xi[1])X_{i}=(1,X_{i[-1]}^{\top})^{\top}. Assumption 2 requires that

𝔼(ZiXi) and 𝔼(ZiXi[1]Xi) are both linear in Xi.\displaystyle\text{$\mathbb{E}(Z_{i}\mid X_{i})$ and $\mathbb{E}(Z_{i}X_{i[-1]}\mid X_{i})$ are both linear in $X_{i}$}. (S51)

From (S51), there exists constants a0a_{0}\in\mathbb{R} and aXK1a_{X}\in\mathbb{R}^{K-1} so that

𝔼(ZiXi)=a0+Xi[1]aX.\displaystyle\mathbb{E}(Z_{i}\mid X_{i})=a_{0}+X_{i[-1]}^{\top}a_{X}. (S52)

This ensures

𝔼(ZiXi[1]Xi)=Xi[1]𝔼(ZiXi)=(S52)Xi[1]a0+Xi[1]Xi[1]aX.\displaystyle\mathbb{E}(Z_{i}X_{i[-1]}\mid X_{i})=X_{i[-1]}\cdot\mathbb{E}(Z_{i}\mid X_{i})\overset{\eqref{eq:ezx}}{=}X_{i[-1]}a_{0}+X_{i[-1]}X_{i[-1]}^{\top}a_{X}. (S53)

From (S53), for 𝔼(ZiXi[1]Xi)\mathbb{E}(Z_{i}X_{i[-1]}\mid X_{i}) to be linear in (1,Xi[1])(1,X_{i[-1]}) as suggested by (S51), we need Xi[1]Xi[1]aXX_{i[-1]}X_{i[-1]}^{\top}a_{X} to be linear in (1,Xi[1])(1,X_{i[-1]}). By Corollary S1(i), this implies that Xi[1]aXX_{i[-1]}^{\top}a_{X} takes at most KK distinct values, which in turn ensures 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) takes at most KK distinct values given (S52).

In addition, that Assumption 2 holds for arbitrary IV assignment mechanism implies that Xi[1]Xi[1]aXX_{i[-1]}X_{i[-1]}^{\top}a_{X} is linear in (1,Xi[1])(1,X_{i[-1]}) for arbitrary aXa_{X}. By Corollary S1(ii), this implies XiX_{i} is categorical.

Proof of Proposition 1(iii).

Assume that the potential outcomes satisfy

𝔼{Yi(1)Xi,Ui=a}=𝔼{Yi(1)Xi,Ui=c}=Xiβ1,𝔼{Yi(0)Xi,Ui=n}=𝔼{Yi(0)Xi,Ui=c}=Xiβ0\displaystyle\begin{array}[]{l}\mathbb{E}\{Y_{i}(1)\mid X_{i},U_{i}=\textup{a}\}=\mathbb{E}\{Y_{i}(1)\mid X_{i},U_{i}=\textup{c}\}=X_{i}^{\top}\beta_{1},\\ \mathbb{E}\{Y_{i}(0)\mid X_{i},U_{i}=\textup{n}\}=\mathbb{E}\{Y_{i}(0)\mid X_{i},U_{i}=\textup{c}\}=X_{i}^{\top}\beta_{0}\end{array} (S56)

for some fixed β0,β1K\beta_{0},\beta_{1}\in\mathbb{R}^{K}. Define

ηi=YiDiXi(β1β0)Xiβ0\displaystyle\eta_{i}=Y_{i}-D_{i}X_{i}^{\top}(\beta_{1}-\beta_{0})-X_{i}^{\top}\beta_{0} (S57)

with

ηi={Yi(1)Xiβ1if Ui=a;Yi(0)Xiβ0if Ui=n;Yi(1)Xiβ1if Ui=c and Zi=1;Yi(0)Xiβ0if Ui=c and Zi=0.\displaystyle\eta_{i}=\left\{\begin{array}[]{lll}Y_{i}(1)-X_{i}^{\top}\beta_{1}&\text{if $U_{i}=\textup{a}$;}\\ Y_{i}(0)-X_{i}^{\top}\beta_{0}&\text{if $U_{i}=\textup{n}$;}\\ Y_{i}(1)-X_{i}^{\top}\beta_{1}&\text{if $U_{i}=\textup{c}$ and $Z_{i}=1$;}\\ Y_{i}(0)-X_{i}^{\top}\beta_{0}&\text{if $U_{i}=\textup{c}$ and $Z_{i}=0$.}\end{array}\right. (S62)

Under Assumption 1, we have

𝔼(ηiZi,Xi,Ui=a)=(S62)𝔼{Yi(1)Xiβ1Zi,Xi,Ui=a}=Assm. 1𝔼{Yi(1)Xiβ1Xi,Ui=a}=(S56)0,𝔼(ηiZi,Xi,Ui=n)=(S62)𝔼{Yi(0)Xiβ0Zi,Xi,Ui=n}=Assm. 1𝔼{Yi(0)Xiβ0Xi,Ui=n}=(S56)0,\displaystyle\begin{array}[]{lclll}\mathbb{E}(\eta_{i}\mid Z_{i},X_{i},U_{i}=\textup{a})&\overset{\eqref{eq:eta}}{=}&\mathbb{E}\left\{Y_{i}(1)-X_{i}^{\top}\beta_{1}\mid Z_{i},X_{i},U_{i}=\textup{a}\right\}\\ &\overset{\text{Assm. \ref{assm:iv}}}{=}&\mathbb{E}\left\{Y_{i}(1)-X_{i}^{\top}\beta_{1}\mid X_{i},U_{i}=\textup{a}\right\}\\ &\overset{\eqref{eq:linear po}}{=}&0,\\ \mathbb{E}(\eta_{i}\mid Z_{i},X_{i},U_{i}=\textup{n})&\overset{\eqref{eq:eta}}{=}&\mathbb{E}\left\{Y_{i}(0)-X_{i}^{\top}\beta_{0}\mid Z_{i},X_{i},U_{i}=\textup{n}\right\}\\ &\overset{\text{Assm. \ref{assm:iv}}}{=}&\mathbb{E}\left\{Y_{i}(0)-X_{i}^{\top}\beta_{0}\mid X_{i},U_{i}=\textup{n}\right\}\\ &\overset{\eqref{eq:linear po}}{=}&0,\end{array}

and

𝔼(ηiZi=1,Xi,Ui=c)\displaystyle\mathbb{E}(\eta_{i}\mid Z_{i}=1,X_{i},U_{i}=\textup{c}) =(S62)\displaystyle\overset{\eqref{eq:eta}}{=} 𝔼{Yi(1)Xiβ1Zi=1,Xi,Ui=c}\displaystyle\mathbb{E}\left\{Y_{i}(1)-X_{i}^{\top}\beta_{1}\mid Z_{i}=1,X_{i},U_{i}=\textup{c}\right\}
=Assm. 1\displaystyle\overset{\text{Assm. \ref{assm:iv}}}{=} 𝔼{Yi(1)Xiβ1Xi,Ui=c}\displaystyle\mathbb{E}\left\{Y_{i}(1)-X_{i}^{\top}\beta_{1}\mid X_{i},U_{i}=\textup{c}\right\}
=(S56)\displaystyle\overset{\eqref{eq:linear po}}{=} 0,\displaystyle 0,
𝔼(ηiZi=0,Xi,Ui=c)\displaystyle\mathbb{E}(\eta_{i}\mid Z_{i}=0,X_{i},U_{i}=\textup{c}) =(S62)\displaystyle\overset{\eqref{eq:eta}}{=} 𝔼{Yi(0)Xiβ0Zi=0,Xi,Ui=c}\displaystyle\mathbb{E}\left\{Y_{i}(0)-X_{i}^{\top}\beta_{0}\mid Z_{i}=0,X_{i},U_{i}=\textup{c}\right\}
=Assm. 1\displaystyle\overset{\text{Assm. \ref{assm:iv}}}{=} 𝔼{Yi(0)Xiβ0Xi,Ui=c}\displaystyle\mathbb{E}\left\{Y_{i}(0)-X_{i}^{\top}\beta_{0}\mid X_{i},U_{i}=\textup{c}\right\}
=(S56)\displaystyle\overset{\eqref{eq:linear po}}{=} 0.\displaystyle 0.

This ensures 𝔼(ηiZi,Xi,Ui)=0\mathbb{E}(\eta_{i}\mid Z_{i},X_{i},U_{i})=0 so that

𝔼(ηiZi,Xi)=𝔼{𝔼(ηiZi,Xi,Ui)Zi,Xi}=0.\displaystyle\mathbb{E}(\eta_{i}\mid Z_{i},X_{i})=\mathbb{E}\{\mathbb{E}(\eta_{i}\mid Z_{i},X_{i},U_{i})\mid Z_{i},X_{i}\}=0. (S64)

(S57) and (S64) ensure that Assumption 3 holds. ∎

S3.2 Proofs of the results in Section 3.2

Proof of Theorem 2.

With a slight abuse of notation, renew DX^i=proj(DiXiZiXi,Xi)\widehat{DX}_{i}=\textup{proj}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i}) as the population fitted value from the first stage of the interacted 2sls throughout this proof. From Proposition S1, β2sls\beta_{\textsc{2sls}} equals the coefficient vector of DX^i\widehat{DX}_{i} in proj(YiDX^i,Xi)\textup{proj}(Y_{i}\mid\widehat{DX}_{i},X_{i}). Recall DX~i=res(DX^iXi)\widetilde{DX}_{i}=\textup{res}(\widehat{DX}_{i}\mid X_{i}). The population FWL in Lemma S3 ensures

β2sls={𝔼(DX~iDX~i)}1𝔼(DX~iYi).\displaystyle\beta_{\textsc{2sls}}=\left\{\mathbb{E}\left(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top}\right)\right\}^{-1}\mathbb{E}\left(\widetilde{DX}_{i}\cdot Y_{i}\right). (S65)

Write Yi=(DiXi)βc+(YiDiXiβc)Y_{i}=(D_{i}X_{i})^{\top}\beta_{\textup{c}}+(Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}}) with

DX~iYi=DX~i(DiXi)βc+DX~i(YiDiXiβc).\displaystyle\widetilde{DX}_{i}\cdot Y_{i}=\widetilde{DX}_{i}\cdot(D_{i}X_{i})^{\top}\beta_{\textup{c}}+\widetilde{DX}_{i}\cdot(Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}}).

This ensures the 𝔼(DX~iYi)\mathbb{E}(\widetilde{DX}_{i}\cdot Y_{i}) on the right-hand side of (S65) equals

𝔼(DX~iYi)\displaystyle\mathbb{E}\left(\widetilde{DX}_{i}\cdot Y_{i}\right) =\displaystyle= 𝔼{DX~i(DiXi)}βc+𝔼{DX~i(YiDiXiβc)}\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(D_{i}X_{i})^{\top}\right\}\cdot\beta_{\textup{c}}+\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}})\right\} (S66)
=\displaystyle= 𝔼{DX~i(DiXi)}βc+B,\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(D_{i}X_{i})^{\top}\right\}\cdot\beta_{\textup{c}}+B,

where

B=𝔼{DX~i(YiDiXiβc)}.\displaystyle B=\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}})\right\}.

Plugging (S66) in (S65) ensures

β2sls={𝔼(DX~iDX~i)}1𝔼{DX~i(DiXi)}βc+{𝔼(DX~iDX~i)}1B.\displaystyle\beta_{\textsc{2sls}}=\left\{\mathbb{E}\left(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top}\right)\right\}^{-1}\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(D_{i}X_{i})^{\top}\right\}\cdot\beta_{\textup{c}}+\left\{\mathbb{E}\left(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top}\right)\right\}^{-1}B.

Therefore, it suffices to show that

𝔼{DX~i(DiXi)}=𝔼(DX~iDX~i),B=B1+B2.\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(D_{i}X_{i})^{\top}\right\}=\mathbb{E}\left(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top}\right),\quad B=B_{1}+B_{2}. (S67)

The sufficiency of Assumptions 2 and 3 to ensure B1=B2=0B_{1}=B_{2}=0 then follows from Section 4 and Lemma S5.

Proof of 𝔼{DX~i(DiXi)}=𝔼(DX~iDX~i)\mathbb{E}\{\widetilde{DX}_{i}\cdot(D_{i}X_{i})^{\top}\}=\mathbb{E}(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top}) in (S67).

Write

DiXiDX~i\displaystyle D_{i}X_{i}-\widetilde{DX}_{i} =\displaystyle= (DiXiDX^i)+(DX^iDX~i)\displaystyle\left(D_{i}X_{i}-\widehat{DX}_{i}\right)+\left(\widehat{DX}_{i}-\widetilde{DX}_{i}\right) (S68)
=\displaystyle= res(DiXiZiXi,Xi)+proj(DX^iXi).\displaystyle\textup{res}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i})+\textup{proj}(\widehat{DX}_{i}\mid X_{i}).

By definition, DX~i=DX^iproj(DX^iXi)\widetilde{DX}_{i}=\widehat{DX}_{i}-\textup{proj}(\widehat{DX}_{i}\mid X_{i}) is a linear combination of (ZiXi,Xi)(Z_{i}X_{i},X_{i}). Properties of linear projection ensure

𝔼{DX~ires(DiXiZiXi,Xi)}=0,𝔼{DX~iproj(DX^iXi)}=0.\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot\textup{res}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i})^{\top}\right\}=0,\quad\mathbb{E}\left\{\widetilde{DX}_{i}\cdot\textup{proj}(\widehat{DX}_{i}\mid X_{i})^{\top}\right\}=0. (S69)

(S68)–(S69) together ensure

𝔼{DX~i(DiXi)}𝔼(DX~iDX~i)\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(D_{i}X_{i})^{\top}\right\}-\mathbb{E}\left(\widetilde{DX}_{i}\cdot\widetilde{DX}_{i}^{\top}\right)
=\displaystyle= 𝔼{DX~i(DiXiDX~i)}\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(D_{i}X_{i}-\widetilde{DX}_{i})^{\top}\right\}
=(S68)\displaystyle\overset{\eqref{eq:dxi_decomp}}{=} 𝔼{DX~ires(DiXiZiXi,Xi)}+𝔼{DX~iproj(DX^iXi)}\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot\textup{res}(D_{i}X_{i}\mid Z_{i}X_{i},X_{i})^{\top}\right\}+\mathbb{E}\left\{\widetilde{DX}_{i}\cdot\textup{proj}(\widehat{DX}_{i}\mid X_{i})^{\top}\right\}
=(S69)\displaystyle\overset{\eqref{eq:0_1}}{=} 0.\displaystyle 0.
Proof of B=B1+B2B=B_{1}+B_{2} in (S67).

Recall from (S7) that YiDiXi=Δi+ZiϵiY_{i}-D_{i}X_{i}^{\top}=\Delta_{i}+Z_{i}\epsilon_{i}. We have

B\displaystyle B =\displaystyle= 𝔼{DX~i(YiDiXiβc)}\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(Y_{i}-D_{i}X_{i}^{\top}\beta_{\textup{c}})\right\}
=(S7)\displaystyle\overset{\eqref{eq:Delta_ep}}{=} 𝔼{DX~i(Δi+Ziϵi)}\displaystyle\mathbb{E}\left\{\widetilde{DX}_{i}\cdot(\Delta_{i}+Z_{i}\epsilon_{i})\right\}
=\displaystyle= 𝔼(DX~iΔi)+𝔼(DX~iZiϵi).\displaystyle\mathbb{E}(\widetilde{DX}_{i}\cdot\Delta_{i})+\mathbb{E}(\widetilde{DX}_{i}\cdot Z_{i}\epsilon_{i}).

We show below

B1=𝔼(DX~iΔi),B2=𝔼(DX~iZiϵi)\displaystyle B_{1}=\mathbb{E}(\widetilde{DX}_{i}\cdot\Delta_{i}),\quad B_{2}=\mathbb{E}(\widetilde{DX}_{i}\cdot Z_{i}\epsilon_{i}) (S70)

to complete the proof. Recall from (S2)

DX~i\displaystyle\widetilde{DX}_{i} =\displaystyle= C1{ZiXiproj(ZiXiXi)}\displaystyle C_{1}\Big{\{}Z_{i}X_{i}-\textup{proj}(Z_{i}X_{i}\mid X_{i})\Big{\}} (S71)
=(S2)\displaystyle\overset{\eqref{eq:wt_tdx}}{=} C1(ZiXiC2Xi)\displaystyle C_{1}(Z_{i}X_{i}-C_{2}X_{i}) (S72)

so that

DX~iZi=(S72)C1(ZiXiC2ZiXi)=C1(IKC2)ZiXi.\displaystyle\widetilde{DX}_{i}\cdot Z_{i}\overset{\eqref{eq:a1_to a2}}{=}C_{1}(Z_{i}X_{i}-C_{2}Z_{i}X_{i})=C_{1}(I_{K}-C_{2})\cdot Z_{i}X_{i}. (S73)
Proof of B1=𝔼(DX~iΔi)B_{1}=\mathbb{E}(\widetilde{DX}_{i}\cdot\Delta_{i}) in (S70).

Let

ξi=Δiproj(ΔiXi)\displaystyle\xi_{i}=\Delta_{i}-\textup{proj}(\Delta_{i}\mid X_{i}) (S74)

denote the residual from the linear projection of Δi\Delta_{i} on XiX_{i}. Properties of linear projection ensure 𝔼{DX~iproj(ΔiXi)}=0\mathbb{E}\{\widetilde{DX}_{i}\cdot\textup{proj}(\Delta_{i}\mid X_{i})\}=0 so that

𝔼(DX~iΔi)=(S74)𝔼(DX~iξi)+𝔼{DX~iproj(ΔiXi)}=𝔼(DX~iξi).\displaystyle\mathbb{E}(\widetilde{DX}_{i}\cdot\Delta_{i})\overset{\eqref{eq:xi_res_def}}{=}\mathbb{E}(\widetilde{DX}_{i}\cdot\xi_{i})+\mathbb{E}\left\{\widetilde{DX}_{i}\cdot\textup{proj}(\Delta_{i}\mid X_{i})\right\}=\mathbb{E}(\widetilde{DX}_{i}\cdot\xi_{i}). (S75)

In addition, ξi\xi_{i} is a function of {Yi(z,d),Di(z),Xi:z=0,1;d=0,1}\{Y_{i}(z,d),D_{i}(z),X_{i}:z=0,1;\,d=0,1\}, whereas DX~i\widetilde{DX}_{i} is a function of (Zi,Xi)(Z_{i},X_{i}). This ensures

ξiDX~iXi\displaystyle\xi_{i}\perp\!\!\!\perp\widetilde{DX}_{i}\mid X_{i} (S76)

under Assumption 1(i). Accordingly, we have

𝔼(DX~iΔi)\displaystyle\mathbb{E}(\widetilde{DX}_{i}\cdot\Delta_{i}) =(S75)\displaystyle\overset{\eqref{eq:a1b1_ss1}}{=} 𝔼(DX~iξi)\displaystyle\mathbb{E}(\widetilde{DX}_{i}\cdot\xi_{i})
=\displaystyle= 𝔼{𝔼(DX~iξiXi)}\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(\widetilde{DX}_{i}\cdot\xi_{i}\mid X_{i})\Big{\}}
=(S76)\displaystyle\overset{\eqref{eq:a1b1_ss2}}{=} 𝔼{𝔼(DX~iXi)𝔼(ξiXi)}\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(\widetilde{DX}_{i}\mid X_{i})\cdot\mathbb{E}(\xi_{i}\mid X_{i})\Big{\}}
=(S71)+(S74)\displaystyle\overset{\eqref{eq:a1_tdxi}+\eqref{eq:xi_res_def}}{=} C1𝔼[{𝔼(ZiXiXi)proj(ZiXiXi)}{𝔼(ΔiXi)proj(ΔiXi)}]\displaystyle C_{1}\cdot\mathbb{E}\Big{[}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})-\textup{proj}(Z_{i}X_{i}\mid X_{i})\Big{\}}\cdot\Big{\{}\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i})\Big{\}}\Big{]}
=\displaystyle= B1,\displaystyle B_{1},

where the second to last equality follows from

𝔼(DX~iXi)\displaystyle\mathbb{E}(\widetilde{DX}_{i}\mid X_{i}) =\displaystyle= C1{𝔼(ZiXiXi)proj(ZiXiXi)}by (S71),\displaystyle C_{1}\cdot\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})-\textup{proj}(Z_{i}X_{i}\mid X_{i})\Big{\}}\quad\text{by \eqref{eq:a1_tdxi},}
𝔼(ξiXi)\displaystyle\mathbb{E}(\xi_{i}\mid X_{i}) =\displaystyle= 𝔼(ΔiXi)proj(ΔiXi)by (S74).\displaystyle\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i})\quad\text{by \eqref{eq:xi_res_def}.}
Proof of B2=𝔼(DX~iZiϵi)B_{2}=\mathbb{E}(\widetilde{DX}_{i}\cdot Z_{i}\epsilon_{i}) in (S70).

Note that ϵi\epsilon_{i} is a function of {Yi(z,d),Di(z),Xi:z=0,1;d=0,1}\{Y_{i}(z,d),D_{i}(z),X_{i}:z=0,1;\,d=0,1\}. Assumption 1(i) ensures

ZiXiϵiXi.\displaystyle Z_{i}X_{i}\perp\!\!\!\perp\epsilon_{i}\mid X_{i}. (S77)

Accordingly, we have

𝔼(ZiXiϵi)\displaystyle\mathbb{E}(Z_{i}X_{i}\cdot\epsilon_{i}) =\displaystyle= 𝔼{𝔼(ZiXiϵiXi)}\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\cdot\epsilon_{i}\mid X_{i})\Big{\}} (S78)
=(S77)\displaystyle\overset{\eqref{eq:a2b2_ss1}}{=} 𝔼{𝔼(ZiXiXi)𝔼(ϵiXi)}\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\mathbb{E}(\epsilon_{i}\mid X_{i})\Big{\}}
=\displaystyle= 𝔼[𝔼{𝔼(ZiXiXi)ϵiXi}]\displaystyle\mathbb{E}\Big{[}\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\mid X_{i}\Big{\}}\Big{]}
=\displaystyle= 𝔼{𝔼(ZiXiXi)ϵi}\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\Big{\}}
=\displaystyle= 𝔼{𝔼(ZiXiXi)ϵiUi=c}(Ui=c).\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\Big{\}}\cdot\mathbb{P}(U_{i}=\textup{c}).

This, combined with (S73), ensures

𝔼(DX~iZiϵi)\displaystyle\mathbb{E}(\widetilde{DX}_{i}\cdot Z_{i}\epsilon_{i}) =(S73)\displaystyle\overset{\eqref{eq:a2_tdxi}}{=} C1(IKC2)𝔼(ZiXiϵi)\displaystyle C_{1}(I_{K}-C_{2})\cdot\mathbb{E}(Z_{i}X_{i}\cdot\epsilon_{i})
=(S78)\displaystyle\overset{\eqref{eq:zxde}}{=} C1(IKC2)𝔼{𝔼(ZiXiXi)ϵiUi=c}(Ui=c)=B2.\displaystyle C_{1}(I_{K}-C_{2})\cdot\mathbb{E}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})\cdot\epsilon_{i}\mid U_{i}=\textup{c}\Big{\}}\cdot\mathbb{P}(U_{i}=\textup{c})=B_{2}.

Proof of Lemma 1.

The expression of τc(Xi)\tau_{\textup{c}}(X_{i}) follows from Lemma S2(iii) and Lemma S1(i). We verify below the expression of βc\beta_{\textup{c}}.

Recall from Eq. (4) that βc={𝔼(XiXiUi=c)}1𝔼(XiτiUi=c)\beta_{\textup{c}}=\{\mathbb{E}(X_{i}X_{i}^{\top}\mid U_{i}=\textup{c})\}^{-1}\cdot\mathbb{E}(X_{i}\tau_{i}\mid U_{i}=\textup{c}). When Xi=(1(Xi=1),,1(Xi=K))X_{i}=(1(X_{i}^{*}=1),\ldots,1(X_{i}^{*}=K))^{\top}, we have

  1. (i)

    XiXi=diag{1(Xi=k)}k=1KX_{i}X_{i}^{\top}=\operatorname{\textup{diag}}\{1(X_{i}^{*}=k)\}_{k=1}^{K} so that

    𝔼(XiXiUi=c)=diag(pk)k=1K,\displaystyle\mathbb{E}(X_{i}X_{i}^{\top}\mid U_{i}=\textup{c})=\operatorname{\textup{diag}}(p_{k})_{k=1}^{K}, (S79)

    where pk=𝔼{1(Xi=k)Ui=c}=(Xi=kUi=c)p_{k}=\mathbb{E}\{1(X_{i}^{*}=k)\mid U_{i}=\textup{c}\}=\mathbb{P}(X_{i}^{*}=k\mid U_{i}=\textup{c}).

  2. (ii)
    𝔼(XiτiUi=c)=(a1,,aK),\displaystyle\mathbb{E}\left(X_{i}\tau_{i}\mid U_{i}=\textup{c}\right)=(a_{1},\ldots,a_{K})^{\top}, (S80)

    where

    ak\displaystyle a_{k} =\displaystyle= 𝔼{1(Xi=k)τiUi=c}\displaystyle\mathbb{E}\Big{\{}1(X_{i}^{*}=k)\cdot\tau_{i}\mid U_{i}=\textup{c}\big{\}}
    =\displaystyle= 𝔼(τiXi=k,Ui=c)(Xi=kUi=c)=τ[k]cpk.\displaystyle\mathbb{E}(\tau_{i}\mid X_{i}^{*}=k,U_{i}=\textup{c})\cdot\mathbb{P}(X_{i}^{*}=k\mid U_{i}=\textup{c})=\tau_{[k]\textup{c}}\cdot p_{k}.

Eqs. (4) and (S79)–(S80) together ensure the kkth element of βc\beta_{\textup{c}} equals pk1ak=τ[k]cp_{k}^{-1}\cdot a_{k}=\tau_{[k]\textup{c}} for k=1,,Kk=1,\ldots,K. ∎

Proof of Corollary 1.

Let τ^wald,[k]\hat{\tau}_{\text{wald},[k]} denote the Wald estimator based on units with Xi=kX_{i}^{*}=k. We show in the following that

β^2sls=(τ^wald,[1],,τ^wald,[K]),\displaystyle\hat{\beta}_{\textsc{2sls}}=(\hat{\tau}_{\textup{wald},[1]},\ldots,\hat{\tau}_{\textup{wald},[K]})^{\top}, (S81)

which ensures β2sls=(τwald,[1],,τwald,[K])=τc\beta_{\textsc{2sls}}=(\tau_{\textup{wald},[1]},\ldots,\tau_{\textup{wald},[K]})^{\top}=\tau_{\textup{c}}.

Recall the definition of forbidden regression from above Lemma S6. In addition, consider the following stratum-wise 2sls based on units with Xi=kX_{i}^{*}=k:

  1. (i)

    First stage: Fit lm(Di1+Zi)\texttt{lm}(D_{i}\sim 1+Z_{i}) over {i:Xi=k}\{i:X_{i}^{*}=k\}. Denote the fitted value by D^i\hat{D}_{i}.

  2. (ii)

    Second stage: Fit lm(Yi1+D^i)\texttt{lm}(Y_{i}\sim 1+\hat{D}_{i}) over {i:Xi=k}\{i:X_{i}^{*}=k\}. Denote the coefficient of D^i\hat{D}_{i} by β~2sls,[k]\tilde{\beta}_{\textsc{2sls},[k]}.

When Xi=(1(Xi=1),,1(Xi=K))X_{i}=(1(X_{i}^{*}=1),\ldots,1(X_{i}^{*}=K))^{\top},

  1. (a)

    properties of least squares ensure that the coefficients of D^iXi\hat{D}_{i}X_{i} from the forbidden regression equal (β~2sls,[1],,β~2sls,[K])(\tilde{\beta}_{\textsc{2sls},[1]},\ldots,\tilde{\beta}_{\textsc{2sls},[K]})^{\top};

  2. (b)

    Lemma S6 ensures the coefficients of D^iXi\hat{D}_{i}X_{i} from the forbidden regression equal β^2sls\hat{\beta}_{\textsc{2sls}}.

These two observations ensure β^2sls=(β~2sls,[1],,β~2sls,[K])\hat{\beta}_{\textsc{2sls}}=(\tilde{\beta}_{\textsc{2sls},[1]},\ldots,\tilde{\beta}_{\textsc{2sls},[K]})^{\top}. This, together with β~2sls,[k]=τ^wald,[k]\tilde{\beta}_{\textsc{2sls},[k]}=\hat{\tau}_{\text{wald},[k]} by standard theory, ensures (S81). ∎

S3.3 Proofs of the results in Section 3.3

Proof of Lemma 2.

Recall

βc={𝔼(XiXiUi=c)}1𝔼(XiτiUi=c)\beta_{\textup{c}}=\{\mathbb{E}(X_{i}X_{i}^{\top}\mid U_{i}=\textup{c})\}^{-1}\mathbb{E}\left(X_{i}\tau_{i}\mid U_{i}=\textup{c}\right)

from Eq. (4). Let Xi[1]=(Xi1,,Xi,K1)X_{i[-1]}=(X_{i1},\ldots,X_{i,K-1})^{\top} to write Xi=(1,Xi[1])X_{i}=(1,X_{i[-1]}^{\top})^{\top}. We have

XiXi=(1Xi[1])(1,Xi[1])=(1Xi[1]Xi[1]Xi[1]Xi[1]),Xiτi=(1Xi[1])τi=(τiXi[1]τi)\displaystyle X_{i}X_{i}^{\top}=\begin{pmatrix}1\\ X_{i[-1]}\end{pmatrix}\Big{(}1,X_{i[-1]}^{\top}\Big{)}=\begin{pmatrix}1&X_{i[-1]}^{\top}\\ X_{i[-1]}&X_{i[-1]}X_{i[-1]}^{\top}\end{pmatrix},\quad X_{i}\tau_{i}=\begin{pmatrix}1\\ X_{i[-1]}\end{pmatrix}\tau_{i}=\begin{pmatrix}\tau_{i}\\ X_{i[-1]}\tau_{i}\end{pmatrix}

with

𝔼(XiXiUi=c)=(1ΦΦD),𝔼(XiτiUi=c)=(τc𝔼(Xi[1]τiUi=c)),\displaystyle\mathbb{E}(X_{i}X_{i}^{\top}\mid U_{i}=\textup{c})=\begin{pmatrix}1&\Phi\\ \Phi^{\top}&D\end{pmatrix},\quad\mathbb{E}\left(X_{i}\tau_{i}\mid U_{i}=\textup{c}\right)=\begin{pmatrix}\tau_{\textup{c}}\\ \mathbb{E}(X_{i[-1]}\tau_{i}\mid U_{i}=\textup{c})\end{pmatrix},

where Φ=𝔼(Xi[1]Ui=c)\Phi=\mathbb{E}(X_{i[-1]}^{\top}\mid U_{i}=\textup{c}) and D=𝔼(Xi[1]Xi[1]Ui=c)D=\mathbb{E}(X_{i[-1]}X_{i[-1]}^{\top}\mid U_{i}=\textup{c}). Therefore, Φ=0\Phi=0 ensures that the first element of βc\beta_{\textup{c}} equals τc\tau_{\textup{c}}. ∎

S3.4 Proofs of the results in Section 3.1

Proof of Theorem 1.

Let Xˇi0=(1,Xi1𝔼(Xi1Ui=c),,Xi,K1𝔼(Xi,K1Ui=c))\check{X}_{i}^{0}=(1,X_{i1}-\mathbb{E}(X_{i1}\mid U_{i}=\textup{c}),\ldots,X_{i,K-1}-\mathbb{E}(X_{i,K-1}\mid U_{i}=\textup{c}))^{\top} denote the population analog of Xi0X_{i}^{0}. Let τ^××(Xi0)=τ^××\hat{\tau}_{\times\times}(X_{i}^{0})=\hat{\tau}_{\times\times}^{*} and τ^××(Xˇi0)\hat{\tau}_{\times\times}(\check{X}_{i}^{0}) denote the first elements of β^2sls(Xi0)\hat{\beta}_{\textsc{2sls}}(X_{i}^{0}) and β^2sls(Xˇi0)\hat{\beta}_{\textsc{2sls}}(\check{X}_{i}^{0}), respectively. Then τ^××(Xi0)\hat{\tau}_{\times\times}(X_{i}^{0}) and τ^××(Xˇi0)\hat{\tau}_{\times\times}(\check{X}_{i}^{0}) have the same probability limit.

In addition, Lemma 2 ensures that τc\tau_{\textup{c}} equals the first element of βc(Xˇi0)\beta_{\textup{c}}(\check{X}_{i}^{0}). This ensures

plimNτ^××(Xi0)=τc\displaystyle\textup{plim}_{N\to\infty}\,\hat{\tau}_{\times\times}(X_{i}^{0})=\tau_{\textup{c}} (S82)
\displaystyle\Longleftrightarrow plimNτ^××(Xˇi0)=τc\displaystyle\textup{plim}_{N\to\infty}\,\hat{\tau}_{\times\times}(\check{X}_{i}^{0})=\tau_{\textup{c}}
Lemma 2\displaystyle\overset{\text{Lemma \ref{lem:centered}}}{\Longleftrightarrow} the first element of β2sls(Xˇi0)\beta_{\textsc{2sls}}(\check{X}_{i}^{0}) equals the first element of βc(Xˇi0)\beta_{\textup{c}}(\check{X}_{i}^{0})
Proposition S2\displaystyle\overset{\text{Proposition \ref{prop:2sls_invar}}}{\Longleftrightarrow} the first element of β2sls(Xi) equals the first element of βc(Xi),\displaystyle\text{the first element of $\beta_{\textsc{2sls}}(X_{i})$ equals the first element of $\beta_{\textup{c}}(X_{i})$},\qquad

where the last equivalence follows from Xˇi0\check{X}_{i}^{0} being a nondegenerate linear transformation of XiX_{i} and the invariance of 2sls to nondegenerate linear transformations in Proposition S2. By Theorem 2, the equivalent result in (S82) is ensured by either Assumption 2 or 3. ∎

Theorem 1 ensures Proposition 2(iii). We verify below Proposition 2(i) and (ii).

Proof of Proposition 2(i).

Assume throughout this proof that Dˇi=proj(DiZi,Xi)\check{D}_{i}=\textup{proj}(D_{i}\mid Z_{i},X_{i}) denotes the population fitted value from the additive first stage lm(DiZi+Xi)\texttt{lm}(D_{i}\sim Z_{i}+X_{i}), and let D~i=Dˇiproj(DˇiXi)\tilde{D}_{i}=\check{D}_{i}-\textup{proj}(\check{D}_{i}\mid X_{i}) denote the residual from the linear projection of Dˇi\check{D}_{i} on XiX_{i}. Given Proposition S3, it suffices to verify that when 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) is linear in XiX_{i}, we have

𝔼(DˇiXi) is linear in Xi,α(Xi)=𝔼(DiD~iXi)𝔼(D~i2)=w+(Xi).\displaystyle\text{$\mathbb{E}(\check{D}_{i}\mid X_{i})$ is linear in $X_{i}$},\quad\alpha(X_{i})=\dfrac{\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i})}{\mathbb{E}(\tilde{D}_{i}^{2})}=w_{+}(X_{i}). (S83)

Write

Dˇi=a+Zi+bXi,wherea+=𝔼{var(ZiXi)π(Xi)}𝔼{var(ZiXi)}\displaystyle\check{D}_{i}=a_{+}Z_{i}+b^{\top}X_{i},\quad\text{where}\ \ a_{+}=\dfrac{\mathbb{E}\left\{\operatorname{var}(Z_{i}\mid X_{i})\cdot\pi(X_{i})\right\}}{\mathbb{E}\left\{\operatorname{var}(Z_{i}\mid X_{i})\right\}} (S84)

by Angrist (1998); see also Sloczynski (2022, Lemma A.1).

Proof of linear 𝔼(DˇiXi)\mathbb{E}(\check{D}_{i}\mid X_{i}) in (S83).

From (S84), we have 𝔼(DˇiXi)=a+𝔼(ZiXi)+bXi,\mathbb{E}(\check{D}_{i}\mid X_{i})=a_{+}\cdot\mathbb{E}(Z_{i}\mid X_{i})+b^{\top}X_{i}, which is linear in XiX_{i} when 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) is linear in XiX_{i}.

Proof of α(Xi)=w+(Xi)\alpha(X_{i})=w_{+}(X_{i}) in (S83).

When 𝔼(ZiXi)\mathbb{E}(Z_{i}\mid X_{i}) is linear in XiX_{i}, we just showed that 𝔼(DˇiXi)\mathbb{E}(\check{D}_{i}\mid X_{i}) is linear in XiX_{i}. This ensures proj(DˇiXi)=𝔼(DˇiXi)\textup{proj}(\check{D}_{i}\mid X_{i})=\mathbb{E}(\check{D}_{i}\mid X_{i}) so that

D~i\displaystyle\tilde{D}_{i} =\displaystyle= Dˇiproj(DˇiXi)\displaystyle\check{D}_{i}-\textup{proj}(\check{D}_{i}\mid X_{i}) (S85)
=\displaystyle= Dˇi𝔼(DˇiXi)\displaystyle\check{D}_{i}-\mathbb{E}(\check{D}_{i}\mid X_{i})
=(S84)\displaystyle\overset{\eqref{eq:taa_b_def}}{=} a+{Zi𝔼(ZiXi)}.\displaystyle a_{+}\Big{\{}Z_{i}-\mathbb{E}(Z_{i}\mid X_{i})\Big{\}}.

This ensures

𝔼(D~i2Xi)=(S85)a+2𝔼[{Zi𝔼(ZiXi)}2Xi]=a+2var(ZiXi),\displaystyle\mathbb{E}(\tilde{D}_{i}^{2}\mid X_{i})\overset{\eqref{eq:taa_tdi}}{=}a_{+}^{2}\cdot\mathbb{E}\Big{[}\big{\{}Z_{i}-\mathbb{E}(Z_{i}\mid X_{i})\big{\}}^{2}\mid X_{i}\Big{]}=a_{+}^{2}\operatorname{var}(Z_{i}\mid X_{i}),

and therefore

𝔼(D~i2)=a+2𝔼{var(ZiXi)}.\displaystyle\mathbb{E}(\tilde{D}_{i}^{2})=a_{+}^{2}\cdot\mathbb{E}\left\{\operatorname{var}(Z_{i}\mid X_{i})\right\}. (S86)

In addition, (S85) ensures

𝔼(DiD~iZi,Xi)=𝔼(DiZi,Xi)D~i=a+𝔼(DiZi,Xi){Zi𝔼(ZiXi)},\displaystyle\mathbb{E}(D_{i}\tilde{D}_{i}\mid Z_{i},X_{i})=\mathbb{E}(D_{i}\mid Z_{i},X_{i})\cdot\tilde{D}_{i}=a_{+}\cdot\mathbb{E}(D_{i}\mid Z_{i},X_{i})\cdot\{Z_{i}-\mathbb{E}(Z_{i}\mid X_{i})\},

which implies

𝔼(DiD~iZi=1,Xi)=a+𝔼(DiZi=1,Xi){1𝔼(ZiXi)},𝔼(DiD~iZi=0,Xi)=a+𝔼(DiZi=0,Xi)𝔼(ZiXi).\displaystyle\begin{array}[]{rcl}\mathbb{E}(D_{i}\tilde{D}_{i}\mid Z_{i}=1,X_{i})&=&a_{+}\cdot\mathbb{E}(D_{i}\mid Z_{i}=1,X_{i})\cdot\{1-\mathbb{E}(Z_{i}\mid X_{i})\},\\ \mathbb{E}(D_{i}\tilde{D}_{i}\mid Z_{i}=0,X_{i})&=&-a_{+}\cdot\mathbb{E}(D_{i}\mid Z_{i}=0,X_{i})\cdot\mathbb{E}(Z_{i}\mid X_{i}).\end{array} (S89)

Therefore, we have

𝔼(DiD~iXi)\displaystyle\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i}) =\displaystyle= 𝔼(DiD~iZi=1,Xi)(Zi=1Xi)\displaystyle\mathbb{E}(D_{i}\tilde{D}_{i}\mid Z_{i}=1,X_{i})\cdot\mathbb{P}(Z_{i}=1\mid X_{i}) (S90)
+𝔼(DiD~iZi=0,Xi)(Zi=0Xi)\displaystyle+\mathbb{E}(D_{i}\tilde{D}_{i}\mid Z_{i}=0,X_{i})\cdot\mathbb{P}(Z_{i}=0\mid X_{i})
=(S89)\displaystyle\overset{\eqref{eq:taa_edd01}}{=} a+𝔼(DiZi=1,Xi){1𝔼(ZiXi)}𝔼(ZiXi)\displaystyle a_{+}\cdot\mathbb{E}(D_{i}\mid Z_{i}=1,X_{i})\cdot\{1-\mathbb{E}(Z_{i}\mid X_{i})\}\cdot\mathbb{E}(Z_{i}\mid X_{i})
a+𝔼(DiZi=0,Xi)𝔼(ZiXi){1𝔼(ZiXi)}\displaystyle-a_{+}\cdot\mathbb{E}(D_{i}\mid Z_{i}=0,X_{i})\cdot\mathbb{E}(Z_{i}\mid X_{i})\cdot\left\{1-\mathbb{E}(Z_{i}\mid X_{i})\right\}
=\displaystyle= a+{𝔼(DiZi=1,Xi)𝔼(DiZi=0,Xi)}𝔼(ZiXi){1𝔼(ZiXi)}\displaystyle a_{+}\cdot\Big{\{}\mathbb{E}(D_{i}\mid Z_{i}=1,X_{i})-\mathbb{E}(D_{i}\mid Z_{i}=0,X_{i})\Big{\}}\cdot\mathbb{E}(Z_{i}\mid X_{i})\cdot\left\{1-\mathbb{E}(Z_{i}\mid X_{i})\right\}
=\displaystyle= a+π(Xi)var(ZiXi).\displaystyle a_{+}\cdot\pi(X_{i})\cdot\operatorname{var}(Z_{i}\mid X_{i}).

Eqs. (S84), (S86), and (S90) together ensure

α(Xi)\displaystyle\alpha(X_{i}) =\displaystyle= 𝔼(DiD~iXi)𝔼(D~i2)\displaystyle\dfrac{\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i})}{\mathbb{E}(\tilde{D}_{i}^{2})}
=(S86)+(S90)\displaystyle\overset{\eqref{eq:taa_tdi2}+\eqref{eq:taa_dtd}}{=} a+π(Xi)var(ZiXi)a+2𝔼{var(ZiXi)}\displaystyle\dfrac{a_{+}\cdot\pi(X_{i})\cdot\operatorname{var}(Z_{i}\mid X_{i})}{a_{+}^{2}\cdot\mathbb{E}\left\{\operatorname{var}(Z_{i}\mid X_{i})\right\}}
=\displaystyle= π(Xi)var(ZiXi)a+𝔼{var(ZiXi)}\displaystyle\dfrac{\pi(X_{i})\cdot\operatorname{var}(Z_{i}\mid X_{i})}{a_{+}\cdot\mathbb{E}\left\{\operatorname{var}(Z_{i}\mid X_{i})\right\}}
=(S84)\displaystyle\overset{\eqref{eq:taa_b_def}}{=} π(Xi)var(ZiXi)𝔼{π(Xi)var(ZiXi)}=w+(Xi).\displaystyle\dfrac{\pi(X_{i})\cdot\operatorname{var}(Z_{i}\mid X_{i})}{\mathbb{E}\left\{\pi(X_{i})\cdot\operatorname{var}(Z_{i}\mid X_{i})\right\}}=w_{+}(X_{i}).

Proof of Proposition 2(ii).

Assume throughout this proof that Dˇi=proj(DiZiXi,Xi)\check{D}_{i}=\textup{proj}(D_{i}\mid Z_{i}X_{i},X_{i}) denotes the population fitted value from the interacted first stage lm(DiZiXi+Xi)\texttt{lm}(D_{i}\sim Z_{i}X_{i}+X_{i}), and let D~i=Dˇiproj(DˇiXi)\tilde{D}_{i}=\check{D}_{i}-\textup{proj}(\check{D}_{i}\mid X_{i}) denote the residual from the linear projection of Dˇi\check{D}_{i} on XiX_{i}. Given Proposition S3, it suffices to verify that when 𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) is linear in XiX_{i}, we have

𝔼(DˇiXi) is linear in Xi,α(Xi)=𝔼(DiD~iXi)𝔼(D~i2)=w×(Xi).\displaystyle\text{$\mathbb{E}(\check{D}_{i}\mid X_{i})$ is linear in $X_{i}$},\quad\alpha(X_{i})=\dfrac{\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i})}{\mathbb{E}(\tilde{D}_{i}^{2})}=w_{\times}(X_{i}). (S91)

Write

Dˇi=aZiXi+bXi.\displaystyle\check{D}_{i}=a^{\top}Z_{i}X_{i}+b^{\top}X_{i}. (S92)
Proof of linear 𝔼(DˇiXi)\mathbb{E}(\check{D}_{i}\mid X_{i}) in (S91).

From (S92), we have 𝔼(DˇiXi)=a𝔼(ZiXiXi)+bXi\mathbb{E}(\check{D}_{i}\mid X_{i})=a^{\top}\mathbb{E}(Z_{i}X_{i}\mid X_{i})+b^{\top}X_{i}, which is linear in XiX_{i} when 𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) is linear in XiX_{i}.

Proof of α(Xi)=w×(Xi)\alpha(X_{i})=w_{\times}(X_{i}) in (S91).

When 𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) is linear in XiX_{i}, we just showed that 𝔼(DˇiXi)\mathbb{E}(\check{D}_{i}\mid X_{i}) is linear in XiX_{i}. This ensures proj(DˇiXi)=𝔼(DˇiXi)\textup{proj}(\check{D}_{i}\mid X_{i})=\mathbb{E}(\check{D}_{i}\mid X_{i}) so that

D~i\displaystyle\tilde{D}_{i} =\displaystyle= Dˇiproj(DˇiXi)\displaystyle\check{D}_{i}-\textup{proj}(\check{D}_{i}\mid X_{i}) (S93)
=\displaystyle= Dˇi𝔼(DˇiXi)\displaystyle\check{D}_{i}-\mathbb{E}(\check{D}_{i}\mid X_{i})
=(S92)\displaystyle\overset{\eqref{eq:cdi_tia}}{=} aZiXi𝔼(aZiXiXi)=(aXi){Zi𝔼(ZiXi)}.\displaystyle a^{\top}Z_{i}X_{i}-\mathbb{E}(a^{\top}Z_{i}X_{i}\mid X_{i})=(a^{\top}X_{i})\left\{Z_{i}-\mathbb{E}(Z_{i}\mid X_{i})\right\}.

Note that the last expression in (S93) parallels (S85) after replacing a+a_{+} with aXia^{\top}X_{i}. Replacing a+a_{+} by aXia^{\top}X_{i} in the proof of Proposition 2(i) after (S85) ensures that the renewed α(Xi)\alpha(X_{i}) in (S91) satisfies

α(Xi)=𝔼(DiD~iXi)𝔼(D~i2)=(aXi)π(Xi)var(ZiXi)𝔼{(aXi)2var(ZiXi)}.\displaystyle\alpha(X_{i})=\dfrac{\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i})}{\mathbb{E}(\tilde{D}_{i}^{2})}=\dfrac{(a^{\top}X_{i})\cdot\pi(X_{i})\cdot\operatorname{var}(Z_{i}\mid X_{i})}{\mathbb{E}\left\{(a^{\top}X_{i})^{2}\cdot\operatorname{var}(Z_{i}\mid X_{i})\right\}}. (S94)

We show below a=𝔼{var(ZiXi)XiXi}1𝔼{var(ZiXi)Xiπ(Xi)}a=\mathbb{E}\{\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}X_{i}^{\top}\}^{-1}\mathbb{E}\{\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}\,\pi(X_{i})\}, which completes the proof.

By the population FWL, we have

a=𝔼{res(ZiXiXi)ZiXi}1𝔼{res(ZiXiXi)Di}.\displaystyle a=\mathbb{E}\left\{\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}X_{i}^{\top}\right\}^{-1}\mathbb{E}\left\{\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot D_{i}\right\}.

Therefore, it suffices to show that

𝔼{res(ZiXiXi)ZiXi}\displaystyle\mathbb{E}\left\{\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}X_{i}^{\top}\right\} =\displaystyle= 𝔼{var(ZiXi)XiXi},\displaystyle\mathbb{E}\{\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}X_{i}^{\top}\}, (S95)
𝔼{res(ZiXiXi)Di}\displaystyle\mathbb{E}\left\{\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot D_{i}\right\} =\displaystyle= 𝔼{var(ZiXi)Xiπ(Xi)}.\displaystyle\mathbb{E}\{\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}\,\pi(X_{i})\}. (S96)

A useful fact is that when 𝔼(ZiXiXi)\mathbb{E}(Z_{i}X_{i}\mid X_{i}) is linear in XiX_{i}, we have

res(ZiXiXi)\displaystyle\textup{res}(Z_{i}X_{i}\mid X_{i}) =\displaystyle= ZiXiproj(ZiXiXi)\displaystyle Z_{i}X_{i}-\textup{proj}(Z_{i}X_{i}\mid X_{i})
=\displaystyle= ZiXi𝔼(ZiXiXi)={Zi𝔼(ZiXi)}Xi\displaystyle Z_{i}X_{i}-\mathbb{E}(Z_{i}X_{i}\mid X_{i})=\left\{Z_{i}-\mathbb{E}(Z_{i}\mid X_{i})\right\}\cdot X_{i}

with

𝔼{res(ZiXiXi)Zi=1,Xi}\displaystyle\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\mid Z_{i}=1,X_{i}\Big{\}} =\displaystyle= {1𝔼(ZiXi)}Xi,\displaystyle\left\{1-\mathbb{E}(Z_{i}\mid X_{i})\right\}\cdot X_{i}, (S97)
𝔼{res(ZiXiXi)h(Xi)}\displaystyle\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot h(X_{i})\Big{\}} =\displaystyle= 0for arbitrary h(),\displaystyle 0\quad\text{for arbitrary $h(\cdot)$}, (S98)
res(ZiXiXi)Zi\displaystyle\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i} =\displaystyle= {1𝔼(ZiXi)}ZiXi.\displaystyle\{1-\mathbb{E}(Z_{i}\mid X_{i})\}\cdot Z_{i}X_{i}. (S99)
Proof of (S95).

From (S97), we have

𝔼{res(ZiXiXi)ZiXiXi}\displaystyle\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}X_{i}^{\top}\mid X_{i}\Big{\}} =\displaystyle= 𝔼{res(ZiXiXi)ZiXi}Xi\displaystyle\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}\mid X_{i}\Big{\}}\cdot X_{i}^{\top}
=\displaystyle= (Zi=1Xi)𝔼{res(ZiXiXi)Zi=1,Xi}Xi\displaystyle\mathbb{P}(Z_{i}=1\mid X_{i})\cdot\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\mid Z_{i}=1,X_{i}\Big{\}}\cdot X_{i}^{\top}
=(S97)\displaystyle\overset{\eqref{eq:a_2}}{=} 𝔼(ZiXi){1𝔼(ZiXi)}XiXi\displaystyle\mathbb{E}(Z_{i}\mid X_{i})\cdot\left\{1-\mathbb{E}(Z_{i}\mid X_{i})\right\}\cdot X_{i}\cdot X_{i}^{\top}
=\displaystyle= var(ZiXi)XiXi.\displaystyle\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}X_{i}^{\top}.

This ensures

𝔼{res(ZiXiXi)ZiXi}=𝔼[𝔼{res(ZiXiXi)ZiXiXi}]=𝔼{var(ZiXi)XiXi}.\displaystyle\mathbb{E}\left\{\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}X_{i}^{\top}\right\}=\mathbb{E}\Big{[}\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}X_{i}^{\top}\mid X_{i}\Big{\}}\Big{]}=\mathbb{E}\left\{\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}X_{i}^{\top}\right\}.\quad (S100)
Proof of (S96).

Let δi=Di(1)Di(0)\delta_{i}=D_{i}(1)-D_{i}(0) with Di=Di(0)+ZiδiD_{i}=D_{i}(0)+Z_{i}\delta_{i} and 𝔼(δiZi,Xi)=𝔼(δiXi)=π(Xi)\mathbb{E}(\delta_{i}\mid Z_{i},X_{i})=\mathbb{E}(\delta_{i}\mid X_{i})=\pi(X_{i}). This ensures

𝔼(DiZi,Xi)\displaystyle\mathbb{E}(D_{i}\mid Z_{i},X_{i}) =\displaystyle= 𝔼{Di(0)Zi,Xi}+Zi𝔼(δiZi,Xi)\displaystyle\mathbb{E}\{D_{i}(0)\mid Z_{i},X_{i}\}+Z_{i}\cdot\mathbb{E}(\delta_{i}\mid Z_{i},X_{i}) (S101)
=\displaystyle= 𝔼{Di(0)Xi}+Ziπ(Xi)\displaystyle\mathbb{E}\{D_{i}(0)\mid X_{i}\}+Z_{i}\cdot\pi(X_{i})

with

𝔼{res(ZiXiXi)DiZi,Xi}\displaystyle\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot D_{i}\mid Z_{i},X_{i}\Big{\}} =\displaystyle= res(ZiXiXi)𝔼(DiZi,Xi)\displaystyle\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot\mathbb{E}(D_{i}\mid Z_{i},X_{i}) (S102)
=(S101)\displaystyle\overset{\eqref{eq:tia_ss_3}}{=} res(ZiXiXi)𝔼{Di(0)Xi}\displaystyle\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot\mathbb{E}\{D_{i}(0)\mid X_{i}\}
+res(ZiXiXi)Ziπ(Xi).\displaystyle+\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}\cdot\pi(X_{i}).

Note that the two terms in (S102) satisfy

𝔼[res(ZiXiXi)𝔼{Di(0)Xi}]=(S98)0,𝔼{res(ZiXiXi)Ziπ(Xi)}=(S99)𝔼[{1𝔼(ZiXi)}ZiXiπ(Xi)]=𝔼[{1𝔼(ZiXi)}𝔼(ZiXi)Xiπ(Xi)]=𝔼{var(ZiXi)Xiπ(Xi)}.\displaystyle\begin{array}[]{rcl}\mathbb{E}\Big{[}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot\mathbb{E}\{D_{i}(0)\mid X_{i}\}\Big{]}&\overset{\eqref{eq:a_3}}{=}&0,\\ \mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}\cdot\pi(X_{i})\Big{\}}&\overset{\eqref{eq:a_4}}{=}&\mathbb{E}\Big{[}\big{\{}1-\mathbb{E}(Z_{i}\mid X_{i})\big{\}}\cdot Z_{i}X_{i}\cdot\pi(X_{i})\Big{]}\\ &=&\mathbb{E}\Big{[}\big{\{}1-\mathbb{E}(Z_{i}\mid X_{i})\big{\}}\cdot\mathbb{E}(Z_{i}\mid X_{i})\cdot X_{i}\cdot\pi(X_{i})\Big{]}\\ &=&\mathbb{E}\Big{\{}\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}\cdot\pi(X_{i})\Big{\}}.\end{array} (S107)

This ensures

𝔼{res(ZiXiXi)Di}\displaystyle\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot D_{i}\Big{\}} =\displaystyle= 𝔼[𝔼{res(ZiXiXi)DiZi,Xi}]\displaystyle\mathbb{E}\Big{[}\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot D_{i}\mid Z_{i},X_{i}\Big{\}}\Big{]}
=(S102)\displaystyle\overset{\eqref{eq:tia_ss_3_1}}{=} 𝔼[res(ZiXiXi)𝔼{Di(0)Xi}]\displaystyle\mathbb{E}\Big{[}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot\mathbb{E}\{D_{i}(0)\mid X_{i}\}\Big{]}
+𝔼{res(ZiXiXi)Ziπ(Xi)}\displaystyle+\mathbb{E}\Big{\{}\textup{res}(Z_{i}X_{i}\mid X_{i})\cdot Z_{i}\cdot\pi(X_{i})\Big{\}}
=(S107)\displaystyle\overset{\eqref{eq:tia_ss_4_1}}{=} 𝔼{var(ZiXi)Xiπ(Xi)}.\displaystyle\mathbb{E}\Big{\{}\operatorname{var}(Z_{i}\mid X_{i})\cdot X_{i}\cdot\pi(X_{i})\Big{\}}.

S3.5 Proofs of the results in Section 4

Proof of Proposition 3.

Proposition 3(i) follows from the population FWL in Lemma S3. Proposition 3(ii) follows from Abadie (2003). We verify below Proposition 3(iii).

First, it follows from (S2) and Lemma S4 that

𝔼(DX~iXi)=C1{𝔼(ZiXiXi)C2Xi},where C1 is invertible.\displaystyle\mathbb{E}(\widetilde{DX}_{i}\mid X_{i})=C_{1}\Big{\{}\mathbb{E}(Z_{i}X_{i}\mid X_{i})-C_{2}X_{i}\Big{\}},\quad\text{where $C_{1}$ is invertible}. (S108)

Note that a necessary condition for wi=uiw_{i}=u_{i} is

𝔼(wiXi)=𝔼(uiXi),\displaystyle\mathbb{E}(w_{i}\mid X_{i})=\mathbb{E}(u_{i}\mid X_{i}), (S109)

where it follows from 𝔼(ΔκXi)=0\mathbb{E}(\Delta\kappa\mid X_{i})=0 that

𝔼(uiXi)={𝔼(XiXiUi=c)}1πc1𝔼(ΔκXi)Xi=0.\displaystyle\mathbb{E}(u_{i}\mid X_{i})=\{\mathbb{E}(X_{i}X_{i}^{\top}\mid U_{i}=\textup{c})\}^{-1}\pi_{\textup{c}}^{-1}\cdot\mathbb{E}(\Delta\kappa\mid X_{i})\cdot X_{i}=0.

Therefore, the necessary condition in (S109) has the following equivalent forms:

(S109)\displaystyle\eqref{eq:wt_goal_2} \displaystyle\Longleftrightarrow 𝔼(wiXi)=0\displaystyle\mathbb{E}(w_{i}\mid X_{i})=0
\displaystyle\Longleftrightarrow 𝔼(DX~iXi)=0\displaystyle\mathbb{E}(\widetilde{DX}_{i}\mid X_{i})=0
(S108)\displaystyle\overset{\eqref{eq:wt_edx}}{\Longleftrightarrow} 𝔼(ZiXiXi)=C2Xi.\displaystyle\mathbb{E}(Z_{i}X_{i}\mid X_{i})=C_{2}X_{i}.

This ensures the necessity of Assumption 2 to ensure wi=uiw_{i}=u_{i}. ∎

S3.6 Proofs of the results in Section 8

Proof of Proposition 4.

Recall DVi~\widetilde{DV_{i}} as the residual from the linear projection of proj(DiViZiVi,Xi)\textup{proj}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i}) on XiX_{i}. The population FWL in Lemma S3 ensures

β2sls={𝔼(DVi~DVi~)}1𝔼(DVi~Yi).\displaystyle\beta_{\textsc{2sls}}^{\prime}=\left\{\mathbb{E}\left(\widetilde{DV_{i}}\cdot\widetilde{DV_{i}}^{\top}\right)\right\}^{-1}\mathbb{E}\left(\widetilde{DV_{i}}\cdot Y_{i}\right). (S110)

Write Yi=(DiVi)βc+(YiDiViβc)Y_{i}=(D_{i}V_{i})^{\top}\beta_{\textup{c}}^{\prime}+(Y_{i}-D_{i}V_{i}^{\top}\beta_{\textup{c}}^{\prime}) with

DVi~Yi=DVi~(DiVi)βc+DVi~(YiDiViβc)\widetilde{DV_{i}}\cdot Y_{i}=\widetilde{DV_{i}}\cdot(D_{i}V_{i})^{\top}\beta_{\textup{c}}^{\prime}+\widetilde{DV_{i}}\cdot(Y_{i}-D_{i}V_{i}^{\top}\beta_{\textup{c}}^{\prime})

and

𝔼(DVi~Yi)\displaystyle\mathbb{E}\left(\widetilde{DV_{i}}\cdot Y_{i}\right) =\displaystyle= 𝔼{DVi~(DiVi)}βc+𝔼{DVi~(YiDiViβc)}\displaystyle\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(D_{i}V_{i})^{\top}\right\}\beta_{\textup{c}}^{\prime}+\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(Y_{i}-D_{i}V_{i}^{\top}\beta_{\textup{c}}^{\prime})\right\} (S111)
=\displaystyle= 𝔼{DVi~(DiVi)}βc+B.\displaystyle\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(D_{i}V_{i})^{\top}\right\}\beta_{\textup{c}}^{\prime}+B^{\prime}.

Plugging (S111) in (S110) ensures

β2sls={𝔼(DVi~DVi~)}1𝔼{DVi~(DiVi)}βc+{𝔼(DVi~DVi~)}1B.\displaystyle\beta_{\textsc{2sls}}^{\prime}=\left\{\mathbb{E}\left(\widetilde{DV_{i}}\cdot\widetilde{DV_{i}}^{\top}\right)\right\}^{-1}\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(D_{i}V_{i})^{\top}\right\}\beta_{\textup{c}}^{\prime}+\left\{\mathbb{E}\left(\widetilde{DV_{i}}\cdot\widetilde{DV_{i}}^{\top}\right)\right\}^{-1}B^{\prime}.

Therefore, it suffices to show that

𝔼{DVi~(DiVi)}=𝔼(DVi~DVi~).\displaystyle\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(D_{i}V_{i})^{\top}\right\}=\mathbb{E}\left(\widetilde{DV_{i}}\cdot\widetilde{DV_{i}}^{\top}\right). (S112)

The sufficient conditions then follow from Lemma S9. We verify (S112) below.

Proof of (S112).

With a slight abuse of notation, renew DVi^=proj(DiViZiVi,Xi)\widehat{DV_{i}}=\textup{proj}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i}) and write

DiViDVi~=DiViDVi^+DVi^DVi~=res(DiViZiVi,Xi)+proj(DVi~Xi).\displaystyle D_{i}V_{i}-\widetilde{DV_{i}}=D_{i}V_{i}-\widehat{DV_{i}}+\widehat{DV_{i}}-\widetilde{DV_{i}}=\textup{res}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i})+\textup{proj}(\widetilde{DV_{i}}\mid X_{i}). (S113)

Note that DVi~\widetilde{DV_{i}} is by definition a linear combination of (ZiVi,Xi)(Z_{i}V_{i},X_{i}). Properties of linear projection ensure

𝔼{DVi~res(DiViZiVi,Xi)}=0,𝔼{DVi~proj(DVi~Xi)}=0.\displaystyle\mathbb{E}\Big{\{}\widetilde{DV_{i}}\cdot\textup{res}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i})^{\top}\Big{\}}=0,\quad\mathbb{E}\Big{\{}\widetilde{DV_{i}}\cdot\textup{proj}(\widetilde{DV_{i}}\mid X_{i})^{\top}\Big{\}}=0. (S114)

(S113)–(S114) together ensure (S112) as follows:

𝔼{DVi~(DiVi)}𝔼(DVi~DVi~)\displaystyle\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(D_{i}V_{i})^{\top}\right\}-\mathbb{E}\left(\widetilde{DV_{i}}\cdot\widetilde{DV_{i}}^{\top}\right)
=\displaystyle= 𝔼{DVi~(DiViDVi~)}\displaystyle\mathbb{E}\left\{\widetilde{DV_{i}}\cdot(D_{i}V_{i}-\widetilde{DV_{i}})^{\top}\right\}
=(S113)\displaystyle\overset{\eqref{eq:sub_ss3}}{=} 𝔼{DVi~res(DiViZiVi,Xi)}𝔼{DVi~proj(DVi~Xi)}\displaystyle\mathbb{E}\Big{\{}\widetilde{DV_{i}}\cdot\textup{res}(D_{i}V_{i}\mid Z_{i}V_{i},X_{i})^{\top}\Big{\}}-\mathbb{E}\Big{\{}\widetilde{DV_{i}}\cdot\textup{proj}(\widetilde{DV_{i}}\mid X_{i})^{\top}\Big{\}}
=(S114)\displaystyle\overset{\eqref{eq:sub_ss1}}{=} 0.\displaystyle 0.

Proof of Proposition 5.

The result follows from letting Zi=DiZ_{i}=D_{i} in Theorems 12. ∎

S4 Proofs of the results in Section S1

Proof of Proposition S3.

Recall Ri=R(Xi,Zi)R_{i}=R(X_{i},Z_{i}) as the regressor vector in the first stage of the generalized additive 2sls in Definition S1. Recall Dˇi=proj(DiRi)\check{D}_{i}=\textup{proj}(D_{i}\mid R_{i}) and D~i=Dˇiproj(DˇiXi)\tilde{D}_{i}=\check{D}_{i}-\textup{proj}(\check{D}_{i}\mid X_{i}). The correspondence between least squares and linear projection ensures τ+\tau_{*+} is the coefficient of Dˇi\check{D}_{i} in proj(YiDˇi,Xi)\textup{proj}(Y_{i}\mid\check{D}_{i},X_{i}). The population FWL in Lemma S3 ensures

τ+=𝔼(YiD~i)𝔼(D~i2).\displaystyle\tau_{*+}=\dfrac{\mathbb{E}(Y_{i}\tilde{D}_{i})}{\mathbb{E}(\tilde{D}_{i}^{2})}. (S115)

We simplify below the numerator 𝔼(YiD~i)\mathbb{E}(Y_{i}\tilde{D}_{i}) in (S115).

Recall τc(Xi)=𝔼(τiUi=c,Xi)\tau_{\textup{c}}(X_{i})=\mathbb{E}(\tau_{i}\mid U_{i}=\textup{c},X_{i}). Let δi=Di(1)Di(0)\delta_{i}=D_{i}(1)-D_{i}(0) to write

Di=Di(0)+δiZi.\displaystyle D_{i}=D_{i}(0)+\delta_{i}Z_{i}. (S116)

Write

Δi={Yi(0)for compliers and never-takersYi(1)τc(Xi)for always-takers=Yi(0)+{τiτc(Xi)}Di(0).\displaystyle\Delta_{i}=\left\{\begin{array}[]{cl}Y_{i}(0)&\text{for compliers and never-takers}\\ Y_{i}(1)-\tau_{\textup{c}}(X_{i})&\text{for always-takers}\end{array}\right.=Y_{i}(0)+\left\{\tau_{i}-\tau_{\textup{c}}(X_{i})\right\}\cdot D_{i}(0).

We have

Yi=Yi(0)+τiDi=Yi(0)+{τiτc(Xi)}Di+τc(Xi)Di=(S116)Yi(0)+{τiτc(Xi)}{Di(0)+δiZi}+τc(Xi)Di=Yi(0)+{τiτc(Xi)}Di(0)+{τiτc(Xi)}δiZi+τc(Xi)Di=Δi+{τiτc(Xi)}δiZi+τc(Xi)Di.\displaystyle\begin{array}[]{rcl}Y_{i}&=&Y_{i}(0)+\tau_{i}\cdot D_{i}\\ &=&Y_{i}(0)+\left\{\tau_{i}-\tau_{\textup{c}}(X_{i})\right\}\cdot D_{i}+\tau_{\textup{c}}(X_{i})\cdot D_{i}\\ &\overset{\eqref{eq:di}}{=}&Y_{i}(0)+\left\{\tau_{i}-\tau_{\textup{c}}(X_{i})\right\}\cdot\left\{D_{i}(0)+\delta_{i}Z_{i}\right\}+\tau_{\textup{c}}(X_{i})\cdot D_{i}\\ &=&Y_{i}(0)+\left\{\tau_{i}-\tau_{\textup{c}}(X_{i})\right\}\cdot D_{i}(0)+\left\{\tau_{i}-\tau_{\textup{c}}(X_{i})\right\}\cdot\delta_{i}Z_{i}+\tau_{\textup{c}}(X_{i})\cdot D_{i}\\ &=&\Delta_{i}+\left\{\tau_{i}-\tau_{\textup{c}}(X_{i})\right\}\cdot\delta_{i}Z_{i}+\tau_{\textup{c}}(X_{i})\cdot D_{i}.\end{array} (S123)

(S123) ensures

𝔼(YiD~i)=𝔼(ΔiD~i)+𝔼[{τiτc(Xi)}δiZiD~i]+𝔼{τc(Xi)DiD~i}.\displaystyle\mathbb{E}(Y_{i}\tilde{D}_{i})=\mathbb{E}(\Delta_{i}\cdot\tilde{D}_{i})+\mathbb{E}\Big{[}\left\{\tau_{i}-\tau_{\textup{c}}(X_{i})\right\}\delta_{i}\cdot Z_{i}\tilde{D}_{i}\Big{]}+\mathbb{E}\left\{\tau_{\textup{c}}(X_{i})\cdot D_{i}\tilde{D}_{i}\right\}. (S124)

We compute below the three expectations on the right-hand side of (S124), respectively.

Compute 𝔼(ΔiD~i)\mathbb{E}(\Delta_{i}\cdot\tilde{D}_{i}) in (S124).

Let ξi=Δiproj(ΔiXi)\xi_{i}=\Delta_{i}-\textup{proj}(\Delta_{i}\mid X_{i}) with

𝔼(ξiXi)=𝔼(ΔiXi)proj(ΔiXi).\displaystyle\mathbb{E}(\xi_{i}\mid X_{i})=\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i}). (S125)

Properties of linear projection ensure 𝔼{proj(ΔiXi)D~i}=0\mathbb{E}\{\textup{proj}(\Delta_{i}\mid X_{i})\cdot\tilde{D}_{i}\}=0 so that

𝔼(ΔiD~i)=𝔼{proj(ΔiXi)D~i}+𝔼(ξiD~i)=𝔼(ξiD~i).\displaystyle\mathbb{E}(\Delta_{i}\cdot\tilde{D}_{i})=\mathbb{E}\left\{\textup{proj}(\Delta_{i}\mid X_{i})\cdot\tilde{D}_{i}\right\}+\mathbb{E}(\xi_{i}\cdot\tilde{D}_{i})=\mathbb{E}(\xi_{i}\cdot\tilde{D}_{i}). (S126)

In addition, ξi\xi_{i} is a function of {Yi(z,d),Di(z),Xi:z=0,1;d=0,1}\{Y_{i}(z,d),D_{i}(z),X_{i}:z=0,1;\,d=0,1\}, whereas D~i\tilde{D}_{i} is a function of (Xi,Zi)(X_{i},Z_{i}). This ensures

ξiD~iXi\displaystyle\xi_{i}\perp\!\!\!\perp\tilde{D}_{i}\mid X_{i} (S127)

under Assumption 1(i). (S125)–(S127) together ensure

𝔼(ΔiD~i)=(S126)𝔼(ξiD~i)=𝔼{𝔼(ξiD~iXi)}=(S127)𝔼{𝔼(ξiXi)𝔼(D~iXi)}=(S125)𝔼[{𝔼(ΔiXi)proj(ΔiXi)}𝔼(D~iXi)]=B,\displaystyle\begin{array}[]{rcl}\mathbb{E}(\Delta_{i}\cdot\tilde{D}_{i})&\overset{\eqref{eq:tsa_xi}}{=}&\mathbb{E}(\xi_{i}\cdot\tilde{D}_{i})\\ &=&\mathbb{E}\left\{\mathbb{E}(\xi_{i}\cdot\tilde{D}_{i}\mid X_{i})\right\}\\ &\overset{\eqref{eq:tsa_indep1}}{=}&\mathbb{E}\left\{\mathbb{E}(\xi_{i}\mid X_{i})\cdot\mathbb{E}(\tilde{D}_{i}\mid X_{i})\right\}\\ &\overset{\eqref{eq:tsa_xi_x}}{=}&\mathbb{E}\Big{[}\Big{\{}\mathbb{E}(\Delta_{i}\mid X_{i})-\textup{proj}(\Delta_{i}\mid X_{i})\Big{\}}\cdot\mathbb{E}(\tilde{D}_{i}\mid X_{i})\Big{]}\\ &=&B,\end{array} (S133)

where the last equality follows from 𝔼(D~iXi)=𝔼(DˇiXi)proj(DˇiXi)\mathbb{E}(\tilde{D}_{i}\mid X_{i})=\mathbb{E}(\check{D}_{i}\mid X_{i})-\textup{proj}(\check{D}_{i}\mid X_{i}) by definition.

Compute 𝔼[{τiτc(Xi)}δiZiD~i]\mathbb{E}[\{\tau_{i}-\tau_{\textup{c}}(X_{i})\}\delta_{i}\cdot Z_{i}\tilde{D}_{i}] in (S124).

Note that {τiτc(Xi)}δi\{\tau_{i}-\tau_{\textup{c}}(X_{i})\}\delta_{i} is a function of {Yi(z,d),Di(z),Xi:z=0,1;d=0,1}\{Y_{i}(z,d),D_{i}(z),X_{i}:z=0,1;\,d=0,1\}, whereas ZiD~iZ_{i}\tilde{D}_{i} is a function of (Xi,Zi)(X_{i},Z_{i}). This ensures

{τiτc(Xi)}δiZiD~iXi\displaystyle\{\tau_{i}-\tau_{\textup{c}}(X_{i})\}\delta_{i}\perp\!\!\!\perp Z_{i}\tilde{D}_{i}\mid X_{i} (S134)

under Assumption 1(i). In addition,

𝔼[{τiτc(Xi)}δiXi]=𝔼{τiτc(Xi)Ui=c,Xi}(Ui=cXi)=0.\displaystyle\mathbb{E}\Big{[}\big{\{}\tau_{i}-\tau_{\textup{c}}(X_{i})\big{\}}\delta_{i}\mid X_{i}\Big{]}=\mathbb{E}\big{\{}\tau_{i}-\tau_{\textup{c}}(X_{i})\mid U_{i}=\textup{c},X_{i}\big{\}}\cdot\mathbb{P}(U_{i}=\textup{c}\mid X_{i})=0. (S135)

(S134)–(S135) together ensure

𝔼[{τiτc(Xi)}δiZiD~iXi]=(S134)𝔼[{τiτc(Xi)}δiXi]𝔼(ZiD~iXi)=(S135)0,\displaystyle\begin{array}[]{rcl}\mathbb{E}\Big{[}\big{\{}\tau_{i}-\tau_{\textup{c}}(X_{i})\big{\}}\delta_{i}\cdot Z_{i}\tilde{D}_{i}\mid X_{i}\Big{]}&\overset{\eqref{eq:tsa_indep}}{=}&\mathbb{E}\Big{[}\big{\{}\tau_{i}-\tau_{\textup{c}}(X_{i})\big{\}}\delta_{i}\mid X_{i}\Big{]}\cdot\mathbb{E}(Z_{i}\tilde{D}_{i}\mid X_{i})\\ &\overset{\eqref{eq:tsa_2_0}}{=}&0,\end{array}

and therefore

𝔼[{τiτc(Xi)}δiZiD~i]=𝔼(𝔼[{τiτc(Xi)}δiZiD~iXi])=0.\displaystyle\begin{array}[]{rcl}\mathbb{E}\Big{[}\big{\{}\tau_{i}-\tau_{\textup{c}}(X_{i})\big{\}}\delta_{i}\cdot Z_{i}\tilde{D}_{i}\Big{]}&=&\mathbb{E}\Big{(}\mathbb{E}\Big{[}\big{\{}\tau_{i}-\tau_{\textup{c}}(X_{i})\big{\}}\delta_{i}\cdot Z_{i}\tilde{D}_{i}\mid X_{i}\Big{]}\Big{)}\\ &=&0.\end{array} (S139)
Compute 𝔼{τc(Xi)DiD~i}\mathbb{E}\{\tau_{\textup{c}}(X_{i})\cdot D_{i}\tilde{D}_{i}\} in (S124).

We have

𝔼{τc(Xi)DiD~iXi}=τc(Xi)𝔼(DiD~iXi)=τc(Xi)α(Xi)𝔼(D~i2),\displaystyle\begin{array}[]{rcl}\mathbb{E}\left\{\tau_{\textup{c}}(X_{i})\cdot D_{i}\tilde{D}_{i}\mid X_{i}\right\}&=&\tau_{\textup{c}}(X_{i})\cdot\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i})\\ &=&\tau_{\textup{c}}(X_{i})\cdot\alpha(X_{i})\cdot\mathbb{E}(\tilde{D}_{i}^{2}),\end{array}

and therefore

𝔼{τc(Xi)DiD~i}=𝔼[𝔼{τc(Xi)DiD~iXi}]=𝔼{τc(Xi)α(Xi)}𝔼(D~i2).\displaystyle\begin{array}[]{rcl}\mathbb{E}\Big{\{}\tau_{\textup{c}}(X_{i})\cdot D_{i}\tilde{D}_{i}\Big{\}}&=&\mathbb{E}\Big{[}\mathbb{E}\big{\{}\tau_{\textup{c}}(X_{i})\cdot D_{i}\tilde{D}_{i}\mid X_{i}\big{\}}\Big{]}\\ &=&\mathbb{E}\Big{\{}\tau_{\textup{c}}(X_{i})\cdot\alpha(X_{i})\Big{\}}\cdot\mathbb{E}(\tilde{D}_{i}^{2}).\end{array} (S143)

Plugging (S133), (S139), (S143) into (S124) ensures

𝔼(YiD~i)=B+𝔼{τc(Xi)α(Xi)}𝔼(D~i2),\displaystyle\mathbb{E}(Y_{i}\tilde{D}_{i})=B+\mathbb{E}\Big{\{}\tau_{\textup{c}}(X_{i})\cdot\alpha(X_{i})\Big{\}}\cdot\mathbb{E}(\tilde{D}_{i}^{2}),

which verifies the expression of τ+={𝔼(D~i2)}1B+𝔼{α(Xi)τc(Xi)}\tau_{*+}=\{\mathbb{E}(\tilde{D}_{i}^{2})\}^{-1}B+\mathbb{E}\left\{\alpha(X_{i})\cdot\tau_{\textup{c}}(X_{i})\right\} by (S115).

The sufficient conditions for B=0B=0 are straightforward. We verify below the sufficient conditions for α(Xi)=w(Xi)\alpha(X_{i})=w(X_{i}).

Conditions for α(Xi)=w(Xi)\alpha(X_{i})=w(X_{i}).

Assume that 𝔼(DˇiXi)\mathbb{E}(\check{D}_{i}\mid X_{i}) is linear in XiX_{i}. We have 𝔼(DˇiXi)=proj(DˇiXi)\mathbb{E}(\check{D}_{i}\mid X_{i})=\textup{proj}(\check{D}_{i}\mid X_{i}) with

D~i=Dˇi𝔼(DˇiXi).\displaystyle\tilde{D}_{i}=\check{D}_{i}-\mathbb{E}(\check{D}_{i}\mid X_{i}). (S144)

This, together with D~i=Dˇiproj(DˇiXi)\tilde{D}_{i}=\check{D}_{i}-\textup{proj}(\check{D}_{i}\mid X_{i}) being a function of (Xi,Zi)(X_{i},Z_{i}), ensures

𝔼(D~i2Xi)\displaystyle\mathbb{E}(\tilde{D}_{i}^{2}\mid X_{i}) =(S144)\displaystyle\overset{\eqref{eq:tsa_tdi}}{=} 𝔼[{Dˇi𝔼(DˇiXi)}2Xi]=var(DˇiXi),\displaystyle\mathbb{E}\left[\left\{\check{D}_{i}-\mathbb{E}(\check{D}_{i}\mid X_{i})\right\}^{2}\mid X_{i}\right]=\operatorname{var}(\check{D}_{i}\mid X_{i}), (S145)
𝔼(DiD~iZi,Xi)\displaystyle\mathbb{E}(D_{i}\tilde{D}_{i}\mid Z_{i},X_{i}) =\displaystyle= 𝔼(DiZi,Xi)D~i\displaystyle\mathbb{E}(D_{i}\mid Z_{i},X_{i})\cdot\tilde{D}_{i} (S146)
=(S144)\displaystyle\overset{\eqref{eq:tsa_tdi}}{=} 𝔼(DiZi,Xi)Dˇi𝔼(DiZi,Xi)𝔼(DˇiXi).\displaystyle\mathbb{E}(D_{i}\mid Z_{i},X_{i})\cdot\check{D}_{i}-\mathbb{E}(D_{i}\mid Z_{i},X_{i})\cdot\mathbb{E}(\check{D}_{i}\mid X_{i}).

From (S146), we have

𝔼(DiD~iXi)\displaystyle\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i}) =\displaystyle= 𝔼{𝔼(DiD~iZi,Xi)Xi}\displaystyle\mathbb{E}\Big{\{}\mathbb{E}\big{(}D_{i}\tilde{D}_{i}\mid Z_{i},X_{i}\big{)}\mid X_{i}\Big{\}} (S147)
=\displaystyle= 𝔼{𝔼(DiZi,Xi)DˇiXi}𝔼{𝔼(DiZi,Xi)𝔼(DˇiXi)Xi}\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(D_{i}\mid Z_{i},X_{i})\cdot\check{D}_{i}\mid X_{i}\Big{\}}-\mathbb{E}\Big{\{}\mathbb{E}(D_{i}\mid Z_{i},X_{i})\cdot\mathbb{E}(\check{D}_{i}\mid X_{i})\mid X_{i}\Big{\}}
=\displaystyle= 𝔼{𝔼(DiZi,Xi)DˇiXi}𝔼{𝔼(DiZi,Xi)Xi}𝔼(DˇiXi)\displaystyle\mathbb{E}\Big{\{}\mathbb{E}(D_{i}\mid Z_{i},X_{i})\cdot\check{D}_{i}\mid X_{i}\Big{\}}-\mathbb{E}\Big{\{}\mathbb{E}(D_{i}\mid Z_{i},X_{i})\mid X_{i}\Big{\}}\cdot\mathbb{E}(\check{D}_{i}\mid X_{i})
=\displaystyle= cov{𝔼(DiZi,Xi),DˇiXi}.\displaystyle\operatorname{cov}\left\{\mathbb{E}(D_{i}\mid Z_{i},X_{i}),\check{D}_{i}\mid X_{i}\right\}.

(S145) and (S147) together ensure

α(Xi)=𝔼(DiD~iXi)𝔼(D~i2)=(S145)+(S147)cov{𝔼(DiZi,Xi),DˇiXi}𝔼{var(DˇiXi)}.\displaystyle\alpha(X_{i})=\dfrac{\mathbb{E}(D_{i}\tilde{D}_{i}\mid X_{i})}{\mathbb{E}(\tilde{D}_{i}^{2})}\overset{\eqref{eq:tsa_tdi2}+\eqref{eq:edd}}{=}\dfrac{\operatorname{cov}\left\{\mathbb{E}(D_{i}\mid Z_{i},X_{i}),\check{D}_{i}\mid X_{i}\right\}}{\mathbb{E}\{\operatorname{var}(\check{D}_{i}\mid X_{i})\}}. (S148)

When Dˇi=𝔼(DiZi,Xi)\check{D}_{i}=\mathbb{E}(D_{i}\mid Z_{i},X_{i}), (S148) simplifies to α(Xi)=var{𝔼(DiZi,Xi)Xi}𝔼[var{𝔼(DiZi,Xi)Xi}]=w(Xi)\alpha(X_{i})=\dfrac{\operatorname{var}\{\mathbb{E}(D_{i}\mid Z_{i},X_{i})\mid X_{i}\}}{\mathbb{E}\big{[}\operatorname{var}\{\mathbb{E}(D_{i}\mid Z_{i},X_{i})\mid X_{i}\}\big{]}}=w(X_{i}).

S5 Proofs of the results in Section S2.4

S5.1 Lemmas

Lemma S10.

Let GG be an m×mm\times m matrix. Let r=rank(G)r=\operatorname{\textup{rank}}(G) denote the rank of GG with 0rm0\leq r\leq m. Then there exists an invertible m×mm\times m matrix

Γ=(Γ11Γ12ImrΓ22)m×m\displaystyle\Gamma=\begin{pmatrix}\Gamma_{11}&\Gamma_{12}\\ I_{m-r}&\Gamma_{22}\end{pmatrix}_{m\times m}

with Γ11r×(mr)\Gamma_{11}\in\mathbb{R}^{r\times(m-r)}, Γ12r×r\Gamma_{12}\in\mathbb{R}^{r\times r}, and Γ22(mr)×r\Gamma_{22}\in\mathbb{R}^{(m-r)\times r} such that

ΓG=(IrC00)m×mfor some Cr×(mr).\displaystyle\Gamma G=\begin{pmatrix}I_{r}&C\\ 0&0\end{pmatrix}_{m\times m}\quad\text{for some $C\in\mathbb{R}^{r\times(m-r)}$.}
Proof of Lemma S10.

The results for r=0r=0 and r=mr=m are straightforward by letting Γ=Im\Gamma=I_{m} and Γ=G1\Gamma=G^{-1}, respectively. We verify below the result for 0<r<m0<r<m.

Given rank(G)=r\operatorname{\textup{rank}}(G)=r, the theory of reduced row echelon form (see, e.g., Strang (2006)) ensures that there exists an invertible m×mm\times m matrix Γ\Gamma^{\prime} such that

ΓG=(IrC00)m×mfor some Cr×(mr).\displaystyle\Gamma^{\prime}G=\begin{pmatrix}I_{r}&C\\ 0&0\end{pmatrix}_{m\times m}\quad\text{for some $C\in\mathbb{R}^{r\times(m-r)}$.} (S149)

Write

Γ=(Γ1Γ2)with Γ1r×m and Γ2(mr)×m.\displaystyle\Gamma^{\prime}=\begin{pmatrix}\Gamma_{1}\\ \Gamma_{2}^{\prime}\end{pmatrix}\quad\text{with $\Gamma_{1}\in\mathbb{R}^{r\times m}$ and $\Gamma_{2}^{\prime}\in\mathbb{R}^{(m-r)\times m}$.} (S150)

That Γ\Gamma^{\prime} is invertible ensures Γ2\Gamma_{2}^{\prime} has full row rank with rank(Γ2)=mr\operatorname{\textup{rank}}(\Gamma_{2}^{\prime})=m-r. The theory of reduced row echelon form ensures that there exists an invertible (mr)×(mr)(m-r)\times(m-r) matrix Γ3\Gamma_{3} such that

Γ3Γ2=(Imr,Γ22)for some (mr)×r matrix Γ22.\displaystyle\Gamma_{3}\Gamma_{2}^{\prime}=(I_{m-r},\ \Gamma_{22})\quad\text{for some $(m-r)\times r$ matrix $\Gamma_{22}$}. (S151)

Define

Γ=(Ir00Γ3)Γ.\displaystyle\Gamma=\begin{pmatrix}I_{r}&0\\ 0&\Gamma_{3}\end{pmatrix}\Gamma^{\prime}. (S152)

Let (Γ11,Γ12)(\Gamma_{11},\Gamma_{12}) denote a partition of Γ1\Gamma_{1} with Γ11r×(mr)\Gamma_{11}\in\mathbb{R}^{r\times(m-r)} and Γ12r×r\Gamma_{12}\in\mathbb{R}^{r\times r}. It follows from (S150)–(S151) that

Γ=(S150)(Ir00Γ3)(Γ1Γ2)=(Γ1Γ3Γ2)=(S151)(Γ11Γ12ImrΓ22)\displaystyle\Gamma\overset{\eqref{eq:gamma_prime}}{=}\begin{pmatrix}I_{r}&0\\ 0&\Gamma_{3}\end{pmatrix}\begin{pmatrix}\Gamma_{1}\\ \Gamma_{2}^{\prime}\end{pmatrix}=\begin{pmatrix}\Gamma_{1}\\ \Gamma_{3}\Gamma_{2}^{\prime}\end{pmatrix}\overset{\eqref{eq:gamma2_def}}{=}\begin{pmatrix}\Gamma_{11}&\Gamma_{12}\\ I_{m-r}&\Gamma_{22}\end{pmatrix} (S153)

with

ΓG=(S152)(Ir00Γ3)ΓG=(S149)(Ir00Γ3)(IrC00)=(IrC00).\displaystyle\Gamma G\overset{\eqref{eq:gamma_def}}{=}\begin{pmatrix}I_{r}&0\\ 0&\Gamma_{3}\end{pmatrix}\Gamma^{\prime}G\overset{\eqref{eq:gamma_p}}{=}\begin{pmatrix}I_{r}&0\\ 0&\Gamma_{3}\end{pmatrix}\begin{pmatrix}I_{r}&C\\ 0&0\end{pmatrix}=\begin{pmatrix}I_{r}&C\\ 0&0\end{pmatrix}.

Lemma S11.

Assume the setting of Proposition S4. Let

𝒮={X=(X0,X1,,Xm)m+1:X satisfies (S50)}\mathcal{S}=\{X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathbb{R}^{m+1}:\,\text{$X$ satisfies \eqref{eq:ls_x0_prop}}\}

denote the solution set of (S50). Let

𝒮0={X0:there exists X=(X0,X1,,Xm)𝒮}\displaystyle\mathcal{S}_{0}=\{X_{0}:\text{there exists $X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}$}\}

denote the set of values of X0X_{0} across all solutions to (S50). Let A0=(a01,,a0m)1×mA_{0}=(a_{01},\ldots,a_{0m})\in\mathbb{R}^{1\times m} denote the coefficient vector of (X1,,Xm)(X_{1},\ldots,X_{m})^{\top} in the first equation in (S50). Let A=(aij)i,j=1,,mm×mA=(a_{ij})_{i,j=1,\ldots,m}\in\mathbb{R}^{m\times m} denote the coefficient matrix of (X1,,Xm)(X_{1},\ldots,X_{m})^{\top} in the last m{m} equations in (S50). Let σ(A)\sigma(A) denote the set of eigenvalues of AA. Let

𝒮σ(A)¯={X=(X0,X1,,Xm)𝒮:X0σ(A)}\mathcal{S}_{\overline{\sigma(A)}}=\{X=(X_{0},X_{1},...,X_{m})^{\top}\in\mathcal{S}:\ X_{0}\not\in\sigma(A)\}

denote the set of solutions to (S50) that have X0X_{0} not equal to any eigenvalue of AA. Then

  1. (i)

    |𝒮0|2|\mathcal{S}_{0}|\leq 2 if A0=01×mA_{0}=0_{1\times m}.

  2. (ii)

    Each X0𝒮0\σ(A)X_{0}\in\mathcal{S}_{0}\backslash\sigma(A) corresponds to a unique solution to (S50) with

    |𝒮σ(A)¯|=|𝒮0\σ(A)|{2if A0=01×m;m+2if A001×m.\displaystyle|\mathcal{S}_{\overline{\sigma(A)}}|=|\mathcal{S}_{0}\backslash\sigma(A)|\leq\left\{\begin{array}[]{cl}2&\text{if $A_{0}=0_{1\times m}$;}\\ m+2&\text{if $A_{0}\neq 0_{1\times m}$.}\\ \end{array}\right.
  3. (iii)

    Further assume that 𝒮0σ(A)\mathcal{S}_{0}\cap\sigma(A)\neq\emptyset. Let λ0𝒮0σ(A)\lambda_{0}\in\mathcal{S}_{0}\cap\sigma(A) denote an eigenvalue of AA that is also in 𝒮0\mathcal{S}_{0}. Let r=rank(λ0IA)r=\operatorname{\textup{rank}}(\lambda_{0}I-A) denote the rank of λ0IA\lambda_{0}I-A. Let

    𝒮λ0={X=(X0,X1,,Xm)𝒮:X0=λ0},\displaystyle\mathcal{S}_{\lambda_{0}}=\{X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}:\,X_{0}=\lambda_{0}\},
    𝒮λ0¯={X=(X0,X1,,Xm)𝒮:X0λ0}\displaystyle\mathcal{S}_{\overline{\lambda_{0}}}=\{X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}:\,X_{0}\neq\lambda_{0}\}

    denote a partition of 𝒮\mathcal{S}, containing solutions to (S50) that have X0=λ0X_{0}=\lambda_{0} and X0λ0X_{0}\neq\lambda_{0}, respectively.

    1. (a)

      If Aλ0IA\neq\lambda_{0}I, then 0<r<m0<r<m and there exists constant grg\in\mathbb{R}^{r}, Gr×(mr)G\in\mathbb{R}^{r\times(m-r)}, hmrh\in\mathbb{R}^{m-r}, and H(mr)×rH\in\mathbb{R}^{(m-r)\times r} such that

      • for all X=(X0,X1,,Xm)𝒮λ0X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}_{\lambda_{0}} with X0=λ0X_{0}=\lambda_{0}, the corresponding X1:r=(X1,,Xr)X_{1:r}=(X_{1},\ldots,X_{r})^{\top} and X(r+1):m=(Xr+1,,Xm)X_{(r+1):m}=(X_{r+1},\ldots,X_{m})^{\top} satisfy

        X1:r=g+GX(r+1):m;\displaystyle X_{1:r}=g+GX_{(r+1):m};
      • for all X=(X0,X1,,Xm)𝒮λ0¯X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}} with X0λ0X_{0}\neq\lambda_{0}, the corresponding X1:(mr)=(X1,,Xmr)X_{1:(m-r)}=(X_{1},\ldots,X_{m-r})^{\top} and X(mr+1):m=(Xmr+1,,Xm)X_{(m-r+1):m}=(X_{m-r+1},\ldots,X_{m})^{\top} satisfy

        X1:(mr)=h+HX(mr+1):m.\displaystyle X_{1:(m-r)}=h+HX_{(m-r+1):m}.
    2. (b)

      If A=λ0IA=\lambda_{0}I, then

      • for all X=(X0,X1,,Xm)𝒮λ0¯X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}} with X0λ0X_{0}\neq\lambda_{0}, we have (X1,,Xm)=(b1,,bm)(X_{1},\ldots,X_{m})=(b_{1},\ldots,b_{m}).

      • |𝒮λ0¯|=|𝒮0\{λ0}|{1if A0=01×m;2if A001×m.|\mathcal{S}_{\overline{\lambda_{0}}}|=|\mathcal{S}_{0}\backslash\{\lambda_{0}\}|\leq\left\{\begin{array}[]{cl}1&\text{if $A_{0}=0_{1\times m}$;}\\ 2&\text{if $A_{0}\neq 0_{1\times m}$.}\\ \end{array}\right.

      • |𝒮0|{2if A0=01×m;3if A001×m.|\mathcal{S}_{0}|\leq\left\{\begin{array}[]{cl}2&\text{if $A_{0}=0_{1\times m}$;}\\ 3&\text{if $A_{0}\neq 0_{1\times m}$.}\\ \end{array}\right.

Proof of Lemma S11.

Let U=(X1,,Xm)U=(X_{1},\ldots,X_{m})^{\top}, c=(c1,,cm)c=(c_{1},\ldots,c_{m})^{\top}, b=(b1,,bm)b=(b_{1},\ldots,b_{m})^{\top}, and

l(t)=c+bt=(c1+b1tcm+bmt)m\displaystyle l(t)=c+bt=\begin{pmatrix}c_{1}+b_{1}t\\ \vdots\\ c_{m}+b_{m}t\end{pmatrix}\in\mathbb{R}^{m} (S155)

to write the last m{m} equations in (S50) as

UX0=l(X0)+AU.\displaystyle UX_{0}=l(X_{0})+AU. (S156)

Write the first equation in (S50) as

X02b0X0c0=A0U.\displaystyle X_{0}^{2}-b_{0}X_{0}-c_{0}=A_{0}U. (S157)

We verify below Lemma S11(i)–(iii) one by one.

Proof of Lemma S11(i):

When A0=0A_{0}=0, (S157) reduces to X02b0X0c0=0X_{0}^{2}-b_{0}X_{0}-c_{0}=0. This ensures X0X_{0} takes at most 2 distinct values across all solutions to (S50).

Proof of Lemma S11(ii):

Write (S156) as

(X0IA)U=l(X0).\displaystyle(X_{0}I-A)U=l(X_{0}). (S158)

For X0𝒮0\σ(A)X_{0}\in\mathcal{S}_{0}\backslash\sigma(A) that is not an eigenvalue of AA, we have X0IAX_{0}I-A is invertible with

(X0IA)1=1det(X0IA)Adj(X0IA),\displaystyle(X_{0}I-A)^{-1}=\frac{1}{\text{det}(X_{0}I-A)}\cdot\text{Adj}(X_{0}I-A), (S159)

where det()(\cdot) and Adj()(\cdot) denote the determinant and adjoint matrix, respectively. (S158)–(S159) together ensure that

(S163)

(S163) ensures that

each X0𝒮0\σ(A)X_{0}\in\mathcal{S}_{0}\backslash\sigma(A) corresponds to a unique solution to (S50) with |𝒮σ(A)¯|=|𝒮0\σ(A)||\mathcal{S}_{\overline{\sigma(A)}}|=|\mathcal{S}_{0}\backslash\sigma(A)|.

It remains to show that

|𝒮0\σ(A)|{2if A0=01×m;m+2if A001×m.\displaystyle|\mathcal{S}_{0}\backslash\sigma(A)|\leq\left\{\begin{array}[]{cl}2&\text{if $A_{0}=0_{1\times m}$;}\\ m+2&\text{if $A_{0}\neq 0_{1\times m}$.}\\ \end{array}\right. (S166)

When A0=01×mA_{0}=0_{1\times m}, Lemma S11(i) ensures |𝒮0\σ(A)||𝒮0|2|\mathcal{S}_{0}\backslash\sigma(A)|\leq|\mathcal{S}_{0}|\leq 2. When A001×mA_{0}\neq 0_{1\times m}, plugging (S163) in (S157) ensures

X02b0X0c0=1det(X0IA)A0Adj(X0IA)l(X0),\displaystyle X_{0}^{2}-b_{0}X_{0}-c_{0}=\frac{1}{\text{det}(X_{0}I-A)}\cdot A_{0}\cdot\text{Adj}(X_{0}I-A)\cdot l(X_{0}),

and therefore

det(X0IA)(X02b0X0c0)=A0Adj(X0IA)l(X0).\displaystyle\text{det}(X_{0}I-A)\cdot(X_{0}^{2}-b_{0}X_{0}-c_{0})=A_{0}\cdot\text{Adj}(X_{0}I-A)\cdot l(X_{0}). (S167)

Note that det(X0IA)\text{det}(X_{0}I-A) is at most mmth-degree in X0X_{0} and the elements of Adj(X0IA)\text{Adj}(X_{0}I-A) are at most (m1m-1)th-degree in X0X_{0}. This, together with l(X0)=c+bX0l(X_{0})=c+bX_{0}, ensures (S167) is at most (m+2)(m+2)th-degree in X0X_{0}. By the fundamental theorem of algebra, X0X_{0} takes at most m+2m+2 distinct values beyond the eigenvalues of AA. This verifies (S166).

Proof of Lemma S11(iii):

From (S155)–(S156), direct algebra ensures that

for all X𝒮X\in\mathcal{S}, we have
(Ub)(X0λ0)\displaystyle\qquad(U-b)(X_{0}-\lambda_{0}) =\displaystyle= UX0λ0UbX0+bλ0\displaystyle UX_{0}-\lambda_{0}U-bX_{0}+b\lambda_{0} (S168)
=(S155)+(S156)\displaystyle\overset{\eqref{eq:lt_def}+\eqref{eq:la-0}}{=} (c+bX0+AU)λ0UbX0+bλ0\displaystyle(c+bX_{0}+AU)-\lambda_{0}U-bX_{0}+b\lambda_{0}
=\displaystyle= (c+bλ0)+AUλ0U\displaystyle(c+b\lambda_{0})+AU-\lambda_{0}U
=(S155)\displaystyle\overset{\eqref{eq:lt_def}}{=} l(λ0)(λ0IA)U.\displaystyle l(\lambda_{0})-(\lambda_{0}I-A)U.

We verify below the result for Aλ0IA\neq\lambda_{0}I and A=λ0IA=\lambda_{0}I, respectively.

Proof of Lemma S11(iiia) for Aλ0IA\neq\lambda_{0}I:

Recall r=rank(λ0IA)r=\operatorname{\textup{rank}}(\lambda_{0}I-A). When Aλ0IA\neq\lambda_{0}I, we have 0<r<m0<r<m. From Lemma S10, there exists an invertible m×mm\times m matrix

Γ=(Γ11Γ12ImrΓ22)=(Γ1Γ2)\displaystyle\Gamma=\begin{pmatrix}\Gamma_{11}&\Gamma_{12}\\ I_{m-r}&\Gamma_{22}\end{pmatrix}=\begin{pmatrix}\Gamma_{1}\\ \Gamma_{2}\end{pmatrix} (S169)

with Γ11r×(mr)\Gamma_{11}\in\mathbb{R}^{r\times(m-r)}, Γ12r×r\Gamma_{12}\in\mathbb{R}^{r\times r}, Γ22(mr)×r\Gamma_{22}\in\mathbb{R}^{(m-r)\times r}, Γ1=(Γ11,Γ12)\Gamma_{1}=(\Gamma_{11},\Gamma_{12}), and Γ2=(Imr,Γ22)\Gamma_{2}=(I_{m-r},\Gamma_{22}) such that

Γ(λ0IA)=(IrC00)for some Cr×(mr).\displaystyle\Gamma(\lambda_{0}I-A)=\begin{pmatrix}I_{r}&C\\ 0&0\end{pmatrix}\quad\text{for some $C\in\mathbb{R}^{r\times(m-r)}$.} (S170)

Multiplying both sides of (S5.1) by Γ\Gamma ensures that

for all X𝒮, we haveΓ(Ub)(X0λ0)=Γl(λ0)Γ(λ0IA)U.\displaystyle\begin{array}[]{l}\text{for all $X\in\mathcal{S}$, we have}\\ \qquad\Gamma(U-b)(X_{0}-\lambda_{0})=\Gamma\cdot l(\lambda_{0})-\Gamma(\lambda_{0}I-A)U.\end{array} (S173)

We simplify below the left- and right-hand sides of (S173), respectively.

Write

U=(X1,,Xm)=(X1:r,X(r+1):m)=(X1:(mr),X(mr+1):m)\displaystyle U=(X_{1},\ldots,X_{m})^{\top}=(X_{1:r}^{\top},X_{(r+1):m}^{\top})^{\top}=(X_{1:(m-r)}^{\top},X_{(m-r+1):m}^{\top})^{\top} (S174)

with

X1:r=(X1,,Xr),\displaystyle X_{1:r}=(X_{1},\ldots,X_{r})^{\top}, X(r+1):m=(Xr+1,,Xm);\displaystyle X_{(r+1):m}=(X_{r+1},\ldots,X_{m})^{\top};
X1:(mr)=(X1,,Xmr),\displaystyle X_{1:(m-r)}=(X_{1},\ldots,X_{m-r})^{\top}, X(mr+1):m=(Xmr+1,,Xm).\displaystyle X_{(m-r+1):m}=(X_{m-r+1},\ldots,X_{m})^{\top}.

This, together with (S169)–(S170), ensures

Γ(λ0IA)U=(S170)+(S174)(IrC00)(X1:rX(r+1):m)=(X1:r+CX(r+1):m0),\displaystyle\Gamma(\lambda_{0}I-A)U\overset{\eqref{eq:gamma_la}+\eqref{eq:u_partitions}}{=}\begin{pmatrix}I_{r}&C\\ 0&0\end{pmatrix}\begin{pmatrix}X_{1:r}\\ X_{(r+1):m}\end{pmatrix}=\begin{pmatrix}X_{1:r}+CX_{(r+1):m}\\ 0\end{pmatrix}, (S175)

and

Γ2(Ub)\displaystyle\Gamma_{2}(U-b) =\displaystyle= Γ2UΓ2b\displaystyle\Gamma_{2}U-\Gamma_{2}b (S176)
=(S169)+(S174)\displaystyle\overset{\eqref{eq:gamma_def_2}+\eqref{eq:u_partitions}}{=} (Imr,Γ22)(X1:(mr)X(mr+1):m)Γ2b\displaystyle(I_{m-r},\Gamma_{22})\begin{pmatrix}X_{1:(m-r)}\\ X_{(m-r+1):m}\end{pmatrix}-\Gamma_{2}b
=\displaystyle= X1:(mr)+Γ22X(mr+1):mΓ2b.\displaystyle X_{1:(m-r)}+\Gamma_{22}X_{(m-r+1):m}-\Gamma_{2}b.

Given (S169)–(S176), the left-hand side of (S173) equals

Γ(Ub)(X0λ0)\displaystyle\Gamma(U-b)(X_{0}-\lambda_{0}) =(S169)\displaystyle\overset{\eqref{eq:gamma_def_2}}{=} (Γ1Γ2)(Ub)(X0λ0)\displaystyle\begin{pmatrix}\Gamma_{1}\\ \Gamma_{2}\end{pmatrix}(U-b)(X_{0}-\lambda_{0}) (S177)
=\displaystyle= (Γ1(Ub)(X0λ0)Γ2(Ub)(X0λ0))\displaystyle\begin{pmatrix}\Gamma_{1}(U-b)(X_{0}-\lambda_{0})\\ \Gamma_{2}(U-b)(X_{0}-\lambda_{0})\end{pmatrix}
=(S176)\displaystyle\overset{\eqref{eq:g2U}}{=} (Γ1(Ub)(X0λ0)(X1:(mr)+Γ22X(mr+1):mΓ2b)(X0λ0)),\displaystyle\begin{pmatrix}\Gamma_{1}(U-b)(X_{0}-\lambda_{0})\\ \Big{(}X_{1:(m-r)}+\Gamma_{22}X_{(m-r+1):m}-\Gamma_{2}b\Big{)}(X_{0}-\lambda_{0})\end{pmatrix},

and the right-hand side of (S173) equals

Γl(λ0)Γ(λ0IA)U\displaystyle\Gamma\cdot l(\lambda_{0})-\Gamma(\lambda_{0}I-A)U =(S169)+(S175)\displaystyle\overset{\eqref{eq:gamma_def_2}+\eqref{eq:llu}}{=} (Γ1Γ2)l(λ0)(X1:r+CX(r+1):m0)\displaystyle\begin{pmatrix}\Gamma_{1}\\ \Gamma_{2}\end{pmatrix}l(\lambda_{0})-\begin{pmatrix}X_{1:r}+CX_{(r+1):m}\\ 0\end{pmatrix} (S178)
=\displaystyle= (Γ1l(λ0)X1:rCX(r+1):mΓ2l(λ0)).\displaystyle\begin{pmatrix}\Gamma_{1}\cdot l(\lambda_{0})-X_{1:r}-CX_{(r+1):m}\\ \Gamma_{2}\cdot l(\lambda_{0})\end{pmatrix}.

Given (S173), comparing the first and second rows of (S177) and (S178) ensures that

for all X𝒮X\in\mathcal{S}, we have
Γ1(Ub)(X0λ0)\displaystyle\Gamma_{1}(U-b)(X_{0}-\lambda_{0}) =\displaystyle= Γ1l(λ0)X1:rCX(r+1):m,\displaystyle\Gamma_{1}\cdot l(\lambda_{0})-X_{1:r}-CX_{(r+1):m},\quad (S179)
(X1:(mr)+Γ22X(mr+1):mΓ2b)(X0λ0)\displaystyle\Big{(}X_{1:(m-r)}+\Gamma_{22}X_{(m-r+1):m}-\Gamma_{2}b\Big{)}(X_{0}-\lambda_{0}) =\displaystyle= Γ2l(λ0).\displaystyle\Gamma_{2}\cdot l(\lambda_{0}). (S180)

The derivation so far, hence (S179)–(S180), only depends on 0<r<m0<r<m and holds regardless of whether λ0𝒮0\lambda_{0}\in\mathcal{S}_{0}. Now that we know λ0\lambda_{0} is indeed within 𝒮0\mathcal{S}_{0}, (S179)–(S180) together lead to the following:

  • Letting X0=λ0X_{0}=\lambda_{0} in (S179) ensures

    for all X=(X0,X1,,Xm)𝒮λ0 with X0=λ0, we haveX1:r=Γ1l(λ0)CX(r+1):m.\displaystyle\begin{array}[]{c}\text{for all $X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}_{\lambda_{0}}$ with $X_{0}=\lambda_{0}$, we have}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \qquad X_{1:r}=\Gamma_{1}\cdot l(\lambda_{0})-CX_{(r+1):m}.\end{array} (S183)

    This verifies the first part of Lemma S11(iiia) with g=Γ1l(λ0)g=\Gamma_{1}\cdot l(\lambda_{0}) and G=CG=-C.

  • Letting X0=λ0X_{0}=\lambda_{0} in (S180) ensures

    Γ2l(λ0)=0,\displaystyle\Gamma_{2}\cdot l(\lambda_{0})=0, (S184)

    as a constraint on the coefficients {ci,bi:i=1,,m}\{c_{i},b_{i}:i=1,\ldots,m\}; c.f. l(λ0)=c+bλ0l(\lambda_{0})=c+b\lambda_{0} by (S155). Plugging (S184) back in (S180) ensures that

    for all X𝒮, we have(X1:(mr)+Γ22X(mr+1):mΓ2b)(X0λ0)=0.\displaystyle\begin{array}[]{l}\text{for all $X\in\mathcal{S}$, we have}\\ \qquad\Big{(}X_{1:(m-r)}+\Gamma_{22}X_{(m-r+1):m}-\Gamma_{2}b\Big{)}(X_{0}-\lambda_{0})=0.\end{array} (S187)

    From (S187), we have

    X1:(mr)=Γ2bΓ22X(mr+1):mfor all X𝒮λ0¯ with X0λ0.\displaystyle X_{1:(m-r)}=\Gamma_{2}b-\Gamma_{22}X_{(m-r+1):m}\quad\text{for all $X\in\mathcal{S}_{\overline{\lambda_{0}}}$ with $X_{0}\neq\lambda_{0}$}. (S188)

    This verifies the second part of Lemma S11(iiia) with h=Γ2bh=\Gamma_{2}b and H=Γ22H=-\Gamma_{22}.

Proof of Lemma S11(iiib) for A=λ0IA=\lambda_{0}I:

When A=λ0IA=\lambda_{0}I, (S5.1) simplifies to

for all X𝒮, we have(Ub)(X0λ0)=l(λ0).\displaystyle\text{for all $X\in\mathcal{S}$, we have}\ (U-b)(X_{0}-\lambda_{0})=l(\lambda_{0}). (S189)

Given λ0𝒮0\lambda_{0}\in\mathcal{S}_{0} by assumption, letting X0=λ0X_{0}=\lambda_{0} in (S189) ensures

l(λ0)=0,\displaystyle l(\lambda_{0})=0, (S190)

as a constraint on {ci,bi:i=1,,m}\{c_{i},b_{i}:i=1,\ldots,m\} analogous to (S184). Plugging (S190) back in (S189) ensures

for all X𝒮, we have(Ub)(X0λ0)=0,\displaystyle\text{for all $X\in\mathcal{S}$, we have}\ (U-b)(X_{0}-\lambda_{0})=0, (S191)

analogous to (S187). This ensures

for all X=(X0,X1,,Xm)𝒮λ0¯X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}} with X0λ0X_{0}\neq\lambda_{0}, we have U=bU=b, (S192)

analogous to (S188).

In addition, that A=λ0IA=\lambda_{0}I implies σ(A)={λ0}\sigma(A)=\{\lambda_{0}\} with 𝒮λ0¯=𝒮σ(A)¯\mathcal{S}_{\overline{\lambda_{0}}}=\mathcal{S}_{\overline{\sigma(A)}}. This, together with Lemma S11(ii), ensures

|𝒮λ0¯|=|𝒮σ(A)¯|=|𝒮0\σ(A)|=|𝒮0\{λ0}|.\displaystyle|\mathcal{S}_{\overline{\lambda_{0}}}|=|\mathcal{S}_{\overline{\sigma(A)}}|=|\mathcal{S}_{0}\backslash\sigma(A)|=|\mathcal{S}_{0}\backslash\{\lambda_{0}\}|.

When A0=01×mA_{0}=0_{1\times m}, Lemma S11(i) ensures |𝒮0|2|\mathcal{S}_{0}|\leq 2 such that |𝒮0\{λ0}|=|𝒮0|11|\mathcal{S}_{0}\backslash\{\lambda_{0}\}|=|\mathcal{S}_{0}|-1\leq 1. When A001×mA_{0}\neq 0_{1\times m}, plugging (S192) in (S157) ensures that

for all X=(X0,X1,,Xm)𝒮λ0¯ with X0λ0, we haveX02b0X0c0=A0b.\displaystyle\text{for all $X=(X_{0},X_{1},\ldots,X_{m})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}}$ with $X_{0}\neq\lambda_{0}$, we have}\ X_{0}^{2}-b_{0}X_{0}-c_{0}=A_{0}b.

The fundamental theorem of algebra ensures that X0X_{0} takes at most 2 distinct values beyond λ0\lambda_{0}, i.e., |𝒮0\{λ0}|2|\mathcal{S}_{0}\backslash\{\lambda_{0}\}|\leq 2. The bound for |𝒮0||\mathcal{S}_{0}| then follows from |𝒮0|=|𝒮0\{λ0}|+1|\mathcal{S}_{0}|=|\mathcal{S}_{0}\backslash\{\lambda_{0}\}|+1.

Remark S1.

From Lemma S10, the definition of Γ\Gamma in (S169) reduces to Γ=Im=Γ2\Gamma=I_{m}=\Gamma_{2} when r=0r=0. This underlies the correspondence between (S190)–(S192) and (S184)–(S188), and ensures that the proof for Lemma S11(iiia) implies the proof for Lemma S11(iiib) as a special case. We nonetheless provided a separate proof to add clarity.

S5.2 Proof of Proposition S4

We verify below Proposition S4 by induction.

Proof of the result for m=1m=1.

When m=1m=1, (S50) suggests

X02\displaystyle X_{0}^{2} =\displaystyle= c0+b0X0+a01X1,\displaystyle c_{0}+b_{0}X_{0}+a_{01}X_{1}, (S193)
X1X0\displaystyle X_{1}X_{0} =\displaystyle= c1+b1X0+a11X1\displaystyle c_{1}+b_{1}X_{0}+a_{11}X_{1} (S194)

for fixed constants {ci,bi,ai1:i=0,1}\{c_{i},b_{i},a_{i1}:i=0,1\}. If a01=0a_{01}=0, then (S193) reduces to X02=c0+b0X0X_{0}^{2}=c_{0}+b_{0}X_{0} so that X0X_{0} takes at most 22 distinct values. If a010a_{01}\neq 0, then (S193) implies X1=a011(X02b0X0c0)X_{1}=a_{01}^{-1}(X_{0}^{2}-b_{0}X_{0}-c_{0}). Plugging this in (S194) yields

a011(X02b0X0c0)X0=c1+b1X0+a11a011(X02b0X0c0),\displaystyle a_{01}^{-1}(X_{0}^{2}-b_{0}X_{0}-c_{0})X_{0}=c_{1}+b_{1}X_{0}+a_{11}a_{01}^{-1}(X_{0}^{2}-b_{0}X_{0}-c_{0}),

as a cubic equation of X0X_{0}. The fundamental theorem of algebra ensures that X0X_{0} takes at most 33 distinct values. This verifies the result for m=1m=1.

Proof of the result for m2m\geq 2.

Assume that the result holds for m=1,,Q1m=1,\ldots,Q-1. We verify below the result for m=Qm=Q, where (S50) becomes

(X0X1XQ)X0=(c0c1cQ)+(b0b1bQ)X0+(a01a0Qa11a1QaQ1aQQ)(X1XQ).\displaystyle\begin{pmatrix}X_{0}\\ X_{1}\\ \vdots\\ X_{Q}\end{pmatrix}X_{0}=\begin{pmatrix}c_{0}\\ c_{1}\\ \vdots\\ c_{Q}\end{pmatrix}+\begin{pmatrix}b_{0}\\ b_{1}\\ \vdots\\ b_{Q}\end{pmatrix}X_{0}+\begin{pmatrix}a_{01}&\ldots&a_{0Q}\\ a_{11}&\ldots&a_{1Q}\\ &\ddots&\\ a_{Q1}&\ldots&a_{QQ}\\ \end{pmatrix}\begin{pmatrix}X_{1}\\ \vdots\\ X_{Q}\end{pmatrix}. (S195)

Following Lemma S11, renew

𝒮\displaystyle\mathcal{S} =\displaystyle= {X=(X0,X1,,XQ)Q+1:X satisfies (S195)},\displaystyle\{X=(X_{0},X_{1},\ldots,X_{Q})^{\top}\in\mathbb{R}^{Q+1}:\,\text{$X$ satisfies \eqref{eq:ls_x0_prop_Q}}\},
𝒮0\displaystyle\mathcal{S}_{0} =\displaystyle= {X0:there exists X=(X0,X1,,XQ)𝒮}.\displaystyle\{X_{0}:\text{there exists $X=(X_{0},X_{1},\ldots,X_{Q})^{\top}\in\mathcal{S}$}\}.

The goal is to show that |𝒮0|Q+2|\mathcal{S}_{0}|\leq Q+2.

First, let U=(X1,,XQ)U=(X_{1},\ldots,X_{Q})^{\top}, c=(c1,,cQ)c=(c_{1},\ldots,c_{Q})^{\top}, b=(b1,,bQ)b=(b_{1},\ldots,b_{Q})^{\top}, A0=(a01,,a0Q)1×QA_{0}=(a_{01},\ldots,a_{0Q})\in\mathbb{R}^{1\times Q}, and A=(aqq)q,q=1,,QQ×QA=(a_{qq^{\prime}})_{q,q^{\prime}=1,\ldots,Q}\in\mathbb{R}^{Q\times Q} to write (S195) as

(X0U)X0\displaystyle\begin{pmatrix}X_{0}\\ U\end{pmatrix}X_{0} =\displaystyle= (c0c)+(b0b)X0+(A0A)U.\displaystyle\begin{pmatrix}c_{0}\\ c\end{pmatrix}+\begin{pmatrix}b_{0}\\ b\end{pmatrix}X_{0}+\begin{pmatrix}A_{0}\\ A\end{pmatrix}U. (S196)

Let σ(A)\sigma(A) denote the set of eigenvalues of AA. We have the following observations from Lemma S11:

  • If 𝒮0σ(A)=\mathcal{S}_{0}\cap\sigma(A)=\emptyset so that X0X_{0} does not equal any eigenvalue of AA across all solutions to (S195), then 𝒮0=𝒮0\σ(A)\mathcal{S}_{0}=\mathcal{S}_{0}\backslash\sigma(A), and it follows from Lemma S11(ii) that |𝒮0|=|𝒮0\σ(A)|Q+2|\mathcal{S}_{0}|=|\mathcal{S}_{0}\backslash\sigma(A)|\leq Q+2.

  • If 𝒮0σ(A)\mathcal{S}_{0}\cap\sigma(A)\neq\emptyset and there exists λ𝒮0σ(A)\lambda\in\mathcal{S}_{0}\cap\sigma(A) so that A=λ0IA=\lambda_{0}I, then it follows from Lemma S11(iiib) that |𝒮0|3Q+2|\mathcal{S}_{0}|\leq 3\leq Q+2.

Therefore, it remains to show that |𝒮0|Q+2|\mathcal{S}_{0}|\leq Q+2 when

𝒮0σ(A)andAλIfor allλ𝒮0σ(A).\displaystyle\mathcal{S}_{0}\cap\sigma(A)\neq\emptyset\quad\text{and}\quad A\neq\lambda I\ \ \text{for all}\ \ \lambda\in\mathcal{S}_{0}\cap\sigma(A). (S197)

Proof of |𝒮0|Q+2|\mathcal{S}_{0}|\leq Q+2 under (S197).

Let λ0\lambda_{0} denote an element in 𝒮0σ(A)\mathcal{S}_{0}\cap\sigma(A) with Aλ0IA\neq\lambda_{0}I. Let r=rank(λ0IA)r=\operatorname{\textup{rank}}(\lambda_{0}I-A) denote the rank of λ0IA\lambda_{0}I-A with 0<r<Q0<r<Q. Let

X1:(Qr)=(X1,,XQr),\displaystyle X_{1:(Q-r)}=(X_{1},\ldots,X_{Q-r})^{\top}, X(Qr+1):Q=(XQr+1,,XQ)\displaystyle X_{(Q-r+1):Q}=(X_{Q-r+1},\ldots,X_{Q})^{\top}

denote a partition of U=(X1,,XQ)U=(X_{1},\ldots,X_{Q})^{\top}. The last rr equations in (S196) equal

X(Qr+1):QX0\displaystyle X_{(Q-r+1):Q}X_{0} =\displaystyle= (XQr+1XQ)X0\displaystyle\begin{pmatrix}X_{Q-r+1}\\ \vdots\\ X_{Q}\end{pmatrix}X_{0} (S198)
=\displaystyle= (cQr+1cQ)+(bQr+1bQ)X0+(aQr+1,1aQr+1,QaQ1aQQ)U\displaystyle\begin{pmatrix}c_{Q-r+1}\\ \vdots\\ c_{Q}\end{pmatrix}+\begin{pmatrix}b_{Q-r+1}\\ \vdots\\ b_{Q}\end{pmatrix}X_{0}+\begin{pmatrix}a_{Q-r+1,1}&\ldots&a_{Q-r+1,Q}\\ &\ddots&\\ a_{Q1}&\ldots&a_{QQ}\\ \end{pmatrix}U
=\displaystyle= c(Qr+1):Q+b(Qr+1):QX0+A(Qr+1):QU,\displaystyle c_{(Q-r+1):Q}+b_{(Q-r+1):Q}X_{0}+A_{(Q-r+1):Q}U,

where

c(Qr+1):Q=(cQr+1cQ),b(Qr+1):Q=(bQr+1bQ),A(Qr+1):Q=(aQr+1,1aQr+1,QaQ1aQQ).\displaystyle c_{(Q-r+1):Q}=\begin{pmatrix}c_{Q-r+1}\\ \vdots\\ c_{Q}\end{pmatrix},\quad b_{(Q-r+1):Q}=\begin{pmatrix}b_{Q-r+1}\\ \vdots\\ b_{Q}\end{pmatrix},\quad A_{(Q-r+1):Q}=\begin{pmatrix}a_{Q-r+1,1}&\ldots&a_{Q-r+1,Q}\\ &\ddots&\\ a_{Q1}&\ldots&a_{QQ}\end{pmatrix}.

In addition, let

𝒮λ0¯={X=(X0,X1,,XQ)𝒮:X0λ0}.\displaystyle\mathcal{S}_{\overline{\lambda_{0}}}=\{X=(X_{0},X_{1},\ldots,X_{Q})^{\top}\in\mathcal{S}:\,X_{0}\neq\lambda_{0}\}.

Lemma S11(iii) ensures that there exists constant hQrh\in\mathbb{R}^{Q-r} and H(Qr)×rH\in\mathbb{R}^{(Q-r)\times r} such that

for all X=(X0,X1,,XQ)𝒮λ0¯ with X0λ0, we haveX1:(Qr)=h+HX(Qr+1):Q.\displaystyle\begin{array}[]{c}\text{for all $X=(X_{0},X_{1},\ldots,X_{Q})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}}$ with $X_{0}\neq\lambda_{0}$, we have}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ X_{1:(Q-r)}=h+HX_{(Q-r+1):Q}.\end{array} (S201)

Consequently, we have

(S204)

(S196)–(S204) together ensure that

for all X=(X0,X1,,XQ)𝒮λ0¯X=(X_{0},X_{1},\ldots,X_{Q})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}} with X0λ0X_{0}\neq\lambda_{0}, we have
(X0X(Qr+1):Q)X0=(S196)+(S198)(c0c(Qr+1):Q)+(b0b(Qr+1):Q)X0+(A0A(Qr+1):Q)U
=(S204)(c0c(Qr+1):Q)+(b0b(Qr+1):Q)X0+(A0A(Qr+1):Q){(h0)+(HIr)X(Qr+1):Q}.
\displaystyle\begin{array}[]{lcllll}\begin{pmatrix}X_{0}\\ X_{(Q-r+1):Q}\end{pmatrix}X_{0}&\overset{\eqref{eq:ls_x0_Q}+\eqref{eq:lc_x0_last r_mat}}{=}&\begin{pmatrix}c_{0}\\ c_{(Q-r+1):Q}\end{pmatrix}+\begin{pmatrix}b_{0}\\ b_{(Q-r+1):Q}\end{pmatrix}X_{0}+\begin{pmatrix}A_{0}\\ A_{(Q-r+1):Q}\end{pmatrix}U\vskip 12.0pt plus 4.0pt minus 4.0pt\\ &\overset{\eqref{eq:lc_x0_ss}}{=}&\begin{pmatrix}c_{0}\\ c_{(Q-r+1):Q}\end{pmatrix}+\begin{pmatrix}b_{0}\\ b_{(Q-r+1):Q}\end{pmatrix}X_{0}\\ &&+\begin{pmatrix}A_{0}\\ A_{(Q-r+1):Q}\end{pmatrix}\cdot\left\{\begin{pmatrix}h\\ 0\end{pmatrix}+\begin{pmatrix}H\\ I_{r}\end{pmatrix}X_{(Q-r+1):Q}\right\}.\end{array}
(S208)

In words, (S5.2) suggests that

By induction, applying Proposition S4 to X=(X0,X(Qr+1):Q)X^{\prime}=(X_{0},X_{(Q-r+1):Q}^{\top})^{\top} at m=r(Q1)m=r\,(\leq Q-1) ensures that

This ensures |𝒮0\{λ0}|r+2Q+1|\mathcal{S}_{0}\backslash\{\lambda_{0}\}|\leq r+2\leq Q+1 so that |𝒮0|=|𝒮0\{λ0}|+1Q+2.|\mathcal{S}_{0}|=|\mathcal{S}_{0}\backslash\{\lambda_{0}\}|+1\leq Q+2.

S5.3 Proof of Proposition S5

We verify below Proposition S5 by induction.

Proof of the result for m=1m=1.

Assume X=X1X=X_{1}\in\mathbb{R} is a scalar such that XX=X12XX^{\top}=X_{1}^{2} is linear in (1,X)=(1,X1)(1,X)=(1,X_{1}). Then there exists constants a0,a1a_{0},a_{1}\in\mathbb{R} such that X12=a0+a1X1X_{1}^{2}=a_{0}+a_{1}X_{1}. By the fundamental theorem of algebra, X1X_{1} takes at most 22 distinct values.

Proof of the result for m2m\geq 2.

Assume that the result holds for m=1,,Qm=1,\ldots,Q. We verify below the result for m=Q+1m=Q+1. Assume X=(X1,,XQ,XQ+1)X=(X_{1},\ldots,X_{Q},X_{Q+1})^{\top} is a (Q+1)×1(Q+1)\times 1 vector such that all elements of

XX=(X1X1X1XQX1XQ+1XQX1XQXQXQXQ+1XQ+1X1XQ+1XQXQ+1XQ+1)\displaystyle XX^{\top}=\left(\begin{array}[]{ccc|c}X_{1}X_{1}&\ldots&X_{1}X_{Q}&X_{1}X_{Q+1}\\ \vdots&&\vdots&\vdots\\ X_{Q}X_{1}&\ldots&X_{Q}X_{Q}&X_{Q}X_{Q+1}\\ \hline\cr X_{Q+1}X_{1}&\ldots&X_{Q+1}X_{Q}&X_{Q+1}X_{Q+1}\end{array}\right) (S215)

are linear in (1,X)(1,X) with

XqXq=cqq+k=1Q+1aqq[k]Xkfor q,q=1,,Q+1\displaystyle X_{q}X_{q^{\prime}}=c_{qq^{\prime}}+\sum_{k=1}^{Q+1}a_{qq^{\prime}[k]}X_{k}\quad\text{for $q,q^{\prime}=1,\ldots,Q+1$} (S216)

for constants {cqq,aqq[k]:q,q,k=1,,Q+1}\{c_{qq^{\prime}},a_{qq^{\prime}[k]}\in\mathbb{R}:q,q^{\prime},k=1,\ldots,Q+1\}. Without loss of generality, assume the coefficients for XqXqX_{q}X_{q^{\prime}} and XqXqX_{q^{\prime}}X_{q} are identical with

cqq=cqq,aqq[k]=aqq[k]forq,q,k=1,,Q+1.\displaystyle c_{qq^{\prime}}=c_{q^{\prime}q},\quad a_{qq^{\prime}[k]}=a_{q^{\prime}q[k]}\quad\text{for}\ \ q,q^{\prime},k=1,\ldots,Q+1. (S217)

Let

𝒮={X=(X1,,XQ+1)Q+1:X satisfies (S216)}\mathcal{S}=\{X=(X_{1},\ldots,X_{Q+1})^{\top}\in\mathbb{R}^{Q+1}:\,\text{$X$ satisfies \eqref{eq:ls_xx}}\}

denote the solution set of (S216). To goal is to show that |𝒮|Q+2|\mathcal{S}|\leq Q+2.

To this end, consider the last column of XXXX^{\top} in (S215) that consists of X1XQ+1,,XQXQ+1,XQ+12X_{1}X_{Q+1},\ldots,X_{Q}X_{Q+1},X_{Q+1}^{2}. The corresponding part in (S216) is

XqXQ+1=cq,Q+1+k=1Q+1aq,Q+1[k]Xkfor q=1,,Q+1,\displaystyle X_{q}X_{Q+1}=c_{q,Q+1}+\sum_{k=1}^{Q+1}a_{q,Q+1[k]}X_{k}\quad\text{for $q=1,\ldots,Q+1$},

and can be summarized in matrix form as

(XQ+1X1XQ)XQ+1\displaystyle\begin{pmatrix}X_{Q+1}\\ \hline\cr X_{1}\\ \vdots\\ X_{Q}\end{pmatrix}X_{Q+1} =\displaystyle= (cQ+1,Q+1c1,Q+1cQ,Q+1)+(aQ+1,Q+1[Q+1]a1,Q+1[Q+1]aQ,Q+1[Q+1])XQ+1\displaystyle\begin{pmatrix}c_{Q+1,Q+1}\\ \hline\cr c_{1,Q+1}\\ \vdots\\ c_{Q,Q+1}\end{pmatrix}+\begin{pmatrix}a_{Q+1,Q+1[Q+1]}\\ \hline\cr a_{1,Q+1[Q+1]}\\ \vdots\\ a_{Q,Q+1[Q+1]}\end{pmatrix}X_{Q+1} (S218)
+(aQ+1,Q+1[1]aQ+1,Q+1[Q]a1,Q+1[1]a1,Q+1[Q]aQ,Q+1[1]aQ,Q+1[Q])(X1XQ).\displaystyle+\begin{pmatrix}a_{Q+1,Q+1[1]}&\ldots&a_{Q+1,Q+1[Q]}\\ \hline\cr a_{1,Q+1[1]}&\ldots&a_{1,Q+1[Q]}\\ &\ddots&\\ a_{Q,Q+1[1]}&\ldots&a_{Q,Q+1[Q]}\\ \end{pmatrix}\begin{pmatrix}X_{1}\\ \vdots\\ X_{Q}\end{pmatrix}.\qquad

Let

A=(a1,Q+1[1]a1,Q+1[Q]aQ,Q+1[1]aQ,Q+1[Q])Q×Q\displaystyle A=\begin{pmatrix}a_{1,Q+1[1]}&\ldots&a_{1,Q+1[Q]}\\ &\ddots&\\ a_{Q,Q+1[1]}&\ldots&a_{Q,Q+1[Q]}\\ \end{pmatrix}\in\mathbb{R}^{Q\times Q}

denote the Q×QQ\times Q coefficient matrix of (X1,,XQ)(X_{1},\ldots,X_{Q}) in the last QQ rows of (S218). Let σ(A)\sigma(A) denote the set of eigenvalues of AA. Let

𝒮Q+1={XQ+1:there exists X=(X1,,XQ+1)𝒮}\displaystyle\mathcal{S}_{Q+1}=\{X_{Q+1}:\text{there exists $X=(X_{1},\ldots,X_{Q+1})^{\top}\in\mathcal{S}$}\}

denote the set of values of XQ+1X_{Q+1} across all solutions to (S216). Note that all solutions to (S216) must satisfy (S218). Applying Proposition S4 and Lemma S11(ii) to (S218) with XQ+1X_{Q+1} treated as X0X_{0} ensures

  1. (i)

    |𝒮Q+1|Q+2|\mathcal{S}_{Q+1}|\leq Q+2 by Proposition S4.

  2. (ii)

    If 𝒮Q+1σ(A)=\mathcal{S}_{Q+1}\cap\sigma(A)=\emptyset such that XQ+1X_{Q+1} does not equal any eigenvalue of AA across all solutions to (S216), then 𝒮Q+1=𝒮Q+1\σ(A)\mathcal{S}_{Q+1}=\mathcal{S}_{Q+1}\backslash\sigma(A), and it follows from Lemma S11(ii) that |𝒮|=|𝒮Q+1\σ(A)|=|𝒮Q+1|.|\mathcal{S}|=|\mathcal{S}_{Q+1}\backslash\sigma(A)|=|\mathcal{S}_{Q+1}|.

These two observations together ensure

|𝒮|=|𝒮Q+1|Q+2when 𝒮Q+1σ(A)=.\displaystyle|\mathcal{S}|=|\mathcal{S}_{Q+1}|\leq Q+2\quad\text{when $\mathcal{S}_{Q+1}\cap\sigma(A)=\emptyset$.}

We show below |𝒮|Q+2|\mathcal{S}|\leq Q+2 when 𝒮Q+1σ(A)\mathcal{S}_{Q+1}\cap\sigma(A)\neq\emptyset.

Let λ0\lambda_{0} denote an element in 𝒮Q+1σ(A)\mathcal{S}_{Q+1}\cap\sigma(A), as an eigenvalue of AA that XQ+1X_{Q+1} takes in at least one solution to (S216). We verify below |𝒮|Q+2|\mathcal{S}|\leq Q+2 for the cases of A=λ0IA=\lambda_{0}I and Aλ0IA\neq\lambda_{0}I, respectively. Throughout, let

𝒮λ0={X=(X1,,XQ,XQ+1)𝒮:XQ+1=λ0},\displaystyle\mathcal{S}_{\lambda_{0}}=\{X=(X_{1},\ldots,X_{Q},X_{Q+1})^{\top}\in\mathcal{S}:\,X_{Q+1}=\lambda_{0}\},
𝒮λ0¯={X=(X1,,XQ,XQ+1)𝒮:XQ+1λ0}\displaystyle\mathcal{S}_{\overline{\lambda_{0}}}=\{X=(X_{1},\ldots,X_{Q},X_{Q+1})^{\top}\in\mathcal{S}:\,X_{Q+1}\neq\lambda_{0}\}

denote a partition of 𝒮\mathcal{S}, containing solutions to (S216) that have XQ+1=λ0X_{Q+1}=\lambda_{0} and XQ+1λ0X_{Q+1}\neq\lambda_{0}, respectively.

S5.3.1 |𝒮|Q+2|\mathcal{S}|\leq Q+2 when A=λ0IA=\lambda_{0}I.

Let q=q=Q+1q=q^{\prime}=Q+1 in (S216) to see

XQ+12=cQ+1,Q+1+aQ+1,Q+1[Q+1]XQ+1+k=1QaQ+1,Q+1[k]Xk.\displaystyle X_{Q+1}^{2}=c_{Q+1,Q+1}+a_{Q+1,Q+1[Q+1]}X_{Q+1}+\sum_{k=1}^{Q}a_{Q+1,Q+1[k]}X_{k}. (S219)

Condition S1 below states the condition that the coefficients for X1,,XQX_{1},\ldots,X_{Q} in (S219) all equal zero. We verify below that

|𝒮λ0¯|\displaystyle|\mathcal{S}_{\overline{\lambda_{0}}}| \displaystyle\leq {1 if Condition S1 holds;2otherwise,\displaystyle\left\{\begin{array}[]{cc}\quad 1&\text{ if Condition \ref{cond:all0_aa} holds;}\\ \quad 2&\text{otherwise,}\end{array}\right. (S222)
|𝒮λ0|\displaystyle|\mathcal{S}_{\lambda_{0}}| \displaystyle\leq {Q+1if Condition S1 holds;Qotherwise.\displaystyle\left\{\begin{array}[]{cc}Q+1&\text{if Condition \ref{cond:all0_aa} holds;}\\ Q&\text{otherwise.}\end{array}\right. (S225)

These two parts together ensure |𝒮|=|𝒮λ0|+|𝒮λ0¯|Q+2|\mathcal{S}|=|\mathcal{S}_{\lambda_{0}}|+|\mathcal{S}_{\overline{\lambda_{0}}}|\leq Q+2.

Condition S1.

aQ+1,Q+1[k]=0a_{Q+1,Q+1[k]}=0 for k=1,,Qk=1,\ldots,Q.

Part I: Proof of (S222).

Condition S1 ensures that AQ+1=(aQ+1,Q+1[1],,aQ+1,Q+1[Q])A_{Q+1}=(a_{{Q+1},Q+1[1]},\ldots,a_{Q+1,Q+1[Q]}) equals zero, which corresponds to the condition of A0=0A_{0}=0 in Lemma S11(iiib). Inequality (S222) follows from Lemma S11(iiib) with XQ+1X_{Q+1} treated as X0X_{0}.

Part II: Proof of (S225).

First, (S216) ensures

XqXq=cqq+aqq[Q+1]XQ+1+k=1Qaqq[k]Xkfor q,q=1,,Q.\displaystyle X_{q}X_{q^{\prime}}=c_{qq^{\prime}}+a_{qq^{\prime}[Q+1]}X_{Q+1}+\sum_{k=1}^{Q}a_{qq^{\prime}[k]}X_{k}\quad\text{for \ $q,q^{\prime}=1,\ldots,Q$}. (S226)

Letting XQ+1=λ0X_{Q+1}=\lambda_{0} in (S226) ensures that

(S231)

By induction, applying Proposition S5 to X=(X1,,XQ)X^{\prime}=(X_{1},\ldots,X_{Q})^{\top} at m=Qm=Q ensures that

This ensures |𝒮λ0|Q+1|\mathcal{S}_{\lambda_{0}}|\leq Q+1, and verifies the first line of (S225) when Condition S1 holds.

When Condition S1 does not hold, there exists k{1,,Q}k\in\{1,\ldots,Q\} such that aQ+1,Q+1[k]0a_{Q+1,Q+1[k]}\neq 0. Without loss of generality, assume that aQ+1,Q+1[Q]0a_{Q+1,Q+1[Q]}\neq 0. Letting X0=λ0X_{0}=\lambda_{0} in (S219) implies

(S238)

Plugging the expression of XQX_{Q} in (S238) back in (S231) ensures

By induction, applying Proposition S5 to X=(X1,,XQ1)X^{\prime}=(X_{1},\ldots,X_{Q-1})^{\top} at m=Q1m=Q-1 ensures that

(S242)

The expression of XQX_{Q} in (S238) further ensures that

(S245)

Observations (S242)–(S245) together ensure |𝒮λ0|Q|\mathcal{S}_{\lambda_{0}}|\leq Q in the second line of (S225) when Condition S1 does not hold.


S5.3.2 |𝒮|Q+2|\mathcal{S}|\leq Q+2 when Aλ0IA\neq\lambda_{0}I.

We now show that |𝒮|Q+2|\mathcal{S}|\leq Q+2 when Aλ0IA\neq\lambda_{0}I. The proof parallels that for the case of A=λ0IA=\lambda_{0}I but involves more technicality. We will first reparameterize XX to state a condition analogous to Condition S1, and then bound |𝒮λ0||\mathcal{S}_{\lambda_{0}}| and |𝒮λ0¯||\mathcal{S}_{\overline{\lambda_{0}}}| when the new condition holds and does not hold, respectively, analogous to (S222)–(S225).

Let r=rank(λ0IA)r=\operatorname{\textup{rank}}(\lambda_{0}I-A), with 1rQ11\leq r\leq Q-1 when Aλ0IA\neq\lambda_{0}I. Let

X1:r=(X1,,Xr),\displaystyle X_{1:r}=(X_{1},\ldots,X_{r})^{\top}, X(r+1):Q=(Xr+1,,XQ),\displaystyle X_{(r+1):Q}=(X_{r+1},\ldots,X_{Q})^{\top},
X1:(Qr)=(X1,,XQr),\displaystyle X_{1:(Q-r)}=(X_{1},\ldots,X_{Q-r})^{\top}, X(Qr+1):Q=(XQr+1,,XQ)\displaystyle X_{(Q-r+1):Q}=(X_{Q-r+1},\ldots,X_{Q})^{\top}

define two partitions of (X1,,XQ)(X_{1},\ldots,X_{Q})^{\top}. Applying Lemma S11(iiia) to (S218) with XQ+1X_{Q+1} treated as X0X_{0} ensures that there exists constant grg\in\mathbb{R}^{r}, Gr×(Qr)G\in\mathbb{R}^{r\times(Q-r)}, hQrh\in\mathbb{R}^{Q-r}, and H(Qr)×rH\in\mathbb{R}^{(Q-r)\times r} such that

X1:r=g+GX(r+1):Qfor all X=(X1,,XQ+1)𝒮λ0 with XQ+1=λ0,X1:(Qr)=h+HX(Qr+1):Qfor all X=(X1,,XQ+1)𝒮λ0¯ with XQ+1λ0.\displaystyle\begin{array}[]{cllll}X_{1:r}&=&g+GX_{(r+1):Q}&\text{for all $X=(X_{1},\ldots,X_{Q+1})^{\top}\in\mathcal{S}_{\lambda_{0}}$ with $X_{Q+1}=\lambda_{0}$},\vskip 6.0pt plus 2.0pt minus 2.0pt\\ X_{1:(Q-r)}&=&h+HX_{(Q-r+1):Q}&\text{for all $X=(X_{1},\ldots,X_{Q+1})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}}$ with $X_{Q+1}\neq\lambda_{0}$}.\end{array}\quad (S248)

A reparametrization.

Let Gq1×(Qr)G_{q}\in\mathbb{R}^{1\times(Q-r)} denote the qqth row of GG for q=1,,rq=1,\ldots,r with G=(G1,,Gr)G=(G_{1},\ldots,G_{r})^{\top}. Motivated by the first row of (S248), define X=(X1,,XQ+1)=(X1:r,X(r+1):Q,XQ+1)X^{*}=(X_{1}^{*},\ldots,X_{Q+1}^{*})^{\top}=(X_{1:r}^{*\top},X_{(r+1):Q}^{*\top},X_{Q+1}^{*})^{\top} as a reparametrization of X=(X1,,XQ+1)X=(X_{1},\ldots,X_{Q+1})^{\top} with

X1:r=(X1,,Xr)=X1:rGX(r+1):Q,X(r+1):Q=(Xr+1,,XQ)=X(r+1):Q,XQ+1=XQ+1,\displaystyle\begin{array}[]{lll}X_{1:r}^{*}=(X^{*}_{1},\ldots,X^{*}_{r})^{\top}=X_{1:r}-GX_{(r+1):Q},\\ X_{(r+1):Q}^{*}=(X^{*}_{r+1},\ldots,X^{*}_{Q})^{\top}=X_{(r+1):Q},&X_{Q+1}^{*}=X_{Q+1},\end{array} (S251)

summarized as

X=(X1:rX(r+1):QXQ+1)=(IrG00IQr0001)(X1:rX(r+1):QXQ+1)=Γ0X,whereΓ0=(IrG00IQr0001).\displaystyle X^{*}=\begin{pmatrix}X_{1:r}^{*}\\ X_{(r+1):Q}^{*}\\ X_{Q+1}^{*}\end{pmatrix}=\begin{pmatrix}I_{r}&-G&0\\ 0&I_{Q-r}&0\\ 0&0&1\end{pmatrix}\begin{pmatrix}X_{1:r}\\ X_{(r+1):Q}\\ X_{Q+1}\end{pmatrix}=\Gamma_{0}X,\ \ \text{where}\ \ \Gamma_{0}=\begin{pmatrix}I_{r}&-G&0\\ 0&I_{Q-r}&0\\ 0&0&1\end{pmatrix}.

Denote by

XqXq=cqq+k=1Q+1aqq[k]Xkfor q,q=1,,Q+1,\displaystyle X_{q}^{*}X_{q^{\prime}}^{*}=c_{qq^{\prime}}+\sum_{k=1}^{Q+1}a_{qq^{\prime}[k]}^{*}X_{k}^{*}\quad\text{for $q,q^{\prime}=1,\ldots,Q+1$}, (S252)

the reparameterization of (S216) in terms of X=(X1,,XQ+1)X^{*}=(X_{1}^{*},\ldots,X_{Q+1}^{*})^{\top}. The coefficients {aqq[k]:q,q,k=1,,Q+1}\{a_{qq^{\prime}[k]}^{*}:q,q^{\prime},k=1,\ldots,Q+1\} in (S252) are constants fully determined by aqq[k]a_{qq^{\prime}[k]}’s and GG, and satisfy

aqq[k]=aqq[k]forq,q,k=1,,Q+1\displaystyle a_{qq^{\prime}[k]}^{*}=a^{*}_{q^{\prime}q[k]}\quad\text{for}\ \ q,q^{\prime},k=1,\ldots,Q+1 (S253)

under (S217). We omit their explicit forms because they are irrelevant to the proof. Let

𝒮={X=(X1,,XQ+1)Q+1:X satisfies (S252)}\displaystyle\mathcal{S}^{*}=\{X^{*}=(X^{*}_{1},\ldots,X^{*}_{Q+1})^{\top}\in\mathbb{R}^{Q+1}:\ \text{$X^{*}$ satisfies \eqref{eq:ls_xx_star}}\}

denote the solution set of (S252). Let

𝒮λ0={X=(X1,,XQ+1)𝒮:XQ+1=λ0},\displaystyle\mathcal{S}^{*}_{\lambda_{0}}=\{X^{*}=(X^{*}_{1},\ldots,X^{*}_{Q+1})^{\top}\in\mathcal{S}^{*}:\ X_{Q+1}^{*}=\lambda_{0}\},
𝒮λ0¯={X=(X1,,XQ+1)𝒮:XQ+1λ0}\displaystyle\mathcal{S}_{\overline{\lambda_{0}}}^{*}=\{X^{*}=(X^{*}_{1},\ldots,X^{*}_{Q+1})^{\top}\in\mathcal{S}^{*}:\ X_{Q+1}^{*}\neq\lambda_{0}\}

denote a partition of 𝒮\mathcal{S}^{*}, containing solutions to (S252) that have XQ+1=λ0X_{Q+1}^{*}=\lambda_{0} and XQ+1λ0X_{Q+1}^{*}\neq\lambda_{0}, respectively. The definition of (S252) ensures

X𝒮λ0X=Γ0X𝒮λ0X𝒮λ0¯X=Γ0X𝒮λ0¯with|𝒮λ0|=|𝒮λ0|,|𝒮λ0¯|=|𝒮λ0¯|.\displaystyle\begin{array}[]{lll}X\in\mathcal{S}_{\lambda_{0}}&\Longleftrightarrow&X^{*}=\Gamma_{0}X\in\mathcal{S}^{*}_{\lambda_{0}}\\ X\in\mathcal{S}_{\overline{\lambda_{0}}}&\Longleftrightarrow&X^{*}=\Gamma_{0}X\in\mathcal{S}_{\overline{\lambda_{0}}}^{*}\end{array}\quad\text{with}\quad|\mathcal{S}_{\lambda_{0}}|=|\mathcal{S}^{*}_{\lambda_{0}}|,\quad|\mathcal{S}_{\overline{\lambda_{0}}}|=|\mathcal{S}_{\overline{\lambda_{0}}}^{*}|. (S256)

Eq. (S256) ensures that

This, together with (S248) and X1:r=X1:rGX(r+1):QX_{1:r}^{*}=X_{1:r}-GX_{(r+1):Q} from (S251), ensures that

(S260)

where gqg_{q} denotes the qqth element of gg for q=1,,rq=1,\ldots,r.

A condition analogous to Condition S1.

Write (S252) as

XqXq=cqq+k=1raqq[k]Xk+k=r+1Qaqq[k]Xk+aqq[Q+1]XQ+1for q,q=1,,Q+1.\displaystyle X_{q}^{*}X_{q^{\prime}}^{*}=c_{qq^{\prime}}+\sum_{k=1}^{r}a_{qq^{\prime}[k]}^{*}X_{k}^{*}+\sum_{k=r+1}^{Q}a_{qq^{\prime}[k]}^{*}X_{k}^{*}+a^{*}_{qq^{\prime}[Q+1]}X_{Q+1}^{*}\quad\text{for $q,q^{\prime}=1,\ldots,Q+1$}. (S261)

Plugging (S260) to the right-hand side of (S261) ensures that

(S268)

Divide combinations of (q,q)(q,q^{\prime}) for q,q=1,,Q+1q,q^{\prime}=1,\ldots,Q+1 into nine blocks as follows:

(S273)

Given (S260), writing out Xq=gqX^{*}_{q}=g_{q} for q=1,,rq=1,\ldots,r on the left-hand side of (S268) implies the following for combinations of (q,q)(q,q^{\prime}) in blocks (i)–(v) in (S273):

for all X=(X1,,XQ+1)𝒮λ0X^{*}=(X^{*}_{1},\ldots,X^{*}_{Q+1})^{\top}\in\mathcal{S}^{*}_{\lambda_{0}} with XQ+1=λ0X_{Q+1}^{*}=\lambda_{0}, we have (S274)
(i)gqgq=γqq+k=r+1Qaqq[k]Xkfor q,q{1,,r};(ii)gqXq=γqq+k=r+1Qaqq[k]Xkfor q{1,,r} and q{r+1,,Q};(iii)gqλ0=γq,Q+1+k=r+1Qaq,Q+1[k]Xkfor q{1,,r} and q=Q+1;(iv)Xqλ0=γq,Q+1+k=r+1Qaq,Q+1[k]Xkfor q{r+1,,Q} and q=Q+1;(v)λ02=γQ+1,Q+1+k=r+1QaQ+1,Q+1[k]Xkfor q=q=Q+1.\displaystyle\begin{array}[]{lccllllll}\text{(i)}&g_{q}g_{q^{\prime}}&=&\gamma_{qq^{\prime}}&+&\sum_{k=r+1}^{Q}a_{qq^{\prime}[k]}^{*}X_{k}^{*}&&\text{for \ \ $q,q^{\prime}\in\{1,\ldots,r\}$;}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \text{(ii)}&g_{q}X_{q^{\prime}}^{*}&=&\gamma_{qq^{\prime}}&+&\sum_{k=r+1}^{Q}a_{qq^{\prime}[k]}^{*}X_{k}^{*}&&\text{for \ \ $q\in\{1,\ldots,r\}$ and $q^{\prime}\in\{r+1,\ldots,Q\}$;}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \text{(iii)}&g_{q}\lambda_{0}&=&\gamma_{q,Q+1}&+&\sum_{k=r+1}^{Q}a^{*}_{q,Q+1[k]}X_{k}^{*}&&\text{for \ \ $q\in\{1,\ldots,r\}$ and $q^{\prime}=Q+1$;}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \text{(iv)}&X_{q}^{*}\lambda_{0}&=&\gamma_{q,Q+1}&+&\sum_{k=r+1}^{Q}a^{*}_{q,Q+1[k]}X_{k}^{*}&&\text{for \ \ $q\in\{r+1,\ldots,Q\}$ and $q^{\prime}=Q+1$;}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \text{(v)}&\lambda_{0}^{2}&=&\gamma_{Q+1,Q+1}&+&\sum_{k=r+1}^{Q}a^{*}_{Q+1,Q+1[k]}X_{k}^{*}&&\text{for \ \ $q=q^{\prime}=Q+1$.}\end{array} (S280)

Condition S2 below parallels Condition S1, and states the condition that all coefficients of (Xr+1,,XQ)(X^{*}_{r+1},\ldots,X^{*}_{Q}) in (S274) are zero. We show below

|𝒮λ0¯|\displaystyle|\mathcal{S}_{\overline{\lambda_{0}}}| \displaystyle\leq {r+1 if Condition S2 holds;r+2otherwise,\displaystyle\left\{\begin{array}[]{cc}\quad r+1&\text{ if Condition \ref{cond:all0} holds;}\\ \quad r+2&\text{otherwise,}\end{array}\right. (S283)
|𝒮λ0|\displaystyle|\mathcal{S}_{\lambda_{0}}| \displaystyle\leq {Qr+1if Condition S2 holds;Qrotherwise.\displaystyle\left\{\begin{array}[]{cc}Q-r+1&\text{if Condition \ref{cond:all0} holds;}\\ Q-r&\text{otherwise.}\end{array}\right. (S286)

These two parts together ensure |𝒮|=|𝒮λ0|+|𝒮λ0¯|Q+2|\mathcal{S}|=|\mathcal{S}_{\lambda_{0}}|+|\mathcal{S}_{\overline{\lambda_{0}}}|\leq Q+2.

Condition S2.

All coefficients of (Xr+1,,XQ)(X^{*}_{r+1},\ldots,X^{*}_{Q}) in (S274) are zero. That is,


Part I: Proof of (S283).

|𝒮λ0¯|r+1|\mathcal{S}_{\overline{\lambda_{0}}}|\leq r+1 when Condition S2 holds.

Recall that aqq[k]=aqq[k]a_{qq^{\prime}[k]}^{*}=a^{*}_{q^{\prime}q[k]} for q,q,k=1,,Q+1q,q^{\prime},k=1,\ldots,Q+1 from (S253). Condition S2(i), (iii), and (v) together ensure

Therefore, when Condition S2 holds, the coefficients of Xr+1,,XQX^{*}_{r+1},\ldots,X^{*}_{Q} are all zero in the linear expansions of {XqXq:q,q=1,,r;Q+1}\{X^{*}_{q}X^{*}_{q^{\prime}}:q,q^{\prime}=1,\ldots,r;\,Q+1\} in (S252). The corresponding part of (S252) reduces to

XqXq=cqq+k=1raqq[k]Xk+aqq[Q+1]XQ+1forq,q=1,,r;Q+1.\displaystyle X_{q}^{*}X_{q^{\prime}}^{*}=c_{qq^{\prime}}+\sum_{k=1}^{r}a_{qq^{\prime}[k]}^{*}X_{k}^{*}+a^{*}_{qq^{\prime}[Q+1]}X_{Q+1}^{*}\quad\text{for}\ \ q,q^{\prime}=1,\ldots,r;\,Q+1. (S291)

In words, (S291) ensures that

By induction, applying Proposition S5 to X=(X1,,Xr,XQ+1)X^{\prime}=(X^{*}_{1},\ldots,X^{*}_{r},X_{Q+1}^{*})^{\top} at m=r+1(Q)m=r+1\,(\leq Q) ensures that

Given that XQ+1=λ0X_{Q+1}^{*}=\lambda_{0} in at least one solution of XX^{*} to (S252), we have

(S296)

In addition, Condition S2(iv) ensures that

The linear expansions of {XqXQ+1:q=r+1,,Q}\{X_{q}^{*}X_{Q+1}^{*}:q=r+1,\ldots,Q\} in (S252) reduce to

XqXQ+1=cq,Q+1+k=1raq,Q+1[k]Xk+λ0Xq+aq,Q+1[Q+1]XQ+1forq=r+1,,Q,\displaystyle X_{q}^{*}X_{Q+1}^{*}=c_{q,Q+1}+\sum_{k=1}^{r}a^{*}_{q,Q+1[k]}X_{k}^{*}+\lambda_{0}X_{q}^{*}+a^{*}_{q,Q+1[Q+1]}X_{Q+1}^{*}\quad\text{for}\ \ q=r+1,\ldots,Q,

implying

Xq(XQ+1λ0)=cq,Q+1+k=1raq,Q+1[k]Xk+aq,Q+1[Q+1]XQ+1forq=r+1,,Q.\displaystyle X_{q}^{*}(X_{Q+1}^{*}-\lambda_{0})=c_{q,Q+1}+\sum_{k=1}^{r}a^{*}_{q,Q+1[k]}X_{k}^{*}+a^{*}_{q,Q+1[Q+1]}X_{Q+1}^{*}\quad\text{for}\ \ q=r+1,\ldots,Q.\quad (S299)

In words, (S299) ensures that

This, together with (S296), ensures

there are at most r+1r+1 distinct solutions of X=(X1,,XQ+1)X^{*}=(X^{*}_{1},\ldots,X^{*}_{Q+1})^{\top} to (S252) that
have XQ+1λ0X_{Q+1}^{*}\neq\lambda_{0}, i.e., |𝒮λ0¯|r+1|\mathcal{S}_{\overline{\lambda_{0}}}^{*}|\leq r+1.

This ensures |𝒮λ0¯|r+1|\mathcal{S}_{\overline{\lambda_{0}}}|\leq r+1 from (S256).


|𝒮λ0¯|r+2|\mathcal{S}_{\overline{\lambda_{0}}}|\leq r+2 when Condition S2 does not hold.

Recall from (S248) that

X1:(Qr)=h+HX(Qr+1):Qfor all X=(X1,,XQ+1)𝒮λ0¯ with XQ+1λ0.\displaystyle X_{1:(Q-r)}=h+HX_{(Q-r+1):Q}\quad\text{for all $X=(X_{1},\ldots,X_{Q+1})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}}$ with $X_{Q+1}\neq\lambda_{0}$}.

This ensures that

for all X=(X1,,XQ+1)𝒮λ0¯X=(X_{1},\ldots,X_{Q+1})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}} with XQ+1λ0X_{Q+1}\neq\lambda_{0}, we have
k=1Qraqq[k]Xk=Aqq[1:(Qr)]X1:(Qr),where Aqq[1:(Qr)]=(aqq[1],,aqq[Qr])1×(Qr)
=(S248)Aqq[1:(Qr)](h+HX(Qr+1):Q)=Aqq[1:(Qr)]h+Aqq[1:(Qr)]HX(Qr+1):Q=c~qq+k=Qr+1Qa~qq[k]Xk,
\displaystyle\begin{array}[]{lcllll}\displaystyle\sum_{k=1}^{Q-r}a_{qq^{\prime}[k]}X_{k}&=&A_{qq^{\prime}[1:(Q-r)]}X_{1:(Q-r)},\\ &&\text{where $A_{qq^{\prime}[1:(Q-r)]}=(a_{qq^{\prime}[1]},\ldots,a_{qq^{\prime}[Q-r]})\in\mathbb{R}^{1\times(Q-r)}$}\vskip 12.0pt plus 4.0pt minus 4.0pt\\ &\overset{\eqref{eq:lc_lambda0}}{=}&A_{qq^{\prime}[1:(Q-r)]}\Big{(}h+HX_{(Q-r+1):Q}\Big{)}\\ &=&A_{qq^{\prime}[1:(Q-r)]}h+A_{qq^{\prime}[1:(Q-r)]}H\cdot X_{(Q-r+1):Q}\\ &=&\tilde{c}_{qq^{\prime}}+\displaystyle\sum_{k=Q-r+1}^{Q}\tilde{a}_{qq^{\prime}[k]}X_{k},\end{array}
(S306)

where c~qq=Aqq[1:(Qr)]h\tilde{c}_{qq^{\prime}}=A_{qq^{\prime}[1:(Q-r)]}h and {a~qq[k]:k=Qr+1,,Q}\{\tilde{a}_{qq^{\prime}[k]}:k=Q-r+1,\ldots,Q\} are elements of Aqq[1:(Qr)]HA_{qq^{\prime}[1:(Q-r)]}H. Plugging (S5.3) in (S216) ensures that

for all X=(X1,,XQ+1)𝒮λ0¯X=(X_{1},\ldots,X_{Q+1})^{\top}\in\mathcal{S}_{\overline{\lambda_{0}}} with XQ+1λ0X_{Q+1}\neq\lambda_{0}, we have
XqXq=cqq+k=1Q+1aqq[k]Xk=cqq+k=1Qraqq[k]Xk+k=Qr+1Qaqq[k]Xk+aqq[Q+1]XQ+1=(S5.3)cqq+(c~qq+k=Qr+1Qa~qq[k]Xk)+k=Qr+1Qaqq[k]Xk+aqq[Q+1]XQ+1=(cqq+c~qq)+k=Qr+1Q(a~qq[k]+aqq[k])Xk+aqq[Q+1]XQ+1\displaystyle\quad\begin{array}[]{lcllll}X_{q}X_{q^{\prime}}&=&c_{qq^{\prime}}+\displaystyle\sum_{k=1}^{Q+1}a_{qq^{\prime}[k]}X_{k}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ &=&c_{qq^{\prime}}+\displaystyle\sum_{k=1}^{Q-r}a_{qq^{\prime}[k]}X_{k}+\displaystyle\sum_{k=Q-r+1}^{Q}a_{qq^{\prime}[k]}X_{k}+a_{qq^{\prime}[Q+1]}X_{Q+1}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ &\overset{\eqref{eq:ss_22}}{=}&c_{qq^{\prime}}+\left(\tilde{c}_{qq^{\prime}}+\displaystyle\sum_{k=Q-r+1}^{Q}\tilde{a}_{qq^{\prime}[k]}X_{k}\right)+\displaystyle\sum_{k=Q-r+1}^{Q}a_{qq^{\prime}[k]}X_{k}+a_{qq^{\prime}[Q+1]}X_{Q+1}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ &=&(c_{qq^{\prime}}+\tilde{c}_{qq^{\prime}})+\displaystyle\sum_{k=Q-r+1}^{Q}(\tilde{a}_{qq^{\prime}[k]}+a_{qq^{\prime}[k]})X_{k}+a_{qq^{\prime}[Q+1]}X_{Q+1}\end{array}
for q,q=Qr+1,,Q+1.\displaystyle\text{for $q,q^{\prime}=Q-r+1,\ldots,Q+1$}.

In words, this ensures that

By induction, applying Proposition S5 to X=(XQr+1,,XQ,XQ+1)X^{\prime}=(X_{Q-r+1},\ldots,X_{Q},X_{Q+1})^{\top} at m=r+1(Q)m=r+1\ (\leq Q) ensures

(S311)

In addition, the second row in (S248) further ensures that

This, together with (S311), ensures |𝒮λ0¯|r+2|\mathcal{S}_{\overline{\lambda_{0}}}|\leq r+2.


Part II: Proof of (S286).

Let 𝒮(r+1):Q\mathcal{S}_{(r+1):Q} denote the set of values of X(r+1):Q=(Xr+1,,XQ)X_{(r+1):Q}=(X_{r+1},\ldots,X_{Q})^{\top} across all solutions X=(X1,,XQ+1)X=(X_{1},\ldots,X_{Q+1})^{\top} to (S216) that have XQ+1=λ0X_{Q+1}=\lambda_{0}. The first line in (S248) ensures that

This ensures |𝒮λ0|=|𝒮(r+1):Q||\mathcal{S}_{\lambda_{0}}|=|\mathcal{S}_{(r+1):Q}| such that it suffices to show that

|𝒮(r+1):Q|\displaystyle|\mathcal{S}_{(r+1):Q}| \displaystyle\leq {Qr+1if Condition S2 holds;Qrotherwise.\displaystyle\left\{\begin{array}[]{cc}Q-r+1&\text{if Condition \ref{cond:all0} holds;}\\ Q-r&\text{otherwise.}\end{array}\right.

A useful fact is that (S268) ensures

for all X=(X1,,XQ+1)𝒮λ0X^{*}=(X^{*}_{1},\ldots,X^{*}_{Q+1})^{\top}\in\mathcal{S}^{*}_{\lambda_{0}} with XQ+1=λ0X_{Q+1}^{*}=\lambda_{0}, we have
XqXq=γqq+k=r+1Qaqq[k]Xkfor q,q=r+1,,Q.\displaystyle\qquad\qquad X_{q}^{*}X_{q^{\prime}}^{*}=\gamma_{qq^{\prime}}+\sum_{k=r+1}^{Q}a_{qq^{\prime}[k]}^{*}X^{*}_{k}\quad\text{for $q,q^{\prime}=r+1,\ldots,Q$}. (S315)
|𝒮(r+1):Q|Qr+1|\mathcal{S}_{(r+1):Q}|\leq Q-r+1 when Condition S2 holds.

In words, (S5.3) implies that

By induction, applying Proposition S5 to X(r+1):QX_{(r+1):Q}^{*} at m=Qr(Q)m=Q-r\ (\leq Q) ensures that

This, together with the correspondence of 𝒮λ0\mathcal{S}^{*}_{\lambda_{0}} and 𝒮λ0\mathcal{S}_{\lambda_{0}} from (S256), ensures that

We next improve the bound by one when Condition S2 does not hold.


|𝒮(r+1):Q|Qr|\mathcal{S}_{(r+1):Q}|\leq Q-r when Condition S2 does not hold.

When Condition S2 does not hold, at least one coefficient of (Xr+1,,XQ)(X^{*}_{r+1},\ldots,X^{*}_{Q}) in (S274) is nonzero. This ensures that there exists constants {β0,βr+1,,βQ:βr+1,,βQ are not all zero}\{\beta_{0},\beta_{r+1},\ldots,\beta_{Q}\in\mathbb{R}:\text{$\beta_{r+1},\ldots,\beta_{Q}$ are not all zero}\} such that

Without loss of generality, assume that βQ0\beta_{Q}\neq 0 such that

(S321)

Plugging (S321) in (S5.3) ensures that

for all X=(X1,,XQ+1)𝒮λ0X^{*}=(X^{*}_{1},\ldots,X^{*}_{Q+1})^{\top}\in\mathcal{S}^{*}_{\lambda_{0}} with XQ+1=λ0X_{Q+1}^{*}=\lambda_{0}, we have
XqXq=γqq+k=r+1Qaqq[k]Xk=γqq+k=r+1Q1aqq[k]Xk+aqq[Q]XQ=(S321)γqq+k=r+1Q1aqq[k]Xkaqq[Q]βQ1(β0+k=r+1Q1βkXk)\displaystyle\qquad\begin{array}[]{lcll}X_{q}^{*}X_{q^{\prime}}^{*}&=&\gamma_{qq^{\prime}}+\displaystyle\sum_{k=r+1}^{Q}a_{qq^{\prime}[k]}^{*}X^{*}_{k}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ &=&\gamma_{qq^{\prime}}+\displaystyle\sum_{k=r+1}^{Q-1}a_{qq^{\prime}[k]}^{*}X^{*}_{k}+a_{qq^{\prime}[Q]}^{*}X^{*}_{Q}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ &\overset{\eqref{eq:lc_lambda0_xq}}{=}&\gamma_{qq^{\prime}}+\displaystyle\sum_{k=r+1}^{Q-1}a_{qq^{\prime}[k]}^{*}X^{*}_{k}-a_{qq^{\prime}[Q]}^{*}\beta_{Q}^{-1}\left(\beta_{0}+\displaystyle\sum_{k=r+1}^{Q-1}\beta_{k}X^{*}_{k}\right)\end{array}
for  q,q=r+1,,Q1q,q^{\prime}=r+1,\ldots,Q-1.

In words, (S5.3) ensures that

By induction, applying Proposition S5 to X=(Xr+1,,XQ1)X^{\prime}=(X^{*}_{r+1},\ldots,X^{*}_{Q-1})^{\top} at m=Qr1(<Q)m=Q-r-1\,(<Q) ensures that

across all X=(X1,,XQ+1)𝒮λ0X^{*}=(X^{*}_{1},\ldots,X^{*}_{Q+1})^{\top}\in\mathcal{S}^{*}_{\lambda_{0}} with XQ+1=λ0X_{Q+1}^{*}=\lambda_{0}, the corresponding X=(Xr+1,,XQ1)X^{\prime}=(X^{*}_{r+1},\ldots,X^{*}_{Q-1})^{\top} take at most QrQ-r distinct values. (S329)

This, together with (S321), ensures that

It then follows from (S256) that |𝒮(r+1):Q|Qr|\mathcal{S}_{(r+1):Q}|\leq Q-r.

S5.4 Proof of Corollary S1

We verify below Corollary S1(i)–(ii) one by one.

Proof of Corollary S1(i)..

The result is trivial when c=0mc=0_{m}. When c0mc\neq 0_{m}, define X0=Xc=j=1mcjXjX_{0}=X^{\top}c=\sum_{j=1}^{m}c_{j}X_{j}. Without loss of generality, assume cm0c_{m}\neq 0 with Xm=cm1(X0j=1m1ciXi)X_{m}=c_{m}^{-1}(X_{0}-\sum_{j=1}^{m-1}c_{i}X_{i}). This ensures X=(X1,,Xm1,Xm)X=(X_{1},\ldots,X_{m-1},X_{m})^{\top} and X=(X0,X1,,Xm1)X^{\prime}=(X_{0},X_{1},\ldots,X_{m-1})^{\top} are linearly equivalent. Therefore,

XXcXX^{\top}c is linear in (1,X)(1,X^{\top})^{\top} \displaystyle\Longleftrightarrow XX0XX_{0} is linear in (1,X)(1,X^{\top})^{\top}
\displaystyle\Longleftrightarrow XX0XX_{0} is linear in (1,X)(1,X^{\prime\top})^{\top}
\displaystyle\Longleftrightarrow XX0 is linear in (1,X).\displaystyle\text{$X^{\prime}X_{0}$ is linear in $(1,X^{\prime\top})^{\top}$}.

The result follows from applying Proposition S4 to X=(X0,X1,,Xm1)X^{\prime}=(X_{0},X_{1},\ldots,X_{m-1})^{\top}. ∎

Proof of Corollary S1(ii)..

Let eje_{j} denote the jjth column of the m×mm\times m identity matrix for j=1,,mj=1,\ldots,m. Then XX=(XXe1,XXe2,,XXem)XX^{\top}=(XX^{\top}e_{1},XX^{\top}e_{2},\ldots,XX^{\top}e_{m}) is component-wise linear in (1,X)(1,X). The result follows from Proposition S5.