This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

suppSupplementary References

Robustness Against Weak or Invalid Instruments: Exploring Nonlinear Treatment Models with Machine Learning

Zijian Guo    Mengchu Zheng Department of Statistics, Rutgers University, USA    Peter Bühlmann Seminar for Statistics, ETH Zürich, Switzerland.
Abstract

We discuss causal inference for observational studies with possibly invalid instrumental variables. We propose a novel methodology called two-stage curvature identification (TSCI) by exploring the nonlinear treatment model with machine learning. The first-stage machine learning enables improving the instrumental variable’s strength and adjusting for different forms of violating the instrumental variable assumptions. The success of TSCI requires the instrumental variable’s effect on treatment to differ from its violation form. A novel bias correction step is implemented to remove bias resulting from the potentially high complexity of machine learning. Our proposed TSCI estimator is shown to be asymptotically unbiased and Gaussian even if the machine learning algorithm does not consistently estimate the treatment model. Furthermore, we design a data-dependent method to choose the best among several candidate violation forms. We apply TSCI to study the effect of education on earnings.

Keywords: Generalized IV strength, Confidence interval, Random forests, Boosting, Neural network.

1 Introduction

Observational studies are major sources for inferring causal effects when randomized experiments are not feasible. But such causal inference from observational studies requires strong assumptions and may be invalid due to the presence of unmeasured confounders. The instrumental variable (IV) regression is a practical and highly popular causal inference approach in the presence of unmeasured confounders. The IVs are required to satisfy three assumptions: conditioning on the baseline covariates, (A1) the IVs are associated with the treatment; (A2) the IVs are not associated with the unmeasured confounders; (A3) the IVs do not directly affect the outcome.

Despite the popularity of the IV method, there is a significant concern about whether the used IVs satisfy (A1)-(A3) in practice. Assumption (A1) requires the IV to be strongly associated with the treatment variable, which can be checked with an F-test in a first-stage linear regression model. Inference with the assumption (A1) being violated has been actively investigated under the name of weak IV (Stock et al., 2002; Staiger and Stock, 1997, e.g.). Assumptions (A2) and (A3) ensure that the IV only affects the outcome through the treatment. If an IV violates either (A2) or (A3), we call it an invalid IV and define its functional form of violating (A2) and (A3) as the violation form, e.g., a linear violation. In the just-identification regime, most empirical analyses rely on external knowledge to argue about the validity of (A2) and (A3). However, there is a pressing need to develop a robust causal inference method against the proposed IVs violating the classical assumptions.

1.1 Our results and contribution

We aim to devise a robust IV framework that leads to causal identification even if the IVs proposed by domain experts may violate assumptions (A1) to (A3). Our framework provides a robustness guarantee even when all proposed IVs are invalid, including the most common regime with a single IV that is possibly invalid. It is well known that the treatment effect is not identifiable when there is no constraint on how the IVs violate assumptions (A2) and (A3). Our key identification assumption is that the violations of (A2) and (A3) arise from simpler forms than the association between the treatment and the IVs; that is, we exclude “special coincidences” such as the IVs violate (A2) and (A3) by linear forms and the conditional mean model of the treatment given IVs is also linear. This identification condition can be evaluated with the generalized IV strength measure introduced in Section 3.3 for a user-specific violation form.

We propose a novel two-stage curvature identification (TSCI) method to infer the treatment effect. An important operational step is to fit the treatment model with a machine learning (ML) algorithm, e.g., random forests, boosting, or deep neural networks. This first-stage ML model enhances the IV’s strength by learning a general conditional mean model of treatment given the IVs, instead of restricting to a linear association model. In the second stage, we leverage the ML prediction and adjust for different IV violation forms. A main novelty of our proposal is to estimate the bias resulting from high complexity of the first-stage ML and implement a corresponding bias correction step; see more details in Section 3.2.

We show that the TSCI estimator is asymptotically Gaussian when the generalized IV strength measure is sufficiently large. We further devise a data-dependent way of choosing the most suitable violation form among a collection of IV violation forms. By including the valid IV setting into the selection, our proposed general TSCI methodology does not exclude the possibility of the IVs being valid but provides a robust IV method against different invalidity forms.

To sum up, the contribution of the current paper is two-fold:

  1. 1.

    TSCI explores a nonlinear treatment model with ML and leads to more reliable causal conclusions than existing methods assuming valid IVs.

  2. 2.

    TSCI provides an efficient way of integrating the first-stage ML. A methodological novelty of TSCI is its bias correction in its second stage, which addresses the issue of high complexity and potential overfitting of ML.

1.2 Comparison to existing literature

There is relevant literature on causal inference when assumptions (A2) and (A3) are violated. When all IVs are invalid, Lewbel (2012); Tchetgen et al. (2021) proposed an identification strategy by assuming that the treatment model has heteroscedastic error but the covariance between the treatment and outcome model errors is homoscedastic. Our proposed TSCI is based on a different idea by exploring the nonlinear structure between the treatment and the IVs, not requiring anything on homo- and heteroscedasticity. Bowden et al. (2015) and Kolesár et al. (2015) assumed that the direct effect of the IVs on the outcome is nearly orthogonal to the effect of the IVs on the treatment. Han (2008); Bowden et al. (2016); Kang et al. (2016); Windmeijer et al. (2019); Guo et al. (2018); Windmeijer et al. (2021) considered the setting that a proportion of IVs are valid and conducted causal inference by selecting valid IVs in a data-dependent way. Along this line, Guo (2023) proposed a uniform inference procedure that is robust to the IV selection error. More recently, Sun et al. (2023) leveraged the number of valid IVs and used the interaction of genetic markers to identify the causal effect. In contrast to assuming the orthogonality and existence of valid IVs, our proposed TSCI is effective even if all IVs are invalid and the effect orthogonality condition does not hold. Such a setting is especially useful in handling econometric applications with a single IV.

The nonlinear structure of the treatment model has been explored in the econometric and the more recent ML literature. In the IV literature, Kelejian (1971); Amemiya (1974); Newey (1990); Hausman et al. (2012) considered constructing a non-parametric treatment model and Belloni et al. (2012) proposed constructing the optimal IVs with a Lasso-type first-stage estimator. More recently, Chen et al. (2023); Liu et al. (2020) proposed constructing the IV as the first-stage prediction given by the ML algorithm. All of the above works are focused on the valid IV setting. In contrast, our current paper applies to the more robust regime where the IVs are allowed to violate assumptions (A2) or (A3) in a range of forms. Even in the context of valid IVs, our proposed TSCI may lead to a more accurate estimator than directly using the predicted value for the treatment model as the new IV (Chen et al., 2023; Liu et al., 2020); see Section 3.5 for details.

ML algorithms integrated into the IV analysis to better accommodate the complicated outcome model. In Sections 3.5 and 5.1, we provide detailed comparisons to Double Machine Learning (DML) proposed in Chernozhukov et al. (2018). Athey et al. (2019) proposed the generalized random forests to infer heterogeneous treatment effects. Both works are based on valid IVs, while our current paper assumes a homogeneous treatment effect and focuses on the different robust IV framework with possibly invalid IVs.

Finally, our proposed TSCI methodology can be used to check the IV validity for the just-identification case. This is distinct from the Sargan test, J test, or specification test (Hansen, 1982; Sargan, 1958; Woutersen and Hausman, 2019) which are used to test IV validity in the over-identification case.

Notation. For a matrix Xn×pX\in\mathbb{R}^{n\times p}, a vector xnx\in\mathbb{R}^{n}, and a set 𝒜{1,,n}\mathcal{A}\subset\{1,\cdots,n\}, we use X𝒜X_{\mathcal{A}} to denote the submatrix of XX whose row indices belong to 𝒜\mathcal{A}, and x𝒜x_{\mathcal{A}} to denote the sub-vector of xx with indices in 𝒜.\mathcal{A}. For a set 𝒜\mathcal{A}, |𝒜|\left|\mathcal{A}\right| denotes its cardinality. For a vector xpx\in\mathbb{R}^{p}, the q\ell_{q} norm of xx is defined as xq=(l=1p|xl|q)1q\|x\|_{q}=\left(\sum_{l=1}^{p}|x_{l}|^{q}\right)^{\frac{1}{q}} for q0q\geq 0 with x0=|{1lp:xl0}|\|x\|_{0}=\left|\{1\leq l\leq p:x_{l}\neq 0\}\right| and x=max1lp|xl|\|x\|_{\infty}=\max_{1\leq l\leq p}|x_{l}|. We use cc and CC to denote generic positive constants that may vary from place to place. For a sequence of random variables XnX_{n} indexed by nn, we use Xn𝑝XX_{n}\overset{p}{\to}X and Xn𝑑XX_{n}\overset{d}{\to}X to represent that XnX_{n} converges to XX in probability and in distribution, respectively. For two positive sequences ana_{n} and bnb_{n}, anbna_{n}\lesssim b_{n} means that C>0\exists C>0 such that anCbna_{n}\leq Cb_{n} for all nn; anbna_{n}\asymp b_{n} if anbna_{n}\lesssim b_{n} and bnanb_{n}\lesssim a_{n}, and anbna_{n}\ll b_{n} if lim supnan/bn=0\limsup_{n\rightarrow\infty}{a_{n}}/{b_{n}}=0. For a matrix MM, we use Tr[M]{\rm Tr}[M] to denote its trace, rank(M){\rm rank}(M) to denote its rank, and MF\|M\|_{F}, M2\|M\|_{2} and M\|M\|_{\infty} to denote its Frobenius norm, spectral norm and element-wise maximum norm, respectively. For a square matrix MM, M2M^{2} denotes the matrix multiplication of MM by itself. For a symmetric matrix MM, we use λmax(M)\lambda_{\max}(M) and λmin(M)\lambda_{\min}(M) to denote its maximum and minimum eigenvalues, respectively.

2 Invalid instruments: modeling and identification

We consider i.i.d. data {Yi,Di,Zi,Xi}1in,\{Y_{i},D_{i},Z_{i},X_{i}\}_{1\leq i\leq n}, where for the ii-th observation, YiY_{i}\in\mathbb{R} and DiD_{i}\in\mathbb{R} respectively denote the outcome and the treatment and ZipzZ_{i}\in\mathbb{R}^{p_{z}} and XipxX_{i}\in\mathbb{R}^{p_{x}} respectively denote the IVs and measured covariates.

2.1 Examples for effect identification with nonlinear treatment models

To explain the main idea, we start with the special case with no covariates,

Yi=Diβ+Ziπ+ϵi,andDi=f(Zi)+δi,for1in,Y_{i}=D_{i}\beta+Z_{i}^{\intercal}\pi+\epsilon_{i},\quad\text{and}\quad D_{i}=f(Z_{i})+\delta_{i},\quad\text{for}\quad 1\leq i\leq n, (1)

where the errors ϵi\epsilon_{i} and δi\delta_{i} satisfy 𝐄(ϵiZi)=0{\mathbf{E}}(\epsilon_{i}\mid Z_{i})=0 and 𝐄(δiZi)=0.{\mathbf{E}}(\delta_{i}\mid Z_{i})=0. These errors ϵi\epsilon_{i} and δi\delta_{i} are typically correlated due to unobserved confounding. The outcome model in (1) is commonly used in the invalid IV literature (Small, 2007; Kang et al., 2016; Guo et al., 2018; Windmeijer et al., 2019), with π0\pi\neq 0 standing for the violation of IV assumptions (A2) and (A3). The treatment model in (1) is not a causal generating model between DiD_{i} and ZiZ_{i} but only stands for a probability model between DiD_{i} and ZiZ_{i}. Specifically, the treatment assignment might be directly influenced by unmeasured confounders, but the treatment model in (1) does not explicitly state this generating process. Instead, for the joint distribution of DiD_{i} and ZiZ_{i}, we define the conditional expectation f(Zi)=𝐄(DiZi),f(Z_{i})={\mathbf{E}}(D_{i}\mid Z_{i}), which leads to the treatment model in (1). We discuss the structural equation model interpretation of the model (1) in Section 2.3, which explicitly characterizes how unmeasured confounders may affect the treatment assignment.

We introduce some projection notations to facilitate the discussion. For a random variable UiU_{i}, define its best linear approximation by ZiZ_{i} as γ=argminγpz+1𝐄(UiZ~iγ)2\gamma^{*}=\operatorname*{arg\,min}_{\gamma\in\mathbb{R}^{p_{z}+1}}{\mathbf{E}}(U_{i}-\widetilde{Z}_{i}^{\intercal}\gamma)^{2} with Z~i=(1,Zi)\widetilde{Z}_{i}=(1,Z_{i}^{\intercal})^{\intercal}. For 1in1\leq i\leq n, write 𝒫ZiUi=Z~iγ\mathcal{P}_{Z_{i}}U_{i}=\widetilde{Z}_{i}^{\intercal}\gamma^{*} and 𝒫ZiUi=UiZ~iγ.\mathcal{P}^{\perp}_{Z_{i}}U_{i}=U_{i}-\widetilde{Z}_{i}^{\intercal}\gamma^{*}. We apply 𝒫Zi\mathcal{P}^{\perp}_{Z_{i}} to the model (1) and obtain 𝒫ZiYi=𝒫ZiDiβ+𝒫Ziϵi\mathcal{P}^{\perp}_{Z_{i}}Y_{i}=\mathcal{P}^{\perp}_{Z_{i}}D_{i}\beta+\mathcal{P}^{\perp}_{Z_{i}}\epsilon_{i}, where the linear violation form ZiπZ_{i}^{\intercal}\pi in (1) is removed after applying the transformation 𝒫Zi\mathcal{P}^{\perp}_{Z_{i}}. If ff is nonlinear, then Var(𝒫Zif(Zi))>0{\rm Var}\left(\mathcal{P}^{\perp}_{Z_{i}}f(Z_{i})\right)>0 and the adjusted variable 𝒫Zif(Zi)\mathcal{P}^{\perp}_{Z_{i}}f(Z_{i}) can be used as a valid IV to identify the effect β\beta via the following estimating equation

𝐄[𝒫Zif(Zi)(YiDiβ)]=𝐄[𝒫Zif(Zi)(Ziπ+ϵi)]=0.{\mathbf{E}}\left[\mathcal{P}^{\perp}_{Z_{i}}f(Z_{i})(Y_{i}-D_{i}\beta)\right]={\mathbf{E}}\left[\mathcal{P}^{\perp}_{Z_{i}}f(Z_{i})(Z_{i}^{\intercal}\pi+\epsilon_{i})\right]=0. (2)

Note that the last equality follows from the population orthogonality between 𝒫Zif(Zi)\mathcal{P}^{\perp}_{Z_{i}}f(Z_{i}) and ZiZ_{i}, together with 𝐄(ϵiZi)=0.{\mathbf{E}}(\epsilon_{i}\mid Z_{i})=0.

The estimation equation in (2) reveals that the treatment effect β\beta can be identified by exploring the nonlinear structure of ff under the model (1). We propose to learn ff using a ML prediction model and measure the variability of 𝒫Zif(Zi)\mathcal{P}^{\perp}_{Z_{i}}f(Z_{i}) by defining a notation of generalized IV strength in Section 3.3. We shall emphasize that using ML algorithms to learn the possibly nonlinear function ff is critical in the current framework, allowing for invalid IVs. The ML algorithms help capture complicated structures in ff and typically retain enough variability contained in the adjusted variable 𝒫Zif(Zi)\mathcal{P}^{\perp}_{Z_{i}}f(Z_{i}). In Section A.1 in the supplement, we present another example for causal identification in the context of the randomized experiments with non-compliance and invalid IVs.

Remark 1

Nonlinear functional form identification has been proposed in various existing literature; see Lewbel (2019)[Sec.3.7] for a review. More concretely, Angrist and Pischke (2009)[Sec.4.6.1] discussed the identification of when the IV has a linear direct effect on the outcome, but the treatment model is a nonlinear parametric model of the IV. The Heckman selection model Heckman (1976) offered a practical and simple solution for estimating effects when the samples are not randomly selected. When no variables satisfy the exclusion restriction, one possible identification relies on the nonlinearity of the inverse Mills ratio. As pointed by Puhani (2000), such an identification is less reliable when the nonlinear model form is relatively weak in applications. Additionally, Shardell and Ferrucci (2016); Ten Have et al. (2007) proposed the causal identification using the interaction of the (binary) IV and baseline covariates as the new IV. Lewbel (2019) commented that the proposed identification fails when the nonlinear functional form does not exist or is relatively weak. As a main difference, our proposal does not use hand-picked instruments or assume the data-generating distribution has some pre-specified nonlinear functional forms. Operationally, we apply the first-stage ML algorithm to learn the complex nonlinear treatment model, which is more powerful in detecting the nonlinear dependence based on exploring the data. However, including the first-stage ML requires a more delicate following-up statistical procedure, such as the bias correction step in (19). More importantly, instead of betting on the existence of nonlinearity, our proposed generalized IV strength measure in (22) guides whether there is enough nonlinear structure after adjusting for invalid forms.

2.2 Model with a general class of invalid IVs

We now introduce the general outcome model, allowing for more complicated violation forms than the linear violation in (1),

Yi=Diβ+g(Zi,Xi)+ϵi,with𝐄(ϵiXi,Zi)=0,for1inY_{i}=D_{i}\beta+g(Z_{i},X_{i})+\epsilon_{i},\quad\text{with}\quad{\mathbf{E}}(\epsilon_{i}\mid X_{i},Z_{i})=0,\quad\text{for}\quad 1\leq i\leq n (3)

where β\beta\in\mathbb{R} is the homogeneous treatment effect and the function g:px+pzg:\mathbb{R}^{p_{x}+p_{z}}\rightarrow\mathbb{R} encodes how the IVs violate the assumptions (A2) and (A3). The classical valid IV setting requires that the function g(z,x)g(z,x) does not change with different assignment of zz. In the invalid IV literature, the commonly used outcome model with a linear violation form can be viewed as a special case of (3) when taking g(Zi,Xi)=Ziπz+Xiπx.g(Z_{i},X_{i})=Z_{i}^{\intercal}\pi_{z}+X_{i}^{\intercal}\pi_{x}.

For the treatment, we define its conditional mean function f(Zi,Xi)𝐄(DiZi,Xi)f(Z_{i},X_{i})\coloneqq{\mathbf{E}}(D_{i}\mid Z_{i},X_{i}) for 1in,1\leq i\leq n, leading to the following treatment model:

Di=f(Zi,Xi)+δiwith𝐄(δiZi,Xi)=0.D_{i}=f(Z_{i},X_{i})+\delta_{i}\quad\text{with}\quad{\mathbf{E}}(\delta_{i}\mid Z_{i},X_{i})=0. (4)

The model (4) is flexible as ff might be any unknown function of ZiZ_{i} and XiX_{i}, and the treatment variable can be continuous, binary, or a count variable. Similarly to (1), the model (4) is only a probability relation instead of a causal generating model and the treatment assignment is allowed to be influenced by unmeasured confounders.

It is well known that, for the outcome model (3), the identification of β\beta is generally impossible without additional identifiability conditions on the function gg. In this paper, we allow the existence of invalid IVs but constrain the functional form of violating (A2) and (A3) to a pre-specified class. An operational step is to specify a set of basis functions for g()g(\cdot) in the outcome model (3).

Define ϕ(Xi)=𝐄[g(Zi,Xi)Xi]\phi(X_{i})={\mathbf{E}}[g(Z_{i},X_{i})\mid X_{i}] and decompose gg as g(Zi,Xi)=h(Zi,Xi)+ϕ(Xi)g(Z_{i},X_{i})=h(Z_{i},X_{i})+\phi(X_{i}) with h(Zi,Xi)=g(Zi,Xi)ϕ(Xi).h(Z_{i},X_{i})=g(Z_{i},X_{i})-\phi(X_{i}). When the IVs ZiZ_{i} are valid, g(Zi,Xi)g(Z_{i},X_{i}) does not directly depend on the assignment of ZiZ_{i}, which implies ϕ(Xi)=g(Zi,Xi)\phi(X_{i})=g(Z_{i},X_{i}) and h()=0.h(\cdot)=0. In this case, we just need to approximate g(z,x)=ϕ(x)g(z,x)=\phi(x) by a set of basis functions 𝐰(x)pw\overrightarrow{\bf w}(x)\in\mathbb{R}^{p_{w}}. With 𝐰(x)\overrightarrow{\bf w}(x), we approximate g(Zi,Xi)=ϕ(Xi)g(Z_{i},X_{i})=\phi(X_{i}) by a linear combination of Wi=𝐰(Xi)pwW_{i}=\overrightarrow{\bf w}(X_{i})\in\mathbb{R}^{p_{w}} for 1in1\leq i\leq n and {ϕ(Xi)}1in\{\phi(X_{i})\}_{1\leq i\leq n} by the matrix W=(W1,,Wn)W=(W^{\intercal}_{1},\ldots,W^{\intercal}_{n})^{\intercal}. For example, if ϕ(x)\phi(x) is linear, we set 𝐰(x)={1,x1,x2,,xpx}\overrightarrow{\bf w}(x)=\{1,x_{1},x_{2},\cdots,x_{p_{x}}\}. Furthermore, we consider the additive model ϕ(x)=j=1pxϕj(xj)\phi(x)=\sum_{j=1}^{p_{x}}\phi_{j}(x_{j}) with smooth functions {ϕj()}1jpx.\{\phi_{j}(\cdot)\}_{1\leq j\leq p_{x}}. For 1jpx,1\leq j\leq p_{x}, we construct a set of basis functions 𝐛j={bj,l()}1lMj\overrightarrow{{\mathbf{b}}_{j}}=\{b_{j,l}(\cdot)\}_{1\leq l\leq M_{j}} for ϕj(xj)\phi_{j}(x_{j}) with Mj1M_{j}\geq 1 denoting the number of basis functions. Examples of the basis functions include the polynomial or B spline basis. We then approximate ϕ(x)\phi(x) with 𝐰(x)={1,𝐛1(x1),,𝐛px(xpx)}.\overrightarrow{\bf w}(x)=\{1,\overrightarrow{{\mathbf{b}}_{1}}(x_{1}),\cdots,\overrightarrow{{\mathbf{b}}_{p_{x}}}(x_{p_{x}})\}.

When the IV is invalid with h0h\not\equiv 0, in addition to approximating ϕ(x)\phi(x) by 𝐰(x)\overrightarrow{\bf w}(x), we further approximate the function h(z,x)h(z,x) by a set of pre-specified basis functions v1(z,x),,vL(z,x)v_{1}(z,x),\cdots,v_{L}(z,x) and then approximate g(z,x)g(z,x) by

𝒱{v1(z,x),,vL(z,x),𝐰(x)}withvl:px+pzfor1lL.\mathcal{V}\coloneqq\{v_{1}(z,x),\cdots,v_{L}(z,x),\overrightarrow{\bf w}(x)\}\quad\text{with}\quad v_{l}:\mathbb{R}^{p_{x}+p_{z}}\rightarrow\mathbb{R}\quad\text{for}\quad 1\leq l\leq L. (5)

We now present different choices of g()g(\cdot) in (3), or equivalently, the basis set 𝒱\mathcal{V} in (5). With them, the model (3) can accommodate a wide range of valid or invalid IV forms.

  1. (1)

    Linear violation. We take g(z,x)=l=1pzπlzl+ϕ(x)g(z,x)=\sum_{l=1}^{p_{z}}\pi_{l}z_{l}+\phi(x) and 𝒱={z1,,zpz,𝐰(x)}.\mathcal{V}=\{z_{1},\cdots,z_{p_{z}},\overrightarrow{\bf w}(x)\}.

  2. (2)

    Polynomial violation (for a single IV). We consider g(z,x)=h(z)+ϕ(x)g(z,x)=h(z)+\phi(x), where the IV and baseline covariates affect the outcome in an additive manner. We set 𝒱={v1(z),,vL(z),𝐰(x)}\mathcal{V}=\{v_{1}(z),\cdots,v_{L}(z),\overrightarrow{\bf w}(x)\} and may take v1(z),,vL(z)v_{1}(z),\cdots,v_{L}(z) as (piecewise) polynomials of various orders.

  3. (3)

    Interaction violation (for a single IV). We consider g(z,x)=l=1pxαlzxl+ϕ(x)g(z,x)=\sum_{l=1}^{p_{x}}\alpha_{l}\cdot z\cdot x_{l}+\phi(x), allowing the IV to affect the outcome interactively with the base covariates. For such a violation form, we set 𝒱={zx1,,zxpx,𝐰(x)}.\mathcal{V}=\{z\cdot x_{1},\cdots,z\cdot x_{p_{x}},\overrightarrow{\bf w}(x)\}.

We focus on the single IV setting for the polynomial and interaction violation to simplify the notations. It can be extended to the setting with multiple IVs.

Two important remarks are in order for the choice of gg and 𝒱\mathcal{V}. Firstly, the choice of gg and the corresponding basis set 𝒱\mathcal{V} is part of the outcome model specification (3), reflecting the users’ belief about how the IVs can potentially violate the assumptions (A2) and (A3). In applications, empirical researchers might apply domain knowledge and construct the pre-specified set of basis functions in (5). For example, Angrist and Lavy (1999) applied the Maimonides’ rule to construct the IV as the transformation of a variable “enrollment” and then adjust with the possible violation form generated by a linear, quadratic, or piecewise linear transformation of “enrollment”. Importantly, we do not have to assume the knowledge of 𝒱\mathcal{V} and will present in Section 3.4 a data-dependent way to choose the best 𝒱\mathcal{V} from a collection of candidate basis sets. Secondly, the linear violation form with g(z,x)=l=1pzπlzl+ϕ(x)g(z,x)=\sum_{l=1}^{p_{z}}\pi_{l}z_{l}+\phi(x) is the most commonly used violation form in the context of the multiple IV framework (Small, 2007; Kang et al., 2016; Guo et al., 2018; Windmeijer et al., 2019). Most existing methods require at least a proportion of ZiZ_{i} to be valid, i.e., the corresponding πl\pi_{l} being zero. In contrast, our framework allows all IVs to be invalid (i.e., all πl\pi_{l} to be nonzero) by taking 𝒱={z1,,zpz,𝐰(x)}.\mathcal{V}=\{z_{1},\cdots,z_{p_{z}},\overrightarrow{\bf w}(x)\}.

With the set 𝒱{\cal V} of basis functions in (5), we define the matrix VV which evaluates the functions in 𝒱{\cal V} at the observed variables:

V=(V1Vn)n×(L+pw)withVi=(v1(Zi,Xi),,vL(Zi,Xi),Wi),V=\begin{pmatrix}V_{1}&\cdots&V_{n}\end{pmatrix}^{\intercal}\in\mathbb{R}^{n\times(L+p_{w})}\quad\text{with}\quad V_{i}=\left(v_{1}(Z_{i},X_{i}),\cdots,v_{L}(Z_{i},X_{i}),W_{i}^{\intercal}\right)^{\intercal}, (6)

for 1in1\leq i\leq n. Note that we always use an intercept Wi,1=1W_{i,1}=1. Instead of assuming ViV_{i} perfectly generating g(Zi,Xi)g(Z_{i},X_{i}), we go with a broader framework by allowing an approximation error of g(Xi,Zi)g(X_{i},Z_{i}) by the best linear combination of ViV_{i}, defined as

Ri(V)g(Zi,Xi)Viπwithπargminb𝐄[g(Zi,Xi)Vib]2.R_{i}(V)\coloneqq g(Z_{i},X_{i})-V^{\intercal}_{i}\pi\quad\text{with}\quad\pi\coloneqq\operatorname*{arg\,min}_{b}{\mathbf{E}}\left[g(Z_{i},X_{i})-V_{i}^{\intercal}b\right]^{2}. (7)

Define R(V)=(R1(V),,Rn(V))nR(V)=(R_{1}(V),\cdots,R_{n}(V))\in\mathbb{R}^{n}. We shall omit the dependence on VV when there is no confusion. With (7), we express the outcome model (3) as

Yi=Diβ+Viπ+Ri(V)+ϵi,for1in.Y_{i}=D_{i}\beta+V_{i}^{\intercal}\pi+R_{i}(V)+\epsilon_{i},\quad\text{for}\quad 1\leq i\leq n. (8)

If g(Zi,Xi)g(Z_{i},X_{i}) is well approximated by a linear combination of Vi,V_{i}, R(V)2/n\|R(V)\|_{2}/\sqrt{n} is close to zero or even R(V)2=0\|R(V)\|_{2}=0.

2.3 Causal interpretation: structural equation and potential outcome models

We first provide the structural equation model (SEM) interpretation of the models (3) and (4). We consider the following SEM for YiY_{i} and DiD_{i} and XiX_{i},

Yia0+Diβ+g1(Zi,Xi)+ν1(Hi)+ϵi0,andDif1(Zi,Xi)+ν2(Hi)+δi0,Y_{i}\leftarrow a_{0}+D_{i}\beta+g_{1}(Z_{i},X_{i})+\nu_{1}(H_{i})+\epsilon^{0}_{i},\quad\text{and}\quad D_{i}\leftarrow f_{1}(Z_{i},X_{i})+\nu_{2}(H_{i})+\delta_{i}^{0}, (9)

where HiH_{i} denotes some unmeasured confounders, ϵi0\epsilon^{0}_{i} and δi0\delta_{i}^{0} are random errors independent of Di,Zi,Xi,Hi,D_{i},Z_{i},X_{i},H_{i}, and a0a_{0} is the intercept such that 𝐄g1(Zi,Xi)=𝐄ν1(Hi)=0{\mathbf{E}}g_{1}(Z_{i},X_{i})={\mathbf{E}}\nu_{1}(H_{i})=0. Define g2(Zi,Xi)=𝐄(ν1(Hi)Zi,Xi)g_{2}(Z_{i},X_{i})={\mathbf{E}}\left(\nu_{1}(H_{i})\mid Z_{i},X_{i}\right) and f2(Zi,Xi)=𝐄(ν2(Hi)Zi,Xi).f_{2}(Z_{i},X_{i})={\mathbf{E}}\left(\nu_{2}(H_{i})\mid Z_{i},X_{i}\right). In (9), the unmeasured confounders HiH_{i} might affect both the outcome and treatment, g1g_{1} encodes a direct effect of the IVs on the outcome, and g2g_{2} captures the association between the IVs and the unmeasured confounders. The SEM (9) together with the definitions of f2f_{2} and g2g_{2} imply the outcome model (3) with ϵi=ϵi0+ν1(Hi)g2(Zi,Xi).\epsilon_{i}=\epsilon^{0}_{i}+\nu_{1}(H_{i})-g_{2}(Z_{i},X_{i}). The treatment model Di=f(Zi,Xi)+δiD_{i}=f(Z_{i},X_{i})+\delta_{i} arises with f(Zi,Xi)=f1(Zi,Xi)+f2(Zi,Xi)f(Z_{i},X_{i})=f_{1}(Z_{i},X_{i})+f_{2}(Z_{i},X_{i}) and δi=δi0+ν2(Hi)f2(Zi,Xi).\delta_{i}=\delta^{0}_{i}+\nu_{2}(H_{i})-f_{2}(Z_{i},X_{i}).

We now present how to interpret the invalid IV model with the potential outcome perspective. For the ii-th subject with the baseline covariates XiX_{i}, we use Y(z,d)(Xi)Y^{(z,d)}(X_{i}) to denote the potential outcome with the IVs and the treatment assigned to zpzz\in\mathbb{R}^{p_{z}} and dd\in\mathbb{R}, respectively. For 1in,1\leq i\leq n, we consider the potential outcome model (Splawa-Neyman et al., 1990; Rubin, 1974),

Yi(z,d)(Xi)=Yi(0,0)(Xi)+dβ+g1(z,Xi)and𝐄(Yi(0,0)(Xi)Zi,Xi)=g2(Zi,Xi),Y_{i}^{(z,d)}(X_{i})=Y_{i}^{(0,0)}(X_{i})+d\beta+g_{1}(z,X_{i})\quad\text{and}\quad{\mathbf{E}}(Y_{i}^{(0,0)}(X_{i})\mid Z_{i},X_{i})=g_{2}(Z_{i},X_{i}), (10)

where β\beta\in\mathbb{R} denotes the treatment effect, g1:pz+pxg_{1}:\mathbb{R}^{p_{\rm z}+p_{\rm x}}\rightarrow\mathbb{R}, and g2:pz+px.g_{2}:\mathbb{R}^{p_{\rm z}+p_{\rm x}}\rightarrow\mathbb{R}. If g1(z,x)g_{1}(z,x) changes with zz, the IVs directly affect the outcome, violating the assumption (A3). If g2(z,x)g_{2}(z,x) changes with zz, the IVs are associated with unmeasured confounders, violating the assumption (A2). The model (10) extends the Additive LInear, Constant Effects (ALICE) model of Holland (1988) by allowing for a general class of invalid IVs. By the consistency assumption Yi=Yi(Zi,Di)(Xi),Y_{i}=Y^{(Z_{i},D_{i})}_{i}(X_{i}), (10) implies (3) with g()=g1()+g2()g(\cdot)=g_{1}(\cdot)+g_{2}(\cdot) and ϵi=Yi(0,0)(Xi)g2(Zi,Xi).\epsilon_{i}=Y_{i}^{(0,0)}(X_{i})-g_{2}(Z_{i},X_{i}). We can easily generalize (10) by considering Yi(z,d)(Xi)=Yi(0,0)(Xi)+dβi+g1(z,Xi),Y_{i}^{(z,d)}(X_{i})=Y_{i}^{(0,0)}(X_{i})+d\beta_{i}+g_{1}(z,X_{i}), where βi\beta_{i} denotes the individual effect for the ii-th individual. If we consider the random effect βi\beta_{i} with 𝐄βi=β{\mathbf{E}}\beta_{i}=\beta and βiβ\beta_{i}-\beta being independent of (Zi,Xi,Di),(Z_{i},X_{i},D_{i}), then we obtain the model (3) with ϵi=Yi(0,0)(Xi)g2(Zi,Xi)+(βiβ)Di\epsilon_{i}=Y_{i}^{(0,0)}(X_{i})-g_{2}(Z_{i},X_{i})+(\beta_{i}-\beta)D_{i} and our proposed method can be applied to make inference for β=𝐄βi.\beta={\mathbf{E}}\beta_{i}.

2.4 Overview of TSCI

We propose in Section 3 a novel methodology called two-stage curvature identification (TSCI) to identify β\beta under models (3) and (4). We first assume knowledge of the basis set 𝒱\mathcal{V} in (5) that spans the function g().g(\cdot). The proposal consists of two stages: in the first stage, we employ ML algorithms to learn the conditional mean f()f(\cdot) in the treatment model (4); in the second stage, we adjust for the IV invalidity form encoded by 𝒱\mathcal{V} and construct the confidence interval for β\beta by leveraging the first-stage ML fits.

The success of TSCI relies on the following identification condition.

Condition 1

Define the best linear approximation of f(Zi,Xi)f(Z_{i},X_{i}) by ViV_{i} in population as γ=argminγ𝐄(f(Zi,Xi)Viγ)2.\gamma^{*}=\operatorname*{arg\,min}_{\gamma}{\mathbf{E}}(f(Z_{i},X_{i})-V_{i}^{\intercal}\gamma)^{2}. We assume 𝐄(f(Zi,Xi)Viγ)2>0.{\mathbf{E}}(f(Z_{i},X_{i})-V_{i}^{\intercal}\gamma^{*})^{2}>0.

The condition states that the basis set 𝒱\mathcal{V}, used for generating g()g(\cdot), does not fully span the conditional mean function f()f(\cdot) of the treatment model. For any user-specified 𝒱\mathcal{V} in (5), Condition 1 can be partially examined by evaluating a generalized IV strength introduced in Section 3.3. When the generalized IV strength is sufficiently large, our proposed TSCI methodology leads to an estimator β^(𝒱)\widehat{\beta}(\mathcal{V}) in (19) such that

(β^(𝒱)β)/SE^(𝒱)𝑑N(0,1),(\widehat{\beta}(\mathcal{V})-\beta)/\widehat{\rm SE}(\mathcal{V})\overset{d}{\to}N(0,1), (11)

with the estimated standard error SE^(𝒱)\widehat{\rm SE}(\mathcal{V}) depending on the generalized IV strength. Importantly, the basis set 𝒱\mathcal{V} used for generating g()g(\cdot) does not need to be fixed in advance: we discuss in Section 3.4 a data-driven strategy for choosing a “good” basis set 𝒱q^\mathcal{V}_{\widehat{q}} among a collection of candidate sets 𝒱0𝒱1𝒱Q\mathcal{V}_{0}\subset\mathcal{V}_{1}\subset\cdots\subset\mathcal{V}_{Q} for a positive integer QQ, where 𝒱0\mathcal{V}_{0} corresponds to the valid IV setting. Our proposed TSCI compares estimators assuming valid IVs and adjusting for violation forms generated from the family {𝒱q}1qQ.\{\mathcal{V}_{q}\}_{1\leq q\leq Q}. Thus, we encompass the setting of valid IVs and through this comparison, the TSCI methodology provides more robust causal inference tools than directly assuming IVs’ validity.

Intuitively, Condition 1 requires f()f(\cdot) to be more non-linear than g()g(\cdot), which cannot be fully tested since g()g(\cdot) is unknown. We shall provide two important remarks on the plausibility of Condition 1. Firstly, when the proposed IVs are valid with g(z,x)g(z,x) being independent of zz, Condition 1 is satisfied as long as f(z,x)f(z,x) depends on zz. More interestingly, when g(z,x)g(z,x) has a relatively weak dependence on zz (i.e., the IVs violate (A2) and (A3) locally), Condition 1 holds if the machine learning algorithm captures a strong dependence of f(z,x)f(z,x) on zz. In practice, domain scientists identify IVs based on their knowledge, and these IVs, even when violating assumptions (A2) and (A3), may have a relatively weak direct effect on the outcome or are weakly associated with the unmeasured confounders. In such a situation, TSCI provides robust causal inference by allowing the proposed IV to be invalid. However, we are not suggesting the users take any covariate as an invalid IV to implement our proposal. Secondly, even if Condition 1 does not hold, our proposed TSCI estimator is reduced to an estimator assuming valid IVs which then has a similar performance to TSLS and DML. In Section 5.3, we explore the performance of TSCI when Condition 1 does not hold.

3 TSCI with Machine Learning

We discuss the first and second stages of our TSCI methodology in Sections 3.1 and 3.2, respectively. The main novelty of our proposal is to estimate a bias caused by the first-stage ML and implement a follow-up bias correction.

3.1 First stage: machine learning models for the treatment model

We estimate the conditional mean f()f(\cdot) in the treatment model (4) by a general ML fit. We randomly split the data into two disjoint subsets 𝒜1\mathcal{A}_{1} and 𝒜2\mathcal{A}_{2}. Throughout the paper we use 𝒜1={1,2,,n1}\mathcal{A}_{1}=\{1,2,\cdots,n_{1}\} with n1=2n/3n_{1}=\lfloor 2n/3\rfloor and set 𝒜2={n1+1,,n}.\mathcal{A}_{2}=\{n_{1}+1,\cdots,n\}. Our results can be extended to any other splitting with |𝒜1||𝒜2|.|\mathcal{A}_{1}|\asymp|\mathcal{A}_{2}|. Sample splitting is essential for removing the endogeneity of the ML predicted values. Without sample splitting, the ML predicted value for the treatment can be close to the treatment (due to overfitting) and hence highly correlated with unmeasured confounders, leading to a TSCI estimator suffering from a significant bias.

The main step is to construct the ML prediction model with data belonging to 𝒜2\mathcal{A}_{2} and express the ML estimator of f𝒜1=(f(Z1,X1),,f(Zn1,Xn1))f_{\mathcal{A}_{1}}=({f}(Z_{1},X_{1}),\cdots,{f}(Z_{n_{1}},X_{n_{1}}))^{\intercal} as

f^𝒜1=ΩD𝒜1for some matrixΩn1×n1,\widehat{f}_{\mathcal{A}_{1}}=\Omega D_{\mathcal{A}_{1}}\quad\text{for some matrix}\quad\Omega\in\mathbb{R}^{n_{1}\times n_{1}}, (12)

where the data belonging to 𝒜2\mathcal{A}_{2} is used to train the ML algorithm and construct the transformation matrix Ω\Omega. Our proposed TSCI method mainly relies on expressing the first-stage prediction as the linear estimator in (12). A wide range of first-stage ML algorithms are shown to have the expression in (12). In the following, we follow the results in Lin and Jeon (2006); Meinshausen (2006); Wager and Athey (2018) and express the sample split random forests in the form (12). In Sections A.6, A.7, and A.8 in the supplement, we present the explicit definitions of Ω\Omega for boosting, deep neural network (DNN), and approximation with B-splines, respectively.

Importantly, the transformation matrix Ωn1×n1\Omega\in\mathbb{R}^{n_{1}\times n_{1}} has a related meaning as the hat matrix in linear regression procedures: however, Ω\Omega is not a projection matrix for nonlinear ML algorithms, which creates additional challenges in adopting the first-stage ML fit; see Section A.2.

Split random forests in (12). To avoid the overfitting of random forests, we adopt the construction of honest trees and forests as in Wager and Athey (2018) and slightly modify it by excluding self-prediction for our purpose. We construct the random forests (RF) with the data {Di,Xi,Zi}i𝒜2\{D_{i},X_{i},Z_{i}\}_{i\in\mathcal{A}_{2}} and estimate f(Zi,Xi)f(Z_{i},X_{i}) for i𝒜1i\in\mathcal{A}_{1} by the constructed RF together with the data {Xi,Zi,Di}i𝒜1\{X_{i},Z_{i},D_{i}\}_{i\in\mathcal{A}_{1}}. RF aggregates S1S\geq 1 decision trees, with each decision tree being viewed as the partition of the whole covariate space px+pz\mathbb{R}^{p_{x}+p_{z}} into disjoint subspaces {l}1lJ.\{\mathcal{R}_{l}\}_{1\leq l\leq J}. Let θ\theta denote the random parameter that determines how a tree is grown. For any given (z,x)px+pz(z^{\intercal},x^{\intercal})^{\intercal}\in\mathbb{R}^{p_{x}+p_{z}} and a given tree with the parameter θ\theta, there exists an unique leaf l(z,x,θ)l(z,x,\theta) with 1l(z,x,θ)J1\leq l(z,x,\theta)\leq J such that the subspace l(z,x,θ)\mathcal{R}_{l(z,x,\theta)} contains (z,x).(z^{\intercal},x^{\intercal})^{\intercal}. With the observations inside l(z,x,θ)\mathcal{R}_{l(z,x,\theta)}, the decision tree estimates f(z,x)=𝐄(DZ=z,X=x)f(z,x)={\mathbf{E}}(D\mid Z=z,X=x) by

f^θ(z,x)=j𝒜1ωj(z,x,θ)Djwithωj(z,x,θ)=𝟏[(Zj,Xj)l(z,x,θ)0]k𝒜1𝟏[(Zk,Xk)l(z,x,θ)0],\widehat{f}_{\theta}(z,x)=\sum_{j\in\mathcal{A}_{1}}\omega_{j}(z,x,\theta)D_{j}\quad\text{with}\quad\omega_{j}(z,x,\theta)=\frac{{\bf{1}}\left[(Z_{j}^{\intercal},X_{j}^{\intercal})^{\intercal}\in\mathcal{R}^{0}_{l(z,x,\theta)}\right]}{\sum_{k\in\mathcal{A}_{1}}{\bf{1}}\left[(Z_{k}^{\intercal},X_{k}^{\intercal})^{\intercal}\in\mathcal{R}^{0}_{l(z,x,\theta)}\right]}, (13)

where l(z,x,θ)0l(z,x,θ)/(z,x)\mathcal{R}^{0}_{l(z,x,\theta)}\coloneqq\mathcal{R}_{l(z,x,\theta)}/{(z,x)} is defined as the region excluding the point (z,x).(z,x). We use {θ1,,θS}\{\theta_{1},\cdots,\theta_{S}\} to denote the parameters corresponding to the SS trees. The estimator f^(z,x)=1Ss=1Sf^θs(z,x)\widehat{f}(z,x)=\frac{1}{S}\sum_{s=1}^{S}\widehat{f}_{\theta_{s}}(z,x) can be expressed as

f^(z,x)=j𝒜1ωj(z,x)Djwherewj(z,x)=1Ss=1Sωj(z,x,θs),\widehat{f}(z,x)=\sum_{j\in\mathcal{A}_{1}}\omega_{j}(z,x)D_{j}\quad\text{where}\quad w_{j}(z,x)=\frac{1}{S}\sum_{s=1}^{S}\omega_{j}(z,x,\theta_{s}), (14)

with ωj(z,x,θs)\omega_{j}(z,x,\theta_{s}) defined in (13). That is, the split RF estimator of f𝒜1f_{\mathcal{A}_{1}} attains the form (12) with Ωij=ωj(Zi,Xi)\Omega_{ij}=\omega_{j}(Z_{i},X_{i}) for i,j𝒜1i,j\in\mathcal{A}_{1}.

The construction of honest trees and removal of self-prediction help remove the endogeneity of the ML predicted values {f^(Zi,Xi)}i𝒜1\{\widehat{f}(Z_{i},X_{i})\}_{i\in\mathcal{A}_{1}}. Firstly, due to the sample-splitting, the construction of Ω\Omega does not directly depend on {Di}i𝒜1,\{D_{i}\}_{i\in\mathcal{A}_{1}}, which removes the endogeneity contained in the ML predicted values. Secondly, consider an extreme setting with ZiZ_{i} and XiX_{i} not being predictive for DiD_{i}. In such a scenario, without removing the self-prediction, the estimator f^(Zi,Xi)\widehat{f}(Z_{i},X_{i}) will be dominated by its own treatment observation DiD_{i} and f^(Zi,Xi)\widehat{f}(Z_{i},X_{i}) is still highly correlated with the unmeasured confounders, leading to a biased TSCI estimator. The removal of self-prediction can be viewed as an ML version of the “leave-one-out” estimator for the treatment model, which was proposed in Angrist et al. (1999) to reduce the bias of TSLS with many IVs. In the numerical studies, we have observed that the exclusion of self-prediction leads to a more accurate estimator than the corresponding TSCI estimator with self-prediction, regardless of the IV strength; see Table S1 for the detailed comparison.

3.2 Second stage: adjusting for violation forms and correcting bias

In the second stage, we leverage the first-stage ML fits in the form of (12) and adjust for the possible IV invalidity forms. In the remaining of the construction, we consider the outcome model in the form of (8) and assume that ViπV_{i}^{\intercal}\pi provides an accurate approximation of g(Zi,Xi)g(Z_{i},X_{i}), resulting in zero or sufficiently small Ri(V)R_{i}(V). Hence Ri(V)R_{i}(V) does not enter in the following construction of the estimator for β\beta. We quantify the effect of the error term Ri(V)R_{i}(V) in the theoretical analysis in Section 4 and propose in Section 3.4 a data-dependent way of choosing VV such that Ri(V)R_{i}(V) is sufficiently small.

Applying the transformation Ω\Omega in (12) to the model (8) with data in 𝒜1\mathcal{A}_{1}, we obtain

Y^𝒜1=f^𝒜1β+V^𝒜1π+R^𝒜1+ϵ^𝒜1,\widehat{Y}_{\mathcal{A}_{1}}=\widehat{f}_{\mathcal{A}_{1}}\beta+\widehat{V}_{\mathcal{A}_{1}}\pi+\widehat{R}_{\mathcal{A}_{1}}+\widehat{\epsilon}_{\mathcal{A}_{1}}, (15)

where Y^𝒜1=ΩY𝒜1,f^𝒜1=ΩD𝒜1,V^𝒜1=ΩV𝒜1,R^𝒜1=ΩR𝒜1,\widehat{Y}_{\mathcal{A}_{1}}=\Omega Y_{\mathcal{A}_{1}},\widehat{f}_{\mathcal{A}_{1}}=\Omega D_{\mathcal{A}_{1}},\widehat{V}_{\mathcal{A}_{1}}=\Omega V_{\mathcal{A}_{1}},\widehat{R}_{\mathcal{A}_{1}}=\Omega R_{\mathcal{A}_{1}}, and ϵ^𝒜1=Ωϵ𝒜1.\widehat{\epsilon}_{\mathcal{A}_{1}}=\Omega\epsilon_{\mathcal{A}_{1}}. For a matrix VV, we use PVP_{V} and PVP_{V}^{\perp} to denote the projection to the column spaces of VV and its orthogonal complement, respectively. Based on (15), we project out the columns of V^𝒜1\widehat{V}_{\mathcal{A}_{1}} and estimate the effect β\beta by

β^init(V)Y^𝒜1PV^𝒜1f^𝒜1f^𝒜1PV^𝒜1f^𝒜1=Y𝒜1𝐌(V)D𝒜1D𝒜1𝐌(V)D𝒜1with𝐌(V)=ΩPV^𝒜1Ω.\widehat{\beta}_{\rm init}(V)\coloneqq\frac{\widehat{Y}_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\widehat{f}_{\mathcal{A}_{1}}}{\widehat{f}_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\widehat{f}_{\mathcal{A}_{1}}}=\frac{{Y}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}\quad\text{with}\quad\mathbf{M}(V)=\Omega^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega. (16)

When Ri(V)=0R_{i}(V)=0, then we decompose the error of the above estimator as

β^init(V)β=ϵ𝒜1𝐌(V)δ𝒜1D𝒜1𝐌(V)D𝒜1+ϵ𝒜1𝐌(V)f𝒜1D𝒜1𝐌(V)D𝒜1.\widehat{\beta}_{\rm init}(V)-\beta=\frac{{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}}}{{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}}+\frac{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}. (17)

In the above decomposition, the first term on the right-hand side is a bias component appearing due to the use of the first-stage ML. To explain this, we consider the homoscedastic correlation Cov(ϵi,δiZi,Xi)=Cov(ϵi,δi){\rm Cov}(\epsilon_{i},\delta_{i}\mid Z_{i},X_{i})={\rm Cov}(\epsilon_{i},\delta_{i}) and obtain the explicit bias expression,

ϵ𝒜1𝐌(V)δ𝒜1D𝒜1𝐌(V)D𝒜1Cov(δi,ϵi)Tr[𝐌(V)]D𝒜1𝐌(V)D𝒜1,\frac{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}\approx\frac{{\rm Cov}(\delta_{i},\epsilon_{i})\cdot{\rm Tr}[\mathbf{M}(V)]}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}, (18)

where Tr[𝐌(V)]{\rm Tr}[\mathbf{M}(V)] denotes the trace of 𝐌(V)\mathbf{M}(V) defined in (16). Note that Tr[𝐌(V)]{\rm Tr}[\mathbf{M}(V)] can be viewed as degrees of freedom or complexity measure for the ML algorithm. It may become particularly large in the overfitting regime. The bias in (18) also becomes large when the denominator D𝒜1𝐌(V)D𝒜1{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}} is small, which means a weak generalized IV strength in the following (22). Thus, both the numerator and denominator in (18) can lead to a large bias of β^init(V)\widehat{\beta}_{\rm init}(V). In Figure 1, we illustrate the bias of β^init(V)\widehat{\beta}_{\rm init}(V) in settings with different values of the IV strength scaled to D𝒜1𝐌(V)D𝒜1{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}.

Refer to caption
Figure 1: Density plot of TSCI and TSCI-Init estimators (with 500 simulations), which are after and before bias correction, respectively. The three panels from left to right correspond to settings with increasing IV strengths; see Section D.1 in the supplement for details. The black dashed line represents the true value β=0.5\beta=0.5. The green and brown solid lines indicate the means of TSCI and TSCI-Init estimators.

To address this finite-sample bias of β^init(V)\widehat{\beta}_{\rm init}(V), we propose the following bias-corrected estimator,

β^(V)=Y𝒜1𝐌(V)D𝒜1D𝒜1𝐌(V)D𝒜1i=1n1[𝐌(V)]iiδ^i[ϵ^(V)]iD𝒜1𝐌(V)D𝒜1.\widehat{\beta}(V)=\frac{{Y}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}-\frac{\sum_{i=1}^{n_{1}}[\mathbf{M}(V)]_{ii}\widehat{\delta}_{i}[\widehat{\epsilon}(V)]_{i}}{{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}}. (19)

with δ^i=Dif^i\widehat{\delta}_{i}=D_{i}-\widehat{f}_{i} for 1in11\leq i\leq n_{1} and ϵ^(V)=PV[YDβ^init(V)]\widehat{\epsilon}(V)=P^{\perp}_{V}[Y-D\widehat{\beta}_{\rm init}(V)]. We show in Figure 1 that our proposal effectively corrects the bias of β^init(V)\widehat{\beta}_{\rm init}(V). Importantly, our proposed bias correction is effective for both homoscedastic and heteroscedastic correlations. We present a simplified bias correction assuming homoscedastic correlation in Section A.10 in the supplement.

Centering at β^RF(V)\widehat{\beta}_{\rm RF}(V) defined in (19), we construct the confidence interval

CI(V)=(β^(V)zα/2SE^(V),β^(V)+zα/2SE^(V)),{\rm CI}(V)=\left(\widehat{\beta}(V)-z_{\alpha/2}\widehat{\rm SE}(V),\widehat{\beta}(V)+z_{\alpha/2}\widehat{\rm SE}(V)\right), (20)

with zα/2z_{\alpha/2} denoting the upper α/2\alpha/2 quantile of the standard normal distribution and

SE^(V)=i=1n1[ϵ^(V)]i2[𝐌(V)D𝒜1]i2D𝒜1𝐌(V)D𝒜1withϵ^(V)=PV[YDβ^init(V)].\widehat{\rm SE}(V)=\frac{\sqrt{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V)]^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}\quad\text{with}\quad\widehat{\epsilon}(V)=P^{\perp}_{V}\left[Y-D\widehat{\beta}_{\rm init}(V)\right]. (21)
Remark 2

The bias component in (18) results from the correlation between ϵi\epsilon_{i} and δi\delta_{i} for i𝒜1i\in\mathcal{A}_{1}. In the regime of many IVs, Angrist et al. (1999) proposed the “leave-one-out” jackknife-type IV estimator, that is, instead of fitting the first stage model with all the data, the treatment model for the ii-th observation is fitted without using itself. Such an estimator effectively removes the bias for TSLS due to the correlation between ϵi\epsilon_{i} and δi\delta_{i}. Our proposed removal of self-prediction is of a similar spirit to the jackknife. However, even after removing the self-prediction, the diagonal of 𝐌(V)\mathbf{M}(V) is not zero, and hence the correlation between ϵi\epsilon_{i} and δi\delta_{i} still leads to a bias component, which requires the bias correction as in (19).

3.3 Generalized IV strength: detection of under-fitted machine learning

The IV strength is particularly important for identifying the treatment effect stably, and the weak IV is a major concern in practical applications of IV-based methods (Stock et al., 2002; Hansen et al., 2008). With a larger basis set 𝒱\mathcal{V}, the IV strength will generally decrease as the information contained in 𝒱\mathcal{V} is projected out from the first-stage ML fit. It is crucial to introduce a generalized IV strength measure in consideration of invalid IV forms and the ML algorithm. Similarly to the classical setup, good performance of our proposed TSCI estimator also requires a relatively large generalized IV strength. We introduce a generalized IV strength measure as,

μ(V)f𝒜1𝐌(V)f𝒜1/[i𝒜1Var(δiXi,Zi)/n1].\mu(V)\coloneqq{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}/{\left[\sum_{i\in\mathcal{A}_{1}}{\rm Var}(\delta_{i}\mid X_{i},Z_{i})/n_{1}\right]}. (22)

If Var(δiXi,Zi)=σδ2,{\rm Var}(\delta_{i}\mid X_{i},Z_{i})=\sigma_{\delta}^{2}, then μ(V)\mu(V) is reduced to f𝒜1𝐌(V)f𝒜1/σδ2.{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}/\sigma_{\delta}^{2}. A sufficiently large strength μ(V)\mu(V) will guarantee stable point and interval estimators defined in (19) and (20). Hence, we need to check whether μ(V)\mu(V) is sufficiently large. Since ff is unknown, we estimate f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}} by its sample version D𝒜1𝐌(V)D𝒜1{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}} and estimate μ(V)\mu(V) by μ(V)^D𝒜1𝐌(V)D𝒜1/[D𝒜1f^𝒜122/n1].\widehat{\mu(V)}\coloneqq{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}/\left[{\|D_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{2}^{2}/n_{1}}\right].

In Section 4, we show that our point estimator in (19) is consistent when the IV strength μ(V)\mu(V) is much larger than Tr[𝐌(V)]{\rm Tr}\left[\mathbf{M}(V)\right]; see Condition (R2). We now develop a bootstrap test to provide an empirical assessment of this IV strength requirement. Since Rothenberg (1984) and Stock et al. (2002) suggested the concentration parameter being larger than 1010 as being “adequate”, we develop a bootstrap test for μ(V)max{2Tr[𝐌(V)],10}.\mu(V)\geq\max\{2{\rm Tr}\left[\mathbf{M}(V)\right],10\}. We apply the wild bootstrap method and construct a probabilistic upper bound for the estimation error D𝒜1𝐌(V)D𝒜1f𝒜1𝐌(V)f𝒜1=2f𝒜1𝐌(V)δ𝒜1+δ𝒜1𝐌(V)δ𝒜1.{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}-{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}=2{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}+{\delta}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}. For 1in11\leq i\leq n_{1}, we define δ^i=Dif^i\widehat{\delta}_{i}=D_{i}-\widehat{f}_{i} and compute δ~i=δ^iμ¯δ\widetilde{\delta}_{i}=\widehat{\delta}_{i}-\bar{\mu}_{\delta} with μ¯δ=1n1i=1n1δ^i.\bar{\mu}_{\delta}=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\widehat{\delta}_{i}. For 1lL,1\leq l\leq L, we generate δi[l]=Ui[l]δ~i\delta^{[l]}_{i}=U^{[l]}_{i}\cdot\widetilde{\delta}_{i} for 1in1,1\leq i\leq n_{1}, with {Ui[l]}1in1\{U^{[l]}_{i}\}_{1\leq i\leq n_{1}} generated as i.i.d. standard normal random variables, and compute S[l]=[2f𝒜1𝐌(V)δ[l]+(δ[l])𝐌(V)δ[l]]/[D𝒜1f^𝒜122/n1].S^{[l]}=\left[{2{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}^{[l]}+({\delta}^{[l]})^{\intercal}\mathbf{M}(V){\delta}^{[l]}}\right]/\left[{{{\|D_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{2}^{2}/n_{1}}}}\right]. We use 𝒮α0(V)\mathcal{S}_{\alpha_{0}}(V) to denote the upper α0\alpha_{0} empirical quantile of {|S[l]|}1lL\{|S^{[l]}|\}_{1\leq l\leq L}. We conduct the generalized IV strength test μ(V)^max{2Tr[𝐌(V)],10}+𝒮α0(V),\widehat{\mu(V)}\geq\max\{2{\rm Tr}\left[\mathbf{M}(V)\right],10\}+{\mathcal{S}_{\alpha_{0}}(V)}, with 𝒮α0(V){\mathcal{S}_{\alpha_{0}}(V)} being a high probability upper bound for |μ(V)^μ(V)|.|\widehat{\mu(V)}-\mu(V)|. We use α0=0.025\alpha_{0}=0.025 throughout this paper. If the above generalized IV strength test is passed, the IV is claimed to be strong after adjusting for the matrix VV defined in (6); otherwise, the IV is claimed to be weak after adjusting for the matrix V.V. Empirically, we observe reliable inference properties when the estimated generalized IV strength μ(V)^\widehat{\mu(V)} is above 40; see Figure S1 for details.

3.4 Data-dependent selection of 𝒱\mathcal{V}

Our proposed TSCI estimator in (19) requires prior knowledge of the basis set 𝒱\mathcal{V}, which generates the function g()g(\cdot). In the following, we consider the nested sets of basis functions 𝒱0𝒱1𝒱Q,\mathcal{V}_{0}\subset\mathcal{V}_{1}\subset\cdots\subset\mathcal{V}_{Q}, where QQ is a positive integer. We devise a data-dependent way to choose the best one among {𝒱q}0qQ\{\mathcal{V}_{q}\}_{0\leq q\leq Q}.

We define 𝒱0{𝐰(x)}\mathcal{V}_{0}\coloneqq\{\overrightarrow{\bf w}(x)\} as the set of basis functions for the valid IV setting. For q1q\geq 1, define 𝒱q{v1(),,vLq(),𝐰(x)}\mathcal{V}_{q}\coloneqq\{v_{1}(\cdot),\cdots,v_{L_{q}}(\cdot),\overrightarrow{\bf w}(x)\} as the basis set for different invalid IV forms, where Lq1L_{q}\geq 1 is the number of basis functions. We present two examples of {𝒱q}1qQ\{\mathcal{V}_{q}\}_{1\leq q\leq Q} for the single IV setting.

  1. (1)

    Polynomial violation: 𝒱q={z,z2,,zq,𝐰(x)},\mathcal{V}_{q}=\{z,z^{2},\cdots,z^{q},\overrightarrow{\bf w}(x)\}, for 1qQ.1\leq q\leq Q.

  2. (2)

    Interaction violation: 𝒱1={z,zx1,zx2,,zxpx,𝐰(x)}.\mathcal{V}_{1}=\left\{z,z\cdot x_{1},z\cdot x_{2},\cdots,z\cdot x_{p_{x}},\overrightarrow{\bf w}(x)\right\}.

For 0qQ0\leq q\leq Q, we define the matrix Vqn×(Lq+pw)V_{q}\in\mathbb{R}^{n\times(L_{q}+p_{w})} with its ii-th row defined as (Vq)i=(v1(Zi,Xi),,vLq(Zi,Xi),Wi)(V_{q})_{i}=\left(v_{1}(Z_{i},X_{i}),\cdots,v_{L_{q}}(Z_{i},X_{i}),W_{i}^{\intercal}\right)^{\intercal} with Wi=𝐰(Xi)W_{i}=\overrightarrow{\bf w}(X_{i}) for 1in1\leq i\leq n.

For estimating q{0,1,}q\in\{0,1,\ldots\} from data, we proceed as follows. We implement the generalized IV strength test in Section 3.3 and define QmaxQ_{\max} as,

Qmaxmaxq0{q:μ(Vq)^max{2Tr[𝐌(Vq)],10}+𝒮α0(Vq)},Q_{\max}\coloneqq\max_{q\geq 0}\left\{q:\widehat{{\mu}(V_{q})}\geq\max\{2{\rm Tr}\left[\mathbf{M}(V_{q})\right],10\}+{\mathcal{S}_{\alpha_{0}}(V_{q})}\right\}, (23)

where α0\alpha_{0} is set at 0.0250.025 by default. For a larger qq, we tend to adjust out more information and have relatively weaker IVs. Intuitively, Qmax{Q_{\max}} denotes the largest index such that the IVs still have enough strength after adjusting for VqV_{q}. With Qmax,Q_{\max}, we shall choose among {Vq}0qQmax.\{V_{q}\}_{0\leq q\leq Q_{\max}}. As a remark, when there is no q0q\geq 0 satisfying (23), this corresponds to the weak IV regime. In such a scenario, we shall implement the valid IV estimator and output a warning of weak IV.

For any given 0qQmax,0\leq q\leq Q_{\max}, we apply the generalized estimator in (19) and construct

β^(Vq)=Y𝒜1𝐌(Vq)D𝒜1D𝒜1𝐌(Vq)D𝒜1i=1n1[𝐌(Vq)]iiδ^i[ϵ^(VQmax)]iD𝒜1𝐌(Vq)D𝒜1,\widehat{\beta}(V_{q})=\frac{{Y}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q}){D}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q}){D}_{\mathcal{A}_{1}}}-\frac{\sum_{i=1}^{n_{1}}[\mathbf{M}(V_{q})]_{ii}\widehat{\delta}_{i}[\widehat{\epsilon}(V_{Q_{\max}})]_{i}}{{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q}){D}_{\mathcal{A}_{1}}}}, (24)

with δ^𝒜1=D𝒜1f^𝒜1,\widehat{\delta}_{\mathcal{A}_{1}}=D_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}, 𝐌()\mathbf{M}(\cdot) defined in (16), and ϵ^()\widehat{\epsilon}(\cdot) defined in (21).

The selection of the best VqV_{q} among {Vq}0qQmax\{V_{q}\}_{0\leq q\leq Q_{\max}} relies on comparing the estimators {β^(Vq)}0qQmax.\{\widehat{\beta}(V_{q})\}_{0\leq q\leq Q_{\max}}. We start with comparing the difference between the estimators β^Vq\widehat{\beta}_{{V}_{q}} and β^Vq\widehat{\beta}_{{V}_{q^{\prime}}} with 0q<qQmax.0\leq q<q^{\prime}\leq Q_{\max}. When {g(Xi,Zi)}1in\{g(X_{i},Z_{i})\}_{1\leq i\leq n} are well approximated by both VqV_{q} and VqV_{q^{\prime}}, the approximation errors R(Vq)R({V}_{q}) and R(Vq)R({V}_{q^{\prime}}) defined in (7) are small and the main difference β^(Vq)β^(Vq)\widehat{\beta}(V_{q^{\prime}})-\widehat{\beta}(V_{q}) is approximately due to the randomness of the errors ϵ𝒜1.\epsilon_{\mathcal{A}_{1}}. In such cases. we then have β^(Vq)β^(Vq)\widehat{\beta}(V_{q^{\prime}})-\widehat{\beta}(V_{q}) is approximated centered at zero with the following conditional variance,

H^(Vq,Vq)\displaystyle\widehat{H}(V_{q},V_{q^{\prime}}) =i=1n1[ϵ^(VQmax)]i2[𝐌(Vq)D𝒜1]i2[D𝒜1𝐌(Vq)D𝒜1]2+i=1n1[ϵ^(VQmax)]i2[𝐌(Vq)D𝒜1]i2[D𝒜1𝐌(Vq)D𝒜1]2\displaystyle=\frac{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V_{Q_{\max}})]^{2}_{i}[\mathbf{M}(V_{q^{\prime}}){D}_{\mathcal{A}_{1}}]_{i}^{2}}{[{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})D_{\mathcal{A}_{1}}]^{2}}+\frac{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V_{Q_{\max}})]^{2}_{i}[\mathbf{M}(V_{q}){D}_{\mathcal{A}_{1}}]_{i}^{2}}{[{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})D_{\mathcal{A}_{1}}]^{2}} (25)
2i=1n1[ϵ^(VQmax)]i2[𝐌(Vq)D𝒜1]i[𝐌(Vq)D𝒜1]i[D𝒜1𝐌(Vq)D𝒜1][D𝒜1𝐌(Vq)D𝒜1].\displaystyle-2\frac{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V_{Q_{\max}})]^{2}_{i}[\mathbf{M}(V_{q^{\prime}}){D}_{\mathcal{A}_{1}}]_{i}[\mathbf{M}(V_{q}){D}_{\mathcal{A}_{1}}]_{i}}{[{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})D_{\mathcal{A}_{1}}]\cdot[{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})D_{\mathcal{A}_{1}}]}.

Based on the above approximation, we further conduct the following test for whether β^(Vq)\widehat{\beta}(V_{q}) is significantly different from β^(Vq)\widehat{\beta}(V_{q^{\prime}}),

𝒞(Vq,Vq)=𝟏(|β^(Vq)β^(Vq)|/H^(Vq,Vq)zα0),\mathcal{C}(V_{q},V_{q^{\prime}})={\bf 1}\left({\left|\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}})\right|}/{\sqrt{\widehat{H}(V_{q},V_{q^{\prime}})}}\geq z_{\alpha_{0}}\right), (26)

where β^(Vq)\widehat{\beta}(V_{q}) and β^(Vq)\widehat{\beta}(V_{q^{\prime}}) are defined in (24). On the other hand, if the smaller matrix VqV_{q} does not provide a good approximation of {g(Zi,Xi)}1in\{g(Z_{i},X_{i})\}_{1\leq i\leq n}, the difference |β^(Vq)β^(Vq)|\left|\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}})\right| tends to be much larger than the threshold in (26) and hence 𝒞(Vq,Vq)=1\mathcal{C}(V_{q},V_{q^{\prime}})=1, indicating that VqV_{q} does not fully generate {g(Zi,Xi)}1in.\{g(Z_{i},X_{i})\}_{1\leq i\leq n}.

For Qmax2,Q_{\max}\geq 2, we generalize the pairwise comparison to multiple comparisons. For 0qQmax1,0\leq q\leq Q_{\max}-1, we compare β^(Vq)\widehat{\beta}(V_{q}) to any β^(Vq)\widehat{\beta}(V_{q^{\prime}}) with q+1qQmax{q+1\leq q^{\prime}\leq Q_{\max}}. Particularly, for 0qQmax1,0\leq q\leq Q_{\max}-1, we define the test

𝒞(Vq)=𝟏(maxq+1qQmax[|β^(Vq)β^(Vq)|/H^(Vq,Vq)]ρ^),\mathcal{C}(V_{q})={\bf 1}\left(\max_{q+1\leq q^{\prime}\leq Q_{\max}}\left[{\left|\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}})\right|}/{\sqrt{\widehat{H}(V_{q},V_{q^{\prime}})}}\right]\geq\widehat{\rho}\right), (27)

where the thresholding ρ^\widehat{\rho} is chosen by the wild bootstrap; see more details in Section A.3 in the supplement. We define 𝒞(VQmax)=0\mathcal{C}(V_{Q_{\max}})=0 as there is no index larger than Qmax{Q_{\max}} that we might compare to. We interpret 𝒞(Vq)=0\mathcal{C}(V_{q})=0 as follows: none of the differences {|β^(Vq)β^(Vq)|}q+1qQmax\{|\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}})|\}_{q+1\leq q^{\prime}\leq Q_{\max}} is large, indicating that VqV_{q} (approximately) generates the function g()g(\cdot) and β^(Vq)\widehat{\beta}(V_{q}) is a reliable estimator.

We choose the index q^c[0,Qmax]\widehat{q}_{c}\in[0,Q_{\max}] as q^c=min0qQmax{q:𝒞(Vq)=0},\widehat{q}_{c}=\min_{0\leq q\leq Q_{\max}}\left\{q:\mathcal{C}(V_{q})=0\right\}, which is the smallest qq such that 𝒞(Vq)\mathcal{C}(V_{q}) is zero. The index q^c\widehat{q}_{c} is interpreted as follows: any matrix containing Vq^c{V}_{\widehat{q}_{c}} as a submatrix does not lead to a substantially different estimator from that by Vq^c.{V}_{\widehat{q}_{c}}. This provides evidence for Vq^c{V}_{\widehat{q}_{c}} forming a good approximation of {g(Zi,Xi)}1in.\{g(Z_{i},X_{i})\}_{1\leq i\leq n}. Here, the sub-index “c” represents the comparison as we compare different {Vq}0qQmax\{V_{q}\}_{0\leq q\leq Q_{\max}} to choose the best one. Note that q^c\widehat{q}_{c} must exist since 𝒞(VQmax)=0\mathcal{C}(V_{Q_{\max}})=0 by definition. In finite samples, there are chances that certain violations cannot be detected, especially if the TSCI estimators given by Vq^cV_{\widehat{q}_{c}} and Vq^c+1V_{\widehat{q}_{c}+1} are not significantly different. We propose a more robust choice of the index as q^r=min{q^c+1,Qmax},\widehat{q}_{r}=\min\{\widehat{q}_{c}+1,Q_{\max}\}, where the sub index “r” denotes the robust selection; see the discussion after Theorem 3. We summarize our proposed TSCI estimator in Algorithm 1.

Algorithm 1 TSCI with machine learning

Input: Data Xn×px,Z,D,YnX\in\mathbb{R}^{n\times p_{x}},Z,D,Y\in\mathbb{R}^{n}; Sets of basis {𝒱q}q0\{\mathcal{V}_{q}\}_{q\geq 0} for approximating g()g(\cdot);

Output: q^c\widehat{q}_{c} and q^r\widehat{q}_{r}; β^(Vq^c)\widehat{\beta}(V_{\widehat{q}_{c}}) and β^(Vq^r)\widehat{\beta}(V_{\widehat{q}_{r}}); CI(Vq^c){\rm CI}(V_{\widehat{q}_{c}}) and CI(Vq^r){\rm CI}(V_{\widehat{q}_{r}}).

1:Generate matrices VqV_{q} based on 𝒱q\mathcal{V}_{q} for q0q\geq 0 as in (6);
2:Compute QmaxQ_{\max} as in (23) and ϵ^(VQmax)\widehat{\epsilon}(V_{Q_{\max}}) as in (21);
3:Compute {β^(Vq)}0qQmax\{\widehat{\beta}(V_{q})\}_{0\leq q\leq Q_{\max}} as in (24) and {H^(Vq,Vq)}0q<qQmax\{\widehat{H}(V_{q},V_{q^{\prime}})\}_{0\leq q<q^{\prime}\leq Q_{\max}} as in (25);
4:if Qmax1Q_{\max}\geq 1 then
5:     Compute {𝒞(Vq)}0qQmax1\{\mathcal{C}(V_{q})\}_{0\leq q\leq Q_{\max}-1} as in (27);
6:end if
7:Set 𝒞(VQmax)=0\mathcal{C}(V_{Q_{\max}})=0;
8:Compute q^c=min{0qQmax:𝒞(Vq)=0}\widehat{q}_{c}=\min\left\{0\leq q\leq Q_{\max}:\mathcal{C}(V_{q})=0\right\}; \triangleright Comparison selection
9:Compute q^r=min{q^c+1,Qmax}\widehat{q}_{r}=\min\{\widehat{q}_{c}+1,Q_{\max}\}; \triangleright Robust selection
10:Compute β^RF(Vq^c)\widehat{\beta}_{\rm RF}(V_{\widehat{q}_{c}}) and β^RF(Vq^r)\widehat{\beta}_{\rm RF}(V_{\widehat{q}_{r}}) as in (19) with V=Vq^c,Vq^r,V=V_{\widehat{q}_{c}},V_{\widehat{q}_{r}}, respectively;
11:Compute CIRF(Vq^c){\rm CI}_{\rm RF}(V_{\widehat{q}_{c}}) and CIRF(Vq^r){\rm CI}_{\rm RF}(V_{\widehat{q}_{r}}) as in (20) with V=Vq^c,Vq^r,V=V_{\widehat{q}_{c}},V_{\widehat{q}_{r}}, respectively.

As a remark, 𝒞(Vq,Vq)\mathcal{C}(V_{q},V_{q^{\prime}}) in (26) can be used to test the IV validity. For any q1q^{\prime}\geq 1, 𝒞(V0,Vq)=1\mathcal{C}(V_{0},V_{q^{\prime}})=1 represents that the estimator assuming valid IV is sufficiently different from an estimator allowing for the violation form generated by VqV_{q^{\prime}}, indicating that the valid IV assumption is violated.

3.5 Comparison to Double Machine Learning and Machine Learning IV

We compare our proposed TSCI with the double machine learning (DML) estimator (Chernozhukov et al., 2018) and the machine learning IV (MLIV) estimator (Chen et al., 2023; Liu et al., 2020). As the most significant difference, TSCI provides a valid inference that is robust to a certain class of invalid IVs while DML and MLIV are designed for valid IV regimes only. We focus on the valid IV setting in the following and compare our proposal to DML and MLIV. In the DML framework, Chernozhukov et al. (2018) considered the outcome model Yi=Diβ+g(Xi)+ϵiY_{i}=D_{i}\beta+g(X_{i})+\epsilon_{i}, which is a special case of our outcome model (3) by assuming valid IVs. The population parameter β\beta in DML can be identified through the following expression (Chernozhukov et al., 2018; Emmenegger and Bühlmann, 2021)

β=𝐄[Yi𝐄(YiXi)][Zi𝐄(ZiXi)]𝐄[Di𝐄(DiXi)][Zi𝐄(ZiXi)],\beta=\frac{{\mathbf{E}}\left[Y_{i}-{\mathbf{E}}(Y_{i}\mid X_{i})\right][Z_{i}-{\mathbf{E}}(Z_{i}\mid X_{i})]}{{\mathbf{E}}\left[D_{i}-{\mathbf{E}}(D_{i}\mid X_{i})\right][Z_{i}-{\mathbf{E}}(Z_{i}\mid X_{i})]}, (28)

where 𝐄(YiXi){\mathbf{E}}(Y_{i}\mid X_{i}), 𝐄(ZiXi),{\mathbf{E}}(Z_{i}\mid X_{i}), and 𝐄(DiXi){\mathbf{E}}(D_{i}\mid X_{i}) are fitted by machine learning algorithms.

As a fundamental difference, we apply machine learning algorithms to capture the nonlinear relation between DiD_{i} and Zi,XiZ_{i},X_{i} while (28) identifies β\beta based on the linear association 𝐄[Di𝐄(DiXi)][Zi𝐄(ZiXi)]{\mathbf{E}}\left[D_{i}-{\mathbf{E}}(D_{i}\mid X_{i})\right][Z_{i}-{\mathbf{E}}(Z_{i}\mid X_{i})]. Due to the first-stage ML, our proposed TSCI estimator is generally more efficient than the DML estimator. On the other hand, we approximate g(z,x)g(z,x) by a set of basis functions, which might provide an inaccurate estimator when the basis functions are misspecified. In contrast, DML learns the conditional mean model by general machine learning algorithms and does not particularly require such a specification of basis functions. We compare the performance of DML and TSCI with valid IVs in Section 5.1 and with invalid IVs in Section 5.3.

Chen et al. (2023); Liu et al. (2020) proposed to use the ML prediction values f^𝒜1\widehat{f}_{\mathcal{A}_{1}} in (12) as the IV, referred to as the MLIV in the current paper. With this MLIV, the standard TSLS estimator can be implemented: in the first stage, use OLS regression of D𝒜1D_{\mathcal{A}_{1}} on the MLIV f^𝒜1\widehat{f}_{\mathcal{A}_{1}} and the baseline covariates V𝒜1V_{\mathcal{A}_{1}} and construct the predicted value D^𝒜1=cff^𝒜1+V𝒜1cv\widehat{D}_{\mathcal{A}_{1}}=c_{f}\widehat{f}_{\mathcal{A}_{1}}+{V}_{\mathcal{A}_{1}}c_{v} with cfc_{f}\in\mathbb{R} and the vector cvc_{v} denoting the OLS regression coefficients; in the second stage, use the outcome regression by replacing D𝒜1{D}_{\mathcal{A}_{1}} with D^𝒜1\widehat{D}_{\mathcal{A}_{1}}. The TSLS estimator with MLIV can then be expressed as

β^MLIVY𝒜1PV𝒜1D^𝒜1/D^𝒜1PV𝒜1D^𝒜1=β+ϵ𝒜1PV𝒜1f^𝒜1/[cff^𝒜1PV𝒜1f^𝒜1].\widehat{\beta}_{\rm\texttt{MLIV}}\coloneqq{Y_{\mathcal{A}_{1}}^{\intercal}P_{V_{\mathcal{A}_{1}}}^{\perp}\widehat{D}_{\mathcal{A}_{1}}}/{\widehat{D}_{\mathcal{A}_{1}}^{\intercal}P_{V_{\mathcal{A}_{1}}}^{\perp}\widehat{D}_{\mathcal{A}_{1}}}=\beta+{\epsilon_{\mathcal{A}_{1}}^{\intercal}P_{V_{\mathcal{A}_{1}}}^{\perp}\widehat{f}_{\mathcal{A}_{1}}}/[{c_{f}\cdot\widehat{f}_{\mathcal{A}_{1}}^{\intercal}P_{V_{\mathcal{A}_{1}}}^{\perp}\widehat{f}_{\mathcal{A}_{1}}}]. (29)

The last equality of (29) holds by plugging in the outcome model Yi=Diβ+Viπ+ϵiY_{i}=D_{i}\beta+V_{i}^{\intercal}\pi+\epsilon_{i} and D^𝒜1=cff^𝒜1+V𝒜1cv\widehat{D}_{\mathcal{A}_{1}}=c_{f}\widehat{f}_{\mathcal{A}_{1}}+{V}_{\mathcal{A}_{1}}c_{v} and applying the orthogonality between D𝒜1D^𝒜1D_{\mathcal{A}_{1}}-\widehat{D}_{\mathcal{A}_{1}} and {f^𝒜1,V𝒜1}.\{\widehat{f}_{\mathcal{A}_{1}},{V}_{\mathcal{A}_{1}}\}. Note that a dominating term of β^init(V)β\widehat{\beta}_{\rm init}(V)-\beta in (17) is ϵ^𝒜1PV^𝒜1f^𝒜1/f^𝒜1PV^𝒜1f^𝒜1.{\widehat{\epsilon}_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\widehat{f}_{\mathcal{A}_{1}}}/{\widehat{f}_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\widehat{f}_{\mathcal{A}_{1}}}. In comparison, the last term in (29) is roughly inflated by a factor of 1/cf1/c_{f}, appearing due to the additional first-stage regression using the MLIV. When the IV is relatively strong, then this factor cfc_{f} is around 1, and MLIV and β^init(V)\widehat{\beta}_{\rm init}(V) are of similar performance for valid IV settings. However, in the presence of relatively weak IVs, the coefficient cfc_{f} in front of f^𝒜1\widehat{f}_{\mathcal{A}_{1}} has a large fluctuation due to the weak association between D𝒜1D_{\mathcal{A}_{1}} and f^𝒜1.\widehat{f}_{\mathcal{A}_{1}}. The weak IV problem deteriorates with the extra first stage regression using the MLIV. This explains why the estimator β^MLIV\widehat{\beta}_{\rm\texttt{MLIV}} in (29) has a much larger bias and standard error than our proposed TSCI estimator in (19) when the IVs are relatively weak; see Section D.2 in the supplementary for the detailed comparison.

4 Theoretical justification

We first establish the asymptotic normality of β^(V)\widehat{\beta}(V) for a given VV. To begin with, we present the required conditions. The first assumption is imposed on the regression errors in models (3) and (4) and the data {Vi,fi}1in\{V_{i},f_{i}\}_{1\leq i\leq n} with fi=f(Zi,Xi)f_{i}=f(Z_{i},X_{i}).

  1. (R1)

    Conditioning on Zi,XiZ_{i},X_{i}, ϵi\epsilon_{i} and δi\delta_{i} are sub-gaussian random variables, that is, there exists a positive constant K>0K>0 such that

    supZi,Ximax{(|ϵi|>tZi,Xi),(|δi|>tZi,Xi)}exp(K2t2/2),\sup_{Z_{i},X_{i}}\max\left\{\mathbb{P}(|\epsilon_{i}|>t\mid Z_{i},X_{i}),\mathbb{P}(|\delta_{i}|>t\mid Z_{i},X_{i})\right\}\leq\exp(-{K^{2}t^{2}}/{2}),

    with supZi,Xi\sup_{Z_{i},X_{i}} denoting the supremum over the support of the density of Zi,XiZ_{i},X_{i}. The random variables {Vi,fi}1in1\{V_{i},f_{i}\}_{1\leq i\leq n_{1}} satisfy λmin(i=1n1ViVi/n1)c\lambda_{\min}\left(\sum_{i=1}^{n_{1}}V_{i}V_{i}^{\intercal}/n_{1}\right)\geq c, i=1n1Vifi/n12C,\left\|\sum_{i=1}^{n_{1}}V_{i}f_{i}/n_{1}\right\|_{2}\leq C, max1in1{|fi|,Vi2}Clogn1,\max_{1\leq i\leq n_{1}}\{|f_{i}|,\|V_{i}\|_{2}\}\leq C\sqrt{\log n_{1}}, and i=1n1Vi[R(V)]i/n12CR(V)\left\|\sum_{i=1}^{n_{1}}V_{i}[R(V)]_{i}/n_{1}\right\|_{2}\leq C\|R(V)\|_{\infty}, where C>0C>0 and c>0c>0 are constants independent of nn and p.p. The matrix Ω\Omega defined in (12) satisfies λmax(Ω)C\lambda_{\max}(\Omega)\leq C for some positive constant C>0.C>0.

The conditional sub-gaussian assumption is required to establish some concentration results. For the special case where ϵi\epsilon_{i} and δi\delta_{i} are independent of Zi,Xi,Z_{i},X_{i}, it is sufficient to assuming sub-gaussian errors ϵi\epsilon_{i} and δi.\delta_{i}. The sub-gaussian conditions on regression errors may be relaxed to moment conditions. The conditions on ViV_{i} and fif_{i} will be automatically satisfied with high probability if 𝐄ViVi{\mathbf{E}}V_{i}V_{i}^{\intercal} is positive definite and ViV_{i} and fif_{i} are sub-gaussian random variables, where the sub-gaussianality conditions on {Vi,fi}1in\{V_{i},f_{i}\}_{1\leq i\leq n} may be relaxed to moment conditions. The proof of Lemma 1 in the supplement shows that λmax(Ω)1\lambda_{\max}(\Omega)\leq 1 for random forests and deep neural networks.

The second assumption is imposed on the generalized IV strength μ(V)\mu(V) defined in (22). Throughout the paper, the asymptotics is taken as n.n\to\infty.

  1. (R2)

    f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}} satisfies f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}\rightarrow\infty and f𝒜1𝐌(V)f𝒜1Tr[𝐌(V)]{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}\gg{\rm Tr}[\mathbf{M}({V})], with 𝐌(V)\mathbf{M}({V}) defined in (16).

The above condition is closely related to Condition 1, where f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}} measures the variability of the estimated f^\widehat{f} after adjusting for VV used to approximate {g(Zi,Xi)}1in\{g(Z_{i},X_{i})\}_{1\leq i\leq n}. Intuitively, a larger value of f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}} indicates that 𝐄(f(Zi,Xi)Viγ)2>0{\mathbf{E}}(f(Z_{i},X_{i})-V_{i}^{\intercal}\gamma^{*})^{2}>0 in Condition 1 holds more plausibly. The generalized IV strength μ(V)\mu(V) introduced in (22) is proportional to f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}} when Var(δiXi,Zi){\rm Var}(\delta_{i}\mid X_{i},Z_{i}) is of a constant scale. Our proposed test in Section 3.3 is designed to test whether μ(V)\mu(V) is sufficiently large for TSCI. For the setting with 𝐄(f(Zi,Xi)Viγ)2{\mathbf{E}}(f(Z_{i},X_{i})-V_{i}^{\intercal}\gamma^{*})^{2} in Condition 1 being a positive constant, f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}} can be of the order nn, making Condition (R2) a mild assumption.

The following proposition establishes the consistency of β^init(V)\widehat{\beta}_{\rm init}(V) if Conditions (R1) and (R2) hold and the approximation errors {Ri(V)=g(Zi,Xi)Viπ}1in\{R_{i}(V)=g(Z_{i},X_{i})-V^{\intercal}_{i}\pi\}_{1\leq i\leq n} are small.

Proposition 1

Consider the models (3) and (4). If Conditions (R1) and (R2) hold and R𝒜1(V)22f𝒜1𝐌(V)f𝒜1\|R_{\mathcal{A}_{1}}(V)\|_{2}^{2}\ll{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}, then β^init(V)\widehat{\beta}_{\rm init}(V) defined in (16) satisfies β^init(V)𝑝β.\widehat{\beta}_{\rm init}(V)\overset{p}{\to}\beta.

4.1 Improvement with bias correction

In the following, we present the distributional properties of the bias-corrected TSCI estimator β^(V)\widehat{\beta}(V) defined in (19) and demonstrate the advantage of this extra bias correction step. For establishing the asymptotic normality, we further impose a stronger generalized IV strength condition than (R2).

  1. (R2-I)

    f𝒜1[𝐌(V)]2f𝒜1,{f}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}\rightarrow\infty, f𝒜1[𝐌(V)]2f𝒜1R(V)22,{f}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}\gg\|R(V)\|_{2}^{2}, and

    f𝒜1[𝐌(V)]2f𝒜1max{(Tr[𝐌(V)])c,lognηn(V)2(Tr[𝐌(V)])2},{f}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}\gg{\max\left\{({\rm Tr}[\mathbf{M}(V)])^{c},\log n\cdot\eta_{n}(V)^{2}\cdot({\rm Tr}[\mathbf{M}(V)])^{2}\right\}},

    where c>1c>1 is some positive constant and ηn()\eta_{n}(\cdot) is defined as

    ηn(V)=f𝒜1f^𝒜1+(|ββ^init(V)|+R(V)+lognn)(logn+f𝒜1f^𝒜1).\eta_{n}(V)=\|f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{\infty}+\left(|\beta-\widehat{\beta}_{\rm init}(V)|+\|R(V)\|_{\infty}+{{\frac{\log n}{\sqrt{n}}}}\right)(\sqrt{\log n}+\|f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{\infty}). (30)

The above condition can be viewed as requiring the IV to be sufficiently strong, where f𝒜1[𝐌(V)]2f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}} can be viewed as another measure of IV strength. If the treatment model is fitted with neural network, we have f𝒜1[𝐌(V)]2f𝒜1=f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}={f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}. For random forests, we only have f𝒜1[𝐌(V)]2f𝒜1f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}\leq{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}. ηn(V)\eta_{n}(V) defined in (30) depends on the accuracy of the ML prediction model f^.\widehat{f}. We have ηn(V)0\eta_{n}(V)\rightarrow 0 if f𝒜1f^𝒜10\|f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{\infty}\rightarrow 0, β^init(V)𝑝β\widehat{\beta}_{\rm init}(V)\overset{p}{\to}\beta, and R(V)0\|R(V)\|_{\infty}\rightarrow 0. However, even for inconsistent f^\widehat{f}, ηn(V)\eta_{n}(V) is of a smaller order of magnitude than logn\sqrt{\log n} as long as f𝒜1f^𝒜1\|f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{\infty} is of a constant scale.

The following theorem establishes the asymptotic normality of the TSCI estimator.

Theorem 1

Consider the models (3) and (4). Suppose that Condition (R1), (R2-I) hold and

max1in1σi2[𝐌(V)f𝒜1]i2i=1n1σi2[𝐌(V)f𝒜1]i20withσi2=𝐄(ϵi2Zi,Xi).\frac{\max_{1\leq i\leq n_{1}}\sigma_{i}^{2}\cdot\left[\mathbf{M}(V){f}_{\mathcal{A}_{1}}\right]_{i}^{2}}{\sum_{i=1}^{n_{1}}\sigma_{i}^{2}\cdot\left[\mathbf{M}(V){f}_{\mathcal{A}_{1}}\right]_{i}^{2}}\rightarrow 0\quad\text{with}\quad\sigma_{i}^{2}={\mathbf{E}}(\epsilon_{i}^{2}\mid Z_{i},X_{i}). (31)

Then β^(V)\widehat{\beta}(V) defined in (19) satisfies

1SE(V)(β^(V)β)𝑑N(0,1),withSE(V)=i=1n1σi2[𝐌(V)f𝒜1]i2f𝒜1𝐌(V)f𝒜1.\frac{1}{{\rm SE}(V)}\left(\widehat{\beta}(V)-\beta\right)\overset{d}{\to}N(0,1),\quad\text{with}\quad{\rm SE}(V)=\frac{\sqrt{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){f}_{\mathcal{A}_{1}}]_{i}^{2}}}{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}.

If SE^(V)\widehat{\rm SE}(V) used in (20) satisfies SE^(V)/SE(V)𝑝1,\widehat{\rm SE}(V)/{\rm SE}(V)\overset{p}{\to}1, then the confidence interval CI(V){\rm CI}(V) in (20) satisfies lim infn(βCI(V))=1α.\liminf_{n\rightarrow\infty}\mathbb{P}(\beta\in{\rm CI}(V))=1-\alpha.

We emphasize that the validity of our proposed confidence interval does not require the ML prediction model f^\widehat{f} to be a consistent estimator of ff. Particularly, Condition (R2) and (R2-I) can be plausibly satisfied as long as the ML algorithms capture enough association between the treatment and the IVs. The condition (31) is imposed such that not a single entry of the vector 𝐌(V)f𝒜1\mathbf{M}(V){f}_{\mathcal{A}_{1}} dominates all other entries, which is needed for verifying the Linderberg condition. The standard error SE(V){\rm SE}(V) relies on the generalized IV strength for a given matrix VV. If f𝒜1𝐌(V)f𝒜1/n1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}/n_{1} is of constant order, then SE(V)1/n.{\rm SE}(V)\lesssim{1}/{\sqrt{n}}. A larger matrix VV will generally lead to a larger SE(V){\rm SE}(V) because f𝒜1𝐌(V)f𝒜1{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}} decreases after adjusting for more information contained in VV. The consistency of SE^(V)\widehat{\rm SE}(V) is presented in Lemma 2 in the supplement.

We now explain the effectiveness of the bias correction for the homoscedastic setting with Cov(ϵi,δiZi,Xi)=Cov(ϵi,δi){\rm Cov}(\epsilon_{i},\delta_{i}\mid Z_{i},X_{i})={\rm Cov}(\epsilon_{i},\delta_{i}). For the initial estimator β^init(V),\widehat{\beta}_{\rm init}(V), we can establish that 1SE(V)(β^init(V)β)=𝒢(V)+~(V),\frac{1}{{\rm SE}(V)}\left(\widehat{\beta}_{\rm init}(V)-\beta\right)={\mathcal{G}}(V)+\widetilde{\mathcal{E}}(V), where 𝒢(V)𝑑N(0,1){\mathcal{G}}(V)\overset{d}{\to}N(0,1) and

|~(V)|Cov(ϵi,δi)Tr[𝐌(V)]+R(V)2f𝒜1[𝐌(V)]2f𝒜1+Tr([𝐌(V)]2)(f𝒜1[𝐌(V)]2f𝒜1)c0,\left|\widetilde{\mathcal{E}}(V)\right|\leq\frac{{\rm Cov}(\epsilon_{i},\delta_{i})\cdot{\rm Tr}[\mathbf{M}(V)]+\|R(V)\|_{2}}{\sqrt{{f}_{\mathcal{A}_{1}}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}}}+\frac{\sqrt{{\rm Tr}([\mathbf{M}({V})]^{2})}}{({f}_{\mathcal{A}_{1}}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}})^{c_{0}}}, (32)

for some positive constant c0>0.c_{0}>0. Our proposed bias-corrected estimator β^(V)\widehat{\beta}(V) is effective in reducing the bias component ~(V).\widetilde{\mathcal{E}}(V). Particularly, Theorem 4 in Section A.10 in the supplement establishes that the term Cov(ϵi,δi)Tr[𝐌(V)]/f𝒜1[𝐌(V)]2f𝒜1{\rm Cov}(\epsilon_{i},\delta_{i})\cdot{\rm Tr}[\mathbf{M}(V)]/\sqrt{{f}_{\mathcal{A}_{1}}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}} in (32) is reduced to ηn(V)Tr[𝐌(V)]/f𝒜1[𝐌(V)]2f𝒜1\eta_{n}(V)\cdot{\rm Tr}[\mathbf{M}(V)]/\sqrt{{f}_{\mathcal{A}_{1}}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}}. If ηn(V)0\eta_{n}(V)\rightarrow 0, which can be achieved for a consistent ML prediction f^\widehat{f}, the bias-corrected TSCI estimator effectively reduces the finite-sample or higher-order bias. However, even if f^𝒜1\widehat{f}_{\mathcal{A}_{1}} is inconsistent with f𝒜1f^𝒜12n\|f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{2}\lesssim\sqrt{n}, the bias correction will not lead to a worse estimator.

4.2 Guarantee for Algorithm 1

We now justify the validity of the confidence interval from the TSCI Algorithm 1, which is based on the asymptotic normality of β^(Vq^)\widehat{\beta}(V_{\widehat{q}}) with a data-dependent index q^\widehat{q}. The property of q^\widehat{q} relies on a careful analysis of the difference β^(Vq)β^(Vq)\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}}) for 0q<qQmax0\leq q<q^{\prime}\leq Q_{\max}.

We start with the setting where both VqV_{q} and VqV_{q^{\prime}} provide good approximations to {g(Zi,Xi)}1in,\{g(Z_{i},X_{i})\}_{1\leq i\leq n}, leading to sufficiently small R(Vq)2\|R(V_{q})\|_{2} and R(Vq)2\|R(V_{q^{\prime}})\|_{2}. In this case, the dominating component of β^(Vq)β^(Vq)\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}}) is Sϵ𝒜1S^{\intercal}\epsilon_{\mathcal{A}_{1}} with

S=(1f𝒜1𝐌(Vq)f𝒜1𝐌(Vq)1f𝒜1𝐌(Vq)f𝒜1𝐌(Vq))f𝒜1n1.S=\left(\frac{1}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}\mathbf{M}(V_{q^{\prime}})-\frac{1}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}\mathbf{M}(V_{q})\right)f_{\mathcal{A}_{1}}\in\mathbb{R}^{n_{1}}. (33)

Conditioning on the data in 𝒜2\mathcal{A}_{2} and {Xi,Zi}i𝒜1,\{X_{i},Z_{i}\}_{i\in\mathcal{A}_{1}}, Sϵ𝒜1S^{\intercal}\epsilon_{\mathcal{A}_{1}} is of zero mean and variance

H(Vq,Vq)=i=1n1σi2[𝐌(Vq)f𝒜1]i2[f𝒜1𝐌(Vq)f𝒜1]2+i=1n1σi2[𝐌(Vq)f𝒜1]i2[f𝒜1𝐌(Vq)f𝒜1]22i=1n1σi2[𝐌(Vq)f𝒜1]i[𝐌(Vq)f𝒜1]i[f𝒜1𝐌(Vq)f𝒜1][f𝒜1𝐌(Vq)f𝒜1],\displaystyle{H}(V_{q},V_{q^{\prime}})=\frac{\sum_{i=1}^{n_{1}}\sigma_{i}^{2}[\mathbf{M}(V_{q^{\prime}}){f}_{\mathcal{A}_{1}}]_{i}^{2}}{[{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}]^{2}}+\frac{\sum_{i=1}^{n_{1}}\sigma_{i}^{2}[\mathbf{M}(V_{q}){f}_{\mathcal{A}_{1}}]_{i}^{2}}{[{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}]^{2}}-2\frac{\sum_{i=1}^{n_{1}}\sigma_{i}^{2}[\mathbf{M}(V_{q^{\prime}}){f}_{\mathcal{A}_{1}}]_{i}[\mathbf{M}(V_{q}){f}_{\mathcal{A}_{1}}]_{i}}{[{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}]\cdot[{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}]}, (34)

with σi2=𝐄(ϵi2Zi,Xi).\sigma_{i}^{2}={\mathbf{E}}(\epsilon_{i}^{2}\mid Z_{i},X_{i}). The following condition requires that the variance of Sf𝒜1S^{\intercal}f_{\mathcal{A}_{1}} dominates other components of β^(Vq)β^(Vq),\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}}), which are mainly finite-sample approximation errors.

  1. (R3)

    The variance H(Vq,Vq)H(V_{q},V_{q^{\prime}}) in (34) satisfies

    H(Vq,Vq)maxV{Vq,Vq}{1μ(V)[1+(1+lognηn(VQmax))Tr[𝐌(V)]]},\sqrt{H(V_{q},V_{q^{\prime}})}\gg\max_{V\in\{V_{q},V_{q^{\prime}}\}}\left\{\frac{1}{\mu(V)}\left[1+(1+\sqrt{\log n}\cdot\eta_{n}(V_{Q_{\max}}))\cdot{\rm Tr}[\mathbf{M}(V)]\right]\right\},

    with ηn()\eta_{n}(\cdot) defined in (30). There exists c>0c>0 such that Var(δiZi,Xi)c.{\rm Var}(\delta_{i}\mid Z_{i},X_{i})\geq c.

When the first stage is fitted with the basis method or neural network and Var(ϵiZi,Xi)=σϵ2,Var(δiZi,Xi)=σδ2,{\rm Var}(\epsilon_{i}\mid Z_{i},X_{i})=\sigma_{\epsilon}^{2},{\rm Var}(\delta_{i}\mid Z_{i},X_{i})=\sigma_{\delta}^{2}, we have H(Vq,Vq)=σϵ2(1/f𝐌(Vq)f1/f𝐌(Vq)f)H(V_{q},V_{q^{\prime}})=\sigma_{\epsilon}^{2}\left({1}/{f^{\intercal}\mathbf{M}({V}_{q^{\prime}})f}-{1}/{f^{\intercal}\mathbf{M}({V}_{q})f}\right) and μ(Vq)=f𝐌(Vq)f/σδ2{\mu(V_{q})}={f^{\intercal}\mathbf{M}({V}_{q})f}/\sigma_{\delta}^{2} for qq.q\leq q^{\prime}. If we assume that f𝐌(Vq)f=cf𝐌(Vq)ff^{\intercal}\mathbf{M}({V}_{q^{\prime}})f=c_{*}f^{\intercal}\mathbf{M}({V}_{q})f for some 0<c<1,0<c_{*}<1, we have H(Vq,Vq)=1ccσϵ2σδ2μ(Vq).H(V_{q},V_{q^{\prime}})=\frac{1-c_{*}}{c_{*}}\frac{\sigma_{\epsilon}^{2}}{\sigma_{\delta}^{2}}\mu(V_{q}). In this case, Condition (R3) is satisfied if μ(Vq)(Tr[𝐌(Vq)])2\mu(V_{q})\gg({\rm Tr}[\mathbf{M}(V_{q})])^{2} and μ(Vq)(Tr[𝐌(Vq)])2\mu(V_{q^{\prime}})\gg({\rm Tr}[\mathbf{M}(V_{q^{\prime}})])^{2} up to some polynomial order of logn\log n. In this case, Condition (R3) is slightly stronger than Condition (R2-I).

The following theorem establishes the asymptotic normality of β^(Vq)β^(Vq)\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}}) under the null setting where both R(Vq)R(V_{q}) and R(Vq)R(V_{q^{\prime}}) are small.

Theorem 2

Consider the model (3) and (4). Suppose that (R1) and (R3) hold, (R2) holds for V{Vq,Vq}V\in\{V_{q},V_{q^{\prime}}\}, and SS defined in (33) satisfies maxi𝒜1Si2/(i𝒜1Si2)0\max_{i\in\mathcal{A}_{1}}{S_{i}^{2}}/(\sum_{i\in\mathcal{A}_{1}}S_{i}^{2})\rightarrow 0. If H(Vq,Vq)maxV{Vq,Vq}R(V)2/μ(V),\sqrt{H(V_{q},V_{q^{\prime}})}\gg\max_{V\in\{V_{q},V_{q^{\prime}}\}}{\|R(V)\|_{2}}/{\sqrt{\mu(V)}}, then (β^(Vq)β^(Vq))/H(Vq,Vq)𝑑N(0,1),\left(\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}})\right)/{\sqrt{H(V_{q},V_{q^{\prime}})}}\overset{d}{\to}N(0,1), with β^()\widehat{\beta}(\cdot) and H(Vq,Vq)H(V_{q},V_{q^{\prime}}) defined in (24) and (34), respectively.

For small approximation errors R(Vq)2\|R(V_{q})\|_{2} and R(Vq)2\|R(V_{q^{\prime}})\|_{2}, the condition H(Vq,Vq)maxV{Vq,Vq}R(V)2/μ(V)\sqrt{H(V_{q},V_{q^{\prime}})}\gg\max_{V\in\{V_{q},V_{q^{\prime}}\}}{\|R(V)\|_{2}}/{\sqrt{\mu(V)}} holds. In this case, Theorem 2 establishes that the difference β^(Vq)β^(Vq)\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}}) is centered at zero and has an asymptotic normal distribution.

We now move on to the case that at least one of VqV_{q} and VqV_{q^{\prime}} does not approximate {g(Zi,Xi)}1in\{g(Z_{i},X_{i})\}_{1\leq i\leq n} well. In this case, Theorem 2 does not hold and we establish Theorem 5 in the supplement that (β^(Vq)β^(Vq))/H(Vq,Vq)(\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}}))/{\sqrt{H(V_{q},V_{q^{\prime}})}} is centered at

n(Vq,Vq)=1H(Vq,Vq)(D𝒜1𝐌(Vq)[R(Vq)]𝒜1D𝒜1𝐌(Vq)D𝒜1D𝒜1𝐌(Vq)[R(Vq)]𝒜1D𝒜1𝐌(Vq)D𝒜1).\mathcal{L}_{n}(V_{q},V_{q^{\prime}})=\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left(\frac{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})[R(V_{q})]_{\mathcal{A}_{1}}}{{D}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})D_{\mathcal{A}_{1}}}-\frac{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})[R(V_{q^{\prime}})]_{\mathcal{A}_{1}}}{{D}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q^{\prime}})D_{\mathcal{A}_{1}}}\right). (35)

To simultaneously quantify the errors of {β^(Vq)β^(Vq)}0q<qQmax,\{\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}})\}_{0\leq q<q^{\prime}\leq Q_{\max}}, we define the α0\alpha_{0} quantile for the maximum of multiple random error components in (33),

(max0q<qQmax1H(Vq,Vq)|f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1|ρ(α0))=α0.\mathbb{P}\left(\max_{0\leq q<q^{\prime}\leq Q_{\max}}\frac{1}{\sqrt{{H}(V_{q},V_{q^{\prime}})}}\left|\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}-\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}\right|\geq\rho(\alpha_{0})\right)=\alpha_{0}. (36)

We introduce the following condition for accurate selection among {𝒱q}0qQ.\{\mathcal{V}_{q}\}_{0\leq q\leq Q}.

  1. (R4)

    For 𝒱0𝒱1𝒱Q\mathcal{V}_{0}\subset\mathcal{V}_{1}\subset\cdots\subset\mathcal{V}_{Q} and corresponding matrices {Vq}0qQ\{{V}_{q}\}_{0\leq q\leq Q}, there exists q{0,1,2,,Q}q^{*}\in\{0,1,2,\cdots,Q\} such that qQmaxq^{*}\leq Q_{\max} and R(Vq)=0R(V_{q^{*}})=0 with QmaxQ_{\max} defined in (23). For any integer q[0,q1],q\in[0,q^{*}-1], there exists an integer q[q+1,q]q^{\prime}\in[q+1,q^{*}] such that

    n(Vq,Vq)Aρ(α0)withA>2,\mathcal{L}_{n}(V_{q},V_{q^{\prime}})\geq A\rho({\alpha_{0}})\quad\text{with}\quad A>2, (37)

    where n(Vq,Vq)\mathcal{L}_{n}(V_{q},V_{q^{\prime}}) is defined in (35) and ρ(α0)\rho({\alpha_{0}}) is defined in (36).

The above condition ensures that there exists 𝒱q\mathcal{V}_{q^{*}} such that the function gg is well approximated by the column space of VqV_{q^{*}}. The well-separation condition (37) is interpreted as follows: if gg is not well approximated by the column space of VqV_{q}, then the estimation bias of β^(Vq)\widehat{\beta}(V_{q}) is larger than the uncertainty ρ(α0)\rho({\alpha_{0}}) due to the multiple random error components. Note that n(Vq,Vq)\mathcal{L}_{n}(V_{q},V_{q^{*}}) defined in (35) is a measure of the bias of β^(Vq).\widehat{\beta}(V_{q}).

The following theorem guarantees the coverage property for the CIs corresponding to Vq^cV_{\widehat{q}_{c}} and Vq^rV_{\widehat{q}_{r}} in Algorithm 1.

Theorem 3

Consider the model (3) and (4). Suppose that Conditions (R1) and (R4) hold, Condition (R2-I) holds for V{Vq}0qQmaxV\in\{V_{q}\}_{0\leq q\leq Q_{\max}}, Condition (R3) holds for any 0q<qQmax0\leq q<q^{\prime}\leq Q_{\max}, H^(Vq,Vq)/H(Vq,Vq)𝑝1\widehat{H}(V_{q},V_{q^{\prime}})/H(V_{q},V_{q^{\prime}})\overset{p}{\to}1, and ρ^/ρ(α0)𝑝1\widehat{\rho}/\rho(\alpha_{0})\overset{p}{\to}1 with ρ^\widehat{\rho} used in (27) and ρ(α0)\rho(\alpha_{0}) defined in (36) respectively. Our proposed CI in Algorithm 1 satisfies lim infn[βCI(Vq^)]1α2α0\liminf_{n\rightarrow\infty}\mathbb{P}\left[\beta\in{\rm CI}(V_{\widehat{q}})\right]\geq 1-\alpha-2\alpha_{0} with q^=q^c\widehat{q}=\widehat{q}_{c} or q^r\widehat{q}_{r} where α0\alpha_{0} is used in (36).

The condition (37) of (R4) is critical to guarantee the selection consistency q^c=q\widehat{q}_{c}=q^{*}, ensuring that CI(Vq^c){\rm CI}(V_{\widehat{q}_{c}}) achieves the desired coverage. However, in finite samples, we may make mistakes in conducting the selection among {Vq}0qQmax.\{V_{q}\}_{0\leq q\leq Q_{\max}}. The selection method q^r\widehat{q}_{r} is more robust in the sense that statistical inference based on Vq^rV_{\widehat{q}_{r}} is still valid even if we cannot separate Vq1V_{q^{*}-1} and Vq.V_{q^{*}}. To achieve the uniform inference without requiring the well-separation condition (37), we may simply apply TSCI with 𝒱Qmax\mathcal{V}_{Q_{\max}}. However, such a confidence interval might be conservative by adjusting for a large VQmax.V_{Q_{\max}}.

5 Simulation studies

In Section 5.1, we consider the valid IV settings and compare our proposal to the DML estimator proposed in Chernozhukov et al. (2018). In Sections 5.2 and 5.3, we demonstrate our proposal for general invalid IV settings and also compare it with the DML estimator. In Section D.3 in the supplement, we further consider multiple IVs settings and compare TSCI with existing methods based on the majority rule. We compute all measures using 500 simulations throughout the numerical studies. The code for replicating the numerical results is available at https://github.com/zijguo/TSCI-Replication.

5.1 Comparison with DML in valid IV settings

We focus on the valid IV settings and illustrate the advantages and limitations of both TSCI and DML. We adopt the data generative setting used in the R package dmlalg (Emmenegger, 2021) and implement DML by the R package DoubleML (Bach et al., 2021). We set the sample size as n=3000n=3000 and generate the baseline covariate XiX_{i}\in\mathbb{R} following the uniform distribution on [π,π][-\pi,\pi]. Conditioning on XiX_{i}, we generate the IV ZiZ_{i} and the hidden confounder HiH_{i} following ZiXiN(3tanh(2Xi1),1)Z_{i}\mid X_{i}\sim N(3\cdot\tanh(2X_{i}-1),1) and HiXiN(2sin(Xi),1)H_{i}\mid X_{i}\sim N(2\cdot\sin(X_{i}),1), respectively. We generate the outcome YiY_{i} and treatment DiD_{i} using the SEM in (9). Particularly, we generate the outcome YiY_{i} following Yi=Di+Xi2/23cos(πHi/4)+ei,2Y_{i}=D_{i}+X_{i}^{2}/2-3\cdot\cos(\pi H_{i}/4)+e_{i,2} and consider the following two treatment models,

  • Setting S1: Di=a|Zi|2tanh(Xi)Hi+ei,1,D_{i}=-a\cdot|Z_{i}|-2\cdot\tanh(X_{i})-H_{i}+e_{i,1},

  • Setting S2: Di=aZi2/22tanh(Xi)Hi+ei,1,D_{i}=a\cdot Z_{i}^{2}/2-2\cdot\tanh(X_{i})-H_{i}+e_{i,1},

where aa controls the nonlinearity of the treatment model and strength of the IV and ei,1N(0,1)e_{i,1}\sim N(0,1) and ei,2N(0,1)e_{i,2}\sim N(0,1) are independent noises. Note that a larger value of aa in the treatment model leads to a more nonlinear dependence between DiD_{i} and ZiZ_{i}. The treatment model and the outcome model are consistent to (4) and (3) with δi\delta_{i} and ϵi\epsilon_{i} being composed of a hidden confounder term and an independent random noise.

In addition to the above two settings, we consider another setting where g(Xi)g(X_{i}) in the outcome model is more complicated. We set px=5p_{x}=5 and for 1in1\leq i\leq n, we generate {Xi,j}1jpx\{X_{i,j}\}_{1\leq j\leq p_{x}} which are independently distributed, and each of them follows a uniform distribution on [1,1][-1,1]. We generate ZiZ_{i} and HiH_{i} following ZiXiN(3tanh(2Xi,11),1)Z_{i}\mid X_{i}\sim N(3\cdot\tanh(2X_{i,1}-1),1) and HiXiN(2sin(Xi,1),1)H_{i}\mid X_{i}\sim N(2\cdot\sin(X_{i,1}),1). We generate {Di,Yi}1in\{D_{i},Y_{i}\}_{1\leq i\leq n} following

  • Setting S3: Di=bZi+sin(2πZi)+3/2cos(2πZi)1/5j=1pxXi,j1/2Hi+ei,1D_{i}=b\cdot Z_{i}+\sin(2\pi Z_{i})+3/2\cdot\cos(2\pi Z_{i})-1/5\cdot\sum_{j=1}^{p_{x}}X_{i,j}-1/2\cdot H_{i}+e_{i,1} and Yi=Di/2+g(Xi)cos(πHi/4)+ei,2Y_{i}=D_{i}/2+g(X_{i})-\cos(\pi H_{i}/4)+e_{i,2} with g(Xi)=1jkpxXi,jXi,kg(X_{i})=\sum_{1\leq j\neq k\leq p_{x}}X_{i,j}X_{i,k},

where the value of bb controls the linear association strength in Setting S3. We implement Algorithm 1 with random forests by specifying a collection of basis functions for g(x)g(x). Since the IV is valid here, we use the set of basis functions 𝒱0={1,𝐛1(x1),,𝐛px(xpx)}\mathcal{V}_{0}=\{1,\overrightarrow{\bf b}_{1}(x_{1}),\cdots,\overrightarrow{\bf b}_{p_{x}}(x_{p_{x}})\} with 𝐛j(xj)\overrightarrow{\bf b}_{j}(x_{j}) denoting the B-spline basis with 5 knots of xjx_{j} for 1jpx1\leq j\leq p_{x}.

Refer to caption
Figure 2: Comparison of DML and TSCI in terms of RMSE, CI coverage, and length under Settings S1, S2, and S3. The larger value of the constant aa in Settings S1 and S2 stands for a higher nonlinearity level in the treatment model, and the larger value of bb in Setting S3 for a higher linearity level. In addition, a larger value of aa and bb indicate larger generalized IV strength.

In Figure 2, we compare DML and TSCI under Settings S1, S2, and S3. In the top two panels, TSCI achieves the desired coverage level in Settings S1 and S2. Even though DML achieves desired coverage in setting S1, the RMSE and CI length of DML is uniformly larger than those of our proposed TSCI. As discussed in Section 3.5, this happens since DML only uses the linear association in the treatment model while TSCI can increase the IV strength by applying ML algorithms to fit nonlinear treatment models. We have designed setting S2 as a less favorable setting for DML, where the linear association between the treatment and IV is nearly zero but the nonlinear association is strong. Consequently, the CI length and RMSE of the DML procedure is much larger than our proposed TSCI. In Setting S3, TSCI does not show the advantage over DML and has an under-coverage problem. This is caused by the estimation bias due to the misspecification of g(x)g(x) in the outcome model. DML is not exposed to such bias because it uses ML algorithms to fit E(YiXi)E(Y_{i}\mid X_{i}). We also implement TSLS in Settings S1, S2 and S3, but we do not include it in the figures because it has low coverage due to misspecification of nonlinear g(x)g(x).

5.2 TSCI with invalid IVs

We now demonstrate our TSCI method under the general setting with possibly invalid IVs. In the following, we focus on the continuous treatment and will investigate the performance of TSCI for binary treatment in Section D.5 in the supplement. We generate Xipx+1X^{*}_{i}\in\mathbb{R}^{p_{x}+1} following a multivariate normal distribution with zero mean and covariance matrix Σ\Sigma where Σij=0.5|ij|\Sigma_{ij}=0.5^{|i-j|} for 1i,jpx+11\leq i,j\leq p_{x}+1. With Φ\Phi denoting the standard normal cumulative distribution function, we define Xij=Φ(Xij)X_{ij}=\Phi(X^{*}_{ij}) for 1jpx.1\leq j\leq p_{x}. We generate a continuous IV as Zi=4(Φ(Xi,px+1)0.5)(2,2)Z_{i}=4(\Phi(X^{*}_{i,p_{x}+1})-0.5)\in(-2,2).

We consider the following two conditional mean models for the treatment,

  • Setting B1: f(Zi,Xi)=2512+Zi+Zi3/3+Zi(aj=15Xij)310j=1pXij,f(Z_{i},X_{i})=-\frac{25}{12}+Z_{i}+Z_{i}^{3}/3+Z_{i}\cdot(a\cdot\sum_{j=1}^{5}X_{ij})-\frac{3}{10}\cdot\sum_{j=1}^{p}X_{ij},

  • Setting B2: f(Zi,Xi)=sin(2πZi)+32cos(2πZi)+Zi(aj=15Xij)310j=1pXij.f(Z_{i},X_{i})=\sin(2\pi Z_{i})+\frac{3}{2}\cos(2\pi Z_{i})+Z_{i}\cdot(a\cdot\sum_{j=1}^{5}X_{ij})-\frac{3}{10}\cdot\sum_{j=1}^{p}X_{ij}.

The value of the constant aa controls the interaction strength between ZiZ_{i} and the first five variables of XiX_{i}, and when a=0a=0, the interaction term disappears. We vary aa across {0,0.5,1}\{0,0.5,1\} and nn across {1000,3000,5000}\{1000,3000,5000\} to observe the performance.

We consider the outcome model (3) with two forms of g(Zi,Xi)g(Z_{i},X_{i}): (a)Vio=1: g(Zi,Xi)=Zi+1/5j=1pxXijg(Z_{i},X_{i})=Z_{i}+1/5\cdot\sum_{j=1}^{p_{x}}X_{ij}; (b)Vio=2: g(Zi,Xi)=Zi+Zi21+1/5j=1pxXijg(Z_{i},X_{i})=Z_{i}+Z_{i}^{2}-1+1/5\cdot\sum_{j=1}^{p_{x}}X_{ij}. We set the errors {(δi,ϵi)}1in\{(\delta_{i},\epsilon_{i})^{\intercal}\}_{1\leq i\leq n} as heteroscedastic following Bekker and Crudu (2015): for 1in,1\leq i\leq n, generate δiN(0,Zi2+0.25)\delta_{i}\sim N(0,Z_{i}^{2}+0.25) and ϵi=0.6δi+[10.62]/[0.864+1.380722](1.38072τ1,i+0.862τ2,i),\epsilon_{i}=0.6\delta_{i}+\sqrt{{[1-0.6^{2}]}/[0.86^{4}+1.38072^{2}]}(1.38072\cdot\tau_{1,i}+0.86^{2}\cdot\tau_{2,i}), where conditioning on ZiZ_{i}, τ1,i\tau_{1,i} and τ2,i\tau_{2,i} are generated to be independent of δi\delta_{i}, with τ1,iN(0,Zi2+0.25)\tau_{1,i}\sim N(0,Z_{i}^{2}+0.25) and τ2,iN(0,1).\tau_{2,i}\sim N(0,1).

We shall implement TSCI with random forests as detailed in Algorithm 1. Since the IV is possibly invalid, we consider four possible basis sets {𝒱q}0q3\{\mathcal{V}_{q}\}_{0\leq q\leq 3} to approximate g(z,x)g(z,x), where 𝒱0={𝐰(x)}\mathcal{V}_{0}=\{\overrightarrow{\bf w}(x)\} and 𝒱q={z,,zq,𝐰(x)}\mathcal{V}_{q}=\{z,\cdots,z^{q},\overrightarrow{\bf w}(x)\} for 1q31\leq q\leq 3 with 𝐰(x)={1,x1,,xpx}\overrightarrow{\bf w}(x)=\{1,x_{1},\cdots,x_{p_{x}}\}. Our proposed TSCI is designed to choose the best 𝒱q\mathcal{V}_{q}.

As “naive” benchmarks, we compare TSCI with three different random forests based methods in the oracle setting, where the best 𝒱q\mathcal{V}_{q} is known a priori. In particular, RF-Init denotes the TSCI estimator without bias correction; RF-Full is to implement TSCI without data-splitting; RF-Plug is the estimator of directly plugging the ML fitted treatment in the outcome model, as a naive generalization of TSLS. We give the detailed expressions of these three estimators and their CIs in Section A.2 in the supplement.

TSCI-RF Proportions of selection TSLS Other RF(oracle)
vio a n Oracle Comp Robust Invalidity q^c=0\widehat{q}_{c}=0 q^c=1\widehat{q}_{c}=1 q^c=2\widehat{q}_{c}=2 q^c=3\widehat{q}_{c}=3 Init Plug Full
1 0.0 1000 0.91 0.01 0.01 0.01 0.99 0.01 0.00 0.00 0.00 0.80 0.38 0.01
3000 0.92 0.92 0.92 1.00 0.00 0.84 0.16 0.00 0.00 0.91 0.64 0.00
5000 0.91 0.92 0.93 1.00 0.00 0.85 0.15 0.00 0.00 0.89 0.74 0.00
0.5 1000 0.91 0.23 0.25 0.24 0.76 0.24 0.01 0.00 0.00 0.84 0.56 0.02
3000 0.95 0.94 0.94 1.00 0.00 0.97 0.02 0.01 0.00 0.91 0.43 0.00
5000 0.92 0.92 0.91 1.00 0.00 0.98 0.01 0.01 0.00 0.88 0.09 0.00
1.0 1000 0.96 0.92 0.92 0.95 0.05 0.93 0.01 0.00 0.00 0.91 0.52 0.08
3000 0.94 0.94 0.95 1.00 0.00 0.99 0.01 0.00 0.00 0.92 0.00 0.00
5000 0.94 0.94 0.94 1.00 0.00 0.98 0.02 0.01 0.00 0.92 0.00 0.00
Table 1: Coverage for Setting B1 with Vio=1. The columns indexed with “TSCI-RF” correspond to our proposed TSCI with random forests, where the columns indexed with “Oracle”, “Comp” and “Robust” correspond to the estimators with 𝒱q\mathcal{V}_{q} selected by the oracle knowledge, the comparison method, and the robust method. The column indexed with “Invalidity” reports the proportion of detecting the proposed IV as invalid. The columns indexed with “Proportions of selection” reports the proportions of basis sets 𝒱q\mathcal{V}_{q} for 0q30\leq q\leq 3 selected by TSCI-RF. The column indexed with “TSLS” corresponds to the TSLS estimator. The columns indexed with “Init”, “Plug”, and “Full” correspond to the RF estimators without bias correction, the plug-in RF estimator and the no data-splitting RF estimator, with the oracle knowledge of the best 𝒱q\mathcal{V}_{q}.

In Table 1, we compare the coverage properties of our proposed TSCI, TSLS, and the above three random forests based methods with Vio=1 in the outcome model. We observe that TSLS fails to have desired coverage probability 95% due to the existence of invalid IV; furthermore, RF-Full and RF-Plug have coverages far below 95% and RF-Init has slightly lower coverage due to the finite-sample bias. In contrast, our proposed TSCI achieves the desired coverage with a relatively strong interaction or large sample size. This is mainly due to its correct selection of 𝒱q\mathcal{V}_{q} in those settings. In some settings with small interaction and small sample size (like a=0a=0, n=1000n=1000), TSCI fails to select the correct basis set, and the coverage is below 95%. For the outcome model with Vio=2, TSCI can have desired coverage as well with a relatively strong interaction or large sample size; see Table S2 in the supplement.

For Setting B1, we further report the absolute bias and CI length in Tables S3 and S4 in the supplement, respectively. In Section D.4, we consider a binary IV setting, denoted as B3, and observe that Settings B2 and B3 exhibit a similar pattern to those of B1. Setting B2 is easier than B1 in the sense that the generalized IV strength remains relatively large even after adjusting for quadratic violation forms in the basis set 𝒱2\mathcal{V}_{2}.

5.3 TSCI with various nonlinearity levels

In the following, we explore the performance of our proposed TSCI when this identification condition (Condition 1) fails to hold, that is, the conditional mean model f(z,x)f(z,x) is not more complex than g(z,x).g(z,x). To approximate such a regime, we consider the outcome model Yi=Di/2+Zi+i=1pxXi,j2cos(πHi/4)+ei,2Y_{i}=D_{i}/2+Z_{i}+\sum_{i=1}^{p_{x}}X_{i,j}^{2}-\cos(\pi H_{i}/4)+e_{i,2} and the treatment model Di=f(Zi,Xi)Hi/2+ei,1D_{i}=f(Z_{i},X_{i})-H_{i}/2+e_{i,1} with different specification of ff detailed in Table 2 where ei,1e_{i,1},ei,2e_{i,2} are independent random noises following the standard normal distribution. In such a generating model, when aa gets close to zero, f(z,x)f(z,x) becomes a linear function in zz, and Condition 1 fails since g(z,x)g(z,x) is also linear in z.z.

Settings Distribution ZiZ_{i} and XiX_{i} Treatment model
C1 Same as Setting B1 in Section 5.2 f(Zi,Xi)=Zi+1/2Zi(ai=1pxXi,j)25/123/10i=1pxXi,jf(Z_{i},X_{i})=Z_{i}+1/2\cdot Z_{i}\cdot(a\cdot\sum_{i=1}^{p_{x}}X_{i,j})-25/12-3/10\cdot\sum_{i=1}^{p_{x}}X_{i,j}
C2 Same as Setting S3 in Section 5.1
C3 Same as Setting B1 in Section 5.2 f(Zi,Xi)=Zi+a(sin(2πZi)+3/2cos(2πZi)3/10i=1pxXi,jf(Z_{i},X_{i})=Z_{i}+a\cdot(\sin(2\pi Z_{i})+3/2\cdot\cos(2\pi Z_{i})-3/10\cdot\sum_{i=1}^{p_{x}}X_{i,j}
Table 2: Different simulation settings, where f(z,x)f(z,x) becomes a linear function of zz with a0.a\rightarrow 0.

We fix n=3000n=3000 and px=5p_{x}=5, generate ZiZ_{i} and XiX_{i} as detailed in Table 2, and generate HiH_{i} as HiXiN(2sin(Xi,1),1).H_{i}\mid X_{i}\sim N(2\cdot\sin(X_{i,1}),1). We implement TSCI detailed in Algorithm 1 by specifying {𝒱q}0q3\{\mathcal{V}_{q}\}_{0\leq q\leq 3} as 𝒱q={z,,zq,𝐛(x)}\mathcal{V}_{q}=\{z,\dots,z^{q},\overrightarrow{\bf b}(x)\} with 𝐛(x)={1,𝐛1(x1),,𝐛px(xpx)}\overrightarrow{\bf b}(x)=\{1,\overrightarrow{\bf b}_{1}(x_{1}),\cdots,\overrightarrow{\bf b}_{p_{x}}(x_{p_{x}})\}, where 𝐛j(xj)\overrightarrow{\bf b}_{j}(x_{j}) denotes the B-spline basis functions defined in Section 5.1.

In Figure 3, we compare TSCI to DML and TSLS when treatment conditional mean models change from linear (a=0a=0) to nonlinear (a>0a>0); Note that in addition, a larger value of aa increases the generalized IV strength. We vary the value of aa in {0,0.1,0.2,,1.2}\{0,0.1,0.2,\cdots,1.2\} where larger values of aa represent more nonlinear dependence between DiD_{i} and ZiZ_{i}. When aa is 0, Condition 1 fails to hold, and hence TSCI cannot identify the treatment effect when there are invalid IVs. In such settings, TSCI has similar performance to DML and TSLS. However, when aa gets slightly larger than 0, TSCI has a better RMSE than DML and TSLS, especially for settings C2 and C3. When aa is sufficiently large (e.g., a0.1a\geq 0.1 in Setting C2), TSCI detects the invalid IV and achieves the desired coverage. However, DML and TSLS do not have coverage since they are designed for valid IV settings.

Refer to caption
Figure 3: Comparison of TSCI, DML and TSLS in terms of RMSE, CI coverage, and length in Settings C1, C2 and C3, where aa controls the nonlinearity level in the treatment model. The stacked bar charts show the basis selection of TSCI, where q=1q=1 corresponds to the correct selection.

We shall point out that, in Setting C2, the confidence intervals by TSCI are shorter than that of DML, even after adjusting for the linear invalid IV forms. This happens since the remaining nonlinear IV strength after adjusting for the invalidity form is even larger than the IV strength due to the linear association.

6 Real data analysis

We revisit the question on the effect of education on income (Card, 1993). We follow Card (1993) and analyze the same data set from the National Longitudinal Survey of Young Men. The outcome is the log wages (lwage), and the treatment is the years of schooling (educ). As argued in Card (1993), there are various reasons that the treatment is endogenous. For example, the unobserved confounder “ability bias” may affect both the schooling years and wages, leading to the OLS estimator having a positive bias. Following Card (1993), we use the indicator for a nearby 4-year college in 1966 (nearc4) as the IV and include the following covariates: a quadratic function of potential experience (exper and expersq), a race indicator (black), and dummy variables for residence in a standard metropolitan statistical area in 1976 (smsa) and in 1966 (smsa66), and the dummy variable for residence in the south in 1976 (south), and a full set of 8 regional dummy variables(reg1 to reg8). The data set consists of n=3010n=3010 observations, which is made available by the R package ivmodel (Kang et al., 2021).

We implement random forests for the treatment model and report the variable importance in Figure S4 in the supplement, where the IV nearc4 ranks the seventh in terms of variable importance, after the covariates exper,expersq, black,south,smsa,smsa66.

We allow for different IV violation forms by approximating g(z,x)g(z,x) with various basis functions {𝒱q}q=0,1,2\{\mathcal{V}_{q}\}_{q=0,1,2} detailed in Table 3. Particularly, since 𝒱0\mathcal{V}_{0} does not involve the IV nearc4, 𝒱0\mathcal{V}_{0} corresponds to the valid IV setting as assumed in the analysis of Card (1993); moreover, 𝒱1\mathcal{V}_{1} and 𝒱2\mathcal{V}_{2} correspond to invalid IV settings by allowing the IV nearc4 to affect the outcome directly or interactively with the baseline covariates. The main difference is that 𝒱1\mathcal{V}_{1} includes the interaction with the six most important covariates while 𝒱2\mathcal{V}_{2} includes all fourteen covariates.

𝒱0\mathcal{V}_{0} {exper,expersq,black,south,smsa,smsa66,reg1,reg2,reg3,reg4,reg5,reg6,reg7,reg8}\{\texttt{exper,expersq,black,south,smsa,smsa66,}\texttt{reg1,reg2,reg3,reg4,reg5,reg6,reg7,reg8}\}
𝒱1\mathcal{V}_{1} 𝒱0nearc4{1,exper,expersq,black,south,smsa,smsa66}\mathcal{V}_{0}\cup\texttt{nearc4}\cdot\{1,\texttt{exper},\texttt{expersq},\texttt{black},\texttt{south},\texttt{smsa},\texttt{smsa66}\}
𝒱2\mathcal{V}_{2} 𝒱0nearc4{1,exper,expersq,black,south,smsa,smsa66,reg1,reg2,reg3,reg4,reg5,reg6,reg7,reg8}\mathcal{V}_{0}\cup\texttt{nearc4}\cdot\{1,\texttt{exper},\texttt{expersq},\texttt{black},\texttt{south},\texttt{smsa},\texttt{smsa66},\texttt{reg1},\texttt{reg2},\texttt{reg3},\texttt{reg4},\texttt{reg5},\texttt{reg6},\texttt{reg7},\texttt{reg8}\}
Table 3: Definitions of 𝒱0,𝒱1\mathcal{V}_{0},\mathcal{V}_{1} and 𝒱2\mathcal{V}_{2}, which are used to approximate g()g(\cdot).

We implement Algorithm 1 to choose the best from {𝒱0,𝒱1,𝒱2},\{\mathcal{V}_{0},\mathcal{V}_{1},\mathcal{V}_{2}\}, and construct the TSCI estimator. Since the TSCI estimates depend on the specific sample splitting, we report 500 TSCI estimates due to 500 different splittings. On the leftmost panel of Figure 4, we compare estimates of OLS, TSLS, and TSCI, which uses 𝒱q^c\mathcal{V}_{\widehat{q}_{c}} reported from Algorithm 1. The median of these 500 TSCI estimators is 0.06040.0604, 87.2% of these 500 estimates are smaller than the OLS estimate (0.0747), and 100% of TSCI estimates are smaller than the TSLS estimate (0.1315). In contrast to the TSLS estimator, the TSCI estimators tend to be smaller than the OLS estimator, which helps correct the positive “ability bias”.

Refer to caption
Figure 4: The leftmost panel reports the histograms of the TSCI (random forests) estimates with the comparison method, where the estimates differ due to the randomness of different 500 realized sample splittings; the solid red line corresponds to the median of the TSCI estimates, while the solid and dashed black lines correspond to the TSLS and OLS estimates, respectively. The middle panel displays the histogram of the generalized IV strength (after adjusting for 𝒱q^c\mathcal{V}_{\widehat{q}_{c}}) over the different 500 realized sample splittings; the solid red line denotes the median of all IV strength for TSCI while the solid black line denotes the IV strength of TSLS. The rightmost panel compares different confidence intervals (CIs) produced by OLS, TSLS, DML, and our proposed TSCI with 𝒱q^c\mathcal{V}_{\widehat{q}_{c}}, and TSCI assuming valid IVs. The CIs of TSCI are adjusted by the multi-splitting method due to finite samples; see Appendix A.4.

On the rightmost panel of Figure 4, we compare different confidence intervals (CIs). The TSLS CI is (0.0238,0.2393)(0.0238,0.2393), and the DML CI is (0.0061,0.1907)(0.0061,0.1907), which are both much wider than the CIs by TSCI. This wide interval results from the relatively weak IV because TSLS and DML only consider the linear association between IV and the treatment. The CI by TSCI with random forests varies with the specific sample splitting. We follow Meinshausen et al. (2009) and implement the multi-split method to adjust the finite-sample variation due to sample splitting; see Section A.4 in the supplement. The multi-split CI is (0.0294, 0.0914). The CIs based on the TSCI estimator with random forests are pushed to the lower part of the wide CI by TSLS. The CIs by our proposed TSCI in Algorithm 1 tend to shift down in comparison to the TSCI assuming valid IVs.

We shall point out that the IV strengths for the TSCI estimators, even after adjusting for the selected basis functions, are typically much larger than the TSLS (the concentration parameter is 13.3313.33), which is illustrated in the middle panel of Figure 4. This happens since the first-stage ML captures much more associations than the linear model in TSLS, even after adjusting for the possible IV violations captured by 𝒱q^c\mathcal{V}_{\widehat{q}_{c}}.

Among 500 sample splittings, the proportion of choosing 𝒱0,𝒱1,\mathcal{V}_{0},\mathcal{V}_{1}, and 𝒱2\mathcal{V}_{2} are 59.2%59.2\%, 38.2%38.2\%, and 2.6%2.6\%, respectively, where around 41%41\% of splittings report nearc4 as invalid IVs. Importantly, 93.6%93.6\% out of the 500 splittings report Qmax>q^cQ_{\text{max}}>\widehat{q}_{c}, which represents that β^𝒱q^c\widehat{\beta}_{\mathcal{V}_{\widehat{q}_{c}}} is not statistically different from β^𝒱Qmax\widehat{\beta}_{\mathcal{V}_{Q_{\text{max}}}} produced by the largest 𝒱Qmax\mathcal{V}_{Q_{\text{max}}}. This indicates that 𝒱q^c\mathcal{V}_{\widehat{q}_{c}} already provides a reasonably good approximation of g(Zi,Xi)g(Z_{i},X_{i}) in the outcome model. In Section E.1 of the supplement, we consider alternative choices of 𝒱2\mathcal{V}_{2} and observe that the results are consistent with that reported above. In Section E.2, we further propose a falsification argument regarding Condition 1 for the data analysis.

7 Conclusion and discussion

We integrate modern ML algorithms into the framework of instrumental variable analysis. We devise a novel TSCI methodology which provides reliable causal conclusions even with a wide range of violation forms. Our proposed generalized IV strength measure helps to understand when our proposed method is reliable and supports the basis functions’ selection in approximating the violation form of possibly invalid IVs. The current methodology is focused on inference for a linear and constant treatment effect. An interesting future research direction is about inference for heterogeneous treatment effects (Athey and Imbens, 2016; Wager and Athey, 2018) but in presence of potentially invalid instruments.

Acknowledgement

We acknowledge valuable suggestions from Xu Cheng, Tirthankar Dasgupta, Qingliang Fan, Michal Kolesár, Greg Lewis, Yuan Liao, Molei Liu, Zhonghua Liu, Zuofeng Shang, and Frank Windmeijer. Z. Guo gratefully acknowledges financial support from NSF DMS 1811857, 2015373, NIH R01GM140463, and financial support for visiting the Institute of Mathematical Research (FIM) at ETH Zurich. P. Bühlmann received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 786461)

References

  • Amemiya (1974) Amemiya, T. (1974). The nonlinear two-stage least-squares estimator. J. Econom. 2(2), 105–110.
  • Angrist et al. (1999) Angrist, J. D., G. W. Imbens, and A. B. Krueger (1999). Jackknife instrumental variables estimation. J. Appl. Econ. 14(1), 57–67.
  • Angrist and Lavy (1999) Angrist, J. D. and V. Lavy (1999). Using maimonides’ rule to estimate the effect of class size on scholastic achievement. Q. J. Econ. 114(2), 533–575.
  • Angrist and Pischke (2009) Angrist, J. D. and J.-S. Pischke (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.
  • Athey and Imbens (2016) Athey, S. and G. Imbens (2016). Recursive partitioning for heterogeneous causal effects. Proc. Natl. Acad. Sci. 113(27), 7353–7360.
  • Athey et al. (2019) Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. Ann. Stat. 47(2), 1148–1178.
  • Bach et al. (2021) Bach, P., V. Chernozhukov, M. S. Kurz, and M. Spindler (2021). DoubleML – An object-oriented implementation of double machine learning in R. arXiv:2103.09603 [stat.ML].
  • Bekker and Crudu (2015) Bekker, P. A. and F. Crudu (2015). Jackknife instrumental variable estimation with heteroskedasticity. J. Econom. 185(2), 332–342.
  • Belloni et al. (2012) Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6), 2369–2429.
  • Bowden et al. (2015) Bowden, J., G. Davey Smith, and S. Burgess (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through egger regression. Int. J. Epidemiol. 44(2), 512–525.
  • Bowden et al. (2016) Bowden, J., G. Davey Smith, P. C. Haycock, and S. Burgess (2016). Consistent estimation in mendelian randomization with some invalid instruments using a weighted median estimator. Genet. Epidemiol. 40(4), 304–314.
  • Card (1993) Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling. Natl. Bur. Econ. Res. Camb, Mass., USA.
  • Chen et al. (2023) Chen, J., D. L. Chen, and G. Lewis (2023). Mostly harmless machine learning: learning optimal instruments in linear iv models. J. Mach. Learn. Res., forthcoming.
  • Chernozhukov et al. (2018) Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. J. Econom. 21(1).
  • Emmenegger (2021) Emmenegger, C. (2021). dmlalg: Double machine learning algorithms. R-package available on CRAN.
  • Emmenegger and Bühlmann (2021) Emmenegger, C. and P. Bühlmann (2021). Regularizing double machine learning in partially linear endogenous models. Electron. J. Stat. 15(2), 6461–6543.
  • Guo (2023) Guo, Z. (2023). Causal inference with invalid instruments: post-selection problems and a solution using searching and sampling. J. R. Stat. Soc. Ser. B. 85(3), 959–985.
  • Guo et al. (2018) Guo, Z., H. Kang, T. Tony Cai, and D. S. Small (2018). Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. J. R. Statist. Soc. B 80(4), 793–815.
  • Han (2008) Han, C. (2008). Detecting invalid instruments using l1-gmm. Econ. Lett. 101(3), 285–287.
  • Hansen et al. (2008) Hansen, C., J. Hausman, and W. Newey (2008). Estimation with many instrumental variables. J. Bus. Econ. Stat. 26(4), 398–422.
  • Hansen (1982) Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 1029–1054.
  • Hausman et al. (2012) Hausman, J. A., W. K. Newey, T. Woutersen, J. C. Chao, and N. R. Swanson (2012). Instrumental variable estimation with heteroskedasticity and many instruments. Quant. Econom. 3(2), 211–255.
  • Heckman (1976) Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In Annals of economic and social measurement, volume 5, number 4, pp.  475–492. NBER.
  • Holland (1988) Holland, P. W. (1988). Causal inference, path analysis and recursive structural equations models. ETS Res. Rep. Ser. 1988(1), i–50.
  • Kang et al. (2021) Kang, H., Y. Jiang, Q. Zhao, and D. S. Small (2021). Ivmodel: an R package for inference and sensitivity analysis of instrumental variables models with one endogenous variable. Obs Stud 7(2), 1–24.
  • Kang et al. (2016) Kang, H., A. Zhang, T. T. Cai, and D. S. Small (2016). Instrumental variables estimation with some invalid instruments and its application to mendelian randomization. J. Am. Stat. Assoc. 111(513), 132–144.
  • Kelejian (1971) Kelejian, H. H. (1971). Two-stage least squares and econometric systems linear in parameters but nonlinear in the endogenous variables. J. Am. Stat. Assoc. 66(334), 373–374.
  • Kolesár et al. (2015) Kolesár, M., R. Chetty, J. Friedman, E. Glaeser, and G. W. Imbens (2015). Identification and inference with many invalid instruments. J. Bus. Econ. Stat. 33(4), 474–484.
  • Lewbel (2012) Lewbel, A. (2012). Using heteroscedasticity to identify and estimate mismeasured and endogenous regressor models. J. Bus. Econ. Stat. 30(1), 67–80.
  • Lewbel (2019) Lewbel, A. (2019). The identification zoo: Meanings of identification in econometrics. J. Econ. Lit. 57(4), 835–903.
  • Lin and Jeon (2006) Lin, Y. and Y. Jeon (2006). Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 101(474), 578–590.
  • Liu et al. (2020) Liu, R., Z. Shang, and G. Cheng (2020). On deep instrumental variables estimate. arXiv preprint arXiv:2004.14954.
  • Meinshausen (2006) Meinshausen, N. (2006). Quantile regression forests. J. Mach. Learn. Res. 7(Jun), 983–999.
  • Meinshausen et al. (2009) Meinshausen, N., L. Meier, and P. Bühlmann (2009). P-values for high-dimensional regression. J. Am. Stat. Assoc. 104(488), 1671–1681.
  • Newey (1990) Newey, W. K. (1990). Efficient instrumental variables estimation of nonlinear models. Econometrica, 809–837.
  • Puhani (2000) Puhani, P. (2000). The heckman correction for sample selection and its critique. J. Econ. Surv. 14(1), 53–68.
  • Rothenberg (1984) Rothenberg, T. J. (1984). Approximating the distributions of econometric estimators and test statistics. Handb. Econom. 2, 881–935.
  • Rubin (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66(5), 688.
  • Sargan (1958) Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables. Econometrica, 393–415.
  • Shardell and Ferrucci (2016) Shardell, M. and L. Ferrucci (2016). Instrumental variable analysis of multiplicative models with potentially invalid instruments. Stat. Med. 35(29), 5430–5447.
  • Small (2007) Small, D. S. (2007). Sensitivity analysis for instrumental variables regression with overidentifying restrictions. J. Am. Stat. Assoc. 102(479), 1049–1058.
  • Splawa-Neyman et al. (1990) Splawa-Neyman, J., D. M. Dabrowska, and T. Speed (1990). On the application of probability theory to agricultural experiments. essay on principles. section 9. Stat. Sci., 465–472.
  • Staiger and Stock (1997) Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65(3), 557–586.
  • Stock et al. (2002) Stock, J. H., J. H. Wright, and M. Yogo (2002). A survey of weak instruments and weak identification in generalized method of moments. J. Bus. Econ. Stat. 20(4), 518–529.
  • Sun et al. (2023) Sun, B., Z. Liu, and E. J. Tchetgen Tchetgen (2023, 02). Semiparametric efficient G-estimation with invalid instrumental variables. Biometrika 110(4), 953–971.
  • Tchetgen et al. (2021) Tchetgen, E. T., B. Sun, and S. Walter (2021). The genius approach to robust mendelian randomization inference. Stat. Sci. 36(3), 443–464.
  • Ten Have et al. (2007) Ten Have, T. R., M. M. Joffe, K. G. Lynch, G. K. Brown, S. A. Maisto, and A. T. Beck (2007). Causal mediation analyses with rank preserving models. Biometrics 63(3), 926–934.
  • Wager and Athey (2018) Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 113(523), 1228–1242.
  • Windmeijer et al. (2019) Windmeijer, F., H. Farbmacher, N. Davies, and G. Davey Smith (2019). On the use of the lasso for instrumental variables estimation with some invalid instruments. J. Am. Stat. Assoc. 114(527), 1339–1350.
  • Windmeijer et al. (2021) Windmeijer, F., X. Liang, F. P. Hartwig, and J. Bowden (2021). The confidence interval method for selecting valid instrumental variables. J. R. Statist. Soc. B 83(4), 752–776.
  • Woutersen and Hausman (2019) Woutersen, T. and J. A. Hausman (2019). Increasing the power of specification tests. J. Econom. 211(1), 166–175.

Appendix A Additional Methods and Theories

A.1 Identification in random experiments with non-compliance and invalid IVs

In the following, we explain how a nonlinear treatment model helps identify the causal effect in randomized experiments with non-compliance and possibly invalid IVs. For the ii-th subject, we use Zi{0,1}Z_{i}\in\{0,1\} to denote the random treatment assignment (which serves as an instrument) and Di{0,1}D_{i}\in\{0,1\} to denote whether the subject actually takes the treatment or not. We use Xi{0,1}X_{i}\in\{0,1\} to denote a binary covariate and Yi(z,d)(x)Y^{(z,d)}_{i}(x) to denote the potential outcome for a subject with the assignment zz, the treatment value dd, and the covariate xx.

A primary concern is that ZiZ_{i} may be an invalid IV even if it is randomly assigned. To facilitate the discussion, we follow the discussion in \citetsuppimbens2014instrumental and adopt the example from \citetsuppmcdonald1992effects studying the effect of influenza vaccination on being hospitalized for flu-related reasons. The researchers randomly sent letters to physicians encouraging them to vaccinate their patients. In this example, ZiZ_{i} denotes the indicator for the physician receiving the encouragement letter, while DiD_{i} stands for the patient being vaccinated or not. The outcome YiY_{i} denotes whether the patient is hospitalized for a flu-related reason. As pointed out in \citetsuppimbens2014instrumental, the physician receiving the letter may “take actions that affect the likelihood of the patient getting infected with the flu other than simply administering the flu vaccine”, indicating that ZiZ_{i} may have a direct effect on the outcome and violate the classical IV assumptions.

For a covariate xx, we define the intention-to-treat (ITT) on the outcome as ITTY(x)=𝐄(YiZi=1,Xi=x)𝐄(YiZi=0,Xi=x){\rm ITT}_{Y}(x)={\mathbf{E}}(Y_{i}\mid Z_{i}=1,X_{i}=x)-{\mathbf{E}}(Y_{i}\mid Z_{i}=0,X_{i}=x) and the ITT on the treatment as ITTD(x)=𝐄(DiZi=1,Xi=x)𝐄(DiZi=0,Xi=x).{\rm ITT}_{D}(x)={\mathbf{E}}(D_{i}\mid Z_{i}=1,X_{i}=x)-{\mathbf{E}}(D_{i}\mid Z_{i}=0,X_{i}=x). In the following proposition, we assume the constant effect and demonstrate the treatment effect identification even if the IV ZiZ_{i} directly affects the outcome (violating assumption (A2)).

Proposition 2

Suppose that (I1) ITTD(x=1)ITTD(x=0)0{\rm ITT}_{D}(x=1)-{\rm ITT}_{D}(x=0)\neq 0, (I2) ZiZ_{i} is randomly assigned, (I3) Yi(z,d)(x)Yi(z,d)(x)=(zz)πY^{(z^{\prime},d)}_{i}(x)-Y^{(z,d)}_{i}(x)=(z^{\prime}-z)\pi for x{0,1}x\in\{0,1\}, and (I4) Yi(z,d)(x)Yi(z,d)(x)=(dd)β.Y^{(z,d^{\prime})}_{i}(x)-Y^{(z,d)}_{i}(x)=(d^{\prime}-d)\beta. Then we identify β\beta as ITTY(x=1)ITTY(x=0)ITTD(x=1)ITTD(x=0).\frac{{\rm ITT}_{Y}(x=1)-{\rm ITT}_{Y}(x=0)}{{\rm ITT}_{D}(x=1)-{\rm ITT}_{D}(x=0)}.

The proof of the above proposition is presented in Section B.1. Condition (I4) assumes a constant effect while (I1)-(I3) are relaxations of the classical IV assumptions (A1)-(A3), respectively. Particularly, the random assignment required by (I2) implies (A2) and (I2) is satisfied in randomized experiments with non-compliance. Condition (I3) allows the IV to affect the outcome directly but assumes that the IV’s direct effect does not interact with the baseline covariate XiX_{i}. When there exists a direct effect, the identification of β\beta relies on the Condition (I1). Condition (I1) essentially requires that the treatment assignment is interactively determined by ZiZ_{i} and XiX_{i}: a special example is given by Di=𝟏(γ0+γzZi+γxXi+γZiXi+δi>0)D_{i}={\bf 1}\left(\gamma_{0}+\gamma_{z}Z_{i}+\gamma_{x}X_{i}+\gamma Z_{i}\cdot X_{i}+\delta_{i}>0\right) where δi\delta_{i} denotes a random variable independent of ZiZ_{i} and Xi.X_{i}. Such identification conditions have been proposed in Shardell and Ferrucci (2016); Ten Have et al. (2007). Particularly, Shardell and Ferrucci (2016) has pointed out that ITTD(x=1)ITTD(x=0){\rm ITT}_{D}(x=1)-{\rm ITT}_{D}(x=0) is referred to as the compliance score and Condition (I1) requires that the compliance score changes with the covariate value xx.

We provide now some discussions on the plausibility of Condition (I1). We still use the vaccination example from \citetsuppmcdonald1992effects and imagine that we may have access to the covariate XiX_{i}, an indicator for whether the patients took regular flu shots in the past few years. Patients who regularly take flu shots (with Xi=1X_{i}=1) are more likely to follow the physician’s encouragement than those who do not (with Xi=0X_{i}=0), indicating ITTD(x=1)ITTD(x=0)0{\rm ITT}_{D}(x=1)-{\rm ITT}_{D}(x=0)\neq 0. However, regarding the concern “take actions that affect the likelihood of the patient getting infected with the flu other than simply administering the flu vaccine” \citepsuppimbens2014instrumental, such a direct effect of ZiZ_{i} might not interact with XiX_{i}.

A.2 Pitfalls with the naive machine learning

As the benchmarks in Section 5.2, we also include the point estimators β^init\widehat{\beta}_{\text{init}}, β^plug,\widehat{\beta}_{\text{plug}}, and β^full\widehat{\beta}_{\text{full}} together with their corresponding standard errors. We construct the 95% confidence interval as (β^1.96SE^,β^+1.96SE^).(\widehat{\beta}-1.96*\widehat{\rm SE},\widehat{\beta}+1.96*\widehat{\rm SE}).

RF-Init. We compute β^\widehat{\beta} as β^init\widehat{\beta}_{\rm init} in (16) and σ^ϵ(V)=PV[YDβ^init(V)]22/(nr).\widehat{\sigma}_{\epsilon}(V)=\sqrt{\|P^{\perp}_{V}[Y-D\widehat{\beta}_{\rm init}(V)]\|_{2}^{2}/(n-r)}. Calculate SE^=σ^ϵD𝒜1[𝐌(V)]2D𝒜1/D𝒜1𝐌(V)D𝒜1.\widehat{\rm SE}={\widehat{\sigma}_{\epsilon}\sqrt{{D}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{D}_{\mathcal{A}_{1}}}}/{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}.

RF-Plug and its pitfall. A simple idea of combining TSLS with ML is to directly use f^𝒜1\widehat{f}_{\mathcal{A}_{1}} in the second stage. We may calculate such a two-stage estimator as β^plug(V)=Y𝒜1PV𝒜1f^𝒜1/f^𝒜1PV𝒜1f^𝒜1,\widehat{\beta}_{\rm plug}(V)={{Y}_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{{V}_{\mathcal{A}_{1}}}\widehat{f}_{\mathcal{A}_{1}}}/{\widehat{f}_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{{V}_{\mathcal{A}_{1}}}\widehat{f}_{\mathcal{A}_{1}}}, where the second-stage regression is implemented by regressing Y𝒜1Y_{\mathcal{A}_{1}} on f^𝒜1=ΩD𝒜1\widehat{f}_{\mathcal{A}_{1}}=\Omega D_{\mathcal{A}_{1}} and V𝒜1V_{\mathcal{A}_{1}}. We compute the standard error SE^=PV(Yβ^plug(V)D)22|𝒜1|D^𝒜1PV𝒜1.\widehat{\rm SE}=\sqrt{\frac{\|P^{\perp}_{V}(Y-\widehat{\beta}_{\rm plug}(V)D)\|_{2}^{2}}{|\mathcal{A}_{1}|\cdot{\widehat{D}_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{{V}_{\mathcal{A}_{1}}}}}}. \citetsuppangrist2009mostly[Sec.4.6.1] criticized such use of the nonlinear first-stage model under the name of “forbidden regression” and \citetsuppchen2020mostly pointed out that β^plug(V)\widehat{\beta}_{\rm plug}(V) is biased since Ω\Omega in (12) is not a projection matrix for random forests.

RF-Full and its pitfall. Sample splitting is essential for removing the endogeneity of the ML-predicted values. Without sample splitting, the ML predicted value for the treatment can be close to the treatment (due to overfitting) and hence highly correlated with unmeasured confounders, leading to a TSCI estimator suffering from a significant bias. In the following, we take the non-split Random Forests as an example. Similarly to Ω\Omega defined in (12), we define the transformation matrix Ωfull\Omega^{\rm full} for random forests constructed with the full data. As a modification of (16), we consider the corresponding TSCI estimator,

β^full(V)=Y𝐌full(V)D/D𝐌full(V)Dwith𝐌RFfull(V)=(Ωfull)PV~Ωfull,\widehat{\beta}_{\rm full}(V)={Y^{\intercal}\mathbf{M}^{\rm full}(V){D}}/{{D}^{\intercal}\mathbf{M}^{\rm full}(V){D}}\quad\text{with}\quad\mathbf{M}^{\rm full}_{\rm RF}(V)=(\Omega^{\rm full})^{\intercal}P^{\perp}_{\widetilde{V}}\Omega^{\rm full}, (38)

where V~=ΩfullV\widetilde{V}=\Omega^{\rm full}V. We calculate its standard error as

SE^=σ^full(V)D[𝐌full(V)]2DD𝐌full(V)Dwithσ^full(V)=PV(Yβ^RFfull(V)D)22/n.\widehat{\rm SE}=\frac{\widehat{\sigma}_{\rm full}(V)\sqrt{{{D}^{\intercal}[\mathbf{M}^{\rm full}(V)]^{2}{D}}}}{{D}^{\intercal}\mathbf{M}^{\rm full}(V){D}}\quad\text{with}\quad\widehat{\sigma}_{\rm full}(V)=\sqrt{\|P^{\perp}_{V}(Y-\widehat{\beta}_{\rm RF}^{\rm full}(V)D)\|_{2}^{2}/{n}}.

The simulation results in Section 5 show that the estimator β^full(V)\widehat{\beta}_{\rm full}(V) in (38) suffers from a large bias and the resulting confidence interval does not achieve the desired coverage.

A.3 Choice of ρ^\widehat{\rho} in (27)

In the following, we present the choice of ρ^=ρ^(α0)>0\widehat{\rho}=\widehat{\rho}(\alpha_{0})>0 used in (27) by the wild bootstrap. For 1in1,1\leq i\leq n_{1}, we compute ϵ~i=[ϵ^(VQmax)]iμ¯ϵ\widetilde{\epsilon}_{i}=[\widehat{\epsilon}(V_{Q_{\max}})]_{i}-\bar{\mu}_{\epsilon} where ϵ^(VQmax)\widehat{\epsilon}(V_{Q_{\max}}) is defined in (21) with V=VQmaxV=V_{Q_{\max}} and μ¯ϵ=1n1i=1n1[ϵ^(VQmax)]i\bar{\mu}_{\epsilon}=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V_{Q_{\max}})]_{i}. We generate ϵi[l]=Ui[l]ϵ~i\epsilon^{[l]}_{i}=U^{[l]}_{i}\cdot\widetilde{\epsilon}_{i} for 1in11\leq i\leq n_{1} and 1lL,1\leq l\leq L, where {Ui[l]}1in1\{U^{[l]}_{i}\}_{1\leq i\leq n_{1}} are i.i.d. standard normal random variables. We define

T[l]=max0q<qQmax1H^(Vq,Vq)[f𝒜1𝐌(Vq)ϵ[l]f𝒜1𝐌(Vq)f𝒜1f𝒜1𝐌(Vq)ϵ[l]f𝒜1𝐌(Vq)f𝒜1]for 1lL.T^{[l]}=\max_{0\leq q<q^{\prime}\leq Q_{\max}}\frac{1}{{\sqrt{\widehat{H}(V_{q},V_{q^{\prime}})}}}\left[\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon^{[l]}}{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}-\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})\epsilon^{[l]}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}\right]\;\text{for}\;1\leq l\leq L.

We set ρ^=ρ^(α0)\widehat{\rho}=\widehat{\rho}(\alpha_{0}) to be the upper α0\alpha_{0} empirical quantile of {|T[l]|}1lL\{|T^{[l]}|\}_{1\leq l\leq L}, that is,

ρ^(α0)=min{ρ:1Ll=1L𝟏(|T[l]|ρ)1α0},\widehat{\rho}(\alpha_{0})=\min\left\{\rho\in\mathbb{R}:\frac{1}{L}\sum_{l=1}^{L}{\bf 1}\left(|{T^{[l]}}|\leq\rho\right)\geq 1-\alpha_{0}\right\}, (39)

where α0\alpha_{0} is set as 0.0250.025 by default.

A.4 Finite-sample adjustment of uncertainty from data splitting

Even though our asymptotic theory in Section 4 is valid for any random sample splitting, the constructed point estimators and confidence intervals do vary with different sample splittings in finite samples. This randomness due to sample splittings has been also observed in double machine learning \citepsuppchernozhukov2018double and multi-splitting \citepsuppmeinshausen2009p. Following \citetsuppchernozhukov2018double and \citetsuppmeinshausen2009p, we introduce a confidence interval which aggregates multiple confidence intervals due to different splittings. Consider SS random splittings and for the ss-th splitting, we use β^s\widehat{\beta}^{s} and SE^s\widehat{\rm SE}^{s} to denote the corresponding TSCI point and standard error estimators, respectively. Following Section 3.4 of \citetsuppchernozhukov2018double, we introduce the median estimator β^med=median{β^s}1sS\widehat{\beta}^{\rm med}={\rm median}\{\widehat{\beta}^{s}\}_{1\leq s\leq S} together with SE^med=median{[SE^s]2+(β^sβ^med)2}1sS,\widehat{\rm SE}^{\rm med}={\rm median}\left\{\sqrt{[\widehat{\rm SE}^{s}]^{2}+(\widehat{\beta}^{s}-\widehat{\beta}^{\rm med})^{2}}\right\}_{1\leq s\leq S}, and construct the median confidence interval as (β^medzα/2SE^med,β^med+zα/2SE^med).(\widehat{\beta}^{\rm med}-z_{\alpha/2}\widehat{\rm SE}^{\rm med},\widehat{\beta}^{\rm med}+z_{\alpha/2}\widehat{\rm SE}^{\rm med}). Alternatively, following \citetsuppmeinshausen2009p, we construct the p values ps(β0)=2(1Φ(|β^sβ0|/SE^s))p^{s}(\beta_{0})=2(1-\Phi(|\widehat{\beta}^{s}-\beta_{0}|/\widehat{\rm SE}^{s})) for β0\beta_{0}\in\mathbb{R} and 1sS,1\leq s\leq S, where Φ\Phi is the CDF of the standard normal. We define the multi-splitting confidence interval as {β0:2median{ps(β0)}s=1Sα}.\{\beta_{0}\in\mathbb{R}:2\cdot{\rm median}\{p^{s}(\beta_{0})\}_{s=1}^{S}\geq\alpha\}. See equation (2.2) of \citetsuppmeinshausen2009p.

A.5 Choices of Violation Basis Functions for Multiple IVs

When there are multiple IVs, there are more choices to specify the violation form, and {𝒱q}1qQ\{\mathcal{V}_{q}\}_{1\leq q\leq Q} are not necessarily nested. For example, when we have two IVs z1z_{1} and z2z_{2}, we may set 𝒱q1,q2={1,z1,,z1q1,z2,,z2q2},\mathcal{V}_{q_{1},q_{2}}=\{1,z_{1},\cdots,z_{1}^{q_{1}},z_{2},\cdots,z_{2}^{q_{2}}\}, and then {𝒱q1,q2}0q1,q2Q\{\mathcal{V}_{q_{1},q_{2}}\}_{0\leq q_{1},q_{2}\leq Q} are not nested. However, even in such a case, our proposed selection method is still applicable. Specifically, for any 0q1,q2Q0\leq q_{1},q_{2}\leq Q, we compare the estimator generated by 𝒱q1,q2\mathcal{V}_{q_{1},q_{2}} to that by 𝒱q1,q2\mathcal{V}_{q^{\prime}_{1},q^{\prime}_{2}} with q1q1q^{\prime}_{1}\geq q_{1} and q2q2.q^{\prime}_{2}\geq q_{2}.

A.6 TSCI with boosting

In the following, we demonstrate how to express the L2L_{2} boosting estimator \citepsuppbuhlmann2007boosting,buhlmann2006sparse in the form of (12). The boosting methods aggregate a sequence of base procedures {g^[m]()}m1.\{\widehat{g}^{[m]}(\cdot)\}_{m\geq 1}. For m1,m\geq 1, we construct the base procedure g^[m]\widehat{g}^{[m]} using the data in 𝒜2\mathcal{A}_{2} and compute the estimated values given by the mm-th base procedure g^𝒜1[m]=(g^[m](X1,Z1),,g^[m](Xn1,Zn1))\widehat{g}^{[m]}_{\mathcal{A}_{1}}=\left(\widehat{g}^{[m]}(X_{1},Z_{1}),\cdots,\widehat{g}^{[m]}(X_{n_{1}},Z_{n_{1}})\right)^{\intercal}. With f^𝒜1[0]=0\widehat{f}^{[0]}_{\mathcal{A}_{1}}=0 and 0<ν10<\nu\leq 1 as the step-length factor (the default being ν=0.1\nu=0.1), we conduct the sequential updates,

f^𝒜1[m]=f^𝒜1[m1]+νg^𝒜1[m]withg^𝒜1[m]=[m](D𝒜1f^𝒜1[m1])form1,\widehat{f}^{[m]}_{\mathcal{A}_{1}}=\widehat{f}^{[m-1]}_{\mathcal{A}_{1}}+\nu\widehat{g}^{[m]}_{\mathcal{A}_{1}}\quad\text{with}\quad\widehat{g}^{[m]}_{\mathcal{A}_{1}}=\mathcal{H}^{[m]}(D_{\mathcal{A}_{1}}-\widehat{f}^{[m-1]}_{\mathcal{A}_{1}})\quad\text{for}\quad m\geq 1,

where [m]n1×n1\mathcal{H}^{[m]}\in\mathbb{R}^{n_{1}\times n_{1}} is a hat matrix determined by the base procedures.

With the stopping time mstopm_{\rm stop}, the L2L_{2} boosting estimator is f^boo=f^[mstop].\widehat{f}^{\rm boo}=\widehat{f}^{[m_{\rm stop}]}. We now compute the transformation matrix Ω\Omega. Set Ω[0]=0\Omega^{[0]}=0 and for m1,m\geq 1, we define

f^𝒜1[m]=Ω[m]D𝒜1withΩ[m]=ν[m]+(Iν[m])Ω[m1].\widehat{f}^{[m]}_{\mathcal{A}_{1}}=\Omega^{[m]}D_{\mathcal{A}_{1}}\quad\text{with}\quad\Omega^{[m]}=\nu\mathcal{H}^{[m]}+({\rm I}-\nu\mathcal{H}^{[m]})\Omega^{[m-1]}. (40)

Define Ωboo=Ω[mstop]\Omega^{\rm boo}=\Omega^{[m_{\rm stop}]} and write f^boo=ΩbooD𝒜1,\widehat{f}^{\rm boo}=\Omega^{\rm boo}D_{\mathcal{A}_{1}}, which is the desired form in (12).

In the following, we give two examples of the construction of the base procedures and provide the detailed expression for [m]\mathcal{H}^{[m]} used in (40). We write Ci=(Xi,Zi)pC_{i}=(X_{i}^{\intercal},Z_{i}^{\intercal})^{\intercal}\in\mathbb{R}^{p} and define the matrix Cn×pC\in\mathbb{R}^{n\times p} with its ii-th row as Ci.C_{i}^{\intercal}. An important step of building the base procedures {g^[m](Zi,Xi)}m1\{\widehat{g}^{[m]}(Z_{i},X_{i})\}_{m\geq 1} is to conduct the variable selection. That is, each base procedure is only constructed based on a subset of covariates Ci=(Xi,Zi)p.C_{i}=(X_{i}^{\intercal},Z_{i}^{\intercal})\in\mathbb{R}^{p}.

Pairwise regression.

In Algorithm 2, we describe the construction of Ωboo\Omega^{\rm boo} with the pairwise regression and the pairwise thin plate as the base procedure. For both base procedures, we need to specify how to construct the basis functions in steps 3 and 8. For the pairwise regression, we set the first element of CiC_{i} as 11 and define Sij,l=CijCilS^{j,l}_{i}=C_{ij}C_{il} for 1in.1\leq i\leq n. Then for step 3, we define 𝒫j,l\mathcal{P}^{j,l} as the projection matrix to the vector S𝒜2j,l;S^{j,l}_{\mathcal{A}_{2}}; for step 8, we define [m]=(S𝒜1j^m,l^m)/S𝒜1j^m,l^m22.\mathcal{H}^{[m]}=(S^{\widehat{j}_{m},\widehat{l}_{m}}_{\mathcal{A}_{1}})^{\intercal}/{\|S^{\widehat{j}_{m},\widehat{l}_{m}}_{\mathcal{A}_{1}}\|_{2}^{2}}.

For the pairwise thin plate, we follow chapter 7 of \citetsuppgreen1993nonparametric to construct the projection matrix. For 1j<lp,1\leq j<l\leq p, we define the matrix Tj,l=(111X1,jX2,jXn,jX1,lX2,lXn,l).T^{j,l}=\begin{pmatrix}1&1&\cdots&1\\ X_{1,j}&X_{2,j}&\cdots&X_{n,j}\\ X_{1,l}&X_{2,l}&\cdots&X_{n,l}\end{pmatrix}. For 1sn,1\leq s\leq n, define Es,sj,l=0E^{j,l}_{s,s}=0; for 1stn,1\leq s\neq t\leq n, define

Es,tj,l=116π(Xs,(j,l)Xt,(j,l)22log((Xs,(j,l)Xt,(j,l)22).E^{j,l}_{s,t}=\frac{1}{16\pi}\left\|(X_{s,(j,l)}-X_{t,(j,l)}\right\|_{2}^{2}\log(\left\|(X_{s,(j,l)}-X_{t,(j,l)}\right\|_{2}^{2}).

Then we define 𝒫¯j,l=(E𝒜2,𝒜2j,l(T,𝒜2j,l))(E𝒜2,𝒜2j,l(T,𝒜2j,l)T,𝒜2j,l0)1n2×(n2+3)\bar{\mathcal{P}}^{j,l}=\begin{pmatrix}E^{j,l}_{\mathcal{A}_{2},\mathcal{A}_{2}}&(T^{j,l}_{\cdot,\mathcal{A}_{2}})^{\intercal}\end{pmatrix}\begin{pmatrix}E^{j,l}_{\mathcal{A}_{2},\mathcal{A}_{2}}&(T^{j,l}_{\cdot,\mathcal{A}_{2}})^{\intercal}\\ T^{j,l}_{\cdot,\mathcal{A}_{2}}&0\end{pmatrix}^{-1}\in\mathbb{R}^{n_{2}\times(n_{2}+3)}. In step 3, compute 𝒫j,ln2×n2\mathcal{P}^{j,l}\in\mathbb{R}^{n_{2}\times n_{2}} as the first n2n_{2} columns of 𝒫¯j,l.\bar{\mathcal{P}}^{j,l}. For step 8, we compute

¯j,l=(E𝒜1,𝒜1j,l(T,𝒜1j,l))(E𝒜1,𝒜1j,l(T,𝒜1j,l)T,𝒜1j,l0)1n1×(n1+3),\bar{\mathcal{H}}^{j,l}=\begin{pmatrix}E^{j,l}_{\mathcal{A}_{1},\mathcal{A}_{1}}&(T^{j,l}_{\cdot,\mathcal{A}_{1}})^{\intercal}\end{pmatrix}\begin{pmatrix}E^{j,l}_{\mathcal{A}_{1},\mathcal{A}_{1}}&(T^{j,l}_{\cdot,\mathcal{A}_{1}})^{\intercal}\\ T^{j,l}_{\cdot,\mathcal{A}_{1}}&0\end{pmatrix}^{-1}\in\mathbb{R}^{n_{1}\times(n_{1}+3)},

and set [m]n1×n1\mathcal{H}^{[m]}\in\mathbb{R}^{n_{1}\times n_{1}} as the first n1n_{1} columns of ¯j^m,l^m.\bar{\mathcal{H}}^{\widehat{j}_{m},\widehat{l}_{m}}.

Algorithm 2 Construction of Ω\Omega in Boosting with non-parametric pairwise regression

Input: Data Cn×p,DnC\in\mathbb{R}^{n\times p},D\in\mathbb{R}^{n}; the step-length factor 0<ν10<\nu\leq 1; the stoping time mstop.m_{\rm stop}.

Output: Ωboo\Omega^{\rm boo}

1:Randomly split the data into disjoint subsets 𝒜1,𝒜2\mathcal{A}_{1},\mathcal{A}_{2} with n1=2n3n_{1}=\lfloor\frac{2n}{3}\rfloor and |𝒜2|=nn1|\mathcal{A}_{2}|=n-n_{1};
2:Set m=0,m=0, f^𝒜2[0]=0,\widehat{f}^{[0]}_{\mathcal{A}_{2}}=0, and Ω[0]=0.\Omega^{[0]}=0.
3:For 1j,lp1\leq j,l\leq p, compute 𝒫j,l\mathcal{P}^{j,l} as the projection matrix to a set of basis functions generated by C𝒜2,jC_{\mathcal{A}_{2},j} and C𝒜2,l.C_{\mathcal{A}_{2},l}.
4:for 1mmstop1\leq m\leq m_{\rm stop} do
5:     Compute the adjusted outcome U𝒜2[m]=D𝒜2f^𝒜2[m1]U^{[m]}_{\mathcal{A}_{2}}=D_{\mathcal{A}_{2}}-\widehat{f}^{[m-1]}_{\mathcal{A}_{2}}
6:     Implement the following base procedure on {Ci,Ui[m]}i𝒜2\{C_{i},U^{[m]}_{i}\}_{i\in\mathcal{A}_{2}}
(j^m,l^m)=argmin1j,lpU𝒜2[m]𝒫j,lU𝒜2[m]2.(\widehat{j}_{m},\widehat{l}_{m})=\operatorname*{arg\,min}_{1\leq j,l\leq p}\left\|U^{[m]}_{\mathcal{A}_{2}}-\mathcal{P}^{j,l}U^{[m]}_{\mathcal{A}_{2}}\right\|^{2}.
7:     Update f^𝒜2[m]=f^𝒜2[m1]+ν𝒫j^m,l^m(D𝒜2f^𝒜2[m1])\widehat{f}^{[m]}_{\mathcal{A}_{2}}=\widehat{f}^{[m-1]}_{\mathcal{A}_{2}}+\nu\mathcal{P}^{\widehat{j}_{m},\widehat{l}_{m}}\left(D_{\mathcal{A}_{2}}-\widehat{f}^{[m-1]}_{\mathcal{A}_{2}}\right)
8:     Construct [m]\mathcal{H}^{[m]} in the same way as 𝒫j^m,l^m\mathcal{P}^{\widehat{j}_{m},\widehat{l}_{m}} but with the data in 𝒜1.\mathcal{A}_{1}.
9:     Compute Ω[m]=ν[m]+(Iν[m])Ω[m1]\Omega^{[m]}=\nu\mathcal{H}^{[m]}+({\rm I}-\nu\mathcal{H}^{[m]})\Omega^{[m-1]}
10:end for
11:Return Ωboo\Omega^{\rm boo}

Decision tree.

In Algorithm 3, we describe the construction of Ωboo\Omega^{\rm boo} with the decision tree as the base procedure.

Algorithm 3 Construction of Ω\Omega in Boosting with Decision tree

Input: Data Cn×p,DnC\in\mathbb{R}^{n\times p},D\in\mathbb{R}^{n}; the step-length factor 0<ν10<\nu\leq 1; the stoping time mstop.m_{\rm stop}.

Output: Ωboo\Omega^{\rm boo}

1:Randomly split the data into disjoint subsets 𝒜1,𝒜2\mathcal{A}_{1},\mathcal{A}_{2} with n1=2n3n_{1}=\lfloor\frac{2n}{3}\rfloor and |𝒜2|=nn1|\mathcal{A}_{2}|=n-n_{1};
2:Set m=0,m=0, f^𝒜2[0]=0,\widehat{f}^{[0]}_{\mathcal{A}_{2}}=0, and Ω[0]=0.\Omega^{[0]}=0.
3:for 1mmstop1\leq m\leq m_{\rm stop} do
4:     Compute the adjusted outcome U𝒜2[m]=D𝒜2f^𝒜2[m1]U^{[m]}_{\mathcal{A}_{2}}=D_{\mathcal{A}_{2}}-\widehat{f}^{[m-1]}_{\mathcal{A}_{2}}
5:     Run the decision tree on {Ci,Ui[m]}i𝒜2\{C_{i},U^{[m]}_{i}\}_{i\in\mathcal{A}_{2}} and partition p\mathbb{R}^{p} as leaves {1[m],,L[m]};\{\mathcal{R}^{[m]}_{1},\cdots,\mathcal{R}^{[m]}_{L}\};
6:     For any j𝒜2,j\in\mathcal{A}_{2}, we identify the leaf l(Cj)[m]\mathcal{R}^{[m]}_{l(C_{j})} containing CjC_{j} and compute
𝒫j,t[m]=𝟏[Ctl(Cj)[m]]k𝒜2𝟏[Ckl(Cj)[m]]fort𝒜2\mathcal{P}^{[m]}_{j,t}=\frac{{\bf{1}}\left[C_{t}\in\mathcal{R}^{[m]}_{l(C_{j})}\right]}{\sum_{k\in\mathcal{A}_{2}}{\bf{1}}\left[C_{k}\in\mathcal{R}^{[m]}_{l(C_{j})}\right]}\quad\text{for}\quad t\in\mathcal{A}_{2}
7:     Compute the matrix (𝒫j,t[m])j,t𝒜2(\mathcal{P}^{[m]}_{j,t})_{j,t\in\mathcal{A}_{2}} and update
f^𝒜2[m]=f^𝒜2[m1]+ν𝒫[m](D𝒜2f^𝒜2[m1])\widehat{f}^{[m]}_{\mathcal{A}_{2}}=\widehat{f}^{[m-1]}_{\mathcal{A}_{2}}+\nu\mathcal{P}^{[m]}\left(D_{\mathcal{A}_{2}}-\widehat{f}^{[m-1]}_{\mathcal{A}_{2}}\right)
8:     Construct the matrix (j,t[m])j,t𝒜1(\mathcal{H}^{[m]}_{j,t})_{j,t\in{\mathcal{A}_{1}}} as j,t[m]=𝟏[Ctl(Cj)[m]]k𝒜1𝟏[Ckl(Cj)[m]].\mathcal{H}^{[m]}_{j,t}=\frac{{\bf{1}}\left[C_{t}\in\mathcal{R}^{[m]}_{l(C_{j})}\right]}{{\sum_{k\in\mathcal{A}_{1}}}{\bf{1}}\left[C_{k}\in\mathcal{R}^{[m]}_{l(C_{j})}\right]}.
9:     Compute Ω[m]=ν[m]+(Iν[m])Ω[m1]\Omega^{[m]}=\nu\mathcal{H}^{[m]}+({\rm I}-\nu\mathcal{H}^{[m]})\Omega^{[m-1]}
10:end for
11:Return Ωboo\Omega^{\rm boo}

A.7 TSCI with deep neural network

In the following, we demonstrate how to calculate Ω\Omega for a deep neural network \citepsuppjames2013introduction. We define the first hidden layer as Hi,k(1)=σ(ωk,0(1)+l=1pzωk,l(1)Zil+l=1pxωk,l+pz(1)Xi,l)H^{(1)}_{i,k}=\sigma\left(\omega^{(1)}_{k,0}+\sum_{l=1}^{p_{\rm z}}\omega^{(1)}_{k,l}Z_{il}+\sum_{l=1}^{p_{\rm x}}\omega^{(1)}_{k,l+p_{\rm z}}X_{i,l}\right) for 1kK1,1\leq k\leq K_{1}, where σ()\sigma(\cdot) is the activation function and {ωk,l(1)}1kK1,1lpx+pz\{\omega^{(1)}_{k,l}\}_{1\leq k\leq K_{1},1\leq l\leq p_{x}+p_{z}} are parameters. For m2m\geq 2, we define the mm-th hidden layer as Hi,k(m)=σ(ωk,0(m)+l=1Km1ωk,l(m)Hi,l(m1))H^{(m)}_{i,k}=\sigma\left(\omega^{(m)}_{k,0}+\sum_{l=1}^{K_{m-1}}\omega^{(m)}_{k,l}H^{(m-1)}_{i,l}\right) for 1kKm,1\leq k\leq K_{m}, where {ωk,l(m)}1kKm,1lKm1\{\omega^{(m)}_{k,l}\}_{1\leq k\leq K_{m},1\leq l\leq K_{m-1}} are unknown parameters. For given M1,M\geq 1, we estimate the unknown parameters based on the data {Xi,Zi,Di}i𝒜2\{X_{i},Z_{i},D_{i}\}_{i\in\mathcal{A}_{2}},

{β^,{ω^(m)}1mM}argmin{βk}0kKM,{ω(m)}1mMi𝒜2(Yiβ0k=1KMβkHi,k(M))2.\left\{\widehat{\beta},\{\widehat{\omega}^{(m)}\}_{1\leq m\leq M}\right\}\coloneqq\operatorname*{arg\,min}_{\{\beta_{k}\}_{0\leq k\leq K_{M}},\{\omega^{(m)}\}_{1\leq m\leq M}}\sum_{i\in\mathcal{A}_{2}}\left(Y_{i}-\beta_{0}-\sum_{k=1}^{K_{M}}\beta_{k}H^{(M)}_{i,k}\right)^{2}.

With {ω^(m)}1mM,\{\widehat{\omega}^{(m)}\}_{1\leq m\leq M}, for 1mM1\leq m\leq M and 1in11\leq i\leq n_{1}, we define

H^i,k(m)=σ(ω^k0(m)+l=1Km1ω^kl(m)Hi,l(m1))for1kKm\widehat{H}^{(m)}_{i,k}=\sigma\left(\widehat{\omega}^{(m)}_{k0}+\sum_{l=1}^{K_{m-1}}\widehat{\omega}^{(m)}_{kl}H^{(m-1)}_{i,l}\right)\quad\text{for}\quad 1\leq k\leq K_{m}

with H^i,k(1)=σ(ω^k0(1)+l=1pzω^kl(1)Zil+l=1pxω^k,l+pz(1)Xil)\widehat{H}^{(1)}_{i,k}=\sigma\left(\widehat{\omega}^{(1)}_{k0}+\sum_{l=1}^{p_{\rm z}}\widehat{\omega}^{(1)}_{kl}Z_{il}+\sum_{l=1}^{p_{\rm x}}\widehat{\omega}^{(1)}_{k,l+p_{\rm z}}X_{il}\right) for 1kK1.1\leq k\leq K_{1}. We use ΩDNN=H^(M)([H^(M)]H^(M))1[H^(M)]\Omega^{\rm DNN}=\widehat{H}^{(M)}\left([\widehat{H}^{(M)}]^{\intercal}\widehat{H}^{(M)}\right)^{-1}[\widehat{H}^{(M)}]^{\intercal} to denote the projection to the column space of the matrix H^(M)\widehat{H}^{(M)} and express the deep neural network estimator as ΩDNND𝒜1.\Omega^{\rm DNN}D_{\mathcal{A}_{1}}. With V^𝒜1=ΩDNNV𝒜1,\widehat{V}_{\mathcal{A}_{1}}=\Omega^{\rm DNN}V_{\mathcal{A}_{1}}, we define 𝐌DNN(V)=[ΩDNN]PV^𝒜1ΩDNN,\mathbf{M}_{\rm DNN}(V)=\left[\Omega^{\rm DNN}\right]^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega^{\rm DNN}, which is shown to be an orthogonal projection matrix in Lemma 1 in the supplement.

A.8 TSCI with B-spline

As a simplification, we consider the additive model 𝐄(DiZi,Xi)=γ1(Zi)+γ2(Xi),{\mathbf{E}}(D_{i}\mid Z_{i},X_{i})=\gamma_{1}(Z_{i})+\gamma_{2}(X_{i}), and assume that γ1()\gamma_{1}(\cdot) can be well approximated by a set of B-spline functions {b1(),,bM()}.\{b_{1}(\cdot),\cdots,b_{M}(\cdot)\}. In practice, the number MM can be chosen by cross-validation. We define the matrix Bn×MB\in\mathbb{R}^{n\times M} with its ii-th row Bi=(b1(Zi),,bM(Zi)).B_{i}=\left(b_{1}(Z_{i}),\cdots,b_{M}(Z_{i})\right)^{\intercal}. Without loss of generality, we approximate γ2(Xi)\gamma_{2}(X_{i}) by WipwW_{i}\in\mathbb{R}^{p_{w}}, which is generated by the same set of basis functions for ϕ(Xi)\phi(X_{i}). Define Ωba=PBWn×n\Omega^{\rm ba}=P_{BW}\in\mathbb{R}^{n\times n} as the projection matrix to the space spanned by the columns of Bn×kB\in\mathbb{R}^{n\times k} and Wn×pwW\in\mathbb{R}^{n\times p_{w}}. We write the first-stage estimator as ΩbaD\Omega^{\rm ba}D and compute 𝐌ba(V)=[Ωba]PV^,WΩba\mathbf{M}_{\rm ba}(V)=\left[\Omega^{\rm ba}\right]^{\intercal}P^{\perp}_{\widehat{V},{W}}\Omega^{\rm ba} with V^=ΩbaV.\widehat{V}=\Omega^{\rm ba}V. The transformation matrix 𝐌ba(V)\mathbf{M}_{\rm ba}(V) is a projection matrix with Mrank(V)rank[𝐌ba(V)]M.M-{\rm rank}(V)\leq{\rm rank}\left[\mathbf{M}_{\rm ba}(V)\right]\leq M. When the basis number MM is small and the degree of freedom M+pwM+p_{w} is much smaller than nn, sample splitting is not even needed for if the B-spline is used for fitting the treatment model, which is different from the general machine learning algorithms.

A.9 Properties of 𝐌(V)\mathbf{M}(V)

The following lemma is about the property of the transformation matrix 𝐌(V)\mathbf{M}(V), whose proof can be found in Section C.4. Recall that

𝐌RF(V)=ΩPV^𝒜1ΩwithΩij=ωj(Zi,Xi),\mathbf{M}_{\rm RF}({V})=\Omega^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega\quad\text{with}\quad\Omega_{ij}=\omega_{j}(Z_{i},X_{i}),

where ωj(z,x)\omega_{j}(z,x) is defined in (14).

Lemma 1

The transformation matrix 𝐌RF(V)\mathbf{M}_{\rm RF}({V}) satisfies

λmax(𝐌RF(V))1andb[𝐌RF(V)]2bb𝐌RF(V)bfor anybn1.\lambda_{\max}(\mathbf{M}_{\rm RF}({V}))\leq 1\quad\text{and}\quad b^{\intercal}[\mathbf{M}_{\rm RF}({V})]^{2}b\leq b^{\intercal}\mathbf{M}_{\rm RF}({V})b\quad\text{for any}\quad b\in\mathbb{R}^{n_{1}}. (41)

As a consequence, we establish Tr([𝐌RF(V)]2)Tr[𝐌(V)].{\rm Tr}\left([\mathbf{M}_{\rm RF}({V})]^{2}\right)\leq{\rm Tr}[\mathbf{M}(V)]. The transformation matrixs 𝐌ba(V)\mathbf{M}_{\rm ba}(V) and 𝐌DNN(V)\mathbf{M}_{\rm DNN}({V}) are orthogonal projection matrices with

[𝐌ba(V)]2=𝐌ba(V)and[𝐌DNN(V)]2=𝐌DNN(V).[\mathbf{M}_{\rm ba}(V)]^{2}=\mathbf{M}_{\rm ba}(V)\quad\text{and}\quad[\mathbf{M}_{\rm DNN}({V})]^{2}=\mathbf{M}_{\rm DNN}({V}). (42)

The transformation matrix 𝐌boo(V)\mathbf{M}_{\rm boo}({V}) satisfies

λmax(𝐌boo(V))Ωboo22,andb[𝐌boo(V)]2bΩboo22b𝐌boo(V)b,\lambda_{\max}(\mathbf{M}_{\rm boo}({V}))\leq\|\Omega^{\rm boo}\|_{2}^{2},\quad\text{and}\quad b^{\intercal}[\mathbf{M}_{\rm boo}({V})]^{2}b\leq\|\Omega^{\rm boo}\|_{2}^{2}\cdot b^{\intercal}\mathbf{M}_{\rm boo}({V})b, (43)

for any bn1.b\in\mathbb{R}^{n_{1}}. As a consequence, we establish Tr([𝐌boo(V)]2)Ωboo22Tr[𝐌(V)].{\rm Tr}\left([\mathbf{M}_{\rm boo}({V})]^{2}\right)\leq\|\Omega^{\rm boo}\|_{2}^{2}\cdot{\rm Tr}[\mathbf{M}(V)].

A.10 Homoscadastic correlation

In the following, we discuss the simplified method and theory in the homoscadastic correlation setting, that is, Cov(ϵi,δiZi,Xi)=Cov(ϵi,δi){\rm Cov}(\epsilon_{i},\delta_{i}\mid Z_{i},X_{i})={\rm Cov}(\epsilon_{i},\delta_{i}). With this extra assumption, we present the following bias-corrected estimator as an alternative to the estimator defined in (19),

β~RF(V)=β^init(V)Cov^(δi,ϵi)Tr[𝐌(V)]D𝒜1𝐌(V)D𝒜1,\widetilde{\beta}_{\rm RF}(V)=\widehat{\beta}_{\rm init}(V)-\frac{\widehat{\rm Cov}(\delta_{i},\epsilon_{i})\cdot{\rm Tr}[\mathbf{M}(V)]}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}},

where β^init(V)\widehat{\beta}_{\rm init}(V) is defined in (16) and the estimator of Cov(δi,ϵi){\rm Cov}(\delta_{i},\epsilon_{i}) is defined as,

Cov^(δi,ϵi)=1n1r(D𝒜1f^𝒜1)PV𝒜1[Y𝒜1D𝒜1β^init(V)],\widehat{\rm Cov}(\delta_{i},\epsilon_{i})=\frac{1}{n_{1}-r}(D_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}})^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}}[Y_{\mathcal{A}_{1}}-D_{\mathcal{A}_{1}}\widehat{\beta}_{\rm init}(V)],

with rr denoting the rank of the matrix (V,W)(V,W). The correction in constructing β~RF(V)\widetilde{\beta}_{\rm RF}(V) implicitly requires Cov(ϵi,δiZi,Xi)=Cov(ϵi,δi),{\rm Cov}(\epsilon_{i},\delta_{i}\mid Z_{i},X_{i})={\rm Cov}(\epsilon_{i},\delta_{i}), which might limit practical applications.

We present a simplified version of Theorem 6 by assuming homoscadastic correlation. We shall only present the results for the TSCI with random forests but the extension to the general machine learning methods is straightforward.

Theorem 4

Consider the model (3) and (4) with Cov(ϵi,δiZi,Xi)=Cov(ϵi,δi){\rm Cov}(\epsilon_{i},\delta_{i}\mid Z_{i},X_{i})={\rm Cov}(\epsilon_{i},\delta_{i}). Suppose that Conditions (R1) and (R2) hold, then

|Cov^(δi,ϵi)Cov(δi,ϵi)|ηn0(V)ηn(V),\left|\widehat{\rm Cov}(\delta_{i},\epsilon_{i})-{\rm Cov}(\delta_{i},\epsilon_{i})\right|\leq\eta^{0}_{n}(V)\leq\eta_{n}(V), (44)

where ηn(V)\eta_{n}(V) is defined in (30) and

ηn0(V)=f𝒜1f^𝒜12n+lognn+(|ββ^init(V)|+R(V)2n)(1+f𝒜1f^𝒜12n).\eta^{0}_{n}(V)=\frac{\|f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{2}}{\sqrt{n}}+\sqrt{\frac{\log n}{n}}+\left(|\beta-\widehat{\beta}_{\rm init}(V)|+\frac{\|R(V)\|_{2}}{\sqrt{n}}\right)\left(1+\frac{\|f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{2}}{\sqrt{n}}\right). (45)

Furthermore, if we assume (31) holds, then β~RF(V)\widetilde{\beta}_{\rm RF}(V) satisfies

1SE(V)(β~RF(V)β)=𝒢(V)+(V),\frac{1}{{\rm SE}(V)}\left(\widetilde{\beta}_{\rm RF}(V)-\beta\right)={\mathcal{G}}(V)+\mathcal{E}(V), (46)

where SE(V){\rm SE}(V) is defined in (59), 𝒢(V)𝑑N(0,1){\mathcal{G}}(V)\overset{d}{\to}N(0,1) and there exist positive constants C>0C>0 and t0>0t_{0}>0 such that

lim infn(|(V)|Cηn0(V)Tr[𝐌(V)]+R(V)2+t0Tr([𝐌RF(V)]2)f𝒜1[𝐌(V)]2f𝒜1)1exp(t02),\liminf_{n\rightarrow\infty}\mathbb{P}\left(\left|\mathcal{E}(V)\right|\leq C\frac{\eta^{0}_{n}(V)\cdot{\rm Tr}[\mathbf{M}(V)]+\|R(V)\|_{2}+t_{0}\sqrt{{\rm Tr}([\mathbf{M}_{\rm RF}({V})]^{2})}}{\sqrt{{f}_{\mathcal{A}_{1}}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}}}\right)\geq 1-\exp(-t_{0}^{2}), (47)

with ηn0(V)\eta^{0}_{n}(V) defined in (45).

As a remark, if R(V)2/n0\|R(V)\|_{2}/\sqrt{n}\rightarrow 0, β^init(V)𝑝β\widehat{\beta}_{\rm init}(V)\overset{p}{\to}\beta, and f𝒜1f^𝒜12/n0,{\|f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}\|_{2}}/{\sqrt{n}}\rightarrow 0, then we have ηn0(V)0.\eta^{0}_{n}(V)\rightarrow 0.

A.11 Consistency of variance estimators

The following lemma controls the variance consistency, whose proof can be found in Section C.5.

Lemma 2

Suppose that Conditions (R1) and (R2) hold and κn(V)2+lognκn(V)𝑝0,\kappa_{n}(V)^{2}+\sqrt{\log n}\kappa_{n}(V)\overset{p}{\to}0, with κn(V)=logn(R(V)+|ββ^init(V)|+logn/n).\kappa_{n}(V)=\sqrt{\log n}\left(\|R(V)\|_{\infty}+|\beta-\widehat{\beta}_{\rm init}(V)|+{\log n}/{\sqrt{n}}\right). If

i=1n1σi2[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)f𝒜1]i2𝑝1,andmax1in1[𝐌(V)D𝒜1]i2i=1n1[𝐌(V)D𝒜1]i2𝑝0,\frac{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){f}_{\mathcal{A}_{1}}]_{i}^{2}}}\overset{p}{\to}1,\quad\text{and}\quad\frac{\max_{1\leq i\leq n_{1}}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}{{{\sum_{i=1}^{n_{1}}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}}\overset{p}{\to}0,

then we have SE^(V)/SE(V)𝑝1.{\widehat{\rm SE}(V)}/{{\rm SE}(V)}\overset{p}{\to}1.

A.12 Size and Power of 𝒞(Vq,Vq)\mathcal{C}(V_{q},V_{q^{\prime}})

The limiting distribution in Theorem 2 implies the size and power of the comparison test 𝒞(Vq,Vq)\mathcal{C}(V_{q},V_{q^{\prime}}) in (26), stated in the following theorem.

Theorem 5

Suppose that the Conditions of Theorem 2 hold, H^(Vq,Vq)/H(Vq,Vq)𝑝1\widehat{H}(V_{q},V_{q^{\prime}})/H(V_{q},V_{q^{\prime}})\overset{p}{\to}1, and n(Vq,Vq)\mathcal{L}_{n}(V_{q},V_{q^{\prime}}) defined in (35) satisfies n(Vq,Vq)𝑝(Vq,Vq)\mathcal{L}_{n}(V_{q},V_{q^{\prime}})\overset{p}{\to}\mathcal{L}^{*}(V_{q},V_{q^{\prime}}) for (Vq,Vq){,},\mathcal{L}^{*}(V_{q},V_{q^{\prime}})\in\mathbb{R}\cup\{-\infty,\infty\}, then the test 𝒞(Vq,Vq)\mathcal{C}(V_{q},V_{q^{\prime}}) in (26) with corresponding α0\alpha_{0} satisfies,

limn𝐏(𝒞(Vq,Vq)=1)=1Φ(zα0(Vq,Vq))+Φ(zα0(Vq,Vq)).\lim_{n\rightarrow\infty}\mathbf{P}\left(\mathcal{C}(V_{q},V_{q^{\prime}})=1\right)=1-\Phi\left(z_{\alpha_{0}}-\mathcal{L}^{*}(V_{q},V_{q^{\prime}})\right)+\Phi\left(-z_{\alpha_{0}}-\mathcal{L}^{*}(V_{q},V_{q^{\prime}})\right). (48)

With |(Vq,Vq)|=0\left|\mathcal{L}^{*}(V_{q},V_{q^{\prime}})\right|=0, then we have limn𝐏(𝒞(Vq,Vq)=1)=2α0.\lim_{n\rightarrow\infty}\mathbf{P}\left(\mathcal{C}(V_{q},V_{q^{\prime}})=1\right)=2\alpha_{0}. With |(Vq,Vq)|=\left|\mathcal{L}^{*}(V_{q},V_{q^{\prime}})\right|=\infty, then we have limn𝐏(𝒞(Vq,Vq)=1)=1.\lim_{n\rightarrow\infty}\mathbf{P}\left(\mathcal{C}(V_{q},V_{q^{\prime}})=1\right)=1.

Appendix B Proofs

We establish Proposition 2 in Section B.1. In Section B.2, we establish Proposition 1. In Section B.3, we establish Theorems 6 and 1. We prove Theorems 2 and 5 in Section B.4. We establish Theorem 3 in Section B.5 and prove Theorem 4 in Section B.6.

We define the conditional covariance matrices Λ,Σδ,Σϵn1×n1\Lambda,\Sigma^{\delta},\Sigma^{\epsilon}\in\mathbb{R}^{n_{1}\times n_{1}} as Λ=𝐄(δ𝒜1ϵ𝒜1X𝒜1,Z𝒜1),\Lambda={\mathbf{E}}({\delta}_{\mathcal{A}_{1}}{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mid X_{\mathcal{A}_{1}},Z_{\mathcal{A}_{1}}), Σδ=𝐄(δ𝒜1δ𝒜1X𝒜1,Z𝒜1),\Sigma^{\delta}={\mathbf{E}}({\delta}_{\mathcal{A}_{1}}{\delta}_{\mathcal{A}_{1}}^{\intercal}\mid X_{\mathcal{A}_{1}},Z_{\mathcal{A}_{1}}), and Σϵ=𝐄(ϵ𝒜1ϵ𝒜1X𝒜1,Z𝒜1)\Sigma^{\epsilon}={\mathbf{E}}({\epsilon}_{\mathcal{A}_{1}}{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mid X_{\mathcal{A}_{1}},Z_{\mathcal{A}_{1}}). For any i,j𝒜1i,j\in\mathcal{A}_{1} and ij,i\neq j, we have Λi,j=𝐄[δj𝐄(ϵiX𝒜1,Z𝒜1,δj)X𝒜1,Z𝒜1].\Lambda_{i,j}={\mathbf{E}}\left[\delta_{j}{\mathbf{E}}(\epsilon_{i}\mid X_{\mathcal{A}_{1}},Z_{\mathcal{A}_{1}},\delta_{j})\mid X_{\mathcal{A}_{1}},Z_{\mathcal{A}_{1}}\right]. Since 𝐄(ϵiX𝒜1,Z𝒜1,δj)=𝐄(ϵiXi,Zi)=0{\mathbf{E}}(\epsilon_{i}\mid X_{\mathcal{A}_{1}},Z_{\mathcal{A}_{1}},\delta_{j})={\mathbf{E}}(\epsilon_{i}\mid X_{i},Z_{i})=0 for iji\neq j, we have Λi,j=0\Lambda_{i,j}=0 and Λ\Lambda is a diagonal matrix. Similarly, we can show Σδ\Sigma^{\delta} and Σϵ\Sigma^{\epsilon} are diagonal matrices. The conditional sub-gaussian condition in (R1) implies that max1jn1max{|Λj,j|,|Σj,jδ|,|Σj,jϵ|}K2.\max_{1\leq j\leq n_{1}}\max\left\{|\Lambda_{j,j}|,|\Sigma^{\delta}_{j,j}|,|\Sigma^{\epsilon}_{j,j}|\right\}\leq K^{2}. We shall use 𝒪\mathcal{O} to denote the set of random variables {Zi,Xi}1in\{Z_{i},X_{i}\}_{1\leq i\leq n} and {Di}i𝒜2\{D_{i}\}_{i\in\mathcal{A}_{2}}.

We introduce the following lemma about the concentration of quadratic forms, which is Theorem 1.1 in \citetsupprudelson2013hanson.

Lemma 3

(Hanson-Wright inequality) Let ϵn\epsilon\in\mathbb{R}^{n} be a random vector with independent sub-gaussian components ϵi\epsilon_{i} with zero mean and sub-gaussian norm KK. Let AA be an n×nn\times n matrix. For every t0t\geq 0, 𝐏(|ϵAϵ𝐄ϵAϵ|>t)2exp[cmin(t2K4AF2,tK2A2)].\mathbf{P}\left(|\epsilon^{\intercal}A\epsilon-{\mathbf{E}}\epsilon^{\intercal}A\epsilon|>t\right)\leq 2\exp\left[-c\min\left(\frac{t^{2}}{K^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\|A\|_{2}}\right)\right].

Note that 2ϵAδ=(ϵ+δ)A(ϵ+δ)ϵAϵδAδ.2\epsilon^{\intercal}A\delta=(\epsilon+\delta)^{\intercal}A(\epsilon+\delta)-\epsilon^{\intercal}A\epsilon-\delta^{\intercal}A\delta. If both ϵi\epsilon_{i} and δi\delta_{i} are sub-gaussian, we apply the union bound, Lemma 3 for both ϵ\epsilon and δ\delta and then establish,

𝐏(|ϵAδ𝐄ϵAδ|>t)6exp[cmin(t2K4AF2,tK2A2)].\mathbf{P}\left(|\epsilon^{\intercal}A\delta-{\mathbf{E}}\epsilon^{\intercal}A\delta|>t\right)\leq 6\exp\left[-c\min\left(\frac{t^{2}}{K^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\|A\|_{2}}\right)\right]. (49)

B.1 Proof of Proposition 2

Note that

𝐄(YiZi=z,Xi=1)\displaystyle{\mathbf{E}}(Y_{i}\mid Z_{i}=z,X_{i}=1) (50)
=𝐄[Di(z)(x=1)Yi(1,z)(x=1)+(1Di(z)(x=1))Yi(0,z)(x=1)Zi=z,Xi=1]\displaystyle={\mathbf{E}}\left[D^{(z)}_{i}(x=1)Y^{(1,z)}_{i}(x=1)+(1-D^{(z)}_{i}(x=1))Y^{(0,z)}_{i}(x=1)\mid Z_{i}=z,X_{i}=1\right]
=𝐄[Di(z)(x=1)Yi(1,z)(x=1)+(1Di(z)(x=1))Yi(0,z)(x=1)Xi=1]\displaystyle={\mathbf{E}}\left[D^{(z)}_{i}(x=1)Y^{(1,z)}_{i}(x=1)+(1-D^{(z)}_{i}(x=1))Y^{(0,z)}_{i}(x=1)\mid X_{i}=1\right]

We assume that Yi(1,z)(x=1)Yi(1,w)(x=1)=Yi(1,z)(x=0)Yi(1,w)(x=0)=(zw)πY^{(1,z)}_{i}(x=1)-Y^{(1,w)}_{i}(x=1)=Y^{(1,z)}_{i}(x=0)-Y^{(1,w)}_{i}(x=0)=(z-w)\pi and obtain

ITTY(x=1)𝐄(YiZi=z,Xi=1)𝐄(YiZi=w,Xi=1)\displaystyle{\rm ITT}_{Y}(x=1)\coloneqq{\mathbf{E}}(Y_{i}\mid Z_{i}=z,X_{i}=1)-{\mathbf{E}}(Y_{i}\mid Z_{i}=w,X_{i}=1) (51)
=𝐄[(Di(z)(x=1)Di(w)(x=1))(Yi(1,w)(x=1)Yi(0,w)(x=1))Xi=1]+(zw)π\displaystyle={\mathbf{E}}\left[(D^{(z)}_{i}(x=1)-D^{(w)}_{i}(x=1))(Y^{(1,w)}_{i}(x=1)-Y^{(0,w)}_{i}(x=1))\mid X_{i}=1\right]+(z-w)\pi
=β𝐄[Di(z)(x=1)Di(w)(x=1)Xi=1]+(zw)π\displaystyle=\beta{\mathbf{E}}\left[D^{(z)}_{i}(x=1)-D^{(w)}_{i}(x=1)\mid X_{i}=1\right]+(z-w)\pi
=β(𝐄(DiZi=z,Xi=1)𝐄(DiZi=w,Xi=1))+(zw)π\displaystyle=\beta\left({\mathbf{E}}(D_{i}\mid Z_{i}=z,X_{i}=1)-{\mathbf{E}}(D_{i}\mid Z_{i}=w,X_{i}=1)\right)+(z-w)\pi
=βITTD(x=1)+(zw)π.\displaystyle=\beta\cdot{\rm ITT}_{D}(x=1)+(z-w)\pi.

Similarly, we have

ITTY(x=0)=βITTD(x=0)+(zw)π.{\rm ITT}_{Y}(x=0)=\beta\cdot{\rm ITT}_{D}(x=0)+(z-w)\pi. (52)

We combine (51) and (52) to establish the results.

B.2 Proof of Proposition 1

Recall that we are analyzing β^init(V)\widehat{\beta}_{\rm init}(V) defined in (16) with replacing 𝐌(V)\mathbf{M}(V) by 𝐌(V).\mathbf{M}(V). We have decomposed the estimation error β^init(V)β\widehat{\beta}_{\rm init}(V)-\beta as

β^init(V)β=ϵ𝒜1𝐌(V)δ𝒜1+ϵ𝒜1𝐌(V)f𝒜1+R𝒜1(V)𝐌(V)D𝒜1D𝒜1𝐌(V)D𝒜1.\widehat{\beta}_{\rm init}(V)-\beta=\frac{{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}+{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}+R_{\mathcal{A}_{1}}(V)^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}. (53)

The following Lemma controls the key components of the above decomposition, whose proof can be found in Section C.1 in the supplement.

Lemma 4

Under Condition (R1), with probability larger than 1exp(cmin{t02,t0})1-\exp\left(-c\min\left\{t_{0}^{2},t_{0}\right\}\right) for some positive constants c>0c>0 and t0>0,t_{0}>0,

|ϵ𝒜1𝐌(V)δ𝒜1Tr[𝐌(V)Λ]|t0K2Tr([𝐌(V)]2),|ϵ𝒜1𝐌(V)f𝒜1|t0Kf𝒜1[𝐌(V)]2f𝒜1\left|{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}}-{\rm Tr}[\mathbf{M}({V})\Lambda]\right|\leq t_{0}K^{2}\sqrt{{\rm Tr}([\mathbf{M}({V})]^{2})},\;\left|{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}\right|\leq t_{0}K\sqrt{{f}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}} (54)
|D𝒜1𝐌(V)D𝒜1f𝒜1𝐌(V)f𝒜11|K2Tr[𝐌(V)]+t0K2Tr([𝐌(V)]2)+t0Kf𝒜1[𝐌(V)]2f𝒜1f𝒜1𝐌(V)f𝒜1.\displaystyle\left|\frac{D^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V)D_{\mathcal{A}_{1}}}{f^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V)f_{\mathcal{A}_{1}}}-1\right|\lesssim\frac{K^{2}{\rm Tr}[\mathbf{M}({V})]+t_{0}K^{2}\sqrt{{\rm Tr}([\mathbf{M}({V})]^{2})}+t_{0}K\sqrt{f_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}f_{\mathcal{A}_{1}}}}{{f^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V)f_{\mathcal{A}_{1}}}}. (55)

where Λ=𝐄(δ𝒜1ϵ𝒜1X𝒜1,Z𝒜1).\Lambda={\mathbf{E}}({\delta}_{\mathcal{A}_{1}}{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mid X_{\mathcal{A}_{1}},Z_{\mathcal{A}_{1}}).

Define τn=f𝒜1𝐌(V)f𝒜1.\tau_{n}={f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}. Together with (41) and the generalized IV strength condition (R2), we apply (55) with t0=τn1/2c0t_{0}=\tau_{n}^{1/2-c_{0}} for some 0<c0<1/20<c_{0}<1/2. Then there exists positive constants c>0c>0 and C>0C>0 such that with probability larger than 1exp(cτn1/2c0),1-\exp\left(-c\tau_{n}^{1/2-c_{0}}\right),

|D𝒜1𝐌(V)D𝒜1f𝒜1𝐌(V)f𝒜11|CK2Tr[𝐌(V)]f𝒜1𝐌(V)f𝒜1+CK+K2(f𝒜1𝐌(V)f𝒜1)c00.1.\left|\frac{D^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V)D_{\mathcal{A}_{1}}}{f^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V)f_{\mathcal{A}_{1}}}-1\right|\leq C\frac{K^{2}{\rm Tr}[\mathbf{M}({V})]}{{{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}}}+C\frac{K+K^{2}}{(f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V)f_{\mathcal{A}_{1}})^{c_{0}}}\leq 0.1. (56)

Together with (53), we establish

|β^init(V)β||ϵ𝒜1𝐌(V)f𝒜1|f𝒜1𝐌(V)f𝒜1+|ϵ𝒜1𝐌(V)δ𝒜1|f𝒜1𝐌(V)f𝒜1+R𝒜1(V)𝐌(V)R𝒜1(V)f𝒜1𝐌(V)f𝒜1.\left|\widehat{\beta}_{\rm init}(V)-\beta\right|\lesssim\frac{\left|{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}\right|}{{{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}}}+\frac{\left|{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}\right|}{{{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}}}+\sqrt{\frac{{R_{\mathcal{A}_{1}}(V)^{\intercal}\mathbf{M}(V)R_{\mathcal{A}_{1}}(V)}}{{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}}}. (57)

By applying the decomposition (57) and the upper bounds (54), and (54) with t0=(f𝒜1𝐌(V)f𝒜1)1/2c0,t_{0}=(f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V)f_{\mathcal{A}_{1}})^{1/2-c_{0}}, we establish that, conditioning on 𝒪\mathcal{O}, with probability lager than 1exp(cτn1/2c0)1-\exp\left(-c\tau_{n}^{1/2-c_{0}}\right) for some positive constants c>0c>0 and c0(0,1/2),c_{0}\in(0,1/2),

|β^init(V)β|K2Tr[𝐌(V)]f𝒜1𝐌(V)f𝒜1+K+K2(f𝒜1𝐌(V)f𝒜1)c0+R𝒜1(V)2f𝒜1𝐌(V)f𝒜1.|\widehat{\beta}_{\rm init}(V)-\beta|\lesssim\frac{K^{2}{\rm Tr}[\mathbf{M}({V})]}{{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}}+\frac{K+K^{2}}{(f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V)f_{\mathcal{A}_{1}})^{c_{0}}}+\frac{\|R_{\mathcal{A}_{1}}(V)\|_{2}}{\sqrt{{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}}}. (58)

The above concentration bound, the assumption (R2), and the assumption f𝒜1𝐌(V)f𝒜1R𝒜1(V)22{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}\gg\|R_{\mathcal{A}_{1}}(V)\|_{2}^{2} imply (|β^init(V)β|c𝒪)0\mathbb{P}(|\widehat{\beta}_{\rm init}(V)-\beta|\geq c\mid\mathcal{O})\rightarrow 0. Then for any constant c>0c>0, we have (|β^init(V)β|c)=𝐄((|β^init(V)β|c𝒪)).\mathbb{P}(|\widehat{\beta}_{\rm init}(V)-\beta|\geq c)={\mathbf{E}}\left(\mathbb{P}(|\widehat{\beta}_{\rm init}(V)-\beta|\geq c\mid\mathcal{O})\right). We apply the bounded convergence theorem and establish (|β^init(V)β|c)0\mathbb{P}(|\widehat{\beta}_{\rm init}(V)-\beta|\geq c)\rightarrow 0 and hence β^init(V)𝑝β.\widehat{\beta}_{\rm init}(V)\overset{p}{\to}\beta.

B.3 Proof of Theorem 1

In the following, we shall prove Theorem 6, which implies Theorem 1 together with Condition (R2-I).

Theorem 6

Suppose that the same conditions of Proposition 1 and (31) hold. Then β^(V)\widehat{\beta}(V) defined in (19) satisfies

1SE(V)(β^(V)β)=𝒢(V)+(V)withSE(V)=i=1n1σi2[𝐌(V)f𝒜1]i2f𝒜1𝐌(V)f𝒜1,\frac{1}{{\rm SE}(V)}\left(\widehat{\beta}(V)-\beta\right)={\mathcal{G}}(V)+\mathcal{E}(V)\quad\text{with}\quad{\rm SE}(V)=\frac{\sqrt{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){f}_{\mathcal{A}_{1}}]_{i}^{2}}}{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}, (59)

where 𝒢(V)𝑑N(0,1){\mathcal{G}}(V)\overset{d}{\to}N(0,1) and there exist positive constants t0>0t_{0}>0 and C>0C>0 such that

lim infn(|(V)|lognηn(V)Tr[𝐌(V)]+t0Tr([𝐌(V)]2)+R(V)2f𝒜1[𝐌(V)]2f𝒜1)1exp(t02),\liminf_{n\rightarrow\infty}\mathbb{P}\left(\left|\mathcal{E}(V)\right|\leq\frac{\sqrt{\log n}\cdot\eta_{n}(V)\cdot{\rm Tr}[\mathbf{M}(V)]+t_{0}\sqrt{{\rm Tr}([\mathbf{M}({V})]^{2})}+\|R(V)\|_{2}}{\sqrt{{f}_{\mathcal{A}_{1}}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}}}\right)\geq 1-\exp(-t_{0}^{2}), (60)

with ηn(V)\eta_{n}(V) defined in (30).

We start with the following error decomposition,

β^RF(V)β=ϵ𝒜1𝐌(V)f𝒜1+R𝒜1(V)𝐌(V)D𝒜1D𝒜1𝐌(V)D𝒜1+Err1+Err2D𝒜1𝐌(V)D𝒜1,\displaystyle\widehat{\beta}_{\rm RF}(V)-\beta=\frac{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}+R_{\mathcal{A}_{1}}(V)^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}+\frac{{\rm Err}_{1}+{\rm Err}_{2}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}, (61)

with Err1=1ijn1n1[𝐌(V)]ijδiϵj{\rm Err}_{1}=\sum_{1\leq i\neq j\leq n_{1}}^{n_{1}}[\mathbf{M}(V)]_{ij}\delta_{i}\epsilon_{j} and Err2=i=1n1[𝐌(V)]ii(fif^i)(ϵi+[ϵ^(V)]iϵi)+i=1n1[𝐌(V)]iiδi([ϵ^(V)]iϵi).{\rm Err}_{2}=\sum_{i=1}^{n_{1}}[\mathbf{M}(V)]_{ii}(f_{i}-\widehat{f}_{i})\left(\epsilon_{i}+[\widehat{\epsilon}(V)]_{i}-\epsilon_{i}\right)+\sum_{i=1}^{n_{1}}[\mathbf{M}(V)]_{ii}\delta_{i}\left([\widehat{\epsilon}(V)]_{i}-\epsilon_{i}\right). Define

𝒢(V)=1SE(V)ϵ𝒜1𝐌(V)f𝒜1D𝒜1𝐌(V)D𝒜1,(V)=1SE(V)R𝒜1(V)𝐌(V)D𝒜1Err1Err2D𝒜1𝐌(V)D𝒜1.{\mathcal{G}}(V)=\frac{1}{{\rm SE}(V)}\frac{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}},\quad\mathcal{E}(V)=\frac{1}{{\rm SE}(V)}\frac{R_{\mathcal{A}_{1}}(V)^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}-{\rm Err}_{1}-{\rm Err}_{2}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}.

Then the decomposition (61) implies (59). We apply (56) together with the generalized IV strength condition (R2) and establish D𝒜1𝐌(V)D𝒜1f𝒜1𝐌(V)f𝒜1𝑝1.\frac{D^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V)D_{\mathcal{A}_{1}}}{f^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V)f_{\mathcal{A}_{1}}}\overset{p}{\to}1. Condition (31) implies the Linderberg condition. Hence, we establish 𝒢(V)𝒪𝑑N(0,1),\mathcal{G}(V)\mid\mathcal{O}\overset{d}{\to}N(0,1), and 𝒢(V)𝑑N(0,1).\mathcal{G}(V)\overset{d}{\to}N(0,1).

Since 𝐄(Err1𝒪)=0{\mathbf{E}}({\rm Err}_{1}\mid\mathcal{O})=0, we apply the similar argument as in (54) and obtain that, conditioning on 𝒪\mathcal{O}, with probability larger than 1exp(ct02)1-\exp\left(-ct_{0}^{2}\right) for some constants c>0c>0 and t0>0,t_{0}>0, |Err1|t0K21ijn1[𝐌(V)]ij2t0K2Tr([𝐌(V)]2).\left|{\rm Err}_{1}\right|\leq t_{0}K^{2}\sqrt{\sum_{1\leq i\neq j\leq n_{1}}[\mathbf{M}(V)]_{ij}^{2}}\leq t_{0}K^{2}\sqrt{{\rm Tr}([\mathbf{M}({V})]^{2})}. Then we establish

(|Err1|t0K2Tr([𝐌(V)]2))\displaystyle\mathbb{P}\left(\left|{\rm Err}_{1}\right|\geq t_{0}K^{2}\sqrt{{\rm Tr}([\mathbf{M}({V})]^{2})}\right) =𝐄[(|Err1|t0K2Tr([𝐌(V)]2)𝒪)]exp(ct02).\displaystyle={\mathbf{E}}\left[\mathbb{P}\left(\left|{\rm Err}_{1}\right|\geq t_{0}K^{2}\sqrt{{\rm Tr}([\mathbf{M}({V})]^{2})}\mid\mathcal{O}\right)\right]\leq\exp\left(-ct_{0}^{2}\right). (62)

We introduce the following Lemma to control Err2,{\rm Err}_{2}, whose proof is presented in Section C.2 in the supplement.

Lemma 5

If Condition (R2) holds, then with probability larger than 1n1c,1-n_{1}^{-c},

max1in1|[ϵ^(V)]iϵi|Clogn(R(V)+|ββ^init(V)|+lognn).\displaystyle\max_{1\leq i\leq n_{1}}\left|[\widehat{\epsilon}(V)]_{i}-\epsilon_{i}\right|\leq C\sqrt{\log n}\left(\|R(V)\|_{\infty}+|\beta-\widehat{\beta}_{\rm init}(V)|+\frac{\log n}{\sqrt{n}}\right).

The sub-gaussianity of ϵi\epsilon_{i} implies that, with probability larger than 1n1c1-n_{1}^{-c} for some positive constant c>0,c>0, max1in1|ϵi|+max1in1|δi|Clogn\max_{1\leq i\leq n_{1}}|\epsilon_{i}|+\max_{1\leq i\leq n_{1}}|\delta_{i}|\leq C\sqrt{\log n} for some positive constant C>0.C>0. Since 𝐌(V)\mathbf{M}(V) is positive definite, we have [𝐌(V)]ii0[\mathbf{M}(V)]_{ii}\geq 0 for any 1in1.1\leq i\leq n_{1}. We have

|Err2|i=1n1[𝐌(V)]ii[|fif^i|(logn+|[ϵ^(V)]iϵi|)+logn|[ϵ^(V)]iϵi|].\left|{\rm Err}_{2}\right|\lesssim\sum_{i=1}^{n_{1}}[\mathbf{M}(V)]_{ii}\left[|f_{i}-\widehat{f}_{i}|\left(\sqrt{\log n}+\left|[\widehat{\epsilon}(V)]_{i}-\epsilon_{i}\right|\right)+\sqrt{\log n}\left|[\widehat{\epsilon}(V)]_{i}-\epsilon_{i}\right|\right]. (63)

By Lemma 5, we have |Err2|lognηn(V)Tr[𝐌(V)]|{\rm Err}_{2}|\lesssim\sqrt{\log n}\cdot\eta_{n}(V)\cdot{\rm Tr}[\mathbf{M}(V)] with ηn(V)\eta_{n}(V) defined in (30). Together with (62), we establish lim infn(|(V)|CUn(V))1exp(t02).\liminf_{n\rightarrow\infty}\mathbb{P}\left(\left|\mathcal{E}(V)\right|\leq CU_{n}(V)\right)\geq 1-\exp(-t_{0}^{2}).

B.4 Proofs of Theorems 2 and 5

By conditional sub-gaussianity and Var(δiZi,Xi)c,{\rm Var}(\delta_{i}\mid Z_{i},X_{i})\geq c, we have cVar(δiZi,Xi)C.c\leq{\rm Var}(\delta_{i}\mid Z_{i},X_{i})\leq C. Hence, we have μ(V)f𝒜1𝐌(V)f𝒜1.\mu(V)\asymp{{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}. With H(Vq,Vq)H(V_{q},V_{q^{\prime}}) defined in (34), we have

β^(Vq)β^(Vq)H(Vq,Vq)=𝒢n(Vq,Vq)+n(Vq,Vq)+1H(Vq,Vq)(~(Vq)~(Vq)),\frac{\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}})}{\sqrt{H(V_{q},V_{q^{\prime}})}}=\mathcal{G}_{n}(V_{q},V_{q^{\prime}})+\mathcal{L}_{n}(V_{q},V_{q^{\prime}})+\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left(\widetilde{\mathcal{E}}(V_{q})-\widetilde{\mathcal{E}}(V_{q^{\prime}})\right), (64)

where n(Vq,Vq)\mathcal{L}_{n}(V_{q},V_{q^{\prime}}) is defined in (35),

𝒢n(Vq,Vq)=1H(Vq,Vq)(f𝒜1𝐌(Vq)ϵ𝒜1D𝒜1𝐌(Vq)D𝒜1f𝒜1𝐌(Vq)ϵ𝒜1D𝒜1𝐌(Vq)D𝒜1),\mathcal{G}_{n}(V_{q},V_{q^{\prime}})=\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left(\frac{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{{D}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})D_{\mathcal{A}_{1}}}-\frac{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{{D}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q^{\prime}})D_{\mathcal{A}_{1}}}\right), (65)
~(Vq)=i=1n1[𝐌(Vq)]iiδ^i[ϵ^(VQmax)]iδ𝒜1𝐌(Vq)ϵ𝒜1D𝒜1𝐌(Vq)D𝒜1.\widetilde{\mathcal{E}}(V_{q})=\frac{\sum_{i=1}^{n_{1}}[\mathbf{M}(V_{q})]_{ii}\widehat{\delta}_{i}[\widehat{\epsilon}(V_{Q_{\max}})]_{i}-\delta_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q}){D}_{\mathcal{A}_{1}}}.

The remaining proof relies on the following lemma, whose proof can be found at Section C.3 in the supplement.

Lemma 6

Suppose that Conditions (R1), (R2), and (R3) hold, then

1H(Vq,Vq)|~(Vq)~(Vq)|𝑝0,and𝒢n(Vq,Vq)𝑑N(0,1).\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left|\widetilde{\mathcal{E}}(V_{q})-\widetilde{\mathcal{E}}(V_{q^{\prime}})\right|\overset{p}{\to}0,\quad\text{and}\quad\mathcal{G}_{n}(V_{q},V_{q^{\prime}})\overset{d}{\to}N(0,1).

By applying (56), we establish that, conditioning on 𝒪\mathcal{O}, with probability larger than 1exp(cτn1/2c0),1-\exp\left(-c\tau_{n}^{1/2-c_{0}}\right), |n(Vq,Vq)|1H(Vq,Vq)([R(Vq)]𝒜12μ(Vq)+[R(Vq)]𝒜12μ(Vq)).\left|\mathcal{L}_{n}(V_{q},V_{q^{\prime}})\right|\lesssim\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left(\frac{\|[R(V_{q})]_{\mathcal{A}_{1}}\|_{2}}{\sqrt{\mu(V_{q})}}+\frac{\|[R(V_{q^{\prime}})]_{\mathcal{A}_{1}}\|_{2}}{\sqrt{\mu(V_{q^{\prime}})}}\right). Then we apply the above inequality and the condition H(Vq,Vq)maxV{Vq,Vq}R(V)2/μ(V)\sqrt{H(V_{q},V_{q^{\prime}})}\gg\max_{V\in\{V_{q},V_{q^{\prime}}\}}{\|R(V)\|_{2}}/{\sqrt{\mu(V)}} and establish that |n(Vq,Vq)|𝑝0.\left|\mathcal{L}_{n}(V_{q},V_{q^{\prime}})\right|\overset{p}{\to}0. Together with the decomposition (64), and Lemma 6, we establish the asymptotic limiting distribution of β^(Vq)β^(Vq)\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}}) in Theorem 2.

With the decomposition (64), we establish (48) in Theorem 5 by the following Lemma 6 and applying the condition n(Vq,Vq)𝑝(Vq,Vq)\mathcal{L}_{n}(V_{q},V_{q^{\prime}})\overset{p}{\to}\mathcal{L}^{*}(V_{q},V_{q^{\prime}}) and H^(Vq,Vq)𝑝H(Vq,Vq).\widehat{H}(V_{q},V_{q^{\prime}})\overset{p}{\to}{H}(V_{q},V_{q^{\prime}}).

B.5 Proof of Theorem 3

In the following, we shall prove

lim infn[βCI(Vq^)]1α2α0withq^=q^corq^r,\liminf_{n\rightarrow\infty}\mathbb{P}\left[\beta\in{\rm CI}(V_{\widehat{q}})\right]\geq 1-\alpha-2\alpha_{0}\quad\text{with}\quad\widehat{q}=\widehat{q}_{c}\;\;\text{or}\;\;\widehat{q}_{r}, (66)

Recall that the test 𝒞(Vq)\mathcal{C}(V_{q}) is defined in (27) and note that

{q^c=q}={Qmaxq}(q=0q1{𝒞(Vq)=1}){𝒞(Vq)=0}.\left\{\widehat{q}_{c}=q^{*}\right\}=\left\{Q_{\max}\geq q^{*}\right\}\cap\left(\cap_{q=0}^{q^{*}-1}\left\{\mathcal{C}(V_{q})=1\right\}\right)\cap\left\{\mathcal{C}(V_{q^{*}})=0\right\}. (67)

Define the events

1={max0q<qQmax|𝒢n(Vq,Vq)|ρ(α0)}and2={|ρ^/ρ(α0)1|τ0},\mathcal{B}_{1}=\left\{\max_{0\leq q<q^{\prime}\leq Q_{\max}}\left|\mathcal{G}_{n}(V_{q},V_{q^{\prime}})\right|\leq\rho(\alpha_{0})\right\}\quad\text{and}\quad\mathcal{B}_{2}=\left\{\left|\widehat{\rho}/\rho(\alpha_{0})-1\right|\leq\tau_{0}\right\},
3={max0q<qQmax1H(Vq,Vq)|~(Vq)~(Vq)|τ1ρ(α0)}.\mathcal{B}_{3}=\left\{\max_{0\leq q<q^{\prime}\leq Q_{\max}}\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left|\widetilde{\mathcal{E}}(V_{q})-\widetilde{\mathcal{E}}(V_{q^{\prime}})\right|\leq\tau_{1}\rho(\alpha_{0})\right\}.

where ρ(α0)\rho(\alpha_{0}) is defined in (36), ρ^\widehat{\rho} is defined in (39) with 𝐌RF\mathbf{M}_{\rm RF} replaced by 𝐌\mathbf{M}, and 𝒢n(Vq,Vq)\mathcal{G}_{n}(V_{q},V_{q^{\prime}}) is defined in (65). By the definition of ρ(α0)\rho(\alpha_{0}) in (36) and the following (86), we control the probability of the event limn𝐏(1)=12α0.\lim_{n\rightarrow\infty}\mathbf{P}\left(\mathcal{B}_{1}\right)=1-2\alpha_{0}. By the following (6) and ρ^/ρ(α0)𝑝1\widehat{\rho}/\rho(\alpha_{0})\overset{p}{\to}1, we establish that, for any positive constants τ0>0\tau_{0}>0 and τ1>0,\tau_{1}>0, lim infn𝐏(23)=1.\liminf_{n\rightarrow\infty}\mathbf{P}\left(\mathcal{B}_{2}\cap\mathcal{B}_{3}\right)=1. Combing the above two equalities, we have

lim infn𝐏(123)12α0.\liminf_{n\rightarrow\infty}\mathbf{P}\left(\mathcal{B}_{1}\cap\mathcal{B}_{2}\cap\mathcal{B}_{3}\right)\geq 1-2\alpha_{0}. (68)

For any 0qq1,0\leq q\leq q^{*}-1, the condition (R4) implies that there exists some q+1qqq+1\leq q^{\prime}\leq q^{*} such that |n(Vq,Vq)|Aρ(α0).\left|\mathcal{L}_{n}(V_{q},V_{q^{\prime}})\right|\geq A\rho(\alpha_{0}). On the event 123\mathcal{B}_{1}\cap\mathcal{B}_{2}\cap\mathcal{B}_{3}, we apply the expression (64) and obtain

|[β^(Vq)β^(Vq)]/H(Vq,Vq)|Aρ(α0)ρ(α0)τ1ρ(α0)(A1τ1)(1τ0)ρ^.\left|[\widehat{\beta}(V_{q})-\widehat{\beta}(V_{q^{\prime}})]/{\sqrt{H(V_{q},V_{q^{\prime}})}}\right|\geq A\rho(\alpha_{0})-\rho(\alpha_{0})-\tau_{1}\rho(\alpha_{0})\geq(A-1-\tau_{1})(1-\tau_{0})\widehat{\rho}. (69)

For any qqQmax,q^{*}\leq q^{\prime}\leq Q_{\max}, we have R(Vq)=0.R(V_{q^{\prime}})=0. Then on the event 123\mathcal{B}_{1}\cap\mathcal{B}_{2}\cap\mathcal{B}_{3}, we apply the expression (64) and obtain that,

|[β^(VQ)β^(Vq)]/H(VQ,Vq)|ρ(α0)+τ1ρ(α0)(1+τ1)(1+τ0)ρ^1.01ρ^.\left|[\widehat{\beta}(V_{Q*})-\widehat{\beta}(V_{q^{\prime}})]/{\sqrt{H(V_{Q*},V_{q^{\prime}})}}\right|\leq\rho(\alpha_{0})+\tau_{1}\rho(\alpha_{0})\leq(1+\tau_{1})(1+\tau_{0})\widehat{\rho}\leq 1.01\widehat{\rho}.

Together with (69), the event 123\mathcal{B}_{1}\cap\mathcal{B}_{2}\cap\mathcal{B}_{3} implies q=0q1{𝒞(Vq)=1}{𝒞(Vq)=0}.\cap_{q=0}^{q^{*}-1}\left\{\mathcal{C}(V_{q})=1\right\}\cap\left\{\mathcal{C}(V_{q^{*}})=0\right\}. By (68), we establish lim infn𝐏[(q=0q1{𝒞(Vq)=1}){𝒞(Vq)=0}]12α0.\liminf_{n\rightarrow\infty}\mathbf{P}\left[\left(\cap_{q=0}^{q^{*}-1}\left\{\mathcal{C}(V_{q})=1\right\}\right)\cap\left\{\mathcal{C}(V_{q^{*}})=0\right\}\right]\geq 1-2\alpha_{0}. Together with (67), we have lim infn𝐏(q^cq)2α0.\liminf_{n\rightarrow\infty}\mathbf{P}\left(\widehat{q}_{c}\neq q^{*}\right)\leq 2\alpha_{0}. To control the coverage probability, we decompose 𝐏(1SE(Vq^c)|β^(Vq^c)β|zα/2)\mathbf{P}\left(\frac{1}{{\rm SE}(V_{\widehat{q}_{c}})}\left|\widehat{\beta}(V_{\widehat{q}_{c}})-\beta\right|\geq z_{\alpha/2}\right) as

𝐏({1SE(Vq^c)|β^(Vq^c)β|zα/2}{q^c=q})\displaystyle\mathbf{P}\left(\left\{\frac{1}{{\rm SE}(V_{\widehat{q}_{c}})}\left|\widehat{\beta}(V_{\widehat{q}_{c}})-\beta\right|\geq z_{\alpha/2}\right\}\cap\left\{\widehat{q}_{c}=q^{*}\right\}\right) (70)
+𝐏({1SE(Vq^c)|β^(Vq^c)β|zα/2}{q^cq}).\displaystyle+\mathbf{P}\left(\left\{\frac{1}{{\rm SE}(V_{\widehat{q}_{c}})}\left|\widehat{\beta}(V_{\widehat{q}_{c}})-\beta\right|\geq z_{\alpha/2}\right\}\cap\left\{\widehat{q}_{c}\neq q^{*}\right\}\right).

Note that

𝐏({1SE(Vq^c)|β^(Vq^c)β|zα/2}{q^c=q})𝐏({1SE(Vq)|β^(Vq)β|zα/2})α,\displaystyle\mathbf{P}\left(\left\{\frac{1}{{\rm SE}(V_{\widehat{q}_{c}})}\left|\widehat{\beta}(V_{\widehat{q}_{c}})-\beta\right|\geq z_{\alpha/2}\right\}\cap\left\{\widehat{q}_{c}=q^{*}\right\}\right)\leq\mathbf{P}\left(\left\{\frac{1}{{\rm SE}(V_{q^{*}})}\left|\widehat{\beta}(V_{q^{*}})-\beta\right|\geq z_{\alpha/2}\right\}\right)\leq\alpha,

and 𝐏({1SE(Vq^c)|β^(Vq^c)β|zα/2}{q^cq})𝐏(q^cq)2α0.\mathbf{P}\left(\left\{\frac{1}{{\rm SE}(V_{\widehat{q}_{c}})}\left|\widehat{\beta}(V_{\widehat{q}_{c}})-\beta\right|\geq z_{\alpha/2}\right\}\cap\left\{\widehat{q}_{c}\neq q^{*}\right\}\right)\leq\mathbf{P}\left(\widehat{q}_{c}\neq q^{*}\right)\leq 2\alpha_{0}. By the decomposition (70), we combine the above two inequalities and establish

𝐏(1SE(Vq^c)|β^(Vq^c)β|zα/2)α+2α0,\mathbf{P}\left(\frac{1}{{\rm SE}(V_{\widehat{q}_{c}})}\left|\widehat{\beta}(V_{\widehat{q}_{c}})-\beta\right|\geq z_{\alpha/2}\right)\leq\alpha+2\alpha_{0},

which implies (66) with q^=q^c.\widehat{q}=\widehat{q}_{c}. By the definition q^r=min{q^c+1,Qmax},\widehat{q}_{r}=\min\{\widehat{q}_{c}+1,Q_{\max}\}, we apply a similar argument and establish (66) with q^=q^r.\widehat{q}=\widehat{q}_{r}.

B.6 Proof of Theorem 4

If Cov(ϵi,δiZi,Xi)=Cov(ϵi,δi){\rm Cov}(\epsilon_{i},\delta_{i}\mid Z_{i},X_{i})={\rm Cov}(\epsilon_{i},\delta_{i}), then (54) further implies

|ϵ𝒜1𝐌(V)δ𝒜1Cov(ϵi,δi)Tr[𝐌RF(V)]|t0K2Tr([𝐌RF(V)]2).\left|{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}}-{\rm Cov}(\epsilon_{i},\delta_{i})\cdot{\rm Tr}[\mathbf{M}_{\rm RF}({V})]\right|\leq t_{0}K^{2}\sqrt{{\rm Tr}([\mathbf{M}_{\rm RF}({V})]^{2})}. (71)

Proof of (44).

We decompose the error Cov^(δi,ϵi)Cov(δi,ϵi)\widehat{\rm Cov}(\delta_{i},\epsilon_{i})-{\rm Cov}(\delta_{i},\epsilon_{i}) as,

1n1r(f𝒜1f^𝒜1+δ𝒜1)PV𝒜1[ϵ𝒜1+D𝒜1(ββ^init(V))+R𝒜1(V)]Cov(δi,ϵi)\displaystyle\frac{1}{n_{1}-r}(f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}}+\delta_{\mathcal{A}_{1}})^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}}[\epsilon_{\mathcal{A}_{1}}+D_{\mathcal{A}_{1}}(\beta-\widehat{\beta}_{\rm init}(V))+R_{\mathcal{A}_{1}}(V)]-{\rm Cov}(\delta_{i},\epsilon_{i}) (72)
=T1+T2+T3,\displaystyle=T_{1}+T_{2}+T_{3},

where T1=1n1r[(δ𝒜1)PV𝒜1ϵ𝒜1(n1r)Cov(δi,ϵi)],T_{1}=\frac{1}{n_{1}-r}\left[(\delta_{\mathcal{A}_{1}})^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}\epsilon_{\mathcal{A}_{1}}-(n_{1}-r){\rm Cov}(\delta_{i},\epsilon_{i})\right], and

T2\displaystyle T_{2} =1n1r(δ𝒜1)PV𝒜1[D𝒜1(ββ^init(V))+R𝒜1(V)],\displaystyle=\frac{1}{n_{1}-r}(\delta_{\mathcal{A}_{1}})^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}\left[D_{\mathcal{A}_{1}}(\beta-\widehat{\beta}_{\rm init}(V))+R_{\mathcal{A}_{1}}(V)\right],
T3\displaystyle T_{3} =1n1r(f𝒜1f^𝒜1)PV𝒜1[ϵ𝒜1+D𝒜1(ββ^init(V))+R𝒜1(V)].\displaystyle=\frac{1}{n_{1}-r}(f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}})^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}[\epsilon_{\mathcal{A}_{1}}+D_{\mathcal{A}_{1}}(\beta-\widehat{\beta}_{\rm init}(V))+R_{\mathcal{A}_{1}}(V)].

We apply (49) with A=PV𝒜1A=P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}} and t=t0K2n1rt=t_{0}K^{2}\sqrt{n_{1}-r} for some 0<t0n1r,0<t_{0}\leq\sqrt{n_{1}-r}, and establish

(|ϵ𝒜1PV𝒜1δ𝒜1Cov(ϵi,δi)Tr(PV𝒜1)|t0K2n1r𝒪)6exp(ct02).\displaystyle\mathbb{P}\left(\left|\epsilon_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}\delta_{\mathcal{A}_{1}}-{\rm Cov}(\epsilon_{i},\delta_{i})\cdot{\rm Tr}\left(P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}\right)\right|\geq t_{0}K^{2}\sqrt{n_{1}-r}\mid\mathcal{O}\right)\leq 6\exp\left(-ct_{0}^{2}\right).

By choosing t0=log(n1r),t_{0}=\log(n_{1}-r), we establish that, with probability larger than 1(n1r)c1-(n_{1}-r)^{-c} for some positive constant c>0,c>0, then

|T1|log(n1r)/(n1r).\left|T_{1}\right|\lesssim\sqrt{{\log(n_{1}-r)}/{(n_{1}-r)}}. (73)

We apply (49) with A=PV𝒜1A=P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}} and t=t0K2n1rt=t_{0}K^{2}\sqrt{n_{1}-r} for some 0<t0n1r,0<t_{0}\leq\sqrt{n_{1}-r},

(|ϵ𝒜1PV𝒜1D𝒜1Cov(ϵi,Di)Tr(PV𝒜1)|t0K2n1r𝒪)6exp(ct02).\displaystyle\mathbb{P}\left(\left|\epsilon_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}D_{\mathcal{A}_{1}}-{\rm Cov}(\epsilon_{i},D_{i})\cdot{\rm Tr}\left(P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}\right)\right|\geq t_{0}K^{2}\sqrt{n_{1}-r}\mid\mathcal{O}\right)\leq 6\exp\left(-ct_{0}^{2}\right).

The above concentration bound implies

(1n1r|ϵ𝒜1PV𝒜1D𝒜1|C(1+log(n1r)n1r)𝒪)6(n1r)c.\mathbb{P}\left(\frac{1}{n_{1}-r}\left|\epsilon_{\mathcal{A}_{1}}^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}D_{\mathcal{A}_{1}}\right|\geq C\left(1+\sqrt{\frac{\log(n_{1}-r)}{n_{1}-r}}\right)\mid\mathcal{O}\right)\leq 6(n_{1}-r)^{-c}. (74)

Hence, we establish that, with probability larger than 1(n1r)c,1-(n_{1}-r)^{-c},

|T2||ββ^init(V)|+R(V)2/n1r,\left|T_{2}\right|\lesssim\left|\beta-\widehat{\beta}_{\rm init}(V)\right|+{\|R(V)\|_{2}}/{\sqrt{n_{1}-r}}, (75)

where the last inequality follows from (58). Regarding T3,T_{3}, we have

|T3|=|1n1r(f𝒜1f^𝒜1)PV𝒜1[ϵ𝒜1+D𝒜1(ββ^init(V))+R𝒜1(V)]|\displaystyle|T_{3}|=\left|\frac{1}{n_{1}-r}(f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}})^{\intercal}P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}[\epsilon_{\mathcal{A}_{1}}+D_{\mathcal{A}_{1}}(\beta-\widehat{\beta}_{\rm init}(V))+R_{\mathcal{A}_{1}}(V)]\right| (76)
PV𝒜1(f𝒜1f^𝒜1)2n1rPV𝒜1ϵ𝒜12+PV𝒜1D𝒜12|ββ^init(V)|+R𝒜1(V)2n1r.\displaystyle\lesssim\frac{\|P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}(f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}})\|_{2}}{\sqrt{n_{1}-r}}\frac{\|P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}\epsilon_{\mathcal{A}_{1}}\|_{2}+\|P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}D_{\mathcal{A}_{1}}\|_{2}\cdot|\beta-\widehat{\beta}_{\rm init}(V)|+\|R_{\mathcal{A}_{1}}(V)\|_{2}}{\sqrt{n_{1}-r}}.

By a similar argument as in (74), we establish that, with probability larger than 1(n1r)c,1-(n_{1}-r)^{-c}, 1n1PV𝒜1ϵ𝒜12+1n1PV𝒜1D𝒜12C,\frac{1}{\sqrt{n_{1}}}\|P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}\epsilon_{\mathcal{A}_{1}}\|_{2}+\frac{1}{\sqrt{n_{1}}}\|P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}D_{\mathcal{A}_{1}}\|_{2}\leq C, for some positive constant C>0.C>0. Together with (76), we establish that, with probability larger than 1(n1r)c,1-(n_{1}-r)^{-c},

|T3|PV𝒜1(f𝒜1f^𝒜1)2n1r(1+|ββ^init(V)|+R(V)2n1r).\left|T_{3}\right|\lesssim\frac{\|P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}(f_{\mathcal{A}_{1}}-\widehat{f}_{\mathcal{A}_{1}})\|_{2}}{\sqrt{n_{1}-r}}\cdot\left(1+|\beta-\widehat{\beta}_{\rm init}(V)|+\frac{\|R(V)\|_{2}}{\sqrt{n_{1}-r}}\right).

Together with (72) and the upper bounds (73) and (75), we establish (44).

Proof of (47).

By (53), we obtain the following decomposition for β~RF(V)\widetilde{\beta}_{\rm RF}(V) in (19),

β^RF(V)β=ϵ𝒜1𝐌(V)f𝒜1[Cov^(δi,ϵi)Cov(δi,ϵi)]Tr[𝐌(V)]D𝒜1𝐌(V)D𝒜1\displaystyle\widehat{\beta}_{\rm RF}(V)-\beta=\frac{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}-[\widehat{\rm Cov}(\delta_{i},\epsilon_{i})-{\rm Cov}(\delta_{i},\epsilon_{i})]{\rm Tr}[\mathbf{M}(V)]}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}} (77)
+ϵ𝒜1𝐌(V)δ𝒜1Cov(δi,ϵi)Tr[𝐌(V)]D𝒜1𝐌(V)D𝒜1+R𝒜1(V)𝐌(V)D𝒜1D𝒜1𝐌(V)D𝒜1.\displaystyle+\frac{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}-{\rm Cov}(\delta_{i},\epsilon_{i}){\rm Tr}[\mathbf{M}(V)]}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}+\frac{R_{\mathcal{A}_{1}}(V)^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}.

The above decomposition implies (59) with 𝒢(V)=1SE(V)ϵ𝒜1𝐌(V)f𝒜1D𝒜1𝐌(V)D𝒜1,\mathcal{G}(V)=\frac{1}{{\rm SE}(V)}\frac{{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}}{{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}}, and

SE(V)(V)=R𝒜1(V)𝐌(V)D𝒜1D𝒜1𝐌(V)D𝒜1\displaystyle{\rm SE}(V)\cdot\mathcal{E}(V)=\frac{R_{\mathcal{A}_{1}}(V)^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}
+[Cov(δi,ϵi)Cov^(δi,ϵi)]Tr[𝐌(V)]+ϵ𝒜1𝐌(V)δ𝒜1Cov(δi,ϵi)Tr[𝐌(V)]D𝒜1𝐌(V)D𝒜1.\displaystyle+\frac{[{\rm Cov}(\delta_{i},\epsilon_{i})-\widehat{\rm Cov}(\delta_{i},\epsilon_{i})]{\rm Tr}[\mathbf{M}(V)]+{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}-{\rm Cov}(\delta_{i},\epsilon_{i}){\rm Tr}[\mathbf{M}(V)]}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}}.

We establish 𝒢(V)𝑑N(0,1)\mathcal{G}(V)\overset{d}{\to}N(0,1) by applying the same arguments as in Section B.3 in the main paper. We establish (47) by combining (44), (56), and (71).

Appendix C Proof of Extra Lemmas

C.1 Proof of Lemma 4

Note that 𝐄[ϵ𝒜1𝐌(V)δ𝒜1𝒪]=Tr(𝐌(V)Λ).{\mathbf{E}}\left[{\epsilon}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}\mid\mathcal{O}\right]={\rm Tr}\left(\mathbf{M}(V)\Lambda\right). We apply (49) with A=𝐌(V)A=\mathbf{M}(V) and t=t0K2𝐌(V)Ft=t_{0}K^{2}\|\mathbf{M}(V)\|_{F} for some t0>0t_{0}>0 and establish

(|ϵ𝒜1𝐌(V)δ𝒜1Tr(𝐌(V)Λ)|t0K2𝐌(V)F𝒪)\displaystyle\mathbb{P}\left(\left|\epsilon_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V)\delta_{\mathcal{A}_{1}}-{\rm Tr}\left(\mathbf{M}(V)\Lambda\right)\right|\geq t_{0}K^{2}\|\mathbf{M}(V)\|_{F}\mid\mathcal{O}\right) (78)
6exp(cmin{t02,t0𝐌(V)F𝐌(V)2})6exp(cmin{t02,t0}),\displaystyle\leq 6\exp\left(-c\min\left\{t_{0}^{2},t_{0}\frac{\|\mathbf{M}(V)\|_{F}}{\|\mathbf{M}(V)\|_{2}}\right\}\right)\leq 6\exp\left(-c\min\left\{t_{0}^{2},t_{0}\right\}\right),

where the last inequality follows from 𝐌(V)F𝐌(V)2\|\mathbf{M}(V)\|_{F}\geq\|\mathbf{M}(V)\|_{2}. The above concentration bound implies (54) by taking an expectation with respect to 𝒪.\mathcal{O}.

Since 𝐄[δ𝒜1𝐌(V)δ𝒜1𝒪]=Tr(𝐌(V)Σδ),{\mathbf{E}}\left[{\delta}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}\mid\mathcal{O}\right]={\rm Tr}\left(\mathbf{M}(V)\Sigma^{\delta}\right), we apply a similar argument to (78) and establish

(|δ𝒜1𝐌(V)δ𝒜1Tr(𝐌(V)Σδ)|t0K2𝐌(V)F𝒪)2exp(cmin{t02,t0}).\displaystyle\mathbb{P}\left(\left|\delta_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V)\delta_{\mathcal{A}_{1}}-{\rm Tr}\left(\mathbf{M}(V)\Sigma^{\delta}\right)\right|\geq t_{0}K^{2}\|\mathbf{M}(V)\|_{F}\mid\mathcal{O}\right)\leq 2\exp\left(-c\min\left\{t_{0}^{2},t_{0}\right\}\right). (79)

Note that 𝐄[δ𝒜1𝐌(V)f𝒜1𝒪]=0,{\mathbf{E}}\left[{\delta}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}\mid\mathcal{O}\right]=0, and conditioning on 𝒪\mathcal{O}, {δi}i𝒜1\{{\delta}_{i}\}_{i\in\mathcal{A}_{1}} are independent sub-gaussian random variables. We apply Proposition 5.16 of \citetsuppvershynin2010introduction and establish that, with probability larger than 1exp(t02)1-\exp(-t_{0}^{2}) for some positive constant c>0,c>0,

(δ𝒜1𝐌(V)f𝒜1Ct0Kf𝒜1[𝐌(V)]2f𝒜1𝒪)exp(ct02).\mathbb{P}\left({\delta}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}\geq Ct_{0}K\sqrt{{f}_{\mathcal{A}_{1}}^{\intercal}[\mathbf{M}(V)]^{2}{f}_{\mathcal{A}_{1}}}\mid\mathcal{O}\right)\leq\exp(-ct_{0}^{2}). (80)

The above concentration bound implies (54) by taking an expectation with respect to 𝒪.\mathcal{O}.

By the decomposition D𝒜1𝐌(V)D𝒜1f𝒜1𝐌(V)f𝒜1=δ𝒜1𝐌(V)f𝒜1+δ𝒜1𝐌(V)δ𝒜1,{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}}-{f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}={\delta}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}}+{\delta}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){\delta}_{\mathcal{A}_{1}}, we establish (55) by applying the concentration bounds (79) and (80).

C.2 Proof of Lemma 5

Under the model Yi=Diβ+Viπ+Ri+ϵiY_{i}=D_{i}\beta+V_{i}^{\intercal}\pi+R_{i}+\epsilon_{i}, we have

Y𝒜1β^init(V)D𝒜1=V𝒜1π+R𝒜1+D𝒜1(ββ^init(V))+ϵ𝒜1.Y_{\mathcal{A}_{1}}-\widehat{\beta}_{\rm init}(V)D_{\mathcal{A}_{1}}=V_{\mathcal{A}_{1}}\pi+R_{{\mathcal{A}_{1}}}+D_{\mathcal{A}_{1}}(\beta-\widehat{\beta}_{\rm init}(V))+\epsilon_{\mathcal{A}_{1}}. (81)

The least square estimator of π\pi is expressed as, π^=(VV)1V(Y𝒜1β^init(V)D𝒜1).\widehat{\pi}=(V^{\intercal}V)^{-1}V^{\intercal}\left(Y_{\mathcal{A}_{1}}-\widehat{\beta}_{\rm init}(V)D_{\mathcal{A}_{1}}\right). Note that

(VV)1V(Y𝒜1β^init(V)D𝒜1)π\displaystyle(V^{\intercal}V)^{-1}V^{\intercal}\left(Y_{\mathcal{A}_{1}}-\widehat{\beta}_{\rm init}(V)D_{\mathcal{A}_{1}}\right)-\pi
=(VV)1V([R(V)]𝒜1+D𝒜1(ββ^init(V))+ϵ𝒜1)\displaystyle=(V^{\intercal}V)^{-1}V^{\intercal}\left([R(V)]_{{\mathcal{A}_{1}}}+D_{\mathcal{A}_{1}}(\beta-\widehat{\beta}_{\rm init}(V))+\epsilon_{\mathcal{A}_{1}}\right)
=(i=1n1ViVi)1i=1n1Vi([R(V)]i+Di(ββ^init(V))+ϵi).\displaystyle=(\sum_{i=1}^{n_{1}}V_{i}V_{i}^{\intercal})^{-1}\sum_{i=1}^{n_{1}}V_{i}\left([R(V)]_{i}+D_{i}(\beta-\widehat{\beta}_{\rm init}(V))+\epsilon_{i}\right).

Note that 𝐄Viϵi=0{\mathbf{E}}V_{i}\epsilon_{i}=0 and conditioning on 𝒪\mathcal{O}, {ϵi}i𝒜1\{{\epsilon}_{i}\}_{i\in\mathcal{A}_{1}} are independent sub-gaussian random variables. By Proposition 5.16 of \citetsuppvershynin2010introduction, there exist positive constants C>0C>0 and c>0c>0 such that (1n1|i=1n1Vijϵi|Ct0i=1n1Vij2/n1𝒪)exp(ct02).\mathbb{P}\left(\frac{1}{n_{1}}\left|\sum_{i=1}^{n_{1}}V_{ij}\epsilon_{i}\right|\geq Ct_{0}{\sqrt{\sum_{i=1}^{n_{1}}V_{ij}^{2}}}/{n_{1}}\mid\mathcal{O}\right)\leq\exp(-ct_{0}^{2}). By Condition (R1), we have maxi,jVij2Clogn\max_{i,j}V_{ij}^{2}\leq C\log n and apply the union bound and establish

(i=1n1Viϵi/n1Clogn1/n1)n1c.\mathbb{P}\left(\left\|\sum_{i=1}^{n_{1}}V_{i}\epsilon_{i}/n_{1}\right\|_{\infty}\geq C{\log n_{1}}/{\sqrt{n_{1}}}\right)\leq n_{1}^{-c}. (82)

Similarly to (82), we establish (1n1i=1n1ViδiClogn1n1)n1c.\mathbb{P}\left(\left\|\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}V_{i}\delta_{i}\right\|_{\infty}\geq C\frac{\log n_{1}}{\sqrt{n_{1}}}\right)\leq n_{1}^{-c}. Together with the expression 1n1i=1n1ViDi=1n1i=1n1Vifi+1n1i=1n1Viδi,\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}V_{i}D_{i}=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}V_{i}f_{i}+\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}V_{i}\delta_{i}, we apply Condition (R1) and establish (|1n1i=1n1ViDi|C)n1c.\mathbb{P}\left(\left|\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}V_{i}D_{i}\right|\geq C\right)\leq n_{1}^{-c}. Together with (82), and Condition (R1), we establish that, with probability larger than 1n1c,1-n_{1}^{-c},

π^π2R(V)+|ββ^init(V)|+logn/n.\|\widehat{\pi}-\pi\|_{2}\lesssim\|R(V)\|_{\infty}+\left|\beta-\widehat{\beta}_{\rm init}(V)\right|+{\log n}/{\sqrt{n}}. (83)

Our proposed estimator ϵ^(V)\widehat{\epsilon}(V) defined in (21) has the following equivalent expression,

ϵ^(V)\displaystyle\widehat{\epsilon}(V) =PV𝒜1[Y𝒜1D𝒜1β^init(V)]\displaystyle=P^{\perp}_{V_{\mathcal{A}_{1}}^{\intercal}}[Y_{\mathcal{A}_{1}}-D_{\mathcal{A}_{1}}\widehat{\beta}_{\rm init}(V)]
=Y𝒜1D𝒜1β^init(V)V(VV)1V(Y𝒜1β^init(V)D𝒜1)\displaystyle=Y_{\mathcal{A}_{1}}-D_{\mathcal{A}_{1}}\widehat{\beta}_{\rm init}(V)-V(V^{\intercal}V)^{-1}V^{\intercal}\left(Y_{\mathcal{A}_{1}}-\widehat{\beta}_{\rm init}(V)D_{\mathcal{A}_{1}}\right)
=Y𝒜1D𝒜1β^init(V)V𝒜1π^.\displaystyle=Y_{\mathcal{A}_{1}}-D_{\mathcal{A}_{1}}\widehat{\beta}_{\rm init}(V)-V_{\mathcal{A}_{1}}\widehat{\pi}.

Then we apply (81) and obtain [ϵ^(V)]iϵi=Di(ββ^init(V))+Vi(ππ^)+Ri(V).[\widehat{\epsilon}(V)]_{i}-\epsilon_{i}=D_{i}(\beta-\widehat{\beta}_{\rm init}(V))+V_{i}^{\intercal}(\pi-\widehat{\pi})+R_{i}(V). Together with Condition (R1) and (83), we establish Lemma 5.

C.3 Proof of Lemma 6

Similarly to (61), we decompose ~(Vq)\widetilde{\mathcal{E}}(V_{q}) as ~(Vq)=Err1+Err2D𝒜1𝐌(Vq)D𝒜1,\widetilde{\mathcal{E}}(V_{q})=\frac{{\rm Err}_{1}+{\rm Err}_{2}}{{D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q}){D}_{\mathcal{A}_{1}}}, where Err1=1ijn1n1[𝐌(Vq)]ijδiϵj{\rm Err}_{1}=\sum_{1\leq i\neq j\leq n_{1}}^{n_{1}}[\mathbf{M}(V_{q})]_{ij}\delta_{i}\epsilon_{j}, and

Err2=i=1n1[𝐌(Vq)]ii(fif^i)(ϵi+[ϵ^(VQmax)]iϵi)+i=1n1[𝐌(Vq)]iiδi([ϵ^(VQmax)]iϵi).{\rm Err}_{2}=\sum_{i=1}^{n_{1}}[\mathbf{M}(V_{q})]_{ii}(f_{i}-\widehat{f}_{i})\left(\epsilon_{i}+[\widehat{\epsilon}(V_{Q_{\max}})]_{i}-\epsilon_{i}\right)+\sum_{i=1}^{n_{1}}[\mathbf{M}(V_{q})]_{ii}\delta_{i}\left([\widehat{\epsilon}(V_{Q_{\max}})]_{i}-\epsilon_{i}\right).

We apply the same analysis as that of (63) and establish

|Err2|\displaystyle\left|{\rm Err}_{2}\right| i=1n1[𝐌(V)]ii[|fif^i|(logn+|[ϵ^(VQmax)]iϵi|)+logn|[ϵ^(VQmax)]iϵi|].\displaystyle\lesssim\sum_{i=1}^{n_{1}}[\mathbf{M}(V)]_{ii}\left[|f_{i}-\widehat{f}_{i}|\left(\sqrt{\log n}+\left|[\widehat{\epsilon}(V_{Q_{\max}})]_{i}-\epsilon_{i}\right|\right)+\sqrt{\log n}\left|[\widehat{\epsilon}(V_{Q_{\max}})]_{i}-\epsilon_{i}\right|\right].

By Lemma 5 with V=VQmax,V=V_{Q_{\max}}, we apply the similar argument as that of (60) and establish that

(|~(Vq)|Clognηn(VQmax)Tr[𝐌(Vq)]+t0Tr([𝐌(Vq)]2)f𝒜1𝐌(Vq)f𝒜1)1exp(t02)\mathbb{P}\left(\left|\widetilde{\mathcal{E}}(V_{q})\right|\geq C\frac{\sqrt{\log n}\cdot\eta_{n}(V_{Q_{\max}})\cdot{\rm Tr}[\mathbf{M}(V_{q})]+t_{0}\sqrt{{\rm Tr}([\mathbf{M}({V_{q}})]^{2})}}{{{f}_{\mathcal{A}_{1}}\mathbf{M}(V_{q}){f}_{\mathcal{A}_{1}}}}\right)\geq 1-\exp(-t_{0}^{2})

where ηn(V)\eta_{n}(V) is defined in (30). We establish |~(Vq)|/H(Vq,Vq)𝑝0\left|\widetilde{\mathcal{E}}(V_{q})\right|/{\sqrt{H(V_{q},V_{q^{\prime}})}}\overset{p}{\to}0 under Condition (R3). Similarly, we establish |~(Vq)|/H(Vq,Vq)𝑝0\left|\widetilde{\mathcal{E}}(V_{q^{\prime}})\right|/{\sqrt{H(V_{q},V_{q^{\prime}})}}\overset{p}{\to}0 under Condition (R3). That is, we establish 1H(Vq,Vq)|~(Vq)~(Vq)|𝑝0.\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left|\widetilde{\mathcal{E}}(V_{q})-\widetilde{\mathcal{E}}(V_{q^{\prime}})\right|\overset{p}{\to}0.

Now we prove 𝒢n(Vq,Vq)𝑑N(0,1)\mathcal{G}_{n}(V_{q},V_{q^{\prime}})\overset{d}{\to}N(0,1) and start with the decomposition,

f𝒜1𝐌(Vq)ϵ𝒜1D𝒜1𝐌(Vq)D𝒜1f𝒜1𝐌(Vq)ϵ𝒜1D𝒜1𝐌(Vq)D𝒜1=f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1\displaystyle\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})D_{\mathcal{A}_{1}}}-\frac{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{{D}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})D_{\mathcal{A}_{1}}}=\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}-\frac{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}} (84)
+f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1(f𝒜1𝐌(Vq)f𝒜1D𝒜1𝐌(Vq)D𝒜11)f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1(f𝒜1𝐌(Vq)f𝒜1D𝒜1𝐌(Vq)D𝒜11).\displaystyle+\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}\left(\frac{{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}}{{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})D_{\mathcal{A}_{1}}}}-1\right)-\frac{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}\left(\frac{{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}}{{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})D_{\mathcal{A}_{1}}}}-1\right).

Since the vector SS defined in (33) satisfies maxi𝒜1Si2/i𝒜1Si20,\max_{i\in\mathcal{A}_{1}}{S_{i}^{2}}/{\sum_{i\in\mathcal{A}_{1}}S_{i}^{2}}\rightarrow 0, we verify the Linderberg condition and establish

1H(Vq,Vq)(f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1)𝑑N(0,1).\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left(\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}-\frac{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}\right)\overset{d}{\to}N(0,1). (85)

We apply (54) and (55) and establish that with probability larger than 1exp(cmin{t02,t0})1-\exp\left(-c\min\left\{t_{0}^{2},t_{0}\right\}\right) for some positive constants c>0c>0 and t0>0,t_{0}>0,

|f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1|t0μ(Vq),and|f𝒜1𝐌(Vq)f𝒜1D𝒜1𝐌(Vq)D𝒜11|Tr[𝐌(Vq)]μ(Vq)+t0μ(Vq).\left|\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}\right|\lesssim\frac{t_{0}}{\sqrt{\mu(V_{q})}},\quad\text{and}\quad\left|\frac{{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}}{{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})D_{\mathcal{A}_{1}}}}-1\right|\lesssim\frac{{\rm Tr}[\mathbf{M}({V_{q}})]}{\mu(V_{q})}+\frac{t_{0}}{\sqrt{\mu(V_{q})}}.

We apply the above inequalities and establish that with probability larger than 1exp(cmin{t02,t0})1-\exp\left(-c\min\left\{t_{0}^{2},t_{0}\right\}\right) for some positive constants c>0c>0 and t0>0,t_{0}>0,

1H(Vq,Vq)|f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1(f𝒜1𝐌(Vq)f𝒜1D𝒜1𝐌(Vq)D𝒜11)|\displaystyle\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left|\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}\left(\frac{{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}}{{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})D_{\mathcal{A}_{1}}}}-1\right)\right|
1H(Vq,Vq)t0μ(Vq)(Tr[𝐌(Vq)]μ(Vq)+t0μ(Vq)).\displaystyle\lesssim\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\frac{t_{0}}{\sqrt{\mu(V_{q})}}\left(\frac{{\rm Tr}[\mathbf{M}({V_{q}})]}{\mu(V_{q})}+\frac{t_{0}}{\sqrt{\mu(V_{q})}}\right).

Under the condition (R3), we establish

1H(Vq,Vq)|f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1(f𝒜1𝐌(Vq)f𝒜1D𝒜1𝐌(Vq)D𝒜11)|𝑝0.\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left|\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}\left(\frac{{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}}{{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q})D_{\mathcal{A}_{1}}}}-1\right)\right|\overset{p}{\to}0.

Similarly, we have 1H(Vq,Vq)|f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1(f𝒜1𝐌(Vq)f𝒜1D𝒜1𝐌(Vq)D𝒜11)|𝑝0.\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left|\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}\left(\frac{{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}}{{D_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})D_{\mathcal{A}_{1}}}}-1\right)\right|\overset{p}{\to}0. The above inequalities and the decomposition (84) imply

|𝒢n(Vq,Vq)1H(Vq,Vq)(f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1f𝒜1𝐌(Vq)ϵ𝒜1f𝒜1𝐌(Vq)f𝒜1)|𝑝0.\left|\mathcal{G}_{n}(V_{q},V_{q^{\prime}})-\frac{1}{\sqrt{H(V_{q},V_{q^{\prime}})}}\left(\frac{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})\epsilon_{\mathcal{A}_{1}}}{f_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V_{q^{\prime}})f_{\mathcal{A}_{1}}}-\frac{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})\epsilon_{\mathcal{A}_{1}}}{{f}^{\intercal}_{\mathcal{A}_{1}}\mathbf{M}(V_{q})f_{\mathcal{A}_{1}}}\right)\right|\overset{p}{\to}0. (86)

Together with (85), we establish 𝒢n(Vq,Vq)𝑑N(0,1)\mathcal{G}_{n}(V_{q},V_{q^{\prime}})\overset{d}{\to}N(0,1).

C.4 Proof of Lemma 1

Note that [𝐌RF(V)]2=(Ω)PV^𝒜1Ω(Ω)PV^𝒜1Ω,[\mathbf{M}_{\rm RF}({V})]^{2}=(\Omega)^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega(\Omega)^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega, and 𝐌(V)\mathbf{M}(V) and [𝐌RF(V)]2[\mathbf{M}_{\rm RF}({V})]^{2} are positive definite. For a vector bnb\in\mathbb{R}^{n}, we have

b[𝐌RF(V)]2b=(PV^𝒜1Ωb)Ω(Ω)PV^𝒜1ΩbPV^𝒜1Ωb22Ω22.\displaystyle b^{\intercal}[\mathbf{M}_{\rm RF}({V})]^{2}b=(P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega b)^{\intercal}\Omega(\Omega)^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega b\leq\|P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega b\|_{2}^{2}\|\Omega\|_{2}^{2}. (87)

Let Ω1\|\Omega\|_{1} and Ω\|\Omega\|_{\infty} denote the matrix 11 and \infty norm, respectively. Since Ω1=1\|\Omega\|_{1}=1 and Ω=1\|\Omega\|_{\infty}=1 for the random forests setting, we have the upper bound for the spectral norm Ω22Ω1Ω1.\|\Omega\|_{2}^{2}\leq\|\Omega\|_{1}\cdot\|\Omega\|_{\infty}\leq 1. Together with (87), we establish (41). Note that b𝐌RF(V)b=b(Ω)PV^𝒜1ΩbΩ22b22,b^{\intercal}\mathbf{M}_{\rm RF}({V})b=b^{\intercal}(\Omega)^{\intercal}P^{\perp}_{\widehat{V}_{\mathcal{A}_{1}}}\Omega b\leq\|\Omega\|_{2}^{2}\|b\|_{2}^{2}, we establish that λmax(𝐌RF(V))1.\lambda_{\max}(\mathbf{M}_{\rm RF}({V}))\leq 1.

We apply the minimax expression of eigenvalues and obtain that

λk([𝐌RF(V)]2)\displaystyle\lambda_{k}([\mathbf{M}_{\rm RF}({V})]^{2}) =maxU:dim(U)=kminuUu[𝐌RF(V)]2uu22\displaystyle=\max_{U:{\rm dim}(U)=k}\min_{u\in U}\frac{u^{\intercal}[\mathbf{M}_{\rm RF}({V})]^{2}u}{\|u\|_{2}^{2}}
maxU:dim(U)=kminuUu𝐌RF(V)uu22=λk(𝐌RF(V)).\displaystyle\leq\max_{U:{\rm dim}(U)=k}\min_{u\in U}\frac{u^{\intercal}\mathbf{M}_{\rm RF}({V})u}{\|u\|_{2}^{2}}=\lambda_{k}(\mathbf{M}_{\rm RF}({V})).

where the inequality follows from (41). The above inequality leads to Tr([𝐌RF(V)]2)Tr[𝐌(V)].{\rm Tr}\left([\mathbf{M}_{\rm RF}({V})]^{2}\right)\leq{\rm Tr}[\mathbf{M}(V)]. Since PBWPVBW,W=PBWP_{BW}^{\perp}P_{V_{BW},W}^{\perp}=P_{BW}^{\perp}, we have

2 =PBWPVBW,WPBWPVBW,WPBW\displaystyle=P_{BW}P_{V_{BW},W}^{\perp}P_{BW}P_{V_{BW},W}^{\perp}P_{BW}
=PBWPVBW,W(IPBW)PVBW,WPBW=PBWPVBW,WPBW=𝐌ba(V)\displaystyle=P_{BW}P_{V_{BW},W}^{\perp}\left({\rm I}-P_{BW}^{\perp}\right)P_{V_{BW},W}^{\perp}P_{BW}=P_{BW}P_{V_{BW},W}^{\perp}P_{BW}=\mathbf{M}_{\rm ba}({V})

Similar to the above proof, we establish [𝐌DNN(V)]2=𝐌DNN(V).[\mathbf{M}_{\rm DNN}({V})]^{2}=\mathbf{M}_{\rm DNN}({V}). The proof of (43) is the same as that of (41) by replacing (87) with b[𝐌RF(V)]2bb𝐌RF(V)bΩ22.b^{\intercal}[\mathbf{M}_{\rm RF}({V})]^{2}b\leq b^{\intercal}\mathbf{M}_{\rm RF}({V})b\cdot\|\Omega\|_{2}^{2}.

C.5 Proof of Lemma 2

By rewriting Lemma 5, we have that, with probability larger than 1nc,1-n^{-c},

max1in1|[ϵ^(V)]iϵi|Cκn(V),κn(V)=logn(R(V)+|ββ^init(V)|+lognn).\max_{1\leq i\leq n_{1}}\left|[\widehat{\epsilon}(V)]_{i}-\epsilon_{i}\right|\leq C\kappa_{n}(V),\quad\kappa_{n}(V)=\sqrt{\log n}\left(\|R(V)\|_{\infty}+|\beta-\widehat{\beta}_{\rm init}(V)|+\frac{\log n}{\sqrt{n}}\right). (88)

It is sufficient to show

(f𝒜1𝐌(V)f𝒜1)2i=1n1σi2[𝐌(V)f𝒜1]i2i=1n1[ϵ^(V)]i2[𝐌(V)D𝒜1]i2(D𝒜1𝐌(V)D𝒜1)2𝑝1.\frac{({f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}})^{2}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){f}_{\mathcal{A}_{1}}]_{i}^{2}}}\frac{{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V)]^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{({D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}})^{2}}\overset{p}{\to}1. (89)

Note that

i=1n1[ϵ^(V)]i2[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)f𝒜1]i2=i=1n1σi2[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)f𝒜1]i2i=1n1[ϵ^(V)]i2[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)D𝒜1]i2.\frac{{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V)]^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){f}_{\mathcal{A}_{1}}]_{i}^{2}}}=\frac{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){f}_{\mathcal{A}_{1}}]_{i}^{2}}}\cdot\frac{{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V)]^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}. (90)

We further decompose

i=1n1[ϵ^(V)]i2[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)D𝒜1]i21=i=1n1[ϵ^(V)ϵi+ϵi]2[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)D𝒜1]i21\displaystyle\frac{{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V)]^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}-1=\frac{{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V)-\epsilon_{i}+\epsilon_{i}]^{2}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}-1 (91)
=i=1n1[ϵ^(V)ϵi]2[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)D𝒜1]i2+i=1n1ϵi[ϵ^(V)ϵi][𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)D𝒜1]i2+i=1n1(ϵi2σi2)[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)D𝒜1]i2\displaystyle=\frac{{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V)-\epsilon_{i}]^{2}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}+\frac{{\sum_{i=1}^{n_{1}}\epsilon_{i}[\widehat{\epsilon}(V)-\epsilon_{i}][\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}+\frac{{\sum_{i=1}^{n_{1}}(\epsilon_{i}^{2}-\sigma^{2}_{i})[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}

Note that |i=1n1(ϵi2σi2)[𝐌(V)D𝒜1]i2i=1n1σi2[𝐌(V)D𝒜1]i2||i=1n1(ϵi2σi2)[𝐌(V)D𝒜1]i2i=1n1[𝐌(V)D𝒜1]i2|.\left|\frac{{\sum_{i=1}^{n_{1}}(\epsilon_{i}^{2}-\sigma^{2}_{i})[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}\right|\lesssim\left|\frac{{\sum_{i=1}^{n_{1}}(\epsilon_{i}^{2}-\sigma^{2}_{i})[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}{{\sum_{i=1}^{n_{1}}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}\right|. Define the vector an1a\in\mathbb{R}^{n_{1}} with ai=[𝐌(V)D𝒜1]i2i=1n1[𝐌(V)D𝒜1]i2a_{i}=\frac{[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}{{{\sum_{i=1}^{n_{1}}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}} and we have a1=1\|a\|_{1}=1 and a22a.\|a\|_{2}^{2}\leq\|a\|_{\infty}. By applying Proposition 5.16 of \citetsuppvershynin2010introduction, we establish that,

(|i=1n1ai(ϵi2σi2)|t0Ka𝒪)exp(cmin{t02,t0}).\mathbb{P}\left(\left|\sum_{i=1}^{n_{1}}a_{i}(\epsilon_{i}^{2}-\sigma^{2}_{i})\right|\geq t_{0}K\|a\|_{\infty}\mid\mathcal{O}\right)\leq\exp(-c\min\{t_{0}^{2},t_{0}\}). (92)

By the condition a𝑝0\|a\|_{\infty}\overset{p}{\to}0 and κn(V)2+lognκn(V)𝑝0,\kappa_{n}(V)^{2}+\sqrt{\log n}\kappa_{n}(V)\overset{p}{\to}0, we establish

i=1n1[ϵ^(V)]i2[𝐌(V)D𝒜1]i2/i=1n1σi2[𝐌(V)D𝒜1]i2𝑝1.{{\sum_{i=1}^{n_{1}}[\widehat{\epsilon}(V)]^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}/{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}\overset{p}{\to}1.

By (90) and i=1n1σi2[𝐌(V)D𝒜1]i2/i=1n1σi2[𝐌(V)f𝒜1]i2𝑝1{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){D}_{\mathcal{A}_{1}}]_{i}^{2}}}/{{\sum_{i=1}^{n_{1}}\sigma^{2}_{i}[\mathbf{M}(V){f}_{\mathcal{A}_{1}}]_{i}^{2}}}\overset{p}{\to}1, and (f𝒜1𝐌(V)f𝒜1)2(D𝒜1𝐌(V)D𝒜1)2𝑝1,\frac{({f}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){f}_{\mathcal{A}_{1}})^{2}}{({D}_{\mathcal{A}_{1}}^{\intercal}\mathbf{M}(V){D}_{\mathcal{A}_{1}})^{2}}\overset{p}{\to}1, we establish (89).

Appendix D Additional Simulation Results

D.1 The appropriate threshold of IV strength for reliable inference

In this section, we show the performance of TSCI in terms of RMSE, bias, and CI coverage with different IV strengths. We generate ZiZ_{i}, XiX_{i} and errors {(δi,ϵi)}1in\{(\delta_{i},\epsilon_{i})^{\intercal}\}_{1\leq i\leq n} following the same procedure as in Settings B1 and B2 in Section 5.2 but with Zi=Ψ(Xi,px+1)(0,1)Z_{i}=\Psi(X^{*}_{i,p_{x}+1})\in(0,1). We generate the treatment and outcome {Di,Yi}1in\{D_{i},Y_{i}\}_{1\leq i\leq n} following Di=1/2Zi+a(sin(2πZi)+3/2cos(2πZi))3/10j=110Xi,j+δiD_{i}=1/2\cdot Z_{i}+a\cdot\left(\sin(2\pi Z_{i})+3/2\cdot\cos(2\pi Z_{i})\right)-3/10\cdot\sum_{j=1}^{10}X_{i,j}+\delta_{i} and Yi=1/2Di+1/5j=110Xi,j+ϵiY_{i}=1/2\cdot D_{i}+1/5\cdot\sum_{j=1}^{10}X_{i,j}+\epsilon_{i}. With fixing the sample size n=3000n=3000, we control the IV strength by varying aa across {0.15,0.17,,0.35}\{0.15,0.17,\cdots,0.35\}. Since we consider the valid IV setting, we implement TSCI with random forests for 500 times and specify 𝒱0={𝐰(x)}\mathcal{V}_{0}=\{\overrightarrow{\bf w}(x)\} with 𝐰(x)={1,x1,,xpx}\overrightarrow{\bf w}(x)=\{1,x_{1},\cdots,x_{p_{x}}\}. In Figure S1, we show that when IV strength is above 40, TSCI has a smaller RMSE and bias, and its confidence interval achieves the desired coverage.

Refer to caption
Figure S1: RMSE, bias and CI coverage of TSCI with different IV strengths. The red dashed line indicates the IV strength as 40. The black dashed line indicates the 95% coverage level.

We also use the setting with a{0.25,0.3,0.33}a\in\{0.25,0.3,0.33\} to illustrate the bias of β^init(V)\widehat{\beta}_{\rm init}(V) and the effect of the proposed bias correction in Figure 1 in the main paper.

D.2 Comparison with machine-learning IV

In the following, we compare our TSCI estimator with the MLIV estimator described in (29) while setting the full set of observed covariates as adjusted forms in our method. We use the exactly same setting in Section D.1. In Table S1, we compare TSCI with MLIV for different n{1000,3000,5000}n\in\{1000,3000,5000\} and different a{0.2,0.25,0.3,0.35}a\in\{0.2,0.25,0.3,0.35\} with a larger value of aa corresponding to a stronger IV. We implement 500 rounds of simulations and report the average measure.

Table S1: Comparison of TSCI and MLIV with the treatment model fitted by RF. The columns indexed with “without self-prediction” stands for the split RF being implemented without self-prediction, while “with self-prediction” stands for the split RF being implemented with self-prediction. The columns indexed by “TSCI”, “Init”, “MLIV” correspond to our proposed TSCI estimator in (19), the initial estimator in (16), and the MLIV estimator in (29). The column indexed with “RMSE Ratio” represents the MSE ratio of the MLLV estimator to TSCI estimator; the column indexed with “IV Str” stands for our proposed IV strength in (22).
without self-prediction with self-prediction
Bias RMSE Bias RMSE
a n TSCI Init MLIV Ratio IV Str TSCI Init MLIV Ratio IV Str
0.20 1000 0.17 0.30 -4.31 374.06 2.18 0.28 0.38 0.48 1.51 11.90
3000 0.03 0.14 -0.15 2.94 12.11 0.17 0.29 0.45 2.17 42.06
5000 0.01 0.10 -0.07 1.67 27.97 0.12 0.23 0.42 2.69 76.41
0.25 1000 0.06 0.18 -0.38 23.14 3.84 0.17 0.28 0.41 1.88 15.79
3000 0.00 0.06 -0.05 1.37 30.35 0.08 0.17 0.35 2.97 74.95
5000 0.00 0.04 -0.02 1.11 77.97 0.04 0.12 0.32 3.84 156.57
0.30 1000 0.02 0.09 -0.11 2.01 7.49 0.09 0.18 0.32 2.22 23.92
3000 -0.00 0.02 -0.02 1.09 74.82 0.03 0.09 0.25 3.74 151.09
5000 0.00 0.02 -0.01 1.06 194.74 0.01 0.06 0.21 4.51 316.40
0.35 1000 0.00 0.05 -0.05 1.29 14.32 0.04 0.11 0.24 2.57 41.62
3000 -0.01 0.01 -0.01 0.99 144.72 0.01 0.04 0.17 3.55 255.65
5000 -0.00 0.01 -0.01 1.00 337.34 0.01 0.04 0.16 4.44 526.98

In Table S1, we observe that the MLIV has a larger bias and standard error than TSCI, leading to an inflated MSE of MLIV. We have explained in Section 3.5 that this happens when the IV strength captured by RF is relatively weak. For a stronger IV (with a=0.35a=0.35), when the sample size is 3000 or 5000 and the self-prediction is excluded, the MLIV and TSCI have comparable MSE, but our proposed TSCI estimator generally exhibits a smaller bias. Moreover, by comparing TSCI and initial estimators, our proposed bias correction step effectively removes the bias due to the high complexity of RF algorithm. Another interesting observation is that the removal of self-prediction helps reducing the bias of both TSCI and MLIV estimators for the reason of reducing the correlation between the ML predicted value and the unmeasured confounders.

We then show how the coefficient cfc_{f} in (29) in the main paper inflates the estimation error of MLIV when IV is weak. In Figure S2, we display the distribution of TSCI and MLIV estimators in box plots and the frequency of cfc_{f} in the histogram in the setting with n=1000n=1000 and a=0.2a=0.2. We can see that there are certain proportion of the coefficient cfc_{f} close to 0, which would inflate the estimator to extremely large values and thus increase the MLIV estimator variance when IV is weak.

Refer to caption
Figure S2: The left panel shows the boxplot of TSCI and MLIV estimators and the right panel shows the histogram of the coefficient cfc_{f} of f^\widehat{f} in the first stage regression Df^+XD\sim\widehat{f}+X in the MLIV method.

D.3 Multiple IV (with non-linearity)

In this section, we consider multiple IVs and compare TSCI with existing methods dealing with invalid IVs, including TSHT \citepsuppguo2018confidence and CIIV \citepsuppwindmeijer2019confidence. We consider the setting with 10 IVs. With fixing the sample size n=3000n=3000, we generate Xipx+pzX^{*}_{i}\in\mathbb{R}^{p_{x}+p_{z}} following the multivariate normal distribution with zero mean and covariance Σ\Sigma where Σi,j=0.5|ij|\Sigma_{i,j}=0.5^{|i-j|} for 1i,jpx+pz1\leq i,j\leq p_{x}+p_{z}. We define the first pzp_{z} columns of XX^{*} as IVs and denote it by Zi=Xi,1:pzZ_{i}=X^{*}_{i,1:p_{z}}. The remaining columns are defined as observed covariates, that is Xi=Xi,(pz+1):(pz+px)X_{i}=X^{*}_{i,(p_{z}+1):(p_{z}+p_{x})}. We generate errors {(δi,ϵi)}1in\{(\delta_{i},\epsilon_{i})^{\intercal}\}_{1\leq i\leq n} following the bivariate normal distribution with zero mean, unit variance and covariance as 0.5. We generate {Di,Yi}1in\{D_{i},Y_{i}\}_{1\leq i\leq n} following Di=j=1pz|Zi,j|+j=1pztanh(Zi,j)+1/2j=1pxXi,j+δiD_{i}=\sum_{j=1}^{p_{z}}|Z_{i,j}|+\sum_{j=1}^{p_{z}}\tanh(Z_{i,j})+1/2\cdot\sum_{j=1}^{p_{x}}X_{i,j}+\delta_{i} and Yi=Di+Ziπ+1/2j=1pxXi,j+ϵiY_{i}=D_{i}+Z_{i}\pi+1/2\cdot\sum_{j=1}^{p_{x}}X_{i,j}+\epsilon_{i} where the vector πpz\pi\in\mathbb{R}^{p_{z}} indicates the invalidity level of each IV. The jj-th IV is valid if πj=0\pi_{j}=0; otherwise, it is invalid. A larger |πj||\pi_{j}| value indicates that the jj-th IV is more severely invalid. We consider the following two settings,

  • Setting D1: π=a(0,0,0,0,0.5,0.5,0.5,0.5,0.5,0.5)\pi=a\cdot(0,0,0,0,{0.5},0.5,0.5,0.5,0.5,0.5)^{\intercal},

  • Setting D2: π=a(0,0,0,0,0.5,0.5,0.5,1,1,1)\pi=a\cdot(0,0,0,0,0.5,0.5,0.5,1,1,1)^{\intercal}.

For Setting D1, neither the majority nor the plurality rule is satisfied. For Setting D2, the plurality rule is satisfied. We vary the value of a{0.1,0.2,,1}a\in\{0.1,0.2,\cdots,1\} to simulate the different levels of invalidity. We specify the sets of basis as 𝒱0={𝐰(x)}\mathcal{V}_{0}=\{\overrightarrow{\bf w}(x)\} with 𝐰(x)={1,x1,,xpx}\overrightarrow{\bf w}(x)=\{1,x_{1},\cdots,x_{p_{x}}\}, 𝒱1={𝐳,𝐰(x)}\mathcal{V}_{1}=\{\mathbf{z},\overrightarrow{\bf w}(x)\} and 𝒱2={𝐳,𝐳2,𝐰(x)}\mathcal{V}_{2}=\{\mathbf{z},\mathbf{z}^{2},\overrightarrow{\bf w}(x)\} with 𝐳2={z12,,z102}\mathbf{z}^{2}=\{z_{1}^{2},\cdots,z_{10}^{2}\}. We run 500 rounds of simulation and report the results in Figure S3.

Refer to caption
Figure S3: Comparison of TSCI with TSHT and CIIV in terms of RMSE, CI coverage and length under Settings D1 and D2. Setting D1 corresponds to a setting that the plurality rule fails while Setting D2 corresponds to a setting that the plurality holds. The stacked bar charts in the last column show the basis selection of TSCI where 𝒱q=1\mathcal{V}_{q=1} is the oracle basis set. “TSCI” is our proposed method detailed in Algorithm 1. “TSHT” and “CIIV” are the methods proposed in \citetsuppguo2018confidence and \citetsuppwindmeijer2019confidence.

In Setting D1, neither the majority rule nor the plurality rule is satisfied, so TSHT and CIIV cannot select valid IVs and do not achieve a desired coverage level of 95%. In contrast, TSCI is able to detect the existing invalidity and obtain valid inference. When the invalidity level is low (with a=0.1a=0.1), TSCI may not be able to identify the invalidity and thus lead to coverage below 95%.

In Setting D2, TSCI achieves the desired coverage level while TSHT and CIIV achieve the desired coverage for a relatively large invalidity level. When the invalidity level is low, say a=0.1a=0.1 and a=0.2a=0.2, TSHT and CIIV are significantly impacted by locally invalid IVs and generate poor coverage. In comparison to TSHT and CIIV, TSCI provides much more robust performance to the low invalidity levels because it evaluates the total invalidity levels accumulated by all ten invalid IVs.

D.4 Additional Results in Section 5.2

In this section, we further report the additional results for Setting B1 and results for Setting B2, and introduce a setting with binary IV denoted as Setting B3.

For Setting B1, we report the coverage for Vio=2 in Table S2 and we further report the mean absolute bias and the confidence interval length in for both violation forms in Tables S3 and S4, respectively. In Table S3, TSLS has a large bias due to the existence of invalid IVs; even in the oracle setting with the prior knowledge of the best 𝒱q\mathcal{V}_{q} approximating gg, RF-Full, and RF-Plug have a large bias. Compared with RF-Init, TSCI corrects the bias effectively and thus have the desired 95% coverage. In Table S4, the CI of TSCI is shorter than the oracle method in settings with small sample size or weak interaction strength, because it fails to select the correct violation forms.

TSCI-RF Proportions of selection TSLS Other RF(oracle)
vio a n Oracle Comp Robust Invalidity q^c=0\widehat{q}_{c}=0 q^c=1\widehat{q}_{c}=1 q^c=2\widehat{q}_{c}=2 q^c=3\widehat{q}_{c}=3 Init Plug Full
2 0.0 1000 0.92 0.01 0.01 0.00 1.00 0.00 0.00 0.00 0.00 0.82 0.30 0.01
3000 0.94 0.94 0.94 1.00 0.00 0.00 1.00 0.00 0.00 0.91 0.63 0.00
5000 0.94 0.94 0.94 1.00 0.00 0.00 1.00 0.00 0.00 0.88 0.77 0.00
0.5 1000 0.92 0.11 0.12 0.21 0.79 0.08 0.13 0.00 0.00 0.86 0.49 0.01
3000 0.94 0.93 0.92 1.00 0.00 0.00 0.97 0.03 0.00 0.90 0.37 0.00
5000 0.94 0.94 0.94 1.00 0.00 0.00 0.99 0.01 0.00 0.88 0.03 0.00
1.0 1000 0.95 0.89 0.89 0.96 0.04 0.02 0.93 0.01 0.00 0.89 0.40 0.01
3000 0.93 0.93 0.93 1.00 0.00 0.00 0.99 0.01 0.00 0.92 0.00 0.00
5000 0.93 0.93 0.93 1.00 0.00 0.00 1.00 0.00 0.00 0.90 0.00 0.00
Table S2: Coverage (at nominal level 0.95) for Setting B1 with Vio=2. The columns indexed with “TSCI-RF” correspond to our proposed TSCI with random forests, where the columns indexed with “Oracle”, “Comp” and “Robust” correspond to the estimators with 𝒱q\mathcal{V}_{q} selected by the oracle knowledge, the comparison method, and the robust method. The column indexed with “Invalidity” reports the proportion of detecting the proposed IV as invalid. The columns indexed with “Proportions of selection” reports the proportions of basis sets 𝒱q\mathcal{V}_{q} for 0q30\leq q\leq 3 selected by TSCI-RF. The column indexed with “TSLS” corresponds to the TSLS estimator. The columns indexed with “Init”, “Plug”, “Full” correspond to the RF estimators without bias correction, the plug-in RF estimator and the no data-splitting RF estimator, with the oracle knowledge of the best 𝒱q\mathcal{V}_{q}.
TSCI-RF Proportions of selection TSLS Other RF(oracle)
vio a n Oracle Comp Robust Invalidity q^c=0\widehat{q}_{c}=0 q^c=1\widehat{q}_{c}=1 q^c=2\widehat{q}_{c}=2 q^c=3\widehat{q}_{c}=3 Init Plug Full
1 0.0 1000 0.02 0.53 0.53 0.01 0.99 0.01 0.00 0.00 0.56 0.13 0.48 0.30
3000 0.01 0.01 0.01 1.00 0.00 0.84 0.16 0.00 0.56 0.04 0.14 0.25
5000 0.00 0.00 0.00 1.00 0.00 0.85 0.15 0.00 0.56 0.03 0.05 0.23
0.5 1000 0.01 0.26 0.25 0.24 0.76 0.24 0.01 0.00 0.33 0.08 0.24 0.22
3000 0.00 0.00 0.00 1.00 0.00 0.97 0.02 0.01 0.33 0.03 0.16 0.19
5000 0.00 0.00 0.00 1.00 0.00 0.98 0.01 0.01 0.33 0.02 0.22 0.18
1.0 1000 0.00 0.01 0.01 0.95 0.05 0.93 0.01 0.00 0.23 0.04 0.15 0.13
3000 0.00 0.00 0.00 1.00 0.00 0.99 0.01 0.00 0.23 0.01 0.38 0.11
5000 0.00 0.00 0.00 1.00 0.00 0.98 0.02 0.01 0.23 0.01 0.37 0.10
2 0.0 1000 0.00 0.54 0.54 0.00 1.00 0.00 0.00 0.00 0.56 0.11 0.53 0.29
3000 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.56 0.05 0.13 0.25
5000 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.56 0.03 0.05 0.23
0.5 1000 0.01 0.38 0.38 0.21 0.79 0.08 0.13 0.00 0.33 0.08 0.34 0.23
3000 0.00 0.00 0.01 1.00 0.00 0.00 0.97 0.03 0.33 0.03 0.19 0.20
5000 0.00 0.00 0.01 1.00 0.00 0.00 0.99 0.01 0.33 0.03 0.29 0.19
1.0 1000 0.01 0.04 0.04 0.96 0.04 0.02 0.93 0.01 0.23 0.04 0.25 0.15
3000 0.00 0.00 0.00 1.00 0.00 0.00 0.99 0.01 0.23 0.02 0.52 0.12
5000 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.23 0.01 0.48 0.11
Table S3: Absolute Bias for Setting B1. The columns indexed with “TSCI-RF” correspond to our proposed TSCI with random forests, where the columns indexed with “Oracle”, “Comp” and “Robust” correspond to estimators with 𝒱q\mathcal{V}_{q} selected by the oracle knowledge, the comparison method, and the robust method. The column indexed with “Invalidity” reports the proportion of detecting the proposed IV as invalid. The columns indexed with “Proportions of selection” reports the proportions of basis sets 𝒱q\mathcal{V}_{q} selected by TSCI-RF using the comparison method. The columns indexed with “TSLS” corresponds to the TSLS estimator. The columns indexed with “Init”, “Plug”, “Full” correspond to the RF estimators without bias correction, the plug-in RF estimator and the no data-splitting RF estimator, with the oracle knowledge of the best 𝒱q\mathcal{V}_{q}.
TSCI-RF Proportions of selection TSLS Other RF(oracle)
vio a n Oracle Comp Robust Invalidity q^c=0\widehat{q}_{c}=0 q^c=1\widehat{q}_{c}=1 q^c=2\widehat{q}_{c}=2 q^c=3\widehat{q}_{c}=3 Init Plug Full
1 0.0 1000 0.49 0.11 0.11 0.01 0.99 0.01 0.00 0.00 0.08 0.49 0.82 0.22
3000 0.32 0.32 0.32 1.00 0.00 0.84 0.16 0.00 0.05 0.32 0.38 0.14
5000 0.23 0.23 0.23 1.00 0.00 0.85 0.15 0.00 0.04 0.23 0.27 0.11
0.5 1000 0.38 0.13 0.14 0.24 0.76 0.24 0.01 0.00 0.05 0.38 0.60 0.19
3000 0.22 0.22 0.23 1.00 0.00 0.97 0.02 0.01 0.03 0.22 0.26 0.11
5000 0.17 0.17 0.17 1.00 0.00 0.98 0.01 0.01 0.02 0.17 0.19 0.09
1.0 1000 0.25 0.24 0.24 0.95 0.05 0.93 0.01 0.00 0.04 0.25 0.33 0.14
3000 0.13 0.13 0.13 1.00 0.00 0.99 0.01 0.00 0.02 0.13 0.18 0.08
5000 0.10 0.10 0.10 1.00 0.00 0.98 0.02 0.01 0.02 0.10 0.13 0.06
2 0.0 1000 0.49 0.17 0.17 0.00 1.00 0.00 0.00 0.00 0.12 0.49 0.85 0.21
3000 0.31 0.31 0.31 1.00 0.00 0.00 1.00 0.00 0.07 0.31 0.38 0.14
5000 0.23 0.23 0.23 1.00 0.00 0.00 1.00 0.00 0.06 0.23 0.27 0.11
0.5 1000 0.38 0.16 0.16 0.21 0.79 0.08 0.13 0.00 0.07 0.38 0.70 0.18
3000 0.23 0.23 0.26 1.00 0.00 0.00 0.97 0.03 0.04 0.23 0.28 0.11
5000 0.17 0.17 0.21 1.00 0.00 0.00 0.99 0.01 0.03 0.17 0.20 0.09
1.0 1000 0.24 0.24 0.24 0.96 0.04 0.02 0.93 0.01 0.05 0.24 0.40 0.14
3000 0.13 0.13 0.13 1.00 0.00 0.00 0.99 0.01 0.03 0.13 0.22 0.08
5000 0.10 0.10 0.11 1.00 0.00 0.00 1.00 0.00 0.02 0.10 0.15 0.06
Table S4: Average CI length for Setting B1. The columns indexed with “TSCI-RF” correspond to TSCI with random forests, where the columns indexed with “Oracle”, “Comp” and “Robust” correspond to estimators with 𝒱q\mathcal{V}_{q} selected by the oracle knowledge, the comparison method, and the robust method. The column indexed with “Invalidity” reports the proportion of detecting the proposed IV as invalid. The columns indexed with “Proportions of selection” reports the proportions of basis sets 𝒱q\mathcal{V}_{q} selected by TSCI-RF using the comparison method. The columns indexed with “TSLS” corresponds to the TSLS estimator. The columns indexed with “Init”, “Plug”, “Full” correspond to the RF estimators without bias correction, the plug-in RF estimator and the no data-splitting RF estimator, with the oracle knowledge of the best 𝒱q\mathcal{V}_{q}.
TSCI-RF Proportions of selection TSLS Other RF(oracle)
vio a n Oracle Comp Robust Invalidity q^c=0\widehat{q}_{c}=0 q^c=1\widehat{q}_{c}=1 q^c=2\widehat{q}_{c}=2 q^c=3\widehat{q}_{c}=3 Init Plug Full
1 0 1000 0.84 0.84 0.84 1.00 0.00 1.00 0.00 0.00 0.58 0.81 0.16 0.00
3000 0.93 0.93 0.94 1.00 0.00 0.96 0.04 0.00 0.11 0.91 0.00 0.00
5000 0.93 0.94 0.94 1.00 0.00 0.95 0.05 0.00 0.01 0.93 0.00 0.00
0.5 1000 0.94 0.94 0.95 1.00 0.00 0.95 0.04 0.01 0.00 0.88 0.00 0.01
3000 0.93 0.93 0.92 1.00 0.00 0.95 0.05 0.00 0.00 0.92 0.00 0.01
5000 0.93 0.93 0.93 1.00 0.00 0.99 0.01 0.00 0.00 0.92 0.00 0.00
1 1000 0.94 0.94 0.95 1.00 0.00 0.98 0.01 0.00 0.00 0.92 0.00 0.01
3000 0.95 0.96 0.96 1.00 0.00 0.98 0.02 0.00 0.00 0.93 0.00 0.00
5000 0.93 0.93 0.93 1.00 0.00 0.99 0.01 0.00 0.00 0.91 0.00 0.00
2 0 1000 0.84 0.17 0.17 0.99 0.01 0.97 0.02 0.00 0.55 0.16 0.13 0.00
3000 0.96 0.96 0.94 1.00 0.00 0.00 0.99 0.01 0.14 0.93 0.00 0.00
5000 0.95 0.95 0.94 1.00 0.00 0.00 0.99 0.01 0.02 0.92 0.00 0.01
0.5 1000 0.95 0.73 0.72 0.92 0.08 0.17 0.70 0.05 0.00 0.68 0.00 0.03
3000 0.94 0.94 0.94 1.00 0.00 0.00 0.95 0.05 0.00 0.93 0.00 0.00
5000 0.95 0.95 0.95 1.00 0.00 0.00 0.98 0.02 0.00 0.92 0.00 0.00
1 1000 0.96 0.95 0.96 1.00 0.00 0.01 0.93 0.06 0.00 0.94 0.00 0.01
3000 0.96 0.95 0.94 1.00 0.00 0.00 0.95 0.05 0.00 0.93 0.00 0.00
5000 0.96 0.96 0.95 1.00 0.00 0.00 0.97 0.03 0.00 0.95 0.00 0.00
Table S5: CI coverage for Setting B2. The columns indexed with “TSCI-RF” correspond to TSCI with random forests, where the columns indexed with “Oracle”, “Comp” and “Robust” correspond to estimators with 𝒱q\mathcal{V}_{q} selected by the oracle knowledge, the comparison method, and the robust method. The column indexed with “Invalidity” reports the proportion of detecting the proposed IV as invalid. The columns indexed with “Proportions of selection” reports the proportions of basis sets 𝒱q\mathcal{V}_{q} selected by TSCI-RF using the comparison method. The columns indexed with “TSLS” corresponds to the TSLS estimator. The columns indexed with “Init”, “Plug”, “Full” correspond to the RF estimators without bias correction, the plug-in RF estimator and the no data-splitting RF estimator, with the oracle knowledge of the best 𝒱q\mathcal{V}_{q}.

For Setting B2, we report the empirical coverage in Table S5 which are generally similar to those for Setting B1.

To approximate the real data analysis in Section 6, we further generate a binary IV as Zi=𝟏(Φ(Xi,6)>0.6)Z_{i}={\bf 1}(\Phi(X^{*}_{i,6})>0.6) and the covariates Xi,j=Xi,jX_{i,j}=X^{*}_{i,j} for 1in1\leq i\leq n and 1j5.1\leq j\leq 5. We consider the following models for f(Zi,Xi)f(Z_{i},X_{i}) and g(Zi,Xi)g(Z_{i},X_{i}),

  • Setting B3 (binary IV): f(Zi,Xi)=Zi(1+aj=14Xij(1+Xi,j+1))3/10i=15Xijf(Z_{i},X_{i})=Z_{i}\cdot(1+a\sum_{j=1}^{4}X_{ij}(1+X_{i,j+1}))-3/10\cdot\sum_{i=1}^{5}X_{ij} and g(Zi,Xi)=Zi+1/2Zi(i=13Xij)).g(Z_{i},X_{i})=Z_{i}+1/2\cdot Z_{i}\cdot(\sum_{i=1}^{3}X_{ij})).

Compared to Settings B1 and B2, the outcome model in Setting B3 involves the interaction between ZiZ_{i} and XiX_{i} while the treatment model involves a more complicated interaction term, whose strength is controlled by aa. We specify 𝒱0={𝐰(x)}\mathcal{V}_{0}=\{\overrightarrow{\bf w}(x)\} with 𝐰(x)={1,x1,,x5}\overrightarrow{\bf w}(x)=\{1,x_{1},\cdots,x_{5}\} and 𝒱1=𝒱0{z,zx1,,zx5}\mathcal{V}_{1}=\mathcal{V}_{0}\cup\{z,z\cdot x_{1},\cdots,z\cdot x_{5}\}. With the specified basis sets, we implement TSCI with random forests as detailed in Algorithm 1 in the main paper. In Table S6, we demonstrate our proposed TSCI method for Setting B3. The observations are coherent with those for Settings B1 and B2. The main difference between the binary IV (Setting B3) and the continuous IV (Settings B1 and B2) is that the treatment effect is not identifiable for a=0a=0, which happens only for the binary IV setting. However, with a non-zero interaction and a relatively large sample size, our proposed TSCI methods achieve the desired coverage. We also observe that the bias correction is effective and improves the coverage when the interaction aa is relatively small.

TSCI-RF RF-Init
Bias Length Coverage Invalidity Bias Coverage
a n Orac Comp Robust Orac Comp Robust Orac Comp Robust Orac Orac
0.25 1000 0.01 0.56 0.56 0.42 0.23 0.23 0.90 0.28 0.28 0.31 0.10 0.82
3000 0.00 0.01 0.01 0.23 0.23 0.23 0.95 0.95 0.95 0.99 0.05 0.84
5000 0.00 0.00 0.00 0.18 0.18 0.18 0.94 0.94 0.94 1.00 0.03 0.86
0.50 1000 0.00 0.02 0.02 0.22 0.22 0.22 0.93 0.90 0.90 0.97 0.04 0.90
3000 -0.00 -0.00 -0.00 0.12 0.12 0.12 0.93 0.93 0.93 1.00 0.02 0.89
5000 0.00 0.00 0.00 0.09 0.09 0.09 0.95 0.95 0.95 1.00 0.02 0.89
0.75 1000 0.00 0.00 0.00 0.15 0.15 0.15 0.92 0.90 0.90 0.99 0.02 0.91
3000 0.00 0.00 0.00 0.08 0.08 0.08 0.94 0.94 0.94 1.00 0.01 0.89
5000 -0.00 -0.00 -0.00 0.06 0.06 0.06 0.93 0.93 0.93 1.00 0.01 0.91
Table S6: Bias, length, and coverage (at nominal level 0.95) for Setting B3. The columns indexed with “TSCI-RF” corresponds to our proposed TSCI with random forests, where the columns indexed with “Bias”, “Length”, and “Coverage” correspond to the absolute bias of the point estimator, the length and empirical coverage of the constructed CI respectively. The columns indexed with “Oracle”, “Comp” and “Robust” correspond to TSCI estimators with 𝒱q\mathcal{V}_{q} selected by the oracle knowledge, the comparison method, and the robust method. The column indexed with “Invalidity” reports the proportion of detecting the proposed IV as invalid. The columns indexed with “RF-Init” correspond to the RF estimators without bias correction but with the oracle knowledge of the best 𝒱q\mathcal{V}_{q}.

D.5 Binary Treatment

We consider the binary treatment setting and explore the finite-sample performance of our proposed TSCI method. We consider the outcome model Yi=Diβ+Zi+0.2j=1pXij+ϵiY_{i}=D_{i}\beta+Z_{i}+0.2\cdot\sum_{j=1}^{p}{X_{ij}}+\epsilon_{i} with β=1\beta=1 and px=20.p_{x}=20. We generate ϵi\epsilon_{i} and δi\delta_{i} following bivariate normal distribution with zero mean, unit variance and covariance as 0.5. We generate the binary treatment DiD_{i} with the conditional mean as

𝐄(DiZi,Xi)=exp(f(Zi,Xi)+δi)1+exp(f(Zi,Xi)+δi),{\mathbf{E}}(D_{i}\mid Z_{i},X_{i})=\frac{\exp(f(Z_{i},X_{i})+\delta_{i})}{1+\exp(f(Z_{i},X_{i})+\delta_{i})},

where f(Zi,Xi)f(Z_{i},X_{i}) is specified in the following two ways.

  • Setting 1 (continuous IV): generate Zi,XiZ_{i},X_{i} following Section 5.2. f(Zi,Xi)=25/12+Zi+Zi2+1/8Zi4+Zi(aj=15Xi,j)3/10j=1pxXi,jf(Z_{i},X_{i})=-25/12+Z_{i}+Z_{i}^{2}+1/8\cdot Z_{i}^{4}+Z_{i}\cdot(a\cdot\sum_{j=1}^{5}X_{i,j})-3/10\cdot\sum_{j=1}^{p_{x}}X_{i,j}.

  • Setting 2 (binary IV): generate f(Zi,Xi)=Zi(1+aj=15Xij)j=1p0.3Xijf(Z_{i},X_{i})=Z_{i}\cdot(1+a\cdot\sum_{j=1}^{5}X_{ij})-\sum_{j=1}^{p}0.3X_{ij}.

TSCI-RF RF-Init TSCI-RF
Bias Length Coverage Invalidity Bias Coverage Weak IV
Setting a n Orac Comp Robust Orac Comp Robust Orac Comp Robust Orac Orac
1000 0.00 0.00 0.00 1.08 1.08 1.08 0.92 0.92 0.92 1.00 0.05 0.94 0.99
3000 0.01 0.01 0.00 0.59 0.59 0.61 0.94 0.93 0.93 1.00 0.00 0.95 0.00
0.0 5000 0.00 0.00 0.01 0.44 0.45 0.89 0.94 0.93 0.95 1.00 0.01 0.94 0.00
1000 0.03 0.03 0.03 1.12 1.12 1.12 0.90 0.90 0.90 1.00 0.07 0.94 0.99
3000 0.00 0.00 0.00 0.62 0.62 0.62 0.93 0.93 0.93 1.00 0.01 0.94 0.00
0.5 5000 0.00 0.00 0.01 0.45 0.45 0.68 0.94 0.93 0.93 1.00 0.01 0.94 0.00
1000 0.01 0.01 0.01 1.22 1.22 1.22 0.92 0.92 0.92 1.00 0.04 0.95 0.99
3000 0.01 0.01 0.00 0.66 0.66 0.67 0.95 0.95 0.95 1.00 0.01 0.96 0.00
1.0 5000 0.00 0.00 0.00 0.49 0.49 0.62 0.93 0.93 0.93 1.00 0.01 0.94 0.00
1000 0.01 0.01 0.01 1.30 1.30 1.30 0.89 0.89 0.89 1.00 0.06 0.94 1.00
3000 0.00 0.00 0.00 0.73 0.73 0.74 0.95 0.95 0.95 1.00 0.01 0.96 0.01
1 1.5 5000 0.00 0.00 0.01 0.53 0.54 0.82 0.93 0.92 0.93 1.00 0.01 0.94 0.00
1000 0.37 0.37 0.37 1.62 1.62 1.62 0.66 0.66 0.66 0.93 0.41 0.78 1.00
3000 0.41 0.41 0.41 1.25 1.25 1.25 0.60 0.60 0.60 1.00 0.42 0.72 1.00
0.0 5000 0.35 0.35 0.35 1.05 1.05 1.05 0.61 0.61 0.61 1.00 0.40 0.65 1.00
1000 0.30 0.30 0.30 1.21 1.21 1.21 0.65 0.65 0.65 1.00 0.36 0.77 1.00
3000 0.21 0.21 0.21 1.18 1.18 1.18 0.73 0.73 0.73 1.00 0.29 0.83 1.00
0.5 5000 0.10 0.10 0.10 1.09 1.09 1.09 0.82 0.82 0.82 1.00 0.22 0.89 1.00
1000 0.16 0.20 0.16 1.35 1.30 1.35 0.77 0.77 0.77 0.93 0.26 0.87 1.00
3000 0.03 0.03 0.03 1.11 1.10 1.11 0.88 0.88 0.88 0.99 0.11 0.94 1.00
1.0 5000 0.00 0.00 0.00 0.92 0.92 0.92 0.89 0.89 0.89 1.00 0.07 0.93 0.57
1000 0.06 0.22 0.06 1.59 1.31 1.59 0.83 0.67 0.83 0.75 0.18 0.93 1.00
3000 0.01 0.02 0.01 1.14 1.13 1.14 0.90 0.90 0.90 0.98 0.08 0.93 0.87
2 1.5 5000 0.00 0.00 0.00 0.91 0.91 0.91 0.93 0.93 0.93 1.00 0.04 0.95 0.02
Table S7: Bias, length, and coverage (at nominal level 0.95) for binary treatment model with Model 1 (continuous IV) and Model 4 (binary IV). The columns indexed with “TSCI-RF” corresponds to our proposed TSCI with the random forests, where the columns indexed with “Bias”, “Length”, and “Coverage” correspond to the absolute bias of the point estimator, the length and empirical coverage of the constructed confidence interval respectively. The columns indexed with “Oracle”, “Comp” and “Robust” correspond to the TSCI estimators with 𝒱q\mathcal{V}_{q} selected by the oracle knowledge, the comparison method, and the robust method. The column indexed with “Invalidity” reports the proportion of detecting the proposed IV as invalid. The columns indexed with “RF-Init” correspond to the RF estimators without bias correction but with the oracle knowledge of the best 𝒱q\mathcal{V}_{q}. The column indexed with “Weak IV” stands for the proportion out of 500 simulations reporting Qmax<1.Q_{\max}<1.

The binary treatment result is summarized in Table S7. The main observation is similar to those for the continuous treatment reported in the main paper. We shall point out the major differences in the following. The settings with the binary treatment are in general more challenging since the IV strength is relatively weak. To measure this, we have reported the column indexed with “weak IV” standing for the proportion of simulations with Qmax=0.Q_{\max}=0. For settings where our proposed generalized IV strength is strong such that Qmax1Q_{\max}\geq 1, our proposed TSCI method achieves the desired coverage level. Even when the generalized IV strength leads to Qmax<1,Q_{\max}<1, our proposed (oracle) TSCI may still achieve the desired coverage level for setting 1.

Appendix E Additional Results for Real Data Analysis

This section contains the additional results for the real data analysis. In Figure S4, we first show the importance score of all variables in random forests, based on which the six most important covariates are selected to construct the basis set 𝒱1\mathcal{V}_{1} in Section 6.

Refer to caption
Figure S4: The importance score of each variable in random forest fitting.

E.1 Other specifications of 𝒱2\mathcal{V}_{2}

In the following, in addition to the basis set 𝒱2\mathcal{V}_{2} considered in Section 6, we consider three more specifications of 𝒱2\mathcal{V}_{2}, as detailed in Table S8, and test the robustness of TSCI’s selection process.

𝒱2\mathcal{V}_{2} 𝒱1\mathcal{V}_{1}\cup{nearc4\cdotreg1, nearc4\cdotreg2, nearc4\cdotreg3, nearc4\cdotreg4,nearc4\cdotreg5, nearc4\cdotreg6, nearc4\cdotreg7, nearc4\cdotreg8}
𝒱2(1)\mathcal{V}_{2}^{(1)} 𝒱1{nearc4exper3}\mathcal{V}_{1}\cup\{\texttt{nearc4}\cdot\texttt{exper}^{3}\}
𝒱2(2)\mathcal{V}_{2}^{(2)} 𝒱1{nearc4experblack,nearc4expersouth,nearc4expersmsa,nearc4expersmsa66\mathcal{V}_{1}\cup\{\texttt{nearc4}\cdot\texttt{exper}\cdot\texttt{black},\texttt{nearc4}\cdot\texttt{exper}\cdot\texttt{south},\texttt{nearc4}\cdot\texttt{exper}\cdot\texttt{smsa},\texttt{nearc4}\cdot\texttt{exper}\cdot\texttt{smsa66}, nearc4expersqblack,nearc4expersqsouth,nearc4expersqsmsa,nearc4expersqsmsa66\texttt{nearc4}\cdot\texttt{expersq}\cdot\texttt{black},\texttt{nearc4}\cdot\texttt{expersq}\cdot\texttt{south},\texttt{nearc4}\cdot\texttt{expersq}\cdot\texttt{smsa},\texttt{nearc4}\cdot\texttt{expersq}\cdot\texttt{smsa66}, nearc4blacksouth,nearc4blacksmsa,nearc4blacksmsa66\texttt{nearc4}\cdot\texttt{black}\cdot\texttt{south},\texttt{nearc4}\cdot\texttt{black}\cdot\texttt{smsa},\texttt{nearc4}\cdot\texttt{black}\cdot\texttt{smsa66}, nearc4southsmsa,nearc4southsmsa66,nearc4smsasmsa66}\texttt{nearc4}\cdot\texttt{south}\cdot\texttt{smsa},\texttt{nearc4}\cdot\texttt{south}\cdot\texttt{smsa66},\texttt{nearc4}\cdot\texttt{smsa}\cdot\texttt{smsa66}\}
𝒱2(3)\mathcal{V}_{2}^{(3)} 𝒱1{nearc4exper3,nearc4experblack,nearc4expersouth,nearc4expersmsa,nearc4expersmsa66\mathcal{V}_{1}\cup\{\texttt{nearc4}\cdot\texttt{exper}^{3},\texttt{nearc4}\cdot\texttt{exper}\cdot\texttt{black},\texttt{nearc4}\cdot\texttt{exper}\cdot\texttt{south},\texttt{nearc4}\cdot\texttt{exper}\cdot\texttt{smsa},\texttt{nearc4}\cdot\texttt{exper}\cdot\texttt{smsa66}, nearc4expersqblack,nearc4expersqsouth,nearc4expersqsmsa,nearc4expersqsmsa66\texttt{nearc4}\cdot\texttt{expersq}\cdot\texttt{black},\texttt{nearc4}\cdot\texttt{expersq}\cdot\texttt{south},\texttt{nearc4}\cdot\texttt{expersq}\cdot\texttt{smsa},\texttt{nearc4}\cdot\texttt{expersq}\cdot\texttt{smsa66}, nearc4blacksouth,nearc4blacksmsa,nearc4blacksmsa66\texttt{nearc4}\cdot\texttt{black}\cdot\texttt{south},\texttt{nearc4}\cdot\texttt{black}\cdot\texttt{smsa},\texttt{nearc4}\cdot\texttt{black}\cdot\texttt{smsa66}, nearc4southsmsa,nearc4southsmsa66,nearc4smsasmsa66}\texttt{nearc4}\cdot\texttt{south}\cdot\texttt{smsa},\texttt{nearc4}\cdot\texttt{south}\cdot\texttt{smsa66},\texttt{nearc4}\cdot\texttt{smsa}\cdot\texttt{smsa66}\}
Table S8: Different specifications of the second basis set 𝒱2\mathcal{V}_{2}. 𝒱2\mathcal{V}_{2} is the specification we used in Section 6. 𝒱2(1)\mathcal{V}_{2}^{(1)} contains the two-way interaction of the IV with covariates exper and expersq and 𝒱1\mathcal{V}_{1}; 𝒱2(2)\mathcal{V}_{2}^{(2)} includes all the two-way interactions of the IV with 6 most important covariates excluding the interaction of the IV with exper, expersq; 𝒱2(3)\mathcal{V}_{2}^{(3)} includes all the two-way interactions of the IV with 6 most important covariates.

We implement TSCI with random forests as detailed in Algorithm 1 with specifying 𝒱0,𝒱1\mathcal{V}_{0},\mathcal{V}_{1} as in Section 6 and different 𝒱2\mathcal{V}_{2} in Table S8. We observe that the point estimators and confidence intervals are relatively stable even with different specifications of 𝒱2.\mathcal{V}_{2}.

Identifications TSCI Proportions of selection Prob(Qmax>q^cQ_{\text{max}}>\widehat{q}_{c}) IV str.
Estimate CI q^c=0\widehat{q}_{c}=0 q^c=1\widehat{q}_{c}=1 q^c=2\widehat{q}_{c}=2
TSCI-𝒱2\mathcal{V}_{2} 0.0604 (0.0294, 0.0914) 59.2% 38.2% 2.6% 93.6% 112.7950
TSCI-𝒱2(1)\mathcal{V}_{2}^{(1)} 0.0575 (0.0263, 0.0886) 40.0% 58.6% 1.4% 2.3% 111.8835
TSCI-𝒱2(2)\mathcal{V}_{2}^{(2)} 0.0614 (0.0303, 0.0916) 55.0% 40.0% 5.0% 72.9% 111.6233
TSCI-𝒱2(3)\mathcal{V}_{2}^{(3)} 0.0575 (0.0268, 0.0891) 39.2% 60.6% 0.2% 0% 113.0134
Table S9: Inference and basis selection of TSCI with different identifications of 𝒱2\mathcal{V}_{2}. The columns indexed with “Inference” correspond to the point estimators and the CI reported by TSCI in Algorithm 1 with 𝒱q^c\mathcal{V}_{\widehat{q}_{c}}. The columns indexed with “Proportions of selection” correspond to proportions of different selections over 500 rounds of simulations. The column indexed with “Prob(Qmax>q^cQ_{\text{max}}>\widehat{q}_{c})” indicates the proportion of Qmax>q^cQ_{\text{max}}>\widehat{q}_{c}. This case means that TSCI is able to selects q^c\widehat{q}_{c} without any weak IV constraints. The last column shows the average IV strength for TSCI estimators.

E.2 Falsification Argument of Condition 1

We propose in the following a falsification argument regarding Condition 1 for the data analysis. Particularly, we demonstrate that the regression model using covariates belonging to 𝒱1\mathcal{V}_{1} provides a good approximation of g(z,x)g(z,x) but not f(z,x)f(z,x). With the TSCI estimator β^𝒱q^c\widehat{\beta}_{\mathcal{V}_{\widehat{q}_{c}}}, we construct the pseudo outcome Y~=YDβ^𝒱q^c\widetilde{Y}=Y-D\widehat{\beta}_{\mathcal{V}_{\widehat{q}_{c}}} and if β^𝒱q^c\widehat{\beta}_{\mathcal{V}_{\widehat{q}_{c}}} is a reasonably good estimator of β\beta, then the pseudo outcome {Y~i}1in\{\widetilde{Y}_{i}\}_{1\leq i\leq n} can be used as proxies of {g(Zi,Xi)}1in.\{g(Z_{i},X_{i})\}_{1\leq i\leq n}. Then we implement two OLS regressions of the pseudo outcome on the covariates in 𝒱1\mathcal{V}_{1} and 𝒱2\mathcal{V}_{2} as defined in Table 3. In addition, we build a random forests prediction model for the pseudo outcome with the predictors ZiZ_{i} and XiX_{i}. To evaluate the performance, we split the data into a training set 𝒜2\mathcal{A}_{2} and a test set 𝒜1\mathcal{A}_{1} as in Section 3.1, where test set 𝒜1\mathcal{A}_{1} is used to estimate the out-of-sample MSE of the model constructed with the training set 𝒜2\mathcal{A}_{2}. We randomly split the data 500 times and report the violin plot of 500 MSE on the left panel of Figure S5. Since our specified basis set 𝒱1\mathcal{V}_{1} leads to a nearly similar prediction performance as the random forests prediction model, this suggests that 𝒱1\mathcal{V}_{1} provides a good approximation of the function g()g(\cdot) if the TSCI estimator is accurate. In comparison, we use the treatment variable to replace the pseudo outcome and compare the MSE of the three prediction models. On the right panel of Figure S5, the random forests prediction model performs much better than the OLS with covariates in 𝒱1\mathcal{V}_{1} and 𝒱2\mathcal{V}_{2}, indicating that 𝒱1\mathcal{V}_{1} does not provide a good approximation for f()f(\cdot) in the treatment model.

Refer to caption
Figure S5: Violin plot of MSE for different prediction models, where the left and right panels correspond to using Y~i=YiDiβ^𝒱q^c\widetilde{Y}_{i}=Y_{i}-D_{i}\widehat{\beta}_{\mathcal{V}_{\widehat{q}_{c}}} and DiD_{i} as the regression outcome variables, respectively. On both panels, “V1”, “V2”, and “RF” respectively stand for OLS regression with 𝒱1\mathcal{V}_{1}, 𝒱2\mathcal{V}_{2}, and random forests with ZiZ_{i} and XiX_{i}, where 𝒱1\mathcal{V}_{1} and 𝒱2\mathcal{V}_{2} are specified in Table 3.
\bibliographystylesupp

chicago.bst \bibliographysuppIVRef