suppSupplementary References
Robustness Against Weak or Invalid Instruments: Exploring Nonlinear Treatment Models with Machine Learning
Abstract
We discuss causal inference for observational studies with possibly invalid instrumental variables. We propose a novel methodology called two-stage curvature identification (TSCI) by exploring the nonlinear treatment model with machine learning. The first-stage machine learning enables improving the instrumental variable’s strength and adjusting for different forms of violating the instrumental variable assumptions. The success of TSCI requires the instrumental variable’s effect on treatment to differ from its violation form. A novel bias correction step is implemented to remove bias resulting from the potentially high complexity of machine learning. Our proposed TSCI estimator is shown to be asymptotically unbiased and Gaussian even if the machine learning algorithm does not consistently estimate the treatment model. Furthermore, we design a data-dependent method to choose the best among several candidate violation forms. We apply TSCI to study the effect of education on earnings.
Keywords: Generalized IV strength, Confidence interval, Random forests, Boosting, Neural network.
1 Introduction
Observational studies are major sources for inferring causal effects when randomized experiments are not feasible. But such causal inference from observational studies requires strong assumptions and may be invalid due to the presence of unmeasured confounders. The instrumental variable (IV) regression is a practical and highly popular causal inference approach in the presence of unmeasured confounders. The IVs are required to satisfy three assumptions: conditioning on the baseline covariates, (A1) the IVs are associated with the treatment; (A2) the IVs are not associated with the unmeasured confounders; (A3) the IVs do not directly affect the outcome.
Despite the popularity of the IV method, there is a significant concern about whether the used IVs satisfy (A1)-(A3) in practice. Assumption (A1) requires the IV to be strongly associated with the treatment variable, which can be checked with an F-test in a first-stage linear regression model. Inference with the assumption (A1) being violated has been actively investigated under the name of weak IV (Stock et al., 2002; Staiger and Stock, 1997, e.g.). Assumptions (A2) and (A3) ensure that the IV only affects the outcome through the treatment. If an IV violates either (A2) or (A3), we call it an invalid IV and define its functional form of violating (A2) and (A3) as the violation form, e.g., a linear violation. In the just-identification regime, most empirical analyses rely on external knowledge to argue about the validity of (A2) and (A3). However, there is a pressing need to develop a robust causal inference method against the proposed IVs violating the classical assumptions.
1.1 Our results and contribution
We aim to devise a robust IV framework that leads to causal identification even if the IVs proposed by domain experts may violate assumptions (A1) to (A3). Our framework provides a robustness guarantee even when all proposed IVs are invalid, including the most common regime with a single IV that is possibly invalid. It is well known that the treatment effect is not identifiable when there is no constraint on how the IVs violate assumptions (A2) and (A3). Our key identification assumption is that the violations of (A2) and (A3) arise from simpler forms than the association between the treatment and the IVs; that is, we exclude “special coincidences” such as the IVs violate (A2) and (A3) by linear forms and the conditional mean model of the treatment given IVs is also linear. This identification condition can be evaluated with the generalized IV strength measure introduced in Section 3.3 for a user-specific violation form.
We propose a novel two-stage curvature identification (TSCI) method to infer the treatment effect. An important operational step is to fit the treatment model with a machine learning (ML) algorithm, e.g., random forests, boosting, or deep neural networks. This first-stage ML model enhances the IV’s strength by learning a general conditional mean model of treatment given the IVs, instead of restricting to a linear association model. In the second stage, we leverage the ML prediction and adjust for different IV violation forms. A main novelty of our proposal is to estimate the bias resulting from high complexity of the first-stage ML and implement a corresponding bias correction step; see more details in Section 3.2.
We show that the TSCI estimator is asymptotically Gaussian when the generalized IV strength measure is sufficiently large. We further devise a data-dependent way of choosing the most suitable violation form among a collection of IV violation forms. By including the valid IV setting into the selection, our proposed general TSCI methodology does not exclude the possibility of the IVs being valid but provides a robust IV method against different invalidity forms.
To sum up, the contribution of the current paper is two-fold:
-
1.
TSCI explores a nonlinear treatment model with ML and leads to more reliable causal conclusions than existing methods assuming valid IVs.
-
2.
TSCI provides an efficient way of integrating the first-stage ML. A methodological novelty of TSCI is its bias correction in its second stage, which addresses the issue of high complexity and potential overfitting of ML.
1.2 Comparison to existing literature
There is relevant literature on causal inference when assumptions (A2) and (A3) are violated. When all IVs are invalid, Lewbel (2012); Tchetgen et al. (2021) proposed an identification strategy by assuming that the treatment model has heteroscedastic error but the covariance between the treatment and outcome model errors is homoscedastic. Our proposed TSCI is based on a different idea by exploring the nonlinear structure between the treatment and the IVs, not requiring anything on homo- and heteroscedasticity. Bowden et al. (2015) and Kolesár et al. (2015) assumed that the direct effect of the IVs on the outcome is nearly orthogonal to the effect of the IVs on the treatment. Han (2008); Bowden et al. (2016); Kang et al. (2016); Windmeijer et al. (2019); Guo et al. (2018); Windmeijer et al. (2021) considered the setting that a proportion of IVs are valid and conducted causal inference by selecting valid IVs in a data-dependent way. Along this line, Guo (2023) proposed a uniform inference procedure that is robust to the IV selection error. More recently, Sun et al. (2023) leveraged the number of valid IVs and used the interaction of genetic markers to identify the causal effect. In contrast to assuming the orthogonality and existence of valid IVs, our proposed TSCI is effective even if all IVs are invalid and the effect orthogonality condition does not hold. Such a setting is especially useful in handling econometric applications with a single IV.
The nonlinear structure of the treatment model has been explored in the econometric and the more recent ML literature. In the IV literature, Kelejian (1971); Amemiya (1974); Newey (1990); Hausman et al. (2012) considered constructing a non-parametric treatment model and Belloni et al. (2012) proposed constructing the optimal IVs with a Lasso-type first-stage estimator. More recently, Chen et al. (2023); Liu et al. (2020) proposed constructing the IV as the first-stage prediction given by the ML algorithm. All of the above works are focused on the valid IV setting. In contrast, our current paper applies to the more robust regime where the IVs are allowed to violate assumptions (A2) or (A3) in a range of forms. Even in the context of valid IVs, our proposed TSCI may lead to a more accurate estimator than directly using the predicted value for the treatment model as the new IV (Chen et al., 2023; Liu et al., 2020); see Section 3.5 for details.
ML algorithms integrated into the IV analysis to better accommodate the complicated outcome model. In Sections 3.5 and 5.1, we provide detailed comparisons to Double Machine Learning (DML) proposed in Chernozhukov et al. (2018). Athey et al. (2019) proposed the generalized random forests to infer heterogeneous treatment effects. Both works are based on valid IVs, while our current paper assumes a homogeneous treatment effect and focuses on the different robust IV framework with possibly invalid IVs.
Finally, our proposed TSCI methodology can be used to check the IV validity for the just-identification case. This is distinct from the Sargan test, J test, or specification test (Hansen, 1982; Sargan, 1958; Woutersen and Hausman, 2019) which are used to test IV validity in the over-identification case.
Notation. For a matrix , a vector , and a set , we use to denote the submatrix of whose row indices belong to , and to denote the sub-vector of with indices in For a set , denotes its cardinality. For a vector , the norm of is defined as for with and . We use and to denote generic positive constants that may vary from place to place. For a sequence of random variables indexed by , we use and to represent that converges to in probability and in distribution, respectively. For two positive sequences and , means that such that for all ; if and , and if . For a matrix , we use to denote its trace, to denote its rank, and , and to denote its Frobenius norm, spectral norm and element-wise maximum norm, respectively. For a square matrix , denotes the matrix multiplication of by itself. For a symmetric matrix , we use and to denote its maximum and minimum eigenvalues, respectively.
2 Invalid instruments: modeling and identification
We consider i.i.d. data where for the -th observation, and respectively denote the outcome and the treatment and and respectively denote the IVs and measured covariates.
2.1 Examples for effect identification with nonlinear treatment models
To explain the main idea, we start with the special case with no covariates,
(1) |
where the errors and satisfy and These errors and are typically correlated due to unobserved confounding. The outcome model in (1) is commonly used in the invalid IV literature (Small, 2007; Kang et al., 2016; Guo et al., 2018; Windmeijer et al., 2019), with standing for the violation of IV assumptions (A2) and (A3). The treatment model in (1) is not a causal generating model between and but only stands for a probability model between and . Specifically, the treatment assignment might be directly influenced by unmeasured confounders, but the treatment model in (1) does not explicitly state this generating process. Instead, for the joint distribution of and , we define the conditional expectation which leads to the treatment model in (1). We discuss the structural equation model interpretation of the model (1) in Section 2.3, which explicitly characterizes how unmeasured confounders may affect the treatment assignment.
We introduce some projection notations to facilitate the discussion. For a random variable , define its best linear approximation by as with . For , write and We apply to the model (1) and obtain , where the linear violation form in (1) is removed after applying the transformation . If is nonlinear, then and the adjusted variable can be used as a valid IV to identify the effect via the following estimating equation
(2) |
Note that the last equality follows from the population orthogonality between and , together with
The estimation equation in (2) reveals that the treatment effect can be identified by exploring the nonlinear structure of under the model (1). We propose to learn using a ML prediction model and measure the variability of by defining a notation of generalized IV strength in Section 3.3. We shall emphasize that using ML algorithms to learn the possibly nonlinear function is critical in the current framework, allowing for invalid IVs. The ML algorithms help capture complicated structures in and typically retain enough variability contained in the adjusted variable . In Section A.1 in the supplement, we present another example for causal identification in the context of the randomized experiments with non-compliance and invalid IVs.
Remark 1
Nonlinear functional form identification has been proposed in various existing literature; see Lewbel (2019)[Sec.3.7] for a review. More concretely, Angrist and Pischke (2009)[Sec.4.6.1] discussed the identification of when the IV has a linear direct effect on the outcome, but the treatment model is a nonlinear parametric model of the IV. The Heckman selection model Heckman (1976) offered a practical and simple solution for estimating effects when the samples are not randomly selected. When no variables satisfy the exclusion restriction, one possible identification relies on the nonlinearity of the inverse Mills ratio. As pointed by Puhani (2000), such an identification is less reliable when the nonlinear model form is relatively weak in applications. Additionally, Shardell and Ferrucci (2016); Ten Have et al. (2007) proposed the causal identification using the interaction of the (binary) IV and baseline covariates as the new IV. Lewbel (2019) commented that the proposed identification fails when the nonlinear functional form does not exist or is relatively weak. As a main difference, our proposal does not use hand-picked instruments or assume the data-generating distribution has some pre-specified nonlinear functional forms. Operationally, we apply the first-stage ML algorithm to learn the complex nonlinear treatment model, which is more powerful in detecting the nonlinear dependence based on exploring the data. However, including the first-stage ML requires a more delicate following-up statistical procedure, such as the bias correction step in (19). More importantly, instead of betting on the existence of nonlinearity, our proposed generalized IV strength measure in (22) guides whether there is enough nonlinear structure after adjusting for invalid forms.
2.2 Model with a general class of invalid IVs
We now introduce the general outcome model, allowing for more complicated violation forms than the linear violation in (1),
(3) |
where is the homogeneous treatment effect and the function encodes how the IVs violate the assumptions (A2) and (A3). The classical valid IV setting requires that the function does not change with different assignment of . In the invalid IV literature, the commonly used outcome model with a linear violation form can be viewed as a special case of (3) when taking
For the treatment, we define its conditional mean function for leading to the following treatment model:
(4) |
The model (4) is flexible as might be any unknown function of and , and the treatment variable can be continuous, binary, or a count variable. Similarly to (1), the model (4) is only a probability relation instead of a causal generating model and the treatment assignment is allowed to be influenced by unmeasured confounders.
It is well known that, for the outcome model (3), the identification of is generally impossible without additional identifiability conditions on the function . In this paper, we allow the existence of invalid IVs but constrain the functional form of violating (A2) and (A3) to a pre-specified class. An operational step is to specify a set of basis functions for in the outcome model (3).
Define and decompose as with When the IVs are valid, does not directly depend on the assignment of , which implies and In this case, we just need to approximate by a set of basis functions . With , we approximate by a linear combination of for and by the matrix . For example, if is linear, we set . Furthermore, we consider the additive model with smooth functions For we construct a set of basis functions for with denoting the number of basis functions. Examples of the basis functions include the polynomial or B spline basis. We then approximate with
When the IV is invalid with , in addition to approximating by , we further approximate the function by a set of pre-specified basis functions and then approximate by
(5) |
We now present different choices of in (3), or equivalently, the basis set in (5). With them, the model (3) can accommodate a wide range of valid or invalid IV forms.
-
(1)
Linear violation. We take and
-
(2)
Polynomial violation (for a single IV). We consider , where the IV and baseline covariates affect the outcome in an additive manner. We set and may take as (piecewise) polynomials of various orders.
-
(3)
Interaction violation (for a single IV). We consider , allowing the IV to affect the outcome interactively with the base covariates. For such a violation form, we set
We focus on the single IV setting for the polynomial and interaction violation to simplify the notations. It can be extended to the setting with multiple IVs.
Two important remarks are in order for the choice of and . Firstly, the choice of and the corresponding basis set is part of the outcome model specification (3), reflecting the users’ belief about how the IVs can potentially violate the assumptions (A2) and (A3). In applications, empirical researchers might apply domain knowledge and construct the pre-specified set of basis functions in (5). For example, Angrist and Lavy (1999) applied the Maimonides’ rule to construct the IV as the transformation of a variable “enrollment” and then adjust with the possible violation form generated by a linear, quadratic, or piecewise linear transformation of “enrollment”. Importantly, we do not have to assume the knowledge of and will present in Section 3.4 a data-dependent way to choose the best from a collection of candidate basis sets. Secondly, the linear violation form with is the most commonly used violation form in the context of the multiple IV framework (Small, 2007; Kang et al., 2016; Guo et al., 2018; Windmeijer et al., 2019). Most existing methods require at least a proportion of to be valid, i.e., the corresponding being zero. In contrast, our framework allows all IVs to be invalid (i.e., all to be nonzero) by taking
With the set of basis functions in (5), we define the matrix which evaluates the functions in at the observed variables:
(6) |
for . Note that we always use an intercept . Instead of assuming perfectly generating , we go with a broader framework by allowing an approximation error of by the best linear combination of , defined as
(7) |
Define . We shall omit the dependence on when there is no confusion. With (7), we express the outcome model (3) as
(8) |
If is well approximated by a linear combination of is close to zero or even .
2.3 Causal interpretation: structural equation and potential outcome models
We first provide the structural equation model (SEM) interpretation of the models (3) and (4). We consider the following SEM for and and ,
(9) |
where denotes some unmeasured confounders, and are random errors independent of and is the intercept such that . Define and In (9), the unmeasured confounders might affect both the outcome and treatment, encodes a direct effect of the IVs on the outcome, and captures the association between the IVs and the unmeasured confounders. The SEM (9) together with the definitions of and imply the outcome model (3) with The treatment model arises with and
We now present how to interpret the invalid IV model with the potential outcome perspective. For the -th subject with the baseline covariates , we use to denote the potential outcome with the IVs and the treatment assigned to and , respectively. For we consider the potential outcome model (Splawa-Neyman et al., 1990; Rubin, 1974),
(10) |
where denotes the treatment effect, , and If changes with , the IVs directly affect the outcome, violating the assumption (A3). If changes with , the IVs are associated with unmeasured confounders, violating the assumption (A2). The model (10) extends the Additive LInear, Constant Effects (ALICE) model of Holland (1988) by allowing for a general class of invalid IVs. By the consistency assumption (10) implies (3) with and We can easily generalize (10) by considering where denotes the individual effect for the -th individual. If we consider the random effect with and being independent of then we obtain the model (3) with and our proposed method can be applied to make inference for
2.4 Overview of TSCI
We propose in Section 3 a novel methodology called two-stage curvature identification (TSCI) to identify under models (3) and (4). We first assume knowledge of the basis set in (5) that spans the function The proposal consists of two stages: in the first stage, we employ ML algorithms to learn the conditional mean in the treatment model (4); in the second stage, we adjust for the IV invalidity form encoded by and construct the confidence interval for by leveraging the first-stage ML fits.
The success of TSCI relies on the following identification condition.
Condition 1
Define the best linear approximation of by in population as We assume
The condition states that the basis set , used for generating , does not fully span the conditional mean function of the treatment model. For any user-specified in (5), Condition 1 can be partially examined by evaluating a generalized IV strength introduced in Section 3.3. When the generalized IV strength is sufficiently large, our proposed TSCI methodology leads to an estimator in (19) such that
(11) |
with the estimated standard error depending on the generalized IV strength. Importantly, the basis set used for generating does not need to be fixed in advance: we discuss in Section 3.4 a data-driven strategy for choosing a “good” basis set among a collection of candidate sets for a positive integer , where corresponds to the valid IV setting. Our proposed TSCI compares estimators assuming valid IVs and adjusting for violation forms generated from the family Thus, we encompass the setting of valid IVs and through this comparison, the TSCI methodology provides more robust causal inference tools than directly assuming IVs’ validity.
Intuitively, Condition 1 requires to be more non-linear than , which cannot be fully tested since is unknown. We shall provide two important remarks on the plausibility of Condition 1. Firstly, when the proposed IVs are valid with being independent of , Condition 1 is satisfied as long as depends on . More interestingly, when has a relatively weak dependence on (i.e., the IVs violate (A2) and (A3) locally), Condition 1 holds if the machine learning algorithm captures a strong dependence of on . In practice, domain scientists identify IVs based on their knowledge, and these IVs, even when violating assumptions (A2) and (A3), may have a relatively weak direct effect on the outcome or are weakly associated with the unmeasured confounders. In such a situation, TSCI provides robust causal inference by allowing the proposed IV to be invalid. However, we are not suggesting the users take any covariate as an invalid IV to implement our proposal. Secondly, even if Condition 1 does not hold, our proposed TSCI estimator is reduced to an estimator assuming valid IVs which then has a similar performance to TSLS and DML. In Section 5.3, we explore the performance of TSCI when Condition 1 does not hold.
3 TSCI with Machine Learning
We discuss the first and second stages of our TSCI methodology in Sections 3.1 and 3.2, respectively. The main novelty of our proposal is to estimate a bias caused by the first-stage ML and implement a follow-up bias correction.
3.1 First stage: machine learning models for the treatment model
We estimate the conditional mean in the treatment model (4) by a general ML fit. We randomly split the data into two disjoint subsets and . Throughout the paper we use with and set Our results can be extended to any other splitting with Sample splitting is essential for removing the endogeneity of the ML predicted values. Without sample splitting, the ML predicted value for the treatment can be close to the treatment (due to overfitting) and hence highly correlated with unmeasured confounders, leading to a TSCI estimator suffering from a significant bias.
The main step is to construct the ML prediction model with data belonging to and express the ML estimator of as
(12) |
where the data belonging to is used to train the ML algorithm and construct the transformation matrix . Our proposed TSCI method mainly relies on expressing the first-stage prediction as the linear estimator in (12). A wide range of first-stage ML algorithms are shown to have the expression in (12). In the following, we follow the results in Lin and Jeon (2006); Meinshausen (2006); Wager and Athey (2018) and express the sample split random forests in the form (12). In Sections A.6, A.7, and A.8 in the supplement, we present the explicit definitions of for boosting, deep neural network (DNN), and approximation with B-splines, respectively.
Importantly, the transformation matrix has a related meaning as the hat matrix in linear regression procedures: however, is not a projection matrix for nonlinear ML algorithms, which creates additional challenges in adopting the first-stage ML fit; see Section A.2.
Split random forests in (12). To avoid the overfitting of random forests, we adopt the construction of honest trees and forests as in Wager and Athey (2018) and slightly modify it by excluding self-prediction for our purpose. We construct the random forests (RF) with the data and estimate for by the constructed RF together with the data . RF aggregates decision trees, with each decision tree being viewed as the partition of the whole covariate space into disjoint subspaces Let denote the random parameter that determines how a tree is grown. For any given and a given tree with the parameter , there exists an unique leaf with such that the subspace contains With the observations inside , the decision tree estimates by
(13) |
where is defined as the region excluding the point We use to denote the parameters corresponding to the trees. The estimator can be expressed as
(14) |
with defined in (13). That is, the split RF estimator of attains the form (12) with for .
The construction of honest trees and removal of self-prediction help remove the endogeneity of the ML predicted values . Firstly, due to the sample-splitting, the construction of does not directly depend on which removes the endogeneity contained in the ML predicted values. Secondly, consider an extreme setting with and not being predictive for . In such a scenario, without removing the self-prediction, the estimator will be dominated by its own treatment observation and is still highly correlated with the unmeasured confounders, leading to a biased TSCI estimator. The removal of self-prediction can be viewed as an ML version of the “leave-one-out” estimator for the treatment model, which was proposed in Angrist et al. (1999) to reduce the bias of TSLS with many IVs. In the numerical studies, we have observed that the exclusion of self-prediction leads to a more accurate estimator than the corresponding TSCI estimator with self-prediction, regardless of the IV strength; see Table S1 for the detailed comparison.
3.2 Second stage: adjusting for violation forms and correcting bias
In the second stage, we leverage the first-stage ML fits in the form of (12) and adjust for the possible IV invalidity forms. In the remaining of the construction, we consider the outcome model in the form of (8) and assume that provides an accurate approximation of , resulting in zero or sufficiently small . Hence does not enter in the following construction of the estimator for . We quantify the effect of the error term in the theoretical analysis in Section 4 and propose in Section 3.4 a data-dependent way of choosing such that is sufficiently small.
Applying the transformation in (12) to the model (8) with data in , we obtain
(15) |
where and For a matrix , we use and to denote the projection to the column spaces of and its orthogonal complement, respectively. Based on (15), we project out the columns of and estimate the effect by
(16) |
When , then we decompose the error of the above estimator as
(17) |
In the above decomposition, the first term on the right-hand side is a bias component appearing due to the use of the first-stage ML. To explain this, we consider the homoscedastic correlation and obtain the explicit bias expression,
(18) |
where denotes the trace of defined in (16). Note that can be viewed as degrees of freedom or complexity measure for the ML algorithm. It may become particularly large in the overfitting regime. The bias in (18) also becomes large when the denominator is small, which means a weak generalized IV strength in the following (22). Thus, both the numerator and denominator in (18) can lead to a large bias of . In Figure 1, we illustrate the bias of in settings with different values of the IV strength scaled to .

To address this finite-sample bias of , we propose the following bias-corrected estimator,
(19) |
with for and . We show in Figure 1 that our proposal effectively corrects the bias of . Importantly, our proposed bias correction is effective for both homoscedastic and heteroscedastic correlations. We present a simplified bias correction assuming homoscedastic correlation in Section A.10 in the supplement.
Centering at defined in (19), we construct the confidence interval
(20) |
with denoting the upper quantile of the standard normal distribution and
(21) |
Remark 2
The bias component in (18) results from the correlation between and for . In the regime of many IVs, Angrist et al. (1999) proposed the “leave-one-out” jackknife-type IV estimator, that is, instead of fitting the first stage model with all the data, the treatment model for the -th observation is fitted without using itself. Such an estimator effectively removes the bias for TSLS due to the correlation between and . Our proposed removal of self-prediction is of a similar spirit to the jackknife. However, even after removing the self-prediction, the diagonal of is not zero, and hence the correlation between and still leads to a bias component, which requires the bias correction as in (19).
3.3 Generalized IV strength: detection of under-fitted machine learning
The IV strength is particularly important for identifying the treatment effect stably, and the weak IV is a major concern in practical applications of IV-based methods (Stock et al., 2002; Hansen et al., 2008). With a larger basis set , the IV strength will generally decrease as the information contained in is projected out from the first-stage ML fit. It is crucial to introduce a generalized IV strength measure in consideration of invalid IV forms and the ML algorithm. Similarly to the classical setup, good performance of our proposed TSCI estimator also requires a relatively large generalized IV strength. We introduce a generalized IV strength measure as,
(22) |
If then is reduced to A sufficiently large strength will guarantee stable point and interval estimators defined in (19) and (20). Hence, we need to check whether is sufficiently large. Since is unknown, we estimate by its sample version and estimate by
In Section 4, we show that our point estimator in (19) is consistent when the IV strength is much larger than ; see Condition (R2). We now develop a bootstrap test to provide an empirical assessment of this IV strength requirement. Since Rothenberg (1984) and Stock et al. (2002) suggested the concentration parameter being larger than as being “adequate”, we develop a bootstrap test for We apply the wild bootstrap method and construct a probabilistic upper bound for the estimation error For , we define and compute with For we generate for with generated as i.i.d. standard normal random variables, and compute We use to denote the upper empirical quantile of . We conduct the generalized IV strength test with being a high probability upper bound for We use throughout this paper. If the above generalized IV strength test is passed, the IV is claimed to be strong after adjusting for the matrix defined in (6); otherwise, the IV is claimed to be weak after adjusting for the matrix Empirically, we observe reliable inference properties when the estimated generalized IV strength is above 40; see Figure S1 for details.
3.4 Data-dependent selection of
Our proposed TSCI estimator in (19) requires prior knowledge of the basis set , which generates the function . In the following, we consider the nested sets of basis functions where is a positive integer. We devise a data-dependent way to choose the best one among .
We define as the set of basis functions for the valid IV setting. For , define as the basis set for different invalid IV forms, where is the number of basis functions. We present two examples of for the single IV setting.
-
(1)
Polynomial violation: for
-
(2)
Interaction violation:
For , we define the matrix with its -th row defined as with for .
For estimating from data, we proceed as follows. We implement the generalized IV strength test in Section 3.3 and define as,
(23) |
where is set at by default. For a larger , we tend to adjust out more information and have relatively weaker IVs. Intuitively, denotes the largest index such that the IVs still have enough strength after adjusting for . With we shall choose among As a remark, when there is no satisfying (23), this corresponds to the weak IV regime. In such a scenario, we shall implement the valid IV estimator and output a warning of weak IV.
For any given we apply the generalized estimator in (19) and construct
(24) |
The selection of the best among relies on comparing the estimators We start with comparing the difference between the estimators and with When are well approximated by both and , the approximation errors and defined in (7) are small and the main difference is approximately due to the randomness of the errors In such cases. we then have is approximated centered at zero with the following conditional variance,
(25) | ||||
Based on the above approximation, we further conduct the following test for whether is significantly different from ,
(26) |
where and are defined in (24). On the other hand, if the smaller matrix does not provide a good approximation of , the difference tends to be much larger than the threshold in (26) and hence , indicating that does not fully generate
For we generalize the pairwise comparison to multiple comparisons. For we compare to any with . Particularly, for we define the test
(27) |
where the thresholding is chosen by the wild bootstrap; see more details in Section A.3 in the supplement. We define as there is no index larger than that we might compare to. We interpret as follows: none of the differences is large, indicating that (approximately) generates the function and is a reliable estimator.
We choose the index as which is the smallest such that is zero. The index is interpreted as follows: any matrix containing as a submatrix does not lead to a substantially different estimator from that by This provides evidence for forming a good approximation of Here, the sub-index “c” represents the comparison as we compare different to choose the best one. Note that must exist since by definition. In finite samples, there are chances that certain violations cannot be detected, especially if the TSCI estimators given by and are not significantly different. We propose a more robust choice of the index as where the sub index “r” denotes the robust selection; see the discussion after Theorem 3. We summarize our proposed TSCI estimator in Algorithm 1.
Input: Data ; Sets of basis for approximating ;
Output: and ; and ; and .
As a remark, in (26) can be used to test the IV validity. For any , represents that the estimator assuming valid IV is sufficiently different from an estimator allowing for the violation form generated by , indicating that the valid IV assumption is violated.
3.5 Comparison to Double Machine Learning and Machine Learning IV
We compare our proposed TSCI with the double machine learning (DML) estimator (Chernozhukov et al., 2018) and the machine learning IV (MLIV) estimator (Chen et al., 2023; Liu et al., 2020). As the most significant difference, TSCI provides a valid inference that is robust to a certain class of invalid IVs while DML and MLIV are designed for valid IV regimes only. We focus on the valid IV setting in the following and compare our proposal to DML and MLIV. In the DML framework, Chernozhukov et al. (2018) considered the outcome model , which is a special case of our outcome model (3) by assuming valid IVs. The population parameter in DML can be identified through the following expression (Chernozhukov et al., 2018; Emmenegger and Bühlmann, 2021)
(28) |
where , and are fitted by machine learning algorithms.
As a fundamental difference, we apply machine learning algorithms to capture the nonlinear relation between and while (28) identifies based on the linear association . Due to the first-stage ML, our proposed TSCI estimator is generally more efficient than the DML estimator. On the other hand, we approximate by a set of basis functions, which might provide an inaccurate estimator when the basis functions are misspecified. In contrast, DML learns the conditional mean model by general machine learning algorithms and does not particularly require such a specification of basis functions. We compare the performance of DML and TSCI with valid IVs in Section 5.1 and with invalid IVs in Section 5.3.
Chen et al. (2023); Liu et al. (2020) proposed to use the ML prediction values in (12) as the IV, referred to as the MLIV in the current paper. With this MLIV, the standard TSLS estimator can be implemented: in the first stage, use OLS regression of on the MLIV and the baseline covariates and construct the predicted value with and the vector denoting the OLS regression coefficients; in the second stage, use the outcome regression by replacing with . The TSLS estimator with MLIV can then be expressed as
(29) |
The last equality of (29) holds by plugging in the outcome model and and applying the orthogonality between and Note that a dominating term of in (17) is In comparison, the last term in (29) is roughly inflated by a factor of , appearing due to the additional first-stage regression using the MLIV. When the IV is relatively strong, then this factor is around 1, and MLIV and are of similar performance for valid IV settings. However, in the presence of relatively weak IVs, the coefficient in front of has a large fluctuation due to the weak association between and The weak IV problem deteriorates with the extra first stage regression using the MLIV. This explains why the estimator in (29) has a much larger bias and standard error than our proposed TSCI estimator in (19) when the IVs are relatively weak; see Section D.2 in the supplementary for the detailed comparison.
4 Theoretical justification
We first establish the asymptotic normality of for a given . To begin with, we present the required conditions. The first assumption is imposed on the regression errors in models (3) and (4) and the data with .
-
(R1)
Conditioning on , and are sub-gaussian random variables, that is, there exists a positive constant such that
with denoting the supremum over the support of the density of . The random variables satisfy , and , where and are constants independent of and The matrix defined in (12) satisfies for some positive constant
The conditional sub-gaussian assumption is required to establish some concentration results. For the special case where and are independent of it is sufficient to assuming sub-gaussian errors and The sub-gaussian conditions on regression errors may be relaxed to moment conditions. The conditions on and will be automatically satisfied with high probability if is positive definite and and are sub-gaussian random variables, where the sub-gaussianality conditions on may be relaxed to moment conditions. The proof of Lemma 1 in the supplement shows that for random forests and deep neural networks.
The second assumption is imposed on the generalized IV strength defined in (22). Throughout the paper, the asymptotics is taken as
-
(R2)
satisfies and , with defined in (16).
The above condition is closely related to Condition 1, where measures the variability of the estimated after adjusting for used to approximate . Intuitively, a larger value of indicates that in Condition 1 holds more plausibly. The generalized IV strength introduced in (22) is proportional to when is of a constant scale. Our proposed test in Section 3.3 is designed to test whether is sufficiently large for TSCI. For the setting with in Condition 1 being a positive constant, can be of the order , making Condition (R2) a mild assumption.
The following proposition establishes the consistency of if Conditions (R1) and (R2) hold and the approximation errors are small.
Proposition 1
4.1 Improvement with bias correction
In the following, we present the distributional properties of the bias-corrected TSCI estimator defined in (19) and demonstrate the advantage of this extra bias correction step. For establishing the asymptotic normality, we further impose a stronger generalized IV strength condition than (R2).
-
(R2-I)
and
where is some positive constant and is defined as
(30)
The above condition can be viewed as requiring the IV to be sufficiently strong, where can be viewed as another measure of IV strength. If the treatment model is fitted with neural network, we have . For random forests, we only have . defined in (30) depends on the accuracy of the ML prediction model We have if , , and . However, even for inconsistent , is of a smaller order of magnitude than as long as is of a constant scale.
The following theorem establishes the asymptotic normality of the TSCI estimator.
Theorem 1
We emphasize that the validity of our proposed confidence interval does not require the ML prediction model to be a consistent estimator of . Particularly, Condition (R2) and (R2-I) can be plausibly satisfied as long as the ML algorithms capture enough association between the treatment and the IVs. The condition (31) is imposed such that not a single entry of the vector dominates all other entries, which is needed for verifying the Linderberg condition. The standard error relies on the generalized IV strength for a given matrix . If is of constant order, then A larger matrix will generally lead to a larger because decreases after adjusting for more information contained in . The consistency of is presented in Lemma 2 in the supplement.
We now explain the effectiveness of the bias correction for the homoscedastic setting with . For the initial estimator we can establish that where and
(32) |
for some positive constant Our proposed bias-corrected estimator is effective in reducing the bias component Particularly, Theorem 4 in Section A.10 in the supplement establishes that the term in (32) is reduced to . If , which can be achieved for a consistent ML prediction , the bias-corrected TSCI estimator effectively reduces the finite-sample or higher-order bias. However, even if is inconsistent with , the bias correction will not lead to a worse estimator.
4.2 Guarantee for Algorithm 1
We now justify the validity of the confidence interval from the TSCI Algorithm 1, which is based on the asymptotic normality of with a data-dependent index . The property of relies on a careful analysis of the difference for .
We start with the setting where both and provide good approximations to leading to sufficiently small and . In this case, the dominating component of is with
(33) |
Conditioning on the data in and is of zero mean and variance
(34) |
with The following condition requires that the variance of dominates other components of which are mainly finite-sample approximation errors.
- (R3)
When the first stage is fitted with the basis method or neural network and we have and for If we assume that for some we have In this case, Condition (R3) is satisfied if and up to some polynomial order of . In this case, Condition (R3) is slightly stronger than Condition (R2-I).
The following theorem establishes the asymptotic normality of under the null setting where both and are small.
Theorem 2
For small approximation errors and , the condition holds. In this case, Theorem 2 establishes that the difference is centered at zero and has an asymptotic normal distribution.
We now move on to the case that at least one of and does not approximate well. In this case, Theorem 2 does not hold and we establish Theorem 5 in the supplement that is centered at
(35) |
To simultaneously quantify the errors of we define the quantile for the maximum of multiple random error components in (33),
(36) |
We introduce the following condition for accurate selection among
- (R4)
The above condition ensures that there exists such that the function is well approximated by the column space of . The well-separation condition (37) is interpreted as follows: if is not well approximated by the column space of , then the estimation bias of is larger than the uncertainty due to the multiple random error components. Note that defined in (35) is a measure of the bias of
The following theorem guarantees the coverage property for the CIs corresponding to and in Algorithm 1.
Theorem 3
The condition (37) of (R4) is critical to guarantee the selection consistency , ensuring that achieves the desired coverage. However, in finite samples, we may make mistakes in conducting the selection among The selection method is more robust in the sense that statistical inference based on is still valid even if we cannot separate and To achieve the uniform inference without requiring the well-separation condition (37), we may simply apply TSCI with . However, such a confidence interval might be conservative by adjusting for a large
5 Simulation studies
In Section 5.1, we consider the valid IV settings and compare our proposal to the DML estimator proposed in Chernozhukov et al. (2018). In Sections 5.2 and 5.3, we demonstrate our proposal for general invalid IV settings and also compare it with the DML estimator. In Section D.3 in the supplement, we further consider multiple IVs settings and compare TSCI with existing methods based on the majority rule. We compute all measures using 500 simulations throughout the numerical studies. The code for replicating the numerical results is available at https://github.com/zijguo/TSCI-Replication.
5.1 Comparison with DML in valid IV settings
We focus on the valid IV settings and illustrate the advantages and limitations of both TSCI and DML. We adopt the data generative setting used in the R package dmlalg (Emmenegger, 2021) and implement DML by the R package DoubleML (Bach et al., 2021). We set the sample size as and generate the baseline covariate following the uniform distribution on . Conditioning on , we generate the IV and the hidden confounder following and , respectively. We generate the outcome and treatment using the SEM in (9). Particularly, we generate the outcome following and consider the following two treatment models,
-
•
Setting S1:
-
•
Setting S2:
where controls the nonlinearity of the treatment model and strength of the IV and and are independent noises. Note that a larger value of in the treatment model leads to a more nonlinear dependence between and . The treatment model and the outcome model are consistent to (4) and (3) with and being composed of a hidden confounder term and an independent random noise.
In addition to the above two settings, we consider another setting where in the outcome model is more complicated. We set and for , we generate which are independently distributed, and each of them follows a uniform distribution on . We generate and following and . We generate following
-
•
Setting S3: and with ,
where the value of controls the linear association strength in Setting S3. We implement Algorithm 1 with random forests by specifying a collection of basis functions for . Since the IV is valid here, we use the set of basis functions with denoting the B-spline basis with 5 knots of for .

In Figure 2, we compare DML and TSCI under Settings S1, S2, and S3. In the top two panels, TSCI achieves the desired coverage level in Settings S1 and S2. Even though DML achieves desired coverage in setting S1, the RMSE and CI length of DML is uniformly larger than those of our proposed TSCI. As discussed in Section 3.5, this happens since DML only uses the linear association in the treatment model while TSCI can increase the IV strength by applying ML algorithms to fit nonlinear treatment models. We have designed setting S2 as a less favorable setting for DML, where the linear association between the treatment and IV is nearly zero but the nonlinear association is strong. Consequently, the CI length and RMSE of the DML procedure is much larger than our proposed TSCI. In Setting S3, TSCI does not show the advantage over DML and has an under-coverage problem. This is caused by the estimation bias due to the misspecification of in the outcome model. DML is not exposed to such bias because it uses ML algorithms to fit . We also implement TSLS in Settings S1, S2 and S3, but we do not include it in the figures because it has low coverage due to misspecification of nonlinear .
5.2 TSCI with invalid IVs
We now demonstrate our TSCI method under the general setting with possibly invalid IVs. In the following, we focus on the continuous treatment and will investigate the performance of TSCI for binary treatment in Section D.5 in the supplement. We generate following a multivariate normal distribution with zero mean and covariance matrix where for . With denoting the standard normal cumulative distribution function, we define for We generate a continuous IV as .
We consider the following two conditional mean models for the treatment,
-
•
Setting B1:
-
•
Setting B2:
The value of the constant controls the interaction strength between and the first five variables of , and when , the interaction term disappears. We vary across and across to observe the performance.
We consider the outcome model (3) with two forms of : (a)Vio=1: ; (b)Vio=2: . We set the errors as heteroscedastic following Bekker and Crudu (2015): for generate and where conditioning on , and are generated to be independent of , with and
We shall implement TSCI with random forests as detailed in Algorithm 1. Since the IV is possibly invalid, we consider four possible basis sets to approximate , where and for with . Our proposed TSCI is designed to choose the best .
As “naive” benchmarks, we compare TSCI with three different random forests based methods in the oracle setting, where the best is known a priori. In particular, RF-Init denotes the TSCI estimator without bias correction; RF-Full is to implement TSCI without data-splitting; RF-Plug is the estimator of directly plugging the ML fitted treatment in the outcome model, as a naive generalization of TSLS. We give the detailed expressions of these three estimators and their CIs in Section A.2 in the supplement.
TSCI-RF | Proportions of selection | TSLS | Other RF(oracle) | |||||||||||
vio | a | n | Oracle | Comp | Robust | Invalidity | Init | Plug | Full | |||||
1 | 0.0 | 1000 | 0.91 | 0.01 | 0.01 | 0.01 | 0.99 | 0.01 | 0.00 | 0.00 | 0.00 | 0.80 | 0.38 | 0.01 |
3000 | 0.92 | 0.92 | 0.92 | 1.00 | 0.00 | 0.84 | 0.16 | 0.00 | 0.00 | 0.91 | 0.64 | 0.00 | ||
5000 | 0.91 | 0.92 | 0.93 | 1.00 | 0.00 | 0.85 | 0.15 | 0.00 | 0.00 | 0.89 | 0.74 | 0.00 | ||
0.5 | 1000 | 0.91 | 0.23 | 0.25 | 0.24 | 0.76 | 0.24 | 0.01 | 0.00 | 0.00 | 0.84 | 0.56 | 0.02 | |
3000 | 0.95 | 0.94 | 0.94 | 1.00 | 0.00 | 0.97 | 0.02 | 0.01 | 0.00 | 0.91 | 0.43 | 0.00 | ||
5000 | 0.92 | 0.92 | 0.91 | 1.00 | 0.00 | 0.98 | 0.01 | 0.01 | 0.00 | 0.88 | 0.09 | 0.00 | ||
1.0 | 1000 | 0.96 | 0.92 | 0.92 | 0.95 | 0.05 | 0.93 | 0.01 | 0.00 | 0.00 | 0.91 | 0.52 | 0.08 | |
3000 | 0.94 | 0.94 | 0.95 | 1.00 | 0.00 | 0.99 | 0.01 | 0.00 | 0.00 | 0.92 | 0.00 | 0.00 | ||
5000 | 0.94 | 0.94 | 0.94 | 1.00 | 0.00 | 0.98 | 0.02 | 0.01 | 0.00 | 0.92 | 0.00 | 0.00 |
In Table 1, we compare the coverage properties of our proposed TSCI, TSLS, and the above three random forests based methods with Vio=1 in the outcome model. We observe that TSLS fails to have desired coverage probability 95% due to the existence of invalid IV; furthermore, RF-Full and RF-Plug have coverages far below 95% and RF-Init has slightly lower coverage due to the finite-sample bias. In contrast, our proposed TSCI achieves the desired coverage with a relatively strong interaction or large sample size. This is mainly due to its correct selection of in those settings. In some settings with small interaction and small sample size (like , ), TSCI fails to select the correct basis set, and the coverage is below 95%. For the outcome model with Vio=2, TSCI can have desired coverage as well with a relatively strong interaction or large sample size; see Table S2 in the supplement.
For Setting B1, we further report the absolute bias and CI length in Tables S3 and S4 in the supplement, respectively. In Section D.4, we consider a binary IV setting, denoted as B3, and observe that Settings B2 and B3 exhibit a similar pattern to those of B1. Setting B2 is easier than B1 in the sense that the generalized IV strength remains relatively large even after adjusting for quadratic violation forms in the basis set .
5.3 TSCI with various nonlinearity levels
In the following, we explore the performance of our proposed TSCI when this identification condition (Condition 1) fails to hold, that is, the conditional mean model is not more complex than To approximate such a regime, we consider the outcome model and the treatment model with different specification of detailed in Table 2 where , are independent random noises following the standard normal distribution. In such a generating model, when gets close to zero, becomes a linear function in , and Condition 1 fails since is also linear in
Settings | Distribution and | Treatment model |
C1 | Same as Setting B1 in Section 5.2 | |
C2 | Same as Setting S3 in Section 5.1 | |
C3 | Same as Setting B1 in Section 5.2 |
We fix and , generate and as detailed in Table 2, and generate as We implement TSCI detailed in Algorithm 1 by specifying as with , where denotes the B-spline basis functions defined in Section 5.1.
In Figure 3, we compare TSCI to DML and TSLS when treatment conditional mean models change from linear () to nonlinear (); Note that in addition, a larger value of increases the generalized IV strength. We vary the value of in where larger values of represent more nonlinear dependence between and . When is 0, Condition 1 fails to hold, and hence TSCI cannot identify the treatment effect when there are invalid IVs. In such settings, TSCI has similar performance to DML and TSLS. However, when gets slightly larger than , TSCI has a better RMSE than DML and TSLS, especially for settings C2 and C3. When is sufficiently large (e.g., in Setting C2), TSCI detects the invalid IV and achieves the desired coverage. However, DML and TSLS do not have coverage since they are designed for valid IV settings.

We shall point out that, in Setting C2, the confidence intervals by TSCI are shorter than that of DML, even after adjusting for the linear invalid IV forms. This happens since the remaining nonlinear IV strength after adjusting for the invalidity form is even larger than the IV strength due to the linear association.
6 Real data analysis
We revisit the question on the effect of education on income (Card, 1993). We follow Card (1993) and analyze the same data set from the National Longitudinal Survey of Young Men. The outcome is the log wages (lwage), and the treatment is the years of schooling (educ). As argued in Card (1993), there are various reasons that the treatment is endogenous. For example, the unobserved confounder “ability bias” may affect both the schooling years and wages, leading to the OLS estimator having a positive bias. Following Card (1993), we use the indicator for a nearby 4-year college in 1966 (nearc4) as the IV and include the following covariates: a quadratic function of potential experience (exper and expersq), a race indicator (black), and dummy variables for residence in a standard metropolitan statistical area in 1976 (smsa) and in 1966 (smsa66), and the dummy variable for residence in the south in 1976 (south), and a full set of 8 regional dummy variables(reg1 to reg8). The data set consists of observations, which is made available by the R package ivmodel (Kang et al., 2021).
We implement random forests for the treatment model and report the variable importance in Figure S4 in the supplement, where the IV nearc4 ranks the seventh in terms of variable importance, after the covariates exper,expersq, black,south,smsa,smsa66.
We allow for different IV violation forms by approximating with various basis functions detailed in Table 3. Particularly, since does not involve the IV nearc4, corresponds to the valid IV setting as assumed in the analysis of Card (1993); moreover, and correspond to invalid IV settings by allowing the IV nearc4 to affect the outcome directly or interactively with the baseline covariates. The main difference is that includes the interaction with the six most important covariates while includes all fourteen covariates.
We implement Algorithm 1 to choose the best from and construct the TSCI estimator. Since the TSCI estimates depend on the specific sample splitting, we report 500 TSCI estimates due to 500 different splittings. On the leftmost panel of Figure 4, we compare estimates of OLS, TSLS, and TSCI, which uses reported from Algorithm 1. The median of these 500 TSCI estimators is , 87.2% of these 500 estimates are smaller than the OLS estimate (0.0747), and 100% of TSCI estimates are smaller than the TSLS estimate (0.1315). In contrast to the TSLS estimator, the TSCI estimators tend to be smaller than the OLS estimator, which helps correct the positive “ability bias”.

On the rightmost panel of Figure 4, we compare different confidence intervals (CIs). The TSLS CI is , and the DML CI is , which are both much wider than the CIs by TSCI. This wide interval results from the relatively weak IV because TSLS and DML only consider the linear association between IV and the treatment. The CI by TSCI with random forests varies with the specific sample splitting. We follow Meinshausen et al. (2009) and implement the multi-split method to adjust the finite-sample variation due to sample splitting; see Section A.4 in the supplement. The multi-split CI is (0.0294, 0.0914). The CIs based on the TSCI estimator with random forests are pushed to the lower part of the wide CI by TSLS. The CIs by our proposed TSCI in Algorithm 1 tend to shift down in comparison to the TSCI assuming valid IVs.
We shall point out that the IV strengths for the TSCI estimators, even after adjusting for the selected basis functions, are typically much larger than the TSLS (the concentration parameter is ), which is illustrated in the middle panel of Figure 4. This happens since the first-stage ML captures much more associations than the linear model in TSLS, even after adjusting for the possible IV violations captured by .
Among 500 sample splittings, the proportion of choosing and are , , and , respectively, where around of splittings report nearc4 as invalid IVs. Importantly, out of the 500 splittings report , which represents that is not statistically different from produced by the largest . This indicates that already provides a reasonably good approximation of in the outcome model. In Section E.1 of the supplement, we consider alternative choices of and observe that the results are consistent with that reported above. In Section E.2, we further propose a falsification argument regarding Condition 1 for the data analysis.
7 Conclusion and discussion
We integrate modern ML algorithms into the framework of instrumental variable analysis. We devise a novel TSCI methodology which provides reliable causal conclusions even with a wide range of violation forms. Our proposed generalized IV strength measure helps to understand when our proposed method is reliable and supports the basis functions’ selection in approximating the violation form of possibly invalid IVs. The current methodology is focused on inference for a linear and constant treatment effect. An interesting future research direction is about inference for heterogeneous treatment effects (Athey and Imbens, 2016; Wager and Athey, 2018) but in presence of potentially invalid instruments.
Acknowledgement
We acknowledge valuable suggestions from Xu Cheng, Tirthankar Dasgupta, Qingliang Fan, Michal Kolesár, Greg Lewis, Yuan Liao, Molei Liu, Zhonghua Liu, Zuofeng Shang, and Frank Windmeijer. Z. Guo gratefully acknowledges financial support from NSF DMS 1811857, 2015373, NIH R01GM140463, and financial support for visiting the Institute of Mathematical Research (FIM) at ETH Zurich. P. Bühlmann received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 786461)
References
- Amemiya (1974) Amemiya, T. (1974). The nonlinear two-stage least-squares estimator. J. Econom. 2(2), 105–110.
- Angrist et al. (1999) Angrist, J. D., G. W. Imbens, and A. B. Krueger (1999). Jackknife instrumental variables estimation. J. Appl. Econ. 14(1), 57–67.
- Angrist and Lavy (1999) Angrist, J. D. and V. Lavy (1999). Using maimonides’ rule to estimate the effect of class size on scholastic achievement. Q. J. Econ. 114(2), 533–575.
- Angrist and Pischke (2009) Angrist, J. D. and J.-S. Pischke (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.
- Athey and Imbens (2016) Athey, S. and G. Imbens (2016). Recursive partitioning for heterogeneous causal effects. Proc. Natl. Acad. Sci. 113(27), 7353–7360.
- Athey et al. (2019) Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. Ann. Stat. 47(2), 1148–1178.
- Bach et al. (2021) Bach, P., V. Chernozhukov, M. S. Kurz, and M. Spindler (2021). DoubleML – An object-oriented implementation of double machine learning in R. arXiv:2103.09603 [stat.ML].
- Bekker and Crudu (2015) Bekker, P. A. and F. Crudu (2015). Jackknife instrumental variable estimation with heteroskedasticity. J. Econom. 185(2), 332–342.
- Belloni et al. (2012) Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80(6), 2369–2429.
- Bowden et al. (2015) Bowden, J., G. Davey Smith, and S. Burgess (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through egger regression. Int. J. Epidemiol. 44(2), 512–525.
- Bowden et al. (2016) Bowden, J., G. Davey Smith, P. C. Haycock, and S. Burgess (2016). Consistent estimation in mendelian randomization with some invalid instruments using a weighted median estimator. Genet. Epidemiol. 40(4), 304–314.
- Card (1993) Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling. Natl. Bur. Econ. Res. Camb, Mass., USA.
- Chen et al. (2023) Chen, J., D. L. Chen, and G. Lewis (2023). Mostly harmless machine learning: learning optimal instruments in linear iv models. J. Mach. Learn. Res., forthcoming.
- Chernozhukov et al. (2018) Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. J. Econom. 21(1).
- Emmenegger (2021) Emmenegger, C. (2021). dmlalg: Double machine learning algorithms. R-package available on CRAN.
- Emmenegger and Bühlmann (2021) Emmenegger, C. and P. Bühlmann (2021). Regularizing double machine learning in partially linear endogenous models. Electron. J. Stat. 15(2), 6461–6543.
- Guo (2023) Guo, Z. (2023). Causal inference with invalid instruments: post-selection problems and a solution using searching and sampling. J. R. Stat. Soc. Ser. B. 85(3), 959–985.
- Guo et al. (2018) Guo, Z., H. Kang, T. Tony Cai, and D. S. Small (2018). Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. J. R. Statist. Soc. B 80(4), 793–815.
- Han (2008) Han, C. (2008). Detecting invalid instruments using l1-gmm. Econ. Lett. 101(3), 285–287.
- Hansen et al. (2008) Hansen, C., J. Hausman, and W. Newey (2008). Estimation with many instrumental variables. J. Bus. Econ. Stat. 26(4), 398–422.
- Hansen (1982) Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 1029–1054.
- Hausman et al. (2012) Hausman, J. A., W. K. Newey, T. Woutersen, J. C. Chao, and N. R. Swanson (2012). Instrumental variable estimation with heteroskedasticity and many instruments. Quant. Econom. 3(2), 211–255.
- Heckman (1976) Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In Annals of economic and social measurement, volume 5, number 4, pp. 475–492. NBER.
- Holland (1988) Holland, P. W. (1988). Causal inference, path analysis and recursive structural equations models. ETS Res. Rep. Ser. 1988(1), i–50.
- Kang et al. (2021) Kang, H., Y. Jiang, Q. Zhao, and D. S. Small (2021). Ivmodel: an R package for inference and sensitivity analysis of instrumental variables models with one endogenous variable. Obs Stud 7(2), 1–24.
- Kang et al. (2016) Kang, H., A. Zhang, T. T. Cai, and D. S. Small (2016). Instrumental variables estimation with some invalid instruments and its application to mendelian randomization. J. Am. Stat. Assoc. 111(513), 132–144.
- Kelejian (1971) Kelejian, H. H. (1971). Two-stage least squares and econometric systems linear in parameters but nonlinear in the endogenous variables. J. Am. Stat. Assoc. 66(334), 373–374.
- Kolesár et al. (2015) Kolesár, M., R. Chetty, J. Friedman, E. Glaeser, and G. W. Imbens (2015). Identification and inference with many invalid instruments. J. Bus. Econ. Stat. 33(4), 474–484.
- Lewbel (2012) Lewbel, A. (2012). Using heteroscedasticity to identify and estimate mismeasured and endogenous regressor models. J. Bus. Econ. Stat. 30(1), 67–80.
- Lewbel (2019) Lewbel, A. (2019). The identification zoo: Meanings of identification in econometrics. J. Econ. Lit. 57(4), 835–903.
- Lin and Jeon (2006) Lin, Y. and Y. Jeon (2006). Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 101(474), 578–590.
- Liu et al. (2020) Liu, R., Z. Shang, and G. Cheng (2020). On deep instrumental variables estimate. arXiv preprint arXiv:2004.14954.
- Meinshausen (2006) Meinshausen, N. (2006). Quantile regression forests. J. Mach. Learn. Res. 7(Jun), 983–999.
- Meinshausen et al. (2009) Meinshausen, N., L. Meier, and P. Bühlmann (2009). P-values for high-dimensional regression. J. Am. Stat. Assoc. 104(488), 1671–1681.
- Newey (1990) Newey, W. K. (1990). Efficient instrumental variables estimation of nonlinear models. Econometrica, 809–837.
- Puhani (2000) Puhani, P. (2000). The heckman correction for sample selection and its critique. J. Econ. Surv. 14(1), 53–68.
- Rothenberg (1984) Rothenberg, T. J. (1984). Approximating the distributions of econometric estimators and test statistics. Handb. Econom. 2, 881–935.
- Rubin (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66(5), 688.
- Sargan (1958) Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables. Econometrica, 393–415.
- Shardell and Ferrucci (2016) Shardell, M. and L. Ferrucci (2016). Instrumental variable analysis of multiplicative models with potentially invalid instruments. Stat. Med. 35(29), 5430–5447.
- Small (2007) Small, D. S. (2007). Sensitivity analysis for instrumental variables regression with overidentifying restrictions. J. Am. Stat. Assoc. 102(479), 1049–1058.
- Splawa-Neyman et al. (1990) Splawa-Neyman, J., D. M. Dabrowska, and T. Speed (1990). On the application of probability theory to agricultural experiments. essay on principles. section 9. Stat. Sci., 465–472.
- Staiger and Stock (1997) Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65(3), 557–586.
- Stock et al. (2002) Stock, J. H., J. H. Wright, and M. Yogo (2002). A survey of weak instruments and weak identification in generalized method of moments. J. Bus. Econ. Stat. 20(4), 518–529.
- Sun et al. (2023) Sun, B., Z. Liu, and E. J. Tchetgen Tchetgen (2023, 02). Semiparametric efficient G-estimation with invalid instrumental variables. Biometrika 110(4), 953–971.
- Tchetgen et al. (2021) Tchetgen, E. T., B. Sun, and S. Walter (2021). The genius approach to robust mendelian randomization inference. Stat. Sci. 36(3), 443–464.
- Ten Have et al. (2007) Ten Have, T. R., M. M. Joffe, K. G. Lynch, G. K. Brown, S. A. Maisto, and A. T. Beck (2007). Causal mediation analyses with rank preserving models. Biometrics 63(3), 926–934.
- Wager and Athey (2018) Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 113(523), 1228–1242.
- Windmeijer et al. (2019) Windmeijer, F., H. Farbmacher, N. Davies, and G. Davey Smith (2019). On the use of the lasso for instrumental variables estimation with some invalid instruments. J. Am. Stat. Assoc. 114(527), 1339–1350.
- Windmeijer et al. (2021) Windmeijer, F., X. Liang, F. P. Hartwig, and J. Bowden (2021). The confidence interval method for selecting valid instrumental variables. J. R. Statist. Soc. B 83(4), 752–776.
- Woutersen and Hausman (2019) Woutersen, T. and J. A. Hausman (2019). Increasing the power of specification tests. J. Econom. 211(1), 166–175.
Appendix A Additional Methods and Theories
A.1 Identification in random experiments with non-compliance and invalid IVs
In the following, we explain how a nonlinear treatment model helps identify the causal effect in randomized experiments with non-compliance and possibly invalid IVs. For the -th subject, we use to denote the random treatment assignment (which serves as an instrument) and to denote whether the subject actually takes the treatment or not. We use to denote a binary covariate and to denote the potential outcome for a subject with the assignment , the treatment value , and the covariate .
A primary concern is that may be an invalid IV even if it is randomly assigned. To facilitate the discussion, we follow the discussion in \citetsuppimbens2014instrumental and adopt the example from \citetsuppmcdonald1992effects studying the effect of influenza vaccination on being hospitalized for flu-related reasons. The researchers randomly sent letters to physicians encouraging them to vaccinate their patients. In this example, denotes the indicator for the physician receiving the encouragement letter, while stands for the patient being vaccinated or not. The outcome denotes whether the patient is hospitalized for a flu-related reason. As pointed out in \citetsuppimbens2014instrumental, the physician receiving the letter may “take actions that affect the likelihood of the patient getting infected with the flu other than simply administering the flu vaccine”, indicating that may have a direct effect on the outcome and violate the classical IV assumptions.
For a covariate , we define the intention-to-treat (ITT) on the outcome as and the ITT on the treatment as In the following proposition, we assume the constant effect and demonstrate the treatment effect identification even if the IV directly affects the outcome (violating assumption (A2)).
Proposition 2
Suppose that (I1) , (I2) is randomly assigned, (I3) for , and (I4) Then we identify as
The proof of the above proposition is presented in Section B.1. Condition (I4) assumes a constant effect while (I1)-(I3) are relaxations of the classical IV assumptions (A1)-(A3), respectively. Particularly, the random assignment required by (I2) implies (A2) and (I2) is satisfied in randomized experiments with non-compliance. Condition (I3) allows the IV to affect the outcome directly but assumes that the IV’s direct effect does not interact with the baseline covariate . When there exists a direct effect, the identification of relies on the Condition (I1). Condition (I1) essentially requires that the treatment assignment is interactively determined by and : a special example is given by where denotes a random variable independent of and Such identification conditions have been proposed in Shardell and Ferrucci (2016); Ten Have et al. (2007). Particularly, Shardell and Ferrucci (2016) has pointed out that is referred to as the compliance score and Condition (I1) requires that the compliance score changes with the covariate value .
We provide now some discussions on the plausibility of Condition (I1). We still use the vaccination example from \citetsuppmcdonald1992effects and imagine that we may have access to the covariate , an indicator for whether the patients took regular flu shots in the past few years. Patients who regularly take flu shots (with ) are more likely to follow the physician’s encouragement than those who do not (with ), indicating . However, regarding the concern “take actions that affect the likelihood of the patient getting infected with the flu other than simply administering the flu vaccine” \citepsuppimbens2014instrumental, such a direct effect of might not interact with .
A.2 Pitfalls with the naive machine learning
As the benchmarks in Section 5.2, we also include the point estimators , and together with their corresponding standard errors. We construct the 95% confidence interval as
RF-Init. We compute as in (16) and Calculate
RF-Plug and its pitfall. A simple idea of combining TSLS with ML is to directly use in the second stage. We may calculate such a two-stage estimator as where the second-stage regression is implemented by regressing on and . We compute the standard error \citetsuppangrist2009mostly[Sec.4.6.1] criticized such use of the nonlinear first-stage model under the name of “forbidden regression” and \citetsuppchen2020mostly pointed out that is biased since in (12) is not a projection matrix for random forests.
RF-Full and its pitfall. Sample splitting is essential for removing the endogeneity of the ML-predicted values. Without sample splitting, the ML predicted value for the treatment can be close to the treatment (due to overfitting) and hence highly correlated with unmeasured confounders, leading to a TSCI estimator suffering from a significant bias. In the following, we take the non-split Random Forests as an example. Similarly to defined in (12), we define the transformation matrix for random forests constructed with the full data. As a modification of (16), we consider the corresponding TSCI estimator,
(38) |
where . We calculate its standard error as
The simulation results in Section 5 show that the estimator in (38) suffers from a large bias and the resulting confidence interval does not achieve the desired coverage.
A.3 Choice of in (27)
A.4 Finite-sample adjustment of uncertainty from data splitting
Even though our asymptotic theory in Section 4 is valid for any random sample splitting, the constructed point estimators and confidence intervals do vary with different sample splittings in finite samples. This randomness due to sample splittings has been also observed in double machine learning \citepsuppchernozhukov2018double and multi-splitting \citepsuppmeinshausen2009p. Following \citetsuppchernozhukov2018double and \citetsuppmeinshausen2009p, we introduce a confidence interval which aggregates multiple confidence intervals due to different splittings. Consider random splittings and for the -th splitting, we use and to denote the corresponding TSCI point and standard error estimators, respectively. Following Section 3.4 of \citetsuppchernozhukov2018double, we introduce the median estimator together with and construct the median confidence interval as Alternatively, following \citetsuppmeinshausen2009p, we construct the p values for and where is the CDF of the standard normal. We define the multi-splitting confidence interval as See equation (2.2) of \citetsuppmeinshausen2009p.
A.5 Choices of Violation Basis Functions for Multiple IVs
When there are multiple IVs, there are more choices to specify the violation form, and are not necessarily nested. For example, when we have two IVs and , we may set and then are not nested. However, even in such a case, our proposed selection method is still applicable. Specifically, for any , we compare the estimator generated by to that by with and
A.6 TSCI with boosting
In the following, we demonstrate how to express the boosting estimator \citepsuppbuhlmann2007boosting,buhlmann2006sparse in the form of (12). The boosting methods aggregate a sequence of base procedures For we construct the base procedure using the data in and compute the estimated values given by the -th base procedure . With and as the step-length factor (the default being ), we conduct the sequential updates,
where is a hat matrix determined by the base procedures.
With the stopping time , the boosting estimator is We now compute the transformation matrix . Set and for we define
(40) |
Define and write which is the desired form in (12).
In the following, we give two examples of the construction of the base procedures and provide the detailed expression for used in (40). We write and define the matrix with its -th row as An important step of building the base procedures is to conduct the variable selection. That is, each base procedure is only constructed based on a subset of covariates
Pairwise regression.
In Algorithm 2, we describe the construction of with the pairwise regression and the pairwise thin plate as the base procedure. For both base procedures, we need to specify how to construct the basis functions in steps 3 and 8. For the pairwise regression, we set the first element of as and define for Then for step 3, we define as the projection matrix to the vector for step 8, we define
For the pairwise thin plate, we follow chapter 7 of \citetsuppgreen1993nonparametric to construct the projection matrix. For we define the matrix For define ; for define
Then we define . In step 3, compute as the first columns of For step 8, we compute
and set as the first columns of
Input: Data ; the step-length factor ; the stoping time
Output:
Decision tree.
In Algorithm 3, we describe the construction of with the decision tree as the base procedure.
Input: Data ; the step-length factor ; the stoping time
Output:
A.7 TSCI with deep neural network
In the following, we demonstrate how to calculate for a deep neural network \citepsuppjames2013introduction. We define the first hidden layer as for where is the activation function and are parameters. For , we define the -th hidden layer as for where are unknown parameters. For given we estimate the unknown parameters based on the data ,
With for and , we define
with for We use to denote the projection to the column space of the matrix and express the deep neural network estimator as With we define which is shown to be an orthogonal projection matrix in Lemma 1 in the supplement.
A.8 TSCI with B-spline
As a simplification, we consider the additive model and assume that can be well approximated by a set of B-spline functions In practice, the number can be chosen by cross-validation. We define the matrix with its -th row Without loss of generality, we approximate by , which is generated by the same set of basis functions for . Define as the projection matrix to the space spanned by the columns of and . We write the first-stage estimator as and compute with The transformation matrix is a projection matrix with When the basis number is small and the degree of freedom is much smaller than , sample splitting is not even needed for if the B-spline is used for fitting the treatment model, which is different from the general machine learning algorithms.
A.9 Properties of
The following lemma is about the property of the transformation matrix , whose proof can be found in Section C.4. Recall that
where is defined in (14).
Lemma 1
The transformation matrix satisfies
(41) |
As a consequence, we establish The transformation matrixs and are orthogonal projection matrices with
(42) |
The transformation matrix satisfies
(43) |
for any As a consequence, we establish
A.10 Homoscadastic correlation
In the following, we discuss the simplified method and theory in the homoscadastic correlation setting, that is, . With this extra assumption, we present the following bias-corrected estimator as an alternative to the estimator defined in (19),
where is defined in (16) and the estimator of is defined as,
with denoting the rank of the matrix . The correction in constructing implicitly requires which might limit practical applications.
We present a simplified version of Theorem 6 by assuming homoscadastic correlation. We shall only present the results for the TSCI with random forests but the extension to the general machine learning methods is straightforward.
Theorem 4
As a remark, if , , and then we have
A.11 Consistency of variance estimators
The following lemma controls the variance consistency, whose proof can be found in Section C.5.
Lemma 2
Suppose that Conditions (R1) and (R2) hold and with If
then we have
A.12 Size and Power of
Appendix B Proofs
We establish Proposition 2 in Section B.1. In Section B.2, we establish Proposition 1. In Section B.3, we establish Theorems 6 and 1. We prove Theorems 2 and 5 in Section B.4. We establish Theorem 3 in Section B.5 and prove Theorem 4 in Section B.6.
We define the conditional covariance matrices as and . For any and we have Since for , we have and is a diagonal matrix. Similarly, we can show and are diagonal matrices. The conditional sub-gaussian condition in (R1) implies that We shall use to denote the set of random variables and .
We introduce the following lemma about the concentration of quadratic forms, which is Theorem 1.1 in \citetsupprudelson2013hanson.
Lemma 3
(Hanson-Wright inequality) Let be a random vector with independent sub-gaussian components with zero mean and sub-gaussian norm . Let be an matrix. For every ,
Note that If both and are sub-gaussian, we apply the union bound, Lemma 3 for both and and then establish,
(49) |
B.1 Proof of Proposition 2
B.2 Proof of Proposition 1
Recall that we are analyzing defined in (16) with replacing by We have decomposed the estimation error as
(53) |
The following Lemma controls the key components of the above decomposition, whose proof can be found in Section C.1 in the supplement.
Lemma 4
Under Condition (R1), with probability larger than for some positive constants and
(54) |
(55) |
where
Define Together with (41) and the generalized IV strength condition (R2), we apply (55) with for some . Then there exists positive constants and such that with probability larger than
(56) |
Together with (53), we establish
(57) |
By applying the decomposition (57) and the upper bounds (54), and (54) with we establish that, conditioning on , with probability lager than for some positive constants and
(58) |
The above concentration bound, the assumption (R2), and the assumption imply . Then for any constant , we have We apply the bounded convergence theorem and establish and hence
B.3 Proof of Theorem 1
Theorem 6
We start with the following error decomposition,
(61) |
with and Define
Then the decomposition (61) implies (59). We apply (56) together with the generalized IV strength condition (R2) and establish Condition (31) implies the Linderberg condition. Hence, we establish and
Since , we apply the similar argument as in (54) and obtain that, conditioning on , with probability larger than for some constants and Then we establish
(62) |
We introduce the following Lemma to control whose proof is presented in Section C.2 in the supplement.
Lemma 5
If Condition (R2) holds, then with probability larger than
B.4 Proofs of Theorems 2 and 5
By conditional sub-gaussianity and we have Hence, we have With defined in (34), we have
(64) |
where is defined in (35),
(65) |
The remaining proof relies on the following lemma, whose proof can be found at Section C.3 in the supplement.
Lemma 6
Suppose that Conditions (R1), (R2), and (R3) hold, then
B.5 Proof of Theorem 3
In the following, we shall prove
(66) |
Recall that the test is defined in (27) and note that
(67) |
Define the events
where is defined in (36), is defined in (39) with replaced by , and is defined in (65). By the definition of in (36) and the following (86), we control the probability of the event By the following (6) and , we establish that, for any positive constants and Combing the above two equalities, we have
(68) |
For any the condition (R4) implies that there exists some such that On the event , we apply the expression (64) and obtain
(69) |
For any we have Then on the event , we apply the expression (64) and obtain that,
Together with (69), the event implies By (68), we establish Together with (67), we have To control the coverage probability, we decompose as
(70) | ||||
Note that
and By the decomposition (70), we combine the above two inequalities and establish
which implies (66) with By the definition we apply a similar argument and establish (66) with
B.6 Proof of Theorem 4
If , then (54) further implies
(71) |
Proof of (44).
We decompose the error as,
(72) | ||||
where and
We apply (49) with and for some and establish
By choosing we establish that, with probability larger than for some positive constant then
(73) |
We apply (49) with and for some
The above concentration bound implies
(74) |
Hence, we establish that, with probability larger than
(75) |
where the last inequality follows from (58). Regarding we have
(76) | ||||
By a similar argument as in (74), we establish that, with probability larger than for some positive constant Together with (76), we establish that, with probability larger than
Together with (72) and the upper bounds (73) and (75), we establish (44).
Proof of (47).
Appendix C Proof of Extra Lemmas
C.1 Proof of Lemma 4
Note that We apply (49) with and for some and establish
(78) | ||||
where the last inequality follows from . The above concentration bound implies (54) by taking an expectation with respect to
Since we apply a similar argument to (78) and establish
(79) |
Note that and conditioning on , are independent sub-gaussian random variables. We apply Proposition 5.16 of \citetsuppvershynin2010introduction and establish that, with probability larger than for some positive constant
(80) |
The above concentration bound implies (54) by taking an expectation with respect to
C.2 Proof of Lemma 5
Under the model , we have
(81) |
The least square estimator of is expressed as, Note that
Note that and conditioning on , are independent sub-gaussian random variables. By Proposition 5.16 of \citetsuppvershynin2010introduction, there exist positive constants and such that By Condition (R1), we have and apply the union bound and establish
(82) |
Similarly to (82), we establish Together with the expression we apply Condition (R1) and establish Together with (82), and Condition (R1), we establish that, with probability larger than
(83) |
Our proposed estimator defined in (21) has the following equivalent expression,
Then we apply (81) and obtain Together with Condition (R1) and (83), we establish Lemma 5.
C.3 Proof of Lemma 6
Similarly to (61), we decompose as where , and
We apply the same analysis as that of (63) and establish
By Lemma 5 with we apply the similar argument as that of (60) and establish that
where is defined in (30). We establish under Condition (R3). Similarly, we establish under Condition (R3). That is, we establish
Now we prove and start with the decomposition,
(84) | ||||
Since the vector defined in (33) satisfies we verify the Linderberg condition and establish
(85) |
We apply (54) and (55) and establish that with probability larger than for some positive constants and
We apply the above inequalities and establish that with probability larger than for some positive constants and
Under the condition (R3), we establish
Similarly, we have The above inequalities and the decomposition (84) imply
(86) |
Together with (85), we establish .
C.4 Proof of Lemma 1
C.5 Proof of Lemma 2
By rewriting Lemma 5, we have that, with probability larger than
(88) |
It is sufficient to show
(89) |
Note that
(90) |
We further decompose
(91) | ||||
Note that Define the vector with and we have and By applying Proposition 5.16 of \citetsuppvershynin2010introduction, we establish that,
(92) |
By the condition and we establish
Appendix D Additional Simulation Results
D.1 The appropriate threshold of IV strength for reliable inference
In this section, we show the performance of TSCI in terms of RMSE, bias, and CI coverage with different IV strengths. We generate , and errors following the same procedure as in Settings B1 and B2 in Section 5.2 but with . We generate the treatment and outcome following and . With fixing the sample size , we control the IV strength by varying across . Since we consider the valid IV setting, we implement TSCI with random forests for 500 times and specify with . In Figure S1, we show that when IV strength is above 40, TSCI has a smaller RMSE and bias, and its confidence interval achieves the desired coverage.

We also use the setting with to illustrate the bias of and the effect of the proposed bias correction in Figure 1 in the main paper.
D.2 Comparison with machine-learning IV
In the following, we compare our TSCI estimator with the MLIV estimator described in (29) while setting the full set of observed covariates as adjusted forms in our method. We use the exactly same setting in Section D.1. In Table S1, we compare TSCI with MLIV for different and different with a larger value of corresponding to a stronger IV. We implement 500 rounds of simulations and report the average measure.
without self-prediction | with self-prediction | ||||||||||
Bias | RMSE | Bias | RMSE | ||||||||
a | n | TSCI | Init | MLIV | Ratio | IV Str | TSCI | Init | MLIV | Ratio | IV Str |
0.20 | 1000 | 0.17 | 0.30 | -4.31 | 374.06 | 2.18 | 0.28 | 0.38 | 0.48 | 1.51 | 11.90 |
3000 | 0.03 | 0.14 | -0.15 | 2.94 | 12.11 | 0.17 | 0.29 | 0.45 | 2.17 | 42.06 | |
5000 | 0.01 | 0.10 | -0.07 | 1.67 | 27.97 | 0.12 | 0.23 | 0.42 | 2.69 | 76.41 | |
0.25 | 1000 | 0.06 | 0.18 | -0.38 | 23.14 | 3.84 | 0.17 | 0.28 | 0.41 | 1.88 | 15.79 |
3000 | 0.00 | 0.06 | -0.05 | 1.37 | 30.35 | 0.08 | 0.17 | 0.35 | 2.97 | 74.95 | |
5000 | 0.00 | 0.04 | -0.02 | 1.11 | 77.97 | 0.04 | 0.12 | 0.32 | 3.84 | 156.57 | |
0.30 | 1000 | 0.02 | 0.09 | -0.11 | 2.01 | 7.49 | 0.09 | 0.18 | 0.32 | 2.22 | 23.92 |
3000 | -0.00 | 0.02 | -0.02 | 1.09 | 74.82 | 0.03 | 0.09 | 0.25 | 3.74 | 151.09 | |
5000 | 0.00 | 0.02 | -0.01 | 1.06 | 194.74 | 0.01 | 0.06 | 0.21 | 4.51 | 316.40 | |
0.35 | 1000 | 0.00 | 0.05 | -0.05 | 1.29 | 14.32 | 0.04 | 0.11 | 0.24 | 2.57 | 41.62 |
3000 | -0.01 | 0.01 | -0.01 | 0.99 | 144.72 | 0.01 | 0.04 | 0.17 | 3.55 | 255.65 | |
5000 | -0.00 | 0.01 | -0.01 | 1.00 | 337.34 | 0.01 | 0.04 | 0.16 | 4.44 | 526.98 |
In Table S1, we observe that the MLIV has a larger bias and standard error than TSCI, leading to an inflated MSE of MLIV. We have explained in Section 3.5 that this happens when the IV strength captured by RF is relatively weak. For a stronger IV (with ), when the sample size is 3000 or 5000 and the self-prediction is excluded, the MLIV and TSCI have comparable MSE, but our proposed TSCI estimator generally exhibits a smaller bias. Moreover, by comparing TSCI and initial estimators, our proposed bias correction step effectively removes the bias due to the high complexity of RF algorithm. Another interesting observation is that the removal of self-prediction helps reducing the bias of both TSCI and MLIV estimators for the reason of reducing the correlation between the ML predicted value and the unmeasured confounders.
We then show how the coefficient in (29) in the main paper inflates the estimation error of MLIV when IV is weak. In Figure S2, we display the distribution of TSCI and MLIV estimators in box plots and the frequency of in the histogram in the setting with and . We can see that there are certain proportion of the coefficient close to 0, which would inflate the estimator to extremely large values and thus increase the MLIV estimator variance when IV is weak.

D.3 Multiple IV (with non-linearity)
In this section, we consider multiple IVs and compare TSCI with existing methods dealing with invalid IVs, including TSHT \citepsuppguo2018confidence and CIIV \citepsuppwindmeijer2019confidence. We consider the setting with 10 IVs. With fixing the sample size , we generate following the multivariate normal distribution with zero mean and covariance where for . We define the first columns of as IVs and denote it by . The remaining columns are defined as observed covariates, that is . We generate errors following the bivariate normal distribution with zero mean, unit variance and covariance as 0.5. We generate following and where the vector indicates the invalidity level of each IV. The -th IV is valid if ; otherwise, it is invalid. A larger value indicates that the -th IV is more severely invalid. We consider the following two settings,
-
•
Setting D1: ,
-
•
Setting D2: .
For Setting D1, neither the majority nor the plurality rule is satisfied. For Setting D2, the plurality rule is satisfied. We vary the value of to simulate the different levels of invalidity. We specify the sets of basis as with , and with . We run 500 rounds of simulation and report the results in Figure S3.

In Setting D1, neither the majority rule nor the plurality rule is satisfied, so TSHT and CIIV cannot select valid IVs and do not achieve a desired coverage level of 95%. In contrast, TSCI is able to detect the existing invalidity and obtain valid inference. When the invalidity level is low (with ), TSCI may not be able to identify the invalidity and thus lead to coverage below 95%.
In Setting D2, TSCI achieves the desired coverage level while TSHT and CIIV achieve the desired coverage for a relatively large invalidity level. When the invalidity level is low, say and , TSHT and CIIV are significantly impacted by locally invalid IVs and generate poor coverage. In comparison to TSHT and CIIV, TSCI provides much more robust performance to the low invalidity levels because it evaluates the total invalidity levels accumulated by all ten invalid IVs.
D.4 Additional Results in Section 5.2
In this section, we further report the additional results for Setting B1 and results for Setting B2, and introduce a setting with binary IV denoted as Setting B3.
For Setting B1, we report the coverage for Vio=2 in Table S2 and we further report the mean absolute bias and the confidence interval length in for both violation forms in Tables S3 and S4, respectively. In Table S3, TSLS has a large bias due to the existence of invalid IVs; even in the oracle setting with the prior knowledge of the best approximating , RF-Full, and RF-Plug have a large bias. Compared with RF-Init, TSCI corrects the bias effectively and thus have the desired 95% coverage. In Table S4, the CI of TSCI is shorter than the oracle method in settings with small sample size or weak interaction strength, because it fails to select the correct violation forms.
TSCI-RF | Proportions of selection | TSLS | Other RF(oracle) | |||||||||||
vio | a | n | Oracle | Comp | Robust | Invalidity | Init | Plug | Full | |||||
2 | 0.0 | 1000 | 0.92 | 0.01 | 0.01 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | 0.30 | 0.01 |
3000 | 0.94 | 0.94 | 0.94 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.91 | 0.63 | 0.00 | ||
5000 | 0.94 | 0.94 | 0.94 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.88 | 0.77 | 0.00 | ||
0.5 | 1000 | 0.92 | 0.11 | 0.12 | 0.21 | 0.79 | 0.08 | 0.13 | 0.00 | 0.00 | 0.86 | 0.49 | 0.01 | |
3000 | 0.94 | 0.93 | 0.92 | 1.00 | 0.00 | 0.00 | 0.97 | 0.03 | 0.00 | 0.90 | 0.37 | 0.00 | ||
5000 | 0.94 | 0.94 | 0.94 | 1.00 | 0.00 | 0.00 | 0.99 | 0.01 | 0.00 | 0.88 | 0.03 | 0.00 | ||
1.0 | 1000 | 0.95 | 0.89 | 0.89 | 0.96 | 0.04 | 0.02 | 0.93 | 0.01 | 0.00 | 0.89 | 0.40 | 0.01 | |
3000 | 0.93 | 0.93 | 0.93 | 1.00 | 0.00 | 0.00 | 0.99 | 0.01 | 0.00 | 0.92 | 0.00 | 0.00 | ||
5000 | 0.93 | 0.93 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.90 | 0.00 | 0.00 |
TSCI-RF | Proportions of selection | TSLS | Other RF(oracle) | |||||||||||
vio | a | n | Oracle | Comp | Robust | Invalidity | Init | Plug | Full | |||||
1 | 0.0 | 1000 | 0.02 | 0.53 | 0.53 | 0.01 | 0.99 | 0.01 | 0.00 | 0.00 | 0.56 | 0.13 | 0.48 | 0.30 |
3000 | 0.01 | 0.01 | 0.01 | 1.00 | 0.00 | 0.84 | 0.16 | 0.00 | 0.56 | 0.04 | 0.14 | 0.25 | ||
5000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.85 | 0.15 | 0.00 | 0.56 | 0.03 | 0.05 | 0.23 | ||
0.5 | 1000 | 0.01 | 0.26 | 0.25 | 0.24 | 0.76 | 0.24 | 0.01 | 0.00 | 0.33 | 0.08 | 0.24 | 0.22 | |
3000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.97 | 0.02 | 0.01 | 0.33 | 0.03 | 0.16 | 0.19 | ||
5000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.98 | 0.01 | 0.01 | 0.33 | 0.02 | 0.22 | 0.18 | ||
1.0 | 1000 | 0.00 | 0.01 | 0.01 | 0.95 | 0.05 | 0.93 | 0.01 | 0.00 | 0.23 | 0.04 | 0.15 | 0.13 | |
3000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.99 | 0.01 | 0.00 | 0.23 | 0.01 | 0.38 | 0.11 | ||
5000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.98 | 0.02 | 0.01 | 0.23 | 0.01 | 0.37 | 0.10 | ||
2 | 0.0 | 1000 | 0.00 | 0.54 | 0.54 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.56 | 0.11 | 0.53 | 0.29 |
3000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.56 | 0.05 | 0.13 | 0.25 | ||
5000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.56 | 0.03 | 0.05 | 0.23 | ||
0.5 | 1000 | 0.01 | 0.38 | 0.38 | 0.21 | 0.79 | 0.08 | 0.13 | 0.00 | 0.33 | 0.08 | 0.34 | 0.23 | |
3000 | 0.00 | 0.00 | 0.01 | 1.00 | 0.00 | 0.00 | 0.97 | 0.03 | 0.33 | 0.03 | 0.19 | 0.20 | ||
5000 | 0.00 | 0.00 | 0.01 | 1.00 | 0.00 | 0.00 | 0.99 | 0.01 | 0.33 | 0.03 | 0.29 | 0.19 | ||
1.0 | 1000 | 0.01 | 0.04 | 0.04 | 0.96 | 0.04 | 0.02 | 0.93 | 0.01 | 0.23 | 0.04 | 0.25 | 0.15 | |
3000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.99 | 0.01 | 0.23 | 0.02 | 0.52 | 0.12 | ||
5000 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.23 | 0.01 | 0.48 | 0.11 |
TSCI-RF | Proportions of selection | TSLS | Other RF(oracle) | |||||||||||
vio | a | n | Oracle | Comp | Robust | Invalidity | Init | Plug | Full | |||||
1 | 0.0 | 1000 | 0.49 | 0.11 | 0.11 | 0.01 | 0.99 | 0.01 | 0.00 | 0.00 | 0.08 | 0.49 | 0.82 | 0.22 |
3000 | 0.32 | 0.32 | 0.32 | 1.00 | 0.00 | 0.84 | 0.16 | 0.00 | 0.05 | 0.32 | 0.38 | 0.14 | ||
5000 | 0.23 | 0.23 | 0.23 | 1.00 | 0.00 | 0.85 | 0.15 | 0.00 | 0.04 | 0.23 | 0.27 | 0.11 | ||
0.5 | 1000 | 0.38 | 0.13 | 0.14 | 0.24 | 0.76 | 0.24 | 0.01 | 0.00 | 0.05 | 0.38 | 0.60 | 0.19 | |
3000 | 0.22 | 0.22 | 0.23 | 1.00 | 0.00 | 0.97 | 0.02 | 0.01 | 0.03 | 0.22 | 0.26 | 0.11 | ||
5000 | 0.17 | 0.17 | 0.17 | 1.00 | 0.00 | 0.98 | 0.01 | 0.01 | 0.02 | 0.17 | 0.19 | 0.09 | ||
1.0 | 1000 | 0.25 | 0.24 | 0.24 | 0.95 | 0.05 | 0.93 | 0.01 | 0.00 | 0.04 | 0.25 | 0.33 | 0.14 | |
3000 | 0.13 | 0.13 | 0.13 | 1.00 | 0.00 | 0.99 | 0.01 | 0.00 | 0.02 | 0.13 | 0.18 | 0.08 | ||
5000 | 0.10 | 0.10 | 0.10 | 1.00 | 0.00 | 0.98 | 0.02 | 0.01 | 0.02 | 0.10 | 0.13 | 0.06 | ||
2 | 0.0 | 1000 | 0.49 | 0.17 | 0.17 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.12 | 0.49 | 0.85 | 0.21 |
3000 | 0.31 | 0.31 | 0.31 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.07 | 0.31 | 0.38 | 0.14 | ||
5000 | 0.23 | 0.23 | 0.23 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.06 | 0.23 | 0.27 | 0.11 | ||
0.5 | 1000 | 0.38 | 0.16 | 0.16 | 0.21 | 0.79 | 0.08 | 0.13 | 0.00 | 0.07 | 0.38 | 0.70 | 0.18 | |
3000 | 0.23 | 0.23 | 0.26 | 1.00 | 0.00 | 0.00 | 0.97 | 0.03 | 0.04 | 0.23 | 0.28 | 0.11 | ||
5000 | 0.17 | 0.17 | 0.21 | 1.00 | 0.00 | 0.00 | 0.99 | 0.01 | 0.03 | 0.17 | 0.20 | 0.09 | ||
1.0 | 1000 | 0.24 | 0.24 | 0.24 | 0.96 | 0.04 | 0.02 | 0.93 | 0.01 | 0.05 | 0.24 | 0.40 | 0.14 | |
3000 | 0.13 | 0.13 | 0.13 | 1.00 | 0.00 | 0.00 | 0.99 | 0.01 | 0.03 | 0.13 | 0.22 | 0.08 | ||
5000 | 0.10 | 0.10 | 0.11 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.02 | 0.10 | 0.15 | 0.06 |
TSCI-RF | Proportions of selection | TSLS | Other RF(oracle) | |||||||||||
vio | a | n | Oracle | Comp | Robust | Invalidity | Init | Plug | Full | |||||
1 | 0 | 1000 | 0.84 | 0.84 | 0.84 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.58 | 0.81 | 0.16 | 0.00 |
3000 | 0.93 | 0.93 | 0.94 | 1.00 | 0.00 | 0.96 | 0.04 | 0.00 | 0.11 | 0.91 | 0.00 | 0.00 | ||
5000 | 0.93 | 0.94 | 0.94 | 1.00 | 0.00 | 0.95 | 0.05 | 0.00 | 0.01 | 0.93 | 0.00 | 0.00 | ||
0.5 | 1000 | 0.94 | 0.94 | 0.95 | 1.00 | 0.00 | 0.95 | 0.04 | 0.01 | 0.00 | 0.88 | 0.00 | 0.01 | |
3000 | 0.93 | 0.93 | 0.92 | 1.00 | 0.00 | 0.95 | 0.05 | 0.00 | 0.00 | 0.92 | 0.00 | 0.01 | ||
5000 | 0.93 | 0.93 | 0.93 | 1.00 | 0.00 | 0.99 | 0.01 | 0.00 | 0.00 | 0.92 | 0.00 | 0.00 | ||
1 | 1000 | 0.94 | 0.94 | 0.95 | 1.00 | 0.00 | 0.98 | 0.01 | 0.00 | 0.00 | 0.92 | 0.00 | 0.01 | |
3000 | 0.95 | 0.96 | 0.96 | 1.00 | 0.00 | 0.98 | 0.02 | 0.00 | 0.00 | 0.93 | 0.00 | 0.00 | ||
5000 | 0.93 | 0.93 | 0.93 | 1.00 | 0.00 | 0.99 | 0.01 | 0.00 | 0.00 | 0.91 | 0.00 | 0.00 | ||
2 | 0 | 1000 | 0.84 | 0.17 | 0.17 | 0.99 | 0.01 | 0.97 | 0.02 | 0.00 | 0.55 | 0.16 | 0.13 | 0.00 |
3000 | 0.96 | 0.96 | 0.94 | 1.00 | 0.00 | 0.00 | 0.99 | 0.01 | 0.14 | 0.93 | 0.00 | 0.00 | ||
5000 | 0.95 | 0.95 | 0.94 | 1.00 | 0.00 | 0.00 | 0.99 | 0.01 | 0.02 | 0.92 | 0.00 | 0.01 | ||
0.5 | 1000 | 0.95 | 0.73 | 0.72 | 0.92 | 0.08 | 0.17 | 0.70 | 0.05 | 0.00 | 0.68 | 0.00 | 0.03 | |
3000 | 0.94 | 0.94 | 0.94 | 1.00 | 0.00 | 0.00 | 0.95 | 0.05 | 0.00 | 0.93 | 0.00 | 0.00 | ||
5000 | 0.95 | 0.95 | 0.95 | 1.00 | 0.00 | 0.00 | 0.98 | 0.02 | 0.00 | 0.92 | 0.00 | 0.00 | ||
1 | 1000 | 0.96 | 0.95 | 0.96 | 1.00 | 0.00 | 0.01 | 0.93 | 0.06 | 0.00 | 0.94 | 0.00 | 0.01 | |
3000 | 0.96 | 0.95 | 0.94 | 1.00 | 0.00 | 0.00 | 0.95 | 0.05 | 0.00 | 0.93 | 0.00 | 0.00 | ||
5000 | 0.96 | 0.96 | 0.95 | 1.00 | 0.00 | 0.00 | 0.97 | 0.03 | 0.00 | 0.95 | 0.00 | 0.00 |
For Setting B2, we report the empirical coverage in Table S5 which are generally similar to those for Setting B1.
To approximate the real data analysis in Section 6, we further generate a binary IV as and the covariates for and We consider the following models for and ,
-
•
Setting B3 (binary IV): and
Compared to Settings B1 and B2, the outcome model in Setting B3 involves the interaction between and while the treatment model involves a more complicated interaction term, whose strength is controlled by . We specify with and . With the specified basis sets, we implement TSCI with random forests as detailed in Algorithm 1 in the main paper. In Table S6, we demonstrate our proposed TSCI method for Setting B3. The observations are coherent with those for Settings B1 and B2. The main difference between the binary IV (Setting B3) and the continuous IV (Settings B1 and B2) is that the treatment effect is not identifiable for , which happens only for the binary IV setting. However, with a non-zero interaction and a relatively large sample size, our proposed TSCI methods achieve the desired coverage. We also observe that the bias correction is effective and improves the coverage when the interaction is relatively small.
TSCI-RF | RF-Init | ||||||||||||
Bias | Length | Coverage | Invalidity | Bias | Coverage | ||||||||
a | n | Orac | Comp | Robust | Orac | Comp | Robust | Orac | Comp | Robust | Orac | Orac | |
0.25 | 1000 | 0.01 | 0.56 | 0.56 | 0.42 | 0.23 | 0.23 | 0.90 | 0.28 | 0.28 | 0.31 | 0.10 | 0.82 |
3000 | 0.00 | 0.01 | 0.01 | 0.23 | 0.23 | 0.23 | 0.95 | 0.95 | 0.95 | 0.99 | 0.05 | 0.84 | |
5000 | 0.00 | 0.00 | 0.00 | 0.18 | 0.18 | 0.18 | 0.94 | 0.94 | 0.94 | 1.00 | 0.03 | 0.86 | |
0.50 | 1000 | 0.00 | 0.02 | 0.02 | 0.22 | 0.22 | 0.22 | 0.93 | 0.90 | 0.90 | 0.97 | 0.04 | 0.90 |
3000 | -0.00 | -0.00 | -0.00 | 0.12 | 0.12 | 0.12 | 0.93 | 0.93 | 0.93 | 1.00 | 0.02 | 0.89 | |
5000 | 0.00 | 0.00 | 0.00 | 0.09 | 0.09 | 0.09 | 0.95 | 0.95 | 0.95 | 1.00 | 0.02 | 0.89 | |
0.75 | 1000 | 0.00 | 0.00 | 0.00 | 0.15 | 0.15 | 0.15 | 0.92 | 0.90 | 0.90 | 0.99 | 0.02 | 0.91 |
3000 | 0.00 | 0.00 | 0.00 | 0.08 | 0.08 | 0.08 | 0.94 | 0.94 | 0.94 | 1.00 | 0.01 | 0.89 | |
5000 | -0.00 | -0.00 | -0.00 | 0.06 | 0.06 | 0.06 | 0.93 | 0.93 | 0.93 | 1.00 | 0.01 | 0.91 |
D.5 Binary Treatment
We consider the binary treatment setting and explore the finite-sample performance of our proposed TSCI method. We consider the outcome model with and We generate and following bivariate normal distribution with zero mean, unit variance and covariance as 0.5. We generate the binary treatment with the conditional mean as
where is specified in the following two ways.
-
•
Setting 1 (continuous IV): generate following Section 5.2. .
-
•
Setting 2 (binary IV): generate .
TSCI-RF | RF-Init | TSCI-RF | |||||||||||||
Bias | Length | Coverage | Invalidity | Bias | Coverage | Weak IV | |||||||||
Setting | a | n | Orac | Comp | Robust | Orac | Comp | Robust | Orac | Comp | Robust | Orac | Orac | ||
1000 | 0.00 | 0.00 | 0.00 | 1.08 | 1.08 | 1.08 | 0.92 | 0.92 | 0.92 | 1.00 | 0.05 | 0.94 | 0.99 | ||
3000 | 0.01 | 0.01 | 0.00 | 0.59 | 0.59 | 0.61 | 0.94 | 0.93 | 0.93 | 1.00 | 0.00 | 0.95 | 0.00 | ||
0.0 | 5000 | 0.00 | 0.00 | 0.01 | 0.44 | 0.45 | 0.89 | 0.94 | 0.93 | 0.95 | 1.00 | 0.01 | 0.94 | 0.00 | |
1000 | 0.03 | 0.03 | 0.03 | 1.12 | 1.12 | 1.12 | 0.90 | 0.90 | 0.90 | 1.00 | 0.07 | 0.94 | 0.99 | ||
3000 | 0.00 | 0.00 | 0.00 | 0.62 | 0.62 | 0.62 | 0.93 | 0.93 | 0.93 | 1.00 | 0.01 | 0.94 | 0.00 | ||
0.5 | 5000 | 0.00 | 0.00 | 0.01 | 0.45 | 0.45 | 0.68 | 0.94 | 0.93 | 0.93 | 1.00 | 0.01 | 0.94 | 0.00 | |
1000 | 0.01 | 0.01 | 0.01 | 1.22 | 1.22 | 1.22 | 0.92 | 0.92 | 0.92 | 1.00 | 0.04 | 0.95 | 0.99 | ||
3000 | 0.01 | 0.01 | 0.00 | 0.66 | 0.66 | 0.67 | 0.95 | 0.95 | 0.95 | 1.00 | 0.01 | 0.96 | 0.00 | ||
1.0 | 5000 | 0.00 | 0.00 | 0.00 | 0.49 | 0.49 | 0.62 | 0.93 | 0.93 | 0.93 | 1.00 | 0.01 | 0.94 | 0.00 | |
1000 | 0.01 | 0.01 | 0.01 | 1.30 | 1.30 | 1.30 | 0.89 | 0.89 | 0.89 | 1.00 | 0.06 | 0.94 | 1.00 | ||
3000 | 0.00 | 0.00 | 0.00 | 0.73 | 0.73 | 0.74 | 0.95 | 0.95 | 0.95 | 1.00 | 0.01 | 0.96 | 0.01 | ||
1 | 1.5 | 5000 | 0.00 | 0.00 | 0.01 | 0.53 | 0.54 | 0.82 | 0.93 | 0.92 | 0.93 | 1.00 | 0.01 | 0.94 | 0.00 |
1000 | 0.37 | 0.37 | 0.37 | 1.62 | 1.62 | 1.62 | 0.66 | 0.66 | 0.66 | 0.93 | 0.41 | 0.78 | 1.00 | ||
3000 | 0.41 | 0.41 | 0.41 | 1.25 | 1.25 | 1.25 | 0.60 | 0.60 | 0.60 | 1.00 | 0.42 | 0.72 | 1.00 | ||
0.0 | 5000 | 0.35 | 0.35 | 0.35 | 1.05 | 1.05 | 1.05 | 0.61 | 0.61 | 0.61 | 1.00 | 0.40 | 0.65 | 1.00 | |
1000 | 0.30 | 0.30 | 0.30 | 1.21 | 1.21 | 1.21 | 0.65 | 0.65 | 0.65 | 1.00 | 0.36 | 0.77 | 1.00 | ||
3000 | 0.21 | 0.21 | 0.21 | 1.18 | 1.18 | 1.18 | 0.73 | 0.73 | 0.73 | 1.00 | 0.29 | 0.83 | 1.00 | ||
0.5 | 5000 | 0.10 | 0.10 | 0.10 | 1.09 | 1.09 | 1.09 | 0.82 | 0.82 | 0.82 | 1.00 | 0.22 | 0.89 | 1.00 | |
1000 | 0.16 | 0.20 | 0.16 | 1.35 | 1.30 | 1.35 | 0.77 | 0.77 | 0.77 | 0.93 | 0.26 | 0.87 | 1.00 | ||
3000 | 0.03 | 0.03 | 0.03 | 1.11 | 1.10 | 1.11 | 0.88 | 0.88 | 0.88 | 0.99 | 0.11 | 0.94 | 1.00 | ||
1.0 | 5000 | 0.00 | 0.00 | 0.00 | 0.92 | 0.92 | 0.92 | 0.89 | 0.89 | 0.89 | 1.00 | 0.07 | 0.93 | 0.57 | |
1000 | 0.06 | 0.22 | 0.06 | 1.59 | 1.31 | 1.59 | 0.83 | 0.67 | 0.83 | 0.75 | 0.18 | 0.93 | 1.00 | ||
3000 | 0.01 | 0.02 | 0.01 | 1.14 | 1.13 | 1.14 | 0.90 | 0.90 | 0.90 | 0.98 | 0.08 | 0.93 | 0.87 | ||
2 | 1.5 | 5000 | 0.00 | 0.00 | 0.00 | 0.91 | 0.91 | 0.91 | 0.93 | 0.93 | 0.93 | 1.00 | 0.04 | 0.95 | 0.02 |
The binary treatment result is summarized in Table S7. The main observation is similar to those for the continuous treatment reported in the main paper. We shall point out the major differences in the following. The settings with the binary treatment are in general more challenging since the IV strength is relatively weak. To measure this, we have reported the column indexed with “weak IV” standing for the proportion of simulations with For settings where our proposed generalized IV strength is strong such that , our proposed TSCI method achieves the desired coverage level. Even when the generalized IV strength leads to our proposed (oracle) TSCI may still achieve the desired coverage level for setting 1.
Appendix E Additional Results for Real Data Analysis
This section contains the additional results for the real data analysis. In Figure S4, we first show the importance score of all variables in random forests, based on which the six most important covariates are selected to construct the basis set in Section 6.

E.1 Other specifications of
In the following, in addition to the basis set considered in Section 6, we consider three more specifications of , as detailed in Table S8, and test the robustness of TSCI’s selection process.
{nearc4reg1, nearc4reg2, nearc4reg3, nearc4reg4,nearc4reg5, nearc4reg6, nearc4reg7, nearc4reg8} | |
, , , | |
, , , |
We implement TSCI with random forests as detailed in Algorithm 1 with specifying as in Section 6 and different in Table S8. We observe that the point estimators and confidence intervals are relatively stable even with different specifications of
Identifications | TSCI | Proportions of selection | Prob() | IV str. | |||
Estimate | CI | ||||||
TSCI- | 0.0604 | (0.0294, 0.0914) | 59.2% | 38.2% | 2.6% | 93.6% | 112.7950 |
TSCI- | 0.0575 | (0.0263, 0.0886) | 40.0% | 58.6% | 1.4% | 2.3% | 111.8835 |
TSCI- | 0.0614 | (0.0303, 0.0916) | 55.0% | 40.0% | 5.0% | 72.9% | 111.6233 |
TSCI- | 0.0575 | (0.0268, 0.0891) | 39.2% | 60.6% | 0.2% | 0% | 113.0134 |
E.2 Falsification Argument of Condition 1
We propose in the following a falsification argument regarding Condition 1 for the data analysis. Particularly, we demonstrate that the regression model using covariates belonging to provides a good approximation of but not . With the TSCI estimator , we construct the pseudo outcome and if is a reasonably good estimator of , then the pseudo outcome can be used as proxies of Then we implement two OLS regressions of the pseudo outcome on the covariates in and as defined in Table 3. In addition, we build a random forests prediction model for the pseudo outcome with the predictors and . To evaluate the performance, we split the data into a training set and a test set as in Section 3.1, where test set is used to estimate the out-of-sample MSE of the model constructed with the training set . We randomly split the data 500 times and report the violin plot of 500 MSE on the left panel of Figure S5. Since our specified basis set leads to a nearly similar prediction performance as the random forests prediction model, this suggests that provides a good approximation of the function if the TSCI estimator is accurate. In comparison, we use the treatment variable to replace the pseudo outcome and compare the MSE of the three prediction models. On the right panel of Figure S5, the random forests prediction model performs much better than the OLS with covariates in and , indicating that does not provide a good approximation for in the treatment model.

chicago.bst \bibliographysuppIVRef