This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

00footnotetext: This paper is a complete and more advanced version of a recent technical note by the author. See the note from https://arxiv.org/abs/2008.12467.11footnotetext: Department of Biostatistics, Harvard Chan School of Public Health.22footnotetext: Department of Statistics, Harvard University33footnotetext: Department of Statistics, University of California, Davis.

Double/Debiased Machine Learning for Logistic Partially
Linear Model

Molei Liu1, Yi Zhang2, Doudou Zhou3
Abstract

We propose double/debiased machine learning approaches to infer (at the parametric rate n1/2n^{-1/2}) the parametric component of a logistic partially linear model with the binary response following a conditional logistic model of a low dimensional linear parametric function of some key (exposure) covariates and a nonparametric function adjusting for the confounding effect of other covariates. We consider a Neyman orthogonal (doubly robust) score equation consisting of two nuisance functions: nonparametric component in the logistic model and conditional mean of the exposure on the other covariates and with the response fixed. To estimate the nuisance models, we separately consider the use of high dimensional (HD) sparse parametric models and more general (typically nonparametric) machine learning (ML) methods. In the HD case, we derive certain moment equations to calibrate the first order bias of the nuisance models and grant our method a model double robustness property in the sense that our estimator achieves the desirable O(n1/2)O(n^{-1/2}) rate when at least one of the nuisance models is correctly specified and both of them are ultra-sparse. In the ML case, the non-linearity of the logit link makes it substantially harder than the partially linear setting to use an arbitrary conditional mean learning algorithm to estimate nuisance component of the logistic model. We handle this obstacle through a novel “full model refitting” procedure that is easy-to-implement and facilitates the use of nonparametric ML algorithms in our framework. Our ML estimator is rate doubly robust in the same sense as Chernozhukov et al., 2018a . We evaluate our methods through simulation studies and apply them in assessing the effect of emergency contraceptive (EC) pill on early gestation foetal with a policy reform in Chile in 2008 (Bentancor and Clarke,, 2017).

1 Introduction

Consider a logistic partially linear model. Let {(Yi,Ai,𝑿i):i=1,2,,n}\{(Y_{i},A_{i},\boldsymbol{X}_{i})\mathrel{\mathop{\mathchar 58\relax}}i=1,2,\ldots,n\} be independent and identically distributed samples of Y{0,1}Y\in\{0,1\}, AA\in\mathbb{R} and 𝑿p\boldsymbol{X}\in\mathbb{R}^{p}. Assume that

(Y=1A,𝑿)=expit{β0A+r0(𝑿)},\mathbb{P}(Y=1\mid A,\boldsymbol{X})={\rm expit}\{\beta_{0}A+r_{0}(\boldsymbol{X})\}, (1)

where expit()=logit1(){\rm expit}(\cdot)={\rm logit}^{-1}(\cdot), logit(a)=log{a/(1a)}{\rm logit}(a)=\log\{a/(1-a)\} and r0()r_{0}(\cdot) is an unknown nuisance function of 𝑿\boldsymbol{X}. In an experimental or observational study with AA taken as the exposure variable, YY being the binary response of interests and 𝑿\boldsymbol{X} representing the observed confounding variables, parameter β0\beta_{0} is of particular interests as it measures the conditional effect of AA on YY in the scale of logarithmic odds ratio. And as the most common and natural way to characterize the conditional model of a binary outcome against some exposure, model (1) has been extensively used in economics and policy science studies.

Our goal is to estimate and infer β0\beta_{0} asymptotic normally at the rate n1/2n^{-1/2}. When 𝑿\boldsymbol{X} is a scalar and r0()r_{0}(\cdot) is smooth, classic semiparametric kernel or sieve regression works well for this purpose (Severini and Staniswalis,, 1994; Lin and Carroll,, 2006). When 𝑿\boldsymbol{X} is of high dimensionality, these approaches can have poor performance due to curse of dimensionality and one would rather estimate r0()r_{0}(\cdot) with modern HD (parametric) or ML (nonparametric)111Our HD setting refers to HD parametric (linear or generalized linear) model and the ML setting refers to ML models of conditional mean estimation (prediction/classification) that is blackbox and usually nonparametric. methods that are much more resistant to the growing dimensionality and complexity of 𝑿\boldsymbol{X}. However, unlike the partially linear model scenario (Chernozhukov et al., 2018c, ; Dukes and Vansteelandt,, 2020), robust and efficient inference of β0\beta_{0} in (1) with HD or ML nuisance models is yet to be well-studied.

In recent, Tan, 2019a proposed a simple and flexible doubly robust estimator to enhance the robustness to the potential misspecification of r(𝒙)r(\boldsymbol{x}) specified as a fixed-dimensional parametric function: r(𝒙)=𝒙𝜸r(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\boldsymbol{\gamma}. They introduced a parametric model m(𝒙)=g(𝒙𝜶)m(\boldsymbol{x})=g(\boldsymbol{x}^{\intercal}\boldsymbol{\alpha}) with a (known) link function g()g(\cdot) for the conditional mean model m0(𝒙)=𝔼(AY=0,𝑿=𝒙)m_{0}(\boldsymbol{x})=\mathbb{E}(A\mid Y=0,\boldsymbol{X}=\boldsymbol{x}) and proposed a doubly robust estimating equation:

1ni=1nϕ^(𝑿i){YieβAi𝑿i𝜸^(1Yi)}{Aig(𝑿i𝜶^)}=0,\frac{1}{n}\sum_{i=1}^{n}\widehat{\phi}(\boldsymbol{X}_{i})\left\{Y_{i}e^{-\beta A_{i}-\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\gamma}}}-(1-Y_{i})\right\}\left\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\right\}=0, (2)

where ϕ^(𝒙)\widehat{\phi}(\boldsymbol{x}) is an estimation of some scalar nuisance function ϕ(𝒙)\phi(\boldsymbol{x}) affecting the asymptotic efficiency of the estimator, and 𝜶^\widehat{\boldsymbol{\alpha}} and 𝜸^\widehat{\boldsymbol{\gamma}} are two fixed dimensional nuisance model estimators. Estimator β^\widehat{\beta} solved from (2) is doubly robust that it is valid when either r(𝒙)=𝒙𝜸r(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\boldsymbol{\gamma} is correctly specified for the logistic model nonparametric component, or m(𝒙)=g(𝒙𝜶)m(\boldsymbol{x})=g(\boldsymbol{x}^{\intercal}\boldsymbol{\alpha}) is correctly specified for the conditional mean model m0(𝒙)=𝔼(AY=0,𝑿=𝒙)m_{0}(\boldsymbol{x})=\mathbb{E}(A\mid Y=0,\boldsymbol{X}=\boldsymbol{x}). Prior to this, the doubly robust semiparametric estimation of odds ratio was built upon p(A𝑿,Y=0)p(A\mid\boldsymbol{X},Y=0), the conditional density of AA given 𝑿\boldsymbol{X} and Y=0Y=0 (Chen,, 2007; Tchetgen Tchetgen et al.,, 2010, e.g.), requiring a stronger model assumption222Chen, (2007) and Tchetgen Tchetgen et al., (2010) estimate the conditional distribution of AA given 𝑿\boldsymbol{X} while Tan, 2019a only needs to specify a conditional mean model of AA. than Tan, 2019a for continuous AA.

Nevertheless, Tan, 2019a focuses on fixed dimensional parametric nuisance models that are still prone to model misspecification. And their proposed approach is not readily applicable to the recently developed and exploited HD (pnp\gg n and the two nuisance components are specified as parametric models with sparse coefficients) or ML (the nuisance functions are estimated by arbitrary blackbox learning algorithm of condition mean) approaches. For the HD realization, this is because simply using some regularized nuisance estimators in (2) would typically incur excessive bias and not guarantee the parametric rate of convergence. We realize bias reduction with respect to the regularization errors by constructing certain dantzig moment equations to estimate the nuisance parameters. With the ultra-sparse nuisance parameters, i.e. their sparsity level is o(n1/2/logp)o(n^{1/2}/\log p), our proposed estimator preserves the model double robustness property that it approaches β0\beta_{0} at the rate Op(n1/2)O_{p}(n^{-1/2}) when either r()r(\cdot) or m()m(\cdot) is correctly specified. Under the ML framework, the non-linearity and unextractablity of the logit link makes it impossible to naturally estimate r0()r_{0}(\cdot) with a learning algorithm of conditional mean333Without purposed modification, the natural form of most contemporary supervised learning methods, e.g. random forest, support vector machine and neural network, can be conceptualized as a blackbox algorithm of conditional mean estimation because their task is prediction for continuous responses and classification for categorical responses. as trivially done under the partially linear model (Chernozhukov et al., 2018a, ). We handle this challenge through an easy-to-implement “full model refitting” (FMR) procedure that facilitates flexible implementation of arbitrary conditional mean learning algorithms in our framework. And our DML estimator for β0\beta_{0} is rate doubly robust in the same sense as Chernozhukov et al., 2018a , i.e. being asymptotically normal at rate n1/2n^{-1/2} when the two nuisance ML estimators are consistent for the true models and their mean squared errors are controlled by op(n1/2)o_{p}(n^{-1/2}).

In recent years, there has been a large body of literature developed for semiparametric inference (of a low-dimensional parameter) with HD and ML nuisance models, which gets arising attention and application in economics and policy sciences (Athey and Imbens,, 2017; Knaus,, 2018, 2020; Yang et al.,, 2020). Similar to the structure of our paper, there are two different frameworks among the literature, HD setting and ML setting, though sometimes both of them referred as “machine learning” approaches. As the main difference between them, the HD setting imposes parametric assumptions on the nuisance models and may allow for their potential misspecification while ML setting uses nonparametric ML estimation supposed to approach the true nuisance models at certain geometric rate.

To estimate low dimensional parameters like average treatment effect (ATE) and conditionl treatment effect in linear or log-linear model under the HD setting with potentially misspecified nuisance models, recent work including Smucler et al., (2019); Tan, 2020a ; Tan, 2020b ; Ning et al., (2020); Dukes and Vansteelandt, (2020) constructed 1\ell_{1}-regularized estimating equations with certain \ell_{\infty}-constraints to simultaneously estimate the nuisance parameters and calibrate their first order bias. In comparison, Bradic et al., (2019) proposed a more sparsity-rate robust ATE estimator that requires substantially weaker sparsity assumptions but needs both HD parametric nuisance models to be correctly specified. Among these literature, our work is the first to consider the logistic partially linear model under a similar regime. In parallel to these but closely relevant to our target, existing approaches including debiased (desparsified) LASSO (Van de Geer et al.,, 2014; Janková and Van De Geer,, 2016) and regularized Riesz representer (Chernozhukov et al., 2018c, ; Belloni et al.,, 2018) used the empirical inverse of the information matrix obtained with 1\ell_{1}-regularized regression to correct for the bias of logistic LASSO estimator. They imposed a technical ultra-sparsity condition on inverse of the information matrix, which has been criticized as unreasonable and unverifiable (Xia et al.,, 2020). Compared with them, our sparsity assumption is model-specific and thus more reasonable and explainable as will be shown later.

We note that near the finishing of our paper, a parallel paper, Ghosh and Tan, (2020) is published on arxiv444As a preliminary version of our paper, our previous technical note https://arxiv.org/abs/2008.12467 appears slightly earlier than Ghosh and Tan, (2020).. Compared with their work, our HD part is studying the same problem but using different doubly robust estimating equation and calibrated procedures for the nuisance models. While the two proposals have similar theoretical properties and numerical implementation strategy (see our Sections 3.1, 4.1 and Appendix D). Also, they should have similar numerical performance. Certainly, our method under the ML setting introduced below has no overlapping with their work.

For the nonparametric ML realization, Chernozhukov et al., 2018a established a double machine learning (DML) framework utilizing Neyman orthogonal scores and cross-fitting to construct parametrically efficient ML based casual estimator. Their framework has been playing a central role in semiparametric inference with ML. As complements or extensions of it, recent work including Chernozhukov et al., 2018b ; Zimmert and Lechner, (2019); Colangelo and Lee, (2020) localized the orthogonal score function to estimate conditional average treatment effect; Semenova and Chernozhukov, (2020) constructed the best linear approximation of a structural function with ML; Farrell et al., (2018) used deep neural networks for DML estimation; Wager and Athey, (2018); Oprescu et al., (2019) proposed and studied the tree-based ML approaches for causal inference; and Cui and Tchetgen, (2019) proposed a minimax data-driven model selection approach to choose the ML nuisance models with the lowest bias on the DML estimator. The above mentioned work elaborated on specific inference problems including partially linear model, ATE, and heterogeneous treatment effect with their nuisance models directly estimable with arbitrary (supervised) ML algorithms. While as summarized above, r0()r_{0}(\cdot) cannot be estimated likewise with general ML algorithms, due to the non-linear structure of (1). To the best of our knowledge, this paper is the first one to solve this non-trivial technical problem through the proposed FMR procedure.

The rest of this paper is organized as follows. In Section 2, we define the Neyman orthogonal (doubly robust) score equation for logistic partially linear model. In Section 3, we introduce the realization of debiased and DML inference for β0\beta_{0} under the HD and ML settings separately. In Section 4, we present and justify the asymptotic property of our HD and ML estimator separately. In Sections 5 and 6, we conduct simulation studies on empirical performance of our method and apply it to assess the effect of EC pill on early gestation foetal.

2 Neyman orthogonal score

Before coming to the specific approaches in Section 3, we introduce a Neyman orthogonal (doubly robust) score function for logistic partially linear model and derive its first order bias, which plays a central role in motivating and guiding our method construction and theoretical analysis. Let observation 𝑫i={Yi,Ai,𝑿i}\boldsymbol{D}_{i}=\{Y_{i},A_{i},\boldsymbol{X}_{i}\} for i=1,,ni=1,\ldots,n and 𝑫={Y,A,𝑿}\boldsymbol{D}=\{Y,A,\boldsymbol{X}\} be a realization of 𝑫i\boldsymbol{D}_{i}. Motivated by equation (2) proposed and studied in Tan, 2019a , we define the Neyman orthogonal score as

h(𝑫;β,η)=ψ(𝑿){YeβA(1Y)er(𝑿)}{Am(𝑿)},h(\boldsymbol{D};\beta,\eta)=\psi(\boldsymbol{X})\{Ye^{-\beta A}-(1-Y)e^{r(\boldsymbol{X})}\}\{A-m(\boldsymbol{X})\},

where η={r(),m(),ψ()}\eta=\{r(\cdot),m(\cdot),\psi(\cdot)\} represents the whole set of nuisance functions. In analog to (2), r()r(\cdot) and m()m(\cdot) corresponds to the nonparametric component r0()r_{0}(\cdot) defined in (1) and the nuisance model m0(𝒙)=𝔼(AY=0,𝑿=𝒙)m_{0}(\boldsymbol{x})=\mathbb{E}(A\mid Y=0,\boldsymbol{X}=\boldsymbol{x}), respectively. And ψ(𝒙)\psi(\boldsymbol{x}) is a nuisance function affecting the asymptotic variance of the estimator that may depend on r(𝒙)r(\boldsymbol{x}) and m(𝒙)m(\boldsymbol{x}) and actually corresponds to ϕ(𝒙)er(𝒙)\phi(\boldsymbol{x})e^{-r(\boldsymbol{x})} with ϕ(𝒙)\phi(\boldsymbol{x}) defined by (2)555We rewrite (2) with ψ(𝒙)=ϕ(𝒙)er(𝒙)\psi(\boldsymbol{x})=\phi(\boldsymbol{x})e^{-r(\boldsymbol{x})} to induce the form of score function with both its partial derivatives on rr and ψ\psi are Neyman orthogonal..

Remark 1.

According to Tan, 2019a , the score function h(𝐃;β,η)h(\boldsymbol{D};\beta,\eta) is doubly robust in the sense that when r()=r0()r(\cdot)=r_{0}(\cdot) or m()=m0()m(\cdot)=m_{0}(\cdot), β0\beta_{0} solves 𝔼h(𝐃;β,η)=0\mathbb{E}h(\boldsymbol{D};\beta,\eta)=0. We shortly demonstrate this as follows. When either r()=r0()r(\cdot)=r_{0}(\cdot) or m()=m0()m(\cdot)=m_{0}(\cdot) holds, we have

𝔼ψ(𝑿)(1Y){er(𝑿)er0(𝑿)}{Am(𝑿)}\displaystyle\mathbb{E}\psi(\boldsymbol{X})(1-Y)\{e^{r(\boldsymbol{X})}-e^{r_{0}(\boldsymbol{X})}\}\{A-m(\boldsymbol{X})\}
=\displaystyle= 𝔼[ψ(𝑿){er(𝑿)er0(𝑿)}{Am(𝑿)}|Y=0,𝑿]=0,\displaystyle\mathbb{E}\left[\psi(\boldsymbol{X})\{e^{r(\boldsymbol{X})}-e^{r_{0}(\boldsymbol{X})}\}\{A-m(\boldsymbol{X})\}\Big{|}Y=0,\boldsymbol{X}\right]=0,

which combined with (1) lead to that

𝔼h(𝑫;β0,η)=\displaystyle\mathbb{E}h(\boldsymbol{D};\beta_{0},\eta)= 𝔼ψ(𝑿){Yeβ0A(1Y)er(𝑿)}{Am(𝑿)}\displaystyle\mathbb{E}\psi(\boldsymbol{X})\{Ye^{-\beta_{0}A}-(1-Y)e^{r(\boldsymbol{X})}\}\{A-m(\boldsymbol{X})\}
=\displaystyle= 𝔼ψ(𝑿)er0(𝑿){Yeβ0Ar0(𝑿)(1Y)}{Am(𝑿)}\displaystyle\mathbb{E}\psi(\boldsymbol{X})e^{r_{0}(\boldsymbol{X})}\{Ye^{-\beta_{0}A-r_{0}(\boldsymbol{X})}-(1-Y)\}\{A-m(\boldsymbol{X})\}
=\displaystyle= 𝔼ψ(𝑿)er0(𝑿){Am(𝑿)}Y(Y=1A,𝑿)(Y=1A,𝑿)=0.\displaystyle\mathbb{E}\psi(\boldsymbol{X})e^{r_{0}(\boldsymbol{X})}\{A-m(\boldsymbol{X})\}\frac{Y-\mathbb{P}(Y=1\mid A,\boldsymbol{X})}{\mathbb{P}(Y=1\mid A,\boldsymbol{X})}=0.

Suppose the nuisance models r0(𝒙)r_{0}(\boldsymbol{x}) and m0(𝒙)m_{0}(\boldsymbol{x}) are estimated by r^(𝒙)\widehat{r}(\boldsymbol{x}) and m^(𝒙)\widehat{m}(\boldsymbol{x}) converging to r¯(𝒙)\bar{r}(\boldsymbol{x}) and m¯(𝒙)\bar{m}(\boldsymbol{x}) and ψ^(𝒙)\widehat{\psi}(\boldsymbol{x}) represents the estimator for ψ(𝒙)\psi(\boldsymbol{x}) approaching ψ¯(𝒙)\bar{\psi}(\boldsymbol{x}). Denote by η¯={r¯(),m¯(),ψ¯()}\bar{\eta}=\{\bar{r}(\cdot),\bar{m}(\cdot),\bar{\psi}(\cdot)\} and η^={r^(),m^(),ψ^()}\widehat{\eta}=\{\widehat{r}(\cdot),\widehat{m}(\cdot),\widehat{\psi}(\cdot)\}. We then write the Gateaux (partial) derivative of the score function h(𝑫;β0,η¯)h(\boldsymbol{D};\beta_{0},\bar{\eta}) as

ηh(𝑫;β0,η¯)[ηη¯]=ψh(𝑫;β0,η¯)[ψψ¯]+rh(𝑫;β0,η¯)[rr¯]+mh(𝑫;β0,η¯)[mm¯]=:{Yeβ0A(1Y)er¯(𝑿)}{Am¯(𝑿)}{ψ(𝑿)ψ¯(𝑿)}(1Y)ψ¯(𝑿)er¯(𝑿){Am¯(𝑿)}{r(𝑿)r¯(𝑿)}ψ¯(𝑿){Yeβ0A(1Y)er¯(𝑿)}{m(𝑿)m¯(𝑿)}.\begin{split}&\partial_{\eta}h(\boldsymbol{D};\beta_{0},\bar{\eta})[\eta-\bar{\eta}]\\ =&\partial_{\psi}h(\boldsymbol{D};\beta_{0},\bar{\eta})[\psi-\bar{\psi}]+\partial_{r}h(\boldsymbol{D};\beta_{0},\bar{\eta})[r-\bar{r}]+\partial_{m}h(\boldsymbol{D};\beta_{0},\bar{\eta})[m-\bar{m}]\\ =\mathrel{\mathop{\mathchar 58\relax}}&\{Ye^{-\beta_{0}A}-(1-Y)e^{\bar{r}(\boldsymbol{X})}\}\{A-\bar{m}(\boldsymbol{X})\}\{\psi(\boldsymbol{X})-\bar{\psi}(\boldsymbol{X})\}\\ &-(1-Y)\bar{\psi}(\boldsymbol{X})e^{\bar{r}(\boldsymbol{X})}\{A-\bar{m}(\boldsymbol{X})\}\{r(\boldsymbol{X})-\bar{r}(\boldsymbol{X})\}\\ &-\bar{\psi}(\boldsymbol{X})\{Ye^{-\beta_{0}A}-(1-Y)e^{\bar{r}(\boldsymbol{X})}\}\{m(\boldsymbol{X})-\bar{m}(\boldsymbol{X})\}.\end{split} (3)
Remark 2.

We evaluate the Neyman orthogonal score on some limiting parameters r¯()\bar{r}(\cdot) and m¯()\bar{m}(\cdot) instead of on r0()r_{0}(\cdot) and m0()m_{0}(\cdot) as in Chernozhukov et al., 2018a . This is because different from their ML framework (and ours) assuming both nuisance estimators converge to the true models, i.e. r¯()=r0()\bar{r}(\cdot)=r_{0}(\cdot) and m¯()=m0()\bar{m}(\cdot)=m_{0}(\cdot), our HD realization allows at most one nuisance model to be wrongly specified, and thus the score function to be analyzed with may not be evaluated at the true models.

Inspired by our deduction in Remark 1, 𝔼ψh(𝑫;β0,η¯)[ψψ¯]=0\mathbb{E}\partial_{\psi}h(\boldsymbol{D};\beta_{0},\bar{\eta})[\psi-\bar{\psi}]=0 for any ψ\psi whenever r¯()=r0()\bar{r}(\cdot)=r_{0}(\cdot) or m¯()=m0()\bar{m}(\cdot)=m_{0}(\cdot). Also, 𝔼rh(𝑫;β0,η¯)[rr¯]=0\mathbb{E}\partial_{r}h(\boldsymbol{D};\beta_{0},\bar{\eta})[r-\bar{r}]=0 when m¯()=m0()\bar{m}(\cdot)=m_{0}(\cdot) and 𝔼mh(𝑫;β0,η¯)[mm¯]=0\mathbb{E}\partial_{m}h(\boldsymbol{D};\beta_{0},\bar{\eta})[m-\bar{m}]=0 when r¯()=r0()\bar{r}(\cdot)=r_{0}(\cdot). Thus, under the ML setting, h(𝑫;β0,η¯)h(\boldsymbol{D};\beta_{0},\bar{\eta}) satisfies Neyman orthogonality defined in Chernozhukov et al., 2018a and the first order (over-fitting) bias of the estimating equation: n1i=1nh(𝑫i;β,η^)=0n^{-1}\sum_{i=1}^{n}h(\boldsymbol{D}_{i};\beta,\widehat{\eta})=0 can be removed through cross-fitting (introduced in Section 3.2) and concentration. While under the HD parametric setting with r¯()r0()\bar{r}(\cdot)\neq r_{0}(\cdot) or m¯()m0()\bar{m}(\cdot)\neq m_{0}(\cdot), one needs to carefully construct the moment equations for r¯()\bar{r}(\cdot) and m¯()\bar{m}(\cdot) to ensure the orthogonality conditions. While similar to existing literature (Chernozhukov et al., 2018a, ; Tan, 2020a, ), removal of the second order (and beyond) bias relies on the assumptions that quality of the nuisance estimators r^()\widehat{r}(\cdot) and m^()\widehat{m}(\cdot) are good enough (see Section 4).

3 Method

In this section, we separately present our specific construction procedures for HD parametric and ML nonparametric realization of the debiased/DML estimator for β0\beta_{0}, based on the Neyman orthogonal score derived in Section 2.

3.1 High dimensional parametric model realization

Consider the setting with pnp\gg n, each 𝑿i\boldsymbol{X}_{i} has its first element being 11, r(𝒙)=𝒙𝜸r(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\boldsymbol{\gamma} and m(𝒙)=g(𝒙𝜶)m(\boldsymbol{x})=g(\boldsymbol{x}^{\intercal}\boldsymbol{\alpha}) where g()g(\cdot) is a monotone and smooth link function with derivative g()g^{\prime}(\cdot). Inspired by existing work including Smucler et al., (2019); Tan, 2020a ; Dukes and Vansteelandt, (2020), we construct dantzig moment equations to ensure the Neyman orthogonality empirically: rh(𝑫;β0,η¯)[rr¯]=0\partial_{r}h(\boldsymbol{D};\beta_{0},\bar{\eta})[r-\bar{r}]=0 and mh(𝑫;β0,η¯)[mm¯]=0\partial_{m}h(\boldsymbol{D};\beta_{0},\bar{\eta})[m-\bar{m}]=0, under potential misspecification of (at most one) the nuisance models.

First, we obtain 𝜸~\widetilde{\boldsymbol{\gamma}} as some initial estimator for 𝜸\boldsymbol{\gamma} that converges to some limiting parameter 𝜸\boldsymbol{\gamma}^{*} equaling to the true model parameter 𝜸0\boldsymbol{\gamma}_{0} when r(𝒙)r(\boldsymbol{x}) is correctly specified: r(𝒙)=𝒙𝜸0r(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\boldsymbol{\gamma}_{0}. And let ψ^(𝒙)\widehat{\psi}(\boldsymbol{x}) be some estimator of the nuisance function ψ(𝒙)\psi(\boldsymbol{x}) depending on 𝜸~\widetilde{\boldsymbol{\gamma}} with its limiting function being ψ¯(𝒙)\bar{\psi}(\boldsymbol{x}), whose choice will be discussed in Section 3.3 later. According to (3), we obtain 𝜶^\widehat{\boldsymbol{\alpha}} through the dantzig moment equation:

min𝜶p𝜶1s.tn1i=1n(1Yi)ψ^(𝑿i)e𝑿i𝜸~{Aig(𝑿i𝜶)}𝑿iλα,{\rm min}_{\boldsymbol{\alpha}\in\mathbb{R}^{p}}\|\boldsymbol{\alpha}\|_{1}\quad{\rm s.t}\quad\left\|n^{-1}\sum_{i=1}^{n}(1-Y_{i})\widehat{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}}}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\alpha})\}\boldsymbol{X}_{i}\right\|_{\infty}\leq\lambda_{\alpha}, (4)

where λα\lambda_{\alpha} is a tuning parameter controlling the regularization bias. Finally, we obtain the nuisance estimator 𝜸^\widehat{\boldsymbol{\gamma}} and the targeted HD estimator β^𝖧𝖣\widehat{\beta}_{\sf\scriptscriptstyle HD} simultaneously from:

minβ,𝜸p𝜸1s.tn1i=1nψ^(𝑿i){YieβAi(1Yi)e𝑿i𝜸}g(𝑿i𝜶^)𝑿iλγ;n1i=1nψ^(𝑿i){YieβAi(1Yi)e𝑿i𝜸}{Aig(𝑿i𝜶^)}=0.\begin{split}{\rm min}_{\beta\in\mathbb{R},\boldsymbol{\gamma}\in\mathbb{R}^{p}}\|\boldsymbol{\gamma}\|_{1}\quad{\rm s.t}\quad\left\|n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\beta A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}}\}g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\boldsymbol{X}_{i}\right\|_{\infty}&\leq\lambda_{\gamma};\\ n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\beta A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}&=0.\end{split} (5)

Let the limits of {𝜶^,𝜸^}\{\widehat{\boldsymbol{\alpha}},\widehat{\boldsymbol{\gamma}}\} be {𝜶¯,𝜸¯}\{\bar{\boldsymbol{\alpha}},\bar{\boldsymbol{\gamma}}\}, and η¯={r¯(),m¯(),ψ¯()}\bar{\eta}=\{\bar{r}(\cdot),\bar{m}(\cdot),\bar{\psi}(\cdot)\} where r¯(𝒙)=𝒙𝜸¯\bar{r}(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\bar{\boldsymbol{\gamma}}, m¯(𝒙)=g(𝒙𝜶¯)\bar{m}(\boldsymbol{x})=g(\boldsymbol{x}^{\intercal}\bar{\boldsymbol{\alpha}}) and ψ¯(𝒙)\bar{\psi}(\boldsymbol{x}) as given in Section 3.3. We shall comment on the orthogonality (moment) conditions of our proposal in Remark 3, compare our method with Dukes and Vansteelandt, (2020) in Remark 4, and discuss its numerical implementation with a weighted LASSO formation in Remark 5.

Remark 3.

Neglect the second (and above) order error terms for now. When r(𝐱)r(\boldsymbol{x}) is correct (see Assumption HD1), i.e. r0(𝐱)=𝐱𝛄0r_{0}(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\boldsymbol{\gamma}_{0}, it naturally holds that 𝔼mh(𝐃;β0,η¯)[m^m0]=0\mathbb{E}\partial_{m}h(\boldsymbol{D};\beta_{0},\bar{\eta})[\widehat{m}-m_{0}]=0 and 𝛄=𝛄¯=𝛄0\boldsymbol{\gamma}^{*}=\bar{\boldsymbol{\gamma}}=\boldsymbol{\gamma}_{0}. Then the \ell_{\infty}-constraint in (4) imposes that

𝔼rh(𝑫;β0,η¯)[r^r0]𝔼(1Y)ψ¯(𝑿)e𝑿𝜸0{Ag(𝑿𝜶¯)}𝑿(𝜸^𝜸0)=𝟎(𝜸^𝜸0).\mathbb{E}\partial_{r}h(\boldsymbol{D};\beta_{0},\bar{\eta})[\widehat{r}-r_{0}]\approx\mathbb{E}(1-Y)\bar{\psi}(\boldsymbol{X})e^{\boldsymbol{X}^{\intercal}\boldsymbol{\gamma}_{0}}\{A-g(\boldsymbol{X}^{\intercal}\bar{\boldsymbol{\alpha}})\}\boldsymbol{X}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}_{0})=\mathbf{0}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}_{0}).

When m¯(𝐱)=m0(𝐱)=g(𝐱𝛂0)\bar{m}(\boldsymbol{x})=m_{0}(\boldsymbol{x})=g(\boldsymbol{x}^{\intercal}\boldsymbol{\alpha}_{0}), we have 𝔼rh(𝐃;β0,η¯)[r^r0]\mathbb{E}\partial_{r}h(\boldsymbol{D};\beta_{0},\bar{\eta})[\widehat{r}-r_{0}] and 𝛂¯=𝛂0\bar{\boldsymbol{\alpha}}=\boldsymbol{\alpha}_{0} in turn. And the \ell_{\infty}-constraint of (5) corresponds to

𝔼mh(𝑫;β0,η¯)[m^m0]𝔼ψ¯(𝑿){Yeβ0A(1Y)e𝑿𝜸¯}g(𝑿i𝜶0)𝑿(𝜶^𝜶0)=𝟎(𝜶^𝜶0).\mathbb{E}\partial_{m}h(\boldsymbol{D};\beta_{0},\bar{\eta})[\widehat{m}-m_{0}]\approx\mathbb{E}\bar{\psi}(\boldsymbol{X})\{Ye^{-\beta_{0}A}-(1-Y)e^{\boldsymbol{X}^{\intercal}\bar{\boldsymbol{\gamma}}}\}g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\alpha}_{0})\boldsymbol{X}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\boldsymbol{\alpha}_{0})=\mathbf{0}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\boldsymbol{\alpha}_{0}).

Thus, the Neyman orthogonality condition ηh(𝐃;β0,η¯)[ηη¯]=0\partial_{\eta}h(\boldsymbol{D};\beta_{0},\bar{\eta})[\eta-\bar{\eta}]=0 as introduced in Section 2 is satisfied under our construction when either r()r(\cdot) or m()m(\cdot) is correctly specified.

Remark 4.

Similar the HD partially linear (or log-linear) setting studied in Dukes and Vansteelandt, (2020), estimating equation for the nuisance parameter 𝛄\boldsymbol{\gamma} in our framework involves the unknown β\beta. Unlike their construction procedure that plug-in β0\beta_{0} as every β\beta\in\mathbb{R} to estimate 𝛄\boldsymbol{\gamma} and invert the resulted score-test pp-values for interval estimation of β0\beta_{0}, we solve for β^\widehat{\beta} and 𝛄^\widehat{\boldsymbol{\gamma}} jointly from (5) with the moment equation for 𝛃^\widehat{\boldsymbol{\beta}} being doubly robust, as demonstrated in Remark 1. Compared with their method, ours is more friendly in computation and implementation, additionally provides n1/2n^{-1/2}-consistent point estimator of β0\beta_{0} and preserves similar theoretical guarantee (see Section 4.1).

Remark 5.

As is detailed in Appendix D, one could also construct LASSO problems with the same Karus–Kuhn–Tucker (KKT) conditions as the \ell_{\infty}-norm constraints in (4) and (5), to obtain the estimators 𝛂^\widehat{\boldsymbol{\alpha}} and 𝛄^\widehat{\boldsymbol{\gamma}}, which has equivalent theoretical properties as the dantzig equations.666Here we present the dantzig equation version because it is more intuitive and directly connected with the Neyman orthogonal conditions. Due to the non-convexity777Its partial derivative on β\beta, n1i=1nψ^(𝐗i)YieβAiAi{Aig(𝐗i𝛂^)}n^{-1}\sum_{i=1}^{n}-\widehat{\psi}(\boldsymbol{X}_{i})Y_{i}e^{-\beta A_{i}}A_{i}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\} is not always positive or negative definite, through empirically and theoretically, the partial derivative should be negative with very high chances. of the equation in the second row of (5), numerical solution of the LASSO counterpart of (5) cannot be obtained with existing software like R package “glmnet” (Friedman et al.,, 2010) and “RCAL” (Tan, 2019b, ). A direct solution to this is programming an optimization procedure such as the Fisher scoring descent algorithm used by Tan, 2020b . While we also find a convenient way that moderately modifies the construction procedure to make the regularized estimating equations solvable with R package “RCAL”, and use it for the numerical implementation of our method. In Appendix D, we outline this modification and demonstrate its theoretical guarantee.

3.2 Machine learning realization

We turn to a (nonparametric) ML setting under which any learning algorithms of conditional mean could potentially be applied to estimate the nuisance functions. Similar to Chernozhukov et al., 2018a , we randomly split the nn samples into KK folds: 1,2,,K\mathcal{I}_{1},\mathcal{I}_{2},\ldots,\mathcal{I}_{K} of equal size, to assist removing the first order (over-fitting) bias through concentration. Then the cross-fitted estimating equation for β\beta is constructed as

n1k=1Kikh(𝑫i;β,η^[-k])=0,n^{-1}\sum_{k=1}^{K}\sum_{i\in\mathcal{I}_{k}}h(\boldsymbol{D}_{i};\beta,\widehat{\eta}^{[\text{-}k]})=0, (6)

where η^[-k]={r^[-k](),m[-k](),ψ[-k]()}\widehat{\eta}^{[\text{-}k]}=\{\widehat{r}^{[\text{-}k]}(\cdot),m^{[\text{-}k]}(\cdot),\psi^{[\text{-}k]}(\cdot)\}, representing ML estimators converging to r¯()=r0()\bar{r}(\cdot)=r_{0}(\cdot), m¯()=m0()\bar{m}(\cdot)=m_{0}(\cdot) and ψ¯()\bar{\psi}(\cdot), obtained with the samples in -k={1,,n}k\mathcal{I}_{\text{-}k}=\{1,\ldots,n\}\setminus\mathcal{I}_{k} and independent of the samples in k\mathcal{I}_{k}. Now we present the specific construction procedures for r^[-k]()\widehat{r}^{[\text{-}k]}(\cdot) and m[-k]()m^{[\text{-}k]}(\cdot) with the choice of ψ[-k]()\psi^{[\text{-}k]}(\cdot) again discussed in Section 3.3.

Suppose there is a blackbox learning algorithm (Ri,𝑪i;)\mathscr{L}(R_{i},\boldsymbol{C}_{i};\mathcal{I}) that inputs samples {1,2,,n}\mathcal{I}\subseteq\{1,2,\ldots,n\} with some response RiR_{i} and covariates 𝑪i\boldsymbol{C}_{i}, and outputs an estimation of 𝔼[Ri𝑪i,i]\mathbb{E}[R_{i}\mid\boldsymbol{C}_{i},i\in\mathcal{I}]. We outline as follows our approach utilizing \mathscr{L} to estimate the nuisance functions. Corresponding to the definition of m0()m_{0}(\cdot), it can be estimated by: m^[-k]()=(Ai,𝑿i;-k{i:Yi=0})\widehat{m}^{[\text{-}k]}(\cdot)=\mathscr{L}(A_{i},\boldsymbol{X}_{i};\mathcal{I}_{\text{-}k}\cap\{i\mathrel{\mathop{\mathchar 58\relax}}Y_{i}=0\}). Compared to the partially linear setting in Chernozhukov et al., 2018a , estimation of r0()r_{0}(\cdot) with \mathscr{L} is more sophisticated since it is defined through a non-linear form: (Y=1A,𝑿)=expit{β0A+r0(𝑿)}\mathbb{P}(Y=1\mid A,\boldsymbol{X})={\rm expit}\{\beta_{0}A+r_{0}(\boldsymbol{X})\}. One could modify some ML approaches, e.g. neural network888By setting the last layer of the neural network to be the combination of a complex network of 𝑿\boldsymbol{X} and a linear function of AA and linking it with the outcome through an expit link. to accommodate this form. However, such modification is not readily available in general, and typically requires additional human efforts on its implementing and validating if it exists. So alternatively, we propose a “full model refitting” (FMR) procedure using an arbitrary \mathscr{L} to estimate r0()r_{0}(\cdot). Our method is motivated by a simple proposition:

Proposition 1.

Let M0(A,𝐗)=(Y=1A,𝐗)=expit{β0A+r0(𝐗)}M_{0}(A,\boldsymbol{X})=\mathbb{P}(Y=1\mid A,\boldsymbol{X})={\rm expit}\{\beta_{0}A+r_{0}(\boldsymbol{X})\}. We have:

β0=argminβ𝔼[logit{M0(A,𝑿)}β(A𝔼[A|𝑿])]2.\beta_{0}={\rm argmin}_{\beta\in\mathbb{R}}\mathbb{E}\left[{\rm logit}\{M_{0}(A,\boldsymbol{X})\}-\beta(A-\mathbb{E}[A|\boldsymbol{X}])\right]^{2}.
Proof.

For any β\beta\in\mathbb{R}, we have

𝔼[logit{M0(A,𝑿)}β(A𝔼[A|𝑿])]2=𝔼{β0A+r0(𝑿)β(A𝔼[A|𝑿])}2\displaystyle\mathbb{E}\left[{\rm logit}\{M_{0}(A,\boldsymbol{X})\}-\beta(A-\mathbb{E}[A|\boldsymbol{X}])\right]^{2}=\mathbb{E}\left\{\beta_{0}A+r_{0}(\boldsymbol{X})-\beta(A-\mathbb{E}[A|\boldsymbol{X}])\right\}^{2}
=\displaystyle= 𝔼{(β0β)(A𝔼[A|𝑿])+η(𝑿)}2=(β0β)2𝔼(A𝔼[A|𝑿])2+𝔼{η(𝑿)}2,\displaystyle\mathbb{E}\left\{(\beta_{0}-\beta)(A-\mathbb{E}[A|\boldsymbol{X}])+\eta(\boldsymbol{X})\right\}^{2}=(\beta_{0}-\beta)^{2}\mathbb{E}(A-\mathbb{E}[A|\boldsymbol{X}])^{2}+\mathbb{E}\{\eta(\boldsymbol{X})\}^{2},

where η(𝑿)=r0(𝑿)+β0𝔼[A|𝑿]\eta(\boldsymbol{X})=r_{0}(\boldsymbol{X})+\beta_{0}\mathbb{E}[A|\boldsymbol{X}]. Thus, β0\beta_{0} minimizes 𝔼[logit{M0(A,𝑿)}β(A𝔼[A|𝑿])]2\mathbb{E}\left[{\rm logit}\{M_{0}(A,\boldsymbol{X})\}-\beta(A-\mathbb{E}[A|\boldsymbol{X}])\right]^{2}. ∎

Further randomly each split -k\mathcal{I}_{\text{-}k} into KK folds -k,1,-k,K,\mathcal{I}_{\text{-}k,1},\ldots\mathcal{I}_{\text{-}k,K}, of equal size and denote by -k,-j=-k-k,j\mathcal{I}_{\text{-}k,\text{-}j}=\mathcal{I}_{\text{-}k}\setminus\mathcal{I}_{\text{-}k,j}. Motivated by Proposition 1, we first estimate the “full” model M0(A,𝑿)M_{0}(A,\boldsymbol{X}) with -k,-j\mathcal{I}_{\text{-}k,\text{-}j} as:

M^[-k,-j]()=(Yi,(Ai,𝑿i);-k,-j),\widehat{M}^{[\text{-}k,\text{-}j]}(\cdot)=\mathscr{L}(Y_{i},(A_{i},\boldsymbol{X}_{i}^{\intercal})^{\intercal};\mathcal{I}_{\text{-}k,\text{-}j}),

and learn a0(𝒙)=𝔼[A|𝑿=𝒙]a_{0}(\boldsymbol{x})=\mathbb{E}[A|\boldsymbol{X}=\boldsymbol{x}] as a^[-k,-j]()=(Ai,𝑿i;-k,-j)\widehat{a}^{[\text{-}k,\text{-}j]}(\cdot)=\mathscr{L}(A_{i},\boldsymbol{X}_{i};\mathcal{I}_{\text{-}k,\text{-}j}). Then we fit the (cross-fitted) least square regression to obtain:

β˘[-k]=argminβ1|-k|j=1Ki-k,j[logit{M^[-k,-j](Ai,𝑿i)}β{Aia^[-k,-j](𝑿i)}]2,\breve{\beta}^{[\text{-}k]}={\rm argmin}_{\beta\in\mathbb{R}}\frac{1}{|\mathcal{I}_{\text{-}k}|}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}\left[{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(A_{i},\boldsymbol{X}_{i})\}-\beta\{A_{i}-\widehat{a}^{[\text{-}k,\text{-}j]}(\boldsymbol{X}_{i})\}\right]^{2}, (7)

as an estimator approaching β0\beta_{0} at certain rate typically larger than n1/2n^{-1/2}. Then r0()r_{0}(\cdot) could be identified through r0(𝑿i)=logit{M0(Ai,𝑿i)}β0Air_{0}(\boldsymbol{X}_{i})={\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}-\beta_{0}A_{i}. Note that the empirically estimated version of logit{M0(Ai,𝑿i)}β0Ai{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}-\beta_{0}A_{i} typically involves AiA_{i} due to the discrepancy of β0\beta_{0} and M()M(\cdot) from their empirical estimation. This can essentially impede removal of the over-fitting bias since rh(𝑫;β0,η¯)\partial_{r}h(\boldsymbol{D};\beta_{0},\bar{\eta}) is not orthogonal to the error functions dependending on AA. So we further estimate the conditional mean of logit{M0(A,𝑿)}β0A{\rm logit}\{M_{0}(A,\boldsymbol{X})\}-\beta_{0}A on 𝑿\boldsymbol{X} to estimate r0()r_{0}(\cdot). Denote by Wi=logit{M^[-k,-j](Ai,𝑿i)}W_{i}={\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(A_{i},\boldsymbol{X}_{i})\} for each i-k,ji\in\mathcal{I}_{\text{-}k,j} and get t^[-k]()=(Wi,𝑿i;-k)\widehat{t}^{[\text{-}k]}(\cdot)=\mathscr{L}(W_{i},\boldsymbol{X}_{i};\mathcal{I}_{\text{-}k}) to estimate t0(𝒙)=:𝔼[logit{M0(A,𝑿)}|𝑿=𝒙]t_{0}(\boldsymbol{x})=\mathrel{\mathop{\mathchar 58\relax}}\mathbb{E}[{\rm logit}\{M_{0}(A,\boldsymbol{X})\}|\boldsymbol{X}=\boldsymbol{x}]. Then the estimator of r0()r_{0}(\cdot) is given by:

r^[-k](𝒙)=t^[-k](𝒙)β˘[-k]a^[-k](𝒙),wherea^[-k](𝒙)=1Kj=1Ka^[-k,-j](𝒙).\widehat{r}^{[\text{-}k]}(\boldsymbol{x})=\widehat{t}^{[\text{-}k]}(\boldsymbol{x})-\breve{\beta}^{[\text{-}k]}\widehat{a}^{[\text{-}k]}(\boldsymbol{x}),\quad\mbox{where}\quad\widehat{a}^{[\text{-}k]}(\boldsymbol{x})=\frac{1}{K}\sum_{j=1}^{K}\widehat{a}^{[\text{-}k,\text{-}j]}(\boldsymbol{x}). (8)

Alternatively, one can estimate r0()r_{0}(\cdot) through

r^[-k]()=log((eβ˘[-k]Ai,𝑿i;-k{i:Yi=1})(1Yi,𝑿i;-k)),\widehat{r}^{[\text{-}k]}(\cdot)=\log\left(\frac{\mathscr{L}(e^{-\breve{\beta}^{[\text{-}k]}A_{i}},\boldsymbol{X}_{i};\mathcal{I}_{\text{-}k}\cap\{i\mathrel{\mathop{\mathchar 58\relax}}Y_{i}=1\})}{\mathscr{L}(1-Y_{i},\boldsymbol{X}_{i};\mathcal{I}_{\text{-}k})}\right), (9)

inspired by the moment condition that is sufficient to identify r0()r_{0}(\cdot):

𝔼[Yeβ0A(1Y)er0(𝑿)|𝑿]=𝔼[eβ0A|𝑿,Y=1]er0(𝑿)𝔼[(1Y)|𝑿]=0.\mathbb{E}\left[Ye^{-\beta_{0}A}-(1-Y)e^{r_{0}(\boldsymbol{X})}\Big{|}\boldsymbol{X}\right]=\mathbb{E}\left[e^{-\beta_{0}A}\Big{|}\boldsymbol{X},Y=1\right]-e^{r_{0}(\boldsymbol{X})}\mathbb{E}\left[(1-Y)\big{|}\boldsymbol{X}\right]=0.

We refer the estimation step for β˘[-k]\breve{\beta}^{[\text{-}k]} and r^[-k]()\widehat{r}^{[\text{-}k]}(\cdot) introduced above as “refitting”, and the whole procedure as FMR since we “refit” the least square problem (7) and ML models \mathscr{L} to estimate r0()r_{0}(\cdot) with the initially estimated full model logit{M0(Ai,𝑿i)}{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\} as a psuedo-outcome. Finally, we solve (6) based on η^[-k]\widehat{\eta}^{[\text{-}k]} to obtain the DML estimator β^𝖬𝖫\widehat{\beta}_{\sf\scriptscriptstyle ML}.

Remark 6.

We further use cross-fitting in FMR to avoid over-fitting of the models M^[-k,-j]()\widehat{M}^{[\text{-}k,\text{-}j]}(\cdot) and a[-k,-j]()a^{[\text{-}k,\text{-}j]}(\cdot) when they are used to obtain the estimators β˘[-k]\breve{\beta}^{[\text{-}k]}, t^[-k](𝐱)\widehat{t}^{[\text{-}k]}(\boldsymbol{x}) and r^[-k](𝐱)\widehat{r}^{[\text{-}k]}(\boldsymbol{x}).

Remark 7.

The FMR implicitly assumes that \mathscr{L} should perform similarly well on different learning objects with the covariates set as either 𝐗\boldsymbol{X} or (A,𝐗)(A,\boldsymbol{X}^{\intercal})^{\intercal}. Classic nonparametric approaches like kernel smoothing or sieve may not satisfy this since including one more covariate AA in addition to the very low dimensional 𝐗\boldsymbol{X} can have substantial impact on estimation performance. Thus, we recommend using more dimensionality-robust modern ML approaches, such as random forest and neural network, in our ML framework. While the classic “plug-in” sieve or kernel method has been well-studied in existing literature (Severini and Staniswalis,, 1994; Lin and Carroll,, 2006).

3.3 Efficiency consideration

The nuisance function ψ()\psi(\cdot) in our framework is included and chosen in consideration of estimation efficiency. Tan, 2019a proposed and studied two options for ϕ()\phi(\cdot) used and defined in (2), with the corresponding function ψ()\psi(\cdot) taken as:

ψopt(𝒙)=er(𝒙)𝔼[{Am(𝑿)}2|𝑿=𝒙,Y=0]𝔼[{Am(𝑿)}2/expit{β0A+r(𝑿)}|𝑿=𝒙,Y=0];ψsimp(𝒙)=expit{r(𝒙)}.\psi_{\rm opt}(\boldsymbol{x})=\frac{e^{-r(\boldsymbol{x})}\mathbb{E}[\{A-m(\boldsymbol{X})\}^{2}|\boldsymbol{X}=\boldsymbol{x},Y=0]}{\mathbb{E}[\{A-m(\boldsymbol{X})\}^{2}/{\rm expit}\{\beta_{0}A+r(\boldsymbol{X})\}|\boldsymbol{X}=\boldsymbol{x},Y=0]};\quad\psi_{\rm simp}(\boldsymbol{x})={\rm expit}\{-r(\boldsymbol{x})\}.

It was shown that when both nuisance models are correctly specified, the estimator solved with the weight ψopt()\psi_{\rm opt}(\cdot) achieves the minimum asymptotic variance. However, computation of ψopt()\psi_{\rm opt}(\cdot) involves numerical integration with respect to 𝑿\boldsymbol{X} given Y=0Y=0, making it sometimes inconvenient to implement. So Tan, 2019a proposed a simplified but reasonable choice ψsimp(𝒙)\psi_{\rm simp}(\boldsymbol{x}) defined as above that is obtained by evaluating ψopt(𝒙)\psi_{\rm opt}(\boldsymbol{x}) at β0=0\beta_{0}=0. In the following theoretical and numerical studies, we stick to ψ(𝒙)=ψsimp(𝒙)\psi(\boldsymbol{x})=\psi_{\rm simp}(\boldsymbol{x}), ψ^(𝒙)=expit(𝒙𝜸~)\widehat{\psi}(\boldsymbol{x})={\rm expit}(-\boldsymbol{x}^{\intercal}\widetilde{\boldsymbol{\gamma}}) and correspondingly ψ¯(𝒙)=expit(𝒙𝜸)\bar{\psi}(\boldsymbol{x})={\rm expit}(-\boldsymbol{x}^{\intercal}\boldsymbol{\gamma}^{*}) under the HD setting, and ψ^[-k](𝒙)=expit{r^[-k](𝒙)}\widehat{\psi}^{[\text{-}k]}(\boldsymbol{x})={\rm expit}\{-\widehat{r}^{[\text{-}k]}(\boldsymbol{x})\} and ψ¯(𝒙)=expit{r0(𝒙)}\bar{\psi}(\boldsymbol{x})={\rm expit}\{-r_{0}(\boldsymbol{x})\} under the ML setting.

4 Asymptotic analysis

Let o(αn)o(\alpha_{n}), O(αn)O(\alpha_{n}), ω(αn)\omega(\alpha_{n}), Ω(αn)\Omega(\alpha_{n}) and Θ(αn)\Theta(\alpha_{n}) represent the sequences growing at a smaller, equal/smaller, larger, equal/larger and equal rate of αn\alpha_{n}, respectively. And let oo_{\mathbb{P}}, OO_{\mathbb{P}}, ω\omega_{\mathbb{P}}, Ω\Omega_{\mathbb{P}} and Θ\Theta_{\mathbb{P}} be the corresponding rates with probability approaching 11 as nn\rightarrow\infty. Let 𝒳p\mathcal{X}\subseteq\mathbb{R}^{p} be the domain of 𝑿\boldsymbol{X}. First, we introduce the regularity condition for β\beta and its estimating equation used under both HD and ML settings as Assumption REG, which is standard and can be commonly found in literature of the asymptotic analysis of MM-estimator (Van der Vaart,, 2000, Chapter 5). And we shall then study the asymptotic properties of β^𝖧𝖣\widehat{\beta}_{\sf\scriptscriptstyle HD} and β^𝖬𝖫\widehat{\beta}_{\sf\scriptscriptstyle ML} in Sections 4.1 and 4.2 respectively.

Assumption REG (Regularity of estimating equation for β\beta).

Parameter β\beta belongs to a compact set \mathcal{B}\subseteq\mathbb{R} and there exists δn=Ω(n1/2logn)\delta_{n}=\Omega(n^{-1/2}\log n) such that (β0δn,β0+δn)(\beta_{0}-\delta_{n},\beta_{0}+\delta_{n})\subseteq\mathcal{B}. Exposure AA belongs to a compact set 𝒜\mathcal{A} and sup𝐱𝒳|𝔼[A|𝐗=𝐱,Y=y]|=O(1)\sup_{\boldsymbol{x}\in\mathcal{X}}|\mathbb{E}[A|\boldsymbol{X}=\boldsymbol{x},Y=y]|=O(1) for y=0,1y=0,1. Also, it is satisfied that

𝔼ψ¯(𝑿)Yeβ0AA{Am¯(𝑿)}=Θ(1)and𝔼h2(𝑫;β0,η¯)=Θ(1).\mathbb{E}\bar{\psi}(\boldsymbol{X})Ye^{-\beta_{0}A}A\{A-\bar{m}(\boldsymbol{X})\}=\Theta(1)\quad\mbox{and}\quad\mathbb{E}h^{2}(\boldsymbol{D};\beta_{0},\bar{\eta})=\Theta(1).

4.1 High dimensional (parametric) setting

Let expit(){\rm expit}^{\prime}(\cdot) be the derivative function of expit(){\rm expit}(\cdot), 0\|\cdot\|_{0} represents the number of non-zero elements in a vector and s=max{𝜸0,𝜸¯0,𝜶¯0}s=\max\{\|\boldsymbol{\gamma}^{*}\|_{0},\|\bar{\boldsymbol{\gamma}}\|_{0},\|\bar{\boldsymbol{\alpha}}\|_{0}\}. We introduce following assumptions to regularize the covariates and nuisance estimators.

Assumption HD1 (Model double robustness).

At least one of the following conditions hold: (a) there exists 𝛄0p\boldsymbol{\gamma}_{0}\in\mathbb{R}^{p} such that r0(𝐱)=𝐱𝛄0r_{0}(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\boldsymbol{\gamma}_{0} and 𝛄=𝛄¯=𝛄0\boldsymbol{\gamma}^{*}=\bar{\boldsymbol{\gamma}}=\boldsymbol{\gamma}_{0}; (b) there exists 𝛂0p\boldsymbol{\alpha}_{0}\in\mathbb{R}^{p} such that m0(𝐱)=g(𝐱𝛂0)m_{0}(\boldsymbol{x})=g(\boldsymbol{x}^{\intercal}\boldsymbol{\alpha}_{0}) and 𝛂¯=𝛂0\bar{\boldsymbol{\alpha}}=\boldsymbol{\alpha}_{0}.

Assumption HD2 (Concentration rate).

It holds that

n1i=1n(1Yi)expit(𝑿i𝜸)e𝑿i𝜸¯{Aig(𝑿i𝜶¯)}𝑿i=O{(logp/n)1/2};\displaystyle\left\|n^{-1}\sum_{i=1}^{n}(1-Y_{i}){\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}\boldsymbol{X}_{i}\right\|_{\infty}=O_{\mathbb{P}}\{(\log p/n)^{1/2}\};
n1i=1nexpit(𝑿i𝜸){Yieβ0Ai(1Yi)e𝑿i𝜸¯}g(𝑿i𝜶¯)𝑿i=O{(logp/n)1/2};\displaystyle\left\|n^{-1}\sum_{i=1}^{n}{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\boldsymbol{X}_{i}\right\|_{\infty}=O_{\mathbb{P}}\{(\log p/n)^{1/2}\};
n1i=1nexpit(𝑿i𝜸){Yieβ0Ai(1Yi)e𝑿i𝜸¯}{Aig(𝑿i𝜶¯)}𝑿i=O{(logp/n)1/2}.\displaystyle\left\|n^{-1}\sum_{i=1}^{n}{\rm expit}^{\prime}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}\boldsymbol{X}_{i}\right\|_{\infty}=O_{\mathbb{P}}\{(\log p/n)^{1/2}\}.
Assumption HD3 (Smooth link function).

There exists L=Θ(1)L=\Theta(1) that for any u,vu,v\in\mathbb{R},

|g(u)g(v)|L|uv|.|g^{\prime}(u)-g^{\prime}(v)|\leq L|u-v|.
Assumption HD4 (Risk of the 1\ell_{1}-regularized estimators).

There exists tuning parameters λα,λγ=Θ{(logp/n)1/2}\lambda_{\alpha},\lambda_{\gamma}=\Theta\{(\log p/n)^{1/2}\} such that (4) and (5) have feasible solutions with probability approaching 11 and

supi{1,,n}|g(𝑿i𝜶^)|=O(1);𝜸~𝜸1+𝜸^𝜸¯1+𝜶^𝜶¯1=O{s(logp/n)1/2};\displaystyle\sup_{i\in\{1,\ldots,n\}}|g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})|=O_{\mathbb{P}}(1);\quad\|\widetilde{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*}\|_{1}+\|\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}}\|_{1}+\|\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}}\|_{1}=O_{\mathbb{P}}\{s(\log p/n)^{1/2}\};
n1i=1n{1+e𝑿i𝜸¯}[{𝑿i(𝜸^𝜸¯)}2+{𝑿i(𝜸~𝜸)}2]+(β^𝖧𝖣β0)2=O(slogp/n);\displaystyle n^{-1}\sum_{i=1}^{n}\{1+e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\left[\{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})\}^{2}+\{\boldsymbol{X}_{i}^{\intercal}(\widetilde{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\}^{2}\right]+(\widehat{\beta}_{\sf\scriptscriptstyle HD}-\beta_{0})^{2}=O_{\mathbb{P}}(s\log p/n);
n1i=1n{1+e𝑿i𝜸¯}[{𝑿i(𝜶^𝜶¯)}2+{g(𝑿i𝜶^)g(𝑿i𝜶¯)}2]=O(slogp/n).\displaystyle n^{-1}\sum_{i=1}^{n}\{1+e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\left[\{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}})\}^{2}+\{g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}^{2}\right]=O_{\mathbb{P}}(s\log p/n).
Assumption HD5 (Ultra-sparsity).

It holds that s=o(n1/2/logp)s=o(n^{1/2}/\log p).

Remark 8.

Under Assumption HD1 and our constructions (4) and (5) (or the one introduced in Appendix D), the expectations of the terms to be concentrated in Assumption HD2 are 𝟎\mathbf{0} by Remark 3. Then their maximum norms can be controlled by O{(logp/n)1/2}O_{\mathbb{P}}\{(\log p/n)^{1/2}\} as assumed in HD2, when the covariates 𝐗i\boldsymbol{X}_{i} are bounded, subgaussian or beyond (Kuchibhotla and Chakrabortty,, 2018), using the concentration results derived in existing literature (Giné and Nickl,, 2016).

Remark 9.

Rates of the prediction and estimation risk of the nuisance estimators in Assumption HD4 can be derived following the general theoretical framework for 1\ell_{1}-regularized estimation introduced in Candes et al., (2007); Bickel et al., (2009); Bühlmann and Van De Geer, (2011); Negahban et al., (2012). The same rate properties has been used for analyzing doubly robust estimators with HD nuisance models in existing literature (Smucler et al.,, 2019; Tan, 2020a, ; Dukes and Vansteelandt,, 2020). Note that (4) and (5) involves the estimators 𝛄~\widetilde{\boldsymbol{\gamma}} or 𝛂^\widehat{\boldsymbol{\alpha}} obtained beforehand. This will require some additional effort on removing the “plug-in” errors of 𝛄~\widetilde{\boldsymbol{\gamma}} or 𝛂^\widehat{\boldsymbol{\alpha}} when deriving the risk rates for 𝛂^\widehat{\boldsymbol{\alpha}} and 𝛄^\widehat{\boldsymbol{\gamma}}, compared to the standard analysis procedures. One could see Tan, 2020a for a similar technical issue and relevant technical details being used to handle it.

In addition, supi{1,,n}|g(𝐗i𝛂^)|=O(1)\sup_{i\in\{1,\ldots,n\}}|g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})|=O_{\mathbb{P}}(1) imposed in HD4 is not a standard assumption but is mild and very likely to hold: sup𝐱𝒳|g(𝐱𝛂¯)|=O(1)\sup_{\boldsymbol{x}\in\mathcal{X}}|g(\boldsymbol{x}^{\intercal}\bar{\boldsymbol{\alpha}})|=O(1) according to Assumption REG so we only need g(𝐱𝛂^)g(𝐱𝛂¯)g(\boldsymbol{x}^{\intercal}\widehat{\boldsymbol{\alpha}})-g(\boldsymbol{x}^{\intercal}\bar{\boldsymbol{\alpha}}) to be O(1)O_{\mathbb{P}}(1) uniformly.

Remark 10.

The ultra-sparsity assumption HD5 was also imposed in existing literature including Tan, 2020a and Dukes and Vansteelandt, (2020), to control the rate of bias incurred by the HD estimators: slogp/ns\log p/n below the parametric rate. For linear nuisance model, existing work like Zhu et al., (2018) and Dukes and Vansteelandt, (2020) suggested to add additional moment (KKT) constraints to relax the ultra-sparsity assumption. However, their approach has not been shown to be feasible for the case with non-linear models yet, so it may be promising but still remains unclear for our framework.

We present the asymptotic property of β^𝖧𝖣\widehat{\beta}_{\sf\scriptscriptstyle HD} in Theorem 2 and its proof in Appendix A.

Theorem 1.

Denote by I¯=𝔼ψ¯(𝐗)Yeβ0AA{Am¯(𝐗)}\bar{I}=\mathbb{E}\bar{\psi}(\boldsymbol{X})Ye^{-\beta_{0}A}A\{A-\bar{m}(\boldsymbol{X})\} and σ¯2=I¯2𝔼h2(𝐃;β0,η¯)\bar{\sigma}^{2}=\bar{I}^{-2}\mathbb{E}h^{2}(\boldsymbol{D};\beta_{0},\bar{\eta}). Under Assumptions REG and HD1HD5, we have

nσ¯1(β^𝖧𝖣β0)=1ni=1n(σ¯I¯)1h(𝑫i;β0,η¯)+o(1),\sqrt{n}\bar{\sigma}^{-1}(\widehat{\beta}_{\sf\scriptscriptstyle HD}-\beta_{0})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(\bar{\sigma}\bar{I})^{-1}h(\boldsymbol{D}_{i};\beta_{0},\bar{\eta})+o_{\mathbb{P}}(1),

which weakly converge to N(0,1){\rm N}(0,1).

Recently, logistic debiased LASSO (Van de Geer et al.,, 2014; Janková and Van De Geer,, 2016) has been criticized on that its sparse inverse information matrix condition is not explainable and justifiable, leading to a subpar performance theoretically and numerically (Xia et al.,, 2020). Interestingly, we find the model sparsity assumption of our method is more reasonable than debiased LASSO and present a simple comparison of these two approaches in Remark 11.

Remark 11.

Assume the logistic model (Y=1A,𝐗)=expit{β0A+𝐗𝛄0}\mathbb{P}(Y=1\mid A,\boldsymbol{X})={\rm expit}\{\beta_{0}A+\boldsymbol{X}^{\intercal}\boldsymbol{\gamma}_{0}\} is correctly specified. As is argued by Xia et al., (2020), assuming the information matrix of the logistic model has an ultra-sparse101010Or approximately sparse (see recent work like Belloni et al., (2018); Ma et al., (2020); Liu et al., (2020)). inverse is crucial to ensure the desirable parametric rate of the debiased LASSO estimator for β0\beta_{0}. However, this assumption is not explainable or convincing for the common gaussian design with sparse precision matrix, due to the presence of logistic canonical link. In comparison, we require that 𝔼(AY=0,𝐗=𝐱)=g(𝐗i𝛂0)\mathbb{E}(A\mid Y=0,\boldsymbol{X}=\boldsymbol{x})=g(\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\alpha}_{0}) with 𝛂00=o(n1/2/logp)\|\boldsymbol{\alpha}_{0}\|_{0}=o(n^{1/2}/\log p). This assumption has two advantages. First, it accommodates nonlinear link function g()g(\cdot) and can be more reasonable for a categorical AA. Second, it is imposed on a conditional model directly and thus more explainable. For example, consider a conditional gaussian model: (A,𝐗){Y=j}𝒩(𝛍j,𝚺)(A,\boldsymbol{X}^{\intercal})^{\intercal}\mid\{Y=j\}\sim\mathcal{N}(\boldsymbol{\mu}_{j},\boldsymbol{\Sigma}) for j=0,1j=0,1. Then we have r0(𝐱)=𝐱𝛄0r_{0}(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\boldsymbol{\gamma}_{0} where (β0,𝛄0)=𝚺1(𝛍1𝛍0)(\beta_{0},\boldsymbol{\gamma}_{0}^{\intercal})^{\intercal}=\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0}) and A𝐗,Y=0A\mid\boldsymbol{X},Y=0 follows a gaussian linear model with the coefficient 𝛂0\boldsymbol{\alpha}_{0} determined by 𝚺1\boldsymbol{\Sigma}^{-1}. Therefore, our sparsity assumptions on 𝛂0\boldsymbol{\alpha}_{0} and 𝛄0\boldsymbol{\gamma}_{0} actually assumes the data generation parameters 𝚺1\boldsymbol{\Sigma}^{-1} and 𝛍1𝛍0\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0} to be sparse, which are more explainable and verifiable in practice.

4.2 Machine learning (nonparametric) setting

Define that f()Q,q=:f(𝑼)Q,q=:{|f(u)|qdQ(u)}1/q\|f(\cdot)\|_{Q,q}=\mathrel{\mathop{\mathchar 58\relax}}\|f(\boldsymbol{U})\|_{Q,q}=\mathrel{\mathop{\mathchar 58\relax}}\{\int|f(u)|^{q}dQ(u)\}^{1/q} for any real number q>0q>0, function f()f(\cdot), random variables 𝑼\boldsymbol{U} and probability measure QQ. Let PP denote the probability measure of the observed 𝑫\boldsymbol{D}. We assume that K=Θ(1)K=\Theta(1) and introduce the following assumption.

Assumption ML1 (Quality of the ML nuisance estimators).

For each k{1,2,,K}k\in\{1,2,\ldots,K\},

sup𝒙𝒳|r^[-k](𝒙)r0(𝒙)|+|m^[-k](𝒙)m0(𝒙)|\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|\widehat{r}^{[\text{-}k]}(\boldsymbol{x})-r_{0}(\boldsymbol{x})|+|\widehat{m}^{[\text{-}k]}(\boldsymbol{x})-m_{0}(\boldsymbol{x})| =o(1);\displaystyle=o_{\mathbb{P}}(1);
r^[-k]()r0()P,2+m^[-k]()m0()P,2\displaystyle\|\widehat{r}^{[\text{-}k]}(\cdot)-r_{0}(\cdot)\|_{P,2}+\|\widehat{m}^{[\text{-}k]}(\cdot)-m_{0}(\cdot)\|_{P,2} =o(n1/4).\displaystyle=o_{\mathbb{P}}(n^{-1/4}).
Remark 12.

Similar to Assumptions 3.2 and 3.4 of Chernozhukov et al., 2018a , our Assumption ML1 requires that the ML estimators for r0()r_{0}(\cdot) and m0()m_{0}(\cdot) are uniformly consistent and their mean squared errors (MSE) achieve the rate o(n1/4)o_{\mathbb{P}}(n^{-1/4}). This assumption is also referred as rate doubly robust property (Smucler et al.,, 2019) in that it requires production of the MSEs of r^[-k]()\widehat{r}^{[\text{-}k]}(\cdot) and m^[-k]()\widehat{m}^{[\text{-}k]}(\cdot) to be o(n1/2)o_{\mathbb{P}}(n^{-1/2}). In Appendix C, we provide justification for our proposed FMR procedure to derive that the resulted r^[-k]()\widehat{r}^{[\text{-}k]}(\cdot) satisfies Assumption ML1 as long as the learning algorithm \mathscr{L} satisfies the same strong convergence properties as Assumption ML1, on all the learning tasks in FMR. Thus, the use of FMR procedure does not actually clip the wings of the ML algorithm being used in our framework.

We present the asymptotic property of β^𝖬𝖫\widehat{\beta}_{\sf\scriptscriptstyle ML} in Theorem 2 with its proof found in Appendix B.

Theorem 2.

Denote by I0=𝔼ψ¯(𝐗)Yeβ0AA{Am0(𝐗)}I_{0}=\mathbb{E}\bar{\psi}(\boldsymbol{X})Ye^{-\beta_{0}A}A\{A-m_{0}(\boldsymbol{X})\} and σ02=I02𝔼h2(𝐃;β0,η0)\sigma_{0}^{2}=I_{0}^{-2}\mathbb{E}h^{2}(\boldsymbol{D};\beta_{0},\eta_{0}). Under Assumptions REG and ML1, we have

nσ01(β^𝖬𝖫β0)=1ni=1n(σ0I0)1h(𝑫i;β0,η0)+o(1),\sqrt{n}\sigma_{0}^{-1}(\widehat{\beta}_{\sf\scriptscriptstyle ML}-\beta_{0})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(\sigma_{0}I_{0})^{-1}h(\boldsymbol{D}_{i};\beta_{0},\eta_{0})+o_{\mathbb{P}}(1),

which weakly converge to N(0,1){\rm N}(0,1).

Remark 13.

Since ψ¯(𝐱)=expit{r0(𝐱)}\bar{\psi}(\boldsymbol{x})={\rm expit}\{-r_{0}(\boldsymbol{x})\} and ψ^[-k](𝐱)=expit{r^[-k](𝐱)}\widehat{\psi}^{[\text{-}k]}(\boldsymbol{x})={\rm expit}\{-\widehat{r}^{[\text{-}k]}(\boldsymbol{x})\} in our ML case, one could show that ψ^[-k](𝐱)\widehat{\psi}^{[\text{-}k]}(\boldsymbol{x}) achieves the same strong convergence and rate properties as r^[-k]()\widehat{r}^{[\text{-}k]}(\cdot) under Assumption ML1. While generally speaking, uniform consistency of ψ^[-k]()\widehat{\psi}^{[\text{-}k]}(\cdot) is sufficient for the desirable conclusion in Theorem 2 so our framework accommodates more flexible choices on ψ(𝐱)\psi(\boldsymbol{x}), for example, ψopt(𝐱)\psi_{\rm opt}(\boldsymbol{x}) as introduced in Section 3.3. We demonstrate this point during the proof in Appendix B.

5 Simulation study

We conduct simulation studies for HD and ML settings separately in Sections 5.1 and 5.2, to study the point and interval estimation performance of our method.

5.1 High dimensional (parametric) setting

For the HD parametric setting, we design three data generation configurations introduced as follows to simulate different scenarios of model specification:

  1. (i)

    First, generate YY following P(Y=1)=1/2P(Y=1)=1/2. Then generate (A,𝑿){Y=j}𝒩(𝝁j,𝚺)(A,\boldsymbol{X}^{\intercal})^{\intercal}\mid\{Y=j\}\sim\mathcal{N}(\boldsymbol{\mu}_{j},\boldsymbol{\Sigma}) for j=0,1j=0,1. Specification of 𝝁j\boldsymbol{\mu}_{j} and 𝚺\boldsymbol{\Sigma} are presented in Appendix E such that β0=0.5\beta_{0}=0.5, r0(𝑿)=0.22(X1+X2)+0.08(X3+X4)r_{0}(\boldsymbol{X})=-0.22(X_{1}+X_{2})+0.08(X_{3}+X_{4}), and m0(𝑿)=0.13˙(X1+X2+X3+X4)m_{0}(\boldsymbol{X})=-0.1\dot{3}(X_{1}+X_{2}+X_{3}+X_{4}).

  2. (ii)

    Generate YY following P(Y=1)=1/2P(Y=1)=1/2 and (A,𝑿){Y=j}𝒩(𝝁j,𝚺)(A,\boldsymbol{X}^{\intercal})^{\intercal}\mid\{Y=j\}\sim\mathcal{N}(\boldsymbol{\mu}_{j},\boldsymbol{\Sigma}) for j=0,1j=0,1. Specification of 𝝁j\boldsymbol{\mu}_{j} and 𝚺j\boldsymbol{\Sigma}_{j} are presented in Appendix E such that β0=0.5\beta_{0}=0.5, r0(𝑿)=0.22(X1+X2)+0.08(X3+X4)0.15(X1X2+X1X3+X2X3)r_{0}(\boldsymbol{X})=-0.22(X_{1}+X_{2})+0.08(X_{3}+X_{4})-0.15(X_{1}X_{2}+X_{1}X_{3}+X_{2}X_{3}), and m0(𝑿)=0.13˙(X1+X2+X3+X4)m_{0}(\boldsymbol{X})=-0.1\dot{3}(X_{1}+X_{2}+X_{3}+X_{4}).

  3. (iii)

    First generate 𝑿𝒩(𝟎,𝚺)\boldsymbol{X}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}) with 𝚺p×p\boldsymbol{\Sigma}\in\mathbb{R}^{p\times p} given in Appendix E. Then we generate AA given 𝑿\boldsymbol{X} from a gaussian linear model with unit variance and conditional mean

    E(A𝑿)=0.15X1+0.15X2+0.15X3+0.15X4+0.075X1X2+0.075X1X3+0.075X2X3.E(A\mid\boldsymbol{X})=0.15X_{1}+0.15X_{2}+0.15X_{3}+0.15X_{4}+0.075X_{1}X_{2}+0.075X_{1}X_{3}+0.075X_{2}X_{3}.

    Finally, generate YY by (Y=1A,𝑿)=expit(0.5A+0.25X1+0.25X2+0.1X3+0.1X4)\mathbb{P}(Y=1\mid A,\boldsymbol{X})={\rm expit}(0.5A+0.25X_{1}+0.25X_{2}+0.1X_{3}+0.1X_{4}).

We realize configurations (i)–(iii) with the sample size n=1000n=1000, 15001500 or 20002000 separately and the dimension of 𝑿\boldsymbol{X} fixed as p=200p=200. Under all these settings, we specify the nuisance models as: r(𝑿)=𝑿𝜸r(\boldsymbol{X})=\boldsymbol{X}^{\intercal}\boldsymbol{\gamma} and m(𝑿)=𝑿𝜶m(\boldsymbol{X})=\boldsymbol{X}^{\intercal}\boldsymbol{\alpha}. Then both nuisance models are correctly specified under (i), only m(𝑿)m(\boldsymbol{X}) is correctly specified under (ii), and only r(𝑿)r(\boldsymbol{X}) is correct under (iii). Note that we could not extract the explicit form of m0(𝑿)m_{0}(\boldsymbol{X}) under (iii) since AA is generated conditional on 𝑿\boldsymbol{X} without fixed Y=0Y=0. While we still expect the linear model m(𝑿)=𝑿𝜶m(\boldsymbol{X})=\boldsymbol{X}^{\intercal}\boldsymbol{\alpha} is misspecified under (iii) as there is non-linear terms included in E(A|𝑿)E(A|\boldsymbol{X}). Implementing details of our HD approach are presented in Appendix D. Specifically, all the tuning parameters in 1\ell_{1}-regularized regression are selected using cross-validation among [0.2(logp/n)1/2,2(logp/n)1/2][0.2(\log p/n)^{1/2},2(\log p/n)^{1/2}]. We conducted each setting with 300300 repeated simulations.

Table 1 evaluates the performance of our estimator β^𝖧𝖣\widehat{\beta}_{\sf\scriptscriptstyle HD} under all settings on its mean square error (MSE), absolute bias and coverage probability (CP) of the 95% confidence interval (CI) estimated using Gaussian bootstrap multiplier. Under all the settings, our method outputs low root-MSE and bias respectively occupying at most 18%18\% and 7%7\% of the magnitude of the true β0(=0.5)\beta_{0}(=0.5) when n=1000n=1000, and at most 12%12\% and 4%4\% of the β0\beta_{0} when n=2000n=2000. As the sample size nn grows, one could see a trend of decaying on the MSEs and bias of our estimator. In addition, under all the settings, our interval estimation has proper CP locating in ±0.03\pm 0.03 range of the nominal level 0.950.95. Thus, our HD estimator performs steadily well under different model specification scenarios as long as at least one nuisance models are correctly specified.

Table 1: Average mean square error (MSE), average absolute bias (Bias), and average coverage probability (CP) of 95% CI of our HD estimator with the sample size set as 10001000, 15001500 and 20002000, under configurations (i)–(iii) described in Section 5.1. Number of repetition for each setting is 300300.
Configuration (i) Configuration (ii) Configuration (iii)
nn 10001000 15001500 20002000 10001000 15001500 20002000 10001000 15001500 20002000
MSE 0.0070.007 0.0060.006 0.0040.004 0.0070.007 0.0060.006 0.0040.004 0.0080.008 0.0040.004 0.0030.003
Bias 0.0240.024 0.0160.016 0.0170.017 0.0320.032 0.0140.014 0.0200.020 0.0340.034 0.0160.016 0.0120.012
CP 0.950.95 0.940.94 0.930.93 0.960.96 0.920.92 0.930.93 0.930.93 0.950.95 0.950.95

5.2 Machine learning (nonparametric) setting

To study our proposed method under the ML setting, we let 𝚺Rp×p\boldsymbol{\Sigma}\in R^{p\times p} with Σii=1\Sigma_{ii}=1; Σij=0.2\Sigma_{ij}=0.2 for iji\neq j, and generate 𝒩(𝟎,𝚺)\mathcal{N}(\boldsymbol{0},\boldsymbol{\Sigma}) random vectors and truncate their entries by (2,2)(-2,2) to obtain 𝑿\boldsymbol{X}. We then generate AA from the gaussian model given 𝑿\boldsymbol{X} with unit variance and conditional mean a0(𝑿)=𝜻afa(𝑿)a_{0}(\boldsymbol{X})=\boldsymbol{\zeta}_{a}^{\intercal}f_{a}(\boldsymbol{X}) where fa(𝒙)f_{a}(\boldsymbol{x}) is a non-linear basis function of 𝒙\boldsymbol{x} including various types of effects (interaction, indicator and trigonometric function, etc.), defined in Appendix E and 𝜻a\boldsymbol{\zeta}_{a} represents its loading coefficients also given in Appendix E. Finally, we set β0=1\beta_{0}=1, r0(𝑿)=𝜻rfr(𝑿)r_{0}(\boldsymbol{X})=\boldsymbol{\zeta}_{r}^{\intercal}f_{r}(\boldsymbol{X}) (see Appendix E), and generate YY with (Y=1A,𝑿)=expit{β0A+r0(𝑿)}\mathbb{P}(Y=1\mid A,\boldsymbol{X})={\rm expit}\{\beta_{0}A+r_{0}(\boldsymbol{X})\}. We fix p=20p=20 and set n=1000n=1000 or n=2000n=2000 separately as two settings.

To estimate the nuisance function r0(𝒙)r_{0}(\boldsymbol{x}), we use the FMR procedure with its last step being (8). And the number of fold for cross-fitting is set as K=5K=5. For choice of the learning algorithms \mathscr{L}, we consider four ML methods and a hybrid method of the ML models introduced as follows.

  1. (a)

    Gradient boosted machines (GBM): an ensemble approach of classification and regression tree (CART) using gradient boosting. Implemented by R-package “gbm” (Greenwell et al.,, 2020).

  2. (b)

    Random forest (RF): ensemble of CART with bagging. Implemented with R package “RandomForest” (Liaw and Wiener,, 2002).

  3. (c)

    Support vector machine (SVM): specified with linear kernel and implemented using R package “e1071” (Dimitriadou et al.,, 2004).

  4. (d)

    Neural network (NN): single hidden layer neural network implemented with R package “nnet” (Ripley et al.,, 2016).

  5. (e)

    Best nuisance models (Best): similar to Chernozhukov et al., 2018a , we use a simple hybrid method choosing the ML estimator for each nuisance component with the best prediction performance evaluated by cross-validated sum-squared loss.

All the above mentioned ML algorithms are popular in the field of ML and have been considered in the literature of DML (Chernozhukov et al., 2018a, ; Cui and Tchetgen,, 2019). Tuning parameters of the ML models including the number of trees of GBM and RF, the margin of SVM, and the number of units and the weight decay of NN, are selected using the resampling approach of R package “caret” (Kuhn et al.,, 2020). We conducted each setting of nn again with 300300 repeated simulations.

Table 2 presents the resulted average MSE, bias and CP of 95% CI of the estimator β^𝖬𝖫\widehat{\beta}_{\sf\scriptscriptstyle ML} obtained with the five ML modelling strategies for n=1000n=1000 and n=2000n=2000 separately. The five approaches have relatively consistent performance in terms of MSE, bias and CP under both settings, with the variation of their MSEs smaller than 0.0020.002. This demonstrates that performance of our framework is robust to the choice of ML algorithms. While to certain degree, NN has the best performance (with the lowest bias and MSE) when n=1000n=1000, and GBM and Best have the best performance when n=2000n=2000. Also, interval estimators of all the approaches achieve proper coverage rates.

Table 2: Average mean square error (MSE), average absolute bias (Bias), and average coverage probability (CP) of 95% CI of our ML estimator with sample sizes set as 10001000 and 20002000, and the nuisance models estimated by the four ML algorithms and the “Best” approach described in Section 5.2. Number of repetition for each setting is 300300.
n=1000n=1000 n=2000n=2000
GBM RF SVM NN Best GBM RF SVM NN Best
MSE 0.0130.013 0.0130.013 0.0120.012 0.0110.011 0.0130.013 0.0060.006 0.0070.007 0.0070.007 0.0070.007 0.0060.006
Bias 0.0360.036 0.0480.048 0.0450.045 0.0150.015 0.0370.037 0.0350.035 0.0490.049 0.0460.046 0.0410.041 0.0350.035
CP 0.930.93 0.940.94 0.920.92 0.950.95 0.930.93 0.940.94 0.940.94 0.920.92 0.940.94 0.940.94

6 A real example: effect of EC pill on early gestation foetal

In this section, we implement our proposed HD and ML methods to study the effect of emergency contraceptive (EC) pill on the rate of new birth and early gestation foetal death (abortion), by revisiting and exploring the data of a quasi-experimental study based on the policy reform of EC pill’s legislation in Chile (Bentancor and Clarke,, 2017). In their original study, the authors collected all records of birth and foetal death in Chile, as well as a number of municipality level features (education, salary and healthcare, etc) of woman at reproductive age (15–34), in the years around 2008, during which the country was experiencing a reform on the legislation of EC pills. As the consequence of this reform, there is about half of the municipalities in Chile started to provide EC pill freely in 2009, while in the remaining half, EC pill is still not available or restricted in use during that period. This policy was mostly dependent on the political, economic and public health factors characterized by totally 1616 features (denoted as 𝒁\boldsymbol{Z}) such as education spending, public health spending, condom use, and political conservativeness. Thus, the treatment of EC pill (A=1A=1 for EC pill accessible; A=0A=0 for EC pill not accessible) can be regarded as exogenous for the individuals.

Let Y(1)Y^{\scriptscriptstyle(1)} denote the indicator for the status of early gestation foetal death in each individual record and Y(2)Y^{\scriptscriptstyle(2)} indicate giving new birth (pregnant and did not incur foetal death). Assume that

(Y(1)=1A,𝒁)=expit{β0(1)A+r0(1)(𝒁)};\displaystyle\mathbb{P}(Y^{\scriptscriptstyle(1)}=1\mid A,\boldsymbol{Z})={\rm expit}\{\beta_{0}^{\scriptscriptstyle(1)}A+r_{0}^{\scriptscriptstyle(1)}(\boldsymbol{Z})\};
(Y(2)=1A,𝒁)=expit{β0(2)A+r0(2)(𝒁)},\displaystyle\mathbb{P}(Y^{\scriptscriptstyle(2)}=1\mid A,\boldsymbol{Z})={\rm expit}\{\beta_{0}^{\scriptscriptstyle(2)}A+r_{0}^{\scriptscriptstyle(2)}(\boldsymbol{Z})\},

where r0(1)()r_{0}^{\scriptscriptstyle(1)}(\cdot) and r0(2)()r_{0}^{\scriptscriptstyle(2)}(\cdot) are two unknown functions. We are interested in inferring the two parameters β0(1)\beta_{0}^{\scriptscriptstyle(1)} and β0(2)\beta_{0}^{\scriptscriptstyle(2)} characterizing the log odd ratios (log-OR) of abortion (among the pregnant individuals) and birth (among all individuals) to the treatment of EC pill respectively. To investigate β0(1)\beta_{0}^{\scriptscriptstyle(1)}, we follow a similar strategy as Bentancor and Clarke, (2017) that focuses on the individual records at age between 15 and 25, on which early gestation foetal death can be viewed as a reasonable proxy for illegal abortion. Note that the prevalence of Y(1)Y^{\scriptscriptstyle(1)} and Y(2)Y^{\scriptscriptstyle(2)} in their corresponding populations are both less than 5%5\%, which could cause logistics model unstable to fit. So we randomly downsample the 0’s in both analysis to make the prevalence of Y(1)Y^{\scriptscriptstyle(1)} and Y(2)Y^{\scriptscriptstyle(2)} being 1/41/4. This procedure only changes intercepts of the logistic models and does not affect the target parameters. The resulted data set for analyzing β0(1)\beta_{0}^{\scriptscriptstyle(1)} (abortion) has n(1)=5,824n^{\scriptscriptstyle(1)}=5,824 samples. While we only take a subset with n(2)=10,000n^{\scriptscriptstyle(2)}=10,000 samples for β0(2)\beta_{0}^{\scriptscriptstyle(2)} so that our algorithms will not require excessive computation times.

For our HD approach, we let 𝑿\boldsymbol{X} be the p=175p=175 dimensional bases joining 𝒁\boldsymbol{Z}, all the interaction terms of 𝒁\boldsymbol{Z} and the three-dimensional natural splines of all the continuous variable in 𝒁\boldsymbol{Z}. And we specify the nuisance functions as m(𝒙)=expit(𝒙𝜶())m(\boldsymbol{x})={\rm expit}(\boldsymbol{x}^{\intercal}\boldsymbol{\alpha}^{\scriptscriptstyle(\ell)}), and r(𝒙)=𝒙𝜸()r(\boldsymbol{x})=\boldsymbol{x}^{\intercal}\boldsymbol{\gamma}^{\scriptscriptstyle(\ell)} for j=1,2j=1,2. We take 𝑿=𝒁\boldsymbol{X}=\boldsymbol{Z} as the input covariates of our ML method. The choice and implementation of the ML algorithms are basically the same as in Section 5.2, except that we additionally introduce dropout111111A common and flexible technique in ML research used for regularization and avoiding over-fitting. Here we randomly and independently set each entry of the training covariates matrix as N(0,1){\rm N}(0,1) variable with probability 0.40.4. for the two tree-based ML algorithms, GBM and RF, to avoid their over-fitting due to that most covariates are of municipality level but the records are of individual level.

Tables 3 and 4 present the point estimation, 95% CI estimation and pp-values of our HD and ML approaches for β0(1)\beta^{\scriptscriptstyle(1)}_{0} and β0(2)\beta^{\scriptscriptstyle(2)}_{0}, respectively. For β0(1)\beta^{\scriptscriptstyle(1)}_{0}, log-OR of early gestation foetal death to EC pill, point estimations of all methods are positive and around 0.175±0.03-0.175\pm 0.03. Their interval estimations are also internally consistent except that SVM outputs a relatively narrow CI and NN includes 0 near its CI lower bound. And all methods reject the null: β0(1)=0\beta^{\scriptscriptstyle(1)}_{0}=0 at level 0.050.05. This is because NN outputs slightly worse prediction model for A𝑿,Y=0A\mid\boldsymbol{X},Y=0. Note that the result of our hybrid method “Best” is consistent enough with HD, indicating that our methods under both settings lead to basically the same conclusion. Similar situation occurs to the estimation of β0(2)\beta^{\scriptscriptstyle(2)}_{0} as well. For β0(2)\beta^{\scriptscriptstyle(2)}_{0}, all methods reject the null at level 0.050.05 and their point and CI estimations are internally consistent (SVM shows a moderate variation from other methods).

Our analysis results reveal that distribution of EC pill could significantly reduce the rate of illegal abortion (in the age group 15–25) and new birth. This is consistent with the results of Bentancor and Clarke, (2017) obtained through their municipality level analysis. Although the estimated effect sizes are of different scales121212Their effect is defined in a partially linear model of the abortion/birth rate against the treatment and control variables. While we are measuring the effect of EC pill in a logistic model at the individual level. and thus incomparable across the two studies, our pp-values appear to show more significance in that nearly all of them are below 0.050.05 while their estimated pp-values are between 0.050.05 and 0.10.1. This is because that we use more complex and robust nuisance models to adjust for the confounding effects of 𝒁\boldsymbol{Z} and perform our analysis from the individual level granting us to have larger sample sizes.

Table 3: Point estimations, estimated 95% CI lower/upper bounds (LB/UB) and (two-sided) pp-values for β0(1)\beta^{\scriptscriptstyle(1)}_{0} (log odds ratio of early gestation foetal death to the treatment of EC pill) of our HD and ML (with the five different realization described in Section 5.2) approaches.
Method HD GBM SVM RF NN Best
β0^\hat{\beta_{0}} -0.175 -0.207 -0.144 -0.181 -0.154 -0.181
CI LB -0.343 -0.348 -0.267 -0.342 -0.356 -0.342
CI UB -0.004 -0.067 -0.020 -0.020 0.047 -0.020
pp-value 0.046 0.004 0.023 0.028 0.133 0.028
Table 4: Point estimations, estimated 95% CI lower/upper bounds (LB/UB) and (two-sided) pp-values for β0(2)\beta^{\scriptscriptstyle(2)}_{0} (log odds ratio of new birth to the treatment of EC pill) of our HD and ML (with the five different realization described in Section 5.2) approaches.
Method HD GBM SVM RF NN Best
β0^\hat{\beta_{0}} -0.181 -0.132 -0.099 -0.150 -0.131 -0.150
CI LB -0.272 -0.235 -0.194 -0.279 -0.255 -0.279
CI UB -0.091 -0.030 -0.005 -0.021 -0.008 -0.021
pp-value 0.000 0.012 0.039 0.023 0.037 0.023

Acknowledgements

The authors thanks their advisor, Tianxi Cai for helpful discussion and comments on this paper.

References

  • Athey and Imbens, (2017) Athey, S. and Imbens, G. W. (2017). The state of applied econometrics: Causality and policy evaluation. Journal of Economic Perspectives, 31(2):3–32.
  • Belloni et al., (2018) Belloni, A., Chernozhukov, V., Chetverikov, D., Hansen, C., and Kato, K. (2018). High-dimensional econometrics and regularized GMM. arXiv preprint arXiv:1806.01888.
  • Bentancor and Clarke, (2017) Bentancor, A. and Clarke, D. (2017). Assessing plan B: The effect of the morning after pill on children and women. The Economic Journal, 127(607):2525–2552.
  • Bickel et al., (2009) Bickel, P. J., Ritov, Y., Tsybakov, A. B., et al. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of statistics, 37(4):1705–1732.
  • Bradic et al., (2019) Bradic, J., Wager, S., and Zhu, Y. (2019). Sparsity double robust inference of average treatment effects. arXiv preprint arXiv:1905.00744.
  • Bühlmann and Van De Geer, (2011) Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
  • Candes et al., (2007) Candes, E., Tao, T., et al. (2007). The dantzig selector: Statistical estimation when pp is much larger than nn. The annals of Statistics, 35(6):2313–2351.
  • Chen, (2007) Chen, H. Y. (2007). A semiparametric odds ratio model for measuring association. Biometrics, 63(2):413–421.
  • (9) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018a). Double/debiased machine learning for treatment and structural parameters.
  • (10) Chernozhukov, V., Newey, W., Robins, J., and Singh, R. (2018b). Double/debiased machine learning of global and local parameters using regularized riesz representers. arXiv preprint arXiv:1802.08667.
  • (11) Chernozhukov, V., Newey, W. K., and Robins, J. (2018c). Double/debiased machine learning using regularized riesz representers. Technical report, cemmap working paper.
  • Colangelo and Lee, (2020) Colangelo, K. and Lee, Y.-Y. (2020). Double debiased machine learning nonparametric inference with continuous treatments. arXiv preprint arXiv:2004.03036.
  • Cui and Tchetgen, (2019) Cui, Y. and Tchetgen, E. T. (2019). Bias-aware model selection for machine learning of doubly robust functionals. arXiv preprint arXiv:1911.02029.
  • Dimitriadou et al., (2004) Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., and Weingessel, A. (2004). R package: e1071: Misc functions of the department of statistics.
  • Dukes and Vansteelandt, (2020) Dukes, O. and Vansteelandt, S. (2020). Inference on treatment effect parameters in potentially misspecified high-dimensional models. Biometrika.
  • Farrell et al., (2018) Farrell, M. H., Liang, T., and Misra, S. (2018). Deep neural networks for estimation and inference: Application to causal effects and other semiparametric estimands. arXiv preprint arXiv:1809.09953.
  • Friedman et al., (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.
  • Ghosh and Tan, (2020) Ghosh, S. and Tan, Z. (2020). Doubly robust semiparametric inference using regularized calibrated estimation with high-dimensional data. arXiv preprint arXiv:2009.12033.
  • Giné and Nickl, (2016) Giné, E. and Nickl, R. (2016). Mathematical foundations of infinite-dimensional statistical models, volume 40. Cambridge University Press.
  • Greenwell et al., (2020) Greenwell, B., Boehmke, B., Cunningham, J., Developers, G., and Greenwell, M. B. (2020). Package ‘gbm’.
  • Janková and Van De Geer, (2016) Janková, J. and Van De Geer, S. (2016). Confidence regions for high-dimensional generalized linear models under sparsity. arXiv preprint arXiv:1610.01353.
  • Knaus, (2018) Knaus, M. C. (2018). A double machine learning approach to estimate the effects of musical practice on student’s skills. arXiv preprint arXiv:1805.10300.
  • Knaus, (2020) Knaus, M. C. (2020). Double machine learning based program evaluation under unconfoundedness. arXiv preprint arXiv:2003.03191.
  • Kuchibhotla and Chakrabortty, (2018) Kuchibhotla, A. K. and Chakrabortty, A. (2018). Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. arXiv preprint arXiv:1804.02605.
  • Kuhn et al., (2020) Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper, T., Mayer, Z., Kenkel, B., Team, R. C., et al. (2020). Package ‘caret’. The R Journal.
  • Liaw and Wiener, (2002) Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3):18–22.
  • Lin and Carroll, (2006) Lin, X. and Carroll, R. J. (2006). Semiparametric estimation in general repeated measures problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):69–88.
  • Liu et al., (2020) Liu, M., Xia, Y., Cho, K., and Cai, T. (2020). Integrative high dimensional multiple testing with heterogeneity under data sharing constraints. arXiv preprint arXiv:2004.00816.
  • Ma et al., (2020) Ma, R., Tony Cai, T., and Li, H. (2020). Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association, pages 1–15.
  • Negahban et al., (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538–557.
  • Ning et al., (2020) Ning, Y., Sida, P., and Imai, K. (2020). Robust estimation of causal effects via a high-dimensional covariate balancing propensity score. Biometrika, 107(3):533–554.
  • Oprescu et al., (2019) Oprescu, M., Syrgkanis, V., and Wu, Z. S. (2019). Orthogonal random forest for causal inference. In International Conference on Machine Learning, pages 4932–4941. PMLR.
  • Ripley et al., (2016) Ripley, B., Venables, W., and Ripley, M. B. (2016). Package ‘nnet’. R package version, 7:3–12.
  • Semenova and Chernozhukov, (2020) Semenova, V. and Chernozhukov, V. (2020). Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal.
  • Severini and Staniswalis, (1994) Severini, T. A. and Staniswalis, J. G. (1994). Quasi-likelihood estimation in semiparametric models. Journal of the American statistical Association, 89(426):501–511.
  • Smucler et al., (2019) Smucler, E., Rotnitzky, A., and Robins, J. M. (2019). A unifying approach for doubly-robust 1\ell_{1}-regularized estimation of causal contrasts. arXiv preprint arXiv:1904.03737.
  • (37) Tan, Z. (2019a). On doubly robust estimation for logistic partially linear models. Statistics &\& Probability Letters, 155:108577.
  • (38) Tan, Z. (2019b). RCAL: Regularized calibrated estimation. R package version 1.0.
  • (39) Tan, Z. (2020a). Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. Annals of Statistics, 48(2):811–837.
  • (40) Tan, Z. (2020b). Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. Biometrika, 107(1):137–158.
  • Tchetgen Tchetgen et al., (2010) Tchetgen Tchetgen, E. J., Robins, J. M., and Rotnitzky, A. (2010). On doubly robust estimation in a semiparametric odds ratio model. Biometrika, 97(1):171–180.
  • Van de Geer et al., (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202.
  • Van der Vaart, (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge University Press.
  • Wager and Athey, (2018) Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242.
  • Xia et al., (2020) Xia, L., Nan, B., and Li, Y. (2020). A revisit to debiased lasso for generalized linear models. arXiv preprint arXiv:2006.12778.
  • Yang et al., (2020) Yang, J.-C., Chuang, H.-C., and Kuan, C.-M. (2020). Double machine learning with gradient boosting and its application to the big nn audit quality effect. Journal of Econometrics.
  • Zhu et al., (2018) Zhu, Y., Bradic, J., et al. (2018). Significance testing in non-sparse high-dimensional linear models. Electronic Journal of Statistics, 12(2):3312–3364.
  • Zimmert and Lechner, (2019) Zimmert, M. and Lechner, M. (2019). Nonparametric estimation of causal heterogeneity under high-dimensional confounding. arXiv preprint arXiv:1908.08779.

Appendix

Appendix A Proof of Theorem 1

Proof.

By (5), we have

n1i=1nh(𝑫i;β^𝖧𝖣,η^)=n1i=1nψ^(𝑿i){Yieβ^𝖧𝖣Ai(1Yi)e𝑿i𝜸^}{Aig(𝑿i𝜶^)}=0.n^{-1}\sum_{i=1}^{n}h(\boldsymbol{D}_{i};\widehat{\beta}_{\sf\scriptscriptstyle HD},\widehat{\eta})=n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\widehat{\beta}_{\sf\scriptscriptstyle HD}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}=0.

Our main involvement is to remove the approximation error n1i=1nh(𝑫i;β^𝖧𝖣,η^)h(𝑫i;β^𝖧𝖣,η¯)n^{-1}\sum_{i=1}^{n}h(\boldsymbol{D}_{i};\widehat{\beta}_{\sf\scriptscriptstyle HD},\widehat{\eta})-h(\boldsymbol{D}_{i};\widehat{\beta}_{\sf\scriptscriptstyle HD},\bar{\eta}), asymptotically. Note that

n1i=1nh(𝑫i;β^𝖧𝖣,η^)h(𝑫i;β^𝖧𝖣,η¯)\displaystyle n^{-1}\sum_{i=1}^{n}h(\boldsymbol{D}_{i};\widehat{\beta}_{\sf\scriptscriptstyle HD},\widehat{\eta})-h(\boldsymbol{D}_{i};\widehat{\beta}_{\sf\scriptscriptstyle HD},\bar{\eta})
=\displaystyle= n1i=1nψ^(𝑿i)(1Yi){e𝑿i𝜸¯e𝑿i𝜸^}{Aig(𝑿i𝜶^)}\displaystyle n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})(1-Y_{i})\{e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}-e^{\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}
+n1i=1nψ^(𝑿i){Yieβ^𝖧𝖣Ai(1Yi)e𝑿i𝜸¯}{g(𝑿i𝜶¯)g(𝑿i𝜶^)}\displaystyle+n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\widehat{\beta}_{\sf\scriptscriptstyle HD}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}
+n1i=1n{expit(𝑿i𝜸~)expit(𝑿i𝜸)}{Yieβ^𝖧𝖣Ai(1Yi)e𝑿i𝜸¯}{Aig(𝑿i𝜶¯)}\displaystyle+n^{-1}\sum_{i=1}^{n}\{{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}})-{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})\}\{Y_{i}e^{-\widehat{\beta}_{\sf\scriptscriptstyle HD}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}
=:\displaystyle=\mathrel{\mathop{\mathchar 58\relax}} Δ1+Δ2+Δ3.\displaystyle\Delta_{1}+\Delta_{2}+\Delta_{3}.

We handle the terms Δ1\Delta_{1}, Δ2\Delta_{2} and Δ3\Delta_{3} separately as follows. First, we have

Δ1=\displaystyle\Delta_{1}= n1i=1n(1Yi){ψ^(𝑿i)ψ¯(𝑿i)}e𝑿i𝜸¯{1e𝑿i(𝜸^𝜸¯)}{Aig(𝑿i𝜶¯)}\displaystyle n^{-1}\sum_{i=1}^{n}(1-Y_{i})\{\widehat{\psi}(\boldsymbol{X}_{i})-\bar{\psi}(\boldsymbol{X}_{i})\}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{1-e^{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}
+n1i=1n(1Yi)ψ^(𝑿i)e𝑿i𝜸¯{1e𝑿i(𝜸^𝜸¯)}{g(𝑿i𝜶¯)g(𝑿i𝜶^)}\displaystyle+n^{-1}\sum_{i=1}^{n}(1-Y_{i})\widehat{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{1-e^{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})}\}\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}
+n1i=1n(1Yi)ψ¯(𝑿i)e𝑿i𝜸¯{1e𝑿i(𝜸^𝜸¯)𝑿i(𝜸^𝜸¯)}{Aig(𝑿i𝜶^)}\displaystyle+n^{-1}\sum_{i=1}^{n}(1-Y_{i})\bar{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{1-e^{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})}-\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}
+n1i=1n(1Yi)ψ¯(𝑿i)e𝑿i𝜸¯{g(𝑿i𝜶¯)g(𝑿i𝜶^)}𝑿i(𝜸^𝜸¯)\displaystyle+n^{-1}\sum_{i=1}^{n}(1-Y_{i})\bar{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})
+n1i=1n(1Yi)ψ¯(𝑿i)e𝑿i𝜸¯{Aig(𝑿i𝜶¯)}𝑿i(𝜸^𝜸¯)\displaystyle+n^{-1}\sum_{i=1}^{n}(1-Y_{i})\bar{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})
=:\displaystyle=\mathrel{\mathop{\mathchar 58\relax}} Δ11+Δ12+Δ13+Δ14+Δ15.\displaystyle\Delta_{11}+\Delta_{12}+\Delta_{13}+\Delta_{14}+\Delta_{15}.

As supi{1,,n}|𝑿i(𝜸^𝜸¯)|=O(1)\sup_{i\in\{1,\ldots,n\}}|\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})|=O_{\mathbb{P}}(1) by Assumption HD4, there exists M1=O(1)M_{1}=O(1) such that with probability approaching 11,

|1e𝑿i(𝜸^𝜸¯)|M1|𝑿i(𝜸^𝜸¯)|,|1e𝑿i(𝜸^𝜸¯)𝑿i(𝜸^𝜸¯)|M1{𝑿i(𝜸^𝜸¯)}2;|1-e^{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})}|\leq M_{1}|\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})|,\quad|1-e^{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})}-\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})|\leq M_{1}\{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})\}^{2}; (A1)
|ψ^(𝑿i)ψ¯(𝑿i)|=e𝑿i𝜸|1e𝑿i(𝜸~𝜸)|(1+e𝑿i𝜸)(1+e𝑿i𝜸~)M1|𝑿i(𝜸~𝜸)|.|\widehat{\psi}(\boldsymbol{X}_{i})-\bar{\psi}(\boldsymbol{X}_{i})|=\frac{e^{\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}*}|1-e^{\boldsymbol{X}_{i}^{\intercal}(\widetilde{\boldsymbol{\gamma}}-\boldsymbol{\gamma}*)}|}{(1+e^{\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}*})(1+e^{\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}}})}\leq M_{1}|\boldsymbol{X}_{i}^{\intercal}(\widetilde{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})|. (A2)

And by Assumptions REG and HD4, there exists M2=Θ(1)M_{2}=\Theta(1) that supi{1,,n}|Aig(𝑿i𝜶¯)|+|Aig(𝑿i𝜶^)|M2\sup_{i\in\{1,\ldots,n\}}|A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})|+|A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})|\leq M_{2}. Consequently, by Assumptions HD4 and boundness of ψ()\psi(\cdot), we have

|Δ11|\displaystyle|\Delta_{11}|\leq n1i=1nM12M2e𝑿i𝜸¯|𝑿i(𝜸~𝜸)||𝑿i(𝜸^𝜸¯)|\displaystyle n^{-1}\sum_{i=1}^{n}M_{1}^{2}M_{2}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}|\boldsymbol{X}_{i}^{\intercal}(\widetilde{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})||\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})|
\displaystyle\leq M12M2[n1i=1ne𝑿i𝜸¯{𝑿i(𝜸~𝜸)}2n1i=1ne𝑿i𝜸¯{𝑿i(𝜸^𝜸¯)}2]1/2=O(slogpn);\displaystyle M_{1}^{2}M_{2}\left[n^{-1}\sum_{i=1}^{n}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{\boldsymbol{X}_{i}^{\intercal}(\widetilde{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\}^{2}\cdot n^{-1}\sum_{i=1}^{n}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})\}^{2}\right]^{1/2}=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right);
|Δ12|\displaystyle|\Delta_{12}|\leq n1i=1nM1e𝑿i𝜸¯|𝑿i(𝜸^𝜸¯)||g(𝑿i𝜶¯)g(𝑿i𝜶^)|\displaystyle n^{-1}\sum_{i=1}^{n}M_{1}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}|\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})||g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})|
\displaystyle\leq M1[n1i=1ne𝑿i𝜸¯{𝑿i(𝜸^𝜸¯)}2n1i=1ne𝑿i𝜸¯{g(𝑿i𝜶¯)g(𝑿i𝜶^)}2]1/2=O(slogpn);\displaystyle M_{1}\left[n^{-1}\sum_{i=1}^{n}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})\}^{2}\cdot n^{-1}\sum_{i=1}^{n}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}^{2}\right]^{1/2}=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right);
|Δ13|\displaystyle|\Delta_{13}|\leq n1i=1nM1M2e𝑿i𝜸¯{𝑿i(𝜸^𝜸¯)}2=O(slogpn);\displaystyle n^{-1}\sum_{i=1}^{n}M_{1}M_{2}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})\}^{2}=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right);
|Δ14|\displaystyle|\Delta_{14}|\leq n1i=1ne𝑿i𝜸¯|𝑿i(𝜸^𝜸¯)||g(𝑿i𝜶¯)g(𝑿i𝜶^)|=O(slogpn),similar to |Δ12|.\displaystyle n^{-1}\sum_{i=1}^{n}e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}|\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}})||g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})|=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right),~{}\mbox{similar to }|\Delta_{12}|.

By Assumptions HD2 and HD4,

|Δ15|n1i=1n(1Yi)ψ¯(𝑿i)e𝑿i𝜸¯{Aig(𝑿i𝜶¯)}𝑿i𝜸^𝜸¯1=O(slogpn).|\Delta_{15}|\leq\left\|n^{-1}\sum_{i=1}^{n}(1-Y_{i})\bar{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}\boldsymbol{X}_{i}\right\|_{\infty}\cdot\|\widehat{\boldsymbol{\gamma}}-\bar{\boldsymbol{\gamma}}\|_{1}=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right).

Thus, we have |Δ1|=O(slogp/n)|\Delta_{1}|=O_{\mathbb{P}}(s\log p/n). For Δ2\Delta_{2}, we have

Δ2=\displaystyle\Delta_{2}= n1i=1nψ^(𝑿i)Yi(eβ^𝖧𝖣Aieβ0Ai){g(𝑿i𝜶¯)g(𝑿i𝜶^)}\displaystyle n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})Y_{i}(e^{-\widehat{\beta}_{\sf\scriptscriptstyle HD}A_{i}}-e^{-\beta_{0}A_{i}})\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}
+n1i=1n{ψ^(𝑿i)ψ¯(𝑿i)}{Yieβ0Ai(1Yi)e𝑿i𝜸¯}{g(𝑿i𝜶¯)g(𝑿i𝜶^)}\displaystyle+n^{-1}\sum_{i=1}^{n}\{\widehat{\psi}(\boldsymbol{X}_{i})-\bar{\psi}(\boldsymbol{X}_{i})\}\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}
+n1i=1nψ¯(𝑿i){Yieβ0Ai(1Yi)e𝑿i𝜸¯}{g(𝑿i𝜶¯)g(𝑿i𝜶^)g(𝑿i𝜶¯)𝑿i(𝜶^𝜶¯)}\displaystyle+n^{-1}\sum_{i=1}^{n}\bar{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})-g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}})\}
+n1i=1nψ¯(𝑿i){Yieβ0Ai(1Yi)e𝑿i𝜸¯}g(𝑿i𝜶¯)𝑿i(𝜶^𝜶¯)\displaystyle+n^{-1}\sum_{i=1}^{n}\bar{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}})
=:\displaystyle=\mathrel{\mathop{\mathchar 58\relax}} Δ21+Δ22+Δ23+Δ24.\displaystyle\Delta_{21}+\Delta_{22}+\Delta_{23}+\Delta_{24}.

Again using Assumptions REG and HD4, there exists M3=Θ(1)M_{3}=\Theta(1) such that

|eβ^𝖧𝖣Aieβ0Ai|M3|β^𝖧𝖣β0|.|e^{-\widehat{\beta}_{\sf\scriptscriptstyle HD}A_{i}}-e^{-\beta_{0}A_{i}}|\leq M_{3}|\widehat{\beta}_{\sf\scriptscriptstyle HD}-\beta_{0}|. (A3)

And by Assumption HD3 and the mean value theorem, for each ii, there exits tit_{i} lying between 𝑿i𝜶^\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}} and 𝑿i𝜶¯\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}} such that

|g(𝑿i𝜶¯)g(𝑿i𝜶^)g(𝑿i𝜶¯)𝑿i(𝜶^𝜶¯)||g(𝑿i𝜶¯)g(ti)||𝑿i(𝜶^𝜶¯)|L{𝑿i(𝜶^𝜶¯)}2.|g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})-g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}})|\leq|g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g^{\prime}(t_{i})||\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}})|\leq L\{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}})\}^{2}. (A4)

These combined with (A1), (A2) and Assumptions REG and HD4 lead to that

|Δ21|\displaystyle|\Delta_{21}| =O(|β^𝖧𝖣β0|[n1i=1n{g(𝑿i𝜶¯)g(𝑿i𝜶^)}2]1/2)=O(slogpn);\displaystyle=O\left(|\widehat{\beta}_{\sf\scriptscriptstyle HD}-\beta_{0}|\left[n^{-1}\sum_{i=1}^{n}\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}^{2}\right]^{1/2}\right)=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right);
|Δ22|\displaystyle|\Delta_{22}| =O([n1i=1n{1+e𝑿i𝜸¯}{ψ^(𝑿i)ψ¯(𝑿i)}2n1i=1n{1+e𝑿i𝜸¯}{g(𝑿i𝜶¯)g(𝑿i𝜶^)}2]1/2)\displaystyle=O\left(\left[n^{-1}\sum_{i=1}^{n}\{1+e^{\boldsymbol{X}_{i}\bar{\boldsymbol{\gamma}}}\}\{\widehat{\psi}(\boldsymbol{X}_{i})-\bar{\psi}(\boldsymbol{X}_{i})\}^{2}\cdot n^{-1}\sum_{i=1}^{n}\{1+e^{\boldsymbol{X}_{i}\bar{\boldsymbol{\gamma}}}\}\{g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}^{2}\right]^{1/2}\right)
=O(slogpn);\displaystyle=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right);
|Δ23|\displaystyle|\Delta_{23}| =O(n1i=1n{1+e𝑿i𝜸¯}{𝑿i(𝜶^𝜶¯)}2)=O(slogpn).\displaystyle=O\left(n^{-1}\sum_{i=1}^{n}\{1+e^{\boldsymbol{X}_{i}\bar{\boldsymbol{\gamma}}}\}\{\boldsymbol{X}_{i}^{\intercal}(\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}})\}^{2}\right)=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right).

And by Assumptions HD2 and HD4,

|Δ24|n1i=1nψ¯(𝑿i){Yieβ0Ai(1Yi)e𝑿i𝜸¯}g(𝑿i𝜶¯)𝑿i𝜶^𝜶¯1=O(slogpn).|\Delta_{24}|\leq\left\|n^{-1}\sum_{i=1}^{n}\bar{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\boldsymbol{X}_{i}\right\|_{\infty}\cdot\|\widehat{\boldsymbol{\alpha}}-\bar{\boldsymbol{\alpha}}\|_{1}=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right).

So we also have |Δ2|=O(slogp/n)|\Delta_{2}|=O_{\mathbb{P}}(s\log p/n). For Δ3\Delta_{3}, we have

Δ3=\displaystyle\Delta_{3}= n1i=1n{expit(𝑿i𝜸~)expit(𝑿i𝜸)}Yi(eβ^𝖧𝖣Aieβ0Ai){Aig(𝑿i𝜶¯)}\displaystyle n^{-1}\sum_{i=1}^{n}\{{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}})-{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})\}Y_{i}(e^{-\widehat{\beta}_{\sf\scriptscriptstyle HD}A_{i}}-e^{-\beta_{0}A_{i}})\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}
+n1i=1n[{expit(𝑿i𝜸~)expit(𝑿i𝜸)expit(𝑿i𝜸)𝑿i(𝜸𝜸~)}\displaystyle+n^{-1}\sum_{i=1}^{n}\Big{[}\{{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}})-{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})-{\rm expit}^{\prime}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})\boldsymbol{X}_{i}^{\intercal}(\boldsymbol{\gamma}^{*}-\widetilde{\boldsymbol{\gamma}})\}
{Yieβ0Ai(1Yi)e𝑿i𝜸¯}{Aig(𝑿i𝜶¯)}]\displaystyle\quad\quad\quad\quad~{}\cdot\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}\Big{]}
+n1i=1nexpit(𝑿i𝜸)𝑿i(𝜸𝜸~){Yieβ0Ai(1Yi)e𝑿i𝜸¯}{Aig(𝑿i𝜶¯)}\displaystyle+n^{-1}\sum_{i=1}^{n}{\rm expit}^{\prime}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})\boldsymbol{X}_{i}^{\intercal}(\boldsymbol{\gamma}^{*}-\widetilde{\boldsymbol{\gamma}})\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}
=:\displaystyle=\mathrel{\mathop{\mathchar 58\relax}} Δ31+Δ32+Δ33.\displaystyle\Delta_{31}+\Delta_{32}+\Delta_{33}.

Again using the mean value theorem and the fact that |expit(u)expit(v)||uv||{\rm expit}^{\prime}(u)-{\rm expit}^{\prime}(v)|\leq|u-v| for any u,vu,v\in\mathbb{R} (by |expit′′()|1|{\rm expit}^{\prime\prime}(\cdot)|\leq 1), we have

|expit(𝑿i𝜸~)expit(𝑿i𝜸)expit(𝑿i𝜸)𝑿i(𝜸𝜸~)|{𝑿i(𝜸𝜸~)}2.|{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}})-{\rm expit}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})-{\rm expit}^{\prime}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})\boldsymbol{X}_{i}^{\intercal}(\boldsymbol{\gamma}^{*}-\widetilde{\boldsymbol{\gamma}})|\leq\{\boldsymbol{X}_{i}^{\intercal}(\boldsymbol{\gamma}^{*}-\widetilde{\boldsymbol{\gamma}})\}^{2}.

Then by (A2), (A3), and Assumptions REG and HD4, we have

|Δ31|=\displaystyle|\Delta_{31}|= O(|β^𝖧𝖣β0|[n1i=1n{𝑿i(𝜸~𝜸)}2]1/2)=O(slogpn);\displaystyle O\left(|\widehat{\beta}_{\sf\scriptscriptstyle HD}-\beta_{0}|\left[n^{-1}\sum_{i=1}^{n}\{\boldsymbol{X}_{i}^{\intercal}(\widetilde{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\}^{2}\right]^{1/2}\right)=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right);
|Δ32|=\displaystyle|\Delta_{32}|= O(n1i=1n{1+e𝑿i𝜸¯}{𝑿i(𝜸~𝜸)}2)=O(slogpn).\displaystyle O\left(n^{-1}\sum_{i=1}^{n}\{1+e^{\boldsymbol{X}_{i}\bar{\boldsymbol{\gamma}}}\}\{\boldsymbol{X}_{i}^{\intercal}(\widetilde{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\}^{2}\right)=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right).

And by Assumptions HD2 and HD4,

|Δ33|n1i=1nexpit(𝑿i𝜸){Yieβ0Ai(1Yi)e𝑿i𝜸¯}{Aig(𝑿i𝜶¯)}𝑿i𝜸𝜸~1=O(slogpn).|\Delta_{33}|\leq\left\|n^{-1}\sum_{i=1}^{n}{\rm expit}^{\prime}(-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}^{*})\{Y_{i}e^{-\beta_{0}A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\bar{\boldsymbol{\alpha}})\}\boldsymbol{X}_{i}\right\|_{\infty}\|\boldsymbol{\gamma}^{*}-\widetilde{\boldsymbol{\gamma}}\|_{1}=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right).

Thus, we have |Δ3|=O(slogp/n)|\Delta_{3}|=O_{\mathbb{P}}(s\log p/n), and by Assumption HD5,

n1i=1nh(𝑫i;β^𝖧𝖣,η^)h(𝑫i;β^𝖧𝖣,η¯)=O(slogpn)=o(1n),n^{-1}\sum_{i=1}^{n}h(\boldsymbol{D}_{i};\widehat{\beta}_{\sf\scriptscriptstyle HD},\widehat{\eta})-h(\boldsymbol{D}_{i};\widehat{\beta}_{\sf\scriptscriptstyle HD},\bar{\eta})=O_{\mathbb{P}}\left(\frac{s\log p}{n}\right)=o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right),

which leads to that

n1i=1nh(𝑫i;β^𝖧𝖣,η¯)+o(1n)=0.n^{-1}\sum_{i=1}^{n}h(\boldsymbol{D}_{i};\widehat{\beta}_{\sf\scriptscriptstyle HD},\bar{\eta})+o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right)=0.

This combined with Remark 1 that 𝔼h(𝑫;β0,η¯)=0\mathbb{E}h(\boldsymbol{D};\beta_{0},\bar{\eta})=0 under Assumption HD1, the regularity Assumption REG, and Theorem 5.21 of Van der Vaart, (2000), finally finishes proving Theorem 1.

Appendix B Proof of Theorem 2

Following the general results of the DML estimator with non-linear Neyman orthogonal score presented in Section 3.3 and Theorem 3.3 of Chernozhukov et al., 2018a , we only need to verify their Assumptions 3.3 and 3.4 on our orthogonal score function h(𝑫;β,η)h(\boldsymbol{D};\beta,\eta), specifically, Conditions A1 and A2 presented as follows.

Condition A1 (Moment condition with Neyman orthogonality).

It holds that: (a) 𝔼h(𝐃;β0,η0)=0\mathbb{E}h(\boldsymbol{D};\beta_{0},\eta_{0})=0 and \mathcal{B} contains an interval of length Θ(n1/2logn)\Theta(n^{-1/2}\log n) centred at β0\beta_{0}; (b) the map (β,η)𝔼h(𝐃;β0,η0)(\beta,\eta)\rightarrow\mathbb{E}h(\boldsymbol{D};\beta_{0},\eta_{0}) is twice continuously Gateaux-differentiable; (c) |𝔼h(𝐃;β,η0)|min{|J0(ββ0)|,c0}|\mathbb{E}h(\boldsymbol{D};\beta,\eta_{0})|\geq\min\{|J_{0}(\beta-\beta_{0})|,c_{0}\} where the parameters η0={r0(),m0(),ψ¯}\eta_{0}=\{r_{0}(\cdot),m_{0}(\cdot),\bar{\psi}\}, c0=Θ(1)c_{0}=\Theta(1) and J0=β𝔼h(𝐃;β,η0)|β=β0=Θ(1)J_{0}=\partial_{\beta}\mathbb{E}h(\boldsymbol{D};\beta,\eta_{0})|_{\beta=\beta_{0}}=\Theta(1); (d) h(𝐃;β0,η0)h(\boldsymbol{D};\beta_{0},\eta_{0}) obeys Neyman orthogonality, i.e. η𝔼h(𝐃;β0,η0)[ηη0]=0\partial_{\eta}\mathbb{E}h(\boldsymbol{D};\beta_{0},\eta_{0})[\eta-\eta_{0}]=0 for all η\eta\in\mathcal{E} where the parameter space of η\eta: {η:𝔼|h(𝐃;β0,η0)[ηη0]|<}\mathcal{E}\subseteq\{\eta\mathrel{\mathop{\mathchar 58\relax}}\mathbb{E}|h(\boldsymbol{D};\beta_{0},\eta_{0})[\eta-\eta_{0}]|<\infty\}.

Condition A2 (Regularity of the score and quality of the nuisance estimators).

It holds that: (a) η^[-k]\widehat{\eta}^{[\text{-}k]} belongs to the realization set 𝒯n\mathcal{T}_{n} for each k{1,2,,K}k\in\{1,2,\ldots,K\}, with probability approaching 1 where 𝒯n\mathcal{T}_{n} satisfies η0𝒯n\eta_{0}\in\mathcal{T}_{n} and conditions given as follows; (b) The space of β\beta, \mathcal{B} is bounded and for each η𝒯n\eta\in\mathcal{T}_{n}, the functional space η={h(;β,η):β}\mathcal{F}_{\eta}=\{h(\cdot;\beta,\eta)\mathrel{\mathop{\mathchar 58\relax}}\beta\in\mathcal{B}\} is measurable and its uniform covering number satisfies: there exists positive constant R=Θ(1)R=\Theta(1) and ν=Θ(1)\nu=\Theta(1) such that

supQlog𝒩(ϵFη()Q,2,η,Q,2)νlog(R/ϵ),ϵ(0,1],\sup_{Q}\log\mathcal{N}(\epsilon\|F_{\eta}(\cdot)\|_{Q,2},\mathcal{F}_{\eta},\|\cdot\|_{Q,2})\leq\nu\log(R/\epsilon),\quad\forall\epsilon\in(0,1],

where Fη()F_{\eta}(\cdot) is a measurable envelope function for η\mathcal{F}_{\eta}: Fη(𝐃)|h(𝐃;β,η)|F_{\eta}(\boldsymbol{D})\geq|h(\boldsymbol{D};\beta,\eta)| for all 𝐃\boldsymbol{D} and β\beta\in\mathcal{B}, and there exists q>2q>2 such that Fη()P,q=O(1)\|F_{\eta}(\cdot)\|_{P,q}=O(1); (c) there exists sequence τn\tau_{n}:

supη={r,m,ψ}𝒯n,β|𝔼h(𝑫;β,η)𝔼h(𝑫;β,{r0,m0,ψ})|=o(τn),\displaystyle\sup_{\eta=\{r,m,\psi\}\in\mathcal{T}_{n},\beta\in\mathcal{B}}|\mathbb{E}h(\boldsymbol{D};\beta,\eta)-\mathbb{E}h(\boldsymbol{D};\beta,\{r_{0},m_{0},\psi\})|=o(\tau_{n}),
supη𝒯n,|ββ0|τn𝔼[h(𝑫;β,η)h(𝑫;β0,{r0,m0,ψ})]2+𝔼[h(𝑫;β0,{r0,m0,ψ})h(𝑫;β0,η0)]2=o(1),\displaystyle\sup_{\eta\in\mathcal{T}_{n},|\beta-\beta_{0}|\leq\tau_{n}}\mathbb{E}[h(\boldsymbol{D};\beta,\eta)-h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})]^{2}+\mathbb{E}[h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})-h(\boldsymbol{D};\beta_{0},\eta_{0})]^{2}=o(1),
supr(0,1),η𝒯n,|ββ0|τn|r2𝔼h{𝑫;β0+r(ββ0),η0+r(η{r0(),m0(),ψ})}|=o(n1/2);\displaystyle\sup_{r\in(0,1),\eta\in\mathcal{T}_{n},|\beta-\beta_{0}|\leq\tau_{n}}|\partial_{r}^{2}\mathbb{E}h\{\boldsymbol{D};\beta_{0}+r(\beta-\beta_{0}),\eta_{0}+r(\eta-\{r_{0}(\cdot),m_{0}(\cdot),\psi\})\}|=o(n^{-1/2});

and (d) 𝔼h2(𝐃;β0,η0)=Θ(1)\mathbb{E}h^{2}(\boldsymbol{D};\beta_{0},\eta_{0})=\Theta(1).

We make some simplification and adaptation on the original assumptions in Chernozhukov et al., 2018a to form Conditions A1-A2, according to our own setting. The only non-trivial change made here is that in Condition A2 (c), we require

supη={r,m,ψ}𝒯n,β|𝔼h(𝑫;β,η)𝔼h(𝑫;β,{r0,m0,ψ})|=o(τn),supr(0,1),η𝒯n,|ββ0|τn|r2𝔼h{𝑫;β0+r(ββ0),η0+r(η{r0(),m0(),ψ})}|=o(n1/2);\begin{split}&\sup_{\eta=\{r,m,\psi\}\in\mathcal{T}_{n},\beta\in\mathcal{B}}|\mathbb{E}h(\boldsymbol{D};\beta,\eta)-\mathbb{E}h(\boldsymbol{D};\beta,\{r_{0},m_{0},\psi\})|=o(\tau_{n}),\\ &\sup_{r\in(0,1),\eta\in\mathcal{T}_{n},|\beta-\beta_{0}|\leq\tau_{n}}|\partial_{r}^{2}\mathbb{E}h\{\boldsymbol{D};\beta_{0}+r(\beta-\beta_{0}),\eta_{0}+r(\eta-\{r_{0}(\cdot),m_{0}(\cdot),\psi\})\}|=o(n^{-1/2});\end{split} (A5)

instead of:

supη𝒯n,β|𝔼h(𝑫;β,η)𝔼h(𝑫;β,η0)|=o(τn),\displaystyle\sup_{\eta\in\mathcal{T}_{n},\beta\in\mathcal{B}}|\mathbb{E}h(\boldsymbol{D};\beta,\eta)-\mathbb{E}h(\boldsymbol{D};\beta,\eta_{0})|=o(\tau_{n}),
supr(0,1),η𝒯n,|ββ0|τn|r2𝔼h{𝑫;β0+r(ββ0),η0+r(ηη0)}|=o(n1/2),\displaystyle\sup_{r\in(0,1),\eta\in\mathcal{T}_{n},|\beta-\beta_{0}|\leq\tau_{n}}|\partial_{r}^{2}\mathbb{E}h\{\boldsymbol{D};\beta_{0}+r(\beta-\beta_{0}),\eta_{0}+r(\eta-\eta_{0})\}|=o(n^{-1/2}),

as used in Assumption 3.4 (c) of Chernozhukov et al., 2018a . The first inequality of (A5) is used by Chernozhukov et al., 2018a to derive a preliminary rate for the DML estimator: |β^𝖬𝖫β0|=o(τn)|\widehat{\beta}_{\sf\scriptscriptstyle ML}-\beta_{0}|=o_{\mathbb{P}}(\tau_{n}) (see their Step 1 of the proof of Lemma 6.3), and the second inequality of (A5) is used in their Step 3 the proof of Lemma 6.3 to process the second order error of

n1k=1Kikh(𝑫i;β0,η0)h(𝑫i;β,η^[-k]),n^{-1}\sum_{k=1}^{K}\sum_{i\in\mathcal{I}_{k}}h(\boldsymbol{D}_{i};\beta_{0},\eta_{0})-h(\boldsymbol{D}_{i};\beta,\widehat{\eta}^{[\text{-}k]}),

uniformly for β\beta satisfying |ββ0|τn|\beta-\beta_{0}|\leq\tau_{n}.

Note that our modified two assumptions are still sufficient for deriving these results. Since 𝔼h(𝑫;β0,{r0,m0,ψ})=0\mathbb{E}h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})=0 holds for all ψ\psi (see our Remark 1), there is actually no need to consider 𝔼h(𝑫;β,{r0,m0,ψ})𝔼h(𝑫;β,{r0,m0,ψ0})\mathbb{E}h(\boldsymbol{D};\beta,\{r_{0},m_{0},\psi\})-\mathbb{E}h(\boldsymbol{D};\beta,\{r_{0},m_{0},\psi_{0}\}) when deriving |β^𝖬𝖫β0|=o(τn)|\widehat{\beta}_{\sf\scriptscriptstyle ML}-\beta_{0}|=o_{\mathbb{P}}(\tau_{n}). While for the Step 3 of Chernozhukov et al., 2018a , one can instead handle the second order error of

n1k=1Kikh(𝑫i;β0,{r0,m0,ψ^[-k]})h(𝑫i;β,η^[-k]),n^{-1}\sum_{k=1}^{K}\sum_{i\in\mathcal{I}_{k}}h(\boldsymbol{D}_{i};\beta_{0},\{r_{0},m_{0},\widehat{\psi}^{[\text{-}k]}\})-h(\boldsymbol{D}_{i};\beta,\widehat{\eta}^{[\text{-}k]}),

again using that 𝔼h(𝑫;β0,{r0,m0,ψ})=0\mathbb{E}h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})=0 holds for all ψ\psi, and then remove the remaining error:

n1k=1Kikh(𝑫i;β0,{r0,m0,ψ0})h(𝑫i;β0,{r0,m0,ψ^[-k]}),n^{-1}\sum_{k=1}^{K}\sum_{i\in\mathcal{I}_{k}}h(\boldsymbol{D}_{i};\beta_{0},\{r_{0},m_{0},\psi_{0}\})-h(\boldsymbol{D}_{i};\beta_{0},\{r_{0},m_{0},\widehat{\psi}^{[\text{-}k]}\}),

through concentration based on the fact ψ𝔼h(𝑫;β0,η0)[ψψ0]=0\partial_{\psi}\mathbb{E}h(\boldsymbol{D};\beta_{0},\eta_{0})[\psi-\psi_{0}]=0 and the second inquality of Condition A2 (c): supη𝒯n,|ββ0|τn𝔼[h(𝑫;β0,{r0,m0,ψ})h(𝑫;β0,η0)]2=o(1)\sup_{\eta\in\mathcal{T}_{n},|\beta-\beta_{0}|\leq\tau_{n}}\mathbb{E}[h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})-h(\boldsymbol{D};\beta_{0},\eta_{0})]^{2}=o(1).

On the other hand, this modification essentially reduces our requirement on the quality of ψ^[-k]()\widehat{\psi}^{[\text{-}k]}(\cdot), as mentioned in Remark 13. As will be seen from the proof below, to fulfill the modified Condition A2 (c), we do not require ψ^[-k]()ψ¯()P,2=o(τn)\|\widehat{\psi}^{[\text{-}k]}(\cdot)-\bar{\psi}(\cdot)\|_{P,2}=o_{\mathbb{P}}(\tau_{n}) as one shall when following the original version of Chernozhukov et al., 2018a but only needs ψ^[-k]()\widehat{\psi}^{[\text{-}k]}(\cdot) to be uniformly consistent (though the former one may still be justifiable in our case since we take ψ¯(𝒙)=expit{r¯(𝒙)}\bar{\psi}(\boldsymbol{x})={\rm expit}\{-\bar{r}(\boldsymbol{x})\}). We now verify Conditions A1 and A2 based on our Assumptions REG and ML1.

Proof.

Condition A1 (a) is directly given by our logistic partial model assumption 1. Condition A1 (b) is naturally satisfied as h(𝑫;β,η0+r(ηη0))h(\boldsymbol{D};\beta,\eta_{0}+r(\eta-\eta_{0})) is a twice continuously differentiable in (β,r)(\beta,r). Condition A1 (c) is directly given by Assumption REG. And Condition A1 (d) holds by equation (3), combined with the model assumptions (1) and 𝔼[A|𝑿=𝒙,Y=0]=m0(𝒙)\mathbb{E}[A|\boldsymbol{X}=\boldsymbol{x},Y=0]=m_{0}(\boldsymbol{x}).

By Assumption ML1 and ψ¯(𝒙)=expit(r¯(𝒙))\bar{\psi}(\boldsymbol{x})={\rm expit}(-\bar{r}(\boldsymbol{x})), there exists ζ1,n=o(1)\zeta_{1,n}=o(1) and ζ2,n=o(n1/4)\zeta_{2,n}=o(n^{-1/4}) such that η^[-k]\widehat{\eta}^{[\text{-}k]} belongs to

𝒯n=:{η=(r,m,ψ):\displaystyle\mathcal{T}_{n}=\mathrel{\mathop{\mathchar 58\relax}}\Big{\{}\eta=(r,m,\psi)\mathrel{\mathop{\mathchar 58\relax}} sup𝒙𝒳|ψ(𝒙)ψ¯(𝒙)|+|r(𝒙)r0(𝒙)|+|m(𝒙)m0(𝒙)|ζ1,n,\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|\psi(\boldsymbol{x})-\bar{\psi}(\boldsymbol{x})|+|r(\boldsymbol{x})-r_{0}(\boldsymbol{x})|+|m(\boldsymbol{x})-m_{0}(\boldsymbol{x})|\leq\zeta_{1,n},
andr()r0()P,2+m()m0()P,2ζ2,n},\displaystyle\mbox{and}\quad\|r(\cdot)-r_{0}(\cdot)\|_{P,2}+\|m(\cdot)-m_{0}(\cdot)\|_{P,2}\leq\zeta_{2,n}\Big{\}},

with probability approaching 11, for each k{1,,K}k\in\{1,\ldots,K\}. We define 𝒯n\mathcal{T}_{n} of Condition A2 in this way such that A2 (a) is satisfied. Now we validate Condition A2 (b). By Assumption REG that AA and β\beta belong to compact sets, and |m(𝒙)|m0(𝒙)+o(1)|m(\boldsymbol{x})|\leq m_{0}(\boldsymbol{x})+o(1) is uniformly bounded for η={r,m,ψ}𝒯n\eta=\{r,m,\psi\}\in\mathcal{T}_{n}, there exists positive C1=Θ(1)C_{1}=\Theta(1) such that for η={r,m,ψ}𝒯n\eta=\{r,m,\psi\}\in\mathcal{T}_{n},

h(𝑫;β,η)=\displaystyle h(\boldsymbol{D};\beta,\eta)= ψ(𝑿){YeβA(1Y)er(𝑿)}{Am(𝑿)}\displaystyle\psi(\boldsymbol{X})\{Ye^{-\beta A}-(1-Y)e^{r(\boldsymbol{X})}\}\{A-m(\boldsymbol{X})\}
\displaystyle\leq |ψ(𝑿)YeβA{Am(𝑿)}|+|ψ(𝑿)er(𝑿)(1Y){Am(𝑿)}|\displaystyle|\psi(\boldsymbol{X})Ye^{-\beta A}\{A-m(\boldsymbol{X})\}|+|\psi(\boldsymbol{X})e^{r(\boldsymbol{X})}(1-Y)\{A-m(\boldsymbol{X})\}|
\displaystyle\leq C1{ψ(𝑿)+ψ(𝑿)er(𝑿)}=C1{expit(ψ(𝑿))+expit(ψ(𝑿))}C1+1;\displaystyle C_{1}\{\psi(\boldsymbol{X})+\psi(\boldsymbol{X})e^{r(\boldsymbol{X})}\}=C_{1}\{{\rm expit}(-\psi(\boldsymbol{X}))+{\rm expit}(\psi(\boldsymbol{X}))\}\leq C_{1}+1;
βh(𝑫;β,η)P,22=\displaystyle\|\partial_{\beta}h(\boldsymbol{D};\beta,\eta)\|_{P,2}^{2}= 𝔼[ψ(𝑿)YeβAA{Am(𝑿)}]2C1.\displaystyle\mathbb{E}[\psi(\boldsymbol{X})Ye^{-\beta A}A\{A-m(\boldsymbol{X})\}]^{2}\leq C_{1}.

Then by Example 19.7 of Van der Vaart, (2000), Condition A2 (b) holds with ν=1\nu=1 and RR being the diameter of \mathcal{B}. Note that Condition A2 (d) is again directly given by Assumption REG. It remains to verify Condition A2 (c). For each η={r,m,ψ}𝒯n\eta=\{r,m,\psi\}\in\mathcal{T}_{n} and β\beta\in\mathcal{B}, using the boundness of β\beta, AA, m0(𝒙)m_{0}(\boldsymbol{x}) ψ(𝒙)\psi(\boldsymbol{x}) and ψ(𝒙)er(𝒙)\psi(\boldsymbol{x})e^{r(\boldsymbol{x})}, there exists C2=Θ(1)C_{2}=\Theta(1) such that

|𝔼h(𝑫;β,η)𝔼h(𝑫;β,{r0,m0,ψ})|\displaystyle|\mathbb{E}h(\boldsymbol{D};\beta,\eta)-\mathbb{E}h(\boldsymbol{D};\beta,\{r_{0},m_{0},\psi\})|
\displaystyle\leq |𝔼ψ(𝑿)YeβA{m0(𝑿)m(𝑿)}|+|𝔼ψ(𝑿)(1Y)er(𝑿){m0(𝑿)m(𝑿)}|\displaystyle|\mathbb{E}\psi(\boldsymbol{X})Ye^{-\beta A}\{m_{0}(\boldsymbol{X})-m(\boldsymbol{X})\}|+|\mathbb{E}\psi(\boldsymbol{X})(1-Y)e^{r(\boldsymbol{X})}\{m_{0}(\boldsymbol{X})-m(\boldsymbol{X})\}|
+|𝔼ψ(𝑿)er(𝑿)(1Y){1er0(𝑿)r(𝑿)}{Am(𝑿0)}|\displaystyle+|\mathbb{E}\psi(\boldsymbol{X})e^{r(\boldsymbol{X})}(1-Y)\{1-e^{r_{0}(\boldsymbol{X})-r(\boldsymbol{X})}\}\{A-m(\boldsymbol{X}_{0})\}|
\displaystyle\leq C2(m0(𝑿)m(𝑿)P,2+r0(𝑿)r(𝑿)P,2+r0(𝑿)r(𝑿)P,22)3C2ζ2,n.\displaystyle C_{2}(\|m_{0}(\boldsymbol{X})-m(\boldsymbol{X})\|_{P,2}+\|r_{0}(\boldsymbol{X})-r(\boldsymbol{X})\|_{P,2}+\|r_{0}(\boldsymbol{X})-r(\boldsymbol{X})\|_{P,2}^{2})\leq 3C_{2}\zeta_{2,n}.

So we take τn=n1/4\tau_{n}=n^{-1/4} and by ζ2,n=o(n1/4)\zeta_{2,n}=o(n^{-1/4}), the first inequality of Condition A2 (c) is satisfied. Again by the boundness of β\beta, AA, m0(𝒙)m_{0}(\boldsymbol{x}) ψ(𝒙)\psi(\boldsymbol{x}) and ψ(𝒙)er(𝒙)\psi(\boldsymbol{x})e^{r(\boldsymbol{x})}; and ζ1,n,τn=o(1)\zeta_{1,n},\tau_{n}=o(1), there exists C3=Θ(1)C_{3}=\Theta(1) such that

𝔼[h(𝑫;β,η)h(𝑫;β0,{r0,m0,ψ})]2+𝔼[h(𝑫;β0,{r0,m0,ψ})h(𝑫;β0,η0)]2\displaystyle\mathbb{E}[h(\boldsymbol{D};\beta,\eta)-h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})]^{2}+\mathbb{E}[h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})-h(\boldsymbol{D};\beta_{0},\eta_{0})]^{2}
\displaystyle\leq 𝔼[h(𝑫;β,η)h(𝑫;β0,η)]2+𝔼[h(𝑫;β0,η)h(𝑫;β0,{r0,m0,ψ})]2\displaystyle\mathbb{E}[h(\boldsymbol{D};\beta,\eta)-h(\boldsymbol{D};\beta_{0},\eta)]^{2}+\mathbb{E}[h(\boldsymbol{D};\beta_{0},\eta)-h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})]^{2}
+𝔼[h(𝑫;β0,{r0,m0,ψ})h(𝑫;β0,η0)]2\displaystyle+\mathbb{E}[h(\boldsymbol{D};\beta_{0},\{r_{0},m_{0},\psi\})-h(\boldsymbol{D};\beta_{0},\eta_{0})]^{2}
\displaystyle\leq 𝔼[ψ(𝑿)eβ0A{e(β0β)A1}{Am(𝑿)}]2\displaystyle\mathbb{E}[\psi(\boldsymbol{X})e^{-\beta_{0}A}\{e^{(\beta_{0}-\beta)A}-1\}\{A-m(\boldsymbol{X})\}]^{2}
+𝔼[ψ(𝑿)eβ0A{m0(𝑿)m(𝑿)}]2+𝔼[ψ(𝑿)er(𝑿){m0(𝑿)m(𝑿)}]2\displaystyle+\mathbb{E}[\psi(\boldsymbol{X})e^{-\beta_{0}A}\{m_{0}(\boldsymbol{X})-m(\boldsymbol{X})\}]^{2}+\mathbb{E}[\psi(\boldsymbol{X})e^{r(\boldsymbol{X})}\{m_{0}(\boldsymbol{X})-m(\boldsymbol{X})\}]^{2}
+𝔼[ψ(𝑿)er(𝑿)(1Y){1er0(𝑿)r(𝑿)}{Am0(𝑿)}]2\displaystyle+\mathbb{E}[\psi(\boldsymbol{X})e^{r(\boldsymbol{X})}(1-Y)\{1-e^{r_{0}(\boldsymbol{X})-r(\boldsymbol{X})}\}\{A-m_{0}(\boldsymbol{X})\}]^{2}
+𝔼[|ψ¯(𝑿)ψ(𝑿)|{eβ0A+er0(𝑿)}|Am0(𝑿)|]2\displaystyle+\mathbb{E}[|\bar{\psi}(\boldsymbol{X})-\psi(\boldsymbol{X})|\{e^{-\beta_{0}A}+e^{r_{0}(\boldsymbol{X})}\}|A-m_{0}(\boldsymbol{X})|]^{2}
\displaystyle\leq C3supa𝒜|e(β0β)a1|+C3supx𝒳{|(1+er(𝒙))[ψ¯(𝒙)ψ(𝒙)]|+|m(𝒙)m0(𝒙)|+|er(𝒙)r0(𝒙)1|}\displaystyle C_{3}\sup_{a\in\mathcal{A}}|e^{(\beta_{0}-\beta)a}-1|+C_{3}\sup_{x\in\mathcal{X}}\{|(1+e^{r(\boldsymbol{x})})[\bar{\psi}(\boldsymbol{x})-\psi(\boldsymbol{x})]|+|m(\boldsymbol{x})-m_{0}(\boldsymbol{x})|+|e^{r(\boldsymbol{x})-r_{0}(\boldsymbol{x})}-1|\}
\displaystyle\leq C3supa𝒜|e(β0β)a1|+C3supx𝒳{|ψ¯(𝒙)ψ(𝒙)|+|m(𝒙)m0(𝒙)|+{expit(r0(𝒙))+1}|er(𝒙)r0(𝒙)1|}\displaystyle C_{3}\sup_{a\in\mathcal{A}}|e^{(\beta_{0}-\beta)a}-1|+C_{3}\sup_{x\in\mathcal{X}}\{|\bar{\psi}(\boldsymbol{x})-\psi(\boldsymbol{x})|+|m(\boldsymbol{x})-m_{0}(\boldsymbol{x})|+\{{\rm expit}(r_{0}(\boldsymbol{x}))+1\}|e^{r(\boldsymbol{x})-r_{0}(\boldsymbol{x})}-1|\}
\displaystyle\leq C3{(eτnC31)+2ζ1,n+2|eζ1,n1|}=o(1),\displaystyle C_{3}\{(e^{\tau_{n}C_{3}}-1)+2\zeta_{1,n}+2|e^{\zeta_{1,n}}-1|\}=o(1),

which validates the second inequality of Condition A2 (c). At last, for each r(0,1)r\in(0,1), denote by β=β0+r(ββ0)\beta^{*}=\beta_{0}+r(\beta-\beta_{0}), η={r,m,ψ}={r0(),m0(),ψ}+r(η{r0(),m0(),ψ})\eta^{*}=\{r^{*},m^{*},\psi\}=\{r_{0}(\cdot),m_{0}(\cdot),\psi\}+r(\eta-\{r_{0}(\cdot),m_{0}(\cdot),\psi\}). Similar as above deduction, we have that there exists C4=Θ(1)C_{4}=\Theta(1),

r2𝔼h{𝑫;β0+r(ββ0),η0+r(η{r0(),m0(),ψ})}\displaystyle\partial_{r}^{2}\mathbb{E}h\{\boldsymbol{D};\beta_{0}+r(\beta-\beta_{0}),\eta_{0}+r(\eta-\{r_{0}(\cdot),m_{0}(\cdot),\psi\})\}
=\displaystyle= 𝔼ψ(𝑿)YeβAA(ββ0){m0(𝑿)m(𝑿)}\displaystyle\mathbb{E}\psi(\boldsymbol{X})Ye^{-\beta^{*}A}A(\beta-\beta_{0})\{m_{0}(\boldsymbol{X})-m(\boldsymbol{X})\}
+𝔼ψ(𝑿)(1Y)er(𝑿){r0(𝑿)r(𝑿)}{m0(𝑿)m(𝑿)}\displaystyle+\mathbb{E}\psi(\boldsymbol{X})(1-Y)e^{r^{*}(\boldsymbol{X})}\{r_{0}(\boldsymbol{X})-r(\boldsymbol{X})\}\{m_{0}(\boldsymbol{X})-m(\boldsymbol{X})\}
\displaystyle\leq C4|ββ0|𝔼|m0(𝑿)m(𝑿)|+C4𝔼|r0(𝑿)r(𝑿)|𝔼|m0(𝑿)m(𝑿)|\displaystyle C_{4}|\beta-\beta_{0}|\cdot\mathbb{E}|m_{0}(\boldsymbol{X})-m(\boldsymbol{X})|+C_{4}\mathbb{E}|r_{0}(\boldsymbol{X})-r(\boldsymbol{X})|\cdot\mathbb{E}|m_{0}(\boldsymbol{X})-m(\boldsymbol{X})|
=\displaystyle= O(m()m0()P,22)+O{(ββ0)2}+O(r()r0()P,22)=O(ζ2,n2)+o(τn2)=o(n1/4).\displaystyle O(\|m(\cdot)-m_{0}(\cdot)\|_{P,2}^{2})+O\{(\beta-\beta_{0})^{2}\}+O(\|r(\cdot)-r_{0}(\cdot)\|_{P,2}^{2})=O(\zeta_{2,n}^{2})+o(\tau_{n}^{2})=o(n^{-1/4}).

Using the verified Conditions A1 and A2, one can follow nearly the same proof procedures as those of Theorem 3.3 and Lemma 6.3 in Chernozhukov et al., 2018a to prove our Theorem 2. The only minor difference concerning the processing of ψ[-k]\psi^{[\text{-}k]} has been presented as above. As we point out, one can handle this smoothly by first considering n1k=1Kikh(𝑫i;β0,{r0,m0,ψ^[-k]})n^{-1}\sum_{k=1}^{K}\sum_{i\in\mathcal{I}_{k}}h(\boldsymbol{D}_{i};\beta_{0},\{r_{0},m_{0},\widehat{\psi}^{[\text{-}k]}\}) when deriving the initial rate and asymptotic expansion of β^𝖬𝖫\widehat{\beta}_{\sf\scriptscriptstyle ML} as 𝔼[h(𝑫i;β0,{r0,m0,ψ^[-k]})|ψ^[-k]]=0\mathbb{E}[h(\boldsymbol{D}_{i};\beta_{0},\{r_{0},m_{0},\widehat{\psi}^{[\text{-}k]}\})|\widehat{\psi}^{[\text{-}k]}]=0, and finally concentrate n1k=1Kikh(𝑫i;β0,{r0,m0,ψ0})h(𝑫i;β0,{r0,m0,ψ^[-k]})n^{-1}\sum_{k=1}^{K}\sum_{i\in\mathcal{I}_{k}}h(\boldsymbol{D}_{i};\beta_{0},\{r_{0},m_{0},\psi_{0}\})-h(\boldsymbol{D}_{i};\beta_{0},\{r_{0},m_{0},\widehat{\psi}^{[\text{-}k]}\}) using that ψ𝔼h(𝑫;β0,η0)[ψψ0]=0\partial_{\psi}\mathbb{E}h(\boldsymbol{D};\beta_{0},\eta_{0})[\psi-\psi_{0}]=0 and ψ^[-k]()\widehat{\psi}^{[\text{-}k]}(\cdot) is uniformly consistent.

Appendix C Justification of the FMR procedure

In this section, we derive error rates for the ML estimator r^[-k]()\widehat{r}^{[\text{-}k]}(\cdot) resulted from the FMR procedure introduced in Section 3.2. Assume that the learning algorithm \mathscr{L} attains the same strong convergence and rate properties as those for m^[-k]()\widehat{m}^{[\text{-}k]}(\cdot) in Assumption ML1, i.e. for each j{1,2,,K}j\in\{1,2,\ldots,K\} and k{1,2,,K}k\in\{1,2,\ldots,K\}:

sup𝒙𝒳,a𝒜|M^[-k,-j](a,𝒙)M0(a,𝒙)|=\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X},a\in\mathcal{A}}|\widehat{M}^{[\text{-}k,\text{-}j]}(a,\boldsymbol{x})-M_{0}(a,\boldsymbol{x})|= o(1);M^[-k,-j]()M0()P,2=o(n1/4);\displaystyle o_{\mathbb{P}}(1);\quad\|\widehat{M}^{[\text{-}k,\text{-}j]}(\cdot)-M_{0}(\cdot)\|_{P,2}=o_{\mathbb{P}}(n^{-1/4});
sup𝒙𝒳|a^[-k,-j](𝒙)a0(𝒙)|=\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|\widehat{a}^{[\text{-}k,\text{-}j]}(\boldsymbol{x})-a_{0}(\boldsymbol{x})|= o(1);a^[-k,-j]()a0()P,2=o(n1/4);\displaystyle o_{\mathbb{P}}(1);\quad\|\widehat{a}^{[\text{-}k,\text{-}j]}(\cdot)-a_{0}(\cdot)\|_{P,2}=o_{\mathbb{P}}(n^{-1/4});
sup𝒙𝒳|t^[-k](𝒙)t[-k,-j](𝒙)|=\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|\widehat{t}^{[\text{-}k]}(\boldsymbol{x})-t^{[\text{-}k,\text{-}j]}_{{\dagger}}(\boldsymbol{x})|= o(1);t^[-k]()t[-k,-j]()P,2=o(n1/4),\displaystyle o_{\mathbb{P}}(1);\quad\|\widehat{t}^{[\text{-}k]}(\cdot)-t^{[\text{-}k,\text{-}j]}_{{\dagger}}(\cdot)\|_{P,2}=o_{\mathbb{P}}(n^{-1/4}),

where t[-k,-j](𝒙)=:𝔼[logit{M^[-k,-j](a,𝒙)}|𝑿=𝒙,M^[-k,-j]()]t^{[\text{-}k,\text{-}j]}_{{\dagger}}(\boldsymbol{x})=\mathrel{\mathop{\mathchar 58\relax}}\mathbb{E}\left[{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(a,\boldsymbol{x})\}\Big{|}\boldsymbol{X}=\boldsymbol{x},\widehat{M}^{[\text{-}k,\text{-}j]}(\cdot)\right]. We justify as follows that

sup𝒙𝒳|r^[-k](𝒙)r0(𝒙)|=o(1);r^[-k]()r0()P,2=o(n1/4).\sup_{\boldsymbol{x}\in\mathcal{X}}|\widehat{r}^{[\text{-}k]}(\boldsymbol{x})-r_{0}(\boldsymbol{x})|=o_{\mathbb{P}}(1);\quad\|\widehat{r}^{[\text{-}k]}(\cdot)-r_{0}(\cdot)\|_{P,2}=o_{\mathbb{P}}(n^{-1/4}).

First, since logit{\rm logit} is a smooth function, it is not hard to show that

sup𝒙𝒳,a𝒜|logit{M^[-k,-j](a,𝒙)}logit{M0(a,𝒙)}|=o(1);logit{M^[-k,-j]()}logit{M0()}P,2=o(n1/4),\sup_{\boldsymbol{x}\in\mathcal{X},a\in\mathcal{A}}|{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(a,\boldsymbol{x})\}-{\rm logit}\{M_{0}(a,\boldsymbol{x})\}|=o_{\mathbb{P}}(1);\quad\|{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(\cdot)\}-{\rm logit}\{M_{0}(\cdot)\}\|_{P,2}=o_{\mathbb{P}}(n^{-1/4}),

under some mild regularity conditions. Then derive the error rate of β˘[-k]\breve{\beta}^{[\text{-}k]} as follows.

|-k|1j=1Ki-k,jlogit{M^[-k,-j](Ai,𝑿i)}{Aia^[-k,-j](𝑿i)}\displaystyle|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(A_{i},\boldsymbol{X}_{i})\}\{A_{i}-\widehat{a}^{[\text{-}k,\text{-}j]}(\boldsymbol{X}_{i})\}
=\displaystyle= |-k|1j=1Ki-k,jlogit{M0(Ai,𝑿i)}{Aia0(𝑿i)}\displaystyle|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}\{A_{i}-a_{0}(\boldsymbol{X}_{i})\}
+|-k|1j=1Ki-k,j[logit{M^[-k,-j](Ai,𝑿i)}logit{M0(Ai,𝑿i)}]{Aia0(𝑿i)}\displaystyle+|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}[{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(A_{i},\boldsymbol{X}_{i})\}-{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}]\{A_{i}-a_{0}(\boldsymbol{X}_{i})\}
+|-k|1j=1Ki-k,jlogit{M0(Ai,𝑿i)}{a0(𝑿i)a^[-k,-j](𝑿i)}\displaystyle+|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}\{a_{0}(\boldsymbol{X}_{i})-\widehat{a}^{[\text{-}k,\text{-}j]}(\boldsymbol{X}_{i})\}
+|-k|1j=1Ki-k,j[logit{M^[-k,-j](Ai,𝑿i)}logit{M0(Ai,𝑿i)}]{a0(𝑿i)a^[-k,-j](𝑿i)}\displaystyle+|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}[{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(A_{i},\boldsymbol{X}_{i})\}-{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}]\{a_{0}(\boldsymbol{X}_{i})-\widehat{a}^{[\text{-}k,\text{-}j]}(\boldsymbol{X}_{i})\}
=\displaystyle= |-k|1j=1Ki-k,jlogit{M0(Ai,𝑿i)}{Aia0(𝑿i)}\displaystyle|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}\{A_{i}-a_{0}(\boldsymbol{X}_{i})\}
+a^[-k,-j]()a0()P,2+logit{M^[-k,-j]()}logit{M0()}P,2+O(n1/2)\displaystyle+\|\widehat{a}^{[\text{-}k,\text{-}j]}(\cdot)-a_{0}(\cdot)\|_{P,2}+\|{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(\cdot)\}-{\rm logit}\{M_{0}(\cdot)\}\|_{P,2}+O_{\mathbb{P}}(n^{-1/2})
=\displaystyle= |-k|1j=1Ki-k,jlogit{M0(Ai,𝑿i)}{Aia0(𝑿i)}+o(n1/4)+O(n1/2),\displaystyle|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}\{A_{i}-a_{0}(\boldsymbol{X}_{i})\}+o_{\mathbb{P}}(n^{-1/4})+O_{\mathbb{P}}(n^{-1/2}),

under some mild regularity conditions. Similarly, we have

|-k|1j=1Ki-k,j{Aia^[-k,-j](𝑿i)}2=|-k|1j=1Ki-k,j{Aia0(𝑿i)}2+o(n1/4)+O(n1/2).\displaystyle|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}\{A_{i}-\widehat{a}^{[\text{-}k,\text{-}j]}(\boldsymbol{X}_{i})\}^{2}=|\mathcal{I}_{\text{-}k}|^{-1}\sum_{j=1}^{K}\sum_{i\in\mathcal{I}_{\text{-}k,j}}\{A_{i}-a_{0}(\boldsymbol{X}_{i})\}^{2}+o_{\mathbb{P}}(n^{-1/4})+O_{\mathbb{P}}(n^{-1/2}).

And consequently, by Proposition 1 and

β˘[-k]=i-klogit{M0(Ai,𝑿i)}{Aia0(𝑿i)}i-k{Aia0(𝑿i)}2+o(n1/4)=β0+o(n1/4).\breve{\beta}^{[\text{-}k]}=\frac{\sum_{i\in\mathcal{I}_{\text{-}k}}{\rm logit}\{M_{0}(A_{i},\boldsymbol{X}_{i})\}\{A_{i}-a_{0}(\boldsymbol{X}_{i})\}}{\sum_{i\in\mathcal{I}_{\text{-}k}}\{A_{i}-a_{0}(\boldsymbol{X}_{i})\}^{2}}+o_{\mathbb{P}}(n^{-1/4})=\beta_{0}+o_{\mathbb{P}}(n^{-1/4}).

Then by Assumption REG that β0\beta_{0} and |a0(𝒙)||a_{0}(\boldsymbol{x})| are bounded and recall a^[-k](𝒙)=K1j=1Ka^[-k,-j](𝒙)\widehat{a}^{[\text{-}k]}(\boldsymbol{x})=K^{-1}\sum_{j=1}^{K}\widehat{a}^{[\text{-}k,\text{-}j]}(\boldsymbol{x}), the estimator r^[-k]()\widehat{r}^{[\text{-}k]}(\cdot) given by equation (8) satisfies that:

sup𝒙𝒳|r^[-k](𝒙)r0(𝒙)|\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|\widehat{r}^{[\text{-}k]}(\boldsymbol{x})-r_{0}(\boldsymbol{x})|
\displaystyle\leq sup𝒙𝒳|t^[-k](𝒙)t0(𝒙)|+|β˘[-k]β0||a0(𝒙)|+|β˘[-k]||a^[-k](𝒙)a0(𝒙)|\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|\widehat{t}^{[\text{-}k]}(\boldsymbol{x})-t_{0}(\boldsymbol{x})|+|\breve{\beta}^{[\text{-}k]}-\beta_{0}||a_{0}(\boldsymbol{x})|+|\breve{\beta}^{[\text{-}k]}||\widehat{a}^{[\text{-}k]}(\boldsymbol{x})-a_{0}(\boldsymbol{x})|
\displaystyle\leq sup𝒙𝒳,j|t^[-k](𝒙)t[-k,-j](𝒙)|+|t[-k,-j](𝒙)t0(𝒙)|+o(1)=o(1),\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X},j}|\widehat{t}^{[\text{-}k]}(\boldsymbol{x})-t^{[\text{-}k,\text{-}j]}_{{\dagger}}(\boldsymbol{x})|+|t^{[\text{-}k,\text{-}j]}_{{\dagger}}(\boldsymbol{x})-t_{0}(\boldsymbol{x})|+o_{\mathbb{P}}(1)=o_{\mathbb{P}}(1),

where sup𝒙𝒳,j|t[-k,-j](𝒙)t0(𝒙)|=o(1)\sup_{\boldsymbol{x}\in\mathcal{X},j}|t^{[\text{-}k,\text{-}j]}_{{\dagger}}(\boldsymbol{x})-t_{0}(\boldsymbol{x})|=o_{\mathbb{P}}(1) is a consequence of sup𝒙𝒳,a𝒜|logit{M^[-k,-j](a,𝒙)}logit{M0(a,𝒙)}|=o(1)\sup_{\boldsymbol{x}\in\mathcal{X},a\in\mathcal{A}}|{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(a,\boldsymbol{x})\}-{\rm logit}\{M_{0}(a,\boldsymbol{x})\}|=o_{\mathbb{P}}(1). And

r^[-k]()r0()P,2\displaystyle\|\widehat{r}^{[\text{-}k]}(\cdot)-r_{0}(\cdot)\|_{P,2}
\displaystyle\leq t^[-k]()t0()P,2+|β˘[-k]β0|a0(𝒙)P,2+|β˘[-k]|a^[-k](𝒙)a0(𝒙)P,2\displaystyle\|\widehat{t}^{[\text{-}k]}(\cdot)-t_{0}(\cdot)\|_{P,2}+|\breve{\beta}^{[\text{-}k]}-\beta_{0}|\|a_{0}(\boldsymbol{x})\|_{P,2}+|\breve{\beta}^{[\text{-}k]}|\|\widehat{a}^{[\text{-}k]}(\boldsymbol{x})-a_{0}(\boldsymbol{x})\|_{P,2}
=\displaystyle= maxj{1,,K}t[-k,-j]()t0()P,2+o(n1/4)\displaystyle\max_{j\in\{1,\ldots,K\}}\|t^{[\text{-}k,\text{-}j]}_{{\dagger}}(\cdot)-t_{0}(\cdot)\|_{P,2}+o_{\mathbb{P}}(n^{-1/4})
=\displaystyle= maxj{1,,K}𝔼[logit{M^[-k,-j](A,𝑿)}|𝑿=𝒙]𝔼[logit{M0(A,𝑿)}|𝑿=𝒙]P,2+o(n1/4)\displaystyle\max_{j\in\{1,\ldots,K\}}\|\mathbb{E}[{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(A,\boldsymbol{X})\}|\boldsymbol{X}=\boldsymbol{x}]-\mathbb{E}[{\rm logit}\{M_{0}(A,\boldsymbol{X})\}|\boldsymbol{X}=\boldsymbol{x}]\|_{P,2}+o_{\mathbb{P}}(n^{-1/4})
\displaystyle\leq maxj{1,,K}logit{M^[-k,-j]()}logit{M0()}P,2+o(n1/4)=o(n1/4).\displaystyle\max_{j\in\{1,\ldots,K\}}\|{\rm logit}\{\widehat{M}^{[\text{-}k,\text{-}j]}(\cdot)\}-{\rm logit}\{M_{0}(\cdot)\}\|_{P,2}+o_{\mathbb{P}}(n^{-1/4})=o_{\mathbb{P}}(n^{-1/4}).

Thus, r^[-k](𝒙)\widehat{r}^{[\text{-}k]}(\boldsymbol{x}) satisfies Assumption ML1 under our assumption for the learning tasks of \mathscr{L}.

Appendix D Numerical implementation of the HD approach

We present and demonstrate the implementation procedure of our HD approach mentioned in Remark 5 that uses LASSO instead of the dantzig equation and modifies the construction procedures to make it solvable using the R-package RCAL. Let G(u)=g(u)𝑑uG(u)=\int g(u)du, and recall that 𝜸~\widetilde{\boldsymbol{\gamma}} is some initial estimator obtained through 1\ell_{1}-regularized logistic regression for YY versus {A,𝑿}\{A,\boldsymbol{X}\} and ψ^(𝒙)=expit(𝒙𝜸~)\widehat{\psi}(\boldsymbol{x})={\rm expit}(-\boldsymbol{x}^{\intercal}\widetilde{\boldsymbol{\gamma}}). Then we fit

min𝜶pn1i=1n(1Yi)ψ^(𝑿i)e𝑿i𝜸~{Ai𝑿i𝜶+G(𝑿i𝜶)}+λα𝜶1,\min_{\boldsymbol{\alpha}\in\mathbb{R}^{p}}n^{-1}\sum_{i=1}^{n}(1-Y_{i})\widehat{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}}}\left\{-A_{i}\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\alpha}+G(\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\alpha})\right\}+\lambda_{\alpha}\|\boldsymbol{\alpha}\|_{1}, (A6)

to obtain 𝜶^\widehat{\boldsymbol{\alpha}}. It is not hard to see that the KKT (or subgradient) condition of (A6) is equivalent to the \ell_{\infty}-constraint in (4). And when the link function of g()g(\cdot) is identity (liner model) or expit(){\rm expit}(\cdot) (logistic model), (A6) can be solved using the R-package glmnet with proper specification of the sample weights, i.e. (1Yi)ψ^(𝑿i)e𝑿i𝜸~(1-Y_{i})\widehat{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}}}. Then we solve

n1i=1nψ^(𝑿i){YieβAi(1Yi)e𝑿i𝜸~}{Aig(𝑿i𝜶^)}=0,n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\beta A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}=0,

to obtain a preliminary estimator β~\widetilde{\beta}. It can be shown that when either r(𝒙)r(\boldsymbol{x}) or m(𝒙)m(\boldsymbol{x}) is correctly specified, the estimator β~\widetilde{\beta} should approach β0\beta_{0} at the rate O{(slogp/n)1/2}O_{\mathbb{P}}\{(s\log p/n)^{1/2}\}, i.e. the 2\ell_{2} errors of 𝜸~\widetilde{\boldsymbol{\gamma}} and 𝜶^\widehat{\boldsymbol{\alpha}}. So β~\widetilde{\beta} provides a good enough approximation of β0\beta_{0} that can be used for the 1\ell_{1}-regularized (weighted) calibration regression (Tan, 2020b, ) to estimate 𝜸\boldsymbol{\gamma}:

minβ,𝜼pn1i=1nψ^(𝑿i)e𝑿i𝜸~g(𝑿i𝜶^){Yieβ~Ai𝑿i𝜸+(1Yi)(β~Ai+𝑿i𝜸)}+λγ𝜸1.{\rm min}_{\beta\in\mathbb{R},\boldsymbol{\eta}\in\mathbb{R}^{p}}n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}}}g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\{Y_{i}e^{-\widetilde{\beta}A_{i}-\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma}}+(1-Y_{i})(\widetilde{\beta}A_{i}+\boldsymbol{X}_{i}^{\intercal}\boldsymbol{\gamma})\}+\lambda_{\gamma}\|\boldsymbol{\gamma}\|_{1}. (A7)

Again, KKT condition of (A7) corresponds to the \ell_{\infty}-constraints in (5), though they are not always imposing the same moment conditions: when the nuisance model r(𝒙)r(\boldsymbol{x}) is misspecified, 𝜸~\widetilde{\boldsymbol{\gamma}} and 𝜸^\widehat{\boldsymbol{\gamma}} typically have different limits. We use R-package RCAL to solve (A7) with the response taken as YiY_{i}, regressors as 𝑿i\boldsymbol{X}_{i}, sample weight ψ^(𝑿i)e𝑿i𝜸~g(𝑿i𝜶^)\widehat{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widetilde{\boldsymbol{\gamma}}}g^{\prime}(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}}) and offset β~Ai\widetilde{\beta}A_{i} for each ii. Denoting the solution of (A7) as 𝜸^\widehat{\boldsymbol{\gamma}}, we finally obtain the estimator β^𝖧𝖣\widehat{\beta}_{\sf\scriptscriptstyle HD} by solving

n1i=1nψ^(𝑿i)e𝑿i(𝜸~𝜸^){YieβAi(1Yi)e𝑿i𝜸^}{Aig(𝑿i𝜶^)}=0.n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})e^{\boldsymbol{X}_{i}^{\intercal}(\widetilde{\boldsymbol{\gamma}}-\widehat{\boldsymbol{\gamma}})}\{Y_{i}e^{-\beta A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}=0. (A8)

Here the final estimating equation is asymptotically equivalent to the second row of (5) only when r(𝒙)r(\boldsymbol{x}) is correctly specified (𝜸~\widetilde{\boldsymbol{\gamma}} and 𝜸^\widehat{\boldsymbol{\gamma}} have the same limiting values). When r(𝒙)r(\boldsymbol{x}) is misspecified, the orthogonal score function used in (A8), denoted by h(𝑫,β0,η)h^{\prime}(\boldsymbol{D},\beta_{0},\eta), is not the same as the h(𝑫,β0,η)h(\boldsymbol{D},\beta_{0},\eta) used in the main text. We shall point out that this does not hurt the Neyman orthogonality of h(𝑫,β0,η)h^{\prime}(\boldsymbol{D},\beta_{0},\eta). It is because that when r(𝒙)r(\boldsymbol{x}) is misspecified but m(𝒙)m(\boldsymbol{x}) is correct (by our model assumption, at least one nuisance model need to be correct), rh(𝑫;β0,η¯)[rr¯]\partial_{r}h^{\prime}(\boldsymbol{D};\beta_{0},\bar{\eta})[r-\bar{r}] is naturally satisfied due to the correctness of m(𝒙)m(\boldsymbol{x}) and mh(𝑫;β0,η¯)[mm¯]\partial_{m}h^{\prime}(\boldsymbol{D};\beta_{0},\bar{\eta})[m-\bar{m}] is satisfied according to the KKT (moment) condition of (A7). When m(𝒙)m(\boldsymbol{x}) is misspecified but r(𝒙)r(\boldsymbol{x}) is correct, mh(𝑫;β0,η¯)[mm¯]\partial_{m}h^{\prime}(\boldsymbol{D};\beta_{0},\bar{\eta})[m-\bar{m}] is naturally satisfied and (A8) is asymptotically equivalent with

n1i=1nψ^(𝑿i){YieβAi(1Yi)e𝑿i𝜸^}{Aig(𝑿i𝜶^)}=0,n^{-1}\sum_{i=1}^{n}\widehat{\psi}(\boldsymbol{X}_{i})\{Y_{i}e^{-\beta A_{i}}-(1-Y_{i})e^{\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\gamma}}}\}\{A_{i}-g(\boldsymbol{X}_{i}^{\intercal}\widehat{\boldsymbol{\alpha}})\}=0,

as 𝜸~\widetilde{\boldsymbol{\gamma}} and 𝜸^\widehat{\boldsymbol{\gamma}} approach the true 𝜸0\boldsymbol{\gamma}_{0} and the second order errors is asymptotically negligible. So rh(𝑫;β0,η¯)[rr¯]\partial_{r}h^{\prime}(\boldsymbol{D};\beta_{0},\bar{\eta})[r-\bar{r}] is satisfied by (A7). Thus, our modified construction procedure does not break the theoretical guarantee of β^𝖧𝖣\widehat{\beta}_{\sf\scriptscriptstyle HD}.

Appendix E Additional details of numerical experiments

First, we present the mean vector and covariance matrix used to generate AA and 𝑿\boldsymbol{X} in Section 5.1:

  1. (i)

    Take 𝝁1=(0.4,0.25,0.25,0,,0)\boldsymbol{\mu}_{1}=(0.4,-0.25,-0.25,0,\ldots,0), 𝝁0=𝟘\boldsymbol{\mu}_{0}=\mathbb{0}, and

    (𝚺1)ij={1.5i=j=11.2i=j20.2i=1,2j5or 2i5,j=10else({\boldsymbol{\Sigma}}^{-1})_{ij}=\begin{cases}1.5&{i=j=1}\\ 1.2&{i=j\geq 2}\\ 0.2&{i=1,2\leq j\leq 5\ or\ 2\leq i\leq 5,j=1}\\ 0&\text{else}\end{cases}
  2. (ii)

    Take 𝝁1=(0.4,0.25,0.25,0,,0)\boldsymbol{\mu}_{1}=(0.4,-0.25,-0.25,0,\ldots,0), 𝝁0=𝟘\boldsymbol{\mu}_{0}=\mathbb{0},

    (𝚺11)ij={1.5i=j=11.2i=j20.2i=1,2j5or 2i5,j=10else({\boldsymbol{\Sigma}_{1}}^{-1})_{ij}=\begin{cases}1.5&{i=j=1}\\ 1.2&{i=j\geq 2}\\ 0.2&{i=1,2\leq j\leq 5\ or\ 2\leq i\leq 5,j=1}\\ 0&\text{else}\end{cases}

    and

    (𝚺01)ij={1.5i=j=11.2i=j20.2i=1,2j5or 2i5,j=10.0753i4,j=2ori=2,3i4ori=3,j=4ori=4,j=30else({\boldsymbol{\Sigma}_{0}}^{-1})_{ij}=\begin{cases}1.5&{i=j=1}\\ 1.2&{i=j\geq 2}\\ 0.2&{i=1,2\leq j\leq 5\ or\ 2\leq i\leq 5,j=1}\\ 0.075&{3\leq i\leq 4,j=2\ or\ i=2,3\leq i\leq 4\ or\ i=3,j=4\ or\ i=4,j=3}\\ 0&\text{else}\end{cases}
  3. (iii)

    Take the covariance of 𝑿\boldsymbol{X} as

    𝚺ij={0.5i=j0.15i4,j4,ij0else.{\boldsymbol{\Sigma}}_{ij}=\begin{cases}0.5&{i=j}\\ 0.15&{i\leq 4,j\leq 4,i\neq j}\\ 0&\text{else}\end{cases}.

Then we present the specific choice on the basis functions for data generation in Section 5.2. In specific, we take

fa(𝒙)\displaystyle f_{a}(\boldsymbol{x}) ={11+ex1,11+ex2,sin(x3),cos(x4),I(x5>0),I(x6>0),x7x8,x9x10};\displaystyle=\left\{\frac{1}{1+e^{x_{1}}},\frac{1}{1+e^{x_{2}}},{\rm sin}(x_{3}),{\rm cos}(x_{4}),I(x_{5}>0),I(x_{6}>0),x_{7}x_{8},x_{9}x_{10}\right\}^{\intercal};
𝜻a\displaystyle\boldsymbol{\zeta}_{a} =(2,2,1,1,0.5,0.5,0.2,0.2),\displaystyle=(2,-2,1,1,0.5,-0.5,0.2,0.2)^{\intercal},

for a0(𝒙)a_{0}(\boldsymbol{x}), the conditional mean of AA given 𝑿=𝒙\boldsymbol{X}=\boldsymbol{x}. And that

fr(𝒙)\displaystyle f_{r}(\boldsymbol{x}) ={x1x2x3,x4x5,x63,sin2(x7),cos(x8),11+x92,11+ex10,I(x11>0),I(x12>0)};\displaystyle=\left\{x_{1}x_{2}x_{3},x_{4}x_{5},x_{6}^{3},{\rm sin}^{2}(x_{7}),{\rm cos}(x_{8}),\frac{1}{1+x_{9}^{2}},\frac{1}{1+e^{x_{10}}},I(x_{11}>0),I(x_{12}>0)\right\}^{\intercal};
𝜻r\displaystyle\boldsymbol{\zeta}_{r} =(0.1,0.1,0.1,0.5,0.5,1,1,0.25,0.25),\displaystyle=(0.1,0.1,0.1,-0.5,0.5,1,-1,0.25,-0.25)^{\intercal},

to specify r0(𝒙)r_{0}(\boldsymbol{x}).