This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonparametric estimation of a covariate-adjusted counterfactual treatment regimen response curve

Ashkan Ertefaielabel=e1]ashkan_ertefaie@urmc.rochester.edu [    Luke Duttweilerlabel=e2]luke_duttweiler@urmc.rochester.edu\orcid0000-0000-0000-0000 [    Brent A. Johnsonlabel=e3]brent_johnson@urmc.rochester.edu\orcid0000-0000-0000-0000 [    Mark J. van der Laanlabel=e4]laan@berkeley.edu [ Department of Biostatistics, University of Rochesterpresep=, ]e1,e2,e3 Department of Biostatistics, University of California at Berkeleypresep=, ]e4
Abstract

Flexible estimation of the mean outcome under a treatment regimen (i.e., value function) is the key step toward personalized medicine. We define our target parameter as a conditional value function given a set of baseline covariates which we refer to as a stratum based value function. We focus on semiparametric class of decision rules and propose a sieve based nonparametric covariate adjusted regimen-response curve estimator within that class. Our work contributes in several ways. First, we propose an inverse probability weighted nonparametrically efficient estimator of the smoothed regimen-response curve function. We show that asymptotic linearity is achieved when the nuisance functions are undersmoothed sufficiently. Asymptotic and finite sample criteria for undersmoothing are proposed. Second, using Gaussian process theory, we propose simultaneous confidence intervals for the smoothed regimen-response curve function. Third, we provide consistency and convergence rate for the optimizer of the regimen-response curve estimator; this enables us to estimate an optimal semiparametric rule. The latter is important as the optimizer corresponds with the optimal dynamic treatment regimen. Some finite-sample properties are explored with simulations.

,
,
,
cadlag functions ,
dynamic marginal structural models,
highly adaptive lasso,
kernel smoothing,
semiparametric decision rules,
undersmoothing,
keywords:
[class=MSC]
keywords:
\newcites

appReferences \startlocaldefs \endlocaldefs

, and

1 Introduction

Marginal structural models have been widely used in causal inference as a tool for estimating the mean outcome under a give decision rule (Robins, 1998) and constructing optimal treatment regimens. This approach requires modeling the potential outcome given a set of baseline covariates, including treatment and possibly some subset of baseline information on a subject. Existing methods typically impose a parametric causal model on the mean of the potential outcome and draw inference under the assumption that the model is correctly specified (Robins, Hernan and Brumback, 2000; Murphy et al., 2001; Joffe et al., 2004; Orellana, Rotnitzky and Robins, 2010). The latter is particularly restrictive when the conditioning set of treatment and baseline covariates includes some continuous variables. Although model selection approaches have been proposed (Van der Laan, Laan and Robins, 2003), it is likely that finite sample size prevents recovering the true model and leads to poor out-of-sample performance (Shinozaki and Suzuki, 2020). Neugebauer and van der Laan (2007) proposed a nonparametric marginal structural model framework to improve the causal interpretability of the estimated parameters of a potentially misspecified parametric working model. However, when the working model is misspecified, the causally interpretable parameters might not be the parameters of interest and the resulting treatment regimen might be suboptimal.

The parameters of the proposed marginal structural model can be estimated using an inverse probability weighted loss functions where weights are the conditional probability of following a particular decision rule. The consistency and causal interpretability of the corresponding estimators rely on the correctly specified weight function. To mitigate the chance of model misspecification, one may decide to model the weight function nonparametrically using data-adaptive techniques. However, this choice may result in nonregular estimators of the primary parameters in the marginal structural model. This lack of regularity is because the rate of convergence for inverse probability weighted estimators rely on a rate of convergence of the weight function which will be slower than the desired root-nn rate when data-adaptive techniques are used. To overcome this concern, authors have used so-called doubly robust estimators for the parameters in the working marginal structural model; that is, estimators that are consistent as long as one of either the treatment model or the outcome models are correctly specified (Robins, 2004; Wahed and Tsiatis, 2004; Orellana, Rotnitzky and Robins, 2010; Rosenblum, 2011; Petersen et al., 2014; Ertefaie et al., 2015). However, doubly-robust nonparametric estimators may suffer from two important issues: (1) the performance of the estimators depends on the modeling choice of nuisance parameters and they can be irregular with large bias and a slow rate of convergence (van der Laan and Luedtke, 2014; Benkeser et al., 2017); and (2) the quality of the constructed decision rule still relies on correctly-specified marginal structural model.

Alternative to the marginal structural models framework, several authors have proposed methods to estimate nonparametric decision rules that circumvent the need for specifying a parametric outcome model. Zhao et al. (2012) proposed an inverse probability weighting approach to estimate a nonparametric decision rule (Dudík, Langford and Li, 2011; Zhao et al., 2015). Doubly robust augmented inverse probability weighting procedures have also been proposed (Rubin and van der Laan, 2012; Zhang et al., 2012; Liu et al., 2018; Athey and Wager, 2021). Due to the nonparametric nature of the estimated decision rule, inference for the resulting value function (i.e., mean outcome under the estimated rule) is challenging. Zhao et al. (2012) developed Fisher consistency, the excess risk bound, and the convergence rate of the value function to provide a theoretical guarantee for their proposed method. Athey and Wager (2021) derived bounds for the regret function under the estimated nonparametric treatment strategy and showed that the bounds decay as n1/2n^{-1/2} as long as the Vapnik–Chervonenkis (VC) dimension of the class of decision rules does not grow too fast. Other recent germane contributions on estimating optimal treatment regimes and conditional average treatment effects include Swaminathan and Joachims (2015), Kallus (2018), Nie and Wager (2021), Zhao et al. (2022), and references therein.

We define a stratum-based value function as the conditional expectation of the potential outcome under a treatment regimen given a prespecified set of baseline covariates 𝑽\bm{V}. The value is a function of regimens and we are interested in understanding the functional association between the mean outcome and the regimens within a class which we refer to as the regimen-response curve. We consider a current status data setting with two levels of coarsening at random (i.e., treatment assignment and censoring). We overcome the two important shortcomings of the existing methods by: (1) allowing the regimen-response curve to be estimated nonparametrically; and (2) estimating the weight functions of the inverse probability weighted loss function using a data-adaptive technique. Recently, Ertefaie, Hejazi and van der Laan (2022) proposed a nonparametric efficient inverse probability weighted estimator for the average causal effect where the weights are estimated using undersmoothed highly adaptive lasso. We will generalize below the same technique to estimate conditional average causal effects via nonparametric marginal structural models. An important byproduct of our nonparametric regimen-response curve estimator is a method to construct semiparametric decision rules where the parameters of the rule are allowed to be a nonparametric function of the set baseline covariates (i.e., V{V}-specific optimal decision rule). The parameters are defined as the optimizer of the regimen-response curve which allows us to provide a rate of convergence for the corresponding functional parameter estimators using empirical process theory. Our theoretical contributions can be summarized as follows:

  • Considering a kernel smoothed regimen-response curve function as the target parameter, we show that the proposed inverse probability weighted estimator (1) is nonparametrically efficient when the weight functions are properly undersmoothed; (2) convergences to a Gaussian process at root-nn rate; and (3) is uniformly consistent at a derived rate.

  • Considering the regimen-response curve function as the target parameter, we derive the asymptotic normality of our estimator and show that the bias caused by the kernel smoothing of the estimator vanishes under some mild assumptions and properly undersmoothed kernel and weight functions.

  • Using a novel empirical process algorithm, we show that the minimizer of the estimated regimen-response curve is a consistent estimator of the minimizer of the true regimen-response curve and derive the convergence rate. This result is of independent interest as it derives a rate of convergence for an optimizer of a nonparametrically estimated function.

In contrast to the existing literature which targets the value function under the estimated rule, our estimand is the stratum based value function. Deriving the asymptotic properties of our estimand is more challenging because it will be nonpathwise differentiable functional when 𝑽\bm{V} includes at least one continuous variable (Bickel et al., 1998). Moreover, existing methods can either estimate a parametric or nonparametric rules where in each case flexibility or interpretability is sacrificed. Our proposed semiparametric rule lands in between and provides a reasonable trade-off between flexibility and interpretability. For example, clinicians might have strong opinion about an important tailoring variable and want to ensure to capture the correct functional form of that variable in the decision rule. In such settings, we can include that variable in the set 𝑽\bm{V} and provide a stratum-based decision rule where the decision rule will be a nonparametric function of 𝑽\bm{V}.

2 Preliminaries

2.1 Notation and Formulation

Let TT be a univariate failure time, 𝑾𝒲p\bm{W}\in\mathcal{W}\subset\mathbb{R}^{p} be a vector of all available baseline covariates measured before treatment A{0,1}A\in\{0,1\}, and CC be a censoring variable. Let 𝑺𝒮q\bm{S}\in\mathcal{S}\subset\mathbb{R}^{q} be a subvector of 𝑾\bm{W} and 𝑽=𝑾𝑺\bm{V}=\bm{W}\setminus\bm{S} where 𝑽𝒱pq\bm{V}\in\mathcal{V}\subset\mathbb{R}^{p-q}. For simplicity of notation, we denote pqp-q as rr. A decision rule dθ𝒟d^{\theta}\in\mathcal{D} is defined as a function that maps values of 𝑺\bm{S} to a treatment option {0,1}\{0,1\}. In our formulation 𝒟\mathcal{D} is a class of deterministic rules indexed by a vector of coefficients. Specifically, we consider dθ=I{𝑺θ>0}d^{\theta}=I\{\bm{S}^{\top}\theta>0\} with θΘq\theta\in\Theta\subset\mathbb{R}^{q} where I(.)I(.) is an indicator function. Let TθT^{\theta} be the potential outcome that would have been observed under the decision rule dθd^{\theta} for a given value of θΘ\theta\in\Theta. Suppose that we have the full data X={(Tθ:dθ𝒟),𝑾}PXFX=\{(T^{\theta}:d^{\theta}\in\mathcal{D}),\bm{W}\}\sim P_{X}\in\mathcal{M}^{F} where F\mathcal{M}^{F} is a nonparametric full data model. Define EPX{I(Tθ>t)𝑽}=ΨPX(θ,𝑽)λE_{P_{X}}\{I(T^{\theta}>t)\mid\bm{V}\}=\Psi_{P_{X}}(\theta,\bm{V})\in\mathcal{F}_{\lambda} where λ\mathcal{F}_{\lambda} is a class of cadlag functions with sectional variation norm bounded by λ\lambda and tt is a predetermined time point. The full-data risk of Ψ\Psi is defined as EPXθθ(Ψ,x)𝑑F(θ)E_{P_{X}}\int_{\theta}\ell_{\theta}(\Psi,x)dF(\theta) where \ell is the log-likelihood loss function, θ(Ψ)=I(Tθ>t)log(Ψ1Ψ)+log(1Ψ),\ell_{\theta}(\Psi)=I(T^{\theta}>t)\log\left(\frac{\Psi}{1-\Psi}\right)+\log(1-\Psi), and dF(θ)dF(\theta) is a weight function/measure on the Euclidean set of all θ\theta-values (e.g., multivariate normal with possibly large variance). Define a V-specific optimal decision rule that maximizes the tt-year survival probability as θ0(𝑽)=argmaxθΘΨPX(θ,𝑽)\theta_{0}(\bm{V})=\operatorname*{argmax}_{\theta\in\Theta}\Psi_{P_{X}}(\theta,\bm{V}). Hence, we allow θ\theta to be a functional parameter mapping 𝒱\mathcal{V} onto q\mathbb{R}^{q}.

The observed data 𝒪={Y=min(T,C),Δ=I(T<C),A,𝑾}\mathcal{O}=\{Y=\min(T,C),\Delta=I(T<C),A,\bm{W}\} follow some probability distribution P0P_{0} that lies in some nonparametric model \mathcal{M}. Suppose we observe nn independent, identically distributed trajectories of 𝒪\mathcal{O}. Let Δc=Δ+(1Δ)I{Y>t}\Delta^{c}=\Delta+(1-\Delta)I\{Y>t\} and G:𝒢G:\mathcal{M}\rightarrow\mathcal{G} be a functional nuisance parameter where 𝒢={G(P):P}\mathcal{G}=\{G(P):P\in\mathcal{M}\}. In our setting, G(P){(A,Δc)𝑾}G(P)\{(A,\Delta^{c})\mid\bm{W}\} denotes the joint treatment and censoring mechanism under a distribution PP. We refer to G(P0)G(P_{0}) as G0G_{0} and G(P)G(P) as GG. Under coarsening at random assumption, G(P){(A,Δc)X}=G(P){(A,Δc)𝑾}G(P)\{(A,\Delta^{c})\mid{X}\}=G(P)\{(A,\Delta^{c})\mid\bm{W}\}. Denote GaGa(P)(A𝑾)G^{a}\equiv G^{a}(P)(A\mid\bm{W}) and GcGc(P)(Δc𝑾,A)G^{c}\equiv G^{c}(P)(\Delta^{c}\mid\bm{W},A). We denote the latter two functions under P0P_{0} as G0aG^{a}_{0} and G0cG^{c}_{0}. Also define Q:𝒬Q:\mathcal{M}\rightarrow\mathcal{Q} and 𝒬={Q(P):P}\mathcal{Q}=\{Q(P):P\in\mathcal{M}\} where Q0(θ,𝑾)=𝔼P0{I(Yθ>t)𝑾}Q_{0}(\theta,\bm{W})=\mathbb{E}_{P_{0}}\{I(Y^{\theta}>t)\mid\bm{W}\}. Let \prod be a projection operator in the Hilbert space L02(P)L_{0}^{2}(P) with inner product h1,h2=EP(h1h2)\langle h_{1},h_{2}\rangle=E_{P}(h_{1}h_{2}).

The loss function θθ(Ψ)𝑑F(θ)\int_{\theta}\ell_{\theta}(\Psi)dF(\theta) is based on the full-data which is not observable, and thus, cannot be used to define our full data functional parameter ΨPX\Psi_{P_{X}}. However, under consistency (Ti=TAiT_{i}=T^{A_{i}}) and no unmeasured confounder (ATa𝑾A\perp T^{a}\mid\bm{W} and ΔcTA,𝑾\Delta^{c}\perp T\mid A,\bm{W}) assumptions, it can be identified using the observed data distribution P0P_{0}. Specifically, we define an inverse probability weighting mapping of the full-data loss as

LG(Ψ)=θ[ΔcI(A=dθ)ξ(θ,𝑽)G{(A,Δc)𝑾}{I(Y>t)log(Ψ1Ψ)+log(1Ψ)}]𝑑F(θ).\displaystyle L_{G}(\Psi)=\int_{\theta}\left[\frac{\Delta^{c}I(A=d^{\theta})\xi(\theta,\bm{V})}{G\{(A,\Delta^{c})\mid\bm{W}\}}\{I(Y>t)\log\left(\frac{\Psi}{1-\Psi}\right)+\log(1-\Psi)\}\right]dF(\theta). (1)

Accordingly, define the regimen-response curve as ΨP0=argminΨλEP0LG0(Ψ)\Psi_{P_{0}}=\operatorname*{argmin}_{\Psi\in\mathcal{F}_{\lambda}}E_{P_{0}}L_{G_{0}}(\Psi). Following our convention, we denote ΨP0\Psi_{P_{0}} as Ψ0\Psi_{0}. We define a highly adaptive estimator of ΨP0\Psi_{P_{0}} as Ψn=argminΨλnPnLGn(Ψ)\Psi_{n}=\operatorname*{argmin}_{\Psi\in\mathcal{F}_{\lambda_{n}}}P_{n}L_{G_{n}}(\Psi) where λn\lambda_{n} is a data adaptively selected λ\lambda and GnG_{n} is an estimator of G0G_{0}. The choice of function ξ\xi is inconsequential to the derivation of the efficient influence function of our target parameter but it may impact the level of undersmoothing needed in finite sample. One may consider ξ(θ,𝑽)=1\xi(\theta,\bm{V})=1 or ξ(θ,𝑽)=G{(A,Δc)𝑽}\xi(\theta,\bm{V})=G\{(A,\Delta^{c})\mid\bm{V}\}. The latter choice corresponds to the stabilized weight.

2.2 Highly adaptive lasso

The highly adaptive lasso is a nonparametric regression approach that can be used to estimate infinite dimensional functional parameters Benkeser and Van Der Laan (2016); van der Laan (2017); van der Laan and Bibaut (2017). It forms a linear combination of indicator basis functions to minimize the expected value of a loss function under the constraint that the constraint that the L1L_{1}-norm of the coefficient vector is bounded by a finite constant value.

Let 𝒟[0,τ]\mathcal{D}[0,\tau] be the Banach space of d-variate real valued cadlag functions (right-continuous with left-hand limits) on a cube [0,τ]d[0,\tau]\in\mathbb{R}^{d}. For each function f𝒟[0,τ]f\in\mathcal{D}[0,\tau] define the supremum norm as f=supw[0,τ]|f(w)|\|f\|_{\infty}=\sup_{w\in[0,\tau]}|f(w)|. For any subset ss of {0,1,,d}\{0,1,\ldots,d\}, we partition [0,τ][0,\tau] in {0}{s(0s,τs]}\{0\}\{\cup_{s}(0_{s},\tau_{s}]\} and define the sectional variation norm of such ff as

fζ=|f(0)|+s{1,,d}0sτs|dfs(us)|,\|f\|^{*}_{\zeta}=|f(0)|+\sum_{s\subset\{1,\ldots,d\}}\int_{0_{s}}^{\tau_{s}}|df_{s}(u_{s})|,

where the sum is over all subsets of {0,1,,d}\{0,1,\ldots,d\}. For a given subset s{0,1,,d}s\subset\{0,1,\ldots,d\}, define us=(uj:js)u_{s}=(u_{j}:j\in s) and usu_{-s} as the complement of usu_{s}. Then, fs:[0s,τs]f_{s}:[0_{s},\tau_{s}]\rightarrow\mathbb{R} defined as fs(us)=f(us,0s)f_{s}(u_{s})=f(u_{s},0_{-s}). Thus, fs(us)f_{s}(u_{s}) is a section of ff that sets the components in the complement of ss equal to zero and varies only along the variables in uru_{r}.

Let GaG(P)(AW)G^{a}\equiv G(P)(A\mid W) and GcG(P)(ΔcW,A)G^{c}\equiv G(P)(\Delta^{c}\mid W,A) denote the treatment and censoring mechanism under an arbitrary distribution PP\in\mathcal{M}. Assuming that our nuisance functional parameter Ga,Gc𝒟[0,τ]G^{a},G^{c}\in\mathcal{D}[0,\tau] has finite sectional variation norm, we can represent logit Ga\text{logit }G^{a} as (Gill, van der Laan and Wellner, 1993)

logit Ga(w):\displaystyle\text{logit }G^{a}(w): =logit Ga(0)+s{1,,d}0sws𝑑Gsa(us)\displaystyle=\text{logit }G^{a}(0)+\sum_{s\subset\{1,\ldots,d\}}\int_{0_{s}}^{w_{s}}dG^{a}_{s}(u_{s})
=logit Ga(0)+s{1,,d}0sτsI(usws)𝑑Gsa(us).\displaystyle=\text{logit }G^{a}(0)+\sum_{s\subset\{1,\ldots,d\}}\int_{0_{s}}^{\tau_{s}}I(u_{s}\leq w_{s})dG^{a}_{s}(u_{s}). (2)

The representation (2) can be approximated using a discrete measure that puts mass on each observed Ws,iW_{s,i} denoted by αs,i\alpha_{s,i}. Let ϕs,i(cs)=I(wi,scs)\phi_{s,i}(c_{s})=I(w_{i,s}\leq c_{s}) where wi,sw_{i,s} are support points of dGsadG^{a}_{s}. We then have

logit Gαa=α0+s{1,,d}i=1nαs,iϕs,i,\text{logit }G^{a}_{\alpha}=\alpha_{0}+\sum_{s\subset\{1,\ldots,d\}}\sum_{i=1}^{n}\alpha_{s,i}\phi_{s,i},

where |α0|+s{1,,d}i=1n|αs,i||\alpha_{0}|+\sum_{s\subset\{1,\ldots,d\}}\sum_{i=1}^{n}|\alpha_{s,i}| is an approximation of the sectional variation norm of ff. The loss based highly adaptive lasso estimator βn\beta_{n} is defined as

αn(λ)=argminα:|α0|+s{1,,d}i=1n|αs,i|<λPnL(logit Gαa).\alpha_{n}(\lambda)=\arg\min_{\alpha:|\alpha_{0}|+\sum_{s\subset\{1,\ldots,d\}}\sum_{i=1}^{n}|\alpha_{s,i}|<\lambda}P_{n}L(\text{logit }G^{a}_{\alpha}).

where L(.)L(.) is a given loss function. Denote Gn,λaGαn(λ)aG^{a}_{n,\lambda}\equiv G^{a}_{\alpha_{n}(\lambda)} as the highly adaptive lasso estimate of GaG^{a}. Similarly, we can define Gn,λcG^{c}_{n,\lambda} as the highly adaptive lasso estimate GcG^{c}. When the nuisance parameter correspond to a conditional probability of a binary variable (e.g., propensity score with a binary treatment or a binary censoring indicator), log-likelihood loss can be used. In practice, λ\lambda is unknown and must be obtained using data. We refer to λn\lambda_{n} as a value of λ\lambda that is determined using data.

3 Dynamic marginal structural models with highly adaptive lasso

3.1 The challenge

The target functional parameter Ψ0\Psi_{0} is not a pathwise differentiable function which makes the inference challenging. The existing literature approximates this functional by pathwise differentiable function Ψβ\Psi_{\beta}, where β\beta is a vector of unknown parameters van der Laan (2006); van der Laan and Petersen (2007); Murphy et al. (2001); Robins, Orellana and Rotnitzky (2008); Orellana, Rotnitzky and Robins (2010); Neugebauer and van der Laan (2007). The working model provides a user-specified summary of the unknown true function relating the expectation of the potential outcome under a strategy. Hence, the quality of the constructed decision rule depends on how well the working model approximates the true function. To mitigate this concern we propose a kernel smoothed highly adaptive lasso estimator of Ψ0(𝒗0,θ)\Psi_{0}(\bm{v}_{0},\theta). Specifically, we define an regimen-response curve estimator as

Ψnh(𝒗0,θ)=PnU(Pn,𝒗0,Ψn)\displaystyle\Psi_{nh}(\bm{v}_{0},\theta)=P_{n}U(P_{n},\bm{v}_{0},\Psi_{n}) (3)

where U(Pn,𝒗0,Ψn)=(PnKh,𝒗0)1{Kh,𝒗0Ψn(θ)}U(P_{n},\bm{v}_{0},\Psi_{n})=(P_{n}K_{h,\bm{v}_{0}})^{-1}\{K_{h,\bm{v}_{0}}\Psi_{n}(\theta)\} and Ψn\Psi_{n} is a highly adaptive lasso estimate of Ψ0\Psi_{0} obtained based on the loss function (1). In this formulation, Kh,𝒗0(v)=hrK{(v𝒗0)h1}K_{h,\bm{v}_{0}}(v)=h^{-r}K\{(v-\bm{v}_{0})h^{-1}\} with K(v)=K(v1,,vr)K(v)=K(v_{1},\cdots,v_{r}) where K(v)K(v) is a kernel satisfying K(v)𝑑v=1\int K(v)dv=1. Moreover, we assume that K(v)=j=1rKu(vj)K(v)=\prod_{j=1}^{r}K_{u}(v_{j}) is a product kernel defined by a univariate kernel Ku(.)K_{u}(.) and K(v)K(v) is a J-orthogonal kernel such that Ku(v)𝑑v=1\int K_{u}(v)dv=1 and Ku(v)vj𝑑v=0\int K_{u}(v)v^{j}dv=0 for j=1,,Jj=1,\cdots,J.

We study the asymptotic behaviour of our estimator relative to Ψ0(𝒗𝟎,θ)\Psi_{0}(\bm{v_{0}},\theta) and a kernel smoothed parameter

Ψ0h(𝒗0,θ)=P0U(P0,𝒗0,Ψ0),\Psi_{0h}(\bm{v}_{0},\theta)=P_{0}U(P_{0},\bm{v}_{0},\Psi_{0}),

where hh is a fixed bandwidth. In general, when the nuisance parameter G0G_{0} is estimated using data adaptive regression techniques, the resulting regimen-response curve estimator Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) won’t be asymptotically linear which makes inference challenging. For example, consider the target parameter Ψ0h(𝒗0,θ)\Psi_{0h}(\bm{v}_{0},\theta), we have

Ψnh(𝒗0,θ)Ψ0h(𝒗0,θ)=\displaystyle\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0h}(\bm{v}_{0},\theta)= (PnP0)U(P0,𝒗0,Ψ0)+\displaystyle(P_{n}-P_{0})U(P_{0},\bm{v}_{0},\Psi_{0})+
P0{U(Pn,𝒗0,Ψn)U(P0,𝒗0,Ψ0)}+\displaystyle P_{0}\{U(P_{n},\bm{v}_{0},\Psi_{n})-U(P_{0},\bm{v}_{0},\Psi_{0})\}+
(PnP0){U(Pn,𝒗0,Ψn)U(P0,𝒗0,Ψ0)}.\displaystyle(P_{n}-P_{0})\{U(P_{n},\bm{v}_{0},\Psi_{n})-U(P_{0},\bm{v}_{0},\Psi_{0})\}. (4)

The first term on the right-hand side of (4) is exactly linear and the third term is negligible under Donsker condition of Ψn\Psi_{n}. The problem arises because data adaptive regression techniques have a rate of convergence slower than the desired root-nn rate, and thus, the bias of P0{U(Pn,𝒗0,Ψn)U(P0,𝒗0,Ψ0)}P_{0}\{U(P_{n},\bm{v}_{0},\Psi_{n})-U(P_{0},\bm{v}_{0},\Psi_{0})\} diverges to infinity.

3.2 The intuition

We show that when the nuisance parameters are properly undersmoothed, the asymptotic linearity of our estimator Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) can be retrieved. Specifically, in the proof of Theorem 4.2, we show that

P0{U(Pn,𝒗0,Ψn)U(P0,𝒗0,Ψ0)}=\displaystyle P_{0}\{U(P_{n},\bm{v}_{0},\Psi_{n})-U(P_{0},\bm{v}_{0},\Psi_{0})\}= (PnP0)D~h,𝒗𝟎(P0)+op(n1/2)\displaystyle(P_{n}-P_{0})\tilde{D}^{*}_{h,\bm{v_{0}}}(P_{0})+o_{p}(n^{-1/2})
+PnKh,𝒗0ΔcI(A=dθ)GnPnKh,𝒗0{I(Yt)Ψn}\displaystyle\hskip 14.45377pt+P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{n}P_{n}K_{h,\bm{v}_{0}}}\left\{I(Y\geq t)-\Psi_{n}\right\} (5)
+PnKh,𝒗0I(A=dθ)GnPnKh,𝒗0(QnΨn)(ΔcGnc)\displaystyle\hskip 14.45377pt+P_{n}\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{n}P_{n}K_{h,\bm{v}_{0}}}(Q_{n}-\Psi_{n})\left(\Delta^{c}-G_{n}^{c}\right) (6)
+PnKh,𝒗0(QnΨn)GnaPnKh,𝒗0{I(A=dθ)Gna},\displaystyle\hskip 14.45377pt+P_{n}\frac{K_{h,\bm{v}_{0}}(Q_{n}-\Psi_{n})}{G_{n}^{a}P_{n}K_{h,\bm{v}_{0}}}\left\{I(A=d^{\theta})-G_{n}^{a}\right\}, (7)

where D~h,𝒗𝟎(P0)\tilde{D}^{*}_{h,\bm{v_{0}}}(P_{0}) includes certain components of the canonical gradient of our statistical parameter Ψ0h(𝒗𝟎,θ)\Psi_{0h}(\bm{v_{0}},\theta) and thus, contributes to the efficient influence function of our estimator. The last three terms do not converge to zero at the appropriate rate and thus induce non-negligible bias. However, the terms (5), (6) and (7) resemble the form of score equations for I(Yt)I(Y\geq t), Δc\Delta^{c} and I(A=dθ)I(A=d^{\theta}).

We first discuss the score equations for G.G^{.} where .{a,c}.\in\{a,c\}. We can generate score Sg(Gn.)S_{g}(G_{n}^{.}) for a nonparametric estimator Gn.G_{n}^{.} by paths {Gn,ϵ.g(w)}\{G_{n,\epsilon}^{.g}(w)\}, such that

logit Gn,ϵ.g(w)\displaystyle\text{logit }G_{n,\epsilon}^{.g}(w) =logit Gn.(0){1+ϵg(0)}+s{1,,d}0swsϕs,us(w){1+ϵg(s,us)}𝑑logit Gn,s.(u),\displaystyle=\text{logit }G_{n}^{.}(0)\{1+\epsilon g(0)\}+\sum_{s\subset\{1,\ldots,d\}}\int_{0_{s}}^{w_{s}}\phi_{s,u_{s}}(w)\{1+\epsilon g(s,u_{s})\}d\text{logit }G_{n,s}^{.}(u),

where gg is a uniformly bounded function on [0,τ]p[0,\tau]^{p}. Accordingly when a function is estimated using a highly adaptive lasso, the score functions can be generated by a path {1+ϵg(s,j)}βn,s,j\{1+\epsilon g(s,j)\}\beta_{n,s,j} for a bounded vector gg as

Sg(Gn,λn.)=ddlogit Gn,λn.L(logit Gn,λn.){(s,j)g(s,j)βn,s,jϕs,j},S_{g}(G_{n,\lambda_{n}}^{.})=\frac{d}{d\text{logit }G^{.}_{n,\lambda_{n}}}L(\text{logit }G^{.}_{n,\lambda_{n}})\left\{\sum_{(s,j)}g(s,j)\beta_{n,s,j}\phi_{s,j}\right\}, (8)

where L()L(\cdot) is the log-likelihood loss function. For example, for the treatment indicator, L(Ga)=Alog(Ga1Ga)+log(1Ga)L(G^{a})=A\log\left(\frac{G^{a}}{1-G^{a}}\right)+\log(1-G^{a}). The penalty in the highly adaptive lasso imposes a restriction on the gg function. Specifically, the path {1+ϵg(s,j)}βn,s,j\{1+\epsilon g(s,j)\}\beta_{n,s,j} can be generated using those gg functions that do not change the L1L_{1}-norm of the parameters. Let βn,s,jg={1+ϵg(s,j)}βn,s,j\beta^{g}_{n,s,j}=\{1+\epsilon g(s,j)\}\beta_{n,s,j} denote the set of perturbed parameters. Then, for small enough ϵ\epsilon such that {1+ϵg(s,j)}>0\{1+\epsilon g(s,j)\}>0, βn,s,jg1=βn,s,jg1{1+ϵg(s,j)}=βn,s,jg1+ϵg(s,j)βn,s,j1\|\beta^{g}_{n,s,j}\|_{1}=\|\beta^{g}_{n,s,j}\|_{1}\{1+\epsilon g(s,j)\}=\|\beta^{g}_{n,s,j}\|_{1}+\epsilon g(s,j)\|\beta_{n,s,j}\|_{1}. Hence, the restriction is satisfied when the inner product of gg and the vector |β||\beta| is zero (i.e., <g.|β|>=0<g.|\beta|>=0).

We now provide a thought experiment on how undersmoothing the nuisance function estimates using a highly adaptive lasso can eliminate the terms (5), (6) and (7). We first ignore the L1L_{1}-norm restriction on the choice of function gg which is erroneous but contains the main idea of the approach. Without the restriction, one can choose g(s0,j0)=I(s=s0,j=j0)g(s_{0},j_{0})=I(s=s_{0},j=j_{0}). The latter choice perturbs one coefficient at a time and corresponds to maximum likelihood estimators. Then, for any s0s_{0} and j0j_{0}, the score function (19) becomes

SI(Gn,λn.)=ddlogit Gn,λn.L(logit Gn,λn.)(βn,s0,j0ϕs0,j0).S_{I}(G_{n,\lambda_{n}}^{.})=\frac{d}{d\text{logit }G^{.}_{n,\lambda_{n}}}L(\text{logit }G^{.}_{n,\lambda_{n}})\left(\beta_{n,s_{0},j_{0}}\phi_{s_{0},j_{0}}\right).

Because, βn,s0,j0\beta_{n,s_{0},j_{0}} is a finite constant for all s0s_{0} and j0j_{0}, solving the above score equation is equivalent to solving

ddlogit Gn,λn.L(logit Gn,λn.)ϕs0,j0.\frac{d}{d\text{logit }G^{.}_{n,\lambda_{n}}}L(\text{logit }G^{.}_{n,\lambda_{n}})\phi_{s_{0},j_{0}}.

For example, for the treatment indicator, the score equation is given by (AGn,λna)ϕs0,j0(A-G_{n,\lambda_{n}}^{a})\phi_{s_{0},j_{0}}. As we undersmooth the fit, we solve more and more score equations (i.e., the number of score equations increases with the number of features included in the model) and any linear combination of those score equations will also be solved. Now let’s consider the term (7) and let f=Kh,𝒗0(QnΨn)GnaPnKh,𝒗0f=\frac{K_{h,\bm{v}_{0}}(Q_{n}-\Psi_{n})}{G_{n}^{a}P_{n}K_{h,\bm{v}_{0}}}. Assuming that Q0Q_{0}, Ψ0\Psi_{0} and G0=G0aG0cG_{0}=G_{0}^{a}G_{0}^{c} are càdlàg with finite sectional variation norm (Assumption 1 in Section 4), Gill, van der Laan and Wellner (1993) showed that ff can be approximated as a linear combination of indicator basis functions. Therefore, if we sufficiently undersmooth Gn,λaG_{n,\lambda}^{a} we will solve (7) up to a root-nn factor. That is

PnKh,𝒗0(QnΨn)GnaPnKh,𝒗0{I(A=dθ)Gna}=op(n1/2).P_{n}\frac{K_{h,\bm{v}_{0}}(Q_{n}-\Psi_{n})}{G_{n}^{a}P_{n}K_{h,\bm{v}_{0}}}\left\{I(A=d^{\theta})-G_{n}^{a}\right\}=o_{p}(n^{-1/2}).

The same argument can be made to show that the other terms (5) and (6) can be made negligible when Ψn\Psi_{n} and GncG_{n}^{c} are properly undersmoothed. Hence, the challenging term will be asymptotically linear with

P0{U(Pn,𝒗0,Ψn)U(P0,𝒗0,Ψ0)}=\displaystyle P_{0}\{U(P_{n},\bm{v}_{0},\Psi_{n})-U(P_{0},\bm{v}_{0},\Psi_{0})\}= (PnP0)D~h,𝒗𝟎(P0)+op(n1/2).\displaystyle(P_{n}-P_{0})\tilde{D}^{*}_{h,\bm{v_{0}}}(P_{0})+o_{p}(n^{-1/2}).

Now, we study the actual case where the choice function hh is restricted to those that do not change the L1L_{1}-norm of the coefficients. Lemma A.2 shows than when (23), (24) and (25) are satisfied, our proposed estimator achieves asymptotic linearity. As we undersmooth the fit, we start adding features with small coefficients into the model. Conditions (23), (24) and (25) imply that the L1L_{1}-norm must be increases until one of the corresponding score equations is solved to a precision of op(n1/2)o_{p}(n^{-1/2}).

Here we provide details for the score equations corresponding to the treatment indicator where Sg(Gn,λna)=(AGn,λna){(s,j)g(s,j)βn,s,jϕs,j}S_{g}(G_{n,\lambda_{n}}^{a})=(A-G_{n,\lambda_{n}}^{a})\left\{\sum_{(s,j)}g(s,j)\beta_{n,s,j}\phi_{s,j}\right\}. Let r(g,Gn,λna)=(s,j)g(s,j)|βn,s,j|r(g,G_{n,\lambda_{n}}^{a})=\sum_{(s,j)}g(s,j)\lvert\beta_{n,s,j}\rvert. For small enough ϵ\epsilon,

(s,j)|{1+ϵg(s,j)}βn,s,j|\displaystyle\sum_{(s,j)}\lvert\{1+\epsilon g(s,j)\}\beta_{n,s,j}\rvert =(s,j){1+ϵg(s,j)}|βn,s,j|\displaystyle=\sum_{(s,j)}\{1+\epsilon g(s,j)\}\lvert\beta_{n,s,j}\rvert
=(s,j)|βn,s,j|+ϵr(g,Gn,λna).\displaystyle=\sum_{(s,j)}\lvert\beta_{n,s,j}\rvert+\epsilon r(g,G_{n,\lambda_{n}}^{a}).

Hence, for any gg satisfying r(g,Gn,λna)=0r(g,G_{n,\lambda_{n}}^{a})=0 (i.e., hh does not change the L1L_{1}- norm of the coefficients), we have PnSg(Gn,λna)=0P_{n}S_{g}(G_{n,\lambda_{n}}^{a})=0.

Let D{f,Gn,λna}=f(AGn,λna)D\{f,G_{n,\lambda_{n}}^{a}\}=f\cdot(A-G_{n,\lambda_{n}}^{a}), where ff is defined above, and let f~\tilde{f} be an approximation of ff using the basis functions that satisfy condition (24). Then, D{f~,Gn,λna}{Sg(Gn,λna):g<}D\{\tilde{f},G_{n,\lambda_{n}}^{a}\}\in\{S_{g}(G_{n,\lambda_{n}}^{a}):\lVert g\rVert_{\infty}<\infty\}. Thus, there exists gg^{\star} such that D(f~,Gn,λna)=Sg(Gn,λna)D(\tilde{f},G_{n,\lambda_{n}}^{a})=S_{g^{\star}}(G_{n,\lambda_{n}}^{a}); however, for this particular choice of gg^{\star}, r(g,Gn,λna)r(g,G_{n,\lambda_{n}}^{a}) may not be zero (i.e., the restriction on gg might be violated). Now, define gg such that g(s,j)=g(s,j)g(s,j)=g^{\star}(s,j) for (s,j)(s,j)(s,j)\neq(s^{\star},j^{\star}); g~(s,j)\tilde{g}(s^{\star},j^{\star}) is defined such that

(s,j)(s,j)g(s,j)|βn,s,j|+g(s,j)|βn,s,j|=0.\displaystyle\sum_{(s,j)\neq(s^{\star},j^{\star})}g^{\star}(s,j)\lvert\beta_{n,s,j}\rvert+g(s^{\star},j^{\star})\lvert\beta_{n,s^{\star},j^{\star}}\rvert=0. (9)

That is, gg matches gg^{\star} everywhere but for a single point (s,j)(s^{\star},j^{\star}), where it is forced to take a value such that r(g,Gn,λna)=0r(g,G_{n,\lambda_{n}}^{a})=0. As a result, for such a choice of gg, PnSh(Gn,λna)=0P_{n}S_{h}(G_{n,\lambda_{n}}^{a})=0 by definition. Below, we show that PnSg(Gn,λna)PnD(f~,Gn,λna)=op(n1/2)P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}D(\tilde{f},G_{n,\lambda_{n}}^{a})=o_{p}(n^{-1/2}) which then implies that PnD(f~,Gn,λna)=op(n1/2)P_{n}D(\tilde{f},G_{n,\lambda_{n}}^{a})=o_{p}(n^{-1/2}). We note that the choice of (s,j)(s^{\star},j^{\star}) is inconsequential.

PnSg(Gn,λna)PnD{f~,Gn,λna}\displaystyle P_{n}S_{g}(G_{n,\lambda_{n}}^{a})-P_{n}D\{\tilde{f},G_{n,\lambda_{n}}^{a}\} =PnSg(Gn,λna)PnSg(Gn,λna)\displaystyle=P_{n}S_{g}(G_{n,\lambda_{n}}^{a})-P_{n}S_{g^{*}}(G_{n,\lambda_{n}}^{a})
=Pn{ddlogit Gn,λnaL(logit Gn,λna)[(s,j){g(s,j)g(s,j)}βn,s,jϕs,j]}\displaystyle=P_{n}\left\{\frac{d}{d\text{logit }G^{a}_{n,\lambda_{n}}}L(\text{logit }G^{a}_{n,\lambda_{n}})\left[\sum_{(s,j)}\left\{g(s,j)-g^{\star}(s,j)\right\}\beta_{n,s,j}\phi_{s,j}\right]\right\}
=Pn[ddlogit Gn,λnaL(logit Gn,λna){g(s,j)g(s,j)}βn,s,jϕs,j]\displaystyle=P_{n}\left[\frac{d}{d\text{logit }G^{a}_{n,\lambda_{n}}}L(\text{logit }G^{a}_{n,\lambda_{n}})\left\{g(s^{\star},j^{\star})-g^{\star}(s^{\star},j^{\star})\right\}\beta_{n,s^{\star},j^{\star}}\phi_{s^{\star},j^{\star}}\right]
=Op(Pn[ddlogit Gn,λnaL(logit Gn,λna)ϕs,j])\displaystyle=O_{p}\left(P_{n}\left[\frac{d}{d\text{logit }G^{a}_{n,\lambda_{n}}}L(\text{logit }G^{a}_{n,\lambda_{n}})\phi_{s^{\star},j^{\star}}\right]\right)
=op(n1/2).\displaystyle=o_{p}(n^{-1/2}).

The details of the forth equality is given in the proof of Lemma A.2 and the last equality follows from the assumption that min(s,j)𝒥nPnddlogit Gn,λnaL(logit Gn,λna)(ϕs,j)=op(n1/2)\min_{(s,j)\in\mathcal{J}_{n}}\lVert P_{n}\frac{d}{d\text{logit }G^{a}_{n,\lambda_{n}}}L(\text{logit }G^{a}_{n,\lambda_{n}})(\phi_{s,j})\rVert=o_{p}(n^{-1/2}) for L()L(\cdot) being log-likelihood loss (i.e., condition (24)). As PnSg(Gn,λna)=0P_{n}S_{g}(G^{a}_{n,\lambda_{n}})=0, it follows that PnD(f~,Gn,λna)=op(n1/2)P_{n}D(\tilde{f},G^{a}_{n,\lambda_{n}})=o_{p}(n^{-1/2}). Using this result, under mild assumptions, we showed that PnD(f,Gn,λna)=op(n1/2)P_{n}D(f,G^{a}_{n,\lambda_{n}})=o_{p}(n^{-1/2}) indicating that the term (7) will be asymptotically negligible (see the proof of Lemma A.2 for details).

3.3 The estimator.

To improve the finite sample performance of our estimator we propose to estimate the nuisance parameters using cross-fitting  (Klaassen, 1987; Zheng and van der Laan, 2011; Chernozhukov et al., 2017). We split the data at random into BB mutually exclusive and exhaustive sets of size approximately nB1nB^{-1}. Let Pn,b0P_{n,b}^{0} and Pn,b1P_{n,b}^{1} denote the the empirical distribution of a training and a validation sample, respectively. For a given λ\lambda and hh, exclude a single (validation) fold of data and fit the highly adaptive lasso estimator using data from the remaining (B1)(B-1) folds; use this model to estimate the nuisance parameters for samples in the holdout (validation) fold. By repeating the process BB times, we will have estimates of the nuisance parameters for all sample units. Accordingly, we define the cross-fitted IPW estimator

ΨnhCF(𝒗0,θ)=B1b=1BPn,b1U(Pn,b1,𝒗0,Ψn,λ,b)\displaystyle\Psi_{nh}^{\textsuperscript{CF}}(\bm{v}_{0},\theta)=B^{-1}\sum_{b=1}^{B}P_{n,b}^{1}U(P_{n,b}^{1},\bm{v}_{0},\Psi_{n,\lambda,b}) (10)

where U(Pn,b1,𝒗0,Ψn,λ,b)=(Gn,λ,bPn,b1Kh,𝒗0)1{Kh,𝒗0ΔcI(A=dθ)Ψn,λ,b}U(P_{n,b}^{1},\bm{v}_{0},\Psi_{n,\lambda,b})=(G_{n,\lambda,b}P_{n,b}^{1}K_{h,\bm{v}_{0}})^{-1}\{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})\Psi_{n,\lambda,b}\}, Ψn,λ,b\Psi_{n,\lambda,b} and Gn,λ,bG_{n,\lambda,b} are highly adaptive lasso estimates of Ψ0\Psi_{0} and G0G_{0} applied to the training sample for the bth sample split for a given λ\lambda. The estimator uses a (J1J-1)-orthogonal kernel with bandwidth hh centered at 𝒗𝟎\bm{v_{0}}.

3.4 Data-adaptive bandwidth selector

The proposed estimator Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) approaches to Ψ0(𝒗0,θ)\Psi_{0}(\bm{v}_{0},\theta) as hn0h_{n}\rightarrow 0 at the cost of increasing the variance of the estimator. Mean square error (MSE) provides an ideal criterion for bias-variance trade-off. However, because the bias is unknown the latter cannot be used directly. Here we propose an approach that circumvent the need for knowing bias while targeting the optimal bias-variance trade-off. Let σnh\sigma_{nh} be a consistent estimator of σ0=[E{D𝒗02(P0)}]1/2\sigma_{0}=[E\{D^{*2}_{\bm{v}_{0}}(P_{0})\}]^{1/2} where D𝒗02(P0)D^{*2}_{\bm{v}_{0}}(P_{0}) is the efficient influence function for Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) (see Theorem 4.2). The goal is to find an hh that minimizes the MSE or equivalently set the derivative of the MSE to zero. That is

{Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)}hΨnh(𝒗0,θ)+σnh(nhr)1/2hσnh/(nhr)1/2=0\{\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)\}\frac{\partial}{\partial h}\Psi_{nh}(\bm{v}_{0},\theta)+\frac{\sigma_{nh}}{(nh^{r})^{1/2}}\frac{\partial}{\partial h}\sigma_{nh}/(nh^{r})^{1/2}=0

We know that the optimal (up to a constant) bias-variance trade-off implies {Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)}σnh/(nhr)1/2\{\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)\}\approx\sigma_{nh}/(nh^{r})^{1/2} (van der Laan, Bibaut and Luedtke, 2018). Hence, the optimal hh will also solve one of the following equations

hΨnh(𝒗0,θ)±κ(hσnh/(nhr)1/2)=0,\displaystyle\frac{\partial}{\partial h}\Psi_{nh}(\bm{v}_{0},\theta)\pm\kappa\left(\frac{\partial}{\partial h}\sigma_{nh}/(nh^{r})^{1/2}\right)=0, (11)

where κ\kappa is a positive constant. Consider a scenario where Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) shows an increasing trend as hnh_{n} approaches zero. This implies that Ψ0(𝒗0,θ)>Ψnh(𝒗0,θ)κσnh/(nhr)1/2\Psi_{0}(\bm{v}_{0},\theta)>\Psi_{nh}(\bm{v}_{0},\theta)-\kappa\sigma_{nh}/(nh^{r})^{1/2}. We then define our optimal finite sample bandwidth selector as

hn=argmaxhΨnh(𝒗0,θ)κσnh/(nhr)1/2.\displaystyle h_{n}=\operatorname*{argmax}_{h}\Psi_{nh}(\bm{v}_{0},\theta)-\kappa\sigma_{nh}/(nh^{r})^{1/2}.

Similarly, when Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) shows decreasing trend as hnh_{n} approaches zero, we define hn=argminhΨnh(𝒗0,θ)+κσnh/(nhr)1/2h_{n}=\operatorname*{argmin}_{h}\Psi_{nh}(\bm{v}_{0},\theta)+\kappa\sigma_{nh}/(nh^{r})^{1/2}. A natural choice for the constant κ\kappa is (1α)(1-\alpha)-quantile of the standard normal distribution (i.e., ζ1α\zeta_{1-\alpha}). Hence, hnh_{n} is the bandwidth value that minimizes the difference between the true value Ψ0(𝒗0,θ)\Psi_{0}(\bm{v}_{0},\theta) and the corresponding lower (upper) confidence bound.

3.5 Undersmoothing criteria

The asymptotic linearity results of Theorems 4.2 and 4.3 rely on estimating the corresponding nuisance parameters using an undersmoothed highly adaptive lasso. Specifically, Theorem 4.2 requires that

  • (a)

    |PnDCARa(Pn,Gna,Qn,Ψn,θ)|=op((nhr)1/2)\left|P_{n}D_{CAR}^{a}(P_{n},G_{n}^{a},Q_{n},\Psi_{n},\theta)\right|=o_{p}\left((nh^{r})^{-1/2}\right),

  • (b)

    |PnDCARc(Pn,Gn,Qn,Ψn,θ)|=op((nhr)1/2)\left|P_{n}D_{CAR}^{c}(P_{n},G_{n},Q_{n},\Psi_{n},\theta)\right|=o_{p}\left((nh^{r})^{-1/2}\right),

  • (c)

    |PnDCARΨ(Pn,Gn,Ψn,θ)|=op((nhr)1/2)\left|P_{n}D_{CAR}^{\Psi}(P_{n},G_{n},\Psi_{n},\theta)\right|=o_{p}\left((nh^{r})^{-1/2}\right),

where DCARa(Pn,Gna,Qn,Ψn,θ)=Kh,𝒗0PnKh,𝒗0(QnΨn)(GnaI(A=dθ)Gn)D_{CAR}^{a}(P_{n},G_{n}^{a},Q_{n},\Psi_{n},\theta)=\frac{K_{h,\bm{v}_{0}}}{P_{n}K_{h,\bm{v}_{0}}}(Q_{n}-\Psi_{n})\left(\frac{G_{n}^{a}-I(A=d^{\theta})}{G_{n}}\right), DCARc(Pn,Gn,Qn,Ψn,θ)=Kh,𝒗0I(A=dθ)PnKh,𝒗0(QnΨn)(GncΔcGn)D_{CAR}^{c}(P_{n},G_{n},Q_{n},\Psi_{n},\theta)=\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{P_{n}K_{h,\bm{v}_{0}}}(Q_{n}-\Psi_{n})\left(\frac{G_{n}^{c}-\Delta^{c}}{G_{n}}\right) and DCARΨ(Pn,Gn,Ψn,θ)=Kh,vΔcI(A=dθ)GnPnKh,v{ΨnI(Y>t)}D_{CAR}^{\Psi}(P_{n},G_{n},\Psi_{n},\theta)=\frac{K_{h,v}\Delta^{c}I(A=d^{\theta})}{G_{n}P_{n}K_{h,v}}\{\Psi_{n}-I(Y>t)\}. Motivated by our theoretical results, we propose the following practical L1-norm bound selection criteria for GnaG_{n}^{a}, GncG_{n}^{c} and Ψn\Psi_{n}:

λn,Ga\displaystyle\lambda_{n,G^{a}} =argminλ|B1b=1BPn,b1DCARa(Pn,b1,Gn,λ,ba,Gn,bc,Qn,b,Ψn,b,θ)|,\displaystyle=\operatorname*{argmin}_{\lambda}\left|B^{-1}\sum_{b=1}^{B}P_{n,b}^{1}D_{CAR}^{a}(P_{n,b}^{1},G_{n,\lambda,b}^{a},G_{n,b}^{c},Q_{n,b},\Psi_{n,b},\theta)\right|, (12)
λn,Gc\displaystyle\lambda_{n,G^{c}} =argminλ|B1b=1BPn,b1DCARc(Pn,b1,Gn,ba,Gn,λ,bc,Qn,b,Ψn,b,θ)|,\displaystyle=\operatorname*{argmin}_{\lambda}\left|B^{-1}\sum_{b=1}^{B}P_{n,b}^{1}D_{CAR}^{c}(P_{n,b}^{1},G_{n,b}^{a},G_{n,\lambda,b}^{c},Q_{n,b},\Psi_{n,b},\theta)\right|, (13)
λn,Ψ\displaystyle\lambda_{n,\Psi} =min{λ:|B1b=1BPn,b1DCARΨ(Pn,b1,Gn,ba,Gn,bc,Ψn,λ,b,θ)|σnh(nlogn)1/2},\displaystyle=\min\left\{\lambda:\left|B^{-1}\sum_{b=1}^{B}P_{n,b}^{1}D_{CAR}^{\Psi}(P_{n,b}^{1},G_{n,b}^{a},G_{n,b}^{c},\Psi_{n,\lambda,b},\theta)\right|\leq\frac{\sigma_{nh}}{(n\log n)^{1/2}}\right\}, (14)

where Gn,baG_{n,b}^{a}, Gn,bcG_{n,b}^{c}, Ψn,b\Psi_{n,b} and Qn,bQ_{n,b} are cross-validated highly adaptive lasso estimates of the corresponding nuisance parameters with the L1-norm bound based on the global cross-validation selector.

To achieve the Gaussian process convergence results in Theorem 4.3 (with a fixed hh), we would require stronger conditions. Specifically, the conditions listed above must hold uniformly for any 𝒗𝒱\bm{v}\in\mathcal{V}, that is, supv𝒱|PnDCARa(Pn,Gna,Qn,Ψn,θ)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|P_{n}D_{CAR}^{a}(P_{n},G_{n}^{a},Q_{n},\Psi_{n},\theta)\right|=o_{p}(n^{-1/2}),
supv𝒱|PnDCARc(Pn,Gn,Qn,Ψn,θ)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|P_{n}D_{CAR}^{c}(P_{n},G_{n},Q_{n},\Psi_{n},\theta)\right|=o_{p}(n^{-1/2}) and supv𝒱|PnDCARΨ(Pn,Gn,Ψn,θ)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|P_{n}D_{CAR}^{\Psi}(P_{n},G_{n},\Psi_{n},\theta)\right|=o_{p}(n^{-1/2}). Accordingly, the practical criteria are given by

λn,Gau\displaystyle\lambda_{n,G^{a}}^{u} =argminλsupv𝒱~|B1b=1BPn,b1DCARa(Pn,b1,Gn,λ,ba,Gn,bc,Qn,b,Ψn,b,θ)|,\displaystyle=\operatorname*{argmin}_{\lambda}\sup_{v\in\tilde{\mathcal{V}}}\left|B^{-1}\sum_{b=1}^{B}P_{n,b}^{1}D_{CAR}^{a}(P_{n,b}^{1},G_{n,\lambda,b}^{a},G_{n,b}^{c},Q_{n,b},\Psi_{n,b},\theta)\right|, (15)
λn,Gcu\displaystyle\lambda_{n,G^{c}}^{u} =argminλsupv𝒱~|B1b=1BPn,b1DCARc(Pn,b1,Gn,ba,Gn,λ,bc,Qn,b,Ψn,b,θ)|,\displaystyle=\operatorname*{argmin}_{\lambda}\sup_{v\in\tilde{\mathcal{V}}}\left|B^{-1}\sum_{b=1}^{B}P_{n,b}^{1}D_{CAR}^{c}(P_{n,b}^{1},G_{n,b}^{a},G_{n,\lambda,b}^{c},Q_{n,b},\Psi_{n,b},\theta)\right|, (16)
λn,Ψu\displaystyle\lambda_{n,\Psi}^{u} =min{λ:supv𝒱~|B1b=1BPn,b1DCARΨ(Pn,b1,Gn,ba,Gn,bc,Ψn,λ,b,θ)|σnh(nlogn)1/2},\displaystyle=\min\left\{\lambda:\sup_{v\in\tilde{\mathcal{V}}}\left|B^{-1}\sum_{b=1}^{B}P_{n,b}^{1}D_{CAR}^{\Psi}(P_{n,b}^{1},G_{n,b}^{a},G_{n,b}^{c},\Psi_{n,\lambda,b},\theta)\right|\leq\frac{\sigma_{nh}}{(n\log n)^{1/2}}\right\}, (17)

where 𝒱~=(𝒗~01,,𝒗~0α)\tilde{\mathcal{V}}=(\tilde{\bm{v}}_{01},\cdots,\tilde{\bm{v}}_{0\alpha}) includes α\alpha sample points of 𝒱\mathcal{V}. The asymptotic linearity result of Theorem 4.3 with a fixed bandwidth hh relies on similar undersmoothing criteria as those listed in (a)–(c) but with hh being replaced by one, and thus, the same practical undersmoothing criteria as(12)-(14) can be considered.

In Theorems 4.2 and 4.3, we show that the proposed estimators are asymptotically efficient when they solve the efficient influence function, that is, PnDCAR=op((nhr)1/2)P_{n}D_{CAR}^{\diamond}=o_{p}\left((nh^{r})^{-1/2}\right) and PnDCAR=op(n1/2)P_{n}D_{CAR}^{\diamond}=o_{p}(n^{-1/2}), respectively, for {a,c,Ψ}\diamond\in\{a,c,\Psi\}. The argmin\operatorname*{argmin} criteria proposed in this section correspond to the most efficient estimators for a given data. Increasing λ\lambda results in decreasing the empirical mean of pseudo score functions Kh,𝒗0PnKh,𝒗0(QnΨn)(Gn,λaI(A=dθ))\frac{K_{h,\bm{v}_{0}}}{P_{n}K_{h,\bm{v}_{0}}}(Q_{n}-\Psi_{n})\left(G_{n,\lambda}^{a}-I(A=d^{\theta})\right) and Kh,𝒗0I(A=dθ)PnKh,𝒗0(QnΨn)(Gn,λcΔc)\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{P_{n}K_{h,\bm{v}_{0}}}(Q_{n}-\Psi_{n})\left(G_{n,\lambda}^{c}-\Delta^{c}\right) and increasing the variance of (Gn,λa)1(G_{n,\lambda}^{a})^{-1} and (Gn,λc)1(G_{n,\lambda}^{c})^{-1}. At a certain point in the grid of λ\lambda, decreases in the empirical mean of the pseudo score functions are insufficient for satisfying the argmin\operatorname*{argmin} conditions, which |PnDCAR|\left|P_{n}D_{CAR}^{\diamond}\right| starts increasing on account of (Gn,λa)1(G_{n,\lambda}^{a})^{-1} and (Gn,λc)1(G_{n,\lambda}^{c})^{-1}. Unlike |PnDCARa|\left|P_{n}D_{CAR}^{a}\right| and |PnDCARc|\left|P_{n}D_{CAR}^{c}\right|, |PnDCARΨ|\left|P_{n}D_{CAR}^{\Psi}\right| decreases as λ\lambda increases which motivates the undersmoothing selection criteria (14) and (17). Specifically, the sectional variation norm λn,Ψ\lambda_{n,\Psi} and λn,Ψu\lambda_{n,\Psi}^{u} are defined as the smallest value (larger than that chosen by the cross-validation selector) for which the left-hand side of the condition is less than σnh/(nlogn)1/2\sigma_{nh}/(n\log n)^{1/2} where σnh\sigma_{nh} be a consistent estimator of σ0=[E{D𝒗02(P0)}]1/2\sigma_{0}=[E\{D^{*2}_{\bm{v}_{0}}(P_{0})\}]^{1/2} with D𝒗02(P0)D^{*2}_{\bm{v}_{0}}(P_{0}) being the efficient influence function for Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta). The cutpoint will lead to a bias that is of a smaller magnitude of the standard error, and thus, won’t affect the coverage.

Remark 1.

The variance estimator σnh2\sigma_{nh}^{2} is calculated using the efficient influence function. One can, instead, use a conservative variance estimator obtained by an influence function where the nuisance parameters are assumed to be known. That is

σnh2\displaystyle\sigma_{nh}^{2} =Pn{Kh,𝒗0PnKh,𝒗0Ψn(θ)Ψ0(𝒗0,θ)}2,\displaystyle=P_{n}\left\{\frac{K_{h,\bm{v}_{0}}}{P_{n}K_{h,\bm{v}_{0}}}\Psi_{n}(\theta)-\Psi_{0}(\bm{v}_{0},\theta)\right\}^{2}, (18)

where Ψ0(𝐯0,θ)\Psi_{0}(\bm{v}_{0},\theta) can be replaced by their corresponding estimators. The variance estimator (18) can also be used in Section 3.4 where we proposed an adaptive bandwidth selector.

4 Theoretical results

The asymptotic properties of our estimator relies on the following assumptions.

Assumption 1 (Complexity of nuisance functions).

The functions Q0Q_{0}, Ψ0\Psi_{0}, G0aG_{0}^{a} and G0cG_{0}^{c} are càdlàg with finite sectional variation norm.

Assumption 1 characterizes the global smoothness assumption of the true functions which is much weaker than than local smoothness assumptions like those characterized by an assumption utilizing Hölder balls (e.g., Robins et al., 2008, 2017; Mukherjee, Newey and Robins, 2017). The assumption facilitates the fast convergence rate of n1/3log(n)p/2n^{-1/3}\log(n)^{p/2} obtainable by the highly adaptive lasso (regardless of dimensionality pp) (van der Laan, 2017; van der Laan and Bibaut, 2017; Bibaut and van der Laan, 2019). The latter rate is generally faster than the typical minimax rates that appear under differing, but similarly positioned, assumptions regarding function classes in nonparametric estimation. In fact, we expect that almost all true conditional mean models that we have to deal with in real data will satisfy Assumption 1. The sectional variation is defined in more detail in Appendix B.

Assumption 2.

(causal inference and censoring assumptions) For i=1,,ni=1,\dotsc,n,

  • (a)

    Consistency of the observed outcome: Ti=TiAiT_{i}=T_{i}^{A_{i}};

  • (b)

    Unconfoundedness of the treatment and censoring mechanisms: AiTia|𝐖i,a{0,1}A_{i}\perp T_{i}^{a}|\mathbf{W}_{i},~{}\forall a\in\{0,1\}, and ΔicTiAi,𝑾i\Delta_{i}^{c}\perp T_{i}\mid A_{i},\bm{W}_{i};

  • (c)

    Positivity of the treatment and censoring mechanisms: mina,δG(A=a,Δc=δ𝐖)>γ\min_{a,\delta}G(A=a,\Delta^{c}=\delta\mid\mathbf{W})>\gamma for all 𝐖𝒲\mathbf{W}\in\mathcal{W} where γ\gamma is a small positive constant.

Assumption 2 lists standard causal identifiability assumptions. The consistency assumption ensures that a subject’s potential outcome is not affected by other subjects’ treatment levels and there are no different versions of treatment. Assumption 2b is a version of the coarsening at random assumption that imposes conditional independence between (1) treatment and potential outcomes given the measured covariates; and (2) censoring indicator and the observed outcome given the treatment and measured covariates. Positivity states that every subject has some chance of receiving treatment level aa with censoring status δ\delta regardless of covariates. Using Assumption 2 the full data risk of Ψ\Psi can be identified using the observed data as

EPXθθ(Ψ,x)𝑑F(θ)=EP0LG0(Ψ),E_{P_{X}}\int_{\theta}\ell_{\theta}(\Psi,x)dF(\theta)=E_{P_{0}}L_{G_{0}}(\Psi),

where LG0(Ψ)L_{G_{0}}(\Psi) and θ(Ψ,x)\ell_{\theta}(\Psi,x) were defined in Section 2.1. Even if the Assumption 2a-b do not hold, it may still be of interest to estimate EPXθθ(Ψ,x)𝑑F(θ)E_{P_{X}}\int_{\theta}\ell_{\theta}(\Psi,x)dF(\theta) using PnLG0(Ψ)P_{n}L_{G_{0}}(\Psi) as an adjusted measure of association.

Remark 2.

The censoring at random assumption implies that CTA,𝐖C\perp T\mid A,\bm{W} on C<TC<T which is weaker than CTA,𝐖C\perp T\mid A,\bm{W} (Van der Laan, Laan and Robins, 2003, Chapter 1). However, because the additional restriction can not be identified using the observed data, it will not impose any restriction on the tangent space and thus, the statistical model remains nonparametric.

Assumption 3 (Undersmoothing).

The level of undersmoothing satisfies the following conditions:

  • (a)

    Fixed hh: Let fϕΨ{f}_{\phi}^{\Psi}, fϕa{f}_{\phi}^{a} and fϕc{f}_{\phi}^{c} be the projections of fΨ=ΔcI(A=dθ)/G0f^{\Psi}=\Delta^{c}I(A=d^{\theta})/G_{0}, fa=(Q0Ψ0)/G0f^{a}=(Q_{0}-\Psi_{0})/G_{0} and fc=I(A=dθ)(Q0Ψ0)/G0f^{c}=I(A=d^{\theta})(Q_{0}-\Psi_{0})/G_{0} onto a linear span of basis functions ϕs,j\phi_{s,j} in L2(P)L^{2}(P), for ϕs,j\phi_{s,j} satisfying conditions (23), (24) and (25) of Lemma A.2. Then, fΨfϕΨ2,μ=Op(n1/4)\lVert f^{\Psi}-{f}_{\phi}^{\Psi}\rVert_{2,\mu}=O_{p}(n^{-1/4}), fafϕa2,μ=Op(n1/4)\lVert f^{a}-{f}_{\phi}^{a}\rVert_{2,\mu}=O_{p}(n^{-1/4}) and fcfϕc2,μ=Op(n1/4)\lVert f^{c}-{f}_{\phi}^{c}\rVert_{2,\mu}=O_{p}(n^{-1/4}).

  • (b)

    Converging hh: The functions fhΨ=Kh,𝒗0fΨf_{h}^{\Psi}=K_{h,\bm{v}_{0}}f^{\Psi}, fha=Kh,𝒗0faf_{h}^{a}=K_{h,\bm{v}_{0}}f^{a} and fhc=Kh,𝒗0fcf_{h}^{c}=K_{h,\bm{v}_{0}}f^{c} are such that f~Ψ=hrfhΨ\tilde{f}^{\Psi}=h^{r}f_{h}^{\Psi}, f~a=hrfha\tilde{f}^{a}=h^{r}f_{h}^{a} and f~c=hrfhc\tilde{f}^{c}=h^{r}f_{h}^{c} are càdlàg with finite sectional variation norm. Let f~ϕΨ\tilde{f}_{\phi}^{\Psi}, f~ϕa\tilde{f}_{\phi}^{a} and f~ϕc\tilde{f}_{\phi}^{c} be projections of f~Ψ\tilde{f}^{\Psi}, f~a\tilde{f}^{a} and f~c\tilde{f}^{c} onto the linear span of the basis functions ϕs,j\phi_{s,j} in L2(P)L^{2}(P), where ϕs,j\phi_{s,j} satisfies condition (20), (21) and (20) of Lemma A.1. Then, f~Ψf~ϕΨ2,μ=Op(n1/3)\lVert\tilde{f}^{\Psi}-\tilde{f}_{\phi}^{\Psi}\rVert_{2,\mu}=O_{p}(n^{-1/3}), f~af~ϕa2,μ=Op(n1/3)\lVert\tilde{f}^{a}-\tilde{f}_{\phi}^{a}\rVert_{2,\mu}=O_{p}(n^{-1/3}) and f~cf~ϕc2,μ=Op(n1/3)\lVert\tilde{f}^{c}-\tilde{f}_{\phi}^{c}\rVert_{2,\mu}=O_{p}(n^{-1/3}).

where μ\mu is an appropriate measure (i.e., the Lebesgue or the counting measure).

Assumption 3 is arguably the most important of the conditions to retain the asymptotic linearity of Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta). Assumption 3 (a) states that when the bandwidth hh is fixed and the estimated coarsening mechanisms (i.e., GaG^{a} and GcG^{c}) and Ψn\Psi_{n} are sufficiently undersmoothed, the generated features can approximate functions faf^{a}, fcf^{c} and fΨf^{\Psi} with n1/3n^{-1/3} rate. Lemma A.2 in the supplementary material provides theoretical undersmoothing conditions required. Assumption 3 (b) correspond to the case where the bandwidth is allows to converge to zero and requires similar conditions as in part (a). Under Assumption 1, fϕΨ{f}_{\phi}^{\Psi}, fϕa{f}_{\phi}^{a} and fϕc{f}_{\phi}^{c} (and f~Ψ\tilde{f}^{\Psi}, f~a\tilde{f}^{a} and f~c\tilde{f}^{c}) will fall into the class of càdlàg functions with finite sectional variation norm. Hence these functions can be approximated using the highly adaptive lasso generated basis functions at n1/3n^{-1/3} rate (up to a logn\log n factor). Importantly, the required rate in Assumption 3 (a) is Op(n1/4)O_{p}(n^{-1/4}) which is slower than the Op(n1/3)O_{p}(n^{-1/3}) rate obtained by the highly adaptive lasso estimator. In Remark 4, we discuss that, under a certain smoothness assumption, the required rate in Assumption 3 (b) can also be slowed down to Op(n1/4)O_{p}(n^{-1/4}). Consequently, this key assumption is not particularly strong.

Assumption 4.

There exist constants κ0\kappa\geq 0 and C>0C>0 such that, for all l>0l>0, pr(0<|𝐒θ0|<l)Clκpr(0<|\bm{S}^{\top}{{\theta}}_{0}|<l)\leq Cl^{\kappa}.

Assumption 4 is the margin assumption of Audibert and Tsybakov (2007) which can also be found in Tsybakov (2004). The assumption is needed to derive the rate of convergence of θnh(𝒗)=argminθΘΨnh(𝒗,θ)\theta_{nh}(\bm{v})=\operatorname*{argmin}_{\theta\in\Theta}\Psi_{nh}(\bm{v},\theta) (Theorem 4.4). The extreme case of κ=\kappa=\infty corresponds to a case where there is a margin around zero, that is, pr(0<|𝑺θ0|<l)=0pr(0<|\bm{S}^{\top}{{\theta}}_{0}|<l)=0. As shown in the proof of Theorem 4.4 the latter leads to the fastest rate of convergence for θnh(𝒗)\theta_{nh}(\bm{v}).

Theorem 4.1.

Let Ψ\Psi be a functional parameter that is identified as Ψ0=argminΨλP0LG0(Ψ)\Psi_{0}=\operatorname*{argmin}_{\Psi\in\mathcal{F}_{\lambda}}P_{0}L_{G_{0}}(\Psi) where the loss function LL is defined in 1. Assume the functional parameter space λ\mathcal{F}_{\lambda} contained in all pp-variate cadlag functions with a bound on its variation norm λ\lambda. Let Ψn=argminΨλnPnLGn(Ψ)\Psi_{n}=\operatorname*{argmin}_{\Psi\in\mathcal{F}_{\lambda_{n}}}P_{n}L_{G_{n}}(\Psi) where λn\lambda_{n} is the cross-validated selector of λ\lambda and let d0(Ψn,Ψ0)=P0LG0(Ψn)P0LG0(Ψ0)d_{0}(\Psi_{n},\Psi_{0})=P_{0}L_{G_{0}}(\Psi_{n})-P_{0}L_{G_{0}}(\Psi_{0}). Also assume ~λ~={LG0(Ψ)LG0(Ψ0):Ψλ}\tilde{\mathcal{F}}_{\tilde{\lambda}}=\{L_{G_{0}}(\Psi)-L_{G_{0}}(\Psi_{0}):\Psi\in\mathcal{F}_{\lambda}\} falls in to the class of cadlag functions with a bound on its variation norm λ~\tilde{\lambda}. Then, under Assumptions 1 and 2, d0(Ψn,Ψ0)=Op(n2/3(logn)4(p1)/3)d_{0}(\Psi_{n},\Psi_{0})=O_{p}(n^{-2/3}(\log n)^{4(p-1)/3}).

Theorem 4.1 generalizes the results of van der Laan (2017) to settings where the estimation of Ψ0\Psi_{0} involves estimation of nuisance parameters G0G_{0}. The proof of the results requires techniques to handle the additional complexities due to the presence of such nuisance parameters. The theorem relies on Assumptions 1 without the need for undersmoothing the nuisance function estimators.

Theorem 4.2.

Let Ψ0(𝐯0,θ)=E{I(Yθ>t)𝐕=𝐯0}\Psi_{0}(\bm{v}_{0},\theta)=E\{I(Y^{\theta}>t)\mid\bm{V}=\bm{v}_{0}\} be JJ-times continuously differentiable at 𝐯0\bm{v}_{0}. Suppose the support of 𝐖\mathbf{W} is uniformly bounded, i.e., 𝒲[0,τ]p\mathcal{W}\subseteq[0,\tau]^{p} for some finite constant τ\tau. Let Ψn,λnΨ\Psi_{n,\lambda_{n}^{\Psi}}, Gn,λnaaG^{a}_{n,\lambda_{n}^{a}} and Gn,λnccG^{c}_{n,\lambda_{n}^{c}} be highly adaptive lasso estimators of Ψ0\Psi_{0}, G0aG_{0}^{a} and G0cG_{0}^{c} with L1L_{1}-norm bounds equal to λnΨ\lambda_{n}^{\Psi}, λna\lambda_{n}^{a} and λnc\lambda_{n}^{c}, respectively. Assuming that

  • (i)

    the bandwidth hh satisfies hr0h^{r}\rightarrow 0 and nhr3/2nh^{r3/2}\rightarrow\infty,

  • (ii)

    Assumptions 1, 2 and and 3 hold,

we have

Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)=(PnP0)D𝒗0(P0)+hJB0(J,𝒗0)+op((nhr)1/2),\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)=(P_{n}-P_{0})D_{\bm{v}_{0}}^{*}(P_{0})+h^{J}B_{0}(J,\bm{v}_{0})+o_{p}\left((nh^{r})^{-1/2}\right),

where Ψnh(𝐯0,θ)\Psi_{nh}(\bm{v}_{0},\theta) is obtained using a (J1J-1)-orthogonal kernel, Gn,λna,λnc=Gn,λnaaGn,λnccG_{n,\lambda_{n}^{a},\lambda_{n}^{c}}=G^{a}_{n,\lambda_{n}^{a}}G^{c}_{n,\lambda_{n}^{c}} and D𝐯0(P0)D_{\bm{v}_{0}}^{*}(P_{0}) is the corresponding efficient influence function defined in the proof of the theorem. Moreover, assuming that nhr+2J0nh^{r+2J}\rightarrow 0,(nhr)1/2{Ψnh(𝐯0,θ)Ψ0(𝐯0,θ)}(nh^{r})^{1/2}\{\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)\} converges to a mean zero normal random variable.

Theorem 4.2 gives conditions under which the proposed kernel estimator Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) is consistent for Ψ0(𝒗0,θ)\Psi_{0}(\bm{v}_{0},\theta), and it also gives the corresponding rate of convergence. In general this result follows if the bandwidth decreases with sample size sufficiently slowly, and if the nuisance functions GG is estimated sufficiently well. The standard local linear kernel smoothing requires that nh3rnh^{3r}\rightarrow\infty to control the variance (Wasserman, 2006, Chapter 5). Using the entropy number of the class of càdlàg functions with finite sectional variation norm, we showed that bandwidth can converge to zero at much faster rate (i.e., nh3r/2nh^{3r/2}\rightarrow\infty) than the standard results. Condition (ii) is the standard causal assumptions. Condition (iii) indicates the level of undersmoothing of the nuisance parameter estimators required to achieve asymptotic linearity of our estimator. Section S6 in the supplementary material provides uniform convergence results with a rate for our regimen-response curve estimator Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) for all 𝒗0𝒱\bm{v}_{0}\in\mathcal{V}.

Remark 3 (The optimal bandwidth rate).

Theorem 4.2 shows that in order to eliminate the bias term, we must undersmooth the kernel such that nhr+2J0nh^{r+2J}\rightarrow 0 (i.e., h<n1r+2Jh<n^{-\frac{1}{r+2J}}). In combination with rate condition (i) in the theorem, we have n23r<h<n1r+2Jn^{-\frac{2}{3r}}<h<n^{-\frac{1}{r+2J}}. Note that for the latter constraint to hold we must have J>r/4J>r/4. Also, the theoretical undersmoothing conditions in Lemma A.1 require J>rJ>r. Hence, the constraint implies that the optimal bandwidth rate is hn=n1r+2Jh_{n}=n^{-\frac{1}{r+2J}} with J>rJ>r.

Remark 4 (Assumption 3 (b)).

We can allow slower rate of convergence of Op(n1/4)O_{p}(n^{-1/4}) in Assumption 3 (b) (i.e., f~Ψf~ϕΨ2,μ=Op(n1/4)\lVert\tilde{f}^{\Psi}-\tilde{f}_{\phi}^{\Psi}\rVert_{2,\mu}=O_{p}(n^{-1/4}), f~af~ϕa2,μ=Op(n1/4)\lVert\tilde{f}^{a}-\tilde{f}_{\phi}^{a}\rVert_{2,\mu}=O_{p}(n^{-1/4}) and f~cf~ϕc2,μ=Op(n1/4)\lVert\tilde{f}^{c}-\tilde{f}_{\phi}^{c}\rVert_{2,\mu}=O_{p}(n^{-1/4})) at the cost of requiring higher level of smoothness for Ψ0(𝐯0,θ)=E{I(Yθ>t)𝐕=𝐯0}\Psi_{0}(\bm{v}_{0},\theta)=E\{I(Y^{\theta}>t)\mid\bm{V}=\bm{v}_{0}\}. Specifically, the optimal bandwidth rate would be hn=n1r+2Jh_{n}=n^{-\frac{1}{r+2J}} with J>52rJ>\frac{5}{2}r.

Theorem 4.3.

Let Ψ0h(𝐯0,θ)=(P0Kh,𝐯0)1{Kh,𝐯0Ψ0(θ)}\Psi_{0h}(\bm{v}_{0},\theta)=(P_{0}K_{h,\bm{v}_{0}})^{-1}\{K_{h,\bm{v}_{0}}\Psi_{0}(\theta)\}. Suppose the support of 𝐖\bm{W} is uniformly bounded, i.e., 𝒲[0,τ]p\mathcal{W}\subseteq[0,\tau]^{p} for some finite constant τ\tau. Let Ψn,λnΨ\Psi_{n,\lambda_{n}^{\Psi}}, Gn,λnaaG^{a}_{n,\lambda_{n}^{a}} and Gn,λnccG^{c}_{n,\lambda_{n}^{c}} be highly adaptive lasso estimators of Ψ0\Psi_{0}, G0aG_{0}^{a} and G0cG_{0}^{c} with L1L_{1}-norm bounds equal to λnΨ\lambda_{n}^{\Psi}, λna\lambda_{n}^{a} and λnc\lambda_{n}^{c}, respectively. Under Assumptions  1, 2 and 3, the estimator Ψnh(𝐯0,θ)\Psi_{nh}(\bm{v}_{0},\theta) will be asymptotically linear. That is

Ψnh(𝒗0,θ)Ψ0h(𝒗0,θ)=(PnP0)Dh,𝒗0(P0)+op(n1/2),\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0h}(\bm{v}_{0},\theta)=(P_{n}-P_{0})D_{h,\bm{v}_{0}}^{*}(P_{0})+o_{p}(n^{-1/2}),

where Gn,λna,λnc=Gn,λnaaGn,λnccG_{n,\lambda_{n}^{a},\lambda_{n}^{c}}=G^{a}_{n,\lambda_{n}^{a}}G^{c}_{n,\lambda_{n}^{c}} and Dh,𝐯0(P0)D_{h,\bm{v}_{0}}^{*}(P_{0}) is the corresponding efficient influence function defined in the proof of the theorem. Moreover, n{Ψnh(θ)Ψ0h(θ)}\sqrt{n}\{\Psi_{nh}(\theta)-\Psi_{0h}(\theta)\} converges weakly as a random element of the cadlag function space endowed with the supremum norm to a Gaussian process with covariance structure implied by the covariance function ρ(𝐯,𝐯)=P0Dh,𝐯(P0)Dh,𝐯(P0)\rho(\bm{v},\bm{v}^{\prime})=P_{0}D^{*}_{h,\bm{v}}(P_{0})D^{*}_{h,\bm{v}^{\prime}}(P_{0}).

Theorem 4.3 strengthens the results of Theorem 4.2 when the bandwidth is fixed. Specifically, it shows that for a fixed bandwidth h>0h>0, n{Ψnh(θ)Ψ0h(θ)}\sqrt{n}\{\Psi_{nh}(\theta)-\Psi_{0h}(\theta)\} convergences to a Gaussian process. This is an important results because in enables us to construct simultaneous confidence intervals for the entire curve Ψ0h(𝒗,θ),\Psi_{0h}(\bm{v},\theta), 𝒗𝒱\bm{v}\in\mathcal{V} and θΘ\theta\in\Theta.

Theorem 4.4.

Let θ0(𝐯)=argminθΘΨ0(𝐯,θ)\theta_{0}(\bm{v})=\operatorname*{argmin}_{\theta\in\Theta}\Psi_{0}(\bm{v},\theta) and θnh(𝐯)=argminθΘΨnh(𝐯,θ)\theta_{nh}(\bm{v})=\operatorname*{argmin}_{\theta\in\Theta}\Psi_{nh}(\bm{v},\theta) where both Ψ0\Psi_{0} and Ψnh\Psi_{nh} are defined in Theorem 4.2. Then, under Assumptions 1-4, θnh(𝐯0)θ0(𝐯0)2=Op(n(r4J)(2κ+4)(8J+4r)(3κ+8))\|\theta_{nh}(\bm{v}_{0})-\theta_{0}(\bm{v}_{0})\|_{2}=O_{p}\left(n^{\frac{(r-4J)(2\kappa+4)}{(8J+4r)(3\kappa+8)}}\right).

Characterizing the minimizer (or in general optimizer) of the estimated regimen-response curve is of interest as it forms an optimal individualized decision rule. Under the margin assumption (i.e., Assumption 4) and using a novel iterative empirical process theory, in Theorem 4.4, we derive the convergence rate of a minimizer of a nonparametrically estimated function Ψnh(𝒗,θ)\Psi_{nh}(\bm{v},\theta).

5 Simulation Studies

We examine the performance of our estimator through two simulation studies. Within these studies, we compare the results of our estimator against theoretical values and random forest based estimation, demonstrating the practical benefits of using undersmoothed highly adaptive lasso based estimation when the data generating mechanism is unknown.

In both scenarios WUnif(1,1)W\sim\text{Unif}(-1,1) and VUnif(1.3,1.3)V\sim\text{Unif}(-1.3,1.3). Additionally Δ|W,VBernoulli{expit(0.5W0.5V)}\Delta|W,V\sim\text{Bernoulli}\{\text{expit}(0.5W-0.5V)\} and Y|A,W,VBernoulli{expit(0.5W+V2+2VW2AVW)}.Y|A,W,V\sim\text{Bernoulli}\{\text{expit}(0.5W+V^{2}+2VW-2AVW)\}. For each scenario we sample n{240,480,960,2400}n\in\{240,480,960,2400\} observations, applying our estimator for each sample. The results displayed below give average effects 360 replicates of each scenario and sample size pair where we set ξ(θ,𝑽)=1\xi(\theta,\bm{V})=1. The latter implies that Ψn\Psi_{n} is obtained by solving a set of simpler score equations (compared with ξ(θ,𝑽)=G{(A,Δc)𝑽}\xi(\theta,\bm{V})=G\{(A,\Delta^{c})\mid\bm{V}\}) which may call for more undersmoothing in order to solve DCARΨ(Pn,Gn,Ψn)D_{CAR}^{\Psi}(P_{n},G_{n},\Psi_{n}).

The fundamental difference between the scenarios comes in the treatment assignment mechanism. In Scenario 1, we have ABernoulli(0.5)A\sim\text{Bernoulli}(0.5), giving us a randomized trial scenario, whereas in Scenario 2, we have ABernoulli{expit(0.7W+0.5V+0.5VW)}.A\sim\text{Bernoulli}\{\text{expit}(0.7W+0.5V+0.5VW)\}. The highly adaptive lasso estimate of Ga(AW,V)G^{a}(A\mid W,V) requires solving fewer score functions in Scenario 1 compared with Scenario 2. In fact the only score equation that to be solved in the former is Pn{AGna(AW,V)}=0P_{n}\{A-G^{a}_{n}(A\mid W,V)\}=0. Hence, higher level of undersmooting is required to solve DCARa(Pn,Gna,Qn,Ψn,θ)D_{CAR}^{a}(P_{n},G_{n}^{a},Q_{n},\Psi_{n},\theta). In general, the simpler the G{(A,Δc)W,V}G\{(A,\Delta^{c})\mid W,V\}, the more undersmooting is needs to solve DCARa(Pn,Gna,Qn,Ψn,θ)D_{CAR}^{a}(P_{n},G_{n}^{a},Q_{n},\Psi_{n},\theta) and DCARc(Pn,Gna,Qn,Ψn,θ)D_{CAR}^{c}(P_{n},G_{n}^{a},Q_{n},\Psi_{n},\theta).

In all simulations we are targeting the estimands Ψ0h(v0,θ)\Psi_{0h}({v}_{0},\theta) and Ψ0(v0,θ)\Psi_{0}({v}_{0},\theta) using the estimator Ψnh(v0,θ)\Psi_{nh}({v}_{0},\theta) where v0=0.5{v}_{0}=0.5 and we consider the bandwidths hn1/3×{0.5,1,1.5,2,2.5,3,4}h\in n^{-1/3}\times\{0.5,1,1.5,2,2.5,3,4\}. Recall that the optimal bandwidth rate is n12J+rn^{\frac{-1}{2J+r}} with J>rJ>r. In our simulation studies r=1r=1 and we are setting J=rJ=r leading to a rate n1/3n^{-1/3} which is faster than n1/5n^{-1/5} obtained by setting J=2>rJ=2>r. The choice of rate matters in cases where the target parameter is Ψ0\Psi_{0} and the goal is to show that the scaled biases remain constant and converges are nominal even under the faster rate of n1/3n^{-1/3} (Figures 2 and 3). Coverage results are also given for Ψ0(v0,θ)\Psi_{0}({v}_{0},\theta) where for each sample hnh_{n} is chosen from the possible bandwidths as described in Section 3.4. In all cases highly adaptive lasso was implemented using the R package hal9001 considering quadratic basis functions for up to 2-way interactions between all predictor variables.

Refer to caption
Figure 1: Simulation studies Scenario 1: The target parameter of interest is Ψ0h\Psi_{0h}. The plots show the scaled bias and coverage rate when the weight functions are estimated using an undersmoothed highly adaptive lasso (HAL) and a random forest (RF).

We estimate the weight functions (i.e., propensity score and censoring indicator models) using an undersmoothed highly adaptive lasso (HAL) and a random forest (RF). In both cases, Ψn\Psi_{n} is a highly adaptive lasso estimate of Ψ0\Psi_{0} obtained based on the loss function (1). Figure 1 shows the scaled bias and the coverage rate of the estimators when the target parameter is Ψ0h\Psi_{0h}. The results confirm our theoretical result presented in Theorem 4.3. The scaled bias if the estimator with undersmoothed HAL is considerably smaller than the one obtained using RF. Importantly, while the coverage rate of undersmoothed HAL estimator remains close to the nominal rate of 95%, the coverage rate of RF based estimator sharply declines as sample size increases. To assess our theoretical result of Theorem 4.2, we plotted the the scaled bias and the coverage rate of the estimators when the target parameter is Ψ0\Psi_{0}. Figure 2 shows that the undersmoothed HAL outperforms RF based estimators. Figures S5-S6 in the Supplementary Material show that similar results hold in Scenario 2. Finally, Figure 3 shows that our proposed data-adaptive bandwidth selector performs very well when combined with undersmoothed HAL estimators of weight functions. Specifically, even with smaller sample size of n=240n=240, our approach results in nearly 95% coverage rate and as the sample size increases it quickly recovers the nominal rate. Figure 4 displays Ψnh(v0=0.5,θ)\Psi_{nh}(v_{0}=0.5,\theta) against θ\theta values. As the sample size increases the bias reduces across all the bandwidth values. Figure S7 in the Supplementary Material shows the scaled bias of θnh\theta_{nh} when the weight functions are estimated using an undersmoothed highly adaptive lasso. The plot shows that the scaled bias is relatively constant across different sample sizes thereby confirming our result in Theorem 4.4 (by setting J=r=1J=r=1 and κ\kappa\rightarrow\infty).

Refer to caption
Figure 2: Simulation studies Scenario 1: The target parameter of interest is Ψ0\Psi_{0}. The plots show the scaled bias and coverage rate when the weight functions are estimated using an undersmoothed highly adaptive lasso (HAL) and a random forest (RF).
Refer to caption
Figure 3: Simulation studies: The target parameter of interest is Ψ0\Psi_{0}. The dotted line shows the nominal rate. The optimal bandwidth hh is selected using the proposed data adaptive approach. The plots show the coverage rates in Scenarios 1 and 2 where the weight functions are estimated using an undersmoothed highly adaptive lasso (HAL) and a random forest (RF).
Refer to caption
Figure 4: Simulation studies Scenario 1: The target parameter of interest is Ψ0\Psi_{0} (black solid line). The plots show Ψnh\Psi_{nh} for different bandwidths when the weight functions are estimated using an undersmoothed highly adaptive lasso and v0=0.5v_{0}=0.5.

6 Concluding Remarks

6.1 Multi-stage decision making

A natural and important extension of our work is to learn dynamic treatment regimens in multi-stage problems. In settings with many decision points (e.g., mobile health), it would be beneficial to consider a stochastic class of decision rules rather than a deterministic one (Kennedy, 2019; Luckett et al., 2019).

6.2 Score based undersmoothing criteria

The practical undersmoothing criteria proposed in Section 3.5 requires calculating the efficient influence function and deriving the relevant DCARD_{CAR} terms. The latter can be challenging and tedious, for example, in multi-stage settings (i.e., time-varying treatment) with many decision points. This motivates the derivation of score based criteria which requires only the score of the corresponding nuisance function. This will, in fact, resemble closely the theoretical undersmoothing conditions listed in Lemmas A.1, A.1 and A.3 in Appendix and Supplementary materials. For example, for the treatment assignment model, we can consider

λn,Ga=argminλB1b=1B[(s,j)𝒥n1βn,λ,bL1|Pn,b1S~s,j(ϕ,Gn,λ,b)|],\lambda_{n,G^{a}}=\operatorname*{argmin}_{\lambda}B^{-1}\sum_{b=1}^{B}\left[\sum_{(s,j)\in\mathcal{J}_{n}}\frac{1}{\lVert\beta_{n,\lambda,b}\rVert_{L_{1}}}\bigg{\lvert}P_{n,b}^{1}\tilde{S}_{s,j}(\phi,G_{n,\lambda,b})\bigg{\rvert}\right], (19)

in which βn,λL1=|βn,λ,0|+s{1,,d}j=1n|βn,λ,s,j|\lVert\beta_{n,\lambda}\rVert_{L_{1}}=\lvert\beta_{n,\lambda,0}\rvert+\sum_{s\subset\{1,\ldots,d\}}\sum_{j=1}^{n}\lvert\beta_{n,\lambda,s,j}\rvert is the L1L_{1}-norm of the coefficients βn,λ,s,j\beta_{n,\lambda,s,j} in the highly adaptive lasso estimator Gn,λG_{n,\lambda} for a given λ\lambda, and S~s,j(ϕ,Gn,λ,b)=ϕs,j(W){AGn,λ,b(1W)}{Gn,λ,b(1W)}1\tilde{S}_{s,j}(\phi,G_{n,\lambda,b})=\phi_{s,j}(W)\{A-G_{n,\lambda,b}(1\mid W)\}\{G_{n,\lambda,b}(1\mid W)\}^{-1}. Note that Ss,j(ϕ,Gn,λ,b)S_{s,j}(\phi,G_{n,\lambda,b}) involves the score equation for the propensity score model (i.e., ϕs,j(W){AGn,λ,b(1W)}\phi_{s,j}(W)\{A-G_{n,\lambda,b}(1\mid W)\}) and the inverse of the estimate propensity score. A more detailed discussion of the score based criteria is provided in Section 4.2 of Ertefaie, Hejazi and van der Laan (2022). However, the theoretical properties of these criteria and their performance in regimen-curve estimation problems are unknown and merit further research.

6.3 Near positively violation

In general, near positivity violation can negatively impact the performance of causal estimators. In particular, it can lead to an increased bias and variance in inverse probability weighted estimators and undersmoothing may exacerbate the issue. One possible remedy is to consider truncation where individuals whose estimated weight functions are smaller than a given cutpoint are removed from the analyses (Crump et al., 2009). There are also other methods to trim the population in an interpretable way (Traskin and Small, 2011; Fogarty et al., 2016). Another possible way to improve the performance of our approach under near positivity violation is to fit a highly adaptive lasso by enforcing the positivity min(Gna,1Gna)>γ\min(G^{a}_{n},1-G^{a}_{n})>\gamma and min(Gnc,1Gnc)>γ\min(G^{c}_{n},1-G^{c}_{n})>\gamma, for a positive constant γ\gamma. One can even make γ\gamma data dependent (i.e., consider γn\gamma_{n}). This is an interesting direction from both methodological and practical point of view.

6.4 Variable selection

In our proposed method, we assume that the set of variables included in regimen response-curve function and the decision rule (i.e., 𝑽\bm{V} and 𝑺\bm{S}, respectively) are a priori specified. Particularly, in health research, variable selection in decision making is important as it may reduce the treatment costs and burden on patients. Moreover, the quality of the constructed rules can be severely hampered by inclusion of many extraneous variables (Jones, Ertefaie and Strawderman, 2022). Therefore, it will be interesting to systematically investigate how to perform variable selection in regimen response-curve function and the decision rule and to provide theoretical guarantees.

Appendix A Theoretical undersmoothing conditions

Lemma A.1 (Theoretical undersmoothing conditions when hn0h_{n}\rightarrow 0).

Let ΨΨ(θ,𝐕)\Psi\equiv\Psi(\theta,\bm{V}), GaG(P)(A𝐖)G^{a}\equiv G(P)(A\mid\bm{W}) and GcG(P)(Δc𝐖,A)G^{c}\equiv G(P)(\Delta^{c}\mid\bm{W},A) denote E(YdθV)E(Y^{d^{\theta}}\mid V), the treatment and censoring mechanism under an arbitrary distribution PP\in\mathcal{M}. Let Ψn,λnΨ\Psi_{n,\lambda_{n}^{\Psi}}, Gn,λnaaG^{a}_{n,\lambda_{n}^{a}} and Gn,λnccG^{c}_{n,\lambda_{n}^{c}} be the highly adaptive lasso estimators of Ψ\Psi, GaG^{a} and GcG^{c} using L1L_{1}-norm bound λnΨ\lambda_{n}^{\Psi}, λna\lambda_{n}^{a} and λnc\lambda_{n}^{c}, respectively. Choosing λnΨ\lambda_{n}^{\Psi}, λna\lambda_{n}^{a} and λnc\lambda_{n}^{c} such that

min(s,j)𝒥nΨPnddΨn,λnΨL(Ψn,λnΨ)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{\Psi}}{\bigg{\|}}P_{n}\frac{d}{d\Psi_{n,\lambda_{n}^{\Psi}}}L(\Psi_{n,\lambda_{n}^{\Psi}})(\phi_{s,j}){\bigg{\|}} =op(n1/2hr/2),\displaystyle=o_{p}\left(n^{-1/2}h^{r/2}\right), (20)
min(s,j)𝒥naPnddlogit Gn,λnaaL(logit Gn,λnaa)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{a}}{\bigg{\|}}P_{n}\frac{d}{d\text{logit }G_{n,\lambda_{n}^{a}}^{a}}L(\text{logit }G_{n,\lambda_{n}^{a}}^{a})(\phi_{s,j}){\bigg{\|}} =op(n1/2hr/2),\displaystyle=o_{p}\left(n^{-1/2}h^{r/2}\right), (21)
min(s,j)𝒥ncPnddlogit Gn,λnccL(logit Gn,λncc)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{c}}{\bigg{\|}}P_{n}\frac{d}{d\text{logit }G_{n,\lambda_{n}^{c}}^{c}}L(\text{logit }G_{n,\lambda_{n}^{c}}^{c})(\phi_{s,j}){\bigg{\|}} =op(n1/2hr/2),\displaystyle=o_{p}\left(n^{-1/2}h^{r/2}\right), (22)

where, in (20), L()L(\cdot) is the loss function (1), and, in (21), L()L(\cdot) is log-likelihood loss. Also, 𝒥n.\mathcal{J}_{n}^{.} is a set of indices for the basis functions with nonzero coefficients in the corresponding model. Let DΨ(fhΨ,Ψn)=fhΨ{I(Y>t)Ψn}D^{\Psi}(f_{h}^{\Psi},\Psi_{n})=f_{h}^{\Psi}\cdot\{I(Y>t)-\Psi_{n}\}, Da(fha,Gna)=fha(AGna)D^{a}(f_{h}^{a},G_{n}^{a})=f_{h}^{a}\cdot(A-G_{n}^{a}), and Dc(fc,Gnc)=fhc(ΔcGnc)D^{c}(f^{c},G_{n}^{c})=f_{h}^{c}\cdot(\Delta^{c}-G_{n}^{c}). The functions fhΨf_{h}^{\Psi}, fhaf_{h}^{a} and fhcf_{h}^{c} are such that f~Ψ=hrfhΨ\tilde{f}^{\Psi}=h^{r}f_{h}^{\Psi}, f~a=hrfha\tilde{f}^{a}=h^{r}f_{h}^{a} and f~c=hrfhc\tilde{f}^{c}=h^{r}f_{h}^{c} are càdlàg with finite sectional variation norm. Let f~ϕΨ\tilde{f}_{\phi}^{\Psi}, f~ϕa\tilde{f}_{\phi}^{a} and f~ϕc\tilde{f}_{\phi}^{c} be projections of f~Ψ\tilde{f}^{\Psi}, f~a\tilde{f}^{a} and f~c\tilde{f}^{c} onto the linear span of the basis functions ϕs,j\phi_{s,j} in L2(P)L^{2}(P), where ϕs,j\phi_{s,j} satisfies condition (20) and (21), respectively. Then, under the optimal bandwidth rate h=n1r+2Jh=n^{\frac{-1}{r+2J}} and J>rJ>r, we have PnDΨ(fhΨ,Ψn)=op((nhr)1/2)P_{n}D^{\Psi}({f}_{h}^{\Psi},\Psi_{n})=o_{p}\left((nh^{r})^{-1/2}\right), PnDa(fha,Gna)=op((nhr)1/2)P_{n}D^{a}({f}_{h}^{a},G_{n}^{a})=o_{p}\left((nh^{r})^{-1/2}\right) and PnDc(fhc,Gnc)=op((nhr)1/2)P_{n}D^{c}({f}_{h}^{c},G_{n}^{c})=o_{p}\left((nh^{r})^{-1/2}\right).

Lemma A.2 (Theoretical undersmoothing conditions for a fixed bandwidth hh).

Let ΨΨ(θ,𝐕)\Psi\equiv\Psi(\theta,\bm{V}), GaG(P)(A𝐖)G^{a}\equiv G(P)(A\mid\bm{W}) and GcG(P)(Δc𝐖,A)G^{c}\equiv G(P)(\Delta^{c}\mid\bm{W},A) denote E(YdθV)E(Y^{d^{\theta}}\mid V), the treatment and censoring mechanism under an arbitrary distribution PP\in\mathcal{M}. Let Ψn,λnΨ\Psi_{n,\lambda_{n}^{\Psi}}, Gn,λnaaG^{a}_{n,\lambda_{n}^{a}} and Gn,λnccG^{c}_{n,\lambda_{n}^{c}} be the highly adaptive lasso estimators of Ψ\Psi, GaG^{a} and GcG^{c} using L1L_{1}-norm bound λnΨ\lambda_{n}^{\Psi}, λna\lambda_{n}^{a} and λnc\lambda_{n}^{c}, respectively. Choosing λnΨ\lambda_{n}^{\Psi}, λna\lambda_{n}^{a} and λnc\lambda_{n}^{c} such that

min(s,j)𝒥nΨPnddΨn,λnΨL(Ψn,λnΨ)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{\Psi}}{\bigg{\|}}P_{n}\frac{d}{d\Psi_{n,\lambda_{n}^{\Psi}}}L(\Psi_{n,\lambda_{n}^{\Psi}})(\phi_{s,j}){\bigg{\|}} =op(n1/2),\displaystyle=o_{p}\left(n^{-1/2}\right), (23)
min(s,j)𝒥naPnddlogit Gn,λnaaL(logit Gn,λnaa)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{a}}{\bigg{\|}}P_{n}\frac{d}{d\text{logit }G_{n,\lambda_{n}^{a}}^{a}}L(\text{logit }G_{n,\lambda_{n}^{a}}^{a})(\phi_{s,j}){\bigg{\|}} =op(n1/2),\displaystyle=o_{p}\left(n^{-1/2}\right), (24)
min(s,j)𝒥ncPnddlogit Gn,λnccL(logit Gn,λncc)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{c}}{\bigg{\|}}P_{n}\frac{d}{d\text{logit }G_{n,\lambda_{n}^{c}}^{c}}L(\text{logit }G_{n,\lambda_{n}^{c}}^{c})(\phi_{s,j}){\bigg{\|}} =op(n1/2),\displaystyle=o_{p}\left(n^{-1/2}\right), (25)

where, in (23), L()L(\cdot) is the loss function (1), and, in (24) and (25), L()L(\cdot) is log-likelihood loss. Also, 𝒥n.\mathcal{J}_{n}^{.} is a set of indices for the basis functions with nonzero coefficients in the corresponding model. Let DΨ(fΨ,Ψn)=fΨ{I(Y>t)Ψn}D^{\Psi}(f^{\Psi},\Psi_{n})=f^{\Psi}\cdot\{I(Y>t)-\Psi_{n}\}, Da(fa,Gna)=fa(AGna)D^{a}(f^{a},G_{n}^{a})=f^{a}\cdot(A-G_{n}^{a}), and Dc(fc,Gnc)=fc(ΔcGnc)D^{c}(f^{c},G_{n}^{c})=f^{c}\cdot(\Delta^{c}-G_{n}^{c}). The functions fΨf^{\Psi}, faf^{a} and fcf^{c} are càdlàg with finite sectional variation norm. Let f~ϕΨ\tilde{f}_{\phi}^{\Psi}, f~ϕa\tilde{f}_{\phi}^{a} and f~ϕc\tilde{f}_{\phi}^{c} be projections of f~Ψ\tilde{f}^{\Psi}, f~a\tilde{f}^{a} and f~c\tilde{f}^{c} onto the linear span of the basis functions ϕs,j\phi_{s,j} in L2(P)L^{2}(P), where ϕs,j\phi_{s,j} satisfies condition (23) and (24), respectively. Then, we have PnDΨ(fΨ,Ψn)=op(n1/2)P_{n}D^{\Psi}({f}^{\Psi},\Psi_{n})=o_{p}\left(n^{-1/2}\right), PnDa(fa,Gna)=op(n1/2)P_{n}D^{a}({f}^{a},G_{n}^{a})=o_{p}\left(n^{-1/2}\right) and PnDc(fc,Gnc)=op(n1/2)P_{n}D^{c}({f}^{c},G_{n}^{c})=o_{p}\left(n^{-1/2}\right).

Lemma A.3 (Theoretical undersmoothing conditions for uniform convergence with fixed bandwidth hh).

Let ΨΨ(θ,𝐕)\Psi\equiv\Psi(\theta,\bm{V}), GaG(P)(A𝐖)G^{a}\equiv G(P)(A\mid\bm{W}) and GcG(P)(Δc𝐖,A)G^{c}\equiv G(P)(\Delta^{c}\mid\bm{W},A) denote E(YdθV)E(Y^{d^{\theta}}\mid V), the treatment and censoring mechanism under an arbitrary distribution PP\in\mathcal{M}. Let Ψn,λnΨ\Psi_{n,\lambda_{n}^{\Psi}}, Gn,λnaaG^{a}_{n,\lambda_{n}^{a}} and Gn,λnccG^{c}_{n,\lambda_{n}^{c}} be the highly adaptive lasso estimators of Ψ\Psi, GaG^{a} and GcG^{c} using L1L_{1}-norm bound λnm\lambda_{n}^{m}, λna\lambda_{n}^{a} and λnc\lambda_{n}^{c}, respectively. Choosing λnΨ\lambda_{n}^{\Psi}, λna\lambda_{n}^{a} and λnc\lambda_{n}^{c} such that

min(s,j)𝒥nΨsupv𝒱PnddΨn,λnΨL(Ψn,λnΨ)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{\Psi}}\sup_{v\in{\mathcal{V}}}{\bigg{\|}}P_{n}\frac{d}{d\Psi_{n,\lambda_{n}^{\Psi}}}L(\Psi_{n,\lambda_{n}^{\Psi}})(\phi_{s,j}){\bigg{\|}} =op(n1/2),\displaystyle=o_{p}(n^{-1/2}), (26)
min(s,j)𝒥nasupv𝒱Pnddlogit Gn,λnaaL(logit Gn,λnaa)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{a}}\sup_{v\in{\mathcal{V}}}{\bigg{\|}}P_{n}\frac{d}{d\text{logit }G_{n,\lambda_{n}^{a}}^{a}}L(\text{logit }G_{n,\lambda_{n}^{a}}^{a})(\phi_{s,j}){\bigg{\|}} =op(n1/2),\displaystyle=o_{p}(n^{-1/2}), (27)
min(s,j)𝒥ncsupv𝒱Pnddlogit Gn,λnccL(logit Gn,λncc)(ϕs,j)\displaystyle\min_{(s,j)\in\mathcal{J}_{n}^{c}}\sup_{v\in{\mathcal{V}}}{\bigg{\|}}P_{n}\frac{d}{d\text{logit }G_{n,\lambda_{n}^{c}}^{c}}L(\text{logit }G_{n,\lambda_{n}^{c}}^{c})(\phi_{s,j}){\bigg{\|}} =op(n1/2),\displaystyle=o_{p}(n^{-1/2}), (28)

where, in (26), L()L(\cdot) is the loss function (1), and, in (27) and (28), L()L(\cdot) is log-likelihood loss. Also, 𝒥n.\mathcal{J}_{n}^{.} is a set of indices for the basis functions with nonzero coefficients in the corresponding model. Let DΨ(fΨ,Ψn)=fΨ{I(Y>t)Ψn}D^{\Psi}(f^{\Psi},\Psi_{n})=f^{\Psi}\cdot\{I(Y>t)-\Psi_{n}\}, Da(fa,Gna)=fa(AGna)D^{a}(f^{a},G_{n}^{a})=f^{a}\cdot(A-G_{n}^{a}), and Dc(fc,Gnc)=fc(ΔcGnc)D^{c}(f^{c},G_{n}^{c})=f^{c}\cdot(\Delta^{c}-G_{n}^{c}). Here, fΨf^{\Psi}, faf^{a} and fcf^{c} are càdlàg with finite sectional variation norm, and we let f~Ψ\tilde{f}^{\Psi}, f~a\tilde{f}^{a} and f~c\tilde{f}^{c} be projections of fΨf^{\Psi}, faf^{a} and fcf^{c} onto the linear span of the basis functions ϕs,j\phi_{s,j} in L2(P)L^{2}(P), where ϕs,j\phi_{s,j} satisfies condition (26) and (27), respectively. Assuming fΨf~Ψ2,P0=Op(n1/4)\lVert f^{\Psi}-\tilde{f}^{\Psi}\rVert_{2,P_{0}}=O_{p}(n^{-1/4}), faf~a2,P0=Op(n1/4)\lVert f^{a}-\tilde{f}^{a}\rVert_{2,P_{0}}=O_{p}(n^{-1/4}) and fcf~c2,P0=Op(n1/4)\lVert f^{c}-\tilde{f}^{c}\rVert_{2,P_{0}}=O_{p}(n^{-1/4}), it follows that PnDΨ(f~Ψ,Ψn)=op(n1/2)P_{n}D^{\Psi}(\tilde{f}^{\Psi},\Psi_{n})=o_{p}(n^{-1/2}), PnDa(f~a,Gna)=op(n1/2)P_{n}D^{a}(\tilde{f}^{a},G_{n}^{a})=o_{p}(n^{-1/2}), PnDc(f~c,Gnc)=op(n1/2)P_{n}D^{c}(\tilde{f}^{c},G_{n}^{c})=o_{p}(n^{-1/2}) PnDΨ(fΨ,Ψn)=op(n1/2)P_{n}D^{\Psi}({f}^{\Psi},\Psi_{n})=o_{p}(n^{-1/2}), PnDa(fa,Gna)=op(n1/2)P_{n}D^{a}({f}^{a},G_{n}^{a})=o_{p}(n^{-1/2}) and PnDc(fc,Gnc)=op(n1/2)P_{n}D^{c}({f}^{c},G_{n}^{c})=o_{p}(n^{-1/2}).

Appendix B Sectional variation norm

Suppose that the function of interest falls into a càdlàg class of functions on a p-dimensional cube [0,τ]p[0,\tau]\subset\mathcal{R}^{p} with finite sectional variation norm where τ\tau is the upper bound of all supports which is assumed to be finite. The p-dimensional cube [0,τ][0,\tau] can be represented as a union of lower dimensional cubes (i.e., ll-dimensional with lpl\leq p) plus the origin. That is [0,τ]={s(0,τs]}{0}[0,\tau]=\{\cup_{s}(0,\tau_{s}]\}\cup\{0\} where s\cup_{s} is over all subsets ss of {1,2,,p}\{1,2,\cdots,p\}.

Denote 𝒟[0,τ]\mathcal{D}[0,\tau] as the Banach space of pp-variate real-valued càdlàg functions on a cube [0,τ]p[0,\tau]\in\mathbb{R}^{p}. For each function f𝒟[0,τ]f\in\mathcal{D}[0,\tau], we define the ss-th section ff as fs(u)=f{u1I(1s),,udI(ps)}f_{s}(u)=f\{u_{1}I(1\in s),\cdots,u_{d}\in I(p\in s)\}. By definition, fs(u)f_{s}(u) varies along the variables in usu_{s} according to ff while setting other variables to zero. We define the sectional variation norm of a given ff as

fζ=|f(0)|+s{1,,p}0sτs|dfs(u)|,\lVert f\rVert^{*}_{\zeta}=\lvert f(0)\rvert+\sum_{s\subset\{1,\ldots,p\}}\int_{0_{s}}^{\tau_{s}}\lvert df_{s}(u)\rvert,

where the sum is over all subsets of the coordinates {0,1,,p}\{0,1,\ldots,p\}. The term 0sτs|dfs(u)|\int_{0_{s}}^{\tau_{s}}\lvert df_{s}(u)\rvert denotes the ss-specific variation norm. To clarify the definition, let’s consider a simple case of p=1p=1. In this case, the variation norm of function ff over [0,τ][0,\tau] is defined as

fζ=supQ𝒬i=1nq1|f(ci+1)f(ci)|\lVert f\rVert^{*}_{\zeta}=\sup_{Q\in\mathcal{Q}}\sum_{i=1}^{n_{q}-1}\lvert f(c_{i+1})-f(c_{i})\rvert

where c0=0c_{0}=0, QQ is a partition of [0,τ][0,\tau] such as (0,c1],(c1,c2],,(cn1,cnq=τ](0,c_{1}],(c_{1},c_{2}],\cdots,(c_{n-1},c_{n_{q}}=\tau] and 𝒬\mathcal{Q} is the class of all possible partitions. For differentiable functions ff, the sectional variation norm is fζ=0τ|f(u)|𝑑u\lVert f\rVert^{*}_{\zeta}=\int_{0}^{\tau}\lvert f^{\prime}(u)\rvert du.

For p=2p=2, the variation norm of function ff over [0,τ]2[0,\tau]\subset\mathcal{R}^{2} is defined as

fζ=supQ𝒬i=0nq11j=0nq21|f(ui+1,vj+1)f(ui,vj+1)f(ui+1,vj)+f(ui,vj)|\lVert f\rVert^{*}_{\zeta}=\sup_{Q\in\mathcal{Q}}\sum_{i=0}^{n_{q_{1}}-1}\sum_{j=0}^{n_{q_{2}}-1}\lvert f(u_{i+1},v_{j+1})-f(u_{i},v_{j+1})-f(u_{i+1},v_{j})+f(u_{i},v_{j})\rvert

where u0=v0=0u_{0}=v_{0}=0, QQ is a partition of [0,τ][0,\tau] such as (0,u1],(u1,u2],,(unq21,unq1=τ](0,u_{1}],(u_{1},u_{2}],\cdots,(u_{n_{q_{2}}-1},u_{n_{q_{1}}}=\tau], (0,v1],(v1,v2],,(vnq21,vnq2=τ](0,v_{1}],(v_{1},v_{2}],\cdots,(v_{n_{q_{2}}-1},v_{n_{q_{2}}}=\tau] and 𝒬\mathcal{Q} is the class of all possible partitions.

In practice, we choose the support points us,iu_{s,i} to be the observed values of 𝑾\bm{W} denoted as w~s,1,,w~s,n\tilde{w}_{s,1},\cdots,\tilde{w}_{s,n}. Hence the indicator basis functions will be ϕs,i(ws)=I(wsw~s,i)\phi_{s,i}(w_{s})=I(w_{s}\geq\tilde{w}_{s,i}). For simplicity lets consider p=1p=1, then, in a hypothetical example, the design matrix based on indicator basis functions would be

WW I(W1.45)I(W\geq 1.45) I(W0.84)I(W\geq 0.84) I(W1.0)I(W\geq 1.0) I(W0.16)I(W\geq 0.16) \cdots
1.45 1 1 0 1 \cdots
0.84 0 1 0 1 \cdots
1.0 1 1 1 1 \cdots
0.16 0 0 0 1 \cdots
\cdots \cdots \cdots \cdots \cdots
{funding}

The first author was supported by National Institute on Drug Abuse under award number R01DA048764, and National Institute of Neurological Disorders and Stroke under award numbers R33NS120240 and R61NS12024. The third author was supported in part by NIH Grant R01AI074345.

{supplement}\stitle

Proofs of the theoretical results S1-S7 \sdescriptionSupplements S1-S7 contain proofs of our theoretical results. {supplement} \stitleAdditional figures \sdescription Supplement S8 includes additional figures for the simulation studies.

References

  • Athey and Wager (2021) {barticle}[author] \bauthor\bsnmAthey, \bfnmSusan\binitsS. and \bauthor\bsnmWager, \bfnmStefan\binitsS. (\byear2021). \btitlePolicy learning with observational data. \bjournalEconometrica \bvolume89 \bpages133–161. \endbibitem
  • Audibert and Tsybakov (2007) {barticle}[author] \bauthor\bsnmAudibert, \bfnmJean-Yves\binitsJ.-Y. and \bauthor\bsnmTsybakov, \bfnmAlexandre B\binitsA. B. (\byear2007). \btitleFast learning rates for plug-in classifiers. \bjournalThe Annals of statistics \bvolume35 \bpages608–633. \endbibitem
  • Benkeser and Van Der Laan (2016) {binproceedings}[author] \bauthor\bsnmBenkeser, \bfnmDavid\binitsD. and \bauthor\bsnmVan Der Laan, \bfnmMark\binitsM. (\byear2016). \btitleThe highly adaptive lasso estimator. In \bbooktitleProceedings of the… International Conference on Data Science and Advanced Analytics. IEEE International Conference on Data Science and Advanced Analytics \bvolume2016 \bpages689. \bpublisherNIH Public Access. \endbibitem
  • Benkeser et al. (2017) {barticle}[author] \bauthor\bsnmBenkeser, \bfnmDavid\binitsD., \bauthor\bsnmCarone, \bfnmMarco\binitsM., \bauthor\bsnmLaan, \bfnmMJ Van Der\binitsM. V. D. and \bauthor\bsnmGilbert, \bfnmPB\binitsP. (\byear2017). \btitleDoubly robust nonparametric inference on the average treatment effect. \bjournalBiometrika \bvolume104 \bpages863–880. \endbibitem
  • Bibaut and van der Laan (2019) {barticle}[author] \bauthor\bsnmBibaut, \bfnmAurélien F\binitsA. F. and \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2019). \btitleFast rates for empirical risk minimization over c\\backslashadl\\backslashag functions with bounded sectional variation norm. \bjournalarXiv preprint arXiv:1907.09244. \endbibitem
  • Bickel et al. (1998) {bbook}[author] \bauthor\bsnmBickel, \bfnmPeter J\binitsP. J., \bauthor\bsnmKlaassen, \bfnmChris A\binitsC. A., \bauthor\bsnmBickel, \bfnmPeter J\binitsP. J., \bauthor\bsnmRitov, \bfnmY\binitsY., \bauthor\bsnmKlaassen, \bfnmJ\binitsJ., \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. and \bauthor\bsnmRitov, \bfnmYA’Acov\binitsY. (\byear1998). \btitleEfficient and adaptive estimation for semiparametric models \bvolume2. \bpublisherSpringer New York. \endbibitem
  • Chernozhukov et al. (2017) {barticle}[author] \bauthor\bsnmChernozhukov, \bfnmVictor\binitsV., \bauthor\bsnmChetverikov, \bfnmDenis\binitsD., \bauthor\bsnmDemirer, \bfnmMert\binitsM., \bauthor\bsnmDuflo, \bfnmEsther\binitsE., \bauthor\bsnmHansen, \bfnmChristian\binitsC. and \bauthor\bsnmNewey, \bfnmWhitney\binitsW. (\byear2017). \btitleDouble/debiased/neyman machine learning of treatment effects. \bjournalAmerican Economic Review \bvolume107 \bpages261–65. \endbibitem
  • Crump et al. (2009) {barticle}[author] \bauthor\bsnmCrump, \bfnmRichard K\binitsR. K., \bauthor\bsnmHotz, \bfnmV Joseph\binitsV. J., \bauthor\bsnmImbens, \bfnmGuido W\binitsG. W. and \bauthor\bsnmMitnik, \bfnmOscar A\binitsO. A. (\byear2009). \btitleDealing with limited overlap in estimation of average treatment effects. \bjournalBiometrika \bvolume96 \bpages187–199. \endbibitem
  • Dudík, Langford and Li (2011) {barticle}[author] \bauthor\bsnmDudík, \bfnmMiroslav\binitsM., \bauthor\bsnmLangford, \bfnmJohn\binitsJ. and \bauthor\bsnmLi, \bfnmLihong\binitsL. (\byear2011). \btitleDoubly robust policy evaluation and learning. \bjournalarXiv preprint arXiv:1103.4601. \endbibitem
  • Ertefaie, Hejazi and van der Laan (2022) {barticle}[author] \bauthor\bsnmErtefaie, \bfnmAshkan\binitsA., \bauthor\bsnmHejazi, \bfnmNima S\binitsN. S. and \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2022). \btitleNonparametric inverse probability weighted estimators based on the highly adaptive lasso. \bjournalBiometrics. \endbibitem
  • Ertefaie et al. (2015) {barticle}[author] \bauthor\bsnmErtefaie, \bfnmAshkan\binitsA., \bauthor\bsnmWu, \bfnmTianshuang\binitsT., \bauthor\bsnmLynch, \bfnmKevin G\binitsK. G. and \bauthor\bsnmNahum-Shani, \bfnmInbal\binitsI. (\byear2015). \btitleIdentifying a set that contains the best dynamic treatment regimes. \bjournalBiostatistics \bvolume17 \bpages135–148. \endbibitem
  • Fogarty et al. (2016) {barticle}[author] \bauthor\bsnmFogarty, \bfnmColin B\binitsC. B., \bauthor\bsnmMikkelsen, \bfnmMark E\binitsM. E., \bauthor\bsnmGaieski, \bfnmDavid F\binitsD. F. and \bauthor\bsnmSmall, \bfnmDylan S\binitsD. S. (\byear2016). \btitleDiscrete optimization for interpretable study populations and randomization inference in an observational study of severe sepsis mortality. \bjournalJournal of the American Statistical Association \bvolume111 \bpages447–458. \endbibitem
  • Gill, van der Laan and Wellner (1993) {bbook}[author] \bauthor\bsnmGill, \bfnmRichard D\binitsR. D., \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. and \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. (\byear1993). \btitleInefficient estimators of the bivariate survival function for three models. \bpublisherRijksuniversiteit Utrecht. Mathematisch Instituut. \endbibitem
  • Joffe et al. (2004) {barticle}[author] \bauthor\bsnmJoffe, \bfnmMarshall M\binitsM. M., \bauthor\bsnmTen Have, \bfnmThomas R\binitsT. R., \bauthor\bsnmFeldman, \bfnmHarold I\binitsH. I. and \bauthor\bsnmKimmel, \bfnmStephen E\binitsS. E. (\byear2004). \btitleModel selection, confounder control, and marginal structural models: review and new applications. \bjournalThe American Statistician \bvolume58 \bpages272–279. \endbibitem
  • Jones, Ertefaie and Strawderman (2022) {barticle}[author] \bauthor\bsnmJones, \bfnmJeremiah\binitsJ., \bauthor\bsnmErtefaie, \bfnmAshkan\binitsA. and \bauthor\bsnmStrawderman, \bfnmRobert L\binitsR. L. (\byear2022). \btitleValid post-selection inference in Robust Q-learning. \bjournalarXiv preprint arXiv:2208.03233. \endbibitem
  • Kallus (2018) {barticle}[author] \bauthor\bsnmKallus, \bfnmNathan\binitsN. (\byear2018). \btitleBalanced policy evaluation and learning. \bjournalAdvances in neural information processing systems \bvolume31. \endbibitem
  • Kennedy (2019) {barticle}[author] \bauthor\bsnmKennedy, \bfnmEdward H\binitsE. H. (\byear2019). \btitleNonparametric causal effects based on incremental propensity score interventions. \bjournalJournal of the American Statistical Association \bvolume114 \bpages645–656. \endbibitem
  • Klaassen (1987) {barticle}[author] \bauthor\bsnmKlaassen, \bfnmChris AJ\binitsC. A. (\byear1987). \btitleConsistent estimation of the influence function of locally asymptotically linear estimators. \bjournalThe Annals of Statistics \bvolume15 \bpages1548–1562. \endbibitem
  • Liu et al. (2018) {barticle}[author] \bauthor\bsnmLiu, \bfnmYing\binitsY., \bauthor\bsnmWang, \bfnmYuanjia\binitsY., \bauthor\bsnmKosorok, \bfnmMichael R\binitsM. R., \bauthor\bsnmZhao, \bfnmYingqi\binitsY. and \bauthor\bsnmZeng, \bfnmDonglin\binitsD. (\byear2018). \btitleAugmented outcome-weighted learning for estimating optimal dynamic treatment regimens. \bjournalStatistics in medicine \bvolume37 \bpages3776–3788. \endbibitem
  • Luckett et al. (2019) {barticle}[author] \bauthor\bsnmLuckett, \bfnmDaniel J\binitsD. J., \bauthor\bsnmLaber, \bfnmEric B\binitsE. B., \bauthor\bsnmKahkoska, \bfnmAnna R\binitsA. R., \bauthor\bsnmMaahs, \bfnmDavid M\binitsD. M., \bauthor\bsnmMayer-Davis, \bfnmElizabeth\binitsE. and \bauthor\bsnmKosorok, \bfnmMichael R\binitsM. R. (\byear2019). \btitleEstimating dynamic treatment regimes in mobile health using v-learning. \bjournalJournal of the American Statistical Association. \endbibitem
  • Mukherjee, Newey and Robins (2017) {barticle}[author] \bauthor\bsnmMukherjee, \bfnmRajarshi\binitsR., \bauthor\bsnmNewey, \bfnmWhitney K\binitsW. K. and \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. (\byear2017). \btitleSemiparametric efficient empirical higher order influence function estimators. \bjournalarXiv preprint arXiv:1705.07577. \endbibitem
  • Murphy et al. (2001) {barticle}[author] \bauthor\bsnmMurphy, \bfnmSusan A\binitsS. A., \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J., \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. and \bauthor\bsnmGroup, \bfnmConduct Problems Prevention Research\binitsC. P. P. R. (\byear2001). \btitleMarginal mean models for dynamic regimes. \bjournalJournal of the American Statistical Association \bvolume96 \bpages1410–1423. \endbibitem
  • Neugebauer and van der Laan (2007) {barticle}[author] \bauthor\bsnmNeugebauer, \bfnmRomain\binitsR. and \bauthor\bparticlevan der \bsnmLaan, \bfnmMark\binitsM. (\byear2007). \btitleNonparametric causal effects based on marginal structural models. \bjournalJournal of Statistical Planning and Inference \bvolume137 \bpages419–434. \endbibitem
  • Nie and Wager (2021) {barticle}[author] \bauthor\bsnmNie, \bfnmXinkun\binitsX. and \bauthor\bsnmWager, \bfnmStefan\binitsS. (\byear2021). \btitleQuasi-oracle estimation of heterogeneous treatment effects. \bjournalBiometrika \bvolume108 \bpages299–319. \endbibitem
  • Orellana, Rotnitzky and Robins (2010) {barticle}[author] \bauthor\bsnmOrellana, \bfnmLiliana\binitsL., \bauthor\bsnmRotnitzky, \bfnmAndrea\binitsA. and \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. (\byear2010). \btitleDynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. \bjournalThe international journal of biostatistics \bvolume6. \endbibitem
  • Petersen et al. (2014) {barticle}[author] \bauthor\bsnmPetersen, \bfnmMaya\binitsM., \bauthor\bsnmSchwab, \bfnmJoshua\binitsJ., \bauthor\bsnmGruber, \bfnmSusan\binitsS., \bauthor\bsnmBlaser, \bfnmNello\binitsN., \bauthor\bsnmSchomaker, \bfnmMichael\binitsM. and \bauthor\bparticlevan der \bsnmLaan, \bfnmMark\binitsM. (\byear2014). \btitleTargeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. \bjournalJournal of causal inference \bvolume2 \bpages147–185. \endbibitem
  • Robins (1998) {barticle}[author] \bauthor\bsnmRobins, \bfnmJM\binitsJ. (\byear1998). \btitleMarginal structural models. 1997 proceedings of the American Statistical Association, section on Bayesian statistical science (pp. 1–10). \bjournalRetrieved from. \endbibitem
  • Robins (2004) {binproceedings}[author] \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. (\byear2004). \btitleOptimal structural nested models for optimal sequential decisions. In \bbooktitleProceedings of the second seattle Symposium in Biostatistics \bpages189–326. \bpublisherSpringer. \endbibitem
  • Robins, Hernan and Brumback (2000) {bmisc}[author] \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M., \bauthor\bsnmHernan, \bfnmMiguel Angel\binitsM. A. and \bauthor\bsnmBrumback, \bfnmBabette\binitsB. (\byear2000). \btitleMarginal structural models and causal inference in epidemiology. \endbibitem
  • Robins, Orellana and Rotnitzky (2008) {barticle}[author] \bauthor\bsnmRobins, \bfnmJames M.\binitsJ. M., \bauthor\bsnmOrellana, \bfnmL.\binitsL. and \bauthor\bsnmRotnitzky, \bfnmA.\binitsA. (\byear2008). \btitleEstimation and extrapolation of optimal treatment and testing strategies. \bjournalStatistics in Medicine \bvolume27 \bpages4678–4721. \endbibitem
  • Robins et al. (2008) {barticle}[author] \bauthor\bsnmRobins, \bfnmJames\binitsJ., \bauthor\bsnmLi, \bfnmLingling\binitsL., \bauthor\bsnmTchetgen, \bfnmEric\binitsE., \bauthor\bparticlevan der \bsnmVaart, \bfnmAad\binitsA. \betalet al. (\byear2008). \btitleHigher order influence functions and minimax estimation of nonlinear functionals. \bjournalProbability and statistics: essays in honor of David A. Freedman \bvolume2 \bpages335–421. \endbibitem
  • Robins et al. (2017) {barticle}[author] \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M., \bauthor\bsnmLi, \bfnmLingling\binitsL., \bauthor\bsnmMukherjee, \bfnmRajarshi\binitsR., \bauthor\bsnmTchetgen, \bfnmEric Tchetgen\binitsE. T. and \bauthor\bparticlevan der \bsnmVaart, \bfnmAad\binitsA. (\byear2017). \btitleMinimax estimation of a functional on a structured high-dimensional model. \bjournalThe Annals of Statistics \bvolume45 \bpages1951–1987. \endbibitem
  • Rosenblum (2011) {bincollection}[author] \bauthor\bsnmRosenblum, \bfnmMichael\binitsM. (\byear2011). \btitleMarginal structural models. In \bbooktitleTargeted Learning \bpages145–160. \bpublisherSpringer. \endbibitem
  • Rubin and van der Laan (2012) {barticle}[author] \bauthor\bsnmRubin, \bfnmDaniel B\binitsD. B. and \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2012). \btitleStatistical issues and limitations in personalized medicine research with clinical trials. \bjournalThe international journal of biostatistics \bvolume8 \bpages18. \endbibitem
  • Shinozaki and Suzuki (2020) {barticle}[author] \bauthor\bsnmShinozaki, \bfnmTomohiro\binitsT. and \bauthor\bsnmSuzuki, \bfnmEtsuji\binitsE. (\byear2020). \btitleUnderstanding marginal structural models for time-varying exposures: pitfalls and tips. \bjournalJournal of epidemiology \bpagesJE20200226. \endbibitem
  • Swaminathan and Joachims (2015) {barticle}[author] \bauthor\bsnmSwaminathan, \bfnmAdith\binitsA. and \bauthor\bsnmJoachims, \bfnmThorsten\binitsT. (\byear2015). \btitleBatch learning from logged bandit feedback through counterfactual risk minimization. \bjournalThe Journal of Machine Learning Research \bvolume16 \bpages1731–1755. \endbibitem
  • Traskin and Small (2011) {barticle}[author] \bauthor\bsnmTraskin, \bfnmMikhail\binitsM. and \bauthor\bsnmSmall, \bfnmDylan S\binitsD. S. (\byear2011). \btitleDefining the study population for an observational study to ensure sufficient overlap: a tree approach. \bjournalStatistics in Biosciences \bvolume3 \bpages94–118. \endbibitem
  • Tsybakov (2004) {barticle}[author] \bauthor\bsnmTsybakov, \bfnmAlexander B\binitsA. B. (\byear2004). \btitleOptimal aggregation of classifiers in statistical learning. \bjournalThe Annals of Statistics \bvolume32 \bpages135–166. \endbibitem
  • Vaart and Wellner (1996) {bbook}[author] \bauthor\bsnmVaart, \bfnmAad W\binitsA. W. and \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. (\byear1996). \btitleWeak convergence and empirical processes: with applications to statistics. \bpublisherSpringer. \endbibitem
  • van der Laan (2006) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2006). \btitleCausal effect models for intention to treat and realistic individualized treatment rules. \endbibitem
  • van der Laan (2017) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark\binitsM. (\byear2017). \btitleA generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. \bjournalThe international journal of biostatistics \bvolume13. \endbibitem
  • van der Laan and Bibaut (2017) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. and \bauthor\bsnmBibaut, \bfnmAurélien F\binitsA. F. (\byear2017). \btitleUniform consistency of the highly adaptive lasso estimator of infinite dimensional parameters. \bjournalarXiv preprint arXiv:1709.06256. \endbibitem
  • van der Laan, Bibaut and Luedtke (2018) {bincollection}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J., \bauthor\bsnmBibaut, \bfnmAurélien\binitsA. and \bauthor\bsnmLuedtke, \bfnmAlexander R\binitsA. R. (\byear2018). \btitleCV-TMLE for nonpathwise differentiable target parameters. In \bbooktitleTargeted Learning in Data Science \bpages455–481. \bpublisherSpringer. \endbibitem
  • Van der Laan, Laan and Robins (2003) {bbook}[author] \bauthor\bparticleVan der \bsnmLaan, \bfnmMark J\binitsM. J., \bauthor\bsnmLaan, \bfnmMJ\binitsM. and \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. (\byear2003). \btitleUnified methods for censored longitudinal data and causality. \bpublisherSpringer Science & Business Media. \endbibitem
  • van der Laan and Luedtke (2014) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. and \bauthor\bsnmLuedtke, \bfnmAlexander R\binitsA. R. (\byear2014). \btitleTargeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. \bjournalU.C. Berkeley Division of Biostatistics Working Paper Series. \endbibitem
  • van der Laan and Petersen (2007) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. and \bauthor\bsnmPetersen, \bfnmMaya L\binitsM. L. (\byear2007). \btitleStatistical learning of origin-specific statically optimal individualized treatment rules. \bjournalThe international journal of biostatistics \bvolume3. \endbibitem
  • van Der Laan and Rose (2018) {bbook}[author] \bauthor\bparticlevan \bsnmDer Laan, \bfnmMark J\binitsM. J. and \bauthor\bsnmRose, \bfnmSherri\binitsS. (\byear2018). \btitleTargeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies. \bpublisherSpringer. \endbibitem
  • Wahed and Tsiatis (2004) {barticle}[author] \bauthor\bsnmWahed, \bfnmAbdus S\binitsA. S. and \bauthor\bsnmTsiatis, \bfnmAnastasios A\binitsA. A. (\byear2004). \btitleOptimal estimator for the survival distribution and related quantities for treatment policies in two-stage randomization designs in clinical trials. \bjournalBiometrics \bvolume60 \bpages124–133. \endbibitem
  • Wasserman (2006) {bbook}[author] \bauthor\bsnmWasserman, \bfnmLarry\binitsL. (\byear2006). \btitleAll of nonparametric statistics. \bpublisherSpringer Science & Business Media. \endbibitem
  • Zhang et al. (2012) {barticle}[author] \bauthor\bsnmZhang, \bfnmBaqun\binitsB., \bauthor\bsnmTsiatis, \bfnmAnastasios A\binitsA. A., \bauthor\bsnmLaber, \bfnmEric B\binitsE. B. and \bauthor\bsnmDavidian, \bfnmMarie\binitsM. (\byear2012). \btitleA robust method for estimating optimal treatment regimes. \bjournalBiometrics \bvolume68 \bpages1010–1018. \endbibitem
  • Zhao et al. (2012) {barticle}[author] \bauthor\bsnmZhao, \bfnmYingqi\binitsY., \bauthor\bsnmZeng, \bfnmDonglin\binitsD., \bauthor\bsnmRush, \bfnmA John\binitsA. J. and \bauthor\bsnmKosorok, \bfnmMichael R\binitsM. R. (\byear2012). \btitleEstimating individualized treatment rules using outcome weighted learning. \bjournalJournal of the American Statistical Association \bvolume107 \bpages1106–1118. \endbibitem
  • Zhao et al. (2015) {barticle}[author] \bauthor\bsnmZhao, \bfnmYing-Qi\binitsY.-Q., \bauthor\bsnmZeng, \bfnmDonglin\binitsD., \bauthor\bsnmLaber, \bfnmEric B\binitsE. B. and \bauthor\bsnmKosorok, \bfnmMichael R\binitsM. R. (\byear2015). \btitleNew statistical learning methods for estimating optimal dynamic treatment regimes. \bjournalJournal of the American Statistical Association \bvolume110 \bpages583–598. \endbibitem
  • Zhao et al. (2022) {barticle}[author] \bauthor\bsnmZhao, \bfnmQingyuan\binitsQ., \bauthor\bsnmSmall, \bfnmDylan S\binitsD. S., \bauthor\bsnmErtefaie, \bfnmAshkan\binitsA. \betalet al. (\byear2022). \btitleSelective inference for effect modification via the lasso. \bjournalJournal of the Royal Statistical Society Series B \bvolume84 \bpages382–413. \endbibitem
  • Zheng and van der Laan (2011) {bincollection}[author] \bauthor\bsnmZheng, \bfnmWenjing\binitsW. and \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2011). \btitleCross-validated targeted minimum-loss-based estimation. In \bbooktitleTargeted Learning \bpages459–474. \bpublisherSpringer. \endbibitem
\doublespacing

Supplement to “Nonparametric estimation of a covariate-adjusted counterfactual treatment regimen response curve”

Appendix S1 Proof of Lemmas A.1, A.2 and A.3

Proof of Lemma A.1.

We first proof the result for the treatment indicator and GaG^{a}. Let f~a=hrfha\tilde{f}^{a}=h^{r}f_{h}^{a} and Da{f~a,Gn,λna}=f~a(AGn,λna)D^{a}\{\tilde{f}^{a},G^{a}_{n,\lambda_{n}}\}=\tilde{f}^{a}\cdot(A-G^{a}_{n,\lambda_{n}}), where f~a\tilde{f}^{a} is a càdlàg function with a finite sectional variation norm, and let f~ϕa\tilde{f}_{\phi}^{a} be an approximation of f~a\tilde{f}^{a} using the basis functions ϕ\phi that satisfy condition (21).

PnDa(fha,Gn,λna)\displaystyle P_{n}D^{a}(f_{h}^{a},G^{a}_{n,\lambda_{n}}) =hrPnDa(f~a,Gn,λna)\displaystyle=h^{-r}P_{n}D^{a}(\tilde{f}^{a},G^{a}_{n,\lambda_{n}})
=hrPnDa(f~af~ϕa,Gn,λna)+hrPnDa(f~ϕa,Gn,λna)\displaystyle=h^{-r}P_{n}D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})+h^{-r}P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})
=hr(PnP0)Da(f~af~ϕa,Gn,λna)+hrP0Da(f~af~ϕa,Gn,λna)+hrPnDa(f~ϕa,Gn,λna)\displaystyle=h^{-r}(P_{n}-P_{0})D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})+h^{-r}P_{0}D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})+h^{-r}P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})
=hrOp(n1/2n1/6)+hrOp(n1/3n1/3)+hrPnDa(f~ϕa,Gn,λna).\displaystyle=h^{-r}O_{p}(n^{-1/2}n^{-1/6})+h^{-r}O_{p}(n^{-1/3}n^{-1/3})+h^{-r}P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}).

The last equality follows from f~af~ϕa2=Op(n1/3)\|\tilde{f}^{a}-\tilde{f}_{\phi}^{a}\|_{2}=O_{p}(n^{-1/3}) and Lemma S1 as

(PnP0)Da(f~af~ϕa,Gn,λna)\displaystyle(P_{n}-P_{0})D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}) supf~af~ϕa2n1/3|(PnP0)Da(f~af~ϕa,Gn,λna)|\displaystyle\leq\sup_{\|\tilde{f}^{a}-\tilde{f}_{\phi}^{a}\|_{2}\leq n^{-1/3}}|(P_{n}-P_{0})D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})|
=Op{n1/2(n1/3,L2(P))}Op(n1/2n1/6).\displaystyle=O_{p}\{n^{-1/2}\mathcal{E}(n^{-1/3},L^{2}(P))\}\leq O_{p}(n^{-1/2}n^{-1/6}).

Moreover, by the convergence rate of the highly adaptive lasso estimate (i.e., G0aGn,λna2,P0=Op(n1/3)\lVert G^{a}_{0}-G^{a}_{n,\lambda_{n}}\rVert_{2,P_{0}}=O_{p}(n^{-1/3})),

P0Da(f~af~ϕa,Gn,λna)=P0(f~af~ϕa)(G0aGn,λna)=Op(n1/3)Op(n1/3).P_{0}D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})=P_{0}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a})(G_{0}^{a}-G^{a}_{n,\lambda_{n}})=O_{p}(n^{-1/3})O_{p}(n^{-1/3}).

Therefore, the proof is complete if we show

hrOp(n1/2n1/6)+hrOp(n1/3n1/3)+hrPnDa(f~ϕa,Gn,λna)=op((nhr)1/2).h^{-r}O_{p}(n^{-1/2}n^{-1/6})+h^{-r}O_{p}(n^{-1/3}n^{-1/3})+h^{-r}P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})=o_{p}\left((nh^{r})^{-1/2}\right).

Considering the optimal bandwidth rate of hr=nrr+2Jh^{r}=n^{\frac{-r}{r+2J}}, the first term satisfies the desired rate when hrn1/2n1/6<n1/2hr/2h^{-r}n^{-1/2}n^{-1/6}<n^{-1/2}h^{-r/2} which implies 16<r2r+4J\frac{-1}{6}<\frac{-r}{2r+4J}. The inequality is satisfied for J>rJ>r. We can similarly show that hrOp(n2/3)=op((nhr)1/2)h^{-r}O_{p}(n^{-2/3})=o_{p}\left((nh^{r})^{-1/2}\right) for J>rJ>r. Hence, for J>rJ>r,

PnDa(fha,Gn,λna)=op((nhr)1/2)+hrPnDa(f~ϕa,Gn,λna).P_{n}D^{a}(f_{h}^{a},G^{a}_{n,\lambda_{n}})=o_{p}\left((nh^{r})^{-1/2}\right)+h^{-r}P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}).

In the following we show that hrPnDa(f~ϕa,Gn,λna)h^{-r}P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}) is also of the order op((nhr)1/2)o_{p}\left((nh^{r})^{-1/2}\right).

For simplicity of notation, let Ga=logit GaG^{a\dagger}=\text{logit }G^{a}. Define a set of score functions generated by a path {1+ϵg(s,j)}βn,s,j\{1+\epsilon g(s,j)\}\beta_{n,s,j} for a uniformly bounded vector gg as

Sg(Gn,λna)=ddGn,λnaL(Gn,λna){(s,j)g(s,j)βn,s,jϕs,j}.S_{g}(G^{a}_{n,\lambda_{n}})=\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\left\{\sum_{(s,j)}g(s,j)\beta_{n,s,j}\phi_{s,j}\right\}.

When L()L(\cdot) is the log-likelihood loss function, L(G)=Alog(G1G)+log(1G)L(G)=A\log\left(\frac{G}{1-G}\right)+\log(1-G). Thus, L(Ga)=AlogGa+log(Ga1)L(G^{a\dagger})=A\log G^{a\dagger}+\log(G^{a\dagger}-1) and Sg(Ga)=(AGn,λna){(s,j)g(s,j)βn,s,jϕs,j}S_{g}(G^{a})=(A-G^{a}_{n,\lambda_{n}})\left\{\sum_{(s,j)}g(s,j)\beta_{n,s,j}\phi_{s,j}\right\}. Let r(g,Gn,λna)=(s,j)g(s,j)|βn,s,j|r(g,G^{a}_{n,\lambda_{n}})=\sum_{(s,j)}g(s,j)\lvert\beta_{n,s,j}\rvert. For small enough ϵ\epsilon,

(s,j)|{1+ϵg(s,j)}βn,s,j|\displaystyle\sum_{(s,j)}\lvert\{1+\epsilon g(s,j)\}\beta_{n,s,j}\rvert =(s,j){1+ϵg(s,j)}|βn,s,j|\displaystyle=\sum_{(s,j)}\{1+\epsilon g(s,j)\}\lvert\beta_{n,s,j}\rvert
=(s,j)|βn,s,j|+ϵr(g,Gn,λn).\displaystyle=\sum_{(s,j)}\lvert\beta_{n,s,j}\rvert+\epsilon r(g,G_{n,\lambda_{n}}).

Hence, for any gg satisfying r(g,Gn,λn)=0r(g,G_{n,\lambda_{n}})=0, we have PnSg(Gn,λna)=0P_{n}S_{g}(G^{a}_{n,\lambda_{n}})=0.

Because Da{f~ϕa,Gn,λna}{Sg(Gn,λna):g<}D^{a}\{\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}\}\in\{S_{g}(G^{a}_{n,\lambda_{n}}):\lVert g\rVert_{\infty}<\infty\}, there exists gg^{\star} such that Da(f~ϕa,Gn,λna)=Sg(Gn,λna)D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})=S_{g^{\star}}(G^{a}_{n,\lambda_{n}}). However, for this particular choice of gg^{\star}, r(g,Gn,λna)r(g,G^{a}_{n,\lambda_{n}}) may not be zero. Now, define gg such that g(s,j)=g(s,j)g(s,j)=g^{\star}(s,j) for (s,j)(s,j)(s,j)\neq(s^{\star},j^{\star}); g~(s,j)\tilde{g}(s^{\star},j^{\star}) is defined such that

(s,j)(s,j)g(s,j)|βn,s,j|+g(s,j)|βn,s,j|=0.\displaystyle\sum_{(s,j)\neq(s^{\star},j^{\star})}g^{\star}(s,j)\lvert\beta_{n,s,j}\rvert+g(s^{\star},j^{\star})\lvert\beta_{n,s^{\star},j^{\star}}\rvert=0. (S29)

That is, gg matches gg^{\star} everywhere but for a single point (s,j)(s^{\star},j^{\star}), where it is forced to take a value such that r(g,Gn,λn)=0r(g,G_{n,\lambda_{n}})=0. As a result, for such a choice of gg, PnSg(Gn,λna)=0P_{n}S_{g}(G^{a}_{n,\lambda_{n}})=0 by definition. Below, we show that PnSg(Gn,λna)PnDa{f~ϕa,Gn,λna}=op(n1/2hr/2)P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}D^{a}\{\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}\}=o_{p}\left(n^{-1/2}h^{r/2}\right) which then implies that hrPnDa{f~ϕa,Gn,λna}=op((nhr)1/2)h^{-r}P_{n}D^{a}\{\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}\}=o_{p}\left((nh^{r})^{-1/2}\right). We note that the choice of (s,j)(s^{\star},j^{\star}) is inconsequential.

PnSg(Gn,λna)PnDa{f~ϕa,Gn,λna}\displaystyle P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}D^{a}\{\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}\} =PnSg(Gn,λna)PnSg(Gn,λna)\displaystyle=P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}S_{g^{*}}(G^{a}_{n,\lambda_{n}})
=Pn{ddGn,λnaL(Gn,λna)[(s,j){g(s,j)g(s,j)}βn,s,jϕs,j]}\displaystyle=P_{n}\left\{\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\left[\sum_{(s,j)}\left\{g(s,j)-g^{\star}(s,j)\right\}\beta_{n,s,j}\phi_{s,j}\right]\right\}
=Pn[ddGn,λnaL(Gn,λna){g(s,j)g(s,j)}βn,s,jϕs,j]\displaystyle=P_{n}\left[\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\left\{g(s^{\star},j^{\star})-g^{\star}(s^{\star},j^{\star})\right\}\beta_{n,s^{\star},j^{\star}}\phi_{s^{\star},j^{\star}}\right]
=Pn[ddGn,λnaL(Gn,λna)κ(s,j)ϕs,j]\displaystyle=P_{n}\left[\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\kappa(s^{\star},j^{\star})\phi_{s^{\star},j^{\star}}\right]
=κ(s,j)Pn[ddGn,λnaL(Gn,λna)ϕs,j],\displaystyle=\kappa(s^{\star},j^{\star})P_{n}\left[\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\phi_{s^{\star},j^{\star}}\right],

where the third equality follows from equation (S30) above with

κ(s,j)=(s,j)(s,j)g(s,j)|βn,s,j||βn,s,j|βn,s,jg(s,j)βn,s,j.\kappa(s^{\star},j^{\star})=-\frac{\sum_{(s,j)\neq(s^{\star},j^{\star})}g^{\star}(s,j)\lvert\beta_{n,s,j}\rvert}{\lvert\beta_{n,s^{\star},j^{\star}}\rvert}\beta_{n,s^{\star},j^{\star}}-g^{\star}(s^{\star},j^{\star})\beta_{n,s^{\star},j^{\star}}.

Moreover,

|κ(s,j)|(s,j)|g(s,j)βn,s,j|.\left\lvert\kappa(s^{\star},j^{\star})\right\rvert\leq\sum_{(s,j)}\lvert g^{\star}(s,j)\beta_{n,s,j}\rvert.

Assuming f~ϕa\tilde{f}_{\phi}^{a} has finite sectional variation norm, the L1L_{1} norm of the coefficients approximating f~ϕa\tilde{f}_{\phi}^{a} will be finite which implies that (s,j)|g(s,j)βn,s,j|\sum_{(s,j)}\lvert g^{\star}(s,j)\beta_{n,s,j}\rvert is finite, and thus |κ(s,j)|=Op(1)\lvert\kappa(s^{\star},j^{\star})\rvert=O_{p}(1). Then,

PnSg(Gn,λna)PnDa(f~ϕa,Gn,λna)\displaystyle P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}) =Op(Pn[ddGn,λnaL(Gn,λna)ϕs,j])\displaystyle=O_{p}\left(P_{n}\left[\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\phi_{s^{\star},j^{\star}}\right]\right)
=op(n1/2hr/2),\displaystyle=o_{p}\left(n^{-1/2}h^{r/2}\right),

where the last equality follows from the assumption that min(s,j)𝒥nPnddGn,λnaL(Gn,λna)(ϕs,j)=op(n1/2hr/2)\min_{(s,j)\in\mathcal{J}_{n}}\lVert P_{n}\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})(\phi_{s,j})\rVert=o_{p}\left(n^{-1/2}h^{r/2}\right) for L()L(\cdot) being log-likelihood loss. As PnSg(Gn,λna)=0P_{n}S_{{g}}(G^{a}_{n,\lambda_{n}})=0, it follows that PnDa(f~ϕa,Gn,λna)=op(n1/2hr/2)P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})=o_{p}\left(n^{-1/2}h^{r/2}\right) and thus hrPnDa(f~ϕa,Gn,λna)=op((nhr)1/2)h^{-r}P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})=o_{p}\left((nh^{r})^{-1/2}\right).
We just showed that PnDa(fha,Gn,λna)=op((nhr)1/2)P_{n}D^{a}(f_{h}^{a},G^{a}_{n,\lambda_{n}})=o_{p}\left((nh^{r})^{-1/2}\right). Similarly we can show that PnDc(fhc,Gn,λnc)=op((nhr)1/2)P_{n}D^{c}(f_{h}^{c},G^{c}_{n,\lambda_{n}})=o_{p}\left((nh^{r})^{-1/2}\right) and PnDΨ(fhΨ,Ψn)=op((nhr)1/2)P_{n}D^{\Psi}(f_{h}^{\Psi},\Psi_{n})=o_{p}\left((nh^{r})^{-1/2}\right) which completes the proof. ∎

Proof of Lemma A.2.

The prove follows from the proof of Lemma A.1 with the bandwidth hh replaced by a constant value (e.g., one). ∎

Proof of Lemma A.3.

We first proof the result for the treatment indicator and GaG^{a}.

supv𝒱|PnDa(fa,Gn,λna)|\displaystyle\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(f^{a},G^{a}_{n,\lambda_{n}})\right| supv𝒱|PnDa(faf~ϕa,Gn,λna)|+supv𝒱|PnDa(f~ϕa,Gn,λna)|\displaystyle\leq\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(f^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|+\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|
supv𝒱|(PnP0)Da(faf~ϕa,Gn,λna)|+supv𝒱|P0Da(faf~ϕa,Gn,λna)|\displaystyle\leq\sup_{v\in\mathcal{V}}\left|(P_{n}-P_{0})D^{a}(f^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|+\sup_{v\in\mathcal{V}}\left|P_{0}D^{a}(f^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|
+supv𝒱|PnDa(f~ϕa,Gn,λna)|\displaystyle+\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|
=Op(n1/2n1/6)+Op(n1/3n1/3)+supv𝒱|PnDa(f~ϕa,Gn,λna)|.\displaystyle=O_{p}(n^{-1/2}n^{-1/6})+O_{p}(n^{-1/3}n^{-1/3})+\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|.

The last equality follows from supv𝒱f~af~ϕa2=Op(n1/3)\sup_{v\in\mathcal{V}}\|\tilde{f}^{a}-\tilde{f}_{\phi}^{a}\|_{2}=O_{p}(n^{-1/3}) and Lemma S1 as

supv𝒱(PnP0)Da(f~af~ϕa,Gn,λna)\displaystyle\sup_{v\in\mathcal{V}}(P_{n}-P_{0})D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}) supv𝒱f~af~ϕa2n1/3|(PnP0)Da(f~af~ϕa,Gn,λna)|\displaystyle\leq\sup_{\begin{subarray}{c}v\in\mathcal{V}\\ \|\tilde{f}^{a}-\tilde{f}_{\phi}^{a}\|_{2}\leq n^{-1/3}\end{subarray}}|(P_{n}-P_{0})D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})|
=Op{n1/2(n1/3,L2(P))}Op(n1/2n1/6).\displaystyle=O_{p}\{n^{-1/2}\mathcal{E}(n^{-1/3},L^{2}(P))\}\leq O_{p}(n^{-1/2}n^{-1/6}).

Moreover, by the convergence rate of the highly adaptive lasso estimate (i.e., supv𝒱G0aGn,λna2,P0=Op(n1/3)\sup_{v\in\mathcal{V}}\lVert G^{a}_{0}-G^{a}_{n,\lambda_{n}}\rVert_{2,P_{0}}=O_{p}(n^{-1/3})),

supv𝒱|P0Da(f~af~ϕa,Gn,λna)|=supv𝒱|P0(f~af~ϕa)(G0aGn,λna)|=Op(n1/3)Op(n1/3).\sup_{v\in\mathcal{V}}\left|P_{0}D^{a}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|=\sup_{v\in\mathcal{V}}\left|P_{0}(\tilde{f}^{a}-\tilde{f}_{\phi}^{a})(G_{0}^{a}-G^{a}_{n,\lambda_{n}})\right|=O_{p}(n^{-1/3})O_{p}(n^{-1/3}).

Therefore, the proof is complete if we show

Op(n1/2n1/6)+Op(n1/3n1/3)+supv𝒱|PnDa(f~ϕa,Gn,λna)|=op(n1/2).O_{p}(n^{-1/2}n^{-1/6})+O_{p}(n^{-1/3}n^{-1/3})+\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|=o_{p}(n^{-1/2}).
supv𝒱|PnDa(fa,Gn,λna)|=op(n1/2)+supv𝒱|PnDa(f~ϕa,Gn,λna)|.\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(f^{a},G^{a}_{n,\lambda_{n}})\right|=o_{p}(n^{-1/2})+\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right|.

In the following we show that supv𝒱|PnDa(f~ϕa,Gn,λna)|\sup_{v\in\mathcal{V}}\left|P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right| is also of the order op(n1/2)o_{p}(n^{-1/2}).

For simplicity of notation, let Ga=logit GaG^{a\dagger}=\text{logit }G^{a}. Define a set of score functions generated by a path {1+ϵg(s,j)}βn,s,j\{1+\epsilon g(s,j)\}\beta_{n,s,j} for a uniformly bounded vector gg as

Sg(Gn,λna)=ddGn,λnaL(Gn,λna){(s,j)g(s,j)βn,s,jϕs,j}.S_{g}(G^{a}_{n,\lambda_{n}})=\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\left\{\sum_{(s,j)}g(s,j)\beta_{n,s,j}\phi_{s,j}\right\}.

When L()L(\cdot) is the log-likelihood loss function, L(G)=Alog(G1G)+log(1G)L(G)=A\log\left(\frac{G}{1-G}\right)+\log(1-G). Thus, L(Ga)=AlogGa+log(Ga1)L(G^{a\dagger})=A\log G^{a\dagger}+\log(G^{a\dagger}-1) and Sg(Ga)=(AGn,λna){(s,j)g(s,j)βn,s,jϕs,j}S_{g}(G^{a})=(A-G^{a}_{n,\lambda_{n}})\left\{\sum_{(s,j)}g(s,j)\beta_{n,s,j}\phi_{s,j}\right\}. Let r(g,Gn,λna)=(s,j)g(s,j)|βn,s,j|r(g,G^{a}_{n,\lambda_{n}})=\sum_{(s,j)}g(s,j)\lvert\beta_{n,s,j}\rvert. For small enough ϵ\epsilon,

(s,j)|{1+ϵg(s,j)}βn,s,j|\displaystyle\sum_{(s,j)}\lvert\{1+\epsilon g(s,j)\}\beta_{n,s,j}\rvert =(s,j){1+ϵg(s,j)}|βn,s,j|\displaystyle=\sum_{(s,j)}\{1+\epsilon g(s,j)\}\lvert\beta_{n,s,j}\rvert
=(s,j)|βn,s,j|+ϵr(g,Gn,λn).\displaystyle=\sum_{(s,j)}\lvert\beta_{n,s,j}\rvert+\epsilon r(g,G_{n,\lambda_{n}}).

Hence, for any gg satisfying r(g,Gn,λn)=0r(g,G_{n,\lambda_{n}})=0, we have PnSg(Gn,λna)=0P_{n}S_{g}(G^{a}_{n,\lambda_{n}})=0.

Because Da{f~ϕa,Gn,λna}{Sg(Gn,λna):g<}D^{a}\{\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}\}\in\{S_{g}(G^{a}_{n,\lambda_{n}}):\lVert g\rVert_{\infty}<\infty\}, there exists gg^{\star} such that Da(f~ϕa,Gn,λna)=Sg(Gn,λna)D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})=S_{g^{\star}}(G^{a}_{n,\lambda_{n}}). However, for this particular choice of gg^{\star}, r(g,Gn,λna)r(g,G^{a}_{n,\lambda_{n}}) may not be zero. Now, define gg such that g(s,j)=g(s,j)g(s,j)=g^{\star}(s,j) for (s,j)(s,j)(s,j)\neq(s^{\star},j^{\star}); g~(s,j)\tilde{g}(s^{\star},j^{\star}) is defined such that

(s,j)(s,j)g(s,j)|βn,s,j|+g(s,j)|βn,s,j|=0.\displaystyle\sum_{(s,j)\neq(s^{\star},j^{\star})}g^{\star}(s,j)\lvert\beta_{n,s,j}\rvert+g(s^{\star},j^{\star})\lvert\beta_{n,s^{\star},j^{\star}}\rvert=0. (S30)

That is, gg matches gg^{\star} everywhere but for a single point (s,j)(s^{\star},j^{\star}), where it is forced to take a value such that r(g,Gn,λn)=0r(g,G_{n,\lambda_{n}})=0. As a result, for such a choice of gg, PnSg(Gn,λna)=0P_{n}S_{g}(G^{a}_{n,\lambda_{n}})=0 by definition. Below, we show that PnSg(Gn,λna)PnDa{f~ϕa,Gn,λna}=op(n1/2)P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}D^{a}\{\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}\}=o_{p}(n^{-1/2}) which then implies that PnDa{f~ϕa,Gn,λna}=op(n1/2)P_{n}D^{a}\{\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}\}=o_{p}(n^{-1/2}). We note that the choice of (s,j)(s^{\star},j^{\star}) is inconsequential.

PnSg(Gn,λna)PnDa{f~ϕa,Gn,λna}\displaystyle P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}D^{a}\{\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}}\} =PnSg(Gn,λna)PnSg(Gn,λna)\displaystyle=P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}S_{g^{*}}(G^{a}_{n,\lambda_{n}})
=Pn{ddGn,λnaL(Gn,λna)[(s,j){g(s,j)g(s,j)}βn,s,jϕs,j]}\displaystyle=P_{n}\left\{\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\left[\sum_{(s,j)}\left\{g(s,j)-g^{\star}(s,j)\right\}\beta_{n,s,j}\phi_{s,j}\right]\right\}
=Pn[ddGn,λnaL(Gn,λna){g(s,j)g(s,j)}βn,s,jϕs,j]\displaystyle=P_{n}\left[\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\left\{g(s^{\star},j^{\star})-g^{\star}(s^{\star},j^{\star})\right\}\beta_{n,s^{\star},j^{\star}}\phi_{s^{\star},j^{\star}}\right]
=Pn[ddGn,λnaL(Gn,λna)κ(s,j)ϕs,j]\displaystyle=P_{n}\left[\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\kappa(s^{\star},j^{\star})\phi_{s^{\star},j^{\star}}\right]
=κ(s,j)Pn[ddGn,λnaL(Gn,λna)ϕs,j],\displaystyle=\kappa(s^{\star},j^{\star})P_{n}\left[\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\phi_{s^{\star},j^{\star}}\right],

where the third equality follows from equation (S30) above with

κ(s,j)=(s,j)(s,j)g(s,j)|βn,s,j||βn,s,j|βn,s,jg(s,j)βn,s,j.\kappa(s^{\star},j^{\star})=-\frac{\sum_{(s,j)\neq(s^{\star},j^{\star})}g^{\star}(s,j)\lvert\beta_{n,s,j}\rvert}{\lvert\beta_{n,s^{\star},j^{\star}}\rvert}\beta_{n,s^{\star},j^{\star}}-g^{\star}(s^{\star},j^{\star})\beta_{n,s^{\star},j^{\star}}.

Moreover,

|κ(s,j)|(s,j)|g(s,j)βn,s,j|.\left\lvert\kappa(s^{\star},j^{\star})\right\rvert\leq\sum_{(s,j)}\lvert g^{\star}(s,j)\beta_{n,s,j}\rvert.

Assuming f~ϕa\tilde{f}_{\phi}^{a} has finite sectional variation norm, the L1L_{1} norm of the coefficients approximating f~ϕa\tilde{f}_{\phi}^{a} will be finite which implies that (s,j)|g(s,j)βn,s,j|\sum_{(s,j)}\lvert g^{\star}(s,j)\beta_{n,s,j}\rvert is finite, and thus |κ(s,j)|=Op(1)\lvert\kappa(s^{\star},j^{\star})\rvert=O_{p}(1). Hence,

supv𝒱|PnSg(Gn,λna)PnDa(f~ϕa,Gn,λna)|\displaystyle\sup_{v\in\mathcal{V}}\left|P_{n}S_{g}(G^{a}_{n,\lambda_{n}})-P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})\right| =Op(supv𝒱|Pn[ddGn,λnaL(Gn,λna)ϕs,j]|)\displaystyle=O_{p}\left(\sup_{v\in\mathcal{V}}\left|P_{n}\left[\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})\phi_{s^{\star},j^{\star}}\right]\right|\right)
=op(n1/2),\displaystyle=o_{p}(n^{-1/2}),

where the last equality follows from the assumption that min(s,j)𝒥nsupv𝒱PnddGn,λnaL(Gn,λna)(ϕs,j)=op(n1/2)\min_{(s,j)\in\mathcal{J}_{n}}\sup_{v\in\mathcal{V}}\lVert P_{n}\frac{d}{dG^{a\dagger}_{n,\lambda_{n}}}L(G^{a\dagger}_{n,\lambda_{n}})(\phi_{s,j})\rVert=o_{p}(n^{-1/2}) for L()L(\cdot) being log-likelihood loss. As PnSg(Gn,λna)=0P_{n}S_{{g}}(G^{a}_{n,\lambda_{n}})=0, it follows that PnDa(f~ϕa,Gn,λna)=op(n1/2)P_{n}D^{a}(\tilde{f}_{\phi}^{a},G^{a}_{n,\lambda_{n}})=o_{p}(n^{-1/2}).
We just showed that PnDa(fa,Gn,λna)=op(n1/2)P_{n}D^{a}(f^{a},G^{a}_{n,\lambda_{n}})=o_{p}(n^{-1/2}). Similarly we can show that PnDc(fc,Gn,λnc)=op(n1/2)P_{n}D^{c}(f^{c},G^{c}_{n,\lambda_{n}})=o_{p}(n^{-1/2}) and PnDΨ(fΨ,Ψn)=op(n1/2)P_{n}D^{\Psi}(f^{\Psi},\Psi_{n})=o_{p}(n^{-1/2}) which completes the proof. ∎

Appendix S2 Other Lemmas

Lemma S1 (Lemma 3.4.2 in Vaart and Wellner (1996)).

For a class of functions \mathcal{F} with envelop F=supf|f(x)|\textbf{F}=\sup_{f\in\mathcal{F}}|f(x)| bounded from above by M<M<\infty, we have

EPsupf,fPδn1/2(PnP0)(f)(δ,,L2(P)){1+(δ,,L2(P))δ2n1/2M}.E_{P}\sup_{f\in\mathcal{F},\|f\|_{P}\leq\delta}n^{1/2}(P_{n}-P_{0})(f)\leq\mathcal{E}(\delta,\mathcal{F},L^{2}(P))\left\{1+\frac{\mathcal{E}(\delta,\mathcal{F},L^{2}(P))}{\delta^{2}n^{1/2}}M\right\}.

This implies that supf,fPδn1/2(PnP0)(f)\sup_{f\in\mathcal{F},\|f\|_{P}\leq\delta}n^{1/2}(P_{n}-P_{0})(f) is bounded in probability by the right-hand side.

Lemma S2 (Proposition 2 in Bibaut and van der Laan (2019)).

Let p2p\geq 2 and M>0M>0. Denote p,M\mathcal{F}_{p,M} the class of cadlag functions on [0,1]p[0,1]^{p} with sectional variation norm smaller than MM. Suppose that P0P_{0} is such that, for all 1r1\leq r\leq\infty, for all real-valued function ff on [0,1]p[0,1]^{p}, fP0,b=c(b)fμ,b\|f\|_{P_{0},b}=c(b)\|f\|_{\mu,b}, for some c(b)>0c(b)>0, and where μ\mu is the Lebesgue measure. Then, for any 0<δ<10<\delta<1, the bracketing entropy integral of p,M\mathcal{F}_{p,M} with respect to .P0,b\|.\|_{P_{0},b} norm is bounded by

[](δ,p,M,.P0,b){C(b,p)Mδ}1/2log(δ/M)p1,\mathcal{E}_{[]}(\delta,\mathcal{F}_{p,M},\|.\|_{P_{0},b})\leq\{C(b,p)M\delta\}^{1/2}\mid\log(\delta/M)\mid^{p-1},

where C(b,p)C(b,p) is a constant that depends only on bb and pp.

Lemma S3.

Consider a sequence of processes {n1/2(PnP0)Hn,h(f)}\{n^{1/2}(P_{n}-P_{0})H_{n,h}(f)\} where

Hn,h(f)=hr/2Kh,𝒗0(fnf0).H_{n,h}(f)=h^{r/2}K_{h,\bm{v}_{0}}(f_{n}-f_{0}).

Assuming the following:

  • (i)

    The function f0f_{0} belongs to a cadlag class of functions with finite sectional variation norm.

  • (ii)

    The function fnf_{n} is the highly adaptive lasso estimate of f0f_{0}.

  • (iii)

    The bandwidth hnh_{n} satisfies hr0h^{r}\rightarrow 0 and nhr3/2nh^{r3/2}\rightarrow\infty.

Then n1/2(PnP0)Hn,h(fn)=op(1)n^{1/2}(P_{n}-P_{0})H_{n,h}(f_{n})=o_{p}(1).

Proof.

Let K~h,𝒗0(v)=hrKh,𝒗0(v)\tilde{K}_{h,\bm{v}_{0}}(v)=h^{r}K_{h,\bm{v}_{0}}(v). Define hr/2H~n,h(f)=Hn,h(f)h^{-r/2}\tilde{H}_{n,h}(f)=H_{n,h}(f). The function H~n(f)\tilde{H}_{n}(f) belongs to a class of function n={H~n,h(f):h}\mathcal{F}_{n}=\{\tilde{H}_{n,h}(f):h\}. We have

H~n(fn)P0\displaystyle\|\tilde{H}_{n}(f_{n})\|_{P_{0}} K~h,𝒗0(v)P0(fnf0)P0\displaystyle\leq\|\tilde{K}_{h,\bm{v}_{0}}(v)\|_{P_{0}}\|(f_{n}-f_{0})\|_{P_{0}}
=Op(hr/2)Op(n1/3)\displaystyle=O_{p}(h^{r/2})O_{p}(n^{-1/3})

where the last equality follows from the kernel and the highly adaptive lasso rate of convergence. Using the result of Lemma S2, n1/2(PnP0)H~n(fn)=Op(hr/4n1/6)n^{1/2}(P_{n}-P_{0})\tilde{H}_{n}(f_{n})=O_{p}(h^{r/4}n^{-1/6}). Therefore,

n1/2(PnP0)Hn(fn)=Op(hr/4n1/6).\displaystyle n^{1/2}(P_{n}-P_{0})H_{n}(f_{n})=O_{p}(h^{-r/4}n^{-1/6}).

Thus, to obtain the desired rate we must have hr/4n1/6=o(1)h^{-r/4}n^{-1/6}=o(1) which implies that hrh^{r} has to converge to zero slower than n2/3n^{-2/3} (i.e., hr>n2/3h^{r}>n^{-2/3}). Note that under continuity and no smoothness assumptions, the rate for the bandwidth is hr=n1/3h^{r}=n^{-1/3} which is much slower than n2/3n^{-2/3}. Thus, under continuity, for every level of smoothness the choice of optimal hh will satisfy the rate hr=n2/3h^{r}=n^{-2/3} (i.e., no condition is imposed).

Lemma S4.

Let the function fnf_{n} is the highly adaptive lasso estimate of f0f_{0} where f0f_{0} belongs to a cadlag class of functions with finite sectional variation norm. Then

PnKh,𝒗0f0P0Kh,𝒗0(f0fn)=PnKh,𝒗0fnPnKh,𝒗0(f0fn)+op((nhr)1/2).P_{n}\frac{K_{h,\bm{v}_{0}}f_{0}}{P_{0}K_{h,\bm{v}_{0}}}\left(f_{0}-f_{n}\right)=P_{n}\frac{K_{h,\bm{v}_{0}}f_{n}}{P_{n}K_{h,\bm{v}_{0}}}\left(f_{0}-f_{n}\right)+o_{p}\left((nh^{r})^{-1/2}\right).
Proof.

We have

PnKh,𝒗0f0P0Kh,𝒗0(f0fn)\displaystyle P_{n}\frac{K_{h,\bm{v}_{0}}f_{0}}{P_{0}K_{h,\bm{v}_{0}}}\left(f_{0}-f_{n}\right) =PnKh,𝒗0fnPnKh,𝒗0(f0fn)+PnKh,𝒗0(f0fn)(f0P0Kh,𝒗0fnPnKh,𝒗0)\displaystyle=P_{n}\frac{K_{h,\bm{v}_{0}}f_{n}}{P_{n}K_{h,\bm{v}_{0}}}\left(f_{0}-f_{n}\right)+P_{n}K_{h,\bm{v}_{0}}(f_{0}-f_{n})\left(\frac{f_{0}}{P_{0}K_{h,\bm{v}_{0}}}-\frac{f_{n}}{P_{n}K_{h,\bm{v}_{0}}}\right)
=PnKh,𝒗0fnPnKh,𝒗0(f0fn)+(PnP0)Kh,𝒗0(f0fn)(f0P0Kh,𝒗0fnPnKh,𝒗0)\displaystyle=P_{n}\frac{K_{h,\bm{v}_{0}}f_{n}}{P_{n}K_{h,\bm{v}_{0}}}\left(f_{0}-f_{n}\right)+(P_{n}-P_{0})K_{h,\bm{v}_{0}}(f_{0}-f_{n})\left(\frac{f_{0}}{P_{0}K_{h,\bm{v}_{0}}}-\frac{f_{n}}{P_{n}K_{h,\bm{v}_{0}}}\right)
+P0Kh,𝒗0(f0fn)(f0P0Kh,𝒗0fnPnKh,𝒗0)\displaystyle\hskip 144.54pt+P_{0}K_{h,\bm{v}_{0}}(f_{0}-f_{n})\left(\frac{f_{0}}{P_{0}K_{h,\bm{v}_{0}}}-\frac{f_{n}}{P_{n}K_{h,\bm{v}_{0}}}\right)
=PnKh,𝒗0fnPnKh,𝒗0(f0fn)+op((nhr)1/2).\displaystyle=P_{n}\frac{K_{h,\bm{v}_{0}}f_{n}}{P_{n}K_{h,\bm{v}_{0}}}\left(f_{0}-f_{n}\right)+o_{p}\left((nh^{r})^{-1/2}\right).

The last equality follows because P0Kh,𝒗0(f0fn)(f0P0Kh,𝒗0fnPnKh,𝒗0)=op((nhr)1/2)P_{0}K_{h,\bm{v}_{0}}(f_{0}-f_{n})\left(\frac{f_{0}}{P_{0}K_{h,\bm{v}_{0}}}-\frac{f_{n}}{P_{n}K_{h,\bm{v}_{0}}}\right)=o_{p}\left((nh^{r})^{-1/2}\right) by the highly adaptive lasso rate of convergence. ∎

Lemma S5.

Let P0LG0(Ψn,Ψ0)P0LG0(Ψn)P0LG0(Ψ0)P_{0}L_{G_{0}}(\Psi_{n},\Psi_{0})\equiv P_{0}L_{G_{0}}(\Psi_{n})-P_{0}L_{G_{0}}(\Psi_{0}) where LGL_{G} is the weighted squared-loss function defined in (1) and λ={m:mν<λ}\mathcal{M}_{\lambda}=\{m\in\mathcal{M}:\|m\|_{\nu}<\lambda\} be the set of cadlag functions with variation norm smaller than λ\lambda. Then, uniformly over λ\lambda, we have

|P0{LG0(Ψn,Ψ0)LGn(Ψn,Ψ0)}|C{d01(Gn,G0)}1/2{P0LG0(Ψn,Ψ0)}1/2,\left|P_{0}\{L_{G_{0}}(\Psi_{n},\Psi_{0})-L_{G_{n}}(\Psi_{n},\Psi_{0})\}\right|\leq C\left\{d_{01}(G_{n},G_{0})\right\}^{1/2}\left\{P_{0}L_{G_{0}}(\Psi_{n},\Psi_{0})\right\}^{1/2},

where CC is a positive finite constant.

Lemma S6.

Suppose θθ02=Op(δn)\|\theta-\theta_{0}\|_{2}=O_{p}(\delta_{n}) and f(θ)=hrKh,𝐯0I(A=dθ)G0(θ)Q0(θ)f(\theta)=\frac{h^{r}K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}(\theta)}Q_{0}(\theta). Under Assumption 4, f(θ)f(θ0)2=Op(hr/2δn)\|f(\theta)-f(\theta_{0})\|_{2}=O_{p}(h^{r/2}\delta_{n}).

Proof.

We assume that 𝑺\bm{S} and Q0(θ)Q_{0}(\theta) are bounded uniformly by some constants CSC_{S} and CQC_{Q}. Moreover, by strong positivity assumption min{G0(θ,𝑾),1G0(θ,𝑾)}>γ\min\{G_{0}(\theta,\bm{W}),1-G_{0}(\theta,\bm{W})\}>\gamma for all 𝑾𝒲\bm{W}\in\mathcal{W} and θΘ\theta\in\Theta. By definition the norm can be written as

f(θ)f(θ0)2\displaystyle\|f(\theta)-f(\theta_{0})\|_{2} =hrKh,𝒗0{I(A=dθ)G0(θ)Q0(θ)I(A=dθ0)G0(θ0)Q0(θ0)}2\displaystyle=\left\|h^{r}K_{h,\bm{v}_{0}}\left\{\frac{I(A=d^{\theta})}{G_{0}(\theta)}Q_{0}(\theta)-\frac{I(A=d^{\theta_{0}})}{G_{0}(\theta_{0})}Q_{0}(\theta_{0})\right\}\right\|_{2}
hrKh,𝒗02(γ)1max{|Q0(θ0)|,|Q0(θ)|}I(dθdθ0)2\displaystyle\leq\|h^{r}K_{h,\bm{v}_{0}}\|_{2}\left\|(\gamma)^{-1}\max\{|Q_{0}(\theta_{0})|,|Q_{0}(\theta)|\}I(d^{\theta}\neq d^{\theta_{0}})\right\|_{2}
=hrKh,𝒗02(γ)1max{|Q0(θ0)|,|Q0(θ)|}I(dθdθ0){I(|𝑺θ0|>t)+I(|𝑺θ0|<t)}2\displaystyle=\|h^{r}K_{h,\bm{v}_{0}}\|_{2}\left\|(\gamma)^{-1}\max\{|Q_{0}(\theta_{0})|,|Q_{0}(\theta)|\}I(d^{\theta}\neq d^{\theta_{0}})\{I(|\bm{S}^{\top}\theta_{0}|>t)+I(|\bm{S}^{\top}\theta_{0}|<t)\}\right\|_{2}
γ1CQhrKh,𝒗02{Ctκ/2+I(|𝑺θ0𝑺θ|>t)2}\displaystyle\leq\gamma^{-1}C_{Q}\|h^{r}K_{h,\bm{v}_{0}}\|_{2}\left\{Ct^{\kappa/2}+\left\|I(|\bm{S}^{\top}\theta_{0}-\bm{S}^{\top}\theta|>t)\right\|_{2}\right\}
γ1CQhrKh,𝒗02{Ctκ/2+𝑺θ0𝑺θ2t}.\displaystyle\leq\gamma^{-1}C_{Q}\|h^{r}K_{h,\bm{v}_{0}}\|_{2}\left\{Ct^{\kappa/2}+\frac{\|\bm{S}^{\top}\theta_{0}-\bm{S}^{\top}\theta\|_{2}}{t}\right\}.

The forth inequality follows from Assumption 4, the Cauchy-Schwarz and |𝑺θ0𝑺θ|>|𝑺θ0||\bm{S}^{\top}\theta_{0}-\bm{S}^{\top}\theta|>|\bm{S}^{\top}\theta_{0}|. The latter holds because because I(dθdθ0)=1I(d^{\theta}\neq d^{\theta_{0}})=1 when (a) 𝑺θ>0\bm{S}^{\top}{{\theta}}>0 and 𝑺θ0<0\bm{S}^{\top}\theta_{0}<0; or (b) 𝑺θ<0\bm{S}^{\top}\theta<0 and 𝑺θ0>0\bm{S}^{\top}\theta_{0}>0. The former and latter imply that 𝑺θ0𝑺θ<𝑺θ0<0\bm{S}^{\top}{\theta}_{0}-\bm{S}^{\top}\theta<\bm{S}^{\top}\theta_{0}<0 and 0<𝑺θ0<𝑺θ0𝑺θ0<\bm{S}\theta_{0}<\bm{S}^{\top}\theta_{0}-\bm{S}^{\top}\theta, respectively, and thus, |𝑺θ0𝑺θ|>|𝑺θ0||\bm{S}^{\top}\theta_{0}-\bm{S}^{\top}\theta|>|\bm{S}^{\top}\theta_{0}|. The fifth inequality holds by Markov inequalities. The upper bound is minimized when t=Ct𝑺θ0𝑺θ21κ/2+1t=C_{t}\|\bm{S}^{\top}\theta_{0}-\bm{S}^{\top}\theta\|_{2}^{\frac{1}{\kappa/2+1}} for a constant CtC_{t} which depends on κ\kappa and CC. Hence,

f(θ)f(θ0)2\displaystyle\|f(\theta)-f(\theta_{0})\|_{2} ChrKh,𝒗02{C𝑺θ0𝑺θ2κκ+2+𝑺θ0𝑺θ2κκ+2}\displaystyle\leq C^{\dagger}\|h^{r}K_{h,\bm{v}_{0}}\|_{2}\left\{C\|\bm{S}^{\top}\theta_{0}-\bm{S}^{\top}\theta\|_{2}^{\frac{\kappa}{\kappa+2}}+\|\bm{S}^{\top}\theta_{0}-\bm{S}^{\top}\theta\|_{2}^{\frac{\kappa}{\kappa+2}}\right\}
=Op(hr/2δκκ+2).\displaystyle=O_{p}(h^{r/2}\delta^{\frac{\kappa}{\kappa+2}}). (S31)

Hence where there is a margin around zero, that is, pr(0<|𝑺θ0|<l)=0pr(0<|\bm{S}^{\top}{{\theta}}_{0}|<l)=0, the right hand side of (S2) reduces to Op(hr/2δ)O_{p}(h^{r/2}\delta). ∎

Appendix S3 Proof of Theorem 4.1: Rate of convergence of HAL with unknown nuisance parameters

Define P0LG0(Ψn,Ψ0)P0LG0(Ψn)P0LG0(Ψ0)P_{0}L_{G_{0}}(\Psi_{n},\Psi_{0})\equiv P_{0}L_{G_{0}}(\Psi_{n})-P_{0}L_{G_{0}}(\Psi_{0}). Then using the definition of the loss-based dissimilarity, we have

0d0(Ψn,Ψ0)\displaystyle 0\leq d_{0}(\Psi_{n},\Psi_{0}) =P0LG0(Ψn,Ψ0)\displaystyle=P_{0}L_{G_{0}}(\Psi_{n},\Psi_{0})
=(P0Pn)LG0(Ψn,Ψ0)+PnLG0(Ψn,Ψ0)\displaystyle=(P_{0}-P_{n})L_{G_{0}}(\Psi_{n},\Psi_{0})+P_{n}L_{G_{0}}(\Psi_{n},\Psi_{0})
=Op(n2/3(logn)4(p1)/3)+PnLGn(Ψn,Ψ0)+Pn{LG0(Ψn,Ψ0)LGn(Ψn,Ψ0)}\displaystyle=O_{p}(n^{-2/3}(\log n)^{4(p-1)/3})+P_{n}L_{G_{n}}(\Psi_{n},\Psi_{0})+P_{n}\{L_{G_{0}}(\Psi_{n},\Psi_{0})-L_{G_{n}}(\Psi_{n},\Psi_{0})\}
Op(n2/3(logn)4(p1)/3)+(PnP0){LG0(Ψn,Ψ0)LGn(Ψn,Ψ0)}\displaystyle\leq O_{p}(n^{-2/3}(\log n)^{4(p-1)/3})+(P_{n}-P_{0})\{L_{G_{0}}(\Psi_{n},\Psi_{0})-L_{G_{n}}(\Psi_{n},\Psi_{0})\}
+P0{LG0(Ψn,Ψ0)LGn(Ψn,Ψ0)}\displaystyle\hskip 216.81pt+P_{0}\{L_{G_{0}}(\Psi_{n},\Psi_{0})-L_{G_{n}}(\Psi_{n},\Psi_{0})\}
=P0{LG0(Ψn,Ψ0)LGn(Ψn,Ψ0)}+Op(n2/3(logn)4(p1)/3)\displaystyle=P_{0}\{L_{G_{0}}(\Psi_{n},\Psi_{0})-L_{G_{n}}(\Psi_{n},\Psi_{0})\}+O_{p}(n^{-2/3}(\log n)^{4(p-1)/3})

Moreover, using Lemma S5 we have

d0(Ψn,Ψ0)\displaystyle d_{0}(\Psi_{n},\Psi_{0}) C{d01(Gn,G0)}1/2{d0(Ψn,Ψ0)}1/2+Op(n2/3(logn)4(p1)/3)\displaystyle\leq C\left\{d_{01}(G_{n},G_{0})\right\}^{1/2}\left\{d_{0}(\Psi_{n},\Psi_{0})\right\}^{1/2}+O_{p}(n^{-2/3}(\log n)^{4(p-1)/3})
=Op(n2/3(logn)4(p1)/3)+Op(n1/3(logn)2(p1)/3){d0(Ψn,Ψ0)}1/2\displaystyle=O_{p}(n^{-2/3}(\log n)^{4(p-1)/3})+O_{p}(n^{-1/3}(\log n)^{2(p-1)/3})\left\{d_{0}(\Psi_{n},\Psi_{0})\right\}^{1/2}

where CC is a positive finite constant. The last equality follows from the convergence rate of HAL minimum loss based estimator of g0g_{0}. Let r1(n)=Op(n2/3(logn)4(p1)/3)r_{1}(n)=O_{p}(n^{-2/3}(\log n)^{4(p-1)/3}) and r2(n)=Op(n1/3(logn)2(p1)/3)r_{2}(n)=O_{p}(n^{-1/3}(\log n)^{2(p-1)/3}), the above equality is equivalent to

{d0(Ψn,Ψ0)}1/2{4r1(n)+r22(n)}1/2+r2(n)2\left\{d_{0}(\Psi_{n},\Psi_{0})\right\}^{1/2}\leq\frac{\{4r_{1}(n)+r_{2}^{2}(n)\}^{1/2}+r_{2}(n)}{2}

which implies that d0(Ψn,Ψ0)=Op(n2/3(logn)4(p1)/3)d_{0}(\Psi_{n},\Psi_{0})=O_{p}(n^{-2/3}(\log n)^{4(p-1)/3}).

Appendix S4 Proof of Theorem 4.2: Asymptotic linearity of the regimen-response curve estimator

We represent the difference between Ψnh(𝒗0,θ)\Psi_{nh}(\bm{v}_{0},\theta) and Ψ0(𝒗0,θ)\Psi_{0}(\bm{v}_{0},\theta) as

Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)={Ψ0h(𝒗0,θ)Ψ0(𝒗0,θ)}+{Ψnh(𝒗0,θ)Ψ0h(𝒗0,θ)}.\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)=\{\Psi_{0h}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)\}+\{\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0h}(\bm{v}_{0},\theta)\}.

Moreover, we can write

Ψnh(𝒗0,θ)Ψ0h(𝒗0,θ)=\displaystyle\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0h}(\bm{v}_{0},\theta)= PnKh,𝒗0Ψn(1PnKh,𝒗01P0Kh,𝒗0)\displaystyle P_{n}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{1}{P_{n}K_{h,\bm{v}_{0}}}-\frac{1}{P_{0}K_{h,\bm{v}_{0}}}\right) (S32)
+(PnP0)Kh,𝒗0P0Kh,𝒗0(ΨnΨ0)+(PnP0)Kh,𝒗0P0Kh,𝒗0(Ψ0)\displaystyle+(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0})+(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{0})
+P0Kh,𝒗0P0Kh,𝒗0(ΨnΨ0).\displaystyle+P_{0}\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0}).

S4.1 Convergence rate of Ψ0h(v0,θ)Ψ0(v0,θ)\Psi_{0h}(v_{0},\theta)-\Psi_{0}(v_{0},\theta)

We consider a (J1J-1)-orthogonal kernel with bandwidth hh centered at 𝒗0\bm{v}_{0}. Then following Lemma 25.1 in van Der Laan and Rose (2018), as h0h\rightarrow 0,

Ψ0h(𝒗0,θ)Ψ0(𝒗0,θ)=hJB0(J,𝒗0),\Psi_{0h}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)=h^{J}B_{0}(J,\bm{v}_{0}),

where

B0(J,𝒗0)={η{0,,J}d:lηl=J}Ψ0(𝒗0)ηsk(s)lslηllηl!𝑑s.B_{0}(J,\bm{v}_{0})=\sum_{\{\eta\in\{0,\cdots,J\}^{d}:\sum_{l}\eta_{l}=J\}}\Psi_{0}(\bm{v}_{0})^{\eta}\int_{s}k(s)\frac{\prod_{l}s_{l}^{\eta_{l}}}{\prod_{l}\eta_{l}!}ds.

S4.2 Asymptotic behaviour of the first term on the RHS of (S32)

Because, (PnP0)Kh,𝒗0=Op(n1/2hr/2)(P_{n}-P_{0})K_{h,\bm{v}_{0}}=O_{p}(n^{-1/2}h^{-r/2}),

PnKh,𝒗0Ψn(1PnKh,𝒗01P0Kh,𝒗0)=\displaystyle P_{n}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{1}{P_{n}K_{h,\bm{v}_{0}}}-\frac{1}{P_{0}K_{h,\bm{v}_{0}}}\right)= (PnP0)Kh,𝒗0Ψn(1PnKh,𝒗01P0Kh,𝒗0)\displaystyle(P_{n}-P_{0})K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{1}{P_{n}K_{h,\bm{v}_{0}}}-\frac{1}{P_{0}K_{h,\bm{v}_{0}}}\right)
+P0Kh,𝒗0Ψn(1PnKh,𝒗01P0Kh,𝒗0)\displaystyle+P_{0}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{1}{P_{n}K_{h,\bm{v}_{0}}}-\frac{1}{P_{0}K_{h,\bm{v}_{0}}}\right)
=\displaystyle= P0Kh,𝒗0Ψn((PnP0)Kh,𝒗0PnKh,𝒗0P0Kh,𝒗0)+op(n1/2hr/2)\displaystyle P_{0}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{-(P_{n}-P_{0})K_{h,\bm{v}_{0}}}{P_{n}K_{h,\bm{v}_{0}}P_{0}K_{h,\bm{v}_{0}}}\right)+o_{p}(n^{-1/2}h^{-r/2})
=\displaystyle= P0Kh,𝒗0Ψn((PnP0)Kh,𝒗0P02Kh,𝒗0)\displaystyle P_{0}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{-(P_{n}-P_{0})K_{h,\bm{v}_{0}}}{P_{0}^{2}K_{h,\bm{v}_{0}}}\right)
+P0Kh,𝒗0Ψn({(PnP0)Kh,𝒗0}2PnKh,𝒗0P0Kh,𝒗0)+op(n1/2hr/2)\displaystyle+P_{0}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{-\{(P_{n}-P_{0})K_{h,\bm{v}_{0}}\}^{2}}{P_{n}K_{h,\bm{v}_{0}}P_{0}K_{h,\bm{v}_{0}}}\right)+o_{p}(n^{-1/2}h^{-r/2})
=(PnP0)Kh,𝒗0P0Kh,𝒗0P0Kh,𝒗0P0Kh,𝒗0(ΨnΨ0)+(PnP0)Kh,𝒗0P0Kh,𝒗0Ψ0h\displaystyle=\frac{-(P_{n}-P_{0})K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}P_{0}\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0})+\frac{-(P_{n}-P_{0})K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}\Psi_{0h}
+op(n1/2hr/2)\displaystyle+o_{p}(n^{-1/2}h^{-r/2})
=(PnP0)Kh,𝒗0P0Kh,𝒗0Ψ0h+op(n1/2hr/2)\displaystyle=\frac{-(P_{n}-P_{0})K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}\Psi_{0h}+o_{p}(n^{-1/2}h^{-r/2})

The second and third equalities follows because {(PnP0)Kh,𝒗0}2=Op(n1hr)\{(P_{n}-P_{0})K_{h,\bm{v}_{0}}\}^{2}=O_{p}(n^{-1}h^{-r}), and thus, those are of order op(n1/2hr/2)o_{p}(n^{-1/2}h^{-r/2}). In the last equality we have also used the consistency of Ψn\Psi_{n}.

S4.3 Asymptotic negligibility of the second term on the RHS of (S32)

We will show that the second order term (PnP0)Kh,𝒗0(ΨnΨ0)=op((nhr)1/2)(P_{n}-P_{0})K_{h,\bm{v}_{0}}(\Psi_{n}-\Psi_{0})=o_{p}\left((nh^{r})^{-1/2}\right). Let K~h,𝒗0(v)=hrKh,𝒗0(v)\tilde{K}_{h,\bm{v}_{0}}(v)=h^{r}K_{h,\bm{v}_{0}}(v). Define Hn,h(m)=Kh,𝒗0(mΨ0)H_{n,h}(m)=K_{h,\bm{v}_{0}}(m-\Psi_{0}) and hrH~n,h(m)=Hn,h(m)h^{-r}\tilde{H}_{n,h}(m)=H_{n,h}(m). The function H~n(m)\tilde{H}_{n}(m) belongs to a class of function n={H~n,h(m):h}\mathcal{F}_{n}=\{\tilde{H}_{n,h}(m):h\}. We have

H~n(Ψn)P0\displaystyle\|\tilde{H}_{n}(\Psi_{n})\|_{P_{0}} K~h,𝒗0(v)P0(ΨnΨ0)P0\displaystyle\leq\|\tilde{K}_{h,\bm{v}_{0}}(v)\|_{P_{0}}\|(\Psi_{n}-\Psi_{0})\|_{P_{0}}
=Op(hr/2)Op(n1/3).\displaystyle=O_{p}(h^{r/2})O_{p}(n^{-1/3}).

The rest follows from Lemma S3. Therefore, (PnP0)Kh,𝒗0P0Kh,𝒗0(ΨnΨ0)=op((nhr)1/2)(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0})=o_{p}\left((nh^{r})^{-1/2}\right) when the bandwidth hnh_{n} satisfies hr0h^{r}\rightarrow 0 and nhr3/2nh^{r3/2}\rightarrow\infty.

S4.4 Asymptotic behaviour of the forth term on the RHS of (S32)

The forth term can be written as

P0Kh,𝒗0P0Kh,𝒗0(ΨnΨ0)=\displaystyle P_{0}\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0})= P0Kh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0(ΨnΨ0)\displaystyle P_{0}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0})
=\displaystyle= P0Kh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0{ΨnI(Y>t)}\displaystyle P_{0}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}\{\Psi_{n}-I(Y>t)\}
=\displaystyle= (PnP0)Kh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0{I(Y>t)Ψ0}\displaystyle(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}\{I(Y>t)-\Psi_{0}\}
PnKh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0{I(Yt)Ψn}+op((nhr)1/2).\displaystyle-P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}+o_{p}\left((nh^{r})^{-1/2}\right).

The third equality follows from Section S4.3.

The last term on the RHS can be represented as

PnKh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0{I(Yt)Ψn}\displaystyle P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\} =PnKh,𝒗0ΔcI(A=dθ)GnP0Kh,𝒗0{I(Yt)Ψn}\displaystyle=P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{n}P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}
+PnKh,𝒗0ΔcI(A=dθ)P0Kh,𝒗0{I(Yt)Ψn}(GnG0G0Gn)\displaystyle+P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}\left(\frac{G_{n}-G_{0}}{G_{0}G_{n}}\right)
=PnKh,𝒗0ΔcI(A=dθ)GnP0Kh,𝒗0{I(Yt)Ψn}+\displaystyle=P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{n}P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}+
(PnP0)Kh,𝒗0ΔcI(A=dθ)P0Kh,𝒗0{I(Yt)Ψn}(GnG0G0Gn)\displaystyle(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}\left(\frac{G_{n}-G_{0}}{G_{0}G_{n}}\right)
+P0Kh,𝒗0ΔcI(A=dθ)P0Kh,𝒗0{I(Yt)Ψn}(GnG0G0Gn).\displaystyle+P_{0}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}\left(\frac{G_{n}-G_{0}}{G_{0}G_{n}}\right).

Using Lemma S3 and similar techniques used in Section S4.3, we have (PnP0)Kh,𝒗0ΔcI(A=dθ)P0Kh,𝒗0{I(Yt)Ψn}(GnG0G0Gn)=op((nhr)1/2)(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}\left(\frac{G_{n}-G_{0}}{G_{0}G_{n}}\right)=o_{p}\left((nh^{r})^{-1/2}\right).

Let Q0=E{I(Yt)|A=dθ,𝑾}Q_{0}=E\{I(Y\geq t)|A=d^{\theta},\bm{W}\}. Then

P0Kh,𝒗0ΔcI(A=dθ)P0Kh,𝒗0{I(Yt)Ψn}(GnG0G0Gn)=\displaystyle P_{0}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}\left(\frac{G_{n}-G_{0}}{G_{0}G_{n}}\right)= P0Kh,𝒗0ΔcI(A=dθ)P0Kh,𝒗0(Q0Ψn)(GnG0G0Gn)\displaystyle P_{0}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{G_{n}-G_{0}}{G_{0}G_{n}}\right)
=\displaystyle= P0Kh,𝒗0G0P0Kh,𝒗0(Q0Ψn)(G0GnG02)\displaystyle-P_{0}\frac{K_{h,\bm{v}_{0}}G_{0}}{P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{G_{0}-G_{n}}{G_{0}^{2}}\right)
+P0Kh,𝒗0G0P0Kh,𝒗0(Q0Ψn)((G0Gn)2GnG02).\displaystyle+P_{0}\frac{K_{h,\bm{v}_{0}}G_{0}}{P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{(G_{0}-G_{n})^{2}}{G_{n}G_{0}^{2}}\right).

Using the rate of convergence of the highly adaptive lasso estimator P0Kh,𝒗0G0P0Kh,𝒗0(Q0Ψn)((G0Gn)2GnG02)=op((nhr)1/2)P_{0}\frac{K_{h,\bm{v}_{0}}G_{0}}{P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{(G_{0}-G_{n})^{2}}{G_{n}G_{0}^{2}}\right)=o_{p}\left((nh^{r})^{-1/2}\right). Moreover,

P0Kh,𝒗0(Q0Ψn)P0Kh,𝒗0(G0GnG0)\displaystyle P_{0}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{G_{0}-G_{n}}{G_{0}}\right) =P0Kh,𝒗0(Q0Ψn)P0Kh,𝒗0(G0cG0aGncGnaG0cG0a)\displaystyle=P_{0}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{G_{0}^{c}G_{0}^{a}-G_{n}^{c}G_{n}^{a}}{G_{0}^{c}G_{0}^{a}}\right)
=P0Kh,𝒗0(Q0Ψn)P0Kh,𝒗0G0a(G0cGncG0cG0a)+P0Kh,𝒗0(Q0Ψn)P0Kh,𝒗0Gnc(G0aGnaG0cG0a)\displaystyle=P_{0}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}G_{0}^{a}\left(\frac{G_{0}^{c}-G_{n}^{c}}{G_{0}^{c}G_{0}^{a}}\right)+P_{0}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}G_{n}^{c}\left(\frac{G_{0}^{a}-G_{n}^{a}}{G_{0}^{c}G_{0}^{a}}\right)
=(PnP0)Kh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψ0)(ΔcG0cG0c)\displaystyle=-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{0})\left(\frac{\Delta^{c}-G_{0}^{c}}{G_{0}^{c}}\right)
(PnP0)Kh,𝒗0(Q0Ψ0)P0Kh,𝒗0(I(A=dθ)G0aG0a)\displaystyle\hskip 14.45377pt-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{0})}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{0}^{a}}{G_{0}^{a}}\right)
+PnKh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψn)(ΔcGncG0c)\displaystyle\hskip 14.45377pt+P_{n}\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{\Delta^{c}-G_{n}^{c}}{G_{0}^{c}}\right)
+PnKh,𝒗0(Q0Ψn)P0Kh,𝒗0Gnc(I(A=dθ)GnaG0cG0a)+op((nhr)1/2).\displaystyle\hskip 14.45377pt+P_{n}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}G_{n}^{c}\left(\frac{I(A=d^{\theta})-G_{n}^{a}}{G_{0}^{c}G_{0}^{a}}\right)+o_{p}\left((nh^{r})^{-1/2}\right).

The first equality follows because

(PnP0)Kh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψn)(ΔcGncG0c)\displaystyle(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{\Delta^{c}-G_{n}^{c}}{G_{0}^{c}}\right) =(PnP0)Kh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψ0)(ΔcG0cG0c)\displaystyle=(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{0})\left(\frac{\Delta^{c}-G_{0}^{c}}{G_{0}^{c}}\right)
+op((nhr)1/2),\displaystyle\hskip 180.67499pt+o_{p}\left((nh^{r})^{-1/2}\right),
(PnP0)Kh,𝒗0(Q0Ψn)P0Kh,𝒗0Gnc(I(A=dθ)GnaG0aG0c)\displaystyle(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}G_{n}^{c}\left(\frac{I(A=d^{\theta})-G_{n}^{a}}{G_{0}^{a}G_{0}^{c}}\right) =(PnP0)Kh,𝒗0(Q0Ψ0)P0Kh,𝒗0(I(A=dθ)G0aG0a)\displaystyle=(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{0})}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{0}^{a}}{G_{0}^{a}}\right)
+op((nhr)1/2).\displaystyle\hskip 180.67499pt+o_{p}\left((nh^{r})^{-1/2}\right).

Moreover, by Lemma S4, we have

PnKh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψn)(ΔcGncG0c)\displaystyle P_{n}\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{\Delta^{c}-G_{n}^{c}}{G_{0}^{c}}\right) =PnKh,𝒗0I(A=dθ)GnaP0Kh,𝒗0(QnΨn)(ΔcGncGnc)+op((nhr)1/2)\displaystyle=P_{n}\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{n}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{n}-\Psi_{n})\left(\frac{\Delta^{c}-G_{n}^{c}}{G_{n}^{c}}\right)+o_{p}\left((nh^{r})^{-1/2}\right)
PnKh,𝒗0(Q0Ψn)P0Kh,𝒗0Gnc(I(A=dθ)GnaG0cG0a)\displaystyle P_{n}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}G_{n}^{c}\left(\frac{I(A=d^{\theta})-G_{n}^{a}}{G_{0}^{c}G_{0}^{a}}\right) =PnKh,𝒗0(QnΨn)PnKh,𝒗0(I(A=dθ)GnaGna)+op((nhr)1/2)\displaystyle=P_{n}\frac{K_{h,\bm{v}_{0}}(Q_{n}-\Psi_{n})}{P_{n}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{n}^{a}}{G_{n}^{a}}\right)+o_{p}\left((nh^{r})^{-1/2}\right)

Hence, if we undersmooth GnaG_{n}^{a} and GncG_{n}^{c} such that PnKh,𝒗0I(A=dθ)GnaP0Kh,𝒗0(QnΨn)(ΔcGncGnc)=op((nhr)1/2)P_{n}\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{n}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{n}-\Psi_{n})\left(\frac{\Delta^{c}-G_{n}^{c}}{G_{n}^{c}}\right)=o_{p}\left((nh^{r})^{-1/2}\right) and PnKh,𝒗0(QnΨn)PnKh,𝒗0(I(A=dθ)GnaGna)=op((nhr)1/2)P_{n}\frac{K_{h,\bm{v}_{0}}(Q_{n}-\Psi_{n})}{P_{n}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{n}^{a}}{G_{n}^{a}}\right)=o_{p}\left((nh^{r})^{-1/2}\right), then

P0Kh,𝒗0(Q0Ψn)P0Kh,𝒗0(G0GnG0)\displaystyle P_{0}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{G_{0}-G_{n}}{G_{0}}\right) =(PnP0)Kh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψ0)(ΔcG0cG0c)\displaystyle=-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{0})\left(\frac{\Delta^{c}-G_{0}^{c}}{G_{0}^{c}}\right)
(PnP0)Kh,𝒗0(Q0Ψ0)P0Kh,𝒗0(I(A=dθ)G0aG0a)+op((nhr)1/2).\displaystyle\hskip 14.45377pt-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{0})}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{0}^{a}}{G_{0}^{a}}\right)+o_{p}\left((nh^{r})^{-1/2}\right).

S4.5 The influence function.

Gathering all the terms, we have,

Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)=\displaystyle\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)= (PnP0)Kh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0I(Yt)\displaystyle(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}I(Y\geq t)
(PnP0)Kh,𝒗0I(A=dθ)G0aP0Kh,𝒗0Q0(ΔcG0cG0c)\displaystyle-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}Q_{0}\left(\frac{\Delta^{c}-G_{0}^{c}}{G_{0}^{c}}\right)
(PnP0)Kh,𝒗0Q0P0Kh,𝒗0(I(A=dθ)G0aG0a)\displaystyle-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}Q_{0}}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{0}^{a}}{G_{0}^{a}}\right)
(PnP0)Kh,𝒗0P0Kh,𝒗0Ψ0h(𝒗0,θ)+hJB0(J)+op((nhr)1/2).\displaystyle-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}\Psi_{0h}(\bm{v}_{0},\theta)+h^{J}B_{0}(J)+o_{p}\left((nh^{r})^{-1/2}\right).

Thus, the asymptotic normality follows from the choice of bandwidth such that (nhr)1/2hJ=O(1)(nh^{r})^{1/2}h^{J}=O(1) which implies hn=n1/(2J+r)h_{n}=n^{-1/(2J+r)}. Thus, (nhr)1/2{Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)}(nh^{r})^{1/2}\{\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)\} converges to a mean zero normal random variable when hnn1/(2J+r)0\frac{h_{n}}{n^{-1/(2J+r)}}\rightarrow 0 (i.e., hnh_{n} converges to zero faster than n1/(2J+r)n^{-1/(2J+r)}).

S4.6 The optimal bandwidth rate.

The rate constraints obtained in Lemma S3 (i.e., h>n23rh>n^{-\frac{2}{3r}}) and Section S4.5 (i.e., h<n1r+2Jh<n^{-\frac{1}{r+2J}}) imply that the optimal bandwidth rate is hn=n1r+2Jh_{n}=n^{-\frac{1}{r+2J}}. Note that for the constraint n23r<h<n1r+2Jn^{-\frac{2}{3r}}<h<n^{-\frac{1}{r+2J}} to hold we must have J>r/4J>r/4 which is a very mild condition. For example, for r<4r<4, we can still pick J=1J=1.

Appendix S5 Proof of Theorem 4.3: Asymptotic linearity of the regimen-response curve estimator for a fixed bandwidth hh

Ψnh(𝒗0,θ)Ψ0h(𝒗0,θ)=\displaystyle\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0h}(\bm{v}_{0},\theta)= PnKh,𝒗0Ψn(1PnKh,𝒗01P0Kh,𝒗0)\displaystyle P_{n}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{1}{P_{n}K_{h,\bm{v}_{0}}}-\frac{1}{P_{0}K_{h,\bm{v}_{0}}}\right) (S33)
+(PnP0)Kh,𝒗0P0Kh,𝒗0(ΨnΨ0)+(PnP0)Kh,𝒗0P0Kh,𝒗0(Ψ0)\displaystyle+(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0})+(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{0})
+P0Kh,𝒗0P0Kh,𝒗0(ΨnΨ0).\displaystyle+P_{0}\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0}).

S5.1 Asymptotic behaviour of the first term on the RHS of (S33)

Following the same steps used in the proof of Theorem 4.2 by consistency of Ψn\Psi_{n} and because, for fixed hh, (PnP0)Kh,𝒗0=Op(n1/2)(P_{n}-P_{0})K_{h,\bm{v}_{0}}=O_{p}(n^{-1/2}), we have

PnKh,𝒗0Ψn(1PnKh,𝒗01P0Kh,𝒗0)=(PnP0)Kh,𝒗0P0Kh,𝒗0Ψ0h(𝒗0,θ)+op(n1/2).\displaystyle P_{n}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{1}{P_{n}K_{h,\bm{v}_{0}}}-\frac{1}{P_{0}K_{h,\bm{v}_{0}}}\right)=\frac{-(P_{n}-P_{0})K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}\Psi_{0h}(\bm{v}_{0},\theta)+o_{p}(n^{-1/2}).

Under assumptions 1 and 2, the equality above holds uniformly on the space 𝒱\mathcal{V}.

S5.2 Asymptotic negligibility of the second term on the RHS of (S32)

We show that supv𝒱|(PnP0)Kh,vP0Kh,v(ΨnΨ0)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|(P_{n}-P_{0})\frac{K_{h,v}}{P_{0}K_{h,v}}\left(\Psi_{n}-\Psi_{0}\right)\right|=o_{p}(n^{-1/2}). Let fv(m)=mΨ0f_{v}(m)=m-\Psi_{0} and n={fv(m):supv𝒱fv(m)<δn,m𝒟[0,τ],mν<,v𝒱}\mathcal{F}_{n}=\{f_{v}(m):\sup_{v\in\mathcal{V}}\|f_{v}(m)\|<\delta_{n},m\in\mathcal{D}[0,\tau],\|m\|^{*}_{\nu}<\infty,v\in\mathcal{V}\}. Then we have

supv𝒱|(PnP0)Kh,vP0Kh,v(ΨnΨ0)|supv𝒱fv(m)n|(PnP0)Kh,vP0Kh,vfv(m)|\displaystyle\sup_{v\in\mathcal{V}}\left|(P_{n}-P_{0})\frac{K_{h,v}}{P_{0}K_{h,v}}\left(\Psi_{n}-\Psi_{0}\right)\right|\leq\sup_{\begin{subarray}{c}v\in\mathcal{V}\\ f_{v}(m)\in\mathcal{F}_{n}\end{subarray}}\left|(P_{n}-P_{0})\frac{K_{h,v}}{P_{0}K_{h,v}}f_{v}(m)\right|

By consistency of Ψn\Psi_{n}, we can consider δn0\delta_{n}\rightarrow 0 which implies that the right hand side of the inequality above is of order op(n1/2)o_{p}(n^{-1/2}).

S5.3 Asymptotic behaviour of the forth term on the RHS of (S33)

Using the same techniques as in the proof of Theorem 4.2, we can show

P0Kh,𝒗0P0Kh,𝒗0(ΨnΨ0)=\displaystyle P_{0}\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0})= (PnP0)Kh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0{I(Y>t)Ψ0}\displaystyle(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}\{I(Y>t)-\Psi_{0}\}
PnKh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0{I(Yt)Ψn}+op(n1/2).\displaystyle-P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}+o_{p}(n^{-1/2}).

This equality holds uniformly over v𝒱v\in\mathcal{V}, because supv𝒱|(PnP0)Kh,vP0Kh,v(ΨnΨ0)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|(P_{n}-P_{0})\frac{K_{h,v}}{P_{0}K_{h,v}}\left(\Psi_{n}-\Psi_{0}\right)\right|=o_{p}(n^{-1/2}) as shown in Section S5.2. Moreover, following the result in Section S4.4,

PnKh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0{I(Yt)Ψn}=\displaystyle P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}= PnKh,𝒗0ΔcI(A=dθ)GnP0Kh,𝒗0{I(Yt)Ψn}\displaystyle P_{n}\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{n}P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}
(PnP0)Kh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψ0)(ΔcG0cG0c)\displaystyle-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{0})\left(\frac{\Delta^{c}-G_{0}^{c}}{G_{0}^{c}}\right)
(PnP0)Kh,𝒗0(Q0Ψ0)P0Kh,𝒗0(I(A=dθ)G0aG0a)\displaystyle-(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{0})}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{0}^{a}}{G_{0}^{a}}\right)
+PnKh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψn)(ΔcGncG0c)\displaystyle+P_{n}\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{\Delta^{c}-G_{n}^{c}}{G_{0}^{c}}\right)
+PnKh,𝒗0(Q0Ψn)P0Kh,𝒗0Gnc(I(A=dθ)GnaG0cG0a)+op(n1/2).\displaystyle+P_{n}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}G_{n}^{c}\left(\frac{I(A=d^{\theta})-G_{n}^{a}}{G_{0}^{c}G_{0}^{a}}\right)+o_{p}(n^{-1/2}).

The equality above also holds uniformly over v𝒱v\in\mathcal{V} because

  • supv𝒱|(PnP0)Kh,𝒗0ΔcI(A=dθ)P0Kh,𝒗0{I(Yt)Ψn}(GnG0G0Gn)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{P_{0}K_{h,\bm{v}_{0}}}\{I(Y\geq t)-\Psi_{n}\}\left(\frac{G_{n}-G_{0}}{G_{0}G_{n}}\right)\right|=o_{p}(n^{-1/2}),

  • supv𝒱|P0Kh,𝒗0G0P0Kh,𝒗0(Q0Ψn)((G0Gn)2GnG02)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|P_{0}\frac{K_{h,\bm{v}_{0}}G_{0}}{P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{(G_{0}-G_{n})^{2}}{G_{n}G_{0}^{2}}\right)\right|=o_{p}(n^{-1/2}),

  • supv𝒱|P0Kh,𝒗0{GnG0}(Q0ΨnG0P0Kh,𝒗0QnΨnGnPnKh,𝒗0)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|P_{0}K_{h,\bm{v}_{0}}\{G_{n}-G_{0}\}\left(\frac{Q_{0}-\Psi_{n}}{G_{0}P_{0}K_{h,\bm{v}_{0}}}-\frac{Q_{n}-\Psi_{n}}{G_{n}P_{n}K_{h,\bm{v}_{0}}}\right)\right|=o_{p}(n^{-1/2}).

S5.4 Convergence to a Gaussian process

If we undersmooth GnaG_{n}^{a}, GncG_{n}^{c} and Ψn\Psi_{n} such that supv𝒱|PnKh,𝒗0(Q0Ψn)P0Kh,𝒗0Gnc(I(A=dθ)GnaG0cG0a)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|P_{n}\frac{K_{h,\bm{v}_{0}}(Q_{0}-\Psi_{n})}{P_{0}K_{h,\bm{v}_{0}}}G_{n}^{c}\left(\frac{I(A=d^{\theta})-G_{n}^{a}}{G_{0}^{c}G_{0}^{a}}\right)\right|=o_{p}(n^{-1/2}), supv𝒱|PnKh,𝒗0I(A=dθ)G0aP0Kh,𝒗0(Q0Ψn)(ΔcGncG0c)|=op(n1/2)\sup_{v\in\mathcal{V}}\left|P_{n}\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}(Q_{0}-\Psi_{n})\left(\frac{\Delta^{c}-G_{n}^{c}}{G_{0}^{c}}\right)\right|=o_{p}(n^{-1/2}), and supv𝒱|PnKh,vΔcI(A=dθ)GnPnKh,v{ΨnI(Y>t)}|=op(n1/2)\sup_{v\in\mathcal{V}}\left|P_{n}\frac{K_{h,v}\Delta^{c}I(A=d^{\theta})}{G_{n}P_{n}K_{h,v}}\{\Psi_{n}-I(Y>t)\}\right|=o_{p}(n^{-1/2}), then

n{Ψnh(𝒗0,θ)Ψ0h(𝒗0,θ)}=n(PnP0)Dv(P0)+op(1)dGP.\displaystyle\sqrt{n}\{\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0h}(\bm{v}_{0},\theta)\}=\sqrt{n}(P_{n}-P_{0})D_{v}^{*}(P_{0})+o_{p}(1)\Rightarrow_{d}GP.

where

D𝒗(P0)=Kh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0\displaystyle D_{\bm{v}}^{*}(P_{0})=\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}} I(Yt)Kh,𝒗0I(A=dθ)G0aP0Kh,𝒗0Q0(ΔcG0cG0c)\displaystyle I(Y\geq t)-\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}Q_{0}\left(\frac{\Delta^{c}-G_{0}^{c}}{G_{0}^{c}}\right)
Kh,𝒗0Q0P0Kh,𝒗0(I(A=dθ)G0aG0a)+Kh,𝒗0P0Kh,𝒗0Ψ0h(𝒗0,θ).\displaystyle-\frac{K_{h,\bm{v}_{0}}Q_{0}}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{0}^{a}}{G_{0}^{a}}\right)+\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}\Psi_{0h}(\bm{v}_{0},\theta).

Hence, for all 𝒗𝒱\bm{v}\in\mathcal{V}, n{Ψnh(𝒗,θ)Ψ0h(𝒗,θ)}\sqrt{n}\{\Psi_{nh}(\bm{v},\theta)-\Psi_{0h}(\bm{v},\theta)\} converges weakly as a random element of the cadlag function space endowed with the supremum norm to a Gaussian process GPGP with covariance structure implied by the covariance function ρ(𝒗,𝒗)=P0D𝒗(P0)D𝒗(P0)\rho(\bm{v},\bm{v}^{\prime})=P_{0}D^{*}_{\bm{v}}(P_{0})D^{*}_{\bm{v}^{\prime}}(P_{0}).

Appendix S6 Uniform consistency of the regimen-response curve estimator

In Theorem 4.2 we showed that our estimator is pointwise consistent. In this section we show that our estimator is uniformly consistent as well in the sense that supv𝒱|Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)|=op(1)\sup_{v\in\mathcal{V}}|\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)|=o_{p}(1) and derive a rate of convergence. Recall that

Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)={Ψ0h(𝒗0,θ)Ψ0(𝒗0,θ)}+{Ψnh(𝒗0,θ)Ψ0h(𝒗0,θ)},\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)=\{\Psi_{0h}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)\}+\{\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0h}(\bm{v}_{0},\theta)\},

where

Ψnh(𝒗0,θ)Ψ0h(𝒗0,θ)=\displaystyle\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0h}(\bm{v}_{0},\theta)= PnKh,𝒗0Ψn(1PnKh,𝒗01P0Kh,𝒗0)\displaystyle P_{n}K_{h,\bm{v}_{0}}\Psi_{n}\left(\frac{1}{P_{n}K_{h,\bm{v}_{0}}}-\frac{1}{P_{0}K_{h,\bm{v}_{0}}}\right) (S34)
+(PnP0)Kh,𝒗0P0Kh,𝒗0(ΨnΨ0)+(PnP0)Kh,𝒗0P0Kh,𝒗0Ψ0\displaystyle+(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0})+(P_{n}-P_{0})\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}\Psi_{0}
+P0Kh,𝒗0P0Kh,𝒗0(ΨnΨ0).\displaystyle+P_{0}\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}(\Psi_{n}-\Psi_{0}).

We know that supv𝒱|Ψ0h(𝒗0,θ)Ψ0(𝒗0,θ)|=Op(hr)\sup_{v\in\mathcal{V}}|\Psi_{0h}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)|=O_{p}(h^{r}), supv𝒱(PnP0)Kh,𝒗=Op(n1/2hr/2)\sup_{v\in\mathcal{V}}(P_{n}-P_{0})K_{h,\bm{v}}=O_{p}(n^{-1/2}h^{-r/2}), and supv𝒱|(PnP0)Kh,vP0Kh,v(ΨnΨ0)|=op(n1/2hr/2)\sup_{v\in\mathcal{V}}\left|(P_{n}-P_{0})\frac{K_{h,v}}{P_{0}K_{h,v}}\left(\Psi_{n}-\Psi_{0}\right)\right|=o_{p}(n^{-1/2}h^{-r/2}). Moreover, supv𝒱|P0Kh,𝒗0P0Kh,𝒗0(ΨnΨ0)|=Op(supv𝒱ΨnΨ0)\sup_{v\in\mathcal{V}}\left|P_{0}\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}\left(\Psi_{n}-\Psi_{0}\right)\right|=O_{p}(\sup_{v\in\mathcal{V}}\|\Psi_{n}-\Psi_{0}\|). Thus, assuming that supv𝒱ΨnΨ0=Op(bn)\sup_{v\in\mathcal{V}}\|\Psi_{n}-\Psi_{0}\|=O_{p}(b_{n}), we have

supv𝒱|Ψnh(𝒗0,θ)Ψ0(𝒗0,θ)|=Op(n1/2hr/2+hr+bn).\sup_{v\in\mathcal{V}}|\Psi_{nh}(\bm{v}_{0},\theta)-\Psi_{0}(\bm{v}_{0},\theta)|=O_{p}\left(n^{-1/2}h^{-r/2}+h^{r}+b_{n}\right).

Appendix S7 Proof of Theorem 4.4: Consistency of θnh\theta_{nh}

The upper bound of EPsupf,fPδ(n)1/2(PnP0)(f)E_{P}\sup_{f\in\mathcal{F},\|f\|_{P}\leq\delta}(n)^{1/2}(P_{n}-P_{0})(f) given in Lemma S1 is either dominated by (δ,,L2(P))\mathcal{E}(\delta,\mathcal{F},L^{2}(P)) or 2(δ,,L2(P))/(δ2n1/2)\mathcal{E}^{2}(\delta,\mathcal{F},L^{2}(P))/(\delta^{2}n^{1/2}). The switch occurs when (δ,,L2(P))=δ2n1/2\mathcal{E}(\delta,\mathcal{F},L^{2}(P))=\delta^{2}n^{1/2}. By Lemma S2, for the cadlag class of functions =d,M\mathcal{F}=\mathcal{F}_{d,M}, we have (δ,,L2(P))=δ1/2|logδ|p1\mathcal{E}(\delta,\mathcal{F},L^{2}(P))=\delta^{1/2}|\log\delta|^{p-1}. Therefore, the switch occurs when δ2n1/2=δ1/2|logδ|p1\delta^{2}n^{1/2}=\delta^{1/2}|\log\delta|^{p-1} which implies δ=n1/3|logn|2(p1)/3\delta=n^{-1/3}|\log n|^{2(p-1)/3} (This is because logδ=1/3logn+log|logδ|p1\log\delta=-1/3\log n+\log|\log\delta|^{p-1} where logn\log n is the leading term). So the result of Lemma S1 can be written as

EPsupf,fPδn1/2(PnP0)(f)\displaystyle E_{P}\sup_{f\in\mathcal{F},\|f\|_{P}\leq\delta}n^{1/2}(P_{n}-P_{0})(f)\leq I(δn1/3|logn|2(p1)/3)(δ,,L2(P))\displaystyle I(\delta\geq n^{-1/3}|\log n|^{2(p-1)/3})\mathcal{E}(\delta,\mathcal{F},L^{2}(P))
+I(δn1/3|logn|2(p1)/3){(δ,,L2(P))δ2n1/2M}.\displaystyle+I(\delta\leq n^{-1/3}|\log n|^{2(p-1)/3})\left\{\frac{\mathcal{E}(\delta,\mathcal{F},L^{2}(P))}{\delta^{2}n^{1/2}}M\right\}.

In the following, for the ease of notation, we denote θnh(v0)θnh\theta_{nh}(v_{0})\equiv\theta_{nh} and θ0(v0)θ0\theta_{0}(v_{0})\equiv\theta_{0}. Define a loss-based dissimilarity d(θn,θ0)=Ψ0(θnh)Ψ0(θ0)d(\theta_{n},\theta_{0})=\Psi_{0}(\theta_{nh})-\Psi_{0}(\theta_{0}). Then by definition

0d(θnh,θ0)\displaystyle 0\leq d(\theta_{nh},\theta_{0}) =(Ψ0Ψnh)(θnh)(Ψ0Ψnh)(θ0)+Ψnh(θnh)Ψnh(θ0)\displaystyle=(\Psi_{0}-\Psi_{nh})(\theta_{nh})-(\Psi_{0}-\Psi_{nh})(\theta_{0})+\Psi_{nh}(\theta_{nh})-\Psi_{nh}(\theta_{0})
{(ΨnhΨ0)(θnh)(ΨnhΨ0)(θ0)}\displaystyle\leq-\{(\Psi_{nh}-\Psi_{0})(\theta_{nh})-(\Psi_{nh}-\Psi_{0})(\theta_{0})\}
=(PnP0){D0(θnh)D0(θ0)}+hJB0(θnhθ0)+R(θnh)R(θ0).\displaystyle=(P_{n}-P_{0})\{D^{*}_{0}(\theta_{nh})-D^{*}_{0}(\theta_{0})\}+h^{J}B_{0}(\theta_{nh}-\theta_{0})+R(\theta_{nh})-R(\theta_{0}).

The last equality follows from Theorem 4.2. Assuming hnn1/(2J+r)0\frac{h_{n}}{n^{-1/(2J+r)}}\rightarrow 0 (i.e., hnh_{n} converges to zero faster than n1/(2J+r)n^{-1/(2J+r)}), we have

0d(θnh,θ0)(PnP0){D0(θnh)D0(θ0)}+{R(θnh)R(θ0)}+o(nJ/(2J+r)).\displaystyle 0\leq d(\theta_{nh},\theta_{0})\leq(P_{n}-P_{0})\{D^{*}_{0}(\theta_{nh})-D^{*}_{0}(\theta_{0})\}+\{R(\theta_{nh})-R(\theta_{0})\}+o(n^{-J/(2J+r)}).

Let d~(θnh,θ0h)=(PnP0){D0(θnh)D0(θ0h)}\tilde{d}(\theta_{nh},\theta_{0h})=(P_{n}-P_{0})\{D^{*}_{0}(\theta_{nh})-D^{*}_{0}(\theta_{0h})\}. Using the rate obtained for the remainder terms in Theorem 4.2, we have d(θnh,θ0)=d~(θnh,θ0h)+{R(θnh)R(θ0)}=d~(θnh,θ0h)+op(n1/2hr/2)d(\theta_{nh},\theta_{0})=\tilde{d}(\theta_{nh},\theta_{0h})+\{R(\theta_{nh})-R(\theta_{0})\}=\tilde{d}(\theta_{nh},\theta_{0h})+o_{p}(n^{-1/2}h^{-r/2}). We proceed with assuming that the convergence rate of d~(θnh,θ0h)\tilde{d}(\theta_{nh},\theta_{0h}) is the dominating rate. We will show that the assumption holds latter in the proof.

Recall,

D0(θ)=Kh,𝒗0ΔcI(A=dθ)G0P0Kh,𝒗0I(Yt)Kh,𝒗0I(A=dθ)G0aP0Kh,𝒗0Q0(ΔcG0cG0c)\displaystyle D^{*}_{0}(\theta)=\frac{K_{h,\bm{v}_{0}}\Delta^{c}I(A=d^{\theta})}{G_{0}P_{0}K_{h,\bm{v}_{0}}}I(Y\geq t)-\frac{K_{h,\bm{v}_{0}}I(A=d^{\theta})}{G_{0}^{a}P_{0}K_{h,\bm{v}_{0}}}Q_{0}\left(\frac{\Delta^{c}-G_{0}^{c}}{G_{0}^{c}}\right)- Kh,𝒗0Q0P0Kh,𝒗0(I(A=dθ)G0aG0a)\displaystyle\frac{K_{h,\bm{v}_{0}}Q_{0}}{P_{0}K_{h,\bm{v}_{0}}}\left(\frac{I(A=d^{\theta})-G_{0}^{a}}{G_{0}^{a}}\right)
Kh,𝒗0P0Kh,𝒗0Ψ0h(𝒗0,θ).\displaystyle-\frac{K_{h,\bm{v}_{0}}}{P_{0}K_{h,\bm{v}_{0}}}\Psi_{0h}(\bm{v}_{0},\theta).

Let f=D0(θ)D0(θ0)f=D^{*}_{0}(\theta)-D^{*}_{0}(\theta_{0}) and hrf~=fh^{-r}\tilde{f}=f. Then,

(PnP0)f~\displaystyle(P_{n}-P_{0})\tilde{f} supf~Phr/2|(PnP0)f~|\displaystyle\leq\sup_{\|\tilde{f}\|_{P}\leq h^{r/2}}|(P_{n}-P_{0})\tilde{f}|
=Op{n1/2(hr/2,L2(P))}\displaystyle=O_{p}\{n^{-1/2}\mathcal{E}(h^{r/2},L^{2}(P))\} (S35)

The equality (S35), follows from Markov inequality and the result of Lemma S6. Hence, (PnP0)f=Op(n1/2hr/4hr)=Op(n1/2h3r/4)(P_{n}-P_{0})f=O_{p}(n^{-1/2}h^{r/4}h^{-r})=O_{p}(n^{-1/2}h^{-3r/4}). Using Taylor expansion, we have d(θnh,θ0)=Ψ0(θnh)Ψ0(θ0)=(θnhθ0)2Ψ0θ2(θnhθ0)+o(θnhθ02)d(\theta_{nh},\theta_{0})=\Psi_{0}(\theta_{nh})-\Psi_{0}(\theta_{0})=(\theta_{nh}-\theta_{0})^{\top}\frac{\partial^{2}\Psi_{0}}{\partial\theta^{2}}(\theta_{nh}-\theta_{0})+o(\|\theta_{nh}-\theta_{0}\|^{2}) which implies θnhθ02,μ=Op(n1/4h3r/8)\|\theta_{nh}-\theta_{0}\|_{2,\mu}=O_{p}(n^{-1/4}h^{-3r/8}). Let δ=n1/4h3r/8\delta=n^{-1/4}h^{-3r/8}. Using Lemma S6,

(PnP0)f~\displaystyle(P_{n}-P_{0})\tilde{f} supf~Phr/2δ|(PnP0)f~|\displaystyle\leq\sup_{\|\tilde{f}\|_{P}\leq h^{r/2}\delta}|(P_{n}-P_{0})\tilde{f}|
=Op{n1/2(hr/2δκκ+2,L2(P))},\displaystyle=O_{p}\{n^{-1/2}\mathcal{E}(h^{r/2}\delta^{\frac{\kappa}{\kappa+2}},L^{2}(P))\}, (S36)

Then d(θnh,θ0)=Op(n1/2h3r/4δκ2κ+4)d(\theta_{nh},\theta_{0})=O_{p}(n^{-1/2}h^{-3r/4}\delta^{\frac{\kappa}{2\kappa+4}}). The iteration continues until there is no improvement in the rate. That is δ2=n1/2h3r/4δκ2κ+4\delta^{2}=n^{-1/2}h^{-3r/4}\delta^{\frac{\kappa}{2\kappa+4}} which implies

δ=(n1/2h3r/4)2κ+43κ+8.\delta=\left(n^{-1/2}h^{-3r/4}\right)^{\frac{2\kappa+4}{3\kappa+8}}.

Hence, θnh(v0)θ0(v0)2=Op((n1/2h3r/4)2κ+43κ+8)\|\theta_{nh}(v_{0})-\theta_{0}(v_{0})\|_{2}=O_{p}\left(\left(n^{-1/2}h^{-3r/4}\right)^{\frac{2\kappa+4}{3\kappa+8}}\right). When there is a margin around zero, that is, pr(0<|𝑺θ0|<l)=0pr(0<|\bm{S}^{\top}{{\theta}}_{0}|<l)=0, we will have θnh(v0)θ0(v0)2,μ=Op(n1/3hr/2)\|\theta_{nh}(v_{0})-\theta_{0}(v_{0})\|_{2,\mu}=O_{p}\left(n^{-1/3}h^{-r/2}\right).

At each iteration the convergence rate of d(θnh,θ0)d(\theta_{nh},\theta_{0}) and at the fix point it achieves the rate of d(θnh,θ0)=Op((n1/2h3r/4)4κ+83κ+8)d(\theta_{nh},\theta_{0})=O_{p}\left(\left(n^{-1/2}h^{-3r/4}\right)^{\frac{4\kappa+8}{3\kappa+8}}\right). Hence, to show that the latter rate dominates the remainder term rate, it is sufficient to show that

(nhr)1/2(n1/2h3r/4)4κ+83κ+8.\frac{(nh^{r})^{1/2}}{\left(n^{1/2}h^{3r/4}\right)^{\frac{4\kappa+8}{3\kappa+8}}}\rightarrow\infty.

The condition is satisfied when h<nκr(3k+4)h<n^{-\frac{\kappa}{r(3k+4)}}. The right hand side is a decreasing function of κ\kappa and as κ\kappa\rightarrow\infty, nκr(3κ+4)n13rn^{-\frac{\kappa}{r(3\kappa+4)}}\rightarrow n^{-\frac{1}{3r}}. This rate condition is satisfied when n2/(3r)<h<n1/(2J+r)n^{-2/(3r)}<h<n^{-1/(2J+r)} where J>r/4J>r/4 and for the optimal rate of hn=n1/(2J+r)h_{n}=n^{-1/(2J+r)}. So, no further condition is imposed to the rate of bandwidth and the optimal rate remains as hn=n1/(2J+r)h_{n}=n^{-1/(2J+r)}. Consequently,

θnh(v0)θ0(v0)2=Op(n(r4J)(2κ+4)(8J+4r)(3κ+8)).\|\theta_{nh}(v_{0})-\theta_{0}(v_{0})\|_{2}=O_{p}\left(n^{\frac{(r-4J)(2\kappa+4)}{(8J+4r)(3\kappa+8)}}\right).

Appendix S8 Additional figures

Refer to caption
Figure S5: Simulation studies Scenario 2: The target parameter of interest is Ψ0h\Psi_{0h}. The plots show the scaled bias and coverage rate when the weight functions are estimated using an undersmoothed highly adaptive lasso (HAL) and a random forest (RF).
Refer to caption
Figure S6: Simulation studies Scenario 2: The target parameter of interest is Ψ0\Psi_{0}. The plots show the scaled bias and coverage rate when the weight functions are estimated using an undersmoothed highly adaptive lasso (HAL) and a random forest (RF).
Refer to caption
Figure S7: Simulation studies Scenarios 1 and 2: The target parameter of interest is the true minimizer of Ψ0\Psi_{0} (i.e., θ0\theta_{0}). The plots show the scaled bias of θnh\theta_{nh} when the weight functions are estimated using an undersmoothed highly adaptive lasso.
\bibliographystyleapp

biometrika \bibliographyappchalipwbib