This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Unified Framework for Semiparametrically Efficient Semi-Supervised Learning

Zichun Xu111Department of Biostatistics, University of Washington,   Daniela Witten222Department of Statistics, University of Washington ,   Ali Shojaie∗†
Abstract

We consider statistical inference under a semi-supervised setting where we have access to both a labeled dataset consisting of pairs {Xi,Yi}i=1n\{X_{i},Y_{i}\}_{i=1}^{n} and an unlabeled dataset {Xi}i=n+1n+N\{X_{i}\}_{i=n+1}^{n+N}. We ask the question: under what circumstances, and by how much, can incorporating the unlabeled dataset improve upon inference using the labeled data? To answer this question, we investigate semi-supervised learning through the lens of semiparametric efficiency theory. We characterize the efficiency lower bound under the semi-supervised setting for an arbitrary inferential problem, and show that incorporating unlabeled data can potentially improve efficiency if the parameter is not well-specified. We then propose two types of semi-supervised estimators: a safe estimator that imposes minimal assumptions, is simple to compute, and is guaranteed to be at least as efficient as the initial supervised estimator; and an efficient estimator, which — under stronger assumptions — achieves the semiparametric efficiency bound. Our findings unify existing semiparametric efficiency results for particular special cases, and extend these results to a much more general class of problems. Moreover, we show that our estimators can flexibly incorporate predicted outcomes arising from “black-box” machine learning models, and thereby achieve the same goal as prediction-powered inference (PPI), but with superior theoretical guarantees. We also provide a complete understanding of the theoretical basis for the existing set of PPI methods. Finally, we apply the theoretical framework developed to derive and analyze efficient semi-supervised estimators in a number of settings, including M-estimation, U-statistics, and average treatment effect estimation, and demonstrate the performance of the proposed estimators via simulations.

1 Introduction

Supervised learning involves modeling the association between a set of responses Y𝒴Y\in{\mathcal{Y}} and a set of covariates X𝒳X\in{\mathcal{X}} on the basis of nn labeled realizations n={Zi}i=1n{{\mathcal{L}}_{n}}=\{Z_{i}\}_{i=1}^{n}, where Z=(X,Y)𝒵Z=(X,Y)\in{\mathcal{Z}}. In the semi-supervised setting, we also have access to NN additional unlabeled realizations of XX, 𝒰N={Xi}i=n+1n+N{{\mathcal{U}}_{N}}=\{X_{i}\}_{i=n+1}^{n+N}, without associated realizations of YY. This situation could arise, for instance, when observing YY is more expensive or time-consuming than observing XX.

Suppose that XXX\sim{{\mathbb{P}}^{*}_{X}}, and YXYXY\mid X\sim{{\mathbb{P}}^{*}_{Y\mid X}}.The joint distribution of ZZ can be expressed as Z=XYXZ\sim{{\mathbb{P}}^{*}}={{\mathbb{P}}^{*}_{X}}{{\mathbb{P}}^{*}_{Y\mid X}}. Suppose that n{{\mathcal{L}}_{n}} consists of nn independent and identically distributed (i.i.d.) realizations of {{\mathbb{P}}^{*}}, and 𝒰N{{\mathcal{U}}_{N}} consists of NN i.i.d. realizations of X{{\mathbb{P}}^{*}_{X}} that are also independent of n{{\mathcal{L}}_{n}}.

We further assume that the marginal distribution and the conditional distribution are separately modeled. That is, X{{\mathbb{P}}^{*}_{X}} belongs to a marginal model 𝒫X{{\mathcal{P}}_{X}}, and YX{{\mathbb{P}}^{*}_{Y\mid X}} belongs to a conditional model 𝒫YX{{\mathcal{P}}_{Y\mid X}}. The model of the joint distribution can thus be expressed as

𝒫={=XYX:X𝒫X,YX𝒫YX}.{{\mathcal{P}}}=\left\{{{\mathbb{P}}={\mathbb{P}}_{X}{\mathbb{P}}_{Y\mid X}:{\mathbb{P}}_{X}\in{{\mathcal{P}}_{X}},{\mathbb{P}}_{Y\mid X}\in{{\mathcal{P}}_{Y\mid X}}}\right\}. (1)

Consider a general inferential problem where the parameter of interest is a Euclidean-valued functional θ():𝒫Θp\theta(\cdot):{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p}. The ground truth is θ()\theta({{\mathbb{P}}^{*}}), which we denote as θ\theta^{*} for simplicity. This paper centers around the following question: how, and by how much, can inference using (n,𝒰N)({{\mathcal{L}}_{n}},{{\mathcal{U}}_{N}}) improve upon inference on θ\theta^{*} using only n{{\mathcal{L}}_{n}}?

1.1 The three settings

We now introduce three settings that will be used throughout this paper.

  1. 1.

    In the supervised setting, only labeled data n{{\mathcal{L}}_{n}} are available. Supervised estimators can be written as θ^n(n)\hat{\theta}_{n}({{\mathcal{L}}_{n}}).

  2. 2.

    In the ordinary semi-supervised (OSS) setting, both labeled data n{{\mathcal{L}}_{n}} and unlabeled data 𝒰N{{\mathcal{U}}_{N}} are available. OSS estimators can be written as θ^n,N(n,𝒰N)\hat{\theta}_{n,N}({{\mathcal{L}}_{n}},{{\mathcal{U}}_{N}}).

  3. 3.

    In the ideal semi-supervised (ISS) setting [Zhang et al., 2019, Cheng et al., 2021], we have access to labeled data n{{\mathcal{L}}_{n}}, and furthermore the marginal distribution X{{\mathbb{P}}^{*}_{X}} is known. ISS estimators can be written as θ^n(n,X)\hat{\theta}_{n}({{\mathcal{L}}_{n}},{{\mathbb{P}}^{*}_{X}}).

The ISS setting is primarily of theoretical interest: in reality, we rarely know X{{\mathbb{P}}^{*}_{X}}. Analyzing the ISS setting will facilitate analysis of the OSS setting.

1.2 Main results

Our main contributions are as follows:

  1. 1.

    For an arbitrary inferential problem, we derive the semiparametric efficiency lower bound under the ISS setting. We show that knowledge of X{{\mathbb{P}}^{*}_{X}} can be used to potentially improve upon a supervised estimator when the parameter of interest is not well-specified, in the sense of Definition 3.1. This generalizes existing results that semi-supervised learning cannot improve a correctly-specified linear model [Kawakita and Kanamori, 2013, Buja et al., 2019a, Song et al., 2023, Gan and Liang, 2023].

  2. 2.

    In the OSS setting, the data are non-i.i.d., and consequently classical semiparametric efficiency theory is not applicable. We establish the semiparametric efficiency lower bound in this setting: to our knowledge, this is the first such result in the literature. As in the ISS setting, an efficiency gain over the supervised setting is possible when the parameter of interest is not well-specified.

  3. 3.

    To achieve efficiency gains over supervised estimation when the parameter is not well-specified, we propose two types of semi-supervised estimators — the safe estimator and the efficient estimator — both of which build upon an arbitrary initial supervised estimator. The safe estimator requires minimal assumptions, and is always at least as efficient as the initial supervised estimator. The efficient estimator imposes stronger assumptions, and can achieve the semi-parametric efficiency bound. Compared to existing semi-supervised estimators, the proposed estimators are more general and simpler to compute, and enjoy optimality properties.

  4. 4.

    Suppose we have access to a “black box” machine learning model, f():𝒳𝒴f(\cdot):\mathcal{X}\rightarrow\mathcal{Y}, that can be used to obtain predictions of YY on unlabeled data. We show that our safe estimator can be adapted to make use of these predictions, thereby connecting the semi-supervised framework of this paper to prediction-powered inference (PPI) [Angelopoulos et al., 2023a, b], a setting of extensive recent interest. This contextualizes existing PPI proposals through the lens of semi-supervised learning, and directly leads to new PPI estimators with better theoretical guarantees and empirical performance.

1.3 Related literature

While semi-supervised learning is not a new research area [Bennett and Demiriz, 1998, Chapelle et al., 2006, Bair, 2013, Van Engelen and Hoos, 2020], it has been the topic of extensive recent theoretical interest.

Recently, Zhang et al. [2019] studied semi-supervised mean estimation and proposed a semi-supervised regression-based estimator with reduced asymptotic variance; Zhang and Bradic [2022] extended this result to the high-dimensional setting and developed an approach for bias-corrected inference. Both Chakrabortty and Cai [2018] and Azriel et al. [2022] considered semi-supervised linear regression, and proposed asymptotically normal estimators with improved efficiency over the supervised ordinary least squares estimator. Deng et al. [2023] further investigated semi-supervised linear regression under a high-dimensional setting and proposed a minimax optimal estimator. Wang et al. [2023] and Quan et al. [2024] extended these findings to the setting of semi-supervised generalized linear models. Some authors have considered semi-supervised learning for more general inferential tasks, such as M-estimation [Chakrabortty, 2016, Song et al., 2023, Yuval and Rosset, 2022] and U-statistics [Kim et al., 2024], and in different sub-fields of statistics, such as causal inference [Cheng et al., 2021, Chakrabortty and Dai, 2022] and extreme-valued statistics [Ahmed et al., 2024]. However, there exists neither a unified theoretical framework for the semi-supervised setting, nor a unified approach to construct efficient estimators in this setting.

Prediction-powered inference (PPI) refers to a setting in which the data analyst has access to not only n{{\mathcal{L}}_{n}} and 𝒰N{{\mathcal{U}}_{N}}, but also a “black box” machine learning model f():𝒳𝒴f(\cdot):\mathcal{X}\rightarrow\mathcal{Y} such that Yf(X)Y\approx f(X). The goal of PPI is to conduct inference on the association between YY and XX while making use of n{{\mathcal{L}}_{n}}, 𝒰N{{\mathcal{U}}_{N}}, and the black box predictions given by f()f(\cdot). To the best of our knowledge, despite extensive recent interest in the PPI setting [Angelopoulos et al., 2023a, b, Miao et al., 2023, Gan and Liang, 2023, Miao and Lu, 2024, Zrnic and Candès, 2024a, b, Gu and Xia, 2024], no prior work formally connects prediction-powered inference with the semi-supervised paradigm.

1.4 Notation

For any natural number KK, [K]:={1,,K}[K]:=\left\{{1,\ldots,K}\right\}. Let 𝒪1×𝒪2{\mathcal{O}}_{1}\times{\mathcal{O}}_{2} denote the Cartesian product of two sets 𝒪1{\mathcal{O}}_{1} and 𝒪2{\mathcal{O}}_{2}. Let λmin(𝐀)\lambda_{\min}(\mathbf{A}) denote the smallest eigenvalue of a matrix 𝐀\mathbf{A}. For two symmetric matrices 𝐀,𝐁p×p\mathbf{A},\mathbf{B}\in{\mathbb{R}}^{p\times p}, we write that 𝐀𝐁\mathbf{A}\succeq\mathbf{B} if and only if 𝐀𝐁\mathbf{A}-\mathbf{B} is positive semi-definite. We use uppercase letters X,Y,ZX,Y,Z to denote random variables and lowercase letters x,y,zx,y,z to denote their realizations. For a vector xpx\in{\mathbb{R}}^{p}, let x\left\lVert x\right\rVert be its Euclidean norm. The random variable XX takes values in the probability space (𝒳,X,)({\mathcal{X}},{\mathcal{F}}_{X},{\mathbb{P}}). For a measurable function f:𝒳pf:{\mathcal{X}}\to{\mathbb{R}}^{p}, let 𝔼[f(X)]=f𝑑p{\mathbb{E}}[f(X)]=\int fd{\mathbb{P}}\in{\mathbb{R}}^{p} denote its expectation, and Var[f(X)]=𝔼[{f(X)𝔼[f(X)]}{f(X)𝔼[f(X)]}]p×p\mathrm{Var}[f(X)]={\mathbb{E}}[\left\{{f(X)-{\mathbb{E}}[f(X)]}\right\}\left\{{f(X)-{\mathbb{E}}[f(X)]}\right\}^{\top}]\in{\mathbb{R}}^{p\times p} its variance-covariance matrix. For another measurable function g:𝒳qg:{\mathcal{X}}\to{\mathbb{R}}^{q}, let Cov[f(X),g(X)]=𝔼[{f(X)𝔼[f(X)]}{g(X)𝔼[g(X)]}]p×q\mathrm{Cov}[f(X),g(X)]={\mathbb{E}}[\left\{{f(X)-{\mathbb{E}}[f(X)]}\right\}\left\{{g(X)-{\mathbb{E}}[g(X)]}\right\}^{\top}]\in{\mathbb{R}}^{p\times q} denote the covariance matrix between f(X)f(X) and g(X)g(X). Let p2(){\mathcal{L}}_{p}^{2}({\mathbb{P}}) be the space of all vector-valued measurable functions f:𝒳pf:{\mathcal{X}}\to{\mathbb{R}}^{p} such that 𝔼[f(X)2]<{\mathbb{E}}[\left\lVert f(X)\right\rVert^{2}]<\infty, and let p,02()p2(){\mathcal{L}}^{2}_{p,0}({\mathbb{P}})\subset{\mathcal{L}}_{p}^{2}({\mathbb{P}}) denote the sub-space of functions with expectation 0, i.e. any fp,02()f\in{\mathcal{L}}^{2}_{p,0}({\mathbb{P}}) satisfies 𝔼[f(Z)]=0{\mathbb{E}}[f(Z)]=0.

1.5 Organization of the paper

The rest of our paper is organized as follows. Section 2 contains a very brief review of some concepts in semiparametric efficiency theory. In Sections 3 and 4, we derive semi-parametric efficiency bounds and construct efficient estimators under the ISS and OSS settings. In Sections 5 and  6, we connect the proposed framework to prediction-powered inference and missing data, respectively. In Section 7, we instantiate our theoretical and methodological findings in the context of specific examples: M-estimation, U-statistics, and the estimation of average treatment effect. Numerical experiments and concluding remarks are in Sections 8 and 9, respectively. Proofs of theoretical results can be found in the Supplement.

2 Overview of semiparametric efficiency theory

We provide a brief introduction to concepts and results from semiparametric efficiency theory that are essential for reading later sections of the paper. See also Supplement E.2. A more comprehensive introduction can be found in Chapter 25 of Van der Vaart [2000] or in Tsiatis [2006].

Consider i.i.d. data {Zi}i=1n\left\{{Z_{i}}\right\}_{i=1}^{n} sampled from 𝒫{{\mathbb{P}}^{*}}\in{{\mathcal{P}}}, and suppose there exists a dominating measure μ\mu such that each element of 𝒫{{\mathcal{P}}} can be represented by its corresponding density with respect to μ\mu. Interest lies in a Euclidean functional θ():𝒫Θp\theta(\cdot):{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p}, and θ:=θ()\theta^{*}:=\theta({{\mathbb{P}}^{*}}). Consider a one-dimensional regular parametric sub-model 𝒫T={t:tT}𝒫{\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\}\subset{{\mathcal{P}}} such that t={\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}} for some tTt^{*}\in T. 𝒫T{\mathcal{P}}_{T} defines a score function at {{\mathbb{P}}^{*}}. The set of score functions at {{\mathbb{P}}^{*}} from all such one-dimensional regular parametric sub-models of 𝒫{{\mathcal{P}}} forms the tangent set at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}, and the closed linear span of the tangent set is the tangent space at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}, which we denote as 𝒯𝒫(){\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}}). By definition, the tangent space 𝒯𝒫()1,02(){\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})\subseteq{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}})}.

An estimator θ^n=θ^n(Z1,,Zn)\hat{\theta}_{n}=\hat{\theta}_{n}(Z_{1},\ldots,Z_{n}) of θ\theta^{*} is regular relative to 𝒫{{\mathcal{P}}} if, for any regular parametric sub-model 𝒫T={t:tT}𝒫{\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\}\subset{{\mathcal{P}}} such that =t{{\mathbb{P}}^{*}}={\mathbb{P}}_{t^{*}} for some tTt^{*}\in T,

n[θ^nθ(t+hn)]V\sqrt{n}\left[\hat{\theta}_{n}-\theta\left({\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}\right)\right]\rightsquigarrow V_{{\mathbb{P}}^{*}}

for all hh\in{\mathbb{R}} under t+hn{\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}, where VV_{{\mathbb{P}}^{*}} is a random variable whose distribution depends on {{\mathbb{P}}^{*}} but not on hh. θ^n\hat{\theta}_{n} is asymptotically linear with influence function φη(z)p,02(){{\varphi}_{\eta^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})} if

n(θ^nθ)=1ni=1nφη(z)+op(1).\sqrt{n}\left(\hat{\theta}_{n}-\theta^{*}\right)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{{\varphi}_{\eta^{*}}}(z)+o_{p}(1).

The influence function φη(z){{\varphi}_{\eta^{*}}}(z) depends on {{\mathbb{P}}^{*}} through some functional η:𝒫Ω\eta:{{\mathcal{P}}}\to\Omega, where Ω\Omega is a general metric space with metric ρ\rho, and η=η()\eta^{*}=\eta({{\mathbb{P}}^{*}}). The functional η()\eta(\cdot) need not be finite-dimensional. The functional of interest θ(){\theta(\cdot)} is typically involved in η()\eta(\cdot), but they are usually not the same. A regular and asymptotically linear estimator is efficient for θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} if its influence function φη(z){{\varphi}^{*}_{\eta^{*}}}(z) satisfies aφη(z)𝒯𝒫()a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}}) for any apa\in{\mathbb{R}}^{p}. If this holds, then φη(z){{\varphi}^{*}_{\eta^{*}}}(z) is the efficient influence function of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}.

Semiparametric efficiency theory aims to provide a lower bound for the asymptotic variances of regular and asymptotically linear estimators.

Lemma 2.1.

Suppose there exists a regular and asymptotically linear estimator of θ\theta^{*}. Let φη(z){{\varphi}^{*}_{\eta^{*}}}(z) denote the efficient influence function of θ^n\hat{\theta}_{n} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}. Then, it follows that for any regular and asymptotically linear estimator θ^n\hat{\theta}_{n} of θ\theta^{*} such that n(θ^nθ)N(0,𝚺)\sqrt{n}(\hat{\theta}_{n}-\theta^{*})\rightsquigarrow N(0,\mathbf{\Sigma}),

𝚺Var[φη(Z)].\mathbf{\Sigma}\succeq\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)].

Lemma 2.1 states that any regular and asymptotically linear estimator of θ\theta^{*} relative to 𝒫{{\mathcal{P}}} has asymptotic variance no smaller than the variance of the efficient influence function, Var[φη(Z)]\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)]. This is referred to as the semiparametric efficiency lower bound. For a proof of Lemma 2.1, see Theorem 3.2 of Bickel et al. [1993].

3 The ideal semi-supervised setting

3.1 Semiparametric efficiency lower bound

Under the ISS setting, we have access to labeled data n{{\mathcal{L}}_{n}}, and the marginal distribution X{{\mathbb{P}}^{*}_{X}} is known. Thus, the model 𝒫{{\mathcal{P}}} reduces from (1) to

{X}𝒫YX={=XYX,YX𝒫YX}.{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}=\left\{{{\mathbb{P}}={{\mathbb{P}}^{*}_{X}}{\mathbb{P}}_{Y\mid X},{\mathbb{P}}_{Y\mid X}\in{{\mathcal{P}}_{Y\mid X}}}\right\}. (2)

A smaller model space leads to an easier inferential problem, and thus a lower efficiency bound. We let

ϕη(x):=𝔼[φη(Z)X=x]{\phi^{*}_{\eta^{*}}}(x):={\mathbb{E}}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid X=x\right] (3)

denote the conditional efficient influence function. Our first result establishes the semiparametric efficiency bound under the ISS setting.

Theorem 3.1.

Suppose the efficient influence function of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} defined in (1) is φη(z){{\varphi}^{*}_{\eta^{*}}}(z), where η:𝒫Ω\eta:{{\mathcal{P}}}\to\Omega is a functional, Ω\Omega is a metric space equipped with metric ρ(,)\rho(\cdot,\cdot), and η=η()\eta^{*}=\eta({{\mathbb{P}}^{*}}). Then, the efficient influence function of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}} defined as (2) is

φη(z)ϕη(x).{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x). (4)

Furthermore, the semiparametric efficiency lower bound satisfies

Var[φη(Z)ϕη(X)]=Var[φη(Z)]Var[ϕη(X)]Var[φη(Z)].\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]=\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)]-\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)]\preceq\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)]. (5)

Thus, knowledge of X{{\mathbb{P}}^{*}_{X}} can improve efficiency at θ\theta^{*} if and only if Var[ϕη(X)]0\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)]\succ 0, i.e. if and only if ϕη(X){\phi^{*}_{\eta^{*}}}(X) is not a constant almost surely.

3.2 Inference with a well-specified parameter

It is well-known that a correctly-specified regression model cannot be improved with additional unlabeled data [Kawakita and Kanamori, 2013, Buja et al., 2019a, Song et al., 2023, Gan and Liang, 2023]; Chakrabortty [2016] further extended this result to the setting of M-estimation. In this section, we formally characterize this phenomenon, and generalize it to an arbitrary inferential problem, in the ISS setting.

The next definition is motivated by Buja et al. [2019b].

Definition 3.1 (Well-specified parameter).

Let =XYX{{\mathbb{P}}^{*}}={{\mathbb{P}}^{*}_{X}}{{\mathbb{P}}^{*}_{Y\mid X}} be the data-generating distribution, and 𝒫X{{\mathcal{P}}_{X}} a model of the marginal distribution such that X𝒫X{\mathbb{P}}_{X}^{*}\in{\mathcal{P}}_{X}. A functional θ():𝒫XYXp{\theta(\cdot)}:{{\mathcal{P}}_{X}}\otimes{{\mathbb{P}}^{*}_{Y\mid X}}\to{\mathbb{R}}^{p} is well-specified at YX{{\mathbb{P}}^{*}_{Y\mid X}} relative to 𝒫X{{\mathcal{P}}_{X}} if any X𝒫X{\mathbb{P}}_{X}\in{{\mathcal{P}}_{X}} satisfies

θ(XYX)=θ(XYX).\theta({{\mathbb{P}}^{*}_{X}}{{\mathbb{P}}^{*}_{Y\mid X}})=\theta({\mathbb{P}}_{X}{{\mathbb{P}}^{*}_{Y\mid X}}).

We emphasize that well-specification is a joint property of the functional of interest θ(){\theta(\cdot)}, the conditional distribution YX{{\mathbb{P}}^{*}_{Y\mid X}}, and the marginal model 𝒫X{{\mathcal{P}}_{X}}. Definition  3.1 states that if the parameter is well-specified, then a change to the marginal model does not change θ()\theta(\cdot). Intuitively, if this is the case, then knowledge of X{{\mathbb{P}}^{*}_{X}} in the ISS setting will not affect inference on θ\theta^{*}.

Theorem 3.2.

Under the conditions of Theorem 3.1, let φη(z){{\varphi}^{*}_{\eta^{*}}}(z) be the efficient influence function of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}. If θ(){\theta(\cdot)} is well-specified at YX{{\mathbb{P}}^{*}_{Y\mid X}} relative to 𝒫X{{\mathcal{P}}_{X}}, then

ϕη(X)=0,X-a.s.,{\phi^{*}_{\eta^{*}}}(X)=0,\quad{{\mathbb{P}}^{*}_{X}}\text{-a.s.}, (6)

where ϕη(x){\phi^{*}_{\eta^{*}}}(x) is the conditional efficient influence function as in (3). Moreover, if 𝒯𝒫X(X)=1,02(X){{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}, then any influence function φη{{\varphi}_{\eta^{*}}} of a regular and asymptotically linear estimator of θ\theta^{*} satisfies

ϕη(x)=0,X-a.s.,{\phi_{\eta^{*}}}(x)=0,\quad{{\mathbb{P}}^{*}_{X}}\text{-a.s.}, (7)

where the notation ϕη(x){\phi_{\eta^{*}}}(x) denotes the conditional influence function,

ϕη(x):=𝔼[φη(Z)X=x].{\phi_{\eta^{*}}}(x):={\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x]. (8)

As a direct corollary of Theorem 3.2, if θ(){\theta(\cdot)} is well-specified, then knowledge of X{{\mathbb{P}}^{*}_{X}} does not improve the semiparametric efficiency lower bound.

Corollary 3.3.

Under the conditions of Theorem 3.1, suppose that θ(){\theta(\cdot)} is well-specified at YX{{\mathbb{P}}^{*}_{Y\mid X}} relative to 𝒫X{{\mathcal{P}}_{X}}. Then, the semiparametric efficiency lower bound of {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} is the same as the semiparametric efficiency lower bound of {{\mathbb{P}}^{*}} relative to {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}.

3.3 Safe and efficient estimators

Together, Theorems 3.1 and 3.2 imply that when the parameter is not well-specified, we can potentially use knowledge of X{{\mathbb{P}}^{*}_{X}} for more efficient inference. We will now present two such approaches, both of which build upon an initial supervised estimator. Under minimal assumptions, the safe estimator is always at least as efficient as the initial supervised estimator. By contrast, under a stronger set of assumptions, the efficient estimator achieves the efficiency lower bound (5) under the ISS setting.

We first provide some intuition behind the two estimators. Suppose that η:𝒫Ω\eta:{{\mathcal{P}}}\to\Omega is a functional that takes values in a metric space Ω\Omega with metric ρ(,)\rho(\cdot,\cdot) and η=η()\eta^{*}=\eta({{\mathbb{P}}^{*}}). Let θ^n=θ^(n)\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}}) denote an initial supervised estimator of θ\theta^{*} that is regular and asymptotically linear with influence function φη(z){{\varphi}_{\eta^{*}}}(z). Motivated by the form of the efficient influence function (4) under the ISS setting, we aim to use knowledge of X{{\mathbb{P}}^{*}_{X}} to obtain an estimator ϕ^n(x){\hat{\phi}_{n}}(x) of the conditional influence function ϕη(x)=𝔼[φη(Z)X=x]{\phi_{\eta^{*}}}(x)={\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x]. This will lead to a new estimator of the form

θ^n,X=θ^n1ni=1nϕ^n(Xi).\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}{\hat{\phi}_{n}}(X_{i}). (9)

In what follows, we let η~n\tilde{\eta}_{n} denote an estimator of η\eta^{*}.

3.3.1 The safe estimator

Let g:𝒳dg:{\mathcal{X}}\to{\mathbb{R}}^{d} denote an arbitrary measurable function of xx, and define

g0(x):=g(x)𝔼[g(X)].{g^{0}}(x):=g(x)-{\mathbb{E}}[g(X)]. (10)

(Since X{{\mathbb{P}}^{*}_{X}} is known in the ISS setting, we can compute g0(){g^{0}}(\cdot) given g()g(\cdot).)

To construct the safe estimator, we will estimate ϕη(x){\phi_{\eta^{*}}}(x) by regressing {φη~n(Zi)}i=1n\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n} onto {g0(Xi)}i=1n\left\{{{g^{0}}(X_{i})}\right\}_{i=1}^{n}, leading to regression coefficients

𝐁^ng=[1ni=1nφη~n(Zi)g0(Xi)]𝔼[g0(X)g0(X)]1,\hat{\mathbf{B}}_{n}^{g}=\left[\frac{1}{n}\sum_{i=1}^{n}{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i}){g^{0}}(X_{i})^{\top}\right]{\mathbb{E}}\left[{g^{0}}(X){g^{0}}(X)^{\top}\right]^{-1}, (11)

where the expectation in (11) is possible since X{{\mathbb{P}}^{*}_{X}} is known. Then the safe ISS estimator takes the form

θ^n,Xsafe=θ^n1ni=1n𝐁^ngg0(Xi).\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{n}^{g}{g^{0}}(X_{i}). (12)

To establish the convergence of θ^n,Xsafe\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}, we made the following assumptions.

Assumption 3.1.

(a) The initial supervised estimator θ^n=θ^(n)\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}}) is a regular and asymptotically linear estimator of θ\theta^{*} with influence function φη(z){{\varphi}_{\eta^{*}}}(z). (b) There exists a set 𝒪Ω{\mathcal{O}}\subset\Omega such that η𝒪\eta^{*}\in{\mathcal{O}}, the class of functions {φη(z):η𝒪}\left\{{{\varphi}_{\eta}(z):\eta\in{\mathcal{O}}}\right\} is {{\mathbb{P}}^{*}}-Donsker, and for all {η1,η2}𝒪\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}, φη1(z)φη2(z)L(z)ρ(η1,η2)\left\lVert{\varphi}_{\eta_{1}}(z)-{\varphi}_{\eta_{2}}(z)\right\rVert\leqslant L(z)\rho(\eta_{1},\eta_{2}), where L:𝒵+L:{\mathcal{Z}}\to{\mathbb{R}}^{+} is a square integrable function. (c) There exists an estimator η~n\tilde{\eta}_{n} of η\eta^{*} such that {η~n𝒪}1{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1 and ρ(η~n,η)=op(1)\rho(\tilde{\eta}_{n},\eta^{*})=o_{p}(1).

Assumptions 3.1 (b) and (c) state that we can find a consistent estimator η~n\tilde{\eta}_{n} of the functional η\eta^{*} on which the influence function of θ^n\hat{\theta}_{n} depends, such that η~n\tilde{\eta}_{n} asymptotically belongs to a realization set 𝒪{\mathcal{O}} on which the class of functions {φη(z):η𝒪}\left\{{{\varphi}_{\eta}(z):\eta\in{\mathcal{O}}}\right\} is a Donsker class. The Donsker condition is standard in semiparametric statistics, and leads to n\sqrt{n}-converegnce while allowing for rich classes of functions. We validate Assumption 3.1 for specific inferential problems in Section 7. In the special case when η()\eta(\cdot) is finite-dimensional, the next proposition provides sufficient conditions for Assumption 3.1.

Proposition 3.4.

Suppose that (i) η()\eta(\cdot) is a finite-dimensional functional; (ii) there exists a bounded open set 𝒪{\mathcal{O}} such that η𝒪\eta^{*}\in{\mathcal{O}}, and for all {η1,η2}𝒪\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}, φη1(z)φη2(z)L(z)η1η2\left\lVert{\varphi}_{\eta_{1}}(z)-{\varphi}_{\eta_{2}}(z)\right\rVert\leqslant L(z)\left\lVert\eta_{1}-\eta_{2}\right\rVert, where L:𝒵+L:{\mathcal{Z}}\to{\mathbb{R}}^{+} is a square integrable function; and (iii) there exists an estimator η~n\tilde{\eta}_{n} of η\eta^{*} such that η~nη=op(1)\left\lVert\tilde{\eta}_{n}-\eta^{*}\right\rVert=o_{p}(1). Then, Assumptions 3.1 (b) and (c) hold.

Define the population regression coefficient as

𝐁g:=𝔼[φη(Z)g0(X)]𝔼[g0(X)g0(X)]1.\mathbf{B}^{g}:={\mathbb{E}}\left[{{\varphi}_{\eta^{*}}}(Z){g^{0}}(X)^{\top}\right]{\mathbb{E}}\left[{g^{0}}(X){g^{0}}(X)^{\top}\right]^{-1}. (13)

The next theorem establishes the asymptotic behavior of θ^n,Xsafe\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}} in (12).

Theorem 3.5.

Suppose that θ^n=θ^(n)\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}}) is a supervised estimator that satisfies Assumption 3.1, and η~n\tilde{\eta}_{n} is an estimator of η\eta^{*} as in Assumption 3.1 (c). Let g:𝒳dg:{\mathcal{X}}\to{\mathbb{R}}^{d} be a square-integrable function such that 𝔼[g(X)2]<{\mathbb{E}}\left[\left\lVert g(X)\right\rVert^{2}\right]<\infty and Var[g(X)]\mathrm{Var}\left[g(X)\right] is non-singular, and let g0(x){g^{0}}(x) be its centered version (10). Then, for 𝐁g\mathbf{B}^{g} in (13), θ^n,Xsafe\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}} defined in (12) is a regular and asymptotically linear estimator of θ\theta^{*} with influence function

φη(z)𝐁gg0(x),{{\varphi}_{\eta^{*}}}(z)-\mathbf{B}^{g}{g^{0}}(x),

and asymptotic variance Var[φη(Z)𝐁gg0(X)]\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]. Moreover,

Var[φη(Z)𝐁gg0(X)]\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right] =Var[φη(Z)ϕη(X)]+Var[ϕη(Z)𝐁gg0(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]
Var[φη(Z)].\displaystyle\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right]. (14)

Theorem 3.5 establishes that θ^n,Xsafe\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}} is always at least as efficient as the initial supervised estimator θ^n\hat{\theta}_{n} under the ISS setting.

Remark 3.1 (Choice of regression basis function g(x)g(x)).

Theorem 3.5 reveals that the asymptotic variance of θ^n,Xsafe\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}} is the sum of two terms. The first term, Var[φη(Z)ϕη(X)]\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right], does not depend on g(x)g(x). The second term, Var[ϕη(Z)𝐁gg0(X)]\mathrm{Var}\left[{\phi_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right], can be interpreted as the approximation error that arises from approximating ϕη(x){\phi_{\eta^{*}}}(x) with 𝐁gg0(x)\mathbf{B}^{g}{g^{0}}(x). Thus, θ^n,Xsafe\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}} is more efficient when 𝐁gg0(x)\mathbf{B}^{g}{g^{0}}(x) is a better approximation of ϕη(x){\phi_{\eta^{*}}}(x).

3.3.2 The efficient estimator

To construct the efficient estimator, we will approximate ϕη(x){\phi_{\eta^{*}}}(x) by regressing {φη~n(Zi)}i=1n\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n} onto a growing number of basis functions. Let 𝒞Mα(𝒳){\mathcal{C}}^{\alpha}_{M}({\mathcal{X}}) denote the Hölder class of functions on 𝒳{\mathcal{X}},

𝒞Mα(𝒳):={f:𝒳, sup𝒳|Dkf(x)|+supx1,x2𝒳|Dkf(x1)Dkf(x2)|x1x2M, k[α]}.{\mathcal{C}}^{\alpha}_{M}({\mathcal{X}}):=\left\{{f:{\mathcal{X}}\to{\mathbb{R}},\text{ }\sup_{\mathcal{X}}\lvert D^{k}f(x)\rvert+\sup_{x_{1},x_{2}\in{\mathcal{X}}}\frac{\lvert D^{k}f(x_{1})-D^{k}f(x_{2})\rvert}{\left\lVert x_{1}-x_{2}\right\rVert}\leqslant M,\text{ }\forall k\in[\alpha]}\right\}. (15)
Assumption 3.2.

Under Assumption 3.1, the set 𝒪{\mathcal{O}} additionally satisfies that: {ϕη(x):η𝒪}𝒮Mα\left\{{\phi_{\eta}(x):\eta\in{\mathcal{O}}}\right\}\subset{\mathcal{S}}^{\alpha}_{M}, where ϕη(x)\phi_{\eta}(x) is the conditional influence function ϕη(x)=𝔼[φη(Z)X=x]\phi_{\eta}(x)={\mathbb{E}}[{\varphi}_{\eta}(Z)\mid X=x] and 𝒮Mα{\mathcal{S}}^{\alpha}_{M} is the class of functions defined as,

𝒮Mα:={f:𝒳p,[f]j𝒞Mjαj(𝒳),j[p]}.{\mathcal{S}}^{\alpha}_{M}:=\left\{{f:{\mathcal{X}}\to{\mathbb{R}}^{p},\quad[f]_{j}\in{\mathcal{C}}^{\alpha_{j}}_{M_{j}}({\mathcal{X}}),\quad\forall j\in[p]}\right\}.

In the definition of 𝒮Mα{\mathcal{S}}^{\alpha}_{M}, [f]j[f]_{j} is the jj-th coordinate of a vector-valued function ff, α=minj[p]{αk}\alpha=\min_{j\in[p]}\left\{{\alpha_{k}}\right\}, and M=maxj[p]{Mj}M=\max_{j\in[p]}\left\{{M_{j}}\right\}.

Under Assumption 3.2, there exists a set of basis functions {gk(x)}k=1\left\{{g_{k}(x)}\right\}_{k=1}^{\infty} of 𝒮Mα{\mathcal{S}}_{M}^{\alpha}, such as a spline basis or a polynomial basis, such that any f𝒮Mαf\in{\mathcal{S}}_{M}^{\alpha} can be represented as an infinite linear combination f(x)=k=1𝐀kgk(x)f(x)=\sum_{k=1}^{\infty}\mathbf{A}_{k}g_{k}(x), for coefficients 𝐀kp×p\mathbf{A}_{k}\in{\mathbb{R}}^{p\times p}. Define GKn(x)=[g1(x),,gKn(x)]{G_{K_{n}}}(x)=\left[g_{1}(x)^{\top},\ldots,g_{K_{n}}(x)^{\top}\right]^{\top} as the concatenation of the first KnK_{n} basis functions. The optimal 2(X){\mathcal{L}}_{2}({{\mathbb{P}}^{*}_{X}})-approximation of ff by the first KnK_{n} basis functions is then f^Kn=𝐁KnGKn(x)\hat{f}_{K_{n}}=\mathbf{B}_{K_{n}}{G_{K_{n}}}(x), where 𝐁Kn\mathbf{B}_{K_{n}} is the population regression coefficient,

𝐁Kn=𝔼[f(X)Gk(X)]𝔼[Gk(X)[Gk(X)]1.\mathbf{B}_{K_{n}}={\mathbb{E}}\left[f(X)G_{k}(X)^{\top}\right]{\mathbb{E}}\left[G_{k}(X)[G_{k}(X)^{\top}\right]^{-1}.

It can be shown that the approximation error arising from the first KnK_{n} basis functions satisfies

ff^Knp,02(X)2=O(Kn2αdim(𝒳)),\left\lVert f-\hat{f}_{K_{n}}\right\rVert^{2}_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}=O\left(K_{n}^{-\frac{2\alpha}{{\text{dim}}({\mathcal{X}})}}\right), (16)

where dim(𝒳){\text{dim}}({\mathcal{X}}) is the dimension of 𝒳{\mathcal{X}} [Newey, 1997].

To estimate the conditional influence function ϕη(x)=𝔼[φη(Z)X=x]{\phi_{\eta^{*}}}(x)={\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x], we define the centered basis functions

GKn0:=GKn𝔼[GKn(X)],{G_{K_{n}}^{0}}:={G_{K_{n}}}-{\mathbb{E}}[{G_{K_{n}}}(X)], (17)

and regress {φη~n(Zi)}i=1n\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n} onto {GKn0(Xi)}i=1n\left\{{{G_{K_{n}}^{0}}(X_{i})}\right\}_{i=1}^{n}. (Note that (17) can be computed because X{{\mathbb{P}}^{*}_{X}} is known.) The resulting regression coefficients are

𝐁^Kn=[1ni=1nφη~n(Zi)GKn0(Xi)]𝔼[GKn0(Xi)GKn0(Xi)]1.\hat{\mathbf{B}}_{K_{n}}=\left[\frac{1}{n}\sum_{i=1}^{n}{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i}){G_{K_{n}}^{0}}(X_{i})^{\top}\right]{\mathbb{E}}\left[{G_{K_{n}}^{0}}(X_{i}){G_{K_{n}}^{0}}(X_{i})^{\top}\right]^{-1}. (18)

The nonparametric least squares estimator of ϕη(x){\phi_{\eta^{*}}}(x) is ϕ^n(x)=𝐁^KnGKn0(x){\hat{\phi}_{n}}(x)=\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}(x), and the efficient ISS estimator is

θ^n,Xeff.=θ^n1ni=1n𝐁^KnGKn0(Xi).\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}(X_{i}). (19)

Define ζn2:=supx𝒳{GKn0(x)Var[GKn0(X)]1GKn0(x)}\zeta_{n}^{2}:=\sup_{x\in{\mathcal{X}}}\left\{{G_{K_{n}}^{0}}(x)^{\top}\mathrm{Var}[{G_{K_{n}}^{0}}(X)]^{-1}{G_{K_{n}}^{0}}(x)\right\}. The next theorem establishes the theoretical properties of θ^n,Xeff.\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}}.

Theorem 3.6.

Suppose that θ^n=θ^(n)\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}}) is a supervised estimator that satisfies Assumptions 3.1 and 3.2 and η~n\tilde{\eta}_{n} is an estimator of η\eta^{*} as in Assumption 3.1 (c). Further suppose that {gk(x)}k=1\left\{{g_{k}(x)}\right\}_{k=1}^{\infty} is a set of basis functions of 𝒮Mα{\mathcal{S}}_{M}^{\alpha} for which GKn(x)=[g1(x),,gKn(x)]{G_{K_{n}}}(x)=\left[g_{1}(x)^{\top},\ldots,g_{K_{n}}(x)^{\top}\right]^{\top} satisfies (16) and

infKn{λmin(Var[GKn(X)])}>0.\inf_{K_{n}}\left\{{\lambda_{\min}\left(\mathrm{Var}\left[{G_{K_{n}}}(X)\right]\right)}\right\}>0.

Let GKn0(x){G_{K_{n}}^{0}}(x) be the centered version of GKn(x){G_{K_{n}}}(x) as in (17). If α>dim(𝒳)\alpha>{\text{dim}}({\mathcal{X}}), KnK_{n}\to\infty, Knρ(η~n,η)0K_{n}\rho(\tilde{\eta}_{n},\eta^{*})\to 0, and ζn2n0\frac{\zeta_{n}^{2}}{n}\to 0, then θ^n,Xeff.\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}} in (19) is a regular and asymptotically linear estimator of θ\theta^{*} with influence function

φη(z)ϕη(x){{\varphi}_{\eta^{*}}}(z)-{\phi_{\eta^{*}}}(x)

under the ISS setting. Moreover, its asymptotic variance is Var[φη(Z)ϕη(X)]\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right], which satisfies

Var[φη(Z)ϕη(X)]Var[φη(Z)𝐁gg0(X)]Var[φη(Z)],\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right], (20)

for arbitrary g(x)g(x), g0(x){g^{0}}(x), and 𝐁g\mathbf{B}^{g} as in Theorem 3.5.

Theorem 3.6 shows that the efficient estimator θ^n,Xeff.\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}} is always at least as efficient as both the safe estimator θ^n,Xsafe\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}} and the initial supervised estimator θ^n\hat{\theta}_{n}. Further, (20) and Theorem 3.1 suggest that if the initial supervised estimator is efficient under the supervised setting, then the efficient estimator is efficient under the ISS setting.

Remark 3.2 (Role of the marginal distribution X{{\mathbb{P}}^{*}_{X}}).

The safe estimator θ^n,Xsafe\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}} involves 𝔼[g(X)]{\mathbb{E}}[g(X)] and 𝔼[g0(X)g0(X)]{\mathbb{E}}\left[{g^{0}}(X){g^{0}}(X)^{\top}\right], and the efficient estimator θ^n,Xeff.\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}} involves 𝔼[GKn(X)]{\mathbb{E}}[{G_{K_{n}}}(X)] and 𝔼[GKn0(X)GKn0(X)]{\mathbb{E}}\left[{G_{K_{n}}^{0}}(X){G_{K_{n}}^{0}}(X)^{\top}\right]. All four of these quantities require knowledge of X{{\mathbb{P}}^{*}_{X}}. However, Theorems 3.5 and 3.6 would hold even if 𝔼[g0(X)g0(X)]{\mathbb{E}}\left[{g^{0}}(X){g^{0}}(X)^{\top}\right] and 𝔼[GKn0(X)GKn0(X)]{\mathbb{E}}\left[{G_{K_{n}}^{0}}(X){G_{K_{n}}^{0}}(X)^{\top}\right] were replaced by their empirical counterparts, 1ni=1ng0(Xi)g0(Xi)\frac{1}{n}\sum_{i=1}^{n}{g^{0}}(X_{i}){g^{0}}(X_{i})^{\top} and 1ni=1nGKn0(Xi)GKn0(Xi)\frac{1}{n}\sum_{i=1}^{n}{G_{K_{n}}^{0}}(X_{i}){G_{K_{n}}^{0}}(X_{i})^{\top}.

4 The ordinary semi-supervised setting

4.1 Semiparametric efficiency lower bound

Because the OSS setting offers more information about the unknown parameter than the supervised setting but less information than the ISS setting, it is natural to think that the semiparametric efficiency bound in the OSS setting should be lower than in the supervised setting but higher than in the ISS setting. In this section, we will formalize this intuition.

In the OSS setting, the full data n𝒰N{{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}} are not jointly i.i.d.. Therefore, the classic theory of semiparametric efficiency for i.i.d. data, as described in Section 2, is not applicable. Bickel and Kwon [2001] provides a framework for efficiency theory for non-i.i.d. data. Inspired by their results, our next theorem establishes the semiparametric efficiency lower bound under the OSS setting. To state Theorem 4.1, we must first adapt some definitions from Section 2 to the OSS setting. Additional definitions necessary for its proof are deferred to Appendix E.3.

Definition 4.1 (Regularity in the OSS setting).

An estimator θ^n,N=θ^n,N(n𝒰N)\hat{\theta}_{n,N}=\hat{\theta}_{n,N}({{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}}) of θ\theta^{*} is regular relative to a model 𝒫{{\mathcal{P}}} if, for every regular parametric sub-model {t:tT}𝒫\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\}\subset{{\mathcal{P}}} such that t={\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}} for some tTt^{*}\in T,

n[θ^n,Nθ(t+hn)]V,\sqrt{n}\left[\hat{\theta}_{n,N}-\theta\left({\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}\right)\right]\rightsquigarrow V_{{\mathbb{P}}^{*}}, (21)

for all hh\in{\mathbb{R}} under t+hn{\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}} and Nn+Nγ(0,1)\frac{N}{n+N}\to\gamma\in(0,1), where VV_{{\mathbb{P}}^{*}} is some random variable whose distribution depends on {{\mathbb{P}}^{*}} but not on hh.

Definition 4.2 (Asymptotic linearity in the OSS setting).

An estimator θ^n,N\hat{\theta}_{n,N} of θ\theta^{*} is asymptotically linear with influence function φη:𝒵×𝒳p{{\varphi}_{\eta^{*}}}:{\mathcal{Z}}\times{\mathcal{X}}\rightarrow{\mathbb{R}}^{p}, where

φη(z1,x2)=φη(1)(z1)+φη(2)(x2){{\varphi}_{\eta^{*}}}(z_{1},x_{2})={{\varphi}_{\eta^{*}}^{(1)}}(z_{1})+{{\varphi}_{\eta^{*}}^{(2)}}(x_{2}) (22)

for some φη(1)(z)p,02(){{\varphi}_{\eta^{*}}^{(1)}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})} and φη(2)(x)p,02(X){{\varphi}_{\eta^{*}}^{(2)}}(x)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}, if

n(θ^n,Nθ)=1ni=1nφη(1)(Zi)+1Ni=n+1n+Nφη(2)(Xi)+op(1),\sqrt{n}\left(\hat{\theta}_{n,N}-\theta^{*}\right)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{{\varphi}_{\eta^{*}}^{(1)}}(Z_{i})+\frac{1}{\sqrt{N}}\sum_{i=n+1}^{n+N}{{\varphi}_{\eta^{*}}^{(2)}}(X_{i})+o_{p}(1), (23)

for Nn+Nγ(0,1]\frac{N}{n+N}\to\gamma\in(0,1].

Remark 4.1.

In the OSS setting, the arguments of the influence function are in the product space 𝒵×𝒳{\mathcal{Z}}\times{\mathcal{X}}. Furthermore, since 𝒵=(𝒳,𝒴){\mathcal{Z}}=({\mathcal{X}},{\mathcal{Y}}), this product space can also be written as (𝒳,𝒴)×𝒳({\mathcal{X}},{\mathcal{Y}})\times{\mathcal{X}}. Thus, to avoid notational confusion, when referring to a function on this space, we use arguments (z1,x2)(z_{1},x_{2}): that is, these subscripts are intended to disambiguate the role of the two arguments in the product space.

The form of the influence function (22) arises from the fact that the data consist of two i.i.d. parts, n{{\mathcal{L}}_{n}} and 𝒰N{{\mathcal{U}}_{N}}, which are independent of each other. By Definition 4.2, if θ^n,N\hat{\theta}_{n,N} is asymptotically linear, then

n(θ^n,Nθ)N(0,Var[φη(1)(Z)]+Var[φη(2)(X)]).\sqrt{n}\left(\hat{\theta}_{n,N}-\theta^{*}\right)\rightsquigarrow N\left(0,\mathrm{Var}\left[{{\varphi}_{\eta^{*}}^{(1)}}(Z)\right]+\mathrm{Var}\left[{{\varphi}_{\eta^{*}}^{(2)}}(X)\right]\right).

That is, the asymptotic variance of an asymptotically linear estimator is the the sum of the variances of the two components of its influence function.

Armed with these two definitions, Lemma 2.1 can be generalized to the OSS setting. Informally, there exists a unique efficient influence function of the form (22), and any regular and asymptotically linear estimator whose influence function equals the efficient influence function is an efficient estimator. Additional details are provided in Appendix E.3.

The next theorem identifies the efficient influence function and the semiparametric efficiency lower bound under the OSS.

Theorem 4.1.

Let 𝒫{{\mathcal{P}}} be defined as in (1), and let Nn+Nγ(0,1)\frac{N}{n+N}\to\gamma\in(0,1). If the efficient influence function of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} is φη{{\varphi}^{*}_{\eta^{*}}}, then the efficient influence function of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} under the OSS setting is

[φη(z1)γϕη(x1)]+γ(1γ)ϕη(x2),\left[{{\varphi}^{*}_{\eta^{*}}}(z_{1})-\gamma{\phi^{*}_{\eta^{*}}}(x_{1})\right]+\sqrt{\gamma(1-\gamma)}{\phi^{*}_{\eta^{*}}}(x_{2}), (24)

where ϕη(x){\phi^{*}_{\eta^{*}}}(x) is the conditional efficient influence function defined in (3). Moreover, the semiparametric efficiency bound can be expressed as

Var[φη(Z)γϕη(X)]+Var[γ(1γ)ϕη(X)]\displaystyle\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)-\gamma{\phi^{*}_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}{\phi^{*}_{\eta^{*}}}(X)\right] (25)
=Var[φη(Z)]γVar[ϕη(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)\right]-\gamma\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right]
=Var[φη(Z)ϕη(X)]+(1γ)Var[ϕη(X)].\displaystyle=\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right].

Theorem 4.1 confirms our intuition for the efficiency bound under the OSS setting. By definition, the efficiency bound under the supervised setting is Var[φη(Z)]\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)\right]. The second line of (25) reveals that the efficiency bound under the OSS setting is smaller by the amount γVar[ϕη(X)]\gamma\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right]. Moreover, in Theorem 3.1 we showed that the efficiency bound under the ISS setting is Var[φη(Z)ϕη(X)]\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)\right]. The third line of (25) shows that the efficiency bound under the OSS setting exceeds this by the amount (1γ)Var[ϕη(X)](1-\gamma)\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right].

The efficiency bound (25) depends on the limiting proportion of unlabeled data, γ\gamma. Intuitively, when γ\gamma is large, we have more unlabeled data, and the efficiency bound in  (25) improves. The special cases where γ=0\gamma=0 and γ=1\gamma=1 are particularly instructive. When γ=0\gamma=0, the amount of unlabeled data is negligible, and hence we should expect no improvement in efficiency over the supervised setting: this intuition is confirmed by setting γ0\gamma\to 0 in (25). When γ=1\gamma=1, there are many more labeled than unlabeled observations; thus, it is as if we know the marginal distribution X{{\mathbb{P}}^{*}_{X}}. Letting γ1\gamma\rightarrow 1, the efficiency lower bound agrees with that under the ISS (5).

We saw in Corollary 3.3 that when the functional of interest is well-specified, the efficiency bound under the ISS setting is the same as in the supervised setting. As an immediate corollary of Theorem 4.1 and Theorem 3.2, we now show that the same result holds under the OSS setting.

Corollary 4.2.

Under the conditions of Theorem 4.1, let φη(z){{\varphi}^{*}_{\eta^{*}}}(z) be the efficient influence function of θ(){\theta(\cdot)} at θ\theta^{*} relative to 𝒫{{\mathcal{P}}}. If θ(){\theta(\cdot)} is well-specified at YX{{\mathbb{P}}^{*}_{Y\mid X}} relative to 𝒫X{{\mathcal{P}}_{X}} in the sense of Definition 3.1, then the efficient influence function of θ(){\theta(\cdot)} at θ\theta^{*} relative to 𝒫{{\mathcal{P}}} under the OSS is φη(z1){{\varphi}^{*}_{\eta^{*}}}(z_{1}).

Thus, an efficient supervised estimator of a well-specified parameter can never be improved via the use of unlabeled data.

4.2 Safe and efficient estimators

Similar to Section 3.3, in this section we provide two types of OSS estimators, a safe estimator and an efficient estimator, both of which build upon and improve an initial supervised estimator.

Recall from Remark 3.2 that the safe and efficient estimators proposed in the ISS setting require knowledge of X{{\mathbb{P}}^{*}_{X}}. Of course, in the OSS setting, X{{\mathbb{P}}^{*}_{X}} is unavailable. Thus, we will simply replace X{{\mathbb{P}}^{*}_{X}} in (12) and (19) with n+N,X{\mathbb{P}}_{n+N,X}, the empirical marginal distribution of the labeled and unlabeled covariates, {Xi}i=1n+N\left\{{X_{i}}\right\}_{i=1}^{n+N}.

4.2.1 The safe estimator

For an arbitrary measurable function of xx, g:𝒳dg:{\mathcal{X}}\to{\mathbb{R}}^{d}, recall that in the ISS setting, the safe estimator (12) made use of its centered version g0(x):=g(x)𝔼[g(X)]{g^{0}}(x):=g(x)-{\mathbb{E}}[g(X)]. Since 𝔼[g(X)]{\mathbb{E}}[g(X)] is unknown under the OSS setting, we define the empirically centered version of gg as

g^0(x)=g(x)1n+Ni=1n+Ng(Xi).{\hat{g}^{0}}(x)=g(x)-\frac{1}{n+N}\sum_{i=1}^{n+N}g(X_{i}). (26)

Now, regressing {φη~n(Zi)}i=1n\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n} onto {g^0(Xi)}i=1n\left\{{{\hat{g}^{0}}(X_{i})}\right\}_{i=1}^{n} yields the regression coefficients

𝐁^n,Ng=[1ni=1nφη~n(Zi)g^0(Xi)][1n+Ni=1n+Ng^0(Xi)g^0(Xi)]1.\hat{\mathbf{B}}_{n,N}^{g}=\left[\frac{1}{n}\sum_{i=1}^{n}{\varphi}_{\tilde{\eta}_{n}}(Z_{i}){\hat{g}^{0}}(X_{i})^{\top}\right]\left[\frac{1}{n+N}\sum_{i=1}^{n+N}{\hat{g}^{0}}(X_{i})^{\top}{\hat{g}^{0}}(X_{i})\right]^{-1}. (27)

The regression estimator of the conditional influence function is thus ϕ^n(x)=𝐁^n,Ngg^0(x){\hat{\phi}_{n}}(x)=\hat{\mathbf{B}}_{n,N}^{g}{\hat{g}^{0}}(x), and the corresponding safe OSS estimator is defined as

θ^n,Nsafe=θ^n1ni=1n𝐁^n,Ngg^0(Xi).\hat{\theta}_{n,N}^{\text{safe}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{n,N}^{g}{\hat{g}^{0}}(X_{i}). (28)

Under the same conditions as in Theorem 3.5, we establish the asymptotic behavior of θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}}.

Theorem 4.3.

Suppose that θ^n=θ^(n)\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}}) is a supervised estimator that satisfies Assumption 3.1, and η~n\tilde{\eta}_{n} is an estimator of η\eta^{*} as in Assumption 3.1 (c). Let g:𝒳dg:{\mathcal{X}}\to{\mathbb{R}}^{d} be a square-integrable function such that 𝔼[g(X)2]<{\mathbb{E}}\left[\left\lVert g(X)\right\rVert^{2}\right]<\infty and Var[g(X)]\mathrm{Var}\left[g(X)\right] is non-singular, and let g^0(x){\hat{g}^{0}}(x) be its empirically centered version (26). Suppose that Nn+Nγ(0,1]\frac{N}{n+N}\to\gamma\in(0,1]. Then, the estimator θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} defined in (28) is a regular and asymptotically linear estimator of θ\theta^{*} in the sense of Definitions 4.1 and 4.2, with influence function

[φη(z1)γ𝐁gg0(x1)]+γ(1γ)𝐁gg0(x2)\left[{{\varphi}_{\eta^{*}}}(z_{1})-\gamma\mathbf{B}^{g}{g^{0}}(x_{1})\right]+\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}}(x_{2}) (29)

under the OSS setting, where 𝐁g\mathbf{B}^{g} is defined in (13). Furthermore, the asymptotic variance of θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} takes the form

𝚺safe(γ)=Var[φη(Z)γ𝐁gg0(X)]+Var[γ(1γ)𝐁gg0(X)],{\mathbf{\Sigma}^{\text{safe}}}(\gamma)=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma\mathbf{B}^{g}{g^{0}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}}(X)\right], (30)

and satisfies

Var[φη(Z)𝐁gg0(X)]𝚺safe(γ)Var[φη(Z)],γ[0,1].\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]\preceq{\mathbf{\Sigma}^{\text{safe}}}(\gamma)\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right],\quad\forall\gamma\in[0,1]. (31)

Theorem 4.3 shows that θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} is always at least as efficient as the initial supervised estimator θ^n\hat{\theta}_{n} under the OSS setting. Thus, it is a safe alternative to θ^n\hat{\theta}_{n} when additional unlabeled data are available.

4.2.2 The efficient estimator

As in Section 3.3, we require that the initial supervised estimator θ^n\hat{\theta}_{n} satisfy Assumption 3.2. Then, for a suitable set of basis functions {gk(x)}k=1\left\{{g_{k}(x)}\right\}_{k=1}^{\infty} of 𝒮Mα{\mathcal{S}}_{M}^{\alpha} (defined in Assumption 3.2), we define GKn(x)=[g1(x),,gKn(x)]{G_{K_{n}}}(x)=\left[g_{1}(x)^{\top},\ldots,g_{K_{n}}(x)^{\top}\right]^{\top}. We then center GKn{G_{K_{n}}} with its empirical mean, 1n+Ni=1n+NGKn(Xi)\frac{1}{n+N}\sum_{i=1}^{n+N}{G_{K_{n}}}(X_{i}), leading to

G^Kn0=GKn(x)1n+Ni=1n+NGKn(Xi).{\hat{G}_{K_{n}}^{0}}={G_{K_{n}}}(x)-\frac{1}{n+N}\sum_{i=1}^{n+N}{G_{K_{n}}}(X_{i}). (32)

The nonparametric least squares estimator of the conditional influence function is ϕ^n(x)=𝐁^Kn,NG^Kn0(x){\hat{\phi}_{n}}(x)=\hat{\mathbf{B}}_{K_{n},N}{\hat{G}_{K_{n}}^{0}}(x), where

𝐁^Kn,N=[1ni=1nφη~n(Zi)G^Kn0(Xi)][1n+Ni=1n+NG^Kn0(Xi)G^Kn0(Xi)]1p×(pKn)\hat{\mathbf{B}}_{K_{n},N}=\left[\frac{1}{n}\sum_{i=1}^{n}{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i}){\hat{G}_{K_{n}}^{0}}(X_{i})^{\top}\right]\left[\frac{1}{n+N}\sum_{i=1}^{n+N}{\hat{G}_{K_{n}}^{0}}(X_{i}){\hat{G}_{K_{n}}^{0}}(X_{i})^{\top}\right]^{-1}\in{\mathbb{R}}^{p\times(pK_{n})} (33)

are the coefficients obtained from regressing {φη~n(Zi)}i=1n\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n} onto {G^Kn0(Xi)}i=1n\left\{{{\hat{G}_{K_{n}}^{0}}(X_{i})}\right\}_{i=1}^{n}. The efficient OSS estimator is then

θ^n,Neff.=θ^n1ni=1n𝐁^Kn,NG^Kn0(Xi).\hat{\theta}_{n,N}^{\text{eff.}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{K_{n},N}{\hat{G}_{K_{n}}^{0}}(X_{i}). (34)

The next theorem establishes the asymptotic properties of θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}}.

Theorem 4.4.

Suppose that the supervised estimator θ^n=θ^(n)\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}}) satisfies Assumptions 3.1 and 3.2, and η~n\tilde{\eta}_{n} is an estimator of η\eta^{*} as in Assumption 3.1 (c). Further, suppose that {gk(x)}k=1\left\{{g_{k}(x)}\right\}_{k=1}^{\infty} is a set of basis functions of 𝒮Mα{\mathcal{S}}_{M}^{\alpha} such that GKn(x)=[g1(x),,gKn(x)]{G_{K_{n}}}(x)=\left[g_{1}(x)^{\top},\ldots,g_{K_{n}}(x)^{\top}\right]^{\top} satisfies (16) for any f𝒮Mαf\in{\mathcal{S}}_{M}^{\alpha}, and infKn{λmin(Var[GKn(X)])}>0\inf_{K_{n}}\left\{{\lambda_{\min}\left(\mathrm{Var}\left[{G_{K_{n}}}(X)\right]\right)}\right\}>0. Suppose that Nn+Nγ(0,1]\frac{N}{n+N}\to\gamma\in(0,1], α>dim(𝒳)\alpha>{\text{dim}}({\mathcal{X}}), KnK_{n}\to\infty and Knρ(η~n,η)0K_{n}\rho(\tilde{\eta}_{n},\eta^{*})\to 0, and ζn2n0\frac{\zeta_{n}^{2}}{n}\to 0. Then, θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} defined in (34) is a regular and asymptotically linear estimator of θ\theta^{*} in the sense of Definitions 4.1 and 4.2, and has influence function

[φη(z1)γϕη(x1)]+γ(1γ)ϕη(x2).[{{\varphi}_{\eta^{*}}}(z_{1})-\gamma{\phi_{\eta^{*}}}(x_{1})]+\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}(x_{2}). (35)

Furthermore, the asymptotic variance of θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} takes the form

𝚺eff(γ)=Var[φη(Z)γϕη(X)]+Var[γ(1γ)ϕη(X)],{\mathbf{\Sigma}^{\text{eff}}}(\gamma)=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma{\phi_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}(X)\right], (36)

and satisfies

Var[φη(Z)ϕη(X)]𝚺eff(γ)𝚺safe(γ)Var[φη(Z)],γ[0,1],\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]\preceq{\mathbf{\Sigma}^{\text{eff}}}(\gamma)\preceq{\mathbf{\Sigma}^{\text{safe}}}(\gamma)\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right],\quad\forall\gamma\in[0,1], (37)

where 𝚺safe(γ){\mathbf{\Sigma}^{\text{safe}}}(\gamma) is defined as  (30).

As in Section 3.3, the efficient OSS estimator, θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}}, is always at least as efficient as both the safe OSS estimator, θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}}, and the initial supervised estimator θ^n\hat{\theta}_{n}. Furthermore, (37) together with Theorem 4.1 show that if the initial supervised estimator is efficient under the supervised setting, then the efficient OSS estimator is efficient under the OSS setting.

Remark 4.2.

We noted in Section 4.1 that the efficiency bound in the OSS falls between the efficiency bounds in the ISS and supervised settings. We can see from Theorems 4.3 and 4.4 that a similar property holds for the safe and efficient estimators. Specifically, (31) and (37) show that the safe and efficient OSS estimators are more efficient than the initial supervised estimator θ^n\hat{\theta}_{n}, but are less efficient than the safe and efficient ISS estimators, respectively. This is again due to the fact that under the OSS, unlabeled observations provide more information than is available under the supervised setting, but less than is available under the ISS. Similarly, as γ0\gamma\to 0, i.e. as the proportion of unlabeled data becomes negligible, the OSS estimators are asymptotically equivalent to the initial supervised estimator. On the other hand, when γ=1\gamma=1, the OSS estimators are asymptotically equivalent to the corresponding ISS estimators.

5 Connection to prediction-powered inference

Suppose now that in addition to n{{\mathcal{L}}_{n}} and 𝒰N{{\mathcal{U}}_{N}}, the data analyst also has access to K1K\geq 1 machine learning prediction models fk:𝒳𝒴f_{k}:{\mathcal{X}}\to{\mathcal{Y}}, k[K]k\in[K], which are independent of n{{\mathcal{L}}_{n}} and 𝒰N{{\mathcal{U}}_{N}} (e.g., they were trained on independent data). For instance, f1,,fKf_{1},\ldots,f_{K} may arise from black-box machine learning models such as neural networks or large language models. It is clear that this is a special case of semi-supervised learning, as {fk}k=1K\left\{{f_{k}}\right\}_{k=1}^{K} can be treated as fixed functions conditional on the data upon which they were trained.

Recently, Angelopoulos et al. [2023a] proposed prediction-powered inference (PPI), which provides a principled approach for making use of {fk}k=1K\left\{{f_{k}}\right\}_{k=1}^{K}. Subsequently, a number of PPI variants have been proposed to further improve statistical efficiency or extend PPI to other settings [Angelopoulos et al., 2023b, Miao et al., 2023, Gan and Liang, 2023, Miao and Lu, 2024, Gu and Xia, 2024]. In this section, we re-examine the PPI problem through the lens of our results in previous sections, and apply these insights to improve upon existing PPI estimators.

Since f1,,fKf_{1},\ldots,f_{K} are independent of n𝒰N{{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}}, existing PPI estimators fall into the category of OSS estimators, and can be shown to be regular and asymptotically linear in the sense of Definitions 4.1 and 4.2. Therefore, Theorem 4.1 suggests that their asymptotic variances are lower bounded by the efficiency bound (25). We show in Supplement C that existing PPI estimators cannot achieve the efficiency bound (25) in the OSS setting, unless strong assumptions are made on the machine learning prediction models. Furthermore, if the parameter of interest is well-specified in the sense of Definition 3.1, then by Corollary 4.2 these estimators cannot be more efficient than the efficient supervised estimator. In other words, independently trained machine learning models, however sophisticated and accurate, cannot improve inference when the parameter is well-specified.

Remark 5.1.

Our insight that independently trained machine learning models cannot improve inference in a well-specified model stands in apparent contradiction to the simulation results of Angelopoulos et al. [2023b], who find that PPI does lead to improvement over supervised estimation in generalized linear models. This is because they have simulated data such that f(X)=Y+ϵf(X)=Y+\epsilon, i.e. the machine learning model is not independent of n{{\mathcal{L}}_{n}} and 𝒰N{{\mathcal{U}}_{N}}. A modification to their simulation study to achieve independence (in keeping with the setting of their paper) corroborates our insight, i.e., PPI does not outperform the supervised estimator.

Next, we take advantage of the insights developed in previous sections to propose a class of OSS estimators that incorporates the machine learning models {fk}k=1K\left\{{f_{k}}\right\}_{k=1}^{K} and improves upon the existing PPI estimators. We begin with an initial supervised estimator θ^n\hat{\theta}_{n} that is regular and asymptotically linear with influence function φη(z)=φη(x,y){{\varphi}_{\eta^{*}}}(z)={{\varphi}_{\eta^{*}}}(x,y), and we suppose that η~n\tilde{\eta}_{n} is an estimator of η\eta^{*}. As in Section 4.2, we estimate the conditional influence function ϕη(x){\phi^{*}_{\eta^{*}}}(x) defined in (3) with regression. Specifically, consider

gη~n(x)=[φη~n(x,f1(x)),,φη~n(x,fK(x))],{g_{\tilde{\eta}_{n}}}(x)=\left[{{\varphi}_{\tilde{\eta}_{n}}}(x,f_{1}(x)),\ldots,{{\varphi}_{\tilde{\eta}_{n}}}(x,f_{K}(x))\right]^{\top}, (38)

which arises from replacing the true response in φη~n(x,y){{\varphi}_{\tilde{\eta}_{n}}}(x,y) with the machine learning model fk(x)f_{k}(x), for k[K]k\in[K]. Its empirically centered version is

g^η~n0(x)=gη~n(x)1n+Ni=1n+Ngη~n(Xi).{\hat{g}^{0}_{\tilde{\eta}_{n}}}(x)={g_{\tilde{\eta}_{n}}}(x)-\frac{1}{n+N}\sum_{i=1}^{n+N}{g_{\tilde{\eta}_{n}}}(X_{i}). (39)

Then the regression estimator of the conditional influence function is ϕ^n(x)=𝐁^n,Ngη~ng^η~n0(x){\hat{\phi}_{n}}(x)=\hat{\mathbf{B}}_{n,N}^{g_{\tilde{\eta}_{n}}}{\hat{g}^{0}_{\tilde{\eta}_{n}}}(x), where

𝐁^n,Ngη~n=[1ni=1nφη~n(Zi)g^η~n0(Xi)][1n+Ni=1n+Ng^η~n0(Xi)g^η~n0(Xi)]1\hat{\mathbf{B}}_{n,N}^{g_{\tilde{\eta}_{n}}}=\left[\frac{1}{n}\sum_{i=1}^{n}{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i}){\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})^{\top}\right]\left[\frac{1}{n+N}\sum_{i=1}^{n+N}{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})^{\top}{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})\right]^{-1} (40)

are the coefficients obtained from regressing {φη~n(Zi)}i=1n\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n} onto {g^η~n0(Xi)}i=1n\left\{{{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})}\right\}_{i=1}^{n}. Motivated by the safe OSS estimator, θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} in (28), the safe PPI estimator is defined as

θ^n,NPPI=θ^n1ni=1n𝐁^n,Ngη~ng^η~n0(Xi).\hat{\theta}_{n,N}^{\text{PPI}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{n,N}^{g_{\tilde{\eta}_{n}}}{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i}). (41)

We now investigate the asymptotic behavior of the estimator (41). Note that Theorem 4.3 is not applicable, as the regression basis g^η~n0{\hat{g}^{0}_{\tilde{\eta}_{n}}} in (39) is random due to the involvement of η~n\tilde{\eta}_{n}. Consider an arbitrary class of measurable functions indexed by η\eta,

𝒢={gη(x):𝒳d,ηΩ}.{\mathcal{G}}=\left\{{g_{\eta}(x):{\mathcal{X}}\to{\mathbb{R}}^{d},\eta\in\Omega}\right\}. (42)

We make the following assumptions on 𝒢{\mathcal{G}}.

Assumption 5.1.

(a) gη(x)d2(X){g_{\eta^{*}}}(x)\in{\mathcal{L}}_{d}^{2}({{\mathbb{P}}^{*}_{X}}) and Var[gη(X)]\mathrm{Var}\left[{g_{\eta^{*}}}(X)\right] is non-singular; (b) Under Assumption 3.1, {gη(x):η𝒪}\left\{{g_{\eta}(x):\eta\in{\mathcal{O}}}\right\} is {{\mathbb{P}}^{*}}-Donsker, and for all {η1,η2}𝒪\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}, gη1(x)gη2(x)G(x)ρ(η1,η2)\left\lVert g_{\eta_{1}}(x)-g_{\eta_{2}}(x)\right\rVert\leqslant G(x)\rho(\eta_{1},\eta_{2}), where G:𝒳+G:{\mathcal{X}}\to{\mathbb{R}}^{+} is a square-integrable function.

Similar to Assumption 3.1, Assumption 5.1 requires that the class of functions {gη(x):η𝒪}\left\{{g_{\eta}(x):\eta\in{\mathcal{O}}}\right\} is a Donsker class; when η()\eta(\cdot) is finite-dimensional, the next proposition provides sufficient conditions for Assumption 5.1.

Proposition 5.1.

When η()\eta(\cdot) is a finite-dimensional functional, the following condition implies Assumption 5.1 (b): under the conditions of Proposition 3.4, for all {η1,η2}𝒪\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}, gη1(x)gη2(x)G(x)η1η2\left\lVert g_{\eta_{1}}(x)-g_{\eta_{2}}(x)\right\rVert\leqslant G(x)\left\lVert\eta_{1}-\eta_{2}\right\rVert, where G:𝒳+G:{\mathcal{X}}\to{\mathbb{R}}^{+} is a square integrable function.

The proof of Proposition 5.1 is similar to that of Proposition 3.4 and is hence omitted.

Define gη0(x)=gη(x)𝔼[gη(X)]{g^{0}_{\eta^{*}}}(x)={g_{\eta^{*}}}(x)-{\mathbb{E}}[{g_{\eta^{*}}}(X)] as the centered version of gη(x){g_{\eta^{*}}}(x), and

𝐁gη=𝔼[φη(Z)gη0(X)]𝔼[gη0(X)gη0(X)]1\mathbf{B}^{g_{\eta^{*}}}={\mathbb{E}}\left[{{\varphi}_{\eta^{*}}}(Z){g^{0}_{\eta^{*}}}(X)^{\top}\right]{\mathbb{E}}\left[{g^{0}_{\eta^{*}}}(X){g^{0}_{\eta^{*}}}(X)^{\top}\right]^{-1} (43)

as the population coefficients for the regression of gη0(X){g^{0}_{\eta^{*}}}(X) onto φη(Z){{\varphi}_{\eta^{*}}}(Z). The next proposition establishes the asymptotic behavior of θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}}.

Proposition 5.2.

Suppose that θ^n=θ^(n)\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}}) is a supervised estimator that satisfies Assumption 3.1, 𝒢{\mathcal{G}} defined as (42) is a class of measurable functions that satisfies Assumption 5.1, and η~n\tilde{\eta}_{n} is an estimator of η\eta^{*} as in Assumption 3.1 (c). Suppose further that Nn+Nγ(0,1]\frac{N}{n+N}\to\gamma\in(0,1]. Then the estimator θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} defined in (41) is a regular and asymptotically linear estimator of θ\theta^{*} in the sense of Definitions 4.1 and 4.2, and has influence function

[φη(z1)γ𝐁gηgη0(x1)]+γ(1γ)𝐁gηgη0(x2)\left[{{\varphi}_{\eta^{*}}}(z_{1})-\gamma\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(x_{1})\right]+\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(x_{2}) (44)

in the OSS setting, where 𝐁gη\mathbf{B}^{g_{\eta^{*}}} is defined as in (43). Furthermore, the asymptotic variance of θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} takes the form

𝚺(γ)=Var[φη(Z)γ𝐁gηgη0(X)]+Var[γ(1γ)𝐁gηgη0(X)],\mathbf{\Sigma}(\gamma)=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(X)\right],

and satisfies

Var[φη(Z)𝐁gηgη0(X)]𝚺(γ)Var[φη(Z)],γ[0,1].\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(X)\right]\preceq\mathbf{\Sigma}(\gamma)\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right],\quad\forall\gamma\in[0,1]. (45)

Proposition 5.2 can be viewed as an extension of Theorem 4.3, where in Theorem 4.3 the function class 𝒢{\mathcal{G}} is a singleton class {g(x)}\left\{{g(x)}\right\} that does not depend on η\eta^{*}. Our proposed PPI estimator (41) flexibly incorporates multiple black-box machine learning models, and enjoys several advantages over existing PPI estimators:

  1. 1.

    We show in Supplement C that (41) is optimal within a class of PPI estimators including Angelopoulos et al. [2023a, b], Miao et al. [2023], Gan and Liang [2023].

  2. 2.

    Unlike the proposals of Angelopoulos et al. [2023a, b], Miao et al. [2023], our estimator is safe: that is, it is at least as efficient as the initial supervised estimator, regardless of the quality of the machine learning models and the proportion of unlabeled data.

  3. 3.

    While existing PPI estimators are only applicable to M- and Z-estimators, our proposal is much more general: it is applicable to arbitrary inferential problems and requires only a regular and asymptotically linear initial supervised estimator.

We provide a detailed discussion of the efficiency of existing PPI estimators in Appendix C.

6 Connection with missing data

The missing data framework provides an alternative approach for modeling the semi-supervised setting [Robins and Rotnitzky, 1995, Chen et al., 2008, Zhou et al., 2008, Li and Luedtke, 2023, Graham et al., 2024]. Here we relate the proposed framework to the classical theory of missing data.

To establish a formal relationship between the two paradigms, we consider a missing completely at random (MCAR) model under which the semiparametric efficiency bound coincides with that derived in Theorem 4.1 in the OSS setting. Let {(Zi,Wi)}i=1n+N\left\{{(Z_{i},W_{i})}\right\}_{i=1}^{n+N} be i.i.d. data, where ZZ\sim{{\mathbb{P}}^{*}}, WWW\sim{{\mathbb{P}}_{W}^{*}} is a binary missingness indicator such that the response YiY_{i} is observed if and only if Wi=1W_{i}=1, and W{{\mathbb{P}}_{W}^{*}} is a Bernoulli distribution with known probability 1γ1-\gamma where γ=limnNn+N(0,1)\gamma=\lim_{n\to\infty}\frac{N}{n+N}\in(0,1). Assume that 𝒫{{\mathbb{P}}^{*}}\in{{\mathcal{P}}}, where 𝒫{{\mathcal{P}}} is defined in (1) as in previous sections. The underlying model of (Z,W)(Z,W) is thus

𝒬={×W:𝒫}.{\mathcal{Q}}=\left\{{{\mathbb{P}}\times{{\mathbb{P}}_{W}^{*}}:{\mathbb{P}}\in{{\mathcal{P}}}}\right\}. (46)

The next proposition derives the efficiency bound relative to (46).

Proposition 6.1.

Let 𝒫{{\mathcal{P}}} be defined as in (1) and let Nn+Nγ(0,1)\frac{N}{n+N}\to\gamma\in(0,1). Suppose that the efficient influence function of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} is φη{{\varphi}^{*}_{\eta^{*}}}, and recall that ϕη(x)=𝔼[φη(Z)X=x]{\phi^{*}_{\eta^{*}}}(x)={\mathbb{E}}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid X=x\right] is the conditional efficient influence function. Consider i.i.d. data {(Zi,Wi)}i=1n+N\left\{{(Z_{i},W_{i})}\right\}_{i=1}^{n+N} generated from ×W{{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}} with model 𝒬{\mathcal{Q}} (46). Then the efficient influence function of θ(){\theta(\cdot)} at ×W{{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}} relative to 𝒫{{\mathcal{P}}} is

w1γ[φη(z)ϕη(x)]+1γϕη(x),\frac{w}{\sqrt{1-\gamma}}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+\sqrt{1-\gamma}{\phi^{*}_{\eta^{*}}}(x),

and the corresponding semiparametric efficiency lower bound is

Var{W1γ[φη(Z)ϕη(X)]+1γϕη(X)}\displaystyle\mathrm{Var}\left\{{\frac{W}{\sqrt{1-\gamma}}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]+\sqrt{1-\gamma}{\phi^{*}_{\eta^{*}}}(X)}\right\}
=Var[φη(Z)ϕη(X)]+(1γ)Var[ϕη(X)].\displaystyle=\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)\right]+(1-\gamma)\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)].

A comparison of (6.1) and (25) reveals that, for any fixed γ(0,1)\gamma\in(0,1), the efficiency bound under the MCAR model (46) is the same as under the OSS setting. Thus, the amount of information useful for inference under the two paradigms is the same. However, in the MCAR model the data {(Zi,Wi)}i=1n+N\left\{{(Z_{i},W_{i})}\right\}_{i=1}^{n+N} are fully i.i.d.; by contrast, in the OSS setting the data are not fully i.i.d., and instead consist of two independent i.i.d. parts from {{\mathbb{P}}^{*}} and X{{\mathbb{P}}^{*}_{X}}, respectively. In fact, the OSS setting corresponds to the MCAR model conditional on the event {i=1n+NWi=N}\left\{{\sum_{i=1}^{n+N}W_{i}=N}\right\}. As the distribution of WW is completely known under the MCAR model, conditioning does not alter the information available, and so it is not surprising that the efficiency bounds are the same.

The semi-supervised setting allows for γ=1\gamma=1, i.e., for the possibility that the sample size of the unlabeled data far exceeds that of the labeled data, NnN\gg n. As discussed in Section 4.1, this case corresponds to the ISS setting, with efficiency lower bound given by Theorem 3.1. Our proposed safe and efficient OSS estimators also allow for γ=1\gamma=1, as shown in Theorems 4.3 and 4.4. By contrast, it is difficult to theoretically analyze the efficiency lower bound when γ=1\gamma=1 using the missing data framework, and estimators developed for missing data require the probability of missingness to be strictly smaller than 1.

7 Applications

In this section, we apply the proposed framework to a variety of inferential problems, including M-estimation, U-statistics, and average treatment effect estimation. Constructing the safe OSS estimator θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} (12) or the PPI estimator θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (41) requires finding an initial supervised estimator θ^n\hat{\theta}_{n} and an estimator η~n\tilde{\eta}_{n} that satisfies Assumption 3.1. To construct the efficient OSS estimator (34), the initial supervised estimator also needs to satisfies Assumption 3.2, which cannot be verified in practice.

7.1 M-estimation

First, we apply the proposed framework to M-estimation, a setting also considered in Chapter 2 of Chakrabortty [2016] and Song et al. [2023]. For a function mθ(z):Θ×𝒵m_{\theta}(z):\Theta\times{\mathcal{Z}}\to{\mathbb{R}}, we define the target parameter as the maximizer of

θ=argmaxθ𝔼[mθ(Z)].\theta^{*}=\arg\max_{\theta}{\mathbb{E}}[m_{\theta}(Z)].

Given i.i.d. labeled data n{{\mathcal{L}}_{n}}, we use the M-estimator

θ^n=argmaxθ{1ni=1nmθ(Zi)}\hat{\theta}_{n}=\arg\max_{\theta}\left\{{\frac{1}{n}\sum_{i=1}^{n}m_{\theta}(Z_{i})}\right\} (47)

as the initial supervised estimator of θ\theta^{*}. Under regularity conditions such as those stated in Theorems 5.7 and 5.21 of Van der Vaart [2000], the M-estimator (47) is a regular and asymptotically linear estimator of θ\theta^{*} with influence function

φη(z)=𝐕θ1mθ(z),{{\varphi}_{\eta^{*}}}(z)=-\mathbf{V}_{\theta^{*}}^{-1}\nabla m_{\theta^{*}}(z), (48)

where 𝐕θ()=2𝔼[mθ(Z)]θθ\mathbf{V}_{\theta}({\mathbb{P}})=\frac{\partial^{2}{\mathbb{E}}_{\mathbb{P}}[m_{\theta}(Z)]}{\partial\theta\partial\theta^{\top}} and 𝐕θ=𝐕θ()(){\mathbf{V}_{\theta^{*}}}=\mathbf{V}_{\theta({{\mathbb{P}}^{*}})}({{\mathbb{P}}^{*}}). The functional η()\eta({\mathbb{P}}) that appears in (48) is

η()=(𝐕θ()1(),θ()),\eta({\mathbb{P}})=\left(\mathbf{V}^{-1}_{\theta({\mathbb{P}})}({\mathbb{P}}),\theta({\mathbb{P}})\right),

which is finite-dimensional. Therefore, validating Assumption 3.1 is equivalent to validating the conditions stated in Proposition 3.4, which we now do for two canonical examples of M-estimation problems.

Example 7.1 (Mean).

Define mθ(z)=12[h(y)θ]2m_{\theta}(z)=\frac{1}{2}[h(y)-\theta]^{2} for some function h(y)p2(Y)h(y)\in{\mathcal{L}}^{2}_{p}({\mathbb{P}}^{*}_{Y}). The target parameter is the expectation θ=𝔼[h(Y)]\theta^{*}={\mathbb{E}}[h(Y)]. The M-estimator (47) in this case is the sample mean

θ^n=1ni=1nh(Yi),\hat{\theta}_{n}=\frac{1}{n}\sum_{i=1}^{n}h(Y_{i}),

which is regular and asymptotically linear with influence function φη(z)=h(y)θ{{\varphi}_{\eta^{*}}}(z)=h(y)-\theta^{*}. The functional η()\eta({\mathbb{P}}) in this example is η()=θ()\eta({\mathbb{P}})=\theta({\mathbb{P}}), as 𝐕θ()𝐈p\mathbf{V}_{\theta}({\mathbb{P}})\equiv\mathbf{I}_{p} and does not depend on the underlying distribution. A consistent estimator η~n\tilde{\eta}_{n} of η=θ\eta^{*}=\theta^{*} is the sample mean η~n=θ^n\tilde{\eta}_{n}=\hat{\theta}_{n}. Furthermore, φη(z){\varphi}_{\eta}(z) is 11-Lipschitz in η\eta, and the conditions of Proposition 3.4 are satisfied.

Example 7.2 (Generalized linear models).

Define mθ(z)=yθxb(xθ)m_{\theta}(z)=y\theta^{\top}x-b\left(x^{\top}\theta\right), where b:b:{\mathbb{R}}\to{\mathbb{R}} is a convex and infinitely-differentiable function. Let b(1)()b^{(1)}(\cdot) and b(2)()b^{(2)}(\cdot) denote its first and second-order derivatives, respectively. Here mθ(z)m_{\theta}(z) corresponds to the log-likelihood of a canonical exponential family distribution with natural parameter θx\theta^{\top}x and log partition function b()b(\cdot), i.e., a generalized linear model (GLM). However, we do not assume that the underlying distribution belongs to this model. The target parameter θ\theta^{*} maximizes 𝔼[mθ(Z)]{\mathbb{E}}[m_{\theta}(Z)], and can be viewed as the best Kullback–Leibler approximation of the underlying distribution by the GLM model. The M-estimator (47) in this case is the GLM estimator

θ^n=argmax𝜃{1ni=1n[YiθXib(Xiθ)]},\hat{\theta}_{n}=\underset{\theta}{\arg\max}\left\{{\frac{1}{n}\sum_{i=1}^{n}\left[Y_{i}\theta^{\top}X_{i}-b\left(X_{i}^{\top}\theta\right)\right]}\right\},

which is regular and asymptotically linear with influence function

φη(z)=𝔼[b(2)(Xθ)XX]1[yxb(1)(xθ)x].{{\varphi}_{\eta^{*}}}(z)={\mathbb{E}}\left[b^{(2)}(X^{\top}\theta^{*})XX^{\top}\right]^{-1}\left[yx-b^{(1)}(x^{\top}\theta^{*})x\right].

A consistent estimator of the functional η()=(𝔼[b(2)(Xθ())XX]1,θ())\eta({\mathbb{P}})=\left({\mathbb{E}}_{\mathbb{P}}\left[b^{(2)}(X^{\top}\theta({\mathbb{P}}))XX^{\top}\right]^{-1},\theta({\mathbb{P}})\right) is

η~n=([1ni=1nb(2)(Xiθ^n)XiXi]1,θ^n).\tilde{\eta}_{n}=\left(\left[\frac{1}{n}\sum_{i=1}^{n}b^{(2)}(X_{i}^{\top}\hat{\theta}_{n})X_{i}X_{i}^{\top}\right]^{-1},\hat{\theta}_{n}\right).

Under mild conditions, it can be shown that θ^n\hat{\theta}_{n} satisfies the conditions of Proposition 3.4. Claims made in this example are proved in Supplement D.

Application to Z-estimation or estimating equations is similar and thus omitted.

7.2 U-statistics

Next, we apply the proposed framework to U-statistics. Let h(y1,,yR):𝒴Rh(y_{1},\ldots,y_{R}):{\mathcal{Y}}^{R}\to{\mathbb{R}} be a symmetric kernel function. The target parameter is defined as

θ=𝔼[h(Y1,,YR)],\theta^{*}={\mathbb{E}}[h(Y_{1},\ldots,Y_{R})],

where the expectation is taken over RR i.i.d. random variables YiiidYY_{i}\stackrel{{\scriptstyle{iid}}}{{\sim}}{\mathbb{P}}^{*}_{Y}. To estimate θ\theta^{*}, the supervised estimator is

θ^n=1(nr){i1<<ir}[n]h(Xi1,,Xir).\hat{\theta}_{n}=\frac{1}{\binom{n}{r}}\sum_{\left\{{i_{1}<\ldots<i_{r}}\right\}\subset[n]}h(X_{i_{1}},\ldots,X_{i_{r}}). (49)

We assume that hh is non-degenerate:

0<Var[h1(Y;Y)]<,0<\mathrm{Var}[h_{1}(Y;{\mathbb{P}}_{Y}^{*})]<\infty, (50)

where h1(y;Y)h_{1}(y;{\mathbb{P}}_{Y}) is the conditional expectation 𝔼[h(Y1,,YR)Y1=y]{\mathbb{E}}_{\mathbb{P}}[h(Y_{1},\ldots,Y_{R})\mid Y_{1}=y] with the first argument fixed at yy, i.e.,

h1(y;Y)=h(y,y2,,yR)r=2RdY(yr).h_{1}(y;{\mathbb{P}}_{Y})=\int h(y,y_{2},\ldots,y_{R})\prod_{r=2}^{R}d{\mathbb{P}}_{Y}(y_{r}). (51)

Under this condition, θ^n\hat{\theta}_{n} is a regular and asymptotically linear estimator with influence function

φη(z)=R[h1(y;Y)θ],{{\varphi}_{\eta^{*}}}(z)=R[h_{1}(y;{\mathbb{P}}^{*}_{Y})-\theta^{*}], (52)

where RR is again the order of the kernel [for a proof, see Theorem 12.3 of Van der Vaart, 2000]. The functional η()\eta({\mathbb{P}}) in (52) is η()=(h1(y;Y),θ(Y))\eta({\mathbb{P}})=\left(h_{1}(y;{\mathbb{P}}_{Y}),\theta({\mathbb{P}}_{Y})\right). This is an infinite-dimensional functional that takes values in 𝒴×{\mathcal{Y}}_{\infty}\times{\mathbb{R}}, where 𝒴{\mathcal{Y}}_{\infty} is the space of uniformly bounded functions h:𝒴h:{\mathcal{Y}}\to{\mathbb{R}} equipped with the uniform metric h1h2=sup𝒴|h1(y)h2(y)|\left\lVert h_{1}-h_{2}\right\rVert_{\infty}=\sup_{{\mathcal{Y}}}|h_{1}(y)-h_{2}(y)|. Denoting η1=(h1,θ1)\eta_{1}=(h_{1},\theta_{1}) and η2=(h2,θ2)\eta_{2}=(h_{2},\theta_{2}), it follows that (52) is RR-Lipschitz with respect to the metric ρ(η1,η2)=h1h2+θ1θ2\rho(\eta_{1},\eta_{2})=\left\lVert h_{1}-h_{2}\right\rVert_{\infty}+\left\lVert\theta_{1}-\theta_{2}\right\rVert.

We have already established that U-statistics (49) are regular and asymptotically linear with RR-Lipschitz influence functions. Therefore, it remains to validate the remaining parts of (b) and (c) of Assumption 3.1, which we now do for two canonical examples of U-statistics.

Example 7.3 (Variance).

Define h(y1,y2)=12(y1y2)2h(y_{1},y_{2})=\frac{1}{2}(y_{1}-y_{2})^{2}. The target of inference is then the variance θ=Var(Y)\theta^{*}=\mathrm{Var}(Y). The U-statistic in this case is the sample variance,

θ^n=1n(n1)i<j(YiYj)2,\hat{\theta}_{n}=\frac{1}{n(n-1)}\sum_{i<j}(Y_{i}-Y_{j})^{2},

which is regular and asymptotically linear with influence function φη(z)=(y𝔼[Y])2θ{{\varphi}_{\eta^{*}}}(z)=(y-{\mathbb{E}}[Y])^{2}-\theta^{*} when h(y1,y2)h(y_{1},y_{2}) is non-degenerate. In this simple example, the functional η()=(𝔼[Y],θ(Y))\eta({\mathbb{P}})=\left({\mathbb{E}}[Y],\theta({\mathbb{P}}_{Y})\right) is finite-dimensional, and a consistent estimator of η\eta^{*} is η~n=(1ni=1nYi,θ^n)\tilde{\eta}_{n}=\left(\frac{1}{n}\sum_{i=1}^{n}Y_{i},\hat{\theta}_{n}\right). Therefore, invoking Proposition 3.4, Assumption 3.1 is satisfied.

Example 7.4 (Kendall’s τ\tau).

Consider Y=(U,V)2Y=(U,V)\in{\mathbb{R}}^{2} and define h(y1,y2)=I{(u1u2)(v1v2)>0}h(y_{1},y_{2})=I\left\{{(u_{1}-u_{2})(v_{1}-v_{2})>0}\right\}. The target of inference is

θ={(U1U2)(V1V2)>0},\theta^{*}={\mathbb{P}}\left\{{(U_{1}-U_{2})(V_{1}-V_{2})>0}\right\},

which measures the dependence between UU and VV. The U-statistic is Kendall’s τ\tau,

θ^n=2n(n1)i<jI{(UiUj)(ViVj)>0},\hat{\theta}_{n}=\frac{2}{n(n-1)}\sum_{i<j}I\left\{{(U_{i}-U_{j})(V_{i}-V_{j})>0}\right\},

which is the average number of pairs (Ui,Vi)(U_{i},V_{i}) with concordant sign. When h(y1,y2)h(y_{1},y_{2}) is non-degenerate, θ^n\hat{\theta}_{n} is regular and asymptotically linear with influence function

φη(z)=2Y{(Uu)(Vv)>0}2θ,{{\varphi}_{\eta^{*}}}(z)=2{\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\}-2\theta^{*},

where η()=(Y{(Uu)(Vv)>0},θ(Y))\eta({\mathbb{P}})=\left({\mathbb{P}}_{Y}\left\{{(U-u)(V-v)>0}\right\},\theta({\mathbb{P}}_{Y})\right). A natural estimator of η\eta^{*} in this case is

η~n=(1ni=1nI{(Uiu)(Viv)>0},θ^n).\tilde{\eta}_{n}=\left(\frac{1}{n}\sum_{i=1}^{n}I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\},\hat{\theta}_{n}\right).

In Supplement D, we validate the conditions in Assumption 3.1 under additional assumptions on Y{(Uu)(Vv)>0}{\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\}.

7.3 Average treatment effect

We consider application of the proposed framework to the estimation of the average treatment effect (ATE). Suppose we have Z=(U,A,Y)Z=(U,A,Y) for confounders U𝒰U\in{\mathcal{U}}, binary treatment A{0,1}A\in\left\{{0,1}\right\}, and outcome Y𝒴Y\in{\mathcal{Y}}. Let Y(0)Y^{(0)} and Y(1)Y^{(1)} represent the counterfactual outcomes under control and treatment, respectively. Under appropriate assumptions, the ATE, defined as 𝔼[Y(1)]𝔼[Y(0)]{\mathbb{E}}\left[Y^{(1)}\right]-{\mathbb{E}}\left[Y^{(0)}\right], can be expressed as 𝔼{μ(1)(U;)μ(0)(U;)}{\mathbb{E}}\left\{{\mu^{(1)}(U;{{\mathbb{P}}^{*}})-\mu^{(0)}(U;{{\mathbb{P}}^{*}})}\right\}, where μ(j)(u;)=𝔼[YU=u,A=j]\mu^{(j)}(u;{\mathbb{P}})={\mathbb{E}}_{\mathbb{P}}[Y\mid U=u,A=j] for j{0,1}j\in\left\{{0,1}\right\}. For simplicity, we consider the target of inference

θ=𝔼[Y(1)]=𝔼{μ(1)(U;)}.\theta^{*}={\mathbb{E}}\left[Y^{(1)}\right]={\mathbb{E}}\left\{{\mu^{(1)}(U;{{\mathbb{P}}^{*}})}\right\}. (53)

Inference for the ATE, 𝔼[Y(1)]𝔼[Y(0)]{\mathbb{E}}\left[Y^{(1)}\right]-{\mathbb{E}}\left[Y^{(0)}\right], follows similarly.

We define μ(u;)=μ(1)(u;)\mu(u;{\mathbb{P}})=\mu^{(1)}(u;{\mathbb{P}}) and π(u;)={A=1U}\pi(u;{\mathbb{P}})={\mathbb{P}}\left\{{A=1\mid U}\right\}, and consider the model

𝒫={pU(u)π(u;𝒫)a[1π(u;𝒫)]1apYA,U(yu,a):u,pU(u)>0;ϵ(0,0.5),π(u;)[ϵ,1ϵ]}.{\mathcal{P}}=\left\{{p_{U}(u)\pi(u;{\mathcal{P}})^{a}[1-\pi(u;{\mathcal{P}})]^{1-a}p_{Y\mid A,U}(y\mid u,a):\forall u,p_{U}(u)>0;\exists\epsilon\in(0,0.5),\pi(u;{\mathbb{P}})\in[\epsilon,1-\epsilon]}\right\}.

It can be shown [Robins et al., 1994, Hahn, 1998] that the efficient influence function relative to 𝒫{\mathcal{P}} is

φη(z)=aπ(u;)[yμ(u;)]+μ(u;)θ,\displaystyle{{\varphi}^{*}_{\eta^{*}}}(z)=\frac{a}{\pi(u;{{\mathbb{P}}^{*}})}\left[y-\mu(u;{{\mathbb{P}}^{*}})\right]+\mu(u;{{\mathbb{P}}^{*}})-\theta^{*}, (54)

where η()={π(u;),μ(u;)}\eta({\mathbb{P}})=\left\{{\pi(u;{\mathbb{P}}),\mu(u;{\mathbb{P}})}\right\}. In this case, η()\eta({\mathbb{P}}) is infinite-dimensional and takes values in the product space 𝒰×𝒰{\mathcal{U}}_{\infty}\times{\mathcal{U}}_{\infty}, where 𝒰{\mathcal{U}}_{\infty} is the space of uniformly bounded functions h:𝒰h:{\mathcal{U}}\to{\mathbb{R}} equipped with the uniform metric h1h2=sup𝒴|h1(y)h2(y)|\left\lVert h_{1}-h_{2}\right\rVert_{\infty}=\sup_{{\mathcal{Y}}}|h_{1}(y)-h_{2}(y)|.

We consider three examples.

Example 7.5 (Additional data on confounders).

Suppose that additional observations of the confounders are available, i.e. 𝒰N={Ui}i=n+1n+N{{\mathcal{U}}_{N}}=\left\{{U_{i}}\right\}_{i=n+1}^{n+N}. The conditional efficient influence function ϕη(u)=𝔼[φη(Z)U=u]{\phi^{*}_{\eta^{*}}}(u)={\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid U=u] in this case is

ϕη(u)=μ(u,)θ.{\phi^{*}_{\eta^{*}}}(u)=\mu(u,{{\mathbb{P}}^{*}})-\theta^{*}.

By Theorem 4.1, the efficient influence function under the OSS is

a1π(u1;)[y1μ(u1;)]+(1γ)ϕη(u1)+γ(1γ)ϕη(u2),\displaystyle\frac{a_{1}}{\pi(u_{1};{{\mathbb{P}}^{*}})}\left[y_{1}-\mu(u_{1};{{\mathbb{P}}^{*}})\right]+(1-\gamma){\phi^{*}_{\eta^{*}}}(u_{1})+\sqrt{\gamma(1-\gamma)}{\phi^{*}_{\eta^{*}}}(u_{2}), (55)

where — in keeping with Remark 4.1 — we have used (z1,u2)=((u1,a1,y1),u2)(z_{1},u_{2})=((u_{1},a_{1},y_{1}),u_{2}) as arguments to the influence function. We note that if UU has no confounding effect, i.e. μ(u,)=θ\mu(u,{{\mathbb{P}}^{*}})=\theta^{*}, then ϕη(u)=0{\phi^{*}_{\eta^{*}}}(u)=0 and the efficiency bound (55) is the same as in the supervised setting.

Example 7.6 (Additional data on confounders and treatment).

Suppose that additional observations of both the confounders and the treatment indicators are available, i.e. 𝒰N={(Ui,Ai)}i=n+1n+N{{\mathcal{U}}_{N}}=\left\{{(U_{i},A_{i})}\right\}_{i=n+1}^{n+N}. The conditional efficient influence function ϕη(u,a)=𝔼[φη(Z)U=u,A=a]{\phi^{*}_{\eta^{*}}}(u,a)={\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid U=u,A=a] in this case is

ϕη(u,a)=aπ(u;){𝔼[YU=u,A=a]μ(u;)}+μ(u;)θ.{\phi^{*}_{\eta^{*}}}(u,a)=\frac{a}{\pi(u;{{\mathbb{P}}^{*}})}\left\{{{\mathbb{E}}[Y\mid U=u,A=a]-\mu(u;{{\mathbb{P}}^{*}})}\right\}+\mu(u;{{\mathbb{P}}^{*}})-\theta^{*}.

By Theorem 4.1, the efficient influence function under the OSS is

a1π(u1){y1𝔼[YU=u1,A=a1]}+(1γ)ϕη(u1,a1)\displaystyle\frac{a_{1}}{\pi(u_{1})}\left\{{y_{1}-{\mathbb{E}}[Y\mid U=u_{1},A=a_{1}]}\right\}+(1-\gamma){\phi^{*}_{\eta^{*}}}(u_{1},a_{1}) (56)
+γ(1γ)ϕη(u2,a2),\displaystyle\quad+\sqrt{\gamma(1-\gamma)}{\phi^{*}_{\eta^{*}}}(u_{2},a_{2}),

where again the subscripts in (LABEL:eq:efficiency_OSS_UA) are in keeping with Remark 4.1. If UU has no confounding effect and AA has no treatment effect, i.e. if 𝔼[YU=u,A=a]=μ(u,)=θ{\mathbb{E}}[Y\mid U=u,A=a]=\mu(u,{{\mathbb{P}}^{*}})=\theta^{*}, then ϕη(u,a)=0{\phi^{*}_{\eta^{*}}}(u,a)=0 and the efficiency bound (LABEL:eq:efficiency_OSS_UA) is the same as in the supervised setting.

Example 7.7 (Additional data on confounders and treatment, and availability of surrogates).

When measuring the primary outcome YY is time-consuming or expensive, we may use a surrogate marker SS\in{\mathbb{R}} as a replacement for YY, to facilitate more timely decisions on the treatment effect [Wittes et al., 1989]. Suppose that n={(Ui,Ai,Si,Yi)}i=1n{{\mathcal{L}}_{n}}=\left\{{(U_{i},A_{i},S_{i},Y_{i})}\right\}_{i=1}^{n}, and that additional observations of the confounders, the treatment indicators, and the surrogate markers are available: that is, 𝒰N={(Ui,Ai,Si)}i=n+1n+N{{\mathcal{U}}_{N}}=\left\{{(U_{i},A_{i},S_{i})}\right\}_{i=n+1}^{n+N}. The conditional efficient influence function ϕη(u,a,s)=𝔼[φη(Z)U=u,A=a,S=s]{\phi^{*}_{\eta^{*}}}(u,a,s)={\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid U=u,A=a,S=s] in this case is

ϕη(u,a,s)=aπ(u;){𝔼[YU=u,A=a,S=s]μ(u;)}+μ(u;)θ.\displaystyle{\phi^{*}_{\eta^{*}}}(u,a,s)=\frac{a}{\pi(u;{{\mathbb{P}}^{*}})}\left\{{{\mathbb{E}}[Y\mid U=u,A=a,S=s]-\mu(u;{{\mathbb{P}}^{*}})}\right\}+\mu(u;{{\mathbb{P}}^{*}})-\theta^{*}.

By Theorem 4.1, the efficient influence function under the OSS is

a1π(u1){y1𝔼[YU=u1,A=a1,S=s1]}+(1γ)ϕη(u1,a1,s1)\displaystyle\frac{a_{1}}{\pi(u_{1})}\left\{{y_{1}-{\mathbb{E}}[Y\mid U=u_{1},A=a_{1},S=s_{1}]}\right\}+(1-\gamma){\phi^{*}_{\eta^{*}}}(u_{1},a_{1},s_{1}) (57)
+γ(1γ)ϕη(u2,a2,s2).\displaystyle+\sqrt{\gamma(1-\gamma)}{\phi^{*}_{\eta^{*}}}(u_{2},a_{2},s_{2}).

Here Z=(U,A,S,Y)Z=(U,A,S,Y), and the subscripts in (57) are again in keeping with Remark 4.1. If 𝔼[YU=u,A=a,S=s]=μ(u,)=θ{\mathbb{E}}[Y\mid U=u,A=a,S=s]=\mu(u,{{\mathbb{P}}^{*}})=\theta^{*}, then ϕη(u,a,s)=0{\phi^{*}_{\eta^{*}}}(u,a,s)=0 and the efficiency bound (LABEL:eq:efficiency_OSS_UA) is the same as in the supervised setting.

Remark 7.1.

We show in Supplement D that the semiparametric efficiency bound (55) is greater than (LABEL:eq:efficiency_OSS_UA), which is greater than (57). In other word, efficiency improves when the unlabeled data 𝒰N{{\mathcal{U}}_{N}} contains more information.

8 Numerical experiments

In this section, we illustrate the proposed framework numerically in the context of mean estimation and generalized linear models. Numerical experiments for variance estimation and Kendall’s τ\tau can be found in Sections F.2 and F.3 of the Appendix.

In each example, the covariates X=[X1,X2]X=[X_{1},X_{2}] are two-dimensional and generated as i.i.d. Unif(0,1)\text{Unif}(0,1). We compute the proposed estimators θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} (28), θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} (34), and θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (41) as follows:

  • For the estimator θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}}, we use g(x)=xg(x)=x.

  • For the estimator θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}}, we use a basis of tensor product natural cubic splines, with Kn{4,9,16}K_{n}\in\left\{{4,9,16}\right\} basis functions.

  • For the estimator θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}}, we use gη~n(x)=φη~n(x,f(x)){g_{\tilde{\eta}_{n}}}(x)={{\varphi}_{\tilde{\eta}_{n}}}(x,f(x)) where f:𝒳𝒴f:{\mathcal{X}}\to{\mathcal{Y}} is a prediction model. We consider two prediction models: (i) a random forest model trained on independent data, which represents an informative prediction model; and (ii) randomly-generated Gaussian noise, which represents a non-informative prediction model.

Each of these estimators is constructed by modifying an efficient supervised estimator, whose performance we also consider. Additionally, we include the PPI++ estimator proposed by Angelopoulos et al. [2023b], using the same two prediction models as for θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}}.

In each simulation setting, we also estimate the semiparametric efficiency lower bounds in the ISS setting (5) and in the OSS setting (25).

For each method, we report the coverage of the 95% confidence interval, as well as the standard error; all results are averaged over 1,000 simulated datasets.

8.1 Mean estimation

We consider Example 7.1 with h(y)=yh(y)=y. In this example, the supervised estimator is the sample mean, which has influence function φη(z)=yθ{{\varphi}_{\eta^{*}}}(z)=y-\theta^{*}. The conditional influence function is

ϕη(x)=𝔼[φη(Z)X=x]=𝔼[YX=x]θ,{\phi_{\eta^{*}}}(x)={\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x]={\mathbb{E}}[Y\mid X=x]-\theta^{*},

which depends on xx through the conditional expectation. We generate the response YY as Y=𝔼[YX]+ϵY={\mathbb{E}}[Y\mid X]+\epsilon, where ϵN(0,1)\epsilon\sim N(0,1). We consider three settings for 𝔼[YX]{\mathbb{E}}[Y\mid X]:

  1. 1.

    Setting 1 (linear model):

    𝔼[YX=x]=1.05+4.76x16.2x2.{\mathbb{E}}[Y\mid X=x]=1.05+4.76x_{1}-6.2x_{2}.
  2. 2.

    Setting 2 (non-linear model):

    𝔼[YX=x]=1.70x1x26.94x1x221.35x12x2+2.28x12x22.{\mathbb{E}}[Y\mid X=x]=-1.70x_{1}x_{2}-6.94x_{1}x_{2}^{2}-1.35x_{1}^{2}x_{2}+2.28x_{1}^{2}x_{2}^{2}.
  3. 3.

    Setting 3 (well-specified model, in the sense of Definition 3.1):

    𝔼[YX=x]=0.{\mathbb{E}}[Y\mid X=x]=0.

For each model, we set n=1,000n=1,000, and vary the proportion of unlabeled observations, γ=Nn+N{0.1,0.3,0.5,0.7,0.9}\gamma=\frac{N}{n+N}\in\left\{{0.1,0.3,0.5,0.7,0.9}\right\}. The semiparametric efficiency lower bound in the ISS setting (5) and in the OSS setting (25) are estimated separately on a sample of 100,000100,000 observations.

Table 1 of Appendix F.1 reports the coverage of 95% confidence intervals for each method. All methods achieve the nominal coverage.

Figure 1 displays the standard error of each method, averaged over 1,000 simulated datasets.

Refer to caption
Figure 1: Standard errors of estimators of the mean, as described in Section 8.1, averaged over 1,000 simulations with n=1,000n=1,000. Left: When the conditional influence function is linear (Setting 1), θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} with g(x)=xg(x)=x achieves the efficiency lower bound in the OSS setting. Center: When the conditional influence function is non-linear (Setting 2), θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} with g(x)=xg(x)=x is no longer efficient, whereas θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} is efficient with a sufficient number of basis functions. Right: For a well-specified estimation problem in the sense of Definition 3.1 (Setting 3), no semi-supervised method can improve upon the supervised estimator, as shown in Corollary 4.2.

We begin with a discussion of Setting 1 (linear model). The estimator θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} achieves the efficiency lower bound in the OSS setting, since θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} with g(x)=xg(x)=x suffices to accurately approximate the true conditional influence function. In fact, θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}}, which uses a greater number of basis functions, performs worse: those additional basis functions contribute to increased variance without improving bias. Both PPI++ and θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} using a random forest prediction model perform well, whereas the version of those methods that uses a pure-noise prediction model performs comparably to the supervised estimator, i.e., incorporating a useless prediction model does not lead to deterioration of performance. Since there is only one parameter of interest in the context of mean estimation, i.e. p=1p=1, the PPI++ estimator is asymptotically equivalent to θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} with gη~n(x)=φη~n(x,f(x)){g_{\tilde{\eta}_{n}}}(x)={{\varphi}_{\tilde{\eta}_{n}}}(x,f(x)) with the same prediction model f()f(\cdot): thus, there is no difference in performance between PPI++ and θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}}. However, as we will show in Section 8.2, when there are multiple parameters, i.e. p>1p>1, θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} can outperform the PPI++ estimator.

We now consider Setting 2 (non-linear model). Because the conditional influence function is non-linear, the best performance for θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} is achieved when the number of basis functions KnK_{n} is sufficiently large. Furthermore, this performance is substantially better than that of θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} with g(x)=xg(x)=x. Other than this, the results are quite similar to Setting 1.

Finally, we consider Setting 3 (well-specified model), in which the model is well-specified in the sense of Definition 3.1. In keeping with Corollary 4.2, no method improves upon the OSS lower bound.

8.2 Generalized linear model

We consider Example  7.2 with a Poisson GLM. In this example, the supervised estimator is the Poisson GLM estimator, which has influence function

φη(z)=𝔼[eXθXX]1x(yexθ).{{\varphi}_{\eta^{*}}}(z)={\mathbb{E}}\left[e^{X^{\top}\theta^{*}}XX^{\top}\right]^{-1}x\left(y-e^{x^{\top}\theta^{*}}\right).

The conditional influence function is then

ϕη(x)=𝔼[eXθXX]1x(𝔼[YX=x]exθ).{\phi_{\eta^{*}}}(x)={\mathbb{E}}\left[e^{X^{\top}\theta^{*}}XX^{\top}\right]^{-1}x\left({\mathbb{E}}[Y\mid X=x]-e^{x^{\top}\theta^{*}}\right).

We generate the response YY as YXPoisson(𝔼[YX])Y\mid X\sim\text{Poisson}\left({\mathbb{E}}[Y\mid X]\right). However, unlike in Section 8.1, the conditional influence function is not a linear function of xx regardless of the form of 𝔼[YX=x]{\mathbb{E}}[Y\mid X=x]. Therefore, we only consider two settings for 𝔼[YX]{\mathbb{E}}[Y\mid X]:

  1. 1.

    Setting 1 (non-linear model):

    𝔼[YX=x]=exp{1.70x1x26.94x1x221.35x12x2+2.28x12x22}.{\mathbb{E}}[Y\mid X=x]=\exp\left\{{-1.70x_{1}x_{2}-6.94x_{1}x_{2}^{2}-1.35x_{1}^{2}x_{2}+2.28x_{1}^{2}x_{2}^{2}}\right\}.
  2. 2.

    Setting 2 (well-specified model, in the sense of Definition 3.1):

    𝔼[YX=x]=exp{1.05+4.76x16.2x2}.{\mathbb{E}}[Y\mid X=x]=\exp\left\{{1.05+4.76x_{1}-6.2x_{2}}\right\}.

For each model, we set n=1,000n=1,000, and vary the proportion of unlabeled observations, γ=Nn+N{0.1,0.3,0.5,0.7,0.9}\gamma=\frac{N}{n+N}\in\left\{{0.1,0.3,0.5,0.7,0.9}\right\}. The semiparametric efficiency lower bound in the ISS setting (5) and in the OSS setting (25) are estimated separately on a sample of 100,000100,000 observations.

Table 2 of Appendix F.1 reports the coverage of 95% confidence intervals for each method. All methods achieve the nominal coverage.

Refer to caption
Figure 2: Standard errors of estimators of the first parameter of the Poisson GLM, as described in Section 8.2, averaged over 1,000 simulations, with n=1,000n=1,000. Left: When the model in non-linear (Setting 1), and with a sufficient number of basis functions, θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} nearly achieves the OSS semiparametric efficiency lower bound. Moreover, θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} with gη~n(x)=φη~n(x,f(x)){g_{\tilde{\eta}_{n}}}(x)={{\varphi}_{\tilde{\eta}_{n}}}(x,f(x)) outperforms the PPI++ estimators with the noisy prediction model. Right: For a well-specified estimation problem in the sense of Definition 3.1 (Setting 2), no semi-supervised method can improve upon the supervised estimator. This agrees with Corollary 4.2.

Figure 2 displays the standard error of the first parameter for each method, averaged over 1,000 simulated datasets. Results for the standard error of the second parameter are similar, and are displayed in Figure 3 in Appendix F.1.

We first consider Setting 1 (non-linear model). Consistent with the results for Setting 2 of Section 8.1, the estimator θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} approximates the efficiency lower bound in the OSS setting when the number of basis functions is sufficiently large. Meanwhile, the estimator θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} with g(x)=xg(x)=x improves upon the supervised estimator, but is not efficient as it cannot accurately approximate the true conditional influence function φη~n(x){{\varphi}_{\tilde{\eta}_{n}}}(x). Both PPI++ and θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} perform well with a random forest prediction model. In addition, we make the following observations that differ from the results shown in Section 8.1:

  1. 1.

    θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} significantly improves upon the supervised estimator θ^n\hat{\theta}_{n} even with a pure-noise prediction model. To see this, recall that θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} estimates the conditional influence function ϕη(x){\phi_{\eta^{*}}}(x) by regressing φη(x,y){{\varphi}_{\eta^{*}}}(x,y) onto φη(x,f(x)){{\varphi}_{\eta^{*}}}(x,f(x)) where f()f(\cdot) is a prediction model. (In practice, the unknown function η\eta^{*} is replaced with a consistent estimator η~n\tilde{\eta}_{n}.) Define 𝐕θ=𝔼[XXeXθ]\mathbf{V}_{\theta}^{*}={\mathbb{E}}\left[XX^{\top}e^{X^{\top}\theta^{*}}\right]. In the case of a Poisson GLM, we have that

    φη(z)=𝐕θ1x(yexθ){{\varphi}_{\eta^{*}}}(z)=\mathbf{V}_{\theta^{*}}^{-1}x\left(y-e^{x^{\top}\theta^{*}}\right)

    and

    φη(x,f(x))=𝐕θ1x(f(x)exθ).{{\varphi}_{\eta^{*}}}(x,f(x))=\mathbf{V}_{\theta^{*}}^{-1}x\left(f(x)-e^{x^{\top}\theta^{*}}\right).

    Thus, in the case of a Poisson GLM, the projection of φη(z){{\varphi}_{\eta^{*}}}(z) onto φη(x,f(x)){{\varphi}_{\eta^{*}}}(x,f(x)) in p2(){\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}}) is non-zero, even if — in the extreme case — f(X)=0f(X)=0.

    By contrast, in the case of mean estimation in Section 8.1, the projection of φη(z)=yθ{{\varphi}_{\eta^{*}}}(z)=y-\theta^{*} onto φη(x,f(x))=f(x)θ{{\varphi}_{\eta^{*}}}(x,f(x))=f(x)-\theta^{*} in p2(){\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}}) equals zero when f(X)f(X) is independent of YY. Thus, for mean estimation, a pure-noise prediction model f()f(\cdot) is completely useless.

  2. 2.

    θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} is more efficient than the PPI++ estimator of Angelopoulos et al. [2023b] when both use the noisy prediction model. This is because PPI++ uses a scalar weight to minimize the trace of the p×pp\times p asymptotic covariance matrix, which is sub-optimal in efficiency when p>1p>1. On the other hand, θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} considers a regression approach to find the best linear approximation of the conditional influence function ϕη(x){\phi_{\eta^{*}}}(x), as in (41).

Finally, we consider Setting 2 (well-specified model). No method improves upon the OSS lower bound, in keeping with Corollary 4.2.

9 Discussion

We have proposed a general framework to study statistical inference in the semi-supervised setting. We established the semiparametric efficiency lower bound for an arbitrary inferential problem under the semi-supervised setting, and showed that no improvement can be made when the model is well-specified. Furthermore, we proposed a class of easy-to-compute estimators that build upon existing supervised estimators and that can achieve the efficiency lower bound for an arbitrary inferential problem.

This paper leaves open several directions for future work. First, our results require a Donsker condition on the influence function of the supervised estimator. This may not hold, for example, when the functional η()\eta(\cdot) is high-dimensional or infinite-dimensional. Recent advances in double/debiased machine learning [Chernozhukov et al., 2018, Foster and Syrgkanis, 2023] may provide an avenue for obtaining efficient semi-supervised estimators in the presence of high- or infinite-dimensional nuisance parameters. Second, a general theoretical framework for efficient semi-supervised estimation in the presence of covariate shift also remains a relatively open problem, despite some promising preliminary work [Ryan and Culp, 2015, Aminian et al., 2022].

Acknowledgments

We thank Abhishek Chakrabortty for pointing out that an earlier version of this paper overlooked connections with Chapter 2 of his dissertation [Chakrabortty, 2016] in the special case of M-estimation.

References

  • Zhang et al. [2019] Anru Zhang, Lawrence D Brown, and T Tony Cai. Semi-supervised inference: General theory and estimation of means. The Annals of Statistics, 47(5):2538–2566, 2019.
  • Cheng et al. [2021] David Cheng, Ashwin N Ananthakrishnan, and Tianxi Cai. Robust and efficient semi-supervised estimation of average treatment effects with application to electronic health records data. Biometrics, 77(2):413–423, 2021.
  • Kawakita and Kanamori [2013] Masanori Kawakita and Takafumi Kanamori. Semi-supervised learning with density-ratio estimation. Machine Learning, 91:189–209, 2013.
  • Buja et al. [2019a] Andreas Buja, Lawrence Brown, Richard Berk, Edward George, Emil Pitkin, Mikhail Traskin, Kai Zhang, and Linda Zhao. Models as approximations i. Statistical Science, 34(4):523–544, 2019a.
  • Song et al. [2023] Shanshan Song, Yuanyuan Lin, and Yong Zhou. A general m-estimation theory in semi-supervised framework. Journal of the American Statistical Association, pages 1–11, 2023.
  • Gan and Liang [2023] Feng Gan and Wanfeng Liang. Prediction de-correlated inference. arXiv preprint arXiv:2312.06478, 2023.
  • Angelopoulos et al. [2023a] Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669–674, 2023a.
  • Angelopoulos et al. [2023b] Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. Ppi++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453, 2023b.
  • Bennett and Demiriz [1998] Kristin Bennett and Ayhan Demiriz. Semi-supervised support vector machines. Advances in Neural Information Processing Systems, 11, 1998.
  • Chapelle et al. [2006] Olivier Chapelle, Mingmin Chi, and Alexander Zien. A continuation method for semi-supervised svms. In Proceedings of the 23rd International Conference on Machine Learning, pages 185–192, 2006.
  • Bair [2013] Eric Bair. Semi-supervised clustering methods. Wiley Interdisciplinary Reviews: Computational Statistics, 5(5):349–361, 2013.
  • Van Engelen and Hoos [2020] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine Learning, 109(2):373–440, 2020.
  • Zhang and Bradic [2022] Yuqian Zhang and Jelena Bradic. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387–403, 2022.
  • Chakrabortty and Cai [2018] Abhishek Chakrabortty and Tianxi Cai. Efficient and adaptive linear regression in semi-supervised settings. The Annals of Statistics, 46(4):1541–1572, 2018.
  • Azriel et al. [2022] David Azriel, Lawrence D Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251, 2022.
  • Deng et al. [2023] Siyi Deng, Yang Ning, Jiwei Zhao, and Heping Zhang. Optimal and safe estimation for high-dimensional semi-supervised learning. Journal of the American Statistical Association, pages 1–12, 2023.
  • Wang et al. [2023] Tong Wang, Wenlu Tang, Yuanyuan Lin, and Wen Su. Semi-supervised inference for nonparametric logistic regression. Statistics in Medicine, 42(15):2573–2589, 2023.
  • Quan et al. [2024] Zhuojun Quan, Yuanyuan Lin, Kani Chen, and Wen Yu. Efficient semi-supervised inference for logistic regression under case-control studies. arXiv preprint arXiv:2402.15365, 2024.
  • Chakrabortty [2016] Abhishek Chakrabortty. Robust semi-parametric inference in semi-supervised settings. PhD thesis, 2016.
  • Yuval and Rosset [2022] Oren Yuval and Saharon Rosset. Semi-supervised empirical risk minimization: Using unlabeled data to improve prediction. Electronic Journal of Statistics, 16(1):1434–1460, 2022.
  • Kim et al. [2024] Ilmun Kim, Larry Wasserman, Sivaraman Balakrishnan, and Matey Neykov. Semi-supervised u-statistics. arXiv preprint arXiv:2402.18921, 2024.
  • Chakrabortty and Dai [2022] Abhishek Chakrabortty and Guorong Dai. A general framework for treatment effect estimation in semi-supervised and high dimensional settings. arXiv preprint arXiv:2201.00468, 2022.
  • Ahmed et al. [2024] Hanan Ahmed, John HJ Einmahl, and Chen Zhou. Extreme value statistics in semi-supervised models. Journal of the American Statistical Association, pages 1–14, 2024.
  • Miao et al. [2023] Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, and Qiongshi Lu. Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220, 2023.
  • Miao and Lu [2024] Jiacheng Miao and Qiongshi Lu. Task-agnostic machine learning-assisted inference. arXiv preprint arXiv:2405.20039, 2024.
  • Zrnic and Candès [2024a] Tijana Zrnic and Emmanuel J Candès. Active statistical inference. arXiv preprint arXiv:2403.03208, 2024a.
  • Zrnic and Candès [2024b] Tijana Zrnic and Emmanuel J Candès. Cross-prediction-powered inference. Proceedings of the National Academy of Sciences, 121(15):e2322083121, 2024b.
  • Gu and Xia [2024] Yanwu Gu and Dong Xia. Local prediction-powered inference. arXiv preprint arXiv:2409.18321, 2024.
  • Van der Vaart [2000] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge University Press, 2000.
  • Tsiatis [2006] Anastasios A Tsiatis. Semiparametric theory and missing data, volume 4. Springer, 2006.
  • Bickel et al. [1993] Peter J Bickel, Chris AJ Klaassen, Ya’acov Ritov J.Klaasen, and Jon A Wellner. Efficient and Adaptive Estimation for Semiparametric Models, volume 4. Springer, 1993.
  • Buja et al. [2019b] Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard Berk, Edward George, and Linda Zhao. Models as approximations ii. Statistical Science, 34(4):545–565, 2019b.
  • Newey [1997] Whitney K Newey. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79(1):147–168, 1997.
  • Bickel and Kwon [2001] Peter J Bickel and Jaimyoung Kwon. Inference for semiparametric models: some questions and an answer. Statistica Sinica, pages 863–886, 2001.
  • Robins and Rotnitzky [1995] James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
  • Chen et al. [2008] Xiaohong Chen, Han Hong, and Alessandro Tarozzi. Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics, 36(2):808–843, 2008.
  • Zhou et al. [2008] Yong Zhou, Alan T K Wan, and Xiaojing Wang. Estimating equations inference with missing data. Journal of the American Statistical Association, 103(483):1187–1199, 2008.
  • Li and Luedtke [2023] Sijia Li and Alex Luedtke. Efficient estimation under data fusion. Biometrika, 110(4):1041–1054, 2023.
  • Graham et al. [2024] Ellen Graham, Marco Carone, and Andrea Rotnitzky. Towards a unified theory for semiparametric data fusion with individual-level data. arXiv preprint arXiv:2409.09973, 2024.
  • Robins et al. [1994] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.
  • Hahn [1998] Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, pages 315–331, 1998.
  • Wittes et al. [1989] Janet Wittes, Edward Lakatos, and Jeffrey Probstfield. Surrogate endpoints in clinical trials: cardiovascular diseases. Statistics in Medicine, 8(4):415–425, 1989.
  • Chernozhukov et al. [2018] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
  • Foster and Syrgkanis [2023] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. The Annals of Statistics, 51(3):879–908, 2023.
  • Ryan and Culp [2015] Kenneth Joseph Ryan and Mark Vere Culp. On semi-supervised linear regression in covariate shift problems. The Journal of Machine Learning Research, 16(1):3183–3217, 2015.
  • Aminian et al. [2022] Gholamali Aminian, Mahed Abroshan, Mohammad Mahdi Khalili, Laura Toni, and Miguel Rodrigues. An information-theoretical approach to semi-supervised learning under covariate-shift. International Conference on Artificial Intelligence and Statistics, pages pp. 7433–7449, 2022.
  • van der Vaart and Wellner [2013] Aad van der Vaart and Jon Wellner. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 2013.
  • Hansen [2022] Bruce Hansen. Econometrics. Princeton University Press, 2022.
  • van der Laan [1995] Mark J van der Laan. Efficient and inefficient estimation in semiparametric models. 1995.
  • Gill et al. [1995] Richard D Gill, Mark J Laan, and Jon A Wellner. Inefficient estimators of the bivariate survival function for three models. In Annales de l’IHP Probabilités et statistiques, volume 31, pages 545–597, 1995.
  • Pfanzagl [1990] Johann Pfanzagl. Estimation in semiparametric models. Springer, 1990.
  • Van Der Vaart [1991] Aad Van Der Vaart. On differentiable functionals. The Annals of Statistics, pages 178–204, 1991.
  • van der Laan and Robins [2003] Mark J van der Laan and James Robins. Unified approach for causal inference and censored data. Unified Methods for Censored Longitudinal Data and Causality, pages 311–370, 2003.

Appendix A Additional notation

We introduce additional notation that is used in the appendix. For a matrix 𝐀p×q\mathbf{A}\in{\mathbb{R}}^{p\times q}, let 𝐀2\left\lVert\mathbf{A}\right\rVert_{2} denote its operator norm and let 𝐀F\left\lVert\mathbf{A}\right\rVert_{F} denote its Frobenius norm. For a set 𝒮{\mathcal{S}}, let int(𝒮)\text{int}({\mathcal{S}}) denote its interior. Consider the probability space (𝒵,,)({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}}). Letting f(z)f(z) be a measurable function f:𝒵pf:{\mathcal{Z}}\to{\mathbb{R}}^{p} over (𝒵,,)({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}}), we adopt the following notation from the empirical process literature: (f)=𝔼[f(Z)]{\mathbb{P}}(f)={\mathbb{E}}_{\mathbb{P}}[f(Z)], n(f)=𝔼n[f(Z)]=1ni=1nf(Zi){{\mathbb{P}}_{n}}(f)={\mathbb{E}}_{{{\mathbb{P}}_{n}}}[f(Z)]=\frac{1}{n}\sum_{i=1}^{n}f(Z_{i}), and 𝔾n(f)=n(n)(f){\mathbb{G}}_{n}(f)=\sqrt{n}({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}})(f). Similarly, for a measurable function g:𝒳pg:{\mathcal{X}}\to{\mathbb{R}}^{p}, n+N(g)=𝔼n+N[g(X)]=1n+Ni=1n+Ng(Xi){{\mathbb{P}}_{n+N}}(g)={\mathbb{E}}_{{{\mathbb{P}}_{n+N}}}[g(X)]=\frac{1}{n+N}\sum_{i=1}^{n+N}g(X_{i}), N(g)=𝔼N[g(X)]=1Ni=n+1n+Ng(Xi){{\mathbb{P}}_{N}}(g)={\mathbb{E}}_{{{\mathbb{P}}_{N}}}[g(X)]=\frac{1}{N}\sum_{i=n+1}^{n+N}g(X_{i}), and 𝔾n+N(g)=n+N(n+N)(g){\mathbb{G}}_{n+N}(g)=\sqrt{n+N}({{\mathbb{P}}_{n+N}}-{{\mathbb{P}}^{*}})(g). For two subspaces 𝒯1p,02(){\mathcal{T}}_{1}\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})} and 𝒯2p,02(){\mathcal{T}}_{2}\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})} such that 𝒯1𝒯2{\mathcal{T}}_{1}\perp{\mathcal{T}}_{2}, let 𝒯1𝒯2{\mathcal{T}}_{1}\oplus{\mathcal{T}}_{2} represent their direct sum in p,02(){{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}.

Appendix B Proof of main results

Proof of Theorem 3.1..

In the first step, we characterize the tangent space of the reduced model {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}, which is defined in (2). Note that the reduced model {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}} satisfies the conditions of Lemma E.11, i.e., the marginal distribution and the conditional distribution are separately modeled. Therefore, by Lemma E.11, the tangent space of the reduced model {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}} can be expressed as 𝒯{X}(X)𝒯𝒫YX(YX){\mathcal{T}}_{\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}}({{\mathbb{P}}^{*}_{X}})\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}, where 𝒫YX{{\mathcal{P}}_{Y\mid X}} is the conditional model. Since {X}\left\{{{{\mathbb{P}}^{*}_{X}}}\right\} is a singleton set, 𝒯{X}(X)={0}{\mathcal{T}}_{\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}}({{\mathbb{P}}^{*}_{X}})=\left\{{0}\right\}, and hence the tangent space of {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}} becomes 𝒯𝒫YX(YX){{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}.

Recall that ϕη(x){\phi^{*}_{\eta^{*}}}(x) is the conditional efficient influence function 𝔼[φη(Z)X=x]{\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid X=x]. In the second step, we show that for all apa\in{\mathbb{R}}^{p},

a[φη(z)ϕη(x)]𝒯𝒫YX(YX).a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}. (58)

We prove this by contradiction. Suppose there exists apa\in{\mathbb{R}}^{p} such that a[φη(z)ϕη(x)]𝒯𝒫YX(YX)a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]\notin{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}. Consider the decomposition

aφη(z)=aϕη(x)fa(x)+a[φη(z)ϕη(x)]ga(x).a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)=\underbrace{a^{\top}{\phi^{*}_{\eta^{*}}}(x)}_{f_{a}(x)}+\underbrace{a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]}_{g_{a}(x)}.

As aφη(z)1,02()=1,02(X)1,02(YX)a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}})}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}, fa(x)1,02(X)f_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}, and ga(x)1,02(YX)g_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}, therefore by the property of direct sum, fa(x)f_{a}(x) and ga(z)g_{a}(z) is the unique decomposition of aφη(z)=fa(x)+ga(z)a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)=f_{a}(x)+g_{a}(z) such that fa(x)1,02(X)f_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}, and ga(x)1,02(YX)g_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}. Further, by definition of the efficient influence function, aφη(z)𝒯𝒫()a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}. Recall the model 𝒫{{\mathcal{P}}} defined in (1). By Lemma E.11, the tangent space relative to 𝒫{{\mathcal{P}}} is

𝒯𝒫()=𝒯𝒫X(X)𝒯𝒫YX(YX).{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}={{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}.

Similarly, by the property of direct sum, there exists a unique decomposition aφη(z)=fa(x)+ga(z)a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)=f^{\prime}_{a}(x)+g^{\prime}_{a}(z) such that fa(x)𝒯𝒫X(X)f^{\prime}_{a}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})} and ga(z)𝒯𝒫YX(YX)g^{\prime}_{a}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}. However, ga(z)𝒯𝒫YX(YX)g_{a}(z)\notin{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}, and hence ga(z)ga(z)g_{a}(z)\neq g_{a}^{\prime}(z). By definition, 𝒯𝒫X(X)1,02(X){{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\subseteq{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}, and 𝒯𝒫YX(YX)1,02(YX){{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\subseteq{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}. Therefore, aφη(z)=fa(x)+ga(z)a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)=f^{\prime}_{a}(x)+g^{\prime}_{a}(z) is another decomposition of fa(x)f^{\prime}_{a}(x) such that fa(x)1,02(X)f^{\prime}_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}, and ga(x)1,02(YX)g^{\prime}_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}. This contradicts the uniqueness of direct sum. Therefore, we have established (58) for all apa\in{\mathbb{R}}^{p}.

In the third step, we show that φη(z)ϕη(x){{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x) is a gradient relative to the model {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}. Consider any one-dimensional regular parametric sub-model 𝒫T{\mathcal{P}}_{T} of {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}. Since the marginal model is a singleton set {X}\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}, this maps one-to-one to a one-dimensional regular parametric sub-model 𝒫T,YX{\mathcal{P}}_{T,Y\mid X} of 𝒫YX{{\mathcal{P}}_{Y\mid X}} such that 𝒫T={X}𝒫T,YX{\mathcal{P}}_{T}=\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}\otimes{\mathcal{P}}_{T,Y\mid X}. Suppose that 𝒫T,YX={pt,YX(z),tT}𝒫YX{\mathcal{P}}_{T,Y\mid X}=\left\{{p_{t,Y\mid X}(z),t\in T}\right\}\subset{{\mathcal{P}}_{Y\mid X}} such that pt,YX(z)p_{t^{*},Y\mid X}(z) is the conditional density of YX{{\mathbb{P}}^{*}_{Y\mid X}} for some tTt^{*}\in T. Denote st(yx)s_{t^{*}}(y\mid x) as the score function relative to 𝒫T,YX{\mathcal{P}}_{T,Y\mid X} at tt^{*}, which then satisfies st(yx)1,02(YX)s_{t^{*}}(y\mid x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}. Note that the score function relative to 𝒫T{\mathcal{P}}_{T} at tt^{*} remains st(yx)s_{t^{*}}(y\mid x), as the marginal distribution is known. Consider the function θ:𝒫TΘ\theta:{\mathcal{P}}_{T}\to\Theta as a function of tt, θ(t)\theta(t). By definition, the efficient influence function φη(z){{\varphi}^{*}_{\eta^{*}}}(z) is a gradient relative to 𝒫{{\mathcal{P}}}, hence

θ(t)t\displaystyle\frac{\partial\theta(t)}{\partial t} =φη(z),st(yx)p,02()\displaystyle=\langle{{\varphi}^{*}_{\eta^{*}}}(z),s_{t^{*}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}
=φη(z)ϕη(x)+ϕη(x),st(yx)p,02()\displaystyle=\langle{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)+{\phi^{*}_{\eta^{*}}}(x),s_{t^{*}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}
=φη(z)ϕη(x),s(yx)p,02()+ϕη(x),st(yx)p,02()\displaystyle=\langle{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x),s(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}+\langle{\phi^{*}_{\eta^{*}}}(x),s_{t^{*}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}
=φη(z)ϕη(x),st(yx)p,02(),\displaystyle=\langle{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x),s_{t^{*}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})},

where the last equality used the fact that st(yx)1,02(YX)s_{t^{*}}(y\mid x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})} and

𝔼[ϕη(X)st(YX)]=𝔼{ϕη(X)𝔼[st(YX)X]}=0.{\mathbb{E}}[{\phi^{*}_{\eta^{*}}}(X)s_{t^{*}}(Y\mid X)]={\mathbb{E}}\left\{{{\phi^{*}_{\eta^{*}}}(X){\mathbb{E}}[s_{t^{*}}(Y\mid X)\mid X]}\right\}=0.

As the above holds for an arbitrary one-dimensional regular parametric sub-model of {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}, φη(z)ϕη(x){{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x) is a gradient relative to {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}} at {{\mathbb{P}}^{*}}.

Finally, combining step 1 to step 3 above, φη(z)ϕη(x){{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x) is a gradient relative to {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}} at {{\mathbb{P}}^{*}} and satisfies a[φη(z)ϕη(x)]𝒯({X}𝒫YX)a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]\in{\mathcal{T}}_{{{\mathbb{P}}^{*}}}({\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}) for all apa\in{\mathbb{R}}^{p}. Therefore, by definition, φη(z)ϕη(x){{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x) is the efficient influence function of θ(){\theta(\cdot)} at θ\theta^{*} relative to {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}. By Lemma E.1, the variance, Var[φη(Z)]\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)], can be represented as

Var[φη(Z)]\displaystyle\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)] =Var[φη(Z)ϕη(X)]+Var[ϕη(X)],\displaystyle=\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]+\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)],

which proves the final claim of the theorem. ∎

Proof of Theorem 3.2.

To prove the first part of the theorem, consider any element s(x)𝒯𝒫X(X)s(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}. Without loss of generality, suppose 𝒫T,X={pt,X(x):tT}{\mathcal{P}}_{T,X}=\left\{{p_{t,X}(x):t\in T}\right\} is a regular parametric sub-model of 𝒫X{{\mathcal{P}}_{X}} with score function s(x)s(x) at X{{\mathbb{P}}^{*}_{X}}. (Otherwise, because the tangent space is a closed linear space by definition, we can always find a sequence of functions {sr(x)}r=1\left\{{s_{r}(x)}\right\}_{r=1}^{\infty} that are score functions of regular parametric sub-models of 𝒫{{\mathcal{P}}}, and the following arguments hold by the continuity of the inner product.) Then, the model 𝒫T=𝒫T,X{YX}{\mathcal{P}}_{T}={\mathcal{P}}_{T,X}\otimes\left\{{{{\mathbb{P}}^{*}_{Y\mid X}}}\right\} is a regular parametric sub-model of 𝒫{{\mathcal{P}}} with score function s(x)s(x) at {{\mathbb{P}}^{*}}. Since 𝒫T{\mathcal{P}}_{T} is parameterized by tTt\in T, we can write θ:𝒫TΘ\theta:{\mathcal{P}}_{T}\to\Theta as a function of tt, θ(t)\theta(t). Because θ(){\theta(\cdot)} is well-specified at YX{{\mathbb{P}}^{*}_{Y\mid X}} relative to 𝒫T,X𝒫X{\mathcal{P}}_{T,X}\subset{{\mathcal{P}}_{X}}, θ(t)\theta(t) is a constant function, θ(t)θ\theta(t)\equiv\theta^{*}, and hence θ(t)t=0\frac{\partial\theta(t)}{\partial t}=0. By pathwise-differentiability,

<D(z),s(x)>p,02()=θ(t)t=0<{D_{{\mathbb{P}}^{*}}}(z),s(x)>_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}}=\frac{\partial\theta(t)}{\partial t}=0

for any gradient D(z){D_{{\mathbb{P}}^{*}}}(z) of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}. Since this holds true for any s(x)𝒯𝒫X(X)s(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}, we see that any gradient D(z){D_{{\mathbb{P}}^{*}}}(z) satisfies

D(z)𝒯𝒫X(X).{D_{{\mathbb{P}}^{*}}}(z)\perp{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}.

Consider the the efficient influence function φη(z){{\varphi}^{*}_{\eta^{*}}}(z) of θ(){\theta(\cdot)} relative to 𝒫{{\mathcal{P}}} at {{\mathbb{P}}^{*}}, which is a gradient of θ(){\theta(\cdot)} relative to 𝒫{{\mathcal{P}}} at {{\mathbb{P}}^{*}} by definition. Further, by the definition of the efficient influence function, for all apa\in{\mathbb{R}}^{p} it holds that

aφη(z)𝒯𝒫()=𝒯𝒫X(X)𝒯𝒫YX(YX).a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}={{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}.

As φη(z)𝒯𝒫X(X){{\varphi}^{*}_{\eta^{*}}}(z)\perp{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}, we see that aφη(z)𝒯𝒫X(X)a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\perp{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}, and hence

aφη(z)𝒯𝒫YX(YX)1,02(YX),a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\subset{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})},

for all apa\in{\mathbb{R}}^{p}. Therefore

ϕη(x)=𝔼[φη(Z)X=x]=0{\phi^{*}_{\eta^{*}}}(x)={\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid X=x]=0

X{{\mathbb{P}}^{*}_{X}}-almost surely.

To prove the second part of the theorem, recall that φη(z){{\varphi}^{*}_{\eta^{*}}}(z) is a gradient by definition. Therefore, by Lemma E.10, the set of gradients relative to 𝒫{{\mathcal{P}}} can be expressed as

{φη(z)+h(z):ah(z)[𝒯𝒫YX(YX)𝒯𝒫X(X)],ap},\left\{{{{\varphi}^{*}_{\eta^{*}}}(z)+h(z):a^{\top}h(z)\in\left[{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\right]^{\perp},\forall a\in{\mathbb{R}}^{p}}\right\}, (59)

where \perp represents the orthogonal complement of a subspace. If 𝒯𝒫X(X)=1,02(X){{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}, then

[𝒯𝒫YX(YX)𝒯𝒫X(X)]1,02(X)=1,02(YX).\left[{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\right]^{\perp}\subseteq{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}^{\perp}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}.

In proving the first part of the theorem, we have shown that φη(z)p,02(YX){{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{Y\mid X}})} under well-specification. Therefore, for any gradient D(z){D_{{\mathbb{P}}^{*}}}(z) relative to 𝒫{{\mathcal{P}}}, by (59), we have

aD(z)=aφη(z)+ah(z)1,02(YX),a^{\top}{D_{{\mathbb{P}}^{*}}}(z)=a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)+a^{\top}h(z)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})},

for all apa\in{\mathbb{R}}^{p}, which implies that D(z)p,02(YX){D_{{\mathbb{P}}^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{Y\mid X}})}. For a regular and asymptotically linear estimator θ^n\hat{\theta}_{n} with influence function φη(z){{\varphi}_{\eta^{*}}}(z), Lemma E.9 implies that φη(z){{\varphi}_{\eta^{*}}}(z) is a gradient of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}. Therefore φη(z)p,02(YX){{\varphi}_{\eta^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{Y\mid X}})}, which implies that 𝔼[φη(Z)X=x]=0{\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x]=0 X{{\mathbb{P}}^{*}_{X}}-almost surely. ∎

Proof of Corollary 3.3.

Suppose the efficient influence function for θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} is φη(z){{\varphi}^{*}_{\eta^{*}}}(z). Since θ(){\theta(\cdot)} is well-specified, Theorem 3.2 implies that the conditional efficient influence function ϕη(X)=0,Xalmost surely{\phi^{*}_{\eta^{*}}}(X)=0,{{\mathbb{P}}^{*}_{X}}-\text{almost surely}. Therefore, by Theorem 3.1, the efficient influence function for θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} as (1) remains φη(z){{\varphi}^{*}_{\eta^{*}}}(z). We have proved that the efficient influence relative to the reduced model {X}𝒫YX{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}} is φη(z)ϕη(x)=φη(z){{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)={{\varphi}^{*}_{\eta^{*}}}(z) in the proof of Theorem 3.1. ∎

Proof of Proposition 3.4.

We first show that (i) and (ii) of Proposition 3.4 implies Assumption 3.1 (b). Since 𝒪{\mathcal{O}} is a bounded Euclidean subset and φη(z){\varphi}_{\eta}(z) is L(z)L(z)-Lipshitz in η\eta over 𝒪{\mathcal{O}}, by example 19.6 of Van der Vaart [2000], the class {φη(z):η𝒪}\left\{{{\varphi}_{\eta}(z):\eta\in{\mathcal{O}}}\right\} is {{\mathbb{P}}^{*}}-Donsker. To see that Assumption 3.1 (c) holds, note that by consistency of η~n\tilde{\eta}_{n}, {η~n𝒪}1{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1 for any open set 𝒪{\mathcal{O}} such that η𝒪\eta^{*}\in{\mathcal{O}}.

Proof of Theorem 3.5.

Denote 𝚺=Cov[g(X)]\mathbf{\Sigma}=\mathrm{Cov}[g(X)]. First, we will show that

𝐁^ng𝐁g2=op(1),\left\lVert\hat{\mathbf{B}}_{n}^{g}-\mathbf{B}^{g}\right\rVert_{2}=o_{p}(1), (60)

where 𝐁^ng\hat{\mathbf{B}}_{n}^{g} is defined as in (11), and 𝐁g\mathbf{B}^{g} is defined as in (13). To this end, we write

𝐁^ng𝐁g2\displaystyle\left\lVert\hat{\mathbf{B}}_{n}^{g}-\mathbf{B}^{g}\right\rVert_{2} n[φη~n(g0)][φη(g0)]2𝚺12\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}
n[φη~n(g0)][φη~n(g0)]2𝚺12+\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}+
[φη~n(g0)][φη(g0)]2𝚺12.\displaystyle\quad\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}.

By Assumption 3.1 and Lemma E.5, we have,

n[φη~n(g0)][φη~n(g0)]2=op(1).\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}=o_{p}(1). (61)

Next,

[φη~n(g0)][φη(g0)]2\displaystyle\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2} =[(φη~nφη)(g0)]2\displaystyle=\left\lVert{{\mathbb{P}}^{*}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}})({g^{0}})^{\top}\right]\right\rVert_{2} (62)
(φη~nφη)(g0)2\displaystyle\leqslant{{\mathbb{P}}^{*}}\left\lVert({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}})({g^{0}})^{\top}\right\rVert_{2}
φη~nφη2()g02().\displaystyle\leqslant\left\lVert{{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}.

By Assumption 3.1, when η~n𝒪\tilde{\eta}_{n}\in{\mathcal{O}}, we have φη~nφη2()L(z)2()ρ(η~n,η)\left\lVert{{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\leqslant\left\lVert L(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\rho(\tilde{\eta}_{n},\eta^{*}), and it follows that

[φη~n(g0)][φη(g0)]2L(z)2()ρ(η~n,η)g02().\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}\leqslant\left\lVert L(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\rho(\tilde{\eta}_{n},\eta^{*})\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}.

Since L(z)2()<\left\lVert L(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}<\infty and ρ(η~n,η)=op(1)\rho(\tilde{\eta}_{n},\eta^{*})=o_{p}(1) by Assumption 3.1, and g02()<\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}<\infty, we have that

L(z)2()ρ(η~n,η)g02()=op(1).\left\lVert L(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\rho(\tilde{\eta}_{n},\eta^{*})\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}=o_{p}(1).

Further, by Assumption 3.1, {η~n𝒪}1{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1, and hence

[φη~n(g0)][φη(g0)]2=op(1).\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}=o_{p}(1). (63)

Combining (61) and (63), we have 𝐁^ng𝐁g2=op(1)\left\lVert\hat{\mathbf{B}}_{n}^{g}-\mathbf{B}^{g}\right\rVert_{2}=o_{p}(1) as (60).

Since nn(g0)=Op(1)\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)=O_{p}(1), the fact that 𝐁^ng𝐁g2=op(1)\left\lVert\hat{\mathbf{B}}_{n}^{g}-\mathbf{B}^{g}\right\rVert_{2}=o_{p}(1) implies,

n(θ^n,Xsafeθ)\displaystyle\sqrt{n}\left(\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}-\theta^{*}\right) =n(θ^nθ)𝐁^ngnn(g0)\displaystyle=\sqrt{n}\left(\hat{\theta}_{n}-\theta^{*}\right)-\hat{\mathbf{B}}_{n}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)
=nn(φη)𝐁^ngnn(g0)+op(1)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\hat{\mathbf{B}}_{n}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)+o_{p}(1)
=nn(φη)𝐁gnn(g0)+(𝐁g𝐁^ng)n(g0)+op(1)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\mathbf{B}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)+\left(\mathbf{B}^{g}-\hat{\mathbf{B}}_{n}^{g}\right){{\mathbb{P}}_{n}}\left({g^{0}}\right)+o_{p}(1)
=nn(φη)𝐁gnn(g0)+op(1)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\mathbf{B}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)+o_{p}(1)
=nn(φη𝐁gg0)+op(1).\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}\left({{\varphi}_{\eta^{*}}}-\mathbf{B}^{g}{g^{0}}\right)+o_{p}(1).

Since 𝐁gg0(x)\mathbf{B}^{g}{g^{0}}(x) is the projection of ϕη{\phi_{\eta^{*}}} onto the linear subspace spanned by g0(x){g^{0}}(x), by Lemma E.1 and E.2 we have the decomposition

Var[φη(Z)𝐁gg0(X)]\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]
=Var{φη(Z)ϕη(X)+ϕη(X)𝐁gg0(X)}\displaystyle=\mathrm{Var}\left\{{{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)+{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)}\right\}
=Var[φη(Z)ϕη(X)]+Var{ϕη(X)𝐁gg0(X)}\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)]+\mathrm{Var}\left\{{{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)}\right\}
=Var[φη(Z)]Var{ϕη(X)}+Var{ϕη(X)}Var[𝐁gg0(X)]\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{*}}}(Z)]-\mathrm{Var}\left\{{{\phi_{\eta^{*}}}(X)}\right\}+\mathrm{Var}\left\{{{\phi_{\eta^{*}}}(X)}\right\}-\mathrm{Var}[\mathbf{B}^{g}{g^{0}}(X)]
=Var[φη(Z)]Var[𝐁gg0(X)].\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{*}}}(Z)]-\mathrm{Var}[\mathbf{B}^{g}{g^{0}}(X)].

Proof of Theorem 3.6.

It suffices to prove the case of p=1p=1, i.e., we only one parameter. For p>1p>1, the analysis follows by separately analyzing each of the pp components. This does not affect the convergence rate since we treat pp as fixed (rather than increasing with the sample size nn).

When p=1p=1, by Assumption 3.2, ϕη𝒞M1α1(𝒳){\phi_{\eta^{*}}}\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}), where 𝒞M1α1(𝒳){\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}) is the Hölder class (15) with parameter α1\alpha_{1} and M1M_{1}. By Theorem 2.7.1 of van der Vaart and Wellner [2013], 𝒞M1α1(𝒳){\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}) is X{{\mathbb{P}}^{*}_{X}}-Donsker when α1>dim(𝒳)\alpha_{1}>{\text{dim}}({\mathcal{X}}). Let {gk(x)}k=1\left\{{g_{k}(x)}\right\}_{k=1}^{\infty} be a basis of 𝒞M1α1(𝒳){\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}) that satisfies (16) for all f𝒞M1α1(𝒳)f\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}). Since GKn(x)𝒞M1α1(𝒳){G_{K_{n}}}(x)\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}), clearly we also have ϕ^n=𝐁^KnGKn0(x)𝒞M1α1(𝒳){\hat{\phi}_{n}}=\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}(x)\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}) as this is only a linear transformation of GKn(x){G_{K_{n}}}(x). If we can show that

ϕηϕ^n1,02(X)=op(1),\left\lVert{\phi_{\eta^{*}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1),

then by Lemma E.4 and the fact that (ϕη)=(ϕ^n)=0{{\mathbb{P}}^{*}}({\phi_{\eta^{*}}})={{\mathbb{P}}^{*}}({\hat{\phi}_{n}})=0,

𝔾n(ϕηϕ^n)=nn(ϕηϕ^n)=op(1).{\mathbb{G}}_{n}({\phi_{\eta^{*}}}-{\hat{\phi}_{n}})=\sqrt{n}{{\mathbb{P}}_{n}}({\phi_{\eta^{*}}}-{\hat{\phi}_{n}})=o_{p}(1).

Then

n(θ^n,Xeff.θ)\displaystyle\sqrt{n}\left(\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}}-\theta^{*}\right) =nn(φηϕ^n)+op(1)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}}-{\hat{\phi}_{n}})+o_{p}(1)
=nn(φηϕη)+op(1),\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}}-{\phi_{\eta^{*}}})+o_{p}(1),

and the results of Theorem 3.6 follow.

It remains to show that

ϕηϕ^n1,02(X)=op(1).\left\lVert{\phi_{\eta^{*}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1).

Denoting 𝚺=Var[GKn0(X)]0\mathbf{\Sigma}=\mathrm{Var}\left[{G_{K_{n}}^{0}}(X)\right]\succ 0 and 𝐁Kn=[φη(GKn0)]𝚺1\mathbf{B}_{K_{n}}={{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({G_{K_{n}}^{0}})^{\top}\right]\mathbf{\Sigma}^{-1}, we have:

ϕηϕ^n1,02(X)\displaystyle\left\lVert{\phi_{\eta^{*}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})} =ϕη𝐁^KnGKn01,02(X)\displaystyle=\left\lVert{\phi_{\eta^{*}}}-\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}
=ϕη𝐁KnGKn0+𝐁KnGKn0𝐁^KnGKn01,02(X)\displaystyle=\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}+\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}-\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}
𝐁^KnGKn0𝐁KnGKn01,02(X)I+ϕη𝐁KnGKn01,02(X)II\displaystyle\leqslant\underbrace{\left\lVert\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}_{\text{I}}+\underbrace{\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}_{\text{II}}

where 𝐁^Kn\hat{\mathbf{B}}_{K_{n}} is defined as (18). We first look at I. By the fact that

𝔼[𝚺12GKn0(X)GKn0(X)𝚺12]=𝐈Kn,{\mathbb{E}}\left[\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}(X){G_{K_{n}}^{0}}(X)^{\top}\mathbf{\Sigma}^{-\frac{1}{2}}\right]=\mathbf{I}_{K_{n}}, (64)

we have:

I2\displaystyle\text{I}^{2} =(𝐁^Kn𝐁Kn)GKn0(x)1,02(X)2\displaystyle=\left\lVert\left(\hat{\mathbf{B}}_{K_{n}}-\mathbf{B}_{K_{n}}\right){G_{K_{n}}^{0}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}
=𝔼[(GKn0(X))(𝐁^Kn𝐁Kn)(𝐁^Kn𝐁Kn)GKn0(X)]\displaystyle={\mathbb{E}}\left[({G_{K_{n}}^{0}}(X))^{\top}\left(\hat{\mathbf{B}}_{K_{n}}-\mathbf{B}_{K_{n}}\right)^{\top}\left(\hat{\mathbf{B}}_{K_{n}}-\mathbf{B}_{K_{n}}\right){G_{K_{n}}^{0}}(X)\right]
=𝔼{[𝚺12GKn0(X)]n[(φηφη~n)𝚺12GKn0]n[(φηφη~n)𝚺12GKn0]𝚺12GKn(X)}\displaystyle={\mathbb{E}}\left\{{\left[\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}(X)\right]^{\top}{{\mathbb{P}}_{n}}\left[({{\varphi}_{\eta^{*}}}-{{\varphi}_{\tilde{\eta}_{n}}})\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right]^{\top}{{\mathbb{P}}_{n}}\left[({{\varphi}_{\eta^{*}}}-{{\varphi}_{\tilde{\eta}_{n}}})\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right]\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}}(X)}\right\}
=𝔼[𝚺12GKn0(X)GKn0(X)𝚺12]n[(φηφη~n)𝚺12GKn0]n[(φηφη~n)𝚺12GKn0]\displaystyle={\mathbb{E}}\left[\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}(X){G_{K_{n}}^{0}}(X)^{\top}\mathbf{\Sigma}^{-\frac{1}{2}}\right]{{\mathbb{P}}_{n}}\left[({{\varphi}_{\eta^{*}}}-{{\varphi}_{\tilde{\eta}_{n}}})\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right]^{\top}{{\mathbb{P}}_{n}}\left[({{\varphi}_{\eta^{*}}}-{{\varphi}_{\tilde{\eta}_{n}}})\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right]
=n[𝚺12GKn0(φηφη~n)]2.\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}({{\varphi}_{\eta^{*}}}-{{\varphi}_{\tilde{\eta}_{n}}})\right]\right\rVert^{2}.

By (64), without loss of generality, we can suppose that GKn0{G_{K_{n}}^{0}} has identity covariance, i.e., 𝚺=𝐈Kn\mathbf{\Sigma}=\mathbf{I}_{K_{n}}. For I, when η~n𝒪\tilde{\eta}_{n}\in{\mathcal{O}}, it follows that

I2\displaystyle\text{I}^{2} =n[GKn0(φηφη~n)]2\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({{\varphi}_{\eta^{*}}}-{{\varphi}_{\tilde{\eta}_{n}}})\right]\right\rVert^{2}
n[LGKn0]2ρ2(η~n,η)\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[L{G_{K_{n}}^{0}}\right]\right\rVert^{2}\rho^{2}(\tilde{\eta}_{n},\eta^{*})
n(L2)n[(GKn0)GKn0]ρ2(η~n,η)\displaystyle\leqslant{{\mathbb{P}}_{n}}(L^{2}){{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]\rho^{2}(\tilde{\eta}_{n},\eta^{*})

Since {η~n𝒪}1{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1 by Assumption 3.1, we have

I2n(L2)n[(GKn0)GKn0]ρ2(η~n,η)+op(1).\text{I}^{2}\leqslant{{\mathbb{P}}_{n}}(L^{2}){{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]\rho^{2}(\tilde{\eta}_{n},\eta^{*})+o_{p}(1).

Now, we consider the term n[(GKn0)GKn0]{{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]. By Theorem 12.16.1 of Hansen [2022],

n[GKn0(GKn0)]𝐈KnF=op(1).\left\lVert{{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]-\mathbf{I}_{K_{n}}\right\rVert_{F}=o_{p}(1).

Therefore,

n[(GKn0)GKn0]\displaystyle{{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right] =Kn+tr(n[GKn0(GKn0)]𝐈Kn)\displaystyle=K_{n}+{\text{tr}}\left({{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]-\mathbf{I}_{K_{n}}\right)
=Kn+op(1)\displaystyle=K_{n}+o_{p}(1)
=Op(Kn),\displaystyle=O_{p}(K_{n}),

where we used continuous mapping for the tr(){\text{tr}}(\cdot) function. Since L(z)L(z) is square integrable, we then have

n(L2)n[(GKn0)GKn0]ρ2(η~n,η)=Op(ρ2(η~n,η)Kn),{{\mathbb{P}}_{n}}(L^{2}){{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]\rho^{2}(\tilde{\eta}_{n},\eta^{*})=O_{p}(\rho^{2}(\tilde{\eta}_{n},\eta^{*})K_{n}),

which implies that

I=Op(Knρ(η~n,η))=op(1).\text{I}=O_{p}\left(\sqrt{K_{n}}\rho(\tilde{\eta}_{n},\eta^{*})\right)=o_{p}(1).

For II, by the bias-variance decomposition,

ϕη𝐁KnGKn(x)1,02(X)2=𝔼[𝐁KnGKn(x)]2+ϕη𝐁KnGKn0(x)1,02(X)2,\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}={\mathbb{E}}[\mathbf{B}_{K_{n}}{G_{K_{n}}}(x)]^{2}+\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})},

where we used the fact that 𝔼[ϕη(X)]=𝔼[GKn0(X)]=0{\mathbb{E}}[{\phi_{\eta^{*}}}(X)]={\mathbb{E}}[{G_{K_{n}}^{0}}(X)]=0. Therefore,

II2\displaystyle\text{II}^{2} ϕη𝐁KnGKn(x)1,02(X)2\displaystyle\leqslant\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})} (65)
=O(Kn2α1dim(𝒳)),\displaystyle=O\left(K_{n}^{-\frac{2\alpha_{1}}{{\text{dim}}({\mathcal{X}})}}\right),

by (16). This shows

II=O(Knα1dim(𝒳))=op(1).\text{II}=O\left(K_{n}^{-\frac{\alpha_{1}}{{\text{dim}}({\mathcal{X}})}}\right)=o_{p}(1).

Combining the results of I and II, we have shown that ϕηϕ^n1,02(X)=op(1)\left\lVert{\phi_{\eta^{*}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1), which finishes the proof. ∎

Proof of Theorem 4.1..

The proof of this theorem utilizes notation and results from Section E.3, as well as Lemma E.13.

By Lemma E.14, pathwise differentiablity at {{\mathbb{P}}^{*}} implies pathwise differentiability at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})}. Since the efficient influence function at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} is φη(z){{\varphi}^{*}_{\eta^{*}}}(z), φη(z){{\varphi}^{*}_{\eta^{*}}}(z) is a gradient at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}, and by Lemma E.14, the function

φη(z1)ϕη(x1)+ϕη(x2){{\varphi}^{*}_{\eta^{*}}}(z_{1})-{\phi^{*}_{\eta^{*}}}(x_{1})+{\phi^{*}_{\eta^{*}}}(x_{2})

is a gradient at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} relative to 𝒫{{\mathcal{P}}}.

First we consider the case of p=1p=1. We first derive the \left\lVert\cdot\right\rVert_{\mathcal{H}}-projection of φη(z1)ϕη(x1)+ϕη(x2){{\varphi}^{*}_{\eta^{*}}}(z_{1})-{\phi^{*}_{\eta^{*}}}(x_{1})+{\phi^{*}_{\eta^{*}}}(x_{2}) onto 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}, where the Hilbert space {\mathcal{H}} and its norm \left\lVert\cdot\right\rVert_{\mathcal{H}} are defined in Section E.3. The form of 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})} is characterized by Lemma E.13. Recall that ϕη(x){\phi^{*}_{\eta^{*}}}(x) is the conditional expectation of φη(z){{\varphi}^{*}_{\eta^{*}}}(z) on XX. The projection problem can be expressed as

minf𝒯𝒫X(X),g𝒯𝒫YX(YX)φη(z1)ϕη(x1)f(x1)g(z1)+ϕη(x2)γ1γf(x2)2\displaystyle\min_{\begin{subarray}{c}f\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})},\\ g\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\end{subarray}}\left\lVert{{\varphi}^{*}_{\eta^{*}}}(z_{1})-{\phi^{*}_{\eta^{*}}}(x_{1})-f(x_{1})-g(z_{1})+{\phi^{*}_{\eta^{*}}}(x_{2})-\frac{\gamma}{1-\gamma}f(x_{2})\right\rVert^{2}_{\mathcal{H}} (66)
=minf𝒯𝒫X(X),g𝒯𝒫YX(YX)(φη(z)ϕη(x)f(x)g(z)1,02()2+γ1γϕη(x)γ1γf(x)1,02(X)2)\displaystyle=\min_{\begin{subarray}{c}f\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})},\\ g\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\end{subarray}}\left(\left\lVert{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)-f(x)-g(z)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}})}+\frac{\gamma}{1-\gamma}\left\lVert{\phi^{*}_{\eta^{*}}}(x)-\frac{\gamma}{1-\gamma}f(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}\right)
=ming𝒯𝒫YX(YX)φη(z)ϕη(x)g(z)1,02(YX)2\displaystyle=\min_{g\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}}\left\lVert{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)-g(z)\right\rVert^{2}_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}}
+minf𝒯𝒫X(X){f(x)1,02(X)2+γ1γf(x)1γγϕη(x)1,02(X)2},\displaystyle\quad+\min_{f\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}}\left\{{\left\lVert f(x)\right\rVert^{2}_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}+\frac{\gamma}{1-\gamma}\left\lVert f(x)-\frac{1-\gamma}{\gamma}{\phi^{*}_{\eta^{*}}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}\right\},

where we used the definition of \left\lVert\cdot\right\rVert_{\mathcal{H}} and the fact that 1,02()=1,02(X)1,02(YX){{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}})}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}. Since φη(z){{\varphi}^{*}_{\eta^{*}}}(z) is the efficient influence function at {{\mathbb{P}}^{*}} and p=1p=1, we have φη(z)ϕη(x)𝒯𝒫YX(YX){{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}, which we proved in the proof of Theorem 3.1. Therefore, for the first term in (66), the minimizer takes g(z)=φη(z)ϕη(x)g(z)={{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x). Similarly, we have ϕη(x)𝒯𝒫X(X){\phi^{*}_{\eta^{*}}}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}. Then, it can be straightforwardly verified that f(x)=(1γ)ϕη(x)f(x)=(1-\gamma){\phi^{*}_{\eta^{*}}}(x) minimizes the second term in (66). Therefore, the \left\lVert\cdot\right\rVert_{\mathcal{H}}-projection of φη(zi){{\varphi}^{*}_{\eta^{*}}}(z_{i}) onto 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})} is

φη(z1,x2)\displaystyle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}) =φη(z1)ϕη(x1)+(1γ)ϕη(x1)+γϕη(x2)\displaystyle={{\varphi}^{*}_{\eta^{*}}}(z_{1})-{\phi^{*}_{\eta^{*}}}(x_{1})+(1-\gamma){\phi^{*}_{\eta^{*}}}(x_{1})+\gamma{\phi^{*}_{\eta^{*}}}(x_{2})
=φη(z1)γϕη(x1)+γϕη(x2).\displaystyle={{\varphi}^{*}_{\eta^{*}}}(z_{1})-\gamma{\phi^{*}_{\eta^{*}}}(x_{1})+\gamma{\phi^{*}_{\eta^{*}}}(x_{2}).

We now consider the general case of pp and validate that φη(z1,x2){{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}) is indeed the efficient influence function by definition. Clearly, since 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})} is a linear space, aφη(z1,x2)𝒯𝒫(())a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})} for all apa\in{\mathbb{R}}^{p}. Denote D()(z1,x2)=φη(z1)ϕη(x1)+ϕη(x2)D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2})={{\varphi}^{*}_{\eta^{*}}}(z_{1})-{\phi^{*}_{\eta^{*}}}(x_{1})+{\phi^{*}_{\eta^{*}}}(x_{2}). For any element f(x1)+g(z1)+γ1γf(x2)𝒯𝒫(())f(x_{1})+g(z_{1})+\frac{\gamma}{1-\gamma}f(x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}, it can be straightforwardly validated by the definition of ,\left\langle\cdot,\cdot\right\rangle_{\mathcal{H}} that

φη(z1,x2)D()(z1,x2),f(x1)+g(z1)+γ1γf(x2)=0.\displaystyle\langle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})-D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2}),f(x_{1})+g(z_{1})+\frac{\gamma}{1-\gamma}f(x_{2})\rangle_{\mathcal{H}}=0.

Consider any one-dimensional regular parametric sub-model 𝒫T={t:tT}{\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T}\right\} of 𝒫{{\mathcal{P}}} such that t={\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}} and its score function at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} is lt(z1,x2)l_{t^{*}}(z_{1},x_{2}). Then, lt(z1,x2)𝒯𝒫(())l_{t^{*}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}. Write θ:𝒫TΘ\theta:{\mathcal{P}}_{T}\to\Theta as a function of tTt\in T. By pathwise differentiability and the fact that D()(z1,x2)D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2}),is a gradient at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} relative to 𝒫{{\mathcal{P}}},

dθ(t)dt\displaystyle\frac{d\theta(t^{*})}{dt} =lt(z1,x2),D()(z1,x2)\displaystyle=\langle l_{t^{*}}(z_{1},x_{2}),D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2})\rangle_{\mathcal{H}}
=lt(z1,x2),D()(z1,x2)φη(z1,x2)+φη(z1,x2)\displaystyle=\langle l_{t^{*}}(z_{1},x_{2}),D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2})-{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})+{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}
=lt(z1,x2),φη(z1,x2)+lt(z1,x2),D()(z1,x2)φη(z1,x2)\displaystyle=\langle l_{t^{*}}(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}+\langle l_{t^{*}}(z_{1},x_{2}),D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2})-{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}
=lt(z1,x2),φη(z1,x2),\displaystyle=\langle l_{t^{*}}(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}},

which shows that φη(z1,x2){{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}) is a gradient. Since further aφη(z1,x2)𝒯𝒫(())a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})} for all apa\in{\mathbb{R}}^{p}, by definition φη(z1,x2){{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}) is the efficient influence function at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} relative to 𝒫{{\mathcal{P}}}.

By Lemma E.16, the semiparametric efficiency lower bound is the outer product of the efficient influence function with itself in ,\langle\cdot,\cdot\rangle_{\mathcal{H}}. By the definition of ,\langle\cdot,\cdot\rangle_{\mathcal{H}}, this can be written as

φη(z)γϕη(x),φη(z)γϕη(x)p,02()\displaystyle\langle{{\varphi}^{*}_{\eta^{*}}}(z)-\gamma{\phi^{*}_{\eta^{*}}}(x),{{\varphi}^{*}_{\eta^{*}}}(z)^{\top}-\gamma{\phi^{*}_{\eta^{*}}}(x)^{\top}\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}
+γ1γ(1γ)ϕη(x),(1γ)ϕη(x)p,02(X)\displaystyle\quad+\frac{\gamma}{1-\gamma}\langle(1-\gamma){\phi^{*}_{\eta^{*}}}(x),(1-\gamma){\phi^{*}_{\eta^{*}}}(x)^{\top}\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}
=Var[φη(Z)ϕη(X)]+(1γ)2Var[ϕη(X)]+γ(1γ)Var[ϕη(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)\right]+(1-\gamma)^{2}\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right]+\gamma(1-\gamma)\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right]
=Var[φη(Z)ϕη(X)]+(1γ)Var[ϕη(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right]
=Var[φη(Z)]γVar[ϕη(X)],\displaystyle=\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)\right]-\gamma\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right],

where we used Lemma E.1 in the last equality. ∎

Proof of Theorem 4.3..

Recall that

g^0(Xi)\displaystyle{\hat{g}^{0}}(X_{i}) =g(Xi)n+N(g)\displaystyle=g(X_{i})-{{\mathbb{P}}_{n+N}}(g)
=g(Xi)(g)n+N[g(g)]\displaystyle=g(X_{i})-{{\mathbb{P}}^{*}}(g)-{{\mathbb{P}}_{n+N}}\left[g-{{\mathbb{P}}^{*}}(g)\right]
=g0(Xi)n+N(g0).\displaystyle={g^{0}}(X_{i})-{{\mathbb{P}}_{n+N}}({g^{0}}).

Then, for the estimator θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} defined in (28),

n(θ^n,Nsafeθ)\displaystyle\sqrt{n}(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*}) =n(θ^nθ)𝐁^n,Ngnn(g^0)\displaystyle=\sqrt{n}(\hat{\theta}_{n}-\theta^{*})-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({\hat{g}^{0}}) (67)
=nn(φη)𝐁^n,Ngnn(g0)+𝐁^n,Ngnn+N(g0)+op(1)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}})+\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n+N}}({g^{0}})+o_{p}(1)
=nn(φη)𝐁^n,Ngnn(g0)+𝐁^n,Ngnnn+Nn(g0)+𝐁^n,NgnNn+NN(g0)+op(1)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}})+\hat{\mathbf{B}}_{n,N}^{g}\frac{\sqrt{n}n}{n+N}{{\mathbb{P}}_{n}}({g^{0}})+\hat{\mathbf{B}}_{n,N}^{g}\frac{\sqrt{n}N}{n+N}{{\mathbb{P}}_{N}}({g^{0}})+o_{p}(1)
=nn(φη)γ𝐁^n,Ngnn(g0)+γ(1γ)𝐁^n,NgNN(g0)+op(1),\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\gamma\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}})+\sqrt{\gamma(1-\gamma)}\hat{\mathbf{B}}_{n,N}^{g}\sqrt{N}{{\mathbb{P}}_{N}}({g^{0}})+o_{p}(1),

as Nn+Nγ(0,1]\frac{N}{n+N}\to\gamma\in(0,1]. By CLT, we have nn(g0)=Op(1)\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)=O_{p}(1), and NN(g0)=Op(1)\sqrt{N}{{\mathbb{P}}_{N}}\left({g^{0}}\right)=O_{p}(1). Further, by the fact that n𝒰N{{\mathcal{L}}_{n}}\perp\!\!\!\perp{{\mathcal{U}}_{N}}, for any function ff, it holds that n(f)N(f){{\mathbb{P}}_{n}}(f)\perp\!\!\!\perp{{\mathbb{P}}_{N}}(f).

Next, we show that

𝐁^ng𝐁^n,Ng2=op(1),\left\lVert\hat{\mathbf{B}}_{n}^{g}-\hat{\mathbf{B}}_{n,N}^{g}\right\rVert_{2}=o_{p}(1),

where 𝐁^ng\hat{\mathbf{B}}_{n}^{g} is defined in (11) and 𝐁^n,Ng\hat{\mathbf{B}}_{n,N}^{g} is defined in (27). Recall that we define 𝚺=Cov[g(X)]\mathbf{\Sigma}=\mathrm{Cov}[g(X)]. Then,

𝐁^ng𝐁^n,Ng2\displaystyle\left\lVert\hat{\mathbf{B}}_{n}^{g}-\hat{\mathbf{B}}_{n,N}^{g}\right\rVert_{2} =n[φη~n(g0)]𝚺1n[φη~n(g^0)]n+N[g^0(g^0)]12\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\mathbf{\Sigma}^{-1}-{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]^{-1}\right\rVert_{2}
n[φη~n(g0)]n[φη~n(g^0)]2𝚺12I+\displaystyle\leqslant\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}}_{\text{I}}+
n[φη~n(g^0)]2𝚺1{n+N[g^0(g^0)]}12II.\displaystyle\quad\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}}_{\text{II}}.

First we show that I=op(1)o_{p}(1).By Assumption 3.1 and Lemma E.5, n(φη~n)=Op(1)\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\right\rVert=O_{p}(1), and since (g0)=0{{\mathbb{P}}^{*}}\left({g^{0}}\right)=0, n+N(g0)=op(1){{\mathbb{P}}_{n+N}}\left({g^{0}}\right)=o_{p}(1), we have:

n[φη~n(g0)]n[φη~n(g^0)]2\displaystyle\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}
=n(φη~n)[n+N(g0)]2\displaystyle=\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\left[{{\mathbb{P}}_{n+N}}({g^{0}})\right]^{\top}\right\rVert_{2}
n(φη~n)2n+N(g0)2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\right\rVert_{2}\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}})\right\rVert_{2}
=op(1).\displaystyle=o_{p}(1).

Since 𝚺0\mathbf{\Sigma}\succ 0, it follows that

I=n[φη~n(g0)]n[φη~n(g^0)]2𝚺12=op(1).\text{I}=\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}=o_{p}(1).

Next we show that II=op(1)o_{p}(1). By (61),

n[φη~n(g0)][φη~n(g0)]2=op(1),\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}=o_{p}(1),

and by (62),

[φη~n(g0)][φη(g0)]2=op(1).\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}=o_{p}(1).

Therefore,

n[φη~n(g^0)]2\displaystyle\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}
n[φη~n(g0)]2+n(φη~n)[n+N(g0)]2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\left[{{\mathbb{P}}_{n+N}}({g^{0}})\right]^{\top}\right\rVert_{2}
=n[φη~n(g0)]2+op(1)\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}+o_{p}(1)
n[φη~n(g0)][φη~n(g0)]2+[φη~n(g0)][φη(g0)]2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}
+[φη(g0)]2+op(1)\displaystyle\quad+\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}+o_{p}(1)
n[φη~n(g0)][φη~n(g0)]2op(1)+[φη~n(g0)][φη(g0)]2op(1)\displaystyle\leqslant\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}}_{o_{p}(1)}+\underbrace{\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}}_{o_{p}(1)}
+[φη(g0)]2Op(1)+op(1)\displaystyle\quad+\underbrace{\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}}_{O_{p}(1)}+o_{p}(1)
=Op(1).\displaystyle=O_{p}(1).

Further, by the law of large numbers, 𝚺n+N[g0(g0)]2=op(1)\left\lVert\mathbf{\Sigma}-{{\mathbb{P}}_{n+N}}\left[{g^{0}}\left({g^{0}}\right)^{\top}\right]\right\rVert_{2}=o_{p}(1) and n+N(g0)=op(1){{\mathbb{P}}_{n+N}}({g^{0}})=o_{p}(1). Then,

𝚺n+N[g^0(g^0)]2\displaystyle\left\lVert\mathbf{\Sigma}-{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}
𝚺n+N[g0(g0)]2+n+N(g0)n+N[(g0)]2\displaystyle\leqslant\left\lVert\mathbf{\Sigma}-{{\mathbb{P}}_{n+N}}\left[{g^{0}}({g^{0}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}}){{\mathbb{P}}_{n+N}}\left[({g^{0}})^{\top}\right]\right\rVert_{2}
=op(1).\displaystyle=o_{p}(1).

Hence, 𝚺1{n+N[g^0(g^0)]}12=op(1)\left\lVert\mathbf{\Sigma}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=o_{p}(1) by continuous mapping. As a result, the second term also satisfies

II=n[φη~n(g^0)]2𝚺1{n+N[g^0(g^0)]}12=op(1).\text{II}=\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=o_{p}(1).

Combining I and II, we have shown that 𝐁^ng𝐁^n,Ng2=op(1)\left\lVert\hat{\mathbf{B}}_{n}^{g}-\hat{\mathbf{B}}_{n,N}^{g}\right\rVert_{2}=o_{p}(1), and consequently 𝐁g𝐁^n,Ng2=op(1)\left\lVert\mathbf{B}^{g}-\hat{\mathbf{B}}_{n,N}^{g}\right\rVert_{2}=o_{p}(1) given (60).

Now, by (67),

n(θ^n,Nsafeθ)=nn(φηγ𝐁gg0)+NN(γ(1γ)𝐁gg0)+op(1).\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*}\right)=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}}-\gamma\mathbf{B}^{g}{g^{0}})+\sqrt{N}{{\mathbb{P}}_{N}}(\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}})+o_{p}(1).

Therefore, θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} is a regular and asymptotically linear estimator with influence function

φη(z1)γ𝐁gg0(x1)+γ(1γ)𝐁gg0(x2).{{\varphi}_{\eta^{*}}}(z_{1})-\gamma\mathbf{B}^{g}{g^{0}}(x_{1})+\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}}(x_{2}).

By Lemmas E.1 and E.2, the asymptotic variance of θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} can be represented as

Var[φη(Z)γ𝐁gg0(X)]+Var[γ(1γ)𝐁gg0(X)]\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma\mathbf{B}^{g}{g^{0}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}}(X)\right]
=Var[φη(Z)γ𝐁gg0(X)]+γ(1γ)Var[𝐁gg0(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma\mathbf{B}^{g}{g^{0}}(X)\right]+\gamma(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]
=Var[φη(Z)ϕη(X)]+Var[ϕη(X)𝐁gg0(X)+(1γ)𝐁gg0(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)+(1-\gamma)\mathbf{B}^{g}{g^{0}}(X)\right]
+γ(1γ)Var[𝐁gg0(X)]\displaystyle\quad+\gamma(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]
=Var[φη(Z)ϕη(X)]+Var[ϕη(X)𝐁gg0(X)]+(1γ)2Var[𝐁gg0(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)\right]+(1-\gamma)^{2}\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]
+γ(1γ)Var[𝐁gg0(X)]\displaystyle\quad+\gamma(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]
=Var[φη(Z)ϕη(X)]+Var[ϕη(X)𝐁gg0(X)]+(1γ)Var[𝐁gg0(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)\right]+(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]
=Var[φη(Z)]Var[ϕη(X)]+Var[ϕη(X)𝐁gg0(X)]+(1γ)Var[𝐁gg0(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right]-\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)\right]+(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]
=Var[φη(Z)]Var[𝐁gg0(X)]+(1γ)Var[𝐁gg0(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right]-\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]+(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]
=Var[φη(Z)]γVar[𝐁gg0(X)],\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right]-\gamma\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right],

which proves the claim of the theorem. ∎

Proof of Theorem 4.4..

As G^Kn0(x)=GKn0n+N(GKn0){\hat{G}_{K_{n}}^{0}}(x)={G_{K_{n}}^{0}}-{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right), we have

n(θ^n,Neff.θ)\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{eff.}}-\theta^{*}\right) =nn(φη)nn(𝐁^Kn,NGKn0)+nn+N(𝐁^Kn,NGKn0)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\sqrt{n}{{\mathbb{P}}_{n}}\left(\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)+\sqrt{n}{{\mathbb{P}}_{n+N}}\left(\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)
=nn(φη)nn(Nn+N𝐁^Kn,NGKn0)+NN(nNn+N𝐁^Kn,NGKn0).\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\sqrt{n}{{\mathbb{P}}_{n}}\left(\frac{N}{n+N}\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)+\sqrt{N}{{\mathbb{P}}_{N}}\left(\frac{\sqrt{nN}}{n+N}\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right).

Similar to the proof of Theorem 3.6, we only prove the case of p=1p=1. The general case proceeds by analyzing each coordinate individually. By Assumption 3.2, ϕη𝒞M1α1(𝒳){\phi_{\eta^{*}}}\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}). As {gk(x)}k=1\left\{{g_{k}(x)}\right\}_{k=1}^{\infty} is a basis of 𝒞M1α1(𝒳){\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}}), we have that 𝐁^Kn,NGKn0(x)𝒞M1α1\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}(x)\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}. Therefore, if we can show that

ϕη𝐁^Kn,NGKn0(x)1,02(X)=op(1),\left\lVert{\phi_{\eta^{*}}}-\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}(x)\right\rVert_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}=o_{p}(1), (68)

then by Lemma E.4, nn(ϕη𝐁^Kn,NGKn0(x))=op(1)\sqrt{n}{{\mathbb{P}}_{n}}\left({\phi_{\eta^{*}}}-\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}(x)\right)=o_{p}(1). By definition, 𝔼[GKn0]=0{\mathbb{E}}[{G_{K_{n}}^{0}}]=0. Therefore, using the fact that Nn+Nγ(0,1]\frac{N}{n+N}\to\gamma\in(0,1],

n(θ^n,Neff.θ)\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{eff.}}-\theta^{*}\right)
=nn(φη)nn(Nn+N𝐁^Kn,NGKn0)+NN(nNn+N𝐁^Kn,NGKn0)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\sqrt{n}{{\mathbb{P}}_{n}}\left(\frac{N}{n+N}\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)+\sqrt{N}{{\mathbb{P}}_{N}}\left(\frac{\sqrt{nN}}{n+N}\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)
=nn(φηγϕη)+NN(γ(1γ)ϕη)+op(1).\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}}-\gamma{\phi_{\eta^{*}}})+\sqrt{N}{{\mathbb{P}}_{N}}\left(\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}\right)+o_{p}(1).

It then follows that θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} is a regular and asymptotically linear estimator with influence function

φη(z1)γϕη(x1)+γ(1γ)ϕη(x2).{{\varphi}_{\eta^{*}}}(z_{1})-\gamma{\phi_{\eta^{*}}}(x_{1})+\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}(x_{2}).

By Lemma E.1, the asymptotic variance of θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} can be represented as

Var[φη(Z)γϕη(X)]+Var[γ(1γ)ϕη(X)]\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma{\phi_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}(X)\right]
=Var[φη(Z)ϕη(X)+(1γ)ϕη(x)]+γ(1γ)Var[ϕη(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)+(1-\gamma){\phi_{\eta^{*}}}(x)\right]+\gamma(1-\gamma)\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right]
=Var[φη(Z)ϕη(X)]+(1γ)2Var[ϕη(x)]+γ(1γ)Var[ϕη(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]+(1-\gamma)^{2}\mathrm{Var}\left[{\phi_{\eta^{*}}}(x)\right]+\gamma(1-\gamma)\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right]
=Var[φη(Z)ϕη(X)]+(1γ)Var[ϕη(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right]
=Var[φη(Z)]Var[ϕη(X)]+(1γ)Var[ϕη(X)]\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right]-\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right]
=Var[φη(Z)]γVar[ϕη(X)],\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right]-\gamma\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right],

which would prove Theorem 4.4.

To complete the argument, we next prove (68). Denoting 𝚺=𝔼[GKn0(GKn0)]\mathbf{\Sigma}={\mathbb{E}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right], and 𝐁Kn=[φη(GKn0)]𝚺1Kn×1\mathbf{B}_{K_{n}}={{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({G_{K_{n}}^{0}})^{\top}\right]\mathbf{\Sigma}^{-1}\in{\mathbb{R}}^{K_{n}\times 1}, we have:

ϕη𝐁^Kn,NGKn01,02(X)\displaystyle\left\lVert{\phi_{\eta^{*}}}-\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})} =ϕη𝐁KnGKn0+𝐁KnGKn0𝐁^Kn,NGKn01,02(X)\displaystyle=\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}+\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}-\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}
𝐁^Kn,NGKn0(x)𝐁KnGKn0(x)1,02(X)I\displaystyle\leqslant\underbrace{\left\lVert\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}(x)-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}_{\text{I}}
+ϕη𝐁KnGKn0(x)1,02(X)II.\displaystyle\quad+\underbrace{\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}_{\text{II}}.

From (65) in the proof of Theorem 3.6, we already have

II=O(Knα12dim(𝒳))=op(1).\text{II}=O\left(K_{n}^{-\frac{\alpha_{1}}{2{\text{dim}}({\mathcal{X}})}}\right)=o_{p}(1).

Therefore, it remains to prove that I=op(1)\text{I}=o_{p}(1). Notice that

I =𝐁^Kn,NGKn0𝐁KnGKn0p,02(X)2\displaystyle=\left\lVert\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert^{2}_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}}
=𝔼{GKn0(𝐁^Kn,N𝐁Kn)(𝐁^Kn,N𝐁Kn)GKn0}\displaystyle={\mathbb{E}}\left\{{{G_{K_{n}}^{0}}^{\top}\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)^{\top}\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right){G_{K_{n}}^{0}}}\right\}
=tr{(𝐁^Kn,N𝐁Kn)(𝐁^Kn,N𝐁Kn)𝚺}.\displaystyle={\text{tr}}\left\{{\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)^{\top}\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)\mathbf{\Sigma}}\right\}.

Moreover,

𝐁Kn=[φη(𝚺12GKn0)]𝚺12,\displaystyle\mathbf{B}_{K_{n}}={{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}\left(\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right)^{\top}\right]\mathbf{\Sigma}^{-\frac{1}{2}},
𝐁^Kn,N=n[φη~n(𝚺12G^Kn0)]n[(𝚺12G^Kn0)(𝚺12G^Kn0)]1𝚺12,\displaystyle\hat{\mathbf{B}}_{K_{n},N}={{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}\left(\mathbf{\Sigma}^{-\frac{1}{2}}{\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]{{\mathbb{P}}_{n}}\left[\left(\mathbf{\Sigma}^{-\frac{1}{2}}{\hat{G}_{K_{n}}^{0}}\right)\left(\mathbf{\Sigma}^{-\frac{1}{2}}{\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]^{-1}\mathbf{\Sigma}^{-\frac{1}{2}},
𝚺12G^Kn0=𝚺12GKn0n+N(𝚺12GKn0).\displaystyle\mathbf{\Sigma}^{-\frac{1}{2}}{\hat{G}_{K_{n}}^{0}}=\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}-{{\mathbb{P}}_{n+N}}\left(\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right).

Therefore, without loss of generality, we can consider normalizing GKn0{G_{K_{n}}^{0}} with 𝚺12GKn0\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}} and let 𝚺=𝐈Kn\mathbf{\Sigma}=\mathbf{I}_{K_{n}}. Then, we can write

I=tr{(𝐁^Kn,N𝐁Kn)(𝐁^Kn,N𝐁Kn)}=𝐁^Kn,N𝐁KnF2.\text{I}={\text{tr}}\left\{{\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)^{\top}\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)}\right\}=\left\lVert\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right\rVert^{2}_{F}.

Denoting rKn(x)=ϕη(x)𝐁KnGKn0(x)r_{K_{n}}(x)={\phi_{\eta^{*}}}(x)-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x) as the approximation error, and ϵ(z)=φη(z)ϕη(x)\epsilon(z)={{\varphi}_{\eta^{*}}}(z)-{\phi_{\eta^{*}}}(x), φη~n(z){{\varphi}_{\tilde{\eta}_{n}}}(z) can be decomposed as:

φη~n(z)\displaystyle{{\varphi}_{\tilde{\eta}_{n}}}(z) =ϕη(x)+ϵ(z)+[φη~n(z)φη(z)],\displaystyle={\phi_{\eta^{*}}}(x)+\epsilon(z)+[{{\varphi}_{\tilde{\eta}_{n}}}(z)-{{\varphi}_{\eta^{*}}}(z)], (69)
=𝐁KnGKn0(x)+rKn(x)+ϵ(z)+[φη~n(z)φη(z)],\displaystyle=\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)+r_{K_{n}}(x)+\epsilon(z)+[{{\varphi}_{\tilde{\eta}_{n}}}(z)-{{\varphi}_{\eta^{*}}}(z)],
=𝐁KnG^Kn0(x)+𝐁Knn+N(GKn0)+rKn(x)+ϵ(z)+[φη~n(z)φη(z)].\displaystyle=\mathbf{B}_{K_{n}}{\hat{G}_{K_{n}}^{0}}(x)+\mathbf{B}_{K_{n}}{{\mathbb{P}}_{n+N}}({G_{K_{n}}^{0}})+r_{K_{n}}(x)+\epsilon(z)+[{{\varphi}_{\tilde{\eta}_{n}}}(z)-{{\varphi}_{\eta^{*}}}(z)].

Therefore, denoting 𝚺^=n[G^Kn0(G^Kn0)]\hat{\mathbf{\Sigma}}={{\mathbb{P}}_{n}}\left[{\hat{G}_{K_{n}}^{0}}({\hat{G}_{K_{n}}^{0}})^{\top}\right], it follows that

𝐁^Kn,N\displaystyle\hat{\mathbf{B}}_{K_{n},N} =𝐁Kn+𝐁Knn+N(GKn0)n(G^Kn0)𝚺^1III+n[rKn(G^Kn0)]𝚺^1IV\displaystyle=\mathbf{B}_{K_{n}}+\underbrace{\mathbf{B}_{K_{n}}{{\mathbb{P}}_{n+N}}({G_{K_{n}}^{0}}){{\mathbb{P}}_{n}}({\hat{G}_{K_{n}}^{0}})^{\top}\hat{\mathbf{\Sigma}}^{-1}}_{\text{III}}+\underbrace{{{\mathbb{P}}_{n}}\left[r_{K_{n}}({\hat{G}_{K_{n}}^{0}})^{\top}\right]\hat{\mathbf{\Sigma}}^{-1}}_{\text{IV}}
+n[ϵ(G^Kn0)]𝚺^1V+n[(φη~n(z)φη(z))(G^Kn0)]𝚺^1VI.\displaystyle\quad+\underbrace{{{\mathbb{P}}_{n}}\left[\epsilon({\hat{G}_{K_{n}}^{0}})^{\top}\right]\hat{\mathbf{\Sigma}}^{-1}}_{\text{V}}+\underbrace{{{\mathbb{P}}_{n}}\left[({{\varphi}_{\tilde{\eta}_{n}}}(z)-{{\varphi}_{\eta^{*}}}(z))({\hat{G}_{K_{n}}^{0}})^{\top}\right]\hat{\mathbf{\Sigma}}^{-1}}_{\text{VI}}.

We first consider 𝚺^\hat{\mathbf{\Sigma}}:

n[G^Kn0(G^Kn0)]\displaystyle{{\mathbb{P}}_{n}}\left[{\hat{G}_{K_{n}}^{0}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]
=n[GKn0(GKn0)]+n(GKn0)n+N[(GKn0)]+n+N(GKn0)n[(GKn0)]\displaystyle={{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]+{{\mathbb{P}}_{n}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n+N}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]+{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]
+n+N(GKn0)n+N[(GKn0)],\displaystyle\quad+{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n+N}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right],

where, by Lemma E.8, we have that

n(GKn0)n+N[(GKn0)]F=Op(Knn)=op(1)\displaystyle\left\lVert{{\mathbb{P}}_{n}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n+N}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert_{F}=O_{p}\left(\frac{K_{n}}{n}\right)=o_{p}(1)
n+N(GKn0)n[(GKn0)]F=Op(Knn)=op(1),\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert_{F}=O_{p}\left(\frac{K_{n}}{n}\right)=o_{p}(1),
n+N(GKn0)n+N[(GKn0)]F=Op(Knn)=op(1).\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n+N}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert_{F}=O_{p}\left(\frac{K_{n}}{n}\right)=o_{p}(1).

Further, by Theorem 12.16.1 of Hansen [2022],

n[GKn0(GKn0)]𝐈KnF=op(1), and λmin(n[GKn0(GKn0)])=Op(1).\left\lVert{{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]-\mathbf{I}_{K_{n}}\right\rVert_{F}=o_{p}(1),\text{ and }\lambda_{\min}\left({{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right)=O_{p}(1).

Therefore it follows that

n[G^Kn0(G^Kn0)]𝐈KnF=op(1), and λmin(n[G^Kn0(G^Kn0)])=Op(1),\left\lVert{{\mathbb{P}}_{n}}\left[{\hat{G}_{K_{n}}^{0}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]-\mathbf{I}_{K_{n}}\right\rVert_{F}=o_{p}(1),\text{ and }\lambda_{\min}\left({{\mathbb{P}}_{n}}\left[{\hat{G}_{K_{n}}^{0}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]\right)=O_{p}(1),

which implies 𝚺^12=Op(1)\left\lVert\hat{\mathbf{\Sigma}}^{-1}\right\rVert_{2}=O_{p}(1). Now, for the term III,

III\displaystyle\left\lVert\text{III}\right\rVert |𝐁Knn+N(GKn0)|n(G^Kn0)𝚺^12\displaystyle\leqslant|\mathbf{B}_{K_{n}}{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right)\rvert\left\lVert{{\mathbb{P}}_{n}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right\rVert\left\lVert\hat{\mathbf{\Sigma}}^{-1}\right\rVert_{2}
=Op(Knn)×Op(Knn)×Op(1)\displaystyle=O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)\times O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)\times O_{p}(1)
=Op(Knn)\displaystyle=O_{p}\left(\frac{K_{n}}{n}\right)
=op(1),\displaystyle=o_{p}(1),

where we used Lemma E.8 to show that |𝐁Knn+N(GKn0)||\mathbf{B}_{K_{n}}{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right)\rvert and n(G^Kn0)\left\lVert{{\mathbb{P}}_{n}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right\rVert are Op(Knn)O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right). For the term IV, first note that by the property of projection and (16), we have

(rKn)=0,(rKnGKn0)=0, and (rKn2)=O(Kn2αdim(𝒳)).{{\mathbb{P}}^{*}}(r_{K_{n}})=0,\quad{{\mathbb{P}}^{*}}(r_{K_{n}}{G_{K_{n}}^{0}})=0,\text{ and }{{\mathbb{P}}^{*}}\left(r_{K_{n}}^{2}\right)=O\left(K_{n}^{-\frac{2\alpha}{{\text{dim}}({\mathcal{X}})}}\right).

Therefore,

IV\displaystyle\left\lVert\text{IV}\right\rVert n[rKn(G^Kn0)]𝚺^2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}
n[rKn(GKn0)]𝚺^2+|n(rKn)|n+N(G^Kn0)𝚺^2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}+\lvert{{\mathbb{P}}_{n}}(r_{K_{n}})\rvert\left\lVert{{\mathbb{P}}_{n+N}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}
n[rKn(GKn0)]×Op(1)+Op(1n)×Op(Knn)×Op(1)\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\times O_{p}(1)+O_{p}\left(\frac{1}{\sqrt{n}}\right)\times O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)\times O_{p}(1)
=Op(n[rKn(GKn0)])+op(1),\displaystyle=O_{p}\left(\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\right)+o_{p}(1),

where we again used Lemma E.8. Further,

𝔼{n[rKn(GKn0)]n[rKn(GKn0)]}\displaystyle{\mathbb{E}}\left\{{{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)\right]}\right\} =1n𝔼[rKn2(GKn0)GKn0]\displaystyle=\frac{1}{n}{\mathbb{E}}\left[r_{K_{n}}^{2}({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]
sup𝒳{GKn0(x)GKn0(x)}n𝔼[rKn2]\displaystyle\leqslant\frac{\sup_{{\mathcal{X}}}\left\{{{G_{K_{n}}^{0}}(x)^{\top}{G_{K_{n}}^{0}}(x)}\right\}}{n}{\mathbb{E}}[r^{2}_{K_{n}}]
=O(ζn2Kn2α1dim(𝒳)n)\displaystyle=O\left(\frac{\zeta_{n}^{2}K_{n}^{-\frac{2\alpha_{1}}{{\text{dim}}({\mathcal{X}})}}}{n}\right)
=o(1).\displaystyle=o(1).

Therefore, by Markov’s inequality n[rKn(GKn0)]=op(1)\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}({G_{K_{n}}^{0}})^{\top}\right]\right\rVert=o_{p}(1) and thus so is IV. Next, consider the term V. Recall that ϵ(z)=φηϕη\epsilon(z)={{\varphi}_{\eta^{*}}}-{\phi_{\eta^{*}}}, and hence

(ϵ)=0, and (ϵg)=0,(ϵ2)<,{{\mathbb{P}}^{*}}(\epsilon)=0,\text{ and }{{\mathbb{P}}^{*}}(\epsilon g)=0,\quad{{\mathbb{P}}^{*}}\left(\epsilon^{2}\right)<\infty,

for any measurable function g(x)g(x) of xx. Thus,

V\displaystyle\left\lVert\text{V}\right\rVert n[ϵ(G^Kn0)]𝚺^2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}
n[ϵ(GKn0)]𝚺^2+|n(ϵ)|n+N(G^Kn0)𝚺^2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}+\lvert{{\mathbb{P}}_{n}}(\epsilon)\rvert\left\lVert{{\mathbb{P}}_{n+N}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}
n[ϵ(GKn0)]×Op(1)+Op(1n)×Op(Knn)×Op(1)\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\times O_{p}(1)+O_{p}\left(\frac{1}{\sqrt{n}}\right)\times O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)\times O_{p}(1)
=Op(n[ϵ(GKn0)])+op(1).\displaystyle=O_{p}\left(\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\right)+o_{p}(1).

Moreover,

𝔼{n[ϵ(GKn0)]n[ϵ(GKn0)]}\displaystyle{\mathbb{E}}\left\{{{{\mathbb{P}}_{n}}\left[\epsilon\left({G_{K_{n}}^{0}}\right)^{\top}\right]{{\mathbb{P}}_{n}}\left[\epsilon\left({G_{K_{n}}^{0}}\right)\right]}\right\} =1n𝔼[ϵ2(GKn0)GKn0]\displaystyle=\frac{1}{n}{\mathbb{E}}\left[\epsilon^{2}\left({G_{K_{n}}^{0}}\right)^{\top}{G_{K_{n}}^{0}}\right]
sup𝒳{GKn0(x)GKn0(x)}n𝔼[ϵ2]\displaystyle\leqslant\frac{\sup_{{\mathcal{X}}}\left\{{{G_{K_{n}}^{0}}(x)^{\top}{G_{K_{n}}^{0}}(x)}\right\}}{n}{\mathbb{E}}[\epsilon^{2}]
=O(ζn2n)\displaystyle=O\left(\frac{\zeta_{n}^{2}}{n}\right)
=o(1).\displaystyle=o(1).

Therefore, by Markov’s inequality n[ϵ(GKn0)]=op(1)\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon({G_{K_{n}}^{0}})^{\top}\right]\right\rVert=o_{p}(1) and so is V. Finally, for VI, by the fact that n[GKn0(GKn0)]𝐈KnF=op(1)\left\lVert{{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]-\mathbf{I}_{K_{n}}\right\rVert_{F}=o_{p}(1) and continuous mapping, we have:

n[(GKn0)GKn0]\displaystyle{{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right] =Kn+tr(n[GKn0(GKn0)]𝐈Kn)\displaystyle=K_{n}+{\text{tr}}\left({{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]-\mathbf{I}_{K_{n}}\right)
=Kn+op(1)\displaystyle=K_{n}+o_{p}(1)
=Op(Kn).\displaystyle=O_{p}(K_{n}).

Then,

IV\displaystyle\left\lVert\text{IV}\right\rVert n[(φη~nφη)(GKn0)]𝚺^12+n(φη~nφη)n+N(GKn0)𝚺^12\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}})({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}^{-1}\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}})\right\rVert\left\lVert{{\mathbb{P}}_{n+N}}({G_{K_{n}}^{0}})^{\top}\right\rVert\left\lVert\hat{\mathbf{\Sigma}}^{-1}\right\rVert_{2}
n[(φη~nφη)(GKn0)]×Op(1)+Op(ρ(η~n,η))×Op(Kn)×Op(1)\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}})({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\times O_{p}(1)+O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))\times O_{p}\left(\sqrt{\frac{K}{n}}\right)\times O_{p}(1)
=n[(φη~nφη)(GKn0)]×Op(1)+op(1)\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}})({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\times O_{p}(1)+o_{p}(1)
n[L(GKn0)]×Op(ρ(η~n,η))+op(1)\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[L({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\times O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))+o_{p}(1)
n(L2)n[(GKn0)GKn0]×Op(ρ(η~n,η))+op(1)\displaystyle\leqslant\sqrt{{{\mathbb{P}}_{n}}(L^{2}){{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]}\times O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))+o_{p}(1)
=Op(ρ(η~n,η)Kn)+op(1)\displaystyle=O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{*})\sqrt{K_{n}}\right)+o_{p}(1)
=op(1).\displaystyle=o_{p}(1).

Putting these results together, we see that I=𝐁Kn𝐁^Kn,NF=op(1)\text{I}=\left\lVert\mathbf{B}_{K_{n}}-\hat{\mathbf{B}}_{K_{n},N}\right\rVert_{F}=o_{p}(1), which then finishes the proof. ∎

Proof of Proposition 5.2.

By the definition of g^η~n0(x){\hat{g}^{0}_{\tilde{\eta}_{n}}}(x), we have:

g^η~n0(Xi)\displaystyle{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i}) =gη~n(Xi)n+N(gη~n)\displaystyle={g_{\tilde{\eta}_{n}}}(X_{i})-{{\mathbb{P}}_{n+N}}({g_{\tilde{\eta}_{n}}})
=gη~n(Xi)(gη~n)n+N[gη~n(gη~n)]\displaystyle={g_{\tilde{\eta}_{n}}}(X_{i})-{{\mathbb{P}}^{*}}({g_{\tilde{\eta}_{n}}})-{{\mathbb{P}}_{n+N}}\left[{g_{\tilde{\eta}_{n}}}-{{\mathbb{P}}^{*}}({g_{\tilde{\eta}_{n}}})\right]
=gη~n0(Xi)n+N(gη~n0).\displaystyle={g^{0}_{\tilde{\eta}_{n}}}(X_{i})-{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}}).

For the estimator θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} defined in (41) and 𝐁^n,Ng\hat{\mathbf{B}}_{n,N}^{g} defined in (40),

n(θ^n,Nsafeθ)\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*}\right) =n(θ^nθ)𝐁^n,Ngnn(g^η~n0)\displaystyle=\sqrt{n}\left(\hat{\theta}_{n}-\theta^{*}\right)-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({\hat{g}^{0}_{\tilde{\eta}_{n}}}\right) (70)
=nn(φη)𝐁^n,Ngnn(gη~n0)+𝐁^n,Ngnn+N(gη~n0)+op(1)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+o_{p}(1)
=nn(φη)γ𝐁^n,Ngnn(gη~n0)+γ(1γ)𝐁^n,NgNN(gη~n0)+op(1),\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\gamma\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+\sqrt{\gamma(1-\gamma)}\hat{\mathbf{B}}_{n,N}^{g}\sqrt{N}{{\mathbb{P}}_{N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+o_{p}(1),

as Nn+Nγ(0,1]\frac{N}{n+N}\to\gamma\in(0,1].

By Assumption 5.1, the restricted class of functions {gη:η𝒪}\left\{{g_{\eta}:\eta\in{\mathcal{O}}}\right\} is {{\mathbb{P}}^{*}}-Donsker, and it satisfies gη~n(x)gη(x)2(X)=op(1)\left\lVert{g_{\tilde{\eta}_{n}}}(x)-{g_{\eta^{*}}}(x)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1) by Lemma E.5. Therefore, by Lemma E.7, the centered class {gη0:η𝒪}\left\{{g^{0}_{\eta}:\eta\in{\mathcal{O}}}\right\} satisfies the conditions of Lemma E.5, and it then follows that gη~n0(x)gη0(x)2(X)=op(1)\left\lVert{g^{0}_{\tilde{\eta}_{n}}}(x)-{g^{0}_{\eta^{*}}}(x)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1), and

𝔾n(gη~n0gη0)=op(1).{\mathbb{G}}_{n}({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}})=o_{p}(1). (71)

As a result,

nn(gη~n0)\displaystyle\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\tilde{\eta}_{n}}}\right) =𝔾n(gη~n0gη0)+n(gη~n0)+nn(gη0)\displaystyle={\mathbb{G}}_{n}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)+\sqrt{n}{{\mathbb{P}}^{*}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\eta^{*}}}\right)
=𝔾n(gη~n0gη0)+nn(gη0)\displaystyle={\mathbb{G}}_{n}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)+\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\eta^{*}}}\right)
=op(1)+Op(1)\displaystyle=o_{p}(1)+O_{p}(1)
=Op(1),\displaystyle=O_{p}(1),

where we used the fact that (gη~n0)=0{{\mathbb{P}}^{*}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)=0 by centering and nn(gη0)=Op(1)\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\eta^{*}}}\right)=O_{p}(1) by centering and CLT. Following a similar argument, NN(gη~n0)=Op(1)\sqrt{N}{{\mathbb{P}}_{N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)=O_{p}(1). Further, by (71),

nn(gη~n0)nn(gη0)\displaystyle\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\tilde{\eta}_{n}}})-\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\eta^{*}}})
=𝔾n(gη~n0gη0)n(gη~n0)+n(gη0)\displaystyle={\mathbb{G}}_{n}({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}})-\sqrt{n}{{\mathbb{P}}^{*}}({g^{0}_{\tilde{\eta}_{n}}})+\sqrt{n}{{\mathbb{P}}^{*}}({g^{0}_{\eta^{*}}})
=𝔾n(gη~n0gη0)=op(1),\displaystyle={\mathbb{G}}_{n}({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}})=o_{p}(1),

and similarly

nN(gη~n0)nN(gη0)=op(1).\sqrt{n}{{\mathbb{P}}_{N}}({g^{0}_{\tilde{\eta}_{n}}})-\sqrt{n}{{\mathbb{P}}_{N}}({g^{0}_{\eta^{*}}})=o_{p}(1).

Therefore, (70) can be further modified as:

n(θ^n,Nsafeθ)\displaystyle\sqrt{n}(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*}) =nn(φη)γ𝐁^n,Ngnn(gη0)+γ(1γ)𝐁^n,NgNN(gη0)+op(1).\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\gamma\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\eta^{*}}})+\sqrt{\gamma(1-\gamma)}\hat{\mathbf{B}}_{n,N}^{g}\sqrt{N}{{\mathbb{P}}_{N}}({g^{0}_{\eta^{*}}})+o_{p}(1). (72)

For convenience, let

𝐁^1=𝐁^n,Ng=n[φη~n(g^η~n0)]{n+N[g^η~n0(g^η~n0)]}1,\hat{\mathbf{B}}_{1}=\hat{\mathbf{B}}_{n,N}^{g}={{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}\left({\hat{g}^{0}_{\tilde{\eta}_{n}}}\right)^{\top}\right]\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}\left({\hat{g}^{0}_{\tilde{\eta}_{n}}}\right)^{\top}\right]}\right\}^{-1},
𝐁^2=n[φη~n(g^η0)]{n+N[g^η0(g^η0)]}1.\hat{\mathbf{B}}_{2}={{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}\left({\hat{g}^{0}_{\eta^{*}}}\right)^{\top}\right]\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{*}}}\left({\hat{g}^{0}_{\eta^{*}}}\right)^{\top}\right]}\right\}^{-1}.

We now show that 𝐁^1𝐁^22=op(1)\left\lVert\hat{\mathbf{B}}_{1}-\hat{\mathbf{B}}_{2}\right\rVert_{2}=o_{p}(1). First,

𝐁^1𝐁^22\displaystyle\left\lVert\hat{\mathbf{B}}_{1}-\hat{\mathbf{B}}_{2}\right\rVert_{2} n[φη~n(g^η~n0g^η0)]2I{n+N[g^η~n0(g^η~n0)]}12II\displaystyle\leqslant\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}}-{\hat{g}^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}}_{\text{I}}\underbrace{\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}}_{\text{II}}
+n[φη~n(g^η0)]2III{n+N[g^η~n0(g^η~n0)]}1{n+N[g^η0(g^η0)]}12IV.\displaystyle\quad+\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}}_{\text{III}}\underbrace{\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}}_{\text{IV}}.

For I, note that by Lemma E.6, we have n[φη~n(gη~n0gη0)]2=op(1)\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}=o_{p}(1). Therefore,

In[φη~n(gη~n0gη0)]2+n(φη~n)[n+N(gη0)]2+n(φη~n)[n+N(gη~n0)]2\displaystyle\text{I}\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\left[{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right)\right]^{\top}\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\left[{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right]^{\top}\right\rVert_{2}
=n(φη~n)[n+N(gη0)]2+n(φη~n)[n+N(gη~n0)]2+op(1)\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\left[{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right)\right]^{\top}\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\left[{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right]^{\top}\right\rVert_{2}+o_{p}(1)
n(φη~n)2n+N(gη0)2+n(φη~n)2n+N(gη~n0)2+op(1)\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\right\rVert_{2}\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right)\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\right\rVert_{2}\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right\rVert_{2}+o_{p}(1)
=op(1),\displaystyle=o_{p}(1),

where we used the fact that n(φη~n)2=op(1)\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\right\rVert_{2}=o_{p}(1) and n+N(gη~n0)2=op(1)\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right\rVert_{2}=o_{p}(1) by Lemma E.5.

For II, note that by Lemma E.5 we have n+N(gη~n0)=op(1)\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})\right\rVert=o_{p}(1). Thus,

n+N[g^η~n0(g^η~n0)gη~n0(gη~n0)]2\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}\left({\hat{g}^{0}_{\tilde{\eta}_{n}}}\right)^{\top}-{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)^{\top}\right]\right\rVert_{2} =n+N(gη~n0)n+N(gη~n0)2\displaystyle=\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right){{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)^{\top}\right\rVert_{2} (73)
n+N(gη~n0)22\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right\rVert^{2}_{2}
=op(1).\displaystyle=o_{p}(1).

Next, consider the term n+N[gη~n0(gη~n0)gη0(gη0)]2\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)^{\top}-{g^{0}_{\eta^{*}}}\left({g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}. By Lemma E.7, the centered class {gη0:ηΩ}\left\{{g^{0}_{\eta}:\eta\in\Omega}\right\} satisfies the conditions of Lemma E.6. Therefore applying Lemma E.6, the two terms n+N[gη~n0(gη~n0gη0)]2\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2} and n+N[gη0(gη~n0gη0)]2\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\eta^{*}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2} are both op(1)o_{p}(1). Then,

n+N[gη~n0(gη~n0)gη0(gη0)]2\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)^{\top}-{g^{0}_{\eta^{*}}}\left({g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}
n+N[gη~n0(gη~n0gη0)]2+n+N[gη0(gη~n0gη0)]2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\eta^{*}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}
=op(1).\displaystyle=o_{p}(1).

Combining these two results, we have:

n+N[g^η~n0(g^η~n0)gη0(gη0)]2\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2} (74)
n+N[g^η~n0(g^η~n0)gη~n0(gη~n0)]2+n+N[gη~n0(gη~n0)gη0(gη0)]2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}
=op(1).\displaystyle=o_{p}(1).

Finally, since n+N[gη0(gη0)]𝑃[gη0(gη0)]0{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\xrightarrow[]{P}{{\mathbb{P}}^{*}}\left[{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\succ 0, by continuous mapping,

II={n+N[g^η~n0(g^η~n0)]}12={[gη0(gη0)]}12+op(1)=Op(1).\text{II}=\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=\left\lVert\left\{{{{\mathbb{P}}^{*}}\left[{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}+o_{p}(1)=O_{p}(1).

We showed that III=Op(1)=O_{p}(1) in the proof of Theorem 4.3.

For IV, by Lemma E.5, both n+N(gη~n0)\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})\right\rVert and n+N(gη0)\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{*}}})\right\rVert are op(1)o_{p}(1). Consequently,

n+N[g^η~n0(g^η~n0)g^η0(g^η0)]2\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}
=n+N[gη~n0(gη~n0)gη0(gη0)]n+N(gη~n0)n+N(gη~n0)+n+N(gη0)n+N(gη0)2\displaystyle=\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]-{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}}){{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}+{{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{*}}}){{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{*}}})^{\top}\right\rVert_{2}
n+N[gη~n0(gη~n0)gη0(gη0)]2+n+N(gη~n0)n+N(gη~n0)2+n+N(gη0)n+N(gη0)2\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}}){{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{*}}}){{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{*}}})^{\top}\right\rVert_{2}
n+N[gη~n0(gη~n0)gη0(gη0)]2+n+N(gη~n0)22+n+N(gη0)22\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})\right\rVert^{2}_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{*}}})\right\rVert^{2}_{2}
=op(1),\displaystyle=o_{p}(1),

where n+N[gη~n0(gη~n0)gη0(gη0)]2=op(1)\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}=o_{p}(1) is given by (73). Further, we have shown that II={n+N[g^η~n0(g^η~n0)]}12=Op(1)\text{II}=\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=O_{p}(1). Similarly, it can be shown that {n+N[g^η0(g^η0)]}12=Op(1)\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=O_{p}(1). Moreover, we have

n+N[g^η0(g^η0)gη0(gη0)]2\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}-{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2} =n+N(gη0)n+N(gη0)2\displaystyle=\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right){{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right)^{\top}\right\rVert_{2} (75)
n+N(gη0)22\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right)\right\rVert^{2}_{2}
=op(1),\displaystyle=o_{p}(1),

which, combined with (74), gives

n+N[g^η~n0(g^η~n0)g^η0(g^η0)]2=op(1).\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}=o_{p}(1).

Finally, from the fact that 𝐀1𝐁1=𝐁1(𝐁𝐀)𝐀1\mathbf{A}^{-1}-\mathbf{B}^{-1}=\mathbf{B}^{-1}(\mathbf{B}-\mathbf{A})\mathbf{A}^{-1}, it follows that:

IV={n+N[g^η~n0(g^η~n0)]}1{n+N[g^η0(g^η0)]}12\displaystyle\text{IV}=\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}
n+N[g^η~n0(g^η~n0)g^η0(g^η0)]2{n+N[g^η~n0(g^η~n0)]}12{n+N[g^η0(g^η0)]}12\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}
=op(1).\displaystyle=o_{p}(1).

Combining the above results, we have 𝐁^1𝐁^22=op(1)\left\lVert\hat{\mathbf{B}}_{1}-\hat{\mathbf{B}}_{2}\right\rVert_{2}=o_{p}(1). As we have shown that nn(gη~n0)=Op(1)\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\tilde{\eta}_{n}}})=O_{p}(1) and NN(gη~n0)=Op(1)\sqrt{N}{{\mathbb{P}}_{N}}({g^{0}_{\tilde{\eta}_{n}}})=O_{p}(1), (72) can be expressed as

n(θ^n,Nsafeθ)\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*}\right) =nn(φη)γ𝐁^2nn(gη0)\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\gamma\hat{\mathbf{B}}_{2}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\eta^{*}}})
+γ(1γ)𝐁^2NN(gη0)+op(1).\displaystyle\quad+\sqrt{\gamma(1-\gamma)}\hat{\mathbf{B}}_{2}\sqrt{N}{{\mathbb{P}}_{N}}({g^{0}_{\eta^{*}}})+o_{p}(1).

The rest of the proof follows Theorem 4.3. ∎

Proof of Proposition 6.1..

First, we show that the tangent space relative to 𝒬{\mathcal{Q}} at ×W{{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}}, denoted as 𝒯𝒬(×W){{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}, is

={sX(x)+wsYX(z):sX(x)𝒯𝒫X(X),sYX(z)𝒯𝒫YX(YX)},{\mathcal{M}}=\left\{{s_{X}(x)+ws_{Y\mid X}(z):s_{X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})},s_{Y\mid X}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}}\right\},

where 𝒯𝒫X(X){{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})} is the tangent space at X{{\mathbb{P}}^{*}_{X}} relative to 𝒫X{{\mathcal{P}}_{X}} and 𝒯𝒫YX(YX){{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})} is the tangent space at YX{{\mathbb{P}}^{*}_{Y\mid X}} relative to 𝒫YX{{\mathcal{P}}_{Y\mid X}}.

We first show that 𝒯𝒬(×W){{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}\subseteq{\mathcal{M}}. Consider any one-dimensional regular parametric sub-model of 𝒬{\mathcal{Q}}, which can be represented as

{×W:𝒫T},\left\{{{\mathbb{P}}\times{{\mathbb{P}}_{W}^{*}}:{\mathbb{P}}\in{\mathcal{P}}_{T}}\right\},

where 𝒫T{\mathcal{P}}_{T} is a one-dimensional regular parametric sub-model of 𝒫{{\mathcal{P}}} such that

𝒫T={pt(z)=pt,X(x)pt,YX(z):tT},{\mathcal{P}}_{T}=\left\{{p_{t}(z)=p_{t,X}(x)p_{t,Y\mid X}(z):t\in T}\right\},

where pt(z)p_{t^{*}}(z) corresponds to the density of {{\mathbb{P}}^{*}} for some tTt^{*}\in T. Denote the score function of pt(z)p_{t}(z) at tt^{*} as s(z)=sX(x)+sYX(z)s(z)=s_{X}(x)+s_{Y\mid X}(z), where sX(x)s_{X}(x) is score function of pt,X(x)p_{t,X}(x) at tt^{*}, and sYX(z)s_{Y\mid X}(z) is the score function of pt,YX(yx)p_{t,Y\mid X}(y\mid x) at tt^{*}. The density of (Z,W)(Z,W) is thus

pt,X(x)[(1γ)pt,YX(z)]wγ1w,p_{t,X}(x)\left[(1-\gamma)p_{t,Y\mid X}(z)\right]^{w}\gamma^{1-w},

and its score function at tt^{*} is sX(x)+wsYX(z)s_{X}(x)+ws_{Y\mid X}(z). Because 𝒫T{\mathcal{P}}_{T} is a one-dimensional regular parametric sub-model of 𝒫{{\mathcal{P}}}, which has the form (1), it must be true that {pt,X(x):tT}\left\{{p_{t,X}(x):t\in T}\right\} is a one-dimensional regular parametric sub-model of 𝒫X{{\mathcal{P}}_{X}}, and {pt,YX(z):tT}\left\{{p_{t,Y\mid X}(z):t\in T}\right\} is a one-dimensional regular parametric sub-model of YX{{\mathbb{P}}^{*}_{Y\mid X}}. Therefore, the corresponding score function satisfies that sX(x)𝒯𝒫X(X)s_{X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})} and sYX(z)𝒯𝒫YX(YX)s_{Y\mid X}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}, and hence sX(x)+wsYX(z)s_{X}(x)+ws_{Y\mid X}(z)\in{\mathcal{M}}. We have 𝒯𝒬(×W){{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}\subseteq{\mathcal{M}} by the fact that {\mathcal{M}} is a closed linear space.

Now we show the other direction, i.e., 𝒯𝒬(×W){\mathcal{M}}\subseteq{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}. Consider an arbitrary element sX(x)+wsYX(z)s_{X}(x)+ws_{Y\mid X}(z) of {\mathcal{M}} where sX(x)𝒯𝒫X(X)s_{X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})} and sYX(z)YXs_{Y\mid X}(z)\in{{\mathbb{P}}^{*}_{Y\mid X}}. Suppose sX(x)s_{X}(x) is the score function of some parametric sub-model {pt,X:tT}\left\{{p_{t,X}:t\in T\subset{\mathbb{R}}}\right\} of 𝒫X{{\mathcal{P}}_{X}} at X{{\mathbb{P}}^{*}_{X}} and sYX(z)s_{Y\mid X}(z) is the score function of some parametric sub-model {pt,XY:tT}\left\{{p_{t,X\mid Y}:t\in T\subset{\mathbb{R}}}\right\} of 𝒫YX{{\mathcal{P}}_{Y\mid X}} at YX{{\mathbb{P}}^{*}_{Y\mid X}}. Then sX(x)+wsYX(z)s_{X}(x)+ws_{Y\mid X}(z) is the score function at ×W{{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}} of

{×W:𝒫T},\left\{{{\mathbb{P}}\times{{\mathbb{P}}_{W}^{*}}:{\mathbb{P}}\in{\mathcal{P}}_{T}}\right\},

where 𝒫T={pt,X(x)pt,YX(z):tTR}{\mathcal{P}}_{T}=\left\{{p_{t,X}(x)p_{t,Y\mid X}(z):t\in T\subset R}\right\}, which proves sX(x)+wsYX(z)𝒯𝒬(×W)s_{X}(x)+ws_{Y\mid X}(z)\in{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})} and hence 𝒯𝒬(×W){\mathcal{M}}\subseteq{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}.

Pathwise differentiability of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to {{\mathbb{P}}^{*}} implies, for any one-dimensional parametric sub-model 𝒫T𝒫{\mathcal{P}}_{T}\subset{{\mathcal{P}}} such that pt(z)p_{t^{*}}(z) corresponds to the density of {{\mathbb{P}}^{*}},

dθ(pt)dt=φη(z),sX(x)+sYX(z)p,02(),\frac{d\theta(p_{t^{*}})}{dt}=\langle{{\varphi}^{*}_{\eta^{*}}}(z),s_{X}(x)+s_{Y\mid X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})},

where s(z)=sX(x)+sYX(z)s(z)=s_{X}(x)+s_{Y\mid X}(z) is the score function of 𝒫T{\mathcal{P}}_{T} at {{\mathbb{P}}^{*}}, sX(x)𝒯𝒫X(X)s_{X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})} and sYX(z)𝒯𝒫YX(YX)s_{Y\mid X}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}. This maps one-to-one to a parametric sub-model {×W:𝒫T}\left\{{{\mathbb{P}}\times{{\mathbb{P}}_{W}^{*}}:{\mathbb{P}}\in{\mathcal{P}}_{T}}\right\} of 𝒬{\mathcal{Q}} with score function sX(x)+wsYX(z)𝒯𝒬(×W)s_{X}(x)+ws_{Y\mid X}(z)\in{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}. We thus have:

dθ(pt)dt\displaystyle\frac{d\theta(p_{t^{*}})}{dt} =φη(z),sX(x)+sYX(z)p,02()\displaystyle=\langle{{\varphi}^{*}_{\eta^{*}}}(z),s_{X}(x)+s_{Y\mid X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}
=φη(z)ϕη(x),sYX(z)p,02(YX)+ϕη(x),sX(z)p,02(X)\displaystyle=\langle{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x),s_{Y\mid X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{Y\mid X}})}+\langle{\phi^{*}_{\eta^{*}}}(x),s_{X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}
=w1γ[φη(z)ϕη(x)],δsYX(z)p,02(YX)+ϕη(x),sX(z)p,02(X)\displaystyle=\left\langle\frac{w}{1-\gamma}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)],\delta s_{Y\mid X}(z)\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{Y\mid X}})}+\langle{\phi^{*}_{\eta^{*}}}(x),s_{X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}
=w1γ[φη(z)ϕη(x)]+ϕη(x),sX(z)+δsYX(z)p,02().\displaystyle=\left\langle\frac{w}{1-\gamma}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+{\phi^{*}_{\eta^{*}}}(x),s_{X}(z)+\delta s_{Y\mid X}(z)\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}.

This implies that the function

w1γ[φη(z)ϕη(x)]+ϕη(x)\frac{w}{1-\gamma}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+{\phi^{*}_{\eta^{*}}}(x)

is a gradient of θ(){\theta(\cdot)} at ×W{{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}} relative to 𝒬{\mathcal{Q}}. As shown in the proof of Theorem 3.1, a[φη(z)ϕη(x)]𝒯𝒫YX(YX)a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})} and aϕη(x)Xa^{\top}{\phi^{*}_{\eta^{*}}}(x)\in{{\mathbb{P}}^{*}_{X}} for apa\in{\mathbb{R}}^{p}. Therefore, we see that

a{w1γ[φη(z)ϕη(x)]+ϕη(x)}𝒯𝒬(×W),a^{\top}\left\{{\frac{w}{1-\gamma}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+{\phi^{*}_{\eta^{*}}}(x)}\right\}\in{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})},

for all apa\in{\mathbb{R}}^{p}. By definition, w1γ[φη(z)ϕη(x)]+ϕη(x)\frac{w}{1-\gamma}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+{\phi^{*}_{\eta^{*}}}(x) is the efficient influence function of θ(){\theta(\cdot)} at ×W{{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}} relative to 𝒬{\mathcal{Q}}. Note the above derivations correspond to the sample size n+Nn+N. Therefore we need to additionally multiply a factor of limnnn+N=1γ\lim_{n\to\infty}\sqrt{\frac{n}{n+N}}=\sqrt{1-\gamma} to obtain the efficient influence function corresponding to the sample size nn. By independence of WW and ZZ, the semiparametric efficiency lower bound can be expressed as:

Var{δ1γ[φη(Z)ϕη(X)]+1γϕη(X)}\displaystyle\mathrm{Var}\left\{{\frac{\delta}{\sqrt{1-\gamma}}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]+\sqrt{1-\gamma}{\phi^{*}_{\eta^{*}}}(X)}\right\}
=Var{δ1γ[φη(Z)ϕη(X)]}+Var[1γϕη(X)]\displaystyle=\mathrm{Var}\left\{{\frac{\delta}{\sqrt{1-\gamma}}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]}\right\}+\mathrm{Var}[\sqrt{1-\gamma}{\phi^{*}_{\eta^{*}}}(X)]
+Cov{δ1γ[φη(Z)ϕη(X)],1γϕη(X)}\displaystyle\quad+\mathrm{Cov}\left\{{\frac{\delta}{\sqrt{1-\gamma}}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)],\sqrt{1-\gamma}{\phi^{*}_{\eta^{*}}}(X)}\right\}
+Cov{1γϕη(X),δ1γ[φη(Z)ϕη(X)]}\displaystyle\quad+\mathrm{Cov}\left\{{\sqrt{1-\gamma}{\phi^{*}_{\eta^{*}}}(X),\frac{\delta}{\sqrt{1-\gamma}}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]}\right\}
=𝔼[δ2]1γVar{[φη(Z)ϕη(X)]}+(1γ)Var[ϕη(X)]\displaystyle=\frac{{\mathbb{E}}[\delta^{2}]}{1-\gamma}\mathrm{Var}\left\{{[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]}\right\}+(1-\gamma)\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)]
+𝔼[δ]Cov{φη(Z)ϕη(X),ϕη(X)}+𝔼[δ]Cov{ϕη(X),φη(Z)ϕη(X)}\displaystyle\quad+{\mathbb{E}}[\delta]\mathrm{Cov}\left\{{{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X),{\phi^{*}_{\eta^{*}}}(X)}\right\}+{\mathbb{E}}[\delta]\mathrm{Cov}\left\{{{\phi^{*}_{\eta^{*}}}(X),{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)}\right\}
=Var[φη(Z)ϕη(X)]+(1γ)Var[ϕη(X)],\displaystyle=\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)\right]+(1-\gamma)\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)],

where in the last step we use Lemma E.1 to show that Cov{ϕη(X),φη(Z)ϕη(X)}=0\mathrm{Cov}\left\{{{\phi^{*}_{\eta^{*}}}(X),{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)}\right\}=0. ∎

Appendix C Connection to prediction-powered inference

In this section, we analyze existing PPI estimators through the lens of the proposed framework. We will show that (i) many existing PPI estimators can be analyzed in a unified manner, including our proposed safe PPI estimator (41); (ii) the proposed safe PPI estimator is optimally efficient among the PPI estimators; (iii) none of the PPI estimators achieves the efficiency bound of Theorem 4.1, without strong assumptions on the machine learning prediction model.

We consider the setting of M-estimation as described in Section 7.1 of the main text. (Our results can be extended to Z-estimation.) Let f:𝒳𝒴f:{\mathcal{X}}\to{\mathcal{Y}} denote a machine learning prediction model trained on independent data. Let mθf(x){\nabla m^{f}_{\theta^{*}}}(x) denote mθf(x)=mθ(x,f(x)){\nabla m^{f}_{\theta^{*}}}(x)={\nabla m_{\theta^{*}}}(x,f(x)), and let m~θ(x)=𝔼[mθ(Z)X=x]{\nabla\tilde{m}_{\theta^{*}}}(x)={\mathbb{E}}[{\nabla m_{\theta^{*}}}(Z)\mid X=x]. Further, denote θ^n\hat{\theta}_{n} as the supervised M-estimator (47), and suppose that suitable conditions hold such that θ^n\hat{\theta}_{n} is regular and asymptotically linear with influence function φη(z){{\varphi}_{\eta^{*}}}(z) in (48). Finally, suppose that the conditions of Proposition 5.2 hold.

First, we present a unified approach to analyze existing PPI estimators, as well as the proposed safe PPI estimator (41). Consider semi-supervised estimators of θ\theta^{*} that are regular and asymptotically linear in the sense of Definitions 4.1 and E.2, with influence function

φ𝐀f(z1,x2)=𝐕θ1[mθ(z1)𝐀mθf(x1)+𝐀mθf(x2)],{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})=-{\mathbf{V}_{\theta^{*}}^{-1}}\left[{\nabla m_{\theta^{*}}}(z_{1})-\mathbf{A}{\nabla m^{f}_{\theta^{*}}}(x_{1})+\mathbf{A}{\nabla m^{f}_{\theta^{*}}}(x_{2})\right], (76)

where 𝐀p×p\mathbf{A}\in{\mathbb{R}}^{p\times p}. The above class includes a number of existing PPI estimators, such as the proposal of Angelopoulos et al. [2023a, b], Miao et al. [2023], Gan and Liang [2023], and the proposed safe PPI estimator (41), with different estimators corresponding to different choices of 𝐀\mathbf{A}. Specifically, the original PPI estimator [Angelopoulos et al., 2023a] considers 𝐀=𝐈p\mathbf{A}=\mathbf{I}_{p}. Angelopoulos et al. [2023b] improve the original PPI estimator by introducing a tuning weight ω\omega\in{\mathbb{R}} and considering 𝐀=ω𝐈p\mathbf{A}=\omega\mathbf{I}_{p}. They show that the value of ω\omega that minimizes the trace of the asymptotic variance of their estimator is

ω=γtr(𝐕θ1{Cov[mθf(X),mθ(Z)]+Cov[mθ(Z),mθf(X)]}𝐕θ1)2tr{𝐕θ1Var[mθf(X)]𝐕θ1}.\omega=\frac{\gamma{\text{tr}}\left({\mathbf{V}_{\theta^{*}}^{-1}}\left\{{\mathrm{Cov}[{\nabla m^{f}_{\theta^{*}}}(X),{\nabla m_{\theta^{*}}}(Z)]+\mathrm{Cov}[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)]}\right\}{\mathbf{V}_{\theta^{*}}^{-1}}\right)}{2{\text{tr}}\left\{{{\mathbf{V}_{\theta^{*}}^{-1}}\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]{\mathbf{V}_{\theta^{*}}^{-1}}}\right\}}.

Miao et al. [2023] instead lets 𝐀=diag(𝝎)\mathbf{A}=\text{diag}(\boldsymbol{\omega}), which is a diagonal matrix with tuning weights 𝝎p\boldsymbol{\omega}\in{\mathbb{R}}^{p}. They set the weights to minimize the element-wise asymptotic variance:

ωj=[𝐕θ1Cov[mθ(Z),mθf(X)]𝐕θ1]jj[𝐕θ1Var[mθf(X)]𝐕θ1]jj,\omega_{j}=\frac{\left[{\mathbf{V}_{\theta^{*}}^{-1}}\mathrm{Cov}[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)]{\mathbf{V}_{\theta^{*}}^{-1}}\right]_{jj}}{\left[{\mathbf{V}_{\theta^{*}}^{-1}}\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]{\mathbf{V}_{\theta^{*}}^{-1}}\right]_{jj}},

for each j[p]j\in[p], where ωj\omega_{j} is the jj-th element of 𝝎\boldsymbol{\omega}. Finally, Gan and Liang [2023] considers

𝐀=ωCov[mθ(Z),mθf(X)]Var[mθf(X)]1,\mathbf{A}=\omega\mathrm{Cov}[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)]\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]^{-1},

for a tuning weight ω\omega\in{\mathbb{R}} and show that the optimal weight is ω=γ\omega=\gamma.

Now, consider the proposed safe PPI estimator (41) with gη(x)=φη(x,f(x)){g_{\eta^{*}}}(x)={\varphi}_{\eta^{*}}(x,f(x)) and with 𝒢={φη(x,f(x)):ηΩ}{\mathcal{G}}=\left\{{{\varphi}_{\eta}(x,f(x)):\eta\in\Omega}\right\}, where η()=(𝐕θ()1(),θ())\eta({\mathbb{P}})=\left(\mathbf{V}^{-1}_{\theta({\mathbb{P}})}({\mathbb{P}}),\theta({\mathbb{P}})\right). Then, (43) can be written as

𝐁gη\displaystyle\mathbf{B}^{g_{\eta^{*}}} =𝔼[φη(Z)gη0(X)]𝔼[gη0(X)gη0(X)]1\displaystyle={\mathbb{E}}\left[{{\varphi}_{\eta^{*}}}(Z){g^{0}_{\eta^{*}}}(X)^{\top}\right]{\mathbb{E}}\left[{g^{0}_{\eta^{*}}}(X){g^{0}_{\eta^{*}}}(X)^{\top}\right]^{-1}
=𝐕θ1Cov[mθ(Z),mθf(X)]Var[mθf(X)]1𝐕θ.\displaystyle={\mathbf{V}_{\theta^{*}}^{-1}}\mathrm{Cov}\left[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)\right]\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]^{-1}{\mathbf{V}_{\theta^{*}}}.

By Proposition 5.2, (41) is regular and asymptotically linear with influence function

φη(x1)γ𝐁gηgη0(x1)+γ𝐁gηgη0(x2)\displaystyle{{\varphi}_{\eta^{*}}}(x_{1})-\gamma\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(x_{1})+\gamma\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(x_{2}) (77)
=𝐕θ1[mθ(z1)𝐀mθf(x1)+𝐀mθf(x2)]\displaystyle=-{\mathbf{V}_{\theta^{*}}^{-1}}\left[{\nabla m_{\theta^{*}}}(z_{1})-\mathbf{A}^{*}{\nabla m^{f}_{\theta^{*}}}(x_{1})+\mathbf{A}^{*}{\nabla m^{f}_{\theta^{*}}}(x_{2})\right]
=φ𝐀f(z1,x2),\displaystyle={\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),

where

𝐀=γCov[mθ(Z),mθf(X)]Var[mθf(X)]1.\mathbf{A}^{*}=\gamma\mathrm{Cov}\left[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)\right]\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]^{-1}.

Therefore, (43) has influence function in the form of (76) with 𝐀=𝐀\mathbf{A}=\mathbf{A}^{*}. In fact, this is the same influence function as Gan and Liang [2023] with the optimal weight ω=γ\omega=\gamma.

Next, we show that (77) has the smallest asymptotic variance among all PPI estimators with influence function of the form (76). That is, the proposed safe estimator (41) is at least as efficient as existing PPI estimators. Recall that under the OSS setting, the asymptotic variance of an estimator with influence function φη(z1,x2){{\varphi}_{\eta^{*}}}(z_{1},x_{2}) is φη(z1,x2),φη(z1,x2)\langle{{\varphi}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}_{\eta^{*}}}(z_{1},x_{2})^{\top}\rangle_{\mathcal{H}}.

Proposition C.1.

For all 𝐀p×p\mathbf{A}\in{\mathbb{R}}^{p\times p}, the following holds:

φ𝐀f(z1,x2),φ𝐀f(z1,x2)φ𝐀f(z1,x2),φ𝐀f(z1,x2).\left\langle{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}\succeq\left\langle{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}.
Proof.

By the linearity of inner product,

h1,h1=h1h2,(h1h2)+h2,h2+h1h2,h2+h2,(h2h1),\left\langle h_{1},h_{1}^{\top}\right\rangle_{\mathcal{H}}=\left\langle h_{1}-h_{2},(h_{1}-h_{2})^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{2},h_{2}^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{1}-h_{2},h_{2}^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{2},(h_{2}-h_{1})^{\top}\right\rangle_{\mathcal{H}},

for arbitrary h1h_{1} and h2h_{2}. Therefore, it suffices to prove that the cross term satisfies

φ𝐀f(z1,x2)φ𝐀f(z1,x2),φ𝐀f(z1,x2)=0\left\langle{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})-{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}=0

for all 𝐀p×p\mathbf{A}\in{\mathbb{R}}^{p\times p}, which we now prove.

By definition, we have

φ𝐀f(z1,x2)φ𝐀f(z1,x2),φ𝐀f(z1,x2)\displaystyle\left\langle{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})-{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}} (78)
=𝐕θ1(𝐀𝐀)mθf(x1)+(𝐀𝐀)mθf(x2),[mθ(x1)𝐀mθf(x1)+𝐀mθf(x2)]𝐕θ1\displaystyle={\mathbf{V}_{\theta^{*}}^{-1}}\left\langle-(\mathbf{A}-\mathbf{A}^{*}){\nabla m^{f}_{\theta^{*}}}(x_{1})+(\mathbf{A}-\mathbf{A}^{*}){\nabla m^{f}_{\theta^{*}}}(x_{2}),\left[{\nabla m_{\theta^{*}}}(x_{1})-\mathbf{A}^{*}{\nabla m^{f}_{\theta^{*}}}(x_{1})+\mathbf{A}^{*}{\nabla m^{f}_{\theta^{*}}}(x_{2})\right]^{\top}\right\rangle_{\mathcal{H}}{\mathbf{V}_{\theta^{*}}^{-1}}
=𝐕θ1(𝐀𝐀)Cov[mθf(X),mθ(X)𝐀mθf(X)]𝐕θ1\displaystyle=-{\mathbf{V}_{\theta^{*}}^{-1}}(\mathbf{A}-\mathbf{A}^{*})\mathrm{Cov}\left[{\nabla m^{f}_{\theta^{*}}}(X),{\nabla m_{\theta^{*}}}(X)-\mathbf{A}^{*}{\nabla m^{f}_{\theta^{*}}}(X)\right]{\mathbf{V}_{\theta^{*}}^{-1}}
+1γγ𝐕θ1(𝐀𝐀)Var[mθf(X)](𝐀)𝐕θ1.\displaystyle\quad+\frac{1-\gamma}{\gamma}{\mathbf{V}_{\theta^{*}}^{-1}}(\mathbf{A}-\mathbf{A}^{*})\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)](\mathbf{A}^{*})^{\top}{\mathbf{V}_{\theta^{*}}^{-1}}.

Recall that 𝐀=γCov[mθ(Z),mθf(X)]Var[mθf(X)]1\mathbf{A}^{*}=\gamma\mathrm{Cov}\left[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)\right]\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]^{-1}. Therefore, we have

Cov[mθf(X),mθ(X)𝐀mθf(X)]=1γγVar[mθf(X)](𝐀).\mathrm{Cov}\left[{\nabla m^{f}_{\theta^{*}}}(X),{\nabla m_{\theta^{*}}}(X)-\mathbf{A}^{*}{\nabla m^{f}_{\theta^{*}}}(X)\right]=\frac{1-\gamma}{\gamma}\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)](\mathbf{A}^{*})^{\top}.

Plugging this into (78), it then follows that

φ𝐀f(z1,x2)φ𝐀f(z1,x2),φ𝐀f(z1,x2)=0,\left\langle{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})-{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}=0,

which then finishes the proof. ∎

Proposition C.1 shows that (41) provides optimal efficiency among estimators with influence function of the form (76). We note that, for fair comparison, we consider the case where there is only one machine learning model f()f(\cdot) for all estimators. However, the proposed safe PPI estimator can incorporate multiple machine learning models in a principled way, whereas it is unclear how to do this for most existing PPI estimators. Moreover, our estimator can deal with general inferential problems beyond M-estimation, such as U-statistics.

Our next result shows that estimators with influence function of the form (76) cannot achieve the semiparametric efficiency bound in the OSS setting unless strong assumptions are imposed on the machine learning prediction model ff. To discuss efficiency in M-estimation (or Z-estimation), we informally consider a nonparametric model in the form of (1) such that the supervised M-estimator (47) is efficient (in the supervised setting). Proposition C.2 provides a necessary condition for estimators with influence function (76) to be efficient in the OSS setting.

Proposition C.2.

Let 𝒮f={𝐀mθf(x)𝐀𝔼[mθf(X)]:𝐀p×p}{\mathcal{S}}_{f}=\left\{{\mathbf{A}{\nabla m^{f}_{\theta^{*}}}(x)-\mathbf{A}{\mathbb{E}}[{\nabla m^{f}_{\theta^{*}}}(X)]:\mathbf{A}\in{\mathbb{R}}^{p\times p}}\right\} denote the linear space generated by mθf(x)𝔼[mθf(X)]{\nabla m^{f}_{\theta^{*}}}(x)-{\mathbb{E}}[{\nabla m^{f}_{\theta^{*}}}(X)] in p,02(X){{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}. If

m~θ(x)𝒮f,{\nabla\tilde{m}_{\theta^{*}}}(x)\notin{\mathcal{S}}_{f}, (79)

then (76) cannot be the efficient influence function of θ(){\theta(\cdot)} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} relative to model (1).

Proof.

Specializing to the case of M-estimation, by Theorem 4.1, the efficient influence function is

φη(z1,x2)=𝐕θ1[mθ(z1)γm~θ(x1)+γm~θ(x2)].{{\varphi}_{\eta^{*}}}(z_{1},x_{2})=-{\mathbf{V}_{\theta^{*}}^{-1}}[{\nabla m_{\theta^{*}}}(z_{1})-\gamma{\nabla\tilde{m}_{\theta^{*}}}(x_{1})+\gamma{\nabla\tilde{m}_{\theta^{*}}}(x_{2})].

Therefore (76) is the efficient influence function if and only if

𝐀mθf(X)𝐀𝔼[mθf(X)]=γm~θ(X)\mathbf{A}{\nabla m^{f}_{\theta^{*}}}(X)-\mathbf{A}{\mathbb{E}}[{\nabla m^{f}_{\theta^{*}}}(X)]=\gamma{\nabla\tilde{m}_{\theta^{*}}}(X)

X{{\mathbb{P}}^{*}_{X}}-almost surely. This cannot happen if m~θ(x)𝒮f{\nabla\tilde{m}_{\theta^{*}}}(x)\notin{\mathcal{S}}_{f}. ∎

To better interpret condition (79), we consider a simple case where our target of inference is the expectation of YY: θ=𝔼[Y]\theta^{*}={\mathbb{E}}[Y], and the loss function is mθ(z)=12(yθ)2m_{\theta}(z)=\frac{1}{2}(y-\theta)^{2}. Therefore, it follows that mθ(z)=yθ\nabla m_{\theta}(z)=y-\theta, 𝐕θ=𝐈p{\mathbf{V}_{\theta^{*}}}=\mathbf{I}_{p}, m~θ(x)=𝔼[YX=x]θ{\nabla\tilde{m}_{\theta^{*}}}(x)={\mathbb{E}}[Y\mid X=x]-\theta^{*}, and mθf(x)=f(x)θ{\nabla m^{f}_{\theta^{*}}}(x)=f(x)-\theta^{*}. We see that, in this case, condition (79) is equivalent to

f(x)=𝐀𝔼[YX=x]+b,f(x)=\mathbf{A}{\mathbb{E}}[Y\mid X=x]+b,

for some 𝐀p×p\mathbf{A}\in{\mathbb{R}}^{p\times p} and bpb\in{\mathbb{R}}^{p}. In other words, a necessary condition for these estimators to be efficient is that the machine learning prediction model is a linear transformation of the true regression function. Of course, this is unlikely to hold for black-box machine learning predictions in practice.

We end this section by noting that although the safe PPI estimator (41) may not be efficient, the proposed efficient estimator (34) can always achieve the semiparametric efficiency bound under regularity conditions. Further, as we show in numerical experiments in Section 8, the safe PPI estimator has good performance when using a good prediction model.

Appendix D Proof of claims in Section 7

D.1 Proof of claim in Example 7.2

Recall that the GLM estimator

θ^n=argmax𝜃{1ni=1n[YiθXib(Xiθ)]}\hat{\theta}_{n}=\underset{\theta}{\arg\max}\left\{{\frac{1}{n}\sum_{i=1}^{n}\left[Y_{i}\theta^{\top}X_{i}-b\left(X_{i}^{\top}\theta\right)\right]}\right\} (80)

which is regular and asymptotically linear with influence function

φη(z)=𝔼[b(2)(Xθ)XX]1[yxb(1)(xθ)x]{{\varphi}_{\eta^{*}}}(z)={\mathbb{E}}\left[b^{(2)}(X^{\top}\theta^{*})XX^{\top}\right]^{-1}\left[yx-b^{(1)}(x^{\top}\theta^{*})x\right]

under regularity conditions, where η=(𝔼[b(2)(Xθ)XX]1,θ)\eta^{*}=\left({\mathbb{E}}\left[b^{(2)}(X^{\top}\theta^{*})XX^{\top}\right]^{-1},\theta^{*}\right).

Proposition D.1.

Suppose that both 𝒳{\mathcal{X}} and Θ\Theta are compact subsets. Then, the GLM estimator θ^n\hat{\theta}_{n} satisfies the conditions of Proposition 3.4 with estimator

η~n=([1ni=1nb(2)(Xiθ^n)XiXi]1,θ^n).\tilde{\eta}_{n}=\left(\left[\frac{1}{n}\sum_{i=1}^{n}b^{(2)}(X_{i}^{\top}\hat{\theta}_{n})X_{i}X_{i}^{\top}\right]^{-1},\hat{\theta}_{n}\right).
Proof.

We first prove that θ^n\hat{\theta}_{n} satisfies the local Lipshitz condition of Proposition 3.4. Denote η(1)()=𝔼[b(2)(Xθ())XX]1p×p\eta^{(1)}({\mathbb{P}})={\mathbb{E}}_{\mathbb{P}}\left[b^{(2)}(X^{\top}\theta({\mathbb{P}}))XX^{\top}\right]^{-1}\in{\mathbb{R}}^{p\times p} and η(2)()=θ()\eta^{(2)}({\mathbb{P}})=\theta({\mathbb{P}}). Consider the metric ρ(η1,η2)=η1(1)η2(1)2+η1(2)η2(2)\rho(\eta_{1},\eta_{2})=\left\lVert\eta_{1}^{(1)}-\eta_{2}^{(1)}\right\rVert_{2}+\left\lVert\eta_{1}^{(2)}-\eta_{2}^{(2)}\right\rVert. Assuming that both 𝒳{\mathcal{X}} and Θ\Theta are compact, which is a mild assumption, we have:

φη1(z)φη2(z)\displaystyle\left\lVert{\varphi}_{\eta_{1}}(z)-{\varphi}_{\eta_{2}}(z)\right\rVert
=(η1(1)η2(1))yxη1(1)b(1)(xη1(2))+η2(1)b(1)(xη2(2))\displaystyle=\left\lVert\left(\eta_{1}^{(1)}-\eta_{2}^{(1)}\right)yx-\eta_{1}^{(1)}b^{(1)}\left(x^{\top}\eta_{1}^{(2)}\right)+\eta_{2}^{(1)}b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rVert
=(η1(1)η2(1))yxη1(1)[b(1)(xη1(2))b(1)(xη2(2))]+(η2(1)η1(1))b(1)(xη2(2))\displaystyle=\left\lVert\left(\eta_{1}^{(1)}-\eta_{2}^{(1)}\right)yx-\eta_{1}^{(1)}\left[b^{(1)}\left(x^{\top}\eta_{1}^{(2)}\right)-b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right]+\left(\eta_{2}^{(1)}-\eta_{1}^{(1)}\right)b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rVert
η1(1)η2(1)2yx+η1(1)2|b(1)(xη1(2))b(1)(xη2(2))|+η2(1)η1(1)2|b(1)(xη2(2))|.\displaystyle\leqslant\left\lVert\eta_{1}^{(1)}-\eta_{2}^{(1)}\right\rVert_{2}\left\lVert yx\right\rVert+\left\lVert\eta_{1}^{(1)}\right\rVert_{2}\left\lvert b^{(1)}\left(x^{\top}\eta_{1}^{(2)}\right)-b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rvert+\left\lVert\eta_{2}^{(1)}-\eta_{1}^{(1)}\right\rVert_{2}\left\lvert b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rvert.

Since 𝔼[b(2)(Xθ)XX]0{\mathbb{E}}\left[b^{(2)}\left(X^{\top}\theta^{*}\right)XX^{\top}\right]\succ 0, we can find a sufficiently small neighborhood 𝒪{\mathcal{O}} containing η\eta^{*} such that supη𝒪η(1)2>0\sup_{\eta\in{\mathcal{O}}}\left\lVert\eta^{(1)}\right\rVert_{2}>0. Since b(1)(xθ)b^{(1)}\left(x^{\top}\theta\right) is differentiable,

|b(1)(xη1(2))b(1)(xη2(2))|supx𝒳,η𝒪|b(2)(xη(2))|xη1(2)η2(2),\displaystyle\left\lvert b^{(1)}\left(x^{\top}\eta_{1}^{(2)}\right)-b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rvert\leqslant\sup_{x\in{\mathcal{X}},\eta\in{\mathcal{O}}}\left\lvert b^{(2)}(x^{\top}\eta^{(2)})\right\rvert\left\lVert x\right\rVert\left\lVert\eta_{1}^{(2)}-\eta_{2}^{(2)}\right\rVert,

where supx𝒳,η𝒪|b(2)(xη(2))|<\sup_{x\in{\mathcal{X}},\eta\in{\mathcal{O}}}\left\lvert b^{(2)}\left(x^{\top}\eta^{(2)}\right)\right\rvert<\infty as b(2)()b^{(2)}(\cdot) is continuous, 𝒳{\mathcal{X}} is compact, and 𝒪{\mathcal{O}} is bounded. Additionally, we have

|b(1)(xη2(2))|supη𝒪|b(1)(xη(2))|.\left\lvert b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rvert\leqslant\sup_{\eta\in{\mathcal{O}}}\left\lvert b^{(1)}\left(x^{\top}\eta^{(2)}\right)\right\rvert.

Similarly, we can always find a sufficiently small neighborhood 𝒪{\mathcal{O}} containing η\eta^{*} such that

supη𝒪|b(1)(xη(2))|12().\sup_{\eta\in{\mathcal{O}}}\left\lvert b^{(1)}\left(x^{\top}\eta^{(2)}\right)\right\rvert\in{\mathcal{L}}^{2}_{1}({{\mathbb{P}}^{*}}).

Putting these results together, we obtain

φη1(z)φη2(z)\displaystyle\left\lVert{\varphi}_{\eta_{1}}(z)-{\varphi}_{\eta_{2}}(z)\right\rVert
η1(1)η2(1)2yx+supη𝒪η(1)2supx𝒳,η𝒪|b(2)(xη(2))|xη1(2)η2(2)\displaystyle\leqslant\left\lVert\eta_{1}^{(1)}-\eta_{2}^{(1)}\right\rVert_{2}\left\lVert yx\right\rVert+\sup_{\eta\in{\mathcal{O}}}\left\lVert\eta^{(1)}\right\rVert_{2}\sup_{x\in{\mathcal{X}},\eta\in{\mathcal{O}}}\left\lvert b^{(2)}\left(x^{\top}\eta^{(2)}\right)\right\rvert\left\lVert x\right\rVert\left\lVert\eta_{1}^{(2)}-\eta_{2}^{(2)}\right\rVert
+supη𝒪|b(1)(xη(2))|η2(1)η1(1)2\displaystyle\quad+\sup_{\eta\in{\mathcal{O}}}\left\lvert b^{(1)}\left(x^{\top}\eta^{(2)}\right)\right\rvert\left\lVert\eta_{2}^{(1)}-\eta_{1}^{(1)}\right\rVert_{2}
L(z)ρ(η1,η2),\displaystyle\leqslant L(z)\rho(\eta_{1},\eta_{2}),

where

L(z)=max{yx+supη𝒪|b(1)(xη(2))|,(supη𝒪η(1)2supx𝒳,η𝒪|b(2)(xη(2))|)x}12().L(z)=\max\left\{{\left\lVert yx\right\rVert+\sup_{\eta\in{\mathcal{O}}}\left\lvert b^{(1)}\left(x^{\top}\eta^{(2)}\right)\right\rvert,\quad\left(\sup_{\eta\in{\mathcal{O}}}\left\lVert\eta^{(1)}\right\rVert_{2}\sup_{x\in{\mathcal{X}},\eta\in{\mathcal{O}}}\left\lvert b^{(2)}\left(x^{\top}\eta^{(2)}\right)\right\rvert\right)\left\lVert x\right\rVert}\right\}\in{\mathcal{L}}_{1}^{2}({{\mathbb{P}}^{*}}).

Next, we show that a consistent estimator of η\eta^{*} is η~n=([1ni=1nb(2)(Xiθ^n)XiXi]1,θ^n)\tilde{\eta}_{n}=\left(\left[\frac{1}{n}\sum_{i=1}^{n}b^{(2)}\left(X_{i}^{\top}\hat{\theta}_{n}\right)X_{i}X_{i}^{\top}\right]^{-1},\hat{\theta}_{n}\right). Since b(2)()b^{(2)}(\cdot) is continuous, b(2)(xθ)xxb^{(2)}\left(x^{\top}\theta\right)xx^{\top} is a continuous (matrix-valued) function of θ\theta for every xx. Then, b(2)(xθ)xxb^{(2)}\left(x^{\top}\theta\right)xx^{\top} is a Gilvenko-Cantelli class (which means that it is Gilvenko-Cantelli component-wise) by Example 19.9 of Van der Vaart [2000] and compactness of Θ\Theta. Therefore,

1ni=1nb(2)(Xiθ^n)XiXi𝔼[b(2)(Xθ^n)XX]2=op(1).\left\lVert\frac{1}{n}\sum_{i=1}^{n}b^{(2)}\left(X_{i}^{\top}\hat{\theta}_{n}\right)X_{i}X_{i}^{\top}-{\mathbb{E}}\left[b^{(2)}\left(X^{\top}\hat{\theta}_{n}\right)XX^{\top}\right]\right\rVert_{2}=o_{p}(1).

Similarly,

|b(2)(xθ1)b(2)(xθ2)|supx𝒳,θΘ|b(3)(xθ)|xθ1θ2,\displaystyle\left\lvert b^{(2)}\left(x^{\top}\theta_{1}\right)-b^{(2)}\left(x^{\top}\theta_{2}\right)\right\rvert\leqslant\sup_{x\in{\mathcal{X}},\theta\in\Theta}\left\lvert b^{(3)}\left(x^{\top}\theta\right)\right\rvert\left\lVert x\right\rVert\left\lVert\theta_{1}-\theta_{2}\right\rVert,

where supx𝒳,θΘ|b(3)(xθ)|<\sup_{x\in{\mathcal{X}},\theta\in\Theta}\left\lvert b^{(3)}\left(x^{\top}\theta\right)\right\rvert<\infty as b(3)()b^{(3)}(\cdot) is continuous and both 𝒳{\mathcal{X}} and Θ\Theta are compact. Therefore, it follows that

𝔼{[b(2)(Xθ^n)b(2)(Xθ)]XX}𝔼[X3]supx𝒳,θΘ|b(3)(xθ)|θ^nθ=op(1){\mathbb{E}}\left\{{\left[b^{(2)}\left(X^{\top}\hat{\theta}_{n}\right)-b^{(2)}\left(X^{\top}\theta^{*}\right)\right]XX^{\top}}\right\}\leqslant{\mathbb{E}}\left[\left\lVert X\right\rVert^{3}\right]\sup_{x\in{\mathcal{X}},\theta\in\Theta}\left\lvert b^{(3)}\left(x^{\top}\theta\right)\right\rvert\left\lVert\hat{\theta}_{n}-\theta^{*}\right\rVert=o_{p}(1)

as θ^n\hat{\theta}_{n} is consistent. Putting these results together, we have

1ni=1nb(2)(Xiθ^n)XiXi𝔼[b(2)(Xθ)XX]2=op(1),\left\lVert\frac{1}{n}\sum_{i=1}^{n}b^{(2)}\left(X_{i}^{\top}\hat{\theta}_{n}\right)X_{i}X_{i}^{\top}-{\mathbb{E}}\left[b^{(2)}\left(X^{\top}\theta^{*}\right)XX^{\top}\right]\right\rVert_{2}=o_{p}(1),

and by continuous mapping,

[1ni=1nb(2)(Xiθ^n)XiXi]1{𝔼[b(2)(Xθ)XX]}12=op(1).\left\lVert\left[\frac{1}{n}\sum_{i=1}^{n}b^{(2)}\left(X_{i}^{\top}\hat{\theta}_{n}\right)X_{i}X_{i}^{\top}\right]^{-1}-\left\{{{\mathbb{E}}\left[b^{(2)}\left(X^{\top}\theta^{*}\right)XX^{\top}\right]}\right\}^{-1}\right\rVert_{2}=o_{p}(1).

D.2 Proof of claim in Example 7.4

Here, we validate the remaining parts of Assumption 3.1 for Example 7.4 under additional assumptions. Following the notation of Example 7.4, recall that the supervised estimator is Kendall’s τ\tau:

θ^n=2n(n1)i<jI{(UiUj)(ViVj)>0}\hat{\theta}_{n}=\frac{2}{n(n-1)}\sum_{i<j}I\left\{{(U_{i}-U_{j})(V_{i}-V_{j})>0}\right\} (81)

which is regular and asymptotically linear with influence function

φη(z)=2Y{(Uu)(Vv)>0}2θ,{\varphi}_{\eta^{*}}(z)=2{\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\}-2\theta^{*},

where η=(Y{(Uu)(Vv)>0},θ)\eta^{*}=\left({\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\},\theta^{*}\right). For simplicity, denote h(y)=Y{(Uu)(Vv)>0}h^{*}(y)={\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\}. By imposing additional assumptions on h(y)h^{*}(y), the next proposition guarantees that (81) satisfies Assumption 3.1.

Proposition D.2.

Suppose that h(y)VM\left\lVert h^{*}(y)\right\rVert_{V}^{*}\leqslant M for some M>1M>1, where V\left\lVert\cdot\right\rVert_{V}^{*} represents the uniform sectional variation norm of a function [van der Laan, 1995, Gill et al., 1995] . Further suppose that Θ\Theta is a bounded subset of 2{\mathbb{R}}^{2} such that θint(Θ)\theta^{*}\in\text{int}(\Theta). Then, the Kendall’s τ\tau (81) satisfies Assumption 3.1 with

𝒪={h(y)θ:2,hVM,θΘ},{\mathcal{O}}=\left\{{h(y)-\theta:{\mathbb{R}}^{2}\to{\mathbb{R}},\left\lVert h\right\rVert_{V}^{*}\leqslant M,\theta\in\Theta}\right\},

and the estimator

η~n=(1ni=1nI{(Uiu)(Viv)>0},θ^n).\tilde{\eta}_{n}=\left(\frac{1}{n}\sum_{i=1}^{n}I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\},\hat{\theta}_{n}\right).
Proof.

We first validate part (b) of Assumption 3.1. Clearly, h(y)θ𝒪h^{*}(y)-\theta^{*}\in{\mathcal{O}}, and we have shown that the influence function of U-statistics of kernel RR is RR-Lipshitz in Section 7.2. Therefore, it remains to prove that {φη:η𝒪}=𝒪\left\{{{\varphi}_{\eta}:\eta\in{\mathcal{O}}}\right\}={\mathcal{O}} is {{\mathbb{P}}^{*}}-Donsker. Re-write 𝒪{\mathcal{O}} as

𝒪={h(y)h(y)+h(y)θ:2,hVM,θΘ},{\mathcal{O}}=\left\{{h(y)-h^{*}(y)+h^{*}(y)-\theta:{\mathbb{R}}^{2}\to{\mathbb{R}},\left\lVert h\right\rVert_{V}^{*}\leqslant M,\theta\in\Theta}\right\},

which is the pairwise sum of two sets: 𝒪=𝒪1+𝒪2{\mathcal{O}}={\mathcal{O}}_{1}+{\mathcal{O}}_{2} where 𝒪1={h(y)h(y):hVM}{\mathcal{O}}_{1}=\left\{{h(y)-h^{*}(y):\left\lVert h\right\rVert_{V}^{*}\leqslant M}\right\} and 𝒪2={h(y)θ:θΘ}{\mathcal{O}}_{2}=\left\{{h^{*}(y)-\theta:\theta\in\Theta}\right\}. By Example 2.10.9 of van der Vaart and Wellner [2013], if both 𝒪1{\mathcal{O}}_{1} and 𝒪2{\mathcal{O}}_{2} are {{\mathbb{P}}^{*}}-Donsker, then their pairwise sum 𝒪{\mathcal{O}} is also {{\mathbb{P}}^{*}}-Donsker. For 𝒪1{\mathcal{O}}_{1}, by triangle inequality of the norm hhVhV+hV2M\left\lVert h-h^{*}\right\rVert_{V}^{*}\leqslant\left\lVert h\right\rVert_{V}^{*}+\left\lVert h^{*}\right\rVert_{V}^{*}\leqslant 2M, therefore 𝒪1{\mathcal{O}}_{1} is a subset of functions with uniform sectional variation norm bounded by 2M2M. By Example 1.2 of van der Laan [1995], this implies that 𝒪1{\mathcal{O}}_{1} is {{\mathbb{P}}^{*}}-Donsker. For 𝒪2{\mathcal{O}}_{2}, it is a set indexed by a finite-dimensional parameter θΘ\theta\in\Theta where Θ\Theta is bounded and h(y)θh(y)^{*}-\theta is 11-Lipschitz in θ\theta. Hence, by Example 19.6 of Van der Vaart [2000], 𝒪2{\mathcal{O}}_{2} is {{\mathbb{P}}^{*}}-Donsker.

We now validate part (c) of Assumption 3.1 with η~n\tilde{\eta}_{n}. Since any indicator function on 2{\mathbb{R}}^{2} has uniform sectional variation norm bounded by 1, we have

1ni=1nI{(Uiu)(Viv)>0}V1ni=1nI{(Uiu)(Viv)>0}V1.\left\lVert\frac{1}{n}\sum_{i=1}^{n}I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\}\right\rVert_{V}^{*}\leqslant\frac{1}{n}\sum_{i=1}^{n}\left\lVert I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\}\right\rVert_{V}^{*}\leqslant 1.

Further, by asymptotic linearity of the U-statistics θ^n\hat{\theta}_{n}, we have θ^nθ2=op(1)\left\lVert\hat{\theta}_{n}-\theta^{*}\right\rVert_{2}=o_{p}(1). Therefore,

{η~n𝒪}={θ^nΘ}1{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}={{\mathbb{P}}^{*}}\left\{{\hat{\theta}_{n}\in\Theta}\right\}\to 1

by the definition of 𝒪{\mathcal{O}}. Further, since the class of indicator functions is {{\mathbb{P}}^{*}}-Gilvenko Cantelli, it follows that

1ni=1nI{(Uiu)(Viv)>0}h(y)=op(1).\left\lVert\frac{1}{n}\sum_{i=1}^{n}I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\}-h^{*}(y)\right\rVert_{\infty}=o_{p}(1).

Combined with the consistency of θ~n\tilde{\theta}_{n}, we have ρ(η~n,η)=op(1)\rho(\tilde{\eta}_{n},\eta^{*})=o_{p}(1), which completes the proof. ∎

D.3 Proof of claim in Remark 7.1

We prove a general result.

Proposition D.3.

Let f(Z)p,02()f(Z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})} and let V=g(X)V=g(X) be a measurable function of XX. Denote fX(x)=𝔼[f(Z)X=x]f_{X}(x)={\mathbb{E}}[f(Z)\mid X=x] and fV(v)=𝔼[f(Z)V=v]f_{V}(v)={\mathbb{E}}[f(Z)\mid V=v]. Then

f(z1)γfV(v1)+γfV(v2),[f(z1)γfV(v1)+γfV(v2)]\displaystyle\left\langle f(z_{1})-\gamma f_{V}(v_{1})+\gamma f_{V}(v_{2}),\left[f(z_{1})-\gamma f_{V}(v_{1})+\gamma f_{V}(v_{2})\right]^{\top}\right\rangle_{\mathcal{H}} (82)
f(z1)γfX(x1)+γfX(x2),[f(z1)γfX(x1)+γfX(x2)],\displaystyle\succeq\left\langle f(z_{1})-\gamma f_{X}(x_{1})+\gamma f_{X}(x_{2}),[f(z_{1})-\gamma f_{X}(x_{1})+\gamma f_{X}(x_{2})]^{\top}\right\rangle_{\mathcal{H}},

where ,\langle\cdot,\cdot\rangle_{\mathcal{H}} is defined in (86).

Proof.

By the linearity of inner product,

h1,h1=h1h2,(h1h2)+h2,h2+h1h2,h2+h2,(h2h1),\left\langle h_{1},h_{1}^{\top}\right\rangle_{\mathcal{H}}=\left\langle h_{1}-h_{2},(h_{1}-h_{2})^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{2},h_{2}^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{1}-h_{2},h_{2}^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{2},(h_{2}-h_{1})^{\top}\right\rangle_{\mathcal{H}},

for arbitrary h1h_{1} and h2h_{2}. Therefore, it suffices to prove that the cross term satisfies

γfX(x1)γfV(v1)γfX(x2)+γfV(v2),[f(z1)γfX(x1)+γfX(x2)]=0.\left\langle\gamma f_{X}(x_{1})-\gamma f_{V}(v_{1})-\gamma f_{X}(x_{2})+\gamma f_{V}(v_{2}),[f(z_{1})-\gamma f_{X}(x_{1})+\gamma f_{X}(x_{2})]^{\top}\right\rangle_{\mathcal{H}}=0.

By the definition of ,\langle\cdot,\cdot\rangle_{\mathcal{H}},

γfX(x1)γfV(v1)γfX(x2)+γfV(v2),[f(z1)γfX(x1)+γfX(x2)]\displaystyle\left\langle\gamma f_{X}(x_{1})-\gamma f_{V}(v_{1})-\gamma f_{X}(x_{2})+\gamma f_{V}(v_{2}),[f(z_{1})-\gamma f_{X}(x_{1})+\gamma f_{X}(x_{2})]^{\top}\right\rangle_{\mathcal{H}}
=γfX(x1)fV(v1)fX(x2)+fV(v2),[f(z1)fX(x1)+(1γ)fX(x1)+γfX(x2)]\displaystyle=\gamma\left\langle f_{X}(x_{1})-f_{V}(v_{1})-f_{X}(x_{2})+f_{V}(v_{2}),[f(z_{1})-f_{X}(x_{1})+(1-\gamma)f_{X}(x_{1})+\gamma f_{X}(x_{2})]^{\top}\right\rangle_{\mathcal{H}}
=γ{fXfV,ffX+(1γ)fXp,02()+1γγfVfX,γfXp,02(X)}\displaystyle=\gamma\left\{{\left\langle f_{X}-f_{V},f-f_{X}+(1-\gamma)f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}+\frac{1-\gamma}{\gamma}\left\langle f_{V}-f_{X},\gamma f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}}\right\}
=γ{fXfV,(1γ)fXp,02(X)+1γγfVfX,γfXp,02(X)}\displaystyle=\gamma\left\{{\left\langle f_{X}-f_{V},(1-\gamma)f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}+\frac{1-\gamma}{\gamma}\left\langle f_{V}-f_{X},\gamma f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}}\right\}
=0,\displaystyle=0,

where in the second-to-last equality we used the fact that

fXfV,ffXp,02()=0\left\langle f_{X}-f_{V},f-f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}=0

by taking the conditional expectation on XX and noting that V=g(X)V=g(X). ∎

Appendix E Additional results

E.1 Technical lemmas

Lemma E.1 (Decomposition of conditional variances).

For a random variables Z=(X,Y)Z=(X,Y)\sim{{\mathbb{P}}^{*}} and XXX\sim{{\mathbb{P}}^{*}_{X}}. Let f,gf,g be any functions such that f(z)p2()f(z)\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}}) and g(x)p2(X)g(x)\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}_{X}}). Then,

Var[f(Z)g(X)]=Var{f(Z)𝔼[f(Z)X]}+Var{𝔼[f(Z)X]g(X)}.\mathrm{Var}[f(Z)-g(X)]=\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}+\mathrm{Var}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}.
Proof.

By the conditional variance formula,

Var{f(Z)𝔼[f(Z)X]}=𝔼{Var{f(Z)𝔼[f(Z)X]}},\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}={\mathbb{E}}\left\{{\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}}\right\},

and

Var{𝔼[f(Z)X]g(X)}=Var{𝔼{𝔼[f(Z)X]g(X)}}.\mathrm{Var}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}=\mathrm{Var}\left\{{{\mathbb{E}}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}}\right\}.

It follows that

Var[f(Z)g(X)]\displaystyle\mathrm{Var}[f(Z)-g(X)] =Var{f(Z)𝔼[f(Z)X]+𝔼[f(Z)X]g(X)}\displaystyle=\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]+{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}
=Var{𝔼{𝔼[f(Z)X]g(X)}}+𝔼{Var{f(Z)𝔼[f(Z)X]}}\displaystyle=\mathrm{Var}\left\{{{\mathbb{E}}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}}\right\}+{\mathbb{E}}\left\{{\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}}\right\}
=Var{𝔼[f(Z)X]g(X)}+Var{f(Z)𝔼[f(Z)X]}.\displaystyle=\mathrm{Var}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}+\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}.

Lemma E.2 (Multivariate pythagorean theorem).

For any function g(x)d2(X)g(x)\in{\mathcal{L}}^{2}_{d}({{\mathbb{P}}^{*}_{X}}), let

𝒜={𝐁g(x):𝐁p×d}{\mathcal{A}}=\left\{{\mathbf{B}g(x):\mathbf{B}\in{\mathbb{R}}^{p\times d}}\right\}

denote its linear span in p2(X){\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}_{X}}). For any f(x)p2(X)f(x)\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}_{X}}),

Var[f(X)]=Var[f(X)Π(f(X)𝒜)]+Var[Π(f(X)𝒜)],\mathrm{Var}\left[f(X)\right]=\mathrm{Var}\left[f(X)-\Pi(f(X)\mid{\mathcal{A}})\right]+\mathrm{Var}\left[\Pi(f(X)\mid{\mathcal{A}})\right],

where Π(a𝒜)\Pi(a\mid{\mathcal{A}}) denote the projection operator that project aa onto a linear space 𝒜{\mathcal{A}}.

Proof.

See Theorem 3.3 of Tsiatis [2006]. ∎

Lemma E.3.

For a random variable ZZ\sim{{\mathbb{P}}^{*}}, and let f,gf,g be any functions such that fp2()f\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}}) and gq2()g\in{\mathcal{L}}^{2}_{q}({{\mathbb{P}}^{*}}). Then,

(fg)2f2()g2().\left\lVert{{\mathbb{P}}^{*}}\left(fg^{\top}\right)\right\rVert_{2}\leqslant\left\lVert f\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\left\lVert g\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}.
Proof.

Consider any non-random vector aqa\in{\mathbb{R}}^{q} such that a2=1\left\lVert a\right\rVert_{2}=1. By the definition of operator norm,

(fg)a2\displaystyle\left\lVert{{\mathbb{P}}^{*}}\left(fg^{\top}\right)a\right\rVert_{2} =(fga)2\displaystyle=\left\lVert{{\mathbb{P}}^{*}}\left(fg^{\top}a\right)\right\rVert_{2}
fga2\displaystyle\leqslant{{\mathbb{P}}^{*}}\left\lVert fg^{\top}a\right\rVert_{2}
fg\displaystyle\leqslant{{\mathbb{P}}^{*}}\left\lVert f\right\rVert\left\lVert g\right\rVert
f2()g2(),\displaystyle\leqslant\left\lVert f\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\left\lVert g\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})},

by Cauchy-Schwartz inequality. ∎

Lemma E.4 (Lemma 19.24 of Van der Vaart [2000].).

Let 𝒮{\mathcal{S}} denote a {{\mathbb{P}}^{*}}-Donsker class of measurable functions over (𝒵,,)({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}}), and let s^n\hat{s}_{n} denote a sequence of random functions that takes value in 𝒮{\mathcal{S}} and satisfies

s^nsp,02()=op(1),\left\lVert\hat{s}_{n}-s\right\rVert_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}}=o_{p}(1),

for some element s𝒮s\in{\mathcal{S}}. Then,

𝔾n(s^ns)=op(1).{\mathbb{G}}_{n}(\hat{s}_{n}-s)=o_{p}(1).
Lemma E.5.

Let 𝒮={sη(z):𝒵p,ηΩ}{\mathcal{S}}=\left\{{s_{\eta}(z):{\mathcal{Z}}\to{\mathbb{R}}^{p},\eta\in\Omega}\right\} be a class of measurable functions on (𝒵,,)({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}}) indexed by η\eta, where Ω\Omega is a metric space with metric ρ\rho. Let ηint(Ω)\eta^{*}\in\text{int}(\Omega) such that sη(z)p2()s_{\eta^{*}}(z)\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}}). Suppose that there exists a set 𝒪Ω{\mathcal{O}}\subset\Omega such that η𝒪\eta^{*}\in{\mathcal{O}}, the class of functions {sη(z):η𝒪}\left\{{s_{\eta}(z):\eta\in{\mathcal{O}}}\right\} is {{\mathbb{P}}^{*}}-Donsker, and for all {η1,η2}𝒪\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}} there exists a square-integrable function S:𝒵+S:{\mathcal{Z}}\to{\mathbb{R}}^{+} such that sη1(Z)sη2(Z)S(Z)ρ(η1,η2)\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert\leqslant S(Z)\rho(\eta_{1},\eta_{2}) almost surely. In addition, suppose there exists an estimator η~n\tilde{\eta}_{n} of η\eta^{*} such that {η~n𝒪}1{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1 and ρ(η~n,η)=op(1)\rho(\tilde{\eta}_{n},\eta^{*})=o_{p}(1). Then, the following hold:

  • supη𝒪(n)sη=op(1)\sup_{\eta\in{\mathcal{O}}}\left\lVert\left({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}}\right)s_{\eta}\right\rVert=o_{p}(1),

  • sη~nsη2()2=op(1)\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert^{2}_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}=o_{p}(1),

  • 𝔾n(sη~nsη)=op(1)\left\lVert{\mathbb{G}}_{n}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})\right\rVert=o_{p}(1),

  • n(sη~n)=Op(1)\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert=O_{p}(1); and if additionally (sη)=0{{\mathbb{P}}^{*}}({s_{\eta^{*}}})=0, then

    n(sη~n)=Op(ρ(η~n,η)1n)=op(1).\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert=O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{*})\wedge\frac{1}{\sqrt{n}}\right)=o_{p}(1).
Proof.

Since the class 𝒮={sη(z):η𝒪}{\mathcal{S}}=\left\{{s_{\eta}(z):\eta\in{\mathcal{O}}}\right\} is {{\mathbb{P}}^{*}}-Donsker, it is also {{\mathbb{P}}^{*}}-Gilvenko-Cantelli, hence

supη𝒪(n)sη=op(1).\sup_{\eta\in{\mathcal{O}}}\left\lVert\left({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}}\right)s_{\eta}\right\rVert=o_{p}(1).

By assumption sη1(Z)sη2(Z)S(Z)ρ(η1,η2)\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert\leqslant S(Z)\rho(\eta_{1},\eta_{2}) for all {η1,η2}𝒪\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}, therefore when η~n𝒪\tilde{\eta}_{n}\in{\mathcal{O}}, we have

sη~nsη2()2ρ2(η~n,η)S(z)2()2=Op(ρ2(η~n,η))=op(1).\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert^{2}_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\leqslant\rho^{2}(\tilde{\eta}_{n},\eta^{*})\left\lVert S(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}^{2}=O_{p}(\rho^{2}(\tilde{\eta}_{n},\eta^{*}))=o_{p}(1).

Further, by assumption {η~n𝒪}0{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\notin{\mathcal{O}}}\right\}\to 0, it then follows that sη~nsη2()2=op(1)\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert^{2}_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}=o_{p}(1). Therefore, by Lemma E.4, 𝔾n(sη~nsη)=op(1)\left\lVert{\mathbb{G}}_{n}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})\right\rVert=o_{p}(1). Then,

n(sη~n)\displaystyle\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert 1n𝔾n(sη~nsη)+(sη~nsη)+n(sη)\displaystyle\leqslant\frac{1}{\sqrt{n}}\left\lVert{\mathbb{G}}_{n}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})\right\rVert+\left\lVert{{\mathbb{P}}^{*}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})\right\rVert+\left\lVert{{\mathbb{P}}_{n}}({s_{\eta^{*}}})\right\rVert
=op(1n)+(sη~nsη)+n(sη)\displaystyle=o_{p}\left(\frac{1}{\sqrt{n}}\right)+\left\lVert{{\mathbb{P}}^{*}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})\right\rVert+\left\lVert{{\mathbb{P}}_{n}}({s_{\eta^{*}}})\right\rVert
Op(1n)+sη~nsη2()+(sη)\displaystyle\leqslant O_{p}\left(\frac{1}{\sqrt{n}}\right)+\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}+\left\lVert{{\mathbb{P}}^{*}}({s_{\eta^{*}}})\right\rVert
=Op(1n)+Op(ρ(η~n,η))+(sη)\displaystyle=O_{p}\left(\frac{1}{\sqrt{n}}\right)+O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))+\left\lVert{{\mathbb{P}}^{*}}({s_{\eta^{*}}})\right\rVert
=Op(ρ(η~n,η)1n)+(sη).\displaystyle=O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{*})\wedge\frac{1}{\sqrt{n}}\right)+\left\lVert{{\mathbb{P}}^{*}}({s_{\eta^{*}}})\right\rVert.

Therefore, n(sη~n)=Op(1)\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert=O_{p}(1). Further, if (sη)=0{{\mathbb{P}}^{*}}({s_{\eta^{*}}})=0, n(sη~n)=(ρ(η~n,η)1n)\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert=\left(\rho(\tilde{\eta}_{n},\eta^{*})\wedge\frac{1}{\sqrt{n}}\right). ∎

Lemma E.6.

Under the conditions of Lemma E.5. Let 𝒢={gη(z):𝒵p,ηΩ}{\mathcal{G}}=\left\{{g_{\eta}(z):{\mathcal{Z}}\to{\mathbb{R}}^{p},\eta\in\Omega}\right\} denote another class of measurable functions on (𝒵,,)({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}}) such that (i) gη(z)2()g_{\eta^{*}}(z)\in{\mathcal{L}}^{2}({{\mathbb{P}}^{*}}) (ii) for all {η1,η2}𝒪\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}} where 𝒪{\mathcal{O}} follows from Lemma E.5 there exists a square-integrable function G:𝒵+G:{\mathcal{Z}}\to{\mathbb{R}}^{+} such that gη1(Z)gη2(Z)G(Z)ρ(η1,η2)\left\lVert g_{\eta_{1}}(Z)-g_{\eta_{2}}(Z)\right\rVert\leqslant G(Z)\rho(\eta_{1},\eta_{2}) almost surely. Then

n[gη~n(sη~nsη)]2=Op(ρ(η~n,η))=op(1),\left\lVert{{\mathbb{P}}_{n}}\left[g_{\tilde{\eta}_{n}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})^{\top}\right]\right\rVert_{2}=O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))=o_{p}(1),

where η~n\tilde{\eta}_{n} is defined in Lemma E.5.

Proof.

When η~n𝒪\tilde{\eta}_{n}\in{\mathcal{O}},

n[gη~n(sη~nsη)]2\displaystyle\left\lVert{{\mathbb{P}}_{n}}\left[g_{\tilde{\eta}_{n}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})^{\top}\right]\right\rVert_{2}
n{gη~n(sη~nsη)2}\displaystyle\leqslant{{\mathbb{P}}_{n}}\left\{{\left\lVert g_{\tilde{\eta}_{n}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})^{\top}\right\rVert_{2}}\right\}
n{gη~nsη~nsη}\displaystyle\leqslant{{\mathbb{P}}_{n}}\left\{{\left\lVert g_{\tilde{\eta}_{n}}\right\rVert\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert}\right\}
n{(gη~ngη+gη)sη~nsη}\displaystyle\leqslant{{\mathbb{P}}_{n}}\left\{{(\left\lVert g_{\tilde{\eta}_{n}}-g_{\eta^{*}}\right\rVert+\left\lVert g_{\eta^{*}}\right\rVert)\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert}\right\}
n{gη~ngηsη~nsη}+n{gηsη~nsη}\displaystyle\leqslant{{\mathbb{P}}_{n}}\left\{{\left\lVert g_{\tilde{\eta}_{n}}-g_{\eta^{*}}\right\rVert\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert}\right\}+{{\mathbb{P}}_{n}}\left\{{\left\lVert g_{\eta^{*}}\right\rVert\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert}\right\}
a.s.n{SG}ρ(η~n,η)2+n{Sgη}ρ(η~n,η)\displaystyle\overset{\text{a.s.}}{\leqslant}{{\mathbb{P}}_{n}}\left\{{SG}\right\}\rho(\tilde{\eta}_{n},\eta^{*})^{2}+{{\mathbb{P}}_{n}}\left\{{S\left\lVert g_{\eta^{*}}\right\rVert}\right\}\rho(\tilde{\eta}_{n},\eta^{*})
={SG}ρ(η~n,η)2+{Sgη}ρ(η~n,η)+Op(1nρ(η~n,η))\displaystyle={{\mathbb{P}}^{*}}\left\{{SG}\right\}\rho(\tilde{\eta}_{n},\eta^{*})^{2}+{{\mathbb{P}}^{*}}\left\{{S\left\lVert g_{\eta^{*}}\right\rVert}\right\}\rho(\tilde{\eta}_{n},\eta^{*})+O_{p}\left(\frac{1}{\sqrt{n}}\rho(\tilde{\eta}_{n},\eta^{*})\right)
S2()G2()ρ(η~n,η)2+S2()gη2()ρ(η~n,η)+Op(1nρ(η~n,η))\displaystyle\leqslant\left\lVert S\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\left\lVert G\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\rho(\tilde{\eta}_{n},\eta^{*})^{2}+\left\lVert S\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\left\lVert g_{\eta^{*}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\rho(\tilde{\eta}_{n},\eta^{*})+O_{p}\left(\frac{1}{\sqrt{n}}\rho(\tilde{\eta}_{n},\eta^{*})\right)
=Op(ρ(η~n,η)2)+Op(ρ(η~n,η))+Op(1nρ(η~n,η))\displaystyle=O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{*})^{2}\right)+O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{*})\right)+O_{p}\left(\frac{1}{\sqrt{n}}\rho(\tilde{\eta}_{n},\eta^{*})\right)
=Op(ρ(η~n,η))\displaystyle=O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))
=op(1).\displaystyle=o_{p}(1).

Therefore the statement follows from the fact {η~n𝒪}0{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\notin{\mathcal{O}}}\right\}\to 0. ∎

Lemma E.7.

Under the conditions of Lemma E.5, consider the centered class

𝒮={sη0(z)=sη(z)(sη):ηΩ}.{\mathcal{S}}^{*}=\left\{{s_{\eta}^{0}(z)=s_{\eta}(z)-{{\mathbb{P}}^{*}}(s_{\eta}):\eta\in\Omega}\right\}.

Then, for all {η1,η2}𝒪\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}} where 𝒪{\mathcal{O}} follows from Lemma E.5, it holds that

sη10(Z)sη20(Z)[S(Z)+S2()]ρ(η1,η2),\left\lVert s^{0}_{\eta_{1}}(Z)-s^{0}_{\eta_{2}}(Z)\right\rVert\leqslant\left[S(Z)+\left\lVert S\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\right]\rho(\eta_{1},\eta_{2}),

almost surely. Further, the class 𝒮={sη0(z)=sη(z)(sη):η𝒪}{\mathcal{S}}^{*}=\left\{{s_{\eta}^{0}(z)=s_{\eta}(z)-{{\mathbb{P}}^{*}}(s_{\eta}):\eta\in{\mathcal{O}}}\right\} is {{\mathbb{P}}^{*}}-Donsker.

Proof.
sη10(Z)sη20(Z)\displaystyle\left\lVert s^{0}_{\eta_{1}}(Z)-s^{0}_{\eta_{2}}(Z)\right\rVert sη1(Z)sη2(Z)+(sη1sη2)\displaystyle\leqslant\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert+\left\lVert{{\mathbb{P}}^{*}}(s_{\eta_{1}}-s_{\eta_{2}})\right\rVert
sη1(Z)sη2(Z)+sη1sη2\displaystyle\leqslant\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert+{{\mathbb{P}}^{*}}\left\lVert s_{\eta_{1}}-s_{\eta_{2}}\right\rVert
sη1(Z)sη2(Z)+sη1sη22()\displaystyle\leqslant\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert+\left\lVert s_{\eta_{1}}-s_{\eta_{2}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}
[S(Z)+S2()]ρ(η1,η2).\displaystyle\leqslant\left[S(Z)+\left\lVert S\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\right]\rho(\eta_{1},\eta_{2}).

Since 𝔾n(sη)=𝔾n(sη0){\mathbb{G}}_{n}(s_{\eta})={\mathbb{G}}_{n}(s^{0}_{\eta}) for all ηΩ\eta\in\Omega, {{\mathbb{P}}^{*}}-Donsker follows immediately. ∎

Lemma E.8.

Let {Xi}i=1n\left\{{X_{i}}\right\}_{i=1}^{n} be i.i.d. random vectors with mean zero and variance 𝐈Kn\mathbf{I}_{K_{n}}, where KnK_{n} satisfies KnK_{n}\to\infty and Knn0\frac{K_{n}}{n}\to 0. Let aKna\in{\mathbb{R}}^{K_{n}} be any fixed vector such that a=O(Kn)\left\lVert a\right\rVert=O(\sqrt{K_{n}}). Then

  • 1ni=1naXi=Op(Knn)\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}=O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)

  • 1ni=1nXi=Op(Knn)\left\lVert\frac{1}{n}\sum_{i=1}^{n}X_{i}\right\rVert=O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)

Proof.

We have:

𝔼[(1ni=1naXi)2]=a2n=O(Knn).{\mathbb{E}}\left[\left(\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}\right)^{2}\right]=\frac{\left\lVert a\right\rVert^{2}}{n}=O\left(\frac{K_{n}}{n}\right).
𝔼[(1ni=1nXi)(1ni=1nXi)]=Knn.{\mathbb{E}}\left[\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)^{\top}\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)\right]=\frac{K_{n}}{n}.

Therefore, by Markov’s inequality, for any ϵ>0\epsilon>0,

{|1ni=1naXi|>ϵ}\displaystyle{\mathbb{P}}\left\{{\left\lvert\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}\right\rvert>\epsilon}\right\} ={(1ni=1naXi)2>ϵ2}\displaystyle={\mathbb{P}}\left\{{\left(\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}\right)^{2}>\epsilon^{2}}\right\}
𝔼[(1ni=1naXi)2]ϵ2\displaystyle\leqslant\frac{{\mathbb{E}}\left[\left(\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}\right)^{2}\right]}{\epsilon^{2}}
=a2nϵ2,\displaystyle=\frac{\left\lVert a\right\rVert^{2}}{n\epsilon^{2}},

which we see that let ϵ=C1Knn\epsilon=C^{-1}\sqrt{\frac{K_{n}}{n}} leads to the result. The seond equation can be proved in a similar manner. ∎

E.2 Additional definitions and results in semiparametric efficiency theory

Following the notation in Section 2, we provide additional definitions and results from classical semiparametric efficiency theory. Recall that for a model 𝒫{{\mathcal{P}}} that contains {{\mathbb{P}}^{*}}, the tangent set at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} is the set of score functions at {{\mathbb{P}}^{*}} from all one-dimensional regular parametric sub-models of 𝒫{{\mathcal{P}}}, and the corresponding tangent space is the closed linear span of the tangent set. A Euclidean functional θ:𝒫p\theta:{{\mathcal{P}}}\to{\mathbb{R}}^{p} is pathwise differentiable at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} if, for an arbitrary smooth one-dimensional parametric sub-model 𝒫T={t:tT}{\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\} of 𝒫{{\mathcal{P}}} such that =t{{\mathbb{P}}^{*}}={\mathbb{P}}_{t^{*}} for some tTt^{*}\in T, it holds that dθ(t)dt=𝔼[D(Z)st(Z)]\frac{d\theta({\mathbb{P}}_{t^{*}})}{dt}={\mathbb{E}}[{D_{{\mathbb{P}}^{*}}}(Z)s_{t^{*}}(Z)], where D(z)p,02(){D_{{\mathbb{P}}^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})} is a gradient of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}, and sts_{t^{*}} is the score function of 𝒫T{\mathcal{P}}_{T} at tt^{*}. The next lemma relates the set of gradients to the set of influence functions of regular asymptotically linear estimators.

Lemma E.9.

Let θ^n\hat{\theta}_{n} be an asymptotically linear estimator of θ\theta^{*} with influence function φη(z){{\varphi}_{\eta^{*}}}(z), then θ(){\theta(\cdot)} is pathwise differentiable at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}} with a gradient φη(z){{\varphi}_{\eta^{*}}}(z) if and only if θ^n\hat{\theta}_{n} is a regular estimator of θ\theta^{*} over 𝒫{{\mathcal{P}}}.

For a proof, see [Pfanzagl, 1990, Van Der Vaart, 1991]. Lemma E.9 implies the equivalence between the set of gradients and the set of influence functions of regular asymptotically linear estimators. If a gradient D{D^{*}_{{\mathbb{P}}^{*}}} satisfies aD𝒯𝒫()a^{\top}{D^{*}_{{\mathbb{P}}^{*}}}\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})} for all apa\in{\mathbb{R}}^{p}, then it is the canonical gradient or efficient influence function of θ(){\theta(\cdot)} at {{\mathbb{P}}^{*}} relative to 𝒫{{\mathcal{P}}}. The next lemma characterizes the set of gradients for a pathwise differentiable parameter.

Lemma E.10 (Characterization of the set of gradients).

Let θ():𝒫p{\theta(\cdot)}:{{\mathcal{P}}}\to{\mathbb{R}}^{p} be a functional that is pathwise differentiable at 𝒫{{\mathbb{P}}^{*}}\in{{\mathcal{P}}} relative to 𝒫{{\mathcal{P}}}. Then the set of gradients of θ(){\theta(\cdot)} relative to 𝒫{{\mathcal{P}}} can be represented as {D(z)+g(z):ag(z)𝒯𝒫(),ap}\left\{{{D_{{\mathbb{P}}^{*}}}(z)+g(z):a^{\top}g(z)\perp{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})},\forall a\in{\mathbb{R}}^{p}}\right\}, where D{D_{{\mathbb{P}}^{*}}} is any gradient of θ(){\theta(\cdot)}.

Proof.

See Theorem 3.4 of Tsiatis [2006]. ∎

Recall the model (1):

𝒫={=XYX:X𝒫X,YX𝒫YX},{{\mathcal{P}}}=\left\{{{\mathbb{P}}={\mathbb{P}}_{X}{\mathbb{P}}_{Y\mid X}:{\mathbb{P}}_{X}\in{{\mathcal{P}}_{X}},{\mathbb{P}}_{Y\mid X}\in{{\mathcal{P}}_{Y\mid X}}}\right\},

which models the marginal distribution and the conditional distribution separately. We next characterize the tangent space relative to (1).

Lemma E.11.

Let 𝒫{{\mathcal{P}}} be defined as in (1). Then, the tangent space 𝒯𝒫(){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})} can be written as

𝒯𝒫()=𝒯𝒫X(X)𝒯𝒫YX(YX),{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}={{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})},

where 𝒯𝒫X(X){{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})} is the tangent space at X{{\mathbb{P}}^{*}_{X}} relative to 𝒫X{{\mathcal{P}}_{X}}, and 𝒯𝒫YX(YX){{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})} is the tangent space at YX{{\mathbb{P}}^{*}_{Y\mid X}} relative to 𝒫YX{{\mathcal{P}}_{Y\mid X}}.

Proof.

See Lemma 1.6 of van der Laan and Robins [2003]. ∎

E.3 Semiparametric efficiency theory under the OSS setting

In this section, we extend the classical semiparametric efficiency theory to the OSS setting, building on ideas from Bickel and Kwon [2001]. We first review some notation. Let Z=(X,Y)Z=(X,Y) be a random variable that takes value in the probability space (𝒵,,)({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}}), where {\mathcal{F}} is a sigma-algebra on 𝒵{\mathcal{Z}}. Let (𝒳,X,X)({\mathcal{X}},{\mathcal{F}}_{X},{{\mathbb{P}}^{*}_{X}}) be the corresponding probability space of (𝒵,,)({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}}) over 𝒳{\mathcal{X}}. Consider the product probability space

(𝒵×𝒳,X,()),({\mathcal{Z}}\times{\mathcal{X}},{\mathcal{F}}\otimes{\mathcal{F}}_{X},{{\mathbb{Q}}({{\mathbb{P}}^{*}})}),

where X{\mathcal{F}}\otimes{\mathcal{F}}_{X} represents the product sigma-algebra of {\mathcal{F}} and X{\mathcal{F}}_{X}, and ()=×X{{\mathbb{Q}}({{\mathbb{P}}^{*}})}={{\mathbb{P}}^{*}}\times{{\mathbb{P}}^{*}_{X}} represents the product measure of {{\mathbb{P}}^{*}} and X{{\mathbb{P}}^{*}_{X}}. Note that the probability measure here depends only on {{\mathbb{P}}^{*}} as X{{\mathbb{P}}^{*}_{X}} is the corresponding marginal distribution of {{\mathbb{P}}^{*}} over 𝒳{\mathcal{X}}. Suppose {{\mathbb{P}}^{*}} belong to a model 𝒫{{\mathcal{P}}}. As usual, we assume that there is some common dominating measure μ\mu for 𝒫{{\mathcal{P}}} such that 𝒫{{\mathcal{P}}} can be equivalently represented by its density functions with respect to μ\mu. Our goal here is to estimate a Euclidean functional θ():𝒫Θp\theta(\cdot):{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p} from the data n𝒰N{{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}}, where niid{{\mathcal{L}}_{n}}\stackrel{{\scriptstyle{iid}}}{{\sim}}{{\mathbb{P}}^{*}} and 𝒰NiidX{{\mathcal{U}}_{N}}\stackrel{{\scriptstyle{iid}}}{{\sim}}{{\mathbb{P}}^{*}_{X}}. The full data is thus generated by

n,N=(i=1n)×(i=n+1n+NX).{\mathbb{Q}}^{n,N}=\left(\prod_{i=1}^{n}{{\mathbb{P}}^{*}}\right)\times\left(\prod_{i=n+1}^{n+N}{{\mathbb{P}}^{*}_{X}}\right).

Let nn\to\infty and Nn+Nγ(0,1)\frac{N}{n+N}\to\gamma\in(0,1). The empirical distribution of n𝒰N{{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}} is n,N=n×N,X{{\mathbb{Q}}_{n,N}}={{\mathbb{P}}_{n}}\times{{\mathbb{P}}_{N,X}}, where n=1ni=1nδZi{{\mathbb{P}}_{n}}=\frac{1}{n}\sum_{i=1}^{n}\delta_{Z_{i}} is the empirical distribution of n{{\mathcal{L}}_{n}} and N,X=1Ni=n+1n+NδXi{{\mathbb{P}}_{N,X}}=\frac{1}{N}\sum_{i=n+1}^{n+N}\delta_{X_{i}} is the empirical distribution of 𝒰N{{\mathcal{U}}_{N}}.

We first establish the local asymptotic normality property of n,N{{\mathbb{Q}}_{n,N}}.

Lemma E.12.

For any regular parametric sub-model 𝒫T={t:tTd}𝒫{\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}^{d}}\right\}\subset{{\mathcal{P}}} such that t={\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}} for some tTt^{*}\in T, let t,X{\mathbb{P}}_{t,X} be the marginal distribution of t{\mathbb{P}}_{t} over 𝒳{\mathcal{X}}. Denote tn,N=(i=1nt)×(i=n+1n+Nt,X).{\mathbb{Q}}_{t}^{n,N}=\left(\prod_{i=1}^{n}{\mathbb{P}}_{t}\right)\times\left(\prod_{i=n+1}^{n+N}{\mathbb{P}}_{t,X}\right). Then, for any hdh\in{\mathbb{R}}^{d},

log(dt+hnn,Ndtn,N)\displaystyle\log\left(\frac{d{\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}}{d{\mathbb{Q}}_{t^{*}}^{n,N}}\right) =hn[n,N()](st+γ1γst,X)12h𝐈th+op(1)\displaystyle=h^{\top}\sqrt{n}\left[{{\mathbb{Q}}_{n,N}}-{{\mathbb{Q}}({{\mathbb{P}}^{*}})}\right]\left(s_{t^{*}}+\frac{\gamma}{1-\gamma}s_{t^{*},X}\right)-\frac{1}{2}h^{\top}\mathbf{I}_{t^{*}}h+o_{p}(1) (83)
N(12h𝐈th,h𝐈th),\displaystyle\rightsquigarrow N\left(-\frac{1}{2}h^{\top}\mathbf{I}_{t^{*}}h,h^{\top}\mathbf{I}_{t^{*}}h\right),

where st(z)s_{t^{*}}(z) is the score function of t{\mathbb{P}}_{t} at tt^{*}, st,X(x)s_{t^{*},X}(x) is the score function of t,X{\mathbb{P}}_{t,X} at tt^{*}, and 𝐈t=Var[st(Z)]+γ1γVar[st,X(X)]\mathbf{I}_{t^{*}}=\mathrm{Var}\left[s_{t^{*}}(Z)\right]+\frac{\gamma}{1-\gamma}\mathrm{Var}\left[s_{t^{*},X}(X)\right]. Moreover, t+hnn,N{\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N} and tn,N{\mathbb{Q}}_{t^{*}}^{n,N} are mutually contiguous.

Here and in the following text, a regular parametric model is interpreted in the sense of quadratic-mean differentiability of the density function at tt^{*}. See Chapter 7 of Van der Vaart [2000] for details.

Proof.

By Theorem 7.2 of Van der Vaart [2000], we have

log(i=1ndt+hndt)=hn(n)(st)12hVar[st(Z)]h+op(1),\displaystyle\log\left(\prod_{i=1}^{n}\frac{d{\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}}{d{\mathbb{P}}_{t^{*}}}\right)=h^{\top}\sqrt{n}({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}})(s_{t^{*}})-\frac{1}{2}h^{\top}\mathrm{Var}[s_{t^{*}}(Z)]h+o_{p}(1),

and

log(i=n+1n+Ndt+hn,Xdt,X)=log(i=n+1n+Ndt+NnhN,Xdt,X)\displaystyle\log\left(\prod_{i=n+1}^{n+N}\frac{d{\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}},X}}{d{\mathbb{P}}_{t^{*},X}}\right)=\log\left(\prod_{i=n+1}^{n+N}\frac{d{\mathbb{P}}_{t^{*}+\sqrt{\frac{N}{n}}\frac{h}{\sqrt{N}},X}}{d{\mathbb{P}}_{t^{*},X}}\right)
=hNnN(N,XX)(st,X)N2nhVar[st,X(X)]h+op(1)\displaystyle=h^{\top}\sqrt{\frac{N}{n}}\sqrt{N}({{\mathbb{P}}_{N,X}}-{{\mathbb{P}}^{*}_{X}})(s_{t^{*},X})-\frac{N}{2n}h^{\top}\mathrm{Var}[s_{t^{*},X}(X)]h+o_{p}(1)
=hγ1γN(N,XX)(st,X)γ2(1γ)hVar[st,X(X)]h+op(1).\displaystyle=h^{\top}\sqrt{\frac{\gamma}{1-\gamma}}\sqrt{N}({{\mathbb{P}}_{N,X}}-{{\mathbb{P}}^{*}_{X}})(s_{t^{*},X})-\frac{\gamma}{2(1-\gamma)}h^{\top}\mathrm{Var}[s_{t^{*},X}(X)]h+o_{p}(1).

Then, it follows that

log(dt+hnn,Ndtn,N)\displaystyle\log\left(\frac{d{\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}}{d{\mathbb{Q}}_{t^{*}}^{n,N}}\right) =log(i=1ndt+hndt)+log(i=n+1n+Ndt+hn,Xdt,X)\displaystyle=\log\left(\prod_{i=1}^{n}\frac{d{\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}}{d{\mathbb{P}}_{t^{*}}}\right)+\log\left(\prod_{i=n+1}^{n+N}\frac{d{\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}},X}}{d{\mathbb{P}}_{t^{*},X}}\right)
=hn(n)(st)+hγ1γN(N,XX)(st,X)\displaystyle=h^{\top}\sqrt{n}({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}})(s_{t^{*}})+h^{\top}\sqrt{\frac{\gamma}{1-\gamma}}\sqrt{N}({{\mathbb{P}}_{N,X}}-{{\mathbb{P}}^{*}_{X}})\left(s_{t^{*},X}\right)
12h[Var(st)+γ1γVar(st,X)]h+op(1).\displaystyle\quad-\frac{1}{2}h^{\top}\left[\mathrm{Var}(s_{t^{*}})+\frac{\gamma}{1-\gamma}\mathrm{Var}(s_{t^{*},X})\right]h+o_{p}(1).

The convergence in distribution then follows from CLT and independence, and contiguity follows from Le Cam’s first lemma. ∎

Define lt(z1,x2):𝒵×𝒳pl_{t^{*}}(z_{1},x_{2}):{\mathcal{Z}}\times{\mathcal{X}}\to{\mathbb{R}}^{p} as

lt(z1,x2)=st(z1)+γ1γst,X(x2).l_{t^{*}}(z_{1},x_{2})=s_{t^{*}}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2}). (84)

From Lemma E.12, lt(z1,x2)l_{t^{*}}(z_{1},x_{2}) can be viewed as the score function of the parametric model 𝒫T{\mathcal{P}}_{T} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} with information matrix 𝐈t\mathbf{I}_{t^{*}}. It is clear that lt(z1,x2)l_{t^{*}}(z_{1},x_{2}) is X{\mathcal{F}}\otimes{\mathcal{F}}_{X} measurable and is an element of the space

={f(z1)+γ1γg(x2),f(z)p,02(),g(x):p,02(X)}.{\mathcal{H}}=\left\{{f(z_{1})+\frac{\gamma}{1-\gamma}g(x_{2}),f(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})},g(x):\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}}\right\}. (85)

Consider any vector-valued functions f1(z),f2(z)p,02()f_{1}(z),f_{2}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})} and g1(x),g2(z)p,02(X)g_{1}(x),g_{2}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}. Therefore, f1(z1)+γ1γg1(x2)f_{1}(z_{1})+\frac{\gamma}{1-\gamma}g_{1}(x_{2})\in{\mathcal{H}} and f2(z1)+γ1γg2(x2)f_{2}(z_{1})+\frac{\gamma}{1-\gamma}g_{2}(x_{2})\in{\mathcal{H}}. Define

f1(z1)+γ1γg1(x2),f2(z1)+γ1γg2(x2)\displaystyle\left\langle f_{1}(z_{1})+\frac{\gamma}{1-\gamma}g_{1}(x_{2}),f_{2}(z_{1})^{\top}+\frac{\gamma}{1-\gamma}g_{2}(x_{2})^{\top}\right\rangle_{\mathcal{H}} (86)
:=f1(z),f2(z)p,02()+γ1γg1(x),g2(x)p,02(X)\displaystyle:=\left\langle f_{1}(z),f_{2}(z)^{\top}\right\rangle_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}}+\frac{\gamma}{1-\gamma}\left\langle g_{1}(x),g_{2}(x)^{\top}\right\rangle_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}}
=𝔼[f1(Z)f2(Z)]+γ1γ𝔼[g1(X)g2(X)].\displaystyle={\mathbb{E}}\left[f_{1}(Z)f_{2}(Z)^{\top}\right]+\frac{\gamma}{1-\gamma}{\mathbb{E}}\left[g_{1}(X)g_{2}(X)^{\top}\right].

It is straightforward to verify that this is indeed an inner product, and {\mathcal{H}} is a Hilbert space. The information matrix 𝐈t\mathbf{I}_{t} can then be represented as

𝐈t=lt(z1,x2),lt(z1,x2).\mathbf{I}_{t^{*}}=\left\langle l_{t^{*}}(z_{1},x_{2}),l_{t^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}.

Consider an arbitrary model 𝒫{{\mathcal{P}}} of {{\mathbb{P}}^{*}}. We now extend the notion of a tangent space to the OSS setting. Similar to the i.i.d. setting, the tangent set relative to 𝒫{{\mathcal{P}}} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} is defined as the set of all score functions at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} of one-dimensional regular parametric sub-models of 𝒫{{\mathcal{P}}}. The tangent space at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} relative to 𝒫{{\mathcal{P}}}, denoted as 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}, is then the closed linear span of the tangent set in {\mathcal{H}}. Note that 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}\subseteq{\mathcal{H}}. Next, we characterizes the tangent space at (){\mathbb{Q}}({{\mathbb{P}}^{*}}) relative to model (1).

Lemma E.13.

The tangent space (){\mathbb{Q}}({{\mathbb{P}}^{*}}) relative to 𝒫{{\mathcal{P}}}, denoted as 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}, can be expressed as

={f(x1)+g(z1)+γ1γf(x2):f(z)𝒯𝒫X(X),g(x)𝒯𝒫YX(YX)}.\displaystyle{\mathcal{M}}=\left\{f(x_{1})+g(z_{1})+\frac{\gamma}{1-\gamma}f(x_{2}):f(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})},g(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\right\}.
Proof.

It is straightforward to see that {\mathcal{M}} is a closed linear space since both 𝒯𝒫YX(YX){{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})} and 𝒯𝒫X(X){{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})} are closed linear spaces by the definition of tangent space.

First, we show that 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}\subseteq{\mathcal{M}}. Consider any one-dimensional regular parametric sub-model 𝒫T={pt(z)=pt,X(x)pt,YX(z):tT}{\mathcal{P}}_{T}=\left\{{p_{t}(z)=p_{t,X}(x)p_{t,Y\mid X}(z):t\in T}\right\} of 𝒫{{\mathcal{P}}} such that pt(z)p_{t^{*}}(z) is the density of {{\mathbb{P}}^{*}} for some tTt^{*}\in T. Let st(z)s_{t^{*}}(z), st,YX(z)s_{t^{*},Y\mid X}(z) and st,X(x)s_{t^{*},X}(x) denote the score function of pt(z)p_{t}(z), pt,X(x)p_{t,X}(x), and pt,YX(z)p_{t,Y\mid X}(z) at tt^{*}, respectively. Then we have st(z)=st,X(x)+st,YX(z)s_{t^{*}}(z)=s_{t^{*},X}(x)+s_{t^{*},Y\mid X}(z). Since 𝒫T𝒫{\mathcal{P}}_{T}\subset{{\mathcal{P}}}, and by the definition of 𝒫{{\mathcal{P}}} (1), it holds that {pt,X(x):tT}\left\{{p_{t,X}(x):t\in T}\right\} is a one-dimensional regular parametric sub-model of 𝒫YX{{\mathcal{P}}_{Y\mid X}}, and {pt,YX(x):tT}\left\{{p_{t,Y\mid X}(x):t\in T}\right\} is a one-dimensional regular parametric sub-model of 𝒫YX{{\mathcal{P}}_{Y\mid X}}. Therefore, st,X(x)𝒯𝒫X(X)s_{t^{*},X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})} and st,YX(z)𝒯𝒫YX(YX)s_{t^{*},Y\mid X}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}. By Lemma E.12, the score function of 𝒬{\mathcal{Q}} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} can be written as

lt(z1,x2)\displaystyle l_{t^{*}}(z_{1},x_{2}) =st(z1)+γ1γst,X(x2)\displaystyle=s_{t^{*}}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2})
=st,X(x1)+st,YX(z1)+γ1γst,X(x2).\displaystyle=s_{t^{*},X}(x_{1})+s_{t^{*},Y\mid X}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2}).

By Lemma E.11, 𝒯𝒫()=𝒯𝒫X(X)𝒯𝒫YX(YX){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}={{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}. Therefore, lt(z1,x2)l_{t^{*}}(z_{1},x_{2})\in{\mathcal{M}}. Then, 𝒯𝒫(()){{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}\subseteq{\mathcal{M}} follows from the fact that {\mathcal{M}} is a closed linear space.

For the other direction, consider arbitrary elements g(z)𝒯𝒫YX(YX)g(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})} and f(x)𝒯𝒫X(X)f(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}. Then, without loss of generality, there must exists a one-dimensional regular parametric sub-model {pt,X(x):tT}\left\{{p_{t,X}(x):t\in T}\right\} of 𝒫X{{\mathcal{P}}_{X}} such that pt,X(x)p_{t^{*},X}(x) is the density of X{{\mathbb{P}}^{*}_{X}}, and f(x)f(x) is the score function of pt,X(x)p_{t,X}(x) at tt^{*}. Similarly, there must exists a one-dimensional regular parametric sub-model {pt,YX(z):tT}\left\{{p_{t,Y\mid X}(z):t\in T}\right\} of 𝒫YX{{\mathcal{P}}_{Y\mid X}} such that pt,YX(z)p_{t^{*},Y\mid X}(z) is the density of YX{{\mathbb{P}}^{*}_{Y\mid X}}, and g(z)g(z) is the score function of pt,YX(z)p_{t,Y\mid X}(z) at tt^{*}. Consider the model 𝒫T={pt(z)=pt,X(x)pt,YX(z):tT}{\mathcal{P}}_{T}=\left\{{p_{t}(z)=p_{t,X}(x)p_{t,Y\mid X}(z):t\in T}\right\}, which is a one-dimensional regular parametric sub-model of 𝒫{{\mathcal{P}}} by its definition, with score function f(x)+g(z)f(x)+g(z) at tt^{*}. By Lemma E.11, f(x)+g(z)𝒯𝒫()f(x)+g(z)\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}. Therefore, by Lemma E.12, we see that the socre at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} relative to this parametric sub-model is f(x1)+g(z1)+γ1γf(x2)𝒯𝒫(())f(x_{1})+g(z_{1})+\frac{\gamma}{1-\gamma}f(x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}, and hence 𝒯𝒫(()){\mathcal{M}}\subseteq{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}. ∎

Next, we consider the extention of pathwise differentiability to the OSS setting.

Definition E.1 (Pathwise differentiability).

A Euclidean functional θ:𝒫Θp\theta:{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p} is pathwise differentiable relative to 𝒫{{\mathcal{P}}} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} if, any one-dimensional regular parametric sub-model 𝒫T𝒫{\mathcal{P}}_{T}\subset{{\mathcal{P}}} where 𝒫T={t:tT}{\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\} such that t={\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}} for some tTt^{*}\in T, θ:𝒫TΘ\theta:{\mathcal{P}}_{T}\to\Theta as a function θ(t)\theta(t) of tt satisfies

dθ(t)dt=lt(z1,x2),D()(z1,x2),\frac{d\theta(t^{*})}{dt}=\langle l_{t^{*}}(z_{1},x_{2}),{D_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2})\rangle_{\mathcal{H}},

for some D()(z1,x2){D_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2})\in{\mathcal{H}}, where ltl_{t^{*}} is the score function of 𝒫T{\mathcal{P}}_{T} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})}. Then, D()(z1,x2){D_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2}) is a gradient of θ(){\theta(\cdot)} relative to 𝒫{{\mathcal{P}}} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})}.

If a gradient D()(z1,x2){D^{*}_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2}) satisfies aD()(z1,x2)𝒯𝒫(())a^{\top}{D^{*}_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})} for all apa\in{\mathbb{R}}^{p}, then it is a canonical gradient or the efficient influence function of θ(){\theta(\cdot)} relative to 𝒫{{\mathcal{P}}} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})}. The next lemma establishes the connection between pathwise differentiability under the OSS setting and pathwise differentiability in the usual sense.

Lemma E.14.

If θ:𝒫Θp\theta:{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p} is a pathwise differentiable relative to 𝒫{{\mathcal{P}}} at {{\mathbb{P}}^{*}}, then it is pathwise differentiable relative to 𝒫{{\mathcal{P}}} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})}. Moreover, if D{D_{{\mathbb{P}}^{*}}} is a gradient of θ(){\theta(\cdot)} relative to 𝒫{{\mathcal{P}}} at {{\mathbb{P}}^{*}}, then

D(z1)D~𝒫()(x1)+D~𝒫()(x2){D_{{\mathbb{P}}^{*}}}(z_{1})-\tilde{D}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})(x_{1})+\tilde{D}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})(x_{2})

is a gradient of θ(){\theta(\cdot)} relative to 𝒫{{\mathcal{P}}} at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})}, where

D~𝒫()(x)=𝔼[D𝒫()(Z)X=x]\tilde{D}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})(x)={\mathbb{E}}\left[D_{{\mathcal{P}}}({{\mathbb{P}}^{*}})(Z)\mid X=x\right]

is the conditional gradient.

Proof.

Let D(z){D_{{\mathbb{P}}^{*}}}(z) be a gradient of θ(){\theta(\cdot)} relative to 𝒫{{\mathcal{P}}} at {{\mathbb{P}}^{*}}. For simplicity, denote D(z)=D(z)D(z)={D_{{\mathbb{P}}^{*}}}(z), and D~(x)=𝔼[D(Z)X=x]\tilde{D}(x)={\mathbb{E}}[D(Z)\mid X=x]. For any one-dimensional regular parametric sub-model 𝒫T{\mathcal{P}}_{T} of 𝒫{{\mathcal{P}}}, pathwise differentiability at {{\mathbb{P}}^{*}} implies

θ(t)\displaystyle\theta(t) =θ(t)+tst,D1,02(X)+o(t)\displaystyle=\theta(t^{*})+t\langle s_{t^{*}},D\rangle_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}+o(t)
=θ(t)+tst,D1,02(X)+o(t)\displaystyle=\theta(t^{*})+t\langle s_{t^{*}},D\rangle_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}+o(t)
=θ(t)+tst,DD~1,02()+tγ1γst,X,1γγD~1,02(X)+o(t)\displaystyle=\theta(t^{*})+t\langle s_{t^{*}},D-\tilde{D}\rangle_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}})}}+t\frac{\gamma}{1-\gamma}\left\langle s_{t^{*},X},\frac{1-\gamma}{\gamma}\tilde{D}\right\rangle_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}+o(t)
=θ(t)+tst(z1)+γ1γst,X(x2),D(z1)D~(x1)+D~(x2)+o(t),\displaystyle=\theta(t^{*})+t\left\langle s_{t^{*}}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2}),D(z_{1})-\tilde{D}(x_{1})+\tilde{D}(x_{2})\right\rangle_{\mathcal{H}}+o(t),

where D(z1)D~(x1)+D~(x2)D(z_{1})-\tilde{D}(x_{1})+\tilde{D}(x_{2})\in{\mathcal{H}} and st(z1)+γ1γst,X(x2)s_{t^{*}}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2}) is the score function of 𝒫T{\mathcal{P}}_{T} at {{\mathbb{P}}^{*}}. ∎

We now extend the definitions of regular and asymptotically linear estimators to the OSS setting as in Section 4.1. The definition of regularity remains unchanged as Definition 4.1 in the main text. We present an equivalent definition of asymptotic linearity here for convenience.

Definition E.2 (Asymptotic linearity).

An estimator θ^n,N\hat{\theta}_{n,N} of θ\theta^{*} is asymptotically linear with influence function φη(z1,x2){{\varphi}_{\eta^{*}}}(z_{1},x_{2})\in{\mathcal{H}} if

n(θ^n,Nθ)=n(n,N())[φη(z1,x2)]+op(1).\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}-\theta^{*}\right)=\sqrt{n}\left({{\mathbb{Q}}_{n,N}}-{{\mathbb{Q}}({{\mathbb{P}}^{*}})}\right)\left[{{\varphi}_{\eta^{*}}}(z_{1},x_{2})\right]+o_{p}(1). (87)

It is easy to verify, by expanding n,N(){{\mathbb{Q}}_{n,N}}-{{\mathbb{Q}}({{\mathbb{P}}^{*}})} and using the fact that γ(0,1)\gamma\in(0,1), that definition E.2 is equivalent to Definition 4.2. Clearly, if θ^n,N\hat{\theta}_{n,N} is an asymptotically linear estimator of θ\theta^{*} with influence function φη(z1,x2){{\varphi}_{\eta^{*}}}(z_{1},x_{2}), its asymptotic variance is φη(z1,x2),φη(z1,x2)\langle{{\varphi}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}_{\eta^{*}}}(z_{1},x_{2})^{\top}\rangle_{\mathcal{H}}.

Similar to Lemma E.9, we first establish the equivalence between the set of gradients and the set of influence functions under the OSS setting.

Lemma E.15.

Let θ^n,N\hat{\theta}_{n,N} be an asymptotically linear estimator of θ\theta^{*} with influence function φη(z1,x2)=φη(1)(z1)+γ1γφη(2)(x2){{\varphi}_{\eta^{*}}}(z_{1},x_{2})={{\varphi}_{\eta^{*}}^{(1)}}(z_{1})+\frac{\gamma}{1-\gamma}{{\varphi}_{\eta^{*}}^{(2)}}(x_{2}) in the sense of Definition E.2. Then, θ(){\theta(\cdot)} is pathwise differentiable at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} relative to 𝒫{{\mathcal{P}}} with gradient φη(z1,x2){{\varphi}_{\eta^{*}}}(z_{1},x_{2}) if and only if θ^n\hat{\theta}_{n} is a regular estimator of θ\theta^{*} over 𝒫{{\mathcal{P}}} in the sense of Definition 4.1.

Proof.

Following the notation of Lemma E.12, consider a one-dimensional regular parametric sub-model 𝒫T{\mathcal{P}}_{T}. By asymptotic linearity and Lemma E.12, we have

(n(θ^n,Nθ)log(dt+hnn,Ndtn,N))N((012h2𝐈t),(φη,φηhlt,φηhφη,lt𝐈t)).\begin{pmatrix}\sqrt{n}(\hat{\theta}_{n,N}-\theta^{*})\\ \log\left(\frac{d{\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}}{d{\mathbb{Q}}_{t^{*}}^{n,N}}\right)\end{pmatrix}\rightsquigarrow N\left(\begin{pmatrix}0\\ -\frac{1}{2}h^{2}\mathbf{I}_{t^{*}}\end{pmatrix},\begin{pmatrix}\langle{{\varphi}_{\eta^{*}}},{{\varphi}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}&h\langle l_{t^{*}},{{\varphi}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}\\ h\langle{{\varphi}_{\eta^{*}}},l_{t^{*}}\rangle_{\mathcal{H}}&\mathbf{I}_{t^{*}}\end{pmatrix}\right).

By contiguity and Le Cam’s third lemma, it follows that

n(θ^n,Nθ)N(hφη,lt,φη,φη)\sqrt{n}\left(\hat{\theta}_{n,N}-\theta^{*}\right)\rightsquigarrow N\left(h\langle{{\varphi}_{\eta^{*}}},l_{t^{*}}\rangle_{\mathcal{H}},\langle{{\varphi}_{\eta^{*}}},{{\varphi}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}\right)

under t+hnn,N{\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}. By regularity of θ^n,N\hat{\theta}_{n,N}, it follows that

n[θ^n,Nθ(t+hn)]N(hφη,lt,φη,φη)\sqrt{n}\left[\hat{\theta}_{n,N}-\theta\left({\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}\right)\right]\rightsquigarrow N\left(h\langle{{\varphi}_{\eta^{*}}},l_{t^{*}}\rangle_{\mathcal{H}},\langle{{\varphi}_{\eta^{*}}},{{\varphi}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}\right)

under t+hnn,N{\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}. This implies that

n[θ(t+hn)θ]hφη,lt.\sqrt{n}\left[\theta\left({\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}\right)-\theta^{*}\right]\to h\langle{{\varphi}_{\eta^{*}}},l_{t^{*}}\rangle_{\mathcal{H}}.

Pathwise differentiability then follows by the definition of a derivative. ∎

Finally, we extend Lemma 2.1 to the OSS setting.

Lemma E.16.

Suppose θ(){\theta(\cdot)} is pathwise differentiable at (){{\mathbb{Q}}({{\mathbb{P}}^{*}})} relative to 𝒫{{\mathcal{P}}} with efficient influence function φη(z1,x2){{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}). Let θ^n,N\hat{\theta}_{n,N} be a regular and asymptotically linear estimator of θ\theta^{*} in the sense of Definitions 4.1 and  E.2 with influence function D(z1,x2){D_{{\mathbb{P}}^{*}}}(z_{1},x_{2}). Then,

D(z1,x2),D(z1,x2)φη(z1,x2),φη(z1,x2).\left\langle{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2}),{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}\succeq\left\langle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}.

Hence, the asymptotic variance of any regular and asymptotically linear estimator of θ\theta^{*} is lower bounded by φη(z1,x2),φη(z1,x2)\left\langle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}.

Proof.

By Lemma E.15, D(z1,x2){D_{{\mathbb{P}}^{*}}}(z_{1},x_{2}) is a gradient of θ^n\hat{\theta}_{n} relative to 𝒫{{\mathcal{P}}}. For any s(z1,x2)𝒯𝒫(())s(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}, without loss of generality, let 𝒫T{\mathcal{P}}_{T} be any one-dimensional regular parametric sub-model of 𝒫{{\mathcal{P}}} with score function s(z1,x2)s(z_{1},x_{2}) at t={\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}}. (Otherwise, s(z1,x2)s(z_{1},x_{2}) is the limit of a sequence of score functions of parametric sub-models, and the following holds by the continuity of the inner product.) Write θ:𝒫TΘ\theta:{\mathcal{P}}_{T}\to\Theta as a function θ(t)\theta(t) of tt. By pathwise differentiability,

dθ(t)dt=s(z1,x2),D(z1,x2)=s(z1,x2),φη(z1,x2),\frac{d\theta(t^{*})}{dt}=\langle s(z_{1},x_{2}),{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}=\langle s(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}},

which shows that s(z1,x2),φη(z1,x2)D(z1,x2)=0\langle s(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})-{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}=0. Since this holds for arbitrary s(z1,x2)𝒯𝒫(())s(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}, we have φη(z1,x2)D(z1,x2)𝒯𝒫(()){{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})-{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})\perp{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})} in {\mathcal{H}}. By definition, eiφη(z1,x2)𝒯𝒫(())e_{i}^{\top}{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})} for all i[p]i\in[p], where eipe_{i}\in{\mathbb{R}}^{p} is the vector whose ii-th coordinate is 11 and the rest are zeros. Therefore, φη(z1,x2),φη(z1,x2)D(z1,x2)=0\langle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})-{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}=0. It follows that

D,D\displaystyle\langle{D_{{\mathbb{P}}^{*}}},{D_{{\mathbb{P}}^{*}}}^{\top}\rangle_{\mathcal{H}} =Dφη+φη,Dφη+φη\displaystyle=\langle{D_{{\mathbb{P}}^{*}}}-{{\varphi}^{*}_{\eta^{*}}}+{{\varphi}^{*}_{\eta^{*}}},{D_{{\mathbb{P}}^{*}}}^{\top}-{{\varphi}^{*}_{\eta^{*}}}^{\top}+{{\varphi}^{*}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}
=Dφη,Dφη+φη,φη\displaystyle=\langle{D_{{\mathbb{P}}^{*}}}-{{\varphi}^{*}_{\eta^{*}}},{D_{{\mathbb{P}}^{*}}}^{\top}-{{\varphi}^{*}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}+\langle{{\varphi}^{*}_{\eta^{*}}},{{\varphi}^{*}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}
φη,φη.\displaystyle\succeq\langle{{\varphi}^{*}_{\eta^{*}}},{{\varphi}^{*}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}.

Appendix F Additional numerical results

F.1 Extended results for mean estimation and generalized linear model

Tables 1,  2, and 3 report the coverage of 95% confidence intervals for methods of mean estimation and generalized linear models presented in Sections 8.1 and 8.2. All methods achieve the nominal coverage.

Setting Method γ=0.1\gamma=0.1 γ=0.3\gamma=0.3 γ=0.5\gamma=0.5 γ=0.7\gamma=0.7 γ=0.9\gamma=0.9
θ^n\hat{\theta}_{n} 0.967 0.944 0.946 0.938 0.954
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.949 0.933 0.947 0.942 0.951
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.944 0.944 0.944 0.953 0.942
θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.960 0.957 0.948 0.949 0.945
1 θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.950 0.939 0.941 0.951 0.948
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.948 0.953 0.936 0.956 0.947
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.946 0.952 0.950 0.951 0.948
PPI++ (Random Forest) 0.962 0.951 0.954 0.940 0.938
PPI++ (Noisy Predictor) 0.950 0.949 0.946 0.942 0.946
θ^n\hat{\theta}_{n} 0.954 0.966 0.941 0.954 0.960
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.948 0.949 0.964 0.942 0.948
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.952 0.950 0.961 0.941 0.947
θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.946 0.955 0.955 0.938 0.942
2 θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.944 0.945 0.957 0.941 0.937
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.956 0.968 0.953 0.945 0.944
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.939 0.961 0.951 0.956 0.948
PPI++ (Random Forest) 0.953 0.950 0.971 0.948 0.940
PPI++ (Noisy Predictor) 0.947 0.953 0.956 0.944 0.938
θ^n\hat{\theta}_{n} 0.946 0.956 0.950 0.954 0.941
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.948 0.952 0.958 0.932 0.949
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.948 0.958 0.963 0.943 0.950
θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.936 0.967 0.941 0.939 0.946
3 θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.951 0.954 0.961 0.942 0.944
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.957 0.963 0.962 0.946 0.941
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.957 0.952 0.956 0.937 0.949
PPI++ (Random Forest) 0.939 0.959 0.960 0.939 0.946
PPI++ (Noisy Predictor) 0.940 0.948 0.948 0.960 0.934
Table 1: 95% coverage of confidence intervals for methods of mean estimation in three settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section 8.1.
Setting Method γ=0.1\gamma=0.1 γ=0.3\gamma=0.3 γ=0.5\gamma=0.5 γ=0.7\gamma=0.7 γ=0.9\gamma=0.9
θ^n\hat{\theta}_{n} 0.933 0.950 0.944 0.950 0.938
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.950 0.950 0.949 0.943 0.940
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.950 0.945 0.945 0.938 0.934
θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.949 0.940 0.944 0.935 0.935
1 θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.940 0.936 0.946 0.931 0.931
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.948 0.942 0.938 0.939 0.939
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.942 0.941 0.949 0.934 0.939
PPI++ (Random Forest) 0.952 0.952 0.951 0.947 0.941
PPI++ (Noisy Predictor) 0.952 0.943 0.945 0.946 0.934
θ^n\hat{\theta}_{n} 0.948 0.943 0.949 0.948 0.940
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.944 0.951 0.944 0.953 0.945
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.947 0.940 0.936 0.942 0.950
θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.947 0.945 0.930 0.942 0.944
2 θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.952 0.932 0.940 0.946 0.944
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.952 0.943 0.955 0.947 0.956
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.954 0.952 0.923 0.937 0.953
PPI++ (Random Forest) 0.941 0.949 0.943 0.955 0.945
PPI++ (Noisy Predictor) 0.942 0.951 0.944 0.953 0.947
Table 2: 95% coverage of confidence intervals of θ1\theta_{1}^{*} for methods of Poisson GLM in two settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section 8.2.
Setting Method γ=0.05\gamma=0.05 γ=0.25\gamma=0.25 γ=0.5\gamma=0.5 γ=0.75\gamma=0.75 γ=0.95\gamma=0.95
θ^n\hat{\theta}_{n} 0.947 0.950 0.946 0.937 0.952
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.943 0.949 0.949 0.941 0.945
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.936 0.942 0.945 0.943 0.950
θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.944 0.944 0.956 0.936 0.953
1 θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.938 0.937 0.950 0.937 0.939
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.942 0.954 0.948 0.943 0.953
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.947 0.958 0.947 0.944 0.947
PPI++ (Random Forest) 0.939 0.948 0.947 0.956 0.953
PPI++ (Noisy Predictor) 0.940 0.946 0.950 0.946 0.940
θ^n\hat{\theta}_{n} 0.960 0.942 0.949 0.939 0.950
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.960 0.953 0.949 0.952 0.943
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.952 0.944 0.947 0.939 0.950
θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.955 0.945 0.942 0.946 0.943
2 θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.950 0.945 0.943 0.935 0.941
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.948 0.949 0.950 0.937 0.954
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.958 0.959 0.943 0.944 0.955
PPI++ (Random Forest) 0.960 0.954 0.949 0.952 0.943
PPI++ (Noisy Predictor) 0.960 0.952 0.947 0.951 0.943
Table 3: 95% coverage of confidence intervals of θ2\theta_{2}^{*} for methods of Poisson GLM in two settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section 8.2.

In the GLM setting of Section 8.2, Figure 3 displays the standard error of estimators of the second parameter. The results are similar to those described in Section 8.2.

Refer to caption
Figure 3: Estimated standard error for of θ2\theta_{2}^{*} in the Poisson GLM setting, detailed in Section 8.2. The results are similar to Figure 2.

F.2 Variance estimation

We consider Example  7.3. The supervised estimator is the sample variance with influence function φη(z)=(y𝔼[Y])2θ{{\varphi}_{\eta^{*}}}(z)=(y-{\mathbb{E}}[Y])^{2}-\theta^{*}. The conditional influence function is

ϕη(x)=𝔼[(Y𝔼[Y])2X=x]θ.{\phi_{\eta^{*}}}(x)={\mathbb{E}}\left[(Y-{\mathbb{E}}[Y])^{2}\mid X=x\right]-\theta^{*}.

We generate the response YY as Y=𝔼[YX]+ϵY={\mathbb{E}}[Y\mid X]+\epsilon, where ϵN(0,1)\epsilon\sim N(0,1). We consider two settings for 𝔼[YX]{\mathbb{E}}[Y\mid X]:

  1. 1.

    Setting 1 (non-linear model): 𝔼[YX=x]=1.70x1x26.94x1x221.35x12x2+2.28x12x22{\mathbb{E}}[Y\mid X=x]=-1.70x_{1}x_{2}-6.94x_{1}x_{2}^{2}-1.35x_{1}^{2}x_{2}+2.28x_{1}^{2}x_{2}^{2}.

  2. 2.

    Setting 2 (well-specified model, in the sense of Definition 3.1): 𝔼[YX=x]=0{\mathbb{E}}[Y\mid X=x]=0.

For each model, we set n=1,000n=1,000, and vary the proportion of unlabeled observations, γ=Nn+N{0.1,0.3,0.5,0.7,0.9}\gamma=\frac{N}{n+N}\in\left\{{0.1,0.3,0.5,0.7,0.9}\right\}. The semiparametric efficiency lower bound in the ISS setting (5) and in the OSS setting (25) are estimated separately on a sample of 100,000100,000 observations.

Table 4 reports the coverage of 95% confidence intervals for each method. All methods achieve the nominal coverage.

Setting Method γ=0.05\gamma=0.05 γ=0.25\gamma=0.25 γ=0.5\gamma=0.5 γ=0.75\gamma=0.75 γ=0.95\gamma=0.95
θ^n\hat{\theta}_{n} 0.943 0.943 0.947 0.946 0.955
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.950 0.949 0.957 0.949 0.952
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.945 0.937 0.956 0.956 0.951
1 θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.946 0.943 0.953 0.949 0.941
θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.945 0.941 0.944 0.959 0.948
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.948 0.948 0.951 0.951 0.943
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.948 0.961 0.942 0.957 0.956
θ^n\hat{\theta}_{n} 0.945 0.937 0.951 0.937 0.944
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.945 0.954 0.949 0.954 0.955
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.940 0.947 0.958 0.944 0.942
2 θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.945 0.964 0.940 0.948 0.938
θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.943 0.939 0.945 0.946 0.942
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.949 0.957 0.948 0.948 0.946
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.950 0.946 0.940 0.950 0.954
Table 4: 95% coverage of confidence intervals for methods of variance estimation in two settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section F.2.
Refer to caption
Figure 4: Comparison of standard error for methods of variance estimation, averaged over 1,000 simulated datasets. Left: When the conditional influence function is non-linear, θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} is efficient with a sufficient number of basis functions. Right: For a well-specified estimation problem in the sense of Definition 3.1, no semi-supervised method improves upon the supervised estimator, which aligns with Corollary 4.2.

Figure 4 displays the standard error of each method, averaged over 1,000 simulated datasets.

The results are similar to those of Section 8. In Setting 1 (non-linear model), θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} achieves the efficiency lower bound in the OSS setting with a sufficient number of basis functions. On the other hand, θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} with g(x)=xg(x)=x improves upon the supervised estimator, but is not efficient as it cannot accurately approximate the non-linear conditional influence function. θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} with random forest predicted responses also performs well, while the version of those methods with a pure-noise prediction model is no worse than the supervised estimator. For Setting 2 (well-specified model), no method improves upon the OSS lower bound, as indicated by Corollary 4.2.

F.3 Kendall’s τ\tau

We consider Example 7.4. The supervised estimator is

θ^n=2n(n1)i<jI{(UiUj)(ViVj)>0}\hat{\theta}_{n}=\frac{2}{n(n-1)}\sum_{i<j}I\left\{{(U_{i}-U_{j})(V_{i}-V_{j})>0}\right\}

with influence function

φη(z)=2Y{(Uu)(Vv)>0}2θ.{{\varphi}_{\eta^{*}}}(z)=2{\mathbb{P}}^{*}_{Y}\left\{{(U-u)(V-v)>0}\right\}-2\theta^{*}.

The conditional influence function is

ϕη(x)=2{(Uu)(VV)>0X=x}2θ,{\phi_{\eta^{*}}}(x)=2{\mathbb{P}}^{*}\left\{{(U^{\prime}-u)(V^{\prime}-V)>0\mid X=x}\right\}-2\theta^{*},

where Y=(U,V)Y^{\prime}=(U^{\prime},V^{\prime}) is an independent copy of YY. We generate the response Y=(U,V)Y=(U,V) as U=𝔼[UX]+ϵU={\mathbb{E}}[U\mid X]+\epsilon and V=𝔼[VX]+ϵV={\mathbb{E}}[V\mid X]+\epsilon^{\prime}, where (ϵ,ϵ)N(0,𝐈2)(\epsilon,\epsilon^{\prime})^{\top}\sim N(0,\mathbf{I}_{2}). We set 𝔼[UX=x]=𝔼[VX=x]:=h(x){\mathbb{E}}[U\mid X=x]={\mathbb{E}}[V\mid X=x]:=h(x), and consider two settings for h(x)h(x):

  1. 1.

    Setting 1 (non-linear model): h(x)=1.70x1x26.94x1x221.35x12x2+2.28x12x22h(x)=-1.70x_{1}x_{2}-6.94x_{1}x_{2}^{2}-1.35x_{1}^{2}x_{2}+2.28x_{1}^{2}x_{2}^{2}.

  2. 2.

    Setting 2 (well-specified model, in the sense of Definition 3.1): h(x)=0h(x)=0.

For each model, we set n=1,000n=1,000, and vary the proportion of unlabeled observations, γ=Nn+N{0.1,0.3,0.5,0.7,0.9}\gamma=\frac{N}{n+N}\in\left\{{0.1,0.3,0.5,0.7,0.9}\right\}. Due the complexity of the data-generating model, we did not estimate the ISS and OSS lower bound.

Table 5 reports the coverage of 95% confidence intervals for each method. All methods achieve the nominal coverage.

Setting Method γ=0.05\gamma=0.05 γ=0.25\gamma=0.25 γ=0.5\gamma=0.5 γ=0.75\gamma=0.75 γ=0.95\gamma=0.95
θ^n\hat{\theta}_{n} 0.943 0.950 0.928 0.942 0.939
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.952 0.938 0.942 0.944 0.947
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.955 0.948 0.947 0.944 0.942
1 θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.937 0.951 0.937 0.941 0.929
θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.954 0.955 0.933 0.947 0.942
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.946 0.946 0.942 0.930 0.942
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.937 0.948 0.945 0.937 0.936
θ^n\hat{\theta}_{n} 0.956 0.966 0.933 0.951 0.942
θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} 0.946 0.934 0.944 0.949 0.936
θ^n,Neff.,Kn=4\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4 0.948 0.954 0.944 0.940 0.941
2 θ^n,Neff.,Kn=9\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9 0.947 0.950 0.933 0.938 0.938
θ^n,Neff.,Kn=16\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16 0.935 0.953 0.937 0.942 0.947
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Random Forest) 0.949 0.950 0.944 0.940 0.944
θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} (Noisy Predictor) 0.949 0.951 0.942 0.943 0.949
Table 5: 95% coverage of confidence intervals for methods of Kendall’s τ\tau in two settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section F.3.
Refer to caption
Figure 5: Comparison of standard error for methods that estimate Kendall’s τ\tau, averaged over 1,000 simulations. Left: When the conditional influence function is non-linear, θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} is efficient with a sufficient number of basis functions. Right: For a well-specified estimation problem in the sense of Definition 3.1, no semi-supervised methods improves upon the supervised estimator, as shown in Corollary 4.2.

The results are similar to those of Section 8. In Setting 1 (non-linear model), θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} with Kn=16K_{n}=16 has the lowest standard error. θ^n,Nsafe\hat{\theta}_{n,N}^{\text{safe}} with g(x)=xg(x)=x improves upon the supervised estimator, but is not as efficient as θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}}. θ^n,NPPI\hat{\theta}_{n,N}^{\text{PPI}} with random forest predicted responses also performs well, while its counterpart with pure-noise prediction models has comparable standard error to the supervised estimator. For Setting 2 (well-specified model), no methods improves upon the OSS lower bound as indicated by Corollary 4.2. We note that the apparent slight improvement of standard error in the second panel by θ^n,Neff.\hat{\theta}_{n,N}^{\text{eff.}} is the result of a relatively small sample size.