A Unified Framework for Semiparametrically Efficient Semi-Supervised Learning

Zichun Xu¹¹1Department of Biostatistics, University of Washington, Daniela Witten²²2Department of Statistics, University of Washington ^∗, Ali Shojaie^∗†

Abstract

We consider statistical inference under a semi-supervised setting where we have access to both a labeled dataset consisting of pairs $\{X_{i},Y_{i}\}_{i=1}^{n}$ and an unlabeled dataset $\{X_{i}\}_{i=n+1}^{n+N}$ . We ask the question: under what circumstances, and by how much, can incorporating the unlabeled dataset improve upon inference using the labeled data? To answer this question, we investigate semi-supervised learning through the lens of semiparametric efficiency theory. We characterize the efficiency lower bound under the semi-supervised setting for an arbitrary inferential problem, and show that incorporating unlabeled data can potentially improve efficiency if the parameter is not well-specified. We then propose two types of semi-supervised estimators: a safe estimator that imposes minimal assumptions, is simple to compute, and is guaranteed to be at least as efficient as the initial supervised estimator; and an efficient estimator, which — under stronger assumptions — achieves the semiparametric efficiency bound. Our findings unify existing semiparametric efficiency results for particular special cases, and extend these results to a much more general class of problems. Moreover, we show that our estimators can flexibly incorporate predicted outcomes arising from “black-box” machine learning models, and thereby achieve the same goal as prediction-powered inference (PPI), but with superior theoretical guarantees. We also provide a complete understanding of the theoretical basis for the existing set of PPI methods. Finally, we apply the theoretical framework developed to derive and analyze efficient semi-supervised estimators in a number of settings, including M-estimation, U-statistics, and average treatment effect estimation, and demonstrate the performance of the proposed estimators via simulations.

1 Introduction

Supervised learning involves modeling the association between a set of responses $Y\in{\mathcal{Y}}$ and a set of covariates $X\in{\mathcal{X}}$ on the basis of $n$ labeled realizations ${{\mathcal{L}}_{n}}=\{Z_{i}\}_{i=1}^{n}$ , where $Z=(X,Y)\in{\mathcal{Z}}$ . In the semi-supervised setting, we also have access to $N$ additional unlabeled realizations of $X$ , ${{\mathcal{U}}_{N}}=\{X_{i}\}_{i=n+1}^{n+N}$ , without associated realizations of $Y$ . This situation could arise, for instance, when observing $Y$ is more expensive or time-consuming than observing $X$ .

Suppose that $X\sim{{\mathbb{P}}^{*}_{X}}$ , and $Y\mid X\sim{{\mathbb{P}}^{*}_{Y\mid X}}$ .The joint distribution of $Z$ can be expressed as $Z\sim{{\mathbb{P}}^{*}}={{\mathbb{P}}^{*}_{X}}{{\mathbb{P}}^{*}_{Y\mid X}}$ . Suppose that ${{\mathcal{L}}_{n}}$ consists of $n$ independent and identically distributed (i.i.d.) realizations of ${{\mathbb{P}}^{*}}$ , and ${{\mathcal{U}}_{N}}$ consists of $N$ i.i.d. realizations of ${{\mathbb{P}}^{*}_{X}}$ that are also independent of ${{\mathcal{L}}_{n}}$ .

We further assume that the marginal distribution and the conditional distribution are separately modeled. That is, ${{\mathbb{P}}^{*}_{X}}$ belongs to a marginal model ${{\mathcal{P}}_{X}}$ , and ${{\mathbb{P}}^{*}_{Y\mid X}}$ belongs to a conditional model ${{\mathcal{P}}_{Y\mid X}}$ . The model of the joint distribution can thus be expressed as

{{\mathcal{P}}}=\left\{{{\mathbb{P}}={\mathbb{P}}_{X}{\mathbb{P}}_{Y\mid X}:{\mathbb{P}}_{X}\in{{\mathcal{P}}_{X}},{\mathbb{P}}_{Y\mid X}\in{{\mathcal{P}}_{Y\mid X}}}\right\}.

(1)

Consider a general inferential problem where the parameter of interest is a Euclidean-valued functional $\theta(\cdot):{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p}$ . The ground truth is $\theta({{\mathbb{P}}^{*}})$ , which we denote as $\theta^{*}$ for simplicity. This paper centers around the following question: how, and by how much, can inference using $({{\mathcal{L}}_{n}},{{\mathcal{U}}_{N}})$ improve upon inference on $\theta^{*}$ using only ${{\mathcal{L}}_{n}}$ ?

1.1 The three settings

We now introduce three settings that will be used throughout this paper.

1.

In the supervised setting, only labeled data ${{\mathcal{L}}_{n}}$ are available. Supervised estimators can be written as $\hat{\theta}_{n}({{\mathcal{L}}_{n}})$ .
2.

In the ordinary semi-supervised (OSS) setting, both labeled data ${{\mathcal{L}}_{n}}$ and unlabeled data ${{\mathcal{U}}_{N}}$ are available. OSS estimators can be written as $\hat{\theta}_{n,N}({{\mathcal{L}}_{n}},{{\mathcal{U}}_{N}})$ .
3.

In the ideal semi-supervised (ISS) setting [Zhang et al., 2019, Cheng et al., 2021], we have access to labeled data ${{\mathcal{L}}_{n}}$ , and furthermore the marginal distribution ${{\mathbb{P}}^{*}_{X}}$ is known. ISS estimators can be written as $\hat{\theta}_{n}({{\mathcal{L}}_{n}},{{\mathbb{P}}^{*}_{X}})$ .

The ISS setting is primarily of theoretical interest: in reality, we rarely know ${{\mathbb{P}}^{*}_{X}}$ . Analyzing the ISS setting will facilitate analysis of the OSS setting.

1.2 Main results

Our main contributions are as follows:

1.

For an arbitrary inferential problem, we derive the semiparametric efficiency lower bound under the ISS setting. We show that knowledge of ${{\mathbb{P}}^{*}_{X}}$ can be used to potentially improve upon a supervised estimator when the parameter of interest is not well-specified, in the sense of Definition 3.1. This generalizes existing results that semi-supervised learning cannot improve a correctly-specified linear model [Kawakita and Kanamori, 2013, Buja et al., 2019a, Song et al., 2023, Gan and Liang, 2023].
2.

In the OSS setting, the data are non-i.i.d., and consequently classical semiparametric efficiency theory is not applicable. We establish the semiparametric efficiency lower bound in this setting: to our knowledge, this is the first such result in the literature. As in the ISS setting, an efficiency gain over the supervised setting is possible when the parameter of interest is not well-specified.
3.

To achieve efficiency gains over supervised estimation when the parameter is not well-specified, we propose two types of semi-supervised estimators — the safe estimator and the efficient estimator — both of which build upon an arbitrary initial supervised estimator. The safe estimator requires minimal assumptions, and is always at least as efficient as the initial supervised estimator. The efficient estimator imposes stronger assumptions, and can achieve the semi-parametric efficiency bound. Compared to existing semi-supervised estimators, the proposed estimators are more general and simpler to compute, and enjoy optimality properties.
4.

Suppose we have access to a “black box” machine learning model, $f(\cdot):\mathcal{X}\rightarrow\mathcal{Y}$ , that can be used to obtain predictions of $Y$ on unlabeled data. We show that our safe estimator can be adapted to make use of these predictions, thereby connecting the semi-supervised framework of this paper to prediction-powered inference (PPI) [Angelopoulos et al., 2023a, b], a setting of extensive recent interest. This contextualizes existing PPI proposals through the lens of semi-supervised learning, and directly leads to new PPI estimators with better theoretical guarantees and empirical performance.

1.3 Related literature

While semi-supervised learning is not a new research area [Bennett and Demiriz, 1998, Chapelle et al., 2006, Bair, 2013, Van Engelen and Hoos, 2020], it has been the topic of extensive recent theoretical interest.

Recently, Zhang et al. [2019] studied semi-supervised mean estimation and proposed a semi-supervised regression-based estimator with reduced asymptotic variance; Zhang and Bradic [2022] extended this result to the high-dimensional setting and developed an approach for bias-corrected inference. Both Chakrabortty and Cai [2018] and Azriel et al. [2022] considered semi-supervised linear regression, and proposed asymptotically normal estimators with improved efficiency over the supervised ordinary least squares estimator. Deng et al. [2023] further investigated semi-supervised linear regression under a high-dimensional setting and proposed a minimax optimal estimator. Wang et al. [2023] and Quan et al. [2024] extended these findings to the setting of semi-supervised generalized linear models. Some authors have considered semi-supervised learning for more general inferential tasks, such as M-estimation [Chakrabortty, 2016, Song et al., 2023, Yuval and Rosset, 2022] and U-statistics [Kim et al., 2024], and in different sub-fields of statistics, such as causal inference [Cheng et al., 2021, Chakrabortty and Dai, 2022] and extreme-valued statistics [Ahmed et al., 2024]. However, there exists neither a unified theoretical framework for the semi-supervised setting, nor a unified approach to construct efficient estimators in this setting.

Prediction-powered inference (PPI) refers to a setting in which the data analyst has access to not only ${{\mathcal{L}}_{n}}$ and ${{\mathcal{U}}_{N}}$ , but also a “black box” machine learning model $f(\cdot):\mathcal{X}\rightarrow\mathcal{Y}$ such that $Y\approx f(X)$ . The goal of PPI is to conduct inference on the association between $Y$ and $X$ while making use of ${{\mathcal{L}}_{n}}$ , ${{\mathcal{U}}_{N}}$ , and the black box predictions given by $f(\cdot)$ . To the best of our knowledge, despite extensive recent interest in the PPI setting [Angelopoulos et al., 2023a, b, Miao et al., 2023, Gan and Liang, 2023, Miao and Lu, 2024, Zrnic and Candès, 2024a, b, Gu and Xia, 2024], no prior work formally connects prediction-powered inference with the semi-supervised paradigm.

1.4 Notation

For any natural number $K$ , $[K]:=\left\{{1,\ldots,K}\right\}$ . Let ${\mathcal{O}}_{1}\times{\mathcal{O}}_{2}$ denote the Cartesian product of two sets ${\mathcal{O}}_{1}$ and ${\mathcal{O}}_{2}$ . Let $\lambda_{\min}(\mathbf{A})$ denote the smallest eigenvalue of a matrix $\mathbf{A}$ . For two symmetric matrices $\mathbf{A},\mathbf{B}\in{\mathbb{R}}^{p\times p}$ , we write that $\mathbf{A}\succeq\mathbf{B}$ if and only if $\mathbf{A}-\mathbf{B}$ is positive semi-definite. We use uppercase letters $X,Y,Z$ to denote random variables and lowercase letters $x,y,z$ to denote their realizations. For a vector $x\in{\mathbb{R}}^{p}$ , let $\left\lVert x\right\rVert$ be its Euclidean norm. The random variable $X$ takes values in the probability space $({\mathcal{X}},{\mathcal{F}}_{X},{\mathbb{P}})$ . For a measurable function $f:{\mathcal{X}}\to{\mathbb{R}}^{p}$ , let ${\mathbb{E}}[f(X)]=\int fd{\mathbb{P}}\in{\mathbb{R}}^{p}$ denote its expectation, and $\mathrm{Var}[f(X)]={\mathbb{E}}[\left\{{f(X)-{\mathbb{E}}[f(X)]}\right\}\left\{{f(X)-{\mathbb{E}}[f(X)]}\right\}^{\top}]\in{\mathbb{R}}^{p\times p}$ its variance-covariance matrix. For another measurable function $g:{\mathcal{X}}\to{\mathbb{R}}^{q}$ , let $\mathrm{Cov}[f(X),g(X)]={\mathbb{E}}[\left\{{f(X)-{\mathbb{E}}[f(X)]}\right\}\left\{{g(X)-{\mathbb{E}}[g(X)]}\right\}^{\top}]\in{\mathbb{R}}^{p\times q}$ denote the covariance matrix between $f(X)$ and $g(X)$ . Let ${\mathcal{L}}_{p}^{2}({\mathbb{P}})$ be the space of all vector-valued measurable functions $f:{\mathcal{X}}\to{\mathbb{R}}^{p}$ such that ${\mathbb{E}}[\left\lVert f(X)\right\rVert^{2}]<\infty$ , and let ${\mathcal{L}}^{2}_{p,0}({\mathbb{P}})\subset{\mathcal{L}}_{p}^{2}({\mathbb{P}})$ denote the sub-space of functions with expectation 0, i.e. any $f\in{\mathcal{L}}^{2}_{p,0}({\mathbb{P}})$ satisfies ${\mathbb{E}}[f(Z)]=0$ .

1.5 Organization of the paper

The rest of our paper is organized as follows. Section 2 contains a very brief review of some concepts in semiparametric efficiency theory. In Sections 3 and 4, we derive semi-parametric efficiency bounds and construct efficient estimators under the ISS and OSS settings. In Sections 5 and 6, we connect the proposed framework to prediction-powered inference and missing data, respectively. In Section 7, we instantiate our theoretical and methodological findings in the context of specific examples: M-estimation, U-statistics, and the estimation of average treatment effect. Numerical experiments and concluding remarks are in Sections 8 and 9, respectively. Proofs of theoretical results can be found in the Supplement.

2 Overview of semiparametric efficiency theory

We provide a brief introduction to concepts and results from semiparametric efficiency theory that are essential for reading later sections of the paper. See also Supplement E.2. A more comprehensive introduction can be found in Chapter 25 of Van der Vaart [2000] or in Tsiatis [2006].

Consider i.i.d. data $\left\{{Z_{i}}\right\}_{i=1}^{n}$ sampled from ${{\mathbb{P}}^{*}}\in{{\mathcal{P}}}$ , and suppose there exists a dominating measure $\mu$ such that each element of ${{\mathcal{P}}}$ can be represented by its corresponding density with respect to $\mu$ . Interest lies in a Euclidean functional $\theta(\cdot):{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p}$ , and $\theta^{*}:=\theta({{\mathbb{P}}^{*}})$ . Consider a one-dimensional regular parametric sub-model ${\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\}\subset{{\mathcal{P}}}$ such that ${\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}}$ for some $t^{*}\in T$ . ${\mathcal{P}}_{T}$ defines a score function at ${{\mathbb{P}}^{*}}$ . The set of score functions at ${{\mathbb{P}}^{*}}$ from all such one-dimensional regular parametric sub-models of ${{\mathcal{P}}}$ forms the tangent set at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ , and the closed linear span of the tangent set is the tangent space at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ , which we denote as ${\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})$ . By definition, the tangent space ${\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})\subseteq{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}})}$ .

An estimator $\hat{\theta}_{n}=\hat{\theta}_{n}(Z_{1},\ldots,Z_{n})$ of $\theta^{*}$ is regular relative to ${{\mathcal{P}}}$ if, for any regular parametric sub-model ${\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\}\subset{{\mathcal{P}}}$ such that ${{\mathbb{P}}^{*}}={\mathbb{P}}_{t^{*}}$ for some $t^{*}\in T$ ,

\sqrt{n}\left[\hat{\theta}_{n}-\theta\left({\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}\right)\right]\rightsquigarrow V_{{\mathbb{P}}^{*}}

for all $h\in{\mathbb{R}}$ under ${\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}$ , where $V_{{\mathbb{P}}^{*}}$ is a random variable whose distribution depends on ${{\mathbb{P}}^{*}}$ but not on $h$ . $\hat{\theta}_{n}$ is asymptotically linear with influence function ${{\varphi}_{\eta^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$ if

\sqrt{n}\left(\hat{\theta}_{n}-\theta^{*}\right)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{{\varphi}_{\eta^{*}}}(z)+o_{p}(1).

The influence function ${{\varphi}_{\eta^{*}}}(z)$ depends on ${{\mathbb{P}}^{*}}$ through some functional $\eta:{{\mathcal{P}}}\to\Omega$ , where $\Omega$ is a general metric space with metric $\rho$ , and $\eta^{*}=\eta({{\mathbb{P}}^{*}})$ . The functional $\eta(\cdot)$ need not be finite-dimensional. The functional of interest ${\theta(\cdot)}$ is typically involved in $\eta(\cdot)$ , but they are usually not the same. A regular and asymptotically linear estimator is efficient for ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ if its influence function ${{\varphi}^{*}_{\eta^{*}}}(z)$ satisfies $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})$ for any $a\in{\mathbb{R}}^{p}$ . If this holds, then ${{\varphi}^{*}_{\eta^{*}}}(z)$ is the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ .

Semiparametric efficiency theory aims to provide a lower bound for the asymptotic variances of regular and asymptotically linear estimators.

Lemma 2.1.

Suppose there exists a regular and asymptotically linear estimator of $\theta^{*}$ . Let ${{\varphi}^{*}_{\eta^{*}}}(z)$ denote the efficient influence function of $\hat{\theta}_{n}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ . Then, it follows that for any regular and asymptotically linear estimator $\hat{\theta}_{n}$ of $\theta^{*}$ such that $\sqrt{n}(\hat{\theta}_{n}-\theta^{*})\rightsquigarrow N(0,\mathbf{\Sigma})$ ,

\mathbf{\Sigma}\succeq\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)].

Lemma 2.1 states that any regular and asymptotically linear estimator of $\theta^{*}$ relative to ${{\mathcal{P}}}$ has asymptotic variance no smaller than the variance of the efficient influence function, $\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)]$ . This is referred to as the semiparametric efficiency lower bound. For a proof of Lemma 2.1, see Theorem 3.2 of Bickel et al. [1993].

3 The ideal semi-supervised setting

3.1 Semiparametric efficiency lower bound

Under the ISS setting, we have access to labeled data ${{\mathcal{L}}_{n}}$ , and the marginal distribution ${{\mathbb{P}}^{*}_{X}}$ is known. Thus, the model ${{\mathcal{P}}}$ reduces from (1) to

{\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}=\left\{{{\mathbb{P}}={{\mathbb{P}}^{*}_{X}}{\mathbb{P}}_{Y\mid X},{\mathbb{P}}_{Y\mid X}\in{{\mathcal{P}}_{Y\mid X}}}\right\}.

(2)

A smaller model space leads to an easier inferential problem, and thus a lower efficiency bound. We let

{\phi^{*}_{\eta^{*}}}(x):={\mathbb{E}}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid X=x\right]

(3)

denote the conditional efficient influence function. Our first result establishes the semiparametric efficiency bound under the ISS setting.

Theorem 3.1.

Suppose the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ defined in (1) is ${{\varphi}^{*}_{\eta^{*}}}(z)$ , where $\eta:{{\mathcal{P}}}\to\Omega$ is a functional, $\Omega$ is a metric space equipped with metric $\rho(\cdot,\cdot)$ , and $\eta^{*}=\eta({{\mathbb{P}}^{*}})$ . Then, the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ defined as (2) is

{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x).

(4)

Furthermore, the semiparametric efficiency lower bound satisfies

\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]=\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)]-\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)]\preceq\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)].

(5)

Thus, knowledge of ${{\mathbb{P}}^{*}_{X}}$ can improve efficiency at $\theta^{*}$ if and only if $\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)]\succ 0$ , i.e. if and only if ${\phi^{*}_{\eta^{*}}}(X)$ is not a constant almost surely.

3.2 Inference with a well-specified parameter

It is well-known that a correctly-specified regression model cannot be improved with additional unlabeled data [Kawakita and Kanamori, 2013, Buja et al., 2019a, Song et al., 2023, Gan and Liang, 2023]; Chakrabortty [2016] further extended this result to the setting of M-estimation. In this section, we formally characterize this phenomenon, and generalize it to an arbitrary inferential problem, in the ISS setting.

The next definition is motivated by Buja et al. [2019b].

Definition 3.1 (Well-specified parameter).

Let ${{\mathbb{P}}^{*}}={{\mathbb{P}}^{*}_{X}}{{\mathbb{P}}^{*}_{Y\mid X}}$ be the data-generating distribution, and ${{\mathcal{P}}_{X}}$ a model of the marginal distribution such that ${\mathbb{P}}_{X}^{*}\in{\mathcal{P}}_{X}$ . A functional ${\theta(\cdot)}:{{\mathcal{P}}_{X}}\otimes{{\mathbb{P}}^{*}_{Y\mid X}}\to{\mathbb{R}}^{p}$ is well-specified at ${{\mathbb{P}}^{*}_{Y\mid X}}$ relative to ${{\mathcal{P}}_{X}}$ if any ${\mathbb{P}}_{X}\in{{\mathcal{P}}_{X}}$ satisfies

\theta({{\mathbb{P}}^{*}_{X}}{{\mathbb{P}}^{*}_{Y\mid X}})=\theta({\mathbb{P}}_{X}{{\mathbb{P}}^{*}_{Y\mid X}}).

We emphasize that well-specification is a joint property of the functional of interest ${\theta(\cdot)}$ , the conditional distribution ${{\mathbb{P}}^{*}_{Y\mid X}}$ , and the marginal model ${{\mathcal{P}}_{X}}$ . Definition 3.1 states that if the parameter is well-specified, then a change to the marginal model does not change $\theta(\cdot)$ . Intuitively, if this is the case, then knowledge of ${{\mathbb{P}}^{*}_{X}}$ in the ISS setting will not affect inference on $\theta^{*}$ .

Theorem 3.2.

Under the conditions of Theorem 3.1, let ${{\varphi}^{*}_{\eta^{*}}}(z)$ be the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ . If ${\theta(\cdot)}$ is well-specified at ${{\mathbb{P}}^{*}_{Y\mid X}}$ relative to ${{\mathcal{P}}_{X}}$ , then

{\phi^{*}_{\eta^{*}}}(X)=0,\quad{{\mathbb{P}}^{*}_{X}}\text{-a.s.},

(6)

where ${\phi^{*}_{\eta^{*}}}(x)$ is the conditional efficient influence function as in (3). Moreover, if ${{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}$ , then any influence function ${{\varphi}_{\eta^{*}}}$ of a regular and asymptotically linear estimator of $\theta^{*}$ satisfies

{\phi_{\eta^{*}}}(x)=0,\quad{{\mathbb{P}}^{*}_{X}}\text{-a.s.},

(7)

where the notation ${\phi_{\eta^{*}}}(x)$ denotes the conditional influence function,

{\phi_{\eta^{*}}}(x):={\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x].

(8)

As a direct corollary of Theorem 3.2, if ${\theta(\cdot)}$ is well-specified, then knowledge of ${{\mathbb{P}}^{*}_{X}}$ does not improve the semiparametric efficiency lower bound.

Corollary 3.3.

Under the conditions of Theorem 3.1, suppose that ${\theta(\cdot)}$ is well-specified at ${{\mathbb{P}}^{*}_{Y\mid X}}$ relative to ${{\mathcal{P}}_{X}}$ . Then, the semiparametric efficiency lower bound of ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ is the same as the semiparametric efficiency lower bound of ${{\mathbb{P}}^{*}}$ relative to ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ .

3.3 Safe and efficient estimators

Together, Theorems 3.1 and 3.2 imply that when the parameter is not well-specified, we can potentially use knowledge of ${{\mathbb{P}}^{*}_{X}}$ for more efficient inference. We will now present two such approaches, both of which build upon an initial supervised estimator. Under minimal assumptions, the safe estimator is always at least as efficient as the initial supervised estimator. By contrast, under a stronger set of assumptions, the efficient estimator achieves the efficiency lower bound (5) under the ISS setting.

We first provide some intuition behind the two estimators. Suppose that $\eta:{{\mathcal{P}}}\to\Omega$ is a functional that takes values in a metric space $\Omega$ with metric $\rho(\cdot,\cdot)$ and $\eta^{*}=\eta({{\mathbb{P}}^{*}})$ . Let $\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}})$ denote an initial supervised estimator of $\theta^{*}$ that is regular and asymptotically linear with influence function ${{\varphi}_{\eta^{*}}}(z)$ . Motivated by the form of the efficient influence function (4) under the ISS setting, we aim to use knowledge of ${{\mathbb{P}}^{*}_{X}}$ to obtain an estimator ${\hat{\phi}_{n}}(x)$ of the conditional influence function ${\phi_{\eta^{*}}}(x)={\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x]$ . This will lead to a new estimator of the form

\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}{\hat{\phi}_{n}}(X_{i}).

(9)

In what follows, we let $\tilde{\eta}_{n}$ denote an estimator of $\eta^{*}$ .

3.3.1 The safe estimator

Let $g:{\mathcal{X}}\to{\mathbb{R}}^{d}$ denote an arbitrary measurable function of $x$ , and define

{g^{0}}(x):=g(x)-{\mathbb{E}}[g(X)].

(10)

(Since ${{\mathbb{P}}^{*}_{X}}$ is known in the ISS setting, we can compute ${g^{0}}(\cdot)$ given $g(\cdot)$ .)

To construct the safe estimator, we will estimate ${\phi_{\eta^{*}}}(x)$ by regressing $\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n}$ onto $\left\{{{g^{0}}(X_{i})}\right\}_{i=1}^{n}$ , leading to regression coefficients

\hat{\mathbf{B}}_{n}^{g}=\left[\frac{1}{n}\sum_{i=1}^{n}{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i}){g^{0}}(X_{i})^{\top}\right]{\mathbb{E}}\left[{g^{0}}(X){g^{0}}(X)^{\top}\right]^{-1},

(11)

where the expectation in (11) is possible since ${{\mathbb{P}}^{*}_{X}}$ is known. Then the safe ISS estimator takes the form

\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{n}^{g}{g^{0}}(X_{i}).

(12)

To establish the convergence of $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}$ , we made the following assumptions.

Assumption 3.1.

(a) The initial supervised estimator $\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}})$ is a regular and asymptotically linear estimator of $\theta^{*}$ with influence function ${{\varphi}_{\eta^{*}}}(z)$ . (b) There exists a set ${\mathcal{O}}\subset\Omega$ such that $\eta^{*}\in{\mathcal{O}}$ , the class of functions $\left\{{{\varphi}_{\eta}(z):\eta\in{\mathcal{O}}}\right\}$ is ${{\mathbb{P}}^{*}}$ -Donsker, and for all $\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}$ , $\left\lVert{\varphi}_{\eta_{1}}(z)-{\varphi}_{\eta_{2}}(z)\right\rVert\leqslant L(z)\rho(\eta_{1},\eta_{2})$ , where $L:{\mathcal{Z}}\to{\mathbb{R}}^{+}$ is a square integrable function. (c) There exists an estimator $\tilde{\eta}_{n}$ of $\eta^{*}$ such that ${{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1$ and $\rho(\tilde{\eta}_{n},\eta^{*})=o_{p}(1)$ .

Assumptions 3.1 (b) and (c) state that we can find a consistent estimator $\tilde{\eta}_{n}$ of the functional $\eta^{*}$ on which the influence function of $\hat{\theta}_{n}$ depends, such that $\tilde{\eta}_{n}$ asymptotically belongs to a realization set ${\mathcal{O}}$ on which the class of functions $\left\{{{\varphi}_{\eta}(z):\eta\in{\mathcal{O}}}\right\}$ is a Donsker class. The Donsker condition is standard in semiparametric statistics, and leads to $\sqrt{n}$ -converegnce while allowing for rich classes of functions. We validate Assumption 3.1 for specific inferential problems in Section 7. In the special case when $\eta(\cdot)$ is finite-dimensional, the next proposition provides sufficient conditions for Assumption 3.1.

Proposition 3.4.

Suppose that (i) $\eta(\cdot)$ is a finite-dimensional functional; (ii) there exists a bounded open set ${\mathcal{O}}$ such that $\eta^{*}\in{\mathcal{O}}$ , and for all $\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}$ , $\left\lVert{\varphi}_{\eta_{1}}(z)-{\varphi}_{\eta_{2}}(z)\right\rVert\leqslant L(z)\left\lVert\eta_{1}-\eta_{2}\right\rVert$ , where $L:{\mathcal{Z}}\to{\mathbb{R}}^{+}$ is a square integrable function; and (iii) there exists an estimator $\tilde{\eta}_{n}$ of $\eta^{*}$ such that $\left\lVert\tilde{\eta}_{n}-\eta^{*}\right\rVert=o_{p}(1)$ . Then, Assumptions 3.1 (b) and (c) hold.

Define the population regression coefficient as

\mathbf{B}^{g}:={\mathbb{E}}\left[{{\varphi}_{\eta^{*}}}(Z){g^{0}}(X)^{\top}\right]{\mathbb{E}}\left[{g^{0}}(X){g^{0}}(X)^{\top}\right]^{-1}.

(13)

The next theorem establishes the asymptotic behavior of $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}$ in (12).

Theorem 3.5.

Suppose that $\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}})$ is a supervised estimator that satisfies Assumption 3.1, and $\tilde{\eta}_{n}$ is an estimator of $\eta^{*}$ as in Assumption 3.1 (c). Let $g:{\mathcal{X}}\to{\mathbb{R}}^{d}$ be a square-integrable function such that ${\mathbb{E}}\left[\left\lVert g(X)\right\rVert^{2}\right]<\infty$ and $\mathrm{Var}\left[g(X)\right]$ is non-singular, and let ${g^{0}}(x)$ be its centered version (10). Then, for $\mathbf{B}^{g}$ in (13), $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}$ defined in (12) is a regular and asymptotically linear estimator of $\theta^{*}$ with influence function

{{\varphi}_{\eta^{*}}}(z)-\mathbf{B}^{g}{g^{0}}(x),

and asymptotic variance $\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]$ . Moreover,

	$\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]$	$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right].$		(14)

Theorem 3.5 establishes that $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}$ is always at least as efficient as the initial supervised estimator $\hat{\theta}_{n}$ under the ISS setting.

Remark 3.1 (Choice of regression basis function $g(x)$ ).

Theorem 3.5 reveals that the asymptotic variance of $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}$ is the sum of two terms. The first term, $\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]$ , does not depend on $g(x)$ . The second term, $\mathrm{Var}\left[{\phi_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]$ , can be interpreted as the approximation error that arises from approximating ${\phi_{\eta^{*}}}(x)$ with $\mathbf{B}^{g}{g^{0}}(x)$ . Thus, $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}$ is more efficient when $\mathbf{B}^{g}{g^{0}}(x)$ is a better approximation of ${\phi_{\eta^{*}}}(x)$ .

3.3.2 The efficient estimator

To construct the efficient estimator, we will approximate ${\phi_{\eta^{*}}}(x)$ by regressing $\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n}$ onto a growing number of basis functions. Let ${\mathcal{C}}^{\alpha}_{M}({\mathcal{X}})$ denote the Hölder class of functions on ${\mathcal{X}}$ ,

{\mathcal{C}}^{\alpha}_{M}({\mathcal{X}}):=\left\{{f:{\mathcal{X}}\to{\mathbb{R}},\text{ }\sup_{\mathcal{X}}\lvert D^{k}f(x)\rvert+\sup_{x_{1},x_{2}\in{\mathcal{X}}}\frac{\lvert D^{k}f(x_{1})-D^{k}f(x_{2})\rvert}{\left\lVert x_{1}-x_{2}\right\rVert}\leqslant M,\text{ }\forall k\in[\alpha]}\right\}.

(15)

Assumption 3.2.

Under Assumption 3.1, the set ${\mathcal{O}}$ additionally satisfies that: $\left\{{\phi_{\eta}(x):\eta\in{\mathcal{O}}}\right\}\subset{\mathcal{S}}^{\alpha}_{M}$ , where $\phi_{\eta}(x)$ is the conditional influence function $\phi_{\eta}(x)={\mathbb{E}}[{\varphi}_{\eta}(Z)\mid X=x]$ and ${\mathcal{S}}^{\alpha}_{M}$ is the class of functions defined as,

{\mathcal{S}}^{\alpha}_{M}:=\left\{{f:{\mathcal{X}}\to{\mathbb{R}}^{p},\quad[f]_{j}\in{\mathcal{C}}^{\alpha_{j}}_{M_{j}}({\mathcal{X}}),\quad\forall j\in[p]}\right\}.

In the definition of ${\mathcal{S}}^{\alpha}_{M}$ , $[f]_{j}$ is the $j$ -th coordinate of a vector-valued function $f$ , $\alpha=\min_{j\in[p]}\left\{{\alpha_{k}}\right\}$ , and $M=\max_{j\in[p]}\left\{{M_{j}}\right\}$ .

Under Assumption 3.2, there exists a set of basis functions $\left\{{g_{k}(x)}\right\}_{k=1}^{\infty}$ of ${\mathcal{S}}_{M}^{\alpha}$ , such as a spline basis or a polynomial basis, such that any $f\in{\mathcal{S}}_{M}^{\alpha}$ can be represented as an infinite linear combination $f(x)=\sum_{k=1}^{\infty}\mathbf{A}_{k}g_{k}(x)$ , for coefficients $\mathbf{A}_{k}\in{\mathbb{R}}^{p\times p}$ . Define ${G_{K_{n}}}(x)=\left[g_{1}(x)^{\top},\ldots,g_{K_{n}}(x)^{\top}\right]^{\top}$ as the concatenation of the first $K_{n}$ basis functions. The optimal ${\mathcal{L}}_{2}({{\mathbb{P}}^{*}_{X}})$ -approximation of $f$ by the first $K_{n}$ basis functions is then $\hat{f}_{K_{n}}=\mathbf{B}_{K_{n}}{G_{K_{n}}}(x)$ , where $\mathbf{B}_{K_{n}}$ is the population regression coefficient,

\mathbf{B}_{K_{n}}={\mathbb{E}}\left[f(X)G_{k}(X)^{\top}\right]{\mathbb{E}}\left[G_{k}(X)[G_{k}(X)^{\top}\right]^{-1}.

It can be shown that the approximation error arising from the first $K_{n}$ basis functions satisfies

\left\lVert f-\hat{f}_{K_{n}}\right\rVert^{2}_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}=O\left(K_{n}^{-\frac{2\alpha}{{\text{dim}}({\mathcal{X}})}}\right),

(16)

where ${\text{dim}}({\mathcal{X}})$ is the dimension of ${\mathcal{X}}$ [Newey, 1997].

To estimate the conditional influence function ${\phi_{\eta^{*}}}(x)={\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x]$ , we define the centered basis functions

{G_{K_{n}}^{0}}:={G_{K_{n}}}-{\mathbb{E}}[{G_{K_{n}}}(X)],

(17)

and regress $\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n}$ onto $\left\{{{G_{K_{n}}^{0}}(X_{i})}\right\}_{i=1}^{n}$ . (Note that (17) can be computed because ${{\mathbb{P}}^{*}_{X}}$ is known.) The resulting regression coefficients are

\hat{\mathbf{B}}_{K_{n}}=\left[\frac{1}{n}\sum_{i=1}^{n}{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i}){G_{K_{n}}^{0}}(X_{i})^{\top}\right]{\mathbb{E}}\left[{G_{K_{n}}^{0}}(X_{i}){G_{K_{n}}^{0}}(X_{i})^{\top}\right]^{-1}.

(18)

The nonparametric least squares estimator of ${\phi_{\eta^{*}}}(x)$ is ${\hat{\phi}_{n}}(x)=\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}(x)$ , and the efficient ISS estimator is

\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}(X_{i}).

(19)

Define $\zeta_{n}^{2}:=\sup_{x\in{\mathcal{X}}}\left\{{G_{K_{n}}^{0}}(x)^{\top}\mathrm{Var}[{G_{K_{n}}^{0}}(X)]^{-1}{G_{K_{n}}^{0}}(x)\right\}$ . The next theorem establishes the theoretical properties of $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}}$ .

Theorem 3.6.

Suppose that $\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}})$ is a supervised estimator that satisfies Assumptions 3.1 and 3.2 and $\tilde{\eta}_{n}$ is an estimator of $\eta^{*}$ as in Assumption 3.1 (c). Further suppose that $\left\{{g_{k}(x)}\right\}_{k=1}^{\infty}$ is a set of basis functions of ${\mathcal{S}}_{M}^{\alpha}$ for which ${G_{K_{n}}}(x)=\left[g_{1}(x)^{\top},\ldots,g_{K_{n}}(x)^{\top}\right]^{\top}$ satisfies (16) and

\inf_{K_{n}}\left\{{\lambda_{\min}\left(\mathrm{Var}\left[{G_{K_{n}}}(X)\right]\right)}\right\}>0.

Let ${G_{K_{n}}^{0}}(x)$ be the centered version of ${G_{K_{n}}}(x)$ as in (17). If $\alpha>{\text{dim}}({\mathcal{X}})$ , $K_{n}\to\infty$ , $K_{n}\rho(\tilde{\eta}_{n},\eta^{*})\to 0$ , and $\frac{\zeta_{n}^{2}}{n}\to 0$ , then $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}}$ in (19) is a regular and asymptotically linear estimator of $\theta^{*}$ with influence function

{{\varphi}_{\eta^{*}}}(z)-{\phi_{\eta^{*}}}(x)

under the ISS setting. Moreover, its asymptotic variance is $\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]$ , which satisfies

\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right],

(20)

for arbitrary $g(x)$ , ${g^{0}}(x)$ , and $\mathbf{B}^{g}$ as in Theorem 3.5.

Theorem 3.6 shows that the efficient estimator $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}}$ is always at least as efficient as both the safe estimator $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}$ and the initial supervised estimator $\hat{\theta}_{n}$ . Further, (20) and Theorem 3.1 suggest that if the initial supervised estimator is efficient under the supervised setting, then the efficient estimator is efficient under the ISS setting.

Remark 3.2 (Role of the marginal distribution ${{\mathbb{P}}^{*}_{X}}$ ).

The safe estimator $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{safe}}$ involves ${\mathbb{E}}[g(X)]$ and ${\mathbb{E}}\left[{g^{0}}(X){g^{0}}(X)^{\top}\right]$ , and the efficient estimator $\hat{\theta}_{n,{{\mathbb{P}}^{*}_{X}}}^{\text{eff.}}$ involves ${\mathbb{E}}[{G_{K_{n}}}(X)]$ and ${\mathbb{E}}\left[{G_{K_{n}}^{0}}(X){G_{K_{n}}^{0}}(X)^{\top}\right]$ . All four of these quantities require knowledge of ${{\mathbb{P}}^{*}_{X}}$ . However, Theorems 3.5 and 3.6 would hold even if ${\mathbb{E}}\left[{g^{0}}(X){g^{0}}(X)^{\top}\right]$ and ${\mathbb{E}}\left[{G_{K_{n}}^{0}}(X){G_{K_{n}}^{0}}(X)^{\top}\right]$ were replaced by their empirical counterparts, $\frac{1}{n}\sum_{i=1}^{n}{g^{0}}(X_{i}){g^{0}}(X_{i})^{\top}$ and $\frac{1}{n}\sum_{i=1}^{n}{G_{K_{n}}^{0}}(X_{i}){G_{K_{n}}^{0}}(X_{i})^{\top}$ .

4 The ordinary semi-supervised setting

4.1 Semiparametric efficiency lower bound

Because the OSS setting offers more information about the unknown parameter than the supervised setting but less information than the ISS setting, it is natural to think that the semiparametric efficiency bound in the OSS setting should be lower than in the supervised setting but higher than in the ISS setting. In this section, we will formalize this intuition.

In the OSS setting, the full data ${{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}}$ are not jointly i.i.d.. Therefore, the classic theory of semiparametric efficiency for i.i.d. data, as described in Section 2, is not applicable. Bickel and Kwon [2001] provides a framework for efficiency theory for non-i.i.d. data. Inspired by their results, our next theorem establishes the semiparametric efficiency lower bound under the OSS setting. To state Theorem 4.1, we must first adapt some definitions from Section 2 to the OSS setting. Additional definitions necessary for its proof are deferred to Appendix E.3.

Definition 4.1 (Regularity in the OSS setting).

An estimator $\hat{\theta}_{n,N}=\hat{\theta}_{n,N}({{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}})$ of $\theta^{*}$ is regular relative to a model ${{\mathcal{P}}}$ if, for every regular parametric sub-model $\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\}\subset{{\mathcal{P}}}$ such that ${\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}}$ for some $t^{*}\in T$ ,

\sqrt{n}\left[\hat{\theta}_{n,N}-\theta\left({\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}\right)\right]\rightsquigarrow V_{{\mathbb{P}}^{*}},

(21)

for all $h\in{\mathbb{R}}$ under ${\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}$ and $\frac{N}{n+N}\to\gamma\in(0,1)$ , where $V_{{\mathbb{P}}^{*}}$ is some random variable whose distribution depends on ${{\mathbb{P}}^{*}}$ but not on $h$ .

Definition 4.2 (Asymptotic linearity in the OSS setting).

An estimator $\hat{\theta}_{n,N}$ of $\theta^{*}$ is asymptotically linear with influence function ${{\varphi}_{\eta^{*}}}:{\mathcal{Z}}\times{\mathcal{X}}\rightarrow{\mathbb{R}}^{p}$ , where

{{\varphi}_{\eta^{*}}}(z_{1},x_{2})={{\varphi}_{\eta^{*}}^{(1)}}(z_{1})+{{\varphi}_{\eta^{*}}^{(2)}}(x_{2})

(22)

for some ${{\varphi}_{\eta^{*}}^{(1)}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$ and ${{\varphi}_{\eta^{*}}^{(2)}}(x)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}$ , if

\sqrt{n}\left(\hat{\theta}_{n,N}-\theta^{*}\right)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{{\varphi}_{\eta^{*}}^{(1)}}(Z_{i})+\frac{1}{\sqrt{N}}\sum_{i=n+1}^{n+N}{{\varphi}_{\eta^{*}}^{(2)}}(X_{i})+o_{p}(1),

(23)

for $\frac{N}{n+N}\to\gamma\in(0,1]$ .

Remark 4.1.

In the OSS setting, the arguments of the influence function are in the product space ${\mathcal{Z}}\times{\mathcal{X}}$ . Furthermore, since ${\mathcal{Z}}=({\mathcal{X}},{\mathcal{Y}})$ , this product space can also be written as $({\mathcal{X}},{\mathcal{Y}})\times{\mathcal{X}}$ . Thus, to avoid notational confusion, when referring to a function on this space, we use arguments $(z_{1},x_{2})$ : that is, these subscripts are intended to disambiguate the role of the two arguments in the product space.

The form of the influence function (22) arises from the fact that the data consist of two i.i.d. parts, ${{\mathcal{L}}_{n}}$ and ${{\mathcal{U}}_{N}}$ , which are independent of each other. By Definition 4.2, if $\hat{\theta}_{n,N}$ is asymptotically linear, then

\sqrt{n}\left(\hat{\theta}_{n,N}-\theta^{*}\right)\rightsquigarrow N\left(0,\mathrm{Var}\left[{{\varphi}_{\eta^{*}}^{(1)}}(Z)\right]+\mathrm{Var}\left[{{\varphi}_{\eta^{*}}^{(2)}}(X)\right]\right).

That is, the asymptotic variance of an asymptotically linear estimator is the the sum of the variances of the two components of its influence function.

Armed with these two definitions, Lemma 2.1 can be generalized to the OSS setting. Informally, there exists a unique efficient influence function of the form (22), and any regular and asymptotically linear estimator whose influence function equals the efficient influence function is an efficient estimator. Additional details are provided in Appendix E.3.

The next theorem identifies the efficient influence function and the semiparametric efficiency lower bound under the OSS.

Theorem 4.1.

Let ${{\mathcal{P}}}$ be defined as in (1), and let $\frac{N}{n+N}\to\gamma\in(0,1)$ . If the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ is ${{\varphi}^{*}_{\eta^{*}}}$ , then the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ under the OSS setting is

\left[{{\varphi}^{*}_{\eta^{*}}}(z_{1})-\gamma{\phi^{*}_{\eta^{*}}}(x_{1})\right]+\sqrt{\gamma(1-\gamma)}{\phi^{*}_{\eta^{*}}}(x_{2}),

(24)

where ${\phi^{*}_{\eta^{*}}}(x)$ is the conditional efficient influence function defined in (3). Moreover, the semiparametric efficiency bound can be expressed as

		$\displaystyle\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)-\gamma{\phi^{}_{\eta^{}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}{\phi^{}_{\eta^{}}}(X)\right]$		(25)
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)\right]-\gamma\mathrm{Var}\left[{\phi^{}_{\eta^{}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi^{}_{\eta^{}}}(X)\right].$

Theorem 4.1 confirms our intuition for the efficiency bound under the OSS setting. By definition, the efficiency bound under the supervised setting is $\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)\right]$ . The second line of (25) reveals that the efficiency bound under the OSS setting is smaller by the amount $\gamma\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right]$ . Moreover, in Theorem 3.1 we showed that the efficiency bound under the ISS setting is $\mathrm{Var}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)\right]$ . The third line of (25) shows that the efficiency bound under the OSS setting exceeds this by the amount $(1-\gamma)\mathrm{Var}\left[{\phi^{*}_{\eta^{*}}}(X)\right]$ .

The efficiency bound (25) depends on the limiting proportion of unlabeled data, $\gamma$ . Intuitively, when $\gamma$ is large, we have more unlabeled data, and the efficiency bound in (25) improves. The special cases where $\gamma=0$ and $\gamma=1$ are particularly instructive. When $\gamma=0$ , the amount of unlabeled data is negligible, and hence we should expect no improvement in efficiency over the supervised setting: this intuition is confirmed by setting $\gamma\to 0$ in (25). When $\gamma=1$ , there are many more labeled than unlabeled observations; thus, it is as if we know the marginal distribution ${{\mathbb{P}}^{*}_{X}}$ . Letting $\gamma\rightarrow 1$ , the efficiency lower bound agrees with that under the ISS (5).

We saw in Corollary 3.3 that when the functional of interest is well-specified, the efficiency bound under the ISS setting is the same as in the supervised setting. As an immediate corollary of Theorem 4.1 and Theorem 3.2, we now show that the same result holds under the OSS setting.

Corollary 4.2.

Under the conditions of Theorem 4.1, let ${{\varphi}^{*}_{\eta^{*}}}(z)$ be the efficient influence function of ${\theta(\cdot)}$ at $\theta^{*}$ relative to ${{\mathcal{P}}}$ . If ${\theta(\cdot)}$ is well-specified at ${{\mathbb{P}}^{*}_{Y\mid X}}$ relative to ${{\mathcal{P}}_{X}}$ in the sense of Definition 3.1, then the efficient influence function of ${\theta(\cdot)}$ at $\theta^{*}$ relative to ${{\mathcal{P}}}$ under the OSS is ${{\varphi}^{*}_{\eta^{*}}}(z_{1})$ .

Thus, an efficient supervised estimator of a well-specified parameter can never be improved via the use of unlabeled data.

4.2 Safe and efficient estimators

Similar to Section 3.3, in this section we provide two types of OSS estimators, a safe estimator and an efficient estimator, both of which build upon and improve an initial supervised estimator.

Recall from Remark 3.2 that the safe and efficient estimators proposed in the ISS setting require knowledge of ${{\mathbb{P}}^{*}_{X}}$ . Of course, in the OSS setting, ${{\mathbb{P}}^{*}_{X}}$ is unavailable. Thus, we will simply replace ${{\mathbb{P}}^{*}_{X}}$ in (12) and (19) with ${\mathbb{P}}_{n+N,X}$ , the empirical marginal distribution of the labeled and unlabeled covariates, $\left\{{X_{i}}\right\}_{i=1}^{n+N}$ .

4.2.1 The safe estimator

For an arbitrary measurable function of $x$ , $g:{\mathcal{X}}\to{\mathbb{R}}^{d}$ , recall that in the ISS setting, the safe estimator (12) made use of its centered version ${g^{0}}(x):=g(x)-{\mathbb{E}}[g(X)]$ . Since ${\mathbb{E}}[g(X)]$ is unknown under the OSS setting, we define the empirically centered version of $g$ as

{\hat{g}^{0}}(x)=g(x)-\frac{1}{n+N}\sum_{i=1}^{n+N}g(X_{i}).

(26)

Now, regressing $\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n}$ onto $\left\{{{\hat{g}^{0}}(X_{i})}\right\}_{i=1}^{n}$ yields the regression coefficients

\hat{\mathbf{B}}_{n,N}^{g}=\left[\frac{1}{n}\sum_{i=1}^{n}{\varphi}_{\tilde{\eta}_{n}}(Z_{i}){\hat{g}^{0}}(X_{i})^{\top}\right]\left[\frac{1}{n+N}\sum_{i=1}^{n+N}{\hat{g}^{0}}(X_{i})^{\top}{\hat{g}^{0}}(X_{i})\right]^{-1}.

(27)

The regression estimator of the conditional influence function is thus ${\hat{\phi}_{n}}(x)=\hat{\mathbf{B}}_{n,N}^{g}{\hat{g}^{0}}(x)$ , and the corresponding safe OSS estimator is defined as

\hat{\theta}_{n,N}^{\text{safe}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{n,N}^{g}{\hat{g}^{0}}(X_{i}).

(28)

Under the same conditions as in Theorem 3.5, we establish the asymptotic behavior of $\hat{\theta}_{n,N}^{\text{safe}}$ .

Theorem 4.3.

Suppose that $\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}})$ is a supervised estimator that satisfies Assumption 3.1, and $\tilde{\eta}_{n}$ is an estimator of $\eta^{*}$ as in Assumption 3.1 (c). Let $g:{\mathcal{X}}\to{\mathbb{R}}^{d}$ be a square-integrable function such that ${\mathbb{E}}\left[\left\lVert g(X)\right\rVert^{2}\right]<\infty$ and $\mathrm{Var}\left[g(X)\right]$ is non-singular, and let ${\hat{g}^{0}}(x)$ be its empirically centered version (26). Suppose that $\frac{N}{n+N}\to\gamma\in(0,1]$ . Then, the estimator $\hat{\theta}_{n,N}^{\text{safe}}$ defined in (28) is a regular and asymptotically linear estimator of $\theta^{*}$ in the sense of Definitions 4.1 and 4.2, with influence function

\left[{{\varphi}_{\eta^{*}}}(z_{1})-\gamma\mathbf{B}^{g}{g^{0}}(x_{1})\right]+\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}}(x_{2})

(29)

under the OSS setting, where $\mathbf{B}^{g}$ is defined in (13). Furthermore, the asymptotic variance of $\hat{\theta}_{n,N}^{\text{safe}}$ takes the form

{\mathbf{\Sigma}^{\text{safe}}}(\gamma)=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma\mathbf{B}^{g}{g^{0}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}}(X)\right],

(30)

and satisfies

\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]\preceq{\mathbf{\Sigma}^{\text{safe}}}(\gamma)\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right],\quad\forall\gamma\in[0,1].

(31)

Theorem 4.3 shows that $\hat{\theta}_{n,N}^{\text{safe}}$ is always at least as efficient as the initial supervised estimator $\hat{\theta}_{n}$ under the OSS setting. Thus, it is a safe alternative to $\hat{\theta}_{n}$ when additional unlabeled data are available.

4.2.2 The efficient estimator

As in Section 3.3, we require that the initial supervised estimator $\hat{\theta}_{n}$ satisfy Assumption 3.2. Then, for a suitable set of basis functions $\left\{{g_{k}(x)}\right\}_{k=1}^{\infty}$ of ${\mathcal{S}}_{M}^{\alpha}$ (defined in Assumption 3.2), we define ${G_{K_{n}}}(x)=\left[g_{1}(x)^{\top},\ldots,g_{K_{n}}(x)^{\top}\right]^{\top}$ . We then center ${G_{K_{n}}}$ with its empirical mean, $\frac{1}{n+N}\sum_{i=1}^{n+N}{G_{K_{n}}}(X_{i})$ , leading to

{\hat{G}_{K_{n}}^{0}}={G_{K_{n}}}(x)-\frac{1}{n+N}\sum_{i=1}^{n+N}{G_{K_{n}}}(X_{i}).

(32)

The nonparametric least squares estimator of the conditional influence function is ${\hat{\phi}_{n}}(x)=\hat{\mathbf{B}}_{K_{n},N}{\hat{G}_{K_{n}}^{0}}(x)$ , where

\hat{\mathbf{B}}_{K_{n},N}=\left[\frac{1}{n}\sum_{i=1}^{n}{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i}){\hat{G}_{K_{n}}^{0}}(X_{i})^{\top}\right]\left[\frac{1}{n+N}\sum_{i=1}^{n+N}{\hat{G}_{K_{n}}^{0}}(X_{i}){\hat{G}_{K_{n}}^{0}}(X_{i})^{\top}\right]^{-1}\in{\mathbb{R}}^{p\times(pK_{n})}

(33)

are the coefficients obtained from regressing $\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n}$ onto $\left\{{{\hat{G}_{K_{n}}^{0}}(X_{i})}\right\}_{i=1}^{n}$ . The efficient OSS estimator is then

\hat{\theta}_{n,N}^{\text{eff.}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{K_{n},N}{\hat{G}_{K_{n}}^{0}}(X_{i}).

(34)

The next theorem establishes the asymptotic properties of $\hat{\theta}_{n,N}^{\text{eff.}}$ .

Theorem 4.4.

Suppose that the supervised estimator $\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}})$ satisfies Assumptions 3.1 and 3.2, and $\tilde{\eta}_{n}$ is an estimator of $\eta^{*}$ as in Assumption 3.1 (c). Further, suppose that $\left\{{g_{k}(x)}\right\}_{k=1}^{\infty}$ is a set of basis functions of ${\mathcal{S}}_{M}^{\alpha}$ such that ${G_{K_{n}}}(x)=\left[g_{1}(x)^{\top},\ldots,g_{K_{n}}(x)^{\top}\right]^{\top}$ satisfies (16) for any $f\in{\mathcal{S}}_{M}^{\alpha}$ , and $\inf_{K_{n}}\left\{{\lambda_{\min}\left(\mathrm{Var}\left[{G_{K_{n}}}(X)\right]\right)}\right\}>0$ . Suppose that $\frac{N}{n+N}\to\gamma\in(0,1]$ , $\alpha>{\text{dim}}({\mathcal{X}})$ , $K_{n}\to\infty$ and $K_{n}\rho(\tilde{\eta}_{n},\eta^{*})\to 0$ , and $\frac{\zeta_{n}^{2}}{n}\to 0$ . Then, $\hat{\theta}_{n,N}^{\text{eff.}}$ defined in (34) is a regular and asymptotically linear estimator of $\theta^{*}$ in the sense of Definitions 4.1 and 4.2, and has influence function

[{{\varphi}_{\eta^{*}}}(z_{1})-\gamma{\phi_{\eta^{*}}}(x_{1})]+\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}(x_{2}).

(35)

Furthermore, the asymptotic variance of $\hat{\theta}_{n,N}^{\text{eff.}}$ takes the form

{\mathbf{\Sigma}^{\text{eff}}}(\gamma)=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma{\phi_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}(X)\right],

(36)

and satisfies

\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-{\phi_{\eta^{*}}}(X)\right]\preceq{\mathbf{\Sigma}^{\text{eff}}}(\gamma)\preceq{\mathbf{\Sigma}^{\text{safe}}}(\gamma)\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right],\quad\forall\gamma\in[0,1],

(37)

where ${\mathbf{\Sigma}^{\text{safe}}}(\gamma)$ is defined as (30).

As in Section 3.3, the efficient OSS estimator, $\hat{\theta}_{n,N}^{\text{eff.}}$ , is always at least as efficient as both the safe OSS estimator, $\hat{\theta}_{n,N}^{\text{safe}}$ , and the initial supervised estimator $\hat{\theta}_{n}$ . Furthermore, (37) together with Theorem 4.1 show that if the initial supervised estimator is efficient under the supervised setting, then the efficient OSS estimator is efficient under the OSS setting.

Remark 4.2.

We noted in Section 4.1 that the efficiency bound in the OSS falls between the efficiency bounds in the ISS and supervised settings. We can see from Theorems 4.3 and 4.4 that a similar property holds for the safe and efficient estimators. Specifically, (31) and (37) show that the safe and efficient OSS estimators are more efficient than the initial supervised estimator $\hat{\theta}_{n}$ , but are less efficient than the safe and efficient ISS estimators, respectively. This is again due to the fact that under the OSS, unlabeled observations provide more information than is available under the supervised setting, but less than is available under the ISS. Similarly, as $\gamma\to 0$ , i.e. as the proportion of unlabeled data becomes negligible, the OSS estimators are asymptotically equivalent to the initial supervised estimator. On the other hand, when $\gamma=1$ , the OSS estimators are asymptotically equivalent to the corresponding ISS estimators.

5 Connection to prediction-powered inference

Suppose now that in addition to ${{\mathcal{L}}_{n}}$ and ${{\mathcal{U}}_{N}}$ , the data analyst also has access to $K\geq 1$ machine learning prediction models $f_{k}:{\mathcal{X}}\to{\mathcal{Y}}$ , $k\in[K]$ , which are independent of ${{\mathcal{L}}_{n}}$ and ${{\mathcal{U}}_{N}}$ (e.g., they were trained on independent data). For instance, $f_{1},\ldots,f_{K}$ may arise from black-box machine learning models such as neural networks or large language models. It is clear that this is a special case of semi-supervised learning, as $\left\{{f_{k}}\right\}_{k=1}^{K}$ can be treated as fixed functions conditional on the data upon which they were trained.

Recently, Angelopoulos et al. [2023a] proposed prediction-powered inference (PPI), which provides a principled approach for making use of $\left\{{f_{k}}\right\}_{k=1}^{K}$ . Subsequently, a number of PPI variants have been proposed to further improve statistical efficiency or extend PPI to other settings [Angelopoulos et al., 2023b, Miao et al., 2023, Gan and Liang, 2023, Miao and Lu, 2024, Gu and Xia, 2024]. In this section, we re-examine the PPI problem through the lens of our results in previous sections, and apply these insights to improve upon existing PPI estimators.

Since $f_{1},\ldots,f_{K}$ are independent of ${{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}}$ , existing PPI estimators fall into the category of OSS estimators, and can be shown to be regular and asymptotically linear in the sense of Definitions 4.1 and 4.2. Therefore, Theorem 4.1 suggests that their asymptotic variances are lower bounded by the efficiency bound (25). We show in Supplement C that existing PPI estimators cannot achieve the efficiency bound (25) in the OSS setting, unless strong assumptions are made on the machine learning prediction models. Furthermore, if the parameter of interest is well-specified in the sense of Definition 3.1, then by Corollary 4.2 these estimators cannot be more efficient than the efficient supervised estimator. In other words, independently trained machine learning models, however sophisticated and accurate, cannot improve inference when the parameter is well-specified.

Remark 5.1.

Our insight that independently trained machine learning models cannot improve inference in a well-specified model stands in apparent contradiction to the simulation results of Angelopoulos et al. [2023b], who find that PPI does lead to improvement over supervised estimation in generalized linear models. This is because they have simulated data such that $f(X)=Y+\epsilon$ , i.e. the machine learning model is not independent of ${{\mathcal{L}}_{n}}$ and ${{\mathcal{U}}_{N}}$ . A modification to their simulation study to achieve independence (in keeping with the setting of their paper) corroborates our insight, i.e., PPI does not outperform the supervised estimator.

Next, we take advantage of the insights developed in previous sections to propose a class of OSS estimators that incorporates the machine learning models $\left\{{f_{k}}\right\}_{k=1}^{K}$ and improves upon the existing PPI estimators. We begin with an initial supervised estimator $\hat{\theta}_{n}$ that is regular and asymptotically linear with influence function ${{\varphi}_{\eta^{*}}}(z)={{\varphi}_{\eta^{*}}}(x,y)$ , and we suppose that $\tilde{\eta}_{n}$ is an estimator of $\eta^{*}$ . As in Section 4.2, we estimate the conditional influence function ${\phi^{*}_{\eta^{*}}}(x)$ defined in (3) with regression. Specifically, consider

{g_{\tilde{\eta}_{n}}}(x)=\left[{{\varphi}_{\tilde{\eta}_{n}}}(x,f_{1}(x)),\ldots,{{\varphi}_{\tilde{\eta}_{n}}}(x,f_{K}(x))\right]^{\top},

(38)

which arises from replacing the true response in ${{\varphi}_{\tilde{\eta}_{n}}}(x,y)$ with the machine learning model $f_{k}(x)$ , for $k\in[K]$ . Its empirically centered version is

{\hat{g}^{0}_{\tilde{\eta}_{n}}}(x)={g_{\tilde{\eta}_{n}}}(x)-\frac{1}{n+N}\sum_{i=1}^{n+N}{g_{\tilde{\eta}_{n}}}(X_{i}).

(39)

Then the regression estimator of the conditional influence function is ${\hat{\phi}_{n}}(x)=\hat{\mathbf{B}}_{n,N}^{g_{\tilde{\eta}_{n}}}{\hat{g}^{0}_{\tilde{\eta}_{n}}}(x)$ , where

\hat{\mathbf{B}}_{n,N}^{g_{\tilde{\eta}_{n}}}=\left[\frac{1}{n}\sum_{i=1}^{n}{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i}){\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})^{\top}\right]\left[\frac{1}{n+N}\sum_{i=1}^{n+N}{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})^{\top}{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})\right]^{-1}

(40)

are the coefficients obtained from regressing $\left\{{{{\varphi}_{\tilde{\eta}_{n}}}(Z_{i})}\right\}_{i=1}^{n}$ onto $\left\{{{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})}\right\}_{i=1}^{n}$ . Motivated by the safe OSS estimator, $\hat{\theta}_{n,N}^{\text{safe}}$ in (28), the safe PPI estimator is defined as

\hat{\theta}_{n,N}^{\text{PPI}}=\hat{\theta}_{n}-\frac{1}{n}\sum_{i=1}^{n}\hat{\mathbf{B}}_{n,N}^{g_{\tilde{\eta}_{n}}}{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i}).

(41)

We now investigate the asymptotic behavior of the estimator (41). Note that Theorem 4.3 is not applicable, as the regression basis ${\hat{g}^{0}_{\tilde{\eta}_{n}}}$ in (39) is random due to the involvement of $\tilde{\eta}_{n}$ . Consider an arbitrary class of measurable functions indexed by $\eta$ ,

{\mathcal{G}}=\left\{{g_{\eta}(x):{\mathcal{X}}\to{\mathbb{R}}^{d},\eta\in\Omega}\right\}.

(42)

We make the following assumptions on ${\mathcal{G}}$ .

Assumption 5.1.

(a) ${g_{\eta^{*}}}(x)\in{\mathcal{L}}_{d}^{2}({{\mathbb{P}}^{*}_{X}})$ and $\mathrm{Var}\left[{g_{\eta^{*}}}(X)\right]$ is non-singular; (b) Under Assumption 3.1, $\left\{{g_{\eta}(x):\eta\in{\mathcal{O}}}\right\}$ is ${{\mathbb{P}}^{*}}$ -Donsker, and for all $\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}$ , $\left\lVert g_{\eta_{1}}(x)-g_{\eta_{2}}(x)\right\rVert\leqslant G(x)\rho(\eta_{1},\eta_{2})$ , where $G:{\mathcal{X}}\to{\mathbb{R}}^{+}$ is a square-integrable function.

Similar to Assumption 3.1, Assumption 5.1 requires that the class of functions $\left\{{g_{\eta}(x):\eta\in{\mathcal{O}}}\right\}$ is a Donsker class; when $\eta(\cdot)$ is finite-dimensional, the next proposition provides sufficient conditions for Assumption 5.1.

Proposition 5.1.

When $\eta(\cdot)$ is a finite-dimensional functional, the following condition implies Assumption 5.1 (b): under the conditions of Proposition 3.4, for all $\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}$ , $\left\lVert g_{\eta_{1}}(x)-g_{\eta_{2}}(x)\right\rVert\leqslant G(x)\left\lVert\eta_{1}-\eta_{2}\right\rVert$ , where $G:{\mathcal{X}}\to{\mathbb{R}}^{+}$ is a square integrable function.

The proof of Proposition 5.1 is similar to that of Proposition 3.4 and is hence omitted.

Define ${g^{0}_{\eta^{*}}}(x)={g_{\eta^{*}}}(x)-{\mathbb{E}}[{g_{\eta^{*}}}(X)]$ as the centered version of ${g_{\eta^{*}}}(x)$ , and

\mathbf{B}^{g_{\eta^{*}}}={\mathbb{E}}\left[{{\varphi}_{\eta^{*}}}(Z){g^{0}_{\eta^{*}}}(X)^{\top}\right]{\mathbb{E}}\left[{g^{0}_{\eta^{*}}}(X){g^{0}_{\eta^{*}}}(X)^{\top}\right]^{-1}

(43)

as the population coefficients for the regression of ${g^{0}_{\eta^{*}}}(X)$ onto ${{\varphi}_{\eta^{*}}}(Z)$ . The next proposition establishes the asymptotic behavior of $\hat{\theta}_{n,N}^{\text{PPI}}$ .

Proposition 5.2.

Suppose that $\hat{\theta}_{n}=\hat{\theta}({{\mathcal{L}}_{n}})$ is a supervised estimator that satisfies Assumption 3.1, ${\mathcal{G}}$ defined as (42) is a class of measurable functions that satisfies Assumption 5.1, and $\tilde{\eta}_{n}$ is an estimator of $\eta^{*}$ as in Assumption 3.1 (c). Suppose further that $\frac{N}{n+N}\to\gamma\in(0,1]$ . Then the estimator $\hat{\theta}_{n,N}^{\text{PPI}}$ defined in (41) is a regular and asymptotically linear estimator of $\theta^{*}$ in the sense of Definitions 4.1 and 4.2, and has influence function

\left[{{\varphi}_{\eta^{*}}}(z_{1})-\gamma\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(x_{1})\right]+\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(x_{2})

(44)

in the OSS setting, where $\mathbf{B}^{g_{\eta^{*}}}$ is defined as in (43). Furthermore, the asymptotic variance of $\hat{\theta}_{n,N}^{\text{PPI}}$ takes the form

\mathbf{\Sigma}(\gamma)=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(X)\right],

and satisfies

\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g_{\eta^{*}}}{g^{0}_{\eta^{*}}}(X)\right]\preceq\mathbf{\Sigma}(\gamma)\preceq\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right],\quad\forall\gamma\in[0,1].

(45)

Proposition 5.2 can be viewed as an extension of Theorem 4.3, where in Theorem 4.3 the function class ${\mathcal{G}}$ is a singleton class $\left\{{g(x)}\right\}$ that does not depend on $\eta^{*}$ . Our proposed PPI estimator (41) flexibly incorporates multiple black-box machine learning models, and enjoys several advantages over existing PPI estimators:

1.

We show in Supplement C that (41) is optimal within a class of PPI estimators including Angelopoulos et al. [2023a, b], Miao et al. [2023], Gan and Liang [2023].
2.

Unlike the proposals of Angelopoulos et al. [2023a, b], Miao et al. [2023], our estimator is safe: that is, it is at least as efficient as the initial supervised estimator, regardless of the quality of the machine learning models and the proportion of unlabeled data.
3.

While existing PPI estimators are only applicable to M- and Z-estimators, our proposal is much more general: it is applicable to arbitrary inferential problems and requires only a regular and asymptotically linear initial supervised estimator.

We provide a detailed discussion of the efficiency of existing PPI estimators in Appendix C.

6 Connection with missing data

The missing data framework provides an alternative approach for modeling the semi-supervised setting [Robins and Rotnitzky, 1995, Chen et al., 2008, Zhou et al., 2008, Li and Luedtke, 2023, Graham et al., 2024]. Here we relate the proposed framework to the classical theory of missing data.

To establish a formal relationship between the two paradigms, we consider a missing completely at random (MCAR) model under which the semiparametric efficiency bound coincides with that derived in Theorem 4.1 in the OSS setting. Let $\left\{{(Z_{i},W_{i})}\right\}_{i=1}^{n+N}$ be i.i.d. data, where $Z\sim{{\mathbb{P}}^{*}}$ , $W\sim{{\mathbb{P}}_{W}^{*}}$ is a binary missingness indicator such that the response $Y_{i}$ is observed if and only if $W_{i}=1$ , and ${{\mathbb{P}}_{W}^{*}}$ is a Bernoulli distribution with known probability $1-\gamma$ where $\gamma=\lim_{n\to\infty}\frac{N}{n+N}\in(0,1)$ . Assume that ${{\mathbb{P}}^{*}}\in{{\mathcal{P}}}$ , where ${{\mathcal{P}}}$ is defined in (1) as in previous sections. The underlying model of $(Z,W)$ is thus

{\mathcal{Q}}=\left\{{{\mathbb{P}}\times{{\mathbb{P}}_{W}^{*}}:{\mathbb{P}}\in{{\mathcal{P}}}}\right\}.

(46)

The next proposition derives the efficiency bound relative to (46).

Proposition 6.1.

Let ${{\mathcal{P}}}$ be defined as in (1) and let $\frac{N}{n+N}\to\gamma\in(0,1)$ . Suppose that the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ is ${{\varphi}^{*}_{\eta^{*}}}$ , and recall that ${\phi^{*}_{\eta^{*}}}(x)={\mathbb{E}}\left[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid X=x\right]$ is the conditional efficient influence function. Consider i.i.d. data $\left\{{(Z_{i},W_{i})}\right\}_{i=1}^{n+N}$ generated from ${{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}}$ with model ${\mathcal{Q}}$ (46). Then the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}}$ relative to ${{\mathcal{P}}}$ is

\frac{w}{\sqrt{1-\gamma}}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+\sqrt{1-\gamma}{\phi^{*}_{\eta^{*}}}(x),

and the corresponding semiparametric efficiency lower bound is

		$\displaystyle\mathrm{Var}\left\{{\frac{W}{\sqrt{1-\gamma}}[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)]+\sqrt{1-\gamma}{\phi^{}_{\eta^{}}}(X)}\right\}$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)\right]+(1-\gamma)\mathrm{Var}[{\phi^{}_{\eta^{}}}(X)].$

A comparison of (6.1) and (25) reveals that, for any fixed $\gamma\in(0,1)$ , the efficiency bound under the MCAR model (46) is the same as under the OSS setting. Thus, the amount of information useful for inference under the two paradigms is the same. However, in the MCAR model the data $\left\{{(Z_{i},W_{i})}\right\}_{i=1}^{n+N}$ are fully i.i.d.; by contrast, in the OSS setting the data are not fully i.i.d., and instead consist of two independent i.i.d. parts from ${{\mathbb{P}}^{*}}$ and ${{\mathbb{P}}^{*}_{X}}$ , respectively. In fact, the OSS setting corresponds to the MCAR model conditional on the event $\left\{{\sum_{i=1}^{n+N}W_{i}=N}\right\}$ . As the distribution of $W$ is completely known under the MCAR model, conditioning does not alter the information available, and so it is not surprising that the efficiency bounds are the same.

The semi-supervised setting allows for $\gamma=1$ , i.e., for the possibility that the sample size of the unlabeled data far exceeds that of the labeled data, $N\gg n$ . As discussed in Section 4.1, this case corresponds to the ISS setting, with efficiency lower bound given by Theorem 3.1. Our proposed safe and efficient OSS estimators also allow for $\gamma=1$ , as shown in Theorems 4.3 and 4.4. By contrast, it is difficult to theoretically analyze the efficiency lower bound when $\gamma=1$ using the missing data framework, and estimators developed for missing data require the probability of missingness to be strictly smaller than 1.

7 Applications

In this section, we apply the proposed framework to a variety of inferential problems, including M-estimation, U-statistics, and average treatment effect estimation. Constructing the safe OSS estimator $\hat{\theta}_{n,N}^{\text{safe}}$ (12) or the PPI estimator $\hat{\theta}_{n,N}^{\text{PPI}}$ (41) requires finding an initial supervised estimator $\hat{\theta}_{n}$ and an estimator $\tilde{\eta}_{n}$ that satisfies Assumption 3.1. To construct the efficient OSS estimator (34), the initial supervised estimator also needs to satisfies Assumption 3.2, which cannot be verified in practice.

7.1 M-estimation

First, we apply the proposed framework to M-estimation, a setting also considered in Chapter 2 of Chakrabortty [2016] and Song et al. [2023]. For a function $m_{\theta}(z):\Theta\times{\mathcal{Z}}\to{\mathbb{R}}$ , we define the target parameter as the maximizer of

\theta^{*}=\arg\max_{\theta}{\mathbb{E}}[m_{\theta}(Z)].

Given i.i.d. labeled data ${{\mathcal{L}}_{n}}$ , we use the M-estimator

\hat{\theta}_{n}=\arg\max_{\theta}\left\{{\frac{1}{n}\sum_{i=1}^{n}m_{\theta}(Z_{i})}\right\}

(47)

as the initial supervised estimator of $\theta^{*}$ . Under regularity conditions such as those stated in Theorems 5.7 and 5.21 of Van der Vaart [2000], the M-estimator (47) is a regular and asymptotically linear estimator of $\theta^{*}$ with influence function

{{\varphi}_{\eta^{*}}}(z)=-\mathbf{V}_{\theta^{*}}^{-1}\nabla m_{\theta^{*}}(z),

(48)

where $\mathbf{V}_{\theta}({\mathbb{P}})=\frac{\partial^{2}{\mathbb{E}}_{\mathbb{P}}[m_{\theta}(Z)]}{\partial\theta\partial\theta^{\top}}$ and ${\mathbf{V}_{\theta^{*}}}=\mathbf{V}_{\theta({{\mathbb{P}}^{*}})}({{\mathbb{P}}^{*}})$ . The functional $\eta({\mathbb{P}})$ that appears in (48) is

\eta({\mathbb{P}})=\left(\mathbf{V}^{-1}_{\theta({\mathbb{P}})}({\mathbb{P}}),\theta({\mathbb{P}})\right),

which is finite-dimensional. Therefore, validating Assumption 3.1 is equivalent to validating the conditions stated in Proposition 3.4, which we now do for two canonical examples of M-estimation problems.

Example 7.1 (Mean).

Define $m_{\theta}(z)=\frac{1}{2}[h(y)-\theta]^{2}$ for some function $h(y)\in{\mathcal{L}}^{2}_{p}({\mathbb{P}}^{*}_{Y})$ . The target parameter is the expectation $\theta^{*}={\mathbb{E}}[h(Y)]$ . The M-estimator (47) in this case is the sample mean

\hat{\theta}_{n}=\frac{1}{n}\sum_{i=1}^{n}h(Y_{i}),

which is regular and asymptotically linear with influence function ${{\varphi}_{\eta^{*}}}(z)=h(y)-\theta^{*}$ . The functional $\eta({\mathbb{P}})$ in this example is $\eta({\mathbb{P}})=\theta({\mathbb{P}})$ , as $\mathbf{V}_{\theta}({\mathbb{P}})\equiv\mathbf{I}_{p}$ and does not depend on the underlying distribution. A consistent estimator $\tilde{\eta}_{n}$ of $\eta^{*}=\theta^{*}$ is the sample mean $\tilde{\eta}_{n}=\hat{\theta}_{n}$ . Furthermore, ${\varphi}_{\eta}(z)$ is $1$ -Lipschitz in $\eta$ , and the conditions of Proposition 3.4 are satisfied.

Example 7.2 (Generalized linear models).

Define $m_{\theta}(z)=y\theta^{\top}x-b\left(x^{\top}\theta\right)$ , where $b:{\mathbb{R}}\to{\mathbb{R}}$ is a convex and infinitely-differentiable function. Let $b^{(1)}(\cdot)$ and $b^{(2)}(\cdot)$ denote its first and second-order derivatives, respectively. Here $m_{\theta}(z)$ corresponds to the log-likelihood of a canonical exponential family distribution with natural parameter $\theta^{\top}x$ and log partition function $b(\cdot)$ , i.e., a generalized linear model (GLM). However, we do not assume that the underlying distribution belongs to this model. The target parameter $\theta^{*}$ maximizes ${\mathbb{E}}[m_{\theta}(Z)]$ , and can be viewed as the best Kullback–Leibler approximation of the underlying distribution by the GLM model. The M-estimator (47) in this case is the GLM estimator

\hat{\theta}_{n}=\underset{\theta}{\arg\max}\left\{{\frac{1}{n}\sum_{i=1}^{n}\left[Y_{i}\theta^{\top}X_{i}-b\left(X_{i}^{\top}\theta\right)\right]}\right\},

which is regular and asymptotically linear with influence function

{{\varphi}_{\eta^{*}}}(z)={\mathbb{E}}\left[b^{(2)}(X^{\top}\theta^{*})XX^{\top}\right]^{-1}\left[yx-b^{(1)}(x^{\top}\theta^{*})x\right].

A consistent estimator of the functional $\eta({\mathbb{P}})=\left({\mathbb{E}}_{\mathbb{P}}\left[b^{(2)}(X^{\top}\theta({\mathbb{P}}))XX^{\top}\right]^{-1},\theta({\mathbb{P}})\right)$ is

\tilde{\eta}_{n}=\left(\left[\frac{1}{n}\sum_{i=1}^{n}b^{(2)}(X_{i}^{\top}\hat{\theta}_{n})X_{i}X_{i}^{\top}\right]^{-1},\hat{\theta}_{n}\right).

Under mild conditions, it can be shown that $\hat{\theta}_{n}$ satisfies the conditions of Proposition 3.4. Claims made in this example are proved in Supplement D.

Application to Z-estimation or estimating equations is similar and thus omitted.

7.2 U-statistics

Next, we apply the proposed framework to U-statistics. Let $h(y_{1},\ldots,y_{R}):{\mathcal{Y}}^{R}\to{\mathbb{R}}$ be a symmetric kernel function. The target parameter is defined as

\theta^{*}={\mathbb{E}}[h(Y_{1},\ldots,Y_{R})],

where the expectation is taken over $R$ i.i.d. random variables $Y_{i}\stackrel{{\scriptstyle{iid}}}{{\sim}}{\mathbb{P}}^{*}_{Y}$ . To estimate $\theta^{*}$ , the supervised estimator is

\hat{\theta}_{n}=\frac{1}{\binom{n}{r}}\sum_{\left\{{i_{1}<\ldots<i_{r}}\right\}\subset[n]}h(X_{i_{1}},\ldots,X_{i_{r}}).

(49)

We assume that $h$ is non-degenerate:

0<\mathrm{Var}[h_{1}(Y;{\mathbb{P}}_{Y}^{*})]<\infty,

(50)

where $h_{1}(y;{\mathbb{P}}_{Y})$ is the conditional expectation ${\mathbb{E}}_{\mathbb{P}}[h(Y_{1},\ldots,Y_{R})\mid Y_{1}=y]$ with the first argument fixed at $y$ , i.e.,

h_{1}(y;{\mathbb{P}}_{Y})=\int h(y,y_{2},\ldots,y_{R})\prod_{r=2}^{R}d{\mathbb{P}}_{Y}(y_{r}).

(51)

Under this condition, $\hat{\theta}_{n}$ is a regular and asymptotically linear estimator with influence function

{{\varphi}_{\eta^{*}}}(z)=R[h_{1}(y;{\mathbb{P}}^{*}_{Y})-\theta^{*}],

(52)

where $R$ is again the order of the kernel [for a proof, see Theorem 12.3 of Van der Vaart, 2000]. The functional $\eta({\mathbb{P}})$ in (52) is $\eta({\mathbb{P}})=\left(h_{1}(y;{\mathbb{P}}_{Y}),\theta({\mathbb{P}}_{Y})\right)$ . This is an infinite-dimensional functional that takes values in ${\mathcal{Y}}_{\infty}\times{\mathbb{R}}$ , where ${\mathcal{Y}}_{\infty}$ is the space of uniformly bounded functions $h:{\mathcal{Y}}\to{\mathbb{R}}$ equipped with the uniform metric $\left\lVert h_{1}-h_{2}\right\rVert_{\infty}=\sup_{{\mathcal{Y}}}|h_{1}(y)-h_{2}(y)|$ . Denoting $\eta_{1}=(h_{1},\theta_{1})$ and $\eta_{2}=(h_{2},\theta_{2})$ , it follows that (52) is $R$ -Lipschitz with respect to the metric $\rho(\eta_{1},\eta_{2})=\left\lVert h_{1}-h_{2}\right\rVert_{\infty}+\left\lVert\theta_{1}-\theta_{2}\right\rVert$ .

We have already established that U-statistics (49) are regular and asymptotically linear with $R$ -Lipschitz influence functions. Therefore, it remains to validate the remaining parts of (b) and (c) of Assumption 3.1, which we now do for two canonical examples of U-statistics.

Example 7.3 (Variance).

Define $h(y_{1},y_{2})=\frac{1}{2}(y_{1}-y_{2})^{2}$ . The target of inference is then the variance $\theta^{*}=\mathrm{Var}(Y)$ . The U-statistic in this case is the sample variance,

\hat{\theta}_{n}=\frac{1}{n(n-1)}\sum_{i<j}(Y_{i}-Y_{j})^{2},

which is regular and asymptotically linear with influence function ${{\varphi}_{\eta^{*}}}(z)=(y-{\mathbb{E}}[Y])^{2}-\theta^{*}$ when $h(y_{1},y_{2})$ is non-degenerate. In this simple example, the functional $\eta({\mathbb{P}})=\left({\mathbb{E}}[Y],\theta({\mathbb{P}}_{Y})\right)$ is finite-dimensional, and a consistent estimator of $\eta^{*}$ is $\tilde{\eta}_{n}=\left(\frac{1}{n}\sum_{i=1}^{n}Y_{i},\hat{\theta}_{n}\right)$ . Therefore, invoking Proposition 3.4, Assumption 3.1 is satisfied.

Example 7.4 (Kendall’s $\tau$ ).

Consider $Y=(U,V)\in{\mathbb{R}}^{2}$ and define $h(y_{1},y_{2})=I\left\{{(u_{1}-u_{2})(v_{1}-v_{2})>0}\right\}$ . The target of inference is

\theta^{*}={\mathbb{P}}\left\{{(U_{1}-U_{2})(V_{1}-V_{2})>0}\right\},

which measures the dependence between $U$ and $V$ . The U-statistic is Kendall’s $\tau$ ,

\hat{\theta}_{n}=\frac{2}{n(n-1)}\sum_{i<j}I\left\{{(U_{i}-U_{j})(V_{i}-V_{j})>0}\right\},

which is the average number of pairs $(U_{i},V_{i})$ with concordant sign. When $h(y_{1},y_{2})$ is non-degenerate, $\hat{\theta}_{n}$ is regular and asymptotically linear with influence function

{{\varphi}_{\eta^{*}}}(z)=2{\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\}-2\theta^{*},

where $\eta({\mathbb{P}})=\left({\mathbb{P}}_{Y}\left\{{(U-u)(V-v)>0}\right\},\theta({\mathbb{P}}_{Y})\right)$ . A natural estimator of $\eta^{*}$ in this case is

\tilde{\eta}_{n}=\left(\frac{1}{n}\sum_{i=1}^{n}I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\},\hat{\theta}_{n}\right).

In Supplement D, we validate the conditions in Assumption 3.1 under additional assumptions on ${\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\}$ .

7.3 Average treatment effect

We consider application of the proposed framework to the estimation of the average treatment effect (ATE). Suppose we have $Z=(U,A,Y)$ for confounders $U\in{\mathcal{U}}$ , binary treatment $A\in\left\{{0,1}\right\}$ , and outcome $Y\in{\mathcal{Y}}$ . Let $Y^{(0)}$ and $Y^{(1)}$ represent the counterfactual outcomes under control and treatment, respectively. Under appropriate assumptions, the ATE, defined as ${\mathbb{E}}\left[Y^{(1)}\right]-{\mathbb{E}}\left[Y^{(0)}\right]$ , can be expressed as ${\mathbb{E}}\left\{{\mu^{(1)}(U;{{\mathbb{P}}^{*}})-\mu^{(0)}(U;{{\mathbb{P}}^{*}})}\right\}$ , where $\mu^{(j)}(u;{\mathbb{P}})={\mathbb{E}}_{\mathbb{P}}[Y\mid U=u,A=j]$ for $j\in\left\{{0,1}\right\}$ . For simplicity, we consider the target of inference

\theta^{*}={\mathbb{E}}\left[Y^{(1)}\right]={\mathbb{E}}\left\{{\mu^{(1)}(U;{{\mathbb{P}}^{*}})}\right\}.

(53)

Inference for the ATE, ${\mathbb{E}}\left[Y^{(1)}\right]-{\mathbb{E}}\left[Y^{(0)}\right]$ , follows similarly.

We define $\mu(u;{\mathbb{P}})=\mu^{(1)}(u;{\mathbb{P}})$ and $\pi(u;{\mathbb{P}})={\mathbb{P}}\left\{{A=1\mid U}\right\}$ , and consider the model

{\mathcal{P}}=\left\{{p_{U}(u)\pi(u;{\mathcal{P}})^{a}[1-\pi(u;{\mathcal{P}})]^{1-a}p_{Y\mid A,U}(y\mid u,a):\forall u,p_{U}(u)>0;\exists\epsilon\in(0,0.5),\pi(u;{\mathbb{P}})\in[\epsilon,1-\epsilon]}\right\}.

It can be shown [Robins et al., 1994, Hahn, 1998] that the efficient influence function relative to ${\mathcal{P}}$ is

\displaystyle{{\varphi}^{*}_{\eta^{*}}}(z)=\frac{a}{\pi(u;{{\mathbb{P}}^{*}})}\left[y-\mu(u;{{\mathbb{P}}^{*}})\right]+\mu(u;{{\mathbb{P}}^{*}})-\theta^{*},

(54)

where $\eta({\mathbb{P}})=\left\{{\pi(u;{\mathbb{P}}),\mu(u;{\mathbb{P}})}\right\}$ . In this case, $\eta({\mathbb{P}})$ is infinite-dimensional and takes values in the product space ${\mathcal{U}}_{\infty}\times{\mathcal{U}}_{\infty}$ , where ${\mathcal{U}}_{\infty}$ is the space of uniformly bounded functions $h:{\mathcal{U}}\to{\mathbb{R}}$ equipped with the uniform metric $\left\lVert h_{1}-h_{2}\right\rVert_{\infty}=\sup_{{\mathcal{Y}}}|h_{1}(y)-h_{2}(y)|$ .

We consider three examples.

Example 7.5 (Additional data on confounders).

Suppose that additional observations of the confounders are available, i.e. ${{\mathcal{U}}_{N}}=\left\{{U_{i}}\right\}_{i=n+1}^{n+N}$ . The conditional efficient influence function ${\phi^{*}_{\eta^{*}}}(u)={\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid U=u]$ in this case is

{\phi^{*}_{\eta^{*}}}(u)=\mu(u,{{\mathbb{P}}^{*}})-\theta^{*}.

By Theorem 4.1, the efficient influence function under the OSS is

\displaystyle\frac{a_{1}}{\pi(u_{1};{{\mathbb{P}}^{*}})}\left[y_{1}-\mu(u_{1};{{\mathbb{P}}^{*}})\right]+(1-\gamma){\phi^{*}_{\eta^{*}}}(u_{1})+\sqrt{\gamma(1-\gamma)}{\phi^{*}_{\eta^{*}}}(u_{2}),

(55)

where — in keeping with Remark 4.1 — we have used $(z_{1},u_{2})=((u_{1},a_{1},y_{1}),u_{2})$ as arguments to the influence function. We note that if $U$ has no confounding effect, i.e. $\mu(u,{{\mathbb{P}}^{*}})=\theta^{*}$ , then ${\phi^{*}_{\eta^{*}}}(u)=0$ and the efficiency bound (55) is the same as in the supervised setting.

Example 7.6 (Additional data on confounders and treatment).

Suppose that additional observations of both the confounders and the treatment indicators are available, i.e. ${{\mathcal{U}}_{N}}=\left\{{(U_{i},A_{i})}\right\}_{i=n+1}^{n+N}$ . The conditional efficient influence function ${\phi^{*}_{\eta^{*}}}(u,a)={\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid U=u,A=a]$ in this case is

{\phi^{*}_{\eta^{*}}}(u,a)=\frac{a}{\pi(u;{{\mathbb{P}}^{*}})}\left\{{{\mathbb{E}}[Y\mid U=u,A=a]-\mu(u;{{\mathbb{P}}^{*}})}\right\}+\mu(u;{{\mathbb{P}}^{*}})-\theta^{*}.

By Theorem 4.1, the efficient influence function under the OSS is

		$\displaystyle\frac{a_{1}}{\pi(u_{1})}\left\{{y_{1}-{\mathbb{E}}[Y\mid U=u_{1},A=a_{1}]}\right\}+(1-\gamma){\phi^{}_{\eta^{}}}(u_{1},a_{1})$		(56)
		$\displaystyle\quad+\sqrt{\gamma(1-\gamma)}{\phi^{}_{\eta^{}}}(u_{2},a_{2}),$		(56)

where again the subscripts in (LABEL:eq:efficiency_OSS_UA) are in keeping with Remark 4.1. If $U$ has no confounding effect and $A$ has no treatment effect, i.e. if ${\mathbb{E}}[Y\mid U=u,A=a]=\mu(u,{{\mathbb{P}}^{*}})=\theta^{*}$ , then ${\phi^{*}_{\eta^{*}}}(u,a)=0$ and the efficiency bound (LABEL:eq:efficiency_OSS_UA) is the same as in the supervised setting.

Example 7.7 (Additional data on confounders and treatment, and availability of surrogates).

When measuring the primary outcome $Y$ is time-consuming or expensive, we may use a surrogate marker $S\in{\mathbb{R}}$ as a replacement for $Y$ , to facilitate more timely decisions on the treatment effect [Wittes et al., 1989]. Suppose that ${{\mathcal{L}}_{n}}=\left\{{(U_{i},A_{i},S_{i},Y_{i})}\right\}_{i=1}^{n}$ , and that additional observations of the confounders, the treatment indicators, and the surrogate markers are available: that is, ${{\mathcal{U}}_{N}}=\left\{{(U_{i},A_{i},S_{i})}\right\}_{i=n+1}^{n+N}$ . The conditional efficient influence function ${\phi^{*}_{\eta^{*}}}(u,a,s)={\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid U=u,A=a,S=s]$ in this case is

\displaystyle{\phi^{*}_{\eta^{*}}}(u,a,s)=\frac{a}{\pi(u;{{\mathbb{P}}^{*}})}\left\{{{\mathbb{E}}[Y\mid U=u,A=a,S=s]-\mu(u;{{\mathbb{P}}^{*}})}\right\}+\mu(u;{{\mathbb{P}}^{*}})-\theta^{*}.

By Theorem 4.1, the efficient influence function under the OSS is

		$\displaystyle\frac{a_{1}}{\pi(u_{1})}\left\{{y_{1}-{\mathbb{E}}[Y\mid U=u_{1},A=a_{1},S=s_{1}]}\right\}+(1-\gamma){\phi^{}_{\eta^{}}}(u_{1},a_{1},s_{1})$		(57)
		$\displaystyle+\sqrt{\gamma(1-\gamma)}{\phi^{}_{\eta^{}}}(u_{2},a_{2},s_{2}).$		(57)

Here $Z=(U,A,S,Y)$ , and the subscripts in (57) are again in keeping with Remark 4.1. If ${\mathbb{E}}[Y\mid U=u,A=a,S=s]=\mu(u,{{\mathbb{P}}^{*}})=\theta^{*}$ , then ${\phi^{*}_{\eta^{*}}}(u,a,s)=0$ and the efficiency bound (LABEL:eq:efficiency_OSS_UA) is the same as in the supervised setting.

Remark 7.1.

We show in Supplement D that the semiparametric efficiency bound (55) is greater than (LABEL:eq:efficiency_OSS_UA), which is greater than (57). In other word, efficiency improves when the unlabeled data ${{\mathcal{U}}_{N}}$ contains more information.

8 Numerical experiments

In this section, we illustrate the proposed framework numerically in the context of mean estimation and generalized linear models. Numerical experiments for variance estimation and Kendall’s $\tau$ can be found in Sections F.2 and F.3 of the Appendix.

In each example, the covariates $X=[X_{1},X_{2}]$ are two-dimensional and generated as i.i.d. $\text{Unif}(0,1)$ . We compute the proposed estimators $\hat{\theta}_{n,N}^{\text{safe}}$ (28), $\hat{\theta}_{n,N}^{\text{eff.}}$ (34), and $\hat{\theta}_{n,N}^{\text{PPI}}$ (41) as follows:

•

For the estimator $\hat{\theta}_{n,N}^{\text{safe}}$ , we use $g(x)=x$ .
•

For the estimator $\hat{\theta}_{n,N}^{\text{eff.}}$ , we use a basis of tensor product natural cubic splines, with $K_{n}\in\left\{{4,9,16}\right\}$ basis functions.
•

For the estimator $\hat{\theta}_{n,N}^{\text{PPI}}$ , we use ${g_{\tilde{\eta}_{n}}}(x)={{\varphi}_{\tilde{\eta}_{n}}}(x,f(x))$ where $f:{\mathcal{X}}\to{\mathcal{Y}}$ is a prediction model. We consider two prediction models: (i) a random forest model trained on independent data, which represents an informative prediction model; and (ii) randomly-generated Gaussian noise, which represents a non-informative prediction model.

Each of these estimators is constructed by modifying an efficient supervised estimator, whose performance we also consider. Additionally, we include the PPI++ estimator proposed by Angelopoulos et al. [2023b], using the same two prediction models as for $\hat{\theta}_{n,N}^{\text{PPI}}$ .

In each simulation setting, we also estimate the semiparametric efficiency lower bounds in the ISS setting (5) and in the OSS setting (25).

For each method, we report the coverage of the 95% confidence interval, as well as the standard error; all results are averaged over 1,000 simulated datasets.

8.1 Mean estimation

We consider Example 7.1 with $h(y)=y$ . In this example, the supervised estimator is the sample mean, which has influence function ${{\varphi}_{\eta^{*}}}(z)=y-\theta^{*}$ . The conditional influence function is

{\phi_{\eta^{*}}}(x)={\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x]={\mathbb{E}}[Y\mid X=x]-\theta^{*},

which depends on $x$ through the conditional expectation. We generate the response $Y$ as $Y={\mathbb{E}}[Y\mid X]+\epsilon$ , where $\epsilon\sim N(0,1)$ . We consider three settings for ${\mathbb{E}}[Y\mid X]$ :

1.

Setting 1 (linear model):

${\mathbb{E}}[Y\mid X=x]=1.05+4.76x_{1}-6.2x_{2}.$
2.

Setting 2 (non-linear model):

${\mathbb{E}}[Y\mid X=x]=-1.70x_{1}x_{2}-6.94x_{1}x_{2}^{2}-1.35x_{1}^{2}x_{2}+2.28x_{1}^{2}x_{2}^{2}.$
3.

Setting 3 (well-specified model, in the sense of Definition 3.1):

${\mathbb{E}}[Y\mid X=x]=0.$

For each model, we set $n=1,000$ , and vary the proportion of unlabeled observations, $\gamma=\frac{N}{n+N}\in\left\{{0.1,0.3,0.5,0.7,0.9}\right\}$ . The semiparametric efficiency lower bound in the ISS setting (5) and in the OSS setting (25) are estimated separately on a sample of $100,000$ observations.

Table 1 of Appendix F.1 reports the coverage of 95% confidence intervals for each method. All methods achieve the nominal coverage.

Figure 1 displays the standard error of each method, averaged over 1,000 simulated datasets.

Refer to caption — Figure 1: Standard errors of estimators of the mean, as described in Section 8.1, averaged over 1,000 simulations with $n=1,000$ . *Left:* When the conditional influence function is linear (Setting 1), $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x)=x$ achieves the efficiency lower bound in the OSS setting. *Center:* When the conditional influence function is non-linear (Setting 2), $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x)=x$ is no longer efficient, whereas $\hat{\theta}_{n,N}^{\text{eff.}}$ is efficient with a sufficient number of basis functions. *Right:* For a well-specified estimation problem in the sense of Definition 3.1 (Setting 3), no semi-supervised method can improve upon the supervised estimator, as shown in Corollary 4.2.

We begin with a discussion of Setting 1 (linear model). The estimator $\hat{\theta}_{n,N}^{\text{safe}}$ achieves the efficiency lower bound in the OSS setting, since $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x)=x$ suffices to accurately approximate the true conditional influence function. In fact, $\hat{\theta}_{n,N}^{\text{eff.}}$ , which uses a greater number of basis functions, performs worse: those additional basis functions contribute to increased variance without improving bias. Both PPI++ and $\hat{\theta}_{n,N}^{\text{PPI}}$ using a random forest prediction model perform well, whereas the version of those methods that uses a pure-noise prediction model performs comparably to the supervised estimator, i.e., incorporating a useless prediction model does not lead to deterioration of performance. Since there is only one parameter of interest in the context of mean estimation, i.e. $p=1$ , the PPI++ estimator is asymptotically equivalent to $\hat{\theta}_{n,N}^{\text{PPI}}$ with ${g_{\tilde{\eta}_{n}}}(x)={{\varphi}_{\tilde{\eta}_{n}}}(x,f(x))$ with the same prediction model $f(\cdot)$ : thus, there is no difference in performance between PPI++ and $\hat{\theta}_{n,N}^{\text{PPI}}$ . However, as we will show in Section 8.2, when there are multiple parameters, i.e. $p>1$ , $\hat{\theta}_{n,N}^{\text{PPI}}$ can outperform the PPI++ estimator.

We now consider Setting 2 (non-linear model). Because the conditional influence function is non-linear, the best performance for $\hat{\theta}_{n,N}^{\text{eff.}}$ is achieved when the number of basis functions $K_{n}$ is sufficiently large. Furthermore, this performance is substantially better than that of $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x)=x$ . Other than this, the results are quite similar to Setting 1.

Finally, we consider Setting 3 (well-specified model), in which the model is well-specified in the sense of Definition 3.1. In keeping with Corollary 4.2, no method improves upon the OSS lower bound.

8.2 Generalized linear model

We consider Example 7.2 with a Poisson GLM. In this example, the supervised estimator is the Poisson GLM estimator, which has influence function

{{\varphi}_{\eta^{*}}}(z)={\mathbb{E}}\left[e^{X^{\top}\theta^{*}}XX^{\top}\right]^{-1}x\left(y-e^{x^{\top}\theta^{*}}\right).

The conditional influence function is then

{\phi_{\eta^{*}}}(x)={\mathbb{E}}\left[e^{X^{\top}\theta^{*}}XX^{\top}\right]^{-1}x\left({\mathbb{E}}[Y\mid X=x]-e^{x^{\top}\theta^{*}}\right).

We generate the response $Y$ as $Y\mid X\sim\text{Poisson}\left({\mathbb{E}}[Y\mid X]\right)$ . However, unlike in Section 8.1, the conditional influence function is not a linear function of $x$ regardless of the form of ${\mathbb{E}}[Y\mid X=x]$ . Therefore, we only consider two settings for ${\mathbb{E}}[Y\mid X]$ :

Setting 1 (non-linear model):

{\mathbb{E}}[Y\mid X=x]=\exp\left\{{-1.70x_{1}x_{2}-6.94x_{1}x_{2}^{2}-1.35x_{1}^{2}x_{2}+2.28x_{1}^{2}x_{2}^{2}}\right\}.

2.

Setting 2 (well-specified model, in the sense of Definition 3.1):

${\mathbb{E}}[Y\mid X=x]=\exp\left\{{1.05+4.76x_{1}-6.2x_{2}}\right\}.$

Table 2 of Appendix F.1 reports the coverage of 95% confidence intervals for each method. All methods achieve the nominal coverage.

Figure 2 displays the standard error of the first parameter for each method, averaged over 1,000 simulated datasets. Results for the standard error of the second parameter are similar, and are displayed in Figure 3 in Appendix F.1.

We first consider Setting 1 (non-linear model). Consistent with the results for Setting 2 of Section 8.1, the estimator $\hat{\theta}_{n,N}^{\text{eff.}}$ approximates the efficiency lower bound in the OSS setting when the number of basis functions is sufficiently large. Meanwhile, the estimator $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x)=x$ improves upon the supervised estimator, but is not efficient as it cannot accurately approximate the true conditional influence function ${{\varphi}_{\tilde{\eta}_{n}}}(x)$ . Both PPI++ and $\hat{\theta}_{n,N}^{\text{PPI}}$ perform well with a random forest prediction model. In addition, we make the following observations that differ from the results shown in Section 8.1:

1.

$\hat{\theta}_{n,N}^{\text{PPI}}$ significantly improves upon the supervised estimator $\hat{\theta}_{n}$ even with a pure-noise prediction model. To see this, recall that $\hat{\theta}_{n,N}^{\text{PPI}}$ estimates the conditional influence function ${\phi_{\eta^{*}}}(x)$ by regressing ${{\varphi}_{\eta^{*}}}(x,y)$ onto ${{\varphi}_{\eta^{*}}}(x,f(x))$ where $f(\cdot)$ is a prediction model. (In practice, the unknown function $\eta^{*}$ is replaced with a consistent estimator $\tilde{\eta}_{n}$ .) Define $\mathbf{V}_{\theta}^{*}={\mathbb{E}}\left[XX^{\top}e^{X^{\top}\theta^{*}}\right]$ . In the case of a Poisson GLM, we have that

${{\varphi}_{\eta^{*}}}(z)=\mathbf{V}_{\theta^{*}}^{-1}x\left(y-e^{x^{\top}\theta^{*}}\right)$

and

${{\varphi}_{\eta^{*}}}(x,f(x))=\mathbf{V}_{\theta^{*}}^{-1}x\left(f(x)-e^{x^{\top}\theta^{*}}\right).$

Thus, in the case of a Poisson GLM, the projection of ${{\varphi}_{\eta^{*}}}(z)$ onto ${{\varphi}_{\eta^{*}}}(x,f(x))$ in ${\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}})$ is non-zero, even if — in the extreme case — $f(X)=0$ .

By contrast, in the case of mean estimation in Section 8.1, the projection of ${{\varphi}_{\eta^{*}}}(z)=y-\theta^{*}$ onto ${{\varphi}_{\eta^{*}}}(x,f(x))=f(x)-\theta^{*}$ in ${\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}})$ equals zero when $f(X)$ is independent of $Y$ . Thus, for mean estimation, a pure-noise prediction model $f(\cdot)$ is completely useless.
2.

$\hat{\theta}_{n,N}^{\text{PPI}}$ is more efficient than the PPI++ estimator of Angelopoulos et al. [2023b] when both use the noisy prediction model. This is because PPI++ uses a scalar weight to minimize the trace of the $p\times p$ asymptotic covariance matrix, which is sub-optimal in efficiency when $p>1$ . On the other hand, $\hat{\theta}_{n,N}^{\text{PPI}}$ considers a regression approach to find the best linear approximation of the conditional influence function ${\phi_{\eta^{*}}}(x)$ , as in (41).

Finally, we consider Setting 2 (well-specified model). No method improves upon the OSS lower bound, in keeping with Corollary 4.2.

9 Discussion

We have proposed a general framework to study statistical inference in the semi-supervised setting. We established the semiparametric efficiency lower bound for an arbitrary inferential problem under the semi-supervised setting, and showed that no improvement can be made when the model is well-specified. Furthermore, we proposed a class of easy-to-compute estimators that build upon existing supervised estimators and that can achieve the efficiency lower bound for an arbitrary inferential problem.

This paper leaves open several directions for future work. First, our results require a Donsker condition on the influence function of the supervised estimator. This may not hold, for example, when the functional $\eta(\cdot)$ is high-dimensional or infinite-dimensional. Recent advances in double/debiased machine learning [Chernozhukov et al., 2018, Foster and Syrgkanis, 2023] may provide an avenue for obtaining efficient semi-supervised estimators in the presence of high- or infinite-dimensional nuisance parameters. Second, a general theoretical framework for efficient semi-supervised estimation in the presence of covariate shift also remains a relatively open problem, despite some promising preliminary work [Ryan and Culp, 2015, Aminian et al., 2022].

Acknowledgments

We thank Abhishek Chakrabortty for pointing out that an earlier version of this paper overlooked connections with Chapter 2 of his dissertation [Chakrabortty, 2016] in the special case of M-estimation.

References

Zhang et al. [2019] Anru Zhang, Lawrence D Brown, and T Tony Cai. Semi-supervised inference: General theory and estimation of means. The Annals of Statistics, 47(5):2538–2566, 2019.
Cheng et al. [2021] David Cheng, Ashwin N Ananthakrishnan, and Tianxi Cai. Robust and efficient semi-supervised estimation of average treatment effects with application to electronic health records data. Biometrics, 77(2):413–423, 2021.
Kawakita and Kanamori [2013] Masanori Kawakita and Takafumi Kanamori. Semi-supervised learning with density-ratio estimation. Machine Learning, 91:189–209, 2013.
Buja et al. [2019a] Andreas Buja, Lawrence Brown, Richard Berk, Edward George, Emil Pitkin, Mikhail Traskin, Kai Zhang, and Linda Zhao. Models as approximations i. Statistical Science, 34(4):523–544, 2019a.
Song et al. [2023] Shanshan Song, Yuanyuan Lin, and Yong Zhou. A general m-estimation theory in semi-supervised framework. Journal of the American Statistical Association, pages 1–11, 2023.
Gan and Liang [2023] Feng Gan and Wanfeng Liang. Prediction de-correlated inference. arXiv preprint arXiv:2312.06478, 2023.
Angelopoulos et al. [2023a] Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669–674, 2023a.
Angelopoulos et al. [2023b] Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. Ppi++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453, 2023b.
Bennett and Demiriz [1998] Kristin Bennett and Ayhan Demiriz. Semi-supervised support vector machines. Advances in Neural Information Processing Systems, 11, 1998.
Chapelle et al. [2006] Olivier Chapelle, Mingmin Chi, and Alexander Zien. A continuation method for semi-supervised svms. In Proceedings of the 23rd International Conference on Machine Learning, pages 185–192, 2006.
Bair [2013] Eric Bair. Semi-supervised clustering methods. Wiley Interdisciplinary Reviews: Computational Statistics, 5(5):349–361, 2013.
Van Engelen and Hoos [2020] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine Learning, 109(2):373–440, 2020.
Zhang and Bradic [2022] Yuqian Zhang and Jelena Bradic. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387–403, 2022.
Chakrabortty and Cai [2018] Abhishek Chakrabortty and Tianxi Cai. Efficient and adaptive linear regression in semi-supervised settings. The Annals of Statistics, 46(4):1541–1572, 2018.
Azriel et al. [2022] David Azriel, Lawrence D Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251, 2022.
Deng et al. [2023] Siyi Deng, Yang Ning, Jiwei Zhao, and Heping Zhang. Optimal and safe estimation for high-dimensional semi-supervised learning. Journal of the American Statistical Association, pages 1–12, 2023.
Wang et al. [2023] Tong Wang, Wenlu Tang, Yuanyuan Lin, and Wen Su. Semi-supervised inference for nonparametric logistic regression. Statistics in Medicine, 42(15):2573–2589, 2023.
Quan et al. [2024] Zhuojun Quan, Yuanyuan Lin, Kani Chen, and Wen Yu. Efficient semi-supervised inference for logistic regression under case-control studies. arXiv preprint arXiv:2402.15365, 2024.
Chakrabortty [2016] Abhishek Chakrabortty. Robust semi-parametric inference in semi-supervised settings. PhD thesis, 2016.
Yuval and Rosset [2022] Oren Yuval and Saharon Rosset. Semi-supervised empirical risk minimization: Using unlabeled data to improve prediction. Electronic Journal of Statistics, 16(1):1434–1460, 2022.
Kim et al. [2024] Ilmun Kim, Larry Wasserman, Sivaraman Balakrishnan, and Matey Neykov. Semi-supervised u-statistics. arXiv preprint arXiv:2402.18921, 2024.
Chakrabortty and Dai [2022] Abhishek Chakrabortty and Guorong Dai. A general framework for treatment effect estimation in semi-supervised and high dimensional settings. arXiv preprint arXiv:2201.00468, 2022.
Ahmed et al. [2024] Hanan Ahmed, John HJ Einmahl, and Chen Zhou. Extreme value statistics in semi-supervised models. Journal of the American Statistical Association, pages 1–14, 2024.
Miao et al. [2023] Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, and Qiongshi Lu. Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220, 2023.
Miao and Lu [2024] Jiacheng Miao and Qiongshi Lu. Task-agnostic machine learning-assisted inference. arXiv preprint arXiv:2405.20039, 2024.
Zrnic and Candès [2024a] Tijana Zrnic and Emmanuel J Candès. Active statistical inference. arXiv preprint arXiv:2403.03208, 2024a.
Zrnic and Candès [2024b] Tijana Zrnic and Emmanuel J Candès. Cross-prediction-powered inference. Proceedings of the National Academy of Sciences, 121(15):e2322083121, 2024b.
Gu and Xia [2024] Yanwu Gu and Dong Xia. Local prediction-powered inference. arXiv preprint arXiv:2409.18321, 2024.
Van der Vaart [2000] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge University Press, 2000.
Tsiatis [2006] Anastasios A Tsiatis. Semiparametric theory and missing data, volume 4. Springer, 2006.
Bickel et al. [1993] Peter J Bickel, Chris AJ Klaassen, Ya’acov Ritov J.Klaasen, and Jon A Wellner. Efficient and Adaptive Estimation for Semiparametric Models, volume 4. Springer, 1993.
Buja et al. [2019b] Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard Berk, Edward George, and Linda Zhao. Models as approximations ii. Statistical Science, 34(4):545–565, 2019b.
Newey [1997] Whitney K Newey. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79(1):147–168, 1997.
Bickel and Kwon [2001] Peter J Bickel and Jaimyoung Kwon. Inference for semiparametric models: some questions and an answer. Statistica Sinica, pages 863–886, 2001.
Robins and Rotnitzky [1995] James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
Chen et al. [2008] Xiaohong Chen, Han Hong, and Alessandro Tarozzi. Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics, 36(2):808–843, 2008.
Zhou et al. [2008] Yong Zhou, Alan T K Wan, and Xiaojing Wang. Estimating equations inference with missing data. Journal of the American Statistical Association, 103(483):1187–1199, 2008.
Li and Luedtke [2023] Sijia Li and Alex Luedtke. Efficient estimation under data fusion. Biometrika, 110(4):1041–1054, 2023.
Graham et al. [2024] Ellen Graham, Marco Carone, and Andrea Rotnitzky. Towards a unified theory for semiparametric data fusion with individual-level data. arXiv preprint arXiv:2409.09973, 2024.
Robins et al. [1994] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.
Hahn [1998] Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, pages 315–331, 1998.
Wittes et al. [1989] Janet Wittes, Edward Lakatos, and Jeffrey Probstfield. Surrogate endpoints in clinical trials: cardiovascular diseases. Statistics in Medicine, 8(4):415–425, 1989.
Chernozhukov et al. [2018] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
Foster and Syrgkanis [2023] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. The Annals of Statistics, 51(3):879–908, 2023.
Ryan and Culp [2015] Kenneth Joseph Ryan and Mark Vere Culp. On semi-supervised linear regression in covariate shift problems. The Journal of Machine Learning Research, 16(1):3183–3217, 2015.
Aminian et al. [2022] Gholamali Aminian, Mahed Abroshan, Mohammad Mahdi Khalili, Laura Toni, and Miguel Rodrigues. An information-theoretical approach to semi-supervised learning under covariate-shift. International Conference on Artificial Intelligence and Statistics, pages pp. 7433–7449, 2022.
van der Vaart and Wellner [2013] Aad van der Vaart and Jon Wellner. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 2013.
Hansen [2022] Bruce Hansen. Econometrics. Princeton University Press, 2022.
van der Laan [1995] Mark J van der Laan. Efficient and inefficient estimation in semiparametric models. 1995.
Gill et al. [1995] Richard D Gill, Mark J Laan, and Jon A Wellner. Inefficient estimators of the bivariate survival function for three models. In Annales de l’IHP Probabilités et statistiques, volume 31, pages 545–597, 1995.
Pfanzagl [1990] Johann Pfanzagl. Estimation in semiparametric models. Springer, 1990.
Van Der Vaart [1991] Aad Van Der Vaart. On differentiable functionals. The Annals of Statistics, pages 178–204, 1991.
van der Laan and Robins [2003] Mark J van der Laan and James Robins. Unified approach for causal inference and censored data. Unified Methods for Censored Longitudinal Data and Causality, pages 311–370, 2003.

Appendix A Additional notation

We introduce additional notation that is used in the appendix. For a matrix $\mathbf{A}\in{\mathbb{R}}^{p\times q}$ , let $\left\lVert\mathbf{A}\right\rVert_{2}$ denote its operator norm and let $\left\lVert\mathbf{A}\right\rVert_{F}$ denote its Frobenius norm. For a set ${\mathcal{S}}$ , let $\text{int}({\mathcal{S}})$ denote its interior. Consider the probability space $({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}})$ . Letting $f(z)$ be a measurable function $f:{\mathcal{Z}}\to{\mathbb{R}}^{p}$ over $({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}})$ , we adopt the following notation from the empirical process literature: ${\mathbb{P}}(f)={\mathbb{E}}_{\mathbb{P}}[f(Z)]$ , ${{\mathbb{P}}_{n}}(f)={\mathbb{E}}_{{{\mathbb{P}}_{n}}}[f(Z)]=\frac{1}{n}\sum_{i=1}^{n}f(Z_{i})$ , and ${\mathbb{G}}_{n}(f)=\sqrt{n}({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}})(f)$ . Similarly, for a measurable function $g:{\mathcal{X}}\to{\mathbb{R}}^{p}$ , ${{\mathbb{P}}_{n+N}}(g)={\mathbb{E}}_{{{\mathbb{P}}_{n+N}}}[g(X)]=\frac{1}{n+N}\sum_{i=1}^{n+N}g(X_{i})$ , ${{\mathbb{P}}_{N}}(g)={\mathbb{E}}_{{{\mathbb{P}}_{N}}}[g(X)]=\frac{1}{N}\sum_{i=n+1}^{n+N}g(X_{i})$ , and ${\mathbb{G}}_{n+N}(g)=\sqrt{n+N}({{\mathbb{P}}_{n+N}}-{{\mathbb{P}}^{*}})(g)$ . For two subspaces ${\mathcal{T}}_{1}\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$ and ${\mathcal{T}}_{2}\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$ such that ${\mathcal{T}}_{1}\perp{\mathcal{T}}_{2}$ , let ${\mathcal{T}}_{1}\oplus{\mathcal{T}}_{2}$ represent their direct sum in ${{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$ .

Appendix B Proof of main results

Proof of Theorem 3.1..

In the first step, we characterize the tangent space of the reduced model ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ , which is defined in (2). Note that the reduced model ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ satisfies the conditions of Lemma E.11, i.e., the marginal distribution and the conditional distribution are separately modeled. Therefore, by Lemma E.11, the tangent space of the reduced model ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ can be expressed as ${\mathcal{T}}_{\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}}({{\mathbb{P}}^{*}_{X}})\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ , where ${{\mathcal{P}}_{Y\mid X}}$ is the conditional model. Since $\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}$ is a singleton set, ${\mathcal{T}}_{\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}}({{\mathbb{P}}^{*}_{X}})=\left\{{0}\right\}$ , and hence the tangent space of ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ becomes ${{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ .

Recall that ${\phi^{*}_{\eta^{*}}}(x)$ is the conditional efficient influence function ${\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid X=x]$ . In the second step, we show that for all $a\in{\mathbb{R}}^{p}$ ,

a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}.

(58)

We prove this by contradiction. Suppose there exists $a\in{\mathbb{R}}^{p}$ such that $a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]\notin{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . Consider the decomposition

a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)=\underbrace{a^{\top}{\phi^{*}_{\eta^{*}}}(x)}_{f_{a}(x)}+\underbrace{a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]}_{g_{a}(x)}.

As $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}})}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ , $f_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}$ , and $g_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ , therefore by the property of direct sum, $f_{a}(x)$ and $g_{a}(z)$ is the unique decomposition of $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)=f_{a}(x)+g_{a}(z)$ such that $f_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}$ , and $g_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . Further, by definition of the efficient influence function, $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}$ . Recall the model ${{\mathcal{P}}}$ defined in (1). By Lemma E.11, the tangent space relative to ${{\mathcal{P}}}$ is

{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}={{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}.

Similarly, by the property of direct sum, there exists a unique decomposition $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)=f^{\prime}_{a}(x)+g^{\prime}_{a}(z)$ such that $f^{\prime}_{a}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ and $g^{\prime}_{a}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . However, $g_{a}(z)\notin{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ , and hence $g_{a}(z)\neq g_{a}^{\prime}(z)$ . By definition, ${{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\subseteq{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}$ , and ${{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\subseteq{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . Therefore, $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)=f^{\prime}_{a}(x)+g^{\prime}_{a}(z)$ is another decomposition of $f^{\prime}_{a}(x)$ such that $f^{\prime}_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}$ , and $g^{\prime}_{a}(x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . This contradicts the uniqueness of direct sum. Therefore, we have established (58) for all $a\in{\mathbb{R}}^{p}$ .

In the third step, we show that ${{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)$ is a gradient relative to the model ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ . Consider any one-dimensional regular parametric sub-model ${\mathcal{P}}_{T}$ of ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ . Since the marginal model is a singleton set $\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}$ , this maps one-to-one to a one-dimensional regular parametric sub-model ${\mathcal{P}}_{T,Y\mid X}$ of ${{\mathcal{P}}_{Y\mid X}}$ such that ${\mathcal{P}}_{T}=\left\{{{{\mathbb{P}}^{*}_{X}}}\right\}\otimes{\mathcal{P}}_{T,Y\mid X}$ . Suppose that ${\mathcal{P}}_{T,Y\mid X}=\left\{{p_{t,Y\mid X}(z),t\in T}\right\}\subset{{\mathcal{P}}_{Y\mid X}}$ such that $p_{t^{*},Y\mid X}(z)$ is the conditional density of ${{\mathbb{P}}^{*}_{Y\mid X}}$ for some $t^{*}\in T$ . Denote $s_{t^{*}}(y\mid x)$ as the score function relative to ${\mathcal{P}}_{T,Y\mid X}$ at $t^{*}$ , which then satisfies $s_{t^{*}}(y\mid x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . Note that the score function relative to ${\mathcal{P}}_{T}$ at $t^{*}$ remains $s_{t^{*}}(y\mid x)$ , as the marginal distribution is known. Consider the function $\theta:{\mathcal{P}}_{T}\to\Theta$ as a function of $t$ , $\theta(t)$ . By definition, the efficient influence function ${{\varphi}^{*}_{\eta^{*}}}(z)$ is a gradient relative to ${{\mathcal{P}}}$ , hence

	$\displaystyle\frac{\partial\theta(t)}{\partial t}$	$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z),s_{t^{}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})}$
		$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x)+{\phi^{}_{\eta^{}}}(x),s_{t^{}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})}$
		$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x),s(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})}+\langle{\phi^{}_{\eta^{}}}(x),s_{t^{}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$
		$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x),s_{t^{}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})},$

where the last equality used the fact that $s_{t^{*}}(y\mid x)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ and

{\mathbb{E}}[{\phi^{*}_{\eta^{*}}}(X)s_{t^{*}}(Y\mid X)]={\mathbb{E}}\left\{{{\phi^{*}_{\eta^{*}}}(X){\mathbb{E}}[s_{t^{*}}(Y\mid X)\mid X]}\right\}=0.

As the above holds for an arbitrary one-dimensional regular parametric sub-model of ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ , ${{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)$ is a gradient relative to ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ at ${{\mathbb{P}}^{*}}$ .

Finally, combining step 1 to step 3 above, ${{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)$ is a gradient relative to ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ at ${{\mathbb{P}}^{*}}$ and satisfies $a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]\in{\mathcal{T}}_{{{\mathbb{P}}^{*}}}({\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}})$ for all $a\in{\mathbb{R}}^{p}$ . Therefore, by definition, ${{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)$ is the efficient influence function of ${\theta(\cdot)}$ at $\theta^{*}$ relative to ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ . By Lemma E.1, the variance, $\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)]$ , can be represented as

\displaystyle\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)]

\displaystyle=\mathrm{Var}[{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)]+\mathrm{Var}[{\phi^{*}_{\eta^{*}}}(X)],

which proves the final claim of the theorem. ∎

Proof of Theorem 3.2.

To prove the first part of the theorem, consider any element $s(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ . Without loss of generality, suppose ${\mathcal{P}}_{T,X}=\left\{{p_{t,X}(x):t\in T}\right\}$ is a regular parametric sub-model of ${{\mathcal{P}}_{X}}$ with score function $s(x)$ at ${{\mathbb{P}}^{*}_{X}}$ . (Otherwise, because the tangent space is a closed linear space by definition, we can always find a sequence of functions $\left\{{s_{r}(x)}\right\}_{r=1}^{\infty}$ that are score functions of regular parametric sub-models of ${{\mathcal{P}}}$ , and the following arguments hold by the continuity of the inner product.) Then, the model ${\mathcal{P}}_{T}={\mathcal{P}}_{T,X}\otimes\left\{{{{\mathbb{P}}^{*}_{Y\mid X}}}\right\}$ is a regular parametric sub-model of ${{\mathcal{P}}}$ with score function $s(x)$ at ${{\mathbb{P}}^{*}}$ . Since ${\mathcal{P}}_{T}$ is parameterized by $t\in T$ , we can write $\theta:{\mathcal{P}}_{T}\to\Theta$ as a function of $t$ , $\theta(t)$ . Because ${\theta(\cdot)}$ is well-specified at ${{\mathbb{P}}^{*}_{Y\mid X}}$ relative to ${\mathcal{P}}_{T,X}\subset{{\mathcal{P}}_{X}}$ , $\theta(t)$ is a constant function, $\theta(t)\equiv\theta^{*}$ , and hence $\frac{\partial\theta(t)}{\partial t}=0$ . By pathwise-differentiability,

<{D_{{\mathbb{P}}^{*}}}(z),s(x)>_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}}=\frac{\partial\theta(t)}{\partial t}=0

for any gradient ${D_{{\mathbb{P}}^{*}}}(z)$ of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ . Since this holds true for any $s(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ , we see that any gradient ${D_{{\mathbb{P}}^{*}}}(z)$ satisfies

{D_{{\mathbb{P}}^{*}}}(z)\perp{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}.

Consider the the efficient influence function ${{\varphi}^{*}_{\eta^{*}}}(z)$ of ${\theta(\cdot)}$ relative to ${{\mathcal{P}}}$ at ${{\mathbb{P}}^{*}}$ , which is a gradient of ${\theta(\cdot)}$ relative to ${{\mathcal{P}}}$ at ${{\mathbb{P}}^{*}}$ by definition. Further, by the definition of the efficient influence function, for all $a\in{\mathbb{R}}^{p}$ it holds that

a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}={{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}.

As ${{\varphi}^{*}_{\eta^{*}}}(z)\perp{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ , we see that $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\perp{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ , and hence

a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\subset{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})},

for all $a\in{\mathbb{R}}^{p}$ . Therefore

{\phi^{*}_{\eta^{*}}}(x)={\mathbb{E}}[{{\varphi}^{*}_{\eta^{*}}}(Z)\mid X=x]=0

${{\mathbb{P}}^{*}_{X}}$ -almost surely.

To prove the second part of the theorem, recall that ${{\varphi}^{*}_{\eta^{*}}}(z)$ is a gradient by definition. Therefore, by Lemma E.10, the set of gradients relative to ${{\mathcal{P}}}$ can be expressed as

\left\{{{{\varphi}^{*}_{\eta^{*}}}(z)+h(z):a^{\top}h(z)\in\left[{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\right]^{\perp},\forall a\in{\mathbb{R}}^{p}}\right\},

(59)

where $\perp$ represents the orthogonal complement of a subspace. If ${{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}$ , then

\left[{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\right]^{\perp}\subseteq{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}^{\perp}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}.

In proving the first part of the theorem, we have shown that ${{\varphi}^{*}_{\eta^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ under well-specification. Therefore, for any gradient ${D_{{\mathbb{P}}^{*}}}(z)$ relative to ${{\mathcal{P}}}$ , by (59), we have

a^{\top}{D_{{\mathbb{P}}^{*}}}(z)=a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z)+a^{\top}h(z)\in{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})},

for all $a\in{\mathbb{R}}^{p}$ , which implies that ${D_{{\mathbb{P}}^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . For a regular and asymptotically linear estimator $\hat{\theta}_{n}$ with influence function ${{\varphi}_{\eta^{*}}}(z)$ , Lemma E.9 implies that ${{\varphi}_{\eta^{*}}}(z)$ is a gradient of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ . Therefore ${{\varphi}_{\eta^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ , which implies that ${\mathbb{E}}[{{\varphi}_{\eta^{*}}}(Z)\mid X=x]=0$ ${{\mathbb{P}}^{*}_{X}}$ -almost surely. ∎

Proof of Corollary 3.3.

Suppose the efficient influence function for ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ is ${{\varphi}^{*}_{\eta^{*}}}(z)$ . Since ${\theta(\cdot)}$ is well-specified, Theorem 3.2 implies that the conditional efficient influence function ${\phi^{*}_{\eta^{*}}}(X)=0,{{\mathbb{P}}^{*}_{X}}-\text{almost surely}$ . Therefore, by Theorem 3.1, the efficient influence function for ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ as (1) remains ${{\varphi}^{*}_{\eta^{*}}}(z)$ . We have proved that the efficient influence relative to the reduced model ${\{{\mathbb{P}}_{X}^{*}\}\otimes{\mathcal{P}}_{Y\mid X}}$ is ${{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)={{\varphi}^{*}_{\eta^{*}}}(z)$ in the proof of Theorem 3.1. ∎

Proof of Proposition 3.4.

We first show that (i) and (ii) of Proposition 3.4 implies Assumption 3.1 (b). Since ${\mathcal{O}}$ is a bounded Euclidean subset and ${\varphi}_{\eta}(z)$ is $L(z)$ -Lipshitz in $\eta$ over ${\mathcal{O}}$ , by example 19.6 of Van der Vaart [2000], the class $\left\{{{\varphi}_{\eta}(z):\eta\in{\mathcal{O}}}\right\}$ is ${{\mathbb{P}}^{*}}$ -Donsker. To see that Assumption 3.1 (c) holds, note that by consistency of $\tilde{\eta}_{n}$ , ${{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1$ for any open set ${\mathcal{O}}$ such that $\eta^{*}\in{\mathcal{O}}$ .
∎

Proof of Theorem 3.5.

Denote $\mathbf{\Sigma}=\mathrm{Cov}[g(X)]$ . First, we will show that

\left\lVert\hat{\mathbf{B}}_{n}^{g}-\mathbf{B}^{g}\right\rVert_{2}=o_{p}(1),

(60)

where $\hat{\mathbf{B}}_{n}^{g}$ is defined as in (11), and $\mathbf{B}^{g}$ is defined as in (13). To this end, we write

	$\displaystyle\left\lVert\hat{\mathbf{B}}_{n}^{g}-\mathbf{B}^{g}\right\rVert_{2}$	$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{}}}({g^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}+$
		$\displaystyle\quad\left\lVert{{\mathbb{P}}^{}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}.$

By Assumption 3.1 and Lemma E.5, we have,

\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}=o_{p}(1).

(61)

Next,

$\displaystyle\left\lVert{{\mathbb{P}}^{}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}$	$\displaystyle=\left\lVert{{\mathbb{P}}^{}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}})({g^{0}})^{\top}\right]\right\rVert_{2}$	(62)
	$\displaystyle\leqslant{{\mathbb{P}}^{}}\left\lVert({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}})({g^{0}})^{\top}\right\rVert_{2}$
	$\displaystyle\leqslant\left\lVert{{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})}\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}.$

By Assumption 3.1, when $\tilde{\eta}_{n}\in{\mathcal{O}}$ , we have $\left\lVert{{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\leqslant\left\lVert L(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\rho(\tilde{\eta}_{n},\eta^{*})$ , and it follows that

\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}\leqslant\left\lVert L(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\rho(\tilde{\eta}_{n},\eta^{*})\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}.

Since $\left\lVert L(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}<\infty$ and $\rho(\tilde{\eta}_{n},\eta^{*})=o_{p}(1)$ by Assumption 3.1, and $\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}<\infty$ , we have that

\left\lVert L(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\rho(\tilde{\eta}_{n},\eta^{*})\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}=o_{p}(1).

Further, by Assumption 3.1, ${{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1$ , and hence

\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}=o_{p}(1).

(63)

Combining (61) and (63), we have $\left\lVert\hat{\mathbf{B}}_{n}^{g}-\mathbf{B}^{g}\right\rVert_{2}=o_{p}(1)$ as (60).

Since $\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)=O_{p}(1)$ , the fact that $\left\lVert\hat{\mathbf{B}}_{n}^{g}-\mathbf{B}^{g}\right\rVert_{2}=o_{p}(1)$ implies,

	$\displaystyle\sqrt{n}\left(\hat{\theta}_{n,{{\mathbb{P}}^{}_{X}}}^{\text{safe}}-\theta^{}\right)$	$\displaystyle=\sqrt{n}\left(\hat{\theta}_{n}-\theta^{*}\right)-\hat{\mathbf{B}}_{n}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)$
		$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\hat{\mathbf{B}}_{n}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)+o_{p}(1)$
		$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\mathbf{B}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)+\left(\mathbf{B}^{g}-\hat{\mathbf{B}}_{n}^{g}\right){{\mathbb{P}}_{n}}\left({g^{0}}\right)+o_{p}(1)$
		$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\mathbf{B}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)+o_{p}(1)$
		$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}\left({{\varphi}_{\eta^{*}}}-\mathbf{B}^{g}{g^{0}}\right)+o_{p}(1).$

Since $\mathbf{B}^{g}{g^{0}}(x)$ is the projection of ${\phi_{\eta^{*}}}$ onto the linear subspace spanned by ${g^{0}}(x)$ , by Lemma E.1 and E.2 we have the decomposition

		$\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left\{{{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)+{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)}\right\}$
		$\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)]+\mathrm{Var}\left\{{{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)}\right\}$
		$\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{}}}(Z)]-\mathrm{Var}\left\{{{\phi_{\eta^{}}}(X)}\right\}+\mathrm{Var}\left\{{{\phi_{\eta^{*}}}(X)}\right\}-\mathrm{Var}[\mathbf{B}^{g}{g^{0}}(X)]$
		$\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{*}}}(Z)]-\mathrm{Var}[\mathbf{B}^{g}{g^{0}}(X)].$

∎

Proof of Theorem 3.6.

It suffices to prove the case of $p=1$ , i.e., we only one parameter. For $p>1$ , the analysis follows by separately analyzing each of the $p$ components. This does not affect the convergence rate since we treat $p$ as fixed (rather than increasing with the sample size $n$ ).

When $p=1$ , by Assumption 3.2, ${\phi_{\eta^{*}}}\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ , where ${\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ is the Hölder class (15) with parameter $\alpha_{1}$ and $M_{1}$ . By Theorem 2.7.1 of van der Vaart and Wellner [2013], ${\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ is ${{\mathbb{P}}^{*}_{X}}$ -Donsker when $\alpha_{1}>{\text{dim}}({\mathcal{X}})$ . Let $\left\{{g_{k}(x)}\right\}_{k=1}^{\infty}$ be a basis of ${\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ that satisfies (16) for all $f\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ . Since ${G_{K_{n}}}(x)\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ , clearly we also have ${\hat{\phi}_{n}}=\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}(x)\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ as this is only a linear transformation of ${G_{K_{n}}}(x)$ . If we can show that

\left\lVert{\phi_{\eta^{*}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1),

then by Lemma E.4 and the fact that ${{\mathbb{P}}^{*}}({\phi_{\eta^{*}}})={{\mathbb{P}}^{*}}({\hat{\phi}_{n}})=0$ ,

{\mathbb{G}}_{n}({\phi_{\eta^{*}}}-{\hat{\phi}_{n}})=\sqrt{n}{{\mathbb{P}}_{n}}({\phi_{\eta^{*}}}-{\hat{\phi}_{n}})=o_{p}(1).

Then

	$\displaystyle\sqrt{n}\left(\hat{\theta}_{n,{{\mathbb{P}}^{}_{X}}}^{\text{eff.}}-\theta^{}\right)$	$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}}-{\hat{\phi}_{n}})+o_{p}(1)$
		$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{}}}-{\phi_{\eta^{}}})+o_{p}(1),$

and the results of Theorem 3.6 follow.

It remains to show that

\left\lVert{\phi_{\eta^{*}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1).

Denoting $\mathbf{\Sigma}=\mathrm{Var}\left[{G_{K_{n}}^{0}}(X)\right]\succ 0$ and $\mathbf{B}_{K_{n}}={{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({G_{K_{n}}^{0}})^{\top}\right]\mathbf{\Sigma}^{-1}$ , we have:

	$\displaystyle\left\lVert{\phi_{\eta^{}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$	$\displaystyle=\left\lVert{\phi_{\eta^{}}}-\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$
		$\displaystyle=\left\lVert{\phi_{\eta^{}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}+\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}-\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$
		$\displaystyle\leqslant\underbrace{\left\lVert\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}}_{\text{I}}+\underbrace{\left\lVert{\phi_{\eta^{}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}_{\text{II}}$

where $\hat{\mathbf{B}}_{K_{n}}$ is defined as (18). We first look at I. By the fact that

{\mathbb{E}}\left[\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}(X){G_{K_{n}}^{0}}(X)^{\top}\mathbf{\Sigma}^{-\frac{1}{2}}\right]=\mathbf{I}_{K_{n}},

(64)

we have:

	$\displaystyle\text{I}^{2}$	$\displaystyle=\left\lVert\left(\hat{\mathbf{B}}_{K_{n}}-\mathbf{B}_{K_{n}}\right){G_{K_{n}}^{0}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}$
		$\displaystyle={\mathbb{E}}\left[({G_{K_{n}}^{0}}(X))^{\top}\left(\hat{\mathbf{B}}_{K_{n}}-\mathbf{B}_{K_{n}}\right)^{\top}\left(\hat{\mathbf{B}}_{K_{n}}-\mathbf{B}_{K_{n}}\right){G_{K_{n}}^{0}}(X)\right]$
		$\displaystyle={\mathbb{E}}\left\{{\left[\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}(X)\right]^{\top}{{\mathbb{P}}_{n}}\left[({{\varphi}_{\eta^{}}}-{{\varphi}_{\tilde{\eta}_{n}}})\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right]^{\top}{{\mathbb{P}}_{n}}\left[({{\varphi}_{\eta^{}}}-{{\varphi}_{\tilde{\eta}_{n}}})\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right]\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}}(X)}\right\}$
		$\displaystyle={\mathbb{E}}\left[\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}(X){G_{K_{n}}^{0}}(X)^{\top}\mathbf{\Sigma}^{-\frac{1}{2}}\right]{{\mathbb{P}}_{n}}\left[({{\varphi}_{\eta^{}}}-{{\varphi}_{\tilde{\eta}_{n}}})\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right]^{\top}{{\mathbb{P}}_{n}}\left[({{\varphi}_{\eta^{}}}-{{\varphi}_{\tilde{\eta}_{n}}})\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right]$
		$\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}({{\varphi}_{\eta^{*}}}-{{\varphi}_{\tilde{\eta}_{n}}})\right]\right\rVert^{2}.$

By (64), without loss of generality, we can suppose that ${G_{K_{n}}^{0}}$ has identity covariance, i.e., $\mathbf{\Sigma}=\mathbf{I}_{K_{n}}$ . For I, when $\tilde{\eta}_{n}\in{\mathcal{O}}$ , it follows that

	$\displaystyle\text{I}^{2}$	$\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({{\varphi}_{\eta^{*}}}-{{\varphi}_{\tilde{\eta}_{n}}})\right]\right\rVert^{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[L{G_{K_{n}}^{0}}\right]\right\rVert^{2}\rho^{2}(\tilde{\eta}_{n},\eta^{*})$
		$\displaystyle\leqslant{{\mathbb{P}}_{n}}(L^{2}){{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]\rho^{2}(\tilde{\eta}_{n},\eta^{*})$

Since ${{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1$ by Assumption 3.1, we have

\text{I}^{2}\leqslant{{\mathbb{P}}_{n}}(L^{2}){{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]\rho^{2}(\tilde{\eta}_{n},\eta^{*})+o_{p}(1).

Now, we consider the term ${{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]$ . By Theorem 12.16.1 of Hansen [2022],

\left\lVert{{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]-\mathbf{I}_{K_{n}}\right\rVert_{F}=o_{p}(1).

Therefore,

	$\displaystyle{{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]$	$\displaystyle=K_{n}+{\text{tr}}\left({{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]-\mathbf{I}_{K_{n}}\right)$
		$\displaystyle=K_{n}+o_{p}(1)$
		$\displaystyle=O_{p}(K_{n}),$

where we used continuous mapping for the ${\text{tr}}(\cdot)$ function. Since $L(z)$ is square integrable, we then have

{{\mathbb{P}}_{n}}(L^{2}){{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]\rho^{2}(\tilde{\eta}_{n},\eta^{*})=O_{p}(\rho^{2}(\tilde{\eta}_{n},\eta^{*})K_{n}),

which implies that

\text{I}=O_{p}\left(\sqrt{K_{n}}\rho(\tilde{\eta}_{n},\eta^{*})\right)=o_{p}(1).

For II, by the bias-variance decomposition,

\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}={\mathbb{E}}[\mathbf{B}_{K_{n}}{G_{K_{n}}}(x)]^{2}+\left\lVert{\phi_{\eta^{*}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})},

where we used the fact that ${\mathbb{E}}[{\phi_{\eta^{*}}}(X)]={\mathbb{E}}[{G_{K_{n}}^{0}}(X)]=0$ . Therefore,

	$\displaystyle\text{II}^{2}$	$\displaystyle\leqslant\left\lVert{\phi_{\eta^{}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$		(65)
		$\displaystyle=O\left(K_{n}^{-\frac{2\alpha_{1}}{{\text{dim}}({\mathcal{X}})}}\right),$		(65)

by (16). This shows

\text{II}=O\left(K_{n}^{-\frac{\alpha_{1}}{{\text{dim}}({\mathcal{X}})}}\right)=o_{p}(1).

Combining the results of I and II, we have shown that $\left\lVert{\phi_{\eta^{*}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1)$ , which finishes the proof. ∎

Proof of Theorem 4.1..

The proof of this theorem utilizes notation and results from Section E.3, as well as Lemma E.13.

By Lemma E.14, pathwise differentiablity at ${{\mathbb{P}}^{*}}$ implies pathwise differentiability at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ . Since the efficient influence function at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ is ${{\varphi}^{*}_{\eta^{*}}}(z)$ , ${{\varphi}^{*}_{\eta^{*}}}(z)$ is a gradient at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ , and by Lemma E.14, the function

{{\varphi}^{*}_{\eta^{*}}}(z_{1})-{\phi^{*}_{\eta^{*}}}(x_{1})+{\phi^{*}_{\eta^{*}}}(x_{2})

is a gradient at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ relative to ${{\mathcal{P}}}$ .

First we consider the case of $p=1$ . We first derive the $\left\lVert\cdot\right\rVert_{\mathcal{H}}$ -projection of ${{\varphi}^{*}_{\eta^{*}}}(z_{1})-{\phi^{*}_{\eta^{*}}}(x_{1})+{\phi^{*}_{\eta^{*}}}(x_{2})$ onto ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ , where the Hilbert space ${\mathcal{H}}$ and its norm $\left\lVert\cdot\right\rVert_{\mathcal{H}}$ are defined in Section E.3. The form of ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ is characterized by Lemma E.13. Recall that ${\phi^{*}_{\eta^{*}}}(x)$ is the conditional expectation of ${{\varphi}^{*}_{\eta^{*}}}(z)$ on $X$ . The projection problem can be expressed as

		$\displaystyle\min_{\begin{subarray}{c}f\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{}_{X}})},\\ g\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{}_{Y\mid X}})}\end{subarray}}\left\lVert{{\varphi}^{}_{\eta^{}}}(z_{1})-{\phi^{}_{\eta^{}}}(x_{1})-f(x_{1})-g(z_{1})+{\phi^{}_{\eta^{}}}(x_{2})-\frac{\gamma}{1-\gamma}f(x_{2})\right\rVert^{2}_{\mathcal{H}}$		(66)
		$\displaystyle=\min_{\begin{subarray}{c}f\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{}_{X}})},\\ g\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{}_{Y\mid X}})}\end{subarray}}\left(\left\lVert{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x)-f(x)-g(z)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}})}+\frac{\gamma}{1-\gamma}\left\lVert{\phi^{}_{\eta^{}}}(x)-\frac{\gamma}{1-\gamma}f(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}\right)$
		$\displaystyle=\min_{g\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{}_{Y\mid X}})}}\left\lVert{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x)-g(z)\right\rVert^{2}_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{Y\mid X}})}}$
		$\displaystyle\quad+\min_{f\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{}_{X}})}}\left\{{\left\lVert f(x)\right\rVert^{2}_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}}+\frac{\gamma}{1-\gamma}\left\lVert f(x)-\frac{1-\gamma}{\gamma}{\phi^{}_{\eta^{}}}(x)\right\rVert^{2}_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}\right\},$

where we used the definition of $\left\lVert\cdot\right\rVert_{\mathcal{H}}$ and the fact that ${{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}})}={{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . Since ${{\varphi}^{*}_{\eta^{*}}}(z)$ is the efficient influence function at ${{\mathbb{P}}^{*}}$ and $p=1$ , we have ${{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ , which we proved in the proof of Theorem 3.1. Therefore, for the first term in (66), the minimizer takes $g(z)={{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)$ . Similarly, we have ${\phi^{*}_{\eta^{*}}}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ . Then, it can be straightforwardly verified that $f(x)=(1-\gamma){\phi^{*}_{\eta^{*}}}(x)$ minimizes the second term in (66). Therefore, the $\left\lVert\cdot\right\rVert_{\mathcal{H}}$ -projection of ${{\varphi}^{*}_{\eta^{*}}}(z_{i})$ onto ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ is

	$\displaystyle{{\varphi}^{}_{\eta^{}}}(z_{1},x_{2})$	$\displaystyle={{\varphi}^{}_{\eta^{}}}(z_{1})-{\phi^{}_{\eta^{}}}(x_{1})+(1-\gamma){\phi^{}_{\eta^{}}}(x_{1})+\gamma{\phi^{}_{\eta^{}}}(x_{2})$
		$\displaystyle={{\varphi}^{}_{\eta^{}}}(z_{1})-\gamma{\phi^{}_{\eta^{}}}(x_{1})+\gamma{\phi^{}_{\eta^{}}}(x_{2}).$

We now consider the general case of $p$ and validate that ${{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})$ is indeed the efficient influence function by definition. Clearly, since ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ is a linear space, $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ for all $a\in{\mathbb{R}}^{p}$ . Denote $D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2})={{\varphi}^{*}_{\eta^{*}}}(z_{1})-{\phi^{*}_{\eta^{*}}}(x_{1})+{\phi^{*}_{\eta^{*}}}(x_{2})$ . For any element $f(x_{1})+g(z_{1})+\frac{\gamma}{1-\gamma}f(x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ , it can be straightforwardly validated by the definition of $\left\langle\cdot,\cdot\right\rangle_{\mathcal{H}}$ that

\displaystyle\langle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})-D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2}),f(x_{1})+g(z_{1})+\frac{\gamma}{1-\gamma}f(x_{2})\rangle_{\mathcal{H}}=0.

Consider any one-dimensional regular parametric sub-model ${\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T}\right\}$ of ${{\mathcal{P}}}$ such that ${\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}}$ and its score function at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ is $l_{t^{*}}(z_{1},x_{2})$ . Then, $l_{t^{*}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ . Write $\theta:{\mathcal{P}}_{T}\to\Theta$ as a function of $t\in T$ . By pathwise differentiability and the fact that $D_{{\mathbb{Q}}({{\mathbb{P}}^{*}})}(z_{1},x_{2})$ ,is a gradient at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ relative to ${{\mathcal{P}}}$ ,

	$\displaystyle\frac{d\theta(t^{*})}{dt}$	$\displaystyle=\langle l_{t^{}}(z_{1},x_{2}),D_{{\mathbb{Q}}({{\mathbb{P}}^{}})}(z_{1},x_{2})\rangle_{\mathcal{H}}$
		$\displaystyle=\langle l_{t^{}}(z_{1},x_{2}),D_{{\mathbb{Q}}({{\mathbb{P}}^{}})}(z_{1},x_{2})-{{\varphi}^{}_{\eta^{}}}(z_{1},x_{2})+{{\varphi}^{}_{\eta^{}}}(z_{1},x_{2})\rangle_{\mathcal{H}}$
		$\displaystyle=\langle l_{t^{}}(z_{1},x_{2}),{{\varphi}^{}_{\eta^{}}}(z_{1},x_{2})\rangle_{\mathcal{H}}+\langle l_{t^{}}(z_{1},x_{2}),D_{{\mathbb{Q}}({{\mathbb{P}}^{}})}(z_{1},x_{2})-{{\varphi}^{}_{\eta^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}$
		$\displaystyle=\langle l_{t^{}}(z_{1},x_{2}),{{\varphi}^{}_{\eta^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}},$

which shows that ${{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})$ is a gradient. Since further $a^{\top}{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ for all $a\in{\mathbb{R}}^{p}$ , by definition ${{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})$ is the efficient influence function at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ relative to ${{\mathcal{P}}}$ .

By Lemma E.16, the semiparametric efficiency lower bound is the outer product of the efficient influence function with itself in $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ . By the definition of $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ , this can be written as

		$\displaystyle\langle{{\varphi}^{}_{\eta^{}}}(z)-\gamma{\phi^{}_{\eta^{}}}(x),{{\varphi}^{}_{\eta^{}}}(z)^{\top}-\gamma{\phi^{}_{\eta^{}}}(x)^{\top}\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$
		$\displaystyle\quad+\frac{\gamma}{1-\gamma}\langle(1-\gamma){\phi^{}_{\eta^{}}}(x),(1-\gamma){\phi^{}_{\eta^{}}}(x)^{\top}\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)\right]+(1-\gamma)^{2}\mathrm{Var}\left[{\phi^{}_{\eta^{}}}(X)\right]+\gamma(1-\gamma)\mathrm{Var}\left[{\phi^{}_{\eta^{}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi^{}_{\eta^{}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)\right]-\gamma\mathrm{Var}\left[{\phi^{}_{\eta^{}}}(X)\right],$

where we used Lemma E.1 in the last equality. ∎

Proof of Theorem 4.3..

Recall that

	$\displaystyle{\hat{g}^{0}}(X_{i})$	$\displaystyle=g(X_{i})-{{\mathbb{P}}_{n+N}}(g)$
		$\displaystyle=g(X_{i})-{{\mathbb{P}}^{}}(g)-{{\mathbb{P}}_{n+N}}\left[g-{{\mathbb{P}}^{}}(g)\right]$
		$\displaystyle={g^{0}}(X_{i})-{{\mathbb{P}}_{n+N}}({g^{0}}).$

Then, for the estimator $\hat{\theta}_{n,N}^{\text{safe}}$ defined in (28),

$\displaystyle\sqrt{n}(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*})$	$\displaystyle=\sqrt{n}(\hat{\theta}_{n}-\theta^{*})-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({\hat{g}^{0}})$	(67)
	$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}})+\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n+N}}({g^{0}})+o_{p}(1)$
	$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}})+\hat{\mathbf{B}}_{n,N}^{g}\frac{\sqrt{n}n}{n+N}{{\mathbb{P}}_{n}}({g^{0}})+\hat{\mathbf{B}}_{n,N}^{g}\frac{\sqrt{n}N}{n+N}{{\mathbb{P}}_{N}}({g^{0}})+o_{p}(1)$
	$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\gamma\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}})+\sqrt{\gamma(1-\gamma)}\hat{\mathbf{B}}_{n,N}^{g}\sqrt{N}{{\mathbb{P}}_{N}}({g^{0}})+o_{p}(1),$

as $\frac{N}{n+N}\to\gamma\in(0,1]$ . By CLT, we have $\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}}\right)=O_{p}(1)$ , and $\sqrt{N}{{\mathbb{P}}_{N}}\left({g^{0}}\right)=O_{p}(1)$ . Further, by the fact that ${{\mathcal{L}}_{n}}\perp\!\!\!\perp{{\mathcal{U}}_{N}}$ , for any function $f$ , it holds that ${{\mathbb{P}}_{n}}(f)\perp\!\!\!\perp{{\mathbb{P}}_{N}}(f)$ .

Next, we show that

\left\lVert\hat{\mathbf{B}}_{n}^{g}-\hat{\mathbf{B}}_{n,N}^{g}\right\rVert_{2}=o_{p}(1),

where $\hat{\mathbf{B}}_{n}^{g}$ is defined in (11) and $\hat{\mathbf{B}}_{n,N}^{g}$ is defined in (27). Recall that we define $\mathbf{\Sigma}=\mathrm{Cov}[g(X)]$ . Then,

	$\displaystyle\left\lVert\hat{\mathbf{B}}_{n}^{g}-\hat{\mathbf{B}}_{n,N}^{g}\right\rVert_{2}$	$\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\mathbf{\Sigma}^{-1}-{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]^{-1}\right\rVert_{2}$
		$\displaystyle\leqslant\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}}_{\text{I}}+$
		$\displaystyle\quad\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}}_{\text{II}}.$

First we show that I= $o_{p}(1)$ .By Assumption 3.1 and Lemma E.5, $\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\right\rVert=O_{p}(1)$ , and since ${{\mathbb{P}}^{*}}\left({g^{0}}\right)=0$ , ${{\mathbb{P}}_{n+N}}\left({g^{0}}\right)=o_{p}(1)$ , we have:

		$\displaystyle\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}$
		$\displaystyle=\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\left[{{\mathbb{P}}_{n+N}}({g^{0}})\right]^{\top}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\right\rVert_{2}\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}})\right\rVert_{2}$
		$\displaystyle=o_{p}(1).$

Since $\mathbf{\Sigma}\succ 0$ , it follows that

\text{I}=\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}\right\rVert_{2}=o_{p}(1).

Next we show that II= $o_{p}(1)$ . By (61),

\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}=o_{p}(1),

and by (62),

\left\lVert{{\mathbb{P}}^{*}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}=o_{p}(1).

Therefore,

		$\displaystyle\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\left[{{\mathbb{P}}_{n+N}}({g^{0}})\right]^{\top}\right\rVert_{2}$
		$\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}+o_{p}(1)$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}^{}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{}}}({g^{0}})^{\top}\right]\right\rVert_{2}$
		$\displaystyle\quad+\left\lVert{{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{}}}({g^{0}})^{\top}\right]\right\rVert_{2}+o_{p}(1)$
		$\displaystyle\leqslant\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]\right\rVert_{2}}_{o_{p}(1)}+\underbrace{\left\lVert{{\mathbb{P}}^{}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{}}}({g^{0}})^{\top}\right]\right\rVert_{2}}_{o_{p}(1)}$
		$\displaystyle\quad+\underbrace{\left\lVert{{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{}}}({g^{0}})^{\top}\right]\right\rVert_{2}}_{O_{p}(1)}+o_{p}(1)$
		$\displaystyle=O_{p}(1).$

Further, by the law of large numbers, $\left\lVert\mathbf{\Sigma}-{{\mathbb{P}}_{n+N}}\left[{g^{0}}\left({g^{0}}\right)^{\top}\right]\right\rVert_{2}=o_{p}(1)$ and ${{\mathbb{P}}_{n+N}}({g^{0}})=o_{p}(1)$ . Then,

		$\displaystyle\left\lVert\mathbf{\Sigma}-{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert\mathbf{\Sigma}-{{\mathbb{P}}_{n+N}}\left[{g^{0}}({g^{0}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}}){{\mathbb{P}}_{n+N}}\left[({g^{0}})^{\top}\right]\right\rVert_{2}$
		$\displaystyle=o_{p}(1).$

Hence, $\left\lVert\mathbf{\Sigma}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=o_{p}(1)$ by continuous mapping. As a result, the second term also satisfies

\text{II}=\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}})^{\top}\right]\right\rVert_{2}\left\lVert\mathbf{\Sigma}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}}({\hat{g}^{0}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=o_{p}(1).

Combining I and II, we have shown that $\left\lVert\hat{\mathbf{B}}_{n}^{g}-\hat{\mathbf{B}}_{n,N}^{g}\right\rVert_{2}=o_{p}(1)$ , and consequently $\left\lVert\mathbf{B}^{g}-\hat{\mathbf{B}}_{n,N}^{g}\right\rVert_{2}=o_{p}(1)$ given (60).

Now, by (67),

\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*}\right)=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}}-\gamma\mathbf{B}^{g}{g^{0}})+\sqrt{N}{{\mathbb{P}}_{N}}(\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}})+o_{p}(1).

Therefore, $\hat{\theta}_{n,N}^{\text{safe}}$ is a regular and asymptotically linear estimator with influence function

{{\varphi}_{\eta^{*}}}(z_{1})-\gamma\mathbf{B}^{g}{g^{0}}(x_{1})+\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}}(x_{2}).

By Lemmas E.1 and E.2, the asymptotic variance of $\hat{\theta}_{n,N}^{\text{safe}}$ can be represented as

		$\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma\mathbf{B}^{g}{g^{0}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\gamma\mathbf{B}^{g}{g^{0}}(X)\right]+\gamma(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)+(1-\gamma)\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle\quad+\gamma(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)\right]+(1-\gamma)^{2}\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle\quad+\gamma(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)\right]+(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)\right]-\mathrm{Var}\left[{\phi_{\eta^{}}}(X)\right]+\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)\right]+(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right]-\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]+(1-\gamma)\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)\right]-\gamma\mathrm{Var}\left[\mathbf{B}^{g}{g^{0}}(X)\right],$

which proves the claim of the theorem. ∎

Proof of Theorem 4.4..

As ${\hat{G}_{K_{n}}^{0}}(x)={G_{K_{n}}^{0}}-{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right)$ , we have

	$\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{eff.}}-\theta^{*}\right)$	$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\sqrt{n}{{\mathbb{P}}_{n}}\left(\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)+\sqrt{n}{{\mathbb{P}}_{n+N}}\left(\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)$
		$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\sqrt{n}{{\mathbb{P}}_{n}}\left(\frac{N}{n+N}\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)+\sqrt{N}{{\mathbb{P}}_{N}}\left(\frac{\sqrt{nN}}{n+N}\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right).$

Similar to the proof of Theorem 3.6, we only prove the case of $p=1$ . The general case proceeds by analyzing each coordinate individually. By Assumption 3.2, ${\phi_{\eta^{*}}}\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ . As $\left\{{g_{k}(x)}\right\}_{k=1}^{\infty}$ is a basis of ${\mathcal{C}}^{\alpha_{1}}_{M_{1}}({\mathcal{X}})$ , we have that $\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}(x)\in{\mathcal{C}}^{\alpha_{1}}_{M_{1}}$ . Therefore, if we can show that

\left\lVert{\phi_{\eta^{*}}}-\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}(x)\right\rVert_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}=o_{p}(1),

(68)

then by Lemma E.4, $\sqrt{n}{{\mathbb{P}}_{n}}\left({\phi_{\eta^{*}}}-\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}(x)\right)=o_{p}(1)$ . By definition, ${\mathbb{E}}[{G_{K_{n}}^{0}}]=0$ . Therefore, using the fact that $\frac{N}{n+N}\to\gamma\in(0,1]$ ,

		$\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{eff.}}-\theta^{*}\right)$
		$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\sqrt{n}{{\mathbb{P}}_{n}}\left(\frac{N}{n+N}\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)+\sqrt{N}{{\mathbb{P}}_{N}}\left(\frac{\sqrt{nN}}{n+N}\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right)$
		$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{}}}-\gamma{\phi_{\eta^{}}})+\sqrt{N}{{\mathbb{P}}_{N}}\left(\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}\right)+o_{p}(1).$

It then follows that $\hat{\theta}_{n,N}^{\text{eff.}}$ is a regular and asymptotically linear estimator with influence function

{{\varphi}_{\eta^{*}}}(z_{1})-\gamma{\phi_{\eta^{*}}}(x_{1})+\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}(x_{2}).

By Lemma E.1, the asymptotic variance of $\hat{\theta}_{n,N}^{\text{eff.}}$ can be represented as

		$\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)-\gamma{\phi_{\eta^{}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}{\phi_{\eta^{*}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)+(1-\gamma){\phi_{\eta^{}}}(x)\right]+\gamma(1-\gamma)\mathrm{Var}\left[{\phi_{\eta^{}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)\right]+(1-\gamma)^{2}\mathrm{Var}\left[{\phi_{\eta^{}}}(x)\right]+\gamma(1-\gamma)\mathrm{Var}\left[{\phi_{\eta^{}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)\right]-\mathrm{Var}\left[{\phi_{\eta^{}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi_{\eta^{*}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}_{\eta^{}}}(Z)\right]-\gamma\mathrm{Var}\left[{\phi_{\eta^{}}}(X)\right],$

which would prove Theorem 4.4.

To complete the argument, we next prove (68). Denoting $\mathbf{\Sigma}={\mathbb{E}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]$ , and $\mathbf{B}_{K_{n}}={{\mathbb{P}}^{*}}\left[{{\varphi}_{\eta^{*}}}({G_{K_{n}}^{0}})^{\top}\right]\mathbf{\Sigma}^{-1}\in{\mathbb{R}}^{K_{n}\times 1}$ , we have:

	$\displaystyle\left\lVert{\phi_{\eta^{}}}-\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$	$\displaystyle=\left\lVert{\phi_{\eta^{}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}+\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}-\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$
		$\displaystyle\leqslant\underbrace{\left\lVert\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}(x)-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}_{\text{I}}$
		$\displaystyle\quad+\underbrace{\left\lVert{\phi_{\eta^{}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}}_{\text{II}}.$

From (65) in the proof of Theorem 3.6, we already have

\text{II}=O\left(K_{n}^{-\frac{\alpha_{1}}{2{\text{dim}}({\mathcal{X}})}}\right)=o_{p}(1).

Therefore, it remains to prove that $\text{I}=o_{p}(1)$ . Notice that

	I	$\displaystyle=\left\lVert\hat{\mathbf{B}}_{K_{n},N}{G_{K_{n}}^{0}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert^{2}_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}}$
		$\displaystyle={\mathbb{E}}\left\{{{G_{K_{n}}^{0}}^{\top}\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)^{\top}\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right){G_{K_{n}}^{0}}}\right\}$
		$\displaystyle={\text{tr}}\left\{{\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)^{\top}\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)\mathbf{\Sigma}}\right\}.$

Moreover,

		$\displaystyle\mathbf{B}_{K_{n}}={{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{}}}\left(\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right)^{\top}\right]\mathbf{\Sigma}^{-\frac{1}{2}},$
		$\displaystyle\hat{\mathbf{B}}_{K_{n},N}={{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}\left(\mathbf{\Sigma}^{-\frac{1}{2}}{\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]{{\mathbb{P}}_{n}}\left[\left(\mathbf{\Sigma}^{-\frac{1}{2}}{\hat{G}_{K_{n}}^{0}}\right)\left(\mathbf{\Sigma}^{-\frac{1}{2}}{\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]^{-1}\mathbf{\Sigma}^{-\frac{1}{2}},$
		$\displaystyle\mathbf{\Sigma}^{-\frac{1}{2}}{\hat{G}_{K_{n}}^{0}}=\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}-{{\mathbb{P}}_{n+N}}\left(\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}\right).$

Therefore, without loss of generality, we can consider normalizing ${G_{K_{n}}^{0}}$ with $\mathbf{\Sigma}^{-\frac{1}{2}}{G_{K_{n}}^{0}}$ and let $\mathbf{\Sigma}=\mathbf{I}_{K_{n}}$ . Then, we can write

\text{I}={\text{tr}}\left\{{\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)^{\top}\left(\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right)}\right\}=\left\lVert\hat{\mathbf{B}}_{K_{n},N}-\mathbf{B}_{K_{n}}\right\rVert^{2}_{F}.

Denoting $r_{K_{n}}(x)={\phi_{\eta^{*}}}(x)-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)$ as the approximation error, and $\epsilon(z)={{\varphi}_{\eta^{*}}}(z)-{\phi_{\eta^{*}}}(x)$ , ${{\varphi}_{\tilde{\eta}_{n}}}(z)$ can be decomposed as:

$\displaystyle{{\varphi}_{\tilde{\eta}_{n}}}(z)$	$\displaystyle={\phi_{\eta^{}}}(x)+\epsilon(z)+[{{\varphi}_{\tilde{\eta}_{n}}}(z)-{{\varphi}_{\eta^{}}}(z)],$	(69)
	$\displaystyle=\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}(x)+r_{K_{n}}(x)+\epsilon(z)+[{{\varphi}_{\tilde{\eta}_{n}}}(z)-{{\varphi}_{\eta^{*}}}(z)],$
	$\displaystyle=\mathbf{B}_{K_{n}}{\hat{G}_{K_{n}}^{0}}(x)+\mathbf{B}_{K_{n}}{{\mathbb{P}}_{n+N}}({G_{K_{n}}^{0}})+r_{K_{n}}(x)+\epsilon(z)+[{{\varphi}_{\tilde{\eta}_{n}}}(z)-{{\varphi}_{\eta^{*}}}(z)].$

Therefore, denoting $\hat{\mathbf{\Sigma}}={{\mathbb{P}}_{n}}\left[{\hat{G}_{K_{n}}^{0}}({\hat{G}_{K_{n}}^{0}})^{\top}\right]$ , it follows that

	$\displaystyle\hat{\mathbf{B}}_{K_{n},N}$	$\displaystyle=\mathbf{B}_{K_{n}}+\underbrace{\mathbf{B}_{K_{n}}{{\mathbb{P}}_{n+N}}({G_{K_{n}}^{0}}){{\mathbb{P}}_{n}}({\hat{G}_{K_{n}}^{0}})^{\top}\hat{\mathbf{\Sigma}}^{-1}}_{\text{III}}+\underbrace{{{\mathbb{P}}_{n}}\left[r_{K_{n}}({\hat{G}_{K_{n}}^{0}})^{\top}\right]\hat{\mathbf{\Sigma}}^{-1}}_{\text{IV}}$
		$\displaystyle\quad+\underbrace{{{\mathbb{P}}_{n}}\left[\epsilon({\hat{G}_{K_{n}}^{0}})^{\top}\right]\hat{\mathbf{\Sigma}}^{-1}}_{\text{V}}+\underbrace{{{\mathbb{P}}_{n}}\left[({{\varphi}_{\tilde{\eta}_{n}}}(z)-{{\varphi}_{\eta^{*}}}(z))({\hat{G}_{K_{n}}^{0}})^{\top}\right]\hat{\mathbf{\Sigma}}^{-1}}_{\text{VI}}.$

We first consider $\hat{\mathbf{\Sigma}}$ :

		$\displaystyle{{\mathbb{P}}_{n}}\left[{\hat{G}_{K_{n}}^{0}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]$
		$\displaystyle={{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]+{{\mathbb{P}}_{n}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n+N}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]+{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]$
		$\displaystyle\quad+{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n+N}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right],$

where, by Lemma E.8, we have that

		$\displaystyle\left\lVert{{\mathbb{P}}_{n}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n+N}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert_{F}=O_{p}\left(\frac{K_{n}}{n}\right)=o_{p}(1)$
		$\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert_{F}=O_{p}\left(\frac{K_{n}}{n}\right)=o_{p}(1),$
		$\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right){{\mathbb{P}}_{n+N}}\left[\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert_{F}=O_{p}\left(\frac{K_{n}}{n}\right)=o_{p}(1).$

Further, by Theorem 12.16.1 of Hansen [2022],

\left\lVert{{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]-\mathbf{I}_{K_{n}}\right\rVert_{F}=o_{p}(1),\text{ and }\lambda_{\min}\left({{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right)=O_{p}(1).

Therefore it follows that

\left\lVert{{\mathbb{P}}_{n}}\left[{\hat{G}_{K_{n}}^{0}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]-\mathbf{I}_{K_{n}}\right\rVert_{F}=o_{p}(1),\text{ and }\lambda_{\min}\left({{\mathbb{P}}_{n}}\left[{\hat{G}_{K_{n}}^{0}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]\right)=O_{p}(1),

which implies $\left\lVert\hat{\mathbf{\Sigma}}^{-1}\right\rVert_{2}=O_{p}(1)$ . Now, for the term III,

	$\displaystyle\left\lVert\text{III}\right\rVert$	$\displaystyle\leqslant\|\mathbf{B}_{K_{n}}{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right)\rvert\left\lVert{{\mathbb{P}}_{n}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right\rVert\left\lVert\hat{\mathbf{\Sigma}}^{-1}\right\rVert_{2}$
		$\displaystyle=O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)\times O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)\times O_{p}(1)$
		$\displaystyle=O_{p}\left(\frac{K_{n}}{n}\right)$
		$\displaystyle=o_{p}(1),$

where we used Lemma E.8 to show that $|\mathbf{B}_{K_{n}}{{\mathbb{P}}_{n+N}}\left({G_{K_{n}}^{0}}\right)\rvert$ and $\left\lVert{{\mathbb{P}}_{n}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right\rVert$ are $O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)$ . For the term IV, first note that by the property of projection and (16), we have

{{\mathbb{P}}^{*}}(r_{K_{n}})=0,\quad{{\mathbb{P}}^{*}}(r_{K_{n}}{G_{K_{n}}^{0}})=0,\text{ and }{{\mathbb{P}}^{*}}\left(r_{K_{n}}^{2}\right)=O\left(K_{n}^{-\frac{2\alpha}{{\text{dim}}({\mathcal{X}})}}\right).

Therefore,

	$\displaystyle\left\lVert\text{IV}\right\rVert$	$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}+\lvert{{\mathbb{P}}_{n}}(r_{K_{n}})\rvert\left\lVert{{\mathbb{P}}_{n+N}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\times O_{p}(1)+O_{p}\left(\frac{1}{\sqrt{n}}\right)\times O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)\times O_{p}(1)$
		$\displaystyle=O_{p}\left(\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\right)+o_{p}(1),$

where we again used Lemma E.8. Further,

	$\displaystyle{\mathbb{E}}\left\{{{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)^{\top}\right]{{\mathbb{P}}_{n}}\left[r_{K_{n}}\left({G_{K_{n}}^{0}}\right)\right]}\right\}$	$\displaystyle=\frac{1}{n}{\mathbb{E}}\left[r_{K_{n}}^{2}({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]$
		$\displaystyle\leqslant\frac{\sup_{{\mathcal{X}}}\left\{{{G_{K_{n}}^{0}}(x)^{\top}{G_{K_{n}}^{0}}(x)}\right\}}{n}{\mathbb{E}}[r^{2}_{K_{n}}]$
		$\displaystyle=O\left(\frac{\zeta_{n}^{2}K_{n}^{-\frac{2\alpha_{1}}{{\text{dim}}({\mathcal{X}})}}}{n}\right)$
		$\displaystyle=o(1).$

Therefore, by Markov’s inequality $\left\lVert{{\mathbb{P}}_{n}}\left[r_{K_{n}}({G_{K_{n}}^{0}})^{\top}\right]\right\rVert=o_{p}(1)$ and thus so is IV. Next, consider the term V. Recall that $\epsilon(z)={{\varphi}_{\eta^{*}}}-{\phi_{\eta^{*}}}$ , and hence

{{\mathbb{P}}^{*}}(\epsilon)=0,\text{ and }{{\mathbb{P}}^{*}}(\epsilon g)=0,\quad{{\mathbb{P}}^{*}}\left(\epsilon^{2}\right)<\infty,

for any measurable function $g(x)$ of $x$ . Thus,

	$\displaystyle\left\lVert\text{V}\right\rVert$	$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}+\lvert{{\mathbb{P}}_{n}}(\epsilon)\rvert\left\lVert{{\mathbb{P}}_{n+N}}\left({\hat{G}_{K_{n}}^{0}}\right)^{\top}\right\rVert\left\lVert\hat{\mathbf{\Sigma}}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon\left({G_{K_{n}}^{0}}\right)^{\top}\right]\right\rVert\times O_{p}(1)+O_{p}\left(\frac{1}{\sqrt{n}}\right)\times O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)\times O_{p}(1)$
		$\displaystyle=O_{p}\left(\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\right)+o_{p}(1).$

Moreover,

	$\displaystyle{\mathbb{E}}\left\{{{{\mathbb{P}}_{n}}\left[\epsilon\left({G_{K_{n}}^{0}}\right)^{\top}\right]{{\mathbb{P}}_{n}}\left[\epsilon\left({G_{K_{n}}^{0}}\right)\right]}\right\}$	$\displaystyle=\frac{1}{n}{\mathbb{E}}\left[\epsilon^{2}\left({G_{K_{n}}^{0}}\right)^{\top}{G_{K_{n}}^{0}}\right]$
		$\displaystyle\leqslant\frac{\sup_{{\mathcal{X}}}\left\{{{G_{K_{n}}^{0}}(x)^{\top}{G_{K_{n}}^{0}}(x)}\right\}}{n}{\mathbb{E}}[\epsilon^{2}]$
		$\displaystyle=O\left(\frac{\zeta_{n}^{2}}{n}\right)$
		$\displaystyle=o(1).$

Therefore, by Markov’s inequality $\left\lVert{{\mathbb{P}}_{n}}\left[\epsilon({G_{K_{n}}^{0}})^{\top}\right]\right\rVert=o_{p}(1)$ and so is V. Finally, for VI, by the fact that $\left\lVert{{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]-\mathbf{I}_{K_{n}}\right\rVert_{F}=o_{p}(1)$ and continuous mapping, we have:

	$\displaystyle{{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]$	$\displaystyle=K_{n}+{\text{tr}}\left({{\mathbb{P}}_{n}}\left[{G_{K_{n}}^{0}}({G_{K_{n}}^{0}})^{\top}\right]-\mathbf{I}_{K_{n}}\right)$
		$\displaystyle=K_{n}+o_{p}(1)$
		$\displaystyle=O_{p}(K_{n}).$

Then,

	$\displaystyle\left\lVert\text{IV}\right\rVert$	$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}})({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\left\lVert\hat{\mathbf{\Sigma}}^{-1}\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}})\right\rVert\left\lVert{{\mathbb{P}}_{n+N}}({G_{K_{n}}^{0}})^{\top}\right\rVert\left\lVert\hat{\mathbf{\Sigma}}^{-1}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}})({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\times O_{p}(1)+O_{p}(\rho(\tilde{\eta}_{n},\eta^{}))\times O_{p}\left(\sqrt{\frac{K}{n}}\right)\times O_{p}(1)$
		$\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{*}}})({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\times O_{p}(1)+o_{p}(1)$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[L({G_{K_{n}}^{0}})^{\top}\right]\right\rVert\times O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))+o_{p}(1)$
		$\displaystyle\leqslant\sqrt{{{\mathbb{P}}_{n}}(L^{2}){{\mathbb{P}}_{n}}\left[({G_{K_{n}}^{0}})^{\top}{G_{K_{n}}^{0}}\right]}\times O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))+o_{p}(1)$
		$\displaystyle=O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{*})\sqrt{K_{n}}\right)+o_{p}(1)$
		$\displaystyle=o_{p}(1).$

Putting these results together, we see that $\text{I}=\left\lVert\mathbf{B}_{K_{n}}-\hat{\mathbf{B}}_{K_{n},N}\right\rVert_{F}=o_{p}(1)$ , which then finishes the proof. ∎

Proof of Proposition 5.2.

By the definition of ${\hat{g}^{0}_{\tilde{\eta}_{n}}}(x)$ , we have:

	$\displaystyle{\hat{g}^{0}_{\tilde{\eta}_{n}}}(X_{i})$	$\displaystyle={g_{\tilde{\eta}_{n}}}(X_{i})-{{\mathbb{P}}_{n+N}}({g_{\tilde{\eta}_{n}}})$
		$\displaystyle={g_{\tilde{\eta}_{n}}}(X_{i})-{{\mathbb{P}}^{}}({g_{\tilde{\eta}_{n}}})-{{\mathbb{P}}_{n+N}}\left[{g_{\tilde{\eta}_{n}}}-{{\mathbb{P}}^{}}({g_{\tilde{\eta}_{n}}})\right]$
		$\displaystyle={g^{0}_{\tilde{\eta}_{n}}}(X_{i})-{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}}).$

For the estimator $\hat{\theta}_{n,N}^{\text{safe}}$ defined in (41) and $\hat{\mathbf{B}}_{n,N}^{g}$ defined in (40),

$\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*}\right)$	$\displaystyle=\sqrt{n}\left(\hat{\theta}_{n}-\theta^{*}\right)-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({\hat{g}^{0}_{\tilde{\eta}_{n}}}\right)$	(70)
	$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+o_{p}(1)$
	$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\gamma\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+\sqrt{\gamma(1-\gamma)}\hat{\mathbf{B}}_{n,N}^{g}\sqrt{N}{{\mathbb{P}}_{N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+o_{p}(1),$

as $\frac{N}{n+N}\to\gamma\in(0,1]$ .

By Assumption 5.1, the restricted class of functions $\left\{{g_{\eta}:\eta\in{\mathcal{O}}}\right\}$ is ${{\mathbb{P}}^{*}}$ -Donsker, and it satisfies $\left\lVert{g_{\tilde{\eta}_{n}}}(x)-{g_{\eta^{*}}}(x)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1)$ by Lemma E.5. Therefore, by Lemma E.7, the centered class $\left\{{g^{0}_{\eta}:\eta\in{\mathcal{O}}}\right\}$ satisfies the conditions of Lemma E.5, and it then follows that $\left\lVert{g^{0}_{\tilde{\eta}_{n}}}(x)-{g^{0}_{\eta^{*}}}(x)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}_{X}})}=o_{p}(1)$ , and

{\mathbb{G}}_{n}({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}})=o_{p}(1).

(71)

As a result,

	$\displaystyle\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)$	$\displaystyle={\mathbb{G}}_{n}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{}}}\right)+\sqrt{n}{{\mathbb{P}}^{}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)+\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\eta^{*}}}\right)$
		$\displaystyle={\mathbb{G}}_{n}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{}}}\right)+\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\eta^{}}}\right)$
		$\displaystyle=o_{p}(1)+O_{p}(1)$
		$\displaystyle=O_{p}(1),$

where we used the fact that ${{\mathbb{P}}^{*}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)=0$ by centering and $\sqrt{n}{{\mathbb{P}}_{n}}\left({g^{0}_{\eta^{*}}}\right)=O_{p}(1)$ by centering and CLT. Following a similar argument, $\sqrt{N}{{\mathbb{P}}_{N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)=O_{p}(1)$ . Further, by (71),

		$\displaystyle\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\tilde{\eta}_{n}}})-\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\eta^{*}}})$
		$\displaystyle={\mathbb{G}}_{n}({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{}}})-\sqrt{n}{{\mathbb{P}}^{}}({g^{0}_{\tilde{\eta}_{n}}})+\sqrt{n}{{\mathbb{P}}^{}}({g^{0}_{\eta^{}}})$
		$\displaystyle={\mathbb{G}}_{n}({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}})=o_{p}(1),$

and similarly

\sqrt{n}{{\mathbb{P}}_{N}}({g^{0}_{\tilde{\eta}_{n}}})-\sqrt{n}{{\mathbb{P}}_{N}}({g^{0}_{\eta^{*}}})=o_{p}(1).

Therefore, (70) can be further modified as:

\displaystyle\sqrt{n}(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*})

\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{*}}})-\gamma\hat{\mathbf{B}}_{n,N}^{g}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\eta^{*}}})+\sqrt{\gamma(1-\gamma)}\hat{\mathbf{B}}_{n,N}^{g}\sqrt{N}{{\mathbb{P}}_{N}}({g^{0}_{\eta^{*}}})+o_{p}(1).

(72)

For convenience, let

\hat{\mathbf{B}}_{1}=\hat{\mathbf{B}}_{n,N}^{g}={{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}\left({\hat{g}^{0}_{\tilde{\eta}_{n}}}\right)^{\top}\right]\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}\left({\hat{g}^{0}_{\tilde{\eta}_{n}}}\right)^{\top}\right]}\right\}^{-1},

\hat{\mathbf{B}}_{2}={{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}\left({\hat{g}^{0}_{\eta^{*}}}\right)^{\top}\right]\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{*}}}\left({\hat{g}^{0}_{\eta^{*}}}\right)^{\top}\right]}\right\}^{-1}.

We now show that $\left\lVert\hat{\mathbf{B}}_{1}-\hat{\mathbf{B}}_{2}\right\rVert_{2}=o_{p}(1)$ . First,

	$\displaystyle\left\lVert\hat{\mathbf{B}}_{1}-\hat{\mathbf{B}}_{2}\right\rVert_{2}$	$\displaystyle\leqslant\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}}-{\hat{g}^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}}_{\text{I}}\underbrace{\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}}_{\text{II}}$
		$\displaystyle\quad+\underbrace{\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\eta^{}}})^{\top}\right]\right\rVert_{2}}_{\text{III}}\underbrace{\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}}_{\text{IV}}.$

For I, note that by Lemma E.6, we have $\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}=o_{p}(1)$ . Therefore,

		$\displaystyle\text{I}\leqslant\left\lVert{{\mathbb{P}}_{n}}\left[{{\varphi}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{}}}\right)^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\left[{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{}}}\right)\right]^{\top}\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\left[{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right]^{\top}\right\rVert_{2}$
		$\displaystyle=\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\left[{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right)\right]^{\top}\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\left[{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right]^{\top}\right\rVert_{2}+o_{p}(1)$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\right\rVert_{2}\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right)\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n}}\left({{\varphi}_{\tilde{\eta}_{n}}}\right)\right\rVert_{2}\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right\rVert_{2}+o_{p}(1)$
		$\displaystyle=o_{p}(1),$

where we used the fact that $\left\lVert{{\mathbb{P}}_{n}}({{\varphi}_{\tilde{\eta}_{n}}})\right\rVert_{2}=o_{p}(1)$ and $\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right\rVert_{2}=o_{p}(1)$ by Lemma E.5.

For II, note that by Lemma E.5 we have $\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})\right\rVert=o_{p}(1)$ . Thus,

$\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}\left({\hat{g}^{0}_{\tilde{\eta}_{n}}}\right)^{\top}-{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)^{\top}\right]\right\rVert_{2}$	$\displaystyle=\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right){{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)^{\top}\right\rVert_{2}$	(73)
	$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)\right\rVert^{2}_{2}$
	$\displaystyle=o_{p}(1).$

Next, consider the term $\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)^{\top}-{g^{0}_{\eta^{*}}}\left({g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}$ . By Lemma E.7, the centered class $\left\{{g^{0}_{\eta}:\eta\in\Omega}\right\}$ satisfies the conditions of Lemma E.6. Therefore applying Lemma E.6, the two terms $\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}$ and $\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\eta^{*}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}$ are both $o_{p}(1)$ . Then,

		$\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}\right)^{\top}-{g^{0}_{\eta^{}}}\left({g^{0}_{\eta^{}}}\right)^{\top}\right]\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{}}}\right)^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\eta^{}}}\left({g^{0}_{\tilde{\eta}_{n}}}-{g^{0}_{\eta^{*}}}\right)^{\top}\right]\right\rVert_{2}$
		$\displaystyle=o_{p}(1).$

Combining these two results, we have:

		$\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{}}}({g^{0}_{\eta^{}}})^{\top}\right]\right\rVert_{2}$		(74)
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{}}}({g^{0}_{\eta^{}}})^{\top}\right]\right\rVert_{2}$
		$\displaystyle=o_{p}(1).$

Finally, since ${{\mathbb{P}}_{n+N}}\left[{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\xrightarrow[]{P}{{\mathbb{P}}^{*}}\left[{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\succ 0$ , by continuous mapping,

\text{II}=\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=\left\lVert\left\{{{{\mathbb{P}}^{*}}\left[{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}+o_{p}(1)=O_{p}(1).

We showed that III $=O_{p}(1)$ in the proof of Theorem 4.3.

For IV, by Lemma E.5, both $\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})\right\rVert$ and $\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{*}}})\right\rVert$ are $o_{p}(1)$ . Consequently,

		$\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{\hat{g}^{0}_{\eta^{}}}({\hat{g}^{0}_{\eta^{}}})^{\top}\right]\right\rVert_{2}$
		$\displaystyle=\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{}}}({g^{0}_{\eta^{}}})^{\top}\right]-{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}}){{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}+{{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{}}}){{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{}}})^{\top}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{}}}({g^{0}_{\eta^{}}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}}){{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{}}}){{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{}}})^{\top}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{}}}({g^{0}_{\eta^{}}})^{\top}\right]\right\rVert_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\tilde{\eta}_{n}}})\right\rVert^{2}_{2}+\left\lVert{{\mathbb{P}}_{n+N}}({g^{0}_{\eta^{*}}})\right\rVert^{2}_{2}$
		$\displaystyle=o_{p}(1),$

where $\left\lVert{{\mathbb{P}}_{n+N}}\left[{g^{0}_{\tilde{\eta}_{n}}}({g^{0}_{\tilde{\eta}_{n}}})^{\top}-{g^{0}_{\eta^{*}}}({g^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}=o_{p}(1)$ is given by (73). Further, we have shown that $\text{II}=\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=O_{p}(1)$ . Similarly, it can be shown that $\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}=O_{p}(1)$ . Moreover, we have

$\displaystyle\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{}}}({\hat{g}^{0}_{\eta^{}}})^{\top}-{g^{0}_{\eta^{}}}({g^{0}_{\eta^{}}})^{\top}\right]\right\rVert_{2}$	$\displaystyle=\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{}}}\right){{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{}}}\right)^{\top}\right\rVert_{2}$	(75)
	$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left({g^{0}_{\eta^{*}}}\right)\right\rVert^{2}_{2}$
	$\displaystyle=o_{p}(1),$

which, combined with (74), gives

\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{\hat{g}^{0}_{\eta^{*}}}({\hat{g}^{0}_{\eta^{*}}})^{\top}\right]\right\rVert_{2}=o_{p}(1).

Finally, from the fact that $\mathbf{A}^{-1}-\mathbf{B}^{-1}=\mathbf{B}^{-1}(\mathbf{B}-\mathbf{A})\mathbf{A}^{-1}$ , it follows that:

		$\displaystyle\text{IV}=\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}-\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{}}}({\hat{g}^{0}_{\eta^{}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}$
		$\displaystyle\leqslant\left\lVert{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}-{\hat{g}^{0}_{\eta^{}}}({\hat{g}^{0}_{\eta^{}}})^{\top}\right]\right\rVert_{2}\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\tilde{\eta}_{n}}}({\hat{g}^{0}_{\tilde{\eta}_{n}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}\left\lVert\left\{{{{\mathbb{P}}_{n+N}}\left[{\hat{g}^{0}_{\eta^{}}}({\hat{g}^{0}_{\eta^{}}})^{\top}\right]}\right\}^{-1}\right\rVert_{2}$
		$\displaystyle=o_{p}(1).$

Combining the above results, we have $\left\lVert\hat{\mathbf{B}}_{1}-\hat{\mathbf{B}}_{2}\right\rVert_{2}=o_{p}(1)$ . As we have shown that $\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\tilde{\eta}_{n}}})=O_{p}(1)$ and $\sqrt{N}{{\mathbb{P}}_{N}}({g^{0}_{\tilde{\eta}_{n}}})=O_{p}(1)$ , (72) can be expressed as

	$\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}^{\text{safe}}-\theta^{*}\right)$	$\displaystyle=\sqrt{n}{{\mathbb{P}}_{n}}({{\varphi}_{\eta^{}}})-\gamma\hat{\mathbf{B}}_{2}\sqrt{n}{{\mathbb{P}}_{n}}({g^{0}_{\eta^{}}})$
		$\displaystyle\quad+\sqrt{\gamma(1-\gamma)}\hat{\mathbf{B}}_{2}\sqrt{N}{{\mathbb{P}}_{N}}({g^{0}_{\eta^{*}}})+o_{p}(1).$

The rest of the proof follows Theorem 4.3. ∎

Proof of Proposition 6.1..

First, we show that the tangent space relative to ${\mathcal{Q}}$ at ${{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}}$ , denoted as ${{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}$ , is

{\mathcal{M}}=\left\{{s_{X}(x)+ws_{Y\mid X}(z):s_{X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})},s_{Y\mid X}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}}\right\},

where ${{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ is the tangent space at ${{\mathbb{P}}^{*}_{X}}$ relative to ${{\mathcal{P}}_{X}}$ and ${{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ is the tangent space at ${{\mathbb{P}}^{*}_{Y\mid X}}$ relative to ${{\mathcal{P}}_{Y\mid X}}$ .

We first show that ${{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}\subseteq{\mathcal{M}}$ . Consider any one-dimensional regular parametric sub-model of ${\mathcal{Q}}$ , which can be represented as

\left\{{{\mathbb{P}}\times{{\mathbb{P}}_{W}^{*}}:{\mathbb{P}}\in{\mathcal{P}}_{T}}\right\},

where ${\mathcal{P}}_{T}$ is a one-dimensional regular parametric sub-model of ${{\mathcal{P}}}$ such that

{\mathcal{P}}_{T}=\left\{{p_{t}(z)=p_{t,X}(x)p_{t,Y\mid X}(z):t\in T}\right\},

where $p_{t^{*}}(z)$ corresponds to the density of ${{\mathbb{P}}^{*}}$ for some $t^{*}\in T$ . Denote the score function of $p_{t}(z)$ at $t^{*}$ as $s(z)=s_{X}(x)+s_{Y\mid X}(z)$ , where $s_{X}(x)$ is score function of $p_{t,X}(x)$ at $t^{*}$ , and $s_{Y\mid X}(z)$ is the score function of $p_{t,Y\mid X}(y\mid x)$ at $t^{*}$ . The density of $(Z,W)$ is thus

p_{t,X}(x)\left[(1-\gamma)p_{t,Y\mid X}(z)\right]^{w}\gamma^{1-w},

and its score function at $t^{*}$ is $s_{X}(x)+ws_{Y\mid X}(z)$ . Because ${\mathcal{P}}_{T}$ is a one-dimensional regular parametric sub-model of ${{\mathcal{P}}}$ , which has the form (1), it must be true that $\left\{{p_{t,X}(x):t\in T}\right\}$ is a one-dimensional regular parametric sub-model of ${{\mathcal{P}}_{X}}$ , and $\left\{{p_{t,Y\mid X}(z):t\in T}\right\}$ is a one-dimensional regular parametric sub-model of ${{\mathbb{P}}^{*}_{Y\mid X}}$ . Therefore, the corresponding score function satisfies that $s_{X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ and $s_{Y\mid X}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ , and hence $s_{X}(x)+ws_{Y\mid X}(z)\in{\mathcal{M}}$ . We have ${{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}\subseteq{\mathcal{M}}$ by the fact that ${\mathcal{M}}$ is a closed linear space.

Now we show the other direction, i.e., ${\mathcal{M}}\subseteq{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}$ . Consider an arbitrary element $s_{X}(x)+ws_{Y\mid X}(z)$ of ${\mathcal{M}}$ where $s_{X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ and $s_{Y\mid X}(z)\in{{\mathbb{P}}^{*}_{Y\mid X}}$ . Suppose $s_{X}(x)$ is the score function of some parametric sub-model $\left\{{p_{t,X}:t\in T\subset{\mathbb{R}}}\right\}$ of ${{\mathcal{P}}_{X}}$ at ${{\mathbb{P}}^{*}_{X}}$ and $s_{Y\mid X}(z)$ is the score function of some parametric sub-model $\left\{{p_{t,X\mid Y}:t\in T\subset{\mathbb{R}}}\right\}$ of ${{\mathcal{P}}_{Y\mid X}}$ at ${{\mathbb{P}}^{*}_{Y\mid X}}$ . Then $s_{X}(x)+ws_{Y\mid X}(z)$ is the score function at ${{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}}$ of

\left\{{{\mathbb{P}}\times{{\mathbb{P}}_{W}^{*}}:{\mathbb{P}}\in{\mathcal{P}}_{T}}\right\},

where ${\mathcal{P}}_{T}=\left\{{p_{t,X}(x)p_{t,Y\mid X}(z):t\in T\subset R}\right\}$ , which proves $s_{X}(x)+ws_{Y\mid X}(z)\in{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}$ and hence ${\mathcal{M}}\subseteq{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}$ .

Pathwise differentiability of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathbb{P}}^{*}}$ implies, for any one-dimensional parametric sub-model ${\mathcal{P}}_{T}\subset{{\mathcal{P}}}$ such that $p_{t^{*}}(z)$ corresponds to the density of ${{\mathbb{P}}^{*}}$ ,

\frac{d\theta(p_{t^{*}})}{dt}=\langle{{\varphi}^{*}_{\eta^{*}}}(z),s_{X}(x)+s_{Y\mid X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})},

where $s(z)=s_{X}(x)+s_{Y\mid X}(z)$ is the score function of ${\mathcal{P}}_{T}$ at ${{\mathbb{P}}^{*}}$ , $s_{X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ and $s_{Y\mid X}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . This maps one-to-one to a parametric sub-model $\left\{{{\mathbb{P}}\times{{\mathbb{P}}_{W}^{*}}:{\mathbb{P}}\in{\mathcal{P}}_{T}}\right\}$ of ${\mathcal{Q}}$ with score function $s_{X}(x)+ws_{Y\mid X}(z)\in{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})}$ . We thus have:

	$\displaystyle\frac{d\theta(p_{t^{*}})}{dt}$	$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z),s_{X}(x)+s_{Y\mid X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$
		$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x),s_{Y\mid X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}_{Y\mid X}})}+\langle{\phi^{}_{\eta^{}}}(x),s_{X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}_{X}})}$
		$\displaystyle=\left\langle\frac{w}{1-\gamma}[{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x)],\delta s_{Y\mid X}(z)\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}_{Y\mid X}})}+\langle{\phi^{}_{\eta^{}}}(x),s_{X}(z)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}_{X}})}$
		$\displaystyle=\left\langle\frac{w}{1-\gamma}[{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x)]+{\phi^{}_{\eta^{}}}(x),s_{X}(z)+\delta s_{Y\mid X}(z)\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}.$

This implies that the function

\frac{w}{1-\gamma}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+{\phi^{*}_{\eta^{*}}}(x)

is a gradient of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}}$ relative to ${\mathcal{Q}}$ . As shown in the proof of Theorem 3.1, $a^{\top}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ and $a^{\top}{\phi^{*}_{\eta^{*}}}(x)\in{{\mathbb{P}}^{*}_{X}}$ for $a\in{\mathbb{R}}^{p}$ . Therefore, we see that

a^{\top}\left\{{\frac{w}{1-\gamma}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+{\phi^{*}_{\eta^{*}}}(x)}\right\}\in{{\mathcal{T}}_{\mathcal{Q}}({{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}})},

for all $a\in{\mathbb{R}}^{p}$ . By definition, $\frac{w}{1-\gamma}[{{\varphi}^{*}_{\eta^{*}}}(z)-{\phi^{*}_{\eta^{*}}}(x)]+{\phi^{*}_{\eta^{*}}}(x)$ is the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}\times{{\mathbb{P}}_{W}^{*}}$ relative to ${\mathcal{Q}}$ . Note the above derivations correspond to the sample size $n+N$ . Therefore we need to additionally multiply a factor of $\lim_{n\to\infty}\sqrt{\frac{n}{n+N}}=\sqrt{1-\gamma}$ to obtain the efficient influence function corresponding to the sample size $n$ . By independence of $W$ and $Z$ , the semiparametric efficiency lower bound can be expressed as:

		$\displaystyle\mathrm{Var}\left\{{\frac{\delta}{\sqrt{1-\gamma}}[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)]+\sqrt{1-\gamma}{\phi^{}_{\eta^{}}}(X)}\right\}$
		$\displaystyle=\mathrm{Var}\left\{{\frac{\delta}{\sqrt{1-\gamma}}[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)]}\right\}+\mathrm{Var}[\sqrt{1-\gamma}{\phi^{}_{\eta^{}}}(X)]$
		$\displaystyle\quad+\mathrm{Cov}\left\{{\frac{\delta}{\sqrt{1-\gamma}}[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)],\sqrt{1-\gamma}{\phi^{}_{\eta^{}}}(X)}\right\}$
		$\displaystyle\quad+\mathrm{Cov}\left\{{\sqrt{1-\gamma}{\phi^{}_{\eta^{}}}(X),\frac{\delta}{\sqrt{1-\gamma}}[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)]}\right\}$
		$\displaystyle=\frac{{\mathbb{E}}[\delta^{2}]}{1-\gamma}\mathrm{Var}\left\{{[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)]}\right\}+(1-\gamma)\mathrm{Var}[{\phi^{}_{\eta^{}}}(X)]$
		$\displaystyle\quad+{\mathbb{E}}[\delta]\mathrm{Cov}\left\{{{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X),{\phi^{}_{\eta^{}}}(X)}\right\}+{\mathbb{E}}[\delta]\mathrm{Cov}\left\{{{\phi^{}_{\eta^{}}}(X),{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)}\right\}$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)\right]+(1-\gamma)\mathrm{Var}[{\phi^{}_{\eta^{}}}(X)],$

where in the last step we use Lemma E.1 to show that $\mathrm{Cov}\left\{{{\phi^{*}_{\eta^{*}}}(X),{{\varphi}^{*}_{\eta^{*}}}(Z)-{\phi^{*}_{\eta^{*}}}(X)}\right\}=0$ . ∎

Appendix C Connection to prediction-powered inference

In this section, we analyze existing PPI estimators through the lens of the proposed framework. We will show that (i) many existing PPI estimators can be analyzed in a unified manner, including our proposed safe PPI estimator (41); (ii) the proposed safe PPI estimator is optimally efficient among the PPI estimators; (iii) none of the PPI estimators achieves the efficiency bound of Theorem 4.1, without strong assumptions on the machine learning prediction model.

We consider the setting of M-estimation as described in Section 7.1 of the main text. (Our results can be extended to Z-estimation.) Let $f:{\mathcal{X}}\to{\mathcal{Y}}$ denote a machine learning prediction model trained on independent data. Let ${\nabla m^{f}_{\theta^{*}}}(x)$ denote ${\nabla m^{f}_{\theta^{*}}}(x)={\nabla m_{\theta^{*}}}(x,f(x))$ , and let ${\nabla\tilde{m}_{\theta^{*}}}(x)={\mathbb{E}}[{\nabla m_{\theta^{*}}}(Z)\mid X=x]$ . Further, denote $\hat{\theta}_{n}$ as the supervised M-estimator (47), and suppose that suitable conditions hold such that $\hat{\theta}_{n}$ is regular and asymptotically linear with influence function ${{\varphi}_{\eta^{*}}}(z)$ in (48). Finally, suppose that the conditions of Proposition 5.2 hold.

First, we present a unified approach to analyze existing PPI estimators, as well as the proposed safe PPI estimator (41). Consider semi-supervised estimators of $\theta^{*}$ that are regular and asymptotically linear in the sense of Definitions 4.1 and E.2, with influence function

{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})=-{\mathbf{V}_{\theta^{*}}^{-1}}\left[{\nabla m_{\theta^{*}}}(z_{1})-\mathbf{A}{\nabla m^{f}_{\theta^{*}}}(x_{1})+\mathbf{A}{\nabla m^{f}_{\theta^{*}}}(x_{2})\right],

(76)

where $\mathbf{A}\in{\mathbb{R}}^{p\times p}$ . The above class includes a number of existing PPI estimators, such as the proposal of Angelopoulos et al. [2023a, b], Miao et al. [2023], Gan and Liang [2023], and the proposed safe PPI estimator (41), with different estimators corresponding to different choices of $\mathbf{A}$ . Specifically, the original PPI estimator [Angelopoulos et al., 2023a] considers $\mathbf{A}=\mathbf{I}_{p}$ . Angelopoulos et al. [2023b] improve the original PPI estimator by introducing a tuning weight $\omega\in{\mathbb{R}}$ and considering $\mathbf{A}=\omega\mathbf{I}_{p}$ . They show that the value of $\omega$ that minimizes the trace of the asymptotic variance of their estimator is

\omega=\frac{\gamma{\text{tr}}\left({\mathbf{V}_{\theta^{*}}^{-1}}\left\{{\mathrm{Cov}[{\nabla m^{f}_{\theta^{*}}}(X),{\nabla m_{\theta^{*}}}(Z)]+\mathrm{Cov}[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)]}\right\}{\mathbf{V}_{\theta^{*}}^{-1}}\right)}{2{\text{tr}}\left\{{{\mathbf{V}_{\theta^{*}}^{-1}}\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]{\mathbf{V}_{\theta^{*}}^{-1}}}\right\}}.

Miao et al. [2023] instead lets $\mathbf{A}=\text{diag}(\boldsymbol{\omega})$ , which is a diagonal matrix with tuning weights $\boldsymbol{\omega}\in{\mathbb{R}}^{p}$ . They set the weights to minimize the element-wise asymptotic variance:

\omega_{j}=\frac{\left[{\mathbf{V}_{\theta^{*}}^{-1}}\mathrm{Cov}[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)]{\mathbf{V}_{\theta^{*}}^{-1}}\right]_{jj}}{\left[{\mathbf{V}_{\theta^{*}}^{-1}}\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]{\mathbf{V}_{\theta^{*}}^{-1}}\right]_{jj}},

for each $j\in[p]$ , where $\omega_{j}$ is the $j$ -th element of $\boldsymbol{\omega}$ . Finally, Gan and Liang [2023] considers

\mathbf{A}=\omega\mathrm{Cov}[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)]\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]^{-1},

for a tuning weight $\omega\in{\mathbb{R}}$ and show that the optimal weight is $\omega=\gamma$ .

Now, consider the proposed safe PPI estimator (41) with ${g_{\eta^{*}}}(x)={\varphi}_{\eta^{*}}(x,f(x))$ and with ${\mathcal{G}}=\left\{{{\varphi}_{\eta}(x,f(x)):\eta\in\Omega}\right\}$ , where $\eta({\mathbb{P}})=\left(\mathbf{V}^{-1}_{\theta({\mathbb{P}})}({\mathbb{P}}),\theta({\mathbb{P}})\right)$ . Then, (43) can be written as

	$\displaystyle\mathbf{B}^{g_{\eta^{*}}}$	$\displaystyle={\mathbb{E}}\left[{{\varphi}_{\eta^{}}}(Z){g^{0}_{\eta^{}}}(X)^{\top}\right]{\mathbb{E}}\left[{g^{0}_{\eta^{}}}(X){g^{0}_{\eta^{}}}(X)^{\top}\right]^{-1}$
		$\displaystyle={\mathbf{V}_{\theta^{}}^{-1}}\mathrm{Cov}\left[{\nabla m_{\theta^{}}}(Z),{\nabla m^{f}_{\theta^{}}}(X)\right]\mathrm{Var}[{\nabla m^{f}_{\theta^{}}}(X)]^{-1}{\mathbf{V}_{\theta^{*}}}.$

By Proposition 5.2, (41) is regular and asymptotically linear with influence function

		$\displaystyle{{\varphi}_{\eta^{}}}(x_{1})-\gamma\mathbf{B}^{g_{\eta^{}}}{g^{0}_{\eta^{}}}(x_{1})+\gamma\mathbf{B}^{g_{\eta^{}}}{g^{0}_{\eta^{*}}}(x_{2})$		(77)
		$\displaystyle=-{\mathbf{V}_{\theta^{}}^{-1}}\left[{\nabla m_{\theta^{}}}(z_{1})-\mathbf{A}^{}{\nabla m^{f}_{\theta^{}}}(x_{1})+\mathbf{A}^{}{\nabla m^{f}_{\theta^{}}}(x_{2})\right]$
		$\displaystyle={\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),$

where

\mathbf{A}^{*}=\gamma\mathrm{Cov}\left[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)\right]\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]^{-1}.

Therefore, (43) has influence function in the form of (76) with $\mathbf{A}=\mathbf{A}^{*}$ . In fact, this is the same influence function as Gan and Liang [2023] with the optimal weight $\omega=\gamma$ .

Next, we show that (77) has the smallest asymptotic variance among all PPI estimators with influence function of the form (76). That is, the proposed safe estimator (41) is at least as efficient as existing PPI estimators. Recall that under the OSS setting, the asymptotic variance of an estimator with influence function ${{\varphi}_{\eta^{*}}}(z_{1},x_{2})$ is $\langle{{\varphi}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}_{\eta^{*}}}(z_{1},x_{2})^{\top}\rangle_{\mathcal{H}}$ .

Proposition C.1.

For all $\mathbf{A}\in{\mathbb{R}}^{p\times p}$ , the following holds:

\left\langle{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}\succeq\left\langle{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}.

Proof.

By the linearity of inner product,

\left\langle h_{1},h_{1}^{\top}\right\rangle_{\mathcal{H}}=\left\langle h_{1}-h_{2},(h_{1}-h_{2})^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{2},h_{2}^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{1}-h_{2},h_{2}^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{2},(h_{2}-h_{1})^{\top}\right\rangle_{\mathcal{H}},

for arbitrary $h_{1}$ and $h_{2}$ . Therefore, it suffices to prove that the cross term satisfies

\left\langle{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})-{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}=0

for all $\mathbf{A}\in{\mathbb{R}}^{p\times p}$ , which we now prove.

By definition, we have

		$\displaystyle\left\langle{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})-{\varphi}^{f}_{\mathbf{A}^{}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}^{}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}$		(78)
		$\displaystyle={\mathbf{V}_{\theta^{}}^{-1}}\left\langle-(\mathbf{A}-\mathbf{A}^{}){\nabla m^{f}_{\theta^{}}}(x_{1})+(\mathbf{A}-\mathbf{A}^{}){\nabla m^{f}_{\theta^{}}}(x_{2}),\left[{\nabla m_{\theta^{}}}(x_{1})-\mathbf{A}^{}{\nabla m^{f}_{\theta^{}}}(x_{1})+\mathbf{A}^{}{\nabla m^{f}_{\theta^{}}}(x_{2})\right]^{\top}\right\rangle_{\mathcal{H}}{\mathbf{V}_{\theta^{*}}^{-1}}$
		$\displaystyle=-{\mathbf{V}_{\theta^{}}^{-1}}(\mathbf{A}-\mathbf{A}^{})\mathrm{Cov}\left[{\nabla m^{f}_{\theta^{}}}(X),{\nabla m_{\theta^{}}}(X)-\mathbf{A}^{}{\nabla m^{f}_{\theta^{}}}(X)\right]{\mathbf{V}_{\theta^{*}}^{-1}}$
		$\displaystyle\quad+\frac{1-\gamma}{\gamma}{\mathbf{V}_{\theta^{}}^{-1}}(\mathbf{A}-\mathbf{A}^{})\mathrm{Var}[{\nabla m^{f}_{\theta^{}}}(X)](\mathbf{A}^{})^{\top}{\mathbf{V}_{\theta^{*}}^{-1}}.$

Recall that $\mathbf{A}^{*}=\gamma\mathrm{Cov}\left[{\nabla m_{\theta^{*}}}(Z),{\nabla m^{f}_{\theta^{*}}}(X)\right]\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)]^{-1}$ . Therefore, we have

\mathrm{Cov}\left[{\nabla m^{f}_{\theta^{*}}}(X),{\nabla m_{\theta^{*}}}(X)-\mathbf{A}^{*}{\nabla m^{f}_{\theta^{*}}}(X)\right]=\frac{1-\gamma}{\gamma}\mathrm{Var}[{\nabla m^{f}_{\theta^{*}}}(X)](\mathbf{A}^{*})^{\top}.

Plugging this into (78), it then follows that

\left\langle{\varphi}^{f}_{\mathbf{A}}(z_{1},x_{2})-{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2}),{\varphi}^{f}_{\mathbf{A}^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}=0,

which then finishes the proof. ∎

Proposition C.1 shows that (41) provides optimal efficiency among estimators with influence function of the form (76). We note that, for fair comparison, we consider the case where there is only one machine learning model $f(\cdot)$ for all estimators. However, the proposed safe PPI estimator can incorporate multiple machine learning models in a principled way, whereas it is unclear how to do this for most existing PPI estimators. Moreover, our estimator can deal with general inferential problems beyond M-estimation, such as U-statistics.

Our next result shows that estimators with influence function of the form (76) cannot achieve the semiparametric efficiency bound in the OSS setting unless strong assumptions are imposed on the machine learning prediction model $f$ . To discuss efficiency in M-estimation (or Z-estimation), we informally consider a nonparametric model in the form of (1) such that the supervised M-estimator (47) is efficient (in the supervised setting). Proposition C.2 provides a necessary condition for estimators with influence function (76) to be efficient in the OSS setting.

Proposition C.2.

Let ${\mathcal{S}}_{f}=\left\{{\mathbf{A}{\nabla m^{f}_{\theta^{*}}}(x)-\mathbf{A}{\mathbb{E}}[{\nabla m^{f}_{\theta^{*}}}(X)]:\mathbf{A}\in{\mathbb{R}}^{p\times p}}\right\}$ denote the linear space generated by ${\nabla m^{f}_{\theta^{*}}}(x)-{\mathbb{E}}[{\nabla m^{f}_{\theta^{*}}}(X)]$ in ${{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}$ . If

{\nabla\tilde{m}_{\theta^{*}}}(x)\notin{\mathcal{S}}_{f},

(79)

then (76) cannot be the efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ relative to model (1).

Proof.

Specializing to the case of M-estimation, by Theorem 4.1, the efficient influence function is

{{\varphi}_{\eta^{*}}}(z_{1},x_{2})=-{\mathbf{V}_{\theta^{*}}^{-1}}[{\nabla m_{\theta^{*}}}(z_{1})-\gamma{\nabla\tilde{m}_{\theta^{*}}}(x_{1})+\gamma{\nabla\tilde{m}_{\theta^{*}}}(x_{2})].

Therefore (76) is the efficient influence function if and only if

\mathbf{A}{\nabla m^{f}_{\theta^{*}}}(X)-\mathbf{A}{\mathbb{E}}[{\nabla m^{f}_{\theta^{*}}}(X)]=\gamma{\nabla\tilde{m}_{\theta^{*}}}(X)

${{\mathbb{P}}^{*}_{X}}$ -almost surely. This cannot happen if ${\nabla\tilde{m}_{\theta^{*}}}(x)\notin{\mathcal{S}}_{f}$ . ∎

To better interpret condition (79), we consider a simple case where our target of inference is the expectation of $Y$ : $\theta^{*}={\mathbb{E}}[Y]$ , and the loss function is $m_{\theta}(z)=\frac{1}{2}(y-\theta)^{2}$ . Therefore, it follows that $\nabla m_{\theta}(z)=y-\theta$ , ${\mathbf{V}_{\theta^{*}}}=\mathbf{I}_{p}$ , ${\nabla\tilde{m}_{\theta^{*}}}(x)={\mathbb{E}}[Y\mid X=x]-\theta^{*}$ , and ${\nabla m^{f}_{\theta^{*}}}(x)=f(x)-\theta^{*}$ . We see that, in this case, condition (79) is equivalent to

f(x)=\mathbf{A}{\mathbb{E}}[Y\mid X=x]+b,

for some $\mathbf{A}\in{\mathbb{R}}^{p\times p}$ and $b\in{\mathbb{R}}^{p}$ . In other words, a necessary condition for these estimators to be efficient is that the machine learning prediction model is a linear transformation of the true regression function. Of course, this is unlikely to hold for black-box machine learning predictions in practice.

We end this section by noting that although the safe PPI estimator (41) may not be efficient, the proposed efficient estimator (34) can always achieve the semiparametric efficiency bound under regularity conditions. Further, as we show in numerical experiments in Section 8, the safe PPI estimator has good performance when using a good prediction model.

Appendix D Proof of claims in Section 7

D.1 Proof of claim in Example 7.2

Recall that the GLM estimator

\hat{\theta}_{n}=\underset{\theta}{\arg\max}\left\{{\frac{1}{n}\sum_{i=1}^{n}\left[Y_{i}\theta^{\top}X_{i}-b\left(X_{i}^{\top}\theta\right)\right]}\right\}

(80)

which is regular and asymptotically linear with influence function

{{\varphi}_{\eta^{*}}}(z)={\mathbb{E}}\left[b^{(2)}(X^{\top}\theta^{*})XX^{\top}\right]^{-1}\left[yx-b^{(1)}(x^{\top}\theta^{*})x\right]

under regularity conditions, where $\eta^{*}=\left({\mathbb{E}}\left[b^{(2)}(X^{\top}\theta^{*})XX^{\top}\right]^{-1},\theta^{*}\right)$ .

Proposition D.1.

Suppose that both ${\mathcal{X}}$ and $\Theta$ are compact subsets. Then, the GLM estimator $\hat{\theta}_{n}$ satisfies the conditions of Proposition 3.4 with estimator

\tilde{\eta}_{n}=\left(\left[\frac{1}{n}\sum_{i=1}^{n}b^{(2)}(X_{i}^{\top}\hat{\theta}_{n})X_{i}X_{i}^{\top}\right]^{-1},\hat{\theta}_{n}\right).

Proof.

We first prove that $\hat{\theta}_{n}$ satisfies the local Lipshitz condition of Proposition 3.4. Denote $\eta^{(1)}({\mathbb{P}})={\mathbb{E}}_{\mathbb{P}}\left[b^{(2)}(X^{\top}\theta({\mathbb{P}}))XX^{\top}\right]^{-1}\in{\mathbb{R}}^{p\times p}$ and $\eta^{(2)}({\mathbb{P}})=\theta({\mathbb{P}})$ . Consider the metric $\rho(\eta_{1},\eta_{2})=\left\lVert\eta_{1}^{(1)}-\eta_{2}^{(1)}\right\rVert_{2}+\left\lVert\eta_{1}^{(2)}-\eta_{2}^{(2)}\right\rVert$ . Assuming that both ${\mathcal{X}}$ and $\Theta$ are compact, which is a mild assumption, we have:

		$\displaystyle\left\lVert{\varphi}_{\eta_{1}}(z)-{\varphi}_{\eta_{2}}(z)\right\rVert$
		$\displaystyle=\left\lVert\left(\eta_{1}^{(1)}-\eta_{2}^{(1)}\right)yx-\eta_{1}^{(1)}b^{(1)}\left(x^{\top}\eta_{1}^{(2)}\right)+\eta_{2}^{(1)}b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rVert$
		$\displaystyle=\left\lVert\left(\eta_{1}^{(1)}-\eta_{2}^{(1)}\right)yx-\eta_{1}^{(1)}\left[b^{(1)}\left(x^{\top}\eta_{1}^{(2)}\right)-b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right]+\left(\eta_{2}^{(1)}-\eta_{1}^{(1)}\right)b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rVert$
		$\displaystyle\leqslant\left\lVert\eta_{1}^{(1)}-\eta_{2}^{(1)}\right\rVert_{2}\left\lVert yx\right\rVert+\left\lVert\eta_{1}^{(1)}\right\rVert_{2}\left\lvert b^{(1)}\left(x^{\top}\eta_{1}^{(2)}\right)-b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rvert+\left\lVert\eta_{2}^{(1)}-\eta_{1}^{(1)}\right\rVert_{2}\left\lvert b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rvert.$

Since ${\mathbb{E}}\left[b^{(2)}\left(X^{\top}\theta^{*}\right)XX^{\top}\right]\succ 0$ , we can find a sufficiently small neighborhood ${\mathcal{O}}$ containing $\eta^{*}$ such that $\sup_{\eta\in{\mathcal{O}}}\left\lVert\eta^{(1)}\right\rVert_{2}>0$ . Since $b^{(1)}\left(x^{\top}\theta\right)$ is differentiable,

\displaystyle\left\lvert b^{(1)}\left(x^{\top}\eta_{1}^{(2)}\right)-b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rvert\leqslant\sup_{x\in{\mathcal{X}},\eta\in{\mathcal{O}}}\left\lvert b^{(2)}(x^{\top}\eta^{(2)})\right\rvert\left\lVert x\right\rVert\left\lVert\eta_{1}^{(2)}-\eta_{2}^{(2)}\right\rVert,

where $\sup_{x\in{\mathcal{X}},\eta\in{\mathcal{O}}}\left\lvert b^{(2)}\left(x^{\top}\eta^{(2)}\right)\right\rvert<\infty$ as $b^{(2)}(\cdot)$ is continuous, ${\mathcal{X}}$ is compact, and ${\mathcal{O}}$ is bounded. Additionally, we have

\left\lvert b^{(1)}\left(x^{\top}\eta_{2}^{(2)}\right)\right\rvert\leqslant\sup_{\eta\in{\mathcal{O}}}\left\lvert b^{(1)}\left(x^{\top}\eta^{(2)}\right)\right\rvert.

Similarly, we can always find a sufficiently small neighborhood ${\mathcal{O}}$ containing $\eta^{*}$ such that

\sup_{\eta\in{\mathcal{O}}}\left\lvert b^{(1)}\left(x^{\top}\eta^{(2)}\right)\right\rvert\in{\mathcal{L}}^{2}_{1}({{\mathbb{P}}^{*}}).

Putting these results together, we obtain

		$\displaystyle\left\lVert{\varphi}_{\eta_{1}}(z)-{\varphi}_{\eta_{2}}(z)\right\rVert$
		$\displaystyle\leqslant\left\lVert\eta_{1}^{(1)}-\eta_{2}^{(1)}\right\rVert_{2}\left\lVert yx\right\rVert+\sup_{\eta\in{\mathcal{O}}}\left\lVert\eta^{(1)}\right\rVert_{2}\sup_{x\in{\mathcal{X}},\eta\in{\mathcal{O}}}\left\lvert b^{(2)}\left(x^{\top}\eta^{(2)}\right)\right\rvert\left\lVert x\right\rVert\left\lVert\eta_{1}^{(2)}-\eta_{2}^{(2)}\right\rVert$
		$\displaystyle\quad+\sup_{\eta\in{\mathcal{O}}}\left\lvert b^{(1)}\left(x^{\top}\eta^{(2)}\right)\right\rvert\left\lVert\eta_{2}^{(1)}-\eta_{1}^{(1)}\right\rVert_{2}$
		$\displaystyle\leqslant L(z)\rho(\eta_{1},\eta_{2}),$

where

L(z)=\max\left\{{\left\lVert yx\right\rVert+\sup_{\eta\in{\mathcal{O}}}\left\lvert b^{(1)}\left(x^{\top}\eta^{(2)}\right)\right\rvert,\quad\left(\sup_{\eta\in{\mathcal{O}}}\left\lVert\eta^{(1)}\right\rVert_{2}\sup_{x\in{\mathcal{X}},\eta\in{\mathcal{O}}}\left\lvert b^{(2)}\left(x^{\top}\eta^{(2)}\right)\right\rvert\right)\left\lVert x\right\rVert}\right\}\in{\mathcal{L}}_{1}^{2}({{\mathbb{P}}^{*}}).

Next, we show that a consistent estimator of $\eta^{*}$ is $\tilde{\eta}_{n}=\left(\left[\frac{1}{n}\sum_{i=1}^{n}b^{(2)}\left(X_{i}^{\top}\hat{\theta}_{n}\right)X_{i}X_{i}^{\top}\right]^{-1},\hat{\theta}_{n}\right)$ . Since $b^{(2)}(\cdot)$ is continuous, $b^{(2)}\left(x^{\top}\theta\right)xx^{\top}$ is a continuous (matrix-valued) function of $\theta$ for every $x$ . Then, $b^{(2)}\left(x^{\top}\theta\right)xx^{\top}$ is a Gilvenko-Cantelli class (which means that it is Gilvenko-Cantelli component-wise) by Example 19.9 of Van der Vaart [2000] and compactness of $\Theta$ . Therefore,

\left\lVert\frac{1}{n}\sum_{i=1}^{n}b^{(2)}\left(X_{i}^{\top}\hat{\theta}_{n}\right)X_{i}X_{i}^{\top}-{\mathbb{E}}\left[b^{(2)}\left(X^{\top}\hat{\theta}_{n}\right)XX^{\top}\right]\right\rVert_{2}=o_{p}(1).

Similarly,

\displaystyle\left\lvert b^{(2)}\left(x^{\top}\theta_{1}\right)-b^{(2)}\left(x^{\top}\theta_{2}\right)\right\rvert\leqslant\sup_{x\in{\mathcal{X}},\theta\in\Theta}\left\lvert b^{(3)}\left(x^{\top}\theta\right)\right\rvert\left\lVert x\right\rVert\left\lVert\theta_{1}-\theta_{2}\right\rVert,

where $\sup_{x\in{\mathcal{X}},\theta\in\Theta}\left\lvert b^{(3)}\left(x^{\top}\theta\right)\right\rvert<\infty$ as $b^{(3)}(\cdot)$ is continuous and both ${\mathcal{X}}$ and $\Theta$ are compact. Therefore, it follows that

{\mathbb{E}}\left\{{\left[b^{(2)}\left(X^{\top}\hat{\theta}_{n}\right)-b^{(2)}\left(X^{\top}\theta^{*}\right)\right]XX^{\top}}\right\}\leqslant{\mathbb{E}}\left[\left\lVert X\right\rVert^{3}\right]\sup_{x\in{\mathcal{X}},\theta\in\Theta}\left\lvert b^{(3)}\left(x^{\top}\theta\right)\right\rvert\left\lVert\hat{\theta}_{n}-\theta^{*}\right\rVert=o_{p}(1)

as $\hat{\theta}_{n}$ is consistent. Putting these results together, we have

\left\lVert\frac{1}{n}\sum_{i=1}^{n}b^{(2)}\left(X_{i}^{\top}\hat{\theta}_{n}\right)X_{i}X_{i}^{\top}-{\mathbb{E}}\left[b^{(2)}\left(X^{\top}\theta^{*}\right)XX^{\top}\right]\right\rVert_{2}=o_{p}(1),

and by continuous mapping,

\left\lVert\left[\frac{1}{n}\sum_{i=1}^{n}b^{(2)}\left(X_{i}^{\top}\hat{\theta}_{n}\right)X_{i}X_{i}^{\top}\right]^{-1}-\left\{{{\mathbb{E}}\left[b^{(2)}\left(X^{\top}\theta^{*}\right)XX^{\top}\right]}\right\}^{-1}\right\rVert_{2}=o_{p}(1).

∎

D.2 Proof of claim in Example 7.4

Here, we validate the remaining parts of Assumption 3.1 for Example 7.4 under additional assumptions. Following the notation of Example 7.4, recall that the supervised estimator is Kendall’s $\tau$ :

\hat{\theta}_{n}=\frac{2}{n(n-1)}\sum_{i<j}I\left\{{(U_{i}-U_{j})(V_{i}-V_{j})>0}\right\}

(81)

which is regular and asymptotically linear with influence function

{\varphi}_{\eta^{*}}(z)=2{\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\}-2\theta^{*},

where $\eta^{*}=\left({\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\},\theta^{*}\right)$ . For simplicity, denote $h^{*}(y)={\mathbb{P}}_{Y}^{*}\left\{{(U-u)(V-v)>0}\right\}$ . By imposing additional assumptions on $h^{*}(y)$ , the next proposition guarantees that (81) satisfies Assumption 3.1.

Proposition D.2.

Suppose that $\left\lVert h^{*}(y)\right\rVert_{V}^{*}\leqslant M$ for some $M>1$ , where $\left\lVert\cdot\right\rVert_{V}^{*}$ represents the uniform sectional variation norm of a function [van der Laan, 1995, Gill et al., 1995] . Further suppose that $\Theta$ is a bounded subset of ${\mathbb{R}}^{2}$ such that $\theta^{*}\in\text{int}(\Theta)$ . Then, the Kendall’s $\tau$ (81) satisfies Assumption 3.1 with

{\mathcal{O}}=\left\{{h(y)-\theta:{\mathbb{R}}^{2}\to{\mathbb{R}},\left\lVert h\right\rVert_{V}^{*}\leqslant M,\theta\in\Theta}\right\},

and the estimator

\tilde{\eta}_{n}=\left(\frac{1}{n}\sum_{i=1}^{n}I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\},\hat{\theta}_{n}\right).

Proof.

We first validate part (b) of Assumption 3.1. Clearly, $h^{*}(y)-\theta^{*}\in{\mathcal{O}}$ , and we have shown that the influence function of U-statistics of kernel $R$ is $R$ -Lipshitz in Section 7.2. Therefore, it remains to prove that $\left\{{{\varphi}_{\eta}:\eta\in{\mathcal{O}}}\right\}={\mathcal{O}}$ is ${{\mathbb{P}}^{*}}$ -Donsker. Re-write ${\mathcal{O}}$ as

{\mathcal{O}}=\left\{{h(y)-h^{*}(y)+h^{*}(y)-\theta:{\mathbb{R}}^{2}\to{\mathbb{R}},\left\lVert h\right\rVert_{V}^{*}\leqslant M,\theta\in\Theta}\right\},

which is the pairwise sum of two sets: ${\mathcal{O}}={\mathcal{O}}_{1}+{\mathcal{O}}_{2}$ where ${\mathcal{O}}_{1}=\left\{{h(y)-h^{*}(y):\left\lVert h\right\rVert_{V}^{*}\leqslant M}\right\}$ and ${\mathcal{O}}_{2}=\left\{{h^{*}(y)-\theta:\theta\in\Theta}\right\}$ . By Example 2.10.9 of van der Vaart and Wellner [2013], if both ${\mathcal{O}}_{1}$ and ${\mathcal{O}}_{2}$ are ${{\mathbb{P}}^{*}}$ -Donsker, then their pairwise sum ${\mathcal{O}}$ is also ${{\mathbb{P}}^{*}}$ -Donsker. For ${\mathcal{O}}_{1}$ , by triangle inequality of the norm $\left\lVert h-h^{*}\right\rVert_{V}^{*}\leqslant\left\lVert h\right\rVert_{V}^{*}+\left\lVert h^{*}\right\rVert_{V}^{*}\leqslant 2M$ , therefore ${\mathcal{O}}_{1}$ is a subset of functions with uniform sectional variation norm bounded by $2M$ . By Example 1.2 of van der Laan [1995], this implies that ${\mathcal{O}}_{1}$ is ${{\mathbb{P}}^{*}}$ -Donsker. For ${\mathcal{O}}_{2}$ , it is a set indexed by a finite-dimensional parameter $\theta\in\Theta$ where $\Theta$ is bounded and $h(y)^{*}-\theta$ is $1$ -Lipschitz in $\theta$ . Hence, by Example 19.6 of Van der Vaart [2000], ${\mathcal{O}}_{2}$ is ${{\mathbb{P}}^{*}}$ -Donsker.

We now validate part (c) of Assumption 3.1 with $\tilde{\eta}_{n}$ . Since any indicator function on ${\mathbb{R}}^{2}$ has uniform sectional variation norm bounded by 1, we have

\left\lVert\frac{1}{n}\sum_{i=1}^{n}I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\}\right\rVert_{V}^{*}\leqslant\frac{1}{n}\sum_{i=1}^{n}\left\lVert I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\}\right\rVert_{V}^{*}\leqslant 1.

Further, by asymptotic linearity of the U-statistics $\hat{\theta}_{n}$ , we have $\left\lVert\hat{\theta}_{n}-\theta^{*}\right\rVert_{2}=o_{p}(1)$ . Therefore,

{{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}={{\mathbb{P}}^{*}}\left\{{\hat{\theta}_{n}\in\Theta}\right\}\to 1

by the definition of ${\mathcal{O}}$ . Further, since the class of indicator functions is ${{\mathbb{P}}^{*}}$ -Gilvenko Cantelli, it follows that

\left\lVert\frac{1}{n}\sum_{i=1}^{n}I\left\{{(U_{i}-u)(V_{i}-v)>0}\right\}-h^{*}(y)\right\rVert_{\infty}=o_{p}(1).

Combined with the consistency of $\tilde{\theta}_{n}$ , we have $\rho(\tilde{\eta}_{n},\eta^{*})=o_{p}(1)$ , which completes the proof. ∎

D.3 Proof of claim in Remark 7.1

We prove a general result.

Proposition D.3.

Let $f(Z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$ and let $V=g(X)$ be a measurable function of $X$ . Denote $f_{X}(x)={\mathbb{E}}[f(Z)\mid X=x]$ and $f_{V}(v)={\mathbb{E}}[f(Z)\mid V=v]$ . Then

		$\displaystyle\left\langle f(z_{1})-\gamma f_{V}(v_{1})+\gamma f_{V}(v_{2}),\left[f(z_{1})-\gamma f_{V}(v_{1})+\gamma f_{V}(v_{2})\right]^{\top}\right\rangle_{\mathcal{H}}$		(82)
		$\displaystyle\succeq\left\langle f(z_{1})-\gamma f_{X}(x_{1})+\gamma f_{X}(x_{2}),[f(z_{1})-\gamma f_{X}(x_{1})+\gamma f_{X}(x_{2})]^{\top}\right\rangle_{\mathcal{H}},$		(82)

where $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ is defined in (86).

Proof.

By the linearity of inner product,

\left\langle h_{1},h_{1}^{\top}\right\rangle_{\mathcal{H}}=\left\langle h_{1}-h_{2},(h_{1}-h_{2})^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{2},h_{2}^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{1}-h_{2},h_{2}^{\top}\right\rangle_{\mathcal{H}}+\left\langle h_{2},(h_{2}-h_{1})^{\top}\right\rangle_{\mathcal{H}},

for arbitrary $h_{1}$ and $h_{2}$ . Therefore, it suffices to prove that the cross term satisfies

\left\langle\gamma f_{X}(x_{1})-\gamma f_{V}(v_{1})-\gamma f_{X}(x_{2})+\gamma f_{V}(v_{2}),[f(z_{1})-\gamma f_{X}(x_{1})+\gamma f_{X}(x_{2})]^{\top}\right\rangle_{\mathcal{H}}=0.

By the definition of $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ ,

		$\displaystyle\left\langle\gamma f_{X}(x_{1})-\gamma f_{V}(v_{1})-\gamma f_{X}(x_{2})+\gamma f_{V}(v_{2}),[f(z_{1})-\gamma f_{X}(x_{1})+\gamma f_{X}(x_{2})]^{\top}\right\rangle_{\mathcal{H}}$
		$\displaystyle=\gamma\left\langle f_{X}(x_{1})-f_{V}(v_{1})-f_{X}(x_{2})+f_{V}(v_{2}),[f(z_{1})-f_{X}(x_{1})+(1-\gamma)f_{X}(x_{1})+\gamma f_{X}(x_{2})]^{\top}\right\rangle_{\mathcal{H}}$
		$\displaystyle=\gamma\left\{{\left\langle f_{X}-f_{V},f-f_{X}+(1-\gamma)f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})}+\frac{1-\gamma}{\gamma}\left\langle f_{V}-f_{X},\gamma f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}_{X}})}}\right\}$
		$\displaystyle=\gamma\left\{{\left\langle f_{X}-f_{V},(1-\gamma)f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}_{X}})}+\frac{1-\gamma}{\gamma}\left\langle f_{V}-f_{X},\gamma f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}_{X}})}}\right\}$
		$\displaystyle=0,$

where in the second-to-last equality we used the fact that

\left\langle f_{X}-f_{V},f-f_{X}\right\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}=0

by taking the conditional expectation on $X$ and noting that $V=g(X)$ . ∎

Appendix E Additional results

E.1 Technical lemmas

Lemma E.1 (Decomposition of conditional variances).

For a random variables $Z=(X,Y)\sim{{\mathbb{P}}^{*}}$ and $X\sim{{\mathbb{P}}^{*}_{X}}$ . Let $f,g$ be any functions such that $f(z)\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}})$ and $g(x)\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}_{X}})$ . Then,

\mathrm{Var}[f(Z)-g(X)]=\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}+\mathrm{Var}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}.

Proof.

By the conditional variance formula,

\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}={\mathbb{E}}\left\{{\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}}\right\},

and

\mathrm{Var}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}=\mathrm{Var}\left\{{{\mathbb{E}}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}}\right\}.

It follows that

	$\displaystyle\mathrm{Var}[f(Z)-g(X)]$	$\displaystyle=\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]+{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}$
		$\displaystyle=\mathrm{Var}\left\{{{\mathbb{E}}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}}\right\}+{\mathbb{E}}\left\{{\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}}\right\}$
		$\displaystyle=\mathrm{Var}\left\{{{\mathbb{E}}[f(Z)\mid X]-g(X)}\right\}+\mathrm{Var}\left\{{f(Z)-{\mathbb{E}}[f(Z)\mid X]}\right\}.$

∎

Lemma E.2 (Multivariate pythagorean theorem).

For any function $g(x)\in{\mathcal{L}}^{2}_{d}({{\mathbb{P}}^{*}_{X}})$ , let

{\mathcal{A}}=\left\{{\mathbf{B}g(x):\mathbf{B}\in{\mathbb{R}}^{p\times d}}\right\}

denote its linear span in ${\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}_{X}})$ . For any $f(x)\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}_{X}})$ ,

\mathrm{Var}\left[f(X)\right]=\mathrm{Var}\left[f(X)-\Pi(f(X)\mid{\mathcal{A}})\right]+\mathrm{Var}\left[\Pi(f(X)\mid{\mathcal{A}})\right],

where $\Pi(a\mid{\mathcal{A}})$ denote the projection operator that project $a$ onto a linear space ${\mathcal{A}}$ .

Proof.

See Theorem 3.3 of Tsiatis [2006]. ∎

Lemma E.3.

For a random variable $Z\sim{{\mathbb{P}}^{*}}$ , and let $f,g$ be any functions such that $f\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}})$ and $g\in{\mathcal{L}}^{2}_{q}({{\mathbb{P}}^{*}})$ . Then,

\left\lVert{{\mathbb{P}}^{*}}\left(fg^{\top}\right)\right\rVert_{2}\leqslant\left\lVert f\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\left\lVert g\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}.

Proof.

Consider any non-random vector $a\in{\mathbb{R}}^{q}$ such that $\left\lVert a\right\rVert_{2}=1$ . By the definition of operator norm,

	$\displaystyle\left\lVert{{\mathbb{P}}^{*}}\left(fg^{\top}\right)a\right\rVert_{2}$	$\displaystyle=\left\lVert{{\mathbb{P}}^{*}}\left(fg^{\top}a\right)\right\rVert_{2}$
		$\displaystyle\leqslant{{\mathbb{P}}^{*}}\left\lVert fg^{\top}a\right\rVert_{2}$
		$\displaystyle\leqslant{{\mathbb{P}}^{*}}\left\lVert f\right\rVert\left\lVert g\right\rVert$
		$\displaystyle\leqslant\left\lVert f\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})}\left\lVert g\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})},$

by Cauchy-Schwartz inequality. ∎

Lemma E.4 (Lemma 19.24 of Van der Vaart [2000].).

Let ${\mathcal{S}}$ denote a ${{\mathbb{P}}^{*}}$ -Donsker class of measurable functions over $({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}})$ , and let $\hat{s}_{n}$ denote a sequence of random functions that takes value in ${\mathcal{S}}$ and satisfies

\left\lVert\hat{s}_{n}-s\right\rVert_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}}=o_{p}(1),

for some element $s\in{\mathcal{S}}$ . Then,

{\mathbb{G}}_{n}(\hat{s}_{n}-s)=o_{p}(1).

Lemma E.5.

Let ${\mathcal{S}}=\left\{{s_{\eta}(z):{\mathcal{Z}}\to{\mathbb{R}}^{p},\eta\in\Omega}\right\}$ be a class of measurable functions on $({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}})$ indexed by $\eta$ , where $\Omega$ is a metric space with metric $\rho$ . Let $\eta^{*}\in\text{int}(\Omega)$ such that $s_{\eta^{*}}(z)\in{\mathcal{L}}^{2}_{p}({{\mathbb{P}}^{*}})$ . Suppose that there exists a set ${\mathcal{O}}\subset\Omega$ such that $\eta^{*}\in{\mathcal{O}}$ , the class of functions $\left\{{s_{\eta}(z):\eta\in{\mathcal{O}}}\right\}$ is ${{\mathbb{P}}^{*}}$ -Donsker, and for all $\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}$ there exists a square-integrable function $S:{\mathcal{Z}}\to{\mathbb{R}}^{+}$ such that $\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert\leqslant S(Z)\rho(\eta_{1},\eta_{2})$ almost surely. In addition, suppose there exists an estimator $\tilde{\eta}_{n}$ of $\eta^{*}$ such that ${{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\in{\mathcal{O}}}\right\}\to 1$ and $\rho(\tilde{\eta}_{n},\eta^{*})=o_{p}(1)$ . Then, the following hold:

•

$\sup_{\eta\in{\mathcal{O}}}\left\lVert\left({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}}\right)s_{\eta}\right\rVert=o_{p}(1)$ ,
•

$\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert^{2}_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}=o_{p}(1)$ ,
•

$\left\lVert{\mathbb{G}}_{n}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})\right\rVert=o_{p}(1)$ ,

•

$\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert=O_{p}(1)$ ; and if additionally ${{\mathbb{P}}^{*}}({s_{\eta^{*}}})=0$ , then

\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert=O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{*})\wedge\frac{1}{\sqrt{n}}\right)=o_{p}(1).

Proof.

Since the class ${\mathcal{S}}=\left\{{s_{\eta}(z):\eta\in{\mathcal{O}}}\right\}$ is ${{\mathbb{P}}^{*}}$ -Donsker, it is also ${{\mathbb{P}}^{*}}$ -Gilvenko-Cantelli, hence

\sup_{\eta\in{\mathcal{O}}}\left\lVert\left({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}}\right)s_{\eta}\right\rVert=o_{p}(1).

By assumption $\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert\leqslant S(Z)\rho(\eta_{1},\eta_{2})$ for all $\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}$ , therefore when $\tilde{\eta}_{n}\in{\mathcal{O}}$ , we have

\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert^{2}_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\leqslant\rho^{2}(\tilde{\eta}_{n},\eta^{*})\left\lVert S(z)\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}^{2}=O_{p}(\rho^{2}(\tilde{\eta}_{n},\eta^{*}))=o_{p}(1).

Further, by assumption ${{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\notin{\mathcal{O}}}\right\}\to 0$ , it then follows that $\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert^{2}_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}=o_{p}(1)$ . Therefore, by Lemma E.4, $\left\lVert{\mathbb{G}}_{n}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})\right\rVert=o_{p}(1)$ . Then,

	$\displaystyle\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert$	$\displaystyle\leqslant\frac{1}{\sqrt{n}}\left\lVert{\mathbb{G}}_{n}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{}}})\right\rVert+\left\lVert{{\mathbb{P}}^{}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{}}})\right\rVert+\left\lVert{{\mathbb{P}}_{n}}({s_{\eta^{}}})\right\rVert$
		$\displaystyle=o_{p}\left(\frac{1}{\sqrt{n}}\right)+\left\lVert{{\mathbb{P}}^{}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{}}})\right\rVert+\left\lVert{{\mathbb{P}}_{n}}({s_{\eta^{*}}})\right\rVert$
		$\displaystyle\leqslant O_{p}\left(\frac{1}{\sqrt{n}}\right)+\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{}}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})}+\left\lVert{{\mathbb{P}}^{}}({s_{\eta^{}}})\right\rVert$
		$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n}}\right)+O_{p}(\rho(\tilde{\eta}_{n},\eta^{}))+\left\lVert{{\mathbb{P}}^{}}({s_{\eta^{*}}})\right\rVert$
		$\displaystyle=O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{})\wedge\frac{1}{\sqrt{n}}\right)+\left\lVert{{\mathbb{P}}^{}}({s_{\eta^{*}}})\right\rVert.$

Therefore, $\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert=O_{p}(1)$ . Further, if ${{\mathbb{P}}^{*}}({s_{\eta^{*}}})=0$ , $\left\lVert{{\mathbb{P}}_{n}}({s_{\tilde{\eta}_{n}}})\right\rVert=\left(\rho(\tilde{\eta}_{n},\eta^{*})\wedge\frac{1}{\sqrt{n}}\right)$ . ∎

Lemma E.6.

Under the conditions of Lemma E.5. Let ${\mathcal{G}}=\left\{{g_{\eta}(z):{\mathcal{Z}}\to{\mathbb{R}}^{p},\eta\in\Omega}\right\}$ denote another class of measurable functions on $({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}})$ such that (i) $g_{\eta^{*}}(z)\in{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})$ (ii) for all $\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}$ where ${\mathcal{O}}$ follows from Lemma E.5 there exists a square-integrable function $G:{\mathcal{Z}}\to{\mathbb{R}}^{+}$ such that $\left\lVert g_{\eta_{1}}(Z)-g_{\eta_{2}}(Z)\right\rVert\leqslant G(Z)\rho(\eta_{1},\eta_{2})$ almost surely. Then

\left\lVert{{\mathbb{P}}_{n}}\left[g_{\tilde{\eta}_{n}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})^{\top}\right]\right\rVert_{2}=O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))=o_{p}(1),

where $\tilde{\eta}_{n}$ is defined in Lemma E.5.

Proof.

When $\tilde{\eta}_{n}\in{\mathcal{O}}$ ,

		$\displaystyle\left\lVert{{\mathbb{P}}_{n}}\left[g_{\tilde{\eta}_{n}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})^{\top}\right]\right\rVert_{2}$
		$\displaystyle\leqslant{{\mathbb{P}}_{n}}\left\{{\left\lVert g_{\tilde{\eta}_{n}}({s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}})^{\top}\right\rVert_{2}}\right\}$
		$\displaystyle\leqslant{{\mathbb{P}}_{n}}\left\{{\left\lVert g_{\tilde{\eta}_{n}}\right\rVert\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert}\right\}$
		$\displaystyle\leqslant{{\mathbb{P}}_{n}}\left\{{(\left\lVert g_{\tilde{\eta}_{n}}-g_{\eta^{}}\right\rVert+\left\lVert g_{\eta^{}}\right\rVert)\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{*}}}\right\rVert}\right\}$
		$\displaystyle\leqslant{{\mathbb{P}}_{n}}\left\{{\left\lVert g_{\tilde{\eta}_{n}}-g_{\eta^{}}\right\rVert\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{}}}\right\rVert}\right\}+{{\mathbb{P}}_{n}}\left\{{\left\lVert g_{\eta^{}}\right\rVert\left\lVert{s_{\tilde{\eta}_{n}}}-{s_{\eta^{}}}\right\rVert}\right\}$
		$\displaystyle\overset{\text{a.s.}}{\leqslant}{{\mathbb{P}}_{n}}\left\{{SG}\right\}\rho(\tilde{\eta}_{n},\eta^{})^{2}+{{\mathbb{P}}_{n}}\left\{{S\left\lVert g_{\eta^{}}\right\rVert}\right\}\rho(\tilde{\eta}_{n},\eta^{*})$
		$\displaystyle={{\mathbb{P}}^{}}\left\{{SG}\right\}\rho(\tilde{\eta}_{n},\eta^{})^{2}+{{\mathbb{P}}^{}}\left\{{S\left\lVert g_{\eta^{}}\right\rVert}\right\}\rho(\tilde{\eta}_{n},\eta^{})+O_{p}\left(\frac{1}{\sqrt{n}}\rho(\tilde{\eta}_{n},\eta^{})\right)$
		$\displaystyle\leqslant\left\lVert S\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})}\left\lVert G\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})}\rho(\tilde{\eta}_{n},\eta^{})^{2}+\left\lVert S\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})}\left\lVert g_{\eta^{}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})}\rho(\tilde{\eta}_{n},\eta^{})+O_{p}\left(\frac{1}{\sqrt{n}}\rho(\tilde{\eta}_{n},\eta^{})\right)$
		$\displaystyle=O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{})^{2}\right)+O_{p}\left(\rho(\tilde{\eta}_{n},\eta^{})\right)+O_{p}\left(\frac{1}{\sqrt{n}}\rho(\tilde{\eta}_{n},\eta^{*})\right)$
		$\displaystyle=O_{p}(\rho(\tilde{\eta}_{n},\eta^{*}))$
		$\displaystyle=o_{p}(1).$

Therefore the statement follows from the fact ${{\mathbb{P}}^{*}}\left\{{\tilde{\eta}_{n}\notin{\mathcal{O}}}\right\}\to 0$ . ∎

Lemma E.7.

Under the conditions of Lemma E.5, consider the centered class

{\mathcal{S}}^{*}=\left\{{s_{\eta}^{0}(z)=s_{\eta}(z)-{{\mathbb{P}}^{*}}(s_{\eta}):\eta\in\Omega}\right\}.

Then, for all $\left\{{\eta_{1},\eta_{2}}\right\}\subset{\mathcal{O}}$ where ${\mathcal{O}}$ follows from Lemma E.5, it holds that

\left\lVert s^{0}_{\eta_{1}}(Z)-s^{0}_{\eta_{2}}(Z)\right\rVert\leqslant\left[S(Z)+\left\lVert S\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\right]\rho(\eta_{1},\eta_{2}),

almost surely. Further, the class ${\mathcal{S}}^{*}=\left\{{s_{\eta}^{0}(z)=s_{\eta}(z)-{{\mathbb{P}}^{*}}(s_{\eta}):\eta\in{\mathcal{O}}}\right\}$ is ${{\mathbb{P}}^{*}}$ -Donsker.

Proof.

	$\displaystyle\left\lVert s^{0}_{\eta_{1}}(Z)-s^{0}_{\eta_{2}}(Z)\right\rVert$	$\displaystyle\leqslant\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert+\left\lVert{{\mathbb{P}}^{*}}(s_{\eta_{1}}-s_{\eta_{2}})\right\rVert$
		$\displaystyle\leqslant\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert+{{\mathbb{P}}^{*}}\left\lVert s_{\eta_{1}}-s_{\eta_{2}}\right\rVert$
		$\displaystyle\leqslant\left\lVert s_{\eta_{1}}(Z)-s_{\eta_{2}}(Z)\right\rVert+\left\lVert s_{\eta_{1}}-s_{\eta_{2}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}$
		$\displaystyle\leqslant\left[S(Z)+\left\lVert S\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}\right]\rho(\eta_{1},\eta_{2}).$

Since ${\mathbb{G}}_{n}(s_{\eta})={\mathbb{G}}_{n}(s^{0}_{\eta})$ for all $\eta\in\Omega$ , ${{\mathbb{P}}^{*}}$ -Donsker follows immediately. ∎

Lemma E.8.

Let $\left\{{X_{i}}\right\}_{i=1}^{n}$ be i.i.d. random vectors with mean zero and variance $\mathbf{I}_{K_{n}}$ , where $K_{n}$ satisfies $K_{n}\to\infty$ and $\frac{K_{n}}{n}\to 0$ . Let $a\in{\mathbb{R}}^{K_{n}}$ be any fixed vector such that $\left\lVert a\right\rVert=O(\sqrt{K_{n}})$ . Then

•

$\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}=O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)$
•

$\left\lVert\frac{1}{n}\sum_{i=1}^{n}X_{i}\right\rVert=O_{p}\left(\sqrt{\frac{K_{n}}{n}}\right)$

Proof.

We have:

{\mathbb{E}}\left[\left(\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}\right)^{2}\right]=\frac{\left\lVert a\right\rVert^{2}}{n}=O\left(\frac{K_{n}}{n}\right).

{\mathbb{E}}\left[\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)^{\top}\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)\right]=\frac{K_{n}}{n}.

Therefore, by Markov’s inequality, for any $\epsilon>0$ ,

	$\displaystyle{\mathbb{P}}\left\{{\left\lvert\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}\right\rvert>\epsilon}\right\}$	$\displaystyle={\mathbb{P}}\left\{{\left(\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}\right)^{2}>\epsilon^{2}}\right\}$
		$\displaystyle\leqslant\frac{{\mathbb{E}}\left[\left(\frac{1}{n}\sum_{i=1}^{n}a^{\top}X_{i}\right)^{2}\right]}{\epsilon^{2}}$
		$\displaystyle=\frac{\left\lVert a\right\rVert^{2}}{n\epsilon^{2}},$

which we see that let $\epsilon=C^{-1}\sqrt{\frac{K_{n}}{n}}$ leads to the result. The seond equation can be proved in a similar manner. ∎

E.2 Additional definitions and results in semiparametric efficiency theory

Following the notation in Section 2, we provide additional definitions and results from classical semiparametric efficiency theory. Recall that for a model ${{\mathcal{P}}}$ that contains ${{\mathbb{P}}^{*}}$ , the tangent set at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ is the set of score functions at ${{\mathbb{P}}^{*}}$ from all one-dimensional regular parametric sub-models of ${{\mathcal{P}}}$ , and the corresponding tangent space is the closed linear span of the tangent set. A Euclidean functional $\theta:{{\mathcal{P}}}\to{\mathbb{R}}^{p}$ is pathwise differentiable at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ if, for an arbitrary smooth one-dimensional parametric sub-model ${\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\}$ of ${{\mathcal{P}}}$ such that ${{\mathbb{P}}^{*}}={\mathbb{P}}_{t^{*}}$ for some $t^{*}\in T$ , it holds that $\frac{d\theta({\mathbb{P}}_{t^{*}})}{dt}={\mathbb{E}}[{D_{{\mathbb{P}}^{*}}}(Z)s_{t^{*}}(Z)]$ , where ${D_{{\mathbb{P}}^{*}}}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$ is a gradient of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ , and $s_{t^{*}}$ is the score function of ${\mathcal{P}}_{T}$ at $t^{*}$ . The next lemma relates the set of gradients to the set of influence functions of regular asymptotically linear estimators.

Lemma E.9.

Let $\hat{\theta}_{n}$ be an asymptotically linear estimator of $\theta^{*}$ with influence function ${{\varphi}_{\eta^{*}}}(z)$ , then ${\theta(\cdot)}$ is pathwise differentiable at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ with a gradient ${{\varphi}_{\eta^{*}}}(z)$ if and only if $\hat{\theta}_{n}$ is a regular estimator of $\theta^{*}$ over ${{\mathcal{P}}}$ .

For a proof, see [Pfanzagl, 1990, Van Der Vaart, 1991]. Lemma E.9 implies the equivalence between the set of gradients and the set of influence functions of regular asymptotically linear estimators. If a gradient ${D^{*}_{{\mathbb{P}}^{*}}}$ satisfies $a^{\top}{D^{*}_{{\mathbb{P}}^{*}}}\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}$ for all $a\in{\mathbb{R}}^{p}$ , then it is the canonical gradient or efficient influence function of ${\theta(\cdot)}$ at ${{\mathbb{P}}^{*}}$ relative to ${{\mathcal{P}}}$ . The next lemma characterizes the set of gradients for a pathwise differentiable parameter.

Lemma E.10 (Characterization of the set of gradients).

Let ${\theta(\cdot)}:{{\mathcal{P}}}\to{\mathbb{R}}^{p}$ be a functional that is pathwise differentiable at ${{\mathbb{P}}^{*}}\in{{\mathcal{P}}}$ relative to ${{\mathcal{P}}}$ . Then the set of gradients of ${\theta(\cdot)}$ relative to ${{\mathcal{P}}}$ can be represented as $\left\{{{D_{{\mathbb{P}}^{*}}}(z)+g(z):a^{\top}g(z)\perp{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})},\forall a\in{\mathbb{R}}^{p}}\right\}$ , where ${D_{{\mathbb{P}}^{*}}}$ is any gradient of ${\theta(\cdot)}$ .

Proof.

See Theorem 3.4 of Tsiatis [2006]. ∎

Recall the model (1):

{{\mathcal{P}}}=\left\{{{\mathbb{P}}={\mathbb{P}}_{X}{\mathbb{P}}_{Y\mid X}:{\mathbb{P}}_{X}\in{{\mathcal{P}}_{X}},{\mathbb{P}}_{Y\mid X}\in{{\mathcal{P}}_{Y\mid X}}}\right\},

which models the marginal distribution and the conditional distribution separately. We next characterize the tangent space relative to (1).

Lemma E.11.

Let ${{\mathcal{P}}}$ be defined as in (1). Then, the tangent space ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}$ can be written as

{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}={{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})},

where ${{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ is the tangent space at ${{\mathbb{P}}^{*}_{X}}$ relative to ${{\mathcal{P}}_{X}}$ , and ${{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ is the tangent space at ${{\mathbb{P}}^{*}_{Y\mid X}}$ relative to ${{\mathcal{P}}_{Y\mid X}}$ .

Proof.

See Lemma 1.6 of van der Laan and Robins [2003]. ∎

E.3 Semiparametric efficiency theory under the OSS setting

In this section, we extend the classical semiparametric efficiency theory to the OSS setting, building on ideas from Bickel and Kwon [2001]. We first review some notation. Let $Z=(X,Y)$ be a random variable that takes value in the probability space $({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}})$ , where ${\mathcal{F}}$ is a sigma-algebra on ${\mathcal{Z}}$ . Let $({\mathcal{X}},{\mathcal{F}}_{X},{{\mathbb{P}}^{*}_{X}})$ be the corresponding probability space of $({\mathcal{Z}},{\mathcal{F}},{{\mathbb{P}}^{*}})$ over ${\mathcal{X}}$ . Consider the product probability space

({\mathcal{Z}}\times{\mathcal{X}},{\mathcal{F}}\otimes{\mathcal{F}}_{X},{{\mathbb{Q}}({{\mathbb{P}}^{*}})}),

where ${\mathcal{F}}\otimes{\mathcal{F}}_{X}$ represents the product sigma-algebra of ${\mathcal{F}}$ and ${\mathcal{F}}_{X}$ , and ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}={{\mathbb{P}}^{*}}\times{{\mathbb{P}}^{*}_{X}}$ represents the product measure of ${{\mathbb{P}}^{*}}$ and ${{\mathbb{P}}^{*}_{X}}$ . Note that the probability measure here depends only on ${{\mathbb{P}}^{*}}$ as ${{\mathbb{P}}^{*}_{X}}$ is the corresponding marginal distribution of ${{\mathbb{P}}^{*}}$ over ${\mathcal{X}}$ . Suppose ${{\mathbb{P}}^{*}}$ belong to a model ${{\mathcal{P}}}$ . As usual, we assume that there is some common dominating measure $\mu$ for ${{\mathcal{P}}}$ such that ${{\mathcal{P}}}$ can be equivalently represented by its density functions with respect to $\mu$ . Our goal here is to estimate a Euclidean functional $\theta(\cdot):{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p}$ from the data ${{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}}$ , where ${{\mathcal{L}}_{n}}\stackrel{{\scriptstyle{iid}}}{{\sim}}{{\mathbb{P}}^{*}}$ and ${{\mathcal{U}}_{N}}\stackrel{{\scriptstyle{iid}}}{{\sim}}{{\mathbb{P}}^{*}_{X}}$ . The full data is thus generated by

{\mathbb{Q}}^{n,N}=\left(\prod_{i=1}^{n}{{\mathbb{P}}^{*}}\right)\times\left(\prod_{i=n+1}^{n+N}{{\mathbb{P}}^{*}_{X}}\right).

Let $n\to\infty$ and $\frac{N}{n+N}\to\gamma\in(0,1)$ . The empirical distribution of ${{\mathcal{L}}_{n}}\cup{{\mathcal{U}}_{N}}$ is ${{\mathbb{Q}}_{n,N}}={{\mathbb{P}}_{n}}\times{{\mathbb{P}}_{N,X}}$ , where ${{\mathbb{P}}_{n}}=\frac{1}{n}\sum_{i=1}^{n}\delta_{Z_{i}}$ is the empirical distribution of ${{\mathcal{L}}_{n}}$ and ${{\mathbb{P}}_{N,X}}=\frac{1}{N}\sum_{i=n+1}^{n+N}\delta_{X_{i}}$ is the empirical distribution of ${{\mathcal{U}}_{N}}$ .

We first establish the local asymptotic normality property of ${{\mathbb{Q}}_{n,N}}$ .

Lemma E.12.

For any regular parametric sub-model ${\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}^{d}}\right\}\subset{{\mathcal{P}}}$ such that ${\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}}$ for some $t^{*}\in T$ , let ${\mathbb{P}}_{t,X}$ be the marginal distribution of ${\mathbb{P}}_{t}$ over ${\mathcal{X}}$ . Denote ${\mathbb{Q}}_{t}^{n,N}=\left(\prod_{i=1}^{n}{\mathbb{P}}_{t}\right)\times\left(\prod_{i=n+1}^{n+N}{\mathbb{P}}_{t,X}\right).$ Then, for any $h\in{\mathbb{R}}^{d}$ ,

	$\displaystyle\log\left(\frac{d{\mathbb{Q}}_{t^{}+\frac{h}{\sqrt{n}}}^{n,N}}{d{\mathbb{Q}}_{t^{}}^{n,N}}\right)$	$\displaystyle=h^{\top}\sqrt{n}\left[{{\mathbb{Q}}_{n,N}}-{{\mathbb{Q}}({{\mathbb{P}}^{}})}\right]\left(s_{t^{}}+\frac{\gamma}{1-\gamma}s_{t^{},X}\right)-\frac{1}{2}h^{\top}\mathbf{I}_{t^{}}h+o_{p}(1)$		(83)
		$\displaystyle\rightsquigarrow N\left(-\frac{1}{2}h^{\top}\mathbf{I}_{t^{}}h,h^{\top}\mathbf{I}_{t^{}}h\right),$		(83)

where $s_{t^{*}}(z)$ is the score function of ${\mathbb{P}}_{t}$ at $t^{*}$ , $s_{t^{*},X}(x)$ is the score function of ${\mathbb{P}}_{t,X}$ at $t^{*}$ , and $\mathbf{I}_{t^{*}}=\mathrm{Var}\left[s_{t^{*}}(Z)\right]+\frac{\gamma}{1-\gamma}\mathrm{Var}\left[s_{t^{*},X}(X)\right]$ . Moreover, ${\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}$ and ${\mathbb{Q}}_{t^{*}}^{n,N}$ are mutually contiguous.

Here and in the following text, a regular parametric model is interpreted in the sense of quadratic-mean differentiability of the density function at $t^{*}$ . See Chapter 7 of Van der Vaart [2000] for details.

Proof.

By Theorem 7.2 of Van der Vaart [2000], we have

\displaystyle\log\left(\prod_{i=1}^{n}\frac{d{\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}}{d{\mathbb{P}}_{t^{*}}}\right)=h^{\top}\sqrt{n}({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{*}})(s_{t^{*}})-\frac{1}{2}h^{\top}\mathrm{Var}[s_{t^{*}}(Z)]h+o_{p}(1),

and

		$\displaystyle\log\left(\prod_{i=n+1}^{n+N}\frac{d{\mathbb{P}}_{t^{}+\frac{h}{\sqrt{n}},X}}{d{\mathbb{P}}_{t^{},X}}\right)=\log\left(\prod_{i=n+1}^{n+N}\frac{d{\mathbb{P}}_{t^{}+\sqrt{\frac{N}{n}}\frac{h}{\sqrt{N}},X}}{d{\mathbb{P}}_{t^{},X}}\right)$
		$\displaystyle=h^{\top}\sqrt{\frac{N}{n}}\sqrt{N}({{\mathbb{P}}_{N,X}}-{{\mathbb{P}}^{}_{X}})(s_{t^{},X})-\frac{N}{2n}h^{\top}\mathrm{Var}[s_{t^{*},X}(X)]h+o_{p}(1)$
		$\displaystyle=h^{\top}\sqrt{\frac{\gamma}{1-\gamma}}\sqrt{N}({{\mathbb{P}}_{N,X}}-{{\mathbb{P}}^{}_{X}})(s_{t^{},X})-\frac{\gamma}{2(1-\gamma)}h^{\top}\mathrm{Var}[s_{t^{*},X}(X)]h+o_{p}(1).$

Then, it follows that

	$\displaystyle\log\left(\frac{d{\mathbb{Q}}_{t^{}+\frac{h}{\sqrt{n}}}^{n,N}}{d{\mathbb{Q}}_{t^{}}^{n,N}}\right)$	$\displaystyle=\log\left(\prod_{i=1}^{n}\frac{d{\mathbb{P}}_{t^{}+\frac{h}{\sqrt{n}}}}{d{\mathbb{P}}_{t^{}}}\right)+\log\left(\prod_{i=n+1}^{n+N}\frac{d{\mathbb{P}}_{t^{}+\frac{h}{\sqrt{n}},X}}{d{\mathbb{P}}_{t^{},X}}\right)$
		$\displaystyle=h^{\top}\sqrt{n}({{\mathbb{P}}_{n}}-{{\mathbb{P}}^{}})(s_{t^{}})+h^{\top}\sqrt{\frac{\gamma}{1-\gamma}}\sqrt{N}({{\mathbb{P}}_{N,X}}-{{\mathbb{P}}^{}_{X}})\left(s_{t^{},X}\right)$
		$\displaystyle\quad-\frac{1}{2}h^{\top}\left[\mathrm{Var}(s_{t^{}})+\frac{\gamma}{1-\gamma}\mathrm{Var}(s_{t^{},X})\right]h+o_{p}(1).$

The convergence in distribution then follows from CLT and independence, and contiguity follows from Le Cam’s first lemma. ∎

Define $l_{t^{*}}(z_{1},x_{2}):{\mathcal{Z}}\times{\mathcal{X}}\to{\mathbb{R}}^{p}$ as

l_{t^{*}}(z_{1},x_{2})=s_{t^{*}}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2}).

(84)

From Lemma E.12, $l_{t^{*}}(z_{1},x_{2})$ can be viewed as the score function of the parametric model ${\mathcal{P}}_{T}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ with information matrix $\mathbf{I}_{t^{*}}$ . It is clear that $l_{t^{*}}(z_{1},x_{2})$ is ${\mathcal{F}}\otimes{\mathcal{F}}_{X}$ measurable and is an element of the space

{\mathcal{H}}=\left\{{f(z_{1})+\frac{\gamma}{1-\gamma}g(x_{2}),f(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})},g(x):\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}}\right\}.

(85)

Consider any vector-valued functions $f_{1}(z),f_{2}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$ and $g_{1}(x),g_{2}(z)\in{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}_{X}})}$ . Therefore, $f_{1}(z_{1})+\frac{\gamma}{1-\gamma}g_{1}(x_{2})\in{\mathcal{H}}$ and $f_{2}(z_{1})+\frac{\gamma}{1-\gamma}g_{2}(x_{2})\in{\mathcal{H}}$ . Define

		$\displaystyle\left\langle f_{1}(z_{1})+\frac{\gamma}{1-\gamma}g_{1}(x_{2}),f_{2}(z_{1})^{\top}+\frac{\gamma}{1-\gamma}g_{2}(x_{2})^{\top}\right\rangle_{\mathcal{H}}$		(86)
		$\displaystyle:=\left\langle f_{1}(z),f_{2}(z)^{\top}\right\rangle_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})}}+\frac{\gamma}{1-\gamma}\left\langle g_{1}(x),g_{2}(x)^{\top}\right\rangle_{{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}_{X}})}}$
		$\displaystyle={\mathbb{E}}\left[f_{1}(Z)f_{2}(Z)^{\top}\right]+\frac{\gamma}{1-\gamma}{\mathbb{E}}\left[g_{1}(X)g_{2}(X)^{\top}\right].$

It is straightforward to verify that this is indeed an inner product, and ${\mathcal{H}}$ is a Hilbert space. The information matrix $\mathbf{I}_{t}$ can then be represented as

\mathbf{I}_{t^{*}}=\left\langle l_{t^{*}}(z_{1},x_{2}),l_{t^{*}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}.

Consider an arbitrary model ${{\mathcal{P}}}$ of ${{\mathbb{P}}^{*}}$ . We now extend the notion of a tangent space to the OSS setting. Similar to the i.i.d. setting, the tangent set relative to ${{\mathcal{P}}}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ is defined as the set of all score functions at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ of one-dimensional regular parametric sub-models of ${{\mathcal{P}}}$ . The tangent space at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ relative to ${{\mathcal{P}}}$ , denoted as ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ , is then the closed linear span of the tangent set in ${\mathcal{H}}$ . Note that ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}\subseteq{\mathcal{H}}$ . Next, we characterizes the tangent space at ${\mathbb{Q}}({{\mathbb{P}}^{*}})$ relative to model (1).

Lemma E.13.

The tangent space ${\mathbb{Q}}({{\mathbb{P}}^{*}})$ relative to ${{\mathcal{P}}}$ , denoted as ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ , can be expressed as

\displaystyle{\mathcal{M}}=\left\{f(x_{1})+g(z_{1})+\frac{\gamma}{1-\gamma}f(x_{2}):f(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})},g(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}\right\}.

Proof.

It is straightforward to see that ${\mathcal{M}}$ is a closed linear space since both ${{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ and ${{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ are closed linear spaces by the definition of tangent space.

First, we show that ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}\subseteq{\mathcal{M}}$ . Consider any one-dimensional regular parametric sub-model ${\mathcal{P}}_{T}=\left\{{p_{t}(z)=p_{t,X}(x)p_{t,Y\mid X}(z):t\in T}\right\}$ of ${{\mathcal{P}}}$ such that $p_{t^{*}}(z)$ is the density of ${{\mathbb{P}}^{*}}$ for some $t^{*}\in T$ . Let $s_{t^{*}}(z)$ , $s_{t^{*},Y\mid X}(z)$ and $s_{t^{*},X}(x)$ denote the score function of $p_{t}(z)$ , $p_{t,X}(x)$ , and $p_{t,Y\mid X}(z)$ at $t^{*}$ , respectively. Then we have $s_{t^{*}}(z)=s_{t^{*},X}(x)+s_{t^{*},Y\mid X}(z)$ . Since ${\mathcal{P}}_{T}\subset{{\mathcal{P}}}$ , and by the definition of ${{\mathcal{P}}}$ (1), it holds that $\left\{{p_{t,X}(x):t\in T}\right\}$ is a one-dimensional regular parametric sub-model of ${{\mathcal{P}}_{Y\mid X}}$ , and $\left\{{p_{t,Y\mid X}(x):t\in T}\right\}$ is a one-dimensional regular parametric sub-model of ${{\mathcal{P}}_{Y\mid X}}$ . Therefore, $s_{t^{*},X}(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ and $s_{t^{*},Y\mid X}(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . By Lemma E.12, the score function of ${\mathcal{Q}}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ can be written as

	$\displaystyle l_{t^{*}}(z_{1},x_{2})$	$\displaystyle=s_{t^{}}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{},X}(x_{2})$
		$\displaystyle=s_{t^{},X}(x_{1})+s_{t^{},Y\mid X}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2}).$

By Lemma E.11, ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}={{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}\oplus{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ . Therefore, $l_{t^{*}}(z_{1},x_{2})\in{\mathcal{M}}$ . Then, ${{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}\subseteq{\mathcal{M}}$ follows from the fact that ${\mathcal{M}}$ is a closed linear space.

For the other direction, consider arbitrary elements $g(z)\in{{\mathcal{T}}_{{\mathcal{P}}_{Y\mid X}}({{\mathbb{P}}^{*}_{Y\mid X}})}$ and $f(x)\in{{\mathcal{T}}_{{\mathcal{P}}_{X}}({{\mathbb{P}}^{*}_{X}})}$ . Then, without loss of generality, there must exists a one-dimensional regular parametric sub-model $\left\{{p_{t,X}(x):t\in T}\right\}$ of ${{\mathcal{P}}_{X}}$ such that $p_{t^{*},X}(x)$ is the density of ${{\mathbb{P}}^{*}_{X}}$ , and $f(x)$ is the score function of $p_{t,X}(x)$ at $t^{*}$ . Similarly, there must exists a one-dimensional regular parametric sub-model $\left\{{p_{t,Y\mid X}(z):t\in T}\right\}$ of ${{\mathcal{P}}_{Y\mid X}}$ such that $p_{t^{*},Y\mid X}(z)$ is the density of ${{\mathbb{P}}^{*}_{Y\mid X}}$ , and $g(z)$ is the score function of $p_{t,Y\mid X}(z)$ at $t^{*}$ . Consider the model ${\mathcal{P}}_{T}=\left\{{p_{t}(z)=p_{t,X}(x)p_{t,Y\mid X}(z):t\in T}\right\}$ , which is a one-dimensional regular parametric sub-model of ${{\mathcal{P}}}$ by its definition, with score function $f(x)+g(z)$ at $t^{*}$ . By Lemma E.11, $f(x)+g(z)\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})}$ . Therefore, by Lemma E.12, we see that the socre at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ relative to this parametric sub-model is $f(x_{1})+g(z_{1})+\frac{\gamma}{1-\gamma}f(x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ , and hence ${\mathcal{M}}\subseteq{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ . ∎

Next, we consider the extention of pathwise differentiability to the OSS setting.

Definition E.1 (Pathwise differentiability).

A Euclidean functional $\theta:{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p}$ is pathwise differentiable relative to ${{\mathcal{P}}}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ if, any one-dimensional regular parametric sub-model ${\mathcal{P}}_{T}\subset{{\mathcal{P}}}$ where ${\mathcal{P}}_{T}=\left\{{{\mathbb{P}}_{t}:t\in T\subset{\mathbb{R}}}\right\}$ such that ${\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}}$ for some $t^{*}\in T$ , $\theta:{\mathcal{P}}_{T}\to\Theta$ as a function $\theta(t)$ of $t$ satisfies

\frac{d\theta(t^{*})}{dt}=\langle l_{t^{*}}(z_{1},x_{2}),{D_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2})\rangle_{\mathcal{H}},

for some ${D_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2})\in{\mathcal{H}}$ , where $l_{t^{*}}$ is the score function of ${\mathcal{P}}_{T}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ . Then, ${D_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2})$ is a gradient of ${\theta(\cdot)}$ relative to ${{\mathcal{P}}}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ .

If a gradient ${D^{*}_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2})$ satisfies $a^{\top}{D^{*}_{{{\mathbb{Q}}({{\mathbb{P}}^{*}})}}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ for all $a\in{\mathbb{R}}^{p}$ , then it is a canonical gradient or the efficient influence function of ${\theta(\cdot)}$ relative to ${{\mathcal{P}}}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ . The next lemma establishes the connection between pathwise differentiability under the OSS setting and pathwise differentiability in the usual sense.

Lemma E.14.

If $\theta:{{\mathcal{P}}}\to\Theta\subseteq{\mathbb{R}}^{p}$ is a pathwise differentiable relative to ${{\mathcal{P}}}$ at ${{\mathbb{P}}^{*}}$ , then it is pathwise differentiable relative to ${{\mathcal{P}}}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ . Moreover, if ${D_{{\mathbb{P}}^{*}}}$ is a gradient of ${\theta(\cdot)}$ relative to ${{\mathcal{P}}}$ at ${{\mathbb{P}}^{*}}$ , then

{D_{{\mathbb{P}}^{*}}}(z_{1})-\tilde{D}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})(x_{1})+\tilde{D}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})(x_{2})

is a gradient of ${\theta(\cdot)}$ relative to ${{\mathcal{P}}}$ at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ , where

\tilde{D}_{{\mathcal{P}}}({{\mathbb{P}}^{*}})(x)={\mathbb{E}}\left[D_{{\mathcal{P}}}({{\mathbb{P}}^{*}})(Z)\mid X=x\right]

is the conditional gradient.

Proof.

Let ${D_{{\mathbb{P}}^{*}}}(z)$ be a gradient of ${\theta(\cdot)}$ relative to ${{\mathcal{P}}}$ at ${{\mathbb{P}}^{*}}$ . For simplicity, denote $D(z)={D_{{\mathbb{P}}^{*}}}(z)$ , and $\tilde{D}(x)={\mathbb{E}}[D(Z)\mid X=x]$ . For any one-dimensional regular parametric sub-model ${\mathcal{P}}_{T}$ of ${{\mathcal{P}}}$ , pathwise differentiability at ${{\mathbb{P}}^{*}}$ implies

	$\displaystyle\theta(t)$	$\displaystyle=\theta(t^{})+t\langle s_{t^{}},D\rangle_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}+o(t)$
		$\displaystyle=\theta(t^{})+t\langle s_{t^{}},D\rangle_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}+o(t)$
		$\displaystyle=\theta(t^{})+t\langle s_{t^{}},D-\tilde{D}\rangle_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}})}}+t\frac{\gamma}{1-\gamma}\left\langle s_{t^{},X},\frac{1-\gamma}{\gamma}\tilde{D}\right\rangle_{{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}+o(t)$
		$\displaystyle=\theta(t^{})+t\left\langle s_{t^{}}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2}),D(z_{1})-\tilde{D}(x_{1})+\tilde{D}(x_{2})\right\rangle_{\mathcal{H}}+o(t),$

where $D(z_{1})-\tilde{D}(x_{1})+\tilde{D}(x_{2})\in{\mathcal{H}}$ and $s_{t^{*}}(z_{1})+\frac{\gamma}{1-\gamma}s_{t^{*},X}(x_{2})$ is the score function of ${\mathcal{P}}_{T}$ at ${{\mathbb{P}}^{*}}$ . ∎

We now extend the definitions of regular and asymptotically linear estimators to the OSS setting as in Section 4.1. The definition of regularity remains unchanged as Definition 4.1 in the main text. We present an equivalent definition of asymptotic linearity here for convenience.

Definition E.2 (Asymptotic linearity).

An estimator $\hat{\theta}_{n,N}$ of $\theta^{*}$ is asymptotically linear with influence function ${{\varphi}_{\eta^{*}}}(z_{1},x_{2})\in{\mathcal{H}}$ if

\displaystyle\sqrt{n}\left(\hat{\theta}_{n,N}-\theta^{*}\right)=\sqrt{n}\left({{\mathbb{Q}}_{n,N}}-{{\mathbb{Q}}({{\mathbb{P}}^{*}})}\right)\left[{{\varphi}_{\eta^{*}}}(z_{1},x_{2})\right]+o_{p}(1).

(87)

It is easy to verify, by expanding ${{\mathbb{Q}}_{n,N}}-{{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ and using the fact that $\gamma\in(0,1)$ , that definition E.2 is equivalent to Definition 4.2. Clearly, if $\hat{\theta}_{n,N}$ is an asymptotically linear estimator of $\theta^{*}$ with influence function ${{\varphi}_{\eta^{*}}}(z_{1},x_{2})$ , its asymptotic variance is $\langle{{\varphi}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}_{\eta^{*}}}(z_{1},x_{2})^{\top}\rangle_{\mathcal{H}}$ .

Similar to Lemma E.9, we first establish the equivalence between the set of gradients and the set of influence functions under the OSS setting.

Lemma E.15.

Let $\hat{\theta}_{n,N}$ be an asymptotically linear estimator of $\theta^{*}$ with influence function ${{\varphi}_{\eta^{*}}}(z_{1},x_{2})={{\varphi}_{\eta^{*}}^{(1)}}(z_{1})+\frac{\gamma}{1-\gamma}{{\varphi}_{\eta^{*}}^{(2)}}(x_{2})$ in the sense of Definition E.2. Then, ${\theta(\cdot)}$ is pathwise differentiable at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ relative to ${{\mathcal{P}}}$ with gradient ${{\varphi}_{\eta^{*}}}(z_{1},x_{2})$ if and only if $\hat{\theta}_{n}$ is a regular estimator of $\theta^{*}$ over ${{\mathcal{P}}}$ in the sense of Definition 4.1.

Proof.

Following the notation of Lemma E.12, consider a one-dimensional regular parametric sub-model ${\mathcal{P}}_{T}$ . By asymptotic linearity and Lemma E.12, we have

\begin{pmatrix}\sqrt{n}(\hat{\theta}_{n,N}-\theta^{*})\\ \log\left(\frac{d{\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}}{d{\mathbb{Q}}_{t^{*}}^{n,N}}\right)\end{pmatrix}\rightsquigarrow N\left(\begin{pmatrix}0\\ -\frac{1}{2}h^{2}\mathbf{I}_{t^{*}}\end{pmatrix},\begin{pmatrix}\langle{{\varphi}_{\eta^{*}}},{{\varphi}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}&h\langle l_{t^{*}},{{\varphi}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}\\ h\langle{{\varphi}_{\eta^{*}}},l_{t^{*}}\rangle_{\mathcal{H}}&\mathbf{I}_{t^{*}}\end{pmatrix}\right).

By contiguity and Le Cam’s third lemma, it follows that

\sqrt{n}\left(\hat{\theta}_{n,N}-\theta^{*}\right)\rightsquigarrow N\left(h\langle{{\varphi}_{\eta^{*}}},l_{t^{*}}\rangle_{\mathcal{H}},\langle{{\varphi}_{\eta^{*}}},{{\varphi}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}\right)

under ${\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}$ . By regularity of $\hat{\theta}_{n,N}$ , it follows that

\sqrt{n}\left[\hat{\theta}_{n,N}-\theta\left({\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}\right)\right]\rightsquigarrow N\left(h\langle{{\varphi}_{\eta^{*}}},l_{t^{*}}\rangle_{\mathcal{H}},\langle{{\varphi}_{\eta^{*}}},{{\varphi}_{\eta^{*}}}^{\top}\rangle_{\mathcal{H}}\right)

under ${\mathbb{Q}}_{t^{*}+\frac{h}{\sqrt{n}}}^{n,N}$ . This implies that

\sqrt{n}\left[\theta\left({\mathbb{P}}_{t^{*}+\frac{h}{\sqrt{n}}}\right)-\theta^{*}\right]\to h\langle{{\varphi}_{\eta^{*}}},l_{t^{*}}\rangle_{\mathcal{H}}.

Pathwise differentiability then follows by the definition of a derivative. ∎

Finally, we extend Lemma 2.1 to the OSS setting.

Lemma E.16.

Suppose ${\theta(\cdot)}$ is pathwise differentiable at ${{\mathbb{Q}}({{\mathbb{P}}^{*}})}$ relative to ${{\mathcal{P}}}$ with efficient influence function ${{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})$ . Let $\hat{\theta}_{n,N}$ be a regular and asymptotically linear estimator of $\theta^{*}$ in the sense of Definitions 4.1 and E.2 with influence function ${D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})$ . Then,

\left\langle{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2}),{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}\succeq\left\langle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}.

Hence, the asymptotic variance of any regular and asymptotically linear estimator of $\theta^{*}$ is lower bounded by $\left\langle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})^{\top}\right\rangle_{\mathcal{H}}$ .

Proof.

By Lemma E.15, ${D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})$ is a gradient of $\hat{\theta}_{n}$ relative to ${{\mathcal{P}}}$ . For any $s(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ , without loss of generality, let ${\mathcal{P}}_{T}$ be any one-dimensional regular parametric sub-model of ${{\mathcal{P}}}$ with score function $s(z_{1},x_{2})$ at ${\mathbb{P}}_{t^{*}}={{\mathbb{P}}^{*}}$ . (Otherwise, $s(z_{1},x_{2})$ is the limit of a sequence of score functions of parametric sub-models, and the following holds by the continuity of the inner product.) Write $\theta:{\mathcal{P}}_{T}\to\Theta$ as a function $\theta(t)$ of $t$ . By pathwise differentiability,

\frac{d\theta(t^{*})}{dt}=\langle s(z_{1},x_{2}),{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}=\langle s(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}},

which shows that $\langle s(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})-{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}=0$ . Since this holds for arbitrary $s(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ , we have ${{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})-{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})\perp{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ in ${\mathcal{H}}$ . By definition, $e_{i}^{\top}{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})\in{{\mathcal{T}}_{{\mathcal{P}}}({{\mathbb{Q}}({{\mathbb{P}}^{*}})})}$ for all $i\in[p]$ , where $e_{i}\in{\mathbb{R}}^{p}$ is the vector whose $i$ -th coordinate is $1$ and the rest are zeros. Therefore, $\langle{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2}),{{\varphi}^{*}_{\eta^{*}}}(z_{1},x_{2})-{D_{{\mathbb{P}}^{*}}}(z_{1},x_{2})\rangle_{\mathcal{H}}=0$ . It follows that

	$\displaystyle\langle{D_{{\mathbb{P}}^{}}},{D_{{\mathbb{P}}^{}}}^{\top}\rangle_{\mathcal{H}}$	$\displaystyle=\langle{D_{{\mathbb{P}}^{}}}-{{\varphi}^{}_{\eta^{}}}+{{\varphi}^{}_{\eta^{}}},{D_{{\mathbb{P}}^{}}}^{\top}-{{\varphi}^{}_{\eta^{}}}^{\top}+{{\varphi}^{}_{\eta^{}}}^{\top}\rangle_{\mathcal{H}}$
		$\displaystyle=\langle{D_{{\mathbb{P}}^{}}}-{{\varphi}^{}_{\eta^{}}},{D_{{\mathbb{P}}^{}}}^{\top}-{{\varphi}^{}_{\eta^{}}}^{\top}\rangle_{\mathcal{H}}+\langle{{\varphi}^{}_{\eta^{}}},{{\varphi}^{}_{\eta^{}}}^{\top}\rangle_{\mathcal{H}}$
		$\displaystyle\succeq\langle{{\varphi}^{}_{\eta^{}}},{{\varphi}^{}_{\eta^{}}}^{\top}\rangle_{\mathcal{H}}.$

∎

Appendix F Additional numerical results

F.1 Extended results for mean estimation and generalized linear model

Tables 1, 2, and 3 report the coverage of 95% confidence intervals for methods of mean estimation and generalized linear models presented in Sections 8.1 and 8.2. All methods achieve the nominal coverage.

Setting	Method	$\gamma=0.1$	$\gamma=0.3$	$\gamma=0.5$	$\gamma=0.7$	$\gamma=0.9$
	$\hat{\theta}_{n}$	0.967	0.944	0.946	0.938	0.954
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.949	0.933	0.947	0.942	0.951
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.944	0.944	0.944	0.953	0.942
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.960	0.957	0.948	0.949	0.945
1	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.950	0.939	0.941	0.951	0.948
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.948	0.953	0.936	0.956	0.947
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.946	0.952	0.950	0.951	0.948
	PPI++ (Random Forest)	0.962	0.951	0.954	0.940	0.938
	PPI++ (Noisy Predictor)	0.950	0.949	0.946	0.942	0.946
	$\hat{\theta}_{n}$	0.954	0.966	0.941	0.954	0.960
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.948	0.949	0.964	0.942	0.948
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.952	0.950	0.961	0.941	0.947
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.946	0.955	0.955	0.938	0.942
2	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.944	0.945	0.957	0.941	0.937
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.956	0.968	0.953	0.945	0.944
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.939	0.961	0.951	0.956	0.948
	PPI++ (Random Forest)	0.953	0.950	0.971	0.948	0.940
	PPI++ (Noisy Predictor)	0.947	0.953	0.956	0.944	0.938
	$\hat{\theta}_{n}$	0.946	0.956	0.950	0.954	0.941
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.948	0.952	0.958	0.932	0.949
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.948	0.958	0.963	0.943	0.950
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.936	0.967	0.941	0.939	0.946
3	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.951	0.954	0.961	0.942	0.944
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.957	0.963	0.962	0.946	0.941
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.957	0.952	0.956	0.937	0.949
	PPI++ (Random Forest)	0.939	0.959	0.960	0.939	0.946
	PPI++ (Noisy Predictor)	0.940	0.948	0.948	0.960	0.934

Table 1: 95% coverage of confidence intervals for methods of mean estimation in three settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section 8.1.

Setting	Method	$\gamma=0.1$	$\gamma=0.3$	$\gamma=0.5$	$\gamma=0.7$	$\gamma=0.9$
	$\hat{\theta}_{n}$	0.933	0.950	0.944	0.950	0.938
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.950	0.950	0.949	0.943	0.940
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.950	0.945	0.945	0.938	0.934
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.949	0.940	0.944	0.935	0.935
1	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.940	0.936	0.946	0.931	0.931
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.948	0.942	0.938	0.939	0.939
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.942	0.941	0.949	0.934	0.939
	PPI++ (Random Forest)	0.952	0.952	0.951	0.947	0.941
	PPI++ (Noisy Predictor)	0.952	0.943	0.945	0.946	0.934
	$\hat{\theta}_{n}$	0.948	0.943	0.949	0.948	0.940
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.944	0.951	0.944	0.953	0.945
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.947	0.940	0.936	0.942	0.950
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.947	0.945	0.930	0.942	0.944
2	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.952	0.932	0.940	0.946	0.944
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.952	0.943	0.955	0.947	0.956
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.954	0.952	0.923	0.937	0.953
	PPI++ (Random Forest)	0.941	0.949	0.943	0.955	0.945
	PPI++ (Noisy Predictor)	0.942	0.951	0.944	0.953	0.947

Table 2: 95% coverage of confidence intervals of

\theta_{1}^{*}

for methods of Poisson GLM in two settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section 8.2.

Setting	Method	$\gamma=0.05$	$\gamma=0.25$	$\gamma=0.5$	$\gamma=0.75$	$\gamma=0.95$
	$\hat{\theta}_{n}$	0.947	0.950	0.946	0.937	0.952
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.943	0.949	0.949	0.941	0.945
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.936	0.942	0.945	0.943	0.950
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.944	0.944	0.956	0.936	0.953
1	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.938	0.937	0.950	0.937	0.939
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.942	0.954	0.948	0.943	0.953
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.947	0.958	0.947	0.944	0.947
	PPI++ (Random Forest)	0.939	0.948	0.947	0.956	0.953
	PPI++ (Noisy Predictor)	0.940	0.946	0.950	0.946	0.940
	$\hat{\theta}_{n}$	0.960	0.942	0.949	0.939	0.950
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.960	0.953	0.949	0.952	0.943
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.952	0.944	0.947	0.939	0.950
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.955	0.945	0.942	0.946	0.943
2	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.950	0.945	0.943	0.935	0.941
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.948	0.949	0.950	0.937	0.954
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.958	0.959	0.943	0.944	0.955
	PPI++ (Random Forest)	0.960	0.954	0.949	0.952	0.943
	PPI++ (Noisy Predictor)	0.960	0.952	0.947	0.951	0.943

Table 3: 95% coverage of confidence intervals of

\theta_{2}^{*}

for methods of Poisson GLM in two settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section 8.2.

In the GLM setting of Section 8.2, Figure 3 displays the standard error of estimators of the second parameter. The results are similar to those described in Section 8.2.

F.2 Variance estimation

We consider Example 7.3. The supervised estimator is the sample variance with influence function ${{\varphi}_{\eta^{*}}}(z)=(y-{\mathbb{E}}[Y])^{2}-\theta^{*}$ . The conditional influence function is

{\phi_{\eta^{*}}}(x)={\mathbb{E}}\left[(Y-{\mathbb{E}}[Y])^{2}\mid X=x\right]-\theta^{*}.

We generate the response $Y$ as $Y={\mathbb{E}}[Y\mid X]+\epsilon$ , where $\epsilon\sim N(0,1)$ . We consider two settings for ${\mathbb{E}}[Y\mid X]$ :

1.

Setting 1 (non-linear model): ${\mathbb{E}}[Y\mid X=x]=-1.70x_{1}x_{2}-6.94x_{1}x_{2}^{2}-1.35x_{1}^{2}x_{2}+2.28x_{1}^{2}x_{2}^{2}$ .
2.

Setting 2 (well-specified model, in the sense of Definition 3.1): ${\mathbb{E}}[Y\mid X=x]=0$ .

Table 4 reports the coverage of 95% confidence intervals for each method. All methods achieve the nominal coverage.

Setting	Method	$\gamma=0.05$	$\gamma=0.25$	$\gamma=0.5$	$\gamma=0.75$	$\gamma=0.95$
	$\hat{\theta}_{n}$	0.943	0.943	0.947	0.946	0.955
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.950	0.949	0.957	0.949	0.952
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.945	0.937	0.956	0.956	0.951
1	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.946	0.943	0.953	0.949	0.941
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.945	0.941	0.944	0.959	0.948
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.948	0.948	0.951	0.951	0.943
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.948	0.961	0.942	0.957	0.956
	$\hat{\theta}_{n}$	0.945	0.937	0.951	0.937	0.944
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.945	0.954	0.949	0.954	0.955
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.940	0.947	0.958	0.944	0.942
2	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.945	0.964	0.940	0.948	0.938
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.943	0.939	0.945	0.946	0.942
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.949	0.957	0.948	0.948	0.946
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.950	0.946	0.940	0.950	0.954

Table 4: 95% coverage of confidence intervals for methods of variance estimation in two settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section F.2.

Figure 4 displays the standard error of each method, averaged over 1,000 simulated datasets.

The results are similar to those of Section 8. In Setting 1 (non-linear model), $\hat{\theta}_{n,N}^{\text{eff.}}$ achieves the efficiency lower bound in the OSS setting with a sufficient number of basis functions. On the other hand, $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x)=x$ improves upon the supervised estimator, but is not efficient as it cannot accurately approximate the non-linear conditional influence function. $\hat{\theta}_{n,N}^{\text{PPI}}$ with random forest predicted responses also performs well, while the version of those methods with a pure-noise prediction model is no worse than the supervised estimator. For Setting 2 (well-specified model), no method improves upon the OSS lower bound, as indicated by Corollary 4.2.

F.3 Kendall’s $\tau$

We consider Example 7.4. The supervised estimator is

\hat{\theta}_{n}=\frac{2}{n(n-1)}\sum_{i<j}I\left\{{(U_{i}-U_{j})(V_{i}-V_{j})>0}\right\}

with influence function

{{\varphi}_{\eta^{*}}}(z)=2{\mathbb{P}}^{*}_{Y}\left\{{(U-u)(V-v)>0}\right\}-2\theta^{*}.

The conditional influence function is

{\phi_{\eta^{*}}}(x)=2{\mathbb{P}}^{*}\left\{{(U^{\prime}-u)(V^{\prime}-V)>0\mid X=x}\right\}-2\theta^{*},

where $Y^{\prime}=(U^{\prime},V^{\prime})$ is an independent copy of $Y$ . We generate the response $Y=(U,V)$ as $U={\mathbb{E}}[U\mid X]+\epsilon$ and $V={\mathbb{E}}[V\mid X]+\epsilon^{\prime}$ , where $(\epsilon,\epsilon^{\prime})^{\top}\sim N(0,\mathbf{I}_{2})$ . We set ${\mathbb{E}}[U\mid X=x]={\mathbb{E}}[V\mid X=x]:=h(x)$ , and consider two settings for $h(x)$ :

1.

Setting 1 (non-linear model): $h(x)=-1.70x_{1}x_{2}-6.94x_{1}x_{2}^{2}-1.35x_{1}^{2}x_{2}+2.28x_{1}^{2}x_{2}^{2}$ .
2.

Setting 2 (well-specified model, in the sense of Definition 3.1): $h(x)=0$ .

For each model, we set $n=1,000$ , and vary the proportion of unlabeled observations, $\gamma=\frac{N}{n+N}\in\left\{{0.1,0.3,0.5,0.7,0.9}\right\}$ . Due the complexity of the data-generating model, we did not estimate the ISS and OSS lower bound.

Table 5 reports the coverage of 95% confidence intervals for each method. All methods achieve the nominal coverage.

Setting	Method	$\gamma=0.05$	$\gamma=0.25$	$\gamma=0.5$	$\gamma=0.75$	$\gamma=0.95$
	$\hat{\theta}_{n}$	0.943	0.950	0.928	0.942	0.939
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.952	0.938	0.942	0.944	0.947
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.955	0.948	0.947	0.944	0.942
1	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.937	0.951	0.937	0.941	0.929
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.954	0.955	0.933	0.947	0.942
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.946	0.946	0.942	0.930	0.942
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.937	0.948	0.945	0.937	0.936
	$\hat{\theta}_{n}$	0.956	0.966	0.933	0.951	0.942
	$\hat{\theta}_{n,N}^{\text{safe}}$	0.946	0.934	0.944	0.949	0.936
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=4$	0.948	0.954	0.944	0.940	0.941
2	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=9$	0.947	0.950	0.933	0.938	0.938
	$\hat{\theta}_{n,N}^{\text{eff.}},K_{n}=16$	0.935	0.953	0.937	0.942	0.947
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Random Forest)	0.949	0.950	0.944	0.940	0.944
	$\hat{\theta}_{n,N}^{\text{PPI}}$ (Noisy Predictor)	0.949	0.951	0.942	0.943	0.949

Table 5: 95% coverage of confidence intervals for methods of Kendall’s

\tau

in two settings, averaged over 1,000 simulated datasets. For details of numerical experiments, see Section F.3.

The results are similar to those of Section 8. In Setting 1 (non-linear model), $\hat{\theta}_{n,N}^{\text{eff.}}$ with $K_{n}=16$ has the lowest standard error. $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x)=x$ improves upon the supervised estimator, but is not as efficient as $\hat{\theta}_{n,N}^{\text{eff.}}$ . $\hat{\theta}_{n,N}^{\text{PPI}}$ with random forest predicted responses also performs well, while its counterpart with pure-noise prediction models has comparable standard error to the supervised estimator. For Setting 2 (well-specified model), no methods improves upon the OSS lower bound as indicated by Corollary 4.2. We note that the apparent slight improvement of standard error in the second panel by $\hat{\theta}_{n,N}^{\text{eff.}}$ is the result of a relatively small sample size.

		$\displaystyle\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)-\gamma{\phi^{}_{\eta^{}}}(X)\right]+\mathrm{Var}\left[\sqrt{\gamma(1-\gamma)}{\phi^{}_{\eta^{}}}(X)\right]$		(25)
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)\right]-\gamma\mathrm{Var}\left[{\phi^{}_{\eta^{}}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left[{{\varphi}^{}_{\eta^{}}}(Z)-{\phi^{}_{\eta^{}}}(X)\right]+(1-\gamma)\mathrm{Var}\left[{\phi^{}_{\eta^{}}}(X)\right].$

	$\displaystyle\frac{\partial\theta(t)}{\partial t}$	$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z),s_{t^{}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})}$
		$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x)+{\phi^{}_{\eta^{}}}(x),s_{t^{}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})}$
		$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x),s(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})}+\langle{\phi^{}_{\eta^{}}}(x),s_{t^{}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{*}})}$
		$\displaystyle=\langle{{\varphi}^{}_{\eta^{}}}(z)-{\phi^{}_{\eta^{}}}(x),s_{t^{}}(y\mid x)\rangle_{{\mathcal{L}}^{2}_{p,0}({{\mathbb{P}}^{}})},$

$\displaystyle\left\lVert{{\mathbb{P}}^{}}\left[{{\varphi}_{\tilde{\eta}_{n}}}({g^{0}})^{\top}\right]-{{\mathbb{P}}^{}}\left[{{\varphi}_{\eta^{*}}}({g^{0}})^{\top}\right]\right\rVert_{2}$	$\displaystyle=\left\lVert{{\mathbb{P}}^{}}\left[({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}})({g^{0}})^{\top}\right]\right\rVert_{2}$	(62)
	$\displaystyle\leqslant{{\mathbb{P}}^{}}\left\lVert({{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}})({g^{0}})^{\top}\right\rVert_{2}$
	$\displaystyle\leqslant\left\lVert{{\varphi}_{\tilde{\eta}_{n}}}-{{\varphi}_{\eta^{}}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{}})}\left\lVert{g^{0}}\right\rVert_{{\mathcal{L}}^{2}({{\mathbb{P}}^{*}})}.$

		$\displaystyle\mathrm{Var}\left[{{\varphi}_{\eta^{*}}}(Z)-\mathbf{B}^{g}{g^{0}}(X)\right]$
		$\displaystyle=\mathrm{Var}\left\{{{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)+{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)}\right\}$
		$\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{}}}(Z)-{\phi_{\eta^{}}}(X)]+\mathrm{Var}\left\{{{\phi_{\eta^{*}}}(X)-\mathbf{B}^{g}{g^{0}}(X)}\right\}$
		$\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{}}}(Z)]-\mathrm{Var}\left\{{{\phi_{\eta^{}}}(X)}\right\}+\mathrm{Var}\left\{{{\phi_{\eta^{*}}}(X)}\right\}-\mathrm{Var}[\mathbf{B}^{g}{g^{0}}(X)]$
		$\displaystyle=\mathrm{Var}[{{\varphi}_{\eta^{*}}}(Z)]-\mathrm{Var}[\mathbf{B}^{g}{g^{0}}(X)].$

	$\displaystyle\left\lVert{\phi_{\eta^{}}}-{\hat{\phi}_{n}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$	$\displaystyle=\left\lVert{\phi_{\eta^{}}}-\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$
		$\displaystyle=\left\lVert{\phi_{\eta^{}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}+\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}-\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}$
		$\displaystyle\leqslant\underbrace{\left\lVert\hat{\mathbf{B}}_{K_{n}}{G_{K_{n}}^{0}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{}_{X}})}}_{\text{I}}+\underbrace{\left\lVert{\phi_{\eta^{}}}-\mathbf{B}_{K_{n}}{G_{K_{n}}^{0}}\right\rVert_{{\mathcal{L}}^{2}_{1,0}({{\mathbb{P}}^{*}_{X}})}}_{\text{II}}$

A Unified Framework for Semiparametrically Efficient Semi-Supervised Learning

Abstract

1 Introduction

1.1 The three settings

1.2 Main results

1.3 Related literature

1.4 Notation

1.5 Organization of the paper

2 Overview of semiparametric efficiency theory

Lemma 2.1.

3 The ideal semi-supervised setting

3.1 Semiparametric efficiency lower bound

Theorem 3.1.

3.2 Inference with a well-specified parameter

Definition 3.1 (Well-specified parameter).

Theorem 3.2.

Corollary 3.3.

3.3 Safe and efficient estimators

3.3.1 The safe estimator

Assumption 3.1.

Proposition 3.4.

Theorem 3.5.

Remark 3.1 (Choice of regression basis function g​(x)g(x)).

3.3.2 The efficient estimator

Assumption 3.2.

Theorem 3.6.

Remark 3.2 (Role of the marginal distribution ℙX∗{{\mathbb{P}}^{*}_{X}}).

4 The ordinary semi-supervised setting

4.1 Semiparametric efficiency lower bound

Definition 4.1 (Regularity in the OSS setting).

Definition 4.2 (Asymptotic linearity in the OSS setting).

Remark 4.1.

Theorem 4.1.

Corollary 4.2.

4.2 Safe and efficient estimators

4.2.1 The safe estimator

Theorem 4.3.

4.2.2 The efficient estimator

Theorem 4.4.

Remark 4.2.

5 Connection to prediction-powered inference

Remark 5.1.

Assumption 5.1.

Proposition 5.1.

Proposition 5.2.

6 Connection with missing data

Proposition 6.1.

7 Applications

7.1 M-estimation

Example 7.1 (Mean).

Example 7.2 (Generalized linear models).

7.2 U-statistics

Example 7.3 (Variance).

Example 7.4 (Kendall’s τ\tau).

7.3 Average treatment effect

Example 7.5 (Additional data on confounders).

Example 7.6 (Additional data on confounders and treatment).

Example 7.7 (Additional data on confounders and treatment, and availability of surrogates).

Remark 7.1.

8 Numerical experiments

8.1 Mean estimation

8.2 Generalized linear model

9 Discussion

Acknowledgments

References

Appendix A Additional notation

Appendix B Proof of main results

Proof of Theorem 3.1..

Proof of Theorem 3.2.

Proof of Corollary 3.3.

Proof of Proposition 3.4.

Proof of Theorem 3.5.

Proof of Theorem 3.6.

Proof of Theorem 4.1..

Proof of Theorem 4.3..

Proof of Theorem 4.4..

Proof of Proposition 5.2.

Proof of Proposition 6.1..

Appendix C Connection to prediction-powered inference

Proposition C.1.

Remark 3.1 (Choice of regression basis function $g(x)$ ).

Remark 3.2 (Role of the marginal distribution ${{\mathbb{P}}^{*}_{X}}$ ).

Example 7.4 (Kendall’s $\tau$ ).

F.3 Kendall’s $\tau$