This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Sieve Quasi-likelihood Ratio Test for Neural Networks with Applications to Genetic Association Studies

Xiaoxi Shen Department of Mathematics, Texas State University, San Marcos, TX, USA Chang Jiang Department of Biostatistics, University of Florida, Gainesville, FL, USA Lyudmila Sakhanenko Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA Qing Lu Department of Biostatistics, University of Florida, Gainesville, FL, USA
Abstract

Neural networks (NN) play a central role in modern Artificial intelligence (AI) technology and has been successfully used in areas such as natural language processing and image recognition. While majority of NN applications focus on prediction and classification, there are increasing interests in studying statistical inference of neural networks. The study of NN statistical inference can enhance our understanding of NN statistical proprieties. Moreover, it can facilitate the NN-based hypothesis testing that can be applied to hypothesis-driven clinical and biomedical research. In this paper, we propose a sieve quasi-likelihood ratio test based on NN with one hidden layer for testing complex associations. The test statistic has asymptotic chi-squared distribution, and therefore it is computationally efficient and easy for implementation in real data analysis. The validity of the asymptotic distribution is investigated via simulations. Finally, we demonstrate the use of the proposed test by performing a genetic association analysis of the sequencing data from Alzheimer’s Disease Neuroimaging Initiative (ADNI).

Keywords: Sieve quasi-likelihood ratio test; nonparametric least squares; influence functions.

1 Introduction

With the advance of science and technology, we are now in the era of the fourth industrial revolution. One of the key drivers of the fourth industrial revolution is artificial intelligence (AI). Deep neural networks play a critical role in AI and have achieved great success in many fields such as natural language processing and imaging recognition. While great attention has been given to applications of neural works (NN), limited studies have been focus on its theoretical properties and statistical inference, which hinders its application to hypothesis-driven clinical and biomedical research. The study of NN statistical inference can improve our understanding of NN properties and facilitate hypotheses testing using NN. Nevertheless, it is challenging to study NN statistical inference. For instance, it has been pointed out in Fukumizu (1996) and Fukumizu et al. (2003) that the parameters in a neural network model are unidentifiable so that classical tests (e.g., Wald test and likelihood ratio test) cannot be used because unidentifaibility of parameters leads to inconsistency of the nonlinear least squares estimators (Wu, 1981).

Many existing literature on NN, such as Shen et al. (2021), Shen et al. (2019), Horel and Giesecke (2020), Schmidt-Hieber et al. (2020), and Chen and White (1999), are based on the framework of nonparametric regression. It has been shown in Chen and White (1999) that the rate of convergence for neural network estimators is f^nf0=𝒪p((n/logn)1+1/d4(1+1/(2d)))\left\|{\hat{f}_{n}-f_{0}}\right\|=\mathcal{O}_{p}\left((n/\log n)^{-\frac{1+1/d}{4(1+1/(2d))}}\right) for sufficiently smooth f0f_{0}, so one of the advantages of neural networks compared with commonly used in nonparametric regression methods (e.g., Nadaraya-Watson estimator and spline regression) is that neural network estimators can avoid the curse of dimensionality in terms of rate of convergence.

There are increasing interests in studying hypothesis testing based on neural networks. Recently, Shen et al. (2019) established asymptotic theories for neural networks, which can be used to perform a nonparametric hypothesis on the true function. Horel and Giesecke (2020) used a Linderberg-Feller type central limit theorem for random process and second order functional delta method to construct test statistic to perform significance tests on input features. However, the asymptotic distribution of the test statistic is complex, making it difficult to obtain the critical value. Shen et al. (2021) proposed a goodness of fit type test based on neural networks. The test statistic is based on comparing the mean squared error values of two neural networks built under the null and the alternative hypothesis. The test statistic has an asymptotic normal distribution, and hence it can be easily used in practice. However, constructing the test statistic requires a random split of the data, which can lead to a potential power loss. In this paper, we propose a sieve quasi likelihood ratio (SQLR) test based on neural networks. Similar to Shen et al. (2021), the test statistic has an asymptotic chi-squared distribution, which facilitate its use in practice. Compared with the goodness of fit test in Shen et al. (2021), the SQLR test does note require data splitting, but requires continuous random input features.

The rest of the paper is organized as follows. Section 2 provides the general results of the sieve quasi-likelihood ratio test under the setup of nonparametric regressions. In section 3, we apply the general theories to neural networks so that significance tests based on neural networks can be performed. We investigate the validity of the theories via simple simulations in section 4, followed by a real data application to genetic association analysis of the sequencing data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) in section 5. The proofs of the main results in are given in the supplementary materials.

2 Sieve Quasi-Likelihood Ratio Test

Consider the classical setting of a nonparametric regression model under the random design,

Yi=f0(𝑿i)+ϵi,i=1,,n,Y_{i}=f_{0}(\boldsymbol{X}_{i})+\epsilon_{i},\quad i=1,\ldots,n,

where the covariates 𝑿1,,𝑿nd\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n}\in\mathbb{R}^{d} are assumed to be i.i.d. from a distribution PP, and ϵ1,,ϵn\epsilon_{1},\ldots,\epsilon_{n} are i.i.d. random errors with 𝔼[ϵ]=0\mathbb{E}[\epsilon]=0. Y1,,YnY_{1},\ldots,Y_{n} are the responses, which are continuous random variables. The true functions f0f_{0} is assume to be in C(𝒳)\mathcal{F}\subset C(\mathcal{X}), where 𝒳d\mathcal{X}\subset\mathbb{R}^{d} is a compact subset. For simplicity, we take 𝒳=[1,1]d\mathcal{X}=[-1,1]^{d}. The norm considered on \mathcal{F} is the L2L_{2}-norm f=(𝒳|f|2dP)1/2\|f\|=\left(\int_{\mathcal{X}}|f|^{2}\textrm{d}P\right)^{1/2}. We further assume that ϵp,1=0((|ϵ|>t))1/pdt<\|\epsilon\|_{p,1}=\int_{0}^{\infty}(\mathbb{P}(|\epsilon|>t))^{1/p}\textrm{d}t<\infty for some p2p\geq 2. Such a assumption is also considered in Han and Wellner (2019) and is necessary to obtain the desired convergence rate.

The approximate sieve exremum estimator f^n\hat{f}_{n} based on n\mathcal{F}_{n} is defined as

n(f^n)inffnn(f)+𝒪p(ηn),\mathbb{Q}_{n}(\hat{f}_{n})\leq\inf_{f\in\mathcal{F}_{n}}\mathbb{Q}_{n}(f)+\mathcal{O}_{p}(\eta_{n}),

where n(f)\mathbb{Q}_{n}(f) is the classical sample squared error loss function

n(f)=1ni=1n(Yif(𝑿i))2.\mathbb{Q}_{n}(f)=\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-f(\boldsymbol{X}_{i}))^{2}.

We assume that n=1n\bigcup_{n=1}^{\infty}\mathcal{F}_{n} is uniformly dense in \mathcal{F}, that is, for each ff\in\mathcal{F}, there exists πnfn\pi_{n}f\in\mathcal{F}_{n} such that sup𝒙𝒳|πnf(𝒙)f(𝒙)|0\sup_{\boldsymbol{x}\in\mathcal{X}}|\pi_{n}f(\boldsymbol{x})-f(\boldsymbol{x})|\to 0 as nn\to\infty. For simplicity, we assume that the sieve space n\mathcal{F}_{n} is countable to avoid additional technical issue on measurability.

The null hypothesis of the sieve quasi-likelihood ratio test is H0:ϕ(f0)=0H_{0}:\phi(f_{0})=0, which is the same as the one proposed in Shen and Shi (2005). We define the sieve quasi-likelihood ratio statistic as

LRn=n(inffn0n(f)inffnn(f)),LR_{n}=n\left(\inf_{f\in\mathcal{F}_{n}^{0}}\mathbb{Q}_{n}(f)-\inf_{f\in\mathcal{F}_{n}}\mathbb{Q}_{n}(f)\right),

where n0\mathcal{F}_{n}^{0} is the null sieve space given by

n0={fn:ϕ(f)=0}.\mathcal{F}_{n}^{0}=\left\{f\in\mathcal{F}_{n}:\phi(f)=0\right\}.

Similar to the definition of f^n\hat{f}_{n}, we denote the approximate sieve extremum estimator under H0H_{0} by f^n0\hat{f}_{n}^{0}, which satisfies

n(f^n0)inffn0n(f)+𝒪p(ηn).\mathbb{Q}_{n}(\hat{f}_{n}^{0})\leq\inf_{f\in\mathcal{F}_{n}^{0}}\mathbb{Q}_{n}(f)+\mathcal{O}_{p}(\eta_{n}).

According to Shen (1997) and Shen and Shi (2005), we assume that the functional ϕ:\phi:\mathcal{F}\to\mathbb{R} has the following smoothness property: for any fnf\in\mathcal{F}_{n},

|ϕ(f)ϕ(f0)ϕf0[ff0]|unff0ω,asff00,|\phi(f)-\phi(f_{0})-\phi_{f_{0}}^{\prime}[f-f_{0}]|\leq u_{n}\|f-f_{0}\|^{\omega},\quad\textrm{as}\quad\|f-f_{0}\|\to 0, (1)

where ϕf0[ff0]\phi_{f_{0}}^{\prime}[f-f_{0}] is defined as limt0[ϕ(f(f0,t))ϕ(f0)]/t\lim_{t\to 0}[\phi(f(f_{0},t))-\phi(f_{0})]/t with f(f0,t)f(f_{0},t) being a path in tt connecting f0f_{0} and ff such that f(f0,0)=f0f(f_{0},0)=f_{0} and f(f0,1)=ff(f_{0},1)=f. ω>0\omega>0 is the degree of smoothness of ϕ\phi at f0f_{0}, ϕf0[ff0]\phi_{f_{0}}^{\prime}[f-f_{0}] is linear in ff0f-f_{0}, and

ϕf0=supfff0>0|ϕf0[ff0]|ff0<.\|\phi_{f_{0}}^{\prime}\|=\sup_{\begin{subarray}{c}f\in\mathcal{F}\\ \|f-f_{0}\|>0\end{subarray}}\frac{|\phi_{f_{0}}^{\prime}[f-f_{0}]|}{\|f-f_{0}\|}<\infty.

Then ϕf0\phi_{f_{0}}^{\prime} is a bounded linear functional on V¯f0\bar{V}_{f_{0}}, which is the completion of span{ff0:f}L2(𝒳,𝒜,P)\textrm{span}\{f-f_{0}:f\in\mathcal{F}\}\subset L_{2}(\mathcal{X},\mathcal{A},P). From the Riesz representation theorem, there exists vV¯f0v^{*}\in\bar{V}_{f_{0}}

ϕf0[ff0]=ff0,v=(ff0)vdP.\phi_{f_{0}}^{\prime}[f-f_{0}]=\langle f-f_{0},v^{*}\rangle=\int(f-f_{0})v^{*}\textrm{d}P.

Let ρn\rho_{n} be the rate of convergence for f^n\hat{f}_{n}, that is, f^nf0=𝒪p(ρn)\|\hat{f}_{n}-f_{0}\|=\mathcal{O}_{p}(\rho_{n}). Let δn\delta_{n} be a sequence converging to 0 with δn=𝒪p(n1/2)\delta_{n}=\mathcal{O}_{p}(n^{-1/2}). For f{fn:ff0ρn}f\in\{f\in\mathcal{F}_{n}:\|f-f_{0}\|\leq\rho_{n}\}, we define

f~n(f)=f+δnu,\tilde{f}_{n}(f)=f+\delta_{n}u^{*},

where u=±v/v2u^{*}=\pm v^{*}/\left\|{v^{*}}\right\|^{2}. The main result relies on the following conditions:

  • (C1)

    (Sieve Space) Suppose that n\mathcal{F}_{n} is uniformly bounded. Moreover, assume that there exists a non-increasing continuous function H(u)H(u) of u>0u>0 such that

    logN(u,n,L2(n))H(u) for all u>0,\log N(u,\mathcal{F}_{n},L_{2}(\mathbb{P}_{n}))\leq H(u)\textrm{ for all }u>0,

    and

    01H1/2(u)du<.\int_{0}^{1}H^{1/2}(u)\textrm{d}u<\infty.
  • (C2)

    (Rate of Convergence) The rate of convergence ρn\rho_{n} satisfies o(n1/4)=ρnn12+12po(n^{-1/4})=\rho_{n}\gtrsim n^{-\frac{1}{2}+\frac{1}{2p}} and

    nρn22H(ρn) and H(ρn) as n.n\rho_{n}^{2}\geq 2H(\rho_{n})\textrm{ and }H(\rho_{n})\to\infty\textrm{ as }n\to\infty.
  • (C3)

    (Approximation Error)

    supQsupfnff0ρnπnf~n(f)f~n(f)L2(Q)=op(ρn1δn2),\sup_{Q}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \|f-f_{0}\|\leq\rho_{n}\end{subarray}}\left\|{\pi_{n}\tilde{f}_{n}(f)-\tilde{f}_{n}(f)}\right\|_{L_{2}(Q)}=o_{p}(\rho_{n}^{-1}\delta_{n}^{2}),

    where the supremum is taken over all probability measures QQ on (𝒳,𝒜)(\mathcal{X},\mathcal{A}) and

    supfnff0ρnn1i=1nϵi(πnf~n(f)(𝑿i)f~n(f)(𝑿i))=op(δn2).\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(f)(\boldsymbol{X}_{i})-\tilde{f}_{n}(f)(\boldsymbol{X}_{i})\right)=o_{p}(\delta_{n}^{2}).
Remark 1.
  1. (i)

    Condition (C1) requires that the sieve space n\mathcal{F}_{n} is not too complex. The complexity measure in terms of the entropy number is common in theory of nonparametric regression (see van de Geer (1987), van de Geer (1988) and van de Geer (1990)).

  2. (ii)

    Condition (C2) is on the rate of convergence of sieve estimators. To obtain the desired result, the convergence rate cannot be too slow. Together with (C1), we can derive the uniform law of large numbers for empirical L2L_{2} norm, as given in van de Geer (2000).

  3. (iii)

    The conditions on the approximation errors in the setting of nonparametric regression are given in condition (C3). These two requirements are special cases of the ones given in Shen (1997).

Theorem 1.

Suppose ηn=o(δn2)\eta_{n}=o(\delta_{n}^{2}), and under (C1)-(C3),

|f^nf0,vnn1i=1nϵiv(𝑿i)|=op(n1/2).\left|{\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle_{n}-n^{-1}\sum_{i=1}^{n}\epsilon_{i}v^{*}(\boldsymbol{X}_{i})}\right|=o_{p}(n^{-1/2}).
Remark 2.

In view of Lemma 22 given in the supplementary materials, the empirical inner product f^nf0,vn\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle_{n} can be replaced by its population version f^nf0,v\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle.

We now state the main theorem for sieve quasi-likelihood ratio statistics. The proof of the theorem follows the same steps as those in Shen and Shi (2005) and are given in the Supplementary materials.

Theorem 2.

Under H0H_{0} and (C1)-(C3), suppose that ηn=o(δn2)\eta_{n}=o(\delta_{n}^{2}), unρnω=o(n1/2)u_{n}\rho_{n}^{\omega}=o(n^{-1/2}), and sup𝐱𝒳|v(𝐱)|<\sup_{\boldsymbol{x}\in\mathcal{X}}\left|{v^{*}(\boldsymbol{x})}\right|<\infty, we have

nσ2[n(f^n0)n(f^n)]𝑑χ12,\frac{n}{\sigma^{2}}\left[\mathbb{Q}_{n}(\hat{f}_{n}^{0})-\mathbb{Q}_{n}(\hat{f}_{n})\right]\xrightarrow{d}\chi_{1}^{2},

where σ2=𝔼[ϵ2]\sigma^{2}=\mathbb{E}[\epsilon^{2}].

In practice, σ2\sigma^{2} is rarely known apriori. A simple application of Slutsky’s theorem yields the following corollary, which shows that we can replace σ2\sigma^{2} with any consistent estimator of σ^n2\hat{\sigma}_{n}^{2}. A straightforward consistent estimator for σ2\sigma^{2} is given by σ^n2=n1i=1n(Yif^n0(𝑿i))2\hat{\sigma}_{n}^{2}=n^{-1}\sum_{i=1}^{n}\left(Y_{i}-\hat{f}_{n}^{0}(\boldsymbol{X}_{i})\right)^{2}.

Corollary 3.

Under the conditions of Theorem 2,

nσ^n2[n(f^n0)n(f^n)]𝑑χ12,\frac{n}{\hat{\sigma}_{n}^{2}}\left[\mathbb{Q}_{n}(\hat{f}_{n}^{0})-\mathbb{Q}_{n}(\hat{f}_{n})\right]\xrightarrow{d}\chi_{1}^{2},

where σ^n2\hat{\sigma}_{n}^{2} is any consistent estimator of σ2\sigma^{2}.

3 An Application to Neural Networks

We first introduce the notations to be used in this section. 𝒆i=(0,,0,1,0,,0)\boldsymbol{e}_{i}=(0,\ldots,0,1,0,\ldots,0) where 1 appears at the iith position. We use +\mathbb{Z}_{+} to the set of non-negative integers and use 𝜷=(β1,,βd)+d\boldsymbol{\beta}=(\beta_{1},\ldots,\beta_{d})\in\mathbb{Z}_{+}^{d} to denote a multi-index. Moreover, we set |𝜷|=i=1d|βi|\left|{\boldsymbol{\beta}}\right|=\sum_{i=1}^{d}\left|{\beta_{i}}\right| and 𝒙𝜷=x1β1xdβd\boldsymbol{x}^{\boldsymbol{\beta}}=x_{1}^{\beta_{1}}\cdots x_{d}^{\beta_{d}} for any 𝒙=(x1,,xd)Td\boldsymbol{x}=(x_{1},\ldots,x_{d})^{T}\in\mathbb{R}^{d}. For a differentiable function uu on 𝒳\mathcal{X}, we set

D𝜷u=|𝜷|u𝒙𝜷=|𝜷|ux1β1xdβd.D^{\boldsymbol{\beta}}u=\frac{\partial^{|\boldsymbol{\beta}|}u}{\partial\boldsymbol{x}^{\boldsymbol{\beta}}}=\frac{\partial^{|\boldsymbol{\beta}|}u}{\partial x_{1}^{\beta_{1}}\cdots\partial x_{d}^{\beta_{d}}}.

One of our goals in this paper is to establish a sieve quasi-likelihood ratio test for neural network estimators. Specifically, for a given kdk\leq d, let 𝑿=(X(1),,X(k),X(k+1),,X(d))Td\boldsymbol{X}=(X^{(1)},\ldots,X^{(k)},X^{(k+1)},\ldots,X^{(d)})^{T}\in\mathbb{R}^{d} and the null hypothesis of interest be

H0:X(1),,X(k) are not significantly associated with Y.H_{0}:X^{(1)},\ldots,X^{(k)}\textrm{ are not significantly associated with Y}.

Different from linear regression, in which the hypothesis can be easily translated into testing whether the corresponding regression coefficients are zero, testing significance of an association in nonparametric regression is more complicated. From Chen and White (1999) and Horel and Giesecke (2020), testing H0H_{0} in the nonparametric setting is equivalent to test whether the corresponding partial derivatives are zeros, or equivalently, to test

H0:i=1k(D𝒆if0(𝒙))2dP(𝒙)=0.H_{0}:\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x})=0.

Hence, we assume that the true function f0f_{0} is a smooth function. Specifically, we consider the Barron class s:={f:𝒳|fsB}\mathscr{B}^{s}:=\{f:\mathcal{X}\to\mathbb{R}|\left\|{f}\right\|_{\mathscr{B}^{s}}\leq B\} for some integer s1s\geq 1 and some fixed constant BB, as considered in Siegel and Xu (2020). Here

fs=d(1+|ω|)s|f^(ω)|dω,\left\|{f}\right\|_{\mathscr{B}^{s}}=\int_{\mathbb{R}^{d}}(1+\left|{\omega}\right|)^{s}|\hat{f}(\omega)|\textrm{d}\omega,

and f^(ω)\hat{f}(\omega) is the Fourier transform of ff. As shown in Siegel and Xu (2020), sHs(𝒳)={f:𝒳|D𝜶f<, for all 0|𝜶|s}\mathscr{B}^{s}\subset H^{s}(\mathcal{X})=\{f:\mathcal{X}\to\mathbb{R}|\left\|{D^{\boldsymbol{\alpha}}f}\right\|<\infty,\textrm{ for all }0\leq\left|{\boldsymbol{\alpha}}\right|\leq s\} and Hd2+2(𝒳)1H^{\left\lfloor\frac{d}{2}\right\rfloor+2}(\mathcal{X})\subset\mathscr{B}^{1}. In what follows, we will take =Cm0(𝒳)\mathcal{F}=C^{m_{0}}(\mathcal{X}) with m0=d2+2m_{0}=\left\lfloor\frac{d}{2}\right\rfloor+2.

The functional ϕ\phi from the general result in section 2 is given by

ϕ:\displaystyle\phi: \displaystyle\mathcal{F}\to\mathbb{R}
fϕ(f)=i=1k(D𝒆if0(𝒙))2dP(𝒙).\displaystyle f\mapsto\phi(f)=\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x}). (2)

The directional derivative ϕf0\phi_{f_{0}}^{\prime} evaluated at “direction” hh can be calculated straightforwardly. For the sieve space, we use the class of neural networks with one hidden layer and sigmoid activation function σ(x)=(1+ex)1\sigma(x)=(1+e^{-x})^{-1}.

rn\displaystyle\mathcal{F}_{r_{n}} ={α0+j=1rnαjσ(𝜸jT𝒙+γ0,j):𝜸jd,αj,γ0,j,\displaystyle=\left\{\alpha_{0}+\sum_{j=1}^{r_{n}}\alpha_{j}\sigma\left(\boldsymbol{\gamma}_{j}^{T}\boldsymbol{x}+\gamma_{0,j}\right):\boldsymbol{\gamma}_{j}\in\mathbb{R}^{d},\alpha_{j},\gamma_{0,j}\in\mathbb{R},\right.
j=0rn|αj|V for some V>4 and max1jrni=0d|γi,j|M for some M>0},\displaystyle\qquad\sum_{j=0}^{r_{n}}|\alpha_{j}|\leq V\textrm{ for some }V>4\left.\textrm{ and }\max_{1\leq j\leq r_{n}}\sum_{i=0}^{d}|\gamma_{i,j}|\leq M\textrm{ for some }M>0\right\}, (3)

where rnr_{n}\uparrow\infty as nn\to\infty. In view of Barron (1993), rn\mathcal{F}_{r_{n}} is L2L_{2}-dense in \mathcal{F}.

Based on the general results in the previous section, the function ϕ\phi needs to be smooth so that the sieve quasi-likelihood ratio statistic follows a chi-squared asymptotic distribution. The following propositions guarantee the satisfaction of conditions on ϕ\phi in the general theory.

Proposition 4.

Let ϕ\phi be the same function given in (2), then, for any hV¯f0h\in\bar{V}_{f_{0}},

ϕf0[h]=2i=1kD𝒆if0(𝒙)D𝒆ih(𝒙)dP(𝒙).\phi_{f_{0}}^{\prime}[h]=2\sum_{i=1}^{k}\int D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})D^{\boldsymbol{e}_{i}}h(\boldsymbol{x})\textrm{d}P(\boldsymbol{x}).

Moreover, ϕf0\phi_{f_{0}}^{\prime} is a bounded linear functional on V¯f0\bar{V}_{f_{0}}.

Proof.

By definition,

ϕf0[h]\displaystyle\phi_{f_{0}}^{\prime}[h] =limt0ϕ(f0+th)ϕ(f0)t\displaystyle=\lim_{t\to 0}\frac{\phi(f_{0}+th)-\phi(f_{0})}{t}
=i=1klimt0(D𝒆i(f0+th)(𝒙))2(D𝒆if0(𝒙))2tdP(𝒙)\displaystyle=\sum_{i=1}^{k}\lim_{t\to 0}\int\frac{\left(D^{\boldsymbol{e}_{i}}(f_{0}+th)(\boldsymbol{x})\right)^{2}-\left(D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})\right)^{2}}{t}\textrm{d}P(\boldsymbol{x})
=2i=1kD𝒆if0(𝒙)D𝒆ih(x)dP(𝒙)+i=1klimt0t(D𝒆ih(𝒙))2dP(𝒙)\displaystyle=2\sum_{i=1}^{k}\int D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})D^{\boldsymbol{e}_{i}}h(x)\textrm{d}P(\boldsymbol{x})+\sum_{i=1}^{k}\lim_{t\to 0}t\int\left(D^{\boldsymbol{e}_{i}}h(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x})
=2i=1kD𝒆if0(𝒙)D𝒆ih(𝒙)dP(𝒙).\displaystyle=2\sum_{i=1}^{k}\int D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})D^{\boldsymbol{e}_{i}}h(\boldsymbol{x})\textrm{d}P(\boldsymbol{x}).

For the second claim, linearity follows directly from the definition of ϕf0\phi_{f_{0}}^{\prime}. Boundedness follows from the Hölder’s inequality by noting that

suphh=1|ϕf0[h]|\displaystyle\sup_{\begin{subarray}{c}h\in\mathcal{F}\\ \left\|{h}\right\|=1\end{subarray}}|\phi_{f_{0}}^{\prime}[h]| 2i=1ksuphh=1|D𝒆if0(𝒙)D𝒆ih(𝒙)dP(𝒙)|\displaystyle\leq 2\sum_{i=1}^{k}\sup_{\begin{subarray}{c}h\in\mathcal{F}\\ \left\|{h}\right\|=1\end{subarray}}\left|{\int D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})D^{\boldsymbol{e}_{i}}h(\boldsymbol{x})\textrm{d}P(\boldsymbol{x})}\right|
2i=1kD𝒆if0suphh=1D𝒆ih\displaystyle\leq 2\sum_{i=1}^{k}\left\|{D^{\boldsymbol{e}_{i}}f_{0}}\right\|\sup_{\begin{subarray}{c}h\in\mathcal{F}\\ \left\|{h}\right\|=1\end{subarray}}\left\|{D^{\boldsymbol{e}_{i}}h}\right\|
s,𝒳2kB2<.\displaystyle\lesssim_{s,\mathcal{X}}2kB^{2}<\infty.

We now impose the following condition on the distribution PP.

  • (C4)

    Suppose that PλP\ll\lambda, where λ\lambda is the Lebesgue measure on d\mathbb{R}^{d}. Let

    φ(𝒙)=dPdλ(𝒙)0.\varphi(\boldsymbol{x})=\frac{\textrm{d}P}{\textrm{d}\lambda}(\boldsymbol{x})\geq 0.

    Moreover, we assume that φ=0\varphi=0 on 𝒳\partial\mathcal{X}, φL(𝒳)\varphi\in L^{\infty}(\mathcal{X}), and logφC1(𝒳)\log\varphi\in C^{1}(\mathcal{X}).

Proposition 5.

Under (C4), the Riesz representor vv^{*} for the bounded linear functional ϕf0\phi_{f_{0}}^{\prime} is given by

v=2i=1k(D2𝒆if0+Deif0D𝒆ilogφ).v^{*}=-2\sum_{i=1}^{k}\left(D^{2\boldsymbol{e}_{i}}f_{0}+D^{e_{i}}f_{0}D^{\boldsymbol{e}_{i}}\log\varphi\right).
Proof.

Define g:𝒳g:\mathcal{X}\to\mathbb{R} and 𝑭:𝒳d\boldsymbol{F}:\mathcal{X}\to\mathbb{R}^{d} as

𝑭\displaystyle\boldsymbol{F} =(0,,0,hith position,0,,0)Tand\displaystyle=(0,\ldots,0,\underbrace{h}_{i\textrm{th position}},0,\ldots,0)^{T}and
g\displaystyle g =φD𝒆if0,\displaystyle=\varphi D^{\boldsymbol{e}_{i}}f_{0},

then g𝑭=φD𝒆if0D𝒆ihg\nabla\cdot\boldsymbol{F}=\varphi D^{\boldsymbol{e}_{i}}f_{0}D^{\boldsymbol{e}_{i}}h. Let 𝒏\boldsymbol{n} be the unit outward normal to 𝒳\partial\mathcal{X}. Given the integration by parts formula and the fact that φ=0\varphi=0 on 𝒳\partial\mathcal{X}, we have

𝒳φD𝒆if0D𝒆ihd𝒙\displaystyle\int_{\mathcal{X}}\varphi D^{\boldsymbol{e}_{i}}f_{0}D^{\boldsymbol{e}_{i}}h\textrm{d}\boldsymbol{x} =𝒳g𝑭d𝒙\displaystyle=\int_{\mathcal{X}}g\nabla\cdot\boldsymbol{F}\textrm{d}\boldsymbol{x}
=𝒳g𝑭d𝒙+𝒳g𝑭𝒏dS\displaystyle=-\int_{\mathcal{X}}\nabla g\cdot\boldsymbol{F}\textrm{d}\boldsymbol{x}+\int_{\partial\mathcal{X}}g\boldsymbol{F}\cdot\boldsymbol{n}\textrm{d}S
=𝒳g𝑭d𝒙\displaystyle=-\int_{\mathcal{X}}\nabla g\cdot\boldsymbol{F}\textrm{d}\boldsymbol{x}
=𝒳h(D𝒆iφD𝒆if0+φD2𝒆if0)d𝒙\displaystyle=-\int_{\mathcal{X}}h\left(D^{\boldsymbol{e}_{i}}\varphi D^{\boldsymbol{e}_{i}}f_{0}+\varphi D^{2\boldsymbol{e}_{i}}f_{0}\right)\textrm{d}\boldsymbol{x}
=𝒳h(D𝒆ilogφD𝒆if0+D2𝒆if0)dP(𝒙)\displaystyle=-\int_{\mathcal{X}}h\left(D^{\boldsymbol{e}_{i}}\log\varphi D^{\boldsymbol{e}_{i}}f_{0}+D^{2\boldsymbol{e}_{i}}f_{0}\right)\textrm{d}P(\boldsymbol{x})
=h,(D2𝒆if0+D2𝒆if0D𝒆ilogφ),\displaystyle=\left\langle{h},{-\left(D^{2\boldsymbol{e}_{i}}f_{0}+D^{2\boldsymbol{e}_{i}}f_{0}D^{\boldsymbol{e}_{i}}\log\varphi\right)}\right\rangle,

Based on the given assumptions, we know that (D2𝒆if0+D2𝒆if0D𝒆ilogφ)C(𝒳)V¯f0-\left(D^{2\boldsymbol{e}_{i}}f_{0}+D^{2\boldsymbol{e}_{i}}f_{0}D^{\boldsymbol{e}_{i}}\log\varphi\right)\in C(\mathcal{X})\subset\bar{V}_{f_{0}}. Therefore,

ϕf0[h]\displaystyle\phi_{f_{0}}^{\prime}[h] =2i=1k𝒳D𝒆if0D𝒆ihdP\displaystyle=2\sum_{i=1}^{k}\int_{\mathcal{X}}D^{\boldsymbol{e}_{i}}f_{0}D^{\boldsymbol{e}_{i}}h\textrm{d}P
=2i=1k𝒳φD𝒆if0D𝒆ihd𝒙\displaystyle=2\sum_{i=1}^{k}\int_{\mathcal{X}}\varphi D^{\boldsymbol{e}_{i}}f_{0}D^{\boldsymbol{e}_{i}}h\textrm{d}\boldsymbol{x}
=h,2i=1k(D2𝒆if0+D2𝒆if0D𝒆ilogφ).\displaystyle=\left\langle{h},{-2\sum_{i=1}^{k}\left(D^{2\boldsymbol{e}_{i}}f_{0}+D^{2\boldsymbol{e}_{i}}f_{0}D^{\boldsymbol{e}_{i}}\log\varphi\right)}\right\rangle.

Before we bound the remainder error of the first order functional Taylor expansion, we provide a bound for higher order derivatives of a neural network.

Proposition 6.

Let mm be a non-negative integer. For any frnf\in\mathcal{F}_{r_{n}} and any multi-index 𝛃\boldsymbol{\beta} with |𝛃|=m|\boldsymbol{\beta}|=m,

sup𝒙𝒳|D𝜷f(𝒙)|VMmm!.\sup_{\boldsymbol{x}\in\mathcal{X}}\left|{D^{\boldsymbol{\beta}}f(\boldsymbol{x})}\right|\leq VM^{m}m!.
Proof.

As frnf\in\mathcal{F}_{r_{n}}, ff can be represented by

f(𝒙)=α0+j=1rnαjσ(𝜸jT𝒙+γ0,j).f(\boldsymbol{x})=\alpha_{0}+\sum_{j=1}^{r_{n}}\alpha_{j}\sigma\left(\boldsymbol{\gamma}_{j}^{T}\boldsymbol{x}+\gamma_{0,j}\right).

A simple calculation yields

D𝜷f(𝒙)=j=1rnαj𝜸j𝜷σ(m)(𝜸jT𝒙+γ0,j),D^{\boldsymbol{\beta}}f(\boldsymbol{x})=\sum_{j=1}^{r_{n}}\alpha_{j}\boldsymbol{\gamma}_{j}^{\boldsymbol{\beta}}\sigma^{(m)}\left(\boldsymbol{\gamma}_{j}^{T}\boldsymbol{x}+\gamma_{0,j}\right),

where σ(m)()\sigma^{(m)}(\cdot) represents the mmth derivative of σ\sigma. According to Minai and Williams (1993), we have

σ(m)(z)=a=1m(1)a1Ca(m)σa(z)(1σ(z))m+1a,\sigma^{(m)}(z)=\sum_{a=1}^{m}(-1)^{a-1}C_{a}^{(m)}\sigma^{a}(z)\left(1-\sigma(z)\right)^{m+1-a},

where

Ca(m)\displaystyle C_{a}^{(m)} =0,m, if a<1 or m0 or s>m,\displaystyle=0,\quad\forall m,\textrm{ if }a<1\textrm{ or }m\leq 0\textrm{ or }s>m,
C1(1)\displaystyle C_{1}^{(1)} =1,\displaystyle=1,
Ca(m)\displaystyle C_{a}^{(m)} =aCa(m1)+(m+1a)Ca1(m1).\displaystyle=aC_{a}^{(m-1)}+(m+1-a)C_{a-1}^{(m-1)}.

Therefore,

sup𝒙𝒳|D𝜷u(𝒙)|\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|D^{\boldsymbol{\beta}}u(\boldsymbol{x})| =supx𝒳|j=1rnαj𝜸j𝜷σ(m)(𝜸jT𝒙+γ0,j)|\displaystyle=\sup_{x\in\mathcal{X}}\left|{\sum_{j=1}^{r_{n}}\alpha_{j}\boldsymbol{\gamma}_{j}^{\boldsymbol{\beta}}\sigma^{(m)}\left(\boldsymbol{\gamma}_{j}^{T}\boldsymbol{x}+\gamma_{0,j}\right)}\right|
j=1rn|αj|𝜸j1msupx𝒵|σ(m)(𝜸jT𝒙+γ0,j)|\displaystyle\leq\sum_{j=1}^{r_{n}}|\alpha_{j}|\left\|{\boldsymbol{\gamma}_{j}}\right\|_{\ell_{1}}^{m}\sup_{x\in\mathcal{Z}}\left|{\sigma^{(m)}\left(\boldsymbol{\gamma}_{j}^{T}\boldsymbol{x}+\gamma_{0,j}\right)}\right|
VMmsupz|σ(m)(z)|\displaystyle\leq VM^{m}\sup_{z\in\mathbb{R}}\left|{\sigma^{(m)}(z)}\right|
=VMmsupz|a=1m(1)a1Ca(m)σa(z)(1σ(z))m+1a|\displaystyle=VM^{m}\sup_{z\in\mathbb{R}}\left|{\sum_{a=1}^{m}(-1)^{a-1}C_{a}^{(m)}\sigma^{a}(z)\left(1-\sigma(z)\right)^{m+1-a}}\right|
VMma=1mCa(m)\displaystyle\leq VM^{m}\sum_{a=1}^{m}C_{a}^{(m)}
=(i)VMmm!,\displaystyle\overset{(i)}{=}VM^{m}m!,

where (i) follows from Proposition 21 in the supplementary material. ∎

Lemma 7 (Rate of Convergence of Neural Network Sieve Estimators).
  1. (i)

    The sieve space rn\mathcal{F}_{r_{n}} satisfies (C1).

  2. (ii)

    Suppose that rn2+1/dlog2rn=𝒪(n)r_{n}^{2+1/d}\log^{2}r_{n}=\mathcal{O}(n), then the rate of convergence ρn\rho_{n} of neural network sieve estimators is

    ρn=𝒪((nlog2n)1+1/d4(1+1/(2d))),\rho_{n}=\mathcal{O}\left(\left(\frac{n}{\log^{2}n}\right)^{-\frac{1+1/d}{4(1+1/(2d))}}\right),

    and ρn\rho_{n} satisfies (C2).

Proof.
  1. (i)

    From Theorem 14.5 in Anthony and Bartlett (2009), we have

    N(u,rn,)\displaystyle N(u,\mathcal{F}_{r_{n}},\left\|{\cdot}\right\|_{\infty}) (4e[rn(d+2)+1](14V)2u(14V1))rn(d+2)+1\displaystyle\leq\left(\frac{4e[r_{n}(d+2)+1]\left(\frac{1}{4}V\right)^{2}}{u\left(\frac{1}{4}V-1\right)}\right)^{r_{n}(d+2)+1}
    =(e[rn(d+1)+1]V2u(V4))rn(d+2)+1,\displaystyle=\left(\frac{e[r_{n}(d+1)+1]V^{2}}{u(V-4)}\right)^{r_{n}(d+2)+1},

    which implies that

    logN(u,rn,L2(n))\displaystyle\log N(u,\mathcal{F}_{r_{n}},L_{2}(\mathbb{P}_{n})) logN(u,rn,)d,V(rnlogrn)log1u.\displaystyle\leq\log N(u,\mathcal{F}_{r_{n}},\left\|{\cdot}\right\|_{\infty})\lesssim_{d,V}(r_{n}\log r_{n})\log\frac{1}{u}. (4)

    Hence, (C1) is satisfied with H(u)=Cd,V(rnlogrn)log1uH(u)=C_{d,V}\cdot\left(r_{n}\log r_{n}\right)\log\frac{1}{u} by noting that

    01log1/21udu\displaystyle\int_{0}^{1}\log^{1/2}\frac{1}{u}\textrm{d}u =01/2log1/21udu+1/21log1/21udu\displaystyle=\int_{0}^{1/2}\log^{1/2}\frac{1}{u}\textrm{d}u+\int_{1/2}^{1}\log^{1/2}\frac{1}{u}\textrm{d}u
    ulog1/21u|01/2+1201/2log1/21udu+12log1/22\displaystyle\leq\left.u\log^{1/2}\frac{1}{u}\right|_{0}^{1/2}+\frac{1}{2}\int_{0}^{1/2}\log^{-1/2}\frac{1}{u}\textrm{d}u+\frac{1}{2}\log^{1/2}2
    log1/22+14log1/22<.\displaystyle\leq\log^{1/2}2+\frac{1}{4}\log^{-1/2}2<\infty.
  2. (ii)

    Note that, for δ1/e\delta\leq 1/e,

    0δlog1/21udu\displaystyle\int_{0}^{\delta}\log^{1/2}\frac{1}{u}\textrm{d}u =ulog1/21u|0δ+0δ12log1/21udu\displaystyle=\left.u\log^{1/2}\frac{1}{u}\right|_{0}^{\delta}+\int_{0}^{\delta}\frac{1}{2}\log^{-1/2}\frac{1}{u}\textrm{d}u
    δlog1/21δ+δlog1/21δ\displaystyle\leq\delta\log^{1/2}\frac{1}{\delta}+\delta\log^{-1/2}\frac{1}{\delta}
    δlog1/21δ.\displaystyle\lesssim\delta\log^{1/2}\frac{1}{\delta}.

    Let ϕn(δ)=(rnlogrn)1/2δlog1/21δ\phi_{n}(\delta)=(r_{n}\log r_{n})^{1/2}\delta\log^{1/2}\frac{1}{\delta}. Clearly, δαϕn(δ)\delta^{-\alpha}\phi_{n}(\delta) is decreasing on (0,)(0,\infty) for 1α<21\leq\alpha<2. Note that

    ρn2ϕn(ρn)nρn1log1/2ρn1(nrnlogrn)1/2.\rho_{n}^{-2}\phi_{n}(\rho_{n})\lesssim\sqrt{n}\Leftrightarrow\rho_{n}^{-1}\log^{1/2}\rho_{n}^{-1}\lesssim\left(\frac{n}{r_{n}\log r_{n}}\right)^{1/2}. (5)

    It follows from Makovoz (1996) that f0πnf0rn1/21/(2d)\left\|{f_{0}-\pi_{n}f_{0}}\right\|\leq r_{n}^{-1/2-1/(2d)}. By taking ρn=rn1/21/(2d)\rho_{n}=r_{n}^{-1/2-1/(2d)} in (5), we obtain the following governing inequality:

    rn1+12dlogrndnrn2+1/dlog2rn=𝒪(n).r_{n}^{1+\frac{1}{2d}}\log r_{n}\lesssim_{d}\sqrt{n}\Leftrightarrow r_{n}^{2+1/d}\log^{2}r_{n}=\mathcal{O}(n).

    Given that rn=(nlog2n)12+1dr_{n}=\left(\frac{n}{\log^{2}n}\right)^{\frac{1}{2+\frac{1}{d}}}, we have

    ρn=𝒪((nlog2n)1+1/d4(1+1/(2d))).\rho_{n}=\mathcal{O}\left(\left(\frac{n}{\log^{2}n}\right)^{-\frac{1+1/d}{4(1+1/(2d))}}\right).

    To show that ρn\rho_{n} satisfies condition (C2), we note that H(ρn)H(\rho_{n})\to\infty as long as ρn0\rho_{n}\to 0 as nn\to\infty. The governing inequality is certainly satisfied based on the previous arguments. We also note that

    ρn\displaystyle\rho_{n} =(nlog2n)1+1/d4(1+1/(2d))\displaystyle=\left(\frac{n}{\log^{2}n}\right)^{-\frac{1+1/d}{4(1+1/(2d))}}
    =n14n141+1/d4(1+1/(2d))(log2n)1+1/d4(1+1/(2d))\displaystyle=n^{-\frac{1}{4}}n^{\frac{1}{4}-\frac{1+1/d}{4(1+1/(2d))}}(\log^{2}n)^{\frac{1+1/d}{4(1+1/(2d))}}
    =n14n141/(2d)4(1+1/(2d))(log2n)1+1/d4(1+1/(2d))\displaystyle=n^{-\frac{1}{4}}n^{-\frac{1}{4}\frac{1/(2d)}{4(1+1/(2d))}}(\log^{2}n)^{\frac{1+1/d}{4(1+1/(2d))}}
    =o(n1/4).\displaystyle=o(n^{-1/4}).

    On the other hand, we have

    ρn\displaystyle\rho_{n} n1+1/d4(1+1/(2d))=n12n121+1/d4(1+1/(2d))=n12n12(2+1d)n12+12p,\displaystyle\geq n^{-\frac{1+1/d}{4(1+1/(2d))}}=n^{-\frac{1}{2}}n^{\frac{1}{2}-\frac{1+1/d}{4(1+1/(2d))}}=n^{-\frac{1}{2}}n^{\frac{1}{2\left(2+\frac{1}{d}\right)}}\geq n^{-\frac{1}{2}+\frac{1}{2p}},

    where the last inequality follows from the assumption p2+1/dp\geq 2+1/d.

Remark 3.

The rate of convergence we obtained has an additional logn\log n term in the denominator compared with the results in Chen and Shen (1998), but this has little effect on the main result.

Proposition 8.

Under (C4) and the assumption of

rn2+1/dlog2rn=𝒪(n),r_{n}^{2+1/d}\log^{2}r_{n}=\mathcal{O}(n), (6)

for any f{frn:ff0ρn}f\in\{f\in\mathcal{F}_{r_{n}}:\left\|{f-f_{0}}\right\|\leq\rho_{n}\}, we have

|ϕ(f)ϕ(f0)ϕf0[ff0]|=o(n1/2).\left|{\phi(f)-\phi(f_{0})-\phi_{f_{0}}^{\prime}[f-f_{0}]}\right|=o(n^{-1/2}).
Proof.

Note that

|ϕ(f)ϕ(f0)ϕf0[ff0]|\displaystyle\left|{\phi(f)-\phi(f_{0})-\phi_{f_{0}}^{\prime}[f-f_{0}]}\right| =|i=1k[(D𝒆if(𝒙))2(D𝒆if0(𝒙))22D𝒆if0(𝒙)D𝒆i(ff0)(𝒙)dP(𝒙)]|\displaystyle=\left|{\sum_{i=1}^{k}\left[\int\left(D^{\boldsymbol{e}_{i}}f(\boldsymbol{x})\right)^{2}-\left(D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})\right)^{2}-2D^{\boldsymbol{e}_{i}}f_{0}(\boldsymbol{x})D^{\boldsymbol{e}_{i}}(f-f_{0})(\boldsymbol{x})\textrm{d}P(\boldsymbol{x})\right]}\right|
=i=1k(D𝒆i(ff0)(𝒙))2dP(𝒙)\displaystyle=\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}(f-f_{0})(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x})
2i=1k(D𝒆i(fπrnf0)(𝒙))2dP(𝒙)+2i=1k(D𝒆i(πrnf0f0)(𝒙))2dP(𝒙),\displaystyle\leq 2\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}(f-\pi_{r_{n}}f_{0})(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x})+2\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}(\pi_{r_{n}}f_{0}-f_{0})(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x}),

where the last inequality follows from the elementary inequality (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) and the triangle inequality. For the second term, it follows from Corollary 1 in Siegel and Xu (2020) that

2i=1k(D𝒆i(πrnf0f0)(𝒙))2dP(𝒙)\displaystyle 2\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}(\pi_{r_{n}}f_{0}-f_{0})(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x}) =2i=1k(D𝒆i(πrnf0f0)(𝒙))2φ(𝒙)d𝒙\displaystyle=2\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}(\pi_{r_{n}}f_{0}-f_{0})(\boldsymbol{x})\right)^{2}\varphi(\boldsymbol{x})\textrm{d}\boldsymbol{x}
2φπrnf0f0Hs(𝒳)2\displaystyle\leq 2\left\|{\varphi}\right\|_{\infty}\left\|{\pi_{r_{n}}f_{0}-f_{0}}\right\|_{H^{s}(\mathcal{X})}^{2}
𝒳,dn1=o(n1/2).\displaystyle\lesssim_{\mathcal{X},d}n^{-1}=o(n^{-1/2}).

For the first term, we use the Gagliardo-Nirenberg interpolation inequality (Theorem 12.87 in Leoni (2017)). For m>1m>1 and θ=11m\theta=1-\frac{1}{m}, there exists a constant CC, which is independent of fπrnf0f-\pi_{r_{n}}f_{0}, such that

(fπrnf0)Cfπrnf011mm(ff0)1m.\left\|{\nabla(f-\pi_{r_{n}}f_{0})}\right\|\leq C\left\|{f-\pi_{r_{n}}f_{0}}\right\|^{1-\frac{1}{m}}\left\|{\nabla^{m}(f-f_{0})}\right\|^{\frac{1}{m}}.

It then follows from Proposition 6 that

2i=1k(D𝒆i(fπrnf0)(𝒙))2dP(𝒙)\displaystyle 2\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}(f-\pi_{r_{n}}f_{0})(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x}) =2i=1kD𝒆i(fπrnf0)2\displaystyle=2\sum_{i=1}^{k}\left\|{D^{\boldsymbol{e}_{i}}(f-\pi_{r_{n}}f_{0})}\right\|^{2}
2(fπrnf0)2\displaystyle\leq 2\left\|{\nabla(f-\pi_{r_{n}}f_{0})}\right\|^{2}
2Cfπrnf022mm(fπrnf0)2m\displaystyle\leq 2C\left\|{f-\pi_{r_{n}}f_{0}}\right\|^{2-\frac{2}{m}}\left\|{\nabla^{m}(f-\pi_{r_{n}}f_{0})}\right\|^{\frac{2}{m}}
2C(2ρn)22m(d1+mm)(max𝜷:|𝜷|=msup𝒙𝒳|D𝜷(fπrnf0)|)2m\displaystyle\leq 2C(2\rho_{n})^{2-\frac{2}{m}}\binom{d-1+m}{m}\left(\max_{\boldsymbol{\beta}:\left|{\boldsymbol{\beta}}\right|=m}\sup_{\boldsymbol{x}\in\mathcal{X}}\left|{D^{\boldsymbol{\beta}}(f-\pi_{r_{n}}f_{0})}\right|\right)^{\frac{2}{m}}
2C(2ρn)22m(d1+mm)(2VMmm!)2m\displaystyle\leq 2C(2\rho_{n})^{2-\frac{2}{m}}\binom{d-1+m}{m}\left(2VM^{m}m!\right)^{\frac{2}{m}}
=8C(d1+mm)V2mM2(m!)2mρn22m.\displaystyle=8C\binom{d-1+m}{m}V^{\frac{2}{m}}M^{2}(m!)^{\frac{2}{m}}\rho_{n}^{2-\frac{2}{m}}.

As we have shown in Lemma 7, under (6), ρn=(n/log2n)1+1/d4(1+1/(2d))\rho_{n}=\left(n/\log^{2}n\right)^{-\frac{1+1/d}{4(1+1/(2d))}}, and then

ρn22m\displaystyle\rho_{n}^{2-\frac{2}{m}} =(nlog2n)1+1/d2(1+1/(2d))(11m)\displaystyle=\left(\frac{n}{\log^{2}n}\right)^{-\frac{1+1/d}{2(1+1/(2d))}\left(1-\frac{1}{m}\right)}
=n12n12mm/(2d)11/d1+1/(2d)(log2n)1+1/d2(1+1/(2d))(11m).\displaystyle=n^{-\frac{1}{2}}n^{-\frac{1}{2m}\frac{m/(2d)-1-1/d}{1+1/(2d)}}\left(\log^{2}n\right)^{-\frac{1+1/d}{2(1+1/(2d))}\left(1-\frac{1}{m}\right)}.

By taking m>2d+2m>2d+2, we obtain that

2i=1k(D𝒆i(fπrnf0)(𝒙))2dP(𝒙)d,V,Mo(n1/2).2\sum_{i=1}^{k}\int\left(D^{\boldsymbol{e}_{i}}(f-\pi_{r_{n}}f_{0})(\boldsymbol{x})\right)^{2}\textrm{d}P(\boldsymbol{x})\lesssim_{d,V,M}o(n^{-1/2}).

Therefore, we obtain that for f{frn:ff0ρn}f\in\{f\in\mathcal{F}_{r_{n}}:\left\|{f-f_{0}}\right\|\leq\rho_{n}\},

|ϕ(f)ϕ(f0)ϕf0[ff0]|=o(n1/2).\left|{\phi(f)-\phi(f_{0})-\phi_{f_{0}}^{\prime}[f-f_{0}]}\right|=o(n^{-1/2}).

Now we state and prove the asymptotic distribution of the sieve quasi-likelihood ratio statistic.

Theorem 9.

Suppose that ηn=o(δn2)\eta_{n}=o(\delta_{n}^{2}) and ϵp,1<\left\|{\epsilon}\right\|_{p,1}<\infty for some p2+1/dp\geq 2+1/d, under (6) and H0H_{0},

nσ^n2[n(f^n0)n(f^n)]𝑑χ12,\frac{n}{\hat{\sigma}_{n}^{2}}\left[\mathbb{Q}_{n}(\hat{f}_{n}^{0})-\mathbb{Q}_{n}(\hat{f}_{n})\right]\xrightarrow{d}\chi_{1}^{2},

where σ^n2\hat{\sigma}_{n}^{2} is any consistent estimator of σ2\sigma^{2}.

Proof.

While conditions (C1) and (C2) have been verified in Lemma 7, condition (C3) remains to be verified. According to Theorem 2.1 in Mhaskar (1996), we can find vectors {γj}j=1rnd\{\gamma_{j}\}_{j=1}^{r_{n}}\in\mathbb{R}^{d} and {γ0,j}j=1rn\{\gamma_{0,j}\}_{j=1}^{r_{n}}\in\mathbb{R} such that for any f𝒲m0,2([1,1]d)f\in\mathcal{W}^{m_{0},2}([-1,1]^{d}), there exists coefficients αj(f)\alpha_{j}(f) satisfying

fj=1rnαj(f)σ(γjTx+γ0,j)rnm0/df𝒲m0,2([1,1]d).\left\|{f-\sum_{j=1}^{r_{n}}\alpha_{j}(f)\sigma\left(\gamma_{j}^{T}x+\gamma_{0,j}\right)}\right\|\lesssim r_{n}^{-m_{0}/d}\left\|{f}\right\|_{\mathcal{W}^{m_{0},2}([-1,1]^{d})}. (7)

In addition, the functionals αj\alpha_{j} are continuous linear functionals on 𝒲m0,2([1,1]d)\mathcal{W}^{m_{0},2}([-1,1]^{d}).

Based on the results from Goulaouic (1971) or Baouendi and Goulaouic (1974) and Lemma 3.2 in Mhaskar (1996) we can show that for an analytic function ff defined on a compact set KK, there exist a>1a>1 and coefficients αj(f)\alpha_{j}(f) such that

fj=1rnαj(f)σ(γjTx+γ0,j)parn1/d,\left\|{f-\sum_{j=1}^{r_{n}}\alpha_{j}(f)\sigma\left(\gamma_{j}^{T}x+\gamma_{0,j}\right)}\right\|_{p}\lesssim a^{-r_{n}^{1/d}}, (8)

where γj\gamma_{j} and γ0,j\gamma_{0,j} are the same as those given in (7). Since ff is analytic, for every frnf\in\mathcal{F}_{r_{n}} with ff0ρn\left\|{f-f_{0}}\right\|\leq\rho_{n}, there exists a neural network πrnfrn\pi_{r_{n}}f\in\mathcal{F}_{r_{n}} with

πrnf=j=1rnαj(f)σ(γjTx+γ0,j),\pi_{r_{n}}f=\sum_{j=1}^{r_{n}}\alpha_{j}(f)\sigma\left(\gamma_{j}^{T}x+\gamma_{0,j}\right),

such that fπrnfan1/d\left\|{f-\pi_{r_{n}}f}\right\|_{\infty}\lesssim a^{-n^{1/d}} for some a>1a>1. For f0+uCm(𝒳)f_{0}+u^{*}\in C^{m}(\mathcal{X}), there exists a neural network πrn(f0+u)rn\pi_{r_{n}}(f_{0}+u^{*})\in\mathcal{F}_{r_{n}} with

πrn(f0+u)=j=1rnαj(f0+u)σ(γjTx+γ0,j),\pi_{r_{n}}(f_{0}+u^{*})=\sum_{j=1}^{r_{n}}\alpha_{j}(f_{0}+u^{*})\sigma\left(\gamma_{j}^{T}x+\gamma_{0,j}\right),

such that f0+uπrn(f0+u)ρn\left\|{f_{0}+u^{*}-\pi_{r_{n}}(f_{0}+u^{*})}\right\|_{\infty}\lesssim\rho_{n}. By considering

πrnf~n(f)=(1δn)πrnf+δnπrn(f0+u),\pi_{r_{n}}\tilde{f}_{n}(f)=(1-\delta_{n})\pi_{r_{n}}f+\delta_{n}\pi_{r_{n}}(f_{0}+u^{*}),

it is clear that πrnf~n(f)rn\pi_{r_{n}}\tilde{f}_{n}(f)\in\mathcal{F}_{r_{n}} and

πrnf~n(f)f~n(f)(1δn)fπrnf+δnf0+uπrn(f0+u)=𝒪(δnρn).\left\|{\pi_{r_{n}}\tilde{f}_{n}(f)-\tilde{f}_{n}(f)}\right\|_{\infty}\leq(1-\delta_{n})\left\|{f-\pi_{r_{n}}f}\right\|_{\infty}+\delta_{n}\left\|{f_{0}+u^{*}-\pi_{r_{n}}(f_{0}+u^{*})}\right\|_{\infty}=\mathcal{O}\left(\delta_{n}\rho_{n}\right).

Therefore, by choosing δn=ρn2\delta_{n}=\rho_{n}^{2} and note that

ρnδnρn1δn2=ρn=o(1),\frac{\rho_{n}\delta_{n}}{\rho_{n}^{-1}\delta_{n}^{2}}=\rho_{n}=o(1),

we can know that the first requirement in (C3) is satisfied. For the second requirement, note that if nn is large enough,

supfnff0ρnn1i=1nϵi(πrnf~n(f)(Xi)f~nf(Xi))\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{r_{n}}\tilde{f}_{n}(f)(X_{i})-\tilde{f}_{n}f(X_{i})\right)
=\displaystyle= supfnff0ρn(1δn)n1i=1nϵi(πrnf(Xi)f(Xi))+δnn1i=1nϵi(πrn(f0+u)(Xi)(f0+u)(Xi))\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}(1-\delta_{n})n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{r_{n}}f(X_{i})-f(X_{i})\right)+\delta_{n}n^{-1}\sum_{i=1}^{n}\epsilon_{i}(\pi_{r_{n}}(f_{0}+u^{*})(X_{i})-(f_{0}+u^{*})(X_{i}))
\displaystyle\leq supfnff0ρnπrnffn1i=1n|ϵi|+δnπrn(f0+u)(f0+u)\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\left\|{\pi_{r_{n}}f-f}\right\|_{\infty}n^{-1}\sum_{i=1}^{n}|\epsilon_{i}|+\delta_{n}\left\|{\pi_{r_{n}}(f_{0}+u^{*})-(f_{0}+u^{*})}\right\|_{\infty}
=\displaystyle= op(ρnδn2).\displaystyle o_{p}(\rho_{n}\delta_{n}^{2}).

Hence (C3) is satisfied, and the desired claim follows from Corollary 3. ∎

4 A Simulation Study

We conducted a simulation study to investigate the type I error and power performance of our proposed test. The model for generating the simulation data is given as follows:

Yi=8+Xi(1)Xi(2)+exp(Xi(3)Xi(4))+0.1Xi(5)+ϵi,i=1,,n,Y_{i}=8+X_{i}^{(1)}X_{i}^{(2)}+\exp\left(X_{i}^{(3)}X_{i}^{(4)}\right)+0.1X_{i}^{(5)}+\epsilon_{i},\quad i=1,\ldots,n,

where 𝑿i=(Xi(1),Xi(2),Xi(3),Xi(4),Xi(5),X(6))\boldsymbol{X}_{i}=\left(X_{i}^{(1)},X_{i}^{(2)},X_{i}^{(3)},X_{i}^{(4)},X_{i}^{(5)},X^{(6)}\right), 𝑿1,,𝑿n i.i.d. Unif([1,1]6)\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n}\sim\textrm{ i.i.d. Unif}([-1,1]^{6}) and ϵ1,,ϵn i.i.d. 𝒩(0,1)\epsilon_{1},\ldots,\epsilon_{n}\sim\textrm{ i.i.d. }\mathcal{N}(0,1). Since X(5)X^{(5)} is not included in the true model, we use X(5)X^{(5)} to investigate whether the SQLR test have good control of type I error, while the other 5 covariates are used for evaluate the power of the proposed test.

A subgradient method discussed in section 7 in Boyd and Mutapcic (2008) was applied to obtain a neural network estimate due to the constraints on the sieve space rn\mathcal{F}_{r_{n}}. The step size for the kkth iteration is chosen to be 0.1/log(e+k)0.1/\log(e+k) to fit a neural network under the null hypothesis H0H_{0}, while the step size of 0.1/(300log(e+k))0.1/(300\log(e+k)) is used under the alternative hypothesis H1H_{1}. Such choices of step sizes ensure the convergence of the subgradient method. In terms of the structure of the neural networks, we set rn=n1/2r_{n}=\lfloor n^{1/2}\rfloor and V=1000V=1000 for both neural networks fitted under H0H_{0} and H1H_{1}. When fitting the neural network under H0H_{0}, the initial value for the weights are randomly assigned. We use the fitted values for the weights from the neural network under H0H_{0} as the initial values and set all the extra weights to be zero when we fit the neural network under H1H_{1}.

Table 1 summarizes the empirical type I error and the empirical power under various sample sizes for the proposed neural-network-based SQLR test and the linear-regression-based FF-test after conducting 500 Monte Carlo iterations. Results from table show that both testing procedure can control the empirical type I error well. In terms of empirical power, the FF-test can only detect the linear component 0.1X(5)0.1X^{(5)} of the simulated model, while SQLR can detect all the components of the model. Therefore, when nonlinear patterns exist in the underlying function, the SQLR test is anticipate to be more powerful than the FF-test. Even in the case of linear terms, the performances between the two methods are comparable.

Table 1: Empirical type I error rate (for covariate X(6)X^{(6)}) and empirical powers (for covariates X(1),,X(5)X^{(1)},\ldots,X^{(5)}) for the neural-network-based SQLR test and the linear-regression-based FF-test
SQLR FF-test
Sample Size 100 500 1000 3000 5000 100 500 1000 3000 5000
X(1)X^{(1)} 0.072 0.072 0.080 0.326 0.818 0.054 0.068 0.046 0.060 0.042
X(2)X^{(2)} 0.058 0.088 0.152 0.504 0.932 0.066 0.062 0.070 0.054 0.058
X(3)X^{(3)} 0.052 0.062 0.104 0.308 0.812 0.050 0.048 0.058 0.078 0.060
X(4)X^{(4)} 0.064 0.072 0.132 0.486 0.920 0.054 0.066 0.048 0.056 0.064
X(5)X^{(5)} 0.074 0.202 0.406 0.904 0.978 0.070 0.222 0.414 0.826 0.956
X(6)X^{(6)} (Type I Error) 0.054 0.058 0.046 0.042 0.060 0.046 0.054 0.038 0.032 0.054

5 Real Data Applications

We conducted two genetic association analyses by applying the proposed sieve quasi-likelihood ratio test based on neural networks to the gene expression data and the sequencing data from Alzheimer’s Disease Neuroimaging Initiative (ADNI). Studies have shown that the hippocampus region in brain plays a vital part in memory and learning Mu and Gage (2011) and the change in the volume of hippocampus has a great impact on Alzheimer’s disease (Schuff et al., 2009). For both analyses, we first regress the logarithm of the hippocampus volume on important covariates (i.e, age, gender and education status) and then use the residual obtained as the response variable to fit neural networks. A total of 464 subjects and 15,837 gene expressions were obtained after quality control.

Under the null hypothesis, the gene is not associated with the response. Therefore, we can use the sample average of the response variable as the null estimator. When we fitted neural networks under the alternative hypothesis, we set the number of hidden units as rn=n1/2r_{n}=\lfloor n^{1/2}\rfloor and the upper bound for the 1\ell_{1}-norm of the hidden-to-output weights as V=1000V=1000. Totally, 3e4 iterations were performed. At the kkth iteration, the learning rate is chosen to be 0.8/log(e+k)0.8/\log(e+k). Table 2 summarizes the top 10 significant genes detected by SQLR and FF-test. Based on the result, the top 10 genes having the smallest PP-values detected by FF-test and SQLR are similar.

Table 2: Top 10 significant genes detected by the neural-network-based SQLR test and the linear- regression-based FF-test
FF-test SQLR
Gene PP-value Gene PP-value
SNRNP40 5.48E-05 PPIH 5.84E-05
PPIH 1.01E-04 SNRNP40 6.91E-05
GPR85 1.65E-04 NOD2 1.22E-04
DNAJB1 1.87E-04 DNAJB1 1.66E-04
WDR70 1.91E-04 CTBP1-AS2 1.94E-04
CYP4F2 2.64E-04 GPR85 2.21E-04
NOD2 2.84E-04 WDR70 2.31E-04
MEGF9 2.85E-04 KAZALD1 2.59E-04
CTBP1-AS2 3.35E-04 CYP4F2 2.95E-04
HNRNPAB 3.58E-04 HNRNPAB 3.72E-04

To explore the performance of the proposed SQLR test for categorical predictors, we conducted a genetic association analysis by applying SQLR to the ADNI genotype data in the APOE gene. The APOE gene in chromosome 19 is a well-known AD gene (Strittmatter et al., 1993). For this analysis, we considered all available single-nucleotide polymorphisms (SNPs) in the APOE gene as the input feature and conducted single-locus association tests considering all other SNPs in the gene. We used the same response variable as the one used in the previous gene expression study. A total of 780 subjects and 169 SNPs were obtained after quality control.

Same to the gene expression study, we used the sample average of the response variable as the null estimator. The tuning parameters used to fit neural networks are the same as mentioned before. Table 3 summarizes the top 10 significant SNPs in the APOE gene detected by the SQLR method for neural networks and by the FF-test in linear regression along with their PP-values.

As we can see from the result, the majority of significant SNPs detected by the FF-test and SQLR test overlap. Whether these significant SNPs are biologically meaningful needs further investigation. This shows that the SQLR test based on neural networks has the potential for wider applications, at least in this study, it performs as good as the F-test.

Table 3: Top 10 significant SNPs detected by the SQLR for neural networks and the FF-test in linear regression
FF-test SQLR-neural net
SNP PP-value SNP PP-value
rs10414043 1.10E-05 rs10414043 1.18E-05
rs7256200 1.10E-05 rs7256200 1.18E-05
rs769449 1.88E-05 rs769449 2.00E-05
rs438811 1.94E-05 rs438811 2.28E-05
rs10119 2.42E-05 rs10119 2.59E-05
rs483082 2.50E-05 rs483082 2.91E-05
rs75627662 5.32E-04 rs75627662 5.44E-04
rs_x139 1.76E-03 rs1038025 3.42E-03
rs59325138 3.01E-03 rs59325138 3.67E-03
rs1038025 3.15E-03 rs1038026 4.34E-03

6 Discussion

Hypothesis-driven studies are quite common in biomedical and public health research. For instance, investigators are typically interested in detecting complex relationships (e.g., non-linear relationships) between genetic variants and diseases in genetic studies. Therefore, significance tests based on a flexible and powerful model is crucial in real world applications. Although neural networks have achieved great success in pattern recognition, due to its black-box nature, it is not straightforward to conduct statistical inference based on neural networks. To fill this gap, we proposed a sieve quasi-likelihood ratio test based on neural networks to testing complex associations. The asymptotic chi-squared distribution of the test statistic was developed, which was validated via simulations studies. We also evaluated SQLR by applying it to the gene expression and sequence data from ADNI.

There are some limitations of the proposed method. First, the underlying function is required to be sufficiently smooth, which may not be true in some applications. Such requirement is not needed in the goodness of fit test proposed in Shen et al. (2021). However, the construction of the goodness of fit test requires data splitting, which could potentially reduce its power. Our empirical studies also find that a suitable choice of the step size is crucial for decent performance of the proposed method. Further studies will be conducted on how to choose the suitable step size for our method so that it can be used as a guidance for real data applications.

In section 2, we developed general theories for the SQLR test under the framework of nonparametric regression. The conditions (C1)-(C3) are easy to verify compared with the original ones in Shen and Shi (2005). Such type of results can be extended to deep neural networks and other models used in artificial intelligence, such as convolution neural networks or long-short term memory reccurrent neural networks, as long as one can obtain a good bound on the metric entropy for the class of functions.

Acknowledgements

This work is supported by the National Institute on Drug Abuse (Award No. R01DA043501) and the National Library of Medicine (Award No. R01LM012848).

References

  • Anthony and Bartlett (2009) Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
  • Baouendi and Goulaouic (1974) MS Baouendi and C Goulaouic. Approximation of analytic functions on compact sets and bernstein’s inequality. Transactions of the American Mathematical Society, 189:251–261, 1974.
  • Barron (1993) Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
  • Boyd and Mutapcic (2008) Stephen Boyd and Almir Mutapcic. Subgradient methods (notes for EE364B Winter 2006-07, Stanford University), 2008.
  • Chen and Shen (1998) Xiaohong Chen and Xiaotong Shen. Sieve extremum estimates for weakly dependent data. Econometrica, pages 289–314, 1998.
  • Chen and White (1999) Xiaohong Chen and Halbert White. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Transactions on Information Theory, 45(2):682–691, 1999.
  • Fukumizu (1996) Kenji Fukumizu. A regularity condition of the information matrix of a multilayer perceptron network. Neural networks, 9(5):871–879, 1996.
  • Fukumizu et al. (2003) Kenji Fukumizu et al. Likelihood ratio of unidentifiable models and multilayer neural networks. Annals of Statistics, 31(3):833–851, 2003.
  • Giné et al. (2000) Evarist Giné, Rafał Latała, and Joel Zinn. Exponential and moment inequalities for u-statistics. In High Dimensional Probability II, pages 13–38. Springer, 2000.
  • Goulaouic (1971) Charles Goulaouic. Approximation polynômiale de fonctions CC^{\infty} et analytiques. Ann. Inst. Fourier Grenoble, 21:149–173, 1971.
  • Han and Wellner (2019) Qiyang Han and Jon A Wellner. Convergence rates of least squares regression estimators with heavy-tailed errors. Annals of Statistics, 47(4):2286–2319, 2019.
  • Horel and Giesecke (2020) Enguerrand Horel and Kay Giesecke. Significance tests for neural networks. Journal of Machine Learning Research, 21(227):1–29, 2020.
  • Leoni (2017) Giovanni Leoni. A first course in Sobolev spaces. American Mathematical Soc., 2017.
  • Makovoz (1996) Yuly Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85(1):98–109, 1996.
  • McDiarmid (1989) Colin McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
  • Mhaskar (1996) Hrushikesh N Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996.
  • Minai and Williams (1993) Ali A Minai and Ronald D Williams. On the derivatives of the sigmoid. Neural Networks, 6(6):845–853, 1993.
  • Mu and Gage (2011) Yangling Mu and Fred H Gage. Adult hippocampal neurogenesis and its role in alzheimer’s disease. Molecular neurodegeneration, 6(1):1–9, 2011.
  • Schmidt-Hieber et al. (2020) Johannes Schmidt-Hieber et al. Nonparametric regression using deep neural networks with relu activation function. Annals of Statistics, 48(4):1875–1897, 2020.
  • Schuff et al. (2009) N Schuff, N Woerner, L Boreta, T Kornfield, LM Shaw, JQ Trojanowski, PM Thompson, CR Jack Jr, MW Weiner, and Alzheimer’s; Disease Neuroimaging Initiative. Mri of hippocampal volume loss in early alzheimer’s disease in relation to apoe genotype and biomarkers. Brain, 132(4):1067–1077, 2009.
  • Shen (1997) Xiaotong Shen. On methods of sieves and penalization. The Annals of Statistics, pages 2555–2591, 1997.
  • Shen and Shi (2005) Xiaotong Shen and Jian Shi. Sieve likelihood ratio inference on general parameter space. Science in China Series A: Mathematics, 48(1):67–78, 2005.
  • Shen et al. (2019) Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, and Qing Lu. Asymptotic properties of neural network sieve estimators. arXiv preprint arXiv:1906.00875, 2019.
  • Shen et al. (2021) Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, and Qing Lu. A goodness-of-fit test based on neural network sieve estimators. Statistics & Probability Letters, page 109100, 2021.
  • Siegel and Xu (2020) Jonathan W Siegel and Jinchao Xu. Approximation rates for neural networks with general activation functions. Neural Networks, 2020.
  • Strittmatter et al. (1993) Warren J Strittmatter, Ann M Saunders, Donald Schmechel, Margaret Pericak-Vance, Jan Enghild, Guy S Salvesen, and Allen D Roses. Apolipoprotein e: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial alzheimer disease. Proceedings of the National Academy of Sciences, 90(5):1977–1981, 1993.
  • van de Geer (1987) Sara van de Geer. A new approach to least-squares estimation, with applications. The Annals of Statistics, pages 587–602, 1987.
  • van de Geer (1988) Sara van de Geer. Regression analysis and empirical processes. CWI, 1988.
  • van de Geer (1990) Sara van de Geer. Estimating a regression function. The Annals of Statistics, pages 907–924, 1990.
  • van de Geer (2000) Sara van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
  • van der Vaart and Wellner (1996) Aad W van der Vaart and Jon A Wellner. Weak convergence. Springer, 1996.
  • Wu (1981) Chien-Fu Wu. Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics, pages 501–513, 1981.

Supplementary Materials

Proof of Theorem 1

In this section, we take the sequence δn=o(n1/2)\delta_{n}=o(n^{-1/2}). The proof of the theorem relies on the following lemmas.

Lemma 10.

Under (C1)-(C3), for a sufficiently large nn,

πnf~n(f^n)f0n2f^nf0n22(1δn)f^nf0,δnun+𝒪p(δn2).\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2}-\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}\leq 2(1-\delta_{n})\left\langle{\hat{f}_{n}-f_{0}},{\delta_{n}u^{*}}\right\rangle_{n}+\mathcal{O}_{p}(\delta_{n}^{2}).
Proof.

We first note that

πnf~n(f^n)f0n2\displaystyle\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2} =πnf~n(f^n)f~n(f^n)+f~n(f^n)f0n2\displaystyle=\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})+\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2}
=πnf~n(f^n)f~n(f^n)+(1δn)(f^nf0)+δnun2\displaystyle=\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})+(1-\delta_{n})(\hat{f}_{n}-f_{0})+\delta_{n}u^{*}}\right\|_{n}^{2}
=πnf~n(f^n)f~n(f^n)n2+(1δn)2f^nf0n2+δn2un2\displaystyle=\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+(1-\delta_{n})^{2}\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}+\delta_{n}^{2}\left\|{u^{*}}\right\|_{n}^{2}
+2(1δn)πnf~n(f^n)f~n(f^n),f^nf0n+2δnπnf~n(f^n)f~n(f^n),un\displaystyle\qquad+2(1-\delta_{n})\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})},{\hat{f}_{n}-f_{0}}\right\rangle_{n}+2\delta_{n}\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})},{u^{*}}\right\rangle_{n}
+2δn(1δn)f^nf0,un\displaystyle\qquad+2\delta_{n}(1-\delta_{n})\left\langle{\hat{f}_{n}-f_{0}},{u^{*}}\right\rangle_{n}
(1δn)2f^nf0n2+2(1δn)f^nf0,δnun+δn2un2\displaystyle\leq(1-\delta_{n})^{2}\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}+2(1-\delta_{n})\left\langle{\hat{f}_{n}-f_{0}},{\delta_{n}u^{*}}\right\rangle_{n}+\delta_{n}^{2}\left\|{u^{*}}\right\|_{n}^{2}
+2(1δn)πnf~n(f^n)f~n(f^n)nf^nf0n\displaystyle\qquad+2(1-\delta_{n})\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}
+2δnπnf~n(f^n)f~n(f^n)nun+πnf~n(f^n)f~n(f^n)n2.\displaystyle\qquad+2\delta_{n}\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}\left\|{u^{*}}\right\|_{n}+\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}.

For a enough large nn, the Strong Law of Large Numbers implies that un2u\left\|{u^{*}}\right\|_{n}\leq 2\left\|{u^{*}}\right\| a.s. and hence

δn2un2=𝒪p(δn2).\delta_{n}^{2}\left\|{u^{*}}\right\|_{n}^{2}=\mathcal{O}_{p}(\delta_{n}^{2}).

Moreover,

(1δn)2f^nf0n2f^nf0n2\displaystyle(1-\delta_{n})^{2}\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}-\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2} =(2δn+δn2)f^nf0n2\displaystyle=(-2\delta_{n}+\delta_{n}^{2})\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}
δn2f^nf0n2.\displaystyle\leq\delta_{n}^{2}\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}.

On the other hand, under (C1) and (C2), it follows from Lemma 5.4 in van de Geer (2000) that

(supfnff0ρnff0n>8ρn)4exp(nρn2),\mathbb{P}\left(\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \|f-f_{0}\|\leq\rho_{n}\end{subarray}}\left\|{f-f_{0}}\right\|_{n}>8\rho_{n}\right)\leq 4\exp\left(-n\rho_{n}^{2}\right), (9)

which implies that f^nf0n=𝒪p(ρn)=op(1)\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}=\mathcal{O}_{p}(\rho_{n})=o_{p}(1) and then

δn2f^nf0n2=op(δn2).\delta_{n}^{2}\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}=o_{p}(\delta_{n}^{2}).

Under (C2) and (C3), we have

2(1δn)πnf~n(f^n)f~n(f^n)nf^nf0\displaystyle 2(1-\delta_{n})\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}\left\|{\hat{f}_{n}-f_{0}}\right\| 2f^nf0nπnf~n(f^n)f~n(f^n)n\displaystyle\leq 2\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}
=𝒪p(ρn)op(ρn1δn2)\displaystyle=\mathcal{O}_{p}(\rho_{n})\cdot o_{p}(\rho_{n}^{-1}\delta_{n}^{2})
=op(δn2),\displaystyle=o_{p}(\delta_{n}^{2}),

and for a large enough nn,

2δnπnf~n(f^n)f~n(f^n)nun=op(δnρn1δn2)=op(δn2),2\delta_{n}\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}\left\|{u^{*}}\right\|_{n}=o_{p}(\delta_{n}\rho_{n}^{-1}\delta_{n}^{2})=o_{p}(\delta_{n}^{2}),

and

πnf~n(f^n)f~n(f^n)n2=op(ρn2δn2δn2)=op(δn2).\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}=o_{p}(\rho_{n}^{-2}\delta_{n}^{2}\delta_{n}^{2})=o_{p}(\delta_{n}^{2}).

Therefore, we obtain

πnf~n(f^n)f0n2f^nf0n22(1δn)f^nf0,δnun+𝒪p(δn2).\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2}-\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}\leq 2(1-\delta_{n})\left\langle{\hat{f}_{n}-f_{0}},{\delta_{n}u^{*}}\right\rangle_{n}+\mathcal{O}_{p}(\delta_{n}^{2}).

Lemma 11.

Under (C1) - (C3),

n1i=1nϵi(πnf~n(f^n)(𝑿i)f^n(𝑿i))=n1δni=1nϵiu(𝑿i)+op(δnn1/2).n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\hat{f}_{n}(\boldsymbol{X}_{i})\right)=n^{-1}\delta_{n}\sum_{i=1}^{n}\epsilon_{i}u^{*}(\boldsymbol{X}_{i})+o_{p}(\delta_{n}n^{-1/2}).
Proof.

Note that

n1i=1nϵi(πnf~n(f^n)(𝑿i)f^n(𝑿i))\displaystyle n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\hat{f}_{n}(\boldsymbol{X}_{i})\right)
=\displaystyle= n1i=1nϵi(πnf~n(f^n)(𝑿i)f~n(f^n)(𝑿i)+f~n(f^n)(𝑿i)f^n(𝑿i))\displaystyle n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})+\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\hat{f}_{n}(\boldsymbol{X}_{i})\right)
=\displaystyle= n1i=1nϵi(πnf~n(f^n)(𝑿i)f~n(f^n)(𝑿i))n1δni=1nϵi(f^n(𝑿i)f0(𝑿i))+n1δni=1nϵiu(𝑿i).\displaystyle n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})\right)-n^{-1}\delta_{n}\sum_{i=1}^{n}\epsilon_{i}\left(\hat{f}_{n}(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i})\right)+n^{-1}\delta_{n}\sum_{i=1}^{n}\epsilon_{i}u^{*}(\boldsymbol{X}_{i}). (10)

Now, we define

J(δ)=0δH1/2(u)duδ,J(\delta)=\int_{0}^{\delta}H^{1/2}(u)\textrm{d}u\vee\delta,

and

Δn=supfnff0ρnff0n.\Delta_{n}=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\left\|{f-f_{0}}\right\|_{n}.

Let ξ1,,ξn\xi_{1},\ldots,\xi_{n} be i.i.d. Rademacher random variables independent of ϵ1,ϵn\epsilon_{1},\epsilon_{n} and 𝑿1,,𝑿n\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n}. It then follows from Corollary 2.2.8 in van der Vaart and Wellner (1996) that

𝔼[supfnff0ρn|1ni=1nξi(ff0)(𝑿i)|]\displaystyle\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\left|\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(f-f_{0})(\boldsymbol{X}_{i})\right|\right] =𝔼[𝔼[supfnff0ρn|1ni=1nξi(ff0)(𝑿i)||Δn8ρn]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\left.\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\left|\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(f-f_{0})(\boldsymbol{X}_{i})\right|\right|\Delta_{n}\leq 8\rho_{n}\right]\right]
𝔼[08ρnlogN(u,n,L2(n))du]\displaystyle\lesssim\mathbb{E}\left[\int_{0}^{8\rho_{n}}\sqrt{\log N(u,\mathcal{F}_{n},L_{2}(\mathbb{P}_{n}))}\textrm{d}u\right]
08ρnH1/2(u)du\displaystyle\lesssim\int_{0}^{8\rho_{n}}H^{1/2}(u)\textrm{d}u
J(8ρn).\displaystyle\lesssim J(8\rho_{n}).

Moreover, based on (C2), we obtain

J2(8ρn)nρn4\displaystyle\frac{J^{2}(8\rho_{n})}{n\rho_{n}^{4}} =1nρn4[(08ρnH1/2(u)du)2+64ρn2]\displaystyle=\frac{1}{n\rho_{n}^{4}}\left[\left(\int_{0}^{8\rho_{n}}H^{1/2}(u)\textrm{d}u\right)^{2}+64\rho_{n}^{2}\right]
=1nρn2[64(18ρn08ρnH1/2(u)du)2+64]\displaystyle=\frac{1}{n\rho_{n}^{2}}\left[64\left(\frac{1}{8\rho_{n}}\int_{0}^{8\rho_{n}}H^{1/2}(u)\textrm{d}u\right)^{2}+64\right]
=1nρn2(64H(λρn)+64) for some λ(0,8),\displaystyle=\frac{1}{n\rho_{n}^{2}}(64H(\lambda\rho_{n})+64)\textrm{ for some }\lambda\in(0,8),

where the last inequality follows from the mean value theorem for integrals. Hence, J(8ρn)=𝒪(nρn2)J(8\rho_{n})=\mathcal{O}(\sqrt{n}\rho_{n}^{2}) and

𝔼[supfnff0ρn|i=1nξi(ff0)(𝑿i)|]nρn2.\mathbb{E}^{*}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\left|\sum_{i=1}^{n}\xi_{i}(f-f_{0})(\boldsymbol{X}_{i})\right|\right]\lesssim n\rho_{n}^{2}.

By Proposition 20, we know that

supfnff0ρn|i=1nϵi(ff0)(𝑿i)|=𝒪p(nρn2),\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\left|\sum_{i=1}^{n}\epsilon_{i}(f-f_{0})(\boldsymbol{X}_{i})\right|=\mathcal{O}_{p}\left(n\rho_{n}^{2}\right),

which implies that

n1δni=1nϵi(f^n(𝑿i)f0(𝑿i))=𝒪p(δnρn2)=op(δnn1/2).n^{-1}\delta_{n}\sum_{i=1}^{n}\epsilon_{i}\left(\hat{f}_{n}(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i})\right)=\mathcal{O}_{p}(\delta_{n}\rho_{n}^{2})=o_{p}(\delta_{n}n^{-1/2}). (11)

The desired claim then follows by combining (C3) with equations (11) and (10). ∎

We now prove Theorem 1.

Proof.

Note that for ff0ρn\|f-f_{0}\|\leq\rho_{n}, we have

f~n(f)f0\displaystyle\left\|{\tilde{f}_{n}(f)-f_{0}}\right\| =(1δn)f+δn(f0+u)f0\displaystyle=\left\|{(1-\delta_{n})f+\delta_{n}(f_{0}+u^{*})-f_{0}}\right\|
=(1δn)(ff0)+δnu\displaystyle=\left\|{(1-\delta_{n})(f-f_{0})+\delta_{n}u^{*}}\right\|
(1δn)ff0+δnu.\displaystyle\leq(1-\delta_{n})\left\|{f-f_{0}}\right\|+\delta_{n}\left\|{u^{*}}\right\|.

With probability tending to 1, f~n(f)f0ρn\left\|{\tilde{f}_{n}(f)-f_{0}}\right\|\leq\rho_{n}. Since

n(f)\displaystyle\mathbb{Q}_{n}(f) =n1i=1n(Yif(𝑿i))2\displaystyle=n^{-1}\sum_{i=1}^{n}(Y_{i}-f(\boldsymbol{X}_{i}))^{2}
=n1i=1n(ϵi+f0(𝑿i)f(𝑿i))2\displaystyle=n^{-1}\sum_{i=1}^{n}(\epsilon_{i}+f_{0}(\boldsymbol{X}_{i})-f(\boldsymbol{X}_{i}))^{2}
=n1i=1nϵi22n1i=1nϵi(f(𝑿i)f0(𝑿i))+ff0n2,\displaystyle=n^{-1}\sum_{i=1}^{n}\epsilon_{i}^{2}-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))+\left\|{f-f_{0}}\right\|_{n}^{2},

we have

n(f^n)\displaystyle\mathbb{Q}_{n}(\hat{f}_{n}) =n1i=1nϵi22n1i=1nϵi(f^n(𝑿i)f0(𝑿i))+f^nf0n2\displaystyle=n^{-1}\sum_{i=1}^{n}\epsilon_{i}^{2}-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}(\hat{f}_{n}(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))+\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}
n(πnf~n(f^n))\displaystyle\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})) =n1i=1nϵi22n1i=1nϵi(πnf~n(f^n)(𝑿i)f0(𝑿i))+πnf~n(f^n)f0n2.\displaystyle=n^{-1}\sum_{i=1}^{n}\epsilon_{i}^{2}-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))+\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2}.

By subtracting these two equation, we have

n(f^n)=n(πnf~n(f^n))+2n1i=1nϵi(πnf~n(f^n)(𝑿i)f^n(𝑿i))+f^nf0n2πnf~n(f^n)f0n2.\mathbb{Q}_{n}(\hat{f}_{n})=\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))+2n^{-1}\sum_{i=1}^{n}\epsilon_{i}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\hat{f}_{n}(\boldsymbol{X}_{i}))+\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2}.

It then follows from the definition of f^n\hat{f}_{n} that

𝒪p(δn2)\displaystyle-\mathcal{O}_{p}(\delta_{n}^{2}) inffnn(f)n(f^n)\displaystyle\leq\inf_{f\in\mathcal{F}_{n}}\mathbb{Q}_{n}(f)-\mathbb{Q}_{n}(\hat{f}_{n})
n(πnf~n(f^n))n(f^n)\displaystyle\leq\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))-\mathbb{Q}_{n}(\hat{f}_{n})
πnf~n(f^n)f0n2f^nf0n22n1i=1nϵi(πnf~n(f^n)(𝑿i)f^n(𝑿i))\displaystyle\leq\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2}-\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\hat{f}_{n}(\boldsymbol{X}_{i}))

From Lemma 10, we know that

πnf~n(f^n)f0n2f^nf0n22(1δn)f^nf0,δnun+𝒪p(δn2).\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2}-\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}^{2}\leq 2(1-\delta_{n})\left\langle{\hat{f}_{n}-f_{0}},{\delta_{n}u^{*}}\right\rangle_{n}+\mathcal{O}_{p}(\delta_{n}^{2}).

From Lemma 11, we have

n1i=1nϵi(πnf~n(f^n)(𝑿i)f^n(𝑿i))=n1δni=1nϵiu(𝑿i)+op(δnn1/2).n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\hat{f}_{n}(\boldsymbol{X}_{i})\right)=n^{-1}\delta_{n}\sum_{i=1}^{n}\epsilon_{i}u^{*}(\boldsymbol{X}_{i})+o_{p}(\delta_{n}n^{-1/2}).

Therefore, put all the pieces together, we have

𝒪p(δn2)\displaystyle-\mathcal{O}_{p}(\delta_{n}^{2}) 2(1δn)f^nf0,δnun2n1δni=1nϵiu(𝑿i)+op(δnn1/2)\displaystyle\leq 2(1-\delta_{n})\left\langle{\hat{f}_{n}-f_{0}},{\delta_{n}u^{*}}\right\rangle_{n}-2n^{-1}\delta_{n}\sum_{i=1}^{n}\epsilon_{i}u^{*}(\boldsymbol{X}_{i})+o_{p}(\delta_{n}n^{-1/2})
2f^nf0,δnun2n1δni=1nϵiu(𝑿i)+op(δnn1/2),\displaystyle\leq 2\left\langle{\hat{f}_{n}-f_{0}},{\delta_{n}u^{*}}\right\rangle_{n}-2n^{-1}\delta_{n}\sum_{i=1}^{n}\epsilon_{i}u^{*}(\boldsymbol{X}_{i})+o_{p}(\delta_{n}n^{-1/2}),

which implies that

f^nf0,un+n1i=1nϵiu(𝑿i)𝒪p(δn)+op(n1/2)=op(n1/2).-\left\langle{\hat{f}_{n}-f_{0}},{u^{*}}\right\rangle_{n}+n^{-1}\sum_{i=1}^{n}\epsilon_{i}u^{*}(\boldsymbol{X}_{i})\leq\mathcal{O}_{p}(\delta_{n})+o_{p}(n^{-1/2})=o_{p}(n^{-1/2}).

By replacing uu^{*} with u-u^{*}, we get

|f^nf0,unn1i=1nϵiu(𝑿i)|op(n1/2),\left|{\left\langle{\hat{f}_{n}-f_{0}},{u^{*}}\right\rangle_{n}-n^{-1}\sum_{i=1}^{n}\epsilon_{i}u^{*}(\boldsymbol{X}_{i})}\right|\leq o_{p}(n^{-1/2}),

and the desired result follows immediately. ∎

Proof of Theorem 2

In what follows, we consider δn=f^nf0,v\delta_{n}=-\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle and ηn=o(δn2)\eta_{n}=o(\delta_{n}^{2}). Under (C1)-(C3), it follows from Theorem 1 and Lemma 22 that if sup𝒙𝒳|v(𝒙)|<\sup_{\boldsymbol{x}\in\mathcal{X}}|v^{*}(\boldsymbol{x})|<\infty

δn\displaystyle\delta_{n} =f^nf0,vn+op(n1/2)\displaystyle=-\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle_{n}+o_{p}(n^{-1/2})
=1ni=1nϵiv(𝑿i)+op(n1/2)\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}v^{*}(\boldsymbol{X}_{i})+o_{p}(n^{-1/2})
=𝒪p(n1/2).\displaystyle=\mathcal{O}_{p}(n^{-1/2}).

The proof of Theorem 2 relies on the following lemmas.

Lemma 12 (Convergence Rate for f^n0\hat{f}_{n}^{0}).

Under (C1) and (C2),

f^n0f0=𝒪p(ρn).\left\|{\hat{f}_{n}^{0}-f_{0}}\right\|=\mathcal{O}_{p}(\rho_{n}).
Proof.

As πnf0f0f^nf0=𝒪p(ρn)\left\|{\pi_{n}f_{0}-f_{0}}\right\|\leq\left\|{\hat{f}_{n}-f_{0}}\right\|=\mathcal{O}_{p}(\rho_{n}), it suffices to show that f^n0πnf0=𝒪p(ρn)\left\|{\hat{f}_{n}^{0}-\pi_{n}f_{0}}\right\|=\mathcal{O}_{p}(\rho_{n}). Note that for any M>0M>0,

supfn0ff0>Mρn𝕄n(f)𝕄n(πnf0)supfnff0>Mρn𝕄n(f)𝕄n(πnf0).\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}^{0}\\ \left\|{f-f_{0}}\right\|>M\rho_{n}\end{subarray}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|>M\rho_{n}\end{subarray}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0}).

Under (C2), we know that ηn=𝒪(ρn2)\eta_{n}=\mathcal{O}(\rho_{n}^{2}). It then follows from the definition of f^n0\hat{f}_{n}^{0} that, for any ε>0\varepsilon>0, there exists K>0K>0 such that,

(𝕄n(f^n0)𝕄n(πnf0)<Kρn2)<ϵ,\mathbb{P}\left(\mathbb{M}_{n}(\hat{f}_{n}^{0})-\mathbb{M}_{n}(\pi_{n}f_{0})<-K\rho_{n}^{2}\right)<\epsilon,

and hence

(f^n0πnf0>Mρn)\displaystyle\mathbb{P}\left(\left\|{\hat{f}_{n}^{0}-\pi_{n}f_{0}}\right\|>M\rho_{n}\right)
\displaystyle\leq (f^n0πnf0>Mρn,𝕄n(f^n0)𝕄n(πnf0)Kρn2)+(𝕄n(f^n0)𝕄n(πnf0)<Kρn2)\displaystyle\mathbb{P}\left(\left\|{\hat{f}_{n}^{0}-\pi_{n}f_{0}}\right\|>M\rho_{n},\mathbb{M}_{n}(\hat{f}_{n}^{0})\geq\mathbb{M}_{n}(\pi_{n}f_{0})-K\rho_{n}^{2}\right)+\mathbb{P}\left(\mathbb{M}_{n}(\hat{f}_{n}^{0})-\mathbb{M}_{n}(\pi_{n}f_{0})<-K\rho_{n}^{2}\right)
\displaystyle\leq (supfn0fπnf0>Mρn𝕄n(f)𝕄n(πnf0)Kρn2)+ε\displaystyle\mathbb{P}\left(\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}^{0}\\ \left\|{f-\pi_{n}f_{0}}\right\|>M\rho_{n}\end{subarray}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})\geq-K\rho_{n}^{2}\right)+\varepsilon
\displaystyle\leq (supfnfπnf0>Mρn𝕄n(f)𝕄n(πnf0)Kρn2)+ε.\displaystyle\mathbb{P}\left(\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|>M\rho_{n}\end{subarray}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})\geq-K\rho_{n}^{2}\right)+\varepsilon.

Note that

𝕄n(f)𝕄n(πnf0)=2ni=1nϵi(f(𝑿i)πnf0(𝑿i))+πnf0f0n2ff0n2.\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})=\frac{2}{n}\sum_{i=1}^{n}\epsilon_{i}\left(f(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i})\right)+\left\|{\pi_{n}f_{0}-f_{0}}\right\|_{n}^{2}-\left\|{f-f_{0}}\right\|_{n}^{2}.

Let j,n={fn:2j1Mρnfπnf0<2jMρn}\mathcal{F}_{j,n}=\{f\in\mathcal{F}_{n}:2^{j-1}M\rho_{n}\leq\left\|{f-\pi_{n}f_{0}}\right\|<2^{j}M\rho_{n}\}. By a standard argument of peeling technique, we have

(supfnfπnf0>Mρn𝕄n(f)𝕄n(πnf0)Kρn2)j=1(supfj,n𝕄n(f)𝕄n(πnf0)Kρn2)\mathbb{P}\left(\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|>M\rho_{n}\end{subarray}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})\geq-K\rho_{n}^{2}\right)\leq\sum_{j=1}^{\infty}\mathbb{P}\left(\sup_{f\in\mathcal{F}_{j,n}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})\geq-K\rho_{n}^{2}\right)

Let M(f)=𝔼[𝕄n(f)]=ff02M(f)=\mathbb{E}[\mathbb{M}_{n}(f)]=-\left\|{f-f_{0}}\right\|^{2}. Then on j,n\mathcal{F}_{j,n}

M(f)M(πnf0)22j2M2ρn2πnf0f02,M(f)-M(\pi_{n}f_{0})\lesssim-2^{2j-2}M^{2}\rho_{n}^{2}-\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2},

which implies that

supfj,n𝕄n(f)𝕄n(πnf0)\displaystyle\sup_{f\in\mathcal{F}_{j,n}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})
\displaystyle\leq supfj,n𝕄n(f)𝕄n(πnf0)(M(f)M(πnf0))+supfj,nM(f)M(πnf0)\displaystyle\sup_{f\in\mathcal{F}_{j,n}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})-(M(f)-M(\pi_{n}f_{0}))+\sup_{f\in\mathcal{F}_{j,n}}M(f)-M(\pi_{n}f_{0})
\displaystyle\lesssim supfj,n|2ni=1nϵi(f(𝑿i)πnf0(𝑿i))|+supfj,n|ff0n2ff02|\displaystyle\sup_{f\in\mathcal{F}_{j,n}}\left|{\frac{2}{n}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))}\right|+\sup_{f\in\mathcal{F}_{j,n}}\left|{\left\|{f-f_{0}}\right\|_{n}^{2}-\left\|{f-f_{0}}\right\|^{2}}\right|
+|πnf0f0n2πnf0f02|22j2M2ρn2.\displaystyle\qquad+\left|{\left\|{\pi_{n}f_{0}-f_{0}}\right\|_{n}^{2}-\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2}}\right|-2^{2j-2}M^{2}\rho_{n}^{2}.

Therefore

(supfj,n𝕄n(f)𝕄n(πnf0)Kρn2)P1+P2+P3,\displaystyle\mathbb{P}\left(\sup_{f\in\mathcal{F}_{j,n}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})\geq-K\rho_{n}^{2}\right)\leq P_{1}+P_{2}+P_{3},

where

P1\displaystyle P_{1} :=(supfj,n|1ni=1nϵi(f(𝑿i)πnf0(𝑿i))|(22j5M2K8)nρn2)\displaystyle:=\mathbb{P}\left(\sup_{f\in\mathcal{F}_{j,n}}\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))}\right|\geq\left(2^{2j-5}M^{2}-\frac{K}{8}\right)\sqrt{n}\rho_{n}^{2}\right)
P2\displaystyle P_{2} :=(supfj,nn|ff0n2ff02|(22j4M2K4)nρn2)\displaystyle:=\mathbb{P}\left(\sup_{f\in\mathcal{F}_{j,n}}\sqrt{n}\left|{\left\|{f-f_{0}}\right\|_{n}^{2}-\left\|{f-f_{0}}\right\|^{2}}\right|\geq\left(2^{2j-4}M^{2}-\frac{K}{4}\right)\sqrt{n}\rho_{n}^{2}\right)
P3\displaystyle P_{3} :=(|f0πnf0n2f0πnf02|(22j3M2K2)ρn2).\displaystyle:=\mathbb{P}\left(\left|{\left\|{f_{0}-\pi_{n}f_{0}}\right\|_{n}^{2}-\left\|{f_{0}-\pi_{n}f_{0}}\right\|^{2}}\right|\geq\left(2^{2j-3}M^{2}-\frac{K}{2}\right)\rho_{n}^{2}\right).

As f0πnf0ρn\left\|{f_{0}-\pi_{n}f_{0}}\right\|\lesssim\rho_{n}, by Markov’s inequality and triangle inequality,

P3\displaystyle P_{3} [(22j3M2K2)ρn2]1𝔼[|f0πnf0n2f0πnf02|]\displaystyle\leq\left[\left(2^{2j-3}M^{2}-\frac{K}{2}\right)\rho_{n}^{2}\right]^{-1}\mathbb{E}\left[\left|{\left\|{f_{0}-\pi_{n}f_{0}}\right\|_{n}^{2}-\left\|{f_{0}-\pi_{n}f_{0}}\right\|^{2}}\right|\right]
[(22j3M2K2)ρn2]1.\displaystyle\lesssim\left[\left(2^{2j-3}M^{2}-\frac{K}{2}\right)\rho_{n}^{2}\right]^{-1}.

For P2P_{2}, it follows from Chebyshev’s inequality, symmetrization inequality, contraction principle and moment inequality (see Proposition 3.1 of Giné et al. (2000)) that

P2\displaystyle P_{2} [(22j4M2K4)nρn2]2𝔼[supfj,nn|ff0n2ff02|2]\displaystyle\leq\left[\left(2^{2j-4}M^{2}-\frac{K}{4}\right)\sqrt{n}\rho_{n}^{2}\right]^{-2}\mathbb{E}\left[\sup_{f\in\mathcal{F}_{j,n}}n\left|{\left\|{f-f_{0}}\right\|_{n}^{2}-\left\|{f-f_{0}}\right\|^{2}}\right|^{2}\right]
[(22j4M2K4)nρn2]2𝔼[supfj,n|1ni=1nξi(f(𝑿i)f0(𝑿i))2|2]\displaystyle\lesssim\left[\left(2^{2j-4}M^{2}-\frac{K}{4}\right)\sqrt{n}\rho_{n}^{2}\right]^{-2}\mathbb{E}\left[\sup_{f\in\mathcal{F}_{j,n}}\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))^{2}}\right|^{2}\right]
[(22j4M2K4)nρn2]2{(𝔼[supfj,n|1nξi(f(𝑿i)f0(𝑿i))|])2+22jM2ρn2+n1}\displaystyle\lesssim\left[\left(2^{2j-4}M^{2}-\frac{K}{4}\right)\sqrt{n}\rho_{n}^{2}\right]^{-2}\left\{\left(\mathbb{E}\left[\sup_{f\in\mathcal{F}_{j,n}}\left|{\frac{1}{\sqrt{n}}\xi_{i}(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))}\right|\right]\right)^{2}+2^{2j}M^{2}\rho_{n}^{2}+n^{-1}\right\}
[(22j4M2K4)nρn2]2{22jM2ρn2+22jM2ρn2+n1},\displaystyle\lesssim\left[\left(2^{2j-4}M^{2}-\frac{K}{4}\right)\sqrt{n}\rho_{n}^{2}\right]^{-2}\left\{2^{2j}M^{2}\rho_{n}^{2}+2^{2j}M^{2}\rho_{n}^{2}+n^{-1}\right\},

where ξ1,,ξn\xi_{1},\ldots,\xi_{n} are i.i.d. Rademacher random variables independent of 𝑿1,,𝑿n\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n}. Similarly, we have

P1\displaystyle P_{1} [(22j5M2K8)nρn2]2𝔼[supfj,n|1ni=1nϵi(f(𝑿i)πnf0(𝑿i))|2]\displaystyle\leq\left[\left(2^{2j-5}M^{2}-\frac{K}{8}\right)\sqrt{n}\rho_{n}^{2}\right]^{-2}\mathbb{E}\left[\sup_{f\in\mathcal{F}_{j,n}}\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\epsilon_{i}\left(f(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i})\right)}\right|^{2}\right]
[(22j5M2K8)nρn2]2{22jM2ρn2+ϵ12222jM2ρn2+n1𝔼[max1in|ϵi|2]}\displaystyle\lesssim\left[\left(2^{2j-5}M^{2}-\frac{K}{8}\right)\sqrt{n}\rho_{n}^{2}\right]^{-2}\left\{2^{2j}M^{2}\rho_{n}^{2}+\left\|{\epsilon_{1}}\right\|_{2}^{2}2^{2j}M^{2}\rho_{n}^{2}+n^{-1}\mathbb{E}\left[\max_{1\leq i\leq n}|\epsilon_{i}|^{2}\right]\right\}
[(22j5M2K8)nρn2]2{22jM2ρn2+ϵ12222jM2ρn2+ϵ1p2n1+2/p}.\displaystyle\lesssim\left[\left(2^{2j-5}M^{2}-\frac{K}{8}\right)\sqrt{n}\rho_{n}^{2}\right]^{-2}\left\{2^{2j}M^{2}\rho_{n}^{2}+\left\|{\epsilon_{1}}\right\|_{2}^{2}2^{2j}M^{2}\rho_{n}^{2}+\left\|{\epsilon_{1}}\right\|_{p}^{2}n^{-1+2/p}\right\}.

Then

P1+P2\displaystyle P_{1}+P_{2} [(22j5M2K8)nρn2]2{22jM2ρn2+(1ϵ12)222jM2ρn2+(1ϵ1p)2n1+2/p}\displaystyle\lesssim\left[\left(2^{2j-5}M^{2}-\frac{K}{8}\right)\sqrt{n}\rho_{n}^{2}\right]^{-2}\left\{2^{2j}M^{2}\rho_{n}^{2}+(1\vee\left\|{\epsilon_{1}}\right\|_{2})^{2}2^{2j}M^{2}\rho_{n}^{2}+(1\vee\left\|{\epsilon_{1}}\right\|_{p})^{2}n^{-1+2/p}\right\}
Cϵ[(2jM22j5M2K8)2ρn2nρn4+1(22j5M2K8)2n22/pρn4].\displaystyle\lesssim C_{\epsilon}\left[\left(\frac{2^{j}M}{2^{2j-5}M^{2}-\frac{K}{8}}\right)^{2}\frac{\rho_{n}^{2}}{n\rho_{n}^{4}}+\frac{1}{\left(2^{2j-5}M^{2}-\frac{K}{8}\right)^{2}n^{2-2/p}\rho_{n}^{4}}\right].

Under (C2), we have ρnn12+12p\rho_{n}\gtrsim n^{-\frac{1}{2}+\frac{1}{2p}}, which implies that nρn2n1/pn\rho_{n}^{2}\gtrsim n^{1/p}. Moreover, for M2KM\geq\sqrt{2K}, we have 22j5M2K822j5M222j6M2=22j6M22^{2j-5}M^{2}-\frac{K}{8}\geq 2^{2j-5}M^{2}-2^{2j-6}M^{2}=2^{2j-6}M^{2}. Therefore, we obtain

j=1P1+P2+P3\displaystyle\sum_{j=1}^{\infty}P_{1}+P_{2}+P_{3} Cϵj=1[(2jM22j6M2)21nρn2+1(22j6M2)2+122j4M2]\displaystyle\lesssim C_{\epsilon}\sum_{j=1}^{\infty}\left[\left(\frac{2^{j}M}{2^{2j-6}M^{2}}\right)^{2}\frac{1}{n\rho_{n}^{2}}+\frac{1}{\left(2^{2j-6}M^{2}\right)^{2}}+\frac{1}{2^{2j-4}M^{2}}\right]
CϵM2,\displaystyle\lesssim\frac{C_{\epsilon}}{M^{2}},

which implies that, for a sufficiently large MM,

(supfnfπnf0>Mρn𝕄n(f)𝕄n(πnf0)Kρn2)<ε.\mathbb{P}\left(\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|>M\rho_{n}\end{subarray}}\mathbb{M}_{n}(f)-\mathbb{M}_{n}(\pi_{n}f_{0})\geq-K\rho_{n}^{2}\right)<\varepsilon.

Therefore,

supn(f^n0πnf0>Mρn)<2ε,\sup_{n}\mathbb{P}\left(\left\|{\hat{f}_{n}^{0}-\pi_{n}f_{0}}\right\|>M\rho_{n}\right)<2\varepsilon,

which implies f^n0πnf0=𝒪p(ρn)\left\|{\hat{f}_{n}^{0}-\pi_{n}f_{0}}\right\|=\mathcal{O}_{p}(\rho_{n}). ∎

Lemma 13 (Local Approximation).

Suppose that unρnω=o(n1/2)u_{n}\rho_{n}^{\omega}=o(n^{-1/2}). Then, under (C1)-(C3) and H0H_{0},

n(f^n)n(f^n0)=f^nf~n(f^n)n2+op(n1),\mathbb{Q}_{n}(\hat{f}_{n})-\mathbb{Q}_{n}(\hat{f}_{n}^{0})=-\left\|{\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+o_{p}(n^{-1}),

and

n(πnf~n(f^n))n(f^n0)+op(n1).\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))\leq\mathbb{Q}_{n}(\hat{f}_{n}^{0})+o_{p}(n^{-1}).
Proof.

First, note that for u=v/v2u^{*}=v^{*}/\left\|{v^{*}}\right\|^{2},

n(f^n)n(πnf~n(f^n))\displaystyle\mathbb{Q}_{n}(\hat{f}_{n})-\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))
=\displaystyle= 2n1i=1nϵi(f^n(𝑿i)πnf~n(f^n)(𝑿i))2f^nf0,πnf~n(f^n)f^nnπnf~n(f^n)f^nn2\displaystyle-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\hat{f}_{n}(\boldsymbol{X}_{i})-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})\right)-2\left\langle{\hat{f}_{n}-f_{0}},{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\rangle_{n}-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}^{2}
=\displaystyle= 2n1i=1nϵi(f^n(𝑿i)f~n(f^n)(𝑿i))+2f^nf0,f^nf~n(f^n)n\displaystyle-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\hat{f}_{n}(\boldsymbol{X}_{i})-\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})\right)+2\left\langle{\hat{f}_{n}-f_{0}},{\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
2n1i=1nϵi(f~n(f^n)(𝑿i)πnf~n(f^n)(𝑿i))+2f^nf0,f~n(f^n)πnf~n(f^n)nπnf~n(f^n)f^nn2.\displaystyle-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})\right)+2\left\langle{\hat{f}_{n}-f_{0}},{\tilde{f}_{n}(\hat{f}_{n})-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}^{2}.

Under (C3), it is clear that

2n1i=1nϵi(f~n(f^n)(𝑿i)πnf~n(f^n)(𝑿i))\displaystyle 2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})\right) =op(δn2)\displaystyle=o_{p}(\delta_{n}^{2})
2f^nf0,f~n(f^n)πnf~n(f^n)n\displaystyle 2\left\langle{\hat{f}_{n}-f_{0}},{\tilde{f}_{n}(\hat{f}_{n})-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n} 2f^nf0nf~n(f^n)πnf~n(f^n)n\displaystyle\leq 2\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}\left\|{\tilde{f}_{n}(\hat{f}_{n})-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}
=𝒪p(ρn)op(ρn1δn2)=op(δn2).\displaystyle=\mathcal{O}_{p}(\rho_{n})\cdot o_{p}(\rho_{n}^{-1}\delta_{n}^{2})=o_{p}(\delta_{n}^{2}).

On the other hand, since f^nf~n(f^n)=δnv/v2\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})=-\delta_{n}v^{*}/\left\|{v^{*}}\right\|^{2}, we have

n(f^n)n(πnf~n(f^n))\displaystyle\mathbb{Q}_{n}(\hat{f}_{n})-\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))
=\displaystyle= 2δnv2n1i=1nϵiv(𝑿i)2δnv2f^nf0,vnπnf~n(f^n)f^nn2+op(δn2)\displaystyle 2\delta_{n}\left\|{v^{*}}\right\|^{-2}n^{-1}\sum_{i=1}^{n}\epsilon_{i}v^{*}(\boldsymbol{X}_{i})-2\delta_{n}\left\|{v^{*}}\right\|^{-2}\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle_{n}-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}^{2}+o_{p}(\delta_{n}^{2})
=\displaystyle= πnf~n(f^n)f^nn2+op(δnn1/2)+op(δn2),\displaystyle-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}^{2}+o_{p}(\delta_{n}n^{-1/2})+o_{p}(\delta_{n}^{2}),

where the last equality follows from Theorem 1. For a enough large nn, we have

πnf~n(f^n)f^nn2\displaystyle\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}^{2} =πnf~n(f^n)f~n(f^n)n2+f~n(f^n)f^nn2+2πnf~n(f^n)f~n(f^n),f~n(f^n)f^nn\displaystyle=\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+\left\|{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}^{2}+2\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})},{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\rangle_{n}
=op(ρn2δn4)+f^nf~n(f^n)n2+op(ρn1δn3)\displaystyle=o_{p}(\rho_{n}^{-2}\delta_{n}^{4})+\left\|{\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+o_{p}(\rho_{n}^{-1}\delta_{n}^{3})
=f^nf~n(f^n)n2+op(δn2),\displaystyle=\left\|{\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+o_{p}(\delta_{n}^{2}),

which implies that

n(f^n)n(πnf~n(f^n))=f^nf~n(f^n)n2+op(n1).\mathbb{Q}_{n}(\hat{f}_{n})-\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))=-\left\|{\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+o_{p}(n^{-1}). (12)

By replacing f^n\hat{f}_{n} with f^n0\hat{f}_{n}^{0} and considering u=v/v2u^{*}=-v^{*}/\left\|{v^{*}}\right\|^{2}, we have

n(f~n0)n(πnf~n(f^n0))\displaystyle\mathbb{Q}_{n}(\tilde{f}_{n}^{0})-\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0}))
=\displaystyle= 2n1i=1nϵi(f^n0(𝑿i)f~n(f^n0)(𝑿i))+2f^n0f0,f^n0πnf~n(f^n0)n\displaystyle-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\hat{f}_{n}^{0}(\boldsymbol{X}_{i})-\tilde{f}_{n}(\hat{f}_{n}^{0})(\boldsymbol{X}_{i})\right)+2\left\langle{\hat{f}_{n}^{0}-f_{0}},{\hat{f}_{n}^{0}-\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n}
2n1i=1nϵi(f~n(f^n0)(𝑿i)πnf~n(f^n0)(𝑿i))πnf~n(f^n0)f^n0n2.\displaystyle-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\tilde{f}_{n}(\hat{f}_{n}^{0})(\boldsymbol{X}_{i})-\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0})(\boldsymbol{X}_{i})\right)-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0})-\hat{f}_{n}^{0}}\right\|_{n}^{2}.
=\displaystyle= 2δnv2n1i=1nϵiv(𝑿i)+2f^n0f^n,f^n0f~n(f^n0)n+2f^nf0,f^n0f~n(f^n0)n\displaystyle-2\delta_{n}\left\|{v^{*}}\right\|^{-2}n^{-1}\sum_{i=1}^{n}\epsilon_{i}v^{*}(\boldsymbol{X}_{i})+2\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n}+2\left\langle{\hat{f}_{n}-f_{0}},{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n}
+2f^n0f0,f~n(f^n0)πnf~n(f^n0)nπnf~n(f^n0)f^n0n2+op(δn2)\displaystyle+2\left\langle{\hat{f}_{n}^{0}-f_{0}},{\tilde{f}_{n}(\hat{f}_{n}^{0})-\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n}-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0})-\hat{f}_{n}^{0}}\right\|_{n}^{2}+o_{p}(\delta_{n}^{2})
=\displaystyle= 2δnv2[n1i=1nϵiv(𝑿i)f^nf0,vn]+2f^n0f^n,f^n0f~n(f^n0)n\displaystyle-2\delta_{n}\left\|{v^{*}}\right\|^{-2}\left[n^{-1}\sum_{i=1}^{n}\epsilon_{i}v^{*}(\boldsymbol{X}_{i})-\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle_{n}\right]+2\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n}
πnf~n(f^n0)f^n0n2+op(δn2)\displaystyle-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0})-\hat{f}_{n}^{0}}\right\|_{n}^{2}+o_{p}(\delta_{n}^{2})
=\displaystyle= 2f^n0f^n,f^n0f~n(f^n0)nπnf~n(f^n0)f^n0n2+op(n1)\displaystyle 2\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n}-\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0})-\hat{f}_{n}^{0}}\right\|_{n}^{2}+o_{p}(n^{-1})
=\displaystyle= 2f^n0f^n,f^n0f~n(f^n0)nf^n0f~n(f^n0)n2+op(n1).\displaystyle 2\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n}-\left\|{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\|_{n}^{2}+o_{p}(n^{-1}).

For any f{fn0:ff0ρn}f\in\{f\in\mathcal{F}_{n}^{0}:\left\|{f-f_{0}}\right\|\leq\rho_{n}\}, under H0H_{0}, we have

0\displaystyle 0 =ϕ(f)ϕ(f0)=ϕf0[ff0]+𝒪(unff0ω)\displaystyle=\phi(f)-\phi(f_{0})=\phi_{f_{0}}^{\prime}[f-f_{0}]+\mathcal{O}(u_{n}\left\|{f-f_{0}}\right\|^{\omega})
=ff0,v+o(n1/2),\displaystyle=\left\langle{f-f_{0}},{v^{*}}\right\rangle+o(n^{-1/2}),

which implies that f^n0f^n,vn=op(n1/2)\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{v^{*}}\right\rangle_{n}=o_{p}(n^{-1/2}). Moreover, note that f^n0f~n(f^n0)=f~n(f^n)f^n\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})=\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}, we have

f^n0f^n,f^n0f~n(f^n0)n\displaystyle\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n} =f^n0f^n,f~n(f^n)f^nn\displaystyle=\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\rangle_{n}
=f^n0f~n(f^n)+f~n(f^n)f^n,f~n(f^n)f^nn\displaystyle=\left\langle{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n})+\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}},{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\rangle_{n}
=f^n0f~n(f^n),f~n(f^n)f^nnf~n(f^n)f^n,f~n(f^n)f^nn\displaystyle=\left\langle{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n})},{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\rangle_{n}-\left\langle{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}},{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\rangle_{n}
=f^n0f^nδnv/v2,δnv/v2n+f^nπnf~n(f^n)n2\displaystyle=\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}-\delta_{n}v^{*}/\left\|{v^{*}}\right\|^{2}},{\delta_{n}v^{*}/\left\|{v^{*}}\right\|^{2}}\right\rangle_{n}+\left\|{\hat{f}_{n}-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}
=δnv2f^n0f^n,vn+f^nπnf~n(f^n)n2δn2vn2/v4\displaystyle=\delta_{n}\left\|{v^{*}}\right\|^{-2}\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{v^{*}}\right\rangle_{n}+\left\|{\hat{f}_{n}-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}-\delta_{n}^{2}\left\|{v^{*}}\right\|_{n}^{2}/\left\|{v^{*}}\right\|^{4}
=δnv2f^n0f^n,v+f^nπnf~n(f^n)n2\displaystyle=\delta_{n}\left\|{v^{*}}\right\|^{-2}\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{v^{*}}\right\rangle+\left\|{\hat{f}_{n}-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}
δn2v2+op(δnn1/2)+op(δn2),\displaystyle\quad-\delta_{n}^{2}\left\|{v^{*}}\right\|^{-2}+o_{p}(\delta_{n}n^{-1/2})+o_{p}(\delta_{n}^{2}),

where the last equality follows from Lemma 22 and vn2=v2+op(1)\left\|{v^{*}}\right\|_{n}^{2}=\left\|{v^{*}}\right\|^{2}+o_{p}(1) by the weak law of large numbers. Now, since δn=f^nf0,v\delta_{n}=-\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle,

δnv2f^n0f^n,vδn2v2\displaystyle\delta_{n}\left\|{v^{*}}\right\|^{-2}\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{v^{*}}\right\rangle-\delta_{n}^{2}\left\|{v^{*}}\right\|^{-2} =δnv2(f^n0f^n,vδn)\displaystyle=\delta_{n}\left\|{v^{*}}\right\|^{-2}\left(\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{v^{*}}\right\rangle-\delta_{n}\right)
=δnv2(f^n0f^n,v+f^nf0,v)\displaystyle=\delta_{n}\left\|{v^{*}}\right\|^{-2}\left(\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{v^{*}}\right\rangle+\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle\right)
=δnv2f^n0f0,v\displaystyle=\delta_{n}\left\|{v^{*}}\right\|^{-2}\left\langle{\hat{f}_{n}^{0}-f_{0}},{v^{*}}\right\rangle
=op(δnn1/2),\displaystyle=o_{p}(\delta_{n}n^{-1/2}),

and then

f^n0f^n,f^n0f~n(f^n0)n=f^nπnf~n(f^n)n2+op(n1).\left\langle{\hat{f}_{n}^{0}-\hat{f}_{n}},{\hat{f}_{n}^{0}-\tilde{f}_{n}(\hat{f}_{n}^{0})}\right\rangle_{n}=\left\|{\hat{f}_{n}-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+o_{p}(n^{-1}).

Therefore,

n(f~n0)n(πnf~n(f^n0))=f^nπnf~n(f^n)n2+op(n1).\mathbb{Q}_{n}(\tilde{f}_{n}^{0})-\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0}))=\left\|{\hat{f}_{n}-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+o_{p}(n^{-1}). (13)

From (12) and (13), we obtain

n(f^n)n(f^n0)\displaystyle\mathbb{Q}_{n}(\hat{f}_{n})-\mathbb{Q}_{n}(\hat{f}_{n}^{0}) inffnn(f)n(f^n0)+op(n1)\displaystyle\leq\inf_{f\in\mathcal{F}_{n}}\mathbb{Q}_{n}(f)-\mathbb{Q}_{n}(\hat{f}_{n}^{0})+o_{p}(n^{-1})
n(πnf~n(f^n0))n(f^n0)+op(n1)\displaystyle\leq\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0}))-\mathbb{Q}_{n}(\hat{f}_{n}^{0})+o_{p}(n^{-1})
=f^nf~n(f^n)n2+op(n1)\displaystyle=-\left\|{\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}+o_{p}(n^{-1})
=n(f^n)n(πnf~n(f^n))+op(n1).\displaystyle=\mathbb{Q}_{n}(\hat{f}_{n})-\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))+o_{p}(n^{-1}).

which proves the desired results. ∎

The problem here is that πnf~n(f^n)\pi_{n}\tilde{f}_{n}(\hat{f}_{n}) may not be in n0\mathcal{F}_{n}^{0}, so we need to construct an approximate minimizer having similar properties. Set

fn[t]=πnf~n(f^n)+tvv2,t.f_{n}^{*}[t]=\pi_{n}\tilde{f}_{n}(\hat{f}_{n})+t\frac{v^{*}}{\left\|{v^{*}}\right\|^{2}},\quad t\in\mathbb{R}.

Note that, for any |t|n1/2|t|\lesssim n^{-1/2} and a enough large nn,

f~n(f^n)f0\displaystyle\left\|{\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\| f^nf0+|δn|vv2\displaystyle\leq\left\|{\hat{f}_{n}-f_{0}}\right\|+\left|{\delta_{n}}\right|\left\|{\frac{v^{*}}{\left\|{v^{*}}\right\|^{2}}}\right\|
f^nf0+f^nf0vvv2\displaystyle\leq\left\|{\hat{f}_{n}-f_{0}}\right\|+\left\|{\hat{f}_{n}-f_{0}}\right\|\left\|{v^{*}}\right\|\frac{\left\|{v^{*}}\right\|}{\left\|{v^{*}}\right\|^{2}}
=𝒪p(ρn)\displaystyle=\mathcal{O}_{p}(\rho_{n})
fn[t]f0\displaystyle\left\|{f_{n}^{*}[t]-f_{0}}\right\| =πnf~n(f^n)f0+|t|vv\displaystyle=\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|+\left|{t}\right|\left\|{\frac{v^{*}}{\left\|{v^{*}}\right\|}}\right\|
=𝒪p(ρn)+𝒪p(n1/2)\displaystyle=\mathcal{O}_{p}(\rho_{n})+\mathcal{O}_{p}(n^{-1/2})
=𝒪p(ρn).\displaystyle=\mathcal{O}_{p}(\rho_{n}).

Under H0H_{0} and unρnω=o(n1/2)u_{n}\rho_{n}^{\omega}=o(n^{-1/2}), we have

ϕ(πnfn[t])\displaystyle\phi(\pi_{n}f_{n}^{*}[t]) =ϕ(πnfn[t])ϕ(f0)\displaystyle=\phi(\pi_{n}f_{n}^{*}[t])-\phi(f_{0})
=πnfn[t]f0,v+r(t)\displaystyle=\left\langle{\pi_{n}f_{n}^{*}[t]-f_{0}},{v^{*}}\right\rangle+r(t)
=πnfn[t]fn[t],v+πnf~n(f^n)f0,v+t+r(t).\displaystyle=\left\langle{\pi_{n}f_{n}^{*}[t]-f_{n}^{*}[t]},{v^{*}}\right\rangle+\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{v^{*}}\right\rangle+t+r(t). (14)

By using the CrC_{r}-inequality, there exists a certain constant Cω>0C_{\omega}>0 such that

πnfn[t]f0ωCω(πnfn[t]fn[t]ω+πnfnf0ω+|t|ω).\left\|{\pi_{n}f_{n}^{*}[t]-f_{0}}\right\|^{\omega}\leq C_{\omega}\left(\left\|{\pi_{n}f_{n}^{*}[t]-f_{n}^{*}[t]}\right\|^{\omega}+\left\|{\pi_{n}f_{n}^{*}-f_{0}}\right\|^{\omega}+|t|^{\omega}\right).

Let Δn=supfnff0ρnf~n(f)πnf~n(f)\Delta_{n}=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\left\|{\tilde{f}_{n}(f)-\pi_{n}\tilde{f}_{n}(f)}\right\|, then

t~\displaystyle\tilde{t} =2vΔn+unCω(Δnω+πnfnf0ω+nω/2vω)\displaystyle=2\left\|{v^{*}}\right\|\Delta_{n}+u_{n}C_{\omega}\left(\Delta_{n}^{\omega}+\left\|{\pi_{n}f_{n}^{*}-f_{0}}\right\|^{\omega}+n^{-\omega/2}\left\|{v^{*}}\right\|^{\omega}\right)
=𝒪p(ρn1δn2)+𝒪p(un(ρn1δn2)ω)+𝒪p(unρnω)+𝒪p(unnω/2)\displaystyle=\mathcal{O}_{p}(\rho_{n}^{-1}\delta_{n}^{2})+\mathcal{O}_{p}(u_{n}(\rho_{n}^{-1}\delta_{n}^{2})^{\omega})+\mathcal{O}_{p}(u_{n}\rho_{n}^{\omega})+\mathcal{O}_{p}(u_{n}n^{-\omega/2})
=op(n1/2).\displaystyle=o_{p}(n^{-1/2}).

Therefore, by (14),

ϕ(πnfn[t~])\displaystyle\phi(\pi_{n}f_{n}^{*}[\tilde{t}]) t~(|πnfn[t~]fn[t~],v|+|πnf~n(f^n)f0,v|+r(t~)))0\displaystyle\geq\tilde{t}-\left(\left|{\left\langle{\pi_{n}f_{n}^{*}[\tilde{t}]-f_{n}^{*}[\tilde{t}]},{v^{*}}\right\rangle}\right|+\left|{\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{v^{*}}\right\rangle}\right|+r(\tilde{t}))\right)\geq 0
ϕ(πnfn[t~])\displaystyle\phi(\pi_{n}f_{n}^{*}[-\tilde{t}]) t~+(|πnfn[t~]fn[t~],v|+|πnf~n(f^n)f0,v|+r(t~)))0.\displaystyle\leq-\tilde{t}+\left(\left|{\left\langle{\pi_{n}f_{n}^{*}[\tilde{t}]-f_{n}^{*}[\tilde{t}]},{v^{*}}\right\rangle}\right|+\left|{\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{v^{*}}\right\rangle}\right|+r(\tilde{t}))\right)\leq 0.

Furthermore, by continuity of ϕ(πnfn[t])\phi(\pi_{n}f_{n}^{*}[t]) and the mean value theorem, there exists some tt^{*}\in\mathbb{R} such that ϕ(πnfn[t])=0\phi(\pi_{n}f_{n}^{*}[t^{*}])=0 and |t|=op(n1/2)|t^{*}|=o_{p}(n^{-1/2}). This implies that πnfn[t]n0\pi_{n}f_{n}^{*}[t^{*}]\in\mathcal{F}_{n}^{0}. Clearly, πnfn[t]f0ρn\left\|{\pi_{n}f_{n}^{*}[t^{*}]-f_{0}}\right\|\leq\rho_{n} for a large nn.

Lemma 14.

Under (C1)-(C3), we have

n(πnfn)n(πnfn[t])=op(n1).\mathbb{Q}_{n}(\pi_{n}f_{n}^{*})-\mathbb{Q}_{n}(\pi_{n}f_{n}^{*}[t^{*}])=o_{p}(n^{-1}).
Proof.

Note that

n(πnf~n(f^n))n(πnfn[t])\displaystyle\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))-\mathbb{Q}_{n}(\pi_{n}f_{n}^{*}[t^{*}])
=\displaystyle= 2n1i=1nϵi(πnf~n(f^n)(𝑿i)πnfn[t](𝑿i))πnfn[t]f0n2+πnf~n(f^n)f0n2\displaystyle-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\pi_{n}f_{n}^{*}[t^{*}](\boldsymbol{X}_{i})\right)-\left\|{\pi_{n}f_{n}^{*}[t^{*}]-f_{0}}\right\|_{n}^{2}+\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}^{2}
=\displaystyle= 2n1i=1nϵi(πnf~n(f^n)(𝑿i)πnfn[t](𝑿i))2πnf~n(f^n)f0,πnfn[t]f~n(f^n)n\displaystyle-2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\pi_{n}f_{n}^{*}[t^{*}](\boldsymbol{X}_{i})\right)-2\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
πnfn[t]πnf~n(f^n)n2.\displaystyle-\left\|{\pi_{n}f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}.

Moreover, since

πnfn[t]πnf~n(f^n)n\displaystyle\left\|{\pi_{n}f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n} πnfn[t]fn[t]n+fn[t]πnf~n(f^n)n\displaystyle\leq\left\|{\pi_{n}f_{n}^{*}[t^{*}]-f_{n}^{*}[t^{*}]}\right\|_{n}+\left\|{f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}
πnfn[t]fn[t]n+|t|vnv2\displaystyle\leq\left\|{\pi_{n}f_{n}^{*}[t^{*}]-f_{n}^{*}[t^{*}]}\right\|_{n}+|t^{*}|\frac{\left\|{v^{*}}\right\|_{n}}{\left\|{v^{*}}\right\|^{2}}
=𝒪p(ρn1δn2)+op(n1/2)=op(n1/2),\displaystyle=\mathcal{O}_{p}(\rho_{n}^{-1}\delta_{n}^{2})+o_{p}(n^{-1/2})=o_{p}(n^{-1/2}),

it follows from Cauchy-Schwarz inequality that

πnf~n(f^n)f0,πnfn[t]f~n(f^n)n\displaystyle\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n} =πnf~n(f^n)f~n(f^n),πnfn[t]f~n(f^n)n\displaystyle=\left\langle{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})},{\pi_{n}f_{n}^{*}[t^{*}]-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
+f~n(f^n)f0,πnfn[t]f~n(f^n)n\displaystyle\quad+\left\langle{\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
πnf~n(f^n)f~n(f^n)nπnfn[t]πnf~n(f^n)n\displaystyle\leq\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}\left\|{\pi_{n}f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}
+f~n(f^n)f0,πnfn[t]f~n(f^n)n\displaystyle\quad+\left\langle{\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
=op(n1)+f~n(f^n)f0,πnfn[t]f~n(f^n)n.\displaystyle=o_{p}(n^{-1})+\left\langle{\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}.

On the other hand, note that

f~n(f^n)f0,πnfn[t]πnf~n(f^n)n\displaystyle\left\langle{\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n} =f~n(f^n)f^n,πnfn[t]πnf~n(f^n)n+f^nf0,πnfn[t]fn[t]n\displaystyle=\left\langle{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}},{\pi_{n}f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}+\left\langle{\hat{f}_{n}-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-f_{n}^{*}[t^{*}]}\right\rangle_{n}
+f^nf0,fn[t]πnf~n(f^n)n\displaystyle\quad+\left\langle{\hat{f}_{n}-f_{0}},{f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
f~n(f^n)f^nnπnfn[t]πnf~n(f^n)n\displaystyle\leq\left\|{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}\left\|{\pi_{n}f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}
+f^nf0nπnfn[t]fn[t]n+f^nf0,fn[t]πnf~n(f^n)n\displaystyle\quad+\left\|{\hat{f}_{n}-f_{0}}\right\|_{n}\left\|{\pi_{n}f_{n}^{*}[t^{*}]-f_{n}^{*}[t^{*}]}\right\|_{n}+\left\langle{\hat{f}_{n}-f_{0}},{f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
𝒪p(δn)op(n1/2)+𝒪p(δn2)+f^nf0,fn[t]πnf~n(f^n)n\displaystyle\leq\mathcal{O}_{p}(\delta_{n})o_{p}(n^{-1/2})+\mathcal{O}_{p}(\delta_{n}^{2})+\left\langle{\hat{f}_{n}-f_{0}},{f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
=op(n1)+f^nf0,fn[t]πnf~n(f^n)n\displaystyle=o_{p}(n^{-1})+\left\langle{\hat{f}_{n}-f_{0}},{f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
=op(n1)+tf^nf0,vvn\displaystyle=o_{p}(n^{-1})+t^{*}\left\langle{\hat{f}_{n}-f_{0}},{\frac{v^{*}}{\left\|{v^{*}}\right\|}}\right\rangle_{n}
=op(n1)+op(n1/2)𝒪p(n1/2)=op(n1),\displaystyle=o_{p}(n^{-1})+o_{p}(n^{-1/2})\mathcal{O}_{p}(n^{-1/2})=o_{p}(n^{-1}),

which implies that

f~n(f^n)f0,πnfn[t]f~n(f^n)n\displaystyle\left\langle{\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
=\displaystyle= f~n(f^n)f0,πnfn[t]πnf~n(f^n)n+f~n(f^n)f0,πnf~n(f^n)f~n(f^n)n\displaystyle\left\langle{\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}f_{n}^{*}[t^{*}]-\pi_{n}\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}+\left\langle{\tilde{f}_{n}(\hat{f}_{n})-f_{0}},{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\rangle_{n}
\displaystyle\leq op(n1)+f~n(f^n)f0nπnf~n(f^n)f~n(f^n)n\displaystyle o_{p}(n^{-1})+\left\|{\tilde{f}_{n}(\hat{f}_{n})-f_{0}}\right\|_{n}\left\|{\pi_{n}\tilde{f}_{n}(\hat{f}_{n})-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}
=\displaystyle= op(n1).\displaystyle o_{p}(n^{-1}).

From (C3), we have

2n1i=1nϵi(πnf~n(f^n)(𝑿i)πnfn[t](𝑿i))\displaystyle 2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-\pi_{n}f_{n}^{*}[t^{*}](\boldsymbol{X}_{i})\right) =2n1i=1nϵi(πnf~n(f^n)(𝑿i)fn[t](𝑿i))\displaystyle=2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(\pi_{n}\tilde{f}_{n}(\hat{f}_{n})(\boldsymbol{X}_{i})-f_{n}^{*}[t^{*}](\boldsymbol{X}_{i})\right)
+2n1i=1nϵi(fn[t](𝑿i)πnfn[t](𝑿i))\displaystyle\quad+2n^{-1}\sum_{i=1}^{n}\epsilon_{i}\left(f_{n}^{*}[t^{*}](\boldsymbol{X}_{i})-\pi_{n}f_{n}^{*}[t^{*}](\boldsymbol{X}_{i})\right)
=2v2n1ti=1nϵiv(𝑿i)+op(n1)\displaystyle=-2\left\|{v^{*}}\right\|^{-2}n^{-1}t^{*}\sum_{i=1}^{n}\epsilon_{i}v^{*}(\boldsymbol{X}_{i})+o_{p}(n^{-1})
=op(n1/2)𝒪p(n1/2)+op(n1)=op(n1),\displaystyle=o_{p}(n^{-1/2})\mathcal{O}_{p}(n^{-1/2})+o_{p}(n^{-1})=o_{p}(n^{-1}),

which proves the desired result. ∎

We now ready to finish the proof of Theorem 2.

Proof.

Note that

nσ2f^nf~n(f^n)n2\displaystyle\frac{n}{\sigma^{2}}\left\|{\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2} =nσ2|f^nf0,v|2vn2v4\displaystyle=\frac{n}{\sigma^{2}}\left|{\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle}\right|^{2}\frac{\left\|{v^{*}}\right\|_{n}^{2}}{\left\|{v^{*}}\right\|^{4}}
=|n1/2σ1v1i=1nϵiv(𝑿i)+op(1)|2vn2v2.\displaystyle=\left|{n^{-1/2}\sigma^{-1}\left\|{v^{*}}\right\|^{-1}\sum_{i=1}^{n}\epsilon_{i}v^{*}(\boldsymbol{X}_{i})+o_{p}(1)}\right|^{2}\frac{\left\|{v^{*}}\right\|_{n}^{2}}{\left\|{v^{*}}\right\|^{2}}.

It then follows from Theorem 1, the smoothness assumption 1, the classical central limit theorem and the Slutsky Theorem that

nσ2f^nf~n(f^n)n2𝑑χ12\displaystyle\frac{n}{\sigma^{2}}\left\|{\hat{f}_{n}-\tilde{f}_{n}(\hat{f}_{n})}\right\|_{n}^{2}\xrightarrow{d}\chi_{1}^{2}

Based on the previous lemmas, we have

n(f^n0)n(f^n)\displaystyle\mathbb{Q}_{n}(\hat{f}_{n}^{0})-\mathbb{Q}_{n}(\hat{f}_{n}) n(πnfn[t])n(f^n)+op(n1)\displaystyle\leq\mathbb{Q}_{n}(\pi_{n}f_{n}^{*}[t^{*}])-\mathbb{Q}_{n}(\hat{f}_{n})+o_{p}(n^{-1})
=n(πnf~n(f^n))n(f^n)+op(n1)\displaystyle=\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}))-\mathbb{Q}_{n}(\hat{f}_{n})+o_{p}(n^{-1})
=f~n(f^n)f^nn2+op(n1)\displaystyle=\left\|{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}^{2}+o_{p}(n^{-1})
n(f^n0)n(f^n)\displaystyle\mathbb{Q}_{n}(\hat{f}_{n}^{0})-\mathbb{Q}_{n}(\hat{f}_{n}) n(f^n0)n(πnf~n(f^n0))op(n1)\displaystyle\geq\mathbb{Q}_{n}(\hat{f}_{n}^{0})-\mathbb{Q}_{n}(\pi_{n}\tilde{f}_{n}(\hat{f}_{n}^{0}))-o_{p}(n^{-1})
=f~n(f^n)f^nn2+op(n1),\displaystyle=\left\|{\tilde{f}_{n}(\hat{f}_{n})-\hat{f}_{n}}\right\|_{n}^{2}+o_{p}(n^{-1}),

so that the desired result follows. ∎

Rate of Convergence of Approximate Sieve Extremum Estimators

We start with a general result on the rate of convergence of sieve estimators under the setup of nonparametric regression. The notations in this section are inherited from section 2 in the main text.

Lemma 15.

For every nn and every δ>8f0πnf0\delta>8\left\|{f_{0}-\pi_{n}f_{0}}\right\|, we have

supfnδ/2<fπnf0δ(πnf0)(f)δ2.\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \delta/2<\left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\mathbb{Q}(\pi_{n}f_{0})-\mathbb{Q}(f)\lesssim-\delta^{2}.
Proof.

First, note that

(πnf0)(f)\displaystyle\mathbb{Q}(\pi_{n}f_{0})-\mathbb{Q}(f) =𝔼[(Yπnf0(𝑿))2]𝔼[(Yf(𝑿))2]\displaystyle=\mathbb{E}\left[(Y-\pi_{n}f_{0}(\boldsymbol{X}))^{2}\right]-\mathbb{E}\left[(Y-f(\boldsymbol{X}))^{2}\right]
=𝔼[(πnf0(𝑿)f0(𝑿))2]𝔼[(f(𝑿)f0(𝑿))2]\displaystyle=\mathbb{E}\left[\left(\pi_{n}f_{0}(\boldsymbol{X})-f_{0}(\boldsymbol{X})\right)^{2}\right]-\mathbb{E}\left[(f(\boldsymbol{X})-f_{0}(\boldsymbol{X}))^{2}\right]
=πnf0f02ff02.\displaystyle=\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2}-\left\|{f-f_{0}}\right\|^{2}.

The triangle inequality gives

fπnf0\displaystyle\left\|{f-\pi_{n}f_{0}}\right\| ff0+πnf0f0\displaystyle\leq\left\|{f-f_{0}}\right\|+\left\|{\pi_{n}f_{0}-f_{0}}\right\|
=ff0πnf0f0+2πnf0f0.\displaystyle=\left\|{f-f_{0}}\right\|-\left\|{\pi_{n}f_{0}-f_{0}}\right\|+2\left\|{\pi_{n}f_{0}-f_{0}}\right\|.

Therefore, we have

ff0πnf0f0fπnf02πnf0f0\left\|{f-f_{0}}\right\|-\left\|{\pi_{n}f_{0}-f_{0}}\right\|\geq\left\|{f-\pi_{n}f_{0}}\right\|-2\left\|{\pi_{n}f_{0}-f_{0}}\right\|

so that for every ff satisfying fπnf0216πnf0f02\left\|{f-\pi_{n}f_{0}}\right\|^{2}\geq 16\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2}, i.e. fπnf04πnf0f0\left\|{f-\pi_{n}f_{0}}\right\|\geq 4\left\|{\pi_{n}f_{0}-f_{0}}\right\|, we have

ff0πnf0f0\displaystyle\left\|{f-f_{0}}\right\|-\left\|{\pi_{n}f_{0}-f_{0}}\right\| fπnf012πnf0f0\displaystyle\geq\left\|{f-\pi_{n}f_{0}}\right\|-\frac{1}{2}\left\|{\pi_{n}f_{0}-f_{0}}\right\|
=12fπnf00,\displaystyle=\frac{1}{2}\left\|{f-\pi_{n}f_{0}}\right\|\geq 0, (15)

which implies that ff0πnf0f0\left\|{f-f_{0}}\right\|\geq\left\|{\pi_{n}f_{0}-f_{0}}\right\|. By squaring both sides of (15), we obtain

14fπnf02\displaystyle\frac{1}{4}\left\|{f-\pi_{n}f_{0}}\right\|^{2} ff02+πnf0f022ff0πnf0f0\displaystyle\leq\left\|{f-f_{0}}\right\|^{2}+\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2}-2\left\|{f-f_{0}}\right\|\cdot\left\|{\pi_{n}f_{0}-f_{0}}\right\|
ff02+πnf0f022πnf0f02\displaystyle\leq\left\|{f-f_{0}}\right\|^{2}+\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2}-2\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2}
=ff02πnf0f02.\displaystyle=\left\|{f-f_{0}}\right\|^{2}-\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2}.

Hence, for δ>8πnf0f0\delta>8\left\|{\pi_{n}f_{0}-f_{0}}\right\|, we have

supfnδ/2<ff0δ(πnf0)(f)\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \delta/2<\left\|{f-f_{0}}\right\|\leq\delta\end{subarray}}\mathbb{Q}(\pi_{n}f_{0})-\mathbb{Q}(f)
\displaystyle\leq supfnfπnf0>δ/2πnf0f02ff02\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|>\delta/2\end{subarray}}\left\|{\pi_{n}f_{0}-f_{0}}\right\|^{2}-\left\|{f-f_{0}}\right\|^{2}
\displaystyle\leq supfnfπnf0>δ/2(14fπnf02)\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|>\delta/2\end{subarray}}\left(-\frac{1}{4}\left\|{f-\pi_{n}f_{0}}\right\|^{2}\right)
δ2.\displaystyle\lesssim-\delta^{2}.

Lemma 16.

For every sufficiently large nn and δ>8f0πnf0\delta>8\left\|{f_{0}-\pi_{n}f_{0}}\right\|, under (C1)

𝔼[supfnδ/2<fπnf0δn|(nQ)(f)(nQ)(πnf0)|]0δH1/2(u)du+1\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \delta/2<\left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\sqrt{n}\left|{(\mathbb{Q}_{n}-Q)(f)-(\mathbb{Q}_{n}-Q)(\pi_{n}f_{0})}\right|\right]\lesssim\int_{0}^{\delta}H^{1/2}(u)\textrm{d}u+1
Proof.

Note that

n|(nQ)(πnf0)(nQ)(f)|\displaystyle\sqrt{n}\left|{(\mathbb{Q}_{n}-Q)(\pi_{n}f_{0})-(\mathbb{Q}_{n}-Q)(f)}\right|
=\displaystyle= n|1ni=1n(Yiπnf0(𝑿i))2𝔼[(Yπnf0(𝑿))2]1ni=1n(Yif(𝑿i))2+𝔼[(Yf(𝑿))2]|\displaystyle\sqrt{n}\left|{\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-\pi_{n}f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(Y-\pi_{n}f_{0}(\boldsymbol{X}))^{2}\right]-\frac{1}{n}\sum_{i=1}^{n}\left(Y_{i}-f(\boldsymbol{X}_{i})\right)^{2}+\mathbb{E}\left[(Y-f(\boldsymbol{X}))^{2}\right]}\right|
\displaystyle\leq |2ni=1nϵi(f(𝑿i)πnf0(𝑿i))|+|1ni=1n{(f0(𝑿i)πnf0(𝑿i))2𝔼[(f0(𝑿)πnf0(𝑿))2]}|\displaystyle\left|{\frac{2}{\sqrt{n}}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))}\right|+\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left\{(f_{0}(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(f_{0}(\boldsymbol{X})-\pi_{n}f_{0}(\boldsymbol{X}))^{2}\right]\right\}}\right|
+|1ni=1n{(f(𝑿i)f0(𝑿i))2𝔼[(f(𝑿)f0(𝑿))2]}|,\displaystyle\qquad+\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left\{(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(f(\boldsymbol{X})-f_{0}(\boldsymbol{X}))^{2}\right]\right\}}\right|,

we obtain

𝔼[supfnδ/2<fπnf0δn|(nQ)(f)(nQ)(πnf0)|]\displaystyle\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \delta/2<\left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\sqrt{n}\left|{(\mathbb{Q}_{n}-Q)(f)-(\mathbb{Q}_{n}-Q)(\pi_{n}f_{0})}\right|\right]
\displaystyle\leq 𝔼[supfnδ/2<fπnf0δ|2ni=1nϵi(f(𝑿i)πnf0(𝑿i))|]\displaystyle\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \delta/2<\left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|{\frac{2}{\sqrt{n}}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))}\right|\right]
+𝔼[supfnδ/2<fπnf0δ|1ni=1n{(f(𝑿i)f0(𝑿i))2𝔼[(f(𝑿)f0(𝑿))2]}|]\displaystyle\qquad+\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \delta/2<\left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left\{(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(f(\boldsymbol{X})-f_{0}(\boldsymbol{X}))^{2}\right]\right\}}\right|\right]
+𝔼[|1ni=1n{(f0(𝑿i)πnf0(𝑿i))2𝔼[(f0(𝑿)πnf0(𝑿))2]}|]\displaystyle\qquad+\mathbb{E}\left[\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left\{(f_{0}(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(f_{0}(\boldsymbol{X})-\pi_{n}f_{0}(\boldsymbol{X}))^{2}\right]\right\}}\right|\right]
:=\displaystyle:= P1+P2+P3.\displaystyle P_{1}+P_{2}+P_{3}.

We start by bounding P3P_{3}. As 0(f0(𝑿i)πnf0(𝑿i))2f0πnf020\leq(f_{0}(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))^{2}\leq\left\|{f_{0}-\pi_{n}f_{0}}\right\|_{\infty}^{2}, it follows from Hoeffding’s inequality that

(|1ni=1n{(f0(𝑿i)πnf0(𝑿i))2𝔼[(f0(𝑿)πnf0(𝑿))2]}|t)2exp{2t2f0πnf04},\displaystyle\mathbb{P}\left(\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left\{(f_{0}(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(f_{0}(\boldsymbol{X})-\pi_{n}f_{0}(\boldsymbol{X}))^{2}\right]\right\}}\right|\geq t\right)\leq 2\exp\left\{-\frac{2t^{2}}{\left\|{f_{0}-\pi_{n}f_{0}}\right\|_{\infty}^{4}}\right\},

and hence

P3=\displaystyle P_{3}= 0(|1ni=1n{(f0(𝑿i)πnf0(𝑿i))2𝔼[(f0(𝑿)πnf0(𝑿))2]}|t)dt\displaystyle\int_{0}^{\infty}\mathbb{P}\left(\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left\{(f_{0}(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(f_{0}(\boldsymbol{X})-\pi_{n}f_{0}(\boldsymbol{X}))^{2}\right]\right\}}\right|\geq t\right)\textrm{d}t
\displaystyle\leq 02exp{2t2f0πnf04}dt\displaystyle\int_{0}^{\infty}2\exp\left\{-\frac{2t^{2}}{\left\|{f_{0}-\pi_{n}f_{0}}\right\|_{\infty}^{4}}\right\}\textrm{d}t
=\displaystyle= (π/2)1/2f0πnf020,as n,\displaystyle(\pi/2)^{1/2}\left\|{f_{0}-\pi_{n}f_{0}}\right\|_{\infty}^{2}\to 0,\quad\textrm{as }n\to\infty,

which implies that for a sufficiently large nn,

P3=𝔼[|1ni=1n{(f0(𝑿i)πnf0(𝑿i))2𝔼[(f0(𝑿)πnf0(𝑿))2]}|]1.P_{3}=\mathbb{E}\left[\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left\{(f_{0}(\boldsymbol{X}_{i})-\pi_{n}f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(f_{0}(\boldsymbol{X})-\pi_{n}f_{0}(\boldsymbol{X}))^{2}\right]\right\}}\right|\right]\leq 1.

On the other hand, it follows from symmetrization inequality that

P2\displaystyle P_{2} 𝔼[supfnfπnf0δ|1ni=1n{(f(𝑿i)f0(𝑿i))2𝔼[(f(𝑿)f0(𝑿))2]}|]\displaystyle\leq\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left\{(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))^{2}-\mathbb{E}\left[(f(\boldsymbol{X})-f_{0}(\boldsymbol{X}))^{2}\right]\right\}}\right|\right]
𝔼[supfnfπnf0δ|1n|i=1nξi(f(𝑿i)f0(𝑿i))2],\displaystyle\lesssim\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|{\frac{1}{\sqrt{n}}}\right|\sum_{i=1}^{n}\xi_{i}(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))^{2}\right],

where ξ1,,ξn\xi_{1},\ldots,\xi_{n} are i.i.d. Rademacher random variables independent of 𝑿1,,𝑿n\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n}. On the other hand, since n\mathcal{F}_{n} is uniformly bounded, we know that

ff0f+f0<.\left\|{f-f_{0}}\right\|_{\infty}\leq\left\|{f}\right\|_{\infty}+\left\|{f_{0}}\right\|_{\infty}<\infty.

According to the contraction principle and Corollary 2.2.8 in van der Vaart and Wellner (1996),

𝔼[supfnfπnf0δ|1n|i=1nξi(f(𝑿i)f0(𝑿i))2]\displaystyle\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|{\frac{1}{\sqrt{n}}}\right|\sum_{i=1}^{n}\xi_{i}(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))^{2}\right]
\displaystyle\lesssim 𝔼[supfnfπnf0δ|1ni=1nξi(f(𝑿i)f0(𝑿i))|]\displaystyle\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))}\right|\right]
=\displaystyle= 𝔼[𝔼[supfnfπnf0δ|1ni=1nξi(f(𝑿i)f0(𝑿i))|]|Δnδ]\displaystyle\mathbb{E}\left[\left.\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}\left(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i})\right)\right|\right]\right|\Delta_{n}\leq\delta\right]
\displaystyle\lesssim 𝔼[0δlogN(u,n,L2(n))du]+𝔼[|1ni=1nξi(πnf0(𝑿i)f0(𝑿i))|]\displaystyle\mathbb{E}\left[\int_{0}^{\delta}\sqrt{\log N(u,\mathcal{F}_{n},L_{2}(\mathbb{P}_{n}))}\textrm{d}u\right]+\mathbb{E}\left[\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(\pi_{n}f_{0}(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))}\right|\right]
\displaystyle\lesssim 0δH1/2(u)du+𝔼[|1ni=1nξi(πnf0(𝑿i)f0(𝑿i))|],\displaystyle\int_{0}^{\delta}H^{1/2}(u)\textrm{d}u+\mathbb{E}\left[\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(\pi_{n}f_{0}(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))}\right|\right],

where Δn=supfnfπnf0δff0n\Delta_{n}=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left\|{f-f_{0}}\right\|_{n}. Similar to the arguments used in bounding P3P_{3}, we have

𝔼[|1ni=1nξi(πnf0(𝑿i)f0(𝑿i))|]\displaystyle\mathbb{E}\left[\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(\pi_{n}f_{0}(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))}\right|\right]
=\displaystyle= 0(1n|i=1nξi(πnf0(𝑿i)f0(𝑿i))|t)dt\displaystyle\int_{0}^{\infty}\mathbb{P}\left(\frac{1}{\sqrt{n}}\left|{\sum_{i=1}^{n}\xi_{i}(\pi_{n}f_{0}(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))}\right|\geq t\right)\textrm{d}t
\displaystyle\lesssim 0exp{t22πnf0f02}dt\displaystyle\int_{0}^{\infty}\exp\left\{-\frac{t^{2}}{2\left\|{\pi_{n}f_{0}-f_{0}}\right\|_{\infty}^{2}}\right\}\textrm{d}t
\displaystyle\lesssim πnf0f00 as n,\displaystyle\left\|{\pi_{n}f_{0}-f_{0}}\right\|_{\infty}\to 0\textrm{ as }n\to\infty,

which implies that 𝔼[|1ni=1nξi(πnf0(𝑿i)f0(𝑿i))|]1\mathbb{E}\left[\left|{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(\pi_{n}f_{0}(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))}\right|\right]\leq 1 for a sufficiently large nn. Therefore, for a sufficiently large nn,

P20δH1/2(u)du+1.P_{2}\lesssim\int_{0}^{\delta}H^{1/2}(u)\textrm{d}u+1.

Finally, for every sufficiently large nn, as ϵp,1<\left\|{\epsilon}\right\|_{p,1}<\infty for some p2p\geq 2, it follows from the multiplier inequality (Lemma 2.9.1 in van der Vaart and Wellner (1996)) that

P3=\displaystyle P_{3}= 𝔼[supfnδ/2<fπnf0δ|1n|i=1nϵi(f(𝑿i)f0(𝑿i))]\displaystyle\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \delta/2<\left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|{\frac{1}{\sqrt{n}}}\right|\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))\right]
\displaystyle\lesssim max1kn𝔼[supfnδ/2<fπnf0δ|1ki=1kξi(f(𝑿i)f0(𝑿i))|]\displaystyle\max_{1\leq k\leq n}\mathbb{E}\left[\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \delta/2<\left\|{f-\pi_{n}f_{0}}\right\|\leq\delta\end{subarray}}\left|{\frac{1}{\sqrt{k}}\sum_{i=1}^{k}\xi_{i}(f(\boldsymbol{X}_{i})-f_{0}(\boldsymbol{X}_{i}))}\right|\right]
\displaystyle\lesssim 0δH1/2(u)du+1,\displaystyle\int_{0}^{\delta}H^{1/2}(u)\textrm{d}u+1,

where the last inequality follows as the upper bound for the local Rademacher complexity does not depend on kk. Combining all the pieces together, we obtain the desired result. ∎

Based on Lemma 15 and Lemma 16, the rate of convergence for the approximate sieve extremum estimators can be easily obtained via an application of Theorem 3.4.1 in van der Vaart and Wellner (1996).

Theorem 17.

Suppose that 0δH1/2(u)duϕn(δ)\int_{0}^{\delta}H^{1/2}(u)\textrm{d}u\lesssim\phi_{n}(\delta) for some function ϕn:(0,)\phi_{n}:(0,\infty)\to\mathbb{R} and for every sufficiently large nn and 8f0πnf0<δη8\left\|{f_{0}-\pi_{n}f_{0}}\right\|<\delta\leq\eta. Suppose that δαϕn(δ)\delta^{-\alpha}\phi_{n}(\delta) is decreasing on (8f0πnf0,)(8\left\|{f_{0}-\pi_{n}f_{0}}\right\|,\infty) for some α<2\alpha<2. Let ρnf0πnf0\rho_{n}\gtrsim\left\|{f_{0}-\pi_{n}f_{0}}\right\| satisfy

ρn2ϕn(ρn)n for every n.\rho_{n}^{-2}\phi_{n}(\rho_{n})\lesssim\sqrt{n}\textrm{ for every }n.

Then if ρn=𝒪p(ρn2)\rho_{n}=\mathcal{O}_{p}(\rho_{n}^{2}) and f^nπnf0=op(1)\left\|{\hat{f}_{n}-\pi_{n}f_{0}}\right\|=o_{p}(1), we get

f^nπnf0=𝒪p(ρn)\left\|{\hat{f}_{n}-\pi_{n}f_{0}}\right\|=\mathcal{O}_{p}(\rho_{n})

and

f^nπnf0=𝒪p(max{ρn,f0πnf0}).\left\|{\hat{f}_{n}-\pi_{n}f_{0}}\right\|=\mathcal{O}_{p}(\max\left\{\rho_{n},\left\|{f_{0}-\pi_{n}f_{0}}\right\|\right\}).

Rate of Convergence of Multiplier Processes

Proposition 18 (Proposition 5 in Han and Wellner (2019)).

Suppose that ϵ1,,ϵn\epsilon_{1},\ldots,\epsilon_{n} are i.i.d. mean zero random variables independent of i.i.d. random variables 𝐗1,,𝐗n\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n}. Then, for any function class \mathcal{F},

𝔼[supf|i=1nϵif(𝑿i)|]𝔼[k=1n(|η(k)||η(k+1)|)𝔼[supf|i=1nξif(𝑿i)|]],\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|\right]\leq\mathbb{E}\left[\sum_{k=1}^{n}\left(|\eta_{(k)}|-|\eta_{(k+1)}|\right)\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\xi_{i}f(\boldsymbol{X}_{i})}\right|\right]\right],

where ξ1,,ξn\xi_{1},\ldots,\xi_{n} are i.i.d. Rademacher random variables independent of 𝐗1,,𝐗n\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n} and ϵ1,,ϵn\epsilon_{1},\ldots,\epsilon_{n} and |η(1)||η(n)||η(n+1)|0|\eta_{(1)}|\geq\cdots\geq|\eta_{(n)}|\geq|\eta_{(n+1)}|\geq 0 are the reversed order statistics for {|ϵiϵi|}i=1n\{|\epsilon_{i}-\epsilon_{i}^{\prime}|\}_{i=1}^{n} with {ϵi}\{\epsilon_{i}^{\prime}\} being an indepedent copy of {ϵi}\{\epsilon_{i}\}.

As a consequence of Proposition 18, we can obtain the following result.

Proposition 19.

Under the same assumption in Proposition 18 and

𝔼[supf|i=1kξif(𝑿i)|]kρk2 for all k=1,,n.\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{k}\xi_{i}f(\boldsymbol{X}_{i})}\right|\right]\lesssim k\rho_{k}^{2}\quad\textrm{ for all }k=1,\ldots,n.
  1. (i)

    If ϵp<\left\|{\epsilon}\right\|_{p}<\infty for some p1p\geq 1 and the sequence {kρk2}\{k\rho_{k}^{2}\} is non-decreasing, then

    𝔼[supf|i=1nϵif(𝑿i)|]ϵpn1+1pρn2.\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|\right]\lesssim\left\|{\epsilon}\right\|_{p}n^{1+\frac{1}{p}}\rho_{n}^{2}.
  2. (ii)

    If ϵp,1<\left\|{\epsilon}\right\|_{p,1}<\infty for some p1p\geq 1 and the sequence {k11pρk2}\{k^{1-\frac{1}{p}}\rho_{k}^{2}\} is non-decreasing, then

    𝔼[supf|i=1nϵif(𝑿i)|]ϵp,1nρn2.\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|\right]\lesssim\left\|{\epsilon}\right\|_{p,1}n\rho_{n}^{2}.
Proof.
  1. (i)

    According to Proposition 18, we have

    𝔼[supf|i=1nϵif(𝑿i)|]\displaystyle\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|\right] 𝔼[k=1n(|η(k)||η(k+1)|)kρk2]\displaystyle\lesssim\mathbb{E}\left[\sum_{k=1}^{n}\left(|\eta_{(k)}|-|\eta_{(k+1)}|\right)k\rho_{k}^{2}\right]
    𝔼[n2k=1n(|η(k)||η(k+1)|)]\displaystyle\lesssim\mathbb{E}\left[n^{2}\sum_{k=1}^{n}\left(|\eta_{(k)}|-|\eta_{(k+1)}|\right)\right]
    =n2𝔼[|η(1)|]\displaystyle=n^{2}\mathbb{E}\left[|\eta_{(1)}|\right]
    =n2𝔼[max1in|ϵiϵi|]\displaystyle=n^{2}\mathbb{E}\left[\max_{1\leq i\leq n}|\epsilon_{i}-\epsilon_{i}^{\prime}|\right]
    nδn2𝔼[max1in|ξi|].\displaystyle\lesssim n\delta_{n}^{2}\mathbb{E}\left[\max_{1\leq i\leq n}|\xi_{i}|\right].

    Since ϵp<\left\|{\epsilon}\right\|_{p}<\infty, then it follows that

    𝔼[max1in|ϵi|]n1/pmax1inϵip=ϵpn1/p.\mathbb{E}\left[\max_{1\leq i\leq n}|\epsilon_{i}|\right]\leq n^{1/p}\max_{1\leq i\leq n}\left\|{\epsilon_{i}}\right\|_{p}=\left\|{\epsilon}\right\|_{p}\cdot n^{1/p}.

    Therefore, we get

    𝔼[supf|i=1nϵif(𝑿i)|]ϵpn1+1pδn2.\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|\right]\lesssim\left\|{\epsilon}\right\|_{p}n^{1+\frac{1}{p}}\delta_{n}^{2}.
  2. (ii)

    According to Proposition 18, we have

    𝔼[supf|i=1nϵif(𝑿i)|]\displaystyle\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|\right] 𝔼[k=1nk1/p(|η(k)||η(k+1)|)𝔼[supf|k1/pi=1nξif(𝑿i)|]]\displaystyle\leq\mathbb{E}\left[\sum_{k=1}^{n}k^{1/p}\left(|\eta_{(k)}|-|\eta_{(k+1)}|\right)\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{k^{-1/p}\sum_{i=1}^{n}\xi_{i}f(\boldsymbol{X}_{i})}\right|\right]\right]
    𝔼[k=1nk1/p(|η(k)||η(k+1)|)]max1kn𝔼[supf|k1/pi=1nξif(𝑿i)|].\displaystyle\leq\mathbb{E}\left[\sum_{k=1}^{n}k^{1/p}\left(|\eta_{(k)}|-|\eta_{(k+1)}|\right)\right]\cdot\max_{1\leq k\leq n}\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{k^{-1/p}\sum_{i=1}^{n}\xi_{i}f(\boldsymbol{X}_{i})}\right|\right].

    For |η(k+1)|<t|η(k+1)||\eta_{(k+1)}|<t\leq|\eta_{(k+1)}|, we have k=Card({i:|ϵiϵi|t})k=\textrm{Card}(\{i:|\epsilon_{i}-\epsilon_{i}^{\prime}|\geq t\}), and then

    𝔼[k=1nk1/p(|η(k)||η(k+1)|)]\displaystyle\mathbb{E}\left[\sum_{k=1}^{n}k^{1/p}\left(|\eta_{(k)}|-|\eta_{(k+1)}|\right)\right] =𝔼[k=1n|η(k+1)||η(k)|k1/pdt]\displaystyle=\mathbb{E}\left[\sum_{k=1}^{n}\int_{|\eta_{(k+1)}|}^{|\eta_{(k)}|}k^{1/p}\textrm{d}t\right]
    =𝔼[0|η(1)|(Card({i:|ϵiϵi|t}))1/pdt]\displaystyle=\mathbb{E}\left[\int_{0}^{|\eta_{(1)}|}\left(\textrm{Card}(\{i:|\epsilon_{i}-\epsilon_{i}^{\prime}|\geq t\})\right)^{1/p}\textrm{d}t\right]
    0(i=1n(|ϵiϵi|t))1/pdt\displaystyle\lesssim\int_{0}^{\infty}\left(\sum_{i=1}^{n}\mathbb{P}(|\epsilon_{i}-\epsilon_{i}^{\prime}|\geq t)\right)^{1/p}\textrm{d}t
    0(i=1n(|ϵi|t))1/pdt.\displaystyle\lesssim\int_{0}^{\infty}\left(\sum_{i=1}^{n}\mathbb{P}(|\epsilon_{i}|\geq t)\right)^{1/p}\textrm{d}t.
    =n1/pϵp,1.\displaystyle=n^{1/p}\left\|{\epsilon}\right\|_{p,1}.

    Therefore, if {k11/pρk2}\{k^{1-1/p}\rho_{k}^{2}\} is non-decreasing,

    max1kn𝔼[supf|k1/pi=1kξif(𝑿i)|]=n11pρn2,\max_{1\leq k\leq n}\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{k^{-1/p}\sum_{i=1}^{k}\xi_{i}f(\boldsymbol{X}_{i})}\right|\right]=n^{1-\frac{1}{p}}\rho_{n}^{2},

    which implies that

    𝔼[supf|i=1nϵif(𝑿i)|]n1/pϵp,1n11pρn2=nρn2.\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|\right]\lesssim n^{1/p}\left\|{\epsilon}\right\|_{p,1}n^{1-\frac{1}{p}}\rho_{n}^{2}=n\rho_{n}^{2}.

Remark 4.

If ρk=kα\rho_{k}=k^{-\alpha}, then {kρk2}\{k\rho_{k}^{2}\} is non-decreasing when ρkk1/2\rho_{k}\geq k^{-1/2}. {k11pρk2}\{k^{1-\frac{1}{p}}\rho_{k}^{2}\} is non-decreasing when ρkk12+12p\rho_{k}\geq k^{-\frac{1}{2}+\frac{1}{2p}}. This shows that to obtain the desired result, the “rate of convergence” ρn\rho_{n} should not converges to 0 too fast.

Proposition 19 can be used to obtain the rate of convergence of the multiplier process, which is just an straightforward application of Markov’s inequality.

Proposition 20.

Under the same assumption in Proposition 18 and

𝔼[supf|i=1kξif(𝑿i)|]kρk2 for all k=1,,n.\mathbb{E}\left[\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{k}\xi_{i}f(\boldsymbol{X}_{i})}\right|\right]\lesssim k\rho_{k}^{2}\quad\textrm{ for all }k=1,\ldots,n.
  1. (i)

    If ϵp<\left\|{\epsilon}\right\|_{p}<\infty for some p1p\geq 1 and the sequence {kρk2}\{k\rho_{k}^{2}\} is non-decreasing, then

    supf|i=1nϵif(𝑿i)|=𝒪p(n1+1pρn2).\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|=\mathcal{O}_{p}\left(n^{1+\frac{1}{p}}\rho_{n}^{2}\right).
  2. (ii)

    If ϵp,1<\left\|{\epsilon}\right\|_{p,1}<\infty for some p1p\geq 1 and the sequence {k11pρk2}\{k^{1-\frac{1}{p}}\rho_{k}^{2}\} is non-decreasing, then

    supf|i=1nϵif(𝑿i)|=𝒪p(nρn2).\sup_{f\in\mathcal{F}}\left|{\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})}\right|=\mathcal{O}_{p}\left(n\rho_{n}^{2}\right).

Auxiliary Results

Proposition 21.

For all non-negative integer mm,

a=1mCa(m)=m!.\sum_{a=1}^{m}C_{a}^{(m)}=m!.
Proof.

We prove this result by induction. For m=1m=1, the identity holds trivially according to the definition. Now, suppose that the result holds for mm, then

a=1m+1Ca(m+1)\displaystyle\sum_{a=1}^{m+1}C_{a}^{(m+1)} =a=1m+1aCa(m)+(m+2a)Ca(m)\displaystyle=\sum_{a=1}^{m+1}aC_{a}^{(m)}+(m+2-a)C_{a}^{(m)}
=(i)a=1maCa(m)+a=1m(m+1a)Ca(m)\displaystyle\overset{(i)}{=}\sum_{a=1}^{m}aC_{a}^{(m)}+\sum_{a^{\prime}=1}^{m}(m+1-a^{\prime})C_{a^{\prime}}^{(m)}
=a=1m(m+1)Ca(m)\displaystyle=\sum_{a=1}^{m}(m+1)C_{a}^{(m)}
=(ii)(m+1)m!\displaystyle\overset{(ii)}{=}(m+1)m!
=(m+1)!,\displaystyle=(m+1)!,

where in equation (i), we let a=a1a^{\prime}=a-1, and equation (ii) follows from the induction hypothesis. Hence the desired result follows. ∎

Lemma 22.

Suppose that M:=sup𝐱𝒳|v(𝐱)|<M:=\sup_{\boldsymbol{x}\in\mathcal{X}}|v^{*}(\boldsymbol{x})|<\infty, then

supfnff0ρnn|ff0,vnff0,v|=op(1).\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\sqrt{n}\left|{\left\langle{f-f_{0}},{v^{*}}\right\rangle_{n}-\left\langle{f-f_{0}},{v^{*}}\right\rangle}\right|=o_{p}(1).

In particular,

|f^nf0,vnf^nf0,v|=op(n1/2).\left|{\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle_{n}-\left\langle{\hat{f}_{n}-f_{0}},{v^{*}}\right\rangle}\right|=o_{p}(n^{-1/2}).
Proof.

Consider the function class

n={(ff0)v:fn}.\mathcal{H}_{n}=\left\{(f-f_{0})v^{*}:f\in\mathcal{F}_{n}\right\}.

Let {f1,,fN}\{f_{1},\ldots,f_{N}\} be a minimal ϵ\epsilon-cover of n\mathcal{F}_{n} with respect to the L2(n)L_{2}(\mathbb{P}_{n})-norm so that N=N(ϵ,n,L2(n))N=N(\epsilon,\mathcal{F}_{n},L_{2}(\mathbb{P}_{n})). Define

hj=(fjf0)v,j=1,,N.h_{j}=(f_{j}-f_{0})v^{*},\quad j=1,\ldots,N.

Note that for any hnh\in\mathcal{H}_{n}, there exists fnf\in\mathcal{F}_{n} such that h=(ff0)vh=(f-f_{0})v^{*}. For such a function ff, we can find j{1,,N}j\in\{1,\ldots,N\} so that ffjn<ϵ\left\|{f-f_{j}}\right\|_{n}<\epsilon. Moreover, since

hhjn\displaystyle\left\|{h-h_{j}}\right\|_{n} =(ff0)v(fjf0)vn\displaystyle=\left\|{(f-f_{0})v^{*}-(f_{j}-f_{0})v^{*}}\right\|_{n}
=(ffj)vn\displaystyle=\left\|{(f-f_{j})v^{*}}\right\|_{n}
Mffjn\displaystyle\leq M\left\|{f-f_{j}}\right\|_{n}
<Mϵ.\displaystyle<M\epsilon.

We know that {h1,,hN}\{h_{1},\ldots,h_{N}\} forms an MϵM\epsilon-cover for n\mathcal{H}_{n}, and hence

N(Mϵ,n,L2(n))N(ϵ,n,L2(n)),N(M\epsilon,\mathcal{H}_{n},L_{2}(\mathbb{P}_{n}))\leq N(\epsilon,\mathcal{F}_{n},L_{2}(\mathbb{P}_{n})),

which implies that logN(Mϵ,n,L2(n))H(ϵ)\log N(M\epsilon,\mathcal{H}_{n},L_{2}(\mathbb{P}_{n}))\leq H(\epsilon) under (C1). On the other hand, since n\mathcal{F}_{n} is uniformly bounded, we know that B:=supx𝒳|f(x)|<B:=\sup_{x\in\mathcal{X}}|f(x)|<\infty for any fnf\in\mathcal{F}_{n}. Hence, for any hnh\in\mathcal{H}_{n},

sup𝒙𝒳|h(𝒙)|\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|h(\boldsymbol{x})| sup𝒙𝒳|f(𝒙)f0(𝒙)|sup𝒙𝒳|v(𝒙)|\displaystyle\leq\sup_{\boldsymbol{x}\in\mathcal{X}}|f(\boldsymbol{x})-f_{0}(\boldsymbol{x})|\sup_{\boldsymbol{x}\in\mathcal{X}}|v^{*}(\boldsymbol{x})|
(B+sup𝒙𝒳|f0(𝒙)|)M<,\displaystyle\leq(B+\sup_{\boldsymbol{x}\in\mathcal{X}}|f_{0}(\boldsymbol{x})|)M<\infty,

which implies that n\mathcal{H}_{n} is uniformly bounded. It then follows from Theorem van de Geer (2000) that n\mathcal{H}_{n} is a Donsker class. Thus, from Lemma 2.3.11 in van der Vaart and Wellner (1996), for any sequence δn0\delta_{n}\to 0,

suphn,δnn|(nP)h|𝑝0,\sup_{h\in\mathcal{H}_{n,\delta_{n}}}\sqrt{n}\left|{(\mathbb{P}_{n}-P)h}\right|\xrightarrow{p}0,

where n,δn={h1h2:h1,h2n,(P(h1h2P(h1h2))2)1/2δn}\mathcal{H}_{n,\delta_{n}}=\left\{h_{1}-h_{2}:h_{1},h_{2}\in\mathcal{H}_{n},\left(P(h_{1}-h_{2}-P(h_{1}-h_{2}))^{2}\right)^{1/2}\leq\delta_{n}\right\}. For fnf\in\mathcal{F}_{n}, set h1=(ff0)vh_{1}=(f-f_{0})v^{*} and h2=(πnf0f0)vh_{2}=(\pi_{n}f_{0}-f_{0})v^{*}. Note that

P(h1h2P(h1h2))2\displaystyle P(h_{1}-h_{2}-P(h_{1}-h_{2}))^{2} =P(h1h2)2(P(h1h2))2\displaystyle=P(h_{1}-h_{2})^{2}-(P(h_{1}-h_{2}))^{2}
P(h1h2)2\displaystyle\leq P(h_{1}-h_{2})^{2}
P(fπnf0)2M2,\displaystyle\leq P(f-\pi_{n}f_{0})^{2}M^{2},

we obtain

supfnff0ρnn|ff0,vnff0,v|\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-f_{0}}\right\|\leq\rho_{n}\end{subarray}}\sqrt{n}\left|{\left\langle{f-f_{0}},{v^{*}}\right\rangle_{n}-\left\langle{f-f_{0}},{v^{*}}\right\rangle}\right|
\displaystyle\leq supfnfπnf0ρnn|ff0,vnff0,v|\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{n}\\ \left\|{f-\pi_{n}f_{0}}\right\|\leq\rho_{n}\end{subarray}}\sqrt{n}\left|{\left\langle{f-f_{0}},{v^{*}}\right\rangle_{n}-\left\langle{f-f_{0}},{v^{*}}\right\rangle}\right|
\displaystyle\leq suph1nh1h2Mρnn|(nP)(h1h2)|+n|(nP)h2|\displaystyle\sup_{\begin{subarray}{c}h_{1}\in\mathcal{H}_{n}\\ \left\|{h_{1}-h_{2}}\right\|\leq M\rho_{n}\end{subarray}}\sqrt{n}\left|{(\mathbb{P}_{n}-P)(h_{1}-h_{2})}\right|+\sqrt{n}|(\mathbb{P}_{n}-P)h_{2}|
=\displaystyle= op(1)+n|(nP)h2|.\displaystyle o_{p}(1)+\sqrt{n}|(\mathbb{P}_{n}-P)h_{2}|.

Moreover, let H(𝒙1,,𝒙n)=n1i=1n(πnf0(𝒙i)f0(𝒙i))v(𝒙i)H(\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n})=n^{-1}\sum_{i=1}^{n}(\pi_{n}f_{0}(\boldsymbol{x}_{i})-f_{0}(\boldsymbol{x}_{i}))v^{*}(\boldsymbol{x}_{i}) and note that

sup𝒙i,𝒙i𝒳|H(𝒙1,,𝒙i,,𝒙n)H(𝒙1,,𝒙i,,𝒙n)|\displaystyle\sup_{\boldsymbol{x}_{i},\boldsymbol{x}_{i}^{\prime}\in\mathcal{X}}|H(\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{i},\ldots,\boldsymbol{x}_{n})-H(\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{i}^{\prime},\ldots,\boldsymbol{x}_{n})|
=\displaystyle= sup𝒙i,𝒙i𝒳1n|(πnf0f0)(𝒙i)v(𝒙i)(πnf0f0)(𝒙i)v(𝒙i)|\displaystyle\sup_{\boldsymbol{x}_{i},\boldsymbol{x}_{i}^{\prime}\in\mathcal{X}}\frac{1}{n}\left|{(\pi_{n}f_{0}-f_{0})(\boldsymbol{x}_{i})v^{*}(\boldsymbol{x}_{i})-(\pi_{n}f_{0}-f_{0})(\boldsymbol{x}_{i}^{\prime})v^{*}(\boldsymbol{x}_{i}^{\prime})}\right|
\displaystyle\leq 2Mnsup𝒙𝒳|πnf0(𝒙)f0(𝒙)|\displaystyle\frac{2M}{n}\sup_{\boldsymbol{x}\in\mathcal{X}}|\pi_{n}f_{0}(\boldsymbol{x})-f_{0}(\boldsymbol{x})|

by McDiarmid inequality (McDiarmid, 1989), for all t>0t>0,

(n|(nP)h2|>t)\displaystyle\mathbb{P}\left(\sqrt{n}|(\mathbb{P}_{n}-P)h_{2}|>t\right) 2exp{t22M2(sup𝒙𝒳|πnf0(𝒙)f0(𝒙)|)2}0,\displaystyle\lesssim 2\exp\left\{-\frac{t^{2}}{2M^{2}\left(\sup_{\boldsymbol{x}\in\mathcal{X}}|\pi_{n}f_{0}(\boldsymbol{x})-f_{0}(\boldsymbol{x})|\right)^{2}}\right\}\to 0,

which implies that n|(nP)h2|=op(1)\sqrt{n}|(\mathbb{P}_{n}-P)h_{2}|=o_{p}(1) and hence the desired claim follows. ∎