This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

thanks: This research was partly supported by the french Agence Nationale de la Recherche (ANR 2011 BS01 010 01 projet Calibration).

Optimal kernel selection for density estimation

M. Lerasle Univ. Nice Sophia Antipolis LJAD CNRS UMR 7351
06100 Nice France
mlerasle@unice.fr
   N. Magalhães INRIA, Select Project
Univ. Paris-Sud 11
Departement de Mathematiques d’Orsay
91405 Orsay Cedex - France
nelo.moltermagalhaes@gmail.com
   P. Reynaud-Bouret Univ. Nice Sophia Antipolis LJAD CNRS UMR 7351
06100 Nice France
Patricia.REYNAUD-BOURET@unice.fr
Abstract

We provide new general kernel selection rules thanks to penalized least-squares criteria. We derive optimal oracle inequalities using adequate concentration tools. We also investigate the problem of minimal penalty as described in [BM07].

keywords:
density estimation, kernel estimators, optimal penalty, minimal penalty, oracle inequalities

1 Introduction

Concentration inequalities are central in the analysis of adaptive nonparametric statistics. They lead to sharp penalized criteria for model selection [Mas07], to select bandwidths and even approximation kernels for Parzen’s estimators in high dimension [GL11], to aggregate estimators [RT07] and to properly calibrate thresholds [DJKP96].

In the present work, we are interested in the selection of a general kernel estimator based on a least-squares density estimation approach. The problem has been considered in L1L^{1}-loss by Devroye and Lugosi [DL01]. Other methods combining log-likelihood and roughness/smoothness penalties have also been proposed in [EL99b, EL99a, EL01]. However these estimators are usually quite difficult to compute in practice. We propose here to minimize penalized least-squares criteria and obtain from them more easily computable estimators. Sharp concentration inequalities for U-statistics [GLZ00, Ada06, HRB03] control the variance term of the kernel estimators, whose asymptotic behavior has been precisely described, for instance in [MS11, MS15, DO13]. We derive from these bounds (see Proposition 4.1) a penalization method to select a kernel which satisfies an asymptotically optimal oracle inequality, i.e. with leading constant asymptotically equal to 11.

In the spirit of [GN09], we use an extended definition of kernels that allows to deal simultaneously with classical collections of estimators as projection estimators, weighted projection estimators, or Parzen’s estimators. This method can be used for example to select an optimal model in model selection (in accordance with [Mas07]) or to select an optimal bandwidth together with an optimal approximation kernel among a finite collection of Parzen’s estimators. In this sense, our method deals, in particular, with the same problem as that of Goldenshluger and Lepski [GL11] and we establish in this framework that a leading constant 1 in the oracle inequality is indeed possible.

Another main consequence of concentration inequalities is to prove the existence of a minimal level of penalty, under which no oracle inequalities can hold. Birgé and Massart shed light on this phenomenon in a Gaussian setting for model selection [BM07]. Moreover in this setting, they prove that the optimal penalty is twice the minimal one. In addition, there is a sharp phase transition in the dimension of the selected models leading to an estimate of the optimal penalty in their case (which is known up to a multiplicative constant). Indeed, starting from the idea that in many models the optimal penalty is twice the minimal one (this is the slope heuristic), Arlot and Massart [AM09] propose to detect the minimal penalty by the phase transition and to apply the "×2\times 2" rule (this is the slope algorithm). They prove that this algorithm works at least in some regression settings.

In the present work, we also show that minimal penalties exist in the density estimation setting. In particular, we exhibit a sharp "phase transition" of the behavior of the selected estimator around this minimal penalty. The analysis of this last result is not standard however. First, the "slope heuristic" of [BM07] only holds in particular cases as the selection of projection estimators, see also [Ler12]. As in the selection of a linear estimator in a regression setting [1], the heuristic can sometimes be corrected: for example for the selection of a bandwidth when the approximation kernel is fixed. In general since there is no simple relation between the minimal penalty and the optimal one, the slope algorithm of [AM09] shall only be used with care for kernel selection. Surprisingly our work reveals that the minimal penalty can be negative. In this case, minimizing an unpenalized criterion leads to oracle estimators. To our knowledge, such phenomenon has only been noticed previously in a very particular classification setting [FT06]. We illustrate all of these different behaviors by means of a simulation study.

In Section 2, after fixing the main notation, providing some examples and defining the framework, we explain our goal, describe what we mean by an oracle inequality and state the exponential inequalities that we shall need. Then we derive optimal penalties in Section 3 and study the problem of minimal penalties in Section 4. All of these results are illustrated for our three main examples : projection kernels, approximation kernels and weighted projection kernels. In Section 5, some simulations are performed in the approximation kernel case. The main proofs are detailed in Section 6 and technical results are discussed in the appendix.

2 Kernel selection for least-squares density estimation

2.1 Setting

Let X,Y,X1,,XnX,Y,X_{1},\ldots,X_{n} denote i.i.d. random variables taking values in the measurable space (𝕏,𝒳,μ)(\mathbb{X},\mathcal{X},\mu), with common distribution PP. Assume PP has density ss with respect to μ\mu and ss is uniformly bounded. Hence, ss belongs to L2L^{2}, where, for any p1p\geq 1,

Lp:={t:𝕏, s.t. tpp:=|t|p𝑑μ<}.L^{p}:=\left\{\left.t:\mathbb{X}\to\mathbb{R},\,\mbox{ s.t. }\,\left\lVert t\right\rVert_{p}^{p}:=\int\left\lvert t\right\rvert^{p}d\mu<\infty\right.\right\}\enspace.

Moreover, =2\left\lVert\cdot\right\rVert=\left\lVert\cdot\right\rVert_{2} and ,\langle\cdot,\cdot\rangle denote respectively the L2L^{2}-norm and the associated inner product and \left\lVert\cdot\right\rVert_{\infty} is the supremum norm. We systematically use xyx\vee y and xyx\wedge y for max(x,y)\max(x,y) and min(x,y)\min(x,y) respectively, and denote |A||A| the cardinality of the set AA. Recall that x+=x0x_{+}=x\vee 0 and, for any y+y\in\mathbb{R}^{+}, y=sup{n s.t. ny}\left\lfloor y\right\rfloor=\sup\{n\in\mathbb{N}\,\mbox{ s.t. }\,n\leq y\}.

Let {k}k𝒦\left\{\left.k\right.\right\}_{k\in\mathcal{K}} denote a collection of symmetric functions k:𝕏2k:\mathbb{X}^{2}\to\mathbb{R} indexed by some given finite set 𝒦\mathcal{K} such that

supx𝕏𝕏k(x,y)2𝑑μ(y)sup(x,y)𝕏2|k(x,y)|<.\sup_{x\in\mathbb{X}}~\int_{\mathbb{X}}k(x,y)^{2}d\mu(y)~\vee\sup_{(x,y)\in\mathbb{X}^{2}}\left\lvert k(x,y)\right\rvert<\infty\enspace.

A function kk satisfying these assumptions is called a kernel, in the sequel. A kernel kk is associated with an estimator s^k\widehat{s}_{k} of ss defined for any x𝕏x\in\mathbb{X} by

s^k(x)\displaystyle\widehat{s}_{k}(x) :=1ni=1nk(Xi,x).\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}k(X_{i},x)\enspace.

Our aim is to select a “good” s^k^\widehat{s}_{\hat{k}} in the family {s^k,k𝒦}\{\widehat{s}_{k},k\in\mathcal{K}\}. Our results are expressed in terms of a constant Γ1\Gamma\geq 1 such that for all k𝒦k\in\mathcal{K},

supx𝕏𝕏k(x,y)2𝑑μ(y)sup(x,y)𝕏2|k(x,y)|Γn.\sup_{x\in\mathbb{X}}~\int_{\mathbb{X}}k(x,y)^{2}d\mu(y)~\vee\sup_{(x,y)\in\mathbb{X}^{2}}\left\lvert k(x,y)\right\rvert\leq\Gamma n\enspace. (1)

This condition plays the same role as |k(x,y)|s(y)𝑑μ(y)<\int|k(x,y)|s(y)d\mu(y)<\infty, the milder condition used in [DL01] when working with L1L^{1}-losses. Before describing the method, let us give three examples of such estimators that are used for density estimation, and see how they can naturally be associated to some kernels. Section A of the appendix gives the computations leading to the corresponding Γ\Gamma’s.

Example 1: Projection estimators.

Projection estimators are among the most classical density estimators. Given a linear subspace SL2S\subset L^{2}, the projection estimator on SS is defined by

s^S=argmintS{t22ni=1nt(Xi)}.\widehat{s}_{S}=\arg\min_{t\in S}\left\{\left.\left\lVert t\right\rVert^{2}-\frac{2}{n}\sum_{i=1}^{n}t(X_{i})\right.\right\}\enspace.

Let 𝒮\mathcal{S} be a family of linear subspaces SS of L2L^{2}. For any S𝒮S\in\mathcal{S}, let (φ)S(\varphi_{\ell})_{\ell\in\mathcal{I}_{S}} denote an orthonormal basis of SS. The projection estimator s^S\widehat{s}_{S} can be computed and is equal to

s^S=S(1ni=1nφ(Xi))φ.\widehat{s}_{S}=\sum_{\ell\in\mathcal{I}_{S}}\left(\left.\frac{1}{n}\sum_{i=1}^{n}\varphi_{\ell}(X_{i})\right.\right)\varphi_{\ell}\enspace.

It is therefore easy to see that it is the estimator associated to the projection kernel kSk_{S} defined for any xx and yy in 𝕏\mathbb{X} by

kS(x,y):=Sφ(x)φ(y).k_{S}(x,y):=\sum_{\ell\in\mathcal{I}_{S}}\varphi_{\ell}(x)\varphi_{\ell}(y)\enspace.

Notice that kSk_{S} actually depends on the basis (φ)S(\varphi_{\ell})_{\ell\in\mathcal{I}_{S}} even if s^S\widehat{s}_{S} does not. In the sequel, we always assume that some orthonormal basis (φ)S(\varphi_{\ell})_{\ell\in\mathcal{I}_{S}} is given with SS. Given a finite collection 𝒮\mathcal{S} of linear subspaces of L2L^{2}, one can choose the following constant Γ\Gamma in (1) for the collection (kS)S𝒮(k_{S})_{S\in\mathcal{S}}

Γ=11nsupS𝒮supfS,f=1f2.\Gamma=1\vee\frac{1}{n}\sup_{S\in\mathcal{S}}\sup_{f\in S,\left\lVert f\right\rVert=1}\left\lVert f\right\rVert_{\infty}^{2}\enspace. (2)

Example 2: Parzen’s estimators.

Given a bounded symmetric integrable function K:K:\mathbb{R}\to\mathbb{R} such that K(u)𝑑u=1\int_{\mathbb{R}}K(u)du=1, K(0)>0K(0)>0 and a bandwidth h>0h>0, the Parzen estimator is defined by

x,s^K,h(x)=1nhi=1nK(xXih).\forall x\in\mathbb{R},\quad\widehat{s}_{K,h}(x)=\frac{1}{nh}\sum_{i=1}^{n}K\left(\left.\frac{x-X_{i}}{h}\right.\right)\enspace.

It can also naturally be seen as a kernel estimator, associated to the function kK,hk_{K,h} defined for any xx and yy in \mathbb{R} by

kK,h(x,y):=1hK(xyh).k_{K,h}(x,y):=\frac{1}{h}K\left(\left.\frac{x-y}{h}\right.\right)\enspace.

We shall call the function kK,hk_{K,h} an approximation or Parzen kernel.
Given a finite collection of pairs (K,h)(K,h)\in\mathcal{H}, one can choose Γ=1\Gamma=1 in (1) if,

hKK1n for any (K,h).h\geq\frac{\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}}{n}\quad\mbox{ for any }(K,h)\in\mathcal{H}\enspace. (3)

Example 3: Weighted projection estimators.

Let (φi)i=1,,p(\varphi_{i})_{i=1,\ldots,p} denote an orthonormal system in L2L^{2} and let w=(wi)i=1,,pw=(w_{i})_{i=1,\ldots,p} denote real numbers in [0,1][0,1]. The associated weighted kernel projection estimator of ss is defined by

s^w=i=1pwi(1nj=1nφi(Xj))φi.\widehat{s}_{w}=\sum_{i=1}^{p}w_{i}\left(\left.\frac{1}{n}\sum_{j=1}^{n}\varphi_{i}(X_{j})\right.\right)\varphi_{i}\enspace.

These estimators are used to derive very sharp adaptive results. In particular, Pinsker’s estimators are weighted kernel projection estimators (see for example [Rig06]). When w{0,1}pw\in\left\{\left.0,1\right.\right\}^{p}, we recover a classical projection estimator. A weighted projection estimator is associated to the weighted projection kernel defined for any xx and yy in 𝕏\mathbb{X} by

kw(x,y):=i=1pwiφi(x)φi(y).k_{w}(x,y):=\sum_{i=1}^{p}w_{i}\varphi_{i}(x)\varphi_{i}(y)\enspace.

Given any finite collection 𝒲\mathcal{W} of weights, one can choose in (1)

Γ=1(1nsupx𝕏i=1pφi(x)2).\Gamma=1\vee\left(\left.\frac{1}{n}\sup_{x\in\mathbb{X}}\sum_{i=1}^{p}\varphi_{i}(x)^{2}\right.\right)\enspace. (4)

2.2 Oracle inequalities and penalized criterion

The goal is to estimate ss in the best possible way using a finite collection of kernel estimators (s^k)k𝒦(\widehat{s}_{k})_{k\in\mathcal{K}}. In other words, the purpose is to select among (s^k)k𝒦(\widehat{s}_{k})_{k\in\mathcal{K}} an estimator s^k^\widehat{s}_{\widehat{k}} from the data such that s^k^s2\left\lVert\widehat{s}_{\widehat{k}}-s\right\rVert^{2} is as close as possible to infk𝒦s^ks2\inf_{k\in\mathcal{K}}\left\lVert\widehat{s}_{k}-s\right\rVert^{2}. More precisely our aim is to select k^{\widehat{k}} such that, with high probability,

s^k^s2Cninfk𝒦s^ks2+Rn,\left\lVert\widehat{s}_{{\widehat{k}}}-s\right\rVert^{2}\leq C_{n}\inf_{k\in\mathcal{K}}\left\lVert\widehat{s}_{k}-s\right\rVert^{2}+R_{n}\enspace, (5)

where Cn1C_{n}\geq 1 is the leading constant and Rn>0R_{n}>0 is usually a remaining term. In this case, s^k^\widehat{s}_{{\widehat{k}}} is said to satisfy an oracle inequality, as long as RnR_{n} is small compared to infk𝒦s^ks2\inf_{k\in\mathcal{K}}\left\lVert\widehat{s}_{k}-s\right\rVert^{2} and CnC_{n} is a bounded sequence. This means that the selected estimator does as well as the best estimator in the family up to some multiplicative constant. The best case one can expect is to get CnC_{n} close to 1. This is why, when Cnn1C_{n}\to_{n\to\infty}1, the corresponding oracle inequality is called asymptotically optimal. To do so, we study minimizers of penalized least-squares criteria. Note that in our three examples choosing k^𝒦{\widehat{k}}\in\mathcal{K} amounts to choosing the smoothing parameter, that is respectively to choosing S^𝒮\widehat{S}\in\mathcal{S}, (K^,h^)(\widehat{K},\widehat{h})\in\mathcal{H} or w^𝒲\widehat{w}\in\mathcal{W}.

Let PnP_{n} denote the empirical measure, that is, for any real valued function tt,

Pn(t):=1ni=1nt(Xi).P_{n}(t):=\frac{1}{n}\sum_{i=1}^{n}t(X_{i})\enspace.

For any tL2t\in L^{2}, let also P(t):=𝕏t(x)s(x)𝑑μ(x).P(t):=\int_{\mathbb{X}}t(x)s(x)d\mu(x)\enspace.
The least-squares contrast is defined, for any tL2t\in L^{2}, by

γ(t):=t22t.\gamma(t):=\left\lVert t\right\rVert^{2}-2t\enspace.

Then for any given function pen:𝒦\operatorname{pen}:\mathcal{K}\to\mathbb{R}, the least-squares penalized criterion is defined by

𝒞pen(k):=Pnγ(s^k)+pen(k).\mathcal{C}_{\operatorname{pen}}(k):=P_{n}\gamma(\widehat{s}_{k})+\operatorname{pen}(k)\enspace. (6)

Finally the selected k^𝒦{\widehat{k}}\in\mathcal{K} is given by any minimizer of 𝒞pen(k)\mathcal{C}_{\operatorname{pen}}(k), that is,

k^argmink𝒦{𝒞pen(k)}.{\widehat{k}}\in\arg\min_{k\in\mathcal{K}}\left\{\left.\mathcal{C}_{\operatorname{pen}}(k)\right.\right\}\enspace. (7)

As Pγ(t)=ts2s2P\gamma(t)=\left\lVert t-s\right\rVert^{2}-\left\lVert s\right\rVert^{2}, it is equivalent to minimize s^ks2\left\lVert\widehat{s}_{k}-s\right\rVert^{2} or Pγ(s^k)P\gamma(\widehat{s}_{k}). As our goal is to select s^k^\widehat{s}_{\widehat{k}} satisfying an oracle inequality, an ideal penalty penid\operatorname{pen}_{\mathrm{id}} should satisfy 𝒞penid(k)=Pγ(s^k)\mathcal{C}_{\operatorname{pen}_{\mathrm{id}}}(k)=P\gamma(\widehat{s}_{k}), i.e. criterion (6) with

penid(k):=(PPn)γ(s^k)=2(PnP)(s^k).\operatorname{pen}_{\mathrm{id}}(k):=(P-P_{n})\gamma(\widehat{s}_{k})=2(P_{n}-P)(\widehat{s}_{k})\enspace.

To identify the main quantities of interest, let us introduce some notation and develop penid(k)\operatorname{pen}_{\mathrm{id}}(k). For all k𝒦k\in\mathcal{K}, let

sk(x)\displaystyle s_{k}(x) :=𝕏k(y,x)s(y)𝑑μ(y)=𝔼[k(X,x)],x𝕏,\displaystyle:=\int_{\mathbb{X}}k(y,x)s(y)d\mu(y)=\mathbb{E}\left[\left.k(X,x)\right.\right],\qquad\forall x\in\mathbb{X}\enspace,

and

Uk:=ij=1n(k(Xi,Xj)sk(Xi)sk(Xj)+𝔼[k(X,Y)]).U_{k}:=\sum_{i\neq j=1}^{n}\left(\left.k(X_{i},X_{j})-s_{k}(X_{i})-s_{k}(X_{j})+\mathbb{E}\left[\left.k(X,Y)\right.\right]\right.\right)\enspace.

Because those quantities are fundamental in the sequel, let us also define Θk(x)=Ak(x,x)\Theta_{k}(x)=A_{k}(x,x) where for (x,y)𝕏2(x,y)\in\mathbb{X}^{2}

Ak(x,y):=𝕏k(x,z)k(z,y)𝑑μ(z).A_{k}(x,y):=\int_{\mathbb{X}}k(x,z)k(z,y)d\mu(z)\enspace. (8)

Denoting

for all x𝕏,χk(x)=k(x,x),\mbox{for all }x\in\mathbb{X},\quad\chi_{k}(x)=k(x,x)\enspace,

the ideal penalty is then equal to

penid(k)=2(PnP)(s^ksk)+2(PnP)sk\displaystyle\operatorname{pen}_{\mathrm{id}}(k)=2(P_{n}-P)(\widehat{s}_{k}-s_{k})+2(P_{n}-P)s_{k}
=2(PχkPskn+(PnP)χkn+Ukn2+(12n)(PnP)sk).\displaystyle=2\left(\left.\frac{P\chi_{k}-Ps_{k}}{n}+\frac{(P_{n}-P)\chi_{k}}{n}+\frac{U_{k}}{n^{2}}+\left(\left.1-\frac{2}{n}\right.\right)(P_{n}-P)s_{k}\right.\right)\enspace. (9)

The main point is that by using concentration inequalities, we obtain:

penid(k)2(PχkPskn).\operatorname{pen}_{\mathrm{id}}(k)\simeq 2\left(\left.\frac{P\chi_{k}-Ps_{k}}{n}\right.\right)\enspace.

The term Psk/nPs_{k}/n depends on ss which is unknown. Fortunately, it can be easily controlled as detailed in the sequel. Therefore one can hope that the choice

pen(k)=2Pχkn\operatorname{pen}(k)=2\frac{P\chi_{k}}{n}

is convenient. In general, this choice still depends on the unknown density ss but it can be easily estimated in a data-driven way by

pen(k)=2Pnχkn.\operatorname{pen}(k)=2\frac{P_{n}\chi_{k}}{n}\enspace.

The goal of Section 3 is to prove this heuristic and to show that 2Pχk/n2P\chi_{k}/n and 2Pnχk/n2P_{n}\chi_{k}/n are optimal choices for the penalty, that is, they lead to an asymptotically optimal oracle inequality.

2.3 Concentration tools

To derive sharp oracle inequalities, we only need two fundamental concentration tools, namely a weak Bernstein’s inequality and the concentration bounds for degenerate U-statistics of order two. We cite them here under their most suitable form for our purpose.

A weak Bernstein’s inequality.

Proposition 2.1.

For any bounded real valued function ff and any X1,,XnX_{1},\ldots,X_{n} i.i.d. with distribution PP, for any u>0u>0,

((PnP)f2P(f2)un+fu3n)exp(u).\mathbb{P}\left((P_{n}-P)f\geq\sqrt{\frac{2P\left(\left.f^{2}\right.\right)u}{n}}+\frac{\left\lVert f\right\rVert_{\infty}u}{3n}\right)\leq\exp(-u)\enspace.

The proof is straightforward and can be derived from either Bennett’s or Bernstein’s inequality [BLM13].

Concentration of degenerate U-statistics of order 2.

Proposition 2.2.

Let X,X1,XnX,X_{1},\ldots X_{n} be i.i.d. random variables defined on a Polish space 𝕏\mathbb{X} equipped with its Borel σ\sigma-algebra and let (fi,j)1ijn\left(\left.f_{i,j}\right.\right)_{1\leq i\not=j\leq n} denote bounded real valued symmetric measurable functions defined on 𝕏2\mathbb{X}^{2}, such that for any iji\not=j, fi,j=fj,if_{i,j}=f_{j,i} and

i,j s.t. 1ijn,𝔼[fi,j(x,X)]=0for a.e. x in 𝕏.\forall~i,j\mbox{ s.t. }1\leq i\neq j\leq n,\qquad\mathbb{E}\left[\left.f_{i,j}(x,X)\right.\right]=0\qquad\mbox{for\;a.e.\;}x\mbox{ in }\mathbb{X}\enspace. (10)

Let UU be the following totally degenerate UU-statistic of order 22,

U=1ijnfi,j(Xi,Xj).U=\sum_{1\leq i\neq j\leq n}f_{i,j}(X_{i},X_{j})\enspace.

Let AA be an upper bound of |fi,j(x,y)|\left\lvert f_{i,j}(x,y)\right\rvert for any i,j,x,yi,j,x,y and

B2\displaystyle B^{2} =max(supi,x𝕏j=1i1𝔼[fi,j(x,Xj)2],supj,t𝕏i=j+1n𝔼[fi,j(Xi,t)2])\displaystyle=\max\left(\left.\sup_{i,x\in\mathbb{X}}\sum_{j=1}^{i-1}\mathbb{E}\left[\left.f_{i,j}(x,X_{j})^{2}\right.\right],\sup_{j,t\in\mathbb{X}}\sum_{i=j+1}^{n}\mathbb{E}\left[\left.f_{i,j}(X_{i},t)^{2}\right.\right]\right.\right)
C2\displaystyle C^{2} =1ijn𝔼[fi,j(Xi,Xj)2]\displaystyle=\sum_{1\leq i\neq j\leq n}\mathbb{E}\left[\left.f_{i,j}(X_{i},X_{j})^{2}\right.\right]
D\displaystyle D =sup(a,b)𝒜𝔼[1i<jnfi,j(Xi,Xj)ai(Xi)bj(Xj)],\displaystyle=\sup_{(a,b)\in\mathcal{A}}\mathbb{E}\left[\left.\sum_{1\leq i<j\leq n}f_{i,j}(X_{i},X_{j})a_{i}(X_{i})b_{j}(X_{j})\right.\right]\enspace,

where 𝒜={(a,b), s.t. 𝔼[i=1n1ai(Xi)2]1,𝔼[j=2nbj(Xj)2]1}.\displaystyle\mathcal{A}=\left\{\left.(a,b),\,\mbox{ s.t. }\,\mathbb{E}\left[\left.\sum_{i=1}^{n-1}a_{i}(X_{i})^{2}\right.\right]\leq 1,\;\mathbb{E}\left[\left.\sum_{j=2}^{n}b_{j}(X_{j})^{2}\right.\right]\leq 1\right.\right\}.
Then there exists some absolute constant κ>0\kappa>0 such that for any u>0u>0, with probability larger than 12.7eu1-2.7e^{-u},

Uκ(Cu+Du+Bu3/2+Au2).\displaystyle U\leq\kappa\left(C\sqrt{u}+Du+Bu^{3/2}+Au^{2}\right)\enspace.

The present result is a simplification of Theorem 3.4.8 in [GN15], which provides explicit constants for any variables defined on a Polish space. It is mainly inspired by [HRB03], where the result therein has been stated only for real variables. This inequality actually dates back to Giné, Latala and Zinn [GLZ00]. This result has been further generalized by Adamczak to U-statistics of any order [Ada06], though the constants are not explicit.

3 Optimal penalties for kernel selection

The main aim of this section is to show that 2Pχk/n2P\chi_{k}/n is a theoretical optimal penalty for kernel selection, which means that if pen(k)\operatorname{pen}(k) is close to 2Pχk/n2P\chi_{k}/n, the selected kernel k^{\widehat{k}} satisfies an asymptotically optimal oracle inequality.

3.1 Main assumptions

To express our results in a simple form, a positive constant Υ\Upsilon is assumed to control, for any kk and kk^{\prime} in 𝒦\mathcal{K}, all the following quantities.

(Γ(1+s))supk𝒦sk2\displaystyle\left(\left.\Gamma(1+\left\lVert s\right\rVert_{\infty})\right.\right)\vee\sup_{k\in\mathcal{K}}\left\lVert s_{k}\right\rVert^{2} \displaystyle\leq Υ,\displaystyle\Upsilon\enspace, (11)
P(χk2)\displaystyle P\left(\left.\chi_{k}^{2}\right.\right) \displaystyle\leq ΥnPΘk,\displaystyle\Upsilon nP\Theta_{k}\enspace, (12)
sksk\displaystyle\left\lVert s_{k}-s_{k^{\prime}}\right\rVert_{\infty} \displaystyle\leq ΥΥnsksk,\displaystyle\Upsilon\vee\sqrt{\Upsilon n}\left\lVert s_{k}-s_{k^{\prime}}\right\rVert\enspace, (13)
𝔼[Ak(X,Y)2]\displaystyle\mathbb{E}\left[\left.A_{k}(X,Y)^{2}\right.\right] \displaystyle\leq ΥPΘk,\displaystyle\Upsilon P\Theta_{k}\enspace, (14)
supx𝕏𝔼[Ak(X,x)2]\displaystyle\sup_{x\in\mathbb{X}}~\mathbb{E}\left[\left.A_{k}(X,x)^{2}\right.\right] \displaystyle\leq Υn,\displaystyle\Upsilon n\enspace, (15)
vk2:=supt𝔹kPt2\displaystyle v_{k}^{2}:=\sup_{t\in\mathbb{B}_{k}}Pt^{2} \displaystyle\leq ΥΥPΘk,\displaystyle\Upsilon\vee\sqrt{\Upsilon P\Theta_{k}}\enspace, (16)

where 𝔹k\mathbb{B}_{k} is the set of functions tt that can be written t(x)=a(z)k(z,x)𝑑μ(z)t(x)=\int a(z)k(z,x)d\mu(z) for some aL2a\in L^{2} with a1\left\lVert a\right\rVert\leq 1.

These assumptions may seem very intricate. They are actually fulfilled by our three main examples under very mild conditions (see Section 3.3).

3.2 The optimal penalty theorem

In the sequel, \square denotes a positive absolute constant whose value may change from line to line and if there are indices such as θ\square_{\theta}, it means that this is a positive function of θ\theta and only θ\theta whose value may change from line to line.

Theorem 3.1.

If Assumptions (11), (12), (13), (14) (15), (16) hold, then, for any x1x\geq 1, with probability larger than 1|𝒦|2ex1-\square|\mathcal{K}|^{2}e^{-x}, for any θ(0,1)\theta\in(0,1), any minimizer k^{\widehat{k}} of the penalized criterion (6) satisfies the following inequality

k𝒦,(14θ)ss^k^2\displaystyle\forall k\in\mathcal{K},\qquad(1-4\theta)\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2} (1+4θ)ss^k2+(pen(k)2Pχkn)\displaystyle\leq(1+4\theta)\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\left(\left.\operatorname{pen}(k)-2\frac{P\chi_{k}}{n}\right.\right)
(pen(k^)2Pχk^n)+Υx2θn.\displaystyle-\left(\left.\operatorname{pen}\left(\left.{\widehat{k}}\right.\right)-2\frac{P\chi_{{\widehat{k}}}}{n}\right.\right)+\square\frac{\Upsilon x^{2}}{\theta n}\enspace. (17)

Assume moreover that there exists C>0C>0, δδ>0\delta^{\prime}\geq\delta>0 and r0r\geq 0 such that for any x1x\geq 1, with probability larger than 1Cex1-Ce^{-x}, for any k𝒦k\in\mathcal{K},

(δ1)PΘknrΥx2npen(k)2Pχkn(δ1)PΘkn+rΥx2n.(\delta-1)\frac{P\Theta_{k}}{n}-\square r\frac{\Upsilon x^{2}}{n}\leq\operatorname{pen}(k)-\frac{2P\chi_{k}}{n}\leq(\delta^{\prime}-1)\frac{P\Theta_{k}}{n}+\square r\frac{\Upsilon x^{2}}{n}\enspace. (18)

Then for all θ(0,1)\theta\in(0,1) and all x1x\geq 1, the following holds with probability at least 1(C+|𝒦|2)ex1-\square(C+|\mathcal{K}|^{2})e^{-x},

(δ1)5θ(δ1)+(4+δ)θss^k^2\displaystyle\frac{(\delta\wedge 1)-5\theta}{(\delta^{\prime}\vee 1)+(4+\delta^{\prime})\theta}\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2} infk𝒦ss^k2+(r+1θ3)Υx2n.\displaystyle\leq\inf_{k\in\mathcal{K}}\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\square\left(\left.r+\frac{1}{\theta^{3}}\right.\right)\frac{\Upsilon x^{2}}{n}\enspace.

Let us make some remarks.

  • First, this is an oracle inequality (see (5)) with leading constant CnC_{n} and remaining term RnR_{n} given by

    Cn=(δ1)+(4+δ)θ(δ1)5θandRn=Cn(r+θ3)Υx2n,C_{n}=\frac{(\delta^{\prime}\vee 1)+(4+\delta^{\prime})\theta}{(\delta\wedge 1)-5\theta}\quad\mbox{and}\quad R_{n}=\square C_{n}(r+\theta^{-3})\frac{\Upsilon x^{2}}{n}\enspace,

    as long as

    • θ\theta is small enough for CnC_{n} to be positive,

    • xx is large enough for the probability to be large and

    • nn is large enough for RnR_{n} to be negligible.

    Typically, r,δ,δ,θr,\delta,\delta^{\prime},\theta and Υ\Upsilon are bounded w.r.t. nn and xx has to be of the order of log(|𝒦|n)\log(|\mathcal{K}|\vee n) for the remainder to be negligible. In particular, 𝒦\mathcal{K} may grow with nn as long as (i) log(|𝒦|n)2\log(|\mathcal{K}|\vee n)^{2} remains negligible with respect to nn and (ii) Υ\Upsilon does not depend on nn.

  • If pen(k)=2Pχk/n\operatorname{pen}(k)=2P\chi_{k}/n, that is if δ=δ=1\delta=\delta^{\prime}=1 and r=C=0r=C=0 in (18), the estimator s^k^\widehat{s}_{{\widehat{k}}} satisfies an asymptotically optimal oracle inequality i.e. Cnn1C_{n}\to_{n\to\infty}1 since θ\theta can be chosen as close to 0 as desired. Take for instance, θ=(logn)1\theta=(\log n)^{-1}.

  • In general PχkP\chi_{k} depends on the unknown ss and this last penalty cannot be used in practice. Fortunately, its empirical counterpart pen(k)=2Pnχk/n\operatorname{pen}(k)=2P_{n}\chi_{k}/n satisfies (18) with δ=1θ\delta=1-\theta, δ=1+θ\delta^{\prime}=1+\theta, r=1/θr=1/\theta and C=2|𝒦|C=2|\mathcal{K}| for any θ(0,1)\theta\in(0,1) and in particular θ=(logn)1\theta=(\log n)^{-1} (see (34) in Proposition B.1). Hence, the estimator s^k^\widehat{s}_{{\widehat{k}}} selected with this choice of penalty also satisfies an asymptotically optimal oracle inequality, by the same argument.

  • Finally, we only get an oracle inequality when δ>0\delta>0, that is when pen(k)\operatorname{pen}(k) is larger than (2PχkPΘk)/n(2P\chi_{k}-P\Theta_{k})/n up to some residual term. We discuss the necessity of this condition in Section 4.

3.3 Main examples

This section shows that Theorem 3.1 can be applied in the examples. In addition, it provides the computation of 2Pχk/n2P\chi_{k}/n in some specific cases of special interest.

Example 1 (continued).

Proposition 3.2.

Let {kS,S𝒮}\left\{\left.k_{S},S\in\mathcal{S}\right.\right\} be a collection of projection kernels. Assumptions (11), (12), (14), (15) and (16) hold for any ΥΓ(1+s)\Upsilon\geq\Gamma(1+\left\lVert s\right\rVert_{\infty}), where Γ\Gamma is given by (2). In addition, Assumption (13) is satisfied under either of the following classical assumptions (see [Mas07, Chapter 7]):

S,S𝒮, either SS or SS,\forall S,S^{\prime}\in\mathcal{S},\qquad\mbox{ either }S\subset S^{\prime}\mbox{ or }S^{\prime}\subset S\enspace, (19)

or

S𝒮,skSΥ2.\forall S\in\mathcal{S},\qquad\left\lVert s_{k_{S}}\right\rVert_{\infty}\leq\frac{\Upsilon}{2}\enspace. (20)

These particular projection kernels satisfy for all (x,y)𝕏2(x,y)\in\mathbb{X}^{2}

AkS(x,y)=𝕏kS(x,z)kS(y,z)𝑑μ(z)=(i,j)S2φi(x)φj(y)𝕏φi(z)φj(z)𝑑μ(z)=kS(x,y).A_{k_{S}}(x,y)=\int_{\mathbb{X}}k_{S}(x,z)k_{S}(y,z)d\mu(z)\\ =\sum_{(i,j)\in\mathcal{I}^{2}_{S}}\varphi_{i}(x)\varphi_{j}(y)\int_{\mathbb{X}}\varphi_{i}(z)\varphi_{j}(z)d\mu(z)=k_{S}(x,y)\enspace.

In particular, ΘkS=χkS=iSφi2\Theta_{k_{S}}=\chi_{k_{S}}=\sum_{i\in\mathcal{I}_{S}}\varphi_{i}^{2} and 2PχkSPΘkS=PχkS2P\chi_{k_{S}}-P\Theta_{k_{S}}=P\chi_{k_{S}}.

Moreover, it appears that the function ΘkS\Theta_{k_{S}} is constant in some linear spaces SS of interest (see [Ler12] for more details). Let us mention one particular case studied further on in the sequel. Suppose 𝒮\mathcal{S} is a collection of regular histogram spaces SS on 𝕏\mathbb{X}, that is, any S𝒮S\in\mathcal{S} is a space of piecewise constant functions on a partition S\mathcal{I}_{S} of 𝕏\mathbb{X} such that μ(i)=1/DS\mu(i)=1/D_{S} for any ii in S\mathcal{I}_{S}. Assumption (20) is satisfied for this collection as soon as Υ2s\Upsilon\geq 2\left\lVert s\right\rVert_{\infty}. The family (φi)iS(\varphi_{i})_{i\in\mathcal{I}_{S}}, where φi=DS𝟏i\varphi_{i}=\sqrt{D_{S}}{\bf 1}_{i} is an orthonormal basis of SS and

χkS=iSφi2=DS.\chi_{k_{S}}=\sum_{i\in\mathcal{I}_{S}}\varphi_{i}^{2}=D_{S}\enspace.

Hence, PχkS=DSP\chi_{k_{S}}=D_{S} and 2DS/n2D_{S}/n can actually be used as a penalty to ensure that the selected estimator satisfies an asymptotically optimal oracle inequality. Moreover, in this example it is actually necessary to choose a penalty larger than DS/nD_{S}/n to get an oracle inequality (see [Ler12] or Section 4 for more details).

Example 2 (continued).

Proposition 3.3.

Let {kK,h,(K,h)}\left\{\left.k_{K,h},(K,h)\in\mathcal{H}\right.\right\} be a collection of approximation kernels. Assumptions (11), (12), (13), (14), (15) and (16) hold with Γ=1\Gamma=1, for any

ΥmaxK{K(0)K2(1+2sK12)},\Upsilon\geq\max_{K}\left\{\left.\frac{K(0)}{\left\lVert K\right\rVert^{2}}\vee\left(\left.1+2\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}^{2}\right.\right)\right.\right\}\enspace,

as soon as (3) is satisfied.

These approximation kernels satisfy, for all xx\in\mathbb{R},

χkK,h(x)\displaystyle\chi_{k_{K,h}}(x) =kK,h(x,x)=K(0)h,\displaystyle=k_{K,h}(x,x)=\frac{K(0)}{h}\enspace,
ΘkK,h(x)\displaystyle\Theta_{k_{K,h}}(x) =AkK,h(x,x)=1h2K(xyh)2𝑑y=K2h.\displaystyle=A_{k_{K,h}}(x,x)=\frac{1}{h^{2}}\int_{\mathbb{R}}K\left(\left.\frac{x-y}{h}\right.\right)^{2}dy=\frac{\left\lVert K\right\rVert^{2}}{h}\enspace.

Therefore, the optimal penalty 2PχkK,h/n=2K(0)/(nh)2P\chi_{k_{K,h}}/n=2K(0)/{(nh)} can be computed in practice and yields an asymptotically optimal selection criterion. Surprisingly, the lower bound 2PχkK,h/nPΘkK,h/n=(2K(0)K2)/(nh)2P\chi_{k_{K,h}}/n-P\Theta_{k_{K,h}}/n=(2K(0)-\left\lVert K\right\rVert^{2})/(nh) can be negative if K2>2K(0)\left\lVert K\right\rVert^{2}>2K(0). In this case, a minimizer of (6) satisfies an oracle inequality, even if this criterion is not penalized. This remarkable fact is illustrated in the simulation study in Section 5.

Example 3 (continued).

Proposition 3.4.

Let {kw,w𝒲}\left\{\left.k_{w},w\in\mathcal{W}\right.\right\} be a collection of weighted projection kernels. Assumption (11) is valid for ΥΓ(1+s)\Upsilon\geq\Gamma(1+\left\lVert s\right\rVert_{\infty}), where Γ\Gamma is given by (4). Moreover (11) and (1) imply (12), (13), (14), (15) and (16).

For these weighted projection kernels, for all x𝕏x\in\mathbb{X}

χkw(x)\displaystyle\chi_{k_{w}}(x) =i=1pwiφi(x)2,hencePχkw=i=1pwiPφi2and\displaystyle=\sum_{i=1}^{p}w_{i}\varphi_{i}(x)^{2},\qquad\mbox{hence}\qquad P\chi_{k_{w}}=\sum_{i=1}^{p}w_{i}P\varphi_{i}^{2}\enspace\qquad\mbox{and}
Θkw(x)\displaystyle\Theta_{k_{w}}(x) =i,j=1pwiwjφiφj𝕏φi(x)φj(x)𝑑μ(x)=i=1pwi2φi(x)2χkw(x).\displaystyle=\sum_{i,j=1}^{p}w_{i}w_{j}\varphi_{i}\varphi_{j}\int_{\mathbb{X}}\varphi_{i}(x)\varphi_{j}(x)d\mu(x)=\sum_{i=1}^{p}w_{i}^{2}\varphi_{i}(x)^{2}\leq\chi_{k_{w}}(x)\enspace.

In this case, the optimal penalty 2Pχkw/n2P\chi_{k_{w}}/n has to be estimated in general. However, in the following example it can still be directly computed.
Let 𝕏=[0,1]\mathbb{X}=[0,1], let μ\mu be the Lebesgue measure. Let φ01\varphi_{0}\equiv 1 and, for any j1j\geq 1,

φ2j1(x)=2cos(2πjx),φ2j(x)=2sin(2πjx).\varphi_{2j-1}(x)=\sqrt{2}\cos(2\pi jx),\qquad\varphi_{2j}(x)=\sqrt{2}\sin(2\pi jx)\enspace.

Consider some odd pp and a family of weights 𝒲={wi,i=0,,p}\mathcal{W}=\left\{\left.w_{i},i=0,\ldots,p\right.\right\} such that, for any w𝒲w\in\mathcal{W} and any i=1,,p/2,i=1,\ldots,p/2, w2i1=w2i=τiw_{2i-1}=w_{2i}=\tau_{i}. In this case, the values of the functions of interest do not depend on xx

χkw=w0+j=1p/2τj,Θkw=w02+j=1p/2τj2.\chi_{k_{w}}=w_{0}+\sum_{j=1}^{p/2}\tau_{j},\qquad\Theta_{k_{w}}=w_{0}^{2}+\sum_{j=1}^{p/2}\tau^{2}_{j}\enspace.

In particular, this family includes Pinsker’s and Tikhonov’s weights.

4 Minimal penalties for kernel selection

The purpose of this section is to see whether the lower bound penmin(k):=(2PχkPΘk)/n\operatorname{pen}_{\mathrm{min}}(k):=(2P\chi_{k}-P\Theta_{k})/n is sharp in Theorem 3.1. To do so we first need the following result which links ss^k\left\lVert s-\widehat{s}_{k}\right\rVert to deterministic quantities, thanks to concentration tools.

4.1 Bias-Variance decomposition with high probability

Proposition 4.1.

Assume {k}k𝒦\left\{\left.k\right.\right\}_{k\in\mathcal{K}} is a finite collection of kernels satisfying Assumptions (11), (12), (13), (14) (15) and (16). For all x>1x>1, for all η\eta in (0,1](0,1], with probability larger than 1|𝒦|ex1-\square|\mathcal{K}|e^{-x}

sks^k2(1+η)PΘkn+Υx2ηn,\left\lVert s_{k}-\widehat{s}_{k}\right\rVert^{2}\leq(1+\eta)\frac{P\Theta_{k}}{n}+\square\frac{\Upsilon x^{2}}{\eta n}\enspace,
PΘkn(1+η)sks^k2+Υx2ηn.\frac{P\Theta_{k}}{n}\leq(1+\eta)\left\lVert s_{k}-\widehat{s}_{k}\right\rVert^{2}+\square\frac{\Upsilon x^{2}}{\eta n}\enspace.

Moreover, for all x>1x>1 and for all η\eta in (0,1)(0,1), with probability larger than 1|𝒦|ex1-\square|\mathcal{K}|e^{-x}, for all k𝒦k\in\mathcal{K}, each of the following inequalities hold

ss^k2(1+η)(ssk2+PΘkn)+Υx2η3n,\left\lVert s-\widehat{s}_{k}\right\rVert^{2}\leq(1+\eta)\left(\left.\left\lVert s-s_{k}\right\rVert^{2}+\frac{P\Theta_{k}}{n}\right.\right)+\square\frac{\Upsilon x^{2}}{\eta^{3}n}\enspace,
ssk2+PΘkn(1+η)ss^k2+Υx2η3n.\left\lVert s-s_{k}\right\rVert^{2}+\frac{P\Theta_{k}}{n}\leq(1+\eta)\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\square\frac{\Upsilon x^{2}}{\eta^{3}n}\enspace.

This means that not only in expectation but also with high probability can the term ss^k2\left\lVert s-\widehat{s}_{k}\right\rVert^{2} be decomposed in a bias term ssk2\left\lVert s-s_{k}\right\rVert^{2} and a "variance" term PΘk/nP\Theta_{k}/n. The bias term measures the capacity of the kernel kk to approximate ss whereas PΘk/nP\Theta_{k}/n is the price to pay for replacing sks_{k} by its empirical version s^k\widehat{s}_{k}. In this sense, PΘk/nP\Theta_{k}/n measures the complexity of the kernel kk in a way which is completely adapted to our problem of density estimation. Even if it does not seem like a natural measure of complexity at first glance, note that in the previous examples, it is indeed always linked to a natural complexity. When dealing with regular histograms defined on [0,1][0,1], PΘkSP\Theta_{k_{S}} is the dimension of the considered space SS, whereas for approximation kernels PΘkK,hP\Theta_{k_{K,h}} is proportional to the inverse of the considered bandwidth hh.

4.2 Some general results about the minimal penalty

In this section, we assume that we are in the asymptotic regime where the number of observations nn\to\infty. In particular, the asymptotic notations refers to this regime.

From now on, the family 𝒦=𝒦n\mathcal{K}=\mathcal{K}_{n} may depend on nn as long as both Γ\Gamma and Υ\Upsilon remain absolute constants that do not depend on it. Indeed, on the previous examples, this seems a reasonable regime. Since 𝒦n\mathcal{K}_{n} now depends on nn, our selected k^=k^n{\widehat{k}}=\hat{k}_{n} also depends on nn.

To prove that the lower bound penmin(k)\operatorname{pen}_{\mathrm{min}}(k) is sharp, we need to show that the estimator chosen by minimizing (6) with a penalty smaller than penmin\operatorname{pen}_{\mathrm{min}} does not satisfy an oracle inequality. This is only possible if the ss^k2\left\lVert s-\widehat{s}_{k}\right\rVert^{2}’s are not of the same order and if they are larger than the remaining term (r+θ3)Υx2/n\square(r+\theta^{-3})\Upsilon x^{2}/n. From an asymptotic point of view, we rewrite this thanks to Proposition 4.1 as for all n1n\geq 1, there exist k0,nk_{0,n} and k1,nk_{1,n} in 𝒦n\mathcal{K}_{n} such that

ssk1,n2+PΘk1,nnssk0,n2+PΘk0,nn(r+1θ3)Υx2n,\left\lVert s-s_{k_{1,n}}\right\rVert^{2}\!+\!\frac{P\Theta_{k_{1,n}}}{n}\!\gg\!\left\lVert s-s_{k_{0,n}}\right\rVert^{2}\!+\!\frac{P\Theta_{k_{0,n}}}{n}\!\gg\!\square\left(\left.\!r+\frac{1}{\theta^{3}}\!\right.\right)\frac{\Upsilon x^{2}}{n}\enspace, (21)

where anbna_{n}\gg b_{n} means that bn/ann0b_{n}/a_{n}\to_{n\to\infty}0. More explicitly, denoting by o(1)\mathrm{o}(1) a sequence only depending on nn and tending to 0 as nn tends to infinity and whose value may change from line to line, one assumes that there exists csc_{s} and cRc_{R} positive constants such that for all n1n\geq 1, there exist k0,nk_{0,n} and k1,nk_{1,n} in 𝒦n\mathcal{K}_{n} such that

ssk0,n2+PΘk0,nncso(1)(ssk1,n2+PΘk1,nn)\displaystyle\left\lVert s-s_{k_{0,n}}\right\rVert^{2}+\frac{P\Theta_{k_{0,n}}}{n}\leq c_{s}~\mathrm{o}(1)\left(\left.\left\lVert s-s_{k_{1,n}}\right\rVert^{2}+\frac{P\Theta_{k_{1,n}}}{n}\right.\right) (22)
(log(|𝒦n|n))3ncRo(1)(ssk0,n2+PΘk0,nn).\displaystyle\frac{(\log(|\mathcal{K}_{n}|\vee n))^{3}}{n}\leq c_{R}~\mathrm{o}(1)\left(\left.\left\lVert s-s_{k_{0,n}}\right\rVert^{2}+\frac{P\Theta_{k_{0,n}}}{n}\right.\right)\enspace. (23)

We put a log-cube factor in the remaining term to allow some choices of θ=θnn0\theta=\theta_{n}\to_{n\to\infty}0 and r=rnn+r=r_{n}\to_{n\to\infty}+\infty.

But (22) and (23) (or (21)) are not sufficient. Indeed, the following result explains what happens when the bias terms are always the leading terms.

Corollary 4.2.

Let (𝒦n)n1(\mathcal{K}_{n})_{n\geq 1} be a sequence of finite collections of kernels kk satisfying Assumptions (11), (12), (13), (14) (15), (16) for a positive constant Υ\Upsilon independent of nn and such that

1n=cbo(1)infk𝒦nssk2PΘk,\frac{1}{n}=c_{b}~\mathrm{o}(1)\inf_{k\in\mathcal{K}_{n}}\frac{\left\lVert s-s_{k}\right\rVert^{2}}{P\Theta_{k}}\enspace, (24)

for some positive constant cbc_{b}.

Assume that there exist real numbers of any sign δδ\delta^{\prime}\geq\delta and a sequence (rn)n1(r_{n})_{n\geq 1} of nonnegative real numbers such that, for all n1n\geq 1, with probability larger than 1/n21-\square/n^{2}, for all k𝒦nk\in\mathcal{K}_{n},

δPΘknδ,δ,Υrnlog(n|𝒦n|)2npen(k)2PχkPΘknδPΘkn+δ,δ,Υrnlog(n|𝒦n|)2n.\delta\frac{P\Theta_{k}}{n}-\square_{\delta,\delta^{\prime},\Upsilon}\frac{r_{n}\log(n\vee|\mathcal{K}_{n}|)^{2}}{n}\\ \leq\operatorname{pen}(k)-\frac{2P\chi_{k}-P\Theta_{k}}{n}\leq\delta^{\prime}\frac{P\Theta_{k}}{n}+\square_{\delta,\delta^{\prime},\Upsilon}\frac{r_{n}\log(n\vee|\mathcal{K}_{n}|)^{2}}{n}\enspace.

Then, with probability larger than 1/n21-\square/n^{2},

ss^k^n2(1+δ,δ,Υ,cbo(1))infk𝒦nss^k2+δ,δ,Υ(rn+logn)log(n|𝒦n|)2n.\left\lVert s-\widehat{s}_{{\widehat{k}}_{n}}\right\rVert^{2}\leq\\ (1+\square_{\delta,\delta^{\prime},\Upsilon,c_{b}}~\mathrm{o}(1))\inf_{k\in\mathcal{K}_{n}}\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\square_{\delta,\delta^{\prime},\Upsilon}\left(\left.r_{n}+\log n\right.\right)\frac{\log(n\vee|\mathcal{K}_{n}|)^{2}}{n}\enspace.

The proof easily follows by taking θ=(logn)1\theta=(\log n)^{-1} in (17), η=2\eta=2 for instance in Proposition 4.1 and by using Assumption (24) and the bounds on pen(k)\operatorname{pen}(k). This result shows that the estimator s^k^n\widehat{s}_{{\widehat{k}}_{n}} satisfies an asymptotically optimal oracle inequality when condition (24) holds, whatever the values of δ\delta and δ\delta^{\prime} even when they are negative. This proves that the lower bound penmin\operatorname{pen}_{\mathrm{min}} is not sharp in this case.

Therefore, we have to assume that at least one bias ss^k2\left\lVert s-\widehat{s}_{k}\right\rVert^{2} is negligible with respect to PΘk/nP\Theta_{k}/n. Actually, to conclude, we assume that this happens for k1,nk_{1,n} in (21).

Theorem 4.3.

Let (𝒦n)n1(\mathcal{K}_{n})_{n\geq 1} be a sequence of finite collections of kernels satisfying Assumptions (11), (12), (13), (14) (15), (16), with Υ\Upsilon not depending on nn. Each 𝒦n\mathcal{K}_{n} is also assumed to satisfy (22) and (23) with a kernel k1,n𝒦nk_{1,n}\in\mathcal{K}_{n} in (22) such that

ssk1,n2co(1)PΘk1,nn,\left\lVert s-s_{k_{1,n}}\right\rVert^{2}\leq c~\mathrm{o}(1)\frac{P\Theta_{k_{1,n}}}{n}\enspace, (25)

for some fixed positive constant cc. Suppose that there exist δδ>0\delta\geq\delta^{\prime}>0 and a sequence (rn)n1(r_{n})_{n\geq 1} of nonnegative real numbers such that rnlog(|𝒦n|n)r_{n}\leq\square\log(|\mathcal{K}_{n}|\vee n) and such that for all n1n\geq 1, with probability larger than 1n21-\square n^{-2}, for all k𝒦nk\in\mathcal{K}_{n},

2PχkPΘknδPΘknδ,δ,Υrnlog(|𝒦n|n)2npen(k)2PχkPΘknδPΘkn+δ,δ,Υrnlog(|𝒦n|n)2n.\frac{2P\chi_{k}-P\Theta_{k}}{n}-\delta\frac{P\Theta_{k}}{n}-\square_{\delta,\delta^{\prime},\Upsilon}\frac{r_{n}\log(|\mathcal{K}_{n}|\vee n)^{2}}{n}\leq\operatorname{pen}(k)\\ \leq\frac{2P\chi_{k}-P\Theta_{k}}{n}-\delta^{\prime}\frac{P\Theta_{k}}{n}+\square_{\delta,\delta^{\prime},\Upsilon}\frac{r_{n}\log(|\mathcal{K}_{n}|\vee n)^{2}}{n}\enspace. (26)

Then, with probability larger than 1/n21-\square/n^{2}, the following holds

PΘk^n(δδ+δ,δ,Υ,c,cs,cRo(1))PΘk1,nandP\Theta_{{\widehat{k}}_{n}}\geq\left(\left.\frac{\delta^{\prime}}{\delta}+\square_{\delta,\delta^{\prime},\Upsilon,c,c_{s},c_{R}}~\mathrm{o}(1)\right.\right)P\Theta_{k_{1,n}}\quad\mbox{and} (27)
ss^k^n2(δδ+δ,δ,Υ,c,cs,cRo(1))ss^k1,n2ss^k0,n2infk𝒦nss^k2.\left\lVert s-\widehat{s}_{{\widehat{k}}_{n}}\right\rVert^{2}\geq\left(\left.\frac{\delta^{\prime}}{\delta}+\square_{\delta,\delta^{\prime},\Upsilon,c,c_{s},c_{R}}~\mathrm{o}(1)\right.\right)\left\lVert s-\widehat{s}_{k_{1,n}}\right\rVert^{2}\\ \gg\left\lVert s-\widehat{s}_{k_{0,n}}\right\rVert^{2}\geq\inf_{k\in\mathcal{K}_{n}}\left\lVert s-\widehat{s}_{k}\right\rVert^{2}\enspace. (28)

By (28), under the conditions of Theorem 4.3, the estimator s^k^n\widehat{s}_{{\widehat{k}}_{n}} cannot satisfy an oracle inequality, hence, the lower bound (2PχkPΘk)/n(2P\chi_{k}-P\Theta_{k})/n in Theorem 3.1 is sharp. This shows that (2PχkPΘk)/n(2P\chi_{k}-P\Theta_{k})/n is a minimal penalty in the sense of [BM07] for kernel selection. When

pen(k)=2PχkPΘkn+κPΘkn,\operatorname{pen}(k)=\frac{2P\chi_{k}-P\Theta_{k}}{n}+\kappa\frac{P\Theta_{k}}{n}\enspace,

the complexity PΘk^nP\Theta_{{\widehat{k}}_{n}} presents a sharp phase transition when κ\kappa becomes positive. Indeed, when κ<0\kappa<0 it follows from (27) that the complexity PΘk^nP\Theta_{{\widehat{k}}_{n}} is asymptotically larger than PΘk1,nP\Theta_{k_{1,n}}. But on the other hand, as a consequence of Theorem 3.1, when κ>0\kappa>0, this complexity becomes smaller than

κninfk𝒦n(ssk2+PΘkn)κ(nssk0,n2+PΘk0,n)κ(nssk1,n2+PΘk1,n)κPΘk1,n.\square_{\kappa}n\inf_{k\in\mathcal{K}_{n}}\left(\left.\left\lVert s-s_{k}\right\rVert^{2}+\frac{P\Theta_{k}}{n}\right.\right)\leq\square_{\kappa}\left(\left.n\left\lVert s-s_{k_{0,n}}\right\rVert^{2}+P\Theta_{k_{0,n}}\right.\right)\\ \ll\square_{\kappa}\left(\left.n\left\lVert s-s_{k_{1,n}}\right\rVert^{2}+P\Theta_{k_{1,n}}\right.\right)\leq\square_{\kappa}P\Theta_{k_{1,n}}\enspace.

4.3 Examples

Example 1 (continued).

Let 𝒮=𝒮n\mathcal{S}=\mathcal{S}_{n} be the collection of spaces of regular histograms on [0,1][0,1] with dimensions {1,,n}\left\{\left.1,\ldots,n\right.\right\} and let S^=S^n\hat{S}=\hat{S}_{n} be the selected space thanks to the penalized criterion. Recall that, for any S𝒮nS\in\mathcal{S}_{n}, the orthonormal basis is defined by φi=DS𝟏i\varphi_{i}=\sqrt{D_{S}}{\bf 1}_{i} and PΘkS=DSP\Theta_{k_{S}}=D_{S}. Assume that ss is α\alpha-Hölderian, with α(0,1]\alpha\in(0,1] with α\alpha-Hölderian norm LL. It is well known (see for instance Section 1.3.3. of [Bir06]) that the bias is bounded above by

sskS2LDS2α.\left\lVert s-s_{k_{S}}\right\rVert^{2}\leq\square_{L}D_{S}^{-2\alpha}\enspace.

In particular, if DS1=nD_{S_{1}}=n,

sskS12Ln2α1=DS1n=PΘkS1n.\left\lVert s-s_{k_{S_{1}}}\right\rVert^{2}\leq\square_{L}n^{-2\alpha}\ll 1=\frac{D_{S_{1}}}{n}=\frac{P\Theta_{k_{S_{1}}}}{n}\enspace.

Thus, (25) holds for kernel kS1k_{S_{1}}. Moreover, if DS0=nD_{S_{0}}=\lfloor\sqrt{n}\rfloor,

(log(n|𝒮n|)3nsskS02+DS0nL(1nα+1n)sskS12+DS1n.\frac{(\log(n\vee|\mathcal{S}_{n}|)^{3}}{n}\ll\left\lVert s-s_{k_{S_{0}}}\right\rVert^{2}+\frac{D_{S_{0}}}{n}\leq\square_{L}\left(\left.\frac{1}{n^{\alpha}}+\frac{1}{\sqrt{n}}\right.\right)\\ \ll\left\lVert s-s_{k_{S_{1}}}\right\rVert^{2}+\frac{D_{S_{1}}}{n}\enspace.

Hence, (21) holds with k0,n=kS0k_{0,n}=k_{S_{0}} and k1,n=kS1k_{1,n}=k_{S_{1}}. Therefore, Theorem 4.3 and Theorem 3.1 apply in this example. If pen(kS)=(1δ)DS/n\operatorname{pen}(k_{S})=(1-\delta)D_{S}/n, the dimension DkS^nδnD_{k_{\widehat{S}_{n}}}\geq\square_{\delta}n and s^kS^n\widehat{s}_{k_{\widehat{S}_{n}}} is not consistent and does not satisfy an oracle inequality. On the other hand, if pen(kS)=(1+δ)DS/n\operatorname{pen}(k_{S})=(1+\delta)D_{S}/n,

DS^nL,δ(n1α+n)DS1=nD_{{\widehat{S}_{n}}}\leq\square_{L,\delta}\left(\left.n^{1-\alpha}+\sqrt{n}\right.\right)\ll D_{S_{1}}=n

and s^kS^n\widehat{s}_{k_{\widehat{S}_{n}}} satisfies an oracle inequality which implies that, with probability larger than 1/n21-\square/n^{2},

ss^kS^n2α,L,δn2α/(2α+1),\left\lVert s-\widehat{s}_{k_{\widehat{S}_{n}}}\right\rVert^{2}\leq\square_{\alpha,L,\delta}n^{-2\alpha/(2\alpha+1)}\enspace,

by taking DSn1/(2α+1).D_{S}\simeq n^{1/(2\alpha+1)}. It achieves the minimax rate of convergence over the class of α\alpha-Hölderian functions.

From Theorem 3.1, the penalty pen(kS)=2DS/n\operatorname{pen}(k_{S})=2D_{S}/n provides an estimator s^kS^n\widehat{s}_{k_{\widehat{S}_{n}}} that achieves an asymptotically optimal oracle inequality. Therefore the optimal penalty is equal to 22 times the minimal one. In particular, the slope heuristics of [BM07] holds in this example, as already noticed in [Ler12].

Finally to illustrate , let us take s(x)=2xs(x)=2x and the collection of regular histograms with dimension in {1,,nβ}\{1,\ldots,\lfloor n^{\beta}\rfloor\}, with β<1/3\beta<1/3. Simple calculations show that

sskS2DSDS3n3βn1.\frac{\left\lVert s-s_{k_{S}}\right\rVert^{2}}{D_{S}}\geq\square D_{S}^{-3}\geq\square n^{-3\beta}\gg n^{-1}.

Hence (24) applies and the penalized estimator with penalty pen(kS)δDSn\operatorname{pen}(k_{S})\simeq\delta\frac{D_{S}}{n} always satisfies an oracle inequality even if δ=0\delta=0 or δ<0\delta<0. This was actually expected since it is likely to choose the largest dimension which is also the oracle choice in this case.

Example 2 (continued).

Let KK be a fixed function, let =n\mathcal{H}=\mathcal{H}_{n} denote the following grid of bandwidths

={KK1i/i=1,,n}\mathcal{H}=\left\{\left.\frac{\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}}{i}\quad/\quad i=1,\ldots,n\right.\right\}\enspace

and let h^=h^n\hat{h}=\hat{h}_{n} be the selected bandwidth. Assume as before that ss is a density on [0,1][0,1] that belongs to the Nikol’ski class 𝒩(α,L)\mathcal{N}(\alpha,L) with α(0,1]\alpha\in(0,1] and L>0L>0. By Proposition 1.5 in [Tsy09], if KK satisfies |u|α|K(u)|𝑑u<\int\left\lvert u\right\rvert^{\alpha}\left\lvert K(u)\right\rvert du<\infty

sskK,h2α,K,Lh2α.\left\lVert s-s_{k_{K,h}}\right\rVert^{2}\leq\square_{\alpha,K,L}h^{2\alpha}\enspace.

In particular, when h1=KK1/nh_{1}=\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}/n,

sskK,h12α,K,Ln2αPΘkK,h1n=K2KK1.\left\lVert s-s_{k_{K,h_{1}}}\right\rVert^{2}\leq\square_{\alpha,K,L}n^{-2\alpha}\ll\frac{P\Theta_{k_{K,h_{1}}}}{n}=\frac{\left\lVert K\right\rVert^{2}}{\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}}\enspace.

On the other hand, for h0=KK1/nh_{0}=\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}/\left\lfloor\sqrt{n}\right\rfloor,

(logn|n|)2nsskK,h02+PΘkK,h0nK,α,L(1nα+1n)sskK,h12+PΘkK,h1n.\frac{(\log n\vee|\mathcal{H}_{n}|)^{2}}{n}\ll\left\lVert s-s_{k_{K,h_{0}}}\right\rVert^{2}+\frac{P\Theta_{k_{K,h_{0}}}}{n}\\ \leq\square_{K,\alpha,L}\left(\left.\frac{1}{n^{\alpha}}+\frac{1}{\sqrt{n}}\right.\right)\ll\left\lVert s-s_{k_{K,h_{1}}}\right\rVert^{2}+\frac{P\Theta_{k_{K,h_{1}}}}{n}\enspace.

Hence, (21) and (25) hold with kernels k0,n=kK,h0k_{0,n}=k_{K,h_{0}} and k1,n=kK,h1k_{1,n}=k_{K,h_{1}}. Therefore, Theorem 4.3 and Theorem 3.1 apply in this example. If for some δ>0\delta>0 we set pen(kK,h)=(2K(0)K2δK2)/(nh)\operatorname{pen}(k_{K,h})=(2K(0)-\left\lVert K\right\rVert^{2}-\delta\left\lVert K\right\rVert^{2})/(nh), then h^nδ,Kn1\widehat{h}_{n}\leq\square_{\delta,K}n^{-1} and s^kK,h^n\widehat{s}_{k_{K,\widehat{h}_{n}}} is not consistent and does not satisfy an oracle inequality. On the other hand, if pen(kK,h)=(2K(0)K2+δK2)/(nh)\operatorname{pen}(k_{K,h})=(2K(0)-\left\lVert K\right\rVert^{2}+\delta\left\lVert K\right\rVert^{2})/(nh), then

h^nδ,K,L(n1α+n)1δ,K,Ln1,\widehat{h}_{n}\geq\square_{\delta,K,L}\left(\left.n^{1-\alpha}+\sqrt{n}\right.\right)^{-1}\gg\square_{\delta,K,L}n^{-1}\enspace,

and s^K,kh^n\widehat{s}_{K,k_{\widehat{h}_{n}}} satisfies an oracle inequality which implies that, with probability larger than 1/n21-\square/n^{2},

ss^kK,h^n2α,K,L,δn2α/(2α+1),\left\lVert s-\widehat{s}_{k_{K,\widehat{h}_{n}}}\right\rVert^{2}\leq\square_{\alpha,K,L,\delta}n^{-2\alpha/(2\alpha+1)}\enspace,

for h=KK1/n1/(2α+1).h=\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}/\left\lfloor n^{1/(2\alpha+1)}\right\rfloor\in\mathcal{H}. In particular it achieves the minimax rate of convergence over the class 𝒩(α,L)\mathcal{N}(\alpha,L). Finally, if pen(kK,h)=2K(0)/(nh)\operatorname{pen}(k_{K,h})=2K(0)/(nh), s^kK,h^n\widehat{s}_{k_{K,\widehat{h}_{n}}} achieves an asymptotically optimal oracle inequality, thanks to Theorem 3.1.

The minimal penalty is therefore

penmin(kK,h)=2K(0)K2nh.\operatorname{pen}_{\mathrm{min}}(k_{K,h})=\frac{2K(0)-\left\lVert K\right\rVert^{2}}{nh}\enspace.

In this case, the optimal penalty penopt(kK,h)=2K(0)/(nh)\operatorname{pen}_{\rm opt}(k_{K,h})=2K(0)/(nh) derived from Theorem 3.1 is not twice the minimal one, but one still has, if 2K(0)K22K(0)\neq\left\lVert K\right\rVert^{2},

penopt(kK,h)=2K(0)2K(0)K2penmin(kK,h),\operatorname{pen}_{\mathrm{opt}}(k_{K,h})=\frac{2K(0)}{2K(0)-\left\lVert K\right\rVert^{2}}\operatorname{pen}_{\mathrm{min}}(k_{K,h})\enspace,

even if they can be of opposite sign depending on KK. This type of nontrivial relationship between optimal and minimal penalty has already been underlined in [1] in regression framework for selecting linear estimators.

Note that if one allows two kernel functions K1K_{1} and K2K_{2} in the family of kernels such that 2K1(0)K122K_{1}(0)\neq\left\lVert K_{1}\right\rVert^{2}, 2K2(0)K222K_{2}(0)\neq\left\lVert K_{2}\right\rVert^{2} and

2K1(0)2K1(0)K122K2(0)2K2(0)K22,\frac{2K_{1}(0)}{2K_{1}(0)-\left\lVert K_{1}\right\rVert^{2}}\neq\frac{2K_{2}(0)}{2K_{2}(0)-\left\lVert K_{2}\right\rVert^{2}}\enspace,

then there is no absolute constant multiplicative factor linking the minimal penalty and the optimal one.

5 Small simulation study

In this section we illustrate on simulated data Theorem 3.1 and Theorem 4.3. We focus on approximation kernels only, since projection kernels have been already discussed in [Ler12].

We observe an n=100n=100 i.i.d. sample of standard gaussian distribution. For a fixed parameter a0a\geq 0 we consider the family of kernels

kKa,h(x,y)=1hKa(xyh)withh={12i,i=1,,50},k_{K_{a},h}(x,y)=\frac{1}{h}K_{a}\left(\left.\frac{x-y}{h}\right.\right)\qquad\mbox{with}\quad h\in\mathcal{H}=\left\{\left.\frac{1}{2i},~i=1,\ldots,50\right.\right\}\enspace,

where for x,Ka(x)=122π(e(xa)22+e(x+a)22).\displaystyle x\in\mathbb{R},\quad K_{a}(x)=\frac{1}{2\sqrt{2\pi}}\left(\left.e^{-\frac{(x-a)^{2}}{2}}+e^{-\frac{(x+a)^{2}}{2}}\right.\right)\enspace.
In particular the kernel estimator with a=0a=0 is the classical Gaussian kernel estimator. Moreover

Ka(0)=12πexp(a22)andKa2=1+ea24π.K_{a}(0)=\frac{1}{\sqrt{2\pi}}\exp\left(\left.-\frac{a^{2}}{2}\right.\right)\quad\mbox{and}\quad\left\lVert K_{a}\right\rVert^{2}=\frac{1+e^{-a^{2}}}{4\sqrt{\pi}}\enspace.

Thus, depending on the value of aa, the minimal penalty (2Ka(0)Ka2)/(nh)(2K_{a}(0)-\left\lVert K_{a}\right\rVert^{2})/(nh) may be negative. We study the behavior of the penalized criterion

𝒞pen(kKa,h)=Pnγ(s^kKa,h)+pen(kKa,h)\mathcal{C}_{\operatorname{pen}}\left(\left.k_{K_{a},h}\right.\right)=P_{n}\gamma(\widehat{s}_{k_{K_{a},h}})+\operatorname{pen}(k_{K_{a},h})

with penalties of the form

pen(kKa,h)=2Ka(0)Ka2nh+κKa2nh,\operatorname{pen}\left(\left.k_{K_{a},h}\right.\right)=\frac{2K_{a}(0)-\left\lVert K_{a}\right\rVert^{2}}{nh}+\kappa\frac{\left\lVert K_{a}\right\rVert^{2}}{nh}\enspace, (29)

for different values of κ\kappa (κ=1,0,1\kappa=-1,0,1) and aa (a=0,1.5,2,3a=0,1.5,2,3). On Figure 1 are represented the selected estimates by the optimal penalty 2Ka(0)/(nh)2K_{a}(0)/(nh) for the different values of aa and on Figure 2 one sees the evolution of the different penalized criteria as a function of 1/h1/h.

Refer to caption
Figure 1: Selected approximation kernel estimators when the penalty is the optimal one i.e. 2Ka(0)nh\frac{2K_{a}(0)}{nh}.
Refer to caption
Figure 2: Behavior of Pnγ(s^kKa,h)P_{n}\gamma(\widehat{s}_{k_{K_{a},h}}) (blue line) and 𝒞pen(kKa,h)\mathcal{C}_{\operatorname{pen}}\left(\left.k_{K_{a},h}\right.\right) as a function of 1/h1/h, which is proportional to the complexity PΘkKa,hP\Theta_{k_{K_{a},h}}.

The contrast curves for a=0a=0 are classical on Figure 2. Without penalization, the criterion decreases and leads to the selection of the smallest bandwidth. At the minimal penalty, the curve is flat and at the optimal penalty one selects a meaningful bandwidth as shown on Figure 1.

When a>0a>0, despite the choice of those unusual kernels, the reconstructions on Figure 1 for the optimal penalty are also meaningful. However when a=2a=2 or a=3a=3, the criterion with minimal penalty is smaller than the unpenalized criterion, meaning that minimizing the latter criterion leads by Theorem 3.1 to an oracle inequality. In our simulation, when a=3a=3, the curves for the optimal criterion and the unpenalized one are so close that the same estimator is selected by both methods.

Refer to caption
Figure 3: Behavior of 1/h^1/\hat{h}, which is proportional to the complexity PΘkKa,hP\Theta_{k_{K_{a},h}}, for the estimator selected by the criterion whose penalty is given by (29), as a function of κ\kappa.  

Finally Figure 3 shows that there is indeed in all cases a sharp phase transition around κ=0\kappa=0 i.e. at the minimal penalty for the complexity of the selected estimate.

6 Proofs

6.1 Proof of Theorem 3.1

The starting point to prove the oracle inequality is to notice that any minimizer k^{\widehat{k}} of 𝒞pen\mathcal{C}_{\operatorname{pen}} satisfies

ss^k^2ss^k2+(pen(k)penid(k))(pen(k^)penid(k^)).\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2}\leq\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\left(\left.\operatorname{pen}(k)-\operatorname{pen}_{\mathrm{id}}(k)\right.\right)-\left(\left.\operatorname{pen}\left(\left.{\widehat{k}}\right.\right)-\operatorname{pen}_{\mathrm{id}}\left(\left.{\widehat{k}}\right.\right)\right.\right)\enspace.

Using the expression of the ideal penalty (9) we find

ss^k^2\displaystyle\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2} ss^k2+(pen(k)2Pχkn)(pen(k^)2Pχk^n)\displaystyle\leq\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\left(\left.\operatorname{pen}(k)-2\frac{P\chi_{k}}{n}\right.\right)-\left(\left.\operatorname{pen}\left(\left.{\widehat{k}}\right.\right)-2\frac{P\chi_{{\widehat{k}}}}{n}\right.\right)
+2P(sksk^)n+2(12n)(PnP)(sk^sk)\displaystyle+2\frac{P(s_{k}-s_{{\widehat{k}}})}{n}+2\left(\left.1-\frac{2}{n}\right.\right)(P_{n}-P)(s_{{\widehat{k}}}-s_{k})
+2(PnP)(χk^χk)n+2Uk^Ukn2.\displaystyle+2\frac{(P_{n}-P)(\chi_{{\widehat{k}}}-\chi_{k})}{n}+2\frac{U_{{\widehat{k}}}-U_{k}}{n^{2}}\enspace. (30)

By Proposition B.1 (see the appendix), for all x>1x>1, for all θ\theta in (0,1)(0,1), with probability larger than 1(7.4|𝒦|+2|𝒦|2)ex1-(7.4|\mathcal{K}|+2|\mathcal{K}|^{2})e^{-x},

ss^k^2\displaystyle\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2} ss^k2+(pen(k)2Pχkn)(pen(k^)2Pχk^n)\displaystyle\leq\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\left(\left.\operatorname{pen}(k)-2\frac{P\chi_{k}}{n}\right.\right)-\left(\left.\operatorname{pen}\left(\left.{\widehat{k}}\right.\right)-2\frac{P\chi_{{\widehat{k}}}}{n}\right.\right)
+θssk^2+θssk2+Υθn\displaystyle+\theta\left\lVert s-s_{\widehat{k}}\right\rVert^{2}+\theta\left\lVert s-s_{k}\right\rVert^{2}+\square\frac{\Upsilon}{\theta n}
+(12n)θssk^2+(12n)θssk2+Υx2θn\displaystyle+\left(\left.1-\frac{2}{n}\right.\right)\theta\left\lVert s-s_{\widehat{k}}\right\rVert^{2}+\left(\left.1-\frac{2}{n}\right.\right)\theta\left\lVert s-s_{k}\right\rVert^{2}+\square\frac{\Upsilon x^{2}}{\theta n}
+θPΘkn+θPΘk^n+Υxθn+θPΘkn+θPΘk^n+Υx2θn\displaystyle+\theta\frac{P\Theta_{k}}{n}+\theta\frac{P\Theta_{\widehat{k}}}{n}+\square\frac{\Upsilon x}{\theta n}+\theta\frac{P\Theta_{k}}{n}+\theta\frac{P\Theta_{\widehat{k}}}{n}+\square\frac{\Upsilon x^{2}}{\theta n}

Hence

ss^k^2\displaystyle\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2} ss^k2+(pen(k)2Pχkn)(pen(k^)2Pχk^n)\displaystyle\leq\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\left(\left.\operatorname{pen}(k)-2\frac{P\chi_{k}}{n}\right.\right)-\left(\left.\operatorname{pen}\left(\left.{\widehat{k}}\right.\right)-2\frac{P\chi_{{\widehat{k}}}}{n}\right.\right)
+2θ[ssk^2+PΘk^n]+2θ[ssk2+PΘkn]+Υx2θn.\displaystyle+2\theta\left[\left.\left\lVert s-s_{\widehat{k}}\right\rVert^{2}+\frac{P\Theta_{\widehat{k}}}{n}\right.\right]+2\theta\left[\left.\left\lVert s-s_{k}\right\rVert^{2}+\frac{P\Theta_{k}}{n}\right.\right]+\square\frac{\Upsilon x^{2}}{\theta n}\enspace.

This bound holds using (11), (12) and (13) only. Now by Proposition 4.1 applied with η=1\eta=1, we have for all x>1x>1, for all θ(0,1)\theta\in(0,1), with probability larger than 1(16.8|𝒦|+2|𝒦|2)ex1-(16.8|\mathcal{K}|+2|\mathcal{K}|^{2})e^{-x},

ss^k^2\displaystyle\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2} ss^k2+(pen(k)2Pχkn)(pen(k^)2Pχk^n)\displaystyle\leq\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\left(\left.\operatorname{pen}(k)-2\frac{P\chi_{k}}{n}\right.\right)-\left(\left.\operatorname{pen}\left(\left.{\widehat{k}}\right.\right)-2\frac{P\chi_{{\widehat{k}}}}{n}\right.\right)
+4θss^k^2+4θss^k2+Υx2θn.\displaystyle+4\theta\left\lVert s-\widehat{s}_{\widehat{k}}\right\rVert^{2}+4\theta\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\square\frac{\Upsilon x^{2}}{\theta n}\enspace.

This gives the first part of the theorem.

For the second part, by the condition (18) on the penalty, we find for all x>1x>1, for all θ\theta in (0,1)(0,1), with probability larger than 1(C+16.8|𝒦|+2|𝒦|2)ex1-(C+16.8|\mathcal{K}|+2|\mathcal{K}|^{2})e^{-x},

(14θ)ss^k^2(1+4θ)ss^k2+(δ1)+PΘkn+(1δ)+PΘk^n+(r+1θ)Υx2n.(1-4\theta)\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2}\leq\\ (1+4\theta)\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+(\delta^{\prime}-1)_{+}\frac{P\Theta_{k}}{n}+(1-\delta)_{+}\frac{P\Theta_{\widehat{k}}}{n}+\square\left(\left.r+\frac{1}{\theta}\right.\right)\frac{\Upsilon x^{2}}{n}\enspace.

By Proposition 4.1 applied with η=θ\eta=\theta, we have with probability larger than 1(C+26.2|𝒦|+2|𝒦|2)ex1-(C+26.2|\mathcal{K}|+2|\mathcal{K}|^{2})e^{-x},

(14θ)ss^k^2(1+4θ)ss^k2+(δ1)+(1+θ)ss^k2+(1δ)+(1+θ)ss^k^2+(r+1θ3)Υx2n,(1-4\theta)\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2}\leq(1+4\theta)\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+(\delta^{\prime}-1)_{+}(1+\theta)\left\lVert s-\widehat{s}_{k}\right\rVert^{2}\\ +(1-\delta)_{+}(1+\theta)\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2}+\square\left(\left.r+\frac{1}{\theta^{3}}\right.\right)\frac{\Upsilon x^{2}}{n}\enspace,

that is

((δ1)θ(4+(1δ)+))ss^k^2((δ1)+θ(4+(δ1)+))ss^k2+(r+1θ3)Υx2n.\left(\left.(\delta\wedge 1)-\theta(4+(1-\delta)_{+})\right.\right)\left\lVert s-\widehat{s}_{{\widehat{k}}}\right\rVert^{2}\\ \leq\left(\left.(\delta^{\prime}\vee 1)+\theta(4+(\delta^{\prime}-1)_{+})\right.\right)\left\lVert s-\widehat{s}_{k}\right\rVert^{2}+\square\left(\left.r+\frac{1}{\theta^{3}}\right.\right)\frac{\Upsilon x^{2}}{n}\enspace.

Hence, because 1[(δ1)+(4+(δ1)+)θ](δ1)+(4+δ)θ1\leq[(\delta^{\prime}\vee 1)+(4+(\delta^{\prime}-1)_{+})\theta]\leq(\delta^{\prime}\vee 1)+(4+\delta^{\prime})\theta, we obtain the desired result.

6.2 Proof of Proposition 4.1

First, let us denote for all x𝕏x\in\mathbb{X}

FA,k(x):=𝔼[Ak(X,x)],ζk(x):=(k(y,x)sk(y))2𝑑μ(y),F_{A,k}(x):=\mathbb{E}\left[\left.A_{k}(X,x)\right.\right],\qquad\zeta_{k}(x):=\int\left(\left.k(y,x)-s_{k}(y)\right.\right)^{2}d\mu(y)\enspace,

and

UA,k:=ij=1n(Ak(Xi,Xj)FA,k(Xi)FA,k(Xj)+𝔼[Ak(X,Y)]).U_{A,k}:=\sum_{i\neq j=1}^{n}\left(\left.A_{k}(X_{i},X_{j})-F_{A,k}(X_{i})-F_{A,k}(X_{j})+\mathbb{E}\left[\left.A_{k}(X,Y)\right.\right]\right.\right)\enspace.

Some easy computations then provide the following useful equality

sks^k2=1nPnζk+1n2UA,k.\left\lVert s_{k}-\widehat{s}_{k}\right\rVert^{2}=\frac{1}{n}P_{n}\zeta_{k}+\frac{1}{n^{2}}U_{A,k}\enspace.

We need only treat the terms on the right-hand side, thanks to the probability tools of Section 2.3. Applying Proposition 2.1, we get, for any x1x\geq 1, with probability larger than 12|𝒦|ex1-2\left\lvert\mathcal{K}\right\rvert e^{-x},

|(PnP)ζk|2xnPζk2+ζkx3n.\left\lvert(P_{n}-P)\zeta_{k}\right\rvert\leq\sqrt{\frac{2x}{n}P\zeta_{k}^{2}}+\frac{\left\lVert\zeta_{k}\right\rVert_{\infty}x}{3n}\enspace.

One can then check the following link between ζk\zeta_{k} and Θk\Theta_{k}

Pζk=(k(y,x)sk(x))2s(y)𝑑μ(x)𝑑μ(y)=PΘksk2.P\zeta_{k}=\int\left(\left.k(y,x)-s_{k}(x)\right.\right)^{2}s(y)d\mu(x)d\mu(y)=P\Theta_{k}-\left\lVert s_{k}\right\rVert^{2}\enspace.

Next, by (1) and (11)

ζk\displaystyle\left\lVert\zeta_{k}\right\rVert_{\infty} =supy𝕏(k(y,x)𝔼[k(X,x)])2𝑑μ(x)\displaystyle=\sup_{y\in\mathbb{X}}\int\left(\left.k(y,x)-\mathbb{E}\left[\left.k(X,x)\right.\right]\right.\right)^{2}d\mu(x)
4supy𝕏k(y,x)2𝑑μ(x)4Υn.\displaystyle\leq 4\sup_{y\in\mathbb{X}}\int k(y,x)^{2}d\mu(x)\leq 4\Upsilon n\enspace.

In particular, since ζk0\zeta_{k}\geq 0,

Pζk2ζkPζk4ΥnPΘk.P\zeta_{k}^{2}\leq\left\lVert\zeta_{k}\right\rVert_{\infty}P\zeta_{k}\leq 4\Upsilon nP\Theta_{k}\enspace.

It follows from these computations and from (11) that there exists an absolute constant \square such that, for any x1x\geq 1, with probability larger than 12|𝒦|ex1-2\left\lvert\mathcal{K}\right\rvert e^{-x}, for any θ(0,1)\theta\in(0,1),

|PnζkPΘk|θPΘk+Υxθ.\left\lvert P_{n}\zeta_{k}-P\Theta_{k}\right\rvert\leq\theta P\Theta_{k}+\square\frac{\Upsilon x}{\theta}\enspace.

We now need to control the term UA,kU_{A,k}. From Proposition 2.2, for any x1x\geq 1, with probability larger than 15.4|𝒦|ex1-5.4\left\lvert\mathcal{K}\right\rvert e^{-x},

|UA,k|n2n2(Cx+Dx+Bx3/2+Ax2).\frac{\left\lvert U_{A,k}\right\rvert}{n^{2}}\leq\frac{\square}{n^{2}}\left(\left.C\sqrt{x}+Dx+Bx^{3/2}+Ax^{2}\right.\right)\enspace.

By (1), (11) and Cauchy-Schwarz inequality,

A=4sup(x,y)𝕏2k(x,z)k(y,z)𝑑μ(z)4supx𝕏k(x,z)2𝑑μ(z)4Υn.A=4\sup_{(x,y)\in\mathbb{X}^{2}}\int k(x,z)k(y,z)d\mu(z)\leq 4\sup_{x\in\mathbb{X}}\int k(x,z)^{2}d\mu(z)\leq 4\Upsilon n\enspace.

In addition, by (15), B216supx𝕏𝔼[Ak(X,x)2]16Υn.B^{2}\leq 16\sup_{x\in\mathbb{X}}\mathbb{E}\left[\left.A_{k}(X,x)^{2}\right.\right]\leq 16\Upsilon n\enspace.
Moreover, applying the Assumption (14),

C2ij=1n𝔼[Ak(Xi,Xj)2]=n2𝔼[Ak(X,Y)2]n2ΥPΘk.C^{2}\leq\sum_{i\neq j=1}^{n}\mathbb{E}\left[\left.A_{k}(X_{i},X_{j})^{2}\right.\right]=n^{2}\mathbb{E}\left[\left.A_{k}(X,Y)^{2}\right.\right]\leq n^{2}\Upsilon P\Theta_{k}\enspace.

Finally, applying the Cauchy-Schwarz inequality and proceeding as for C2C^{2}, the quantity used to define DD can be bounded above as follows:

𝔼[i=1n1j=i+1nai(Xi)bj(Xj)Ak(Xi,Xj)]n𝔼[Ak(X,Y)2]nΥPΘk.\mathbb{E}\left[\left.\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}a_{i}(X_{i})b_{j}(X_{j})A_{k}(X_{i},X_{j})\right.\right]\leq n\sqrt{\mathbb{E}\left[\left.A_{k}(X,Y)^{2}\right.\right]}\leq n\sqrt{\Upsilon P\Theta_{k}}\enspace.

Hence for any x1x\geq 1, with probability larger than 15.4|𝒦|ex1-5.4\left\lvert\mathcal{K}\right\rvert e^{-x},

for any θ(0,1),|UA,k|n2θPΘkn+Υx2θn.\mbox{for any }\theta\in(0,1),\quad\frac{\left\lvert U_{A,k}\right\rvert}{n^{2}}\leq\theta\frac{P\Theta_{k}}{n}+\square\frac{\Upsilon x^{2}}{\theta n}\enspace.

Therefore, for all θ(0,1)\theta\in(0,1),

|s^ksk2PΘkn|2θPΘkn+Υx2θn,\left\lvert\left\lVert\widehat{s}_{k}-s_{k}\right\rVert^{2}-\frac{P\Theta_{k}}{n}\right\rvert\leq 2\theta\frac{P\Theta_{k}}{n}+\square\frac{\Upsilon x^{2}}{\theta n}\enspace,

and the first part of the result follows by choosing θ=η/2\theta=\eta/2. Concerning the two remaining inequalities appearing in the proposition, we begin by developing the loss. For all k𝒦k\in\mathcal{K}

s^ks2=s^ksk2+sks2+2s^ksk,sks.\left\lVert\widehat{s}_{k}-s\right\rVert^{2}=\left\lVert\widehat{s}_{k}-s_{k}\right\rVert^{2}+\left\lVert s_{k}-s\right\rVert^{2}+2\langle\widehat{s}_{k}-s_{k},s_{k}-s\rangle\enspace.

Then, for all x𝕏x\in\mathbb{X}

FA,k(x)sk(x)\displaystyle F_{A,k}(x)-s_{k}(x) =s(y)k(x,z)k(z,y)𝑑μ(z)𝑑μ(y)s(z)k(z,x)𝑑μ(z)\displaystyle=\int s(y)\int k(x,z)k(z,y)d\mu(z)d\mu(y)-\int s(z)k(z,x)d\mu(z)
=(s(y)k(z,y)𝑑μ(y)s(z))k(x,z)𝑑μ(z)\displaystyle=\int\left(\left.\int s(y)k(z,y)d\mu(y)-s(z)\right.\right)k(x,z)d\mu(z)
=(sk(z)s(z))k(z,x)𝑑μ(z).\displaystyle=\int\left(\left.s_{k}(z)-s(z)\right.\right)k(z,x)d\mu(z)\enspace.

Moreover, since PFA,k=sk2PF_{A,k}=\left\lVert s_{k}\right\rVert^{2}, we find

s^ksk,sks\displaystyle\langle\widehat{s}_{k}-s_{k},s_{k}-s\rangle =(s^k(x)(sk(x)s(x)))𝑑μ(x)+𝔼[sk(X)]sk2\displaystyle=\int\left(\left.\widehat{s}_{k}(x)\left(\left.s_{k}(x)-s(x)\right.\right)\right.\right)d\mu(x)+\mathbb{E}\left[\left.s_{k}(X)\right.\right]-\left\lVert s_{k}\right\rVert^{2}
=1ni=1n(k(x,Xi)(sk(x)s(x)))𝑑μ(x)+P(skFA,k)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\int\left(\left.k(x,X_{i})\left(\left.s_{k}(x)-s(x)\right.\right)\right.\right)d\mu(x)+P(s_{k}-F_{A,k})
=1ni=1n(FA,k(Xi)sk(Xi))+P(skFA,k)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left(\left.F_{A,k}(X_{i})-s_{k}(X_{i})\right.\right)+P(s_{k}-F_{A,k})
=(PnP)(FA,ksk).\displaystyle=(P_{n}-P)(F_{A,k}-s_{k})\enspace.

This expression motivates us to apply again Proposition 2.1 to this term. We find by (1), (11) and Cauchy-Schwarz inequality

supx𝕏|FA,k(x)sk(x)|\displaystyle\sup_{x\in\mathbb{X}}\left\lvert F_{A,k}(x)-s_{k}(x)\right\rvert ssksupx𝕏|s(z)sk(z)|sskk(x,z)𝑑μ(z)\displaystyle\leq\left\lVert s-s_{k}\right\rVert\sup_{x\in\mathbb{X}}\int\frac{\left\lvert s(z)-s_{k}(z)\right\rvert}{\left\lVert s-s_{k}\right\rVert}k(x,z)d\mu(z)
ssksupx𝕏k(x,z)2𝑑μ(z)sskΥn.\displaystyle\leq\left\lVert s-s_{k}\right\rVert\sqrt{\sup_{x\in\mathbb{X}}\int k(x,z)^{2}d\mu(z)}\leq\left\lVert s-s_{k}\right\rVert\sqrt{\Upsilon n}\enspace.

Moreover,

P(FA,ksk)2\displaystyle P\left(\left.F_{A,k}-s_{k}\right.\right)^{2} ssk2P(|s(z)sk(z)|sskk(.,z)dμ(z))2\displaystyle\leq\left\lVert s-s_{k}\right\rVert^{2}P\left(\left.\int\frac{\left\lvert s(z)-s_{k}(z)\right\rvert}{\left\lVert s-s_{k}\right\rVert}k(.,z)d\mu(z)\right.\right)^{2}
ssk2vk2.\displaystyle\leq\left\lVert s-s_{k}\right\rVert^{2}v_{k}^{2}\enspace.

Thus by (16), for any θ,u>0\theta,u>0,

2P(FA,ksk)2xnθssk2+(ΥΥPΘk)x2θn\displaystyle\sqrt{\frac{2P\left(\left.F_{A,k}-s_{k}\right.\right)^{2}x}{n}}\leq\theta\left\lVert s-s_{k}\right\rVert^{2}+\frac{\left(\left.\Upsilon\vee\sqrt{\Upsilon P\Theta_{k}}\right.\right)x}{2\theta n}
θssk2+Υxθn(uθPΘkn+Υx216θun).\displaystyle\leq\theta\left\lVert s-s_{k}\right\rVert^{2}+\frac{\Upsilon x}{\theta n}\vee\left(\left.\frac{u}{\theta}\frac{P\Theta_{k}}{n}+\frac{\Upsilon x^{2}}{16\theta un}\right.\right)\enspace.

Hence, for any θ(0,1)\theta\in(0,1) and x1x\geq 1, taking u=θ2u=\theta^{2}

2P(FA,ksk)2xnθ(ssk2+PΘkn)+Υx2θ3n.\displaystyle\sqrt{\frac{2P\left(\left.F_{A,k}-s_{k}\right.\right)^{2}x}{n}}\leq\theta\left(\left.\left\lVert s-s_{k}\right\rVert^{2}+\frac{P\Theta_{k}}{n}\right.\right)+\square\frac{\Upsilon x^{2}}{\theta^{3}n}\enspace.

By Proposition 2.1, for all θ\theta in (0,1)(0,1) , for all x>0x>0 with probability larger than 12|𝒦|ex1-2|\mathcal{K}|e^{-x},

2|s^ksk,sks|\displaystyle 2\left\lvert\langle\widehat{s}_{k}-s_{k},s_{k}-s\rangle\right\rvert 22P(FA,ksk)2xn+2sskΥnx3n\displaystyle\leq 2\sqrt{\frac{2P\left(\left.F_{A,k}-s_{k}\right.\right)^{2}x}{n}}+2\left\lVert s-s_{k}\right\rVert\sqrt{\Upsilon n}\frac{x}{3n}
3θ(ssk2+PΘkn)+Υx2θ3n.\displaystyle\leq 3\theta\left(\left.\left\lVert s-s_{k}\right\rVert^{2}+\frac{P\Theta_{k}}{n}\right.\right)+\square\frac{\Upsilon x^{2}}{\theta^{3}n}\enspace.

Putting together all of the above, one concludes that for all θ\theta in (0,1)(0,1), for all x>1x>1, with probability larger than 19.4|𝒦|ex1-9.4|\mathcal{K}|e^{-x}

s^ks2sks23θssk2+(1+4θ)PΘkn+Υx2θ3n\left\lVert\widehat{s}_{k}-s\right\rVert^{2}-\left\lVert s_{k}-s\right\rVert^{2}\leq 3\theta\left\lVert s-s_{k}\right\rVert^{2}+(1+4\theta)\frac{P\Theta_{k}}{n}+\square\frac{\Upsilon x^{2}}{\theta^{3}n}

and

s^ks2sks23θ(ssk2+PΘkn)+(1θ)PΘknΥx2θ3n.\left\lVert\widehat{s}_{k}-s\right\rVert^{2}-\left\lVert s_{k}-s\right\rVert^{2}\geq-3\theta\left(\left.\left\lVert s-s_{k}\right\rVert^{2}+\frac{P\Theta_{k}}{n}\right.\right)+(1-\theta)\frac{P\Theta_{k}}{n}-\square\frac{\Upsilon x^{2}}{\theta^{3}n}\enspace.

Choosing, θ=η/4\theta=\eta/4 leads to the second part of the result.

6.3 Proof of Theorem 4.3

It follows from (17) (applied with θ=(logn)1\theta=\square(\log n)^{-1} and x=log(n|𝒦n|)x=\square\log(n\vee|\mathcal{K}_{n}|)) and Assumption (26) that with probability larger than 1n21-\square n^{-2} we have for any k𝒦k\in\mathcal{K} and any n2n\geq 2

s^k^ns2(1+logn)s^ks2(1+δ)(1+logn)PΘkn+(1+δ)(1+logn)PΘk^nn+δ,δ,Υlog(|𝒦n|n)3n.\left\lVert\widehat{s}_{{\widehat{k}}_{n}}-s\right\rVert^{2}\leq\left(\left.1+\frac{\square}{\log n}\right.\right)\left\lVert\widehat{s}_{k}-s\right\rVert^{2}-(1+\delta^{\prime})\left(\left.1+\frac{\square}{\log n}\right.\right)\frac{P\Theta_{k}}{n}\\ +(1+\delta)\left(\left.1+\frac{\square}{\log n}\right.\right)\frac{P\Theta_{{\widehat{k}}_{n}}}{n}+\square_{\delta,\delta^{\prime},\Upsilon}\frac{\log(|\mathcal{K}_{n}|\vee n)^{3}}{n}\enspace. (31)

Applying this inequality with k=k1,nk=k_{1,n} and using Proposition 4.1 with η=(logn)1/3\eta=\square(\log n)^{-1/3} and x=log(|𝒦n|n)x=\square\log(|\mathcal{K}_{n}|\vee n) as a lower bound for s^k^ns2\left\lVert\widehat{s}_{{\widehat{k}}_{n}}-s\right\rVert^{2} and as an upper bound for s^k1,ns2\left\lVert\widehat{s}_{k_{1,n}}-s\right\rVert^{2}, we obtain asymptotically that with probability larger than 1n21-\square n^{-2},

δ(1+δo(1))PΘk^nn(1+o(1))sk1,ns2δ(1+δo(1))PΘk1,nn+δ,δ,Υlog(|𝒦n|n)3n.-\delta(1+\square_{\delta}~\mathrm{o}(1))\frac{P\Theta_{{\widehat{k}}_{n}}}{n}\leq\left(\left.1+\mathrm{o}(1)\right.\right)\left\lVert s_{k_{1,n}}-s\right\rVert^{2}-\delta^{\prime}(1+\square_{\delta^{\prime}}~\mathrm{o}(1))\frac{P\Theta_{k_{1,n}}}{n}\\ +\square_{\delta,\delta^{\prime},\Upsilon}\frac{\log(|\mathcal{K}_{n}|\vee n)^{3}}{n}\enspace.

By Assumption (25), sk1,ns2co(1)PΘk1,nn\left\lVert s_{k_{1,n}}-s\right\rVert^{2}\leq c~\mathrm{o}(1)\frac{P\Theta_{k_{1,n}}}{n} and by (22),

(log(|𝒦n|n))3ncRcso(1)PΘk1,nn.\frac{\left(\left.\log(|\mathcal{K}_{n}|\vee n)\right.\right)^{3}}{n}\leq c_{R}c_{s}~\mathrm{o}(1)\frac{P\Theta_{k_{1,n}}}{n}\enspace.

This gives (27). In addition, starting with the event where (31) holds and using Proposition 4.1, we also have with probability larger than 1n21-\square n^{-2},

s^k^ns2(1+logn)s^k1,ns2(1+δ)PΘk1,nn+(1+δ)(1+o(1))s^k^ns2+δ,δ,Υlog(|𝒦n|n)3n.\left\lVert\widehat{s}_{{\widehat{k}}_{n}}-s\right\rVert^{2}\leq\left(\left.1+\frac{\square}{\log n}\right.\right)\left\lVert\widehat{s}_{k_{1,n}}-s\right\rVert^{2}-(1+\delta^{\prime})\frac{P\Theta_{k_{1,n}}}{n}\\ +(1+\delta)\left(\left.1+\mathrm{o}(1)\right.\right)\left\lVert\widehat{s}_{{\widehat{k}}_{n}}-s\right\rVert^{2}+\square_{\delta,\delta^{\prime},\Upsilon}\frac{\log(|\mathcal{K}_{n}|\vee n)^{3}}{n}\enspace.

Since s^k1,ns2PΘk1,nn\left\lVert\widehat{s}_{k_{1,n}}-s\right\rVert^{2}\simeq\frac{P\Theta_{k_{1,n}}}{n}, this leads to

(δ+δo(1))s^k^s2(δ+δ,co(1))s^k1,ns2+δ,δ,Υlog(|𝒦n|n)3n.(-\delta+\square_{\delta}~\mathrm{o}(1))\left\lVert\widehat{s}_{{\widehat{k}}}-s\right\rVert^{2}\leq\\ -(\delta^{\prime}+\square_{\delta^{\prime},c}~\mathrm{o}(1))\left\lVert\widehat{s}_{k_{1,n}}-s\right\rVert^{2}+\square_{\delta,\delta^{\prime},\Upsilon}~\frac{\log(|\mathcal{K}_{n}|\vee n)^{3}}{n}\enspace.

This leads to (28) by (21).

Appendix A Proofs for the examples

A.1 Computation of the constant Γ\Gamma for the three examples

We have to show for each family {k}k𝒦\left\{\left.k\right.\right\}_{k\in\mathcal{K}} (see (8) and (1)) that there exists a constant Γ1\Gamma\geq 1 such that for all k𝒦k\in\mathcal{K}

supx𝕏|Θk(x)|Γn,andsup(x,y)𝕏2|k(x,y)|Γn.\sup_{x\in\mathbb{X}}~\left\lvert\Theta_{k}(x)\right\rvert\leq\Gamma n,\quad\text{and}\quad\sup_{(x,y)\in\mathbb{X}^{2}}\left\lvert k(x,y)\right\rvert\leq\Gamma n\enspace.

Example 1: Projection kernels.

First, notice that from Cauchy-Schwarz inequality we have for all (x,y)𝕏2(x,y)\in\mathbb{X}^{2} |kS(x,y)|χkS(x)χkS(y)\left\lvert k_{S}(x,y)\right\rvert\leq\sqrt{\chi_{k_{S}}(x)\chi_{k_{S}}(y)} and by orthonormality, for any (x,x)𝕏2(x,x^{\prime})\in\mathbb{X}^{2},

AkS(x,x)=(i,j)S2φi(x)φj(x)𝕏φi(y)φj(y)𝑑μ(y)=kS(x,x).A_{k_{S}}(x,x^{\prime})=\sum_{(i,j)\in\mathcal{I}^{2}_{S}}\varphi_{i}(x)\varphi_{j}(x^{\prime})\int_{\mathbb{X}}\varphi_{i}(y)\varphi_{j}(y)d\mu(y)=k_{S}(x,x^{\prime})\enspace.

In particular, for any x𝕏x\in\mathbb{X}, ΘkS(x)=χkS(x)\Theta_{k_{S}}(x)=\chi_{k_{S}}(x). Hence, projection kernels satisfy (1) for Γ=1n1supS𝒮χkS\Gamma=1\vee n^{-1}\sup_{S\in\mathcal{S}}\left\lVert\chi_{k_{S}}\right\rVert_{\infty}. We conclude by writing

χkS=supx𝕏iSφi(x)2=sup(ai)i s.t. iSai2=1supx𝕏(iSaiφi(x))2.\left\lVert\chi_{k_{S}}\right\rVert_{\infty}=\sup_{x\in\mathbb{X}}\sum_{i\in\mathcal{I}_{S}}\varphi_{i}(x)^{2}=\sup_{\begin{subarray}{c}(a_{i})_{i\in\mathcal{I}}\,\mbox{ s.t. }\,\\ \sum_{i\in\mathcal{I}_{S}}a_{i}^{2}=1\end{subarray}}\sup_{x\in\mathbb{X}}\left(\left.\sum_{i\in\mathcal{I}_{S}}a_{i}\varphi_{i}(x)\right.\right)^{2}\enspace.

For fSf\in S we have f2=if,φi2\left\lVert f\right\rVert^{2}=\sum_{i\in\mathcal{I}}\langle f,\varphi_{i}\rangle^{2}. Hence with ai=f,φia_{i}=\langle f,\varphi_{i}\rangle,

χkS=supfS,f=1f2.\left\lVert\chi_{k_{S}}\right\rVert_{\infty}=\sup_{f\in S,\left\lVert f\right\rVert=1}\left\lVert f\right\rVert_{\infty}^{2}\enspace.

Example 2: Approximation kernels.

First, sup(x,y)𝕏2|kK,h(x,y)|K/h.\sup_{(x,y)\in\mathbb{X}^{2}}\left\lvert k_{K,h}(x,y)\right\rvert\leq\left\lVert K\right\rVert_{\infty}/h. Second, since KL1K\in L^{1}

ΘkK,h(x)=1h2𝕏K(xyh)2𝑑y=K2hKK1h.\Theta_{k_{K,h}}(x)=\frac{1}{h^{2}}\int_{\mathbb{X}}K\left(\left.\frac{x-y}{h}\right.\right)^{2}dy=\frac{\left\lVert K\right\rVert^{2}}{h}\leq\frac{\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}}{h}\enspace.

Now KL1K\in L^{1} and K(u)𝑑u=1\int K(u)du=1 implies K11\left\lVert K\right\rVert_{1}\geq 1, hence (1) holds with Γ=1\Gamma=1 if one assumes that hKK1/nh\geq\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}/n.

Example 3: Weighted projection kernels.

For all x𝕏x\in\mathbb{X}

Θkw(x)=i,j=1pwiφi(x)wjφj(x)𝕏φi(y)φj(y)𝑑μ(y)=i=1pwi2φi(x)2.\Theta_{k_{w}}(x)=\sum_{i,j=1}^{p}w_{i}\varphi_{i}(x)w_{j}\varphi_{j}(x)\int_{\mathbb{X}}\varphi_{i}(y)\varphi_{j}(y)d\mu(y)=\sum_{i=1}^{p}w_{i}^{2}\varphi_{i}(x)^{2}\enspace.

From Cauchy-Schwarz inequality, for any (x,y)𝕏2(x,y)\in\mathbb{X}^{2},

|kw(x,y)|Θkw(x)Θkw(y).\left\lvert k_{w}(x,y)\right\rvert\leq\sqrt{\Theta_{k_{w}}(x)\Theta_{k_{w}}(y)}\enspace.

We thus find that kwk_{w} verifies (1) with Γ1n1supw𝒲Θkw\Gamma\geq 1\vee n^{-1}\sup_{w\in\mathcal{W}}\left\lVert\Theta_{k_{w}}\right\rVert_{\infty}. Since wi1w_{i}\leq 1 we find the announced result which is independent of 𝒲\mathcal{W}.

A.2 Proof of Proposition 3.2

Since skS2s2s\left\lVert s_{k_{S}}\right\rVert^{2}\leq\left\lVert s\right\rVert^{2}\leq\left\lVert s\right\rVert_{\infty}, we find that (11) only requires ΥΓ(1+s)\Upsilon\geq\Gamma(1+\left\lVert s\right\rVert_{\infty}). Assumption (12) holds: this follows from ΥΓ\Upsilon\geq\Gamma and

𝔼[χkS(X)2]χkSPχkSΓnPΘkS.\mathbb{E}\left[\left.\chi_{k_{S}}(X)^{2}\right.\right]\leq\left\lVert\chi_{k_{S}}\right\rVert_{\infty}P\chi_{k_{S}}\leq\Gamma nP\Theta_{k_{S}}\enspace.

Now for proving Assumption (14), we write

𝔼[AkS(X,Y)2]\displaystyle\mathbb{E}\left[\left.A_{k_{S}}(X,Y)^{2}\right.\right] =𝔼[kS(X,Y)2]=𝕏𝔼[kS(X,x)2]s(x)𝑑μ(x)\displaystyle=\mathbb{E}\left[\left.k_{S}(X,Y)^{2}\right.\right]=\int_{\mathbb{X}}\mathbb{E}\left[\left.k_{S}(X,x)^{2}\right.\right]s(x)d\mu(x)
s(i,j)S2𝔼[φi(X)φj(X)]𝕏φi(x)φj(x)𝑑μ(x)\displaystyle\leq\left\lVert s\right\rVert_{\infty}\sum_{(i,j)\in\mathcal{I}^{2}_{S}}\mathbb{E}\left[\left.\varphi_{i}(X)\varphi_{j}(X)\right.\right]\int_{\mathbb{X}}\varphi_{i}(x)\varphi_{j}(x)d\mu(x)
=sPΘkSΥPΘkS.\displaystyle=\left\lVert s\right\rVert_{\infty}P\Theta_{k_{S}}\leq\Upsilon P\Theta_{k_{S}}\enspace.

In the same way, Assumption (15) follows from sΓΥ\left\lVert s\right\rVert_{\infty}\Gamma\leq\Upsilon. Suppose (19) holds with S=S+SS=S+S^{\prime} so that the basis (φi)i(\varphi_{i})_{i\in\mathcal{I}} of SS^{\prime} is included in the one (φi)i𝒥(\varphi_{i})_{i\in\mathcal{J}} of SS. Since χkSΓn\left\lVert\chi_{k_{S}}\right\rVert_{\infty}\leq\Gamma n we have

skS(x)skS(x)\displaystyle s_{k_{S}}(x)-s_{k_{S^{\prime}}}(x) =j𝒥(Pφj)φj(x)j𝒥(Pφj)2j𝒥φj(x)2\displaystyle=\sum_{j\in\mathcal{J}\setminus\mathcal{I}}\left(\left.P\varphi_{j}\right.\right)\varphi_{j}(x)\leq\sqrt{\sum_{j\in\mathcal{J}\setminus\mathcal{I}}\left(\left.P\varphi_{j}\right.\right)^{2}\sum_{j\in\mathcal{J}\setminus\mathcal{I}}\varphi_{j}(x)^{2}}
skSskSχkS1/2skSskSΓn.\displaystyle\leq\left\lVert s_{k_{S}}-s_{k_{S^{\prime}}}\right\rVert\left\lVert\chi_{k_{S}}\right\rVert^{1/2}_{\infty}\leq\left\lVert s_{k_{S}}-s_{k_{S^{\prime}}}\right\rVert\sqrt{\Gamma n}\enspace.

Hence, (13) holds in this case. Assuming (20) implies that (13) holds since

skSskSskS+skSΥ.\left\lVert s_{k_{S}}-s_{k_{S^{\prime}}}\right\rVert_{\infty}\leq\left\lVert s_{k_{S}}\right\rVert_{\infty}+\left\lVert s_{k_{S^{\prime}}}\right\rVert_{\infty}\leq\Upsilon\enspace.

Finally for (16), for any aL2a\in L^{2},

𝕏a(x)kS(x,y)𝑑μ(x)=ia,φiφi(y)=ΠS(a).\int_{\mathbb{X}}a(x)k_{S}(x,y)d\mu(x)=\sum_{i\in\mathcal{I}}\langle a,\varphi_{i}\rangle\varphi_{i}(y)=\Pi_{S}(a)\enspace.

is the orthogonal projection of aa onto SS. Therefore, 𝔹kS\mathbb{B}_{k_{S}} is the unit ball in SS for the L2L^{2}-norm and, for any t𝔹kSt\in\mathbb{B}_{k_{S}}, 𝔼[t(X)2]st2s.\mathbb{E}\left[\left.t(X)^{2}\right.\right]\leq\left\lVert s\right\rVert_{\infty}\left\lVert t\right\rVert^{2}\leq\left\lVert s\right\rVert_{\infty}\enspace.

A.3 Proof of Proposition 3.3

First, since K11\left\lVert K\right\rVert_{1}\geq 1

skK,h2\displaystyle\left\lVert s_{k_{K,h}}\right\rVert^{2} =𝕏(𝕏s(y)1hK(xyh)𝑑y)2𝑑x\displaystyle=\int_{\mathbb{X}}\left(\left.\int_{\mathbb{X}}s(y)\frac{1}{h}K\left(\left.\frac{x-y}{h}\right.\right)dy\right.\right)^{2}dx
=𝕏(𝕏s(x+hz)K(z)𝑑z)2𝑑x\displaystyle=\int_{\mathbb{X}}\left(\left.\int_{\mathbb{X}}s(x+hz)K\left(\left.z\right.\right)dz\right.\right)^{2}dx
K12𝕏(𝕏s(x+hz)|K(z)|K1𝑑z)2𝑑x\displaystyle\leq\left\lVert K\right\rVert_{1}^{2}\int_{\mathbb{X}}\left(\left.\int_{\mathbb{X}}s(x+hz)\frac{\left\lvert K\left(\left.z\right.\right)\right\rvert}{\left\lVert K\right\rVert_{1}}dz\right.\right)^{2}dx
K12𝕏2s(x+hz)2|K(z)|K1𝑑x𝑑zsK12.\displaystyle\leq\left\lVert K\right\rVert_{1}^{2}\int_{\mathbb{X}^{2}}s(x+hz)^{2}\frac{\left\lvert K\left(\left.z\right.\right)\right\rvert}{\left\lVert K\right\rVert_{1}}dxdz\leq\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}^{2}\enspace.

Hence, Assumption (11) holds if Υ1+sK12\Upsilon\geq 1+\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}^{2}. Now, we have

P(χkK,h2)=K(0)2h2=PΘkK,hK(0)2K2hnPΘkK,hK(0)2K2KK1,P\left(\left.\chi_{k_{K,h}}^{2}\right.\right)=\frac{K(0)^{2}}{h^{2}}=P\Theta_{k_{K,h}}\frac{K(0)^{2}}{\left\lVert K\right\rVert^{2}h}\leq nP\Theta_{k_{K,h}}\frac{K(0)^{2}}{\left\lVert K\right\rVert^{2}\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}}\enspace,

so it is sufficient to have ΥK(0)/K2\Upsilon\geq K(0)/\left\lVert K\right\rVert^{2} (since K(0)KK(0)\leq\left\lVert K\right\rVert_{\infty}) to ensure (12). Moreover, for any hh\in\mathcal{H} and any x𝕏x\in\mathbb{X},

skK,h(x)=𝕏s(y)1hK(xyh)𝑑y=𝕏s(x+zh)K(z)𝑑zsK1.s_{k_{K,h}}(x)=\int_{\mathbb{X}}s(y)\frac{1}{h}K\left(\left.\frac{x-y}{h}\right.\right)dy=\int_{\mathbb{X}}s(x+zh)K(z)dz\leq\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}\enspace.

Therefore, Assumption (13) holds for Υ2sK1\Upsilon\geq 2\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}. Then on one hand

|AkK,h(x,y)|\displaystyle\left\lvert A_{k_{K,h}}(x,y)\right\rvert 1h2𝕏|K(xzh)K(yzh)|𝑑z\displaystyle\leq\frac{1}{h^{2}}\int_{\mathbb{X}}\left\lvert K\left(\left.\frac{x-z}{h}\right.\right)K\left(\left.\frac{y-z}{h}\right.\right)\right\rvert dz
1h𝕏|K(xyhu)K(u)|𝑑u\displaystyle\leq\frac{1}{h}\int_{\mathbb{X}}\left\lvert K\left(\left.\frac{x-y}{h}-u\right.\right)K\left(\left.u\right.\right)\right\rvert du
K2hKK1hPΘkK,hn.\displaystyle\leq\frac{\left\lVert K\right\rVert^{2}}{h}\wedge\frac{\left\lVert K\right\rVert_{\infty}\left\lVert K\right\rVert_{1}}{h}\leq P\Theta_{k_{K,h}}\wedge n\enspace.

And on the other hand

𝔼[|AkK,h(X,x)|]\displaystyle\mathbb{E}\left[\left.\left\lvert A_{k_{K,h}}(X,x)\right\rvert\right.\right] 1h𝕏2|K(xyhu)K(u)|𝑑us(y)𝑑y\displaystyle\leq\frac{1}{h}\int_{\mathbb{X}^{2}}\left\lvert K\left(\left.\frac{x-y}{h}-u\right.\right)K\left(\left.u\right.\right)\right\rvert du~s(y)dy
=𝕏2|K(v)K(u)|s(x+h(vu))𝑑u𝑑vsK12.\displaystyle=\int_{\mathbb{X}^{2}}\left\lvert K\left(\left.v\right.\right)K\left(\left.u\right.\right)\right\rvert s(x+h(v-u))dudv\leq\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}^{2}\enspace.

Therefore,

supx𝕏𝔼[AkK,h(X,x)2]sup(x,y)𝕏2|AkK,h(x,y)|supx𝕏𝔼[|AkK,h(X,x)|](PΘkK,hn)sK12,\sup_{x\in\mathbb{X}}~\mathbb{E}\left[\left.A_{k_{K,h}}(X,x)^{2}\right.\right]\leq\sup_{(x,y)\in\mathbb{X}^{2}}\left\lvert A_{k_{K,h}}(x,y)\right\rvert~\sup_{x\in\mathbb{X}}~\mathbb{E}\left[\left.\left\lvert A_{k_{K,h}}(X,x)\right\rvert\right.\right]\\ \leq\left(\left.P\Theta_{k_{K,h}}\wedge n\right.\right)\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}^{2}\enspace,

and 𝔼[AkK,h(X,Y)2]supx𝕏𝔼[AkK,h(X,x)2]sK12PΘkK,h.\mathbb{E}\left[\left.A_{k_{K,h}}(X,Y)^{2}\right.\right]\leq\sup_{x\in\mathbb{X}}~\mathbb{E}\left[\left.A_{k_{K,h}}(X,x)^{2}\right.\right]\leq\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}^{2}P\Theta_{k_{K,h}}\enspace. Hence Assumption (14) and (15) hold when ΥsK12\Upsilon\geq\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}^{2}. Finally let us prove that Assumption (16) is satisfied. Let t𝔹kK,ht\in\mathbb{B}_{k_{K,h}} and aL2a\in L^{2} be such that a=1\left\lVert a\right\rVert=1 and t(y)=𝕏a(x)1hK(xyh)𝑑xt(y)=\int_{\mathbb{X}}a(x)\frac{1}{h}K\left(\left.\frac{x-y}{h}\right.\right)dx for all y𝕏y\in\mathbb{X}. Then the following follows from Cauchy-Schwarz inequality

t(y)1h𝕏a(x)2𝑑x𝕏K(xyh)2𝑑xKh.t(y)\leq\frac{1}{h}\sqrt{\int_{\mathbb{X}}a(x)^{2}dx}\sqrt{\int_{\mathbb{X}}K\left(\left.\frac{x-y}{h}\right.\right)^{2}dx}\leq\frac{\left\lVert K\right\rVert}{\sqrt{h}}\enspace.

Thus for any t𝔹kK,ht\in\mathbb{B}_{k_{K,h}}

Pt2t|t|,sKhs=sPΘkK,hΥPΘkK,h.Pt^{2}\leq\left\lVert t\right\rVert_{\infty}\langle\left\lvert t\right\rvert,s\rangle\leq\frac{\left\lVert K\right\rVert}{\sqrt{h}}\left\lVert s\right\rVert=\left\lVert s\right\rVert\sqrt{P\Theta_{k_{K,h}}}\leq\sqrt{\Upsilon P\Theta_{k_{K,h}}}\enspace.

We conclude that all the assumptions hold if

Υ(K(0)/K2)(1+2sK12).\Upsilon\geq\left(\left.K(0)/\left\lVert K\right\rVert^{2}\right.\right)\vee\left(\left.1+2\left\lVert s\right\rVert_{\infty}\left\lVert K\right\rVert_{1}^{2}\right.\right)\enspace.

A.4 Proof of Proposition 3.4

Let us define for convenience Φ(x):=i=1pφi(x)2\Phi(x):=\sum_{i=1}^{p}\varphi_{i}(x)^{2}, so Γ1n1Φ\Gamma\geq 1\vee n^{-1}\left\lVert\Phi\right\rVert_{\infty}. Then we have for these kernels: Φ(x)χkw(x)Θkw(x)\Phi(x)\geq\chi_{k_{w}}(x)\geq\Theta_{k_{w}}(x) for all x𝕏x\in\mathbb{X}. Moreover, denoting by Πs\Pi s the orthogonal projection of ss onto the linear span of (φi)i=1,,p(\varphi_{i})_{i=1,\ldots,p},

skw2=i=1pwi2(Pφi)2Πs2s2s.\displaystyle\left\lVert s_{k_{w}}\right\rVert^{2}=\sum_{i=1}^{p}w_{i}^{2}\left(\left.P\varphi_{i}\right.\right)^{2}\leq\left\lVert\Pi s\right\rVert^{2}\leq\left\lVert s\right\rVert^{2}\leq\left\lVert s\right\rVert_{\infty}\enspace.

Assumption (11) holds for this family if ΥΓ(1+s)\Upsilon\geq\Gamma(1+\left\lVert s\right\rVert_{\infty}). We prove in what follows that all the remaining assumptions are valid using only (1) and (11).
First, it follows from Cauchy-Schwarz inequality that, for any x𝕏x\in\mathbb{X}, χkw(x)2Φ(x)Θkw(x)\chi_{k_{w}}(x)^{2}\leq\Phi(x)\Theta_{k_{w}}(x). Assumption (12) is then automatically satisfied from the definition of Γ\Gamma

𝔼[χkw(X)2]ΦPΘkwΓnPΘkw.\mathbb{E}\left[\left.\chi_{k_{w}}(X)^{2}\right.\right]\leq\left\lVert\Phi\right\rVert_{\infty}P\Theta_{k_{w}}\leq\Gamma nP\Theta_{k_{w}}\enspace.

Now let ww and ww^{\prime} be any two vectors in [0,1]p[0,1]^{p}, we have

skw=i=1pwi(Pφi)φi,skwskw=i=1p(wiwi)(Pφi)φi.s_{k_{w}}=\sum_{i=1}^{p}w_{i}(P\varphi_{i})\varphi_{i},\qquad s_{k_{w}}-s_{k_{w^{\prime}}}=\sum_{i=1}^{p}(w_{i}-w_{i}^{\prime})\left(\left.P\varphi_{i}\right.\right)\varphi_{i}\enspace.

Hence skwskw2=i=1p(wiwi)2(Pφi)2\left\lVert s_{k_{w}}-s_{k_{w^{\prime}}}\right\rVert^{2}=\sum_{i=1}^{p}(w_{i}-w_{i}^{\prime})^{2}\left(\left.P\varphi_{i}\right.\right)^{2} and, by Cauchy-Schwarz inequality, for any x𝕏x\in\mathbb{X},

|skw(x)skw(x)|skwskwΦ(x)skwskwΓn.\left\lvert s_{k_{w}}(x)-s_{k_{w^{\prime}}}(x)\right\rvert\leq\left\lVert s_{k_{w}}-s_{k_{w^{\prime}}}\right\rVert\sqrt{\Phi(x)}\leq\left\lVert s_{k_{w}}-s_{k_{w^{\prime}}}\right\rVert\sqrt{\Gamma n}\enspace.

Assumption (13) follows using (11). Concerning Assumptions (14) and (15), let us first notice that by orthonormality, for any (x,x)𝕏2(x,x^{\prime})\in\mathbb{X}^{2},

Akw(x,x)=i=1pwi2φi(x)φi(x).A_{k_{w}}(x,x^{\prime})=\sum_{i=1}^{p}w_{i}^{2}\varphi_{i}(x)\varphi_{i}(x^{\prime})\enspace.

Therefore, Assumption (15) holds since

𝔼[Akw(X,x)2]\displaystyle\mathbb{E}\left[\left.A_{k_{w}}(X,x)^{2}\right.\right] =𝕏(i=1pwi2φi(y)φi(x))2s(y)𝑑μ(y)\displaystyle=\int_{\mathbb{X}}\left(\left.\sum_{i=1}^{p}w_{i}^{2}\varphi_{i}(y)\varphi_{i}(x)\right.\right)^{2}s(y)d\mu(y)
s1i,jpwi2wj2φi(x)φj(x)𝕏φi(y)φj(y)𝑑μ(y)\displaystyle\leq\left\lVert s\right\rVert_{\infty}\sum_{1\leq i,j\leq p}w_{i}^{2}w_{j}^{2}\varphi_{i}(x)\varphi_{j}(x)\int_{\mathbb{X}}\varphi_{i}(y)\varphi_{j}(y)d\mu(y)
=si=1pwi4φi(x)2sΦ(x)sΓn.\displaystyle=\left\lVert s\right\rVert_{\infty}\sum_{i=1}^{p}w_{i}^{4}\varphi_{i}(x)^{2}\leq\left\lVert s\right\rVert_{\infty}\Phi(x)\leq\left\lVert s\right\rVert_{\infty}\Gamma n\enspace.

Assumption (14) also holds from similar computations:

𝔼[Akw(X,Y)2]\displaystyle\mathbb{E}\left[\left.A_{k_{w}}(X,Y)^{2}\right.\right] =𝕏𝔼[(i=1pwi2φi(X)φi(x))2]s(x)𝑑μ(x)\displaystyle=\int_{\mathbb{X}}\mathbb{E}\left[\left.\left(\left.\sum_{i=1}^{p}w_{i}^{2}\varphi_{i}(X)\varphi_{i}(x)\right.\right)^{2}\right.\right]s(x)d\mu(x)
s1i,jpwi2wj2𝔼[φi(X)φj(X)]𝕏φi(x)φj(x)𝑑μ(x)\displaystyle\leq\left\lVert s\right\rVert_{\infty}\sum_{1\leq i,j\leq p}w_{i}^{2}w_{j}^{2}\mathbb{E}\left[\left.\varphi_{i}(X)\varphi_{j}(X)\right.\right]\int_{\mathbb{X}}\varphi_{i}(x)\varphi_{j}(x)d\mu(x)
sPΘkw.\displaystyle\leq\left\lVert s\right\rVert_{\infty}P\Theta_{k_{w}}\enspace.

We finish with the proof of (16). Let us prove that 𝔹kw=kw\mathbb{B}_{k_{w}}=\mathcal{E}_{k_{w}}, where

kw={t=i=1pwitiφi, s.t. i=1pti21}.\mathcal{E}_{k_{w}}=\left\{\left.t=\sum_{i=1}^{p}w_{i}t_{i}\varphi_{i},\,\mbox{ s.t. }\,\sum_{i=1}^{p}t_{i}^{2}\leq 1\right.\right\}\enspace.

First, notice that any t𝔹kwt\in\mathbb{B}_{k_{w}} can be written

𝕏a(x)kw(x,y)𝑑μ(x)=i=1pwia,φiφi(y).\displaystyle\int_{\mathbb{X}}a(x)k_{w}(x,y)d\mu(x)=\sum_{i=1}^{p}w_{i}\langle a,\varphi_{i}\rangle\varphi_{i}(y)\enspace.

Then, consider some tkwt\in\mathcal{E}_{k_{w}}. By definition, there exists a collection (ti)i=1,,p(t_{i})_{i=1,\ldots,p} such that t=i=1pwitiφit=\sum_{i=1}^{p}w_{i}t_{i}\varphi_{i}, and i=1pti21\sum_{i=1}^{p}t_{i}^{2}\leq 1. If a=i=1ptiφia=\sum_{i=1}^{p}t_{i}\varphi_{i}, a2=i=1pti21\left\lVert a\right\rVert^{2}=\sum_{i=1}^{p}t_{i}^{2}\leq 1 and a,φi=ti\langle a,\varphi_{i}\rangle=t_{i}, hence t𝔹kwt\in\mathbb{B}_{k_{w}}. Conversely, for t𝔹kwt\in\mathbb{B}_{k_{w}}, there exists some function aL2a\in L^{2} such that a21\left\lVert a\right\rVert^{2}\leq 1, and t=i=1pwia,φiφit=\sum_{i=1}^{p}w_{i}\langle a,\varphi_{i}\rangle\varphi_{i}. Since (φi)i=1,,p(\varphi_{i})_{i=1,\ldots,p} is an orthonormal system, one can take a=i=1pa,φiφia=\sum_{i=1}^{p}\langle a,\varphi_{i}\rangle\varphi_{i}. With ti=a,φit_{i}=\langle a,\varphi_{i}\rangle, we find a2=i=1pti2\left\lVert a\right\rVert^{2}=\sum_{i=1}^{p}t_{i}^{2} and tkwt\in\mathcal{E}_{k_{w}}. For any t𝔹kw=kwt\in\mathbb{B}_{k_{w}}=\mathcal{E}_{k_{w}}, t2=i=1pwi2ti2i=1pti21\left\lVert t\right\rVert^{2}=\sum_{i=1}^{p}w_{i}^{2}t_{i}^{2}\leq\sum_{i=1}^{p}t_{i}^{2}\leq 1. Hence Pt2st2s.Pt^{2}\leq\left\lVert s\right\rVert_{\infty}\left\lVert t\right\rVert^{2}\leq\left\lVert s\right\rVert_{\infty}\enspace.

Appendix B Concentration of the residual terms

The following proposition gathers the concentration bounds of the remaining terms appearing in (6.1).

Proposition B.1.

Let {k}k𝒦\left\{\left.k\right.\right\}_{k\in\mathcal{K}} denote a finite collection of kernels satisfying (1) and suppose that Assumptions (11), (12) and (13) hold. Then

θ(0,1),2P(sk^sk)nθssk^2+θssk2+2Υθn.\forall\theta\in(0,1),\qquad 2\frac{P(s_{{\widehat{k}}}-s_{k})}{n}\leq\theta\left\lVert s-s_{{\widehat{k}}}\right\rVert^{2}+\theta\left\lVert s-s_{k}\right\rVert^{2}+\frac{2\Upsilon}{\theta n}\enspace. (32)

For any x1x\geq 1, with probability larger than 12|𝒦|2ex1-2\left\lvert\mathcal{K}\right\rvert^{2}e^{-x}, for any (k,k)𝒦2(k,k^{\prime})\in\mathcal{K}^{2}, for any θ(0,1)\theta\in(0,1),

|2(PnP)(sksk)|θ(ssk2+ssk2)+Υx2θn.\left\lvert 2(P_{n}-P)(s_{k}-s_{k^{\prime}})\right\rvert\leq\theta\left(\left.\left\lVert s-s_{k^{\prime}}\right\rVert^{2}+\left\lVert s-s_{k}\right\rVert^{2}\right.\right)+\square\frac{\Upsilon x^{2}}{\theta n}\enspace. (33)

For any x1x\geq 1, with probability larger than 12|𝒦|ex1-2\left\lvert\mathcal{K}\right\rvert e^{-x}, for any k𝒦k\in\mathcal{K},

θ(0,1),|2(PnP)χk|θPΘk+Υxθ.\forall\theta\in(0,1),\qquad\left\lvert 2(P_{n}-P)\chi_{k}\right\rvert\leq\theta P\Theta_{k}+\square\frac{\Upsilon x}{\theta}\enspace. (34)

For any x1x\geq 1, with probability larger than 15.4|𝒦|ex1-5.4\left\lvert\mathcal{K}\right\rvert e^{-x}, for any k𝒦k\in\mathcal{K},

θ(0,1),2|Uk|n2θPΘkn+Υx2θn.\forall\theta\in(0,1),\qquad\frac{2\left\lvert U_{k}\right\rvert}{n^{2}}\leq\theta\frac{P\Theta_{k}}{n}+\square\frac{\Upsilon x^{2}}{\theta n}\enspace. (35)

Proof First for (32), notice that, by (13), for any θ(0,1)\theta\in(0,1)

2P(sk^sk)n\displaystyle 2\frac{P(s_{{\widehat{k}}}-s_{k})}{n} 2sk^skn2n(Υ(θ4nsksk^2+Υθ))\displaystyle\leq 2\frac{\left\lVert s_{{\widehat{k}}}-s_{k}\right\rVert_{\infty}}{n}\leq\frac{2}{n}\left(\left.\Upsilon\vee\left(\left.\frac{\theta}{4}n\left\lVert s_{k}-s_{{\widehat{k}}}\right\rVert^{2}+\frac{\Upsilon}{\theta}\right.\right)\right.\right)
θ2sksk^2+2Υθnθssk^2+θssk2+2Υθn.\displaystyle\leq\frac{\theta}{2}\left\lVert s_{k}-s_{{\widehat{k}}}\right\rVert^{2}+\frac{2\Upsilon}{\theta n}\leq\theta\left\lVert s-s_{{\widehat{k}}}\right\rVert^{2}+\theta\left\lVert s-s_{k}\right\rVert^{2}+\frac{2\Upsilon}{\theta n}\enspace.

Then, by Proposition 2.1, with probability larger than 1|𝒦|2ex1-\left\lvert\mathcal{K}\right\rvert^{2}e^{-x},

for any (k,k)𝒦2,(PnP)(sksk)2P(sksk)2xn+skskx3n.\mbox{for any }(k,k^{\prime})\in\mathcal{K}^{2},\quad(P_{n}-P)(s_{k}-s_{k^{\prime}})\leq\sqrt{\frac{2P\left(\left.s_{k}-s_{k^{\prime}}\right.\right)^{2}x}{n}}+\frac{\left\lVert s_{k}-s_{k^{\prime}}\right\rVert_{\infty}x}{3n}\enspace.

Since by (11) P(sksk)2ssksk2Υsksk2,P\left(\left.s_{k}-s_{k^{\prime}}\right.\right)^{2}\leq\left\lVert s\right\rVert_{\infty}\left\lVert s_{k}-s_{k^{\prime}}\right\rVert^{2}\leq\Upsilon\left\lVert s_{k}-s_{k^{\prime}}\right\rVert^{2}\enspace,

2P(sksk)2xnθ4sksk2+2Υxθn.\sqrt{\frac{2P\left(\left.s_{k}-s_{k^{\prime}}\right.\right)^{2}x}{n}}\leq\frac{\theta}{4}\left\lVert s_{k}-s_{k^{\prime}}\right\rVert^{2}+\frac{2\Upsilon x}{\theta n}\enspace.

Moreover, by (13) skskx3nθ4sksk2+Υx2θn.\frac{\left\lVert s_{k}-s_{k^{\prime}}\right\rVert_{\infty}x}{3n}\leq\frac{\theta}{4}\left\lVert s_{k}-s_{k^{\prime}}\right\rVert^{2}+\square\frac{\Upsilon x^{2}}{\theta n}\enspace. Hence, for x1x\geq 1, with probability larger than 1|𝒦|2ex1-\left\lvert\mathcal{K}\right\rvert^{2}e^{-x}

(PnP)(sksk)\displaystyle(P_{n}-P)(s_{k}-s_{k^{\prime}}) θ2sksk2+Υx2θn\displaystyle\leq\frac{\theta}{2}\left\lVert s_{k}-s_{k^{\prime}}\right\rVert^{2}+\square\frac{\Upsilon x^{2}}{\theta n}
θ(ssk2+ssk2)+Υx2θn,\displaystyle\leq\theta\left(\left.\left\lVert s-s_{k^{\prime}}\right\rVert^{2}+\left\lVert s-s_{k}\right\rVert^{2}\right.\right)+\square\frac{\Upsilon x^{2}}{\theta n}\enspace,

which gives (33). Now, using again Proposition 2.1, with probability larger than 1|𝒦|ex1-\left\lvert\mathcal{K}\right\rvert e^{-x}, for any k𝒦k\in\mathcal{K},

(PnP)χk2P(χk)2xn+χkx3n.(P_{n}-P)\chi_{k}\leq\sqrt{\frac{2P\left(\left.\chi_{k}\right.\right)^{2}x}{n}}+\frac{\left\lVert\chi_{k}\right\rVert_{\infty}x}{3n}\enspace.

By (1) and (11), for any k𝒦k\in\mathcal{K}, χksup(x,y)𝕏2|k(x,y)|ΓnΥn.\left\lVert\chi_{k}\right\rVert_{\infty}\leq\sup_{(x,y)\in\mathbb{X}^{2}}\left\lvert k(x,y)\right\rvert\leq\Gamma n\leq\Upsilon n\enspace.

Concerning (34), we get by (12), Pχk2ΥnPΘkP\chi_{k}^{2}\leq\Upsilon nP\Theta_{k}, hence, for any x1x\geq 1 we have with probability larger than 1|𝒦|ex1-\left\lvert\mathcal{K}\right\rvert e^{-x}

(PnP)χkθPΘk+(13+12θ)Υx.(P_{n}-P)\chi_{k}\leq\theta P\Theta_{k}+\left(\left.\frac{1}{3}+\frac{1}{2\theta}\right.\right)\Upsilon x\enspace.

For (35), we apply Proposition 2.2 to obtain with probability larger than 12.7|𝒦|ex1-2.7\left\lvert\mathcal{K}\right\rvert e^{-x}, for any k𝒦k\in\mathcal{K},

Ukn2n2(Cx+Dx+Bx3/2+Ax2),\displaystyle\frac{U_{k}}{n^{2}}\leq\frac{\square}{n^{2}}\left(\left.C\sqrt{x}+Dx+Bx^{3/2}+Ax^{2}\right.\right)\enspace,

where A,B,C,DA,B,C,D are defined accordingly to Proposition 2.2. Let us evaluate all these terms. First, A4sup(x,y)𝕏2|k(x,y)|4ΥnA\leq 4\sup_{(x,y)\in\mathbb{X}^{2}}\left\lvert k(x,y)\right\rvert\leq 4\Upsilon n by (1) and (11). Next, C2n2𝔼[k(X,Y)2]n2sPΘkn2ΥPΘk.C^{2}\leq\square n^{2}\mathbb{E}\left[\left.k(X,Y)^{2}\right.\right]\leq\square n^{2}\left\lVert s\right\rVert_{\infty}P\Theta_{k}\leq\square n^{2}\Upsilon P\Theta_{k}\enspace.

Using (1), we find B24nsupx𝕏k(x,y)2s(y)𝑑μ(y)4nsΓ.B^{2}\leq 4n\sup_{x\in\mathbb{X}}\int k(x,y)^{2}s(y)d\mu(y)\leq 4n\left\lVert s\right\rVert_{\infty}\Gamma\enspace.

By (11), we consequently have B24ΥnB^{2}\leq 4\Upsilon n. Finally, using Cauchy-Schwarz inequality and proceeding as for C2C^{2},

𝔼[i=1n1j=i+1nai(Xi)bj(Xj)k(Xi,Xj)]n𝔼[k(X,Y)2]nΥPΘk.\mathbb{E}\left[\left.\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}a_{i}(X_{i})b_{j}(X_{j})k(X_{i},X_{j})\right.\right]\leq n\sqrt{\mathbb{E}\left[\left.k(X,Y)^{2}\right.\right]}\leq n\sqrt{\Upsilon P\Theta_{k}}\enspace.

Hence, DnΥPΘkD\leq n\sqrt{\Upsilon P\Theta_{k}} which gives (35).

References

  • [1] S. Arlot and F. Bach. Data-driven calibration of linear estimators with minimal penalties. In Advances in Neural Information Processing Systems 22, pages 46–54, 2009.
  • [Ada06] R. Adamczak. Moment inequalities for UU-statistics. Ann. Probab., 34(6):2288–2314, 2006.
  • [AM09] S. Arlot and P. Massart. Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res., 10:245–279, 2009.
  • [Bir06] L. Birgé. Statistical estimation with model selection. Indag. Math. (N.S.), 17(4):497–537, 2006.
  • [BLM13] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities. Oxford University Press, Oxford, 2013.
  • [BM07] L. Birgé and P. Massart. Minimal penalties for Gaussian model selection. Probab. Theory Related Fields, 138(1-2):33–73, 2007.
  • [DJKP96] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Density estimation by wavelet thresholding. Ann. Statist., 24(2):508–539, 1996.
  • [DL01] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics. Springer-Verlag, 2001.
  • [DO13] P. Deheuvels and S. Ouadah. Uniform-in-bandwidth functional limit laws. J. Theoret. Probab., 26(3):697–721, 2013.
  • [EL99a] P. P. B. Eggermont and V. N. LaRiccia. Best asymptotic normality of the kernel density entropy estimator for smooth densities. IEEE Trans. Inform. Theory, 45(4):1321–1326, 1999.
  • [EL99b] P. P. B. Eggermont and V. N. LaRiccia. Optimal convergence rates for good’s nonparametric maximum likelihood density estimator. Ann. Statist., 27(5):1600–1615, 1999.
  • [EL01] P. P. B. Eggermont and V. N. LaRiccia. Maximum penalized likelihood estimation, volume I of Springer Series in Statistics. Springer-Verlag, New York, 2001.
  • [FT06] M. Fromont and C. Tuleau. Functional classification with margin conditions. In Learning theory, volume 4005 of Lecture Notes in Comput. Sci., pages 94–108. Springer, Berlin, 2006.
  • [GL11] A. Goldenshluger and O. Lepski. Bandwidth selection in kernel density estimation: oracle inequalities and adaptive minimax optimality. Ann. Statist., 39(3):1608–1632, 2011.
  • [GLZ00] E. Giné, R. Latała, and J. Zinn. Exponential and moment inequalities for UU-statistics. In High dimensional probability, II, volume 47 of Progr. Probab., pages 13–38. Birkhäuser Boston, 2000.
  • [GN09] E. Giné and R. Nickl. Uniform limit theorems for wavelet density estimators. Ann. Probab., 37(4):1605–1646, 2009.
  • [GN15] E Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical models. Cambridge University press, 2015.
  • [HRB03] C. Houdré and P. Reynaud-Bouret. Exponential inequalities, with constants, for U-statistics of order two. In Stochastic inequalities and applications, volume 56 of Progr. Probab., pages 55–69. Birkhäuser, Basel, 2003.
  • [Ler12] M. Lerasle. Optimal model selection in density estimation. Ann. Inst. H. Poincaré Probab. Statist., 48(3):884–908, 2012.
  • [Mas07] P. Massart. Concentration inequalities and model selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin, 2007. Lectures from the 33rd Summer School in Saint-Flour.
  • [MS11] D. M. Mason and J. W. H. Swanepoel. A general result on the uniform in bandwidth consistency of kernel-type function estimators. TEST, 20(1):72–94, 2011.
  • [MS15] D. M. Mason and J. W. H. Swanepoel. Erratum to: A general result on the uniform in bandwidth consistency of kernel-type function estimators . TEST, 24(1):205–206, 2015.
  • [Rig06] P. Rigollet. Adaptive density estimation using the blockwise Stein method. Bernoulli, 12(2):351–370, 2006.
  • [RT07] P. Rigollet and A. B. Tsybakov. Linear and convex aggregation of density estimators. Math. Methods Statist., 16(3):260–280, 2007.
  • [Tsy09] A. B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009.