This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

, and

Estimator selection in the Gaussian setting

Yannick Baraudlabel=e1]baraud@unice.fr [    Christophe Giraudlabel=e2]christophe.giraud@polytechnique.edu [    Sylvie Huet label=e3]sylvie.huet@jouy.inra.fr [ Université de Nice Sophia Antipolis, Ecole Polytechnique and INRA Université de Nice Sophia-Antipolis,
Laboratoire J-A Dieudonné, UMR CNRS 6621
Parc Valrose
06108, Nice cedex 02
France

Ecole Polytechnique,
CMAP, UMR CNRS 7641
route de Saclay
91128 Palaiseau Cedex
France

INRA MIAJ
78352, Jouy en Josas cedex
France

Abstract

We consider the problem of estimating the mean ff of a Gaussian vector YY with independent components of common unknown variance σ2\sigma^{2}. Our estimation procedure is based on estimator selection. More precisely, we start with an arbitrary and possibly infinite collection 𝔽{\mathbb{F}} of estimators of ff based on YY and, with the same data YY, aim at selecting an estimator among 𝔽{\mathbb{F}} with the smallest Euclidean risk. No assumptions on the estimators are made and their dependencies with respect to YY may be unknown. We establish a non-asymptotic risk bound for the selected estimator. As particular cases, our approach allows to handle the problems of aggregation and model selection as well as those of choosing a window and a kernel for estimating a regression function, or tuning the parameter involved in a penalized criterion. We also derive oracle-type inequalities when 𝔽{\mathbb{F}} consists of linear estimators. For illustration, we carry out two simulation studies. One aims at comparing our procedure to cross-validation for choosing a tuning parameter. The other shows how to implement our approach to solve the problem of variable selection in practice.

62J05,
62J07, 62G05, 62G08, 62F07,
Gaussian linear regression,
Estimator selection,
Model selection,
Variable selection,
Linear estimator,
Kernel estimator,
Ridge regression,
Lasso,
Elastic net,
Random Forest,
PLS1 regression,
keywords:
[class=AMS]
keywords:

1 Introduction

1.1 The setting and the approach

We consider the Gaussian regression framework

Yi=fi+εi,i=1,,nY_{i}=f_{i}+{\varepsilon}_{i},\ i=1,\ldots,n

where f=(f1,,fn)f=(f_{1},\ldots,f_{n}) is an unknown vector of n{\mathbb{R}}^{n} and the εi{\varepsilon}_{i} are independent centered Gaussian random variables with common variance σ2\sigma^{2}. Throughout the paper, σ2\sigma^{2} is assumed to be unknown which corresponds to the practical case. Our aim is to estimate ff from the observation of YY. For specific forms of ff, this setting allows to deal simultaneously with the following problems.

Example 1 (Signal denoising).

The vector ff is of the form

f=(F(x1),,F(xn))f=(F(x_{1}),\ldots,F(x_{n}))

where x1,,xnx_{1},\ldots,x_{n} are distinct points of a set 𝒳{\mathcal{X}} and FF is an unknown mapping from 𝒳{\mathcal{X}} into {\mathbb{R}}.

Example 2 (Linear regression).

The vector ff is assumed to be of the form

f=Xβf=X\beta (1)

where XX is a n×pn\times p matrix, β\beta is an unknown pp-dimensional vector and pp some integer larger than 1 (and possibly larger than nn). The columns of the matrix XX are usually called predictors. When pp is large, one may assume that the decomposition (1) is sparse in the sense that only few βj\beta_{j} are non-zero. Estimating ff or finding the predictors associated to the non-zero coordinates of β\beta are classical issues. The latter is called variable selection.

Our estimation strategy is based on estimator selection. More precisely, we start with an arbitrary collection 𝔽={f^λ,λΛ}{\mathbb{F}}=\{\widehat{f}_{\lambda},\ \lambda\in\Lambda\} of estimators of ff based on YY and aim at selecting the one with the smallest Euclidean risk by using the same observation YY. The way the estimators f^λ\widehat{f}_{\lambda} depend on YY may be arbitrary and possibly unknown. For example, the f^λ\widehat{f}_{\lambda} may be obtained from the minimization of a criterion, a Bayesian procedure or the guess of some experts.

1.2 The motivation

The problem of choosing some best estimator among a family of candidate ones is central in Statistics. Let us present some examples.

Example 3 (Choosing a tuning parameter).

Many statistical procedures depend on a (possibly multi-dimensional) parameter λ\lambda that needs to be tuned in view of obtaining an estimator with the best possible performance. For example, in the context of linear regression as described in Example 2, the Lasso estimator (see Tibshirani (1996) and Chen et al. (1998)) defined by f^λ=Xβλ^\widehat{f}_{\lambda}=X\widehat{\beta_{\lambda}} with

βλ^=argminβp[YXβ2+λj=1p|βj|]\widehat{\beta_{\lambda}}=\arg\!\min_{\beta\in{\mathbb{R}}^{p}}\left[{\left\|{Y-X\beta}\right\|^{2}+\lambda\sum_{j=1}^{p}\left|{\beta_{j}}\right|}\right]

depends on the choice of the parameter λ0\lambda\geq 0. Selecting this parameter among a grid Λ+\Lambda\subset{\mathbb{R}}_{+} amounts to selecting a (suitable) estimator among the family 𝔽={f^λ,λΛ}{\mathbb{F}}=\{\widehat{f}_{\lambda},\ \lambda\in\Lambda\}.

Another dilemma for Statisticians is the choice of a procedure to solve a given problem. In the context of Example 3, there exist many competitors to the Lasso estimator and one may alternatively choose a procedure based on ridge regression (see Hoerl and Kennard (1970)), random forest or PLS (see Tenenhaus (1998), Helland (2001) and Helland (2006)). Similarly, for the problem of signal denoising as described in Example 1, popular approaches include spline smoothing, wavelet decompositions and kernel estimators. The choice of a kernel may be possibly tricky.

Example 4 (Choosing a kernel).

Consider the problem described in Example 1 with 𝒳={\mathcal{X}}={\mathbb{R}}. For a kernel KK and a bandwidth h>0h>0, the Nadaraya-Watson estimator (see Nadaraya (1964) and Watson (1964)) f^K,hn\widehat{f}_{K,h}\in{\mathbb{R}}^{n} is defined as

f^K,h=(F^K,h(x1),,F^K,h(xn))\widehat{f}_{K,h}=\left({\widehat{F}_{K,h}(x_{1}),\ldots,\widehat{F}_{K,h}(x_{n})}\right)

where for xx\in{\mathbb{R}}

F^K,h(x)=j=1nK(xxjh)Yjj=1nK(xxjh).\widehat{F}_{K,h}(x)={\sum_{j=1}^{n}K\left({x-x_{j}\over h}\right)Y_{j}\over\sum_{j=1}^{n}K\left({x-x_{j}\over h}\right)}.

There exist many possible choices for the kernel KK, such as the Gaussian kernel K(x)=ex2/2K(x)=e^{-x^{2}/2}, the uniform kernel K(x)=𝟏|x|<1K(x)=\mathbf{1}_{|x|<1}, etc. Given a (finite) family 𝒦\mathcal{K} of candidate kernels KK and a grid +\mathcal{H}\subset{\mathbb{R}}_{+}^{*} of possible values of hh, one may consider the problem of selecting the best kernel estimator among the family 𝔽={f^λ,λ=(K,h)𝒦×}{\mathbb{F}}=\{\widehat{f}_{\lambda},\ \lambda=(K,h)\in\mathcal{K}\times\mathcal{H}\}.

1.3 A look at the literature

A common way to address the above issues is to use some cross-validation scheme such as leave-one-out or VV-fold. Even though these resampling techniques are widely used in practice, little is known on their theoretical performances. For more details, we refer to Arlot and Celisse (2010) for a survey on cross-validation technics applied to model selection. Compared to these approaches, as we shall see, the procedure we propose is less time consuming and easier to implement. Moreover, it does not require to know how the estimators depend on the data YY and we can therefore handle the following problem.

Example 5 (Selecting among mute experts).

A Statistician is given a collection 𝔽={f^λ,λΛ}{\mathbb{F}}=\{\widehat{f}_{\lambda},\ \lambda\in\Lambda\} of estimators from a family Λ\Lambda of experts λ\lambda, each of which keeping secret the way his/her estimator f^λ\widehat{f}_{\lambda} depends on the observation YY. The problem is then to find which expert λ\lambda is the closest to the truth.

Given a selection rule among 𝔽{\mathbb{F}}, an important issue is to compare the risk of the selected estimator to those of the candidate ones. Results in this direction are available in the context of model selection, which can be seen as a particular case of estimator selection. More precisely, for the purpose of selecting a suitable model one starts with a collection SS\SS of those, typically linear spaces chosen for their approximation properties with respect to ff, and one associates to each model SSSS\in\SS a suitable estimator f^S\widehat{f}_{S} with values in SS. Selecting a model then amounts to selecting an estimator among the collection 𝔽={f^S,SSS}{\mathbb{F}}=\{\widehat{f}_{S},\ S\in\SS\}. For this problem, selection rules based on the minimization of a penalized criterion have been proposed in the regression setting by Yang (1999), Baraud (2000), Birgé and Massart (2001) and Baraud et al (2009). Another way, usually called Lepski’s method, appears in a series of papers by Lepski (1990; 1991; ; ) and was originally designed to perform model selection among collections of nested models. Finally, we mention that other procedures based on resampling have interestingly emerged from the work of Arlot (2007; 2009) and Célisse (2008). A common feature of those approaches lies in the fact that the proposed selection rules apply to specific collections of estimators only.

An alternative to estimator selection is aggregation which aims at designing a suitable combination of given estimators in order to outperform each of these separately (and even the best combination of these) up to a remaining term. Aggregation techniques can be found in Catoni (1997; 2004), Juditsky and Nemirovski (2000), Nemirovski (2000), Yang (),  (),  (2001), Tsybakov (2003), Wegkamp (2003), Birgé (2006), Rigollet and Tsybakov (2007), Bunea, Tsybakov and Wegkamp (2007) and Goldenshluger (2009) for 𝕃p{\mathbb{L}}_{p}-losses. Most of the aggregation procedures are based on a sample splitting, one part of the data being used for building the estimators, the remaining part for selecting among these. Such a device requires that the observations be i.i.d. or at least that one has at disposal two independent copies of the data. From this point of view our procedure differs from classical aggregation procedures since we use the whole data YY to build and select. In the Gaussian regression setting that is considered here, we mention the results of Leung and Barron (2006) for the problem of mixing least-squares estimators. Their procedure uses the same data YY to estimate and to aggregate but requires the variance to be known. Giraud (2008) extends their results to the case where it is unknown.

1.4 What is new here?

Our approach for solving the problem of estimator selection is new. We introduce a collection SS\SS of linear subspaces of n{\mathbb{R}}^{n} for approximating the estimators in 𝔽{\mathbb{F}} and use a penalized criterion to compare them. As already mentioned and as we shall see, this approach requires no assumption on the family of estimators at hand and is easy to implement, an R-package being available on

http://w3.jouy.inra.fr/unites/miaj/public/perso/SylvieHuet_en.html.

A general way of comparing estimators in various statistical settings has been described in Baraud (2010). However, the procedure proposed there is mainly abstract and inadequate in the Gaussian framework we consider.

We prove a non-asymptotic risk bound for the estimator we select and show that this bound is optimal in the sense that it essentially cannot be improved (except for numerical constants maybe) by any other selection rule. For the sakes of illustration and comparison, we apply our procedure to various problems among which aggregation, model selection, variable selection and selection among linear estimators. In each of these cases, our approach allows to recover classical results in the areas as well as to establish new ones. In the context of aggregation we compute the aggregation rates for the unknown variance case. These rates turn out to be the same as those for the known variance case. For selecting an estimator among a family of linear ones, we propose a new procedure and establish a risk bound which requires almost no assumption on the considered family. Finally, our approach provides a way of selecting a suitable variable selection procedure among a family of candidate ones. It thus provides an alternative to cross-validation for which little is known.

The paper is organized as follows. In Section 2 we present our selection rule and the theoretical properties of the resulting estimator. For illustration, we show in Sections 34 and 5 respectively, how the procedure can be used to aggregate preliminary estimators, select a linear estimator among a finite collection of candidate ones, or solve the problem of variable selection. Section 6 is devoted to two simulation studies. One aims at comparing the performance of our procedure to the classical VV-fold in view of selecting a tuning parameter among a grid. In the other, we evaluate the performance of the variable selection procedure we propose to some classical ones such as the Lasso, random forest, and others based on ridge and PLS regression. Finally, the proofs are postponed to Section 7.

Throughout the paper CC denotes a constant that may vary from line to line.

2 The procedure and the main result

2.1 The procedure

Given a collection 𝔽={f^λ,λΛ}{\mathbb{F}}=\{\widehat{f}_{\lambda},\lambda\in\Lambda\} of estimators of ff based on YY, the selection rule we propose is based on the choices of a family SS\SS of linear subspaces of n{\mathbb{R}}^{n}, a collection {SSλ,λΛ}\{\SS_{\lambda},\ \lambda\in\Lambda\} of (possibly random) subsets of SS\SS, a weight function Δ\Delta and a penalty function pen\mathop{\rm pen}\nolimits, both from SS\SS into +{\mathbb{R}}_{+}. We introduce those objects below and refer to Sections 34 and 5 for examples.

2.1.1 The collection of estimators 𝔽{\mathbb{F}}

The collection 𝔽={f^λ,λΛ}{\mathbb{F}}=\{\widehat{f}_{\lambda},\lambda\in\Lambda\} can be arbitrary. In particular, 𝔽{\mathbb{F}} need not be finite nor countable and it may consist of a mix of estimators based on the minimization of a criterion, a Bayes procedure or the guess of some experts. The dependency of these estimators with respect to YY need not be known. Nevertheless, we shall see on examples how we can use this information, when available, to improve the performance of our estimation procedure.

2.1.2 The families SS\SS and SSλ\SS_{\lambda}

Let SS\SS be a family of linear spaces of n{\mathbb{R}}^{n} satisfying the following.

Assumption 1.

The family SS\SS is finite or countable and for all SSSS\in\SS, dim(S)n2\dim(S)\leq n-2.

To each estimator f^λ𝔽\widehat{f}_{\lambda}\in{\mathbb{F}}, we associate a (possibly random) subset SSλSS\SS_{\lambda}\subset\SS.

Typically, the family SS\SS should be chosen to possess good approximation properties with respect to the elements of 𝔽{\mathbb{F}} and SSλ\SS_{\lambda} with respect to f^λ\widehat{f}_{\lambda} specifically. One may take SSλ=SS\SS_{\lambda}=\SS but for computational reasons it will be convenient to allow SSλ\SS_{\lambda} to be smaller. The choices of SSλ\SS_{\lambda} may be made on the basis of the observation f^λ\widehat{f}_{\lambda}. We provide examples of SS\SS and SSλ\SS_{\lambda} in various statistical settings described in Sections 3 to 5.

2.1.3 The weight function Δ\Delta and the associated function penΔ{\mathop{\rm pen}_{\Delta}\nolimits}

We consider a function Δ\Delta from SS\SS into +{\mathbb{R}}_{+} and assume

Assumption 2.
Σ=SSSeΔ(S)<+.\Sigma=\sum_{S\in\SS}e^{-\Delta(S)}<+\infty. (2)

Whenever SS\SS is finite, inequality (2) automatically holds true. However, in practice Σ\Sigma should be kept to a reasonable size. When Σ=1\Sigma=1, eΔ(.)e^{-\Delta(.)} can be interpreted as a prior distribution on SS\SS and gives thus a Bayesian flavor to the procedure we propose. To the weight function Δ\Delta, we associate the function penΔ{\mathop{\rm pen}_{\Delta}\nolimits} mapping SS\SS into +{\mathbb{R}}_{+} and defined by

𝔼[(UpenΔ(S)ndim(S)V)+]=eΔ(S){\mathbb{E}}\left[{\left({U-{{\mathop{\rm pen}_{\Delta}\nolimits}(S)\over n-\dim(S)}V}\right)_{+}}\right]=e^{-\Delta(S)} (3)

where x+x_{+} denotes the positive part of xx\in{\mathbb{R}} and U,VU,V are two independent χ2\chi^{2} random variables with respectively dim(S)+1\dim(S)+1 and ndim(S)1n-\dim(S)-1 degrees of freedom. This function can be easily computed from the quantiles of the Fisher distribution as we shall see in Section 8.1. From a more theoretical point of view, it is shown in Baraud et al (2009) that under Assumption 3 below, there exists a positive constant CC (depending on κ\kappa only) such that

penΔ(S)C(dim(S)Δ(S)).{\mathop{\rm pen}_{\Delta}\nolimits}(S)\leq C(\dim(S)\vee\Delta(S)). (4)
Assumption 3.

There exists κ(0,1)\kappa\in(0,1) such that for all SSSS\in\SS,

1dim(S)Δ(S)κn.1\leq\dim(S)\vee\Delta(S)\leq\kappa n.

2.1.4 The selection criterion

The selection procedure we propose involves a penalty function pen\mathop{\rm pen}\nolimits from SS\SS into +{\mathbb{R}}_{+} with the following property.

Assumption 4.

The penalty function pen\mathop{\rm pen}\nolimits satisfies for some K>1K>1,

pen(S)KpenΔ(S)forallSSS.\mathop{\rm pen}\nolimits(S)\geq K{\mathop{\rm pen}_{\Delta}\nolimits}(S)\ \ \ {\rm for\ all}\ \ S\in\SS. (5)

Whenever equality holds in (5), it derives from (4) that pen(S)\mathop{\rm pen}\nolimits(S) measures the complexity of the model SS in terms of dimension and weight.

Denoting ΠS\Pi_{S} the projection operator onto a linear space SnS\subset{\mathbb{R}}^{n}, given the families SSλ\SS_{\lambda}, the penalty function pen\mathop{\rm pen}\nolimits and some positive number α\alpha, we define

critα(f^λ)=infSSSλ[YΠSf^λ2+αf^λΠSf^λ2+pen(S)σ^S2],{\rm crit}_{\alpha}(\widehat{f}_{\lambda})=\inf_{S\in\SS_{\lambda}}\left[{\left\|{Y-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\mathop{\rm pen}\nolimits(S)\,\widehat{\sigma}^{2}_{S}}\right], (6)

where

σ^S2=YΠSY2ndim(S).\widehat{\sigma}^{2}_{S}={\left\|{Y-\Pi_{S}Y}\right\|^{2}\over n-\dim(S)}. (7)

2.2 The main result

For all λΛ\lambda\in\Lambda let us set

A(f^λ,SSλ)=infSSSλ[f^λΠSf^λ2+pen(S)σ^S2].A(\widehat{f}_{\lambda},\SS_{\lambda})=\inf_{S\in\SS_{\lambda}}\left[{\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\mathop{\rm pen}\nolimits(S)\,\widehat{\sigma}^{2}_{S}}\right]. (8)

This quantity corresponds to an accuracy index for the estimator f^λ\widehat{f}_{\lambda} with respect to the family SSλ\SS_{\lambda}. The following result holds.

Theorem 1.

Let K>1,α>0,δ0K>1,\alpha>0,\delta\geq 0. Assume that Assumptions 12 and 4 hold. There exists a constant CC (given by (34)) depending on KK and α\alpha only such that for any f^λ^\widehat{f}_{\widehat{\lambda}} in 𝔽{\mathbb{F}} satisfying

critα(f^λ^)infλΛcritα(f^λ)+δ,{\rm crit}_{\alpha}(\widehat{f}_{\widehat{\lambda}})\leq\inf_{\lambda\in\Lambda}{\rm crit}_{\alpha}(\widehat{f}_{\lambda})+\delta, (9)

we have the following bounds

C𝔼[ff^λ^2]\displaystyle C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right] \displaystyle\leq 𝔼[infλΛ[ff^λ2+A(f^λ,SSλ)]]+Σσ2+δ\displaystyle{\mathbb{E}}\left[{\inf_{\lambda\in\Lambda}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+A(\widehat{f}_{\lambda},\SS_{\lambda})}\right]}\right]+\Sigma\sigma^{2}+\delta (10)
\displaystyle\leq infλΛ{𝔼[ff^λ2]+𝔼[A(f^λ,SSλ)]}+Σσ2+δ\displaystyle\inf_{\lambda\in\Lambda}\left\{{{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+{\mathbb{E}}\left[{A(\widehat{f}_{\lambda},\SS_{\lambda})}\right]}\right\}+\Sigma\sigma^{2}+\delta\;\; (11)

(provided that the quantity involved in the expectation in (10) is measurable). Furthermore, if equality holds in (5) and Assumption 3 is satisfied, for each λΛ\lambda\in\Lambda

  • if the set SSλ\SS_{\lambda} is non-random,

C𝔼[A(f^λ,SSλ)]\displaystyle C^{\prime}{\mathbb{E}}\left[{A(\widehat{f}_{\lambda},\SS_{\lambda})}\right] (12)
\displaystyle\leq 𝔼[ff^λ2]+infSSSλ[𝔼[f^λΠSf^λ2]+(dim(S)Δ(S))σ2]\displaystyle\!\!\!\!{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+\inf_{S\in\SS_{\lambda}}\left[{{\mathbb{E}}\left[{\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}}\right]+(\dim(S)\vee\Delta(S))\sigma^{2}}\right]
  • if there exists a (possibly random) linear space S^λSSλ\widehat{S}_{\lambda}\in\SS_{\lambda} such that f^λS^λ\widehat{f}_{\lambda}\in\widehat{S}_{\lambda} with probability 1,

C𝔼[A(f^λ,SSλ)]𝔼[ff^λ2]+𝔼[dim(S^λ)Δ(S^λ)]σ2,C^{\prime}{\mathbb{E}}\left[{A(\widehat{f}_{\lambda},\SS_{\lambda})}\right]\leq{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+{\mathbb{E}}\left[{\dim(\widehat{S}_{\lambda})\vee\Delta(\widehat{S}_{\lambda})}\right]\sigma^{2}, (13)

where CC^{\prime} is a positive constant only depending on κ\kappa and KK.

Let us now comment Theorem 1.

It turns out that inequality (10) leaves no place for a substantial improvement in the sense that the bound we get is essentially optimal and cannot be improved (apart from constants) by any other selection rule among 𝔽{\mathbb{F}}. To see this, let us assume for simplicity that 𝔽{\mathbb{F}} is finite so that a measurable minimizer of critα{\rm crit}_{\alpha} always exists and δ\delta can be chosen as 0. Let K=1.1K=1.1, α=1/2\alpha=1/2 (to fix up the ideas), SS\SS a family of linear spaces satisfying the assumptions of Theorem 1 and pen\mathop{\rm pen}\nolimits, the penalty function achieving equality in (5). Besides, assume that SS\SS contains a linear space SS such that 1dim(S)n/21\leq\dim(S)\leq n/2 and associate to SS the weight Δ(S)=dim(S)\Delta(S)=\dim(S). If SSλ=SS\SS_{\lambda}=\SS for all λ\lambda, we deduce from (4) and (10) that for some universal constant CC^{\prime}, whatever 𝔽{\mathbb{F}} and fnf\in{\mathbb{R}}^{n}

C𝔼[ff^λ^2]\displaystyle C^{\prime}{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right] (14)
\displaystyle\leq 𝔼[infλΛ[ff^λ2+infSSS(f^λΠSf^λ2+pen(S)σ^S2)]]\displaystyle{\mathbb{E}}\left[{\inf_{\lambda\in\Lambda}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+\inf_{S\in\SS}\left({\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\mathop{\rm pen}\nolimits(S)\widehat{\sigma}^{2}_{S}}\right)}\right]}\right]
\displaystyle\leq 𝔼[infλΛ[ff^λ2+f^λΠSf^λ2+dim(S)σ^S2]].\displaystyle{\mathbb{E}}\left[{\inf_{\lambda\in\Lambda}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\dim(S)\widehat{\sigma}^{2}_{S}}\right]}\right].

In the opposite direction, the following result holds.

Proposition 1.

There exists a universal constant CC, such that for any finite family 𝔽={f^λ,λΛ}{\mathbb{F}}=\{\widehat{f}_{\lambda},\ \lambda\in\Lambda\} of estimators and any selection rule λ~\widetilde{\lambda} based on YY among Λ\Lambda, there exists fSf\in S such that

C𝔼[ff^λ~2]𝔼[infλΛ[ff^λ2+f^λΠSf^λ2+dim(S)σ2]].C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widetilde{\lambda}}}\right\|^{2}}\right]\geq{\mathbb{E}}\left[{\inf_{\lambda\in\Lambda}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\dim(S)\sigma^{2}}\right]}\right]. (15)

We see that, up to the estimator σ^S2\widehat{\sigma}_{S}^{2} in place of σ2\sigma^{2} and numerical constants, the left-hand sides of (14) and (15) coincide.

In view of commenting (11) further, we continue assuming that 𝔽{\mathbb{F}} is finite so that we can keep δ=0\delta=0 in (11). A particular feature of (11) lies in the fact that the risk bound pays no price for considering a large collection 𝔽{\mathbb{F}} of estimators. In fact, it is actually decreasing with respect to 𝔽{\mathbb{F}} (or equivalently Λ\Lambda) for the inclusion. This means that if one adds a new estimator to the collection 𝔽{\mathbb{F}} (without changing neither SS\SS nor the families SSλ\SS_{\lambda} associated to the former estimators), the risk bound for f^λ^\widehat{f}_{\widehat{\lambda}} can only be improved. In contrast, the computation of the estimator f^λ^\widehat{f}_{\widehat{\lambda}} is all the more difficult that |𝔽|\left|{{\mathbb{F}}}\right| is large. More precisely, if the cardinalities of the families SSλ\SS_{\lambda} are not too large, the computation of f^λ^\widehat{f}_{\widehat{\lambda}} requires around |𝔽|\left|{{\mathbb{F}}}\right| steps.

The selection rule we use does not require to know how the estimators depend on YY. In fact, as we shall see, a more important piece of information is the ranges of the estimators f^λ=f^λ(Y)\widehat{f}_{\lambda}=\widehat{f}_{\lambda}(Y) as YY varies in n{\mathbb{R}}^{n}. A situation of special interest occurs when each f^λ\widehat{f}_{\lambda} belongs to some (possibly random) linear space S^λ\widehat{S}_{\lambda} in SS\SS with probability one. By taking SSλ\SS_{\lambda} such that S^λSSλ\widehat{S}_{\lambda}\in\SS_{\lambda} for all λ\lambda, we deduce from Theorem 1 by using (11) and (13) the following corollary.

Corollary 1.

Assume that the Assumptions of Theorem 1 are satisfied, that Assumption 3 holds and that equality holds in (5). If for all λΛ\lambda\in\Lambda there exists a (possibly random) linear space S^λSSλ\widehat{S}_{\lambda}\in\SS_{\lambda} such that f^λS^λ\widehat{f}_{\lambda}\in\widehat{S}_{\lambda} with probability 1, then f^λ^\widehat{f}_{\widehat{\lambda}} satisfies

C𝔼[ff^λ^2]infλΛ[𝔼[ff^λ2]+𝔼[dim(S^λ)Δ(S^λ)]σ2]+δ,C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]\leq\inf_{\lambda\in\Lambda}\left[{{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+{\mathbb{E}}\left[{\dim(\widehat{S}_{\lambda})\vee\Delta(\widehat{S}_{\lambda})}\right]\sigma^{2}}\right]+\delta, (16)

for some CC depending on KK and κ\kappa only.

One may apply this result in the context of model selection. One starts with a collection of models SS={Sm,m}\SS=\left\{{S_{m},\ m\in\mathcal{M}}\right\} and associate to each SmS_{m} an estimator f^m\widehat{f}_{m} with values in SmS_{m}. By taking 𝔽={f^m,m}{\mathbb{F}}=\{\widehat{f}_{m},\ m\in\mathcal{M}\} (here Λ=\Lambda=\mathcal{M}) and SSm={Sm}\SS_{m}=\left\{{S_{m}}\right\} for all mm\in\mathcal{M}, our selection procedure leads to an estimator f^m^\widehat{f}_{\widehat{m}} which satisfies

C𝔼[ff^m^2]infm[𝔼[ff^m2]+(dim(Sm)Δ(Sm))σ2].C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{m}}}\right\|^{2}}\right]\leq\inf_{m\in\mathcal{M}}\left[{{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{m}}\right\|^{2}}\right]+\left({\dim(S_{m})\vee\Delta(S_{m})}\right)\sigma^{2}}\right]. (17)

When f^m=ΠSmY\widehat{f}_{m}=\Pi_{S_{m}}Y for all mm\in\mathcal{M}, our selection rule becomes

m^=argminm[Yf^m2+pen(Sm)σ^Sm2]\widehat{m}=\arg\min_{m\in\mathcal{M}}\left[{\left\|{Y-\widehat{f}_{m}}\right\|^{2}+\mathop{\rm pen}\nolimits(S_{m})\,\widehat{\sigma}^{2}_{S_{m}}}\right] (18)

and turns out to coincide with that described in Baraud et al (2009). Interestingly, Corollary 1 shows that this selection rule can still be used for families 𝔽{\mathbb{F}} of (non-linear) estimators of the form ΠSm^Y\Pi_{S_{\widehat{m}}}Y where the Sm^S_{\widehat{m}} are chosen randomly among SS\SS on the basis of YY, doing thus as if the linear spaces Sm^S_{\widehat{m}} were non-random. An estimator of the form ΠSm^Y\Pi_{S_{\widehat{m}}}Y can be interpreted as resulting from a model selection procedures among the family of projection estimators {ΠmY,m}\{\Pi_{m}Y,\ m\in\mathcal{M}\} and hence, (18) can be used to choose some best model selection rule among a collection of candidate ones.

3 Aggregation

In this section, we consider the problems of Model Selection Aggregation (MS), Convex Aggregation (Cv{\mathrm{Cv}}) and Linear Aggregation (L) defined below. Given M2M\geq 2 preliminary estimators of ff, denoted {ϕk,k=1,,M}\left\{{\phi_{k},\ k=1,\ldots,M}\right\}, our aim is to build an estimator f^\widehat{f} based on YY whose risk is as close as possible to infg𝔽Λfg2\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\|{f-g}\right\|^{2} where

𝔽Λ={fλ=j=1Mλjϕj,λΛ}{\mathbb{F}}_{\Lambda}=\left\{{f_{\lambda}=\sum_{j=1}^{M}\lambda_{j}\phi_{j},\ \lambda\in\Lambda}\right\}

and, according to the aggregation problem at hand, Λ\Lambda is one of the three sets

ΛMS={λ{0,1}M,j=1Mλj=1},ΛCv={λ+M,j=1Mλj=1},ΛL=M.\Lambda_{{\rm MS}}=\left\{{\lambda\in\{0,1\}^{M},\ \sum_{j=1}^{M}\lambda_{j}=1}\right\},\ \Lambda_{{\mathrm{Cv}}}=\left\{{\lambda\in{\mathbb{R}}_{+}^{M},\ \sum_{j=1}^{M}\lambda_{j}=1}\right\},\ \Lambda_{{\rm L}}={\mathbb{R}}^{M}.

When Λ=ΛMS\Lambda=\Lambda_{{\rm MS}}, 𝔽Λ{\mathbb{F}}_{\Lambda} is the set {ϕ1,,ϕM}\left\{{\phi_{1},\ldots,\phi_{M}}\right\} consisting of the initial estimators. When Λ=ΛCv\Lambda=\Lambda_{{\mathrm{Cv}}}, 𝔽Λ{\mathbb{F}}_{\Lambda} is the convex hull of the ϕj\phi_{j}. In the literature, one may also find

ΛCv={λ[0,1]M,j=1Mλj1}\Lambda_{{\mathrm{Cv}}}^{\prime}=\left\{{\lambda\in[0,1]^{M},\ \sum_{j=1}^{M}\lambda_{j}\leq 1}\right\}

in place of ΛCv\Lambda_{{\mathrm{Cv}}} in which case 𝔽Λ{\mathbb{F}}_{\Lambda} is the convex hull of {0,ϕ1,,ϕM}\left\{{0,\phi_{1},\ldots,\phi_{M}}\right\}. Finally, when Λ=ΛL\Lambda=\Lambda_{{\rm L}}, 𝔽Λ{\mathbb{F}}_{\Lambda} is the linear span of the ϕj\phi_{j}.

Each of these three aggregation problems are solved separately if for each Λ{ΛMS,ΛCv,ΛL}\Lambda\in\left\{{\Lambda_{{\rm MS}},\Lambda_{{\mathrm{Cv}}},\Lambda_{{\rm L}}}\right\} one can design an estimator f^=f^(Λ)\widehat{f}=\widehat{f}(\Lambda) satisfying

𝔼[ff^2]Cinfg𝔽Λfg2Cψn,Λσ2{\mathbb{E}}\left[{\left\|{f-\widehat{f}}\right\|^{2}}\right]-C\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\|{f-g}\right\|^{2}\leq C^{\prime}\psi_{n,\Lambda}\sigma^{2} (19)

with C=1C=1, C>0C^{\prime}>0 free of f,n,Mf,n,M and

ψn,Λ={MifΛ=ΛLnlog(eM/n)ifΛ=ΛCvandnMMifΛ=ΛCvandnMlogMifΛ=ΛMS.\psi_{n,\Lambda}=\left\{\begin{array}[]{cc}M&\mbox{if}\ \Lambda=\Lambda_{{\rm L}}\\ \sqrt{n\log(eM/\sqrt{n})}&\mbox{if}\ \Lambda=\Lambda_{{\mathrm{Cv}}}\ \mbox{and}\ \sqrt{n}\leq M\\ M&\mbox{if}\ \Lambda=\Lambda_{{\mathrm{Cv}}}\ \mbox{and}\ \sqrt{n}\geq M\\ \log M&\mbox{if}\ \Lambda=\Lambda_{{\rm MS}}.\end{array}\right. (20)

These problems have only been considered when the variance is known. The quantity ψn,Λ\psi_{n,\Lambda} then corresponds to the best possible upper bound in (19) over all possible fnf\in{\mathbb{R}}^{n} and preliminary estimators ϕj\phi_{j} and is called the optimal rate of aggregation. For a more precise definition, we refer the reader to Tsybakov (2003). Bunea et al (2007) considered the problem of solving these three problems simultaneously by building an estimator f^\widehat{f} which satisfies (19) simultaneously for all Λ{ΛMS,ΛCv,ΛL}\Lambda\in\left\{{\Lambda_{{\rm MS}},\Lambda_{{\mathrm{Cv}}},\Lambda_{{\rm L}}}\right\} and some constant C>1C>1. This is an interesting issue since it is impossible to know in practice which aggregation device should be used to achieve the smallest risk bound: as Λ\Lambda grows (for the inclusion), the bias infg𝔽Λfg2\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\|{f-g}\right\|^{2} decreases while the rate ψn,Λ\psi_{n,\Lambda} increases.

The aim of this section is to show that our procedure provides a way of solving (or nearly solving) the three aggregation problems both separately and simultaneously when the variance is unknown.

Throughout this section, we consider the family SS¯\overline{\SS} consisting of the SmS_{m} defined for each m{1,,M}m\subset\left\{{1,\ldots,M}\right\} and mm\neq\varnothing as the linear span of the ϕj\phi_{j} for jmj\in m. Along this section, we shall use the weight function Δ\Delta defined on SS¯\overline{\SS} by

Δ(Sm)=|m|+log[(M|m|)],\Delta(S_{m})=|m|+\log\left[{\left(\!\!\begin{array}[]{c}M\\ |m|\\ \end{array}\!\!\right)}\right],

take α=1/2\alpha=1/2 and pen(.)=1.1penΔ(.)\mathop{\rm pen}\nolimits(.)=1.1{\mathop{\rm pen}_{\Delta}\nolimits}(.) taking thus K=1.1K=1.1. The choices of α\alpha and KK is only to fix up the ideas. Note that Δ\Delta satisfies Assumption 2 with Σ<1\Sigma<1. To avoid trivialities, we assume all along n4n\geq 4.

3.1 Solving the three aggregation problems separately

3.1.1 Linear Aggregation

Problem (L) is the easiest to solve. Let us take 𝔽=𝔽Λ{\mathbb{F}}={\mathbb{F}}_{\Lambda} with Λ=ΛL\Lambda=\Lambda_{{\rm L}} and

SS=SSL={S{1,,M}}\SS=\SS_{{\rm L}}=\left\{{S_{\{1,\ldots,M\}}}\right\} (21)

and SSλ=SSL\SS_{\lambda}=\SS_{{\rm L}} for all λΛL\lambda\in\Lambda_{{\rm L}}. Minimizing critα(fλ){\rm crit}_{\alpha}(f_{\lambda}) over fλ𝔽Λf_{\lambda}\in{\mathbb{F}}_{\Lambda} amounts to minimizing Yfλ2\left\|{Y-f_{\lambda}}\right\|^{2} over fλS{1,,M}f_{\lambda}\in S_{\{1,\ldots,M\}} and hence, the resulting estimator is merely f^L=ΠS{1,,M}Y\widehat{f}_{{\rm L}}=\Pi_{S_{\{1,\ldots,M\}}}Y. The risk of f^L\widehat{f}_{{\rm L}} satisfies

𝔼[ff^L]infg𝔽Λfg2+Mσ2.{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\rm L}}}\right\|}\right]\leq\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\|{f-g}\right\|^{2}+M\sigma^{2}.

whatever nn and MM which solves the problem of Linear Aggregation.

3.1.2 Model Selection Aggregation

To tackle Problem (MS), we take 𝔽=𝔽Λ{\mathbb{F}}={\mathbb{F}}_{\Lambda} with Λ=ΛMS\Lambda=\Lambda_{{\rm MS}}, that is, 𝔽Λ={ϕ1,,ϕM}{\mathbb{F}}_{\Lambda}=\left\{{\phi_{1},\ldots,\phi_{M}}\right\},

SS=SSMS={S{1},,S{M}}\SS=\SS_{{\rm MS}}=\{S_{\{1\}},\ldots,S_{\{M\}}\} (22)

and associate to each fλ=ϕjf_{\lambda}=\phi_{j} the collection SSλ\SS_{\lambda} reduced to {S{j}}\left\{{S_{\{j\}}}\right\}. Note that dim(S)1\dim(S)\leq 1 and Δ(S)=log(eM)dim(S)\Delta(S)=\log(eM)\geq\dim(S) for all SSSMSS\in\SS_{{\rm MS}}, so that under the assumption that log(eM)n/2\log(eM)\leq n/2 we may apply Corollary 1 with δ=0\delta=0 (since 𝔽Λ{\mathbb{F}}_{\Lambda} is finite), κ=1/2\kappa=1/2 and get that for some constant C>0C>0 the resulting estimator f^MS\widehat{f}_{{\rm MS}} satisfies

C𝔼[ff^MS2]infg𝔽Λfg2+log(M)σ2.C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\rm MS}}}\right\|^{2}}\right]\leq\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\|{f-g}\right\|^{2}+\log(M)\sigma^{2}.

This risk bound is of the form (19) except for the constant CC which is not equal to 1. We do not know whether Problem (MS) can be solved or not with C=1C=1 when the variance σ2\sigma^{2} is unknown and MM is large (possibly larger than nn).

3.1.3 Convex aggregation

For this problem, we emphasize the aggregation rate with respect to the quantity

L=supj=1,,Mϕjσn.L=\sup_{j=1,\ldots,M}{\left\|{\phi_{j}}\right\|\over\sigma\sqrt{n}}. (23)

If M<nLM<\sqrt{n}L, take again the estimator f^L\widehat{f}_{{\rm L}}. Since the convex hull of the ϕj\phi_{j} is a subset of the linear space S{1,,M}S_{\{1,\ldots,M\}}, for Λ=ΛCv\Lambda=\Lambda_{{\mathrm{Cv}}} we have

𝔼[ff^L2]infg𝔽Λfg2+Mσ2.{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\rm L}}}\right\|^{2}}\right]\leq\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\|{f-g}\right\|^{2}+M\sigma^{2}.

Let us now turn to the case MnLM\geq\sqrt{n}L. More precisely, assume that

2nLMe1min{nLenL2,en/(2L)}2\leq\sqrt{n}L\leq M\leq e^{-1}\min\left\{{\sqrt{n}Le^{nL^{2}},e^{\sqrt{n}/(2L)}}\right\} (24)

and set d(n,M)=n/(2log(eM))d(n,M)=n/(2\log(eM)). We consider the family of estimators 𝔽=𝔽Λ{\mathbb{F}}={\mathbb{F}}_{\Lambda} with Λ=ΛCv\Lambda=\Lambda_{{\mathrm{Cv}}} and

SS=SSCv=SSλ={SmSS¯,|m|d(n,M)},λΛCv.\SS=\SS_{{\mathrm{Cv}}}=\SS_{\lambda}=\left\{{S_{m}\in\overline{\SS},\ |m|\leq d(n,M)}\right\},\ \ \forall\lambda\in\Lambda_{{\mathrm{Cv}}}. (25)

The set ΛCv\Lambda_{{\mathrm{Cv}}} being compact, λcritα(fλ)\lambda\mapsto{\rm crit}_{\alpha}(f_{\lambda}) admits a minimum λ^\widehat{\lambda} over ΛCv\Lambda_{{\mathrm{Cv}}} and we set f^Cv=f^λ^\widehat{f}_{{\mathrm{Cv}}}=\widehat{f}_{\widehat{\lambda}}.

Proposition 2.

There exists a universal constant C>1C>1 such that

𝔼[ff^Cv2]Cinfg𝔽Λfg2CnL2log(eM/nL2)σ2.{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\mathrm{Cv}}}}\right\|^{2}}\right]-C\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\|{f-g}\right\|^{2}\leq C\sqrt{nL^{2}\log(eM/\sqrt{nL^{2}})}\sigma^{2}.

This risk bound is of the form (19) except for the constant CC which is not equal to 1. Again, we do not know whether Problem (Cv{\mathrm{Cv}}) can be solved or not with C=1C=1 when the variance σ2\sigma^{2} is unknown and MM possibly larger than nn.

3.2 Solving the three problems simultaneously

Consider now three estimators f^L,f^MS,f^Cv\widehat{f}_{{\rm L}},\widehat{f}_{{\rm MS}},\widehat{f}_{{\mathrm{Cv}}} with values respectively in S{1,,M}S_{\{1,\ldots,M\}}, j=1MS{j}\bigcup_{j=1}^{M}S_{\{j\}} and the convex hull 𝒞{\mathcal{C}} of the ϕj\phi_{j} (we use a new notation for this convex hull to avoid ambiguity). One may take the estimators defined in Section 3.1 but any others would suit. The aim of this section is to select the one with the smallest risk to estimate ff. To do so, we apply our selection procedure with 𝔽={f^L,f^MS,f^Cv}{\mathbb{F}}=\{\widehat{f}_{{\rm L}},\widehat{f}_{{\rm MS}},\widehat{f}_{{\mathrm{Cv}}}\}, taking thus Λ={L,MS,Cv}\Lambda=\left\{{{\rm L},{\rm MS},{\mathrm{Cv}}}\right\}, and associate to each of these three estimators the families SSL,SSMS,SSCv\SS_{{\rm L}},\SS_{{\rm MS}},\SS_{{\mathrm{Cv}}} defined by (21), (22) and (25) respectively and choose SS=SSLSSMSSSCv\SS=\SS_{{\rm L}}\cup\SS_{{\rm MS}}\cup\SS_{{\mathrm{Cv}}}.

Proposition 3.

Assume that (24) holds and that log(eM)n/2\log(eM)\leq n/2. There exists a universal constant C>0C>0 such that whatever f^L,f^MS\widehat{f}_{{\rm L}},\widehat{f}_{{\rm MS}} and f^Cv\widehat{f}_{{\mathrm{Cv}}} with values in S{1,,M}S_{\{1,\ldots,M\}}, j=1MS{j}\bigcup_{j=1}^{M}S_{\{j\}} and 𝒞{\mathcal{C}} respectively, the selected estimator f^λ^\widehat{f}_{\widehat{\lambda}} satisfies for all fnf\in{\mathbb{R}}^{n},

C𝔼[ff^λ^2]infλ{L,MS,Cv}[𝔼[ff^λ2]+Bλ],C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]\leq\inf_{\lambda\in\left\{{{\rm L},{\rm MS},{\mathrm{Cv}}}\right\}}\left[{{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+B_{\lambda}}\right],

where

BL=σ2M,BMS=σ2logM,BCv=σ2[MnL2log(eM/nL2)].B_{{\rm L}}=\sigma^{2}M,\ \ B_{{\rm MS}}=\sigma^{2}\log M,\ \ B_{{\mathrm{Cv}}}=\sigma^{2}\left[{M\wedge\sqrt{nL^{2}\log(eM/\sqrt{nL^{2}})}}\right].

In particular, if f^L,f^MS\widehat{f}_{{\rm L}},\widehat{f}_{{\rm MS}} and f^Cv\widehat{f}_{{\mathrm{Cv}}} fulfills (19), then

C𝔼[ff^λ^2]infλ{L,MS,Cv}[infg𝔽λfg2+Bλ],C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]\leq\inf_{\lambda\in\left\{{{\rm L},{\rm MS},{\mathrm{Cv}}}\right\}}\left[{\inf_{g\in{\mathbb{F}}_{\lambda}}\left\|{f-g}\right\|^{2}+B_{\lambda}}\right],

where 𝔽λ{\mathbb{F}}_{\lambda} stands for 𝔽Λ{\mathbb{F}}_{\Lambda} when Λ=Λλ\Lambda=\Lambda_{\lambda}.

4 Selecting among linear estimator

In this section, we consider the situation where the estimators f^λ\widehat{f}_{\lambda} are linear, that is, are of the form f^λ=AλY\widehat{f}_{\lambda}=A_{\lambda}Y for some known and deterministic n×nn\times n matrix AλA_{\lambda}. As mentioned before, this setting covers many popular estimation procedures including kernel ridge estimators, spline smoothing, Nadaraya estimators, λ\lambda-nearest neighbors, projection estimators, low-pass filters, etc. In some cases AλA_{\lambda} is symmetric (e.g. kernel ridge, spline smoothing, projection estimators), in some others AλA_{\lambda} is non-symmetric and non-singular (as for Nadaraya estimators) and sometimes AλA_{\lambda} can be both singular and non-symmetric (low pass filters, λ\lambda-nearest neighbors). A common feature of those procedures lies in the fact that they depend on a tuning parameter (possibly multidimensional) and their practical performances can be quite poor if this parameter is not suitably calibrated. A series of papers have investigated the calibration of some of these procedures. To mention a few of them, Cao and Golubev (2006) focus on spline smoothing, Zhang (2005) on kernel ridge regression, Goldenshluger and Lepski (2009) on kernel estimators and Arlot and Bach (2009) propose a procedure to select among symmetric linear estimator with spectrum in [0,1][0,1]. The procedure we present can handle all these cases in an unified framework. Throughout the section, we assume that Λ\Lambda is finite.

4.1 The families SSλ\SS_{\lambda}

To apply our selection procedure, we need to associate to each AλA_{\lambda} a suitable collection of approximation spaces SSλ\SS_{\lambda}. To do so, we introduce below a linear space SλS_{\lambda} which plays a key role in our analysis.

For the sake of simplicity, let us first consider the case where AλA_{\lambda} is non-singular. Then SλS_{\lambda} is defined as the linear span of the right-singular vectors of Aλ1IA^{-1}_{\lambda}-I associated to singular values smaller than 1. When AλA_{\lambda} is symmetric, SλS_{\lambda} is merely the linear span of the eigenvectors of AλA_{\lambda} associated to eigenvalues not smaller than 1/2. If none of the singular values are smaller than 1, then Sλ={0}S_{\lambda}=\left\{{0}\right\}.

Let us now extend the definition of SλS_{\lambda} to singular operators AλA_{\lambda}. Let us recall that n=ker(Aλ)rg(Aλ){\mathbb{R}}^{n}=\ker(A_{\lambda})\oplus\mathrm{rg}(A_{\lambda}^{*}) where AλA_{\lambda}^{*} stands for the transpose of AλA_{\lambda} and rg(Aλ)\mathrm{rg}(A_{\lambda}^{*}) for its range. The operator AλA_{\lambda} then induces a one to one operator between rg(Aλ)\mathrm{rg}(A^{*}_{\lambda}) and rg(Aλ)\mathrm{rg}(A_{\lambda}). Write Aλ+A_{\lambda}^{+} for the inverse of this operator from rg(Aλ)\mathrm{rg}(A_{\lambda}) to rg(Aλ)\mathrm{rg}(A_{\lambda}^{*}). The orthogonal projection operator from n{\mathbb{R}}^{n} onto rg(Aλ)\mathrm{rg}(A_{\lambda}^{*}) induces a linear operator from rg(Aλ)\mathrm{rg}(A_{\lambda}) into rg(Aλ)\mathrm{rg}(A_{\lambda}^{*}), denoted Π¯λ\overline{\Pi}_{\lambda}. Then SλS_{\lambda} is defined as the linear span of the right-singular vectors of Aλ+Π¯λA^{+}_{\lambda}-\overline{\Pi}_{\lambda} associated to singular values smaller than 1. Again if this set is empty, Sλ={0}S_{\lambda}=\left\{{0}\right\}. When AλA_{\lambda} is non-singular or symmetric, we recover the definition of SλS_{\lambda} given above.

For each λΛ\lambda\in\Lambda, take SSλ\SS_{\lambda} such that SSλ{Sλ}\SS_{\lambda}\supset\left\{{S_{\lambda}}\right\}. From a theoretical point of view, it is enough to take SSλ={Sλ}\SS_{\lambda}=\left\{{S_{\lambda}}\right\} but practically it may be wise to use a larger set and by doing so, to possibly improve the approximation of f^λ\widehat{f}_{\lambda} by elements of SSλ\SS_{\lambda}. One may for example take SSλ={Sλ1,,Sλn2}\SS_{\lambda}=\left\{{S_{\lambda}^{1},\ldots,S_{\lambda}^{n-2}}\right\} where SλkS_{\lambda}^{k} is the linear span of the right-singular vectors associated to the kk smallest singular values of Aλ+Π¯λA^{+}_{\lambda}-\overline{\Pi}_{\lambda}.

4.2 Choices of SS\SS, Δ\Delta and pen\mathop{\rm pen}\nolimits

Take SS=λΛSSλ\SS=\bigcup_{\lambda\in\Lambda}\SS_{\lambda} and Δ\Delta of the form

Δ(S)=a(1dim(S))forallSSS\Delta(S)=a\left({1\vee\dim(S)}\right)\ \ {\rm for\ all}\ S\in\SS

where a1a\geq 1 satisfies Assumption 2 with Σ1\Sigma\leq 1. One may take a=(log|Λ|)1a=(\log\left|{\Lambda}\right|)\vee 1 even though this choice is not necessarily the best. Finally, for some K>1K>1, take pen(S)=KpenΔ(S)\mathop{\rm pen}\nolimits(S)=K{\mathop{\rm pen}_{\Delta}\nolimits}(S) for all SSSS\in\SS and select f^λ^\widehat{f}_{\widehat{\lambda}} by minimizing the criterion given by (6), taking thus δ=0\delta=0 in (9).

4.3 An oracle-type inequality for linear estimators

The following holds.

Corollary 2.

Let K>1K>1, κ(0,1)\kappa\in(0,1) and α>0\alpha>0. If Assumption 1 holds and Δ(S)κn\Delta(S)\leq\kappa n for all SSSS\in\SS, the estimator f^λ^\widehat{f}_{\widehat{\lambda}} satisfies

Ca1𝔼[ff^λ^2]infλ𝔼[ff^λ2]+σ2,Ca^{-1}{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]\leq\inf_{\lambda}{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+\sigma^{2},

for some CC depending on K,αK,\alpha and κ\kappa only.

The problem of selecting some best linear estimator among a family of those have also been considered in Arlot and Bach (2009) in the Gaussian regression framework, and in Goldenshluger and Lepski (2009) in the multidimensional Gaussian white noise model. Arlot and Bach proposed a penalized procedure based on random penalties. Unlike ours, their approach requires that the operators be symmetric with eigenvalues in [0,1][0,1] and that the cardinality of Λ\Lambda is at most polynomial with respect to nn. Goldenshluger and Lepski proposed a selection rule among families of kernel estimators to solve the problem of structural adaptation. Their approach requires suitable assumptions on the kernels while ours requires nothing. Nevertheless, we restrict to the case of the Euclidean loss whereas Goldenshluger and Lepski considered more general 𝕃p{\mathbb{L}}_{p} ones.

5 Variable selection

Throughout this section, we consider the problem of variable selection introduced in Example 2 and assume that p2p\geq 2 in order to avoid trivialities. When pp is small enough (say smaller than 20), this problem can be solved by using a suitable variable selection procedure that explores all the subsets of {1,,p}\{1,\ldots,p\}. For example, one may use the penalized criterion introduced in Birgé and Massart (2001) when the variance is known, and the one in Baraud et al (2009) when it is not. When pp is larger, such an approach can no longer be applied since it becomes numerically intractable. To overcome this problem, algorithms based on the minimization of convex criteria have been proposed among which are the Lasso, the Dantzig selector of Candès and Tao (2007), the elastic net of Zou and Hastie (2005). An alternative to those criteria is the forward-backward algorithm described in Zhang (2008), among others. Since there seems to be no evidence that one of these procedures outperforms all the others, it may be reasonable to mix them all and let the data decide which is the more appropriate to solve the problem at hand. As enlarging 𝔽{\mathbb{F}} can only improve the risk bound of our estimator, only the CPU resources should limit the number of candidate estimators.

The procedure we propose could not only be used to select among those candidate procedures but also to select the tuning parameters they depend on. From this point of view, it provides an alternative to the cross-validation techniques which are quite popular but offer little theoretical guarantees.

5.1 Implementation roadmap

Start by choosing a family {\mathcal{L}} of variable selection procedures. Examples of such procedures are the Lasso, the Dantzig selector, the elastic net, among others. If necessary, associate to each \ell\in{\mathcal{L}} a family of tuning parameters HH_{\ell}. For example, in order to use the Lasso procedure one needs to choose a tuning parameter h>0h>0 among a grid HLasso+H_{{\rm Lasso}}\subset{\mathbb{R}}_{+}. If a selection procedure \ell requires no choice of tuning parameters, then one may take H={0}H_{\ell}=\left\{{0}\right\}. Let us denote by m^(,h)\widehat{m}(\ell,h) the subset of {1,,p}\{1,\ldots,p\} corresponding to the predictors selected by the procedure \ell for the choice of the tuning parameter hh. For m{1,,p}m\subset\left\{{1,\ldots,p}\right\}, let SmS_{m} be the linear span of the column vectors X.,jX_{.,j} for jmj\in m (with the convention S={0}S_{\varnothing}=\left\{{0}\right\}). For \ell\in{\mathcal{L}} and hHh\in H_{\ell}, associate to the subset m^(,h)\widehat{m}(\ell,h) an estimator f^(,h)\widehat{f}_{(\ell,h)} of ff with values in Sm^(,h)S_{\widehat{m}(\ell,h)} (one may for example take the projection of YY onto the random linear space Sm^(,h)S_{\widehat{m}(\ell,h)} but any other choice would suit). Finally, consider the family 𝔽={f^λ,λΛ}{\mathbb{F}}=\{\widehat{f}_{\lambda},\ \lambda\in\Lambda\} of these estimators by taking Λ=({}×H)\Lambda=\bigcup_{\ell\in{\mathcal{L}}}(\{\ell\}\times H_{\ell}) and set ^={m^(λ),λΛ}\widehat{\mathcal{M}}=\left\{{\widehat{m}(\lambda),\ \lambda\in\Lambda}\right\}. All along we assume that Λ\Lambda is finite (so that we take δ=0\delta=0 in (9)).

The approximation spaces and the weight function

Throughout, we shall restrict ourselves to subsets of predictors with cardinality not larger than some Dmaxn2D_{\max}\leq n-2. In view of approximating the estimators f^λ\widehat{f}_{\lambda}, we suggest the collection SS\SS given by

SS={Sm|m{1,,p},card(m)Dmax}.\SS=\bigcup\left\{{S_{m}\big{|}\ m\subset\left\{{1,\ldots,p}\right\},{\rm card}(m)\leq D_{\max}}\right\}. (26)

We associate to SS\SS the weight function Δ\Delta defined for SSSS\in\SS by

Δ(S)=log[(pD)]+log(1+D)withD=dim(S).\Delta(S)=\log\left[{\left(\!\!\begin{array}[]{c}p\\ D\\ \end{array}\!\!\right)}\right]+\log(1+D)\ \ {\rm with}\ \ D=\dim(S). (27)

Since

SSSeΔ(S)\displaystyle\sum_{S\in\SS}e^{-\Delta(S)} =\displaystyle= D=0pSSSdim(S)=DeΔ(S)\displaystyle\sum_{D=0}^{p}\sum_{\scriptsize\begin{array}[]{cc}S\in\SS\\ \dim(S)=D\end{array}}e^{-\Delta(S)}
\displaystyle\leq D=0pelog(1+D)1+log(1+p),\displaystyle\sum_{D=0}^{p}e^{-\log(1+D)}\leq 1+\log(1+p),

Assumption 2 is satisfied with Σ=1+log(1+p)\Sigma=1+\log(1+p).

Let us now turn to the choices of the SSλSS\SS_{\lambda}\subset\SS. The criterion given by (6) cannot be computed when SSλ=SS\SS_{\lambda}=\SS for all λ\lambda as soon as pp is too large. In such a case, one must consider a smaller subset of SS\SS and we suggest for λ=(,h)Λ\lambda=(\ell,h)\in\Lambda

SS(,h)={Sm^(,h),hH}\SS_{(\ell,h)}=\left\{{S_{\widehat{m}(\ell,h^{\prime})},\ h^{\prime}\in H_{\ell}}\right\}

(where the SmS_{m} are defined above), or preferably

SS(,h)={Sm^(,h),,hH}\SS_{(\ell,h)}=\left\{{S_{\widehat{m}(\ell^{\prime},h^{\prime})},\ \ell^{\prime}\in{\mathcal{L}},h^{\prime}\in H_{\ell}}\right\}

whenever this latter family is not too large. Note that these two families are random.

5.2 The results

Our choices of Δ\Delta and SSλ\SS_{\lambda} ensure that f^λSm^(λ)SSλ\widehat{f}_{\lambda}\in S_{\widehat{m}(\lambda)}\in\SS_{\lambda} for all λΛ\lambda\in\Lambda and that

Δ(Sm^(λ))2dim(Sm^(λ))logp.\Delta(S_{\widehat{m}(\lambda)})\leq 2\dim(S_{\widehat{m}(\lambda)})\log p.

Hence, by applying Corollary 1 with S^λ=Sm^(λ)\widehat{S}_{\lambda}=S_{\widehat{m}(\lambda)}, we get the following result.

Corollary 3.

Let K>1K>1, κ(0,1)\kappa\in(0,1) and DmaxD_{\max} be some positive integer satisfying Dmaxκn/(2logp)D_{\max}\leq\kappa n/(2\log p). Let ^={m^(λ),λΛ}\widehat{\mathcal{M}}=\{\widehat{m}(\lambda),\lambda\in\Lambda\} be a (finite) collection of random subsets of {1,,p}\{1,\ldots,p\} with cardinality not larger than DmaxD_{\max} based on the observation YY and {f^λ,λΛ}\{\widehat{f}_{\lambda},\lambda\in\Lambda\} a family of estimators ff, also based on YY, such that f^λSm^(λ)\widehat{f}_{\lambda}\in S_{\widehat{m}(\lambda)}. By applying our selection procedure, the resulting estimator f^λ^\widehat{f}_{\widehat{\lambda}} satisfies

C𝔼[ff^λ^2]infλΛ[𝔼[ff^λ2]+𝔼[dim(Sm^(λ))]log(p)σ2],C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]\leq\inf_{\lambda\in\Lambda}\left[{{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+{\mathbb{E}}\left[{\dim(S_{\widehat{m}(\lambda)})}\right]\log(p)\sigma^{2}}\right],

where CC is a constant depending on the choices of KK and κ\kappa only.

Again, note that the risk bound we get is non-increasing with respect to Λ\Lambda. This means that if one adds a new variable selection procedure or considers more tuning parameters to increase Λ\Lambda, the risk bound we get can only be improved.

Without additional information on the estimators f^λ\widehat{f}_{\lambda} it is difficult to compare 𝔼[dim(Sm^(λ))]σ2{\mathbb{E}}\left[{\dim(S_{\widehat{m}(\lambda)})}\right]\sigma^{2} and 𝔼[ff^λ2]{\mathbb{E}}\left[{\|f-\widehat{f}_{\lambda}\|^{2}}\right]. If f^λ\widehat{f}_{\lambda} is of the form ΠSY\Pi_{S}Y for some deterministic subset SSSS\in\SS it is well-known that

𝔼[fΠSY2]=fΠSf2+dim(S)σ2dim(S)σ2.{\mathbb{E}}\left[{\left\|{f-\Pi_{S}Y}\right\|^{2}}\right]=\left\|{f-\Pi_{S}f}\right\|^{2}+\dim(S)\sigma^{2}\geq\dim(S)\sigma^{2}.

Under the assumption that fSmf\in S_{m^{*}} and that mm^{*} belongs to ^\widehat{\mathcal{M}} with probability close enough to 1, we can compare the risk of the estimator f^λ^\widehat{f}_{\widehat{\lambda}} to the cardinality of mm^{*}.

Corollary 4.

Assume that the assumptions of Corollary 3 hold and that f^λ=ΠSm^(λ)Y\widehat{f}_{\lambda}=\Pi_{S_{\widehat{m}(\lambda)}}Y for all λΛ\lambda\in\Lambda. If fSmf\in S_{m^{*}} for some non-void subset m{1,,p}m^{*}\subset\{1,\ldots,p\} with cardinality not larger than DmaxD_{\max}, then

C𝔼[ff^λ^2]log(p)|m|σ2+Rn(m)C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]\leq\log(p)|m^{*}|\sigma^{2}+R_{n}(m^{*})

where CC is a constant depending on KK and κ\kappa only, and

Rn(m)=(f2+nσ2)([m^])1/2.R_{n}(m^{*})=(\|f\|^{2}+n\sigma^{2})\left({{\mathbb{P}}\left[{m^{*}\not\in\widehat{\mathcal{M}}}\right]}\right)^{1/2}.

Zhao and You, (2006) gives sufficient conditions on the design XX to ensure that [m^]{\mathbb{P}}\left[{m^{*}\not\in\widehat{\mathcal{M}}}\right] is exponentially small with respect to nn when the family ^\widehat{\mathcal{M}} is obtained by using the LARS-Lasso algorithm with different values of the tuning parameter.

6 Simulation study

In the linear regression setting described in Example 2, we carry out a simulation study to evaluate the performances of our procedure to solve the two following problems.

We first consider the problem, described in Example 3, of tuning the smoothing parameter of the Lasso procedure for estimating ff. The performances of our procedure are compared with those of the VV-fold cross-validation method. Secondly, we consider the problem of variable selection. We solve it by using our criterion in view of selecting among a family {\mathcal{L}} of candidate variable selection procedures.

Our simulation study is based on a large number of examples which have been chosen in view of covering a large variety of situations. Most of these have been found in the literature in the context of Example 2 either for estimation or variable selection purposes when the number pp of predictors is large.

The section is organized as follows. The simulation design is given in the following section. Then, we describe how our procedure is applied for tuning the Lasso and performing variable selection. Finally, we give the results of the simulation study.

6.1 Simulation design

One example is determined by the number of observations nn, the number of variables pp, the n×pn\times p matrix XX, the values of the parameters β\beta, and the ratio signal/noise ρ\rho. It is denoted by ex(n,p,X,β,ρ){\mathrm{ex}}(n,p,X,\beta,\rho), and the set of all considered examples is denoted {\mathcal{E}}. For each example, we carry out 400 simulations of YY as a Gaussian random vector with expectation f=Xβf=X\beta and variance σ2In\sigma^{2}I_{n}, where InI_{n} is the n×nn\times n identity matrix, and σ2=f2/nρ\sigma^{2}=\|f\|^{2}/n\rho.

The collection {\mathcal{E}} is composed of several collections e{\mathcal{E}}_{e} for e=1,,Ee=1,\ldots,E where each collection e{\mathcal{E}}_{e} is characterized by a vector of parameters βe\beta_{e}, and a set 𝒳e{\mathcal{X}}_{e} of matrices XX:

e={ex(n,p,X,β,ρ):(n,p),X𝒳e,β=βe,ρ}{\mathcal{E}}_{e}=\left\{{\mathrm{ex}}(n,p,X,\beta,\rho):(n,p)\in{\mathcal{I}},X\in{\mathcal{X}}_{e},\beta=\beta_{e},\rho\in{\mathcal{R}}\right\}

where ={5,10,20}{\mathcal{R}}=\{5,10,20\} and {\mathcal{I}} consists of pairs (n,p)(n,p) such that pp is smaller, equal or greater than nn. The examples are described in further details in Section 8.2. They are inspired by examples found in Tibshirani (1996), Zou and Hastie (2005), Zou (2006), and Huang et al. (2008) for comparing the Lasso method to the ridge, adaptive Lasso and elastic net methods. They make up a large variety of situations. They include cases where

  • the covariates are not, moderately or strongly correlated,

  • the covariates with zero coefficients are weakly or highly correlated with covariates with non-zero coefficients,

  • the covariates with non-zero coefficients are grouped and correlated within these groups,

  • the lasso method is known to be inconsistent,

  • few or many effects are present.

6.2 Tuning a smoothing parameter

We consider here the problem of tuning the smoothing parameter of the Lasso estimator as described in Example 3. Instead of considering the Lasso estimators for a fixed grid Λ\Lambda of smoothing parameters λ\lambda, we rather focus on the sequence {f^1,,f^Dmax}\{\widehat{f}_{1},\ldots,\widehat{f}_{D_{\max}}\} of estimators given by the DmaxD_{\max} first steps of the LARS-Lasso algorithm proposed by Efron et al. (2004). Hence, the tuning parameter is here the number hH={1,,Dmax}h\in H=\left\{{1,\ldots,D_{\max}}\right\} of steps. In our simulation study, we compare the performance of our criterion to that of the VV-fold cross-validation for the problem of selecting the best estimator among the collection 𝔽={f^1,,f^Dmax}{\mathbb{F}}=\{\widehat{f}_{1},\ldots,\widehat{f}_{D_{\max}}\}.

6.2.1 The estimator of ff based on our procedure

We recall that our selection procedure relies on the choices of families SS\SS, SSh\SS_{h} for hHh\in H, a weight function Δ\Delta, a penalty function pen\mathop{\rm pen}\nolimits and two universal constants K>1K>1 and α>0\alpha>0. We choose the family SS\SS defined by (26). We associate to f^h\widehat{f}_{h} the family SSh={Sm^(h)|hH}SS\SS_{h}=\{S_{\widehat{m}(h^{\prime})}|\ h^{\prime}\in H\}\subset\SS where the SmS_{m} are defined in Section 5.1 and m^(h){1,,p}\widehat{m}(h^{\prime})\subset\left\{{1,\ldots,p}\right\} is the set of indices corresponding to the predictors retuned by the LARS-Lasso algorithm at step hHh^{\prime}\in H. We take pen(S)=KpenΔ(S)\mathop{\rm pen}\nolimits(S)=K{\mathop{\rm pen}_{\Delta}\nolimits}(S) with Δ(S)\Delta(S) defined by (27) and K=1.1K=1.1. This value of KK is consistent with what is suggested in Baraud et al. (2009). The choice of α\alpha is based on the following considerations. First, choosing α\alpha around one seems reasonable since it weights similarly the term YΠSf^λ2\|Y-\Pi_{S}\widehat{f}_{\lambda}\|^{2} which measures how well the estimator fits the data and the approximation term f^λΠSf^λ2\|\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}\|^{2} involved in our criterion (6). Second, simple calculation shows that the constant C1=C1(1.1,α)C^{-1}=C^{-1}(1.1,\alpha) involved in Theorem 1 is minimum for α\alpha close to 0.6. We therefore carried out our simulations for α\alpha varying from 0.2 to 1.5. The results being very similar for α\alpha between 0.5 and 1.2, we choose α=0.5\alpha=0.5. We denote by f^penΔ\widehat{f}_{{\mathop{\rm pen}_{\Delta}\nolimits}} the resulting estimator of ff.

6.2.2 The estimator of ff based on VV-fold cross-validation

For each hHh\in H, the prediction error is estimated using a VV-fold cross-validation procedure, with V=n/10V=n/10. The estimator f^CV\widehat{f}_{CV} is chosen by minimizing the estimated prediction error.

6.2.3 The results

The simulations were carried out with R (www.r-project.org) using the library elasticnet.

For each example ex{\mathrm{ex}}\in{\mathcal{E}}, we estimate on the basis of 400 simulations the oracle risk

Oex=𝔼(minhHff^h2),O_{{\mathrm{ex}}}={\mathbb{E}}\left(\min_{h\in H}\|f-\widehat{f}_{h}\|^{2}\right), (29)

and the Euclidean risks Rex(f^penΔ)R_{{\mathrm{ex}}}(\widehat{f}_{{\mathop{\rm pen}_{\Delta}\nolimits}}) and Rex(f^CV)R_{{\mathrm{ex}}}(\widehat{f}_{CV}) of f^penΔ\widehat{f}_{{\mathop{\rm pen}_{\Delta}\nolimits}} and f^CV\widehat{f}_{CV} respectively.

The results presented in Table 1 show that our procedure tends to choose a better estimator than the CV in the sense that the ratios Rex(f^penΔ)/OexR_{{\mathrm{ex}}}(\widehat{f}_{{\mathop{\rm pen}_{\Delta}\nolimits}})/O_{{\mathrm{ex}}} are closer to one than Rex(f^CV)/OexR_{{\mathrm{ex}}}(\widehat{f}_{CV})/O_{{\mathrm{ex}}}.

quantiles
procedure mean std-err 0%0\% 50%50\% 75%75\% 99% 100%100\%
CV 1.18 0.08 1.05 1.18 1.24 1.36 1.38
penΔ{\mathop{\rm pen}_{\Delta}\nolimits} 1.065 0.06 1.01 1.055 1.084 1.18 2.27
Table 1: Mean, standard-error and quantiles of the ratios Rex/OexR_{{\mathrm{ex}}}/O_{{\mathrm{ex}}} calculated over all ex{\mathrm{ex}}\in{\mathcal{E}} such that Oex<nσ2/3O_{{\mathrm{ex}}}<n\sigma^{2}/3. The number of such examples equals 654, see Section 8.2.

Nevertheless, for a few examples these ratios are larger for our procedure than for the CV. These examples correspond to situations where the Lasso estimators are highly biased.

In practice, it is worth considering several estimation procedures in order to increase the chance to have good estimators of ff among the family 𝔽{\mathbb{F}}. Selecting among candidate procedures is the purpose of the following simulation experiment in the variable selection context.

6.3 Variable selection

In this section, we consider the problem of variable selection and use the procedure and notations introduced in Section 5. To solve this problem, we consider estimators of the form f^m^=ΠSm^Y\widehat{f}_{\widehat{m}}=\Pi_{S_{\widehat{m}}}Y where m^\widehat{m} is a random subset of {1,,p}\{1,\ldots,p\} depending on YY. Given a family ^={m^(,h),m^(,h)×H}\widehat{\mathcal{M}}=\{\widehat{m}(\ell,h),\ \widehat{m}(\ell,h)\in{\mathcal{L}}\times H_{\ell}\} of such random sets, we consider the family 𝔽={f^m^(,h)|(,h)×H}{\mathbb{F}}=\{\widehat{f}_{\widehat{m}(\ell,h)}|\ (\ell,h)\in{\mathcal{L}}\times H_{\ell}\}. The descriptions of {\mathcal{L}} and HH_{\ell} are postponed to Section 8.3. Let us merely mention that we choose {\mathcal{L}} which gathers variable selection procedures based on the Lasso, ridge regression, Elastic net, PLS1 regression, Adaptive Lasso, Random Forest, and on an exhaustive research among the subsets of {1,,p}\left\{{1,\ldots,p}\right\} with small cardinality. For each procedure \ell, the parameter set HH_{\ell} corresponds to different choices of tuning parameters. For each λ=(,h)×H\lambda=(\ell,h)\in{\mathcal{L}}\times H_{\ell}, we take SSλ={Sm^(,h)}\SS_{\lambda}=\{S_{\widehat{m}(\ell,h)}\} so that our selection rule over 𝔽{\mathbb{F}} amounts to minimizing over ^\widehat{\mathcal{M}}

crit(m)=YΠSmY2+KpenΔ(Sm)σ^Sm2,{\rm crit}(m)=\|Y-\Pi_{S_{m}}Y\|^{2}+K\mathop{\rm pen}\nolimits_{\Delta}(S_{m})\widehat{\sigma}_{S_{m}}^{2}, (30)

where penΔ\mathop{\rm pen}\nolimits_{\Delta} is given by (3).

6.3.1 Results

The simulations were carried out with R (www.r-project.org) using the libraries elasticnet, randomForest, pls and the program lm.ridge in the library MASS. We first select the tuning parameters associated to the procedures \ell in {\mathcal{L}}. More precisely, for each \ell we select an estimator among the collection 𝔽={f^m^(,h)|hH}{\mathbb{F}}_{\ell}=\{\widehat{f}_{\widehat{m}(\ell,h)}|\ h\in H_{\ell}\} by minimizing Criterion (30) over ^={m^(,h)|hH}\widehat{\mathcal{M}}_{\ell}=\left\{{\widehat{m}(\ell,h)}|h\in H_{\ell}\right\}. We denote by m^()\widehat{m}(\ell) the selected set and by f^m^()\widehat{f}_{\widehat{m}(\ell)} the corresponding projection estimator. For each example ex{\mathrm{ex}}\in{\mathcal{E}} and each method \ell\in{\mathcal{L}}, we estimate the risk

Rex,=𝔼(ff^m^()2)R_{{\mathrm{ex}},\ell}={\mathbb{E}}\left(\|f-\widehat{f}_{\widehat{m}(\ell)}\|^{2}\right)

of f^m^()\widehat{f}_{\widehat{m}(\ell)} on the basis of 400 simulations and we do the same to calculate that of our estimator f^m^\widehat{f}_{\widehat{m}},

Rex,all=𝔼(ff^m^2).R_{{\mathrm{ex}},\mbox{all}}={\mathbb{E}}\left(\|f-\widehat{f}_{\widehat{m}}\|^{2}\right).

Let us now define the minimum of these risks over all methods:

Rex,min=min{Rex,all,Rex,,}.R_{{\mathrm{ex}},\min}=\min\left\{R_{{\mathrm{ex}},\mathrm{all}},R_{{\mathrm{ex}},\ell},\ell\in{\mathcal{L}}\right\}.

We compare the ratios Rex,/Rex,minR_{{\mathrm{ex}},\ell}/R_{{\mathrm{ex}},\min} for {all}\ell\in{\mathcal{L}}\cup\{{\rm all}\} to judge the performances of the candidate procedures on each example ex{\mathrm{ex}}\in{\mathcal{E}}. The mean, standard deviations and quantiles of the sequence {Rex,/Rex,min,ex}\{R_{{\mathrm{ex}},\ell}/R_{{\mathrm{ex}},\min},\ {\mathrm{ex}}\in{\mathcal{E}}\} are presented in Table 2. In particular, the results show that

  • none of the procedures \ell in {\mathcal{L}} outperforms all the others simultaneously over all examples,

  • our procedure, corresponding to =all\ell={\rm all}, achieves the smallest mean value. Besides, this value is very close to one.

  • the variability of our procedure is small compared to the others

  • for all examples, our procedure selects an estimator the risk of which does not exceed twice that of the oracle.

quantiles
method mean std-err 50%50\% 75%75\% 95% 100%100\%
Lasso 2.82 9.40 1.12 1.33 6.38 127
ridge 1.76 1.90 1.42 1.82 2.87 36.9
pls 1.50 1.20 1.22 1.50 2.58 17
en 1.46 1.90 1.12 1.33 2.57 29
ALridge 1.20 0.31 1.15 1.26 1.51 5.78
ALpls 1.29 0.87 1.14 1.29 1.75 12.7
rFmse 4.13 9.50 1.38 2.04 19.2 118
rFpurity 3.99 10.00 1.42 2.06 15.1 138
exhaustive 22.9 45 6.30 24.5 92.9 430
all 1.16 0.16 1.12 1.25 1.47 1.95
Table 2: For each {all}\ell\in{\mathcal{L}}\cup\left\{{{\rm all}}\right\}, mean, standard-error and quantiles of the ratios Rex,/Rex,minR_{{\mathrm{ex}},\ell}/R_{{\mathrm{ex}},\min} calculated over all ex{\mathrm{ex}}\in{\mathcal{E}}. The number of examples in the collection {\mathcal{E}} is equal to 660.

The false discovery rate (FDR) and the true discovery rate (TDR) are also parameters of interest in the context of variable selection. These quantities are given at Table 3 for each example when ρ=10\rho=10 and n=p=100n=p=100. Except for one example, the FDR is small, while the TDR is varying a lot among the examples.

1{\mathcal{E}}_{1} 2{\mathcal{E}}_{2} 3{\mathcal{E}}_{3} 4{\mathcal{E}}_{4} 5{\mathcal{E}}_{5} 6{\mathcal{E}}_{6} 7{\mathcal{E}}_{7} 8{\mathcal{E}}_{8} 9{\mathcal{E}}_{9} 10{\mathcal{E}}_{10} 11{\mathcal{E}}_{11}
FDR 0.045 0.026 0.004 0.026 0.018 0.041 0.012 0.026 0.042 0.15 0.014
TDR 0.74 0.63 0.18 0.63 0.17 0.99 1 1 0.98 0.29 0.20
Table 3:   False dicovery rate (FDR) and true discovery rate (TDR) using our method, for each example with ρ=10\rho=10 and n=p=100n=p=100.

7 Proofs

7.1 Proof of Theorem 1

Throughout this section, we use the following notations. For all λΛ\lambda\in\Lambda and SSSλS\in\SS_{\lambda}, we write

critα(f^λ,S)=YΠSf^λ2+σ2𝔭𝔢𝔫(S)+αf^λΠSf^λ2,{\rm crit}_{\alpha}(\widehat{f}_{\lambda},S)=\left\|{Y-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S)+\alpha\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2},

where

𝔭𝔢𝔫(S)=pen(S)σ^S2/σ2,for all SSS.\mathop{\mathfrak{pen}}\nolimits(S)=\mathop{\rm pen}\nolimits(S)\,\widehat{\sigma}^{2}_{S}/\sigma^{2},\ \ \textrm{for all }S\in\SS. (31)

For all λΛ\lambda\in\Lambda, let S(λ)SSλS(\lambda)\in\SS_{\lambda} be such that

critα(f^λ,S(λ))critα(f^λ)+δ.{\rm crit}_{\alpha}(\widehat{f}_{\lambda},S(\lambda))\leq{\rm crit}_{\alpha}(\widehat{f}_{\lambda})+\delta.

We also write ε=Yf{\varepsilon}=Y-f and S¯\overline{S} for the linear space generated by SS and ff. It follows the facts that for all λΛ\lambda\in\Lambda and SSSλS\in\SS_{\lambda}

critα(f^λ^,S(λ^))critα(f^λ^)+δcritα(f^λ)+2δcritα(f^λ,S)+2δ{\rm crit}_{\alpha}(\widehat{f}_{\widehat{\lambda}},S(\widehat{\lambda}))\leq{\rm crit}_{\alpha}(\widehat{f}_{\widehat{\lambda}})+\delta\leq{\rm crit}_{\alpha}(\widehat{f}_{\lambda})+2\delta\leq{\rm crit}_{\alpha}(\widehat{f}_{\lambda},S){+2\delta}

and simple algebra that

fΠS(λ^)f^λ^2+αf^λ^ΠS(λ^)f^λ^2\displaystyle\left\|{f-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\widehat{\lambda}}-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2}
\displaystyle\leq fΠSf^λ2+αf^λΠSf^λ2+2σ2𝔭𝔢𝔫(S)+2δ\displaystyle\left\|{f-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+2\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S)+2\delta
+ 2ε,ΠS(λ^)f^λ^fσ2𝔭𝔢𝔫(S(λ^))+ 2ε,fΠSf^λσ2𝔭𝔢𝔫(S).\displaystyle\ +\ 2{\langle}{\varepsilon},\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}-f{\rangle}-\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S(\widehat{\lambda}))\ +\ 2{\langle}{\varepsilon},f-\Pi_{S}\widehat{f}_{\lambda}{\rangle}-\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S).

For λΛ\lambda\in\Lambda and SSSS\in\SS, let us set uλ,S=(ΠSf^λf)/ΠSf^λfu_{\lambda,S}=\left({\Pi_{S}\widehat{f}_{\lambda}-f}\right)/\left\|{\Pi_{S}\widehat{f}_{\lambda}-f}\right\| if ΠSf^λf\Pi_{S}\widehat{f}_{\lambda}\neq f and uλ,S=0u_{\lambda,S}=0 otherwise. For all λ\lambda and SS, we have uλ,SS¯u_{\lambda,S}\in\overline{S} and

fΠS(λ^)f^λ^2+αf^λ^ΠS(λ^)f^λ^2\displaystyle\left\|{f-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\widehat{\lambda}}-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2}
\displaystyle\leq fΠSf^λ2+αf^λΠSf^λ2+2σ2𝔭𝔢𝔫(S)+2δ\displaystyle\left\|{f-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+2\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S){+2\delta}
+ 2|ε,uλ^,S(λ^)|ΠS(λ^)f^λ^fσ2𝔭𝔢𝔫(S(λ^))\displaystyle+\ 2\left|{{\langle}{\varepsilon},u_{\widehat{\lambda},S(\widehat{\lambda})}{\rangle}}\right|\left\|{\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}-f}\right\|-\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S(\widehat{\lambda}))
+ 2|ε,uλ,S|ΠSf^λfσ2𝔭𝔢𝔫(S)\displaystyle+\ 2\left|{{\langle}{\varepsilon},u_{\lambda,S}{\rangle}}\right|\left\|{\Pi_{S}\widehat{f}_{\lambda}-f}\right\|-\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S)
\displaystyle\leq fΠSf^λ2+αf^λΠSf^λ2+2σ2𝔭𝔢𝔫(S)+2δ\displaystyle\left\|{f-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+2\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S){+2\delta}
+K1fΠS(λ^)f^λ^2+KΠS¯(λ^)ε2σ2𝔭𝔢𝔫(S(λ^))\displaystyle+\ K^{-1}\left\|{f-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2}+K\left\|{\Pi_{\bar{S}(\widehat{\lambda})}{\varepsilon}}\right\|^{2}-\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S(\widehat{\lambda}))
+K1fΠSf^λ2+KΠS¯ε2σ2𝔭𝔢𝔫(S)\displaystyle+\ K^{-1}\left\|{f-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+K\left\|{\Pi_{\bar{S}}{\varepsilon}}\right\|^{2}-\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S)

Hence, by using (5) and (31) we get

(1K1)fΠS(λ^)f^λ^2+αf^λ^ΠS(λ^)f^λ^2\displaystyle(1-K^{-1})\left\|{f-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\widehat{\lambda}}-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2} (32)
\displaystyle\leq (1+K1)fΠSf^λ2+αf^λΠSf^λ2+2σ2𝔭𝔢𝔫(S)+Σ~+2δ\displaystyle(1+K^{-1})\left\|{f-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+2\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S)+\tilde{\Sigma}{+2\delta}
\displaystyle\leq 2(1+K1)ff^λ2+2δ\displaystyle 2(1+K^{-1})\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}{+2\delta}
+(α+2(1+K1))f^λΠSf^λ2+2σ2𝔭𝔢𝔫(S)+Σ~\displaystyle+\left({\alpha+2(1+K^{-1})}\right)\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}+2\sigma^{2}\mathop{\mathfrak{pen}}\nolimits(S)+\tilde{\Sigma}

where

Σ~=2KSSS(ΠS¯ε2penΔ(S)ndim(S)YΠS¯Y2)+.\tilde{\Sigma}=2K\sum_{S\in\SS}\left({\left\|{\Pi_{\overline{S}}{\varepsilon}}\right\|^{2}-{{\mathop{\rm pen}_{\Delta}\nolimits}(S)\over n-\dim(S)}\left\|{Y-\Pi_{\overline{S}}Y}\right\|^{2}}\right)_{+}.

For each SSSS\in\SS,

YΠSY2ndim(S)YΠS¯Y2ndim(S){\left\|{Y-\Pi_{S}Y}\right\|^{2}\over n-\dim(S)}\geq{\left\|{Y-\Pi_{\overline{S}}Y}\right\|^{2}\over n-\dim(S)}

and since the variable YΠS¯Y2\left\|{Y-\Pi_{\overline{S}}Y}\right\|^{2} is independent of ΠS¯ε2\left\|{\Pi_{\overline{S}}{\varepsilon}}\right\|^{2} and is stochastically larger than εΠS¯ε2\left\|{{\varepsilon}-\Pi_{\overline{S}}{\varepsilon}}\right\|^{2}, we deduce from the definition of penΔ(S){\mathop{\rm pen}_{\Delta}\nolimits}(S) and (2), that on the one hand 𝔼(Σ~)2Kσ2Σ{\mathbb{E}}(\tilde{\Sigma})\leq 2K\sigma^{2}\Sigma.

On the other hand, since SS is arbitrary among SSλ\SS_{\lambda} and since

(1α+11K1)1ff^λ^2(1K1)fΠS(λ^)f^λ^2+αf^λ^ΠS(λ^)f^λ^2\left({{1\over\alpha}+{1\over 1-K^{-1}}}\right)^{-1}\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}\leq(1-K^{-1})\left\|{f-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2}+\alpha\left\|{\widehat{f}_{\widehat{\lambda}}-\Pi_{S(\widehat{\lambda})}\widehat{f}_{\widehat{\lambda}}}\right\|^{2}

we deduce from (32) that for all λΛ\lambda\in\Lambda,

ff^λ^2C1[ff^λ2+A(f^λ,SSλ)+Σ~+δ]\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}\leq C^{-1}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+A(\widehat{f}_{\lambda},\SS_{\lambda})+\tilde{\Sigma}+\delta}\right] (33)

with

C1=C1(K,α)=(1+αK1)(α+2(1+K1))α(1K1),C^{-1}=C^{-1}(K,\alpha)={\left({1+\alpha-K^{-1}}\right)\left({\alpha+2(1+K^{-1})}\right)\over\alpha(1-K^{-1})}, (34)

and (11) follows by taking the expectation on both sides of (33). Note that provided that

infλΛ[ff^λ2+A(f^λ,SSλ)]\inf_{\lambda\in\Lambda}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+A(\widehat{f}_{\lambda},\SS_{\lambda})}\right]

is measurable, we have actually proved the stronger inequality

C𝔼[ff^λ^2]𝔼[infλΛ{ff^λ2+A(f^λ,SSλ)}]+σ2Σ+δ.C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]\leq{\mathbb{E}}\left[{\inf_{\lambda\in\Lambda}\left\{{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+A(\widehat{f}_{\lambda},\SS_{\lambda})}\right\}}\right]+\sigma^{2}\Sigma+\delta. (35)

Let us now turn to the second part of the Theorem, fixing some λΛ\lambda\in\Lambda. Since equality holds in (5), under Assumption 3 by (4)

pen(S)=KpenΔ(S)C(κ,K)(dim(S)Δ(S)),SSS.\mathop{\rm pen}\nolimits(S)=K{\mathop{\rm pen}_{\Delta}\nolimits}(S)\leq C(\kappa,K)(\dim(S)\vee\Delta(S)),\ \ \forall S\in\SS.

If SSλ\SS_{\lambda} is non-random, for some C=C(κ,K)>0C^{\prime}=C^{\prime}(\kappa,K)>0 and all SSSλS\in\SS_{\lambda},

C𝔼[A(f^λ,SSλ)]\displaystyle C^{\prime}{\mathbb{E}}\left[{A(\widehat{f}_{\lambda},\SS_{\lambda})}\right]
\displaystyle\leq 𝔼[f^λΠSf^λ2]+(dim(S)Δ(S))𝔼[σ^S2],\displaystyle{\mathbb{E}}\left[{\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}}\right]+(\dim(S)\vee\Delta(S)){\mathbb{E}}\left[{\widehat{\sigma}_{S}^{2}}\right],
=\displaystyle= 𝔼[f^λΠSf^λ2]+dim(S)Δ(S)ndim(S)[fΠSf2+(ndim(S))σ2].\displaystyle{\mathbb{E}}\left[{\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}}\right]+{\dim(S)\vee\Delta(S)\over n-\dim(S)}\left[{\left\|{f-\Pi_{S}f}\right\|^{2}+(n-\dim(S))\sigma^{2}}\right].

Since fΠSf2fΠSf^λ2\left\|{f-\Pi_{S}f}\right\|^{2}\leq\left\|{f-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}, we have

fΠSf2𝔼[fΠSf^λ2]2𝔼[ff^λ2]+2𝔼[f^λΠSf^λ2],\left\|{f-\Pi_{S}f}\right\|^{2}\leq{\mathbb{E}}\left[{\left\|{f-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}}\right]\leq 2{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+2{\mathbb{E}}\left[{\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}}\right],

and under Assumption 3, (dim(S)Δ(S))/(ndim(S))κ(1κ)1(\dim(S)\vee\Delta(S))/(n-\dim(S))\leq\kappa(1-\kappa)^{-1}, and hence for all SSSλS\in\SS_{\lambda}

C𝔼[A(f^λ,SSλ)]\displaystyle C^{\prime}{\mathbb{E}}\left[{A(\widehat{f}_{\lambda},\SS_{\lambda})}\right] \displaystyle\leq (1+2κ1κ)𝔼[ff^λ2]+2κ1κ𝔼[f^λΠSf^λ2]\displaystyle\left({1+{2\kappa\over 1-\kappa}}\right){\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+{2\kappa\over 1-\kappa}{\mathbb{E}}\left[{\left\|{\widehat{f}_{\lambda}-\Pi_{S}\widehat{f}_{\lambda}}\right\|^{2}}\right]
+(dim(S)Δ(S))σ2.\displaystyle\ \ +\ \ \left({\dim(S)\vee\Delta(S)}\right)\sigma^{2}.

which leads to (12).

Let us turn to the proof of (13). We set σ^λ2=σ^S^λ2\widehat{\sigma}_{\lambda}^{2}=\widehat{\sigma}^{2}_{\widehat{S}_{\lambda}}. Since with probability one f^λS^λSSλ\widehat{f}_{\lambda}\in\widehat{S}_{\lambda}\in\SS_{\lambda},

𝔼[A(f^λ,SSλ)]𝔼[pen(S^λ)σ^λ2]{\mathbb{E}}\left[{A(\widehat{f}_{\lambda},\SS_{\lambda})}\right]\leq{\mathbb{E}}\left[{\mathop{\rm pen}\nolimits(\widehat{S}_{\lambda})\widehat{\sigma}_{\lambda}^{2}}\right]

and it suffices thus to bound the right-hand side. Since equality holds in (5) and since f^λS^λ\widehat{f}_{\lambda}\in\widehat{S}_{\lambda}

pen(S^λ)σ^λ2\displaystyle\mathop{\rm pen}\nolimits(\widehat{S}_{\lambda})\,\widehat{\sigma}^{2}_{\lambda} =\displaystyle= KpenΔ(S^λ)ndim(S^λ)YΠS^λY2\displaystyle K{{\mathop{\rm pen}_{\Delta}\nolimits}(\widehat{S}_{\lambda})\over n-\dim(\widehat{S}_{\lambda})}\left\|{Y-\Pi_{\widehat{S}_{\lambda}}Y}\right\|^{2}
\displaystyle\leq KpenΔ(S^λ)ndim(S^λ)Yf^λ2=KpenΔ(S^λ)ndim(S^λ)f+εf^λ2\displaystyle K{{\mathop{\rm pen}_{\Delta}\nolimits}(\widehat{S}_{\lambda})\over n-\dim(\widehat{S}_{\lambda})}\left\|{Y-\widehat{f}_{\lambda}}\right\|^{2}=K{{\mathop{\rm pen}_{\Delta}\nolimits}(\widehat{S}_{\lambda})\over n-\dim(\widehat{S}_{\lambda})}\left\|{f+{\varepsilon}-\widehat{f}_{\lambda}}\right\|^{2}
\displaystyle\leq 2KpenΔ(S^λ)ndim(S^λ)[ff^λ2+ε2]\displaystyle 2K{{\mathop{\rm pen}_{\Delta}\nolimits}(\widehat{S}_{\lambda})\over n-\dim(\widehat{S}_{\lambda})}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+\left\|{{\varepsilon}}\right\|^{2}}\right]
\displaystyle\leq 2KpenΔ(S^λ)ndim(S^λ)[ff^λ2+(ε22nσ2)++2nσ2].\displaystyle 2K{{\mathop{\rm pen}_{\Delta}\nolimits}(\widehat{S}_{\lambda})\over n-\dim(\widehat{S}_{\lambda})}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+(\left\|{{\varepsilon}}\right\|^{2}-2n\sigma^{2})_{+}+2n\sigma^{2}}\right].

Under Assumption 3, 1Δ(S^λ)dim(S^λ)κn1\leq\Delta(\widehat{S}_{\lambda})\vee\dim(\widehat{S}_{\lambda})\leq\kappa n and we deduce from (4) that for some constant CC depending only on KK and κ\kappa

Cpen(S^λ)σ^λ2\displaystyle C\mathop{\rm pen}\nolimits(\widehat{S}_{\lambda})\,\widehat{\sigma}^{2}_{\lambda} \displaystyle\leq ff^λ2+(dim(S^λ)Δ(S^λ))σ2+(ε22nσ2)+,\displaystyle\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+\left({\dim(\widehat{S}_{\lambda})\vee\Delta(\widehat{S}_{\lambda})}\right)\sigma^{2}+(\left\|{{\varepsilon}}\right\|^{2}-2n\sigma^{2})_{+},

and the result follows from the fact that 𝔼[(ε22nσ2)+]3σ2{\mathbb{E}}[(\left\|{{\varepsilon}}\right\|^{2}-2n\sigma^{2})_{+}]\leq 3\sigma^{2} for all nn.

7.2 Proof of Proposition 1

For all λΛ\lambda\in\Lambda and fSf\in\ S, ff^λΠSf^λf^λ\left\|{f-\widehat{f}_{\lambda}}\right\|\geq\left\|{\Pi_{S}\widehat{f}_{\lambda}-\widehat{f}_{\lambda}}\right\| and hence,

ff^λ~2infλΛff^λ212infλΛ[ff^λ2+ΠSf^λf^λ2].\left\|{f-\widehat{f}_{\widetilde{\lambda}}}\right\|^{2}\geq\inf_{\lambda\in\Lambda}\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}\geq{1\over 2}\inf_{\lambda\in\Lambda}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}+\left\|{\Pi_{S}\widehat{f}_{\lambda}-\widehat{f}_{\lambda}}\right\|^{2}}\right].

Besides, since the minimax rate of estimation over SS is of order dim(S)σ2\dim(S)\sigma^{2}, for some universal constant CC,

CsupfS𝔼[ff^λ~2]dim(S)σ2.C\sup_{f\in S}{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widetilde{\lambda}}}\right\|^{2}}\right]\geq\dim(S)\sigma^{2}.

Putting these bounds together lead to the result.

7.3 Proof of Proposition 2

Under (24), it is not difficult to see that d(n,M)=n/(2log(eM))2d(n,M)=n/(2\log(eM))\geq 2 so that SS\SS is not empty and since for all SmSSCvS_{m}\in\SS_{{\mathrm{Cv}}}

(dim(Sm)1)Δ(Sm)=|m|+log[(M|m|)]|m|(1+logM)n2,(\dim(S_{m})\vee 1)\leq\Delta(S_{m})=|m|+\log\left[{\left(\!\!\begin{array}[]{c}M\\ |m|\\ \end{array}\!\!\right)}\right]\leq|m|(1+\log M)\leq{n\over 2},

Assumptions 1 to 4 are satisfied with κ=1/2\kappa=1/2. Besides, the set ΛCv\Lambda_{{\mathrm{Cv}}} being compact, λcritα(fλ)\lambda\mapsto{\rm crit}_{\alpha}(f_{\lambda}) admits a minimum over ΛCv\Lambda_{{\mathrm{Cv}}} (we shall come back the minimization of this criterion at the end of the subsection) and hence we can take δ=0\delta=0. By applying Theorem 1 and using (12), the resulting estimator f^Cv=f^λ^\widehat{f}_{{\mathrm{Cv}}}=\widehat{f}_{\widehat{\lambda}} satisfies for some universal constant C>0C>0

C𝔼[ff^Cv2]infg𝔽Λ{fg2+A¯(g,SS)},C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\mathrm{Cv}}}}\right\|^{2}}\right]\leq\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\{{\left\|{f-g}\right\|^{2}+\overline{A}(g,\SS)}\right\}, (36)

where

A¯(g,SS)=infSSS[gΠSg2+(dim(S)Δ(S))σ2].\overline{A}(g,\SS)=\inf_{S\in\SS}\left[{\left\|{g-\Pi_{S}g}\right\|^{2}+\left({\dim(S)\vee\Delta(S)}\right)\sigma^{2}}\right]. (37)

We bound A¯(g,SS)\overline{A}(g,\SS) from above by using the following approximation result below the proof of which can be found in Makovoz (1996) (more precisely, we refer to the proof of his Theorem 2).

Lemma 1.

For all gg in the convex hull 𝔽Λ{\mathbb{F}}_{\Lambda} of the ϕj\phi_{j} and all D1D\geq 1, there exists m{1,,M}m\subset\left\{{1,\ldots,M}\right\} such that |m|=(2D)M|m|=(2D)\wedge M and

gΠSmg24D1supj=1,,Mϕj2.\left\|{g-\Pi_{S_{m}}g}\right\|^{2}\leq 4D^{-1}\sup_{j=1,\ldots,M}\left\|{\phi_{j}}\right\|^{2}.

By using this lemma and the fact that log(MD)Dlog(eM/D)\log\left(\!\!\begin{array}[]{c}M\\ D\\ \end{array}\!\!\right)\leq D\log(eM/D) for all D{1,,M}D\in\{1,\ldots,M\}, we get

A¯(g,SS)inf1Dd(n,M)/2[4nL2D+2D(1+log(eM/(2D))]σ2.\overline{A}(g,\SS)\leq\inf_{1\leq D\leq d(n,M)/2}\left[{{4nL^{2}\over D}+2D(1+\log(eM/(2D))}\right]\sigma^{2}.

Taking for DD the integer part of

x(n,M,L)=nL2log(eM/nL2)x(n,M,L)=\sqrt{{nL^{2}\over\log(eM/\sqrt{nL^{2}})}}

which belongs to [1,d(n,M)/2][1,d(n,M)/2] under (24), we get

A¯(g,SS)CnL2log(eM/nL2)σ2\overline{A}(g,\SS)\leq C^{\prime}\sqrt{nL^{2}\log(eM/\sqrt{nL^{2}})}\sigma^{2} (38)

for some universal constant C>0C^{\prime}>0 which together with (36) leads to the risk bound

𝔼[ff^Cv2]Cinfg𝔽Λfg2CnL2log(eM/nL2)σ2.{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\mathrm{Cv}}}}\right\|^{2}}\right]-C\inf_{g\in{\mathbb{F}}_{\Lambda}}\left\|{f-g}\right\|^{2}\leq C\sqrt{nL^{2}\log(eM/\sqrt{nL^{2}})}\sigma^{2}.

Concerning the computation of f^Cv\widehat{f}_{{\mathrm{Cv}}}, note that

infλΛcritα(fλ)\displaystyle\inf_{\lambda\in\Lambda}{\rm crit}_{\alpha}(f_{\lambda}) =\displaystyle= infλΛinfSSSCv[YΠSfλ2+αfλΠSfλ2+pen(S)σ^S2]\displaystyle\inf_{\lambda\in\Lambda}\inf_{S\in\SS_{{\mathrm{Cv}}}}\left[{\left\|{Y-\Pi_{S}f_{\lambda}}\right\|^{2}+\alpha\left\|{f_{\lambda}-\Pi_{S}f_{\lambda}}\right\|^{2}+\mathop{\rm pen}\nolimits(S)\,\widehat{\sigma}^{2}_{S}}\right]
=\displaystyle= infSSSCv{[infλΛ(YΠSfλ2+αfλΠSfλ2)]+pen(S)σ^S2},\displaystyle\inf_{S\in\SS_{{\mathrm{Cv}}}}\left\{{\left[{\inf_{\lambda\in\Lambda}\left({\left\|{Y-\Pi_{S}f_{\lambda}}\right\|^{2}+\alpha\left\|{f_{\lambda}-\Pi_{S}f_{\lambda}}\right\|^{2}}\right)}\right]+\mathop{\rm pen}\nolimits(S)\,\widehat{\sigma}^{2}_{S}}\right\},

and hence, one can solve the problem of minimizing critα(fλ){\rm crit}_{\alpha}(f_{\lambda}) over λΛ\lambda\in\Lambda by proceeding into two steps. First, for each SS in the finite set SSCv\SS_{{\mathrm{Cv}}} minimize the convex criterion

critα(S,fλ)=YΠSfλ2+αfλΠSfλ2{\rm crit}_{\alpha}(S,f_{\lambda})=\left\|{Y-\Pi_{S}f_{\lambda}}\right\|^{2}+\alpha\left\|{f_{\lambda}-\Pi_{S}f_{\lambda}}\right\|^{2}

over the convex (and compact set) ΛCv\Lambda_{{\mathrm{Cv}}}. Denote by f^Cv,S\widehat{f}_{{\mathrm{Cv}},S} the resulting minimizers. Then, minimize the quantity critα(S,f^Cv,S)+pen(S)σ^S2{\rm crit}_{\alpha}(S,\widehat{f}_{{\mathrm{Cv}},S})+\mathop{\rm pen}\nolimits(S)\,\widehat{\sigma}^{2}_{S} for SS varying among SSCv\SS_{{\mathrm{Cv}}}. Denoting by S^\widehat{S} such a minimizer, we have that f^Cv=f^Cv,S^\widehat{f}_{{\mathrm{Cv}}}=\widehat{f}_{{\mathrm{Cv}},\widehat{S}}.

7.4 Proof of Proposition 3

By applying Theorem 1, we obtain that the selected estimator f^λ^\widehat{f}_{\widehat{\lambda}} satisfies

C𝔼[ff^λ^2]infλ{L,MS,Cv}[𝔼[ff^λ2]+𝔼[A(f^λ,SSλ)]].C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]\leq\inf_{\lambda\in\left\{{{\rm L},{\rm MS},{\mathrm{Cv}}}\right\}}\left[{{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+{\mathbb{E}}\left[{A(\widehat{f}_{\lambda},\SS_{\lambda})}\right]}\right].

Let us now bound 𝔼[A(f^λ,SSλ)]{\mathbb{E}}\left[{A(\widehat{f}_{\lambda},\SS_{\lambda})}\right] for each λΛ\lambda\in\Lambda.

If λ=L\lambda={\rm L}, by using (12) and the fact that f^LS{1,,M}\widehat{f}_{{\rm L}}\in S_{\{1,\ldots,M\}}, we have

C𝔼[A(f^L,SSL)]𝔼[ff^L2]+Mσ2.C^{\prime}{\mathbb{E}}\left[{A(\widehat{f}_{{\rm L}},\SS_{{\rm L}})}\right]\leq{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\rm L}}}\right\|^{2}}\right]+M\sigma^{2}.

If λ=MS\lambda={\rm MS}, we may use (13) since with probability one f^MSSSMS\widehat{f}_{{\rm MS}}\in\SS_{{\rm MS}} and since dim(S)Δ(S)1+log(M)\dim(S)\vee\Delta(S)\leq 1+\log(M) for all SSSMSS\in\SS_{{\rm MS}}, we get

C𝔼[A(f^MS,SSMS)]𝔼[ff^MS2]+log(M)σ2.C^{\prime}{\mathbb{E}}\left[{A(\widehat{f}_{{\rm MS}},\SS_{{\rm MS}})}\right]\leq{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\rm MS}}}\right\|^{2}}\right]+\log(M)\sigma^{2}.

Finally, let us turn to the case λ=Cv\lambda={\mathrm{Cv}} and denote by gg the best approximation of ff in 𝒞{\mathcal{C}}. Since f^Cv𝒞\widehat{f}_{{\mathrm{Cv}}}\in{\mathcal{C}}, for all SSSCvS\in\SS_{{\mathrm{Cv}}},

f^CvΠSf^Cv\displaystyle\left\|{\widehat{f}_{{\mathrm{Cv}}}-\Pi_{S}\widehat{f}_{{\mathrm{Cv}}}}\right\| \displaystyle\leq f^CvΠSg=f^Cvf+fg+gΠSg\displaystyle\left\|{\widehat{f}_{{\mathrm{Cv}}}-\Pi_{S}g}\right\|=\left\|{\widehat{f}_{{\mathrm{Cv}}}-f+f-g+g-\Pi_{S}g}\right\|
\displaystyle\leq 2ff^Cv+gΠSg,\displaystyle 2\left\|{f-\widehat{f}_{{\mathrm{Cv}}}}\right\|+\left\|{g-\Pi_{S}g}\right\|,

and hence by using (12)

C𝔼[A(f^Cv,SSCv)]𝔼[ff^Cv2]+A¯(g,SSCv)C^{\prime}{\mathbb{E}}\left[{A(\widehat{f}_{{\mathrm{Cv}}},\SS_{{\mathrm{Cv}}})}\right]\leq{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\mathrm{Cv}}}}\right\|^{2}}\right]+\overline{A}(g,\SS_{{\mathrm{Cv}}})

where A¯(g,SSCv)\overline{A}(g,\SS_{{\mathrm{Cv}}}) is given by (37). By arguing as in Section (3.1.3), we deduce that under (24)

C𝔼[A(f^Cv,SSCv)]𝔼[ff^Cv2]+nL2log(eM/nL2)σ2.C^{\prime}{\mathbb{E}}\left[{A(\widehat{f}_{{\mathrm{Cv}}},\SS_{{\mathrm{Cv}}})}\right]\leq{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{{\mathrm{Cv}}}}\right\|^{2}}\right]+\sqrt{nL^{2}\log(eM/\sqrt{nL^{2}})}\sigma^{2}.

By putting these bounds together we get the result.

7.5 Proof of Corollary 2

Since Assumptions 1 to 4 are fulfilled and 𝔽{\mathbb{F}} is finite, we may apply Theorem 1 and take δ=0\delta=0. By using (12), we have for some CC depending on K,αK,\alpha and κ\kappa,

C𝔼[ff^λ^2]\displaystyle C{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\widehat{\lambda}}}\right\|^{2}}\right]
\displaystyle\leq infλΛ{𝔼[ff^λ2]+𝔼[f^λΠSλf^λ2]+a(1+dim(Sλ))σ2}.\displaystyle\inf_{\lambda\in\Lambda}\left\{{{\mathbb{E}}\left[{\left\|{f-\widehat{f}_{\lambda}}\right\|^{2}}\right]+{\mathbb{E}}\left[{\left\|{\widehat{f}_{\lambda}-\Pi_{S_{\lambda}}\widehat{f}_{\lambda}}\right\|^{2}}\right]+a(1+\dim(S_{\lambda}))\sigma^{2}}\right\}.

For all λΛ\lambda\in\Lambda,

𝔼[ff^λ2]\displaystyle{\mathbb{E}}\left[{\|f-\widehat{f}_{\lambda}\|^{2}}\right] =\displaystyle= fAλf2+𝔼[Aλε2]\displaystyle\left\|{f-A_{\lambda}f}\right\|^{2}+{\mathbb{E}}\left[{\left\|{A_{\lambda}{\varepsilon}}\right\|^{2}}\right]
=\displaystyle= fAλf2+Tr(AλAλ)σ2\displaystyle\left\|{f-A_{\lambda}f}\right\|^{2}+{\rm Tr}(A_{\lambda}^{*}A_{\lambda})\sigma^{2}
\displaystyle\geq max{fAλf2,Tr(AλAλ)σ2}\displaystyle\max\left\{{\left\|{f-A_{\lambda}f}\right\|^{2},{\rm Tr}(A_{\lambda}^{*}A_{\lambda})\sigma^{2}}\right\}

and

𝔼[f^λΠSλf^λ2]\displaystyle{\mathbb{E}}\left[{\left\|{\widehat{f}_{\lambda}-\Pi_{S_{\lambda}}\widehat{f}_{\lambda}}\right\|^{2}}\right] =\displaystyle= (IΠSλ)Aλf2+𝔼[(IΠSλ)Aλε2],\displaystyle\left\|{(I-\Pi_{S_{\lambda}})A_{\lambda}f}\right\|^{2}+{\mathbb{E}}\left[{\left\|{(I-\Pi_{S_{\lambda}})A_{\lambda}{\varepsilon}}\right\|^{2}}\right],
\displaystyle\leq 2max{(IΠSλ)Aλf2,𝔼[Aλε2]}\displaystyle 2\max\left\{{\left\|{(I-\Pi_{S_{\lambda}})A_{\lambda}f}\right\|^{2},{\mathbb{E}}\left[{\left\|{A_{\lambda}{\varepsilon}}\right\|^{2}}\right]}\right\}
=\displaystyle= 2max{(IΠSλ)Aλf2,Tr(AλAλ)σ2}\displaystyle 2\max\left\{{\left\|{(I-\Pi_{S_{\lambda}})A_{\lambda}f}\right\|^{2},{\rm Tr}(A_{\lambda}^{*}A_{\lambda})\sigma^{2}}\right\}

and hence, Corollary 2 follows from the next lemma.

Lemma 2.

For all λΛ\lambda\in\Lambda we have

(i)\displaystyle(i) (IΠSλ)AλffAλf,\displaystyle\left\|{(I-\Pi_{S_{\lambda}})A_{\lambda}f}\right\|\,\leq\,\left\|{f-A_{\lambda}f}\right\|,
(ii)\displaystyle(ii) dim(Sλ) 4Tr(AλAλ).\displaystyle\dim(S_{\lambda})\,\leq\,4\,{\rm Tr}(A_{\lambda}^{*}A_{\lambda}).

Proof of Lemma 2: Writing f=f0+f1ker(Aλ)rg(Aλ)f=f_{0}+f_{1}\in\ker(A_{\lambda})\oplus\mathrm{rg}(A^{*}_{\lambda}) and using the fact that rg(Aλ)=ker(Aλ)\mathrm{rg}(A^{*}_{\lambda})=\ker(A_{\lambda})^{\perp} and the definition of Π¯λ\overline{\Pi}_{\lambda}, we obtain

fAλf2\displaystyle\left\|{f-A_{\lambda}f}\right\|^{2} =\displaystyle= f0+f1Aλf12\displaystyle\left\|{f_{0}+f_{1}-A_{\lambda}f_{1}}\right\|^{2}
=\displaystyle= f0Πker(Aλ)Aλf12+(IΠ¯λAλ)f12\displaystyle\left\|{f_{0}-\Pi_{\ker(A_{\lambda})}A_{\lambda}f_{1}}\right\|^{2}+\left\|{(I-\overline{\Pi}_{\lambda}A_{\lambda})f_{1}}\right\|^{2}
\displaystyle\geq (Aλ+Π¯λ)Aλf12\displaystyle\left\|{(A_{\lambda}^{+}-\overline{\Pi}_{\lambda})A_{\lambda}f_{1}}\right\|^{2}
\displaystyle\geq k=1mλsk2<Aλf,vk>2,\displaystyle\sum_{k=1}^{m_{\lambda}}s_{k}^{2}<A_{\lambda}f,v_{k}>^{2},

where s1smλs_{1}\geq\ldots\geq s_{m_{\lambda}} are the singular values of Aλ+Π¯λA^{+}_{\lambda}-\overline{\Pi}_{\lambda} counted with their multiplicity and (v1,,vmλ)(v_{1},\ldots,v_{m_{\lambda}}) is an orthonormal family of right-singular vectors associated to (s1,,smλ)(s_{1},\ldots,s_{m_{\lambda}}). If s1<1s_{1}<1, then Sλ=nS_{\lambda}={\mathbb{R}}^{n} and we have fAλf(IΠSλ)Aλf=0\left\|{f-A_{\lambda}f}\right\|\geq\left\|{(I-\Pi_{S_{\lambda}})A_{\lambda}f}\right\|=0. Otherwise, s11s_{1}\geq 1, we may consider kλk_{\lambda} as the largest kk such that sk1s_{k}\geq 1 and derive that

fAλf2\displaystyle\left\|{f-A_{\lambda}f}\right\|^{2} \displaystyle\geq k=1kλsk2<Aλf,vk>2\displaystyle\sum_{k=1}^{k_{\lambda}}s_{k}^{2}<A_{\lambda}f,v_{k}>^{2}
\displaystyle\geq k=1kλ<Aλf,vk>2=(IΠSλ)Aλf2,\displaystyle\sum_{k=1}^{k_{\lambda}}<A_{\lambda}f,v_{k}>^{2}\ =\ \left\|{(I-\Pi_{S_{\lambda}})A_{\lambda}f}\right\|^{2},

which proves the assertion (i)(i).

For the bound (ii)(ii), we set Mλ=Aλ+Π¯λM_{\lambda}=A^{+}_{\lambda}-\overline{\Pi}_{\lambda} and note that

(MλΠ¯λ)(MλΠ¯λ)=MλMλ+Π¯λΠ¯λMλΠ¯λΠ¯λMλ(M_{\lambda}-\overline{\Pi}_{\lambda})(M_{\lambda}-\overline{\Pi}_{\lambda})^{*}=M_{\lambda}M^{*}_{\lambda}+\overline{\Pi}_{\lambda}\overline{\Pi}_{\lambda}^{*}-M_{\lambda}\overline{\Pi}_{\lambda}^{*}-\overline{\Pi}_{\lambda}M_{\lambda}^{*}

induces a semi-positive quadratic form on rg(Aλ)\mathrm{rg}(A_{\lambda}^{*}). As a consequence the quadratic form (Mλ+Π¯λ)(Mλ+Π¯λ)(M_{\lambda}+\overline{\Pi}_{\lambda})(M_{\lambda}+\overline{\Pi}_{\lambda})^{*} is dominated by the quadratic form 2(MλMλ+Π¯λΠ¯λ)2(M_{\lambda}M_{\lambda}^{*}+\overline{\Pi}_{\lambda}\overline{\Pi}_{\lambda}^{*}) on rg(Aλ)\mathrm{rg}(A_{\lambda}^{*}). Furthermore

(Mλ+Π¯λ)(Mλ+Π¯λ)=(Aλ+)(Aλ+)=(AλAλ)+(M_{\lambda}+\overline{\Pi}_{\lambda})(M_{\lambda}+\overline{\Pi}_{\lambda})^{*}=(A_{\lambda}^{+})(A_{\lambda}^{+})^{*}=(A_{\lambda}^{*}A_{\lambda})^{+}

where (AλAλ)+(A_{\lambda}^{*}A_{\lambda})^{+} is the inverse of the linear operator Lλ:rg(Aλ)rg(Aλ)L_{\lambda}:\mathrm{rg}(A^{*}_{\lambda})\to\mathrm{rg}(A^{*}_{\lambda}) induced by AλAλA_{\lambda}^{*}A_{\lambda} restricted on rg(Aλ)\mathrm{rg}(A^{*}_{\lambda}). We then have that the quadratic form induced by (AλAλ)+(A_{\lambda}^{*}A_{\lambda})^{+} is dominated by the quadratic form

2(Aλ+Π¯λ)(Aλ+Π¯λ)+2Π¯λΠ¯λ2(A_{\lambda}^{+}-\overline{\Pi}_{\lambda})(A_{\lambda}^{+}-\overline{\Pi}_{\lambda})^{*}+2\overline{\Pi}_{\lambda}\overline{\Pi}_{\lambda}^{*}

on rg(Aλ)\mathrm{rg}(A_{\lambda}^{*}). In particular the sequence of the eigenvalues of (AλAλ)+(A_{\lambda}^{*}A_{\lambda})^{+} is dominated by the sequence (2sk2+2)k=1,mλ(2s_{k}^{2}+2)_{k=1,m_{\lambda}} so

Tr(AλAλ)=Tr(Lλ)\displaystyle{\rm Tr}(A_{\lambda}^{*}A_{\lambda})\ =\ {\rm Tr}(L_{\lambda}) \displaystyle\geq k=1mλ12(1+sk2)\displaystyle\sum_{k=1}^{m_{\lambda}}{1\over 2(1+s_{k}^{2})}
\displaystyle\geq k=kλ+1mλ12(1+sk2)dim(Sλ)/4,\displaystyle\sum_{k=k_{\lambda}+1}^{m_{\lambda}}{1\over 2(1+s_{k}^{2})}\ \geq\ {\rm dim}(S_{\lambda})/4,

which conclude the proof of Lemma 2.

7.6 Proof of Corollary 4

Along the section, we write SS_{*} for SmS_{m^{*}} and S^λ\widehat{S}_{\lambda} for Sm^(λ)S_{\widehat{m}(\lambda)} for short. By using (10) with δ=0\delta=0 and since Σ1+log(1+p)\Sigma\leq 1+\log(1+p), we have

C𝔼[ff^λ^2]𝔼[infλΛfΠS^λY2+pen(S^λ)σ^S^λ2]+(1+log(p+1))σ2,C{\mathbb{E}}\left[{\|f-\widehat{f}_{\widehat{\lambda}}\|^{2}}\right]\leq{\mathbb{E}}\left[{\inf_{\lambda\in\Lambda}\|f-\Pi_{\widehat{S}_{\lambda}}Y\|^{2}+\mathop{\rm pen}\nolimits(\widehat{S}_{\lambda})\widehat{\sigma}^{2}_{\widehat{S}_{\lambda}}}\right]+(1+\log(p+1))\sigma^{2},

for some constant C>0C>0 depending on KK only. Writing BB for the event B={m^}B=\left\{{m^{*}\notin\widehat{\mathcal{M}}}\right\}, we have

𝔼[infλΛ{fΠS^λY2+pen(S^λ)σ^S^λ2}]An+Rn{\mathbb{E}}\left[{\inf_{\lambda\in\Lambda}\left\{{\|f-\Pi_{\widehat{S}_{\lambda}}Y\|^{2}+\mathop{\rm pen}\nolimits(\widehat{S}_{\lambda})\widehat{\sigma}^{2}_{\widehat{S}_{\lambda}}}\right\}}\right]\leq A_{n}+R^{\prime}_{n}

where

An\displaystyle A_{n} =\displaystyle= 𝔼[fΠSY2+pen(S)σ^S2]\displaystyle{\mathbb{E}}\left[{\left\|{f-\Pi_{S_{*}}Y}\right\|^{2}+\mathop{\rm pen}\nolimits(S_{*})\widehat{\sigma}^{2}_{S_{*}}}\right]
Rn\displaystyle R^{\prime}_{n} =\displaystyle= 𝔼[infλΛ{fΠS^λY2+pen(S^λ)σ^S^λ2}𝟏B].\displaystyle{\mathbb{E}}\left[{\inf_{\lambda\in\Lambda}\left\{{\|f-\Pi_{\widehat{S}_{\lambda}}Y\|^{2}+\mathop{\rm pen}\nolimits(\widehat{S}_{\lambda})\widehat{\sigma}^{2}_{\widehat{S}_{\lambda}}}\right\}{\bf 1}_{B}}\right].

Let us bound AnA_{n} from above. Note that fΠSY2=ΠSε2\|f-\Pi_{S_{*}}Y\|^{2}=\|\Pi_{S_{*}}{\varepsilon}\|^{2} and σ^S2=(IΠS)ε2/(ndim(S))\widehat{\sigma}^{2}_{S_{*}}=\|(I-\Pi_{S_{*}}){\varepsilon}\|^{2}/(n-\dim(S_{*})) and since dim(S)Dmaxκn/(2logp)\dim(S_{*})\leq D_{\max}\leq\kappa n/(2\log p), by using (4) we get

An\displaystyle A_{n} \displaystyle\leq (dim(S)+pen(S))σ2C(1+log(p))dim(S)σ2,\displaystyle(\dim(S_{*})+\mathop{\rm pen}\nolimits(S_{*}))\sigma^{2}\leq C^{\prime}(1+\log(p))\dim(S_{*})\sigma^{2},

for some constant C>0C^{\prime}>0 depending on KK and κ\kappa only.

Let us now turn to RnR^{\prime}_{n}. For all λΛ\lambda\in\Lambda, fΠS^λY2f2\|f-\Pi_{\widehat{S}_{\lambda}}Y\|^{2}\leq\|f\|^{2} and

σ^S^λ2=YΠS^λY2ndim(S^λ)2f2+ε2ndim(S^λ).\widehat{\sigma}^{2}_{\widehat{S}_{\lambda}}={\|Y-\Pi_{\widehat{S}_{\lambda}}Y\|^{2}\over n-\dim(\widehat{S}_{\lambda})}\leq 2{\left\|{f}\right\|^{2}+\left\|{{\varepsilon}}\right\|^{2}\over n-\dim(\widehat{S}_{\lambda})}.

Since for all SSSS\in\SS, dim(S)Dmaxκn/(2logp)\dim(S)\leq D_{\max}\leq\kappa n/(2\log p), by using (4) again, there exists some positive constant cc depending on KK and κ\kappa only such that for all λΛ\lambda\in\Lambda, pen(S^λ)/(ndim(S^λ))c\mathop{\rm pen}\nolimits(\widehat{S}_{\lambda})/(n-\dim(\widehat{S}_{\lambda}))\leq c and hence,

infλΛ{fΠS^λY2+pen(S^λ)σ^S^λ2}𝟏B(1+2c)(f2+ε2)𝟏B.\inf_{\lambda\in\Lambda}\left\{{\|f-\Pi_{\widehat{S}_{\lambda}}Y\|^{2}+\mathop{\rm pen}\nolimits(\widehat{S}_{\lambda})\hat{\sigma}^{2}_{\widehat{S}_{\lambda}}}\right\}{\bf 1}_{B}\leq(1+2c)\left({\left\|{f}\right\|^{2}+\left\|{{\varepsilon}}\right\|^{2}}\right){\bf 1}_{B}.

Some calculation shows that 𝔼[(f2+ε2)2](f2+2nσ2)2{\mathbb{E}}\left[{\left({\left\|{f}\right\|^{2}+\left\|{{\varepsilon}}\right\|^{2}}\right)^{2}}\right]\leq\left({\left\|{f}\right\|^{2}+2n\sigma^{2}}\right)^{2} and hence, by Cauchy-Schwarz inequality

Rn(1+2c)(f2+2nσ2)(B).R^{\prime}_{n}\leq(1+2c)(\|f\|^{2}+2n\sigma^{2})\sqrt{{\mathbb{P}}(B)}.

The result follows by putting the bounds on AnA_{n} and RnR^{\prime}_{n} together.

References

  • Arlot, (2007) Arlot, S. (2007). Rééchantillonnage et Sélection de modèles. PhD thesis, University Paris XI.
  • Arlot, (2009) Arlot, S. (2009). Model selection by resampling penalization. Electron. J. Stat., 3:557–624.
  • Arlot and Bach, (2009) Arlot, S. and Bach, F. (2009). Data-driven calibration of linear estimators with minimal penalties. Advances in Neural Information Processing Systems (NIPS), 22:46–54.
  • Arlot and Celisse, (2010) Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40–79.
  • Baraud, (2000) Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Related Fields, 117(4):467–493.
  • Baraud, (2010) Baraud, Y. (2010). Estimator selection with respect to Hellinger-type risks. Probab. Theory Relat. Fields.
  • Baraud et al., (2009) Baraud, Y., Giraud, C., and Huet, S. (2009). Gaussian model selection with an unknown variance. Ann. Statist., 37(2):630–672.
  • Birgé, (2006) Birgé, L. (2006). Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Ann. Inst. H. Poincaré Probab. Statist., 42(3):273–325.
  • Birgé and Massart, (2001) Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS), 3(3):203–268.
  • Boulesteix and Strimmer, (2006) Boulesteix, A. and Strimmer, K. (2006). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1):32–44.
  • Breiman, (2001) Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.
  • Bunea et al., (2007) Bunea, F., Tsybakov, A. B., and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist., 35(4):1674–1697.
  • Candès and Tao, (2007) Candès, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when pp is much larger than nn. Ann. Statist., 35(6):2313–2351.
  • Cao and Golubev, (2006) Cao, Y. and Golubev, Y. (2006). On oracle inequalities related to smoothing splines. Math. Methods Statist., 15(4):398–414 (2007).
  • Catoni, (1997) Catoni, O. (1997). Mixture approach to universal model selection. Technical report, Ecole Normale Supérieure, France.
  • Catoni, (2004) Catoni, O. (2004). Statistical learning theory and stochastic optimization. In Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001. Springer-Verlag, Berlin.
  • Celisse, (2008) Celisse, A. (2008). Model selection via cross-validation in density estimation, regression, and change-points detection. PhD thesis, University Paris XI.
  • Chen et al., (1998) Chen, S. S., Donoho, D. L., and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33–61 (electronic).
  • Díaz-Uriarte and Alvares de Andrés, (2006) Díaz-Uriarte, R. and Alvares de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3).
  • Efron et al., (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Ann. Statist., 32(2):407–499. With discussion, and a rejoinder by the authors.
  • Genuer et al., (2010) Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Patter Recognition Lett., to appear.
  • Giraud, (2008) Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli, 14(4):1089–1107.
  • Goldenshluger, (2009) Goldenshluger, A. (2009). A universal procedure for aggregating estimators. Ann. Statist., 37(1):542–568.
  • Goldenshluger and Lepski, (2009) Goldenshluger, A. and Lepski, O. (2009). Structural adaptation via 𝕃p\mathbb{L}_{p}-norm oracle inequalities. Probab. Theory Related Fields, 143(1-2):41–71.
  • Helland, (2006) Helland, I. (2006). Partial least squares regression. In Kotz, S., Balakrishnan, N., Read, C., Vidakovic, B., and Johnston, N., editors, Encyclopedia of statistical sciences (2nd ed.), volume 9, pages 5957–5962, New York. Wiley.
  • Helland, (2001) Helland, I. (2001). Some theoretical aspects of partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 58:97–107.
  • Hoerl and Kennard, (1970) Hoerl, A. and Kennard, R. (1970). Ridge regression: bayes estimation for nonorthogonal problems. Technometrics, 12:55–67.
  • Hoerl and Kennard, (2006) Hoerl, A. and Kennard, R. (2006). Ridge regression. In Kotz, S., Balakrishnan, N., Read, C., Vidakovic, B., and Johnston, N., editors, Encyclopedia of statistical sciences (2nd ed.), volume 11, pages 7273–7280, New York. Wiley.
  • Huang et al., (2008) Huang, J., Ma, S., and Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica, 4(1603-1618).
  • Juditsky and Nemirovski, (2000) Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist., 28(3):681–712.
  • Lepski˘ı, (1990) Lepskiĭ, O. V. (1990). A problem of adaptive estimation in Gaussian white noise. Teor. Veroyatnost. i Primenen., 35(3):459–470.
  • Lepski˘ı, (1991) Lepskiĭ, O. V. (1991). Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. Teor. Veroyatnost. i Primenen., 36(4):645–659.
  • (33) Lepskiĭ, O. V. (1992a). Asymptotically minimax adaptive estimation. II. Schemes without optimal adaptation. Adaptive estimates. Teor. Veroyatnost. i Primenen., 37(3):468–481.
  • (34) Lepskiĭ, O. V. (1992b). On problems of adaptive estimation in white Gaussian noise. In Topics in nonparametric estimation, volume 12 of Adv. Soviet Math., pages 87–106. Amer. Math. Soc., Providence, RI.
  • Leung and Barron, (2006) Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory, 52(8):3396–3410.
  • Makovoz, (1996) Makovoz, Y. (1996). Random approximants and neural networks. J. Approx. Theory, 85(1):98–109.
  • Nadaraya, (1964) Nadaraya, E. A. (1964). On estimating regression. Theory of Probability and its Applications, 9(1):141—142.
  • Nemirovski, (2000) Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on probability theory and statistics (Saint-Flour, 1998), volume 1738 of Lecture Notes in Math., pages 85–277. Springer, Berlin.
  • Rigollet and Tsybakov, (2007) Rigollet, P. and Tsybakov, A. B. (2007). Linear and convex aggregation of density estimators. Math. Methods Statist., 16(3):260–280.
  • Strobl et al., (2008) Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(307).
  • Strobl et al., (2007) Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(25).
  • Tenenhaus, (1998) Tenenhaus, M. (1998). La régression PLS. Éditions Technip, Paris. Théorie et pratique. [Theory and application].
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288.
  • Tsybakov, (2003) Tsybakov, A. B. (2003). Optimal rates of aggregation. In Proceedings of the 16th Annual Conference on Learning Theory (COLT) and 7th Annual Workshop on Kernel Machines, pages 303–313. Lecture Notes in Artificial Intelligence 2777, Springer-Verlag, Berlin.
  • Watson, (1964) Watson, G. S. (1964). Smooth regression analysis. Sankhyā Ser. A, 26:359–372.
  • Wegkamp, (2003) Wegkamp, M. (2003). Model selection in nonparametric regression. Ann. Statist., 31:252–273.
  • Yang, (1999) Yang, Y. (1999). Model selection for nonparametric regression. Statist. Sinica, 9:475–499.
  • (48) Yang, Y. (2000a). Combining different procedures for adaptive regression. J. Multivariate Anal., 74(1):135–161.
  • (49) Yang, Y. (2000b). Mixing strategies for density estimation. Ann. Statist., 28(1):75–87.
  • Yang, (2001) Yang, Y. (2001). Adaptive regression by mixing. J. Amer. Statist. Assoc., 96(454):574–588.
  • Zhang, (2005) Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Comput., 17(9):2077–2098.
  • Zhang, (2008) Zhang, T. (2008). Adaptive forward-backward greedy algorithm for learning sparse representations. Technical report, Rutgers University, NJ.
  • Zhao and You, (2006) Zhao, P. and Yu, B. (2006). On Model Selection Consistency of Lasso. JMLR 7(Nov):2541–2563.
  • Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc., 101(476):1418–1429.
  • Zou and Hastie, (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(2):301–320.

8 Appendix

8.1 Computation of penΔ(S){\mathop{\rm pen}_{\Delta}\nolimits}(S)

The penalty penΔ(S)\mathop{\rm pen}\nolimits_{\Delta}(S), defined at equation (3), is linked to the EDkhi function introduced in Baraud al (2009) (see Definition 3), via the following formula:

penΔ(S)=ndim(S)ndim(S)1EDkhi(dim(S)+1,ndim(S)1,eΔ(S)dim(S)+1).\mathop{\rm pen}\nolimits_{\Delta}(S)={n-\dim(S)\over n-\dim(S)-1}\mathrm{EDkhi}\left(\dim(S)+1,n-\dim(S)-1,\frac{e^{-\Delta(S)}}{\dim(S)+1}\right).

Therefore, according to the result given in Section 6.1 in Baraud et al (2009), penΔ(S)\mathop{\rm pen}\nolimits_{\Delta}(S) is the solution in xx of the equation

eΔ(S)D+1\displaystyle\frac{e^{-\Delta(S)}}{D+1} =\displaystyle= (FD+3,N1xN1N(D+3))\displaystyle{\mathbb{P}}\left(F_{D+3,N-1}\geq x\frac{N-1}{N(D+3)}\right)
xN1N(D+1)(FD+1,N+1xN+1N(D+1)).\displaystyle\ \ \ \ \ -x\frac{N-1}{N(D+1)}{\mathbb{P}}\left(F_{D+1,N+1}\geq x\frac{N+1}{N(D+1)}\right).

8.2 Simulated examples

The collection {\mathcal{E}} is composed of several collections 1,,11{\mathcal{E}}_{1},\ldots,{\mathcal{E}}_{11} that are detailed below. The collections 1{\mathcal{E}}_{1} to 10{\mathcal{E}}_{10} are composed of examples where XX is generated as nn independent centered Gaussian vectors with covariance matrix CC. For each e{1,,10}e\in\{1,\ldots,10\}, we define a p×pp\times p matrix CeC_{e} and a pp-vector of parameters βe\beta_{e}. We denote by 𝒳e{\mathcal{X}}_{e} the set of 5 matrices XX simulated as nn-i.i.d 𝒩p(0,Ce)\mathcal{N}_{p}(0,C_{e}). The collection e{\mathcal{E}}_{e} is then defined as follows:

e={ex(n,p,X,β,ρ),(n,p),X𝒳e,β=βe,ρ}{\mathcal{E}}_{e}=\left\{{\mathrm{ex}}(n,p,X,\beta,\rho),(n,p)\in{\mathcal{I}},X\in{\mathcal{X}}_{e},\beta=\beta_{e},\rho\in{\mathcal{R}}\right\}

where ={5,10,20}{\mathcal{R}}=\{5,10,20\} and

={(100,50),(100,100),(100,1000),(200,100),(200,200)}{\mathcal{I}}=\left\{(100,50),(100,100),(100,1000),(200,100),(200,200)\right\} (39)

in Section 6.2, and

={(100,50),(100,100),(200,100),(200,200)}{\mathcal{I}}=\left\{(100,50),(100,100),(200,100),(200,200)\right\} (40)

in Section 6.3.

Let us now describe the collections 1{\mathcal{E}}_{1} to 10{\mathcal{E}}_{10}.

Collection 1{\mathcal{E}}_{1}

The matrix CC equals the p×pp\times p identity matrix denoted IpI_{p}. The parameters β\beta satisfy βj=0\beta_{j}=0 for j16j\geq 16, βj=2.5\beta_{j}=2.5 for 1j51\leq j\leq 5, βj=1.5\beta_{j}=1.5 for 6j106\leq j\leq 10, βj=0.5\beta_{j}=0.5 for 11j1511\leq j\leq 15.

Collection 2{\mathcal{E}}_{2}

the matrix CC is such that Cjk=r|jk|C_{jk}=r^{|j-k|}, for 1j,k151\leq j,k\leq 15 and 16j,kp16\leq j,k\leq p with r=0.5r=0.5. Otherwise Cj,k=0C_{j,k}=0. The parameters β\beta are as in Collection 1{\mathcal{E}}_{1}.

Collection 3{\mathcal{E}}_{3}

The matrix CC is as in Collection 2{\mathcal{E}}_{2} with r=0.95r=0.95, the parameters β\beta are as in Collection 1{\mathcal{E}}_{1}.

Collection 4{\mathcal{E}}_{4}

The matrix CC is such that Cjk=r|jk|C_{jk}=r^{|j-k|}, for 1j,kp1\leq j,k\leq p, with r=0.5r=0.5, the parameters β\beta are as in Collection 1{\mathcal{E}}_{1}.

Collection 5{\mathcal{E}}_{5}

the matrix CC is as in Collection 4{\mathcal{E}}_{4} with r=0.95r=0.95, the parameters β\beta are as in Collection 1{\mathcal{E}}_{1}.

Collection 6{\mathcal{E}}_{6}

The matrix CC equals IpI_{p}. The parameters β\beta satisfy βj=0\beta_{j}=0 for j16j\geq 16, βj=1.5\beta_{j}=1.5 for j15j\leq 15.

Collection 7{\mathcal{E}}_{7}

The matrix CC satisfies Cj,k=(1ρ1)1lj=k+ρ1C_{j,k}=(1-\rho_{1})1\thinspace{\rm l}_{j=k}+\rho_{1} for 1,j,k31\leq,j,k\leq 3, Cj,k=Ck,j=ρ2C_{j,k}=C_{k,j}=\rho_{2} for j=4,k=1,2,3j=4,k=1,2,3, Cj,k=1lj=kC_{j,k}=1\thinspace{\rm l}_{j=k} for j,k5j,k\geq 5, with ρ1=.39\rho_{1}=.39 and ρ2=.23\rho_{2}=.23. The parameters β\beta satisfy βj=0\beta_{j}=0 for j4j\geq 4, βj=5.6\beta_{j}=5.6 for j3j\leq 3.

Collection 8{\mathcal{E}}_{8}

The matrix CC satisfies Cj,k=0.5|jk|C_{j,k}=0.5^{|j-k|} for j,k8j,k\leq 8, Cj,k=1lj=kC_{j,k}=1\thinspace{\rm l}_{j=k} for j,k9j,k\geq 9. The parameters β\beta satisfy βj=0\beta_{j}=0 for j{1,2,5}j\not\in\{1,2,5\}, β1=3\beta_{1}=3, β2=1.5\beta_{2}=1.5, β5=2\beta_{5}=2.

Collection 9{\mathcal{E}}_{9}

The matrix CC is defined as in Example 8{\mathcal{E}}_{8}. The parameters β\beta satisfy βj=0\beta_{j}=0 for j9j\geq 9, βj=0.85\beta_{j}=0.85 for j8j\leq 8.

Collection 10{\mathcal{E}}_{10}

The matrix CC satisfies Cj,k=0.51ljk+1lj=kC_{j,k}=0.51\thinspace{\rm l}_{j\neq k}+1\thinspace{\rm l}_{j=k} for j,k40j,k\leq 40, Cj,k=1lj=kC_{j,k}=1\thinspace{\rm l}_{j=k} for j,k41j,k\geq 41. The parameters β\beta satisfy βj=2\beta_{j}=2 for 11j2011\leq j\leq 20 and 31j4031\leq j\leq 40, βj=0\beta_{j}=0 otherwise.

Collection 11{\mathcal{E}}_{11}

In this last example, we denote by 𝒳11{\mathcal{X}}_{11} the set of 5 matrices XX simulated as follows. For 1jp1\leq j\leq p, we denote by XjX_{j} the column jj of XX. Let EE be generated as nn i.i.d. 𝒩p(0,0.01Ip)\mathcal{N}_{p}(0,0.01I_{p}) and let Z1,Z2,Z3Z_{1},Z_{2},Z_{3} be generated as nn i.i.d. 𝒩3(0,I3)\mathcal{N}_{3}(0,I_{3}). Then for j=1,,5j=1,\ldots,5, Xj=Z1+EjX_{j}=Z_{1}+E_{j}, for j=6,,10j=6,\ldots,10, Xj=Z2+EjX_{j}=Z_{2}+E_{j}, for j=11,,15j=11,\ldots,15, Xj=Z3+EjX_{j}=Z_{3}+E_{j}, for j16j\geq 16, Xj=EjX_{j}=E_{j}. The parameters β\beta are as in Collection 6{\mathcal{E}}_{6}. The collection 11{\mathcal{E}}_{11} is defined as the set of examples ex(n,p,X,β,ρ){\mathrm{ex}}(n,p,X,\beta,\rho) for (n,p)(n,p)\in{\mathcal{I}}, X𝒳11X\in{\mathcal{X}}_{11}, and ρ\rho\in{\mathcal{R}}.

The collection {\mathcal{E}} is thus composed of 660 examples for {\mathcal{I}} chosen as in (40), and 825 for {\mathcal{I}} chosen as in (39). For some of the examples, the Lasso estimators were highly biased leading to high values of the ratio Oex/nσ2O_{{\mathrm{ex}}}/n\sigma^{2}, see Equation (29). We only keep the examples for which the Lasso estimator improves the risk of the naive estimator YY by a factor at least 1/31/3. This convention leads us to remove 171 examples over 825. These pathological examples are coming from the collections 1{\mathcal{E}}_{1}, 6{\mathcal{E}}_{6} and 7{\mathcal{E}}_{7} for n=100n=100 and p100p\geq 100, and from collections 2{\mathcal{E}}_{2} and 4{\mathcal{E}}_{4} when p=1000p=1000. The examples of collection 7{\mathcal{E}}_{7} were chosen by Zou to illustrate that the Lasso estimators may be highly biased. All the other examples, correspond to matrices XX that are nearly orthogonal.

8.3 Procedures for calculating sets of predictors

Let ^=^\widehat{\mathcal{M}}=\bigcup_{\ell\in\mathcal{L}}\widehat{\mathcal{M}}_{\ell} where we recall that for \ell\in{\mathcal{L}}, ^={m^(,h)|hH}\widehat{\mathcal{M}}_{\ell}=\{\widehat{m}(\ell,h)|\ h\in H_{\ell}\}.

The Lasso procedure is described in Section 6.2. The collection ^Lasso={m^(1),,m^(Dmax)}\widehat{\mathcal{M}}_{{\rm Lasso}}=\{\widehat{m}(1),\ldots,\widehat{m}(D_{\max})\} where m^(h)\widehat{m}(h) is the set of indices corresponding to the predictors returned by the LARS-Lasso algorithm at step h{1,,Dmax}h\in\{1,\ldots,D_{\max}\} (see Section 6.2).

The ridge procedure is based on the minimization of YXβ2+hβ2\|Y-X\beta\|^{2}+h\|\beta\|^{2} with respect to β\beta, for some positive hh, see for example Hoerl and Kennard (2006). Tibshirani (1996) noted that in the case of a large number of small effects, ridge regression gives better results than the lasso for variable selection. For each hHridgeh\in H_{\mathrm{ridge}}, the regression coefficients β^(h)\widehat{\beta}(h) are calculated and a collection of predictors sets is built as follows. Let j1,jpj_{1},\ldots j_{p} be such that |β^j1(h)|>>|β^jp(h)||\widehat{\beta}_{j_{1}}(h)|>\ldots>|\widehat{\beta}_{j_{p}}(h)| and set

Mh={{j1,,jk},k=1,,Dmax}.M_{h}=\left\{{\{j_{1},\ldots,j_{k}\},\ k=1,\ldots,D_{\max}}\right\}.

Then, the collection ^ridge\widehat{\mathcal{M}}_{\mathrm{ridge}} is defined as ^ridge={Mh,hHridge}\widehat{\mathcal{M}}_{\mathrm{ridge}}=\{M_{h},h\in H_{\mathrm{ridge}}\}.

The elastic net procedure proposed by Zou and Hastie (2005) mixes the 1\ell_{1} and 2\ell_{2} penalties of the Lasso and the ridge procedures. Let HridgeH_{\mathrm{ridge}} be a grid of values for the tuning parameter hh of the 2\ell_{2} penalty. We choose ^en={M(en,h):hHridge}\widehat{\mathcal{M}}_{\mathrm{en}}=\{M_{(\mathrm{en},h)}:h\in H_{\mathrm{ridge}}\} where M(en,h)M_{(\mathrm{en},h)} denotes the collection of the active sets of cardinality less than DmaxD_{\max}, selected by the elastic net procedure when the 2\ell_{2}-smoothing parameter equals hh. For each hHridgeh\in H_{\mathrm{ridge}} the collection M(en,h)M_{(\mathrm{en},h)} can be conveniently computed by first calculating the ridge regression coefficients and then applying the LARS-lasso algorithm, see Zou and Hastie (2005).

The partial least squares regression (PLSR1) aims to reduce the dimensionality of the regression problem by calculating a small number of components that are usefull for predicting YY. Several applications of this procedure for analysing high-dimensional genomic data have been reviewed by Boulesteix and Strimmer (2006). In particular, it can be used for calculating subsets of covariates as we did for the ridge procedure. The PLSR1 procedure constructs, for a given hh, uncorrelated latent components t1,,tht_{1},\ldots,t_{h} that are highly correlated with the response YY, see Helland (2006). Let HplsH_{\mathrm{pls}} be a grid a values for the tuning parameter hh. For each hHplsh\in H_{\mathrm{pls}}, we write β^(h)\widehat{\beta}(h) for the PLS regression coefficients calculated with the first hh components. We then set ^PLS={Mh:hHpls}\widehat{\mathcal{M}}_{\mathrm{PLS}}=\{M_{h}:h\in H_{\mathrm{pls}}\}, where MhM_{h} is build from β^(h)\widehat{\beta}(h) as for the ridge procedure.

The adaptive lasso procedure proposed by Zou (2006) starts with a preliminary estimator β~\widetilde{\beta}. Then one applies the lasso procedure replacing the parameters |βj|,j=1,,p|\beta_{j}|,j=1,\ldots,p in the 1\ell_{1} penalty by the weighted parameters |βj|/|β~j|γ,j=1,,p|\beta_{j}|/|\tilde{\beta}_{j}|^{\gamma},j=1,\ldots,p for some positive γ\gamma. The idea is to increase the penalty for coefficients that are close to zero, reducing thus the bias in the estimation of ff and improving the variable selection accuracy. Zou showed that, if β~\widetilde{\beta} is a n\sqrt{n}-consistent estimator of β\beta, then the adaptive lasso procedure is consistent in situations where the lasso is not. A lot of work has been done around this subject, see Huang et al. (2008) for example.

We apply the procedure with γ=1\gamma=1, and considering two different preliminary estimators:

- using the ridge estimator, β~(h)\widetilde{\beta}(h) as preliminary estimator. For each hHridgeh\in H_{\mathrm{ridge}}, the adaptive lasso procedure is applied for calculating the active sets, MALridge,hM_{\mathrm{ALridge},h}, of cardinality less than DmaxD_{\max}. The collection ^ALridge\widehat{\mathcal{M}}_{\mathrm{ALridge}} is thus defined as ^ALridge={MALridge,h,hHridge}\widehat{\mathcal{M}}_{\mathrm{ALridge}}=\left\{M_{\mathrm{ALridge},h},h\in H_{\mathrm{ridge}}\right\}.

- using the PLSR1 estimator, β~(h)\widetilde{\beta}(h), as preliminary estimator. The procedure is the same as described just above. The collection MALplsM_{\mathrm{ALpls}} is defined as MALpls={MALpls,h,hHpls}M_{\mathrm{ALpls}}=\left\{M_{{\mathrm{ALpls},h}},h\in H_{\mathrm{pls}}\right\}.

The random forest algorithm was proposed by Breiman (2001) for classification and regression problems. The procedure averages several regression trees calculated on bootstrap samples. The algorithm returns measures of variable importance that may be used for variable selection, see for example Díaz-Uriarte and Alvares de Andrés (2006), Genuer et al. (2010), Strobl et al. (2007; 2008).

Let us denote by hh the number of variables randomly chosen at each split when constructing the trees and

HrF={p/j|j{3,2,1.5,1}}.H_{rF}=\{p/j\ |\ j\in\{3,2,1.5,1\}\}.

For each hHrFh\in H_{rF}, we consider the set of indices

Mh={{j1,,jk},k=1,,Dmax},M_{h}=\{\{j_{1},\ldots,j_{k}\},k=1,\ldots,D_{\max}\},

where {j1,,jk}\{j_{1},\ldots,j_{k}\} are the ranks of the variable importance measures. Two importance measures are proposed. The first one is based on the decrease in the mean square error of prediction after permutation of each of the variables. It leads to the collection ^rFmse={Mh,hHrF}\widehat{\mathcal{M}}_{\mathrm{rFmse}}=\{M_{h},h\in H_{rF}\}. The second one is based on the decrease in node impurities, and leads similarly to the collection ^purity{\widehat{\mathcal{M}}_{\mathrm{purity}}}.

The exhaustive procedure considers the collection of all subsets of {1,p}\left\{1,\ldots p\right\} with dimension smaller than DmaxD_{\max}. We denote this collection exhaustive\mathcal{M}_{\mathrm{exhaustive}}.

Choice of tuning parameters

We have to choose DmaxD_{\max}, the largest number of predictors considered in the collection ^\widehat{\mathcal{M}}. For all methods, except the exhaustive method, DmaxD_{\max} may be large, say Dmaxmin(n2,p)D_{\max}\leq\min(n-2,p). Nevertheless, for saving computing time, we chose DmaxD_{\max} large enough such that the dimension of the estimated subset is always smaller than DmaxD_{\max}. For the exhaustive method, DmaxD_{\max} must be chosen in order to make the calculation feasible: Dmax=4D_{\max}=4 for p=50p=50, Dmax=3D_{\max}=3 for p=100p=100 and Dmax=2D_{\max}=2 for p=200p=200.

For the ridge method we choose Hridge={103,102,101,1,5}H_{\mathrm{ridge}}=\{10^{-3},10^{-2},10^{-1},1,5\}, and for the PLSR1 method, Hpls=1,,5H_{\mathrm{pls}}=1,\ldots,5.