, and
Estimator selection in the Gaussian setting
Abstract
We consider the problem of estimating the mean of a Gaussian vector with independent components of common unknown variance . Our estimation procedure is based on estimator selection. More precisely, we start with an arbitrary and possibly infinite collection of estimators of based on and, with the same data , aim at selecting an estimator among with the smallest Euclidean risk. No assumptions on the estimators are made and their dependencies with respect to may be unknown. We establish a non-asymptotic risk bound for the selected estimator. As particular cases, our approach allows to handle the problems of aggregation and model selection as well as those of choosing a window and a kernel for estimating a regression function, or tuning the parameter involved in a penalized criterion. We also derive oracle-type inequalities when consists of linear estimators. For illustration, we carry out two simulation studies. One aims at comparing our procedure to cross-validation for choosing a tuning parameter. The other shows how to implement our approach to solve the problem of variable selection in practice.
keywords:
[class=AMS]keywords:
1 Introduction
1.1 The setting and the approach
We consider the Gaussian regression framework
where is an unknown vector of and the are independent centered Gaussian random variables with common variance . Throughout the paper, is assumed to be unknown which corresponds to the practical case. Our aim is to estimate from the observation of . For specific forms of , this setting allows to deal simultaneously with the following problems.
Example 1 (Signal denoising).
The vector is of the form
where are distinct points of a set and is an unknown mapping from into .
Example 2 (Linear regression).
The vector is assumed to be of the form
(1) |
where is a matrix, is an unknown -dimensional vector and some integer larger than 1 (and possibly larger than ). The columns of the matrix are usually called predictors. When is large, one may assume that the decomposition (1) is sparse in the sense that only few are non-zero. Estimating or finding the predictors associated to the non-zero coordinates of are classical issues. The latter is called variable selection.
Our estimation strategy is based on estimator selection. More precisely, we start with an arbitrary collection of estimators of based on and aim at selecting the one with the smallest Euclidean risk by using the same observation . The way the estimators depend on may be arbitrary and possibly unknown. For example, the may be obtained from the minimization of a criterion, a Bayesian procedure or the guess of some experts.
1.2 The motivation
The problem of choosing some best estimator among a family of candidate ones is central in Statistics. Let us present some examples.
Example 3 (Choosing a tuning parameter).
Many statistical procedures depend on a (possibly multi-dimensional) parameter that needs to be tuned in view of obtaining an estimator with the best possible performance. For example, in the context of linear regression as described in Example 2, the Lasso estimator (see Tibshirani (1996) and Chen et al. (1998)) defined by with
depends on the choice of the parameter . Selecting this parameter among a grid amounts to selecting a (suitable) estimator among the family .
Another dilemma for Statisticians is the choice of a procedure to solve a given problem. In the context of Example 3, there exist many competitors to the Lasso estimator and one may alternatively choose a procedure based on ridge regression (see Hoerl and Kennard (1970)), random forest or PLS (see Tenenhaus (1998), Helland (2001) and Helland (2006)). Similarly, for the problem of signal denoising as described in Example 1, popular approaches include spline smoothing, wavelet decompositions and kernel estimators. The choice of a kernel may be possibly tricky.
Example 4 (Choosing a kernel).
Consider the problem described in Example 1 with . For a kernel and a bandwidth , the Nadaraya-Watson estimator (see Nadaraya (1964) and Watson (1964)) is defined as
where for
There exist many possible choices for the kernel , such as the Gaussian kernel , the uniform kernel , etc. Given a (finite) family of candidate kernels and a grid of possible values of , one may consider the problem of selecting the best kernel estimator among the family .
1.3 A look at the literature
A common way to address the above issues is to use some cross-validation scheme such as leave-one-out or -fold. Even though these resampling techniques are widely used in practice, little is known on their theoretical performances. For more details, we refer to Arlot and Celisse (2010) for a survey on cross-validation technics applied to model selection. Compared to these approaches, as we shall see, the procedure we propose is less time consuming and easier to implement. Moreover, it does not require to know how the estimators depend on the data and we can therefore handle the following problem.
Example 5 (Selecting among mute experts).
A Statistician is given a collection of estimators from a family of experts , each of which keeping secret the way his/her estimator depends on the observation . The problem is then to find which expert is the closest to the truth.
Given a selection rule among , an important issue is to compare the risk of the selected estimator to those of the candidate ones. Results in this direction are available in the context of model selection, which can be seen as a particular case of estimator selection. More precisely, for the purpose of selecting a suitable model one starts with a collection of those, typically linear spaces chosen for their approximation properties with respect to , and one associates to each model a suitable estimator with values in . Selecting a model then amounts to selecting an estimator among the collection . For this problem, selection rules based on the minimization of a penalized criterion have been proposed in the regression setting by Yang (1999), Baraud (2000), Birgé and Massart (2001) and Baraud et al (2009). Another way, usually called Lepski’s method, appears in a series of papers by Lepski (1990; 1991; ; ) and was originally designed to perform model selection among collections of nested models. Finally, we mention that other procedures based on resampling have interestingly emerged from the work of Arlot (2007; 2009) and Célisse (2008). A common feature of those approaches lies in the fact that the proposed selection rules apply to specific collections of estimators only.
An alternative to estimator selection is aggregation which aims at designing a suitable combination of given estimators in order to outperform each of these separately (and even the best combination of these) up to a remaining term. Aggregation techniques can be found in Catoni (1997; 2004), Juditsky and Nemirovski (2000), Nemirovski (2000), Yang (), (), (2001), Tsybakov (2003), Wegkamp (2003), Birgé (2006), Rigollet and Tsybakov (2007), Bunea, Tsybakov and Wegkamp (2007) and Goldenshluger (2009) for -losses. Most of the aggregation procedures are based on a sample splitting, one part of the data being used for building the estimators, the remaining part for selecting among these. Such a device requires that the observations be i.i.d. or at least that one has at disposal two independent copies of the data. From this point of view our procedure differs from classical aggregation procedures since we use the whole data to build and select. In the Gaussian regression setting that is considered here, we mention the results of Leung and Barron (2006) for the problem of mixing least-squares estimators. Their procedure uses the same data to estimate and to aggregate but requires the variance to be known. Giraud (2008) extends their results to the case where it is unknown.
1.4 What is new here?
Our approach for solving the problem of estimator selection is new. We introduce a collection of linear subspaces of for approximating the estimators in and use a penalized criterion to compare them. As already mentioned and as we shall see, this approach requires no assumption on the family of estimators at hand and is easy to implement, an R-package being available on
http://w3.jouy.inra.fr/unites/miaj/public/perso/SylvieHuet_en.html
.
A general way of comparing estimators in various statistical settings has been described in Baraud (2010). However, the procedure proposed there is mainly abstract and inadequate in the Gaussian framework we consider.
We prove a non-asymptotic risk bound for the estimator we select and show that this bound is optimal in the sense that it essentially cannot be improved (except for numerical constants maybe) by any other selection rule. For the sakes of illustration and comparison, we apply our procedure to various problems among which aggregation, model selection, variable selection and selection among linear estimators. In each of these cases, our approach allows to recover classical results in the areas as well as to establish new ones. In the context of aggregation we compute the aggregation rates for the unknown variance case. These rates turn out to be the same as those for the known variance case. For selecting an estimator among a family of linear ones, we propose a new procedure and establish a risk bound which requires almost no assumption on the considered family. Finally, our approach provides a way of selecting a suitable variable selection procedure among a family of candidate ones. It thus provides an alternative to cross-validation for which little is known.
The paper is organized as follows. In Section 2 we present our selection rule and the theoretical properties of the resulting estimator. For illustration, we show in Sections 3, 4 and 5 respectively, how the procedure can be used to aggregate preliminary estimators, select a linear estimator among a finite collection of candidate ones, or solve the problem of variable selection. Section 6 is devoted to two simulation studies. One aims at comparing the performance of our procedure to the classical -fold in view of selecting a tuning parameter among a grid. In the other, we evaluate the performance of the variable selection procedure we propose to some classical ones such as the Lasso, random forest, and others based on ridge and PLS regression. Finally, the proofs are postponed to Section 7.
Throughout the paper denotes a constant that may vary from line to line.
2 The procedure and the main result
2.1 The procedure
Given a collection of estimators of based on , the selection rule we propose is based on the choices of a family of linear subspaces of , a collection of (possibly random) subsets of , a weight function and a penalty function , both from into . We introduce those objects below and refer to Sections 3, 4 and 5 for examples.
2.1.1 The collection of estimators
The collection can be arbitrary. In particular, need not be finite nor countable and it may consist of a mix of estimators based on the minimization of a criterion, a Bayes procedure or the guess of some experts. The dependency of these estimators with respect to need not be known. Nevertheless, we shall see on examples how we can use this information, when available, to improve the performance of our estimation procedure.
2.1.2 The families and
Let be a family of linear spaces of satisfying the following.
Assumption 1.
The family is finite or countable and for all , .
To each estimator , we associate a (possibly random) subset .
Typically, the family should be chosen to possess good approximation properties with respect to the elements of and with respect to specifically. One may take but for computational reasons it will be convenient to allow to be smaller. The choices of may be made on the basis of the observation . We provide examples of and in various statistical settings described in Sections 3 to 5.
2.1.3 The weight function and the associated function
We consider a function from into and assume
Assumption 2.
(2) |
Whenever is finite, inequality (2) automatically holds true. However, in practice should be kept to a reasonable size. When , can be interpreted as a prior distribution on and gives thus a Bayesian flavor to the procedure we propose. To the weight function , we associate the function mapping into and defined by
(3) |
where denotes the positive part of and are two independent random variables with respectively and degrees of freedom. This function can be easily computed from the quantiles of the Fisher distribution as we shall see in Section 8.1. From a more theoretical point of view, it is shown in Baraud et al (2009) that under Assumption 3 below, there exists a positive constant (depending on only) such that
(4) |
Assumption 3.
There exists such that for all ,
2.1.4 The selection criterion
The selection procedure we propose involves a penalty function from into with the following property.
Assumption 4.
The penalty function satisfies for some ,
(5) |
Whenever equality holds in (5), it derives from (4) that measures the complexity of the model in terms of dimension and weight.
Denoting the projection operator onto a linear space , given the families , the penalty function and some positive number , we define
(6) |
where
(7) |
2.2 The main result
For all let us set
(8) |
This quantity corresponds to an accuracy index for the estimator with respect to the family . The following result holds.
Theorem 1.
Let . Assume that Assumptions 1, 2 and 4 hold. There exists a constant (given by (34)) depending on and only such that for any in satisfying
(9) |
we have the following bounds
(10) | |||||
(11) |
(provided that the quantity involved in the expectation in (10) is measurable). Furthermore, if equality holds in (5) and Assumption 3 is satisfied, for each
-
•
if the set is non-random,
(12) | |||||
-
•
if there exists a (possibly random) linear space such that with probability 1,
(13) |
where is a positive constant only depending on and .
Let us now comment Theorem 1.
It turns out that inequality (10) leaves no place for a substantial improvement in the sense that the bound we get is essentially optimal and cannot be improved (apart from constants) by any other selection rule among . To see this, let us assume for simplicity that is finite so that a measurable minimizer of always exists and can be chosen as 0. Let , (to fix up the ideas), a family of linear spaces satisfying the assumptions of Theorem 1 and , the penalty function achieving equality in (5). Besides, assume that contains a linear space such that and associate to the weight . If for all , we deduce from (4) and (10) that for some universal constant , whatever and
(14) | |||||
In the opposite direction, the following result holds.
Proposition 1.
There exists a universal constant , such that for any finite family of estimators and any selection rule based on among , there exists such that
(15) |
We see that, up to the estimator in place of and numerical constants, the left-hand sides of (14) and (15) coincide.
In view of commenting (11) further, we continue assuming that is finite so that we can keep in (11). A particular feature of (11) lies in the fact that the risk bound pays no price for considering a large collection of estimators. In fact, it is actually decreasing with respect to (or equivalently ) for the inclusion. This means that if one adds a new estimator to the collection (without changing neither nor the families associated to the former estimators), the risk bound for can only be improved. In contrast, the computation of the estimator is all the more difficult that is large. More precisely, if the cardinalities of the families are not too large, the computation of requires around steps.
The selection rule we use does not require to know how the estimators depend on . In fact, as we shall see, a more important piece of information is the ranges of the estimators as varies in . A situation of special interest occurs when each belongs to some (possibly random) linear space in with probability one. By taking such that for all , we deduce from Theorem 1 by using (11) and (13) the following corollary.
Corollary 1.
One may apply this result in the context of model selection. One starts with a collection of models and associate to each an estimator with values in . By taking (here ) and for all , our selection procedure leads to an estimator which satisfies
(17) |
When for all , our selection rule becomes
(18) |
and turns out to coincide with that described in Baraud et al (2009). Interestingly, Corollary 1 shows that this selection rule can still be used for families of (non-linear) estimators of the form where the are chosen randomly among on the basis of , doing thus as if the linear spaces were non-random. An estimator of the form can be interpreted as resulting from a model selection procedures among the family of projection estimators and hence, (18) can be used to choose some best model selection rule among a collection of candidate ones.
3 Aggregation
In this section, we consider the problems of Model Selection Aggregation (MS), Convex Aggregation () and Linear Aggregation (L) defined below. Given preliminary estimators of , denoted , our aim is to build an estimator based on whose risk is as close as possible to where
and, according to the aggregation problem at hand, is one of the three sets
When , is the set consisting of the initial estimators. When , is the convex hull of the . In the literature, one may also find
in place of in which case is the convex hull of . Finally, when , is the linear span of the .
Each of these three aggregation problems are solved separately if for each one can design an estimator satisfying
(19) |
with , free of and
(20) |
These problems have only been considered when the variance is known. The quantity then corresponds to the best possible upper bound in (19) over all possible and preliminary estimators and is called the optimal rate of aggregation. For a more precise definition, we refer the reader to Tsybakov (2003). Bunea et al (2007) considered the problem of solving these three problems simultaneously by building an estimator which satisfies (19) simultaneously for all and some constant . This is an interesting issue since it is impossible to know in practice which aggregation device should be used to achieve the smallest risk bound: as grows (for the inclusion), the bias decreases while the rate increases.
The aim of this section is to show that our procedure provides a way of solving (or nearly solving) the three aggregation problems both separately and simultaneously when the variance is unknown.
Throughout this section, we consider the family consisting of the defined for each and as the linear span of the for . Along this section, we shall use the weight function defined on by
take and taking thus . The choices of and is only to fix up the ideas. Note that satisfies Assumption 2 with . To avoid trivialities, we assume all along .
3.1 Solving the three aggregation problems separately
3.1.1 Linear Aggregation
Problem (L) is the easiest to solve. Let us take with and
(21) |
and for all . Minimizing over amounts to minimizing over and hence, the resulting estimator is merely . The risk of satisfies
whatever and which solves the problem of Linear Aggregation.
3.1.2 Model Selection Aggregation
To tackle Problem (MS), we take with , that is, ,
(22) |
and associate to each the collection reduced to . Note that and for all , so that under the assumption that we may apply Corollary 1 with (since is finite), and get that for some constant the resulting estimator satisfies
This risk bound is of the form (19) except for the constant which is not equal to 1. We do not know whether Problem (MS) can be solved or not with when the variance is unknown and is large (possibly larger than ).
3.1.3 Convex aggregation
For this problem, we emphasize the aggregation rate with respect to the quantity
(23) |
If , take again the estimator . Since the convex hull of the is a subset of the linear space , for we have
Let us now turn to the case . More precisely, assume that
(24) |
and set . We consider the family of estimators with and
(25) |
The set being compact, admits a minimum over and we set .
Proposition 2.
There exists a universal constant such that
This risk bound is of the form (19) except for the constant which is not equal to 1. Again, we do not know whether Problem () can be solved or not with when the variance is unknown and possibly larger than .
3.2 Solving the three problems simultaneously
Consider now three estimators with values respectively in , and the convex hull of the (we use a new notation for this convex hull to avoid ambiguity). One may take the estimators defined in Section 3.1 but any others would suit. The aim of this section is to select the one with the smallest risk to estimate . To do so, we apply our selection procedure with , taking thus , and associate to each of these three estimators the families defined by (21), (22) and (25) respectively and choose .
4 Selecting among linear estimator
In this section, we consider the situation where the estimators are linear, that is, are of the form for some known and deterministic matrix . As mentioned before, this setting covers many popular estimation procedures including kernel ridge estimators, spline smoothing, Nadaraya estimators, -nearest neighbors, projection estimators, low-pass filters, etc. In some cases is symmetric (e.g. kernel ridge, spline smoothing, projection estimators), in some others is non-symmetric and non-singular (as for Nadaraya estimators) and sometimes can be both singular and non-symmetric (low pass filters, -nearest neighbors). A common feature of those procedures lies in the fact that they depend on a tuning parameter (possibly multidimensional) and their practical performances can be quite poor if this parameter is not suitably calibrated. A series of papers have investigated the calibration of some of these procedures. To mention a few of them, Cao and Golubev (2006) focus on spline smoothing, Zhang (2005) on kernel ridge regression, Goldenshluger and Lepski (2009) on kernel estimators and Arlot and Bach (2009) propose a procedure to select among symmetric linear estimator with spectrum in . The procedure we present can handle all these cases in an unified framework. Throughout the section, we assume that is finite.
4.1 The families
To apply our selection procedure, we need to associate to each a suitable collection of approximation spaces . To do so, we introduce below a linear space which plays a key role in our analysis.
For the sake of simplicity, let us first consider the case where is non-singular. Then is defined as the linear span of the right-singular vectors of associated to singular values smaller than 1. When is symmetric, is merely the linear span of the eigenvectors of associated to eigenvalues not smaller than 1/2. If none of the singular values are smaller than 1, then .
Let us now extend the definition of to singular operators . Let us recall that where stands for the transpose of and for its range. The operator then induces a one to one operator between and . Write for the inverse of this operator from to . The orthogonal projection operator from onto induces a linear operator from into , denoted . Then is defined as the linear span of the right-singular vectors of associated to singular values smaller than 1. Again if this set is empty, . When is non-singular or symmetric, we recover the definition of given above.
For each , take such that . From a theoretical point of view, it is enough to take but practically it may be wise to use a larger set and by doing so, to possibly improve the approximation of by elements of . One may for example take where is the linear span of the right-singular vectors associated to the smallest singular values of .
4.2 Choices of , and
4.3 An oracle-type inequality for linear estimators
The following holds.
Corollary 2.
Let , and . If Assumption 1 holds and for all , the estimator satisfies
for some depending on and only.
The problem of selecting some best linear estimator among a family of those have also been considered in Arlot and Bach (2009) in the Gaussian regression framework, and in Goldenshluger and Lepski (2009) in the multidimensional Gaussian white noise model. Arlot and Bach proposed a penalized procedure based on random penalties. Unlike ours, their approach requires that the operators be symmetric with eigenvalues in and that the cardinality of is at most polynomial with respect to . Goldenshluger and Lepski proposed a selection rule among families of kernel estimators to solve the problem of structural adaptation. Their approach requires suitable assumptions on the kernels while ours requires nothing. Nevertheless, we restrict to the case of the Euclidean loss whereas Goldenshluger and Lepski considered more general ones.
5 Variable selection
Throughout this section, we consider the problem of variable selection introduced in Example 2 and assume that in order to avoid trivialities. When is small enough (say smaller than 20), this problem can be solved by using a suitable variable selection procedure that explores all the subsets of . For example, one may use the penalized criterion introduced in Birgé and Massart (2001) when the variance is known, and the one in Baraud et al (2009) when it is not. When is larger, such an approach can no longer be applied since it becomes numerically intractable. To overcome this problem, algorithms based on the minimization of convex criteria have been proposed among which are the Lasso, the Dantzig selector of Candès and Tao (2007), the elastic net of Zou and Hastie (2005). An alternative to those criteria is the forward-backward algorithm described in Zhang (2008), among others. Since there seems to be no evidence that one of these procedures outperforms all the others, it may be reasonable to mix them all and let the data decide which is the more appropriate to solve the problem at hand. As enlarging can only improve the risk bound of our estimator, only the CPU resources should limit the number of candidate estimators.
The procedure we propose could not only be used to select among those candidate procedures but also to select the tuning parameters they depend on. From this point of view, it provides an alternative to the cross-validation techniques which are quite popular but offer little theoretical guarantees.
5.1 Implementation roadmap
Start by choosing a family of variable selection procedures. Examples of such procedures are the Lasso, the Dantzig selector, the elastic net, among others. If necessary, associate to each a family of tuning parameters . For example, in order to use the Lasso procedure one needs to choose a tuning parameter among a grid . If a selection procedure requires no choice of tuning parameters, then one may take . Let us denote by the subset of corresponding to the predictors selected by the procedure for the choice of the tuning parameter . For , let be the linear span of the column vectors for (with the convention ). For and , associate to the subset an estimator of with values in (one may for example take the projection of onto the random linear space but any other choice would suit). Finally, consider the family of these estimators by taking and set . All along we assume that is finite (so that we take in (9)).
The approximation spaces and the weight function
Throughout, we shall restrict ourselves to subsets of predictors with cardinality not larger than some . In view of approximating the estimators , we suggest the collection given by
(26) |
We associate to the weight function defined for by
(27) |
Since
Assumption 2 is satisfied with .
Let us now turn to the choices of the . The criterion given by (6) cannot be computed when for all as soon as is too large. In such a case, one must consider a smaller subset of and we suggest for
(where the are defined above), or preferably
whenever this latter family is not too large. Note that these two families are random.
5.2 The results
Our choices of and ensure that for all and that
Hence, by applying Corollary 1 with , we get the following result.
Corollary 3.
Let , and be some positive integer satisfying . Let be a (finite) collection of random subsets of with cardinality not larger than based on the observation and a family of estimators , also based on , such that . By applying our selection procedure, the resulting estimator satisfies
where is a constant depending on the choices of and only.
Again, note that the risk bound we get is non-increasing with respect to . This means that if one adds a new variable selection procedure or considers more tuning parameters to increase , the risk bound we get can only be improved.
Without additional information on the estimators it is difficult to compare and . If is of the form for some deterministic subset it is well-known that
Under the assumption that and that belongs to with probability close enough to 1, we can compare the risk of the estimator to the cardinality of .
Corollary 4.
Assume that the assumptions of Corollary 3 hold and that for all . If for some non-void subset with cardinality not larger than , then
where is a constant depending on and only, and
Zhao and You, (2006) gives sufficient conditions on the design to ensure that is exponentially small with respect to when the family is obtained by using the LARS-Lasso algorithm with different values of the tuning parameter.
6 Simulation study
In the linear regression setting described in Example 2, we carry out a simulation study to evaluate the performances of our procedure to solve the two following problems.
We first consider the problem, described in Example 3, of tuning the smoothing parameter of the Lasso procedure for estimating . The performances of our procedure are compared with those of the -fold cross-validation method. Secondly, we consider the problem of variable selection. We solve it by using our criterion in view of selecting among a family of candidate variable selection procedures.
Our simulation study is based on a large number of examples which have been chosen in view of covering a large variety of situations. Most of these have been found in the literature in the context of Example 2 either for estimation or variable selection purposes when the number of predictors is large.
The section is organized as follows. The simulation design is given in the following section. Then, we describe how our procedure is applied for tuning the Lasso and performing variable selection. Finally, we give the results of the simulation study.
6.1 Simulation design
One example is determined by the number of observations , the number of variables , the matrix , the values of the parameters , and the ratio signal/noise . It is denoted by , and the set of all considered examples is denoted . For each example, we carry out 400 simulations of as a Gaussian random vector with expectation and variance , where is the identity matrix, and .
The collection is composed of several collections for where each collection is characterized by a vector of parameters , and a set of matrices :
where and consists of pairs such that is smaller, equal or greater than . The examples are described in further details in Section 8.2. They are inspired by examples found in Tibshirani (1996), Zou and Hastie (2005), Zou (2006), and Huang et al. (2008) for comparing the Lasso method to the ridge, adaptive Lasso and elastic net methods. They make up a large variety of situations. They include cases where
-
•
the covariates are not, moderately or strongly correlated,
-
•
the covariates with zero coefficients are weakly or highly correlated with covariates with non-zero coefficients,
-
•
the covariates with non-zero coefficients are grouped and correlated within these groups,
-
•
the lasso method is known to be inconsistent,
-
•
few or many effects are present.
6.2 Tuning a smoothing parameter
We consider here the problem of tuning the smoothing parameter of the Lasso estimator as described in Example 3. Instead of considering the Lasso estimators for a fixed grid of smoothing parameters , we rather focus on the sequence of estimators given by the first steps of the LARS-Lasso algorithm proposed by Efron et al. (2004). Hence, the tuning parameter is here the number of steps. In our simulation study, we compare the performance of our criterion to that of the -fold cross-validation for the problem of selecting the best estimator among the collection .
6.2.1 The estimator of based on our procedure
We recall that our selection procedure relies on the choices of families , for , a weight function , a penalty function and two universal constants and . We choose the family defined by (26). We associate to the family where the are defined in Section 5.1 and is the set of indices corresponding to the predictors retuned by the LARS-Lasso algorithm at step . We take with defined by (27) and . This value of is consistent with what is suggested in Baraud et al. (2009). The choice of is based on the following considerations. First, choosing around one seems reasonable since it weights similarly the term which measures how well the estimator fits the data and the approximation term involved in our criterion (6). Second, simple calculation shows that the constant involved in Theorem 1 is minimum for close to 0.6. We therefore carried out our simulations for varying from 0.2 to 1.5. The results being very similar for between 0.5 and 1.2, we choose . We denote by the resulting estimator of .
6.2.2 The estimator of based on -fold cross-validation
For each , the prediction error is estimated using a -fold cross-validation procedure, with . The estimator is chosen by minimizing the estimated prediction error.
6.2.3 The results
The simulations were carried out with R (www.r-project.org) using the library elasticnet.
For each example , we estimate on the basis of 400 simulations the oracle risk
(29) |
and the Euclidean risks and of and respectively.
The results presented in Table 1 show that our procedure tends to choose a better estimator than the CV in the sense that the ratios are closer to one than .
quantiles | |||||||
---|---|---|---|---|---|---|---|
procedure | mean | std-err | 99% | ||||
CV | 1.18 | 0.08 | 1.05 | 1.18 | 1.24 | 1.36 | 1.38 |
1.065 | 0.06 | 1.01 | 1.055 | 1.084 | 1.18 | 2.27 |
Nevertheless, for a few examples these ratios are larger for our procedure than for the CV. These examples correspond to situations where the Lasso estimators are highly biased.
In practice, it is worth considering several estimation procedures in order to increase the chance to have good estimators of among the family . Selecting among candidate procedures is the purpose of the following simulation experiment in the variable selection context.
6.3 Variable selection
In this section, we consider the problem of variable selection and use the procedure and notations introduced in Section 5. To solve this problem, we consider estimators of the form where is a random subset of depending on . Given a family of such random sets, we consider the family . The descriptions of and are postponed to Section 8.3. Let us merely mention that we choose which gathers variable selection procedures based on the Lasso, ridge regression, Elastic net, PLS1 regression, Adaptive Lasso, Random Forest, and on an exhaustive research among the subsets of with small cardinality. For each procedure , the parameter set corresponds to different choices of tuning parameters. For each , we take so that our selection rule over amounts to minimizing over
(30) |
where is given by (3).
6.3.1 Results
The simulations were carried out with R (www.r-project.org) using the libraries elasticnet, randomForest, pls and the program lm.ridge in the library MASS. We first select the tuning parameters associated to the procedures in . More precisely, for each we select an estimator among the collection by minimizing Criterion (30) over . We denote by the selected set and by the corresponding projection estimator. For each example and each method , we estimate the risk
of on the basis of 400 simulations and we do the same to calculate that of our estimator ,
Let us now define the minimum of these risks over all methods:
We compare the ratios for to judge the performances of the candidate procedures on each example . The mean, standard deviations and quantiles of the sequence are presented in Table 2. In particular, the results show that
-
•
none of the procedures in outperforms all the others simultaneously over all examples,
-
•
our procedure, corresponding to , achieves the smallest mean value. Besides, this value is very close to one.
-
•
the variability of our procedure is small compared to the others
-
•
for all examples, our procedure selects an estimator the risk of which does not exceed twice that of the oracle.
quantiles | ||||||
---|---|---|---|---|---|---|
method | mean | std-err | 95% | |||
Lasso | 2.82 | 9.40 | 1.12 | 1.33 | 6.38 | 127 |
ridge | 1.76 | 1.90 | 1.42 | 1.82 | 2.87 | 36.9 |
pls | 1.50 | 1.20 | 1.22 | 1.50 | 2.58 | 17 |
en | 1.46 | 1.90 | 1.12 | 1.33 | 2.57 | 29 |
ALridge | 1.20 | 0.31 | 1.15 | 1.26 | 1.51 | 5.78 |
ALpls | 1.29 | 0.87 | 1.14 | 1.29 | 1.75 | 12.7 |
rFmse | 4.13 | 9.50 | 1.38 | 2.04 | 19.2 | 118 |
rFpurity | 3.99 | 10.00 | 1.42 | 2.06 | 15.1 | 138 |
exhaustive | 22.9 | 45 | 6.30 | 24.5 | 92.9 | 430 |
all | 1.16 | 0.16 | 1.12 | 1.25 | 1.47 | 1.95 |
The false discovery rate (FDR) and the true discovery rate (TDR) are also parameters of interest in the context of variable selection. These quantities are given at Table 3 for each example when and . Except for one example, the FDR is small, while the TDR is varying a lot among the examples.
FDR | 0.045 | 0.026 | 0.004 | 0.026 | 0.018 | 0.041 | 0.012 | 0.026 | 0.042 | 0.15 | 0.014 |
---|---|---|---|---|---|---|---|---|---|---|---|
TDR | 0.74 | 0.63 | 0.18 | 0.63 | 0.17 | 0.99 | 1 | 1 | 0.98 | 0.29 | 0.20 |
7 Proofs
7.1 Proof of Theorem 1
Throughout this section, we use the following notations. For all and , we write
where
(31) |
For all , let be such that
We also write and for the linear space generated by and . It follows the facts that for all and
and simple algebra that
For and , let us set if and otherwise. For all and , we have and
Hence, by using (5) and (31) we get
(32) | |||||
where
For each ,
and since the variable is independent of and is stochastically larger than , we deduce from the definition of and (2), that on the one hand .
On the other hand, since is arbitrary among and since
we deduce from (32) that for all ,
(33) |
with
(34) |
and (11) follows by taking the expectation on both sides of (33). Note that provided that
is measurable, we have actually proved the stronger inequality
(35) |
Let us now turn to the second part of the Theorem, fixing some . Since equality holds in (5), under Assumption 3 by (4)
If is non-random, for some and all ,
Since , we have
and under Assumption 3, , and hence for all
which leads to (12).
Let us turn to the proof of (13). We set . Since with probability one ,
and it suffices thus to bound the right-hand side. Since equality holds in (5) and since
Under Assumption 3, and we deduce from (4) that for some constant depending only on and
and the result follows from the fact that for all .
7.2 Proof of Proposition 1
For all and , and hence,
Besides, since the minimax rate of estimation over is of order , for some universal constant ,
Putting these bounds together lead to the result.
7.3 Proof of Proposition 2
Under (24), it is not difficult to see that so that is not empty and since for all
Assumptions 1 to 4 are satisfied with . Besides, the set being compact, admits a minimum over (we shall come back the minimization of this criterion at the end of the subsection) and hence we can take . By applying Theorem 1 and using (12), the resulting estimator satisfies for some universal constant
(36) |
where
(37) |
We bound from above by using the following approximation result below the proof of which can be found in Makovoz (1996) (more precisely, we refer to the proof of his Theorem 2).
Lemma 1.
For all in the convex hull of the and all , there exists such that and
By using this lemma and the fact that for all , we get
Taking for the integer part of
which belongs to under (24), we get
(38) |
for some universal constant which together with (36) leads to the risk bound
Concerning the computation of , note that
and hence, one can solve the problem of minimizing over by proceeding into two steps. First, for each in the finite set minimize the convex criterion
over the convex (and compact set) . Denote by the resulting minimizers. Then, minimize the quantity for varying among . Denoting by such a minimizer, we have that .
7.4 Proof of Proposition 3
If , by using (12) and the fact that , we have
If , we may use (13) since with probability one and since for all , we get
Finally, let us turn to the case and denote by the best approximation of in . Since , for all ,
and hence by using (12)
where is given by (37). By arguing as in Section (3.1.3), we deduce that under (24)
By putting these bounds together we get the result.
7.5 Proof of Corollary 2
Since Assumptions 1 to 4 are fulfilled and is finite, we may apply Theorem 1 and take . By using (12), we have for some depending on and ,
For all ,
and
and hence, Corollary 2 follows from the next lemma.
Lemma 2.
For all we have
Proof of Lemma 2: Writing and using the fact that and the definition of , we obtain
where are the singular values of counted with their multiplicity and is an orthonormal family of right-singular vectors associated to . If , then and we have . Otherwise, , we may consider as the largest such that and derive that
which proves the assertion .
For the bound , we set and note that
induces a semi-positive quadratic form on . As a consequence the quadratic form is dominated by the quadratic form on . Furthermore
where is the inverse of the linear operator induced by restricted on . We then have that the quadratic form induced by is dominated by the quadratic form
on . In particular the sequence of the eigenvalues of is dominated by the sequence so
which conclude the proof of Lemma 2.
7.6 Proof of Corollary 4
Along the section, we write for and for for short. By using (10) with and since , we have
for some constant depending on only. Writing for the event , we have
where
Let us bound from above. Note that and and since , by using (4) we get
for some constant depending on and only.
Let us now turn to . For all , and
Since for all , , by using (4) again, there exists some positive constant depending on and only such that for all , and hence,
Some calculation shows that and hence, by Cauchy-Schwarz inequality
The result follows by putting the bounds on and together.
References
- Arlot, (2007) Arlot, S. (2007). Rééchantillonnage et Sélection de modèles. PhD thesis, University Paris XI.
- Arlot, (2009) Arlot, S. (2009). Model selection by resampling penalization. Electron. J. Stat., 3:557–624.
- Arlot and Bach, (2009) Arlot, S. and Bach, F. (2009). Data-driven calibration of linear estimators with minimal penalties. Advances in Neural Information Processing Systems (NIPS), 22:46–54.
- Arlot and Celisse, (2010) Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40–79.
- Baraud, (2000) Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Related Fields, 117(4):467–493.
- Baraud, (2010) Baraud, Y. (2010). Estimator selection with respect to Hellinger-type risks. Probab. Theory Relat. Fields.
- Baraud et al., (2009) Baraud, Y., Giraud, C., and Huet, S. (2009). Gaussian model selection with an unknown variance. Ann. Statist., 37(2):630–672.
- Birgé, (2006) Birgé, L. (2006). Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Ann. Inst. H. Poincaré Probab. Statist., 42(3):273–325.
- Birgé and Massart, (2001) Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS), 3(3):203–268.
- Boulesteix and Strimmer, (2006) Boulesteix, A. and Strimmer, K. (2006). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1):32–44.
- Breiman, (2001) Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.
- Bunea et al., (2007) Bunea, F., Tsybakov, A. B., and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist., 35(4):1674–1697.
- Candès and Tao, (2007) Candès, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when is much larger than . Ann. Statist., 35(6):2313–2351.
- Cao and Golubev, (2006) Cao, Y. and Golubev, Y. (2006). On oracle inequalities related to smoothing splines. Math. Methods Statist., 15(4):398–414 (2007).
- Catoni, (1997) Catoni, O. (1997). Mixture approach to universal model selection. Technical report, Ecole Normale Supérieure, France.
- Catoni, (2004) Catoni, O. (2004). Statistical learning theory and stochastic optimization. In Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001. Springer-Verlag, Berlin.
- Celisse, (2008) Celisse, A. (2008). Model selection via cross-validation in density estimation, regression, and change-points detection. PhD thesis, University Paris XI.
- Chen et al., (1998) Chen, S. S., Donoho, D. L., and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33–61 (electronic).
- Díaz-Uriarte and Alvares de Andrés, (2006) Díaz-Uriarte, R. and Alvares de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3).
- Efron et al., (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Ann. Statist., 32(2):407–499. With discussion, and a rejoinder by the authors.
- Genuer et al., (2010) Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Patter Recognition Lett., to appear.
- Giraud, (2008) Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli, 14(4):1089–1107.
- Goldenshluger, (2009) Goldenshluger, A. (2009). A universal procedure for aggregating estimators. Ann. Statist., 37(1):542–568.
- Goldenshluger and Lepski, (2009) Goldenshluger, A. and Lepski, O. (2009). Structural adaptation via -norm oracle inequalities. Probab. Theory Related Fields, 143(1-2):41–71.
- Helland, (2006) Helland, I. (2006). Partial least squares regression. In Kotz, S., Balakrishnan, N., Read, C., Vidakovic, B., and Johnston, N., editors, Encyclopedia of statistical sciences (2nd ed.), volume 9, pages 5957–5962, New York. Wiley.
- Helland, (2001) Helland, I. (2001). Some theoretical aspects of partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 58:97–107.
- Hoerl and Kennard, (1970) Hoerl, A. and Kennard, R. (1970). Ridge regression: bayes estimation for nonorthogonal problems. Technometrics, 12:55–67.
- Hoerl and Kennard, (2006) Hoerl, A. and Kennard, R. (2006). Ridge regression. In Kotz, S., Balakrishnan, N., Read, C., Vidakovic, B., and Johnston, N., editors, Encyclopedia of statistical sciences (2nd ed.), volume 11, pages 7273–7280, New York. Wiley.
- Huang et al., (2008) Huang, J., Ma, S., and Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica, 4(1603-1618).
- Juditsky and Nemirovski, (2000) Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist., 28(3):681–712.
- Lepski˘ı, (1990) Lepskiĭ, O. V. (1990). A problem of adaptive estimation in Gaussian white noise. Teor. Veroyatnost. i Primenen., 35(3):459–470.
- Lepski˘ı, (1991) Lepskiĭ, O. V. (1991). Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. Teor. Veroyatnost. i Primenen., 36(4):645–659.
- (33) Lepskiĭ, O. V. (1992a). Asymptotically minimax adaptive estimation. II. Schemes without optimal adaptation. Adaptive estimates. Teor. Veroyatnost. i Primenen., 37(3):468–481.
- (34) Lepskiĭ, O. V. (1992b). On problems of adaptive estimation in white Gaussian noise. In Topics in nonparametric estimation, volume 12 of Adv. Soviet Math., pages 87–106. Amer. Math. Soc., Providence, RI.
- Leung and Barron, (2006) Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory, 52(8):3396–3410.
- Makovoz, (1996) Makovoz, Y. (1996). Random approximants and neural networks. J. Approx. Theory, 85(1):98–109.
- Nadaraya, (1964) Nadaraya, E. A. (1964). On estimating regression. Theory of Probability and its Applications, 9(1):141—142.
- Nemirovski, (2000) Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on probability theory and statistics (Saint-Flour, 1998), volume 1738 of Lecture Notes in Math., pages 85–277. Springer, Berlin.
- Rigollet and Tsybakov, (2007) Rigollet, P. and Tsybakov, A. B. (2007). Linear and convex aggregation of density estimators. Math. Methods Statist., 16(3):260–280.
- Strobl et al., (2008) Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(307).
- Strobl et al., (2007) Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(25).
- Tenenhaus, (1998) Tenenhaus, M. (1998). La régression PLS. Éditions Technip, Paris. Théorie et pratique. [Theory and application].
- Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288.
- Tsybakov, (2003) Tsybakov, A. B. (2003). Optimal rates of aggregation. In Proceedings of the 16th Annual Conference on Learning Theory (COLT) and 7th Annual Workshop on Kernel Machines, pages 303–313. Lecture Notes in Artificial Intelligence 2777, Springer-Verlag, Berlin.
- Watson, (1964) Watson, G. S. (1964). Smooth regression analysis. Sankhyā Ser. A, 26:359–372.
- Wegkamp, (2003) Wegkamp, M. (2003). Model selection in nonparametric regression. Ann. Statist., 31:252–273.
- Yang, (1999) Yang, Y. (1999). Model selection for nonparametric regression. Statist. Sinica, 9:475–499.
- (48) Yang, Y. (2000a). Combining different procedures for adaptive regression. J. Multivariate Anal., 74(1):135–161.
- (49) Yang, Y. (2000b). Mixing strategies for density estimation. Ann. Statist., 28(1):75–87.
- Yang, (2001) Yang, Y. (2001). Adaptive regression by mixing. J. Amer. Statist. Assoc., 96(454):574–588.
- Zhang, (2005) Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Comput., 17(9):2077–2098.
- Zhang, (2008) Zhang, T. (2008). Adaptive forward-backward greedy algorithm for learning sparse representations. Technical report, Rutgers University, NJ.
- Zhao and You, (2006) Zhao, P. and Yu, B. (2006). On Model Selection Consistency of Lasso. JMLR 7(Nov):2541–2563.
- Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc., 101(476):1418–1429.
- Zou and Hastie, (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(2):301–320.
8 Appendix
8.1 Computation of
8.2 Simulated examples
The collection is composed of several collections that are detailed below. The collections to are composed of examples where is generated as independent centered Gaussian vectors with covariance matrix . For each , we define a matrix and a -vector of parameters . We denote by the set of 5 matrices simulated as -i.i.d . The collection is then defined as follows:
where and
(39) |
in Section 6.2, and
(40) |
in Section 6.3.
Let us now describe the collections to .
Collection
The matrix equals the identity matrix denoted . The parameters satisfy for , for , for , for .
Collection
the matrix is such that , for and with . Otherwise . The parameters are as in Collection .
Collection
The matrix is as in Collection with , the parameters are as in Collection .
Collection
The matrix is such that , for , with , the parameters are as in Collection .
Collection
the matrix is as in Collection with , the parameters are as in Collection .
Collection
The matrix equals . The parameters satisfy for , for .
Collection
The matrix satisfies for , for , for , with and . The parameters satisfy for , for .
Collection
The matrix satisfies for , for . The parameters satisfy for , , , .
Collection
The matrix is defined as in Example . The parameters satisfy for , for .
Collection
The matrix satisfies for , for . The parameters satisfy for and , otherwise.
Collection
In this last example, we denote by the set of 5 matrices simulated as follows. For , we denote by the column of . Let be generated as i.i.d. and let be generated as i.i.d. . Then for , , for , , for , , for , . The parameters are as in Collection . The collection is defined as the set of examples for , , and .
The collection is thus composed of 660 examples for chosen as in (40), and 825 for chosen as in (39). For some of the examples, the Lasso estimators were highly biased leading to high values of the ratio , see Equation (29). We only keep the examples for which the Lasso estimator improves the risk of the naive estimator by a factor at least . This convention leads us to remove 171 examples over 825. These pathological examples are coming from the collections , and for and , and from collections and when . The examples of collection were chosen by Zou to illustrate that the Lasso estimators may be highly biased. All the other examples, correspond to matrices that are nearly orthogonal.
8.3 Procedures for calculating sets of predictors
Let where we recall that for , .
The Lasso procedure is described in Section 6.2. The collection where is the set of indices corresponding to the predictors returned by the LARS-Lasso algorithm at step (see Section 6.2).
The ridge procedure is based on the minimization of with respect to , for some positive , see for example Hoerl and Kennard (2006). Tibshirani (1996) noted that in the case of a large number of small effects, ridge regression gives better results than the lasso for variable selection. For each , the regression coefficients are calculated and a collection of predictors sets is built as follows. Let be such that and set
Then, the collection is defined as .
The elastic net procedure proposed by Zou and Hastie (2005) mixes the and penalties of the Lasso and the ridge procedures. Let be a grid of values for the tuning parameter of the penalty. We choose where denotes the collection of the active sets of cardinality less than , selected by the elastic net procedure when the -smoothing parameter equals . For each the collection can be conveniently computed by first calculating the ridge regression coefficients and then applying the LARS-lasso algorithm, see Zou and Hastie (2005).
The partial least squares regression (PLSR1) aims to reduce the dimensionality of the regression problem by calculating a small number of components that are usefull for predicting . Several applications of this procedure for analysing high-dimensional genomic data have been reviewed by Boulesteix and Strimmer (2006). In particular, it can be used for calculating subsets of covariates as we did for the ridge procedure. The PLSR1 procedure constructs, for a given , uncorrelated latent components that are highly correlated with the response , see Helland (2006). Let be a grid a values for the tuning parameter . For each , we write for the PLS regression coefficients calculated with the first components. We then set , where is build from as for the ridge procedure.
The adaptive lasso procedure proposed by Zou (2006) starts with a preliminary estimator . Then one applies the lasso procedure replacing the parameters in the penalty by the weighted parameters for some positive . The idea is to increase the penalty for coefficients that are close to zero, reducing thus the bias in the estimation of and improving the variable selection accuracy. Zou showed that, if is a -consistent estimator of , then the adaptive lasso procedure is consistent in situations where the lasso is not. A lot of work has been done around this subject, see Huang et al. (2008) for example.
We apply the procedure with , and considering two different preliminary estimators:
- using the ridge estimator, as preliminary estimator. For each , the adaptive lasso procedure is applied for calculating the active sets, , of cardinality less than . The collection is thus defined as .
- using the PLSR1 estimator, , as preliminary estimator. The procedure is the same as described just above. The collection is defined as .
The random forest algorithm was proposed by Breiman (2001) for classification and regression problems. The procedure averages several regression trees calculated on bootstrap samples. The algorithm returns measures of variable importance that may be used for variable selection, see for example Díaz-Uriarte and Alvares de Andrés (2006), Genuer et al. (2010), Strobl et al. (2007; 2008).
Let us denote by the number of variables randomly chosen at each split when constructing the trees and
For each , we consider the set of indices
where are the ranks of the variable importance measures. Two importance measures are proposed. The first one is based on the decrease in the mean square error of prediction after permutation of each of the variables. It leads to the collection . The second one is based on the decrease in node impurities, and leads similarly to the collection .
The exhaustive procedure considers the collection of all subsets of with dimension smaller than . We denote this collection .
Choice of tuning parameters
We have to choose , the largest number of predictors considered in the collection . For all methods, except the exhaustive method, may be large, say . Nevertheless, for saving computing time, we chose large enough such that the dimension of the estimated subset is always smaller than . For the exhaustive method, must be chosen in order to make the calculation feasible: for , for and for .
For the ridge method we choose , and for the PLSR1 method, .