This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Risk Consistency of Cross-Validation with Lasso-Type Procedures

Darren Homrighausen
Department of Statistics
Colorado State University
Fort Collins, CO 80523
darrenho@stat.colostate.edu
   Daniel J. McDonald
Department of Statistics
Indiana University
Bloomington, IN 47408
dajmcdon@indiana.edu
Abstract

The lasso and related sparsity inducing algorithms have been the target of substantial theoretical and applied research. Correspondingly, many results are known about their behavior for a fixed or optimally chosen tuning parameter specified up to unknown constants. In practice, however, this oracle tuning parameter is inaccessible so one must use the data to select one. Common statistical practice is to use a variant of cross-validation for this task. However, little is known about the theoretical properties of the resulting predictions with such data-dependent methods. We consider the high-dimensional setting with random design wherein the number of predictors pp grows with the number of observations nn. Under typical assumptions on the data generating process, similar to those in the literature, we recover oracle rates up to a log factor when choosing the tuning parameter with cross-validation. Under weaker conditions, when the true model is not necessarily linear, we show that the lasso remains risk consistent relative to its linear oracle. We also generalize these results to the group lasso and square-root lasso and investigate the predictive and model selection performance of cross-validation via simulation. Key words and phrases: Linear oracle; Model selection; Persistence; Regularization

1 Introduction

Following its introduction in the statistical (Tibshirani, 1996) and signal processing (Chen, Donoho, and Saunders, 1998) communities, 1\ell_{1}-regularized linear regression has become a fixture as both a data analysis tool and as a subject for deep theoretical investigations. In particular, for a response vector YnY\in\mathbb{R}^{n}, design matrix 𝕏n×p\mathbb{X}\in\mathbb{R}^{n\times p}, and tuning parameter λ\lambda, we consider the lasso problem of finding

β^λ=argminβ1nY𝕏β22+λβ1\widehat{\beta}_{\lambda}=\operatorname*{argmin}_{\beta}\frac{1}{n}\left|\left|Y-\mathbb{X}\beta\right|\right|_{2}^{2}+\lambda\left|\left|\beta\right|\right|_{1} (1)

for any λ\lambda, where ||||2\left|\left|\cdot\right|\right|_{2} and ||||1\left|\left|\cdot\right|\right|_{1} indicate the Euclidean and 1\ell_{1}-norms respectively. An equivalent but less computationally convenient specification of the lasso problem given by Equation (1) is the constrained optimization version:

β^t:=β^(t)argminβt1nY𝕏β22\widehat{\beta}_{t}:=\widehat{\beta}(\mathcal{B}_{t})\in\operatorname*{argmin}_{\beta\in\mathcal{B}_{t}}\frac{1}{n}\left|\left|Y-\mathbb{X}\beta\right|\right|_{2}^{2} (2)

where t:={β:β1t}\mathcal{B}_{t}:=\{\beta:\left|\left|\beta\right|\right|_{1}\leq t\}. By convexity, for each λ\lambda (or tt), there is always at least one solution to these optimization problems. While it is true that the solution is not necessarily unique if rank(𝕏)<p(\mathbb{X})<p, this detail is unimportant for our purposes, and we abuse notation slightly by referring to β^λ\widehat{\beta}_{\lambda} as ‘the’ lasso solution.

These two optimization problems are equivalent mathematically, but they differ substantially in practice. Though the constrained optimization problem (Equation (2)) can be solved via quadratic programming, most algorithms use the Lagrangian formulation (Equation (1)). In this paper, we address both estimators as each is more amenable to theoretical treatment in different contexts.

The literature contains numerous results regarding the statistical properties of the lasso, and, while it is beyond the scope of this paper to give a complete literature review, we highlight some of these results here. Early results about the asymptotic distribution of the lasso solution are shown in Knight and Fu (2000) under the assumption that the sample covariance matrix has a nonnegative definite limit and pp is fixed. Many authors (e.g. Donoho, Elad, and Temlyakov, 2006; Meinshausen and Bühlmann, 2006; Meinshausen and Yu, 2009; Wainwright, 2009; Zhao and Yu, 2006; Zou, 2006) have investigated model selection properties of the lasso—showing that when the best predicting model is linear and sparse, the lasso will tend to asymptotically recover those predictors. The literature has considered this setting under various “irrepresentability” conditions which ensure that the relevant predictors are not too highly correlated with irrelevant ones. Bickel, Ritov, and Tsybakov (2009) analyze the lasso and the Dantzig selector (Candès and Tao, 2007) under restricted eigenvalue conditions with an oracle tuning parameter. Finally, Belloni, Chernozhukov, and Wang (2014) develop results for a related method, the square-root lasso, with heteroscedastic noise and oracle tuning parameter.

Theoretical results such as these and others, depend critically on the choice of tuning parameters and are typically of the form: if t=tn=o(n/logp)1/4t=t_{n}=o(n/\log p)^{1/4}, then β^tn\widehat{\beta}_{t_{n}} possesses some desired behavior (correct model selection, risk consistency, sign consistency, et cetera). However comforting results of this type are, this theoretical guidance says little about the properties of the lasso when the tuning parameter is chosen in a data-dependent way.

In this paper, we show that the lasso under random design with cross-validated tuning parameter is indeed risk consistent under some conditions on the joint distribution of the design matrix 𝕏\mathbb{X} and the response vector YY. Additionally, we demonstrate that our framework can be used to show similar results for other lasso-type methods. Our results build on the previous theory presented in Homrighausen and McDonald (2014) and Homrighausen and McDonald (2013). Homrighausen and McDonald (2014) proves risk consistency for cross-validation under strong conditions on the data generating process, most notably n>pn>p, and on the cross-validation procedure (requiring leave-one-out CV). The results in this paper differ from those in Homrighausen and McDonald (2013) in a number of ways. The current paper examines the Lagrangian formulation of the lasso problem under typical conditions, weakens the conditions on an upper bound for tt, provides more refined results via concentration inequalities, examines the influence of KK, and includes related results for the group lasso and the square-root lasso.

1.1 Overview of results

The criterion we focus on for this paper is risk consistency, (alternatively known as persistence). That is, we investigate the difference between the prediction risk of the lasso estimator with tuning parameter estimated by cross-validation and the risk of the best linear oracle predictor (with oracle tuning parameter). Risk consistency of lasso has previously been studied by Greenshtein and Ritov (2004); Bunea, Tsybakov, and Wegkamp (2007); van de Geer (2008) and Bartlett, Mendelson, and Neeman (2012). Their results, in contrast to ours, assume that the tuning parameter is selected in an oracle fashion.

We present two main results which make progressively stronger assumptions on the data generating process and use both forms of the lasso estimator. The first imposes strong conditions on the design matrix, assumes the linear model is true, and that this linear model is sparse. The second is more general and allows the true model to be neither linear nor correctly specified. For this reason, our focus is on risk consistency rather than estimation of a “true” parameter or correct identification of a “true” sparsity pattern. Additionally, well-known results of Shao (1993) imply that cross-validation leads to inconsistent model selection in general, suggesting that results for sparse recovery may not exist. Although prediction is an important goal, one is often interested in variable selection for more interpretable models or follow-up experiments. In light of the negative results in Shao (1993), we are unable to offer theoretical results which promise consistent model selection by cross validation, but simulations in Section 4 suggest that it performs well nevertheless. Both the estimation and sparse recovery settings are frequently studied in the literature assuming the tuning parameter is the oracle and that the data generating model is linear (e.g. Bunea, Tsybakov, and Wegkamp, 2007; Candès and Plan, 2009; Donoho, Elad, and Temlyakov, 2006; Leng, Lin, and Wahba, 2006; Meinshausen and Bühlmann, 2006; Meinshausen and Yu, 2009).

In the first case, when the truth is linear, we examine the Lagrangian formulation in Equation (1). In this scenario, we prove convergence rates which differ only by a log\log factor relative to those achievable with the oracle tuning parameter (e.g. Negahban, Ravikumar, Wainwright, and Yu, 2012; Bühlmann and van de Geer, 2011; Bunea, Tsybakov, and Wegkamp, 2007). That is, for an ss^{*}-sparse linear model with restricted isometry conditions on the covariance of the design, the risk of the cross-validated estimator approaches the oracle risk at a rate of O(slog(p)log(n)/n)O(s^{*}\log(p)\log(n)/n). Under more general conditions, we follow the approach of Greenshtein and Ritov (2004) and examine the constrained optimization form in Equation (2). Using our methods, we require that the set of candidate predictors, tn\mathcal{B}_{t_{n}}, satisfies tn=o(n1/4/(mn(logp)1/4+1/(2q)))t_{n}=o\left(n^{1/4}/(m_{n}(\log p)^{1/4+1/(2q))}\right) where mnm_{n} is a sequence which approaches infinity and qq characterizes the tail behavior of the data. This is essentially as quickly as one could hope relative to Greenshtein and Ritov (2004) under our more general assumptions on the design matrix. We note however that, using empirical process techniques, Bartlett, Mendelson, and Neeman (2012) have been able to improve the rate shown in Greenshtein and Ritov (2004) to tn=o(n1/2/(log3/2nlog3/2(np)))t_{n}=o\left(n^{1/2}/(\log^{3/2}n\log^{3/2}(np))\right) for sub-Gaussian design and oracle tuning parameter.

1.2 Tuning parameter selection methods and outline of the paper

There are several proposed data-dependent techniques for choosing tt or λ\lambda. Kato (2009) and Tibshirani and Taylor (2012) investigate estimating the “degrees of freedom” of a lasso solution. With an unbiased estimator of the degrees of freedom, the tuning parameter can be selected by minimizing the empirical risk penalized by a function of this estimator. Note that this approach requires an estimate of the variance, which is nontrivial when p>np>n (Giraud, Huet, and Verzelen, 2012; Homrighausen and McDonald, 2016). Another risk estimator is the adapted Bayesian information criterion proposed by Wang and Leng (2007) which uses a plug-in estimator based on the second-order Taylor’s expansion of the risk. Arlot and Massart (2009) and Saumard (2011) consider “slope heuristics” as a method for penalty selection with Gaussian noise, paying particular attention to the regressogram estimator in the first case and piecewise polynomials with pp fixed in the second. Finally, Sun and Zhang (2012) present an algorithm to jointly estimate the regression coefficients and the noise level. This results in a data-driven value for the tuning parameter which possesses oracle properties under some regularity conditions.

However, many authors (e.g. Efron et al., 2004; Friedman et al., 2010; Greenshtein and Ritov, 2004; Tibshirani, 1996, 2011; Zou et al., 2007 and as discussed by van de Geer and Bühlmann, 2011, Section 2.4.1) recommend selecting tt or λ\lambda in the lasso problem by minimizing a KK-fold cross-validation estimator of the risk (see Section 2 for the precise definition). Cross-validation is generally well-studied in a number of contexts, especially model selection and risk estimation. In the context of model selection, Arlot and Celisse (2010) give a detailed survey of the literature emphasizing the relationship between the sizes of the validation set and the training set as well as discussing the positive bias of cross-validation as a risk estimator.

Some results supporting the use of cross-validation for statistical methods other than lasso are known. For instance, Stone (1974, 1977) outlines various conditions under which cross-validated methods can result in good predictions. More recently, Dudoit and van der Laan (2005) find finite sample bounds for various cross-validation procedures. These results do not address the lasso nor parameter spaces with increasing dimensions, and furthermore, apply to choosing the best estimator from a finite collection of candidate estimators. Lecué and Mitchell (2012) provide oracle inequalities related to using cross-validation with lasso, however, it treats the problem as aggregating a dictionary of lasso estimators with different tuning parameters, and the results are stated for fixed pp rather than the high-dimensional setting investigated here. Most recently, Flynn, Hurvich, and Simonoff (2013) investigate numerous methods for tuning parameter selection in penalized regression, but the theoretical results hold only when p/n0p/n\rightarrow 0 and not for cross-validation. In particular, the authors state “to our knowledge the asymptotic properties of [KK]-fold CV have not been fully established in the context of penalized regression” (Flynn, Hurvich, and Simonoff, 2013, p. 1033).

Rather than cross-validation, one may use information criteria such as AIC (Akaike, 1974) or BIC (Schwarz, 1978). These techniques are frequently advocated in the literature (for example Bühlmann and van de Geer, 2011; Wang et al., 2007; Tibshirani, 1996; Fan and Li, 2001), but the classical asymptotic arguments underlying these methods apply only for pp fixed and rely on maximum likelihood estimates (or Bayesian posteriors) for all parameters including the noise. The theory in the high-dimensional setting supporting these methods is less complete. Recent work has developed new information criteria with supporting asymptotic results uf rank(𝕏)=p\textrm{rank}(\mathbb{X})=p but is still allowed to increase. For example, the criterion developed by Wang et al. (2009) selects the correct model asymptotically even if pp\rightarrow\infty as long as p/n0p/n\rightarrow 0. If pp is allowed to increase more quickly than nn, theoretical results assume σ2\sigma^{2} is known to get around the difficult task of high-dimensional variance estimation (Chen and Chen, 2012; Zhang and Shen, 2010; Kim et al., 2012; Fan and Tang, 2013).

In Section 2, we outline the mathematical setup for the lasso prediction problem and discuss some empirical concerns. Section 3 contains the main result and associated conditions. Section 4 compares different choices of KK for cross-validation via simulation, while Section 5 summarizes our contribution and presents some avenues for further research.

2 Notation and definitions

2.1 Preliminaries

Suppose we observe data Zi=(Yi,Xi)Z_{i}^{\top}=(Y_{i},X_{i}^{\top}) consisting of predictor variables, XipnX_{i}\in\mathbb{R}^{p_{n}}, and response variables, YiY_{i}\in\mathbb{R}, where ZiμnZ_{i}\sim\mu_{n} are independent and identically distributed for i=1,2,,ni=1,2,\ldots,n and the distribution μn\mu_{n} is in some class \mathcal{F} to be specified. Here, we use the notation pnp_{n} to allow the number of predictor variables to change with nn. Similarly, we index the distribution μn\mu_{n} to emphasize its dependence on nn. For simplicity, in what follows, we omit the subscript nn when there is little risk of confusion.

We consider the problem of estimating a linear functional f(𝒳)=𝒳βf(\mathcal{X})=\mathcal{X}^{\top}\beta for predicting 𝒴\mathcal{Y}, when 𝒵=(𝒴,𝒳)μn\mathcal{Z}^{\top}=(\mathcal{Y},\mathcal{X}^{\top})\sim\mu_{n} is a new, independent random variable from the same distribution as the data and β=(β1,,βp)\beta=(\beta_{1},\ldots,\beta_{p})^{\top}. For now, we do not assume a linear model, only the usual regression framework where 𝒴=f(𝒳)+ϵ,\mathcal{Y}=f^{*}(\mathcal{X})+\epsilon, with ϵ\epsilon and 𝒳\mathcal{X} independent and ff^{*} is some unknown function. We will use zero-based indexing for 𝒵\mathcal{Z} so that 𝒵0=𝒴\mathcal{Z}_{0}=\mathcal{Y}. To measure performance, we use the L2L^{2}-risk of the predictor β\beta:

R(β):=𝔼μn[(𝒴𝒳β)2].R\left(\beta\right):=\mathbb{E}_{\mu_{n}}\left[(\mathcal{Y}-\mathcal{X}^{\top}\beta)^{2}\right]. (3)

Note that this is a conditional expectation: for any estimator β^\widehat{\beta},

R(β^):=𝔼μn[(𝒴𝒳β^)2|Z1,,Zn],R\left(\widehat{\beta}\right):=\mathbb{E}_{\mu_{n}}\left[(\mathcal{Y}-\mathcal{X}^{\top}\widehat{\beta})^{2}|Z_{1},\ldots,Z_{n}\right], (4)

and the expectation is taken only over the new data 𝒵\mathcal{Z} and not over any observables which may be used to choose β^\widehat{\beta}.

Using the nn independent observations Z1,,ZnZ_{1},\ldots,Z_{n}, we can form the response vector Y:=(Yi)i=1nY:=(Y_{i})_{i=1}^{n} and the design matrix 𝕏:=[X1,,Xn]\mathbb{X}:=[X_{1},\ldots,X_{n}]^{\top}. Then, given a vector β\beta, we write the squared-error empirical risk function as

R^(β):=1nY𝕏β22=1ni=1n(YiXiβ)2.\widehat{R}\left(\beta\right):=\frac{1}{n}||Y-\mathbb{X}\beta||_{2}^{2}=\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-X_{i}^{\top}\beta)^{2}. (5)

Analogously to Equation (5), we write the KK-fold cross-validation estimator of the risk with respect to the tuning parameter tt, which we abbreviate to CV-risk or just CV, as

R^Vn(t)=R^Vn(β^t(v1),,β^t(vK)):=1KvVn1|v|rv(YrXrβ^t(v))2.\widehat{R}_{V_{n}}\left(t\right)=\widehat{R}_{V_{n}}\left(\widehat{\beta}_{t}^{(v_{1})},\ldots,\widehat{\beta}_{t}^{(v_{K})}\right):=\frac{1}{K}\sum_{v\in V_{n}}\frac{1}{|v|}\sum_{r\in v}\left(Y_{r}-X_{r}^{\top}\widehat{\beta}_{t}^{(v)}\right)^{2}. (6)

Here, Vn={v1,,vK}V_{n}=\{v_{1},\ldots,v_{K}\} is a set of validation sets, β^t(v)\widehat{\beta}_{t}^{(v)} is the estimator in Equation (2) with the observations in the validation set vv removed, and |v||v| indicates the cardinality of vv. Notice in particular that the cross-validation estimator of the risk is a function of tt rather than a single predictor β\beta—it uses KK different predictors at a fixed tt, averaging over their performance on the respective held-out sets. Over the course of the paper, we will freely exchange λ\lambda for tt in this definition.

Lastly, we define the CV-risk minimizing choice of tuning parameter to be

t^=argmintTR^Vn(t).\widehat{t}=\operatorname*{argmin}_{t\in T}\widehat{R}_{V_{n}}\left(t\right). (7)

In our setting, we will take TT (or Λ\Lambda) as an interval subset of the nonnegative real numbers which needs to be defined by the data-analyst. The choice of TT is an important part of the performance of β^t^\widehat{\beta}_{\widehat{t}} and requires some explanation. First, we provide some insight into the computational load of using CV-risk to find t^\widehat{t}.

2.2 Computations

In practice, CV-risk is known to be time consuming and somewhat unstable due to the randomness associated with forming VnV_{n}. For a fixed vVnv\in V_{n}, suppose β^λ(v)\widehat{\beta}_{\lambda}^{(v)} is found for the entire lasso path via the lars (Efron et al., 2004) algorithm, which can be computed in the same computational complexity as a least squares fit. To fix ideas, suppose n>pn>p, which means lars has computational complexity O(np2+p3)O(np^{2}+p^{3}). Hence, as the feature matrix 𝕏\mathbb{X} with |v||v| rows removed has approximately n(K1)/Kn(K-1)/K rows, β^λ(v)\widehat{\beta}_{\lambda}^{(v)} can be computed for all λ\lambda in O((n(K1)/K)p2+p3)O((n(K-1)/K)p^{2}+p^{3}) time. Repeating this KK times means the computational complexity for forming R^Vn(λ)\widehat{R}_{V_{n}}\left(\lambda\right) over all λ\lambda is O(n(K1)p2+p3)O(n(K-1)p^{2}+p^{3}). If KK is a fixed fraction of nn, CV-risk has computational complexity of order (np)2(np)^{2}, which is a factor of nn slower than a single lasso fit.

Though more expensive on a single processor, CV-risk is readily parallelizable over the KK folds and therefore (ignoring communication costs between processors) CV-risk could actually be faster to compute than R^\widehat{R} (and hence β^λ\widehat{\beta}_{\lambda}) as

n(K1)/K<nn(K-1)/K<n. This advantage is of course lost if we ultimately compute β^λ^\widehat{\beta}_{\widehat{\lambda}}. However, this computational advantage would be maintained if we instead report

β~=1KvVnβ~(v),\tilde{\beta}=\frac{1}{K}\sum_{v\in V_{n}}\tilde{\beta}^{(v)}, (8)

where β~(v)\tilde{\beta}^{(v)} is the lasso estimate trained on the observations in (v)(v) with the tuning parameter chosen by minimizing the empirical risk using the test observations in vv. For K=4K=4, for example, this would provide a 25% reduction in computation time. The properties of this approximation is an interesting avenue for further investigation.

2.3 Choosing the sets Λ\Lambda and TT

The data analyst must be able to solve the optimization problem in Equation (7). For Λ\Lambda, we must choose a lower bound: Λ=[λn,)\Lambda=[\lambda_{n},\infty). This implies we must choose λn\lambda_{n} as a function of the data. While it is tempting to allow λn=0\lambda_{n}=0, this results in numerical and practical implementation issues if rank(𝕏)<p(\mathbb{X})<p and is unnecessary as the theory will show. However, the lower bound will have a nontrivial impact on the quality of the recovery, as choosing a value too large may eliminate the best solutions. We will see in the next section that under some assumptions on the data generating process, we can safely choose a particular λn>0\lambda_{n}>0 that allows order logn\log n coefficients to be selected without compromising the quality of the estimator.

In the case of TT, an upper bound must be selected for any grid-search optimization procedure. As we will impose much weaker conditions on the data generating process, choosing such an upper bound is more complicated. Note that, by Equation (2), β^t\widehat{\beta}_{t} must be in the 1\ell_{1}-ball with radius tt. This constraint is only binding (Osborne, Presnell, and Turlach, 2000) if

t<minη𝒦||β^+η||1=:t0,t<\min_{\eta\in\mathcal{K}}||\widehat{\beta}_{\infty}+\eta||_{1}=:t_{0},

where β^=β^(p)\widehat{\beta}_{\infty}=\widehat{\beta}(\mathbb{R}^{p}) is a least-squares solution and 𝒦:={a:𝕏a=0}\mathcal{K}:=\{a:\mathbb{X}a=0\}. Observe that 𝒦={0}\mathcal{K}=\{0\} if npn\geq p and otherwise 𝒦\mathcal{K} has dimension pnp-n, which would imply β^\widehat{\beta}_{\infty} is not unique. Both of these statements assume that the columns of 𝕏\mathbb{X} contain a linearly independent set of size min{n,p}\min\{n,p\}. See Tibshirani (2013) for the more general situation. In either case, if tt0t\geq t_{0}, then β^t\widehat{\beta}_{t} is equal to a least-squares solution. Based on this argument and the fact that 0𝒦0\in\mathcal{K}, it is tempting to define the upper bound to be tmax:=β^1t_{\max}:=||\widehat{\beta}_{\infty}||_{1}, where β^=(𝕏𝕏)𝕏Y\widehat{\beta}_{\infty}=(\mathbb{X}^{\top}\mathbb{X})^{\dagger}\mathbb{X}^{\top}Y is the least-squares solution when ()(\cdot)^{\dagger} is given by the Moore-Penrose inverse. However, this upper bound has at least two problems. First, although theoretically well-defined, as with setting λn=0\lambda_{n}=0, it suffers from numerical and practical implementation issues if rank(𝕏)<p(\mathbb{X})<p. The second problem is that it grows much too fast, at least order n\sqrt{n}, therefore potentially including solutions which will have low bias but very high variance.

Instead, we define an upper bound based on a rudimentary variance estimator tmax:=Y22/an,t_{\max}:=\left|\left|Y\right|\right|_{2}^{2}/a_{n}, where (an)(a_{n}) is an increasing sequence of constants. Hence, we consider the optimization interval to be T=[0,tmax]T=[0,t_{\max}]. We defer fixing a particular sequence (an)(a_{n}) until the next section. As our main results demonstrate, this choice of TT is enough to ensure that a risk consistent sequence of tuning parameters can be selected.

Remark 1.

We emphasize here that using a cross-validated tuning parameter involves more than simply allowing the tuning parameter to be selected in a data-dependent manner. In order to meaningfully analyze tuning parameter selection, we allow the search set TT and the tuning parameter to be chosen based on the data.

Remark 2.

The computational implementation of CV for an interval Λ\Lambda (or TT) deserves some discussion. Two widely used algorithms for lasso are glmnet (Friedman et al., 2010), which uses coordinate descent, and lars (Efron et al., 2004), which leverages the piece-wise linearity of the lasso solution as λ\lambda varies (the lasso path). The package glmnet is much faster than lars, however, glmnet examines a grid of values, λjΛ\lambda_{j}\in\Lambda, j=1,,Jj=1,\ldots,J say, and approximates the solution at each λj\lambda_{j} with increasing accuracy depending on the number of iterations. On the other hand, lars evaluates the entire solution path exactly, such that it is theoretically possible to choose any λΛ\lambda\in\Lambda via numerical optimization. However, optimizing Equation (7) with standard solvers can be difficult due to a possible lack of convexity. In both cases, the extremes of the interval Λ\Lambda will affect the quality of the solution.

3 Main results

In this section, we present results for both forms of the lasso estimator in Equations (1) and (2) under more and less restrictive assumptions, respectively. First, we define the types of random variables 𝒵\mathcal{Z} which we allow. To quantify the tail behavior of our data, we appeal to the notion of an Orlicz norm.

Definition 1.

For any random variable ξ\xi and function ψ\psi that is nondecreasing, convex, and ψ(0)=0\psi(0)=0, define the ψ\psi-Orlicz norm

ξψ=inf{c>0:𝔼ψ(|ξ|c)1}.||\xi||_{\psi}=\inf\left\{c>0:\mathbb{E}\psi\left(\frac{|\xi|}{c}\right)\leq 1\right\}.

For any integer r1r\geq 1, we are interested in both the usual LrL^{r}-norm given by ξr:=(𝔼|ξ|r)1/r||\xi||_{r}:=(\mathbb{E}|\xi|^{r})^{1/r} and the norm given by choosing ψ(x)=ψr(x):=exp(xr)1\psi(x)=\psi_{r}(x):=\exp(x^{r})-1 and represented notationally as ξψr||\xi||_{\psi_{r}}. Note that if ξψr<||\xi||_{\psi_{r}}<\infty, then for sufficiently large xx, there are constants C1,c2C_{1},c_{2} such that P(|ξ|>x)C1exp(c2xr).P(|\xi|>x)\leq C_{1}\exp(-c_{2}x^{r}).

In the particular case of the ψ2\psi_{2}-Orlicz norm, if ξψ2<κ||\xi||_{\psi_{2}}<\kappa it holds that (|ξ|>x)2exp(x2/κ2)\mathbb{P}(|\xi|>x)\leq 2\exp(-x^{2}/\kappa^{2}) and 𝔼[|ξ|k]2κkΓ(k/2+1),\mathbb{E}[|\xi|^{k}]\leq 2\kappa^{k}\Gamma(k/2+1), where Γ(t)=0xt1ex𝑑x\Gamma(t)=\int_{0}^{\infty}x^{t-1}e^{-x}dx is the Gamma function. Additionally, (𝔼|ξ|r)1/r=ξrr!ξψ1(\mathbb{E}|\xi|^{r})^{1/r}=||\xi||_{r}\leq r!||\xi||_{\psi_{1}} and ξψ1(log2)1/r1ξψr||\xi||_{\psi_{1}}\leq(\log 2)^{1/r-1}||\xi||_{\psi_{r}}.

Before outlining our results, we define the following set of distributions.

Definition 2.

Let 1q<1\leq q<\infty and CqC_{q} be a constant independent of nn. Then define

q:={(μn):μn is a measure on pn and max1j,kpξjξk𝔼μnξjξkψqCq};\mathcal{F}_{q}:=\left\{(\mu_{n}):\mu_{n}\textrm{ is a measure on }\mathbb{R}^{p_{n}}\textrm{ and }\max_{1\leq j,k\leq p}||\xi_{j}\xi_{k}-\mathbb{E}_{\mu_{n}}\xi_{j}\xi_{k}||_{\psi_{q}}\leq C_{q}\right\};

that is, all centered cross products of a random variable ξμn\xi\sim\mu_{n} have uniformly finite ψq\psi_{q}-Orlicz norm, independent of nn. Define the analogous set \mathcal{F}_{\infty} which contains the measures μn\mu_{n} such that |ξj|Cq|\xi_{j}|\leq C_{q} almost surely μn\mu_{n} for each j=1,,pj=1,\ldots,p.

Remark 3.

To make subsequent expressions as a function of qq make sense, interpret for any 0<c<0<c<\infty, c/=0c/\infty=0 and /=1\infty/\infty=1.

While μn\mu_{n} is a measure on p\mathbb{R}^{p} indexing with nn is more natural than indexing with pp given that our results include pnp_{n} increasing with nn. Section 3 is a common moment condition (Greenshtein and Ritov, 2004; Nardi and Rinaldo, 2008; Bartlett, Mendelson, and Neeman, 2012) for showing risk consistency of lasso-type methods in high dimensional settings.

To simplify our exposition, we make the following condition about the size of the validation sets for CV.

Condition 1.

The sequence of validation sets {Vn}n=1\{V_{n}\}_{n=1}^{\infty} is such that, as nn\rightarrow\infty, vVn\forall v\in V_{n}, |v|cn|v|\asymp c_{n} for some sequence cnc_{n}.

This condition is intended to rule out some pathological cases of unbalanced validation sets, but standard CV methods satisfy it. For example, with KK-fold cross-validation, we can take cn=n/Kc_{n}=\lfloor n/K\rfloor, which is the integer part of n/Kn/K. For nn design random variables X1,,XnX_{1},\ldots,X_{n} and oracle prediction function ff^{*}, define 𝐟n:=(f(X1),,f(Xn))\mathbf{f}_{n}^{*}:=(f^{*}(X_{1}),\ldots,f^{*}(X_{n}))^{\top}.

3.1 Persistence when ff^{*} is linear

If we are willing to impose strong conditions on μn\mu_{n}, as in  Bunea, Tsybakov, and Wegkamp (2007) and Meinshausen (2007), then we can obtain cross-validated lasso estimators which achieve nearly oracle rates.

If we assume that 𝔼[𝒴|𝒳]=f(𝒳)=𝒳β\mathbb{E}[\mathcal{Y}\ |\ \mathcal{X}]=f^{*}(\mathcal{X})=\mathcal{X}^{\top}\beta^{*}, then we can write the risk of an estimator β^λ^\widehat{\beta}_{\widehat{\lambda}} as

R(β^λ^)=R(β^λ^)R(β)excess risk+R(β)noise=R(β^λ^)σ2excess risk+σ2noise,R\left(\widehat{\beta}_{\widehat{\lambda}}\right)=\underbrace{R\left(\widehat{\beta}_{\widehat{\lambda}}\right)-R(\beta^{*})}_{\mbox{excess risk}}+\underbrace{R(\beta^{*})}_{\mbox{noise}}=\underbrace{R\left(\widehat{\beta}_{\widehat{\lambda}}\right)-\sigma^{2}}_{\mbox{excess risk}}+\underbrace{\sigma^{2}}_{\mbox{noise}},

where R(β)=𝔼[(𝒴𝒳β)2]=σ2R(\beta^{*})=\mathbb{E}[(\mathcal{Y}-\mathcal{X}^{\top}\beta^{*})^{2}]=\sigma^{2}. We write the excess risk as (λ^):=R(β^λ^)σ2\mathcal{E}(\widehat{\lambda}):=R(\widehat{\beta}_{\widehat{\lambda}})-\sigma^{2} and prove a convergence rate for (λ^)\mathcal{E}(\widehat{\lambda}). In this case, targeting the excess risk is the same as estimating the conditional expectation of 𝒴\mathcal{Y}, but if f(𝒳)f^{*}(\mathcal{X}) is not linear (as in Section 3.2), the excess risk remains a meaningful way of assessing prediction behavior.

Theorem 3.

Assume the following conditions:

  1. M1:

    There exists a constant Cq<C_{q}<\infty independent of nn such that |𝒳j|<Cq|\mathcal{X}_{j}|<C_{q} almost surely for all j=1,,pj=1,\ldots,p.

  2. M2:

    𝔼[𝒳]=0\mathbb{E}[\mathcal{X}]=0 and 𝔼[𝒳𝒳]=Σ\mathbb{E}[\mathcal{X}\mathcal{X}^{\top}]=\Sigma.

  3. M3:

    ϵN(0,σ2)\epsilon\sim N(0,\sigma^{2})

  4. M4:

    0<ν<1\exists 0<\nu<1 such that Σ\Sigma and Σ1\Sigma^{-1} are diagonally dominant at level ν\nu; that is |σjj|ν+ji|σij||\sigma_{jj}|\geq\nu+\sum_{j\neq i}|\sigma_{ij}| for all 1jp1\leq j\leq p.

  5. M5:

    β0=s||\beta^{*}||_{0}=s^{*}, which is independent of nn.

  6. M6:

    λmin(lognlogp/n)1/2\lambda_{\min}\propto(\log{n}\log{p}/n)^{1/2}.

  7. M7:

    logp=o(n/logn)\log{p}=o(n/\log{n}).

Then, (λ^)=Op((slognlogp)/n).\mathcal{E}(\widehat{\lambda})=O_{p}\left((s^{*}\log n\log p)/n\right).

These conditions and the result warrant a few comments. First, note that condition M4 implies that Σ(1ν)diag(Σ)\Sigma-(1-\nu)\textrm{diag}(\Sigma) is positive semi-definite. Second, as ss^{*} is fixed, M6 and M7 ensure λmin0\lambda_{\min}\rightarrow 0 so that Λ\Lambda will eventually allow models with ss^{*} covariates. Thus, the procedure is asymptotically consistent for model selection (Meinshausen, 2007). Finally, note that (λ^)\mathcal{E}(\widehat{\lambda}) is random due to the term R(β^λ^)R(\widehat{\beta}_{\widehat{\lambda}}). Here, we emphasize that R(β^λ^)R(\widehat{\beta}_{\widehat{\lambda}}) is a function of the data: the conditional expectation of a new test random variable 𝒵\mathcal{Z} given the observed data used to choose both λ^\widehat{\lambda} and β^λ\widehat{\beta}_{\lambda} as in Equation (4).

While the conditions of Theorem 3 are strong, they are typical of those used to prove persistence of the lasso estimator with oracle tuning parameter (for the case of fixed design, see Negahban et al., 2012). For instance, Bunea et al. (2007) prove an oracle rate for the lasso of O(slogp/n)O(s^{*}\log{p}/n) with λminσlogp/n\lambda_{\min}\propto\sigma\sqrt{\log{p}/n}. Under similar conditions, our result with cross-validated tuning parameter requires a larger λmin\lambda_{\min} (resulting in smaller models) and a slower convergence rate to the oracle by a factor logn\log{n}. A reasonable choice of Λ\Lambda suggested by Theorem 3 is Λ=[λmin,)=[(logplogn/n)1/2,)\Lambda=[\lambda_{\min},\ \infty)=[(\log{p}\log{n}/n)^{1/2},\ \infty).

Proof of Theorem 3.

We have that, for all g>0g>0,

(R(β^λ^)σ2>gslognlogpn)\displaystyle\mathbb{P}\left(R(\widehat{\beta}_{\widehat{\lambda}})-\sigma^{2}>g\frac{s^{*}\log n\log p}{n}\right) (infλΛ(R(β^λ)σ2)>gslognlogp2n)\displaystyle\leq\mathbb{P}\left(\inf_{\lambda\in\Lambda}\left(R(\widehat{\beta}_{\lambda})-\sigma^{2}\right)>g\frac{s^{*}\log n\log p}{2n}\right)
+2(supλΛ|R(β^λ)σ2R^(β^λ)|>gslognlogp2n),\displaystyle+2\mathbb{P}\left(\sup_{\lambda\in\Lambda}\left|R(\widehat{\beta}_{\lambda})-\sigma^{2}-\widehat{R}(\widehat{\beta}_{\lambda})\right|>g\frac{s^{*}\log n\log p}{2n}\right),

by the proof of Theorem 7 in Meinshausen (2007). Note that R^(β^λ)\widehat{R}(\widehat{\beta}_{\lambda}) is defined in Equation (5). Furthermore, the second term on the right hand side goes to 0 by that result. Now note that

(infλΛ(R(β^λ)σ2)>gslognlogpn)\displaystyle\mathbb{P}\left(\inf_{\lambda\in\Lambda}\left(R(\widehat{\beta}_{\lambda})-\sigma^{2}\right)>g\frac{s^{*}\log n\log p}{n}\right) (R(β^λmin)σ2>gslognlogpn).\displaystyle\leq\mathbb{P}\left(R(\widehat{\beta}_{\lambda_{\min}})-\sigma^{2}>g\frac{s^{*}\log n\log p}{n}\right).

By Corollary 1 in Bunea, Tsybakov, and Wegkamp (2007), for any λ\lambda, we have that

(R(β^λ)σ2>gsλ21ν)10p2exp(c1nλ2)=10exp(2logpc1nλ2).\displaystyle\mathbb{P}\left(R(\widehat{\beta}_{\lambda})-\sigma^{2}>g\frac{s^{*}\lambda^{2}}{1-\nu}\right)\leq 10p^{2}\exp\left(-c_{1}n\lambda^{2}\right)=10\exp\left(2\log p-c_{1}n\lambda^{2}\right).

Setting λmin\lambda_{\min} proportional to (lognlogp/n)1/2(\log n\log p/n)^{1/2} is enough for the upper bound to go to zero as nn\rightarrow\infty yielding the result. ∎

3.2 Persistence when ff^{*} is not linear

In order to derive results under more general conditions, we move to the linear oracle estimation framework. For any tt, define the oracle estimator with respect to tt as

βt:=argminβtR(β).\beta_{t}:=\operatorname*{argmin}_{\beta\in\mathcal{B}_{t}}R\left(\beta\right).

Suppose t^\widehat{t} is an estimator, such as by cross-validation. Then we can decompose the risk of an estimator β^t^\widehat{\beta}_{\widehat{t}} as follows:

R(β^t)=R(β^t^)R(βt)excess risk+R(βt)Rapproximation error+Rnoise,R\left(\widehat{\beta}_{t}\right)=\underbrace{R\left(\widehat{\beta}_{\widehat{t}}\right)-R\left(\beta_{t}\right)}_{\mbox{excess risk}}\quad+\underbrace{R\left(\beta_{t}\right)-R_{*}}_{\mbox{approximation error}}+\quad\underbrace{R_{*}}_{\mbox{noise}},

where RR_{*} is the risk of the mean function ff^{*}. Because the data generating process is not necessarily linear, we study the performance of an estimator β^t^\widehat{\beta}_{\widehat{t}} via the excess risk of β^t^\widehat{\beta}_{\widehat{t}} relative to βt\beta_{t}, which we define as

(t^,t):=R(β^t^)R(βt).\mathcal{E}(\widehat{t},t):=R\left(\widehat{\beta}_{\widehat{t}}\right)-R\left(\beta_{t}\right). (9)

Here, (t^,t)\mathcal{E}(\widehat{t},t) depends on the cross-validated tuning parameter t^\widehat{t} as well as the oracle set through tt. Focusing on Equation (9), allows for meaningful theory when the approximation error does not necessarily converge to zero as nn grows. This is particularly important here, as we do not assume that the conditional expectation of 𝒴\mathcal{Y} given 𝒳\mathcal{X} is linear. As tt\rightarrow\infty, the approximation error decreases and hence we desire to take t=tnt=t_{n} for some increasing sequence (tn)(t_{n}). As discussed in the introduction, Greenshtein and Ritov (2004) show that if tn=o((n/logp)1/4)t_{n}=o((n/\log p)^{1/4}), then (tn,tn)\mathcal{E}(t_{n},t_{n}) converges in probability to zero. Bartlett, Mendelson, and Neeman (2012) increase the size of this search set allowing tn=o(n1/2/(log3/2nlog3/2(np)))t_{n}=o(n^{1/2}/(\log^{3/2}n\log^{3/2}(np))) while still having (tn,tn)𝑃0.\mathcal{E}(t_{n},t_{n})\xrightarrow{P}0.

Theorem 4.

Let (μn)q(\mu_{n})\in\mathcal{F}_{q} and suppose that Condition 1 holds. Then for any sequences (an),(tn)(a_{n}),\ (t_{n}) which satisfy antn=o(n)a_{n}t_{n}=o(n),

μn((t^,tn)>δ)δ1(Ωn,1+Ωn,2)+2(Dnc)+(Enc).\mathbb{P}_{\mu_{n}}\left(\mathcal{E}(\widehat{t},t_{n})>\delta\right)\leq\delta^{-1}\left(\Omega_{n,1}+\Omega_{n,2}\right)+2\mathbb{P}(D_{n}^{c})+\mathbb{P}(E_{n}^{c}). (10)

Here,

Ωn,1\displaystyle\Omega_{n,1} :=[1+2nCqan]2(logp)1+2/q(n1/2+cn1/2+(ncn)1/2),\displaystyle:=\left[1+\frac{2nC_{q}^{\prime}}{a_{n}}\right]^{2}\sqrt{(\log p)^{1+2/q}}\left(n^{-1/2}+c_{n}^{-1/2}+(n-c_{n})^{-1/2}\right),
Ωn,2\displaystyle\Omega_{n,2} :=(1+tn)2(logp)1+2/qn,\displaystyle:=(1+t_{n})^{2}\sqrt{\frac{(\log p)^{1+2/q}}{n}},
Dn\displaystyle D_{n} :={tmax2nCqan},\displaystyle:=\left\{t_{\max}\leq\frac{2nC_{q}^{\prime}}{a_{n}}\right\},
En\displaystyle E_{n} :={tmaxtn}\displaystyle:=\{t_{\max}\geq t_{n}\}

and Cq=Cq(log2)1/q1C_{q}^{\prime}=C_{q}(\log 2)^{1/q-1}.

Remark 4.

The sets Dn,EnD_{n},\ E_{n} account for cases wherein tmax=Y2/ant_{\max}=\left|\left|Y\right|\right|_{2}/a_{n} results in suboptimal sets TT. If (an)(a_{n}) is such that tmaxt_{\max} grows too quickly with non-negligible probability, then cross-validation may result in low-bias but high variance solutions. On the other hand, if tmaxt_{\max} is too small, then we will rule out possible estimators with lower risk. Here DnD_{n} calibrates a high-probability upper bound on tmaxt_{\max} based on (μn)(\mu_{n}) and the choice of (an)(a_{n}) while EnE_{n} ensures that tmaxt_{\max} will be large enough to include low risk estimators.

Remark 5.

Usually in the oracle estimation framework, t^=tn\widehat{t}=t_{n} and so the excess risk is necessarily nonnegative because the oracle predictor, βtn\beta_{t_{n}}, is selected as the risk minimizer over tn\mathcal{B}_{t_{n}}. In that case, Equation (10) would correspond to convergence in probability. As we are examining the case where the optimization set is estimated, (t^,tn)\mathcal{E}(\widehat{t},t_{n}) may be negative. However, we are only interested in bounding the case where (t^,tn)\mathcal{E}(\widehat{t},t_{n}) is positive, i.e. the case where the estimator is worse than the oracle.

Define bn=min{ncn,cn}b_{n}=\min\{n-c_{n},c_{n}\}. It follows that (n1/2+cn1/2+(ncn)1/2)3bn1/2.(n^{-1/2}+c_{n}^{-1/2}+(n-c_{n})^{-1/2})\leq 3b_{n}^{-1/2}. We state the following corollary to Theorem 4, which outlines the conditions under which we can do asymptotically at least as well at predicting a new observation as the oracle linear model. That is, when the right hand side of Equation (10) goes to zero.

Corollary 5.

Under the conditions of Theorem 4, if an=n(logp)1/4+1/(2q)mn/bn1/4a_{n}=n(\log p)^{1/4+1/(2q)}m_{n}/b_{n}^{1/4} and tn=o(bn1/4/mn(logp)1/4+1/(2q))t_{n}=o(b_{n}^{1/4}/m_{n}(\log p)^{1/4+1/(2q)}), where mnm_{n} is any sequence which tends toward infinity and mn=o(bn1/4)m_{n}=o(b_{n}^{1/4}), we have that for nn large enough and p=o(exp{bnq/(q+2)})p=o(\exp\{b_{n}^{q/(q+2)}\}),

μn((t^,tn)>δ)1mn2δ(1+bnn)+2exp(n/8e2).\mathbb{P}_{\mu_{n}}\left(\mathcal{E}(\widehat{t},t_{n})>\delta\right)\leq\frac{1}{m_{n}^{2}\delta}\left(1+\sqrt{\frac{b_{n}}{n}}\right)+2\exp(-n/8e^{2}).

In particular, μn((t^,tn)>δ)0\mathbb{P}_{\mu_{n}}\left(\mathcal{E}(\widehat{t},t_{n})>\delta\right)\rightarrow 0.

Remark 6.

The rate at which δ\delta can be taken to zero quantifies the decay of the size of the ‘bad’ set where (t^,tn)\mathcal{E}(\widehat{t},t_{n}) is large. For the corollary, both mn=o(bn1/4)m_{n}=o(b_{n}^{1/4}) and δ1=o(mn2)\delta^{-1}=o(m_{n}^{2}). Therefore, it is necessary for δ1=o(bn1/2)\delta^{-1}=o(b_{n}^{1/2}) and hence, δ\delta must go to zero at a slower rate than bn1/2b_{n}^{-1/2}.

Remark 7.

As qq increases, which corresponds to (μn)q(\mu_{n})\in\mathcal{F}_{q} putting less mass on the tails of the centered interactions of the components of 𝒵\mathcal{Z}, the faster the oracle set given by tn\mathcal{B}_{t_{n}} can grow. That, is the weaker the 1\ell_{1}-sparsity assumption of the linear oracle. When q=q=\infty, the random variables have no tails and we get the fastest rate of growth for tnt_{n}; that is, (bn1/4/(mn(logp)1/4))(b_{n}^{1/4}/(m_{n}(\log p)^{1/4})).

The parameter bnb_{n} controls the minimum size of the validation versus training sets that comprise cross-validation. It must be true that bnb_{n} is strictly less than nn. To get the best guarantee, bnb_{n} should increase as fast as possible. Hence, our results advocate a cross-validation scheme where the validation and training sets are asymptotically balanced, i.e. cnncnc_{n}\asymp n-c_{n}. This should be compared with the results in Shao (1993), which state that for model selection consistency, one should have cn/n1c_{n}/n\rightarrow 1. However, Shao (1993) presents results for model selection while we focus on prediction error. Similarly, Dudoit and van der Laan (2005) provide oracle inequalities for cross-validation and also advocate for the validation set to grow as fast as possible. Note that for KK-fold cross-validation, cn=n/Kc_{n}=\lfloor n/K\rfloor so that bn=O(n)b_{n}=O(n).

Additionally, the rates ana_{n} and mnm_{n} deserve comment. It is instructive to compare this choice of tmaxt_{\textrm{max}} with Y22/n\left|\left|Y\right|\right|_{2}^{2}/n, a standard estimate of the noise variance in high dimensions (e.g. van de Geer and Bühlmann, 2011, p. 104). If an=na_{n}=n, then Y22/an\left|\left|Y\right|\right|_{2}^{2}/a_{n} is an overestimate of the variance due to the presence of the regression function ff^{*}. Our results state that we must choose ana_{n} to increase slower than nn, thereby increasing this overestimation and enlarging the potential search set TT. While ana_{n} depends on several parameters, bnb_{n}, nn, and pp are known to the analyst. Also, the choice of qq depends on how much approximation error the analyst is willing to make. The mnm_{n} term is required to slow the growth of tnt_{n} just slightly. While this shrinks the size of the set tn\mathcal{B}_{t_{n}} relative to that used by Greenshtein and Ritov (2004), potentially eliminating some better solutions, effectively mnm_{n} quantifies their requirement tn=o((n/logp)1/4)t_{n}=o((n/\log p)^{1/4}), making explicit the need for tnt_{n} to grow more slowly than (n/logp)1/4(n/\log p)^{1/4}. As such, if we set bnnb_{n}\asymp n and set q=q=\infty, we reaquire the rate shown by Greenshtein and Ritov (2004), where Section 3.2 implies that both μn((t^,tn)>δ)0\mathbb{P}_{\mu_{n}}\left(\mathcal{E}(\widehat{t},t_{n})>\delta\right)\rightarrow 0 and tn=o((n/logp)1/4)t_{n}=o((n/\log p)^{1/4}).

While the entire proof is contained in the supplementary material, we provide a brief sketch here.

Proof sketch of Theorem 4 and Section 3.2.

In order to control the behavior of μn((t^,tn)>δ)\mathbb{P}_{\mu_{n}}\left(\mathcal{E}(\widehat{t},t_{n})>\delta\right), we require a number of steps.

  1. 1.

    The first step is form the decomposition

    (t^,tn)\displaystyle\mathcal{E}(\widehat{t},t_{n}) =R(β^t^)R(βtn)=R(β^t^)R^Vn(t^)(I)+\displaystyle=R\left(\widehat{\beta}_{\widehat{t}}\right)-R\left(\beta_{t_{n}}\right)=\underbrace{R\left(\widehat{\beta}_{\widehat{t}}\right)-\widehat{R}_{V_{n}}\left(\widehat{t}\right)}_{(I)}+
    +R^Vn(t^)R^Vn(tmax)(II)+R^Vn(tmax)R^(β^tn)(III)+R^(β^tn)R(βtn)(IV).\displaystyle+\underbrace{\widehat{R}_{V_{n}}\left(\widehat{t}\right)-\widehat{R}_{V_{n}}\left(t_{\max}\right)}_{(II)}+\underbrace{\widehat{R}_{V_{n}}\left(t_{\max}\right)-\widehat{R}\left(\widehat{\beta}_{t_{n}}\right)}_{(III)}+\underbrace{\widehat{R}\left(\widehat{\beta}_{t_{n}}\right)-R\left(\beta_{t_{n}}\right)}_{(IV)}.

    Here, R^(β^tn)\widehat{R}\left(\widehat{\beta}_{t_{n}}\right) and R^Vn(t^)\widehat{R}_{V_{n}}\left(\widehat{t}\right) are defined in Equation (5) and Equation (6), respectively.

  2. 2.

    By standard rules of probability, it is enough to control each term (I) through (IV) on the intersection DnEnD_{n}\cap E_{n} as long as (Dnc)\mathbb{P}(D_{n}^{c}) and (Enc)\mathbb{P}(E_{n}^{c}) are both small.

  3. 3.

    For each of the terms, we rewrite the difference in risks as a difference of quadratic forms involving the covariance matrices of the random variables and the extended coefficients. For example, R(β)=𝔼μn(𝒴β𝒳)2=γΣnγR\left(\beta\right)=\mathbb{E}_{\mu_{n}}(\mathcal{Y}-\beta^{\top}\mathcal{X})^{2}=\gamma^{\top}\Sigma_{n}\gamma where Σn=𝔼μn(𝒵𝒵)\Sigma_{n}=\mathbb{E}_{\mu_{n}}(\mathcal{Z}\mathcal{Z}^{\top}) and γ=(1,β)\gamma^{\top}=(-1,\beta^{\top}), and R^(β)=n1Y𝕏β22=γΣ^nγ,\widehat{R}\left(\beta\right)=n^{-1}\left|\left|Y-\mathbb{X}\beta\right|\right|_{2}^{2}=\gamma^{\top}\widehat{\Sigma}_{n}\gamma, where Σ^n=n1i=1nZiZi\widehat{\Sigma}_{n}=n^{-1}\sum_{i=1}^{n}Z_{i}Z_{i}^{\top}. Thus, we have, for example, (IV)=γ^tnΣ^nγ^tnγtnΣnγtn.(IV)=\widehat{\gamma}^{\top}_{t_{n}}\widehat{\Sigma}_{n}\widehat{\gamma}_{t_{n}}-\gamma^{\top}_{t_{n}}\Sigma_{n}\gamma_{t_{n}}.

  4. 4.

    After some algebraic manipulation and an application of Hölder’s inequality, we can rewrite these differences as the product of the squared L1L^{1}-norm of the coefficient vector and the expected value of the LL^{\infty}-norm of the difference between an empirical and expected covariance. For example R(β)R^(β)γ12ΣnΣ^n.R\left(\beta\right)-\widehat{R}\left(\beta\right)\leq||\gamma||^{2}_{1}||\Sigma_{n}-\widehat{\Sigma}_{n}||_{\infty}.

  5. 5.

    Finally, the L1L^{1}-norm of the coefficient vector is bounded by tt due to the constraint in Equation (2) while 𝔼ΣnΣ^n=O((log(p)/n)1/2)\mathbb{E}||\Sigma_{n}-\widehat{\Sigma}_{n}||_{\infty}=O((\log(p)/n)^{1/2}) by Nemirovski’s inequality and some intermediate lemmas.

  6. 6.

    To move to the corollary, we use the Orlicz condition on (μn)(\mu_{n}) to show that (Dnc)\mathbb{P}(D_{n}^{c}) and (Enc)\mathbb{P}(E_{n}^{c}) are both small with high probability. This ensures that tmax<2nCq/ant_{\max}<2nC_{q}^{\prime}/a_{n} and for any sequences (an)(a_{n}) and (tn)(t_{n}) which satisfy antn=o(n)a_{n}t_{n}=o(n), tmax>tnt_{\max}>t_{n}.

We note here that our results generalize to other MM-estimators which use an 1\ell_{1}-constraint. In particular, relative to the set of coefficients βtn\beta\in\mathcal{B}_{t_{n}} with tn=o(bn1/4/(mn(logp)1/4+1/(2q)))t_{n}=o\left(b_{n}^{1/4}/\left(m_{n}(\log p)^{1/4+1/(2q)}\right)\right), an empirical estimator with cross-validated tuning parameter has prediction risk which converges to the prediction risk of the oracle.

Corollary 6.

Consider the group lasso (Yuan and Lin, 2006)

βt^=argmin{n1Y𝕏β22:gG|g|βg2t},\widehat{\beta_{t}}=\operatorname*{argmin}\{n^{-1}\left|\left|Y-\mathbb{X}\beta\right|\right|_{2}^{2}:\sum_{g\in G}\sqrt{|g|}\left|\left|\beta_{g}\right|\right|_{2}\leq t\},

and the square-root lasso (Belloni, Chernozhukov, and Wang, 2014)

βt^=argmin{n1Y𝕏β2:β1t}.\widehat{\beta_{t}}=\operatorname*{argmin}\{n^{-1}\left|\left|Y-\mathbb{X}\beta\right|\right|_{2}:\left|\left|\beta\right|\right|_{1}\leq t\}.

Let tnt_{n} and ana_{n} be as in Section 3.2. Then, for nn large enough, log(p)=o(bnq/(q+2))\log(p)=o(b_{n}^{q/(q+2)}), and maxg|g|=O(1)\max_{g}\sqrt{|g|}=O(1), we have

μn((t^,tn)>δ)1mn2δ(1+bnn)+2en/8e2.\mathbb{P}_{\mu_{n}}\left(\mathcal{E}(\widehat{t},t_{n})>\delta\right)\leq\frac{1}{m_{n}^{2}\delta}\left(1+\sqrt{\frac{b_{n}}{n}}\right)+2e^{-n/8e^{2}}.

4 Simulations

So far, though we have given some indication about the asymptotic order of KK, we haven’t provided any guidance for the choice of KK with a fixed nn and pp. In this section, we investigate the predictive and model selection performance of KK-fold cross-validation for a range of KK.

We consider three criteria: prediction risk, sensitivity, and specificity. For prediction risk, we approximate R(β^λ^)R(\widehat{\beta}_{\widehat{\lambda}}) by the empirical risk of 500 test observations. For sensitivity and specificity, define the active set of a coefficient vector β\beta to be 𝒮(β):={j:|βj|>0}\mathcal{S}(\beta):=\{j:|\beta_{j}|>0\}, along with 𝒮:=𝒮(β)\mathcal{S}^{*}:=\mathcal{S}(\beta^{*}) and 𝒮^:=𝒮(β^λ^)\widehat{\mathcal{S}}:=\mathcal{S}(\widehat{\beta}_{\widehat{\lambda}}) to be the active sets of β\beta^{*} and β^λ^\widehat{\beta}_{\widehat{\lambda}}, respectively. Then sensitivity = |𝒮𝒮^|/|𝒮||\mathcal{S}^{*}\cap\widehat{\mathcal{S}}|/|\mathcal{S}^{*}| and specificity = |(𝒮)c𝒮^c|/|(𝒮)c||(\mathcal{S}^{*})^{c}\cap\widehat{\mathcal{S}}^{c}|/|(\mathcal{S}^{*})^{c}|, where |||\cdot| counts the number of elements in a set and 𝒜c\mathcal{A}^{c} indicates the complement of a set 𝒜\mathcal{A}.

4.1 Simulation details

Conditions:

For these simulations, we consider a wide range of possible conditions by varying the correlation in the design, ρ\rho; the number of parameters, pp; the sparsity, α\alpha; and the signal-to-noise ratio, SNR. In all cases, we let n=100n=100, p=75,350,1000p=75,350,1000, and set the measurement error variance σ2=1\sigma^{2}=1. Lastly, we run each simulation condition combination 100 times. For these simulations, we assume that there exists a β\beta^{*} such that the regression function f(X)=Xβf^{*}(X)=X^{\top}\beta^{*} in order to make model selection meaningful.

The design matrices are produced in two steps. First, Xiji.i.d.N(0,1)X_{ij}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,1) for 1in1\leq i\leq n and 1jp1\leq j\leq p, forming the matrix 𝕏n×p\mathbb{X}\in\mathbb{R}^{n\times p}. Second, correlations are introduced by defining a matrix D that has all off diagonal elements equal to ρ\rho and diagonal elements equal to one. Then, we redefine 𝕏𝕏D1/2\mathbb{X}\leftarrow\mathbb{X}D^{1/2}. For these simulations, we consider correlations ρ=0.2,0.5,0.95\rho=0.2,0.5,0.95.

For sparsity, we take s=nαs^{*}=\lceil n^{\alpha}\rceil and generate the ss^{*} non-zero elements of β\beta^{*} from a Laplace distribution with parameter 1. We let α=0.1,0.33,0.5\alpha=0.1,0.33,0.5. Also, to compensate for the random amount of signal in each observation, we vary the signal-to-noise ratio, defined to be SNR=βDβ.\textrm{SNR}=\beta^{\top}D\beta. We consider SNR=1\textrm{SNR}=1 and 1010. Note that as the SNR increases the observations go from a high-noise and low-signal regime to a low-noise and high-signal one. Lastly, we consider two different noise distributions, ϵN(0,1)\epsilon\sim N(0,1) and ϵ31/2t(3)\epsilon\sim 3^{-1/2}t(3). Here t(3)t(3) indicates a tt distribution with 3 degrees of freedom and the 31/23^{-1/2} term makes the variance equal to 1. Finally, we take K={3, 10, 25, 50, 75, 100}K=\{3,\ 10,\ 25,\ 50,\ 75,\ 100\}, the last case being leave-one-out CV.

4.2 Simulation results

Though we have run all of the above simulations, to save space we have only included plots from a select number of simulation conditions that are the most informative. For prediction risk, all considered KK’s result in remarkably similar prediction risks. In Figure 1, for p=1000p=1000, SNR = 1, and ρ=0.9\rho=0.9, we see that taking K=3K=3 or K=10K=10 results in slightly smaller prediction risks. This is comforting as both of these values of KK are used frequently by default and they are the least computationally demanding.

Refer to caption
p=1000p=1000, α=0.1\alpha=0.1
Refer to caption
p=1000p=1000, α=0.4\alpha=0.4
Refer to caption
p=1000p=1000, α=0.6\alpha=0.6
Figure 1: Prediction risk: The other parameters are set at: SNR=1\textrm{SNR}=1, ρ=0.9\rho=0.9.

For model selection, there is a more nuanced story. For sensitivity, which describes how often we would correctly identify a coefficient as nonzero, larger values of KK tend to work better. For instance, in Figure 2, we see that K=3K=3 is often decidedly worse than larger KK, followed by K=10K=10. As is often the case, this conclusion presents a trade-off with the results for specificity (Figure 3): smaller values of KK tend to work better. In general, β^(λ^)\widehat{\beta}(\widehat{\lambda}) tends to have more nonzero entries as KK increases holding all else constant. As the correlation parameter (ρ)(\rho) or the signal to noise (SNR) increases, all values of KK have approximately the same performance.

Refer to caption
p=350p=350, α=0.6\alpha=0.6
Refer to caption
p=1000p=1000, α=0.4\alpha=0.4
Refer to caption
p=1000p=1000, α=0.6\alpha=0.6
Figure 2: Sensitivity: The other parameters are set at: SNR=1\textrm{SNR}=1, ρ=0.2\rho=0.2.
Refer to caption
p=350p=350, α=0.6\alpha=0.6
Refer to caption
p=1000p=1000, α=0.4\alpha=0.4
Refer to caption
p=1000p=1000, α=0.6\alpha=0.6
Figure 3: Specificity: The other parameters are set at: SNR=1\textrm{SNR}=1, ρ=0.2\rho=0.2.

5 Discussion

A common practice in data analysis is to attempt to estimate a coefficient vector β\beta via lasso with its regularization parameter chosen via cross-validation. Unfortunately, no definitive theoretical results existed as to the effect of choosing the tuning parameter in this data-dependent way in the high-dimensional random design setting.

We have demonstrated that the lasso, group lasso, and square-root lasso with tuning parameters chosen by cross-validation are risk consistent relative to the linear oracle predictor. Under strong conditions on the joint distribution (as in related literature), our results differ from the standard convergence rates by a factor of logn\log n. Imposing more mild conditions on the joint distribution of the data, we can achieve the same rate of convergence to the risk of the linear oracle predictor with a data-dependent tuning parameter as with the optimal, yet inaccessible, oracle tuning parameter. Together, these results show that if the data analyst is interested in using lasso for prediction, choosing the tuning parameter via cross-validation is still a good procedure. In practice, our results justify data-driven sets Λ\Lambda and TT over which we can safely select tuning parameters. For KK-fold CV, when ff^{*} is linear, our theory suggests taking Λ=[(logplogn/n)1/2,)\Lambda=\left[(\log{p}\log{n}/n)^{1/2},\ \infty\right) while for ff^{*} arbitrary, a reasonable choice assuming q=2q=2 is T=[0,Y2/(Kn3/4(lognlogp)1/2))T=\left[0,\left|\left|Y\right|\right|_{2}/(Kn^{3/4}(\log{n}\log{p})^{1/2})\right).

This work reveals some interesting open questions. First, our most general results do not apply for leave-one-out cross-validation as cn=1c_{n}=1 for all nn and hence the upper-bounds become trivial. Leave-one-out cross-validation is more computationally demanding than KK-fold cross-validation, but is still used in practice. These results also do not give any prescription for choosing KK other than that it should be o(n)o(n). Our simulation study indicates that all KK ranging from 3 to nn have approximately the same prediction ability. For model selection, larger KK tends to produce more nonzero coefficients and hence has better sensitivity but poorer specificity.

As there are many other methods for choosing the tuning parameter in the lasso problem, a direct comparison of the behavior of the lasso estimator with tuning parameter chosen via cross-validation versus a degrees-of-freedom-based method is of substantial interest and should be investigated. Also, our results depend strongly on the upper bound for TT or the lower bound for Λ\Lambda. However, in most cases, we will never need to use tuning parameters this extreme. So it makes sense to attempt to find more subtle theory to provide greater intuition for the behavior of lasso under cross-validation.

Acknowledgements

The authors gratefully acknowledge support from the National Science Foundation (DH, DMS–1407543; DJM, DMS–1407439) and the Institute of New Economic Thinking (DH and DJM, INO14-00020). Additionally, the authors would like the thank Cosma Shalizi and Ryan Tibshirani for reading preliminary versions of this manuscript and also Larry Wasserman for the inspiration and encouragement.

References

  • Akaike (1974) Hirotugu Akaike. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716–723, 1974.
  • Arlot and Celisse (2010) Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40–79, 2010.
  • Arlot and Massart (2009) Sylvain Arlot and Pascal Massart. Data-driven calibration of penalties for least-squares regression. The Journal of Machine Learning Research, 10:245–279, 2009.
  • Bartlett et al. (2012) Peter L. Bartlett, Shahar Mendelson, and Joseph Neeman. 1\ell_{1}-regularized linear regression: Persistence and oracle inequalities. Probability Theory and Related Fields, 154(1-2):193–224, 2012.
  • Belloni et al. (2014) Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Pivotal estimation via square-root lasso in nonparametric regression. The Annals of Statistics, 42(2):757–788, 2014.
  • Bickel et al. (2009) Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
  • Bühlmann and van de Geer (2011) Peter Bühlmann and Sara van de Geer. Statistics for high-dimensional data: Methods, theory and applications. Springer, New York, 2011.
  • Bunea et al. (2007) Florentina Bunea, Alexandre B. Tsybakov, and Marten H. Wegkamp. Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1:169–194, 2007.
  • Candès and Plan (2009) Emmanuel J. Candès and Yaniv Plan. Near-ideal model selection by 1\ell_{1} minimization. The Annals of Statistics, 37(5A):2145–2177, 2009.
  • Candès and Tao (2007) Emmanuel J. Candès and Terence Tao. The Dantzig selector: Statistical estimation when pp is much larger than nn. The Annals of Statistics, 35(6):2313–2351, 2007.
  • Chen and Chen (2012) Jiahua Chen and Zehua Chen. Extended bic for small-n-large-p sparse glm. Statistica Sinica, 22(2):555, 2012.
  • Chen et al. (1998) Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.
  • Donoho et al. (2006) David. L. Donoho, Michael Elad, and Vladimir Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1):6–18, 2006.
  • Dudoit and van der Laan (2005) Sandrine Dudoit and Mark J. van der Laan. Asymptotics of cross-validation risk estimation in estimator selection and performance assessment. Statistical Methodology, pages 131–154, 2005.
  • Dümbgen et al. (2010) Lutz Dümbgen, Sara A. van de Geer, Mark C. Veraar, and Jon A. Wellner. Nemirovski’s inequalities revisited. American Mathematical Monthly, 117(2):138–160, 2010.
  • Efron et al. (2004) Bradley Efron, Trevor Hastie, Ian Johnstone, and Robert Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004.
  • Fan and Li (2001) Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
  • Fan and Tang (2013) Yingying Fan and Cheng Yong Tang. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):531–552, 2013.
  • Flynn et al. (2013) Cheryl J. Flynn, Clifford M. Hurvich, and Jeffrey S. Simonoff. Efficiency for regularization parameter selection in penalized likelihood estimation of misspecified models. Journal of the American Statistical Association, 108:1031–1043, 2013.
  • Friedman et al. (2010) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
  • Giraud et al. (2012) Christophe Giraud, Sylvie Huet, and Nicolas Verzelen. High-dimensional regression with unknown variance. Statistical Science, 27(4):500–518, 2012.
  • Greenshtein and Ritov (2004) Eitan Greenshtein and Ya’acov Ritov. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10(6):971–988, 2004.
  • Homrighausen and McDonald (2013) Darren Homrighausen and Daniel J. McDonald. The lasso, persistence, and cross-validation. In Proceedings of The 30th International Conference on Machine Learning, pages 1031–1039, 2013.
  • Homrighausen and McDonald (2014) Darren Homrighausen and Daniel J. McDonald. Leave-one-out cross-validation is risk consistent for lasso. Machine Learning, 97(1-2):65–78, 2014.
  • Homrighausen and McDonald (2016) Darren Warren Homrighausen and Daniel J. McDonald. Risk estimation for high-dimensional lasso regression, 2016. URL http://arxiv.org/abs/1602.01522.
  • Kato (2009) Kengo Kato. On the degrees of freedom in shrinkage estimation. Journal of Multivariate Analysis, 100(7):1338–1352, 2009.
  • Kim et al. (2012) Yongdai Kim, Sunghoon Kwon, and Hosik Choi. Consistent model selection criteria on high dimensions. The Journal of Machine Learning Research, 13(1):1037–1057, 2012.
  • Knight and Fu (2000) Keith Knight and Wenjiang Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28(5):1356–1378, 2000.
  • Lecué and Mitchell (2012) Guillaume Lecué and Charles Mitchell. Oracle inequalities for cross-validation type procedures. Electronic Journal of Statistics, 6:1803–18374, 2012.
  • Leng et al. (2006) Chenlei Leng, Yi Lin, and Grace Wahba. A note on the lasso and related procedures in model selection. Statistica Sinica, 16(4):1273–1284, 2006.
  • Meinshausen (2007) Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393, 2007.
  • Meinshausen and Bühlmann (2006) Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
  • Meinshausen and Yu (2009) Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246–270, 2009.
  • Nardi and Rinaldo (2008) Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group lasso estimator for linear models. Electronic Journal of Statistics, 2:605–633, 2008.
  • Negahban et al. (2012) Sahand Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Statistical Science, 27:538–337, 2012.
  • Osborne et al. (2000) Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the lasso and its dual. Journal of Computational and Graphical statistics, 9(2):319–337, 2000.
  • Saumard (2011) Adrien Saumard. The slope heuristics in heteroscedastic regression, 2011. URL http://arxiv.org/abs/1104.1050.
  • Schwarz (1978) G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461–464, 1978.
  • Shao (1993) Jun Shao. Linear model selection by cross-validation. Journal of the American Statistical Association, 88:486–494, 1993.
  • Stone (1974) M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):111–147, 1974.
  • Stone (1977) M. Stone. Asymptotics for and against cross-validation. Biometrika, 64(1):29–35, 1977.
  • Sun and Zhang (2012) Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012.
  • Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267–288, 1996.
  • Tibshirani (2011) Robert Tibshirani. Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):273–282, 2011.
  • Tibshirani (2013) Ryan J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456–1490, 2013.
  • Tibshirani and Taylor (2012) Ryan J. Tibshirani and Jonathan Taylor. Degrees of freedom in lasso problems. Annals of Statistics, 40:1198–1232, 2012.
  • van de Geer (2008) Sara A. van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645, 2008.
  • van de Geer and Bühlmann (2011) Sara A. van de Geer and Peter Bühlmann. Statistics for High-Dimensional Data. Springer Verlag, 2011.
  • van der Vaart and Wellner (1996) Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996.
  • Vershynin (2012) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Yonina C. Eldar and Gitta Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 210–268. Cambridge University Press, 2012.
  • Wainwright (2009) Martin J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using 1\ell_{1}-constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
  • Wang and Leng (2007) Hansheng Wang and Chenlei Leng. Unified LASSO estimation by least squares approximation. Journal of the American Statistical Association, 102(479):1039–1048, 2007.
  • Wang et al. (2007) Hansheng Wang, Runze Li, and Chih-Ling Tsai. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3):553–568, 2007.
  • Wang et al. (2009) Hansheng Wang, Bo Li, and Chenlei Leng. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671–683, 2009.
  • Yuan and Lin (2006) Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
  • Zhang and Shen (2010) Yongli Zhang and Xiaotong Shen. Model selection procedure for high-dimensional data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 3(5):350–358, 2010.
  • Zhao and Yu (2006) Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006.
  • Zou (2006) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.
  • Zou et al. (2007) Hui Zou, Trevor Hastie, and Robert Tibshirani. On the degrees of freedom of the lasso. The Annals of Statistics, 35(5):2173–2192, 2007.

Appendix A Appendix

To show Theorem 4, we need a few preliminary results. We will show how to rewrite the risk as a quadratic form in Section A.1. In Section A.2, we state a concentration of measure result and some standard properties of Orlicz norms. In Section A.3 we prove some useful preliminary lemmas. Lastly, in Section A.4 we prove the main theorem and corollaries.

A.1 Squared-error loss and quadratic forms

We can rewrite the various formulas for the risk from Section 2 as quadratic forms. Define the parameter to be γ:=(1,β)\gamma^{\top}:=(-1,\beta^{\top}), with associated estimator γ^t:=(1,β^t)\widehat{\gamma}_{t}^{\top}:=(-1,\widehat{\beta}_{t}^{\top}). We can rewrite Equation (3) as

R(β)=𝔼μn[(𝒴β𝒳)2]=γΣnγR\left(\beta\right)=\mathbb{E}_{\mu_{n}}\left[(\mathcal{Y}-\beta^{\top}\mathcal{X})^{2}\right]=\gamma^{\top}\Sigma_{n}\gamma (11)

where Σn:=𝔼μn[𝒵𝒵]\Sigma_{n}:=\mathbb{E}_{\mu_{n}}[\mathcal{Z}\mathcal{Z}^{\top}]. Analogously, Equation (5) has the following form

R^(β)=1nY𝕏β22=γΣ^nγ,\widehat{R}\left(\beta\right)=\frac{1}{n}\left|\left|Y-\mathbb{X}\beta\right|\right|_{2}^{2}=\gamma^{\top}\widehat{\Sigma}_{n}\gamma,

where Σ^n=n1i=1nZiZi\widehat{\Sigma}_{n}=n^{-1}\sum_{i=1}^{n}Z_{i}Z_{i}^{\top}. Lastly, we rewrite Equation (6) as

R^Vn(t)=1KvVn(γ^t(v))TΣ^vγ^t(v),\widehat{R}_{V_{n}}\left(t\right)=\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}^{(v)}_{t})^{T}\widehat{\Sigma}_{v}\widehat{\gamma}^{(v)}_{t}, (12)

where Σ^v=|v|1rvZrZr\widehat{\Sigma}_{v}=|v|^{-1}\sum_{r\in v}Z_{r}Z_{r}^{\top}, γ^t(v):=(1,β^t(v))\widehat{\gamma}^{(v)}_{t}:=(-1,\widehat{\beta}^{(v)}_{t})^{\top}, and

β^t(v):=argminβtγΣ^(v)γ,\widehat{\beta}_{t}^{(v)}:=\operatorname*{argmin}_{\beta\in\mathcal{B}_{t}}\gamma^{\top}\widehat{\Sigma}_{(v)}\gamma,

with Σ^(v):=(n|v|)1rvZrZr\widehat{\Sigma}_{(v)}:=(n-|v|)^{-1}\sum_{r\notin v}Z_{r}Z_{r}^{\top}.

A.2 Background Results

We use the following results in our proofs. First is a special case of Nemirovski’s inequality. See Dümbgen, van de Geer, Veraar, and Wellner (2010) for more general formulations.

Lemma 7 (Nemirovski’s inequality).

Let (ξi)iv(\xi_{i})_{i\in v} be independent random vectors in d\mathbb{R}^{d}, for d3d\geq 3 with 𝔼[ξi]=0\mathbb{E}[\xi_{i}]=0 and 𝔼ξi22<\mathbb{E}\left|\left|\xi_{i}\right|\right|_{2}^{2}<\infty. Then, for any validation set vv and distribution for the ξi\xi_{i}’s,

𝔼ivξi2\displaystyle\mathbb{E}\left|\left|\sum_{i\in v}\xi_{i}\right|\right|_{\infty}^{2} (2elogde)iv𝔼ξi22elogdiv𝔼ξi2.\displaystyle\leq(2e\log d-e)\sum_{i\in v}\mathbb{E}\left|\left|\xi_{i}\right|\right|_{\infty}^{2}\leq 2e\log d\sum_{i\in v}\mathbb{E}\left|\left|\xi_{i}\right|\right|_{\infty}^{2}.

Also, we need the following results about the Orlicz norms.

Lemma 8 (van der Vaart and Wellner 1996).

For any ψr\psi_{r}-Orlicz norm with 1<r1<r\in\mathbb{N} and sequence of \mathbb{R}-valued random variables (ζj)j=1,,m(\zeta_{j})_{j=1,\ldots,m}

max1jmζjψrΨlog1/r(m+1)max1jmζjψr,\left|\left|\max_{1\leq j\leq m}\zeta_{j}\right|\right|_{\psi_{r}}\leq\Psi\log^{1/r}(m+1)\max_{1\leq j\leq m}\left|\left|\zeta_{j}\right|\right|_{\psi_{r}},

where Ψ\Psi is a constant that depends only on ψr\psi_{r}.

Lemma 9 (Corollary 5.17 in Vershynin 2012).

Let ξ1,,ξn\xi_{1},\ldots,\xi_{n} be iid centered random variables and let ξiψ1κ\left|\left|\xi_{i}\right|\right|_{\psi_{1}}\leq\kappa. Then for every δ>0\delta>0,

(|i=1nξi|nδ)2exp(cnmin{δ2κ2,δκ}),\mathbb{P}\left(\left|\sum_{i=1}^{n}\xi_{i}\right|\geq n\delta\right)\leq 2\exp\left(-cn\min\left\{\frac{\delta^{2}}{\kappa^{2}},\frac{\delta}{\kappa}\right\}\right),

where c=1/8e2c=1/8e^{2}.

A.3 Supporting Lemmas

Several times in our proof of the main results we need to bound a quadratic form given by a symmetric matrix and an estimator indexed by a tuning parameter. To this end, we state the following lemma.

Lemma 10.

Suppose apa\in\mathbb{R}^{p} and Ap×pA\in\mathbb{R}^{p\times p}. Then

aAaa12A,a^{\top}Aa\leq\left|\left|a\right|\right|^{2}_{1}\left|\left|A\right|\right|_{\infty},

where A:=maxi,j|Aij|\left|\left|A\right|\right|_{\infty}:=\max_{i,j}|A_{ij}| is the entry-wise max norm.

We use Section A.2 to find the rate of convergence for the sample covariance matrix to the population covariance.

Lemma 11.

Let v{1,2,,n}v\subseteq\{1,2,\ldots,n\} be an index set and let |v||v| be its number of elements. If μq\mu\in\mathcal{F}_{q}, then there exists a constant CC, depending only on qq, such that

𝔼μΣ^vΣnC(logp)1+2/q|v|,\mathbb{E}_{\mu}\left|\left|\widehat{\Sigma}_{v}-\Sigma_{n}\right|\right|_{\infty}\leq C\sqrt{\frac{(\log p)^{1+2/q}}{|v|}},

where it is understood that 2/=02/\infty=0.

Proof.

(Section A.3) Let ξr(p+1)2\xi_{r}\in\mathbb{R}^{(p+1)^{2}} be the vectorized version of the zero-mean matrix 1|v|(ZrZr𝔼𝒵𝒵)\frac{1}{|v|}(Z_{r}Z_{r}^{\top}-\mathbb{E}\mathcal{Z}\mathcal{Z}^{\top}). Then, by Jensen’s inequality

(𝔼Σ^vΣn)2𝔼Σ^vΣn2=𝔼rvξr2.\left(\mathbb{E}\left|\left|\widehat{\Sigma}_{v}-\Sigma_{n}\right|\right|_{\infty}\right)^{2}\leq\mathbb{E}\left|\left|\widehat{\Sigma}_{v}-\Sigma_{n}\right|\right|_{\infty}^{2}=\mathbb{E}\left|\left|\sum_{r\in v}\xi_{r}\right|\right|_{\infty}^{2}.

Using Section A.2 with d=(p+1)2d=(p+1)^{2} and writing XL2(μ)2:=𝔼μX2\left|\left|X\right|\right|_{L_{2}(\mu)}^{2}:=\mathbb{E}_{\mu}X^{2} we find

𝔼μrvξr2\displaystyle\mathbb{E}_{\mu}\left|\left|\sum_{r\in v}\xi_{r}\right|\right|_{\infty}^{2} 2elog((p+1)2)rvξrL2(μ)2\displaystyle\leq 2e\log\left((p+1)^{2}\right)\sum_{r\in v}\bigg{|}\bigg{|}\left|\left|\xi_{r}\right|\right|_{\infty}\bigg{|}\bigg{|}_{L_{2}(\mu)}^{2}
4elog(p+1)rv((2(log2)1/q1)ξrψq)2\displaystyle\leq 4e\log(p+1)\sum_{r\in v}\left(\left(2(\log 2)^{1/q-1}\right)\bigg{|}\bigg{|}\left|\left|\xi_{r}\right|\right|_{\infty}\bigg{|}\bigg{|}_{\psi_{q}}\right)^{2}
log(p+1)rv(log1/q((p+1)2+1)ξrψq)2\displaystyle\lesssim\log(p+1)\sum_{r\in v}\left(\log^{1/q}\left((p+1)^{2}+1\right)\bigg{|}\bigg{|}\left|\left|\xi_{r}\right|\right|_{\psi_{q}}\bigg{|}\bigg{|}_{\infty}\right)^{2}
log(p+1)1|v|2rv(log1/q((p+1)2+1)Cq)2\displaystyle\lesssim\log(p+1)\frac{1}{|v|^{2}}\sum_{r\in v}\left(\log^{1/q}\left((p+1)^{2}+1\right)C_{q}\right)^{2}
Clog(p+1)1|v|2rvlog2/q((p+1)2+1)\displaystyle\leq C^{\prime}\log(p+1)\frac{1}{|v|^{2}}\sum_{r\in v}\log^{2/q}\left((p+1)^{2}+1\right)
C(logp)1+2/q|v|.\displaystyle\leq C\frac{(\log p)^{1+2/q}}{|v|}.

Note that ψq\psi_{q} is the Orlicz norm induced by the measure μn\mu_{n} and the third inequality follows by Section A.2. ∎

Corollary 12.

By the definition of μn\mu_{n},

(|Y22n𝔼[Y12]|nδ)2exp(cnmin{δ2Cq2,δCq}),\mathbb{P}\left(\left|\left|\left|Y\right|\right|_{2}^{2}-n\mathbb{E}[Y_{1}^{2}]\right|\geq n\delta\right)\leq 2\exp\left(-cn\min\left\{\frac{\delta^{2}}{C_{q}^{{}^{\prime}2}},\frac{\delta}{C_{q}^{\prime}}\right\}\right),

where c=1/8e2c=1/8e^{2} is an absolute constant and Cq=(log2)1/q1CqC_{q}^{\prime}=(\log 2)^{1/q-1}C_{q}.

Proof.

This result follows immediately from Section A.2 and the result

ξψ1(log2)1/q1ξψq(log2)1/q1Cq=Cq.||\xi||_{\psi_{1}}\leq(\log 2)^{1/q-1}||\xi||_{\psi_{q}}\leq(\log 2)^{1/q-1}C_{q}=C_{q}^{\prime}.

A.4 Proof of Main Results

Theorem 4.

Let Dn,EnD_{n},E_{n} be any two sets. Then we can make the following decomposition:

((t^,tn)δ)\displaystyle\mathbb{P}\left(\mathcal{E}(\widehat{t},t_{n})\geq\delta\right) =((t^,tn)δDnEn)+((t^,tn)δDncEn)+\displaystyle=\mathbb{P}\left(\mathcal{E}(\widehat{t},t_{n})\geq\delta\cap D_{n}\cap E_{n}\right)+\mathbb{P}\left(\mathcal{E}(\widehat{t},t_{n})\geq\delta\cap D_{n}^{c}\cap E_{n}\right)+
+((t^,tn)δDnEnc)+((t^,tn)δDncEnc)\displaystyle\quad+\mathbb{P}\left(\mathcal{E}(\widehat{t},t_{n})\geq\delta\cap D_{n}\cap E_{n}^{c}\right)+\mathbb{P}\left(\mathcal{E}(\widehat{t},t_{n})\geq\delta\cap D_{n}^{c}\cap E_{n}^{c}\right)
((t^,tn)δDnEn)+2(Dnc)+(Enc).\displaystyle\leq\mathbb{P}\left(\mathcal{E}(\widehat{t},t_{n})\geq\delta\cap D_{n}\cap E_{n}\right)+2\mathbb{P}\left(D_{n}^{c}\right)+\mathbb{P}\left(E_{n}^{c}\right). (13)

Also,

(t^,tn)\displaystyle\mathcal{E}(\widehat{t},t_{n}) =R(β^t^)R(βtn)\displaystyle=R\left(\widehat{\beta}_{\widehat{t}}\right)-R\left(\beta_{t_{n}}\right)
=R(β^t^)R^Vn(t^)(I)+R^Vn(t^)R^Vn(tmax)(II)+\displaystyle=\underbrace{R\left(\widehat{\beta}_{\widehat{t}}\right)-\widehat{R}_{V_{n}}\left(\widehat{t}\right)}_{(I)}+\underbrace{\widehat{R}_{V_{n}}\left(\widehat{t}\right)-\widehat{R}_{V_{n}}\left(t_{\max}\right)}_{(II)}+
+R^Vn(tmax)R^(β^tn)(III)+R^(β^tn)R(βtn)(IV),\displaystyle\qquad+\underbrace{\widehat{R}_{V_{n}}\left(t_{\max}\right)-\widehat{R}\left(\widehat{\beta}_{t_{n}}\right)}_{(III)}+\underbrace{\widehat{R}\left(\widehat{\beta}_{t_{n}}\right)-R\left(\beta_{t_{n}}\right)}_{(IV)}, (14)

where we use the notation β^t=β^(t)\widehat{\beta}_{t}=\widehat{\beta}(\mathcal{B}_{t}). Now, for any tTnt\in T_{n}, R^Vn(t^)R^Vn(t)0,\widehat{R}_{V_{n}}\left(\widehat{t}\right)-\widehat{R}_{V_{n}}\left(t\right)\leq 0, by the definition of t^\widehat{t}, and thus (II)<0(II)<0.

Let Dn:={tmax2nCq/an}D_{n}:=\left\{t_{\max}\leq 2nC_{q}^{\prime}/a_{n}\right\} and En:={tmaxtn}E_{n}:=\{t_{\max}\geq t_{n}\}. On the set EnE_{n}, (III)R^Vn(tmax)R^(β^tmax)=:(III)~(III)\leq\widehat{R}_{V_{n}}\left(t_{\max}\right)-\widehat{R}\left(\widehat{\beta}_{t_{\max}}\right)=:\widetilde{(III)}. Taking the first term from Equation (13) and combining it with the decomposition in Equation (14), we see that

((t^,tn)δDnEn)\displaystyle\mathbb{P}\left(\mathcal{E}(\widehat{t},t_{n})\geq\delta\cap D_{n}\cap E_{n}\right) ((I)δ/3Dn)+((III)~δ/3DnEn)\displaystyle\leq\mathbb{P}\bigg{(}(I)\geq\delta/3\cap D_{n}\bigg{)}+\mathbb{P}\bigg{(}\widetilde{(III)}\geq\delta/3\cap D_{n}\cap E_{n}\bigg{)}
+((IV)δ/3)\displaystyle\quad+\mathbb{P}\bigg{(}(IV)\geq\delta/3\bigg{)} (15)

We break the remainder of this section into parts based on these terms.

Final predictor and cross-validation risk (I)(I):

Using the notation introduced in Section A.1, note that by equations (11) and (12)

R(β^t^)R^Vn(t^)\displaystyle R\left(\widehat{\beta}_{\widehat{t}}\right)-\widehat{R}_{V_{n}}\left(\,\widehat{t}\right) =γ^t^Σnγ^t^1KvVn(γ^t^(v))Σ^vγ^t^(v)\displaystyle=\widehat{\gamma}_{\widehat{t}}^{\top}\,\Sigma_{n}\,\widehat{\gamma}_{\widehat{t}}-\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{v}\widehat{\gamma}_{\widehat{t}}^{(v)}
=[γ^t^Σnγ^t^γ^t^(Σ^n)γ^t^]+[γ^t^(Σ^n)γ^t^1KvVn(γ^t^(v))Σ^vγ^t^(v)]\displaystyle=\left[\widehat{\gamma}_{\widehat{t}}^{\top}\,\Sigma_{n}\,\widehat{\gamma}_{\widehat{t}}-\widehat{\gamma}_{\widehat{t}}^{\top}\left(\widehat{\Sigma}_{n}\right)\widehat{\gamma}_{\widehat{t}}\right]+\left[\widehat{\gamma}_{\widehat{t}}^{\top}\left(\widehat{\Sigma}_{n}\right)\widehat{\gamma}_{\widehat{t}}-\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{v}\widehat{\gamma}_{\widehat{t}}^{(v)}\right]

Addressing each of the terms in order,

[γ^t^Σnγ^t^γ^t^(Σ^n)γ^t^]\displaystyle\left[\widehat{\gamma}_{\widehat{t}}^{\top}\,\Sigma_{n}\,\widehat{\gamma}_{\widehat{t}}-\widehat{\gamma}_{\widehat{t}}^{\top}\left(\widehat{\Sigma}_{n}\right)\widehat{\gamma}_{\widehat{t}}\right] =γ^t^(ΣnΣ^n)γ^t^γ^t^12ΣnΣ^n,\displaystyle=\widehat{\gamma}_{\widehat{t}}^{\top}\left(\Sigma_{n}-\widehat{\Sigma}_{n}\right)\widehat{\gamma}_{\widehat{t}}\leq\left|\left|\widehat{\gamma}_{\widehat{t}}\right|\right|_{1}^{2}\left|\left|\Sigma_{n}-\widehat{\Sigma}_{n}\right|\right|_{\infty},

where the inequality follows by Section A.3.

Likewise,

[γ^t^(Σ^n)γ^t^1KvVn(γ^t^(v))Σ^vγ^t^(v)]\displaystyle\left[\widehat{\gamma}_{\widehat{t}}^{\top}\left(\widehat{\Sigma}_{n}\right)\widehat{\gamma}_{\widehat{t}}-\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{v}\widehat{\gamma}_{\widehat{t}}^{(v)}\right]
=(γ^t^Σ^nγ^t^1KvVn(γ^t^(v))Σ^nγ^t^(v))+(1KvVn(γ^t^(v))Σ^nγ^t^(v)1KvVn(γ^t^(v))Σ^vγ^t^(v))\displaystyle=\left(\widehat{\gamma}_{\widehat{t}}^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}-\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}^{(v)}\right)+\left(\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}^{(v)}-\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{v}\widehat{\gamma}_{\widehat{t}}^{(v)}\right)
=1KvVn(γ^t^Σ^nγ^t^(γ^t^(v))Σ^nγ^t^(v))+(1KvVn(γ^t^(v))(Σ^nΣ^v)γ^t^(v))\displaystyle=\frac{1}{K}\sum_{v\in V_{n}}\left(\widehat{\gamma}_{\widehat{t}}^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}-(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}^{(v)}\right)+\left(\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\left(\widehat{\Sigma}_{n}-\widehat{\Sigma}_{v}\right)\widehat{\gamma}_{\widehat{t}}^{(v)}\right)
1KvVn(γ^t^(v))(Σ^nΣ^v)γ^t^(v).\displaystyle\leq\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\left(\widehat{\Sigma}_{n}-\widehat{\Sigma}_{v}\right)\widehat{\gamma}_{\widehat{t}}^{(v)}.

The last inequality follows as γ^t^\widehat{\gamma}_{\widehat{t}} is chosen to minimize γ^t^Σ^nγ^t^\widehat{\gamma}_{\widehat{t}}^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}, and so for any vVnv\in V_{n},

γ^t^Σ^nγ^t^(γ^t^(v))Σ^nγ^t^(v).\widehat{\gamma}_{\widehat{t}}^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}\leq(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}^{(v)}. (16)

Continuing and using Section A.3,

1KvVn(γ^t^(v))(Σ^nΣ^v)γ^t^(v)\displaystyle\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\left(\widehat{\Sigma}_{n}-\widehat{\Sigma}_{v}\right)\widehat{\gamma}_{\widehat{t}}^{(v)} 1KvVnγ^t^(v)12Σ^nΣ^v\displaystyle\leq\frac{1}{K}\sum_{v\in V_{n}}\left|\left|\widehat{\gamma}_{\widehat{t}}^{(v)}\right|\right|_{1}^{2}\left|\left|\widehat{\Sigma}_{n}-\widehat{\Sigma}_{v}\right|\right|_{\infty}
1KvVnγ^t^(v)12(Σ^nΣn+ΣnΣ^v)\displaystyle\leq\frac{1}{K}\sum_{v\in V_{n}}\left|\left|\widehat{\gamma}_{\widehat{t}}^{(v)}\right|\right|_{1}^{2}\left(\left|\left|\widehat{\Sigma}_{n}-\Sigma_{n}\right|\right|_{\infty}+\left|\left|\Sigma_{n}-\widehat{\Sigma}_{v}\right|\right|_{\infty}\right)

Therefore,

(I)\displaystyle(I) γ^t^12ΣnΣ^n+1KvVn(γ^t^(v))(Σ^nΣ^v)γ^t^(v)\displaystyle\leq\left|\left|\widehat{\gamma}_{\widehat{t}}\right|\right|_{1}^{2}\left|\left|\Sigma_{n}-\widehat{\Sigma}_{n}\right|\right|_{\infty}+\frac{1}{K}\sum_{v\in V_{n}}\left(\widehat{\gamma}_{\widehat{t}}^{(v)}\right)^{\top}\left(\widehat{\Sigma}_{n}-\widehat{\Sigma}_{v}\right)\widehat{\gamma}_{\widehat{t}}^{(v)}
(1+tmax)2(2ΣnΣ^n+1KvVnΣnΣ^v).\displaystyle\leq(1+t_{\textrm{max}})^{2}\bigg{(}2\left|\left|\Sigma_{n}-\widehat{\Sigma}_{n}\right|\right|_{\infty}+\frac{1}{K}\sum_{v\in V_{n}}\left|\left|\Sigma_{n}-\widehat{\Sigma}_{v}\right|\right|_{\infty}\bigg{)}.

By Section A.3 with Vn={{1,,n}}V_{n}=\{\{1,\ldots,n\}\} and cn=nc_{n}=n,

𝔼ΣnΣ^nC1(logp)1+2/qn,\mathbb{E}\left|\left|\Sigma_{n}-\widehat{\Sigma}_{n}\right|\right|_{\infty}\leq C_{1}\sqrt{\frac{(\log p)^{1+2/q}}{n}},

while taking Vn={v1,,vK}V_{n}=\{v_{1},\ldots,v_{K}\} shows,

1KvVn𝔼ΣnΣ^vC2(logp)1+2/qcn.\frac{1}{K}\sum_{v\in V_{n}}\mathbb{E}\left|\left|\Sigma_{n}-\widehat{\Sigma}_{v}\right|\right|_{\infty}\leq C_{2}\sqrt{\frac{(\log p)^{1+2/q}}{c_{n}}}.

Combining these two bounds together gives

((I)δ3Dn)3δ𝔼[(I) 1Dn]\mathbb{P}\bigg{(}(I)\geq\frac{\delta}{3}\cap D_{n}\bigg{)}\leq\frac{3}{\delta}\mathbb{E}[(I)\,\mathbf{1}_{D_{n}}]\\
3δ(1+2n(log2)1/q1Cq)an)2(2C1(logp)1+2/qn+C2(logp)1+2/qcn).\leq\frac{3}{\delta}\left(1+\frac{2n(\log 2)^{1/q-1}C_{q})}{a_{n}}\right)^{2}\bigg{(}2C_{1}\sqrt{\frac{(\log p)^{1+2/q}}{n}}+C_{2}\sqrt{\frac{(\log p)^{1+2/q}}{c_{n}}}\bigg{)}. (17)

Cross-validation risk and empirical risk (III)(III):

Due to the discussion following Equation (14), it is sufficient to bound (III)~\widetilde{(III)} instead. Recall that Σ^(v)=(ncn)1rvZrZr.\widehat{\Sigma}_{(v)}=(n-c_{n})^{-1}\sum_{r\notin v}Z_{r}Z_{r}^{\top}.

Then,

R^Vn(tmax)R^(β^tmax)\displaystyle\widehat{R}_{V_{n}}\left(t_{\max}\right)-\widehat{R}\left(\widehat{\beta}_{t_{\max}}\right) =1KvVn(γ^tmax(v))Σ^vγ^tmax(v)γ^tmaxΣ^nγ^tmax\displaystyle=\frac{1}{K}\sum_{v\in V_{n}}(\widehat{\gamma}_{t_{\max}}^{(v)})^{\top}\widehat{\Sigma}_{v}\widehat{\gamma}_{t_{\max}}^{(v)}-\widehat{\gamma}_{t_{\max}}^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{t_{\max}}
=1KvVn((γ^tmax(v))Σ^vγ^tmax(v)(γ^tmax(v))Σ^(v)γ^tmax(v))+\displaystyle=\frac{1}{K}\sum_{v\in V_{n}}\left((\widehat{\gamma}_{t_{\max}}^{(v)})^{\top}\widehat{\Sigma}_{v}\widehat{\gamma}_{t_{\max}}^{(v)}-(\widehat{\gamma}_{t_{\max}}^{(v)})^{\top}\widehat{\Sigma}_{(v)}\widehat{\gamma}_{t_{\max}}^{(v)}\right)+
+1KvVn((γ^tmax(v))Σ^(v)γ^tmax(v)γ^tmaxΣ^nγ^tmax)\displaystyle\qquad+\frac{1}{K}\sum_{v\in V_{n}}\left((\widehat{\gamma}_{t_{\max}}^{(v)})^{\top}\widehat{\Sigma}_{(v)}\widehat{\gamma}_{t_{\max}}^{(v)}-\widehat{\gamma}_{t_{\max}}^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{t_{\max}}\right)
1KvVnγ^tmax(v)12Σ^vΣ^(v),\displaystyle\leq\frac{1}{K}\sum_{v\in V_{n}}\left|\left|\widehat{\gamma}_{t_{\max}}^{(v)}\right|\right|_{1}^{2}\left|\left|\widehat{\Sigma}_{v}-\widehat{\Sigma}_{(v)}\right|\right|_{\infty},

where the inequality follows by Section A.3 and the fact that γ^t(v)\widehat{\gamma}_{t_{*}}^{(v)} is chosen to minimize (γ^t(v))Σ^(v)γ^t(v)(\widehat{\gamma}_{t_{*}}^{(v)})^{\top}\widehat{\Sigma}_{(v)}\widehat{\gamma}_{t_{*}}^{(v)}, which implies

(γ^tmax(v))Σ^(v)γ^tmax(v)γ^tmaxΣ^(v)γ^tmax.(\widehat{\gamma}_{t_{\max}}^{(v)})^{\top}\widehat{\Sigma}_{(v)}\widehat{\gamma}_{t_{\max}}^{(v)}\leq\widehat{\gamma}_{t_{\max}}^{\top}\widehat{\Sigma}_{(v)}\widehat{\gamma}_{t_{\max}}. (18)

As before,

1KvVnγ^tmax(v)12Σ^vΣ^(v)(1+tmax)21KvVn(ΣnΣ^v+ΣnΣ^(v)).\frac{1}{K}\sum_{v\in V_{n}}\left|\left|\widehat{\gamma}_{t_{\max}}^{(v)}\right|\right|_{1}^{2}\left|\left|\widehat{\Sigma}_{v}-\widehat{\Sigma}_{(v)}\right|\right|_{\infty}\leq(1+t_{\textrm{max}})^{2}\frac{1}{K}\sum_{v\in V_{n}}\left(\left|\left|\Sigma_{n}-\widehat{\Sigma}_{v}\right|\right|_{\infty}+\left|\left|\Sigma_{n}-\widehat{\Sigma}_{(v)}\right|\right|_{\infty}\right).

We can use a straight-forward adaptation of Section A.3 to show that

1KvVn𝔼ΣnΣ^(v)C3(logp)1+2/qncn.\frac{1}{K}\sum_{v\in V_{n}}\mathbb{E}\left|\left|\Sigma_{n}-\widehat{\Sigma}_{(v)}\right|\right|_{\infty}\leq C_{3}\sqrt{\frac{(\log p)^{1+2/q}}{n-c_{n}}}.

Therefore,

((III)~δ/3DnEn)3δ𝔼[(III)~ 1Dn]\mathbb{P}\bigg{(}\widetilde{(III)}\geq\delta/3\cap D_{n}\cap E_{n}\bigg{)}\leq\frac{3}{\delta}\mathbb{E}[\widetilde{(III)}\,\mathbf{1}_{D_{n}}]
3δ(1+2nCqan)2(C2(logp)1+2/qcn+C3(logp)1+2/qncn).\leq\frac{3}{\delta}\left(1+\frac{2nC_{q}^{\prime}}{a_{n}}\right)^{2}\left(C_{2}\sqrt{\frac{(\log p)^{1+2/q}}{c_{n}}}+C_{3}\sqrt{\frac{(\log p)^{1+2/q}}{n-c_{n}}}\right). (19)

Empirical risk and expected risk (IV)(IV)

The proof of these results is given by Greenshtein and Ritov (2004). We include a somewhat different proof for completeness. Observe

R(β^tn)R(βtn)\displaystyle R\left(\widehat{\beta}_{t_{n}}\right)-R\left(\beta_{t_{n}}\right) =R(β^tn)R^(β^tn)+R^(β^tn)R(βtn)\displaystyle=R\left(\widehat{\beta}_{t_{n}}\right)-\widehat{R}\left(\widehat{\beta}_{t_{n}}\right)+\widehat{R}\left(\widehat{\beta}_{t_{n}}\right)-R\left(\beta_{t_{n}}\right)
R(β^tn)R^(β^tn)+R^(βtn)R(βtn)\displaystyle\leq R\left(\widehat{\beta}_{t_{n}}\right)-\widehat{R}\left(\widehat{\beta}_{t_{n}}\right)+\widehat{R}\left(\beta_{t_{n}}\right)-R\left(\beta_{t_{n}}\right)
2supβtn|R(β)R^(β)|.\displaystyle\leq 2\sup_{\beta\in\mathcal{B}_{t_{n}}}\left|R\left(\beta\right)-\widehat{R}\left(\beta\right)\right|.

Using Section A.3 (See Section A.1 for notation)

supβtn|R(β)R^(β)|supβtnγ12Σ^nΣn(1+tn)2Σ^nΣn.\sup_{\beta\in\mathcal{B}_{t_{n}}}\left|R\left(\beta\right)-\widehat{R}\left(\beta\right)\right|\leq\sup_{\beta\in\mathcal{B}_{t_{n}}}\left|\left|\gamma\right|\right|^{2}_{1}\left|\left|\widehat{\Sigma}_{n}-\Sigma_{n}\right|\right|_{\infty}\leq(1+t_{n})^{2}\left|\left|\widehat{\Sigma}_{n}-\Sigma_{n}\right|\right|_{\infty}.

Therefore,

((IV)δ/3)3δ(1+tn)2(C4(logp)1+2/qn).\mathbb{P}\bigg{(}(IV)\geq\delta/3\bigg{)}\leq\frac{3}{\delta}\left(1+t_{n}\right)^{2}\bigg{(}C_{4}\sqrt{\frac{(\log p)^{1+2/q}}{n}}\bigg{)}. (20)

The proof follows by combining Equation (13) with Equation (15) and using the bounds from Equations (17), (19), and (20).

Lastly, there are the various constants incurred in the course of this proof. In Section A.2, the constant Ψ\Psi can be chosen arbitrarily small based on inspecting the proof of Lemma 2.2.2 in van der Vaart and Wellner (1996). As this constant premultiplies every term in Ωn,1\Omega_{n,1} and Ωn,2\Omega_{n,2}, we can without loss of generality set the constant equal to one. For instance, in Section A.3, the constant CC is upper bounded by 8e1/2(log2)(2q)/(2q)81/qCqΨ8e^{1/2}(\log 2)^{(2-q)/(2q)}8^{1/q}C_{q}\Psi. This constant can be taken arbitrarily small by choosing Ψ\Psi small enough. ∎

Lemma 13.

Define the set Dn:={tmax2nCq/an},D_{n}:=\left\{t_{\max}\leq 2nC_{q}^{\prime}/a_{n}\right\}, where ana_{n} is the normalizing rate defined in Section 3.2 and Cq=Cq(log2)1/q1C_{q}^{\prime}=C_{q}(\log 2)^{1/q-1}. Then, (Dnc)ecn\mathbb{P}(D^{c}_{n})\leq e^{-cn}.

Proof.

By Section A.3,

(Y22ann(𝔼[Y12]+δ)an)\displaystyle\mathbb{P}\left(\frac{\left|\left|Y\right|\right|_{2}^{2}}{a_{n}}\geq\frac{n(\mathbb{E}[Y_{1}^{2}]+\delta)}{a_{n}}\right) exp(cnmin{δ2Cq2,δCq}).\displaystyle\leq\exp\left(-cn\min\left\{\frac{\delta^{2}}{C_{q}^{{}^{\prime}2}},\frac{\delta}{C_{q}^{\prime}}\right\}\right).

Furthermore, 𝔼[Y12]Cq\mathbb{E}[Y_{1}^{2}]\leq C_{q}^{\prime}, so

(Y22ann(Cq+δ)an)\displaystyle\mathbb{P}\left(\frac{\left|\left|Y\right|\right|_{2}^{2}}{a_{n}}\geq\frac{n(C_{q}^{\prime}+\delta)}{a_{n}}\right) (Y22ann(𝔼[Y12]+δ)an).\displaystyle\leq\mathbb{P}\left(\frac{\left|\left|Y\right|\right|_{2}^{2}}{a_{n}}\geq\frac{n(\mathbb{E}[Y_{1}^{2}]+\delta)}{a_{n}}\right).

Therefore, setting δ=Cq\delta=C_{q}^{\prime} yields (Y22/an2nh/an)ecn.\mathbb{P}(\left|\left|Y\right|\right|_{2}^{2}/a_{n}\geq 2nh/a_{n})\leq e^{-cn}.

Lemma 14.

Define the set En:={tmaxtn}.E_{n}:=\left\{t_{\max}\geq t_{n}\right\}. If antn=o(n)a_{n}t_{n}=o(n), then, for all n>2antn/𝔼[Y12]n>2a_{n}t_{n}/\mathbb{E}[Y_{1}^{2}], (Enc)exp{cn}.\mathbb{P}(E^{c}_{n})\leq\exp\left\{-cn\right\}.

Proof.

By Section A.3,

(Y22ann(𝔼[Y12]δ)an)\displaystyle\mathbb{P}\left(\frac{\left|\left|Y\right|\right|_{2}^{2}}{a_{n}}\leq\frac{n(\mathbb{E}[Y_{1}^{2}]-\delta)}{a_{n}}\right) exp(cnmin{δ2Cq2,δCq}),\displaystyle\leq\exp\left(-cn\min\left\{\frac{\delta^{2}}{C_{q}^{{}^{\prime}2}},\frac{\delta}{C_{q}^{\prime}}\right\}\right),

for all δ>0\delta>0. Setting δ=𝔼[Y12]antn/n\delta=\mathbb{E}[Y_{1}^{2}]-a_{n}t_{n}/n therefore implies

(tmaxtn)\displaystyle\mathbb{P}\left(t_{\textrm{max}}\leq t_{n}\right) exp(cnmin{(𝔼[Y12]antn/n)2Cq2,𝔼[Y12]antn/nCq})\displaystyle\leq\exp\left(-cn\min\left\{\frac{(\mathbb{E}[Y_{1}^{2}]-a_{n}t_{n}/n)^{2}}{C_{q}^{{}^{\prime}2}},\frac{\mathbb{E}[Y_{1}^{2}]-a_{n}t_{n}/n}{C_{q}^{\prime}}\right\}\right)
exp(cnmin{𝔼[Y12]24Cq2,𝔼[Y12]2Cq}).\displaystyle\leq\exp\left(-cn\min\left\{\frac{\mathbb{E}[Y_{1}^{2}]^{2}}{4C_{q}^{{}^{\prime}2}},\frac{\mathbb{E}[Y_{1}^{2}]}{2C_{q}^{\prime}}\right\}\right).

Since 0<𝔼[Y12]/Cq10<\mathbb{E}[Y_{1}^{2}]/C_{q}^{\prime}\leq 1, the result follows. ∎

Section 3.2.

For the lasso\sqrt{\mbox{lasso}}, the result is nearly immediate as we are considering the same constraint set β1t\left|\left|\beta\right|\right|_{1}\leq t and the same search space for the tuning parameter T=[0,Y22/an]T=[0,\ \left|\left|Y\right|\right|_{2}^{2}/a_{n}]. However, in Equations (16) and (18), we rely on the empirical minimizer. The analogous results here are γ^t^Σ^nγ^t^(γ^t^(v))Σ^nγ^t^(v)\widehat{\gamma}_{\widehat{t}}^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}\leq(\widehat{\gamma}_{\widehat{t}}^{(v)})^{\top}\widehat{\Sigma}_{n}\widehat{\gamma}_{\widehat{t}}^{(v)} and (γ^tmax(v))Σ^(v)γ^tmax(v)γ^tmaxΣ^(v)γ^tmax(\widehat{\gamma}_{t_{\max}}^{(v)})^{\top}\widehat{\Sigma}_{(v)}\widehat{\gamma}_{t_{\max}}^{(v)}\leq\widehat{\gamma}_{t_{\max}}^{\top}\widehat{\Sigma}_{(v)}\widehat{\gamma}_{t_{\max}} respectively, but this implies that (16) and (18) hold.

For the group lasso with maxg|g|=O(1)\max_{g}\sqrt{|g|}=O(1), we have tgG|g|βg2β1t\geq\sum_{g\in G}\sqrt{|g|}\left|\left|\beta_{g}\right|\right|_{2}\geq\left|\left|\beta\right|\right|_{1} so that Section A.4 still applies with tmaxt_{\textrm{max}} as before. We note that in this case, the oracle group linear model is restricted to the ball tn={β:gG|g|βg2tn}\mathcal{B}_{t_{n}}=\{\beta:\sum_{g\in G}\sqrt{|g|}\left|\left|\beta_{g}\right|\right|_{2}\leq t_{n}\} rather than the larger set {β:β1tn}\{\beta:\left|\left|\beta\right|\right|_{1}\leq t_{n}\}. ∎