This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dual Lasso Selector

\nameNiharika Gauraha \emailniharika.gauraha@gmail.com
\addrSystems Science and Informatics Unit
Indian Statistical Institute
8th Mile, Mysore Road Bangalore, India
Abstract

We consider the problem of model selection and estimation in sparse high dimensional linear regression models with strongly correlated variables. First, we study the theoretical properties of the dual Lasso solution, and we show that joint consideration of the Lasso primal and its dual solutions are useful for selecting correlated active variables. Second, we argue that correlations among active predictors are not problematic, and we derive a new weaker condition on the design matrix, called Pseudo Irrepresentable Condition (PIC). Third, we present a new variable selection procedure, Dual Lasso Selector, and we prove that the PIC is a necessary and sufficient condition for consistent variable selection for the proposed method. Finally, by combining the dual Lasso selector further with the Ridge estimation even better prediction performance is achieved. We call the combination (DLSelect+Ridge), it can be viewed as a new combined approach for inference in high-dimensional regression models with correlated variables. We illustrate DLSelect+Ridge method and compare it with popular existing methods in terms of variable selection, prediction accuracy, estimation accuracy and computation speed by considering various simulated and real data examples.

Keywords: Dual Lasso Selector, Correlated Variable Selection, High-dimensional Statistics, Lasso, Lasso Dual, Ridge Regression

1 Introduction and Motivation

The use of microarray technologies have become popular to monitor genome-wide expression changes in health and disease. Typically, a microarray data set is high dimensional in the sense, it usually has tens of thousands of gene expression profile(variables) but only tens or hundreds of subjects(observations). In microarray analysis, a group of genes sharing the same biological pathway tend to have highly correlated expression levels Segal et al. (2003) and the goal is to identify all(rather than a few) of them if they are related to the underlying biological process. This is one example where the need to select groups of correlated variables arises. In many applications it is required to identify all relevant correlated variables. In this paper, we consider the problem of model selection and estimation in sparse high dimensional linear regression models with strongly correlated variables.

We start with the standard linear regression model as

Y =Xβ+ϵ,\displaystyle=\textbf{X}\beta+\epsilon, (1)

with response vector Yn×1\textbf{Y}_{n\times 1}, design matrix Xn×p\textbf{X}_{n\times p}, true underlying coefficient vector βp×1\beta_{p\times 1} and error vector ϵn×1Nn(0,In)\epsilon_{n\times 1}\sim N_{n}(0,I_{n}). In particular, we consider the case of sparse high dimensional linear model (pn)(p\gg n) with strong empirical correlation among few variables. The Lasso is a widely used regularized regression method to find sparse solutions, the lasso estimator is defined as

β^Lasso=argminβp{12YXβ22+λβ1},\displaystyle\hat{\beta}_{Lasso}=\arg\min_{\beta\in\mathbb{R}^{p}}\left\{\frac{1}{2}\|{\textbf{Y}-\textbf{X}\beta}\|_{2}^{2}+\lambda\|\beta\|_{1}\right\}, (2)

where λ0\lambda\geq 0 is the regularization parameter that controls the amount of regularization. It is known that the Lasso tends to select a single variable from a group of strongly correlated variables even if many or all of these variables are important.

In presence of correlated predictors, the concept of clustering or grouping correlated predictors and then pursuing group-wise model fitting was proposed, see Bühlmann et al. (2012) and Gauraha (2016). When the dimension is very high or in case of overlapping clusters, finding an appropriate group structure remains as difficult as the original problem. We note that clustering followed by model fitting is computationally expensive, not reliable and do not scale for large, high-dimensional data sets, so we do not consider it further in this paper. An alternatively approach is simultaneous clustering and model fitting that involves combination of two different penalties. For example, Elastic Net (Zou and Hastie (2005)) is a combination of two regularization techniques, the 2\ell_{2} regularization provides grouping effects and 1\ell_{1} regularization produces sparse models. Therefore, the eNet selects or drops highly correlated variables together that depends on the amount of 1\ell_{1} and 2\ell_{2} regularization.

The influence of correlations on Lasso prediction has been studied in Hebiri and Lederer (2013) and van de Geer and Lederer (2013), and it is shown that Lasso prediction works well in presence of any degree of correlations with an appropriate amount of regularization. However, studies show that correlations are problematic for parameter estimation and variable selection. It has been proven that the design matrix must satisfy the following two conditions for the Lasso to perform exact variable selection: irrepresentability(IC) condition(Zhao and Yu (2006)) and beta-min condition(Bühlmann and van de Geer (2011)). Having highly correlated variables implies that the design matrix violates the IC, and the Lasso solution is not stable. When active covariates are highly correlated the Lasso solution is not unique and Lasso randomly selects one variable from correlated group. However, even in case of highly correlated variables the corresponding dual Lasso solution is always unique. The dual of the Lasso problem (2), as shown in Wang et al. (2015) is given by

supθ\displaystyle\sup_{\theta}\; g(θ)=12Y22θY22\displaystyle g(\theta)=\frac{1}{2}\|\textbf{Y}\|_{2}^{2}-\|\theta-\textbf{Y}\|_{2}^{2}
subject to |XjTθ|λ for all j{1,,p}\displaystyle|X_{j}^{T}\theta|\leq\lambda\;\text{ for all }j\in\{1,...,p\} (3)

The intuition drawn from the articles OSBORNE et al. (2000) and Wang et al. (2015) further motivates us to consider the Lasso optimal and its dual optimal solution together, that yields in selecting correlated active predictors.

Exploiting the fact about uniqueness of the dual Lasso solution, we propose a new variable selection procedure, the Dual Lasso Selector (DLS). For a given λ\lambda and a Lasso estimator β^Lasso\hat{\beta}_{Lasso}, we can compute the corresponding Dual Lasso solution using KKT conditions. Basically, the DLS active set corresponds to the predictors that satisfies dual Lasso feasible boundary conditions (we discuss it in details in later section). We argue that correlations among active predictors are not problematic, and we define a new weaker condition on the design matrix that allows for correlation among active predictors, called Pseudo Irrepresentable Condition (PIC). We prove that the Pseudo Irrepresentable Condition is a necessary and sufficient condition for the proposed dual Lasso selector to select the true active set (under assumption of beta-min condition) with high probability. Moreover, we use the 2\ell_{2} penalty (the Ridge regression, Hoerl and Kennard (1970)) which is known to perform best in case of correlated variables, to estimate the coefficients of the predictors selected by the dual Lasso selector. We call the combination of the two, the DLSelect+Ridge. Though, DLSelect+Ridge resembles the ”Ridge post Lasso” but it is conceptually different and behaves differently than the Lasso followed by the Ridge, especially in the presence of highly correlated variables. For example DLSelect+Ridge looks like Elastic-net, since both are combination of 1\ell_{1} and 2\ell_{2} penalties but Elastic-net is a combination of the Ridge Regression followed by the Lasso. In addition, Enet needs to cross-validate on a two-dimensional surface O(k2)O(k^{2}) to select its the optimal regularization parameters, whereas DLSelect+Ridge needs to cross validated twice on one-dimensional surface O(k)O(k), where k is the length of the search space for a regularization parameter.

Our contribution is summarized as follows:

  1. 1.

    We briefly review the state-of-the-art methods of simultaneous clustering and model fitting using combination of penalties such as Elastic-net, OSCAR and Fused Lasso etc.

  2. 2.

    We study the theoretical properties of the Lasso and its dual optimal solution together and we show that selection of active correlated variables is related to the dual feasible boundary conditions.

  3. 3.

    By further exploiting the uniqueness property of the dual Lasso solution, we develop a variable selection algorithm to efficiently select the true active predictors (including correlated active predictors). we call this selection technique as the Dual Lasso Selector.

  4. 4.

    We derive the Pseudo Irrepresentable Conditions (PIC) for the design matrix that allow for the correlation between active covariates, and we show that under assumption of PIC the dual Lasso selector is variable selection consistent.

  5. 5.

    We propose a new combined approach, the DLSelect+Ridge: Dual Lasso selecting predictors and the Ridge estimating their coefficients.

  6. 6.

    We study the theoretical properties of the combination DLSelect+Ridge.

  7. 7.

    We implement the DLSelect+Ridge method and empirically compare it with existing methods like Lasso and Enet etc. in terms of variable selection consistency, prediction accuracy, estimation accuracy and time complexity (using various simulations and real data examples).

We have organized the rest of the article in the following manner. We start with background in section 2. In section 3, we present Dual Lasso Selector. We define PIC and discuss variable selection consistency under this assumption on the design matrix, in section 4. Section 5 is concerned with illustration of the proposed method on real and simulated data sets. Section 6 gives computational details. We shall provide some concluding remarks in section 7.

2 Notations and Background

In this section, we state notations and assumptions, used throughout the paper.

We consider usual sparse high-dim linear regression model as given in 1 with pnp\gg n. For the design matrix Xn×p\textbf{X}\in\mathcal{R}^{n\times p}, we represent rows by xiTp,i=1,,nx_{i}^{T}\in\mathbb{R}^{p},\ i=1,...,n, and columns by XjTn,i=1,,pX_{j}^{T}\in\mathbb{R}^{n},\ i=1,...,p. We assume that the design matrix Xn×p\textbf{X}_{n\times p} is fixed, the data is centred and the predictors are standardized, so that i=1nYi=0\sum_{i=1}^{n}\textbf{Y}_{i}=0, i=1n(Xj)i=0\sum_{i=1}^{n}({X}_{j}){i}=0 and 1nXjTXj=1\frac{1}{n}\textbf{X}^{T}_{j}\textbf{X}_{j}=1 for all j=1,,pj=1,...,p. We denote by

S={j{1,,p}:βj0},\displaystyle S=\{j\in\{1,...,p\}:\beta_{j}\neq 0\}, (4)

the true active set and cardinality of the set s=|S|s=|S|, is called sparsity index. We assume that the true coefficient vector β\beta is sparse, that is sps\ll p. We denote XS\textbf{X}_{S} as the restriction of X to columns in SS, and βS\beta_{S} is the vector β\beta restricted to the support SS, with 0 outside the support SS. Without loss of generality we can assume that the first ss variables are the active variables, and we partition the covariance matrix, C=1nXTXC=\frac{1}{n}\textbf{X}^{T}\textbf{X}, for the active and the redundant variables as follows.

C=[C11C12C21C22]\displaystyle C=\left[\begin{array}[]{cc}C_{11}&C_{12}\\ C_{21}&C_{22}\end{array}\right] (7)

Similarly the coefficient vector β\beta can be partitioned as [β1β2].\left[\begin{array}[]{c}\beta_{1}\\ \beta_{2}\end{array}\right].
The 1\ell_{1}-norm and 2\ell_{2}-norm (square) are defined as

β1\displaystyle\|\beta\|_{1} =j=1p|βj|\displaystyle=\textstyle\sum_{j=1}^{p}|\beta_{j}| (8)
β22\displaystyle\|\beta\|_{2}^{2} =j=1pβj2.\displaystyle=\textstyle\sum_{j=1}^{p}\beta_{j}^{2}. (9)

Throughout the paper, we use the notation λ1>0\lambda_{1}>0 for 1\ell_{1} penalty and λ2>0\lambda_{2}>0 for other penalty functions. For a vector apa\in\mathbb{R}^{p}, we denote its sign vector as

𝕊(a)={1 if a>01 if a<00 if a=0\displaystyle\mathbb{S}(a)=\left\{\begin{array}[]{ll}1&\text{ if }a>0\\ -1&\text{ if }a<0\\ 0&\text{ if }a=0\\ \end{array}\right. (10)

We denote sub-gradient of 1\ell_{1}-norm evaluated at βp\beta\in\mathbb{R}^{p}, as τβ1\tau\in\partial\|\beta\|_{1}, where τ\tau satisfies the following.

τi={1 if βi>0[1,1] if βi=01 if βi<0\displaystyle\tau_{i}=\left\{\begin{array}[]{ll}1&\text{ if }\beta_{i}>0\\ \left[-1,1\right]&\text{ if }\beta_{i}=0\\ -1&\text{ if }\beta_{i}<0\\ \end{array}\right. (14)

3 Review of Relevant Work

Given the huge literature on the use of Lasso-type penalties for variable selection, we provide only a brief overview here, with focus on previous approaches which are closely related to our work. In particular, we briefly review the Lasso, the Ridge and the state-of-the-art in simultaneous clustering and model fitting using combination of penalties for high-dimensional sparse linear models. In general, we define a penalized least squares method as follows.

minβp{12YXβ22+𝒫method(β,.)}\displaystyle\min_{\beta\in\mathbb{R}^{p}}\left\{\frac{1}{2}\|{\textbf{Y}-\textbf{X}\beta}\|_{2}^{2}+\mathcal{P}_{method}(\beta,.)\right\} (15)

where the penalty terms 𝒫method(β,.)\mathcal{P}_{method}(\beta,.) can be different for different methods depending on the type and number of penalties used. In the following we define various penalized least squares estimators in terms of penalties used by them, and we also mention their computational complexity, variable selection consistency and grouping effects of selecting and dropping highly correlated predictors together.

  1. 1.

    Lasso: The Lasso method was proposed by Tibshirani (1996) and the lasso penalty is defined as

    𝒫Lasso(β,λ1)=λ1β1\displaystyle\mathcal{P}_{Lasso}(\beta,\lambda_{1})=\lambda_{1}\|\beta\|_{1} (16)

    It uses the single 1\ell_{1} penalty, and due to nature of the 1\ell_{1} penalty it simultaneously performs variable selection and estimation. The whole regularization path can be computed efficiently with the computational effort of a single OLS fit, by some modification of the LARS algorithm, see Efron et al. (2004). It does not provide grouping effect, in fact the Lasso tends to select a single predictor from a group of highly correlated predictors.

  2. 2.

    Ridge Regression (RR): The ridge method was proposed by Hoerl and Kennard (1970) and the ridge penalty is defined as follows.

    𝒫Ridge(β,λ2)=λ2β2\displaystyle\mathcal{P}_{Ridge}(\beta,\lambda_{2})=\lambda_{2}\|\beta\|_{2} (17)

    It uses the single L2L_{2} penalty, and it always has a unique solution for a fixed regularization parameter λ2\lambda_{2}. Though it is known to correctly detect the variable signs with reduced mean square error with correlated variables, but does not provide variable selection. It provides grouping effect with the highly correlated variables, and the computational complexity of the ridge is same as the computational effort of a single OLS fit.

  3. 3.

    Elastic-net (Enet): The Enet method was introduced by Zou and Hastie (2005). The Enet penalty is a combination of 1\ell_{1} and L2L_{2} penalties and it is defined as follows.

    𝒫Enet(β,λ1,λ2)=λ1β1+λ2β2\displaystyle\mathcal{P}_{Enet}(\beta,\lambda_{1},\lambda_{2})=\lambda_{1}\|\beta\|_{1}+\lambda_{2}\|\beta\|_{2} (18)

    Enet addresses both the limitations of the Lasso, that is it can select correlated predictors as well as it can handle the s>ns>n case. It provides grouping effect, but requires to search in two-dimensional space for choosing optimal values of its regularization parameters. Hence its effective time complexity depends on the length of the search space for the regularization parameters λ1\lambda_{1} and λ2\lambda_{2}.

  4. 4.

    Correlation Based Penalty (CP): The correlation based penalized least squares method was proposed by Tutz and Ulbricht (2009), which uses the following correlation-based penalty term

    𝒫CP(β,λ2)=λ2i=1p1j>i{(βiβj)21ρij+(βi+βj)21+ρij}\displaystyle\mathcal{P}_{CP}(\beta,\lambda_{2})=\lambda_{2}\sum_{i=1}^{p-1}\sum_{j>i}\left\{\frac{(\beta_{i}-\beta_{j})^{2}}{1-\rho_{ij}}+\frac{(\beta_{i}+\beta_{j})^{2}}{1+\rho_{ij}}\right\}

    It uses the single CPCP penalty norm, and it always has a unique solution for a fixed regularization parameter and the grouping effect strongly depends on the convexity of the penalty term. It does not provide variable selection. However, a boosted version of the penalized estimator allows to select variables. But the major drawback is that it is not scalable for large high dimensional problems.

  5. 5.

    Fused Lasso: The Fused Lasso method was given by Tibshirani et al. (2005), and the Fused Lasso penalty is defined as

    𝒫Fused(β,λ1,λ2)=λ1β1+λ2j=2p|βjβj1|\displaystyle\mathcal{P}_{Fused}(\beta,\lambda_{1},\lambda_{2})=\lambda_{1}\|\beta\|_{1}+\lambda_{2}\sum_{j=2}^{p}|\beta_{j}-\beta_{j-1}| (19)

    The first constraint encourages sparsity in the coefficients and the second constraint encourages sparsity in their differences. The major drawback of this method is that it requires the covariates to be in some order. It does not perform automated variable clustering to unordered features.

  6. 6.

    OSCAR: The OSCAR was invented by Bondell and Reich (2008), and the OSCAR penalty is given as follows.

    𝒫Oscar(β,λ1,λ2)=λ1β1+λ2i<jmax{|βi|,|βj|}\displaystyle\mathcal{P}_{Oscar}(\beta,\lambda_{1},\lambda_{2})=\lambda_{1}\|\beta\|_{1}+\lambda_{2}\sum_{i<j}max\{|\beta_{i}|,|\beta_{j}|\}

    with |β1||βp||\beta_{1}|\leq...\leq|\beta_{p}|. The first constraint is to encourage sparsity in the coefficients and the second constraint encourages equi-sparsity in |β||\beta|. The time complexity limits its scalability on ultra high-dimensional problems, moreover it requires two-dimensional grid search over the two parameters (λ1,λ2)(\lambda_{1},\lambda_{2})

  7. 7.

    L1CP : The L1CP penalty term is given by as follows, see Anbari and Mkhadri (2014).

    𝒫L1CP(β,λ1,λ2)=λ1β1+λ2i=1p1j>i{(βiβj)21ρij+(βi+βj)21+ρij}.\displaystyle\mathcal{P}_{L1CP}(\beta,\lambda_{1},\lambda_{2})=\lambda_{1}\|\beta\|_{1}+\lambda_{2}\sum_{i=1}^{p-1}\sum_{j>i}\left\{\frac{(\beta_{i}-\beta_{j})^{2}}{1-\rho_{ij}}+\frac{(\beta_{i}+\beta_{j})^{2}}{1+\rho_{ij}}\right\}. (20)

    It performs variable selection with grouping effect and estimation together but is not scalable to the large scale problems due to expensive computation time, and it also requires two-dimensional grid search over the two parameters (λ1,λ2)(\lambda_{1},\lambda_{2}).

  8. 8.

    Clustered Lasso: The Clustered Lasso penalty is defined as

    𝒫CL(β,λ1,λ2)=λ1β1+λ2i<j|βiβj|\displaystyle\mathcal{P}_{CL}(\beta,\lambda_{1},\lambda_{2})=\lambda_{1}\|\beta\|_{1}+\lambda_{2}\sum_{i<j}|\beta_{i}-\beta_{j}|

    The first constraint encourages sparsity in the coefficients and the second constraint encourages equi-sparsity in |β||\beta|, It is similar as the Fused lasso but does not require ordering of variables, see She (2010). It provides grouping effect, but it requires two-dimensional grid search over the two parameters (λ1,λ2)(\lambda_{1},\lambda_{2}). It is computationally expensive since it has to check equi-sparsity pattern for each pair of variables.

In the following table we summarize the properties discussed above for various regularization methods.

Properties   /   Methods Clustering/ Ordering Required Variable Selection Grouping Effect Scalability Grid Search
Lasso No Yes No Yes 1D
Ridge No No Yes Yes 1D
PC No No Yes No 1D
Elastic-Net No Yes Yes Yes 2D
Fused-lasso Yes Yes Yes Yes 2D
OSCAR No Yes Yes Yes 2D
L1CP No Yes Yes Yes 2D
Clustered-Lasso No Yes Yes Yes 2D
DLSelect+Ridge No Yes Yes Yes 1D
Table 1: Comparision Table

4 Dual Lasso Selector

In this section, we present the dual Lasso selector, a new variable selection method for sparse high-dim regression models with correlated variables. First, we study the theoretical properties of the Lasso and dual Lasso solutions. Then, we show that the magnitude of correlations between the predictors and the dual vector determines the set of active predictors. This is the basis for our correlated variable selection.

The dual problem of the Lasso problem (2) can be given as follows (we provide the detailed derivation of the Lasso’s dual in the appendix A.1):

supθ\displaystyle\sup_{\theta}\; 12Y22θY22\displaystyle\frac{1}{2}\|\textbf{Y}\|_{2}^{2}-\|\theta-\textbf{Y}\|_{2}^{2} (21)
subject to |XjTθ|λ for j=1,,p,\displaystyle|X_{j}^{T}\theta|\leq\lambda\text{ for }j=1,...,p, (22)

where θ\theta is the dual vector, as defined in equation (44). For a fixed λ0\lambda\geq 0, let β^lasso(λ)\hat{\beta}_{lasso}(\lambda) and θ^(λ)\hat{\theta}(\lambda) denote the optimal solutions of the Lasso and its dual problem respectively. Since it is implicit that the Lasso and its dual optimal depends on the λ\lambda, we drop the term λ\lambda from the expression for notational simplicity. From KKT conditions (derivation of the KKT conditions is given in the appendix A.2) we get the following primal dual relationship:

θ^=YXβ^lasso.\displaystyle\hat{\theta}=\textbf{Y}-\textbf{X}\hat{\beta}_{lasso}. (23)

It is worth mentioning the basic properties of the Lasso and its dual, which has already been derived and studied by various authors (see Tibshirani and Taylor (2011) and Wang et al. (2015) for more insights).

  1. 1.

    Uniqueness of the Lasso-fit: There may not be a unique solution for the Lasso problem because for the criterion (42) is not strictly convex in β\beta. But the least square loss is strictly convex in Xβ\textbf{X}\beta, hence there is always a unique fitted value Xβ^\textbf{X}\hat{\beta}.

  2. 2.

    Uniqueness of the dual vector: The dual problem is strictly convex in θ\theta, therefore the dual optimal θ^\hat{\theta} is unique. Another argument for the uniqueness of θ^\hat{\theta} is that it is a function of Xβ^\textbf{X}\hat{\beta} (23) which itself is unique. The fact that the DLS can achieve consistent variable selection for situations (with correlated active predictors) when the Lasso is unstable for estimation of the true active set is related to the uniqueness of the dual Lasso solution.

  3. 3.

    Uniqueness of the Sub-gradient: Sub-gradient of 1\ell_{1} norm of any Lasso solution β^\hat{\beta} is unique because it is a function of Xβ^\textbf{X}\hat{\beta} (see Appendix A.2). More specifically, suppose β^\hat{\beta} and β~\tilde{\beta} are two lasso solutions for a fixed λ\lambda value, then they must have the same signs sign(β^)=sign(β~)sign(\hat{\beta})=sign(\tilde{\beta}), it is not possible that β^j>0\hat{\beta}_{j}>0 and β^j<0\hat{\beta}_{j}<0 for some jj.

Let S^lasso\hat{S}_{lasso} denote the support set or active set of the Lasso estimator β^\hat{\beta} which is given as

S^lasso(λ)={j{1,,p}:(β^lasso)j0}\displaystyle\hat{S}_{lasso}(\lambda)=\{j\in\{1,...,p\}:(\hat{\beta}_{lasso})_{j}\neq 0\} (24)

Similarly, we define the active set of the dual Lasso vector that corresponds to the active constraints of the dual optimization problem. We note that constraints are said to be active at a feasible point if that point lies on a boundary formed by the constraint.

S^dual(λ)={j{1,,p}:|XjTθ|=λ}\displaystyle\hat{S}_{dual}(\lambda)=\{j\in\{1,...,p\}:\;|X_{j}^{T}\theta|=\lambda\} (25)

Now, we define the following lemmas that will be used later for our mathematical derivations.

Lemma 1

The active set selected by the Lasso S^lasso(λ)\hat{S}_{lasso}(\lambda) is always contained in the active set selected by the dual Lasso S^dual(λ)\hat{S}_{dual}(\lambda), that is

S^lasso(λ)S^dual(λ).\hat{S}_{lasso}(\lambda)\subseteq\hat{S}_{dual}(\lambda).

Proof  The proof is rather easy. From KKT condition (dual feasibility condition, see Appendix A.2), we have

|XjTθ|<λβ^j=0\displaystyle|X_{j}^{T}\theta|<\lambda\implies\hat{\beta}_{j}=0 (26)

The proof lies in the “implication” in the above equation (26)(but not in equivalence).  

It is known that Irrepresentable condition (assuming beta-min conditions holds) is necessary and sufficient condition for the Lasso to select true model, see Zhao and Yu (2006) (for completeness we have proved it in appendix A.4).

Lemma 2

Under assumption of the Irrepresentable Condition (IC) on the design matrix, the active set selected by the Lasso S^lasso(λ)\hat{S}_{lasso}(\lambda) is equal to the active set selected by the dual Lasso S^dual(λ)\hat{S}_{dual}(\lambda), that is

S^lasso(λ)=S^dual(λ).\hat{S}_{lasso}(\lambda)=\hat{S}_{dual}(\lambda).

The proof is worked out in Appendix A.3.

The IC may fail to hold due to violation of any one of (or both) the following two conditions:

  1. 1.

    When C11C_{11} is not invertible, that implies there is strong correlation among variables of the true active set.

  2. 2.

    The active predictors are correlated with the noise features (this situation is better explained in terms of irrepresentable condition).

When there is strong correlation among variables of the active set, then C11C_{11} is not invertible and the IC does not hold, and the Lasso fails to do variable selection. But we argue that the dual Lasso can still perform variable selection consistently even when C11C_{11} not invertible, when we impose some milder condition on the design matrix, we call it Pseudo Irrepresentable Condition (PIC). The Pseudo Irrepresentable Condition is defined as follows.

Definition 3 (Pseudo Irrepresentable Condition(PIC))

We partition the covariance matrix as in (7). Then the Pseudo Irrepresentable condition is said to be met for the set S with a constant η>0\eta>0, if the following holds:

|XjTGsign(β1)|1η, for all jSc,\displaystyle|X^{T}_{j}G\;sign(\beta_{1})|\leq 1-\eta,\text{ for all }j\in S^{c}, (27)

where G is a generalized inverse of the form [CA1000]\left[\begin{array}[]{cc}C_{A}^{-1}&0\\ 0&0\end{array}\right], and (27) holds for all CACRC_{A}\in C_{R}, where CRC_{R} is defined as CR:={Crr:rank(Crr)=rank(C11)=r,CrrC11}C_{R}:=\{C_{rr}:rank(C_{rr})=rank(C_{11})=r,C_{rr}\subset C_{11}\}.

The following lemma gives the sufficient condition for the dual Lasso for support recovery. This lemma is similar in spirit of the Lemma 2 define in Omidiran and Wainwright (2010). Here, we do not assume that Σ11\Sigma_{11} is invertible.

Lemma 4 (Primal-dual Condition for Variable Selection)

Suppose that we can find a primal-dual pair (β^,θ^)(\hat{\beta},\hat{\theta}) that satisfy the KKT conditions

XT(YXβ^)+λv^\displaystyle\textbf{X}^{T}(\textbf{Y}-\textbf{X}\hat{\beta})+\lambda\hat{v} =0, where v^=sign(β^)\displaystyle=0,\;\text{ where }\hat{v}=sign(\hat{\beta}) (28)
θ^\displaystyle\hat{\theta} =YXβ^,\displaystyle=\textbf{Y}-\textbf{X}\hat{\beta}, (29)

and the signed support recovery conditions

v^j\displaystyle\hat{v}_{j} =sign(βj) for all jS,\displaystyle=sign(\beta_{j})\text{ for all }j\in S, (30)
β^j\displaystyle\hat{\beta}_{j} =0 for all jSc,\displaystyle=0\text{ for all }j\in S^{c}, (31)
|v^j|\displaystyle|\hat{v}_{j}| <1 for all jSc\displaystyle<1\text{ for all }j\in S^{c} (32)

Then θ^\hat{\theta} is the unique optimal solution to the dual Lasso and S^dual\hat{S}_{dual} recovers the true active set.

We have shown that the dual Lasso optimal θ^\hat{\theta} is always unique, and it remains to show that the S^dual\hat{S}_{dual} recovers the true active set. Under the assumption (32), we can derive that |XjTθ^|<λ|X_{j}^{T}\hat{\theta}|<\lambda for all jScj\in S^{c}. Therefore S^dual=S\hat{S}_{dual}=S.

Theorem 5

Under assumption of the PIC on the design matrix X, the active set selected by the dual Lasso S^dual\hat{S}_{dual}, is the same as the true active set SS with high probability. that is

S^dual=S.\hat{S}_{dual}=S.

When C11C_{11} is invertible the PIC coincides with the IC, and under assumption of the IC we have already shown that S^dual=S^lasso\hat{S}_{dual}=\hat{S}_{lasso}. In Appendix A.4, we prove that the PIC is necessary and sufficient condition (beta-min condition is implicit) for the dual Lasso to consistently select the true active set. The PIC may hold even when C11C_{11} is not invertible, which implies that the PIC is weaker than the IC. We illustrate it with the following examples:

Let S={1,2,3,4}S=\{1,2,3,4\} be the active set, the covariance matrix C=XTXnC=\frac{\textbf{X}^{T}\textbf{X}}{n} and is given as

C=[1000ρ0100ρ0010ρ0001ρρρρρ1],\displaystyle C=\left[\begin{array}[]{ccccc}1&0&0&0&\rho\\ 0&1&0&0&\rho\\ 0&0&1&0&\rho\\ 0&0&0&1&\rho\\ \rho&\rho&\rho&\rho&1\end{array}\right],

where the active variables are uncorrelated and the noise variable is equally correlated with all active covariates. First of all, it is easy to check that only for |ρ|12|\rho|\leq\frac{1}{2}, CC is positive semi definite, and for |ρ|<14|\rho|<\frac{1}{4}, CC satisfies the IC.

Now, we augment this matrix with two additional columns, one copy of the first and second active variables, and we rearrange the columns such that we get the following covariance matrix, and we redefine the set of active variables as S={1,2,3,4,5,6}S=\{1,2,3,4,5,6\} and we assume that |ρ|<14|\rho|<\frac{1}{4}.

C=[110000ρ110000ρ001100ρ001100ρ000010ρ000001ρρρρρρρ1].\displaystyle C=\left[\begin{array}[]{ccccccccc}1&1&0&0&0&0&\rho\\ 1&1&0&0&0&0&\rho\\ 0&0&1&1&0&0&\rho\\ 0&0&1&1&0&0&\rho\\ 0&0&0&0&1&0&\rho\\ 0&0&0&0&0&1&\rho\\ \rho&\rho&\rho&\rho&\rho&\rho&1\end{array}\right].

We partition Σ\Sigma as (7), and it is clear that C11C_{11} is not invertible and IC does not hold, hence the Lasso does not perform variable selection. The rank of the C11C_{11} is 4. Let us consider any (4×4)(4\times 4) sub matrix of C11C_{11} such that its rank is four (S1S,andrank(S1)=4,S1={{1,3,5,6},{1,4,5,6},{2,3,5,6},{2,4,5,6}S_{1}\subset S,andrank(S_{1})=4,S_{1}=\{\{1,3,5,6\},\{1,4,5,6\},\{2,3,5,6\},\{2,4,5,6\}). Further, we partition C11C_{11} as

C11=[CrrCrrCrrCrr],\displaystyle C_{11}=\left[\begin{array}[]{cc}C_{rr}&C_{rr^{\prime}}\\ C_{r^{\prime}r}&C_{r^{\prime}r^{\prime}}\end{array}\right], (35)

where rank(Crr)rank(C_{rr}) has full column rank and rank(Crr)=rank(C11)rank(C_{rr})=rank(C_{11}), let CRC_{R} be set of are four such possible invertible sub matrices of C11C_{11}. Then considering the generalized inverses corresponding to them as

C11+=[CA1000],\displaystyle C_{11}^{+}=\left[\begin{array}[]{cc}C_{A}^{-1}&0\\ 0&0\end{array}\right], (38)

where CACRC_{A}\in C_{R} is invertible. With the above inverse C11+C_{11}^{+} the PIC holds for the design matrix X. It can be also viewed as the IC is satisfied for each reduced active set SS1S^{\prime}\in S_{1} and the corresponding reduced design matrix XS\textbf{X}_{S^{\prime}}, and hence the Lasso picks randomly one element from the set S1S_{1} and sets the coefficient of the noise variable to zero (with high probability). Also, since PIC holds, the dual Lasso will select the true active set SS with high probability and will set zero for the coefficient of noise feature.

4.1 Dual Lasso Selection and Ridge Estimation

After proving that the joint consideration of the Lasso primal and its dual leads to correlated variable selection (under certain regularity condition), we now combine the dual Lasso selection with the Ridge estimation. Mainly, we consider the 2\ell_{2} penalty (Ridge penalty) which is known to perform best in case of correlated variables, to estimate the coefficients of the predictors selected by the dual Lasso. We develop an algorithm called DLSelect+RR, which is a two stage procedure, the dual selection followed by the Ridge Regression.

Input: dataset (Y,X)(\textbf{Y},\textbf{X})
Output: S^\hat{S}:= the set of selected variables
β^\hat{\beta} := the estimated coefficient vector
Steps:
1. Perform Lasso on the data (Y(\textbf{Y}, X)\textbf{X}). Denote the Lasso estimator as β^lasso\hat{\beta}_{lasso}.
2. Compute the dual optimal as
θ^=YXβ^lasso.\hat{\theta}=\textbf{Y}-\textbf{X}\hat{\beta}_{lasso}.
Denote the dual Lasso active set as S^dual\hat{S}_{dual}
3. Compute the reduced design matrix as
Xred={Xj:jS^dual}.\textbf{X}_{red}=\{{X}_{j}:j\in\hat{S}_{dual}\}.
4. Perform Ridge regression based on the data (Y(\textbf{Y}, Xred)\textbf{X}_{red}) and obtain the ridge estimator βj\beta_{j} for jSdualj\in S_{dual}. Set the remaining coefficients to zero.
β^j=0 if jSdual\hat{\beta}_{j}=0\text{ if }j\not\in S_{dual}
return (S^,β^)(\hat{S},\hat{\beta})
Algorithm 1 DLSelect+RR Algorithm

If model selection works perfectly (under strong assumptions, i.e. IC), then the post-model selection estimators are the oracle estimators with well behaved properties (see Belloni and Chernozhukov (2013)). In the following we argue that for the combination, dual selection followed by 2\ell_{2} estimation, the prediction accuracy is at least as good as the Lasso.It has been already proven that the Lasso+OLS (Belloni and Chernozhukov (2013)) estimator performs at least as good as Lasso in terms of the rate of convergence, and it has a smaller bias than the Lasso. Further Lasso+mLS (Lasso+ modified OLS) or Lasso+Ridge estimator have been also proven to be asymptotically unbiased under the Irrepresentable condition and other regularity conditions, see Liu and Yu (2013). Under the Irrepresentable condition the Lasso solution is unique and the DLSelect+RR is the same as the Lasso+Ridge and the same argument holds for the DLSelect+RR. Also, In the following section we prove empirically that the prediction performance of the DLSelect+RR is at least as good as the Lasso.

5 Numerical Studies

In this section, we apply the DLSelect+Ridge for variable selection and estimation on simulations and real data and compare the results with that of the Lasso, Ridge and Elastic-net. We consider the True Positive Rate (power) and False Discovery Rate (FDR) as the measure of performances for variable selection, which are defined as follows.

TPR\displaystyle TPR =|S^S||S|\displaystyle=\frac{|\hat{S}\bigcap S|}{|S|} (39)
FDR\displaystyle FDR =|S^Sc||S^|\displaystyle=\frac{|\hat{S}\bigcap S^{c}|}{|\hat{S}|}

For prediction performance we consider the Mean Squared Prediction Error, which is defined as

MSE\displaystyle MSE =1nYY^22,\displaystyle=\frac{1}{n}\|\textbf{Y}-\hat{\textbf{Y}}\|^{2}_{2}, (40)

where Y^\hat{\textbf{Y}} is the predicted response vector or an estimate Xβ^\textbf{X}\hat{\beta} based on an estimator β^\hat{\beta}.

Since our aim is to avoid false negatives, we do not report false positives, and ridge does not perform variable selection therefore TPR is not reported for the Ridge. The Ridge is considered as a competitor because its prediction performance is better than the Lasso for correlated designs.

5.1 Simulation Examples

We consider five different simulation settings, where simulate data from the linear model as in (1) with fixed design matrix X, and σ=1\sigma=1. We generate the design matrix X once from a multivariate normal distribution Np(0,Σ)N_{p}(0,\Sigma) with different structures for Σ\Sigma, and keep it fixed for all replications.

For each simulation example, 100100 data sets were generated, where each dataset consists of a training set used to fit the model, an independent validation set used for tuning the regularization parameter and an independent test set used for evaluation of the performance. We denote by #/#/#\#/\#/\#, the number of observation in training, validation and test set respectively. For most of the simulation examples we fix the size of the active set to s=20s=20 and the true coefficient vector as

β={1,,120,0,,0480}.\displaystyle\beta=\{\underbrace{1,...,1}_{20},\underbrace{0,...,0}_{480}\}. (41)

We generate 100100 data sets with sample sizes n/n/1000n/n/1000 with n=100,200,400,600n=100,200,400,600. For each simulation example and each method the MSE and TPR are computed over 100100 data sets. A suitable grid of values for the tuning parameters is considered, and all reported results are based on the median of 100100 simulation runs.

5.1.1 Block Diagonal Model

Here we generate the fixed design matrix XNp(0,Σ1)\textbf{X}\sim N_{p}(0,\Sigma_{1}) with p=500p=500, where Σ1\Sigma_{1} is a block diagonal matrix. The matrix Σ1\Sigma_{1} consists of 5050 independent blocks BB of size 10×1010\times 10, defined as

Bj,k={1,j=k.9,otherwiseB_{j,k}=\left\{\begin{array}[]{cc}1,&j=k\\ .9,&otherwise\end{array}\right.

This simulation example is considered to show that when the Lasso (due to collinearity) and Ridge (due to noise) do not perform well, the Enet and DLSelect+Lasso perform quite well. From the table (12) it is easy to figure out that the Ridge performs poorly in terms of prediction performance for all simulation setting and the Lasso is not stable for variable selection. The Enet consistently selects the true active set, and DLSelect+Ridge completes with Enet in all settings.

Table 2: Performance measures for block diagonal case
n Method MSE(SE) TPR
100 Lasso 22.37(1.31) 0.45
Ridge 565.58(3.31) NA
Enet 22.17(1.2) 1
DLSelect+Ridge 18.92(1.2) 0.6
200 Lasso 11.52(0.67) 0.6
Ridge 466.35(2.26) NA
Enet 11.37(0.63) 1
DLSelect+Ridge 8.45(0.55) 1
400 Lasso 6.88(0.31) 0.55
Ridge 417.59(2.07) NA
Enet 6.85(0.32) 1
DLSelect+Ridge 5.42(0.37) 1
600 Lasso 5.54(0.28) 0.65
Ridge 5.87(0.31) NA
Enet 5.34(0.25) 1
DLSelect+Ridge 3.53(0.22) 1

5.1.2 Single Block Model with Noise Features

Here we generate the fixed design matrix XNp(0,Σ2)\textbf{X}\sim N_{p}(0,\Sigma_{2}) with p=500p=500, where Σ2\Sigma_{2} is almost an identity matrix except for the first 20×2020\times 20 is a single highly correlated block. The matrix Σ2\Sigma_{2} is defined as

Σj,k={1,j=k.9,jk and i,j200,otherwise\Sigma_{j,k}=\left\{\begin{array}[]{cc}1,&j=k\\ .9,&j\neq k\textbf{ and }i,j\leq 20\\ 0,&otherwise\end{array}\right.

In this setting, the first twenty variables are active predictors and they are highly correlated, and the remaining 480480 are independent noise variables. We generate 100100 data sets with sample sizes n/n/1000n/n/1000, where n=100,200,400,600n=100,200,400,600. The simulation results are reported in Table (3).

Table 3: Performance measures for single block with noise
n Method MSE(SE) TPR
100 Lasso 101.37(2.26) 0.4
Ridge 922.41(4.36) NA
Enet 102.91(2.37) 1
DLSelect+Ridge 88.00(3.2) 1
200 Lasso 15.55(0.59) 0.25
Ridge 627.66(2.67) NA
Enet 15.92(0.57) 1
DLSelect+Ridge 8.26(0.56) 1
400 Lasso 2.67(0.17) 0.2
Ridge 456.17(1.95) NA
Enet 2.50(0.15) 1
DLSelect+Ridge 1.16(0.081) 1
600 Lasso 1.46(0.06) 0.15
Ridge 5.87(0.26) NA
Enet 1.12(0.06) 1
DLSelect+Ridge 2.29(0.13) 1

From Table (3), it is clear that the Lasso and Ridge performs poorly (one can give similar argument as Block diagonal model). The Enet and DLSelect+Ridge consistently selects true active set with reduced prediction error.

5.1.3 Single Block Model without Noise Features

Here we generate the fixed design matrix XNp(0,Σ3)\textbf{X}\sim N_{p}(0,\Sigma_{3}) with p=20p=20, where Σ3\Sigma_{3} is a single block of highly correlated variables. The matrix Σ3\Sigma_{3} is defined as

Σj,k={1,j=k.99,otherwise\Sigma_{j,k}=\left\{\begin{array}[]{cc}1,&j=k\\ .99,&otherwise\end{array}\right.

The true coefficient vector is

β={1,,120}.\beta=\{\underbrace{1,...,1}_{20}\}.

We generate 100100 data sets with sample sizes n/n/200n/n/200 with n=20,200n=20,200. The simulation results are reported in Table (4).

Table 4: Performance measures for single block with noise
n Method MSE(SE) TPR
20 Lasso 245.95(8.91) 0.25
Ridge 246.23(8.97) NA
Enet 244.95(8.85) 1
DLSelect+Ridge 249.99(7.15) 1
200 Lasso 4.75(0.42) 0.1
Ridge 4.64(0.43) NA
Enet 4.75(0.40) 1
DLSelect+Ridge 1.00(0.10) 1

From Table (4), it is apparent that the Lasso performs poorly in terms of variable selection as well as prediction accuracy. The Ridge gives the best predictive performs for the number of sample size increases. The Enet and DLSelect+Ridge consistently selects true active set with, however the DLSelect+Ridge has better prediction accuracy for moderate sample size.

5.1.4 Toeplitz Model

Here we consider special case of a Toeplitz matrix Σ4\Sigma_{4} to generate the fixed design matrix XNp(0,Σ4)\textbf{X}\sim N_{p}(0,\Sigma_{4}) with p=500p=500. The matrix Σ4\Sigma_{4} is defined as

Σj,k={1,j=kρ|ij|otherwise\Sigma_{j,k}=\left\{\begin{array}[]{cc}1,&j=k\\ \rho^{|i-j|}&otherwise\end{array}\right.

, where ρ=0.9\rho=0.9. The true coefficient vector is as defined in (41), and we generate 100100 data sets with sample sizes n/n/1000n/n/1000 where n=100,200,400,600n=100,200,400,600. The Table 5 shows the simulation results.

Table 5: Performance measures for Toeplitz settings
n Method MSE(SE) TPR
100 Lasso 10440.42(53.28) 0.36
Ridge 12478.77(23.98) NA
Enet 10352.06(23.21) 0.89
DLSelect+Ridge 7789.103(22.61) 1
200 Lasso 713.34(4.13) 0.51
Ridge 654.85(3.8) NA
Enet 651.08(3.86) 0.99
DLSelect+Ridge 97.17(1.70) 0.77
400 Lasso 145.15(2.12) 0.54
Ridge 98.63(1.13) NA
Enet 103.57(1.19) 1
DLSelect+Ridge 52.19(0.85) 1
600 Lasso 200.15(1.68) 0.65
Ridge 169.01(1.27) NA
Enet 169.66(1.31) 0.99
DLSelect+Ridge 22.35(0.47)) 1

The Table (4) shows that the Lasso and the Ridge performs poorly for all settings. The DLSelect+Ridge consistently selects the true active set, however the DLSelect+Ridge has better prediction accuracy for moderate sample size.

5.1.5 Independent Predictor Model

Finally we consider an identity matrix to generate the fixed design matrix XNp(0,I)\textbf{X}\sim N_{p}(0,I) with p=500p=500. In this setting all predictors and uncorrelated. The true coefficient vector is as defined in (41), and we generate 100100 data sets with sample sizes n/n/1000n/n/1000, where n=100,200,400,600n=100,200,400,600.

The Table 10 shows the simulation results.

Table 6: Performance measures for independent settings
n Method MSE(SE) TPR
100 Lasso 14.50(3.74) 1
Ridge 153.47(0.91) NA
Enet 28.85(4.59) 1
DLSelect+Ridge 39.16(29.73) 1
200 Lasso 2.32(0.23) 1
Ridge 139.34(0.76) NA
Enet 2.45(0.24) 1
DLSelect+Ridge 3.48(0.51) 1
400 Lasso 1.56(0.10) 1
Ridge 118.21(0.73) NA
Enet 1.59(0.10) 1
DLSelect+Ridge 3.70(0.46) 1
600 Lasso 1.35(0.06) 1
Ridge 8.60(0.53) NA
Enet 1.37(0.06) 1
DLSelect+Ridge 3.37(0.35) 1

The Table (4) shows that the Lasso gives the best prediction accuracy and the Ridge performs poorly for all the settings. The Enet and DLSelect+Ridge competes each other.

5.2 Real Data Example

In this section, we consider five real world data to evaluate the prediction and variable selection performance of the proposed method DLSelect+RidgeDLSelect+Ridge. We randomly split the data sets into two halves for 100100 times, we use first half for training (using cross validation) and second half is used as a test set. For testing variable selection, For first two datasets (UScrime and Prostate) we consider all the variables as relevant variables and for the remaining datasets we select ten most variable which are highly correlated with the response and another ten variables which are correlated with the selected variables. Median MSE, standard error and median TPR are reported over 100100 splits for each example.

5.2.1 USCrime Data

This is a classical dataset collected in 1960 where criminologists are mainly interested in the effect of punishment on crime rates. There are Independent 1515 independent variables and the response is rate of crimes in a particular category per head of population. For more details on this dataset we refer to Ehrlich (1973). The performance measures are reported in Table 7.

Table 7: Performance measures for UScrime data
Method MSE(SE) TPR
Lasso 87725(36371) 0.54
Ridge 77153(26118) NA
Enet 83403(34342) 0.45
DLSelect+Ridge 78275(24625) 0.54

Here, we have considered all covariates as important variables. The Ridge regression outperforms the other methods, and DLSelect+Ridge performs better than Lasso and the Enet in terms of prediction perform as well as variable selection.

5.2.2 Prostate Data

The Prostate dataset has 9797 observations and 99 covariates. This dataset is an outcome of a study that examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. For further details on the dataset we refer to Stamey et al. (1989).

Table 8: Performance measures for Prostate data
Method MSE(SE) TPR
Lasso 0.56(0.09) 0.63
Ridge 0.56(0,08) NA
Enet 0.55(0.09) 0.63
DLSelect+Ridge 0.56(0.07) 1

The performance measures are reported in Table 7. Here, we have considered all covariates as important variables, From the table, it is clear that all method seems to report almost the same prediction error, and DLSelect+Ridge performs better than Lasso and the Enet in terms of variable selection.

5.2.3 Riboflavin Data

The dataset of riboflavin consists of, n=71n=71 observations of p=4088p=4088 predictors (gene expressions) and univariate response, riboflavin production rate(log-transformed), see Bühlmann et al. (2014) for details on riboflavin dataset. Since the ground truth is not available, we consider Riboflavin data for the design matrix X with synthetic parameters β\beta and simulated Gaussian errors ϵn(0,σ2I)\epsilon\sim\mathbb{N}_{n}(0,\sigma^{2}I). We fix the size of the active set to s=20s=20 and σ=1\sigma=1 and for the true active set, select ten predictors which are highly correlated with the response and another ten variables which are most correlated with those selected variables. The true coefficient vector is

βj={1 if jS0 if jS.\displaystyle\beta_{j}=\left\{\begin{array}[]{ll}1&\text{ if }j\in S\\ 0&\text{ if }j\not\in S\\ \end{array}\right..

Then we compute the response using the Equation (1). The performance measures are reported in Table 9.

Table 9: Performance measures for Leukaemia data
Method MSE(SE) TPR
Lasso 96.69(63) 0.27
Ridge 290.98(138) NA
Enet 92.44(65) 0.44
DLSelect+Ridge 88.31(54) 0.38

From the table (9), we conclude that Enet outperforms in terms of variable selection, whereas, DLSelect+Ridge performs better than others in terms of prediction performance.

5.2.4 Myeloma Data

We consider another real dataset, Myeloma (n=173,p=12625)(n=173,p=12625) data for the design matrix X with synthetic parameters β\beta and simulated Gaussian errorsWe refer to Tian et al. (2003) for details on Myeloma dataset. In this example also, we set active set and generated response same as previous example (Riboflavin). The performance measures are reported in Table LABEL:table:Myeloma.

Table 10: Performance measures for Leukaemia data
Method MSE(SE) TPR
Lasso 68.29(27.73) 0.35
Ridge 239.58(58.25) NA
Enet 70.37(29.16) 0.58
DLSelect+Ridge 75.25(28.70) 0.52

From the table (9), the Enet outperforms in terms of variable selection as well as prediction performance.

5.2.5 Leukaemia Data

We consider the famous dataset of Leukaemia Data Golub et al. (1999). In this example also, we set active set and generated response same as previous examples. The performance measures are reported in Table 11.

Table 11: Performance measures for Leukaemia data
Method MSE(SE) TPR
Lasso 111.53(97.3) 0.46
Ridge 182.11(97.2) NA
Enet 90.47(82.64) 0.6
DLSelect+Ridge 72.43(67.3) 0.5

From the table (11), it is clear that the Enet gives the better prediction performance, and DLSelect+Ridge performs better than Lasso and the Enet in terms of variable selection.

Table 12: Performance measures for block diagonal case
n Lasso Ridge Enet DLSelect+Ridge
100100 MSE(SE) 22.37(1.31) 565.58(3.31) 22.17(1.2) 18.92(1.2)
TPR 0.45 1 1 0.6
200200 MSE(SE) 11.52(0.67) 466.35(2.26) 11.37(0.63) 8.45(0.55)
TPR 0.6 1 1 1
400400 MSE(SE) 6.88(0.31) 417.59(2.07) 6.85(0.32) 5.42(0.37)
TPR 0.55 1 1 1
600600 MSE(SE) 5.54(0.28) 5.87(0.31) 5.34(0.25) 3.53(0.22)
TPR 0.65 1 1 1

We estimate the following for each method.
MSE (SD) : mean squared prediction error as define in LABEL:eg:TBD.
TPR(True Positive Rate) : The ratio of the total number of truly identified non-zero components of β\beta and sparsity index (s).

6 Computational Details

Statistical analysis was performed in R 3.2.2. We used the package “glmnet” for penalized regression method(the Lasso).

7 Concluding Remarks

The main achievements of this work are summarized as follows: We argued that the correlations among active predictors is not problematic, as long as the PIC is satisfied by the design matrix. In particular, we proved that the dual Lasso performs consistent variable selection under assumption of PIC. Exploiting this result we proposed the dual Lasso+Ridge method. We illustrated DLSelection+Ridge method on simulated and real high dimensional data sets. The numerical studies based on the simulations and real examples show clearly that the proposed method is very competitive in terms of variable selection, prediction accuracy, estimation accuracy and computation speed.

Appendix A

A.1 Derivation of the Dual Form of the Lasso

In this section, we derive the Lagrange dual of the Lasso problem (2), which serves as the selection operator for our approach. That is, by considering the lasso and its dual simultaneously it is possible to identify the non-zero entries in the estimator. For more details on dual derivation and projection on polytope formed by the dual constraints, we refer to Wang et al. (2015).

We recall that the Lasso problem is defined as the following convex optimization problem.

minβp{12YXβ22+λβ1}\displaystyle\min_{\beta\in\mathbb{R}^{p}}\left\{\frac{1}{2}\|{\textbf{Y}-\textbf{X}\beta}\|_{2}^{2}+\lambda\|\beta\|_{1}\right\} (42)

Since the above problem has no constraints, its dual problem is trivial. So we introduce a new vector r=YXβ\textbf{r}=\textbf{Y}-\textbf{X}\beta, then the Lasso problem can be written as:

minβp\displaystyle\min_{\beta\in\mathbb{R}^{p}} {12r22+λβ1}\displaystyle\left\{\frac{1}{2}\|\textbf{r}\|_{2}^{2}+\lambda\|\beta\|_{1}\right\} (43)
subject to r=YXβ\displaystyle\textbf{r}=\textbf{Y}-\textbf{X}\beta

Now, to account for the constraints we introduce the dual vector θinn\theta in\mathbb{R}^{n}, then we get the following Lagrangian equation with β\beta and rr as primal variables.

L(β,r,θ)=12r22+λβ1+θT(YXβr)\displaystyle L(\beta,\textbf{r},\theta)=\frac{1}{2}\|\textbf{r}\|_{2}^{2}+\lambda\|\beta\|_{1}+\theta^{T}(\textbf{Y}-\textbf{X}\beta-\textbf{r}) (44)

Then the dual function can be written as:

g(θ)\displaystyle g(\theta) =infβ,rL(β,r,θ)\displaystyle=\inf_{\beta,\textbf{r}}L(\beta,\textbf{r},\theta)
=12r22+λβ1+θT(YXβz)\displaystyle=\frac{1}{2}\|\textbf{r}\|_{2}^{2}+\lambda\|\beta\|_{1}+\theta^{T}(\textbf{Y}-\textbf{X}\beta-z)
=θTy+infr{12r22θTr}+infβ{λβ1θTXβ}\displaystyle=\theta^{T}y+\inf_{r}\left\{\frac{1}{2}\|\textbf{r}\|_{2}^{2}-\theta^{T}\textbf{r}\right\}+\inf_{\beta}\left\{\lambda\|\beta\|_{1}-\theta^{T}\textbf{X}\beta\right\}
=θTy+infrL1(r)+infβL2(β)\displaystyle=\theta^{T}y+\inf_{r}L_{1}(r)+\inf_{\beta}L_{2}(\beta)

After solving the first optimization problem, we get

infrL1(r)=12r22\displaystyle\inf_{r}L_{1}(r)=-\frac{1}{2}\|\textbf{r}\|_{2}^{2} (45)

Since L1(r)L_{1}(r) is non-differentiable, we consider its subgradient

L1(β)=λvXTθ,\displaystyle\partial L_{1}(\beta)=\lambda v-\textbf{X}^{T}\theta,

where v is the subgradient of β1\|\beta\|_{1}, and it satisfies v1\|v\|_{\infty}\leq 1 and vTβ=β1v^{T}\beta=\|\beta\|_{1}. For L1L_{1} to attain an optimum, the following must hold.

λvXTθ\displaystyle\lambda v-\textbf{X}^{T}\theta =0\displaystyle=0
XTθ\displaystyle\implies\textbf{X}^{T}\theta =λv\displaystyle=\lambda v
|XjTθ|λ for all j{1,,p}\displaystyle\therefore|X_{j}^{T}\theta|\leq\lambda\;\text{ for all }j\in\{1,...,p\} (46)

From (45) and (46), we get the dual objective function as:

g(θ)\displaystyle g(\theta) =θTY12θTθ\displaystyle=\theta^{T}\textbf{Y}-\frac{1}{2}\theta^{T}\theta
g(θ)\displaystyle g(\theta) =12Y22θY22\displaystyle=\frac{1}{2}\|\textbf{Y}\|_{2}^{2}-\|\theta-\textbf{Y}\|_{2}^{2}

Then the dual problem is given as:

supθ\displaystyle\sup_{\theta}\; g(θ)=12Y22θY22\displaystyle g(\theta)=\frac{1}{2}\|\textbf{Y}\|_{2}^{2}-\|\theta-\textbf{Y}\|_{2}^{2}
subject to |XjTθ|λ for all j{1,,p}\displaystyle\therefore|X_{j}^{T}\theta|\leq\lambda\;\text{ for all }j\in\{1,...,p\} (47)

A.2 Relationship Between the Lasso and its Dual Optimal

In this section, we derive the relationship between the Lasso optimal and its dual optimal.

For a fixed λ\lambda, the Lasso problem (2) is convex in β\beta and it is strictly feasible since it has no constraints, therefore by Slater’s condition, strong duality holds. Let us suppose that β^,r^\hat{\beta},\hat{r} and θ^\hat{\theta} are optimal primal and dual variables, then by the KKT conditions the following must hold.

0\displaystyle 0 βL(β^,r^,θ^)\displaystyle\in\partial_{\beta}L(\hat{\beta},\hat{r},\hat{\theta}) (48)
ΔzL(β^,r^,θ^)\displaystyle\Delta_{z}L(\hat{\beta},\hat{r},\hat{\theta}) =r^θ^=0\displaystyle=\hat{r}-\hat{\theta}=0 (49)
ΔθL(β^,r^,θ^)\displaystyle\Delta_{\theta}L(\hat{\beta},\hat{r},\hat{\theta}) =YXβ^r^=0\displaystyle=\textbf{Y}-\textbf{X}\hat{\beta}-\hat{r}=0 (50)

From (48) we get

xTθ^\displaystyle x^{T}\hat{\theta} =λv^\displaystyle=\lambda\hat{v}
|XjTθ|\displaystyle|X_{j}^{T}\theta| λ for all j{1,,p}\displaystyle\leq\lambda\;\text{ for all }j\in\{1,...,p\}

Or equivalently for all j{1,,p}j\in\{1,...,p\} the following must hold.

XjTθ^={λ if β^>0[λ,λ] if β^=0λ if β^<0\displaystyle X_{j}^{T}\hat{\theta}=\left\{\begin{array}[]{ll}\lambda&\text{ if }\hat{\beta}>0\\ \in\left[-\lambda,\lambda\right]&\text{ if }\hat{\beta}=0\\ -\lambda&\text{ if }\hat{\beta}<0\\ \end{array}\right. (54)

From the above equation (54), we get the following important result.

|XjTθ^|<λβ^=0\displaystyle|X_{j}^{T}\hat{\theta}|<\lambda\implies\hat{\beta}=0 (55)

Finally, from (49) and (50) we get the following equality.

θ^=YXβ^\displaystyle\hat{\theta}=\textbf{Y}-\textbf{X}\hat{\beta} (56)

and substituting value of θ^\hat{\theta} in (54) we get the following expression.

XT(YXβ^)=λv.\displaystyle\textbf{X}^{T}(\textbf{Y}-\textbf{X}\hat{\beta})=\lambda v. (57)

A.3 Proof of Lemma 2

Proof  Without loss of generality we can assume that the first s=|S|s=|S| variables are the active variables, and we partition the empirical covariance matrix as in Equation (7), β^=(β1β2)T\hat{\beta}=(\beta_{1}\;\beta_{2})^{T} and v^=(v1v2)T\hat{v}=(v_{1}\;v_{2})^{T} accordingly. Let us recall the IC (for the noiseless case for simplicity), it is defined as follows.

Definition 6 (Irrepresentable Condition(IC))

The irrepresentable condition is said to be met for the set S with a constant η>0\eta>0, if the following holds:

C12C111sign(β1)1η.\displaystyle\|C_{12}C_{11}^{-1}sign(\beta_{1})\|_{\infty}\leq 1-\eta. (58)

Under IC, the lasso solution is unique. If we further assume the beta-min condition then the following holds, see (Bühlmann and van de Geer (2011)) for the detailed proof.

S=S^lasso.S=\hat{S}_{lasso}.

The proof of the proposition (2) is fairly simple, we prove it by contradiction. Let us assume that S^lasso!=S^dual\hat{S}_{lasso}!=\hat{S}_{dual}, then from Proposition (LABEL:pro:lasso_in_dual) the Lasso active set S^lasso\hat{S}_{lasso} is a proper subset of the dual active set S^dual\hat{S}_{dual} , and it follows that there exists some jS^cj\in\hat{S}^{c} for which the following condition is satisfied.

βj=0 and |XjTθ^|=λ\displaystyle\beta_{j}=0\text{ and }|X_{j}^{T}\hat{\theta}|=\lambda

Substituting value of θ^=YXβ^\hat{\theta}=\textbf{Y}-\textbf{X}\hat{\beta} (see Appendix A.2) and Y=Xβ\textbf{Y}=\textbf{X}\beta, we get the following.

|XjT(YXβ^)|\displaystyle|X_{j}^{T}(\textbf{Y}-\textbf{X}\hat{\beta})| =λ\displaystyle=\lambda
|XjTX(ββ^)|\displaystyle\implies|X_{j}^{T}\textbf{X}(\beta-\hat{\beta})| =λ\displaystyle=\lambda

Under IC the Lasso selects the active sets, so we have β2=β^2=0\beta_{2}=\hat{\beta}_{2}=0, some algebraic simplification gives the following equality.

|(C21)j(β1β^1)|=λ\displaystyle\implies|(C_{21})_{j}(\beta_{1}-\hat{\beta}_{1})|=\lambda (59)

From the KKT condition (see AppendixA.2) we have:

XT(YXβ^)+λv^=0\displaystyle X^{T}(\textbf{Y}-\textbf{X}\hat{\beta})+\lambda\hat{v}=0

where v1\|v\|_{\infty}\leq 1 and βv=β1\beta v=\|\beta\|_{1}. , by substituting Y=Xβ\textbf{Y}=\textbf{X}\beta, we get

XTX(β^β)\displaystyle\textbf{X}^{T}\textbf{X}(\hat{\beta}-\beta) =λv^\displaystyle=-\lambda\hat{v}

We can write the above equation in terms of partitions of C=Σ^C=\hat{\Sigma} as follows.

[C11C12C21C22)](β1β1^β2β2^)\displaystyle\left[\begin{array}[]{cc}C_{11}&C_{12}\\ C_{21}&C_{22})\end{array}\right](\begin{array}[]{c}\beta_{1}-\hat{\beta_{1}}\\ \beta_{2}-\hat{\beta_{2}}\end{array}) =λ(v1v2)\displaystyle=\lambda(\begin{array}[]{c}v_{1}\\ v_{2}\end{array})

Since β2=β2^=0\beta_{2}=\hat{\beta_{2}}=0, therefore, we get the following equality.

β1β1^=λC111sign(β1)\displaystyle\beta_{1}-\hat{\beta_{1}}=\lambda C_{11}^{-1}sign(\beta_{1})

Substituting value of β1β1^\beta_{1}-\hat{\beta_{1}} into the equation (59) we get

|(C21)jλC111sign(β1)|=λ\displaystyle|(C_{21})_{j}\lambda C_{11}^{-1}sign(\beta_{1})|=\lambda (60)
|(C21)jC111sign(β1)|=1\displaystyle|(C_{21})_{j}C_{11}^{-1}sign(\beta_{1})|=1 (61)

It violet the IC, hence under assumption of IC,

βj=0|XjTθ^|<λ.\beta_{j}=0\implies|X_{j}^{T}\hat{\theta}|<\lambda.

Therefore the following equality must hold, that completes the proof.

S^lasso(λ)=S^dual(λ).\hat{S}_{lasso}(\lambda)=\hat{S}_{dual}(\lambda).
 

Appendix A.4 IC implies Lasso Variable Selection

Proof  This result and proof are from Bühlmann and van de Geer (2011). The IC depends on the covariance of the predictors C=Σ^C=\hat{\Sigma} and the signs of the unknown true parameter β\beta (beta-min condition is implicit). For simplicity, we prove it for the noiseless case, where Y=Xβ\textbf{Y}=\textbf{X}\beta. We first assume that the IC holds and we will show that Lasso correctly identifies the active set SS. From KKT condition as in (57), and substituting Y=Xβ\textbf{Y}=\textbf{X}\beta, we get

XTX(β^β)\displaystyle\textbf{X}^{T}\textbf{X}(\hat{\beta}-\beta) =λv\displaystyle=-\lambda v
[C11C12C21C22)](β1β1^β2β2^)\displaystyle\left[\begin{array}[]{cc}C_{11}&C_{12}\\ C_{21}&C_{22})\end{array}\right](\begin{array}[]{c}\beta_{1}-\hat{\beta_{1}}\\ \beta_{2}-\hat{\beta_{2}}\end{array}) =λ(v1v2)\displaystyle=\lambda(\begin{array}[]{c}v_{1}\\ v_{2}\end{array})

We note that, for the true parameter vector, β2\beta_{2} is a null vector by definition. We get the following two equations after some simplification:

C11(β1β1^)C12β2^\displaystyle C_{11}(\beta_{1}-\hat{\beta_{1}})-C_{12}\hat{\beta_{2}} =λv1\displaystyle=\lambda v_{1} (62)
C21(β1β1^)C22β2^\displaystyle C_{21}(\beta_{1}-\hat{\beta_{1}})-C_{22}\hat{\beta_{2}} =λv2\displaystyle=\lambda v_{2} (63)

After some algebraic simplification of the first equation we get

β1^β1=C111(C12β2^+λv1)\displaystyle\hat{\beta_{1}}-\beta_{1}=C^{-1}_{11}(C_{12}\hat{\beta_{2}}+\lambda v_{1})

Substituting value of β1^β1\hat{\beta_{1}}-\beta_{1}, in the second equation

C21C111(C12β2^+λv1)Σ22β2\displaystyle C_{21}C^{-1}_{11}(C_{12}\hat{\beta_{2}}+\lambda v_{1})-\Sigma_{22}\beta_{2} =λv2\displaystyle=\lambda v_{2}

by multiplying both the sides with β^2T\hat{\beta}_{2}^{T}

β^2T(C22C21C111C12)β2^\displaystyle\hat{\beta}_{2}^{T}(C_{22}-C_{21}C^{-1}_{11}C_{12})\hat{\beta_{2}} =λβ21β2^TC21C111λv1\displaystyle=-\lambda\|\beta_{2}\|_{1}-\hat{\beta_{2}}^{T}C_{21}C^{-1}_{11}\lambda v_{1}

Applying holders inequality for the term β2^TC21C111λv1\hat{\beta_{2}}^{T}C_{21}C^{-1}_{11}\lambda v_{1} on RHS we get

β2^TC21C111λv1\displaystyle\hat{\beta_{2}}^{T}C_{21}C^{-1}_{11}\lambda v_{1} λβ2^1C21C111v1\displaystyle\leq\lambda\|\hat{\beta_{2}}\|_{1}C_{21}C^{-1}_{11}v_{1}\|_{\infty}
\displaystyle\implies λβ2^1.\displaystyle\leq\lambda\|\hat{\beta_{2}}\|_{1}.

We get the following expression after substitution,

β^2T(C22C21C111C12)β2^λβ2^1.\displaystyle\hat{\beta}_{2}^{T}(C_{22}-C_{21}C^{-1}_{11}C_{12})\hat{\beta_{2}}\leq-\lambda\|\hat{\beta_{2}}\|_{1}.

Since λβ2^1>0\lambda\hat{\beta_{2}}\|_{1}>0 we get the following inequality,

β^2T(C22C21C111C12)β2^0\displaystyle\hat{\beta}_{2}^{T}(C_{22}-C_{21}C^{-1}_{11}C_{12})\hat{\beta_{2}}\leq 0

The matrix (C22C21C111C12)(C_{22}-C_{21}C^{-1}_{11}C_{12}) is a positive semi-definite, we have arrived at a contradiction. Therefore β^Sc=0\hat{\beta}_{S^{c}}=0, for any Lasso solution it is true. Hence the Lasso correctly identifies all the zero components, and S^lassoS\hat{S}_{lasso}\subset S.

Now, we assume that lasso selects the true active set, and we will show that the IC holds. Basically, It is given that β^2=β^Sc=0\hat{\beta}_{2}=\hat{\beta}_{S^{c}}=0. Using the KKT condition again, and substituting β^2=0\hat{\beta}_{2}=0 in (64), we get the following expression.

C11(β1β1^)\displaystyle C_{11}(\beta_{1}-\hat{\beta_{1}}) =λv1\displaystyle=\lambda v_{1}
C21(β1β1^)\displaystyle C_{21}(\beta_{1}-\hat{\beta_{1}}) =λv2\displaystyle=\lambda v_{2}

After solving the above we have

C21C111λv1\displaystyle C_{21}C_{11}^{-1}\lambda v_{1} =λv2\displaystyle=\lambda v_{2}

Since v2<1\|v2\|_{\infty}<1 and we have the following inequality.

C21C111λsign(β1)\displaystyle\|C_{21}C_{11}^{-1}\lambda sign(\beta_{1})\|_{\infty} <λ\displaystyle<\lambda
C21C111sign(β1)<1\displaystyle\implies\|C_{21}C_{11}^{-1}sign(\beta_{1})\|_{\infty}<1
 

Appendix A.5 PIC implies dual Lasso Variable selection

Proof  The proof is similar to the proof given for IC (see Appendix A.4) except we replace G111G_{11}^{-1} with the one of the generalized inverse . The PIC, like IC depends on the covariance of the predictors C=Σ^C=\hat{\Sigma} and the signs of the unknown true parameter β\beta. For simplicity, we prove it for the noiseless case, where Y=Xβ\textbf{Y}=\textbf{X}\beta. We first assume that the PIC holds and we will show that dual Lasso correctly identifies the active set SS. From KKT condition as in (57), and substituting Y=Xβ\textbf{Y}=\textbf{X}\beta, we get

XTX(β^β)\displaystyle\textbf{X}^{T}\textbf{X}(\hat{\beta}-\beta) =λv\displaystyle=-\lambda v
[C11C12C21C22)](β1β1^β2β2^)\displaystyle\left[\begin{array}[]{cc}C_{11}&C_{12}\\ C_{21}&C_{22})\end{array}\right](\begin{array}[]{c}\beta_{1}-\hat{\beta_{1}}\\ \beta_{2}-\hat{\beta_{2}}\end{array}) =λ(v1v2)\displaystyle=\lambda(\begin{array}[]{c}v_{1}\\ v_{2}\end{array})

We note that, for the true parameter vector, β2\beta_{2} is a null vector, by definition. We get the following two equations after some simplification:

C11(β1β1^)C12β2^\displaystyle C_{11}(\beta_{1}-\hat{\beta_{1}})-C_{12}\hat{\beta_{2}} =λv1\displaystyle=\lambda v_{1} (64)
C21(β1β1^)C22β2^\displaystyle C_{21}(\beta_{1}-\hat{\beta_{1}})-C_{22}\hat{\beta_{2}} =λv2\displaystyle=\lambda v_{2} (65)

After simplification of the first equation we get

β1^β1=C11+(C12β2^+λv1)\displaystyle\hat{\beta_{1}}-\beta_{1}=C^{+}_{11}(C_{12}\hat{\beta_{2}}+\lambda v_{1})

Substituting value of β1^β1\hat{\beta_{1}}-\beta_{1}, in the second equation

C21C11+(C12β2^+λv1)Σ22β2\displaystyle C_{21}C^{+}_{11}(C_{12}\hat{\beta_{2}}+\lambda v_{1})-\Sigma_{22}\beta_{2} =λv2\displaystyle=\lambda v_{2}

by multiplying both the sides with β^2T\hat{\beta}_{2}^{T}

β^2T(C22C21C11+C12)β2^\displaystyle\hat{\beta}_{2}^{T}(C_{22}-C_{21}C^{+}_{11}C_{12})\hat{\beta_{2}} =λβ21β2^TC21C11+λv1\displaystyle=-\lambda\|\beta_{2}\|_{1}-\hat{\beta_{2}}^{T}C_{21}C^{+}_{11}\lambda v_{1}

Applying holders inequality for the term β2^TC21C11+λv1\hat{\beta_{2}}^{T}C_{21}C^{+}_{11}\lambda v_{1} on RHS we get

β2^TC21C11+λv1\displaystyle\hat{\beta_{2}}^{T}C_{21}C^{+}_{11}\lambda v_{1} λβ2^1C21C11+v1\displaystyle\leq\lambda\|\hat{\beta_{2}}\|_{1}C_{21}C^{+}_{11}v_{1}\|_{\infty}
\displaystyle\implies λβ2^1.\displaystyle\leq\lambda\|\hat{\beta_{2}}\|_{1}.

We get the following expression after substitution,

β^2T(C22C21C11+C12)β2^λβ2^1.\displaystyle\hat{\beta}_{2}^{T}(C_{22}-C_{21}C^{+}_{11}C_{12})\hat{\beta_{2}}\leq-\lambda\|\hat{\beta_{2}}\|_{1}.

Since λβ2^1>0\lambda\hat{\beta_{2}}\|_{1}>0 we get the following inequality,

β^2T(C22C21C11+C12)β2^0\displaystyle\hat{\beta}_{2}^{T}(C_{22}-C_{21}C^{+}_{11}C_{12})\hat{\beta_{2}}\leq 0

The matrix (C22C21C11+C12)(C_{22}-C_{21}C^{+}_{11}C_{12}) is a positive semi-definite, we have arrived at a contradiction. Therefore β^Sc=0\hat{\beta}_{S^{c}}=0, for any Lasso solution it is true. Hence the Lasso correctly identifies all the zero components. Hence giving the similar argument as lemma (2), it can be shown that |XjTθ^|<λ|X_{j}^{T}\hat{\theta}|<\lambda for all jScj\in S^{c}. Therefore PIC implies dual lasso selects the true active set.

Now, we assume that dual lasso selects the true active set, and we will show that the PIC holds. It is given that |XjTθ^|<λ|X_{j}^{T}\hat{\theta}|<\lambda for all jScj\in S^{c}. Therefore for any beta solution β^Sc=0\hat{\beta}_{S^{c}}=0. Using the KKT condition again, and substituting β^2=0\hat{\beta}_{2}=0 in (64), we get the following expression.

C11(β1β1^)\displaystyle C_{11}(\beta_{1}-\hat{\beta_{1}}) =λv1\displaystyle=\lambda v_{1}
C21(β1β1^)\displaystyle C_{21}(\beta_{1}-\hat{\beta_{1}}) =λv2\displaystyle=\lambda v_{2}

After solving the above we have

C21C11+λv1\displaystyle C_{21}C_{11}^{+}\lambda v_{1} =λv2\displaystyle=\lambda v_{2}

Since |XjTθ^|<λforjSc|X_{j}^{T}\hat{\theta}|<\lambda\;forj\in S^{c} , therefore v2<1\|v2\|_{\infty}<1 and we have the following PIC.

C21C11+λsign(β1)\displaystyle\|C_{21}C_{11}^{+}\lambda sign(\beta_{1})\|_{\infty} <λ\displaystyle<\lambda
C21C11+sign(β1)<1\displaystyle\implies\|C_{21}C_{11}^{+}sign(\beta_{1})\|_{\infty}<1
 

References

  • Anbari and Mkhadri (2014) Mohammed El Anbari and Abdallah Mkhadri. Penalized regression combining the l1 norm and a correlation based penalty. Sankhya, 76-B:82–102, 2014.
  • Belloni and Chernozhukov (2013) Alexandre Belloni and Victor Chernozhukov. Least squares after model selection in high-dimensional sparse models. Bernoulli, 19:521–547, 2013.
  • Bondell and Reich (2008) Howard D. Bondell and Brian J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64:115–123, 2008.
  • Bühlmann and van de Geer (2011) Peter Bühlmann and Sara van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Verlag, 2011.
  • Bühlmann et al. (2012) Peter Bühlmann, Philipp Rütimann, Sara van de Geer, and Cun-Hui Zhang. Correlated variables in regression: clustering and sparse estimation. Journal of Statistical Planning and Inference, 143:1835–1871, 2012.
  • Bühlmann et al. (2014) Peter Bühlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view towards applications in biology. Annual Review of Statistics and its Applications, 1:255–278., 2014.
  • Efron et al. (2004) Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Ann. Statist., 32(2):407–499, 2004.
  • Ehrlich (1973) I Ehrlich. Participation in illegitimate activities: a theoretical and empirical investigation. Journal of Political Economy, 81:521–565, 1973.
  • Gauraha (2016) Niharika Gauraha. Stability feature selection using cluster representative lasso. In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,, pages 381–386, 2016.
  • Golub et al. (1999) T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. H. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–538, 1999.
  • Hebiri and Lederer (2013) Mohamed Hebiri and Johannes Lederer. How correlations influence lasso prediction. IEEE Trans. Inf. Theor., 59(3):1846–1854, 2013.
  • Hoerl and Kennard (1970) Arthur E. Hoerl and Tobert W. Kennard. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970.
  • Liu and Yu (2013) Hanzhong Liu and Bin Yu. Asymptotic properties of lasso+mls and lasso+ridge in sparse high-dimensional linear regression. Electron. J. Statist., 7:3124–3169, 2013.
  • Omidiran and Wainwright (2010) Dapo Omidiran and Martin J. Wainwright. High-dimensional variable selection with sparse random projections: Measurement sparsity and statistical efficiency. J. Mach. Learn. Res., 11:2361–2386, 2010.
  • OSBORNE et al. (2000) Michael R. OSBORNE, Brett Presnell, and Berwin A. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3):389–403, 2000.
  • Segal et al. (2003) M. Segal, K. Dahlquist, and B. Conklin. Regression approaches for microarray data analysis. Journal of Computational Biology, 10:961–980, 2003.
  • She (2010) Yiyuan She. Sparse regression with exact clustering. Electron. J. Statist., 4:1055–1096, 2010.
  • Stamey et al. (1989) T.A. Stamey, J.N. Kabalin, J.E. McNeal, I.M. Johnstone, F. Freiha, E.A. Redwine, and N. Yang. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate: Ii. radical prostatectomy treated patients. Journal of Urology, 141(5):1076–1083, 1989.
  • Tian et al. (2003) E Tian, F Zhan, R Walker, E Rasmussen, Y Ma, B Barlogie, and JD Jr Shaughnessy. The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma. N Engl J Med., 349(26):2483–2494, 2003.
  • Tibshirani (1996) Robert Tibshirani. Regression analysis and selection via the lasso. Royal Statistical Society Series, 58:267–288, 1996.
  • Tibshirani et al. (2005) Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(1):91–108, 2005. ISSN 1369-7412.
  • Tibshirani and Taylor (2011) Ryan Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. Ann. Statist., 39(3):1335–1371, 06 2011.
  • Tutz and Ulbricht (2009) Gerhard Tutz and Jan Ulbricht. Penalized regression with correlation-based penalty. Statistics and Computing, 19(3):239–253, 2009.
  • van de Geer and Lederer (2013) Sara van de Geer and Johannes Lederer. The Lasso, correlated design, and improved oracle inequalities, volume Volume 9 of Collections, pages 303–316. Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2013.
  • Wang et al. (2015) Jie Wang, Jiayu Zhou, Peter Wonka, and Jieping Ye. Lasso screening rules via dual polytope projection. Journal of Machine Learning Research, 16:1063–1101, 2015.
  • Zhao and Yu (2006) Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.
  • Zou and Hastie (2005) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. J. R. Statist. Soc, 67:301––320, 2005.