This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparse group lasso and high dimensional multinomial classification.

Martin Vincent 111Corresponding author. Tel.: +4522860740
E-mail address: vincent@math.ku.dk (M. Vincent).
, Niels Richard Hansen
University of Copenhagen, Department of Mathematical Sciences, Universitetsparken 5, 2100 Copenhagen Ø, Denmark
Abstract

The sparse group lasso optimization problem is solved using a coordinate gradient descent algorithm. The algorithm is applicable to a broad class of convex loss functions. Convergence of the algorithm is established, and the algorithm is used to investigate the performance of the multinomial sparse group lasso classifier. On three different real data examples the multinomial group lasso clearly outperforms multinomial lasso in terms of achieved classification error rate and in terms of including fewer features for the classification. The run-time of our sparse group lasso implementation is of the same order of magnitude as the multinomial lasso algorithm implemented in the R package glmnet. Our implementation scales well with the problem size. One of the high dimensional examples considered is a 50 class classification problem with 10k features, which amounts to estimating 500k parameters. The implementation is available as the R package msgl.

keywords:
Sparse group lasso , classification , high dimensional data analysis , coordinate gradient descent , penalized loss.
journal: Computational Statistics & Data Analysis

1 Introduction

The sparse group lasso is a regularization method that combines the lasso [1] and the group lasso [2]. Friedman et al. [3] proposed a coordinate descent approach for the sparse group lasso optimization problem. Simon et al. [4] used a generalized gradient descent algorithm for the sparse group lasso and considered applications of this method for linear, logistic and Cox regression. We present a sparse group lasso algorithm suitable for high dimensional problems. This algorithm is applicable to a broad class of convex loss functions. In the algorithm we combine three non-differentiable optimization methods: the coordinate gradient descent [5], the block coordinate descent [6] and a modified coordinate descent method.

Our main application is to multinomial regression. In Section 1.1 we introduce the general sparse group lasso optimization problem with a convex loss function. Part I investigates the performance of the multinomial sparse group lasso classifier. In Part II we present the general sparse group lasso algorithm and establish convergence.

The formulation of an efficient and robust sparse group lasso algorithm is not straight forward due to non-differentiability of the penalty. Firstly, the sparse group lasso penalty is not completely separable, which is problematic when using a standard coordinate descent scheme. To obtain a robust algorithm an adjustment is necessary. Our solution is a minor modification of the coordinate descent algorithm; it efficiently treats the singularity at zero that cannot be separated out. Secondly, our algorithm is a Newton type algorithm; hence we sequentially optimize penalized quadratic approximations of the loss function. This approach raises another challenge: how to reduce the costs of computing the Hessian? In Section 4.6 we show that an upper bound on the Hessian is sufficient to determine whether the minimum over a block of coefficients is attained at zero. This approach enables us to update a large percentage of the blocks without computing the complete Hessian. In this way we reduce the run-time, provided that the upper bound of the Hessian can be computed efficiently. We found that this approach reduces the run-time on large data sets by a factor of more than 2.

Our focus is on applications of the multinomial sparse group lasso to problems with many classes. For this purpose we choose three multiclass classification problems. We found that the multinomial group lasso and sparse group lasso perform well on these problems. The error rates were substantially lower than the best obtained with multinomial lasso, and the low error rates were achieved for models with fewer features having non-zero coefficients. For example, we consider a text classification problem consisting of Amazon reviews with 50 classes and 10k textual features. This problem showed a large improvement in the error rates: from approximately 40% for the lasso to less than 20% for the group lasso.

We provide a generic implementation of the sparse group lasso algorithm in the form of a C++ template library. The implementation for multinomial and logistic sparse group lasso regression is available as an R package. For our implementation the time to compute the sparse group lasso solution is of the same order of magnitude as the time required for the multinomial lasso algorithm as implemented in the R-package glmnet. The computation time of our implementation scales well with the problem size.

1.1 Sparse group lasso

Consider a convex, bounded below and twice continuously differentiable function f:nf:\mathbb{R}^{n}\to\mathbb{R}. We say that β^n\hat{\beta}\in\mathbb{R}^{n} is a sparse group lasso minimizer if it is a solution to the unconstrained convex optimization problem

minimizef+λΦ\operatorname*{minimize}f+\lambda\Phi (1)

where Φ:n\Phi:\mathbb{R}^{n}\to\mathbb{R} is the sparse group lasso penalty (defined below) and λ>0\lambda>0.

Before defining the sparse group lasso penalty some notation is needed. We decompose the search space

n=n1××nm\mathbb{R}^{n}=\mathbb{R}^{n_{1}}\times\dots\times\mathbb{R}^{n_{m}}

into mm\in\mathbb{N} blocks having dimensions nin_{i}\in\mathbb{N} for i=1,,mi=1,\dots,m, hence n=n1++nmn=n_{1}+\dots+n_{m}. For a vector βn\beta\in\mathbb{R}^{n} we write β=(β(1),,β(m))\beta=({\beta}^{(1)},\dots,{\beta}^{(m)}) where β(1)n1,,β(m)nm{\beta}^{(1)}\in\mathbb{R}^{n_{1}},\dots,{\beta}^{(m)}\in\mathbb{R}^{n_{m}}. For J=1,,mJ=1,\dots,m we call β(J){\beta}^{(J)} the JJ’th block of β\beta. We use the notation βi(J){\beta}^{(J)}_{i} to denote the ii’th coordinate of the JJ’th block of β\beta, whereas βi\beta_{i} is the ii’th coordinate of β\beta.

Definition 1 (Sparse group lasso penalty).

The sparse group lasso penalty is defined as

Φ(β)=def(1α)J=1mγJβ(J)2+αi=1nξi|βi|\Phi(\beta)\stackrel{{\scriptstyle\text{def}}}{{=}}(1-\alpha)\sum_{J=1}^{m}\gamma_{J}\left\|{\beta}^{(J)}\right\|_{2}+\alpha\sum_{i=1}^{n}\xi_{i}\left|\beta_{i}\right|

for α[0,1]\alpha\in[0,1], group weights γ[0,)m\gamma\in[0,\infty)^{m}, and parameter weights ξ=(ξ(1),,ξ(m))[0,)n\xi=({\xi}^{(1)},\dots,{\xi}^{(m)})\in[0,\infty)^{n} where ξ(1)[0,)n1,,ξ(m)[0,)nm{\xi}^{(1)}\in[0,\infty)^{n_{1}},\dots,{\xi}^{(m)}\in[0,\infty)^{n_{m}}.

The sparse group lasso penalty includes the lasso penalty (α=1\alpha=1) and the group lasso penalty (α=0\alpha=0). Note also that for sufficiently large values of λ\lambda the minimizer of (1) is zero. The infimum of these, denoted λmax\lambda_{\text{max}}, is computable, see Section 4.2.

We emphasize that the sparse group lasso penalty is specified by

  • 1.

    a grouping of the parameters β=(β(1),,β(m))\beta=({\beta}^{(1)},\dots,{\beta}^{(m)}),

  • 2.

    and the weights α,γ\alpha,\gamma and ξ\xi.

In Part I we consider multinomial regression; here the parameter grouping is given by the multinomial model. For multinomial as well as other regression models, the grouping can also reflect a grouping of the features.

Part I The multinomial sparse group lasso classifier

In this section we examine the characteristics of the multinomial sparse group lasso method. Our main interest is the application of the multinomial sparse group lasso classifier to problems with many classes. For this purpose we have chosen three classification problems based on three different data sets, with 10, 18 and 50 classes. In [7] the microRNA expression profile of different types of primary cancer samples is studied. In Section 3.1 we consider the problem of classifying the primary site based on the microRNA profiles in this data set. The Amazon reviews author classification problem, presented in [8], is studied in Section 3.2. The messenger RNA profile of different human muscle diseases is studied in [9]. We consider, in Section 3.3, the problem of classifying the disease based on the messenger RNA profiles in this data set. Table 1 summarizes the dimensions and characteristics of the data sets and the associated classification problems. Finally, in Section 4, we examine the characteristics of the method applied to simulated data sets.

2 Setup

Consider a classification problem with KK classes, NN samples, and pp features. Assume given a data set (x1,y1),,(xN,yN)(x_{1},y_{1}),\dots,(x_{N},y_{N}) where, for all i=1,,Ni=1,\dots,N, xipx_{i}\in\mathbb{R}^{p} is the observed feature vector and yi{1,,K}y_{i}\in\{1,\dots,K\} is the categorical response. We organize the feature vectors in the N×pN\times p design matrix

X=def(x1xN)T.X\stackrel{{\scriptstyle\text{def}}}{{=}}(x_{1}\cdots x_{N})^{T}.

As in [10] we use a symmetric parametrization of the multinomial model. With h:{1,,K}×ph:\{1,\dots,K\}\times\mathbb{R}^{p}\to\mathbb{R} given by

h(l,η)=defexp(ηl)k=1Kexp(ηk),h(l,\eta)\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{\exp(\eta_{l})}{\sum_{k=1}^{K}\exp(\eta_{k})},

the multinomial model is specified by

P(yi=l|xi)=h(l,β(0)+βxi).P(y_{i}=l|x_{i})=h(l,{\beta}^{(0)}+\beta x_{i}).

The model parameters are organized in the KK-dimensional vector, β(0){\beta}^{(0)}, of intercept parameters together with the K×pK\times p matrix

β=def(β(1)β(p)),\beta\stackrel{{\scriptstyle\text{def}}}{{=}}\left({\beta}^{(1)}\cdots{\beta}^{(p)}\right), (2)

where β(i)K{\beta}^{(i)}\in\mathbb{R}^{K} are the parameters associated with the ii’th feature. The lack of identifiability in this parametrization is in practice dealt with by the penalization.

The log-likelihood is

(β(0),β)=i=1Nlogh(yi,β(0)+βxi).\ell({\beta}^{(0)},\beta)=\sum_{i=1}^{N}\log h(y_{i},{\beta}^{(0)}+\beta x_{i}). (3)

Our interest is the sparse group lasso penalized maximum likelihood estimator. That is, (β(0),β)({\beta}^{(0)},\beta) is estimated as a minimizer of the sparse group lasso penalized negative-log-likelihood:

(β(0),β)+λ((1α)J=1pγJβ(J)2+αi=1Kpξi|βi|).-\ell({\beta}^{(0)},\beta)+\lambda\left((1-\alpha)\sum_{J=1}^{p}\gamma_{J}\left\|{\beta}^{(J)}\right\|_{2}+\alpha\sum_{i=1}^{Kp}\xi_{i}\left|\beta_{i}\right|\right). (4)

In our applications we let γJ=K\gamma_{J}=\sqrt{K} for all J=1,,pJ=1,\dots,p and ξi=1\xi_{i}=1 for all i=1,,Kpi=1,\dots,Kp, but other choices are possible in the implementation. Note that the parameter grouping, as part of the penalty specification, is given in terms of the columns in (2), i.e. m=pm=p.

3 Data examples

Data set Features KK NN\ pp\
Cancer sites microRNA expressions 18 162 217
Amazon reviews Various textual features 50 1500 10k
Muscle diseases Gene expression 10 107 22k
Table 1: Summary of data sets and the associated classification problem.

The data sets were preprocessed before applying the multinomial sparse group lasso estimator. Two preprocessing schemes were used: normalization and standardization. Normalization entails centering and scaling of the samples in order to obtain a design matrix with row means of 0 and row variance of 1. Standardization involves centering and scaling the features to obtain a design matrix with column means of 0 and column variance of 1. Note that the order in which normalization and standardization are applied matters.

The purpose of normalization is to remove technical (non-biological) variation. A range of different normalization procedures exist for biological data. Sample centering and scaling is one of the simpler procedures. We use this simple normalization procedure for the two biological data sets in this paper. Normalization is done before and independent from the sparse group lasso algorithm.

The purpose of standardization is to create a common scale for features. This ensures that differences in scale will not influence the penalty and thus the variable selection. Standardization is an option for the sparse group lasso implementation, and it is applied as the last preprocessing step for all three example data sets.

We want to compare the performance of the multinomial sparse group lasso estimator for different values of the regularization parameter α\alpha. Applying the multinomial sparse group lasso estimator with a given α[0,1]\alpha\in[0,1] and λ\lambda-sequence, λ1,,λd>0\lambda_{1},\ldots,\lambda_{d}>0, results in a sequence of estimated models with parameters {β^(λi,α)}i=1,,d\{\hat{\beta}(\lambda_{i},\alpha)\}_{i=1,\dots,d}. The generalization error can be estimated by cross validation [11]. For our applications we keep the sample ratio between classes in the cross validation subsets approximately fixed to that of the entire data set. Hence, we may compute a sequence, {Err^(λi,α)}i=1,,d\{\widehat{\operatorname*{Err}}(\lambda_{i},\alpha)\}_{i=1,\dots,d}, of estimated expected generalization errors for the sequence of models. However, for given α1\alpha_{1} and α2\alpha_{2} we cannot simply compare Err^(λi,α1)\widehat{\operatorname*{Err}}(\lambda_{i},\alpha_{1}) and Err^(λi,α2)\widehat{\operatorname*{Err}}(\lambda_{i},\alpha_{2}), since the λi\lambda_{i} value is scaled differently for different values of α\alpha. We will instead compare the models with the same number of non-zero parameters and the same number of non-zero parameter groups, respectively. Define

Θ^(λ,α)=defJ=1pI(β^(J)(λ,α)0)\hat{\Theta}(\lambda,\alpha)\stackrel{{\scriptstyle\text{def}}}{{=}}\sum_{J=1}^{p}I({\hat{\beta}}^{(J)}(\lambda,\alpha)\neq 0)

with β^(λ,α)\hat{\beta}(\lambda,\alpha) the estimator of β\beta for the given values of λ\lambda and α\alpha. That is, Θ^(λ,α)\hat{\Theta}(\lambda,\alpha) is the number of non-zero parameter blocks in the fitted model. Note that there is a one-to-one correspondence between parameter blocks and features in the design matrix. Furthermore, we define the total number of non-zero parameters as

Π^(λ,α)=defi=1nI(β^i(λ,α)0).\hat{\Pi}(\lambda,\alpha)\stackrel{{\scriptstyle\text{def}}}{{=}}\sum_{i=1}^{n}I(\hat{\beta}_{i}(\lambda,\alpha)\neq 0).

In particular, we want to compare the fitted models with the same number of parameter blocks. There may, however, be more than one λ\lambda-value corresponding to a given value of Θ^\hat{\Theta}. Thus we compare the models on a subsequence of the λ\lambda-sequence. With θ1<<θd\theta_{1}<\dots<\theta_{d^{\prime}} for ddd^{\prime}\leq d denoting the different elements of the set {Θ^(λi,α)}i=1,,d\{\hat{\Theta}(\lambda_{i},\alpha)\}_{i=1,\ldots,d} in increasing order we define

λ~i(α)=defmin{λ|Θ^(λ,α)=θi}.\tilde{\lambda}_{i}(\alpha)\stackrel{{\scriptstyle\text{def}}}{{=}}\min\left\{\lambda\,\left|\,\hat{\Theta}(\lambda,\alpha)=\theta_{i}\right.\right\}.

We then compare the characteristics of the multinomial sparse group lasso estimators for different α\alpha values by comparing the estimates

{(Err^(λ~i(α),α),Θ^(λ~i(α)),Π^(λ~i(α)))}i=1,,d.\left\{\left(\widehat{\operatorname*{Err}}(\tilde{\lambda}_{i}(\alpha),\alpha),\hat{\Theta}(\tilde{\lambda}_{i}(\alpha)),\hat{\Pi}(\tilde{\lambda}_{i}(\alpha))\right)\right\}_{i=1,\dots,d^{\prime}}.

3.1 Cancer sites

Refer to caption
Figure 1: Estimated expected generalization error, for different values of α\alpha, for the microRNA cancer sites data set. The cross validation based estimate of the expected misclassification error is plotted against the number of non-zero parameter blocks in the model (left), and against the number of non-zero parameters in the model (right). The estimated standard error is approximately 0.03 for all models.

The data set consists of bead-based expression data for 217 microRNAs from normal and cancer tissue samples. The samples are divided into 11 normal classes, 16 tumor classes and 8 tumor cell line classes. For the purpose of this study we select the normal and tumor classes with more than 5 samples. This results in an 18 class data set with 162 samples. The data set is unbalanced, with the number of samples in each class ranging from 5 to 26 and with an average of 9 samples per class. Data was normalized and then standardized before running the sparse group lasso algorithm. For more information about this data set see [7]. The data set is available from the Gene Expression Omnibus with accession number GSE2564.

Figure 1 shows the result of a 10-fold cross validation for 5 different values of α\alpha, including the lasso and group lasso. The λ\lambda-sequence runs from λmax\lambda_{\text{max}} to 10410^{-4}, with d=200d=200. It is evident that the group lasso and sparse group lasso models achieve a lower expected error using fewer genes than the lasso model. However, models with a low α\alpha value have a larger number of non-zero parameters than models with a high α\alpha value. A reasonable compromise could be the model with α=0.25\alpha=0.25. This model does not only have a low estimated expected error, but the low error is also achieved with a lower estimated number of non-zero parameters, compared to group lasso.

3.2 Amazon reviews

Refer to caption
Figure 2: Estimated expected generalization error, for different values of α\alpha, for the Amazon reviews author classification problem. The cross validation based estimate of expected misclassification error is plotted against the number of non-zero parameter blocks in the model (left), and against the number of non-zero parameters in the model (right). The estimated standard error is approximately 0.01 for all models.

The Amazon review data set consists of 10k textual features (including lexical, syntactic, idiosyncratic and content features) extracted from 1500 customer reviews from the Amazon Commerce Website. The reviews were collected among the reviews from 50 authors with 50 reviews per author. The primary classification task is to identify the author based on the textual features. The data and feature set were presented in [8] and can be found in the UCI machine learning repository [12]. In [8] a Synergetic Neural Network is used for author classification, and a 22k feature based 10-fold CV accuracy of 0.805 is reported. The feature selection and training of the classifier were done separately.

We did 10-fold cross validation using multinomial sparse group lasso for five different values of α\alpha. The results are shown in Figure 2. The λ\lambda-sequence runs from λmax\lambda_{\text{max}} to 10410^{-4}, with d=100d=100. The design matrix is sparse for this data set. Our implementation of the multinomial sparse group lasso algorithm utilizes the sparse design matrix to gain speed and for memory efficiency. No normalization was applied for this data set. Features were scaled to have variance 1, but were not centered.

For this data set it is evident that lasso performs badly and that the group lasso performs best - in fact much better than lasso. The group lasso achieves an accuracy of around 0.82 with a feature set of size 1\sim 1k. This outperforms the neural network in [8].

3.3 Muscle diseases

Refer to caption
Figure 3: Estimated expected generalization error, for different values of α\alpha, for the muscle disease classification problem. The cross validation based estimate of expected misclassification error is plotted against the number of non-zero parameter blocks in the model (left), and against the number of non-zero parameters in the model (right) The estimated standard error is approximately 0.04 for all models.

This data set consists of messenger RNA array expression data of 119 muscle biopsies from patients with various muscle diseases. The samples are divided into 13 diagnostic groups. For this study we only consider classes with more than 5 samples. This results in a classification problem with 107 samples and 10 classes. The data set is unbalanced with class sizes ranging from 4 to 20 samples per class. Data was normalized and then standardized before running the sparse group lasso algorithm. For background information on this data set, see [9]. The data set is available from the Gene Expression Omnibus with accession number GDS1956.

The results of a 10-fold cross validation are shown in Figure 3. The λ\lambda-sequence runs from λmax\lambda_{\text{max}} to 10510^{-5}, with d=200d=200. We see the same trend as in the other two data examples. Again the group lasso models perform well, however not significantly better than the closest sparse group lasso models (α=0.25\alpha=0.25). The lasso models perform reasonably well on this data set, but they are still outperformed by the sparse group lasso models.

4 A simulation study

In this section we investigate the characteristics of the sparse group lasso estimator on simulated data sets. We are primarily interested in trends in the generalization error as α\alpha is varied and λ^\hat{\lambda} is selected by cross validation on a relatively small training set. We suspect that this trend will depend on the distribution of the data. We restrict our attention to multiclass data where the distribution of the features given the class is Gaussian. Loosely speaking, we suspect that if the differences in the data distributions are very sparse, i.e. the centers of the Gaussian distributions are mostly identical across classes, the lasso will produce models with the lowest generalization error. If the data distribution is sparse, but not very sparse, then the optimal α\alpha is in the interval (0,1)(0,1). For a dense distribution, typically with differences being among all classes, we expect the group lasso to perform best. The simulation study confirms this.

The mathematical formulation is as follows. Let

μ=(μ1μK)\mu=(\mu_{1}\dots\mu_{K})

where μip\mu_{i}\in\mathbb{R}^{p} for i=1,,Ki=1,\dots,K and p=pa+pbp=p_{a}+p_{b}. Denote by 𝒟μ\mathcal{D}_{\mu} a data set consisting of NN samples for each of the KK classes – each sampled from the Gaussian distribution with centers μ1,,μK\mu_{1},\dots,\mu_{K}, respectively, and with a common covariance matrix Σ\Sigma. Let λ^\hat{\lambda} be the smallest λ\lambda-value with the minimal estimated expected generalization error, as determined by cross validation on 𝒟μ\mathcal{D}_{\mu}. Denote by Errμ(λ,α)\operatorname*{Err}_{\mu}(\lambda,\alpha) the generalization error of the model β^(λ,α)\hat{\beta}(\lambda,\alpha) that has been estimated from the training set 𝒟μ\mathcal{D}_{\mu}, by the sparse group lasso, for the given values of λ\lambda and α\alpha. Then let

Zμ(α)=Errμ(λ^,α)ErrBayes(μ)\textstyle Z_{\mu}(\alpha)=\operatorname*{Err}_{\mu}(\hat{\lambda},\alpha)-\operatorname*{Err}_{\text{Bayes}}(\mu)

where ErrBayes(μ)\operatorname*{Err}_{\text{Bayes}}(\mu) is the Bayes rate. We are interested in trends in ZμZ_{\mu}, as a function of α\alpha, for different configurations of μ1,,μK\mu_{1},\dots,\mu_{K}. To be specific, we will sample μ1,,μK\mu_{1},\dots,\mu_{K} from one of the following distributions:

  • 1.

    A sparse model distribution, where the first pap_{a} entries of μi\mu_{i} are i.i.d. with a distribution that is a mixture of the uniform distribution on [2,2][-2,2] and the degenerate distribution at 0 with point probability p0p_{0}.

  • 2.

    A dense model distribution, where the first pap_{a} entries of μi\mu_{i} are i.i.d. Laplace distributed with location 0 and scale bb.

The last pbp_{b} entries are zero. We take pa=5/(1p0)p_{a}=\lfloor 5/(1-p_{0})\rfloor throughout for the sparse model distribution. The within class covariance matrix Σ\Sigma is constructed using features from the cancer site data set. Let Σ0\Sigma_{0} be the empirical covariance matrix of pp randomly chosen features. To avoid that the covariance matrix become singular we take

Σ=(1δ)Σ0+δ𝐈\Sigma=(1-\delta)\Sigma_{0}+\delta\mathbf{I}

for δ(0,1)\delta\in(0,1).

The primary quantity of interest is

err(α)=defE(Zμ(α)),\operatorname*{err}(\alpha)\stackrel{{\scriptstyle\text{def}}}{{=}}\operatorname*{E}\left(Z_{\mu}(\alpha)\right), (5)

the expectation being over μ\mu and the data set 𝒟μ\mathcal{D}_{\mu}. We are also interested in how well we can estimate the non-zero patterns of the μi\mu_{i}’s. Consider this as KpKp two class classification problems, one for each parameter, where we predict the μij\mu_{ij} to be non-zero if β^ij\hat{\beta}_{ij} is non-zero, and μij\mu_{ij} to be zero otherwise. We calculate the number of false positives, true positives, false negatives and true negatives. The positive predictive value (ppv) and the true positive rate (tpr) are of particular interest. The true positive rate measures how sensitive a given method is at discovering non-zero entries. The positive predictive value measures the precision with which the method is selecting the non-zero entries. We consider the following two quantities

tpr(α)=defE[tpr(β^(λ^,α))] and ppv(α)=defE[ppv(β^(λ^,α))].\operatorname*{tpr}(\alpha)\stackrel{{\scriptstyle\text{def}}}{{=}}\operatorname*{E}\left[\operatorname*{tpr}\left(\hat{\beta}(\hat{\lambda},\alpha)\right)\right]\text{ and }\operatorname*{ppv}(\alpha)\stackrel{{\scriptstyle\text{def}}}{{=}}\operatorname*{E}\left[\operatorname*{ppv}\left(\hat{\beta}(\hat{\lambda},\alpha)\right)\right]. (6)
Refer to caption
Figure 4: The estimated expected error gab (solid black line) for the three configurations. The central 95% of the distribution of Zμ(α)Z_{\mu}(\alpha) is shown as the shaded area on the plot. The error gab for 5 randomly selected μ\mu-configurations is shown (red dashed lines).

In order to estimate the quantities (5) and (6) we sample MM configurations of μ\mu from one of the above distributions. For each configuration we sample a training and a test data set of sizes NKNK and 100K100K, respectively. Using the training data set we fit the model β^(λ^,α)\hat{\beta}(\hat{\lambda},\alpha) and estimate Zμ(α)Z_{\mu}(\alpha) using the test data set. Estimates err^(α)\widehat{\operatorname*{err}}(\alpha), tpr^(α)\widehat{\operatorname*{tpr}}(\alpha) and ppv^(α)\widehat{\operatorname*{ppv}}(\alpha) are the corresponding averages over the MM configurations.

For this study we chose M=100M=100, N=15N=15, K=25K=25, pb=50p_{b}=50, δ=0.25\delta=0.25 and the following three configuration distributions:

  • 1.

    Thin configurations, where the centers are distributed according to the sparse model distribution with p0=0.95p_{0}=0.95, as defined above.

  • 2.

    Sparse configurations, where the centers are distributed according to the sparse model distribution with p0=0.80p_{0}=0.80.

  • 3.

    Dense configurations, where the centers are distributed according to the dense model distribution with scale b=0.2b=0.2 and pa=25p_{a}=25.

In Figure 4 we see that for thin configurations the lasso has the lowest estimated error gab, along with the sparse group lasso with α=0.8\alpha=0.8. For the sparse configurations the results indicate that the optimal choice of α\alpha is in the open interval (0,1)(0,1), but in this case all choices of α\alpha result in a comparable error gab. For the dense configurations the group lasso is among the methods with the lowest error gab.

Refer to caption
Figure 5: The estimated expected true positive rate (solid black line) for the three configurations. The central 95% of the distribution of tpr\operatorname*{tpr} is shown as the shaded area on the plot. The true positive rate for 5 randomly selected μ\mu-configurations is shown (red dashed lines).
Refer to caption
Figure 6: The estimated expected positive predictive value (solid black line) for the three configurations. The central 95% of the distribution of ppv\operatorname*{ppv} is shown as the shaded area on the plot. The positive predictive value for 5 randomly selected μ\mu-configurations is shown (red dashed lines).

In Figure 5 we plotted the true positive rate for the three configurations. Except for the thin configurations, the lasso is markedly less sensitive than the sparse group and group lasso methods. However, looking at Figure 6 we see that the sparse group and group lasso methods have a lower precision than the lasso, except for the dense configurations. We note that the group lasso has the worst precision, except for the dense configurations.

Part II The sparse group lasso algorithm

In this section we present the sparse group lasso algorithm. The algorithm is applicable to a broad class of loss functions. Specifically, we require that the loss function f:nf:\mathbb{R}^{n}\to\mathbb{R} is convex, twice continuously differentiable and bounded below. Additionally, we require that all quadratic approximations around a point in the sublevel set

{βn|f(β)+λΦ(β)f(β0)+λΦ(β0)}\left\{\beta\in\mathbb{R}^{n}\,\left|\,f(\beta)+\lambda\Phi(\beta)\leq f(\beta_{0})+\lambda\Phi(\beta_{0})\right.\right\}

are bounded below, where β0n\beta_{0}\in\mathbb{R}^{n} is the initial point. The last requirement will ensure that all subproblems are well defined.

The algorithm solves (1) for a decreasing sequence of λ\lambda values ranging from λmax\lambda_{\text{max}} to a user specified λmin\lambda_{\text{min}}. The algorithm consists of four nested main loops:

  • 1.

    A numerical continuation loop, decreasing λ\lambda.

  • 2.

    An outer coordinate gradient descent loop (Algorithm 1).

  • 3.

    A middle block coordinate descent loop (Algorithm 2).

  • 4.

    An inner modified coordinate descent loop (Algorithm 3).

In Section 4.3 to 4.5 we discuss the outer, middle and inner loop, respectively. In Section 4.6 we develop a method allowing us to bypass computations of large parts of the Hessian, hereby improving the performance of the middle loop. Section 5 provides a discussion of the available software solutions, as well as run-time performance of the current implementation.

The theoretical basis of the optimization methods we apply can be found in [6, 5]. A short review tailored to this paper is given in A.

4.1 The sparse group lasso penalty

In this section we derive fundamental results regarding the sparse group lasso penalty.

We first observe that Φ\Phi is separable in the sense that if, for any group J1,,mJ\in 1,\dots,m, we define the convex penalty Φ(J):nJ{\Phi}^{(J)}:\mathbb{R}^{n_{J}}\to\mathbb{R} by

Φ(J)(x^)=def(1α)γJx^2+αi=1nJξi(J)|x^i|{\Phi}^{(J)}(\hat{x})\stackrel{{\scriptstyle\text{def}}}{{=}}(1-\alpha)\gamma_{J}\left\|\hat{x}\right\|_{2}+\alpha\sum_{i=1}^{n_{J}}{\xi}^{(J)}_{i}\left|\hat{x}_{i}\right|

then Φ(β)=J=1mΦ(J)(β(J))\Phi(\beta)=\sum_{J=1}^{m}{\Phi}^{(J)}({\beta}^{(J)}). Separability of the penalty is required to ensure convergence of coordinate descent methods, see [5, 6], and see also A.

In a block coordinate descent scheme the primary minimization problem is solved by minimizing each block, one at a time, until convergence. We consider conditions ensuring that

0argminxnJg(x)+λΦ(J)(x)0\in\underset{x\in\mathbb{R}^{n_{J}}}{\arg\min}\,g(x)+\lambda{\Phi}^{(J)}(x) (7)

for a given convex and twice continuously differentiable function g:nJg:\mathbb{R}^{n_{J}}\to\mathbb{R}. For J=1,,mJ=1,\dots,m a straightforward calculation shows that the subgradient of Φ(J){\Phi}^{(J)} at zero is

Φ(J)(0)=(1α)γJBnJ+αdiag(ξ(J))TnJ\partial{\Phi}^{(J)}(0)=(1-\alpha)\gamma_{J}B^{n_{J}}+\alpha\operatorname*{diag}({\xi}^{(J)})T^{n_{J}}

where Bn=def{xn|x21}B^{n}\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{x\in\mathbb{R}^{n}\,\left|\,\left\|x\right\|_{2}\leq 1\right.\right\}, Tn=def[1,1]nT^{n}\stackrel{{\scriptstyle\text{def}}}{{=}}[-1,1]^{n} and where for xnx\in\mathbb{R}^{n} diag(x)\operatorname*{diag}(x) denotes the n×nn\times n diagonal matrix with diagonal xx. For an introduction to the theory of subgradients see Chapter 4 in [13].

Proposition 1 below gives a necessary and sufficient condition for (7) to hold. Before we state the proposition the following definition is needed.

Definition 2.

For nn\in\mathbb{N} we define the map κ:n×nn\kappa:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}^{n} by

κ(v,z)i=def{0|zi|vizisgn(zi)viotherwise for i=1,,n\kappa(v,z)_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}\begin{cases}0&\left|z_{i}\right|\leq v_{i}\\ z_{i}-\operatorname*{sgn}(z_{i})v_{i}&\text{otherwise}\end{cases}\text{ for $i=1,\dots,n$}

and the function K:n×nK:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R} by

K(v,z)=defκ(v,z)22={i||zi|>vi}(zisgn(zi)vi)2.K(v,z)\stackrel{{\scriptstyle\text{def}}}{{=}}\left\|\kappa(v,z)\right\|_{2}^{2}=\sum_{\left\{i\,\left|\,\left|z_{i}\right|>v_{i}\right.\right\}}\left(z_{i}-\operatorname*{sgn}(z_{i})v_{i}\right)^{2}.
Proposition 1.

Assume given a>0a>0, v,znv,z\in\mathbb{R}^{n} and define the closed sets

Y=z+diag(v)TnandX=aBn+Y.Y=z+\operatorname*{diag}(v)T_{n}\quad\text{and}\quad X=aB^{n}+Y.

Then the following hold:

  • a.

    κ(v,z)=argminyYy2\kappa(v,z)=\underset{y\in Y}{\arg\min}\,\left\|y\right\|_{2}.

  • b.

    0X0\in X if and only if K(v,z)a2K(v,z)\leq a^{2}.

  • c.

    If K(v,z)>a2K(v,z)>a^{2} then argminxXx2=(1a/K(v,z))κ(v,z)\underset{x\in X}{\arg\min}\,\left\|x\right\|_{2}=\left(1-a/\sqrt{K(v,z)}\right)\kappa(v,z).

The proof of Proposition 1 is given in C. Proposition 1 implies that (7) holds if and only if

K(λαξ(J),g(0))λ(1α)γJ.\sqrt{K(\lambda\alpha{\xi}^{(J)},\nabla g(0))}\leq\lambda(1-\alpha)\gamma_{J}.

The following observations will prove to be valuable. Note that we use \preceq to denote coordinatewise ordering.

Lemma 1.

For any three vectors v,z,znv,z,z^{\prime}\in\mathbb{R}^{n} the following hold:

  • a.

    K(v,z)=K(v,|z|)K(v,z)=K(v,\left|z\right|).

  • b.

    K(v,z)K(v,z)K(v,z)\leq K(v,z^{\prime}) when |z||z|\left|z\right|\preceq\left|z^{\prime}\right|.

Proof.

(a) is a simple calculation and (b) is a consequence of the definition and (a). ∎

4.2 The λ\lambda-sequence

For sufficiently large λ\lambda values the only solution to (1) will be zero. We denote the infimum of these by λmax\lambda_{\text{max}}. By using the above observations it is clear that

λmax\displaystyle\lambda_{\text{max}} =definf{λ>0|β^(λ)=0}\displaystyle\stackrel{{\scriptstyle\text{def}}}{{=}}\inf\left\{\lambda>0\,\left|\,\hat{\beta}(\lambda)=0\right.\right\}
=inf{λ>0|J=1,,m:K(λαξ(J),f(0)(J))λ(1α)γJ}\displaystyle=\inf\left\{\lambda>0\,\left|\,\forall J=1,\dots,m:\sqrt{K(\lambda\alpha{\xi}^{(J)},{\nabla f(0)}^{(J)})}\leq\lambda(1-\alpha)\gamma_{J}\right.\right\}
=maxJ=1,,minf{λ>0|K(λαξ(J),f(0)(J))λ(1α)γJ}.\displaystyle=\max_{J=1,\dots,m}\inf\left\{\lambda>0\,\left|\,\sqrt{K(\lambda\alpha{\xi}^{(J)},{\nabla f(0)}^{(J)})}\leq\lambda(1-\alpha)\gamma_{J}\right.\right\}.

It is possible to compute

inf{λ>0|K(λαξ(J),f(0)(J))λ(1α)γJ}\inf\left\{\lambda>0\,\left|\,\sqrt{K(\lambda\alpha{\xi}^{(J)},{\nabla f(0)}^{(J)})}\leq\lambda(1-\alpha)\gamma_{J}\right.\right\}

by using the fact that the function λK(λαξ(J),f(0)(J))\lambda\to K(\lambda\alpha{\xi}^{(J)},{\nabla f(0)}^{(J)}) is piecewise quadratic and monotone.

4.3 Outer loop

In the outer loop a coordinate gradient descent scheme is used. In this paper we use the simplest form of this scheme. In this simple form the coordinate gradient descent method is similar to Newton’s method; however the important difference is the way the non-differentiable penalty is handled. The convergence of the coordinate gradient descent method is not trivial and is established in [5].

The algorithm is based on a quadratic approximation of the loss function ff, at the current estimate of the minimizer. The difference, Δ\Delta, between the minimizer of the penalized quadratic approximation and the current estimate is then a descent direction. A new estimate of the minimizer of the objective is found by applying a line search in the direction of Δ\Delta. We repeat this until a stopping condition is met, see Algorithm 1. Note that a line search is necessary in order to ensure global convergence. For most iterations, however, a step size of 1 will give sufficient decrease in the objective. With q=f(β)q=\nabla f(\beta) and H=2f(β)H=\nabla^{2}f(\beta) the quadratic approximation of ff around the current estimate, β\beta, is

qT(xβ)+12(xβ)TH(xβ)=qTxqTβ+12xTHx12(βTHx+xTHβ)+12βTHβ.q^{T}(x-\beta)+\frac{1}{2}(x-\beta)^{T}H(x-\beta)\\ =q^{T}x-q^{T}\beta+\frac{1}{2}x^{T}Hx-\frac{1}{2}\left(\beta^{T}Hx+x^{T}H\beta\right)+\frac{1}{2}\beta^{T}H\beta.

HH is symmetric, thus it follows that the quadratic approximation of ff around β\beta equals

Q(x)qTβ+12βTHβ,Q(x)-q^{T}\beta+\frac{1}{2}\beta^{T}H\beta,

where Q:nQ:\mathbb{R}^{n}\to\mathbb{R} is defined by

Q(x)=def(qHβ)Tx+12xTHx.Q(x)\stackrel{{\scriptstyle\text{def}}}{{=}}(q-H\beta)^{T}x+\frac{1}{2}x^{T}Hx.

We have reduced problem (1) to the following penalized quadratic optimization problem

minxnQ(x)+λΦ(x).\underset{x\in\mathbb{R}^{n}}{\min}\,Q(x)+\lambda\Phi(x). (8)
Algorithm 1 Outer loop. Solve (1) by coordinate gradient descent.
0:β=β0\beta=\beta_{0}
repeat
  Let q=f(β)q=\nabla f(\beta), H=2f(β)H=\nabla^{2}f(\beta) and Q(x)=(qHβ)Tx+12xTHxQ(x)=(q-H\beta)^{T}x+\frac{1}{2}x^{T}Hx.
  Compute β^=argminxnQ(x)+λΦ(x)\hat{\beta}=\underset{x\in\mathbb{R}^{n}}{\arg\min}\,Q(x)+\lambda\Phi(x).
  Compute step size tt and set β=β+tΔ\beta=\beta+t\Delta, for Δ=ββ^\Delta=\beta-\hat{\beta}.
until stopping condition is met.

The convergence of Algorithm 1 is implied by Theorem 1e in [5]. This implies:

Proposition 2.

Every cluster point of the sequence {βk}k\{\beta_{k}\}_{k\in\mathbb{N}} generated by Algorithm 1 is a solution of problem (1).

Remark 1.

The convergence of Algorithm 1 is ensured even if HH is a (symmetric) positive definite matrix approximating 2f(β)\nabla^{2}f(\beta). For high dimensional problems it might be computationally beneficial to take HH to be diagonal, e.g. as the diagonal of 2f(β)\nabla^{2}f(\beta).

4.4 Middle loop

In the middle loop the penalized quadratic optimization problem (8) is solved. The penalty Φ\Phi is block separable, i.e.

Q(x)+λΦ(x)=Q(x)+λJ=0pΦ(J)(x(J))Q(x)+\lambda\Phi(x)=Q(x)+\lambda\sum_{J=0}^{p}{\Phi}^{(J)}({x}^{(J)})

with Φ(J){\Phi}^{(J)} convex, and we can therefore use the block coordinate descent method over the blocks x(1),,x(m){x}^{(1)},\dots,{x}^{(m)}. The block coordinate descent method will converge to a minimizer even for non-differentiable objectives if the non-differentiable parts are block separable, see [6]. Since Φ\Phi is separable and QQ is convex, twice continuously differentiable and bounded below, the block coordinate descent scheme converges to the minimizer of problem (8). Hence, our problem is reduced to the following collection of problems, one for each J=1,,mJ=1,\dots,m,

minx^nJQ(J)(x^)+λΦ(J)(x^)\underset{\hat{x}\in\mathbb{R}^{n_{J}}}{\min}\,{Q}^{(J)}(\hat{x})+\lambda{\Phi}^{(J)}(\hat{x}) (9)

where Q(J):nJ{Q}^{(J)}:\mathbb{R}^{n_{J}}\to\mathbb{R} is the quadratic function

x^Q(x(1),,x(J1),x^,x(J+1),,x(m))\hat{x}\to Q({x}^{(1)},\dots,{x}^{(J-1)},\hat{x},{x}^{(J+1)},\dots,{x}^{(m)})

up to an additive constant. We decompose an n×nn\times n matrix HH into block matrices in the following way

H=(H11H12H1mH21H22H2mHm1Hm2Hmm)H=\begin{pmatrix}H_{11}&H_{12}&\cdots&H_{1m}\\ H_{21}&H_{22}&\cdots&H_{2m}\\ \vdots&\vdots&\ddots&\vdots\\ H_{m1}&H_{m2}&\cdots&H_{mm}\\ \end{pmatrix}

where HIJH_{IJ} is an nI×nJn_{I}\times n_{J} matrix. By the symmetry of HH it follows that

Q(J)(x^)\displaystyle{Q}^{(J)}(\hat{x}) =x^T(qHβ)(J)+12(2Ix^THJIx(I)x^THJJx(J)+x^THJJx^)\displaystyle=\hat{x}^{T}{(q-H\beta)}^{(J)}+\frac{1}{2}\left(2\sum_{I}\hat{x}^{T}H_{JI}{x}^{(I)}-\hat{x}^{T}H_{JJ}{x}^{(J)}+\hat{x}^{T}H_{JJ}\hat{x}\right)
=x^T(q(J)+[H(xβ)](J)HJJx(J))+12x^THJJx^\displaystyle=\hat{x}^{T}\left({q}^{(J)}+{[H(x-\beta)]}^{(J)}-H_{JJ}{x}^{(J)}\right)+\frac{1}{2}\hat{x}^{T}H_{JJ}\hat{x}

up to an additive constant. We may, therefore, redefine

Q(J)(x^)=defx^Tg(J)+12x^THJJx^{Q}^{(J)}(\hat{x})\stackrel{{\scriptstyle\text{def}}}{{=}}\hat{x}^{T}{g}^{(J)}+\frac{1}{2}\hat{x}^{T}H_{JJ}\hat{x}

where the block gradient g(J){g}^{(J)} is defined by

g(J)=defq(J)+[H(xβ)](J)HJJx(J).{g}^{(J)}\stackrel{{\scriptstyle\text{def}}}{{=}}{q}^{(J)}+{\left[H(x-\beta)\right]}^{(J)}-H_{JJ}{x}^{(J)}. (10)

For the collection of problems given by (9) a considerable fraction of the minimizers will be zero in practice. By Lemma 1 this is the case if and only if

K(λαξ(J),g(J))λ(1α)γJ.\sqrt{K(\lambda\alpha{\xi}^{(J)},{g}^{(J)})}\leq\lambda(1-\alpha)\gamma_{J}.

These considerations lead us to Algorithm 2.

Algorithm 2 Middle loop. Solve (8) by block coordinate descent.
repeat
  Choose next block index JJ according to the cyclic rule.
  Compute the block gradient g(J){g}^{(J)}.
  if  K(λαξ(J),g(J))λ(1α)γJ\sqrt{K(\lambda\alpha{\xi}^{(J)},{g}^{(J)})}\leq\lambda(1-\alpha)\gamma_{J} then
   Let x(J)=0{x}^{(J)}=0.
  else
   Let x(J)=argminx^nJQ(J)(x^)+λΦ(J)(x^){x}^{(J)}=\underset{\hat{x}\in\mathbb{R}^{n_{J}}}{\arg\min}\,{Q}^{(J)}(\hat{x})+\lambda{\Phi}^{(J)}(\hat{x}).
  end if
until stopping condition is met.

4.5 Inner loop

Finally we need to determine the minimizer of (9), i.e. the minimizer of

Q(J)(x^)+λ(1α)γJx^2loss+αi=0nJξi(J)|x^i|penalty.\underbrace{\vphantom{\sum_{i=0}^{n_{J}}}{Q}^{(J)}(\hat{x})+\lambda(1-\alpha)\gamma_{J}\left\|\hat{x}\right\|_{2}}_{\text{loss}}+\underbrace{\alpha\sum_{i=0}^{n_{J}}{\xi}^{(J)}_{i}\left|\hat{x}_{i}\right|}_{\text{penalty}}. (11)

The two first terms of (11) are considered the loss function and the last term is the penalty. Note that the loss is not differentiable at zero (due to the L2L_{2}-norm), thus we cannot completely separate out the non-differentiable parts. This implies that ordinary block coordinate descent is not guaranteed to converge to a minimizer. Algorithm 3 adjusts for this problem, and we have the following proposition.

Proposition 3.

For any ϵ>0\epsilon>0 the cluster points of the sequence {x^k}k\{\hat{x}_{k}\}_{k\in\mathbb{N}} generated by Algorithm 3 are minimizers of (11).

Proof.

Since Q(J)(0)+λΦ(J)(0)=0{Q}^{(J)}(0)+\lambda{\Phi}^{(J)}(0)=0 Algorithm 3 is a modified block coordinate descent scheme. Furthermore JJ is chosen such that (11) is not optimal at 0. We can therefore apply Lemma 4 in B, from which the claim follows directly. ∎

Hence, for a given block J=1,,mJ=1,\dots,m we need to solve the following two problems:

  • I.

    For each j=1,nJj=1,\dots n_{J}, compute a minimizer for the function

    x^Q(J)(x(J),,xj1(J),x^,xj+1(J),,xnJ(J))+λΦ(J)(x(J),,xj1(J),x^,xj+1(J),,xnJ(J)).\mathbb{R}\ni\hat{x}\to{Q}^{(J)}({x}^{(J)},\dots,{x}^{(J)}_{j-1},\hat{x},{x}^{(J)}_{j+1},\dots,{x}^{(J)}_{n_{J}})\\ +\lambda{\Phi}^{(J)}({x}^{(J)},\dots,{x}^{(J)}_{j-1},\hat{x},{x}^{(J)}_{j+1},\dots,{x}^{(J)}_{n_{J}}).
  • II.

    Compute a descent direction at zero for (11).

Regarding I

Writing out the equation we see that in the jj’th iteration we need to find the minimizer of the function ω:\omega:\mathbb{R}\to\mathbb{R} given by

ω(x^)=defcx^+12hx^2+γx^2+r+ξ|x^|\omega(\hat{x})\stackrel{{\scriptstyle\text{def}}}{{=}}c\hat{x}+\frac{1}{2}h\hat{x}^{2}+\gamma\sqrt{\hat{x}^{2}+r}+\xi\left|\hat{x}\right| (12)

with c=gj(J)+ij(HJJ)jixic={g}^{(J)}_{j}+\sum_{i\neq j}(H_{JJ})_{ji}x_{i}, γ=λ(1α)γJ\gamma=\lambda(1-\alpha)\gamma_{J}, ξ=λαξj(J)\xi=\lambda\alpha{\xi}^{(J)}_{j}, r=ijxi2r=\sum_{i\neq j}x_{i}^{2}, and where hh is the jj’th diagonal of the Hessian block HJJH_{JJ}.

By convexity of ff we conclude that h0h\geq 0. Lemma 2 below deals with the case h>0h>0. Since the quadratic approximation QQ is bounded below the case h=0h=0 implies that c=0c=0, hence for h=0h=0 we have x^=0\hat{x}=0.

Lemma 2.

If h>0h>0 then the minimizer x^\hat{x} of ω\omega is given as follows:

  • a.

    If r=0r=0 or γ=0\gamma=0 then

    x^={ξ+γchif c>ξ+γ0if |c|ξ+γξγchif c<ξγ\hat{x}=\begin{cases}\frac{\xi+\gamma-c}{h}&\text{if }c>\xi+\gamma\\ 0&\text{if }\left|c\right|\leq\xi+\gamma\\ \frac{-\xi-\gamma-c}{h}&\text{if }c<-\xi-\gamma\end{cases}
  • b.

    If r>0,γ>0r>0,\gamma>0 then x^=0\hat{x}=0 if |c|ξ\left|c\right|\leq\xi and otherwise the solution to

    c+sgn(ξc)ξ+hx^+γx^x^2+r=0.c+\operatorname*{sgn}(\xi-c)\xi+h\hat{x}+\gamma\frac{\hat{x}}{\sqrt{\hat{x}^{2}+r}}=0.
Proof.

Simple calculations will show the results. ∎

For case (b) in the above lemma we solve the equation by applying a standard root finding method.

Regarding II

For a convex function f:nf:\mathbb{R}^{n}\to\mathbb{R} and a point xnx\in\mathbb{R}^{n}, the vector

Δ=argminx^f(x)x^2\Delta=-\underset{\hat{x}\in\partial f(x)}{\arg\min}\,\left\|\hat{x}\right\|_{2}

is a descent direction at xx provided ff is not optimal at xx, see [13], Section 8.4. We may use this fact to compute a descent direction at zero for the function (11). By Proposition 1 it follows that Δn\Delta\in\mathbb{R}^{n} defined by

Δi=def{0|gi(J)|λαξi(J)gi(J)λαξi(J)sgn(gi(J))otherwise\Delta_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}-\begin{cases}0&\left|{g}^{(J)}_{i}\right|\leq\lambda\alpha{\xi}^{(J)}_{i}\\ {g}^{(J)}_{i}-\lambda\alpha{\xi}^{(J)}_{i}\operatorname*{sgn}({g}^{(J)}_{i})&\text{otherwise}\end{cases}

is a descent direction at zero for the function (11).

Algorithm 3 Inner loop. Compute the minimizer of (11) by a modified coordinate descent scheme.
repeat
  Choose next parameter index jj according to the cyclic rule.
  Compute
xj(J)=argminx^Q(J)(x1(J),,xj1(J),x^,xj+1(J),,xnJ(J))+λΦ(J)(x(J),,xj1(J),x^,xj+1(J),,xnJ(J)){x}^{(J)}_{j}=\underset{\hat{x}\in\mathbb{R}}{\arg\min}\,\,{Q}^{(J)}({x}^{(J)}_{1},\dots,{x}^{(J)}_{j-1},\hat{x},{x}^{(J)}_{j+1},\dots,{x}^{(J)}_{n_{J}})\\ +\lambda{\Phi}^{(J)}({x}^{(J)},\dots,{x}^{(J)}_{j-1},\hat{x},{x}^{(J)}_{j+1},\dots,{x}^{(J)}_{n_{J}})
  if x(J)2<ϵ\left\|{x}^{(J)}\right\|_{2}<\epsilon and Q(J)(x(J))+λΦ(J)(x(J))0{Q}^{(J)}({x}^{(J)})+\lambda{\Phi}^{(J)}({x}^{(J)})\geq 0 then
   Compute a descent direction, Δ\Delta, at zero for (11).
   Use line search to find tt such that Q(J)(tΔ)+λΦ(J)(tΔ)<0{Q}^{(J)}(t\Delta)+\lambda{\Phi}^{(J)}(t\Delta)<0.
   Let x(J)=tΔ{x}^{(J)}=t\Delta
  end if
until stopping condition is met.

4.6 Hessian upper bound optimization

In this section we present a way of reducing the number of blocks for which the block gradient needs to be computed. The aim is to reduce the computational costs of the algorithm.

In the middle loop, Algorithm 2, the block gradient (10) is computed for all mm blocks. To determine if a block is zero it is, in fact, sufficient to compute an upper bound on the block gradient. Since the gradient, qq, is already computed we focus on the term involving the Hessian. That is, for J=1,,mJ=1,\dots,m, we compute a bJb_{J}\in\mathbb{R} such that

|[H(xβ)](J)|bJDnJ\left|{\left[H(x-\beta)\right]}^{(J)}\right|\preceq b_{J}D_{n_{J}}

where Dn=def(1,1,,1)nD_{n}\stackrel{{\scriptstyle\text{def}}}{{=}}(1,1,\dots,1)\in\mathbb{R}^{n}. We define

tJ=defsup{x0|KJ(λαξ(J),|q(J)|+xDnJ)λ(1α)γJ}t_{J}\stackrel{{\scriptstyle\text{def}}}{{=}}\sup\left\{x\geq 0\,\left|\,\sqrt{K_{J}(\lambda\alpha{\xi}^{(J)},\left|{q}^{(J)}\right|+xD_{n_{J}})}\leq\lambda(1-\alpha)\gamma_{J}\right.\right\}

when KJ(λαξ(J),|q(J)|)λ(1α)γJ\sqrt{K_{J}(\lambda\alpha{\xi}^{(J)},\left|{q}^{(J)}\right|)}\leq\lambda(1-\alpha)\gamma_{J} and otherwise let tJ=0t_{J}=0. When bJ<tJb_{J}<t_{J} it follows by Lemma 1 that

KJ(λαξ(J),g(J))\displaystyle K_{J}(\lambda\alpha{\xi}^{(J)},{g}^{(J)}) =KJ(λαξ(J),|g(J)|)\displaystyle=K_{J}(\lambda\alpha{\xi}^{(J)},\left|{g}^{(J)}\right|)
KJ(λαξ(J),|q(J)|+bJDnJ)\displaystyle\leq K_{J}(\lambda\alpha{\xi}^{(J)},\left|{q}^{(J)}\right|+b_{J}D_{n_{J}})
λ2(1α)2γJ2\displaystyle\leq\lambda^{2}(1-\alpha)^{2}\gamma_{J}^{2}

and by Proposition 1 this implies that the block JJ is zero. The above considerations lead us to Algorithm 4. Note that it is possible to compute the tJt_{J}’s by using the fact that the function

xKJ(λαξ(J),|q(J)|+xDnJ)\mathbb{R}\ni x\to K_{J}(\lambda\alpha{\xi}^{(J)},\left|{q}^{(J)}\right|+xD_{n_{J}})

is monotone and piecewise quadratic.

Algorithm 4 Middle loop with Hessian bound optimization.
repeat
  Choose next block index JJ according to the cyclic rule.
  Compute upper bound bJb_{J}.
  if bJ<tjb_{J}<t_{j} then
   Let x(J)=0{x}^{(J)}=0.
  else
   Compute g(J){g}^{(J)} and compute new x(J){x}^{(J)} (see Algorithm 2).
  end if
until stopping condition is met.

In Algorithm 4 it is only necessary to compute the block gradient for those blocks where x(J)0{x}^{(J)}\neq 0 or when bj<tJb_{j}<t_{J}. This is only beneficial if we can efficiently compute a sufficiently good bound bJb_{J}. For a broad class of loss functions this can be done using the Cauchy-Schwarz inequality.

To assess the performance of the Hessian bound scheme we used our multinomial sparse group lasso implementation with and without bound optimization (and with α=0.5\alpha=0.5). Table 2 lists the ratio of the run-time without using bound optimization to the run-time with bound optimization, on the three different data sets. The Hessian bound scheme decreases the run-time for the multinomial loss function, and the ratio increases with the number of blocks mm in the data set. The same trend can be seen for other values of α\alpha.

Data set nn\ \ mm\ Ratio
Cancer 3.9k 217 1.14
Amazon 500k 10k 1.76
Muscle 220k 22k 2.47
Table 2: Timing the Hessian bound optimization scheme.

5 Software

We provide two software solutions in relation to the current paper. An R package, msgl, with a relatively simple interface to our multinomial and logistic sparse group lasso regression routines. In addition, a C++ template library, sgl, is provided. The sgl template library gives access to the generic sparse group lasso routines. The R package relies on this library. The sgl template library relies on several external libraries. We use the Armadillo C++ library [14] as our primary linear algebra engine. Armadillo is a C++ template library using expression template techniques to optimize the performance of matrix expressions. Furthermore we utilize several Boost libraries [15]. Boost is a collection of free peer-reviewed C++ libraries, many of which are template libraries. For an introduction to these libraries see for example [16]. Use of multiple processors for cross validation and subsampling is supported through OpenMP [17].

The msgl R package is available from CRAN. The sgl library is available upon request.

5.1 Run-time performance

Table 3 lists run-times of the current multinomial sparse group lasso implementation for three real data examples. For comparison, the glmnet uses 5.2s, 8.3s and 137.0s, respectively, to fit the lasso path for the three data sets in Table 3. The glmnet is a fast implementation of the coordinate descent algorithm for fitting generalized linear models with the lasso penalty or the elastic net penalty [10]. The glmnet cannot be used to fit models with group lasso or sparse group lasso penalty.

Data set nn\ \ mm\ Lasso Sparse group lasso Group lasso
α=0.75\alpha=0.75 α=0.25\alpha=0.25
Cancer 3.9k 217 5.9s 4.8s 6.3s 6.0s
Muscle 220k 22k 25.0s 25.8s 37.7s 36.7s
Amazon 500k 10k 331.6s 246.7s 480.4s 285.1s
Table 3: Times for computing the multinomial sparse group lasso regression solutions for a lambda sequence of length 100, on a 2.20 GHz Intel Core i7 processor (using one thread). In all cases the sequence runs from λmax\lambda_{\max} to 0.002. The number of samples in the data sets Cancer, Muscle and Amazon are respectively 162, 107 and 1500. See also Table 1 and the discussions in Sections 3.1, 3.3 and 3.2 respectively.

6 Conclusion

We developed an algorithm for solving the sparse group lasso optimization problem with a general convex loss function. Furthermore, convergence of the algorithm was established in a general framework. This framework includes the sparse group lasso penalized negative-log-likelihood for the multinomial distribution, which is of primary interest for multiclass classification problems.

We implemented the algorithm as a C++ template library. An R package is available for the multinomial and the logistic regression loss functions. We presented applications to multiclass classification problems using three real data examples. The multinomial group lasso solution achieved optimal performance in all three examples in terms of estimated expected misclassification error. In one example some sparse group lasso solutions achieved comparable performance based on fewer features. If there is cost associated with the acquisition of each feature, the objective would be to minimize the cost while optimizing the classification performance. In general, the sparse group lasso solutions provide sparser solutions than the group lasso. These sparser solutions can be of interest for model selection purposes and for interpretation of the model.

Appendix A Block coordinate descent methods

In this section we review the theoretical basis of the optimization methods that we apply in the sparse group lasso algorithm. We use three slightly different methods: a coordinate gradient descent, a block coordinate descent and a modified block coordinate descent.

We are interested in unconstrained optimization problems on n\mathbb{R}^{n} where the coordinates are naturally divided into mm\in\mathbb{N} blocks with dimensions nin_{i}\in\mathbb{N} for i=1,,mi=1,\dots,m. We decompose the search space

n=n1××nm\mathbb{R}^{n}=\mathbb{R}^{n_{1}}\times\dots\times\mathbb{R}^{n_{m}}

and denote by PiP_{i} the orthogonal projection onto the ii’th block. For a vector xnx\in\mathbb{R}^{n} we write x=(x(1),,x(m))x=({x}^{(1)},\dots,{x}^{(m)}) where x(1)n1,,x(m)nm{x}^{(1)}\in\mathbb{R}^{n_{1}},\dots,{x}^{(m)}\in\mathbb{R}^{n_{m}}. For i=1,,mi=1,\dots,m we call x(i){x}^{(i)} the ii’th block of xx. We assume that the objective function F:nF:\mathbb{R}^{n}\to\mathbb{R} is bounded below and of the form

F(x)=f(x)+i=1mhi(x(i))F(x)=f(x)+\sum_{i=1}^{m}h_{i}({x}^{(i)})

where f:nf:\mathbb{R}^{n}\to\mathbb{R} is convex and each hi:nih_{i}:\mathbb{R}^{n_{i}}\to\mathbb{R}, for i=1,,mi=1,\dots,m are convex. Furthermore, we assume that for any i=1,,mi=1,\dots,m and any x0=(x0(1),,x0(m))x_{0}=({x_{0}}^{(1)},\dots,{x_{0}}^{(m)}) the function

nix^F(x0(1),,x0(i1),x^,x0(i+1),,x0(m))\mathbb{R}^{n_{i}}\ni\hat{x}\to F({x_{0}}^{(1)},\dots,{x_{0}}^{(i-1)},\hat{x},{x_{0}}^{(i+1)},\dots,{x_{0}}^{(m)})

is hemivariate. A function is said to be hemivariate if it is not constant on any line segment of its domain.

A.1 Coordinate gradient descent

Algorithm 5 Coordinate gradient descent scheme.
repeat
  Compute quadratic approximation QQ of ff around the current point xx.
  Compute search direction
xnew=argminx^nQ(x^)+i=1mhi(x^(i)).x^{new}=\underset{\hat{x}\in\mathbb{R}^{n}}{\arg\min}\,Q(\hat{x})+\sum_{i=1}^{m}h_{i}\left({\hat{x}}^{(i)}\right).
  Let Δ=xxnew\Delta=x-x^{new} and compute step size tt using the Armijo rule and let xx+tΔx\leftarrow x+t\Delta.
until stopping condition is met.
Algorithm 6 Armijo rule.
0:a(0,0.5)a\in(0,0.5) and b(0,1)b\in(0,1)
 Let δ=f(x)TΔ+i=1m(hi(xi+Δi)hi(xi))\delta=\nabla f(x)^{T}\Delta+\sum_{i=1}^{m}\left(h_{i}(x_{i}+\Delta_{i})-h_{i}(x_{i})\right).
while F(x+tΔ)>F(x)+taδF(x+t\Delta)>F(x)+ta\delta do
  tbt.t\leftarrow bt.
end while

For this scheme we make the additional assumption that ff is twice continuously differentiable everywhere. The scheme is outlined in Algorithm 5, where the step size is chosen by the Armijo rule outlined in Algorithm 6. Theorem 1e in [5] implies the following:

Corollary 1.

If ff is twice continuously differentiable then every cluster point of the sequence {xk}k\{x_{k}\}_{k\in\mathbb{N}} generated by Algorithm 5 is a minimizer of FF.

A.2 Block coordinate descent

Algorithm 7 Block coordinate descent.
repeat
  Choose next block index ii according to the cyclic rule.
  x(i)argminx^niF(x^Pix).{x}^{(i)}\leftarrow\underset{\hat{x}\in\mathbb{R}^{n_{i}}}{\arg\min}\,F(\hat{x}\oplus P_{i}^{\perp}x).
until some stopping condition is met.

The block coordinate descent scheme is outlined in Algorithm 7. By Corollary 2 below the block coordinate descent method converges to a coordinatewise minimum.

Definition 3.

A point pnp\in\mathbb{R}^{n} is said to be a coordinatewise minimizer of FF if for each block i=1,,mi=1,\dots,m it holds that

F(p+(0,,0,di,0,,0))F(p) for all dini.F(p+(0,\dots,0,d_{i},0,\dots,0))\geq F(p)\text{ for all }d_{i}\in\mathbb{R}^{n_{i}}.

If ff is differentiable then by Lemma 3 the block coordinate descent method converges to a minimizer. Lemma 3 below is a simple consequence of the separability of FF.

Lemma 3.

Let pnp\in\mathbb{R}^{n} be a coordinatewise minimizer of FF. If ff is differentiable at pp then pp is a stationary point of FF.

Proposition 5.1 in [6] implies the following:

Corollary 2.

For the sequence {xk}k\{x_{k}\}_{k\in\mathbb{N}} generated by the block coordinate descent algorithm (Algorithm 7) it holds that every cluster point of {xk}k\{x_{k}\}_{k\in\mathbb{N}} is a coordinatewise minimizer of FF.

Appendix B Modified block coordinate descent

Algorithm 8 Modified coordinate descent loop.
repeat
  Let ii+1modm.i\leftarrow i+1\mod m.
  x(i)argminx^niF(x^Pix).{x}^{(i)}\leftarrow\underset{\hat{x}\in\mathbb{R}^{n_{i}}}{\arg\min}\,F(\hat{x}\oplus P_{i}^{\perp}x).
  if xp2<ϵ\left\|x-p\right\|_{2}<\epsilon and F(x)F(p)F(x)\geq F(p) then
   Compute descent direction Δ\Delta at pp for FF.
   Use line search to find tt such that F(p+tΔ)<F(p)F(p+t\Delta)<F(p).
   Let x(i)p+tΔ{x}^{(i)}\leftarrow p+t\Delta.
  end if
until stopping condition is met.

For this last scheme we make the additional assumption that ff is twice continuously differentiable everywhere except at a given non-optimal point pnp\in\mathbb{R}^{n}. In this case the block coordinate descent method is no longer guaranteed to be globally convergent, as it may get stuck at pp. One immediate solution to this is to compute a descent direction at pp, then use a line search to find a starting point x0x_{0} with F(x0)<F(p)F(x_{0})<F(p). Since ff is differentiable on the sublevel set {xn|F(x)<F(p)}\left\{x\in\mathbb{R}^{n}\,\left|\,F(x)<F(p)\right.\right\} it follows by the results above that the cluster points of the generated sequence are stationary points of FF. This procedure is not efficient since it discards a carefully chosen starting point. We apply the modified coordinate descent loop, outlined in Algorithm 8, instead.

Lemma 4.

Assume that ff is differentiable everywhere except at pnp\in\mathbb{R}^{n}, and that FF is not optimal at pp. Then for any ϵ>0\epsilon>0 the cluster points of the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by Algorithm 8 are minimizers of FF.

Proof.

Let zz be a cluster point of {xk}\{x^{k}\}. By Corollary 2, zz is a coordinatewise minimizer of FF. Then Lemma 3 implies that zz is either pp or a stationary point of FF. We shall show by contradiction that pp is not a cluster point of {xk}k\{x^{k}\}_{k\in\mathbb{N}}, thus assume otherwise. The sequence {F(xk)}k\{F(x^{k})\}_{k\in\mathbb{N}} is decreasing; hence, if we can find a kk^{\prime}\in\mathbb{N} such that F(xk)<F(p)F(x^{k^{\prime}})<F(p) we reach a contradiction (since this would conflict with the continuity of FF). Choose kk^{\prime} such that xkp2<ϵ\left\|x^{k^{\prime}}-p\right\|_{2}<\epsilon. Since we may assume that F(xk)F(p)F(x^{k^{\prime}})\geq F(p) it follows by the definition of Algorithm 8 that F(xk+1)<F(p)F(x^{k^{\prime}+1})<F(p). ∎

Appendix C Proof of Proposition 1

(a) Straightforward.

(b) If κ(v,z)2a\left\|\kappa(v,z)\right\|_{2}\leq a then κ(v,z)aBn-\kappa(v,z)\in aB^{n} hence 0X0\in X. For the other implication simply choose y0Yy_{0}\in Y such that y0aBn-y_{0}\in aB^{n} and note that κ(v,z)2y02a\left\|\kappa(v,z)\right\|_{2}\leq\left\|y_{0}\right\|_{2}\leq a.

(c) Assume κ(v,z)2>a\left\|\kappa(v,z)\right\|_{2}>a, and let x=(1a/κ(v,z)2)κ(v,z)x^{*}=(1-a/\left\|\kappa(v,z)\right\|_{2})\kappa(v,z). Then xXx^{*}\in X and x2=κ(v,z)2a\left\|x^{*}\right\|_{2}=\left\|\kappa(v,z)\right\|_{2}-a. The point xx^{*} is in fact a minimizer. To see this let xXx^{\prime}\in X, that is we have

x=z+as+diag(v)tx^{\prime}=z+as+\operatorname*{diag}(v)t

for some sBns\in B^{n} and tTnt\in T_{n}. It follows, by the triangle inequality and (a), that

x2+axas2=z+diag(v)t2κ(v,z)2.\left\|x^{\prime}\right\|_{2}+a\geq\left\|x^{\prime}-as\right\|_{2}=\left\|z+\operatorname*{diag}(v)t\right\|_{2}\geq\left\|\kappa(v,z)\right\|_{2}.

So x2κ(v,z)2a=x2\left\|x^{\prime}\right\|_{2}\geq\left\|\kappa(v,z)\right\|_{2}-a=\left\|x^{*}\right\|_{2} and since XX is convex and xx2x\to\left\|x\right\|_{2} is strictly convex the found minimizer xx^{*} is the unique minimizer.

References

  • [1] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58 (1994) 267–288.
  • [2] L. Meier, S. V. D. Geer, P. Bühlmann, E. T. H. Zürich, The group lasso for logistic regression, Journal of the Royal Statistical Society, Series B.
  • [3] J. Friedman, T. Hastie, R. Tibshirani, A note on the group lasso and a sparse group lasso, Tech. Rep. arXiv:1001.0736, comments: 8 pages, 3 figs (Jan 2010).
  • [4] N. Simon, J. Friedman, T. Hastie, R. Tibshirani, A sparse group lasso, Journal of Computational and Graphical Statistics.
  • [5] P. Tseng, S. Yun, A coordinate gradient descent method for nonsmooth separable minimization, Mathematical Programming 117 (1) (2009) 387–423.
  • [6] P. Tseng, Convergence of a block coordinate descent method for nondifferentiable minimization, Journal of optimization theory and applications 109 (3) (2001) 475–494.
  • [7] J. Lu, G. Getz, E. Miska, E. Alvarez-Saavedra, J. Lamb, D. Peck, A. Sweet-Cordero, B. Ebert, R. Mak, A. Ferrando, et al., Microrna expression profiles classify human cancers, nature 435 (7043) (2005) 834–838.
  • [8] S. Liu, Z. Liu, J. Sun, L. Liu, Application of synergetic neural network in online writeprint identification, International Journal of Digital Content Technology and its Applications 5 (3) (2011) 126–135.
  • [9] M. Bakay, Z. Wang, G. Melcon, L. Schiltz, J. Xuan, P. Zhao, V. Sartorelli, J. Seo, E. Pegoraro, C. Angelini, et al., Nuclear envelope dystrophies show a transcriptional fingerprint suggesting disruption of rb–myod pathways in muscle regeneration, Brain 129 (4) (2006) 996.
  • [10] J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software 33 (1) (2010) 1–22.
    URL http://www.jstatsoft.org/v33/i01/
  • [11] T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations, New York: Springer-Verlag, 2001.
  • [12] A. Frank, A. Asuncion, UCI machine learning repository (2010).
    URL http://archive.ics.uci.edu/ml
  • [13] D. Bertsekas, A. Nedić, A. Ozdaglar, Convex analysis and optimization, Athena Scientific optimization and computation series, Athena Scientific, 2003.
  • [14] C. Sanderson, Armadillo: An open source c++ linear algebra library for fast prototyping and computationally intensive experiments, Tech. rep., NICTA (October 2010).
  • [15] Boost c++ libraries.
    URL http://www.boost.org
  • [16] R. Demming, D. Duffy, Introduction to the Boost C++ libraries, Datasim Education, 2010.
  • [17] The openmp api specification for parallel programming.
    URL http://www.openmp.org