This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An active-set method for sparse approximations
Part II: General piecewise-linear terms

Spyridon Pougkakiotis    Jacek Gondzio    Dionysios S. Kalogerias
Abstract

In this paper we present an efficient active-set method for the solution of convex quadratic programming problems with general piecewise-linear terms in the objective, with applications to sparse approximations and risk-minimization. The method exploits the structure of the piecewise-linear terms appearing in the objective in order to significantly reduce its memory requirements, and thus improve its efficiency. We showcase the robustness of the proposed solver on a variety of problems arising in risk-averse portfolio selection, quantile regression, and binary classification via linear support vector machines. We provide computational evidence to demonstrate, on real-world datasets, the ability of the solver of efficiently handling a variety of problems, by comparing it against an efficient general-purpose interior point solver as well as a state-of-the-art alternating direction method of multipliers. This work complements the accompanying paper [“An active-set method for sparse approximations. Part I: Separable 1\ell_{1} terms”, S. Pougkakiotis, J. Gondzio, D. S. Kalogerias], in which we discuss the case of separable 1\ell_{1} terms, analyze the convergence, and propose general-purpose preconditioning strategies for the solution of its associated linear systems.

1 Introduction

In this paper we are interested in the solution of the following optimization problem

minxn{cx+12xQx+i=1l((Cx+d)i)++Dx1+δ𝒦(x)},s.t.Ax=b,\begin{split}\underset{x\in\mathbb{R}^{n}}{\text{min}}&\quad\left\{c^{\top}x+\frac{1}{2}x^{\top}Qx+\sum_{i=1}^{l}\left(\left(Cx+d\right)_{i}\right)_{+}+\|Dx\|_{1}+\delta_{\mathcal{K}}(x)\right\},\\ \textnormal{s.t.}&\quad\ Ax=b,\end{split} (1.1)

where cnc\in\mathbb{R}^{n}, Qn×nQ\in\mathbb{R}^{n\times n} is a positive semi-definite matrix, Cl×nC\in\mathbb{R}^{l\times n}, dld\in\mathbb{R}^{l}, Am×nA\in\mathbb{R}^{m\times n} is a linear constraint matrix with bmb\in\mathbb{R}^{m} a given right-hand side, Dn×nD\in\mathbb{R}^{n\times n} is a diagonal positive semi-definite (“weight”) matrix, and 𝒦\mathcal{K} is a closed convex set 𝒦{xn:x[al,au]}\mathcal{K}\coloneqq\{x\in\mathbb{R}^{n}\colon x\in\left[a_{l},a_{u}\right]\} with al,auna_{l},\ a_{u}\in\mathbb{R}^{n}, such that (al)i{},(au)i{+}(a_{l})_{i}\in\mathbb{R}\cup\{-\infty\},\ (a_{u})_{i}\in\mathbb{R}\cup\{+\infty\} for all i=1,,ni=1,\ldots,n. Additionally, ()+max{,0}(\cdot)_{+}\equiv\max\{\cdot,0\}, while δ𝒦()\delta_{\mathcal{K}}(\cdot) is an indicator function for the set 𝒦\mathcal{K}, that is, δ𝒦(x)=0\delta_{\mathcal{K}}(x)=0 if x𝒦x\in\mathcal{K} and δ𝒦(x)=\delta_{\mathcal{K}}(x)=\infty, otherwise.

We introduce an auxiliary variable wlw\in\mathbb{R}^{l}, and reformulate (1.1) in the following form:

min(x,w)n×l{cx+12xQx+i=1l(wi)++Dx1+δ𝒦(x)},s.t.Cx+dw=0l,Ax=b.\begin{split}\underset{(x,w)\ \in\ \mathbb{R}^{n}\times\mathbb{R}^{l}}{\text{min}}&\left\{c^{\top}x+\frac{1}{2}x^{\top}Qx+\sum_{i=1}^{l}\left(w_{i}\right)_{+}+\|Dx\|_{1}+\delta_{\mathcal{K}}(x)\right\},\\ \textnormal{s.t.}\quad\ \ &\ Cx+d-w=0_{l},\\ &\ Ax=b.\end{split} (P)
Remark 1.

Let us notice that the model in (P) can readily accommodate terms of the form of Cx+d1\|Cx+d\|_{1} where Cl×nC\in\mathbb{R}^{l\times n}, and dld\in\mathbb{R}^{l}. Indeed, letting c=C𝟙lc=-C^{\top}\mathds{1}_{l} and adding the constant term 𝟙ld-\mathds{1}_{l}^{\top}d in the objective of (P), we notice that

Cx+d1𝟙l(Cx+d)+j=1l(2(Cx+d)j)+,\|Cx+d\|_{1}\equiv-\mathds{1}_{l}^{\top}(Cx+d)+\sum_{j=1}^{l}\left(2(Cx+d)_{j}\right)_{+},

where 𝟙l(1,,1)l\mathds{1}_{l}\coloneqq(1,\ldots,1)^{\top}\in\mathbb{R}^{l}. Similarly, any piecewise-linear term of the form

i=1lmax{(C1x+d1)i,(C2x+d2)i},\sum_{i=1}^{l}\max\left\{\left(C_{1}x+d_{1}\right)_{i},\left(C_{2}x+d_{2}\right)_{i}\right\},

where C1,C2l×nC_{1},\ C_{2}\in\mathbb{R}^{l\times n} and d1,d2ld_{1},\ d_{2}\in\mathbb{R}^{l}, can also be readily modeled. Indeed, setting c=C2𝟙c=C_{2}^{\top}\mathds{1} and adding the term d2𝟙ld_{2}^{\top}\mathds{1}_{l} in the objective yields

𝟙l(C2x+d2)+i=1l((C1x+d1C2xd2)i)+i=1lmax{(C1x+d1)i,(C2x+d2)i}.\mathds{1}_{l}^{\top}\left(C_{2}x+d_{2}\right)+\sum_{i=1}^{l}\left(\left(C_{1}x+d_{1}-C_{2}x-d_{2}\right)_{i}\right)_{+}\equiv\sum_{i=1}^{l}\max\left\{\left(C_{1}x+d_{1}\right)_{i},\left(C_{2}x+d_{2}\right)_{i}\right\}.

Finally, it is important to note that model (P) allows for multiple piecewise-linear terms of the form max{Cx+d,0l}\max\{Cx+d,0_{l}\}, Cx+d1\|Cx+d\|_{1} or max{C1x+d1,C2x+d2}\max\left\{C_{1}x+d_{1},C_{2}x+d_{2}\right\}, since we can always adjust ll to account for more than one terms. Hence, one can observe that (P) is quite general and can be used to model a plethora of very important problems that arise in practice.

In light of the discussion in Remark 1, it is easily observed that problem (P) can model a plethora of very important problems arising in several application domains spanning, among others, operational research, machine learning, data science, and engineering. More specifically, various lasso and fussed lasso instances (with applications to sparse approximations for classification and regression [19, 54], portfolio allocation [1], or medical diagnosis [23], among many others) can be readily modeled by (P). Additionally, various risk-minimization problems with linear random cost functions can be modeled by (P) (e.g. see [33, 44, 48]). Furthermore, even risk-minimization problems with nonlinear random cost functions, which are typically solved via Gauss-Newton schemes (e.g. see [10]), often require the solution of sub-problems like (P). Finally, continuous relaxations of integer programming problems with applications to operational research (e.g. [32]) often take the form of (P). Given the multitude of problems requiring easy access to (usually accurate) solutions of (P), the derivation of efficient, robust, and scalable solution methods is of paramount importance.

Problem (P) can be solved by various first- or second-order methods. In particular, using a standard reformulation, by introducing several auxiliary variables, (P) can be written as a convex quadratic programming (QP) one and efficiently solved by, among others, an interior point method (IPM; e.g. [19, 38]), an alternating direction method of multipliers (ADMM; e.g. [20]), or a proximal point method (e.g. [28, 43]). However, the reformulation of (P) into a convex QP is not expected to lead to scalable solution methods, since the dimension of the problem is significantly increased, and hence an already large-scale instance might be very difficult to solve in this way.

Alternatively, ADMM (or general splitting) schemes can be developed for the solution of (P) without the need of additional auxiliary variables (see Section 3). However, no first-order scheme would be able to consistently yield sufficiently accurate solutions (i.e. of 4-, 5- or 6-digit accuracy). If such a solution is sought, we have to employ a semismooth Newton method (SSN; e.g. [12, 24, 37]), or a combination of a proximal point method with an SSN scheme utilized for the solution of its associated sub-problems. SSN methods have been successfully applied to a plethora of problems, however, their success is heavily reliant on the properties of the problem at hand (e.g. the rank of the linear constraints, or the conditioning of the Hessian). On the other hand, the combination of the proximal point method with the SSN can circumvent these issues, since the associated nonsmooth sub-problems can be guaranteed to be well-defined and well-conditioned.

Various such solvers have been developed and analyzed in the literature. For example, the authors in [53] developed a dual augmented Lagragian scheme combined with an SSN method for the solution of semidefinite programming problems, and obtained very promising results. This scheme was then utilized for the solution of linear programming problems in [35], and for lasso-regularized problems in [34]. A similar primal-dual approach for 1\ell_{1}-regularized convex quadratic programming problems was developed and analyzed in our accompanying paper [39] and was shown to be especially efficient for the solution of elastic-net linear regression and L1L^{1}-regularized partial differential equation constrained optimization problems. In fact, the proposed active set method developed in this work is a direct extension of the method given in [39], altered in a specific way so that it can efficiently handle most piecewise-linear terms that appear in practice, via restricting its memory requirements.

Indeed, we showcase that each of the nonsmooth terms in the objective of (P) can be utilized for reducing the memory requirements of the proposed method. In particular, when computing the Clarke subdifferential ([13]) of an appropriate augmented Lagrangian penalty function, we can show that the Clarke derivatives of such piecewise-linear terms can act as projectors. As a result, we obtain an active-set scheme that reduces the sub-problems’ dimensions significantly, leading to better performance and reduced memory requirements. In particular, we observe that a thresholding operation (originating from the presence of 1\ell_{1} terms in the objective) determines which of the variables xx are inactive, allowing us to throw away entire columns from matrices AA and CC when solving the associated sub-problems. Furthermore, the max{,0}\max\{\cdot,0\} terms in the objective determine which of the rows of CC are non-important, allowing us to further eliminate such rows.

We showcase the robustness and the efficiency of the resulting active-set scheme on various optimization problems arising in risk-averse portfolio selection, quantile regression, and binary classification via linear support vector machines. In each of these cases the proposed scheme is compared against the robust polynomially convergent regularized interior point method developed and analyzed in [38], as well as the well-known ADMM-based OSQP solver ([49]). We demonstrate the reduced memory requirements of the active set scheme (and hence its improved scalability), as compared to interior point and ADMM alternatives applied to QP reformulations, while showcasing its efficiency and robustness.

Structure of the paper

In Section 2 we briefly present the proposed inner-outer active-set method. In particular, in Section 2.1 we derive a proximal method of multipliers (outer scheme) for the solution of (P), assuming that we can find an ϵ\epsilon-optimal solution of the associated sub-problems. Then, in Section 2.2, we briefly present a semismooth Newton method (inner scheme) for finding approximate solutions to sub-problems arising from the proximal method of multipliers. Focusing on the structure of problem (P), by selecting appropriate Clarke derivatives, we show that the proposed inner-outer method is in fact an active-set scheme, the associated linear systems of which are well-conditioned and stable. In Section 2.3, we discuss an extension of the method for dealing with problems having arbitrary piecewise-linear terms in the objective (with applications to, among others, robust optimization) that are not currently covered within the paper.

Subsequently, in Section 3, we derive a proximal alternating direction method of multipliers for the approximate solution of (P) in order to obtain good initial estimates for the primal and dual variables of the problem. This can then be used to warm-start the proposed second-order algorithm. A good starting point for the algorithm could mean that only a small portion of the problem data matrices is used at each inner-outer iteration of the scheme, while the outer method is expected to achieve its local linear (and potentially superlinear) convergence rate in a small number of inner-outer iterations.

In Section 4, we showcase the efficiency and robustness of the approach on several important real-life applications arising in risk-averse portfolio optimization, statistical regression, and binary classification via linear support vector machines. Finally, we discuss our conclusions in Section 5.

Notation

Given a vector xx in n\mathbb{R}^{n}, x\|x\| denotes its Euclidean norm. Given a closed set 𝒦n\mathcal{K}\subset\mathbb{R}^{n}, Π𝒦(x)\Pi_{\mathcal{K}}(x) denotes the Euclidean projection onto 𝒦\mathcal{K}, that is Π𝒦(x)argmin{xz:z𝒦}\Pi_{\mathcal{K}}(x)\coloneqq\arg\min\{\|x-z\|\colon z\in\mathcal{K}\}. Given a closed set 𝒦\mathcal{K}, we write dist(z,D)infz𝒦zz\textnormal{dist}(z,D)\coloneqq\inf_{z^{\prime}\in\mathcal{K}}\|z-z^{\prime}\|. Given a convex function p:np\colon\mathbb{R}^{n}\mapsto\mathbb{R}, we define the proximity operator as proxp(u)argminx{p(x)+12ux2}\textbf{prox}_{p}(u)\coloneqq\arg\min_{x}\left\{p(x)+\frac{1}{2}\|u-x\|^{2}\right\}. Given an index set 𝒟\mathcal{D}, |𝒟||\mathcal{D}| denotes its cardinality. Given a rectangular matrix Am×nA\in\mathbb{R}^{m\times n} and an index set {1,,n}\mathcal{B}\subseteq\{1,\ldots,n\}, we denote the columns of AA, the indices of which belong to \mathcal{B}, as AA_{\mathcal{B}}. Given a matrix Am×nA\in\mathbb{R}^{m\times n}, and two index sets 1{1,,m}\mathcal{B}_{1}\subseteq\{1,\ldots,m\} and 2{1,,n}\mathcal{B}_{2}\subseteq\{1,\ldots,n\}, we denote the subset of rows and columns of AA, the indices of which belong to 1,2\mathcal{B}_{1},\ \mathcal{B}_{2} respectively, as A(1,2)A_{\left(\mathcal{B}_{1},\mathcal{B}_{2}\right)}. Finally, given an arbitrary vector dd with nn components as well as some indexes 1i1i2n1\leq i_{1}\leq i_{2}\leq n, we denote by di1:i2d_{i_{1}:i_{2}} the vector containing the i1i_{1}-st up to the i2i_{2}-nd component of this vector. To avoid confusion, indexes ii are always used to denote entries of vectors or matrices, while kk and jj are reserved to denote iteration counters (outer and inner, respectively).

2 An active-set method

In this section we derive an active-set method for the solution of (P). The algorithm is an inner-outer scheme, which results by combining an outer proximal method of multipliers, and an inner semismooth Newton method for the solution of the PMM sub-problems. Following the discussion in [39, Section 2], we briefly derive the outer PMM scheme. Then, we briefly present the inner semismooth Newton method and discuss the solution of its associated linear systems, which is the main bottleneck of the algorithm.

2.1 Outer scheme: A primal-dual proximal method of multipliers

In this section, following and extending the developments in [39], we derive a primal-dual proximal method of multipliers for the approximate solution of (P). A convergence analysis is not provided, and the reader is referred to [39, Section 2], for an outline of such an analysis (since it applies directly to the case under consideration).

Given a penalty parameter β>0\beta>0, and some dual multiplier estimates y,zy,\ z, we follow [39, Section 2] to obtain the augmented Lagrangian corresponding to (P), which reads

β(x,w;y,z)cx+12xQx+i=1l(wi)++Dx1y([Cx+dwAxb])+β2[Cx+dwAxb]212βz2+12βz+βxβΠ𝒦(β1z+x)2.\begin{split}\mathcal{L}_{\beta}(x,w;y,z)\coloneqq\ &\ c^{\top}x+\frac{1}{2}x^{\top}Qx+\sum_{i=1}^{l}\left(w_{i}\right)_{+}+\|Dx\|_{1}-y^{\top}\left(\begin{bmatrix}Cx+d-w\\ Ax-b\end{bmatrix}\right)\\ &\quad\ +\frac{\beta}{2}\left\|\begin{bmatrix}Cx+d-w\\ Ax-b\end{bmatrix}\right\|^{2}-\frac{1}{2\beta}\|z\|^{2}+\frac{1}{2\beta}\|z+\beta x-\beta\Pi_{\mathcal{K}}(\beta^{-1}z+x)\|^{2}.\end{split} (2.1)

Indeed, this can be verified utilizing certain standard properties of Fenchel duality and of the proximity operator (see [39, Section 2]).

During iteration k0k\geq 0 of the proximal method of multipliers, we have the estimates (xk,yk,zk)(x_{k},y_{k},z_{k}) as well as the penalty parameters βk,ρk\beta_{k},\ \rho_{k}, such that ρkβkτk\rho_{k}\coloneqq\frac{\beta_{k}}{\tau_{k}}, where {τk}k=0\{\tau_{k}\}_{k=0}^{\infty} is a non-increasing positive sequence. For simplicity of exposition, let g(x,w)g1(x)+g2(w)g\left(x,w\right)\coloneqq g_{1}(x)+g_{2}(w), where g1(x)Dx1,g_{1}(x)\coloneqq\|Dx\|_{1}, and g2(w)i=1l(wi)+.g_{2}(w)\coloneqq\sum_{i=1}^{l}\left(w_{i}\right)_{+}. We consider the following regularized continuously differentiable function:

ϕ(x,w)ϕβk,ρk(x,w;xk,yk,zk)βk(x,w;yk,zk)+12ρkxxk2g(x,w).\begin{split}\scalebox{1.0}{\mbox{$\displaystyle\phi(x,w)\equiv\ \phi_{\beta_{k},\rho_{k}}(x,w;x_{k},y_{k},z_{k})\coloneqq$}}&\ \scalebox{1.0}{\mbox{$\displaystyle\mathcal{L}_{\beta_{k}}(x,w;y_{k},z_{k})+\frac{1}{2\rho_{k}}\left\|x-x_{k}\right\|^{2}-g\left(x,w\right)$}}.\end{split}

Notice that we introduce primal proximal regularizer only for variable xx, and not for ww. This is a very important algorithmic choice that departs from the developments in [39]. We want to develop a memory-efficient active-set method, and for that reason we avoid introducing a proximal term for the auxiliary variables ww. This choice, which does not hinder the stability of the proposed approach, leads to better memory efficiency of the resulting active-set scheme, by promoting a sparsification of the associated linear systems solved at each inner-outer iteration (this point will become clear in Section 2.2.1).

The minimization of the proximal augmented Lagrangian function can be written as

minx,w{βk(x,w;yk,zk)+12ρkxxk2}minx,w{ϕ(x,w)+g(x,w)},\underset{x,w}{\min}\ \left\{\mathcal{L}_{\beta_{k}}(x,w;y_{k},z_{k})+\frac{1}{2\rho_{k}}\left\|x-x_{k}\right\|^{2}\right\}\equiv\min_{x,w}\left\{\phi(x,w)+g(x,w)\right\},

and thus we need to find (x,w)n×l(x^{*},w^{*})\in\mathbb{R}^{n}\times\mathbb{R}^{l} such that

(ϕ(x,w))((x,w)(x,w))+g(x,w)g(x,w)0,(x,w)n×l.\scalebox{1.0}{\mbox{$\displaystyle\left(\nabla\phi(x^{*},w^{*})\right)^{\top}\left((x,w)-(x^{*},w^{*})\right)+g(x,w)-g(x^{*},w^{*})\geq 0,\quad\forall\ (x,w)\in\mathbb{R}^{n}\times\mathbb{R}^{l}$}}.

To that end, we observe that

xϕ(x,t,w)=c+Qx[CA]yk+βk[CA]([Cx+dwAxb])+(zk+βkx)βkΠ𝒦(βk1zk+x)+ρk1(xxk),wϕ(x,t,w)=(yk)1:lβk(Cx+dw).\begin{split}\nabla_{x}\phi(x,t,w)=&\ c+Qx-\begin{bmatrix}C^{\top}&A^{\top}\end{bmatrix}y_{k}+\beta_{k}\begin{bmatrix}C^{\top}&A^{\top}\end{bmatrix}\left(\begin{bmatrix}Cx+d-w\\ Ax-b\end{bmatrix}\right)\\ &\qquad+(z_{k}+\beta_{k}x)-\beta_{k}\Pi_{\mathcal{K}}(\beta_{k}^{-1}z_{k}+x)+\rho_{k}^{-1}(x-x_{k}),\\ \nabla_{w}\phi(x,t,w)=&\ (y_{k})_{1:l}-\beta_{k}\left(Cx+d-w\right).\end{split}

By introducing the dual variable yykβk([Cx+dwAxb])l+my\coloneqq y_{k}-\beta_{k}\left(\begin{bmatrix}Cx+d-w\\ Ax-b\end{bmatrix}\right)\in\mathbb{R}^{l+m}, the optimality conditions of minx,w{ϕ(x,w)+g(x,w)}\underset{x,w}{\min}\ \left\{\phi(x,w)+g(x,w)\right\} can be written as

(0n+l,0l+m)βk,ρk(x,w,y;xk,yk,zk)k(x,w,y),(0_{n+l},0_{l+m})\in\mathcal{M}_{\beta_{k},\rho_{k}}(x,w,y;x_{k},y_{k},z_{k})\equiv\mathcal{M}_{k}(x,w,y), (2.2)

where

k(x,w,y){(u,v):urk(x,y)+[g1(x)g2(w)],v=[Cx+dwAxb]+βk1(yyk)},\begin{split}\mathcal{M}_{k}(x,w,y)\coloneqq\left\{\scalebox{1.0}{\mbox{$\displaystyle(u^{\prime},v^{\prime})\colon u^{\prime}\in r_{k}(x,y)+\begin{bmatrix}\partial g_{1}(x)\\ \partial g_{2}(w)\end{bmatrix},\ v^{\prime}=\begin{bmatrix}Cx+d-w\\ Ax-b\end{bmatrix}+\beta_{k}^{-1}(y-y_{k})$}}\right\},\end{split}
rk(x,y)[c+Qx[CA]y+(zk+βkx)βkΠ𝒦(βk1zk+x)+ρk1(xxk)y1:l].r_{k}(x,y)\coloneqq\begin{bmatrix}\scalebox{1.0}{\mbox{$\displaystyle c+Qx-\begin{bmatrix}C^{\top}&A^{\top}\end{bmatrix}y+(z_{k}+\beta_{k}x)-\beta_{k}\Pi_{\mathcal{K}}(\beta_{k}^{-1}z_{k}+x)$}}+\rho_{k}^{-1}(x-x_{k})\\ y_{1:l}\end{bmatrix}.

We now describe the proposed proximal method of multipliers in Algorithm PMM.

Algorithm PMM (proximal method of multipliers)

Input: (x0,w0,y0,z0)n×l×l+m×n(x_{0},w_{0},y_{0},z_{0})\in\mathbb{R}^{n}\times\mathbb{R}^{l}\times\mathbb{R}^{l+m}\times\mathbb{R}^{n}, β0,β>0\beta_{0},\ \beta_{\infty}>0, {τk}k=0\{\tau_{k}\}_{k=0}^{\infty}, such that τkτ>0\tau_{k}\geq\tau_{\infty}>0, k0\forall\ k\geq 0.

Choose a sequence of positive numbers {ϵk}\{\epsilon_{k}\} such that ϵk0\epsilon_{k}\rightarrow 0.
for (k=0,1,2,k=0,1,2,\ldotsdo
  Find (xk+1,wk+1,yk+1)(x_{k+1},w_{k+1},y_{k+1}) such that:
εkdist(0n+2l+m,k(xk+1,wk+1,yk+1))ϵk,\varepsilon_{k}\coloneqq\textnormal{dist}\left(0_{n+2l+m},\mathcal{M}_{k}\left(x_{k+1},w_{k+1},y_{k+1}\right)\right)\leq\epsilon_{k}, (2.3)
  where letting r^=rk(xk+1,yk+1)\hat{r}=r_{k}(x_{k+1},y_{k+1}) and 𝒰={un+l:ug(xk+1,wk+1)},\mathcal{U}=\left\{u\in\mathbb{R}^{n+l}\colon u\in\partial g(x_{k+1},w_{k+1})\right\}, we have

εk=[r^+Π𝒰(r^)[CA]xk+1+[db][Il0m,l]wk+1+βk1(yk+1yk)].\displaystyle\varepsilon_{k}=\left\|\begin{bmatrix}\hat{r}+\Pi_{\mathcal{U}}\left(-\hat{r}\right)\\ \begin{bmatrix}C\\ A\end{bmatrix}x_{k+1}+\begin{bmatrix}d\\ -b\end{bmatrix}-\begin{bmatrix}I_{l}\\ 0_{m,l}\end{bmatrix}w_{k+1}+\beta_{k}^{-1}(y_{k+1}-y_{k})\end{bmatrix}\right\|.

zk+1\displaystyle\ \ \ \ z_{k+1} =(zk+βkxk+1)βkΠ𝒦(βk1zk+xk+1).\displaystyle=\ (z_{k}+\beta_{k}x_{k+1})-\beta_{k}\Pi_{\mathcal{K}}\big{(}\beta_{k}^{-1}z_{k}+x_{k+1}\big{)}. (2.4)
βk+1\displaystyle\ \ \ \ \beta_{k+1} β,ρk+1=βk+1τk+1.\displaystyle\nearrow\beta_{\infty}\leq\infty,\quad\rho_{k+1}=\frac{\beta_{k+1}}{\tau_{k+1}}. (2.5)
end for
return (xk+1,wk+1,yk+1,zk+1)(x_{k+1},w_{k+1},y_{k+1},z_{k+1}).

Let us notice that given a triple (x~,w~,y~)(\tilde{x},\tilde{w},\tilde{y}), we can easily evaluate the distance in (2.3), due to the piecewise linear structure of the associated function g()g(\cdot). A trivial extension of the analysis in [39, Section 3] yields that Algorithm PMM is globally convergent under the standard assumption of primal-dual feasibility. Furthermore, given some additional conditions on the error sequence {ϵk}\{\epsilon_{k}\} one can show that in fact Algorithm PMM achieves a local linear convergence rate (which becomes superlinear if βk\beta_{k}\rightarrow\infty at a suitable rate). Finally, the algorithm exhibits a global linear convergence rate, assuming that the starting point is chosen properly. For more details, we refer the reader to [39, Theorems 2.2, 2.3].

2.2 Inner scheme: A semismooth Newton method

Next, we employ a semismooth Newton (SSN) method to solve problem (2.3) appearing in Algorithm PMM, and verify that the resulting inner-outer scheme admits an active-set interpretation. Given the estimates (xk,yk,zk)(x_{k},y_{k},z_{k}) as well as the penalties βk,ρk\beta_{k},\ \rho_{k}, we apply SSN to approximately solve (2.2). Given any bounded positive penalty ζk>0\zeta_{k}>0, the optimality conditions in (2.2) can be written as

^k(x,w,y)ζk[ζk1([xw]proxζkg(xζkrk(x,y)))[CA]x+[db][Il0m,l]w+βk1(yyk)]=[0n+l0l+m].\widehat{\mathcal{M}}_{k}\left(x,w,y\right)\coloneqq\zeta_{k}\begin{bmatrix}\zeta_{k}^{-1}\left(\begin{bmatrix}x\\ w\end{bmatrix}-\textbf{prox}_{\zeta_{k}g}\left(x-\zeta_{k}r_{k}(x,y)\right)\right)\\ \begin{bmatrix}C\\ A\end{bmatrix}x+\begin{bmatrix}d\\ -b\end{bmatrix}-\begin{bmatrix}I_{l}\\ 0_{m,l}\end{bmatrix}w+\beta_{k}^{-1}(y-y_{k})\end{bmatrix}=\begin{bmatrix}0_{n+l}\\ 0_{l+m}\end{bmatrix}. (2.6)

We set xk0=xkx_{k_{0}}=x_{k}, yk0=yky_{k_{0}}=y_{k}, and at every iteration kjk_{j} of SSN, we solve

[Hkj0n,lζkBg1,kj[CA]0l,n(IlBg2,kj)ζk[Bg2,kj0l,m]ζk[CA]ζk[Il0m,l]ζkβk1Il+m]Mkj[dxdwdy]=^k(xkj,wkj,ykj),\underbrace{\scalebox{1.0}{\mbox{$\displaystyle\begin{bmatrix}H_{k_{j}}&0_{n,l}&-\zeta_{k}B_{g_{1},k_{j}}\begin{bmatrix}C^{\top}&A^{\top}\end{bmatrix}\\ 0_{l,n}&\left(I_{l}-B_{g_{2},k_{j}}\right)&\zeta_{k}\begin{bmatrix}B_{g_{2},k_{j}}&0_{l,m}\end{bmatrix}\\ \zeta_{k}\begin{bmatrix}C\\ A\end{bmatrix}&\zeta_{k}\begin{bmatrix}-I_{l}\\ 0_{m,l}\end{bmatrix}&\zeta_{k}\beta_{k}^{-1}I_{l+m}\end{bmatrix}$}}}_{{M_{k_{j}}}}\begin{bmatrix}d_{x}\\ d_{w}\\ d_{y}\end{bmatrix}=-\widehat{\mathcal{M}}_{k}\left(x_{k_{j}},w_{k_{j}},y_{k_{j}}\right), (2.7)

where we have introduced the notation

HkjInBg1,kj+ζkβkBg1,kj((1+ρk1βk1)InBδ,kj+βk1Q),H_{k_{j}}\coloneqq I_{n}-B_{g_{1},k_{j}}+\zeta_{k}\beta_{k}B_{g_{1},k_{j}}\left(\left(1+\rho_{k}^{-1}\beta_{k}^{-1}\right)I_{n}-B_{\delta,k_{j}}+\beta_{k}^{-1}Q\right),

assuming that we are given some arbitrary matrices

Bδ,kjxCΠ𝒦(βk1zk+xkj),Bg1,kjxC(proxζkg1(xkjζk(rk(xkj,ykj))1:n)),Bg2,kjwC(proxζkg2(wkjζk(ykj)1:l)).\begin{split}&B_{\delta,k_{j}}\in\partial^{C}_{x}\Pi_{\mathcal{K}}\left(\beta_{k}^{-1}z_{k}+x_{k_{j}}\right),\\ &B_{g_{1},k_{j}}\in\partial^{C}_{x}\left(\textbf{prox}_{\zeta_{k}g_{1}}\left(x_{k_{j}}-\zeta_{k}(r_{k}(x_{k_{j}},y_{k_{j}}))_{1:n}\right)\right),\\ &B_{g_{2},k_{j}}\in\partial^{C}_{w}\left(\textbf{prox}_{\zeta_{k}g_{2}}\left(w_{k_{j}}-\zeta_{k}(y_{k_{j}})_{1:l}\right)\right).\end{split} (2.8)

The symbol xC()\partial_{x}^{C}(\cdot) denotes the Clarke subdifferential of a function (see [13]). In our case, any element of the Clarke subdifferential is a Newton derivative (see [14, Chapter 13]), since rk(,)r_{k}(\cdot,\cdot) and g(,)g1()+g2()g(\cdot,\cdot)\equiv g_{1}(\cdot)+g_{2}(\cdot) are piecewise continuously differentiable and regular functions. For any unu\in\mathbb{R}^{n} and any i{1,,n}i\in\{1,\ldots,n\}, it holds that

uiC(Π[ai,bi](ui))={{1},ifui(ai,bi),{0},ifui[ai,bi],[0,1],ifui{ai,bi}.\partial_{u_{i}}^{C}\left(\Pi_{[a_{i},b_{i}]}(u_{i})\right)=\begin{cases}\{1\},&\qquad\textnormal{if}\quad u_{i}\in(a_{i},b_{i}),\\ \{0\},&\qquad\textnormal{if}\quad u_{i}\notin[a_{i},b_{i}],\\ [0,1],&\qquad\textnormal{if}\quad u_{i}\in\{a_{i},b_{i}\}.\end{cases}

Since g1(x)=Dx1g_{1}(x)=\|Dx\|_{1} and D0nD\succeq 0_{n} is diagonal, we have that for all unu\in\mathbb{R}^{n} and any i{1,,n}i\in\{1,\ldots,n\},

(proxζkg1(u))i=soft[ζkD(i,i),ζkD(i,i)](ui)max{|ui|ζkD(i,i),0}sign(ui).\left(\textbf{prox}_{\zeta_{k}g_{1}}\left(u\right)\right)_{i}=\textbf{soft}_{\left[-\zeta_{k}D_{(i,i)},\zeta_{k}D_{(i,i)}\right]}\left(u_{i}\right)\equiv\max\left\{\lvert u_{i}\rvert-\zeta_{k}D_{(i,i)},0\right\}\textnormal{sign}(u_{i}).

Finally, since g2(w)=max{w,0}g_{2}(w)=\max\{w,0\}, we have that for all ulu\in\mathbb{R}^{l} and any i=1,,li=1,\ldots,l,

(proxζkg2(u))i=soft[0,ζk](ui)max{uiζk,0}+min{ui,0}.\left(\textbf{prox}_{\zeta_{k}g_{2}}\left(u\right)\right)_{i}=\textbf{soft}_{\left[0,\zeta_{k}\right]}\left(u_{i}\right)\equiv\max\{u_{i}-\zeta_{k},0\}+\min\{u_{i},0\}.

Then, we can show (e.g. see [14, Example 14.9]) that

(uC(proxζkg1(u)))i={{1},if|ui|>ζkD(i,i)orD(i,i)=0,{0},if|ui|<ζkD(i,i),[0,1],ifui=ζkD(i,i),for all i{1,,n},\displaystyle\left(\partial_{u}^{C}\left(\textbf{prox}_{\zeta_{k}g_{1}}\left(u\right)\right)\right)_{i}=\begin{cases}\{1\},&\qquad\textnormal{if}\quad\lvert u_{i}\rvert>\zeta_{k}D_{(i,i)}\ \textnormal{or}\ D_{(i,i)}=0,\\ \{0\},&\qquad\textnormal{if}\quad\lvert u_{i}\rvert<\zeta_{k}D_{(i,i)},\\ [0,1],&\qquad\textnormal{if}\quad u_{i}=\zeta_{k}D_{(i,i)},\end{cases}\ \textnormal{for all }i\in\{1,\ldots,n\},

and

(uC(proxζkg2(u)))i={{1},ifui>ζkorui<0,{0},if0<ui<ζk,[0,1],ifui=ζkorui=0,for all i{1,,l}.\left(\partial_{u}^{C}\left(\textbf{prox}_{\zeta_{k}g_{2}}\left(u\right)\right)\right)_{i}=\begin{cases}\{1\},&\qquad\textnormal{if}\quad u_{i}>\zeta_{k}\ \textnormal{or}\ u_{i}<0,\\ \{0\},&\qquad\textnormal{if}\quad 0<u_{i}<\zeta_{k},\\ [0,1],&\qquad\textnormal{if}\quad u_{i}=\zeta_{k}\ \textnormal{or}\ u_{i}=0,\end{cases}\ \textnormal{for all }i\in\{1,\ldots,l\}.

We complete the derivation of the SSN, by defining a primal-dual merit function for globalizing the semismooth Newton method via a backtracking line-search scheme. Following [39], we employ the following merit function

Θk(x,w,y)M^k(x,w,y)2.\Theta_{k}(x,w,y)\coloneqq\left\|\widehat{M}_{k}\left(x,w,y\right)\right\|^{2}. (2.9)

This function is often used for globalizing SSN schemes applied to nonsmooth equations of the form of (2.6). In Algorithm SSN, we outline a locally superlinearly convergent semismooth Newton method for the approximate solution of (2.3). The associated linear systems can be solved approximately (e.g. by Krylov subspace methods as in [39]), although in this work a suitable factorization scheme is utilized.

Algorithm SSN (semismooth Newton method)

Input: Let ϵk>0\epsilon_{k}>0, μ(0,12)\mu\in(0,\frac{1}{2}), δ(0,1)\delta\in(0,1), xk0(=xk)x_{k_{0}}(=x_{k}), wk0(=wk)w_{k_{0}}(=w_{k}), yk0(=yk)y_{k_{0}}(=y_{k}).

for (j=0,1,2,j=0,1,2,\ldotsdo
  Choose some MkjM_{k_{j}}, as in (2.7) and solve
Mkj[dxdwdy]=M^k(xkj,wkj,ykj).M_{k_{j}}\begin{bmatrix}d_{x}\\ d_{w}\\ d_{y}\end{bmatrix}=-\widehat{M}_{k}\left(x_{k_{j}},w_{k_{j}},y_{k_{j}}\right).
  (Line Search) Set αj=δmj\alpha_{j}=\delta^{m_{j}}, where mjm_{j} is the first non-negative integer for which:
Θk(x¯,w¯,y¯)(12μδmj)Θk(xkj,wkj,ykj)\Theta_{k}\left(\bar{x},\bar{w},\bar{y}\right)\leq\left(1-2\mu\delta^{m_{j}}\right)\Theta_{k}\left(x_{k_{j}},w_{k_{j}},y_{k_{j}}\right)
  where (x¯,w¯,y¯)=(xkj+δmjdx,wkj+δmjdw,ykj+δmjdy)\left(\bar{x},\bar{w},\bar{y}\right)=\left(x_{k_{j}}+\delta^{m_{j}}d_{x},w_{k_{j}}+\delta^{m_{j}}d_{w},y_{k_{j}}+\delta^{m_{j}}d_{y}\right).
  Set (xkj+1,wkj+1,ykj+1)=(x¯,w¯,y¯)\left(x_{k_{j+1}},w_{k_{j+1}},y_{k_{j+1}}\right)=\left(\bar{x},\bar{w},\bar{y}\right).
  if (dist(0n+2l+m,Mk(xkj+1,wkj+1,ykj+1))ϵk)\left(\textnormal{dist}\left(0_{n+2l+m},M_{k}\left(x_{k_{j+1}},w_{k_{j+1}},y_{k_{j+1}}\right)\right)\leq\epsilon_{k}\right) then
   return (xkj+1,wkj+1,ykj+1)(x_{k_{j+1}},w_{k_{j+1}},y_{k_{j+1}}).
  end if
end for

Local superlinear convergence of Algorithm SSN follows directly from [37, Thereom 3], since the equation in (2.6) is BD-regular (i.e. the Bouligand subdifferential at the optimum contains nonsingular matrices) by construction, as it models “regularized” sub-problems arising from the proximal method of multipliers. On the other hand, if we assume that the directional derivative of (2.9) is continuous at the optimum, then we can mirror the analysis in [25] to show that Algorithm SSN is globally convergent. There is a wide literature on similar semismooth Newton schemes, and we refer the reader to [12, 25, 37, 40] and the references therein for more details.

2.2.1 The SSN linear systems

The major bottleneck of the previously presented inner-outer scheme is the solution of the associated linear systems given in (2.7). One could alter Algorithm SSN so that it does not require an exact solution. In turn this would allow for the utilization of preconditioned Krylov subspace solvers for the efficient solution of such systems (e.g. as in [39, Section 3]). In particular, any preconditioner derived in [21] can be utilized for the proposed solver. However, for simplicity of exposition, we employ a standard factorization approach. As will become evident in Section 4, the active-set nature of the proposed algorithm enables the use of factorization even for large-scale problems of interest, since most of the data are inactive at each inner-outer iteration kjk_{j}. Indeed, in the presence of multiple piecewise-linear terms in the objective, and assuming that the method is appropriately warm-started, one can ensure that most of the problem data are not incorporated when forming the active Hessian at each inner-outer iteration.

In what follows we derive the associated linear systems. Let k0k\geq 0 be an arbitrary iteration of Algorithm PMM, and j0j\geq 0 an arbitrary iteration of Algorithm SSN. Firstly, let us notice that any element BδxC(Π𝒦())B_{\delta}\in\partial_{x}^{C}\left(\Pi_{\mathcal{K}}\left(\cdot\right)\right) yields a Newton derivative (see [14, Theorem 14.8]). The same applies for any element Bg1xC(proxζkg1())B_{g_{1}}\in\partial_{x}^{C}\left(\textbf{prox}_{\zeta_{k}g_{1}}\left(\cdot\right)\right), and Bg2wC(proxζkg2())B_{g_{2}}\in\partial_{w}^{C}\left(\textbf{prox}_{\zeta_{k}g_{2}}\left(\cdot\right)\right). Thus, using (2.8) we can choose Bδ,kj,Bg1,kj,Bg2,kjB_{\delta,k_{j}},\ B_{g_{1},k_{j}},\ B_{g_{2},k_{j}} from the Bouligand subdifferential to improve computational efficiency, by reducing the active variables and constraint rows. To that end, we set Bδ,kjB_{\delta,k_{j}} as a diagonal matrix with

(Bδ,kj)(i,i){1,if(βk1zk+xkj)i(ali,aui),0,otherwise,,\left(B_{\delta,k_{j}}\right)_{(i,i)}\coloneqq\begin{cases}1,&\qquad\textnormal{if}\quad\left(\beta_{k}^{-1}z_{k}+x_{k_{j}}\right)_{i}\in\left(a_{l_{i}},a_{u_{i}}\right),\\ 0,&\qquad\textnormal{otherwise},\end{cases}, (2.10)

for all i{1,,n}i\in\{1,\ldots,n\}, Bg1,kjB_{g_{1},k_{j}} as the following diagonal matrix

(Bg1,kj)(i,i){1,if|(u^kj)i|>ζkD(i,i),orD(i,i)=0,0,otherwise,,\left(B_{g_{1},k_{j}}\right)_{(i,i)}\coloneqq\begin{cases}1,&\qquad\textnormal{if}\quad\left|\left(\widehat{u}_{k_{j}}\right)_{i}\right|>\zeta_{k}D_{(i,i)},\ \textnormal{or}\quad D_{(i,i)}=0,\\ 0,&\qquad\textnormal{otherwise},\end{cases}, (2.11)

for all i{1,,n}i\in\{1,\ldots,n\}, where u^kjxkjζk(rk(xkj,ykj))1:n\widehat{u}_{k_{j}}\coloneqq x_{k_{j}}-\zeta_{k}(r_{k}(x_{k_{j}},y_{k_{j}}))_{1:n}, and Bg2,kjB_{g_{2},k_{j}} as

(Bg2,kj)(i,i){1,if(wkj)iζk(ykj)i0,or(wkj)iζk(ykj)iζk,0,otherwise,\left(B_{g_{2},k_{j}}\right)_{(i,i)}\coloneqq\begin{cases}1,&\qquad\textnormal{if}\quad\left(w_{k_{j}}\right)_{i}-\zeta_{k}\left(y_{k_{j}}\right)_{i}\leq 0,\ \textnormal{or}\ \left(w_{k_{j}}\right)_{i}-\zeta_{k}\left(y_{k_{j}}\right)_{i}\geq\zeta_{k},\\ 0,&\qquad\textnormal{otherwise}\end{cases}, (2.12)

for all i{1,,l}i\in\{1,\ldots,l\}.

Given the aforementioned choices for the projection matrices, we can now explicitly eliminate certain variables from the system in (2.7), in order to obtain a saddle-point system. To that end, from the second block-equation in (2.7), we have

(IlBg2,kj)dw+ζkBg2,kj(dy)1:l=(wkjproxζkg2(wkjζk(ykj)1:l)).\left(I_{l}-B_{g_{2},k_{j}}\right)d_{w}+\zeta_{k}B_{g_{2},k_{j}}(d_{y})_{1:l}=-\left(w_{k_{j}}-\textbf{prox}_{\zeta_{k}g_{2}}\left(w_{k_{j}}-\zeta_{k}(y_{k_{j}})_{1:l}\right)\right).

Let g2,kj{i{1,,l}:(Bg2,kj)(i,i)=1}\mathcal{B}_{g_{2},k_{j}}\coloneqq\left\{i\in\{1,\ldots,l\}\colon\left(B_{g_{2},k_{j}}\right)_{(i,i)}=1\right\}. Then, we obtain

(dy)g2,kj=ζk1(wkjproxζkg2(wkjζk(ykj)1:l))g2,kj,(d_{y})_{\mathcal{B}_{g_{2},k_{j}}}=-\zeta_{k}^{-1}\left(w_{k_{j}}-\textbf{prox}_{\zeta_{k}g_{2}}\left(w_{k_{j}}-\zeta_{k}(y_{k_{j}})_{1:l}\right)\right)_{\mathcal{B}_{g_{2},k_{j}}}, (2.13)

where the right-hand side can be evaluated easily. Letting 𝒩g2,kj{1,,l}g2,kj\mathcal{N}_{g_{2},k_{j}}\coloneqq\{1,\ldots,l\}\setminus\mathcal{B}_{g_{2},k_{j}}, we have

(dw)𝒩g2,kj=(wkj)𝒩g2,kj.(d_{w})_{\mathcal{N}_{g_{2},k_{j}}}=-(w_{k_{j}})_{\mathcal{N}_{g_{2},k_{j}}}. (2.14)

On the other hand, from the third block-equation of (2.7), we observe that

(dw)g2,kj=(wkjdC(xkj+dx)βk1(ykj+dyyk)1:l)g2,kj,(d_{w})_{\mathcal{B}_{g_{2},k_{j}}}=-\left(w_{k_{j}}-d-C\left(x_{k_{j}}+d_{x}\right)-\beta_{k}^{-1}\left(y_{k_{j}}+d_{y}-y_{k}\right)_{1:l}\right)_{\mathcal{B}_{g_{2},k_{j}}},

which can be computed after solving the reduced system. We define the following index sets

g1,kj{i{1,,n}:(Bg1,kj)(i,i)=1},𝒩g1,kj{1,,n}g1,kj.\mathcal{B}_{g_{1},k_{j}}\coloneqq\left\{i\in\{1,\ldots,n\}\colon\left(B_{g_{1},k_{j}}\right)_{(i,i)}=1\right\},\qquad\mathcal{N}_{g_{1},k_{j}}\coloneqq\{1,\ldots,n\}\setminus\mathcal{B}_{g_{1},k_{j}}.

From the first block equation of (2.7), we obtain

(dx)𝒩g1,kj=(xkjproxζkg1(xkjζk(rβk,ρk(xkj,ykj))1:n))𝒩g1,kj.(d_{x})_{\mathcal{N}_{g_{1},k_{j}}}=-\left(x_{k_{j}}-\textbf{prox}_{\zeta_{k}g_{1}}\left(x_{k_{j}}-\zeta_{k}\left(r_{\beta_{k},\rho_{k}}(x_{k_{j}},y_{k_{j}})\right)_{1:n}\right)\right)_{\mathcal{N}_{g_{1},k_{j}}}.

After a straightforward (if tedious) calculation, we can pivot (dy)g2,kj(d_{y})_{\mathcal{B}_{g_{2},k_{j}}}, dwd_{w}, and (dx)𝒩g1,kj(d_{x})_{\mathcal{N}_{g_{1},k_{j}}} in system (2.7). This results in the following saddle-point system:

[(ζk1Hkj)(g1,kj,g1,kj)[C^A^][C^A^]βk1Im+|𝒩g2,kj|]M^kj[dx,g1,kj[(dy)𝒩g2,kj(dy)(l+1:l+m)]]=r^kj,\underbrace{\begin{bmatrix}-\left(\zeta_{k}^{-1}H_{k_{j}}\right)_{\left(\mathcal{B}_{g_{1},k_{j}},\mathcal{B}_{g_{1},k_{j}}\right)}&\begin{bmatrix}\widehat{C}\\ \widehat{A}\end{bmatrix}^{\top}\\ \begin{bmatrix}\widehat{C}\\ \widehat{A}\end{bmatrix}&\beta_{k}^{-1}I_{m+\lvert\mathcal{N}_{g_{2},k_{j}}\rvert}\end{bmatrix}}_{\widehat{M}_{k_{j}}}\begin{bmatrix}d_{x,\mathcal{B}_{g_{1},k_{j}}}\\ \begin{bmatrix}\left(d_{y}\right)_{\mathcal{N}_{g_{2},k_{j}}}\\ (d_{y})_{(l+1:l+m)}\end{bmatrix}\end{bmatrix}=\widehat{r}_{k_{j}}, (2.15)

where C^C(𝒩g2,kj,g1,kj),A^Ag1,kj\widehat{C}\coloneqq C_{\left({\mathcal{N}_{g_{2},k_{j}}},\mathcal{B}_{g_{1},k_{j}}\right)},\ \widehat{A}\coloneqq A_{\mathcal{B}_{g_{1},k_{j}}}, and

r^kj[ζk1(M^k(xkj,wkj,ykj))g1,kj((M^k(xkj,wkj,ykj))(n+l+1:n+2l))𝒩g2,kj(M^k(xkj,wkj,ykj))(n+2l+1:n+2l+m)]+[Q(g1,kj,𝒩g1,kj)dx,𝒩g1,kj+(C(g2,kj,g1,kj))(dy)g2,kjC(𝒩g2,kj,𝒩g1,kj)(dx)𝒩g1,kj+(dw)𝒩g2,kjA𝒩g1,kjdx,𝒩g1,kj].\begin{split}\widehat{r}_{k_{j}}\coloneqq&\ \begin{bmatrix}\zeta_{k}^{-1}\left(\widehat{M}_{k}(x_{k_{j}},w_{k_{j}},y_{k_{j}})\right)_{\mathcal{B}_{g_{1},k_{j}}}\\ \left(\left(-\widehat{M}_{k}(x_{k_{j}},w_{k_{j}},y_{k_{j}})\right)_{(n+l+1:n+2l)}\right)_{\mathcal{N}_{g_{2},k_{j}}}\\ \left(-\widehat{M}_{k}(x_{k_{j}},w_{k_{j}},y_{k_{j}})\right)_{(n+2l+1:n+2l+m)}\end{bmatrix}\\ &\qquad+\begin{bmatrix}Q_{\left(\mathcal{B}_{g_{1},k_{j}},\mathcal{N}_{g_{1},k_{j}}\right)}d_{x,\mathcal{N}_{g_{1},k_{j}}}+\left(C_{\left({\mathcal{B}_{g_{2},k_{j}}},\mathcal{B}_{g_{1},k_{j}}\right)}\right)^{\top}(d_{y})_{\mathcal{B}_{g_{2},k_{j}}}\\ -C_{\left(\mathcal{N}_{g_{2},k_{j}},\mathcal{N}_{g_{1},k_{j}}\right)}(d_{x})_{\mathcal{N}_{g_{1},k_{j}}}+(d_{w})_{\mathcal{N}_{g_{2},k_{j}}}\\ -A_{\mathcal{N}_{g_{1},k_{j}}}d_{x,\mathcal{N}_{g_{1},k_{j}}}\end{bmatrix}.\end{split}

Notice that the coefficient matrix of (2.15) is symmetric and quasi-definite (see [50]). As such, it is strongly factorizable, that is, each symmetric permutation of it admits an LΔLL\Delta L^{\top} decomposition, with Δ\Delta diagonal. Alternatively, one could eliminate further variables in order to obtain a positive definite linear system. This could be beneficial in certain applications, but is omitted here for the sake of generality. We note that C^\widehat{C} contains only a subset of the columns and rows of CC. Similarly, A^\widehat{A} contains only a subset of the columns of AA. As we will verify in practice, relatively close to the solution of (P) the active-set matrices C^\widehat{C}, A^\widehat{A}, can be significantly smaller than CC and AA, respectively, allowing the proposed approach to solve large-scale instances, without requiring excessive memory. Finally, we note that if we had included a proximal term for the auxiliary variables ww (i.e. if we were to directly apply the algorithm given in [39]), then C^\widehat{C} would only contain a subset of the columns of CC but all of its rows (which, in general, are expected to be much more problematic, since it is often the case that lnl\gg n).

2.3 Extension to arbitrary piecewise-linear terms

Before closing this section we would like to notice that there are certain piecewise-linear terms that are not captured by the model in (P). In particular, given a set of KK pairs (Cr,dr)l×n×l(C_{r},d_{r})\in\mathbb{R}^{l\times n}\times\mathbb{R}^{l} (with K3K\geq 3), where r{1,,K}r\in\{1,\ldots,K\}, we could consider problems of the following form

minxn{cx+12xQx+i=1lmaxr{1,,R}{(Crx+dr)i}+Dx1+δ𝒦(x)},s.t.Ax=b.\begin{split}\underset{x\in\mathbb{R}^{n}}{\text{min}}&\quad\left\{c^{\top}x+\frac{1}{2}x^{\top}Qx+\sum_{i=1}^{l}\max_{r\in\{1,\ldots,R\}}\{\left(C_{r}x+d_{r}\right)_{i}\}+\|Dx\|_{1}+\delta_{\mathcal{K}}(x)\right\},\\ \textnormal{s.t.}&\quad\ Ax=b.\end{split}

In order to do so, we would have to introduce KK auxiliary vectors wrlw_{r}\in\mathbb{R}^{l}, r{1,,K}r\in\{1,\ldots,K\}, and reformulate the problem in the following form

minxn{cx+12xQx+i=1lmaxr{1,,K}{(wr)i}+Dx1+δ𝒦(x)},s.t.Crx+drwr=0l,r{1,,K},Ax=b.\begin{split}\underset{x\in\mathbb{R}^{n}}{\text{min}}&\quad\left\{c^{\top}x+\frac{1}{2}x^{\top}Qx+\sum_{i=1}^{l}\max_{r\in\{1,\ldots,K\}}\{(w_{r})_{i}\}+\|Dx\|_{1}+\delta_{\mathcal{K}}(x)\right\},\\ \textnormal{s.t.}&\quad\ C_{r}x+d_{r}-w_{r}=0_{l},\qquad r\in\{1,\ldots,K\},\\ &\quad\ Ax=b.\end{split}

Subsequently, in the derivation of the semismooth Newton method, we would have to evaluate the proximity operator of g~(u)maxi{1,,K}(ui)\tilde{g}(u)\coloneqq\max_{i\in\{1,\ldots,K\}}(u_{i}), uKu\in\mathbb{R}^{K}, which admits a closed-form solution:

(proxζg~(u))i=min{ui,s},where s is such that i=1K(|ui|s)+=ζ.\left(\textbf{prox}_{\zeta\tilde{g}}(u)\right)_{i}=\min\{u_{i},s\},\quad\textnormal{where }s\in\mathbb{R}\textnormal{ is such that }\sum_{i=1}^{K}\left(\lvert u_{i}\rvert-s\right)_{+}=\zeta.

Then, the Clarke derivative of the previous could be computed by utilizing [14, Theorem 14.7]. This was not considered in this work for brevity of presentation, and is left as a future research direction. Indeed, this extension paves the way for the generalization of the proposed active-set scheme to a plethora of robust optimization problems that appear in practice (e.g. see [4]), as well as delivering an alternative to standard cutting-plane methods (e.g. see [29]), decomposition methods (e.g. see [3, 5, 18, 46]), or specialized interior point (e.g. see [22]) approaches appearing in the literature.

3 Warm-starting

Following the developments in [35, 39], we would like to find a starting point for Algorithm PMM that is relatively close to a primal-dual solution. Indeed, this is crucial since, on the one hand, then we can expect to observe early linear convergence of Algorithm PMM, while on the other hand, we can expect to be close to identifying the correct active-sets which in turn implies that the memory requirements of Algorithm SSN are significantly reduced. To that end, we employ a proximal alternating direction method of multipliers (pADMM; e.g. see [20]). We reformulate (P) by introducing an artificial variable un+lu\in\mathbb{R}^{n+l}, as

min(x,w,u)n×l×n+l{cx+12xQx+i=n+1n+l(ui)++D(u1:n)1+δ𝒦(u1:n)},s.t.[CIl0l,l+nA0m,l0m,l+nIl+nIl+n]Mr[xwu]=[db0l+n].\begin{split}\underset{(x,w,u)\ \in\ \mathbb{R}^{n}\times\mathbb{R}^{l}\times\mathbb{R}^{n+l}}{\text{min}}&\left\{c^{\top}x+\frac{1}{2}x^{\top}Qx+\sum_{i=n+1}^{n+l}\left(u_{i}\right)_{+}+\|D(u_{1:n})\|_{1}+\delta_{\mathcal{K}}(u_{1:n})\right\},\\ \textnormal{s.t.}\qquad\ \ &\ \underbrace{\begin{bmatrix}C&-I_{l}&0_{l,l+n}\\ A&0_{m,l}&0_{m,l+n}\\ \lx@intercol\hfil-I_{l+n}\hfil\lx@intercol&I_{l+n}\end{bmatrix}}_{M_{r}}\begin{bmatrix}x\\ w\\ u\end{bmatrix}=\begin{bmatrix}-d\\ b\\ 0_{l+n}\end{bmatrix}.\\ \end{split} (P’)

Given a penalty σ>0\sigma>0, we associate the following augmented Lagrangian to (P’)

^σ(x,w,u,y)cx+12xQx+i=n+1n+l(ui)++D(u1:n)1+δ𝒦(u1:n)y(Mr[xwu][db0l+n])+σ2Mr[xwu][db0l+n]2,\begin{split}\widehat{\mathcal{L}}_{\sigma}(x,w,u,y)\coloneqq&\ c^{\top}x+\frac{1}{2}x^{\top}Qx+\sum_{i=n+1}^{n+l}\left(u_{i}\right)_{+}+\|D(u_{1:n})\|_{1}+\delta_{\mathcal{K}}(u_{1:n})\\ &\ -y^{\top}\left(M_{r}\begin{bmatrix}x\\ w\\ u\end{bmatrix}-\begin{bmatrix}-d\\ b\\ 0_{l+n}\end{bmatrix}\right)+\frac{\sigma}{2}\left\|M_{r}\begin{bmatrix}x\\ w\\ u\end{bmatrix}-\begin{bmatrix}-d\\ b\\ 0_{l+n}\end{bmatrix}\right\|^{2},\end{split}

where ym+n+2ly\in\mathbb{R}^{m+n+2l} denotes the dual multipliers associated to the linear equality constraints of (P’). Let an arbitrary positive definite matrix R(n+l)×(n+l)R\in\mathbb{R}^{(n+l)\times(n+l)} be given, and assume the notation vR2=vRv\|v\|^{2}_{R}=v^{\top}Rv, for any vn+lv\in\mathbb{R}^{n+l}. Also, as in Section 2.1, we denote g(x,w)=Dx1+i=1l(wi)+g(x,w)=\|Dx\|_{1}+\sum_{i=1}^{l}\left(w_{i}\right)_{+}. Algorithm pADMM describes a proximal ADMM for the approximate solution of (P’).

Algorithm pADMM (proximal ADMM)

Input: σ>0\sigma>0, R0R\succ 0, γ(0,1+52)\gamma\in\left(0,\frac{1+\sqrt{5}}{2}\right), (x0,w0,u0,y0)2n+3l+m(x_{0},w_{0},u_{0},y_{0})\in\mathbb{R}^{2n+3l+m}.

for (k=0,1,2,k=0,1,2,\ldotsdo
uk+1\displaystyle u_{k+1} =argmin𝑢{^σ(xk,wk,u,yk)}Π𝒦×l(proxσ1g([xkwk]+σ1(yk)(m+l+1:m+n+2l))).\displaystyle=\underset{u}{\arg\min}\left\{\widehat{\mathcal{L}}_{\sigma}\left(x_{k},w_{k},u,y_{k}\right)\right\}\equiv\Pi_{\mathcal{K}\times\mathbb{R}^{l}}\left(\textbf{prox}_{\sigma^{-1}g}\left(\begin{bmatrix}x_{k}\\ w_{k}\end{bmatrix}+\sigma^{-1}(y_{k})_{(m+l+1:m+n+2l)}\right)\right).
[xk+1wk+1]\displaystyle\begin{bmatrix}x_{k+1}\\ w_{k+1}\end{bmatrix} =argminx,w{^σ(x,w,uk+1,yk)+12[xxkwwk]R2}.\displaystyle=\underset{x,\ w}{\arg\min}\left\{\widehat{\mathcal{L}}_{\sigma}\left(x,w,u_{k+1},y_{k}\right)+\frac{1}{2}\left\|\begin{bmatrix}x-x_{k}\\ w-w_{k}\end{bmatrix}\right\|_{R}^{2}\right\}.
yk+1\displaystyle y_{k+1} =ykγσ(Mr[xwu][db0l+n]).\displaystyle=y_{k}-\gamma\sigma\left(M_{r}\begin{bmatrix}x\\ w\\ u\end{bmatrix}-\begin{bmatrix}-d\\ b\\ 0_{l+n}\end{bmatrix}\right).
end for

Let us notice that under certain standard assumptions on (P’), Algorithm pADMM converges globally, potentially at a linear rate (see [20]). We can employ the regularization matrix RR as a means of ensuring memory efficiency of Algorithm pADMM. For example, we can recover the prox-linear ADMM [20, Section 1.1], where given some sufficiently large constant σ^>0\hat{\sigma}>0, one defines

Rσ^In+lσ[CC+AA+σ1Off(Q)CC0l,l]0.R\coloneqq\hat{\sigma}I_{n+l}-\sigma\begin{bmatrix}C^{\top}C+A^{\top}A+\sigma^{-1}\textnormal{Off}(Q)&-C^{\top}\\ -C&0_{l,l}\end{bmatrix}\succ 0.

Given this choice of RR, the second step of Algorithm pADMM consists of only matrix-vector multiplications with A,CA,\ C and QQ, and thus no (non-diagonal) matrix inversion is required. If memory is not an issue, one could use a positive-definite diagonal regularizer RR, yielding a standard regularized ADMM. In our implementation, the user can specify which of the two strategies should be employed. In either case, the first step of Algorithm pADMM is trivial to solve, since we know that the proximity operator of g()g(\cdot) can be computed analytically as in Section 2.2.

Finally, once an approximate solution (x~,w~,u~,y~)(\tilde{x},\tilde{w},\tilde{u},\tilde{y}) is computed, we set the starting point of Algorithm PMM as (x0,w0,y0,z0)=(x~,w~,(y~)(1:m+l),z~)(x_{0},w_{0},y_{0},z_{0})=\left(\tilde{x},\tilde{w},\left(\tilde{y}\right)_{(1:m+l)},\tilde{z}\right), where

z~=(y~)(m+l+1:m+l+n)Πg1(u~1:n)((y~)(m+l+1:m+l+n)).\tilde{z}=\left(\tilde{y}\right)_{(m+l+1:m+l+n)}-\Pi_{\partial g_{1}(\tilde{u}_{1:n})}\left(\left(\tilde{y}\right)_{(m+l+1:m+l+n)}\right).

Indeed, an optimal primal-dual solution of (P’) satisfies

(y~)(m+l+1:m+2l+n)g(u~)+δ𝒦×l(u~),\left(\tilde{y}^{*}\right)_{(m+l+1:m+2l+n)}\in\partial g\left(\tilde{u}^{*}\right)+\partial\delta_{\mathcal{K}\times\mathbb{R}^{l}}\left(\tilde{u}^{*}\right),

thus the characterization of z~\tilde{z} in Algorithm PMM follows from the Appendix, where one can also find the termination criteria of Algorithm pADMM (see (A.2)).

4 Applications and numerical results

In this section we present various applications that can be modeled by problem (P), focusing on portfolio optimization, quantile regression, and binary classification. In particular, we first discuss (and numerically demonstrate) the effectiveness of the approach for the solution of single-period mean-risk portfolio optimization problems, where risk is measured via the conditional value at risk or the mean absolute semi-deviation. Subsequently, we apply the proposed scheme for the solution of quantile regression problems, demonstrating its scalability. Finally, we apply the active-set scheme for the solution of binary classification problems via linear support vector machines on certain large-scale datasets.

Implementation details

Before proceeding to the numerical results, we mention certain implementation details of the proposed algorithm. The implementation follows closely the developments in [39], and can be found on GitHub111https://github.com/spougkakiotis/Active_set_method_for_CQP_piecewise_LP. The code is written in MATLAB, and the experiments are run on a PC with a 2.2GHz Intel core i7-8750H processor (hexa-core), 16GB of RAM, using the Windows 10 operating system.

We run Algorithm pADMM (warm-start) for at most 100 iterations (or until a 3-digit accurate solution is found). The user can choose whether the warm-starting scheme should be matrix-free or not. In the presented experiments, the matrix-free scheme was only used for the largest quantile regression and classification problems, for which standard ADMM crashed due to excessive memory requirements. Then, starting with β0=10\beta_{0}=10, ρ0=50\rho_{0}=50, we run Algorithms PMM-SSN. Following [39], when solving the PMM sub-problems using Algorithm SSN we use a predictor-corrector-like heuristic in which the first iteration is accepted without line-search and then line-search is activated for subsequent iterations. Algorithm PMM is allowed to run for at most 200 iterations, while Algorithm SSN is stopped after 20 inner iterations. An iterate is accepted as optimal if the conditions given in (A.1) are satisfied for the tolerance specified by the user. Most other implementation details follow directly from the developments in Sections 2.12.2. We refer the reader to the implementation on GitHub for additional details.

In the presented experiments we compare the proposed approach against the interior point solver given in [38], the implementation of which can be found on GitHub222https://github.com/spougkakiotis/IP_PMM, as well as the ADMM-based OSQP solver given in [49], the implementation of which can also be found on GitHub333https://github.com/osqp/osqp-matlab. We note that most problems considered herein were especially challenging for OSQP (mostly due to requesting highly accurate solutions), and for that reason we allow it to run for 50,000 iterations (unlike its default iteration threshold, which is 4000).

4.1 Portfolio optimization

We first consider the mean-risk portfolio selection problem (originally proposed in [36]), where we minimize some convex risk measure, while keeping the expected return of the portfolio above some desirable level. A variety of models for this problem have been intensely analyzed and solved in the literature (e.g. see [1, 31, 44]). The departure from the variance as a measure of risk often allows for great flexibility in the decision making of investors, enabling them to follow potential regulations as well as to better control the risk associated with an “optimal” portfolio. Optimality conditions and existence of solutions for several general deviation measures have been characterized in [45]. The comparison of portfolios obtained by minimizing different risk measures has also been considered (e.g. see [41, 42, 48]).

The method presented in this paper is general and not related to a specific model choice. Indeed, we would like to showcase the efficiency of our approach for obtaining accurate solutions to portfolio optimization problems with various risk measures of practical interest. Thus, we focus on the solution of a standard portfolio selection model that has been used in the literature. All numerical results are obtained on real-world datasets. As a result the problems are of medium-scale. Nonetheless, even for such medium-scale problems, we will be able to demonstrate the efficiency of the proposed approach, when compared to the efficient interior point method employed in the literature [19] for similar problems. We also note that both second-order solvers (i.e. IPM and active-set) significantly outperform OSQP on these instances, however it was included in the comparison for completeness. Some large-scale instances will be tackled in the following subsections, where the method will be applied to quantile regression and binary classification instances.

Let xnx\in\mathbb{R}^{n} represent a portfolio of nn financial instruments, such that

xi[ali,aui],withali1,aui1,for alli=1,,n,andi=1nxi=1.x_{i}\in\left[a_{l_{i}},a_{u_{i}}\right],\quad\textnormal{with}\ a_{l_{i}}\geq-1,\ a_{u_{i}}\leq 1,\ \textnormal{for all}\ i=1,\ldots,n,\quad\textnormal{and}\ \sum_{i=1}^{n}x_{i}=1.

This requirement indicates that short positions for each stock are restricted by some percentage (ali%a_{l_{i}}\%) of the available wealth (assuming ali<0a_{l_{i}}<0), and no more than aui%a_{u_{i}}\% of the total wealth can be invested to instrument ii. Let 𝝃n\bm{\xi}\in\mathbb{R}^{n} denote a random vector, the ii-th entry of which represents the random return of the ii-th instrument. Then, the random loss (i.e. the negative of the random return) of a given portfolio xx is given by f(x,𝝃)x𝝃f(x,\bm{\xi})\coloneqq-x^{\top}\bm{\xi}. In this paper we assume that 𝝃\bm{\xi} follows some continuous distribution p(𝝃)p(\bm{\xi}), as well as that there is a one-to-one correspondence between percentage return and monetary value (as in [44, Section 3]). Additionally, given some expected benchmark return rr (e.g. the market index), we only consider portfolios that yield an expected return above a certain threshold, i.e. 𝔼[f(x,𝝃)]r\mathbb{E}[-f(x,\bm{\xi})]\geq r.

Finally, given the previously stated constraints, we would like to minimize some convex risk measure ϱ()\varrho(\cdot) of interest. However, in order to make sure that the problem is well-posed, while the transaction costs are not excessive, we include an 1\ell_{1} term in the objective. This is a well-known modeling choice that yields sparse portfolios, and thus regulates the transaction costs in the single-period portfolio setting (e.g. see [1, 16, 31]). Additionally, with an appropriate tuning of the 1\ell_{1} regularization parameter τ>0\tau>0, one could also control the amount of short positions (see [16]). It should be mentioned here that in the multi-period setting such an 1\ell_{1} term does not guarantee a reduction in the transaction costs (but mostly in the holding costs), and an additional total variation term should also be added in the objective (see [15, 19]). This is omitted here, since we focus on the single-period case, but the model in (P) could easily incorporate such an additional term (see Remark 1). By putting everything together, the model reads as

minxϱ(f(x,𝝃))+τx1,s.t.i=1nxi=1,𝔼[f(x,𝝃)]r,xi[ali,aui],i=1,,n.\begin{split}\min_{x}&\ \varrho\left(f(x,\bm{\xi})\right)+\tau\|x\|_{1},\\ \textnormal{s.t.}&\ \sum_{i=1}^{n}x_{i}=1,\\ &\ \mathbb{E}\left[-f(x,\bm{\xi})\right]\geq r,\\ &\ x_{i}\in\left[a_{l_{i}},a_{u_{i}}\right],\qquad i=1,\ldots,n.\end{split} (4.1)

There are several methods for solving such stochastic problems; let us mention two important variants. There is the parametric approach (e.g. as in [1, 44]), where one assumes that the returns follow some known distribution which is subsequently sampled to yield finite-dimensional optimization problems, and the sampling approach (e.g. as in [31]), where one obtains a finite number of samples (without assuming a specific distribution). Such samples are often obtained by historical observations, and this approach is also followed in this paper. It is well-known that historical data cannot fully predict the future (see [31]), however it is a widely-used practice. The reader is referred to [27] for an extensive discussion onprobabilistic models for portfolio selection problems.

Additional soft or hard constraints can be included when solving a portfolio selection problem. Such constraints can either be incorporated directly via the use of a model (e.g. see [7, Section 2]) as hard constraints or by including appropriate penalty terms in the objective (soft constraints). It is important to note that the model given in (P) is quite general and as a result has great expressive power, allowing one to incorporate various important risk measures (and their combinations), as well as modeling constraints of interest.

Real-world datasets:

In what follows, we solve two different instances of problem (4.1). In particular, we consider two potential risk measures; the conditional value at risk (e.g. see [44]), as well as the mean absolute semi-deviation (e.g. see [47, Section 6.2.2.], noting that this is in fact equivalent to the mean absolute deviation originally proposed in [30]). In each of the above cases problem (4.1) has the form of (P) and thus Algorithm PMM can directly be applied.

We showcase the effectiveness of the proposed approach on 6 real datasets taken from [9]. Each dataset contains time series for weekly asset returns and market indexes for different major stock markets, namely, DowJones, NASDAQ100, FTSE100, SP500, NASDAQComp, and FF49Industries. In the first 5 markets the authors in [9] provide the market indexes, while for the last dataset the uniform allocation strategy is considered as a benchmark. Additional information on the datasets is collected in Table 1. We note that stocks with less than 10 years of observations have been disregarded.

Table 1: Portfolio optimization datasets.
Name # of assets # of data points Timeline
DowJones 28 1363 Feb. 1990–Apr. 2016
NASDAQ100 82 596 Nov. 2004–Apr. 2016
FTSE100 83 717 Jul. 2002–Apr. 2016
FF49Industries 49 2325 Jul. 1969–Jul. 2015
SP500 442 595 Nov. 2004–Apr. 2016
NASDAQComp 1203 685 Feb. 2003–Apr. 2016

4.1.1 Conditional value at risk

First, we consider portfolio optimization problems that seek a solution minimizing the conditional value at risk; a measure which is known to be coherent (see [2] for a definition of coherent risk measures). In particular using the notation introduced earlier, we consider the following optimization problem

minxn{CVaRα(f(x,𝝃))+τx1+δ𝒦(x)},s.t.Ax=b,\underset{x\in\mathbb{R}^{n}}{\text{min}}\left\{\textnormal{CVaR}_{\alpha}\left(f(x,\bm{\xi})\right)+\tau\|x\|_{1}+\delta_{\mathcal{K}}(x)\right\},\qquad\textnormal{s.t.}\ Ax=b, (4.2)

where f(x,𝝃)f(x,\bm{\xi}) is the random cost function, Am×nA\in\mathbb{R}^{m\times n} models the linear constraint matrix of problem (4.1) (where an auxiliary variable has been introduced to transform the inequality constraint involving the expectation into an equality), and 𝒦[al,au]\mathcal{K}\coloneqq[a_{l},a_{u}]. In the above 1α(0,1)1-\alpha\in(0,1) is the confidence level. It is well-known ([44]) that given a continuous random variable XX, the conditional value at risk can be computed as

CVaRα(X)mint{t+α1𝔼[(Xt)+]}.\textnormal{CVaR}_{\alpha}\left(X\right)\coloneqq\min_{t\in\mathbb{R}}\left\{t+\alpha^{-1}\mathbb{E}\left[\left(X-t\right)_{+}\right]\right\}.

We can write problem (4.2) in the following equivalent form:

min(x,t)n×{t+1lαi=1l(ξixt)++Dx1+δ𝒦(x)},s.t.Ax=b,\underset{(x,t)\in\mathbb{R}^{n}\times\mathbb{R}}{\text{min}}\left\{t+\frac{1}{l\alpha}\sum_{i=1}^{l}\left(-\xi_{i}^{\top}x-t\right)_{+}+\|Dx\|_{1}+\delta_{\mathcal{K}}(x)\right\},\qquad\textnormal{s.t.}\ Ax=b,

where the expectation has been substituted by summation since we assume the availability of a dataset {ξ1,,ξl}\{\xi_{1},\ldots,\xi_{l}\}. Introducing an auxiliary variable wlw\in\mathbb{R}^{l}, the previous can be re-written as

min(x,t,w)n××l{t+i=1l(wi)++Dx1+δ𝒦(x)},s.t.1lα(ξixt)wi=0,i=1,,l,Ax=b.\begin{split}\underset{(x,t,w)\in\mathbb{R}^{n}\times\mathbb{R}\times\mathbb{R}^{l}}{\text{min}}&\left\{t+\sum_{i=1}^{l}\left(w_{i}\right)_{+}+\|Dx\|_{1}+\delta_{\mathcal{K}}(x)\right\},\\ \textnormal{s.t.}\qquad\ \ &\ \frac{1}{l\alpha}\left(-\xi_{i}^{\top}x-t\right)-w_{i}=0,\quad i=1,\ldots,l,\\ &\ Ax=b.\end{split}

We solve problem (4.2) using the proposed active-set method (AS), the interior point-proximal method of multipliers (IP-PMM) given in [38], and OSQP ([49]). We allow any short position, that is ali=1a_{l_{i}}=-1, and restrict investing more that 60%60\% of the available wealth on a single stock (i.e. aui=0.6a_{u_{i}}=0.6). We set tol=105\texttt{tol}=10^{-5} and τ=102\tau=10^{-2}, and run the three methods for each of the datasets described in Table 1 for varying confidence level. We report the confidence parameter α\alpha, the number of PMM, SSN, IP-PMM and OSQP iterations, the CPU time needed by each of the three schemes, as well as the number of factorizations used within the active set (PMM-SSN) scheme. Indeed, it often happens that the active-set is not altered from one iteration to the next, and the factorization does not need to be re-computed. The results are collected in Table 4.1.1. In all the numerical results that follow, the lowest running time exhibited by a solver, assuming it successfully converged, is presented in bold.

Table 2: CVaR portfolio selection: varying confidence level (tol = 10510^{-5}, τ=102\tau=10^{-2}).
Dataset 𝜶\bm{\alpha} Iterations Time (s)
missingmissingmissing  3-5
missingmissingmissing  6-8
PMM(SSN)[Fact.] IP–PMM OSQP AS IP–PMM OSQP
DowJones 0.050.05 35(169)[142] 18 32,575 0.65 0.71 10.94
0.100.10 36(166)[145] 21 42,625 0.75 0.83 13.61
0.150.15 36(144)[125] 13 43,100 0.67 0.53 14.45
NASDAQ100 0.050.05 31(156)[138] 19 30,575 0.66 0.60 7.92
0.100.10 33(168)[150] 16 34,700 0.65 0.50 8.81
0.150.15 33(163)[142] 15 40,500 0.62 0.57 10.46
FTSE100 0.050.05 32(179)[164] 17 21,825 0.77 0.90 10.96
0.100.10 34(171)[147] 19 35,825 0.80 0.99 17.70
0.150.15 35(179)[156] 11 36,550 0.93 0.59 18.24
FF49Industries 0.050.05 41(215)[186] 27 50,00050,000^{\ddagger}1 1.77 4.65 59.1659.16^{\ddagger}
0.100.10 42(224)[183] 22 50,00050,000^{\ddagger} 2.37 3.62 57.9257.92^{\ddagger}
0.150.15 39(205)[157] 11 50,00050,000^{\ddagger} 2.66 2.18 51.2551.25^{\ddagger}
SP500 0.050.05 30(203)[193] 20 21,075 5.93 12.82 65.70
0.100.10 34(199)[187] 11 32,375 5.88 6.98 99.39
0.150.15 34(173)[160] 17 39,225 5.78 10.17 119.95
NASDAQComp 0.050.05 28(198)[192] 10 16,100 22.07 35.52 143.82
0.100.10 31(171)[169] 14 31,275 23.31 48.99 291.97
0.150.15 34(173)[167] 19 44,775 21.67 67.14 403.75
  • 1

    \ddagger indicates that the solver reached the maximum number of iterations.

From Table 4.1.1 we observe that for the smaller instances (DowJones, NaSDAQ100, FTSE100, FF49Industries) both second-order methods perform quite well and are comparable in terms of CPU time. Nonetheless, even in this case, the proposed active-set solver requires significantly less memory. Indeed, this can be seen by the fact that the two methods achieve similar times but the active-set scheme is performing significantly more factorizations. On the other hand, for the larger instances (SP500, NASDAQComp) the proposed active-set scheme outperforms the interior point method significantly. This is mostly due to the efficiency gained from the lower memory requirements and cheaper factorizations of the active-set solver. Also, we observe that both methods are quite robust with respect to the confidence level and consistently outperform OSQP, which struggles to find a 5-digit accurate solution. We note that OSQP could potentially be competitive for smaller tolerances (e.g. for finding a 3-digit accurate solution), however, the application under consideration dictates that an accurate solution is needed, since a small improvement in the portfolio output can translate into huge profits in practice.

In order to observe the behaviour of the three solvers under a different 1\ell_{1} regularization value, we fix tol=105\texttt{tol}=10^{-5}, τ=101\tau=10^{-1}, and run them for varying confidence level. The results are collected in Table 4.1.1. Similar observations can be drawn from Table 4.1.1, while noting that both second-order approaches remain efficient for different values of the regularization parameter τ\tau. Notice, however, that one should be careful when choosing this regularization parameter, since otherwise the obtained portfolios could be meaningless. In light of this, we have chosen values for τ\tau that yield reasonable portfolios, with controlled short positions. Additionally we should note that both second-order solvers were able to solve all problems in higher accuracy (e.g. up to 6- or 7-digits of accuracy), however, the differences in the obtained portfolios (in terms of positions and/or associated risk) were negligible and hence such accuracies were not considered here.

Table 3: CVaR portfolio selection: varying confidence level (tol = 10510^{-5}, τ=101\tau=10^{-1}).
Dataset 𝜶\bm{\alpha} Iterations Time (s)
missingmissingmissing  3-5
missingmissingmissing  6-8
PMM(SSN)[Fact.] IP–PMM OSQP AS IP–PMM OSQP
DowJones 0.050.05 35(168)[144] 21 24,975 0.57 0.85 8.31
0.100.10 36(154)[133] 20 21,925 0.64 0.79 7.32
0.150.15 36(143)[128] 17 29,625 0.68 0.69 9.95
NASDAQ100 0.050.05 34(157)[124] 19 17,225 0.46 0.57 4.50
0.100.10 34(155)[133] 16 18,100 0.59 0.49 4.80
0.150.15 35(157)[132] 16 20,775 0.56 0.48 5.49
FTSE100 0.050.05 34(166)[137] 21 18,900 0.65 1.17 10.10
0.100.10 34(174)[157] 16 24,750 0.92 0.86 13.15
0.150.15 36(166)[147] 17 28,850 0.95 0.92 15.53
FF49Industries 0.050.05 43(215)[171] 19 48,000 1.63 3.00 59.33
0.100.10 41(184)[146] 22 40,475 1.57 3.53 48.48
0.150.15 41(160)[136] 19 35,525 1.45 3.06 40.91
SP500 0.050.05 32(163)[144] 20 21,425 4.68 12.26 66.29
0.100.10 33(165)[143] 18 22,100 5.86 12.43 73.56
0.150.15 35(169)[148] 19 32,225 5.61 11.51 98.25
NASDAQComp 0.050.05 32(190)[182] 26 17,200 15.55 90.78 157.83
0.100.10 34(180)[172] 15 13,925 22.75 51.81 127.26
0.150.15 34(170)[167] 18 14,675 24.69 62.40 137.42

4.1.2 Mean absolute semi-deviation

Next, we consider portfolio optimization problems that seek a solution minimizing the mean absolute semi-deviation, which is also coherent. We consider the following optimization problem

minxn{MAsD(f(x,𝝃))+τx1+δ𝒦(x)},s.t.Ax=b,\underset{x\in\mathbb{R}^{n}}{\text{min}}\left\{\textnormal{MAsD}\left(f(x,\bm{\xi})\right)+\tau\|x\|_{1}+\delta_{\mathcal{K}}(x)\right\},\qquad\textnormal{s.t.}\ Ax=b, (4.3)

where, given a continuous random variable XX, the associated risk is defined as

MAsD(X)𝔼[(X𝔼[X])+]12𝔼|X𝔼[X]|,\textnormal{MAsD}\left(X\right)\coloneqq\mathbb{E}\left[\left(X-\mathbb{E}[X]\right)_{+}\right]\equiv\frac{1}{2}\mathbb{E}\left|X-\mathbb{E}[X]\right|,

where the equivalence follows from [47, Prospotion 6.1]. Given a dataset {ξ1,,ξl}\{\xi_{1},\ldots,\xi_{l}\}, problem (4.3) can be written as

min(x,w)n×l{i=1l(wi)++Dx1+δ𝒦(x)},s.t.1l(ξix+𝝁x)wi=0,i=1,,l,Ax=b,\begin{split}\underset{(x,w)\in\mathbb{R}^{n}\times\mathbb{R}^{l}}{\text{min}}&\left\{\sum_{i=1}^{l}\left(w_{i}\right)_{+}+\|Dx\|_{1}+\delta_{\mathcal{K}}(x)\right\},\\ \textnormal{s.t.}\quad\ \ &\ \frac{1}{l}\left(-\xi_{i}^{\top}x+\bm{\mu}^{\top}x\right)-w_{i}=0,\quad i=1,\ldots,l,\\ &\ Ax=b,\end{split}

where 𝝁1li=1lξix\bm{\mu}\coloneqq\frac{1}{l}\sum_{i=1}^{l}\xi_{i}^{\top}x. Note that this model is in the form of (P).

We fix tol=105\texttt{tol}=10^{-5} and run the two methods on the 6 datasets for two different sensible values of the regularization parameter τ\tau. The results are collected in Table 4.1.2.

Table 4: MAsD portfolio selection: varying regularization (tol = 10510^{-5}).
Dataset 𝝉\bm{\tau} Iterations Time (s)
missingmissingmissing  3-5
missingmissingmissing  6-8
PMM(SSN)[Fact.] IP–PMM OSQP AS IP–PMM OSQP
DowJones 0.010.01 51(142)[139] 10 32,350 1.05 0.56 10.47
0.050.05 52(142)[139] 13 14,050 0.72 0.54 4.60
NASDAQ100 0.010.01 47(167)[160] 11 18,800 0.67 0.26 5.02
0.050.05 48(176)[161] 10 40,550 0.67 0.27 10.72
FTSE100 0.010.01 46(153)[150] 10 18,325 0.79 0.27 9.56
0.050.05 46(153)[150] 14 47,125 0.74 0.35 23.29
FF49Industries 0.010.01 53(141)[136] 11 50,00050,000^{\ddagger}1 1.49 1.72 57.5957.59^{\ddagger}
0.050.05 52(137)[132] 9 12,050 1.35 1.51 14.24
SP500 0.010.01 42(165)[163] 17 28,375 5.33 2.91 87.10
0.050.05 41(157)[153] 17 41,275 4.39 3.07 128.09
NASDAQComp 0.010.01 44(143)[133] 17 34,175 9.75 9.91 310.12
0.050.05 44(143)[132] 18 41,000 10.22 9.72 367.32
  • 1

    \ddagger indicates that the solver reached the maximum number of iterations.

From Table 4.1.2 we observe that both second-order schemes are robust and comparable for all instances. In this case, the larger instances (SP500, NASDAQComp) were solved in comparable time by both solvers. Nevertheless, it is important to note that the active-set scheme is the better choice, since it has significantly less memory requirements. Again, this can be readily observed from the fact that its factorizations are significantly cheaper compared to those of IP-PMM (since the active-set scheme performs significantly more factorizations and still achieves comparable performance). Indeed, as will become clear when solving large-scale quantile regression and binary classification instances, the proposed active-set solver scales better than IP-PMM or OSQP. Additionally, we also observe that the method is robust (i.e. converges reliably to an accurate solution). Finally, as in the case of the CVaR instances, OSQP struggled to find accurate solutions at a reasonable number of iterations, making it a less efficient choice for such problems.

4.1.3 Extensions and alternative risk measures

Let us notice that the presented methodology can be easily extended to the multi-period case [7, 33]. Then, one could include an additional fused-lasso term in the objective function in order to ensure low transaction costs. It is important to note that in this case the 1\ell_{1} term added in the objective has the effect of reducing holding costs as well as short positions (e.g. see [15, 16, 19]). As noted in Remark 1, the additional fused-lasso term can be easily incorporated in the objective of (P). Multi-period portfolio selection problems are not considered here, however, one can expect a very similar behaviour to that observed in the single-period case.

Finally, we could easily deal with alternative risk measures, such as the variance (e.g. [36]), combination of CVaR and MAsD (e.g. see [6]), or approximations of other risk measures via multiple CVaR measures (see [26]). These were not included here for brevity of presentation.

4.2 Penalized quantile regression

Next we consider linear regression models of the following form

yi=β0+ξiβ+ϵi,i{1,,l}y_{i}=\beta_{0}+\xi_{i}^{\top}\beta+\epsilon_{i},\qquad i\in\{1,\ldots,l\}

where ξi\xi_{i} is a dd-dimensional vector of covariates, (β0,β)(\beta_{0},\beta) are the regression coefficients and ϵi\epsilon_{i} is some random error. A very popular problem in statistics is the estimation of the optimal coefficients, in the sense of minimizing a model of the following form:

min(β0,β)×d{1li=1l(yiβ0ξiβ)+λp(β)},\min_{(\beta_{0},\beta)\ \in\ \mathbb{R}\times\mathbb{R}^{d}}\left\{\frac{1}{l}\sum_{i=1}^{l}\ell\left(y_{i}-\beta_{0}-\xi_{i}^{\top}\beta\right)+\lambda p(\beta)\right\}, (4.4)

where ()\ell(\cdot) is some loss function and p()p(\cdot) is a penalty function with an associated regularization parameter λ0\lambda\geq 0. Following [54], we consider the elastic-net penalty,

p(β)pτ(β)τβ1+1τ2β22,0τ1.p(\beta)\equiv p_{\tau}(\beta)\coloneqq\tau\|\beta\|_{1}+\frac{1-\tau}{2}\|\beta\|_{2}^{2},\qquad 0\leq\tau\leq 1.

For the loss function, we employ the quantile loss

(w)ρα(w)(1α)w+αw+=12(|w|+(2α1)w),0<α<1,\ell(w)\equiv\rho_{\alpha}(w)\coloneqq\left(1-\alpha\right)w_{-}+\alpha w_{+}=\frac{1}{2}\left(\lvert w\rvert+(2\alpha-1)w\right),\qquad 0<\alpha<1, (4.5)

where ww\in\mathbb{R}. Notice that the case α=12\alpha=\frac{1}{2} yields the absolute loss. Letting x=[β0β]x=\begin{bmatrix}\beta_{0}&\beta^{\top}\end{bmatrix}^{\top}, and using Remark 1, we can re-write problem (4.4) in the form of (P), as

min(x,w)1+d×l{(α1)𝟙lw+12xQx+i=1l(wi)++Dx1},s.t.1l([1ξi]x+yi)wi=0,i=1,,l,\begin{split}\min_{(x,w)\ \in\ \mathbb{R}^{1+d}\times\mathbb{R}^{l}}&\ \left\{(\alpha-1)\mathds{1}_{l}^{\top}w+\frac{1}{2}x^{\top}Qx+\sum_{i=1}^{l}(w_{i})_{+}+\|Dx\|_{1}\right\},\\ \textnormal{s.t.}\qquad\ &\ \frac{1}{l}\left(-\begin{bmatrix}1&\xi_{i}^{\top}\end{bmatrix}x+y_{i}\right)-w_{i}=0,\qquad i=1,\ldots,l,\end{split}

where

Q=[001,d0d,1λ(1τ)Id],D=[001,d0d,1λτId].Q=\begin{bmatrix}0&0_{1,d}\\ 0_{d,1}&\lambda(1-\tau)I_{d}\end{bmatrix},\qquad D=\begin{bmatrix}0&0_{1,d}\\ 0_{d,1}&\lambda\tau I_{d}\end{bmatrix}.
Real-world datasets:

In what follows, we solve several instances of problem (4.4). We showcase the effectiveness of the proposed approach on 5 regression problems taken from the LIBSVM library (see [11]). Additional information on the datasets is collected in Table 5.

Table 5: Quantile regression datasets.
Name # of data points # of features
space_ga 3107 6
abalone 4177 8
cpusmall 8192 12
cadata 20,640 8
E2006-tfidf 16,087 150,360

We fix tol=104\texttt{tol}=10^{-4}, λ=102\lambda=10^{-2}, and τ=0.5\tau=0.5, and run all three methods (active-set (AS), IP-PMM and OSQP) on the 5 instances for varying quantile level α\alpha. The results are collected in Table 4.2.

Table 6: Quantile regression: varying confidence level (tol = 10410^{-4}, λ=102\lambda=10^{-2}, τ=0.5\tau=0.5).
Problem 𝜶\bm{\alpha} Iterations Time (s)
missingmissingmissing  3-5
missingmissingmissing  6-8
PMM(SSN)[Fact.] IP–PMM OSQ AS IP–PMM OSQP
space_ga 0.500.50 15(43)[30] 19 2350 0.38 0.35 0.51
0.650.65 15(50)[35] 20 16,300 0.37 0.39 3.36
0.800.80 15(62)[48] 19 16,975 0.46 0.35 3.46
0.950.95 15(78)[77] 20 7400 0.48 0.36 1.51
abalone 0.500.50 14(68)[64] 35 6350 1.02 1.29 2.15
0.650.65 14(68)[63] 35 6300 0.89 1.36 2.17
0.800.80 14(79)[74] 27 12,425 0.89 1.01 4.13
0.950.95 14(87)[79] 18 2875 0.82 0.69 0.97
cpusmall 0.500.50 15(64)[64] 29 \doublebarwedge1 2.01 2.57 \doublebarwedge
0.650.65 16(74)[74] 30 \doublebarwedge 2.33 2.81 \doublebarwedge
0.800.80 16(86)[85] 26 \doublebarwedge 2.64 2.37 \doublebarwedge
0.950.95 15(110)[110] 22 \doublebarwedge 2.85 1.96 \doublebarwedge
cadata 0.500.50 3(66)[65] 49 7125 2.96 14.63 16.10
0.650.65 4(85)[83] 42 9050 3.65 12.43 20.27
0.800.80 3(77)[76] 45 7275 3.45 13.13 16.45
0.950.95 3(225)[224] 80 7450 13.14 23.32 17.08
E2006-tfidf 0.500.50 14(27)[20] \dagger2 \dagger 84.82 \dagger \dagger
0.650.65 15(34)[26] \dagger \dagger 95.08 \dagger \dagger
0.800.80 16(46)[35] \dagger \dagger 112.24 \dagger \dagger
0.950.95 17(79)[71] \dagger \dagger 165.78 \dagger \dagger
  • 1

    \doublebarwedge indicates that the solver incorrectly identified the problem as infeasible.

  • 2

    \dagger indicates that the solver ran out of memory.

From Table 4.2 we observe that the three approaches are comparable for the smaller isntances (space_ga, abalone, and cpusmall). The active-set scheme significantly outperforms IP-PMM and OSQP on the cadata problem, mainly due to its better numerical stability. Indeed, this problem is highly ill-conditioned, and this is reflected in the increased number of IP-PMM iterations needed to obtain a 4-digit accurate solution. Finally, we observe that for the large-scale instance (E2006-tfidf), IP-PMM and OSQP crashed due to memory requirements. In this case, the active set scheme was warm-started by a matrix-free ADMM (as discussed in Section 3), since the standard ADMM also crashed due to excessive memory requirements. The active-set scheme, however, manages to solve these large-scale instances very efficiently, without running into any memory issues (consistently). The OSQP solver is competitive for certain instances, albeit slightly slower than both second-order solvers, but fails to solve the cpusmall instances due to an incorrect classification of these instances as infeasible. Additionally, we observe that it is less consistent than the second-order approaches, since its iterations vary greatly with the problem parameters (see instances space_ga and abalone).

Next, we fix tol=104,\texttt{tol}=10^{-4}, and α=0.8\alpha=0.8, and run the three methods for varying regularization parameters τ\tau and λ\lambda. The results are collected in Table 4.2. We observe that both second-order methods are quite robust for a wide range of parameter values. OSQP is competitive for certain instances, however, its behaviour is greatly influenced by the problem parameters. Nonetheless, it was albe to outperform the second-order solvers on two out of the four cadata instances. Finally, the active-set scheme consistently solved the large-scale instance (E2006-tfidf) for a wide range of parameter values, without running into any memory issues.

Table 7: Quantile regression: varying regularization (tol = 10410^{-4}, α=0.8\alpha=0.8).
Problem (𝝉,𝝀)(\bm{\tau},\bm{\lambda}) Iterations Time (s)
missingmissingmissing  3-5
missingmissingmissing  6-8
PMM(SSN)[Fact.] IP–PMM OSQP AS IP–PMM OSQP
space_ga (0.2,5102)(0.2,5\cdot 10^{-2}) 15(58)[49] 16 37,125 0.51 0.30 8.00
(0.4,1102)(0.4,1\cdot 10^{-2}) 15(61)[48] 26 12,050 0.46 0.49 2.48
(0.6,5103)(0.6,5\cdot 10^{-3}) 15(67)[46] 25 13,025 0.44 0.46 2.68
(0.8,1103)(0.8,1\cdot 10^{-3}) 15(82)[62] 27 20,575 0.61 0.49 4.21
abalone (0.2,5102)(0.2,5\cdot 10^{-2}) 14(70)[65] 25 16,225 0.88 1.32 5.49
(0.4,1102)(0.4,1\cdot 10^{-2}) 14(71)[67] 34 15,575 0.86 1.48 5.16
(0.6,5103)(0.6,5\cdot 10^{-3}) 15(80)[76] 39 18,675 0.92 1.50 6.26
(0.8,1103)(0.8,1\cdot 10^{-3}) 15(116)[109] 30 9975 1.44 1.16 3.34
cpusmall (0.2,5102)(0.2,5\cdot 10^{-2}) 15(83)[81] 20 \doublebarwedge1 1.98 2.14 \doublebarwedge
(0.4,1102)(0.4,1\cdot 10^{-2}) 15(81)[78] 26 \doublebarwedge 2.10 2.39 \doublebarwedge
(0.6,5103)(0.6,5\cdot 10^{-3}) 16(101)[94] 21 \doublebarwedge 2.79 1.94 \doublebarwedge
(0.8,1103)(0.8,1\cdot 10^{-3}) 15(106)[103] 20 \doublebarwedge 3.16 1.85 \doublebarwedge
cadata (0.2,5102)(0.2,5\cdot 10^{-2}) 9(147)[143] 62 5875 15.42 18.05 13.18
(0.4,1102)(0.4,1\cdot 10^{-2}) 4(63)[62] 50 7350 2.76 14.68 17.41
(0.6,5103)(0.6,5\cdot 10^{-3}) 5(121)[118] 45 9775 5.49 13.32 21.70
(0.8,1103)(0.8,1\cdot 10^{-3}) 7(276)[276] 33 875 14.01 9.91 2.08
E2006-tfidf (0.2,5102)(0.2,5\cdot 10^{-2}) 16(47)[38] \dagger2 \dagger 115.87 \dagger \dagger
(0.4,1102)(0.4,1\cdot 10^{-2}) 16(43)[36] \dagger \dagger 118.89 \dagger \dagger
(0.6,5103)(0.6,5\cdot 10^{-3}) 16(43)[34] \dagger \dagger 115.90 \dagger \dagger
(0.8,1103)(0.8,1\cdot 10^{-3}) 17(48)[40] \dagger \dagger 128.38 \dagger \dagger
  • 1

    \doublebarwedge indicates that the solver incorrectly identified the problem as infeasible.

  • 2

    \dagger indicates that the solver ran out of memory.

4.3 Binary classification via linear support vector machines

Finally, we are interested in training a binary linear classifier using regularized soft-margin linear support vector machines (SVMs), [51]. More specifically, given a training dataset {(yi,ξi)}i=1l\{(y_{i},\xi_{i})\}_{i=1}^{l}, where yi{1,1}y_{i}\in\{-1,1\} are the labels and ξid\xi_{i}\in\mathbb{R}^{d} are the feature vectors (with dd the number of features), we would like to solve the following optimization problem

min(β0,β)×d{1li=1l(1yi(ξiββ0))++λ(τ1β1+τ22β22)},\min_{(\beta_{0},\beta)\in\mathbb{R}\times\mathbb{R}^{d}}\left\{\frac{1}{l}\sum_{i=1}^{l}\left(1-y_{i}\left(\xi_{i}^{\top}\beta-\beta_{0}\right)\right)_{+}+\lambda\left(\tau_{1}\|\beta\|_{1}+\frac{\tau_{2}}{2}\|\beta\|_{2}^{2}\right)\right\}, (4.6)

where λ>0\lambda>0 is a regularization parameter, and τ1,τ2>0\tau_{1},\ \tau_{2}>0 are the weights for the 1\ell_{1} and 2\ell_{2} regularizers, respectively. The standard Euclidean regularization is traditionally used as a trade-off between the margin of the classifier (the larger the better) and the correct classification of ξi\xi_{i}, for all i{1,,l}i\in\{1,\ldots,l\} (e.g. [17]). However, this often leads to a dense estimate for β\beta, which has led researchers in the machine learning community into considering the 1\ell_{1} regularizer instead, in order to encourage sparsity (e.g. [8]). It is well-known that both regularizers can be combined to obtain the effect of both individual ones, using the elastic-net penalty (see, for example, [52]), assuming that τ1\tau_{1} and τ2\tau_{2} are appropriately tuned.

Real-world datasets:

In what follows we consider elastic-net SVM instances of the form of (4.6). We showcase the effectiveness of the proposed active-set scheme on 3 large-scale binary classification datasets taken from the LIBSVM library ([11]). The problem names, as well as the number of features and training points are collected in Table 8.

Table 8: Binary classification datasets.
Name # of training points # of features
rcv1 20,242 47,236
real-sim 72,309 20,958
news20 19,996 1,355,191

In the experiments to follow, we only consider large-scale problems that neither IP-PMM nor OSQP can solve, due to excessive memory requirements. Nonetheless, we should note that, by following the developments in [19], IP-PMM can be specialized to problems of this form in order to be able to tackle large-scale instances. However, the same can be said about the proposed active-set method. The schemes tested in this work employ factorization for the solution of their associated linear systems. The introduction of preconditioning, e.g. as in [39], could improve their efficiency in certain cases, but is out of the scope of this article. In the following experiments, we employ the (matrix-free) prox-linear ADMM (as described in Section 3), to avoid any memory issues in the warm-starting phase of the proposed algorithm.

We fix tol=105\texttt{tol}=10^{-5}, λ=102\lambda=10^{-2}, and run the active-set solver for all datasets given in Table 8 for varying values of the regularization parameters τ1\tau_{1}, and τ2\tau_{2}. The results are collected in Table 4.3.

Table 9: Binary classification via elastic-net SVMs: varying regularization (tol = 10510^{-5}, λ=102\lambda=10^{-2}).
Problem 𝝉𝟏\bm{\tau_{1}} 𝝉𝟐\bm{\tau_{2}} Iterations Time (s)
missingmissingmissing  4-4
missingmissingmissing  5-5
PMM(SSN)[Fact.] AS
rcv1 0.2 0.2 48(139)[105] 43.36
0.8 0.2 47(100)[58] 23.28
0.2 0.8 47(165)[133] 54.81
5 5 39(85)[45] 20.57
real-sim 0.2 0.2 50(130)[100] 252.25
0.8 0.2 47(85)[48] 132.20
0.2 0.8 48(101)[68] 201.56
5 5 47(85)[48] 128.75
news20 0.2 0.2 50(142)[102] 213.12
0.8 0.2 41(85)[43] 133.94
0.2 0.8 70(212)[161] 74.42
5 5 41(85)[43] 130.33

We observe that the solver was able to consistently solve these instances, without running into memory issues. In this case, the behaviour of the active-set solver was affected by the parameters τ1\tau_{1} and τ2\tau_{2} (indeed, see the number of SSN iterations for different regularization values), however, we consistently obtain convergence in a very reasonable amount of time. Overall, we observe that the proposed algorithm is very general and can be applied in a plethora of very important applications arising in practice. We were able to showcase that the active-set nature of the method allows one to solve large-scale instances on a personal computer, without the need of employing iterative linear algebra (which could also complement the solver, as in [39]). The proposed method strikes a good balance between first-order methods, which are fast but unreliable, and second-order interior point methods, which are extremely robust but can struggle with the problem size, ill-conditioning and memory requirements. We have demonstrated that 1\ell_{1}-regularized convex quadratic problems with piecewise-linear terms can be solved very efficiently using the proposed active-set scheme, and we conjecture that the algorithm can be readily extended to deal with general nonlinear convex objectives, or discretizations of stochastic two-stage problems.

5 Conclusions

In this paper we derived an efficient active-set method for the solution of convex quadratic optimization problems with piecewise-linear terms in the objective. The method, which complements our developments in the accompanying paper [“An active-set method for sparse approximations. Part I: Separable 1\ell_{1} terms”, S. Pougkakiotis, J. Gondzio, D. S. Kalogerias], arises by suitably combining a proximal method of multipliers with a semismooth Newton scheme, and admits an active-set interpretation. By taking advantage of the piecewise-linear terms in the objective, the method has very reasonable memory requirements since it utilizes only a small active-set at every inner-outer iteration. We warm-start the algorithm using an appropriate alternating direction method of multipliers, and ensure faster convergence and reduced memory requirements.

We showcase the efficiency and robustness of the proposed scheme on a variety of optimization problems arising in risk-averse portfolio optimization, quantile regression, and binary classification via linear support vector machines. A numerical comparison against a robust interior point method and a state-of-the-art alternating direction method of multipliers demonstrates the viability of the approach as well as its limited memory requirements. In particular, we observe a significantly better behaviour, compared to the two other solvers, when dealing with large-scale instances. Overall, the approach remains efficient for a wide range of problems and strikes a good balance between cheap but unreliable first-order methods and expensive but highly reliable interior point methods.

Appendix A Appendix: Termination criteria

The optimality conditions of (P) can be written as

proxg1(xcQx+[CA]yz)=x,proxg2(wy1:l)=w,[Cx+dwAxb]= 0l+m,Π𝒦(x+z)=x,\begin{split}\textbf{prox}_{g_{1}}\left(x-c-Qx+\begin{bmatrix}C^{\top}&A^{\top}\end{bmatrix}y-z\right)=\ x,&\qquad\textbf{prox}_{g_{2}}\left(w-y_{1:l}\right)=\ w,\\ \begin{bmatrix}Cx+d-w\\ Ax-b\end{bmatrix}=\ 0_{l+m},&\qquad\Pi_{\mathcal{K}}(x+z)=\ x,\end{split}

and the termination criteria for Algorithm PMM (given a tolerance ϵ>0\epsilon>0) can be summarized as

xproxg1(xcQx+[CA]yz)1+cϵ,wproxg2(wy1:l)ϵ,[Cx+dwAxb]1+b+dϵ,xΠ𝒦(x+z)1+x+zϵ.\begin{split}\frac{\left\|x-\textbf{prox}_{g_{1}}\left(x-c-Qx+\begin{bmatrix}C^{\top}&A^{\top}\end{bmatrix}y-z\right)\right\|}{1+\|c\|_{\infty}}\leq\epsilon,&\quad\left\|w-\textbf{prox}_{g_{2}}\left(w-y_{1:l}\right)\right\|\leq\epsilon,\\ \frac{\left\|\begin{bmatrix}Cx+d-w\\ Ax-b\end{bmatrix}\right\|}{1+\|b\|_{\infty}+\|d\|_{\infty}}\leq\epsilon,&\quad\frac{\|x-\Pi_{\mathcal{K}}(x+z)\|}{1+\|x\|_{\infty}+\|z\|_{\infty}}\leq\epsilon.\end{split} (A.1)

From the reformulation of (P) given in (P’), the termination criteria of Algorithm pADMM are as follows (upon noting that the variables of the algorithm are (x,w,u,y)(x,w,u,y))

c+Qx[CAIn0n,l]y1+cϵ,[Il0l,m+nIl]yϵ,Mr[xwu][db0l+n][db]+1ϵ,uΠ𝒦×l(proxg(u+y~))1+u+y~ϵ,\begin{split}\frac{\left\|c+Qx-\begin{bmatrix}C^{\top}&A^{\top}-I_{n}&0_{n,l}\end{bmatrix}y\right\|}{1+\|c\|}\leq\epsilon,&\qquad\left\|\begin{bmatrix}I_{l}&0_{l,m+n}&I_{l}\end{bmatrix}y\right\|\leq\epsilon,\\ \frac{\left\|M_{r}\begin{bmatrix}x\\ w\\ u\end{bmatrix}-\begin{bmatrix}-d\\ b\\ 0_{l+n}\end{bmatrix}\right\|}{\left\|\begin{bmatrix}-d\\ b\end{bmatrix}\right\|+1}\leq\epsilon,&\qquad\frac{\left\|u-\Pi_{\mathcal{K}\times\mathbb{R}^{l}}\left(\textbf{prox}_{g}\left(u+\tilde{y}\right)\right)\right\|}{1+\|u\|+\left\|\tilde{y}\right\|}\leq\epsilon,\end{split} (A.2)

where y~y(l+m+1:2l+m+n)\tilde{y}\coloneqq y_{(l+m+1:2l+m+n)}.

References

  • [1] S. Alexander, T. F. Coleman, and Y. Li, Minimizing CVaR and VaR for a portfolio of derivatives, Journal of Banking & Finance, 30 (2006), pp. 583–605, https://doi.org/10.1016/j.jbankfin.2005.04.012.
  • [2] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, Coherent measures of risk, Mathematical Finance, 9 (1999), pp. 203–228, https://doi.org/10.1111/1467-9965.00068.
  • [3] J. F. Benders, Partitioning procedures for solving mixed-variables programming problems, Numerische Mathematik, 4 (1962), pp. 238–252, https://doi.org/10.1007/BF01386316.
  • [4] D. Bertsimas and M. Sim, The price of robustness, Operations Research, 52 (2004), pp. 35–53s, https://doi.org/10.1287/opre.1030.0065.
  • [5] J. R. Birge, Decomposition and partitioning methods for multistage stochastic linear programs, Operations Research, 33 (1985), pp. 989–1007, https://doi.org/10.1287/opre.33.5.989.
  • [6] C. Birungi and L. Muthoni, Analysis of risk measures in portfolio optimization for the Uganda securities exchange, Journal of Financial Risk Management, 10 (2021), pp. 135–152, https://doi.org/10.4236/jfrm.2021.102008.
  • [7] S. Boyd, E. Busseti, S. Diamond, K. R. N., K. Koh, P. Nystrup, and J. Speth, Multi-period trading via convex optimization, Foundations and Trends® in Optimization, 3 (2017), pp. 1–76, https://doi.org/10.1561/2400000023.
  • [8] P. S. Bradley and O. L. Mangasarian, Feature selection via concave minimization and support vector machines, in Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, San Francisco, CA, USA, 1998, Morgan Kaufmann Publishers Inc., pp. 82––90.
  • [9] R. Bruni, F. Cesarone, A. Scozzari, and F. Tardella, Real-world datasets for portfolio selection and solutions of some stochastic dominance portfolio models, Data in Brief, 8 (2016), pp. 858–862, https://doi.org/10.1016/j.dib.2016.06.031.
  • [10] J. V. Burke and M. C. Ferris, A Gauss-Newton method for convex composite optimization, Mathematical Programming, 71 (1995), pp. 179–194, https://doi.org/10.1007/BF01585997.
  • [11] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, 2 (2011), pp. 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  • [12] J. Chen and L. Qi, Globally and superlinearly convergent inexact Newton-Krylov algorithms for solving nonsmooth equations, Numerical Linear Algebra with Applications, 17 (2010), pp. 155–174, https://doi.org/10.1002/nla.673.
  • [13] F. Clarke, Optimization and Nonsmooth Analysis, Classics in Applied Mathematics, John Wiley and Sons, New York, 1990, https://doi.org/10.1137/1.9781611971309.
  • [14] C. Clason and T. Valkonen, Introduction to Nonsmooth Analysis and Optimization, 2020, https://arxiv.org/abs/arXiv:1912.08672.
  • [15] S. Corsaro, V. De Simone, and Z. Marino, Fused Lasso approach in portfolio selection, Annals of Operations Research, 299 (2021), pp. 47–59, https://doi.org/10.1007/s10479-019-03289-w.
  • [16] S. Corsaro, V. De Simone, Z. Marino, and F. Perla, 1\ell_{1}-Regularization for multi-period portfolio selection, Annals of Operations Research, 294 (2020), pp. 75–86, https://doi.org/10.1007/s10479-019-03308-w.
  • [17] C. Cortes and V. N. Vapnik, Support-vector networks, Machine Learning, 20 (1995), pp. 273–297, https://doi.org/10.1007/BF00994018.
  • [18] G. B. Dantzig and P. Wolfe, The decomposition algorithm for linear programming, Econometrica, 29 (1961), pp. 767–778, https://doi.org/10.2307/1911818.
  • [19] V. De Simone, D. di Serafino, J. Gondzio, S. Pougkakiotis, and M. Viola, Sparse approximations with interior point methods, SIAM Review, 64 (2022), pp. 954–988, https://doi.org/10.1137/21M1401103.
  • [20] W. Deng and W. Yin, On the global and linear convergence of the generalized alternating direction method of multipliers, Journal of Scientific Computing, 66 (2016), pp. 889–916, https://doi.org/10.1007/s10915-015-0048-x.
  • [21] J. Gondzio, S. Pougkakiotis, and J. W. Pearson, General-purpose preconditioning for regularized interior point methods, Computational Optimization and Applications, (2022), https://doi.org/10.1007/s10589-022-00424-5.
  • [22] B. L. Gorissen, Interior point methods can exploit structure of convex piecewise linear functions with application in radiation therapy, SIAM Journal on Optimization, 32 (2022), pp. 256–275, https://doi.org/10.1137/21M1402364.
  • [23] A. Gramfort, B. Thirion, and G. Varoquaux, Identifying predictive regions from fMRI with TV-L1 prior, in 2013 International Workshop on Pattern Recognition in Neuroimaging, 2013, pp. 17–20, https://doi.org/10.1109/PRNI.2013.14.
  • [24] S.-P. Han, J.-S. Pang, and N. Rangaraj, Globally convergent Newton methods for nonsmooth equations, Mathematics of Operations Research, 17 (1992), pp. 586–607, https://doi.org/10.1287/moor.17.3.586.
  • [25] E. Hans and T. Raasch, Global convergence of damped semismooth Newton methods for 1\ell_{1} Tikhonov regularization, Inverse Problems, 31 (2015), p. 025005, https://doi.org/10.1088/0266-5611/31/2/025005.
  • [26] K. Inui and M. Kijima, On the significance of expected shortfall as a coherent risk measure, Journal of Banking and Finance, 29 (2005), pp. 853–864, https://doi.org/10.1016/j.jbankfin.2004.08.005.
  • [27] P. Jorion, Risk management lessons from long-term capital management, European Financial Management, 6 (2000), pp. 277–300, https://doi.org/10.1111/1468-036X.00125.
  • [28] C. Kanzow and T. Lechner, Globalized inexact proximal Newton-type methods for nonconvex composite functions, Computational Optimization and Applications, 78 (2021), pp. 377–410, https://doi.org/10.1007/s10589-020-00243-6.
  • [29] J. E. Kelley, Jr., The cutting-plane method for solving convex programs, Journal of the Society for Industrial and Applied Mathematics, 8 (1960), pp. 703–712, https://doi.org/10.1137/0108053.
  • [30] H. Konno and H. Yamazaki, Mean-absolute deviation portfolio optimization model and its applications to Tokyo stock market, Management Science, 37 (1991), pp. 519–531, https://doi.org/10.1287/mnsc.37.5.519.
  • [31] P. Krokhmal, S. Uryasev, and J. Palmquist, Portfolio optimization with conditional value-at-risk objective and constraints, Journal of Risk, 4 (2002), pp. 43–68, https://doi.org/10.21314/JOR.2002.057.
  • [32] R. Lazimy, Mixed-integer quadratic programming, Mathematical Programming, 22 (1982), pp. 332–349, https://doi.org/10.1007/BF01581047.
  • [33] D. Li and W.-L. Ng, Optimal dynamic portfolio selection: Multiperiod mean-variance formulation, Mathematical Finance, 10 (2000), pp. 387–406, https://doi.org/10.1111/1467-9965.00100.
  • [34] X. Li, D. Sun, and K. C. Toh, A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems, SIAM Journal on Optimization, 28 (2018), pp. 433–458, https://doi.org/10.1137/16M1097572.
  • [35] X. Li, D. Sun, and K. C. Toh, An asymptotically superilinearly convergent semismooth Newton augmented Lagrangian method for linear programming, SIAM Journal on Optimization, 30 (2020), pp. 2410–2440, https://doi.org/10.1137/19M1251795.
  • [36] H. M. Markowitz, Portfolio selection, The Journal of Finance, 7 (1952), pp. 77–91, https://doi.org/10.2307/2975974.
  • [37] J. Martínez and L. Qi, Inexact Newton methods for solving nonsmooth equations, Journal of Computational and Applied Mathematics, 60 (1995), pp. 127–145, https://doi.org/10.1016/0377-0427(94)00088-I.
  • [38] S. Pougkakiotis and J. Gondzio, An interior point-proximal method of multipliers for convex quadratic programming, Computational Optimization and Applications, 78 (2021), pp. 307–351, https://doi.org/10.1007/s10589-020-00240-9.
  • [39] S. Pougkakiotis, J. Gondzio, and D. S. Kalogerias, An active-set method for sparse approximations. part i: Separable 1\ell_{1} terms, 2022, https://arxiv.org/abs/arXiv:2201.10211.
  • [40] L. Qi, Convergence analysis of some algorithms for solving nonsmooth equations, Mathematics of Operations Research, 18 (1993), pp. 227–244, https://doi.org/10.1287/moor.18.1.227.
  • [41] H. Ramos, M. Righi, C. Guedes, and F. Müller, A comparison of risk measures for portfolio optimization with cardinality constraints, 2022, http://dx.doi.org/10.2139/ssrn.4141301.
  • [42] M. B. Righi and D. Borenstein, A simulation comparison of risk measures for portfolio optimization, Finance Research Letters, 24 (2018), pp. 105–112, https://doi.org/10.1016/j.frl.2017.07.013.
  • [43] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization, 14 (1976), pp. 877–898, https://doi.org/10.1137/0314056.
  • [44] R. T. Rockafellar and S. Uryasev, Optimization of conditional value-at-risk, Journal of Risk, 2 (2000), pp. 21–41, https://doi.org/10.21314/JOR.2000.038.
  • [45] R. T. Rockafellar, S. Uryasev, and M. Zabarankin, Optimality conditions in portfolio analysis with general deviation measures, Mathematical Programming, 108 (2006), pp. 515–540, https://doi.org/10.1007/s10107-006-0721-9.
  • [46] A. Ruszczyński, A regularized decomposition method for minimizing a sum of polyhedral functions, Mathematical Programming, 35 (1986), pp. 309–333, https://doi.org/10.1007/BF01580883.
  • [47] A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on Stochastic Programming: Modeling and Theory, MOS-SIAM Series on Optimization, SIAM & Mathematical Optimization Society, Philadelphia, 2009, https://doi.org/10.1137/1.9780898718751.
  • [48] L. P. Silva, D. Alem, and F. L. Carvalho, Portfolio optimization using mean absolute deviation (MAD) and conditional value-at-risk (CVaR), Production, 27 (2017), https://doi.org/10.1590/0103-6513.208816.
  • [49] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd, OSQP: an operator splitting solver for quadratic programs, Mathematical Programming Computation, 12 (2020), pp. 637–672, https://doi.org/10.1007/s12532-020-00179-2.
  • [50] R. J. Vanderbei, Symmetric quasidefinite matrices, SIAM Journal on Optimization, 5 (1993), pp. 100–113, https://doi.org/10.1137/0805005.
  • [51] V. N. Vapnik, The Nature of Statistical Learning Theory, Information Science and Statistics, Springer, New York, NY, 2000, https://doi.org/10.1007/978-1-4757-3264-1.
  • [52] L. Wang, J. Zhu, and H. Zou, The doubly regularized support vector machine, Statistica Sinica, 16 (2006), pp. 589–615, http://www.jstor.org/stable/24307560.
  • [53] X. Y. Zhao, D. F. Sun, and K. C. Toh, A Newton-CG augmented Lagrangian method for semidefinite programming, SIAM Journal on Optimization, 20 (2010), pp. 1737–1765, https://doi.org/10.1137/080718206.
  • [54] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (2005), pp. 301–320, https://doi.org/10.1111/j.1467-9868.2005.00503.x.