This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A generic coordinate descent solver for nonsmooth convex optimization

\nameOlivier Fercoq Email: olivier.fercoq@telecom-paristech.fr LTCI, Télécom ParisTech, Université Paris-Saclay, 46 rue Barrault, 75634 Paris Cedex 13, France
Abstract

We present a generic coordinate descent solver for the minimization of a nonsmooth convex objective with structure. The method can deal in particular with problems with linear constraints. The implementation makes use of efficient residual updates and automatically determines which dual variables should be duplicated. A list of basic functional atoms is pre-compiled for efficiency and a modelling language in Python allows the user to combine them at run time. So, the algorithm can be used to solve a large variety of problems including Lasso, sparse multinomial logistic regression, linear and quadratic programs.

keywords:
Coordinate descent; convex optimization; generic solver; efficient implementation

1 Introduction

Coordinate descent methods decompose a large optimization problem into a sequence of one-dimensional optimization problems. The algorithm was first described for the minimization of quadratic functions by Gauss and Seidel in [49]. Coordinate descent methods have become unavoidable in machine learning because they are very efficient for key problems, namely Lasso [20], logistic regression [67] and support vector machines [43, 52]. Moreover, the decomposition into small subproblems means that only a small part of the data is processed at each iteration and this makes coordinate descent easily scalable to high dimensions.

One of the main ingredients of an efficient coordinate descent solver is its ability to compute efficiently partial derivatives of the objective function [39]. In the case of least squares for instance, this involves the definition of a vector of residuals that will be updated during the run of the algorithm. As this operation needs to be performed at each iteration, and millions of iterations are usually needed, the residual update and directional derivative computation must be coded in a compiled programming language.

Many coordinate descent solvers have been written in order to solve a large variety of problems. However, most of the existing solvers can only solve problems of the type

minxNj=1Jf(Ajxbj)+i=1Ig(x(i))\min_{x\in\mathbb{R}^{N}}\sum_{j=1}^{J}f(A_{j}x-b_{j})+\sum_{i=1}^{I}g(x^{(i)})

where x(i)Nix^{(i)}\in\mathbb{R}^{N_{i}} is the iith block of xx, i=1INi=N\sum_{i=1}^{I}N_{i}=N, AjMj×NA_{j}\in\mathbb{R}^{M_{j}\times N} is a matrix and bjMjb_{j}\in\mathbb{R}^{M_{j}} is a vector, and where ff is a convex differentiable function and gg is a convex lower-semicontinuous function whose proximal operator is easy to compute (a.k.a. a proximal-friendly convex function). Each piece of code usually covers only one type of function [14, 42]. Moreover, even when the user has a choice of objective function, the same function is used for every block [6].

In this work, we propose a generic coordinate descent method for the resolution of the convex optimization problem

minxN12xQx+j=1Jcjffj(Ajfxbjf)+i=1Iciggi(Digx(i)big)+l=1Lclhhl(Alhxblh).\min_{x\in\mathbb{R}^{N}}\frac{1}{2}x^{\top}Qx+\sum_{j=1}^{J}c^{f}_{j}f_{j}(A^{f}_{j}x-b^{f}_{j})+\sum_{i=1}^{I}c^{g}_{i}g_{i}(D^{g}_{i}x^{(i)}-b^{g}_{i})+\sum_{l=1}^{L}c^{h}_{l}h_{l}(A^{h}_{l}x-b^{h}_{l})\;. (1)

We shall call fjf_{j}, gig_{i} and hlh_{l} atom functions. Each of them may be different. We will assume that fjf_{j}’s are differentiable and convex, gig_{i}’s and hlh_{l}’s are proximal-friendly convex functions. As before AjfMjf×NA^{f}_{j}\in\mathbb{R}^{M^{f}_{j}\times N} and AlhMlh×NA^{h}_{l}\in\mathbb{R}^{M^{h}_{l}\times N} are matrices, DigD_{i}^{g} is a multiple of the identity matrix of size NiN_{i}, bjfMjfb^{f}_{j}\in\mathbb{R}^{M^{f}_{j}}, bigNib^{g}_{i}\in\mathbb{R}^{N_{i}} and blhMlhb^{h}_{l}\in\mathbb{R}^{M^{h}_{l}} are vectors, cjfc^{f}_{j}, cigc^{g}_{i} and clhc^{h}_{l} are positive real numbers, QQ is a N×NN\times N positive semi-definite matrix.

The algorithm we implemented is described in [15] and can be downloaded on https://bitbucket.org/ofercoq/cd_solver. The present paper focuses on important implementation details about residual updates and dual variable duplication. The novelty of our code is that it allows a generic treatment of these algorithmic steps and includes a modelling interface in Python for the definition of the optimization problem. Note that unlike most coordinate descent implementations, it can deal with nonseparable nonsmooth objectives and linear constraints.

2 Literature review on coordinate descent methods

A thorough review on coordinate descent is beyond the scope of this paper. We shall refer the interested reader to the review papers [63] and [54]. Instead, for selected papers dealing with smooth functions, separable non-smooth functions or non-separable non-smooth function, we list their main features. We also quickly review what has been done for non-convex functions. We sort papers in order of publication except when there are an explicit dependency between a paper and a follow-up.

2.1 Smooth functions

Smooth objectives are a natural starting point for algorithmic innovations. The optimization problem at stake writes

minxNf(x)\min_{x\in\mathbb{R}^{N}}f(x)

where ff is a convex differentiable function with Lipschitz-continuous partial derivatives.

In Table 1, we compare papers that introduced important improvements to coordinate descent methods. We shall in particular stress the seminal paper by Tseng and Yun [60]. It features coordinate gradient steps instead of exact minimization. Indeed a coordinate gradient steps gives similar iteration complexity both in theory and in practice for a much cheaper iteration cost. Moreover, this opened the door for many innovations: blocks of coordinates and the use of proximal operators were developed in the same paper. Another crucial step was made in [39]: Nesterov showed how randomization can help finding finite-time complexity bounds and proposed an accelerated version of coordinate descent. He also proposed to use a non-uniform sampling of coordinates depending on the coordinate-wise Lipschitz constants.

Paper ​​Rate​​ ​​Rand​​ ​​Grad​​ ​​Blck​​ Par Acc Notable feature
Seidel ’74 [49] ×\times N ×\times ×\times ×\times ×\times quadratic
Warga ’63 [62] ×\times N ×\times ×\times ×\times ×\times strictly convex
Luo & Tseng ’92 [34] ​​asymp​​ N ×\times ×\times ×\times ×\times rate for weakly convex
Leventhal & Lewis ’08 [28] Y ×\times ×\times ×\times ×\times quadratic ff
Tseng & Yun ’09 [60] ​​asymp​​ N ×\times ×\times ​​line search, proximal operator​​
Nesterov ’12 [39] Y ×\times ×\times ​​not eff​​ 1st acc & 1st non-uniform
Beck & Tetruashvili ’13 [4] N ×\times ​​not eff​​ ​​finite time analysis cyclic CD​​
Lin & Xiao ’13 [33] Y ×\times ​​not eff​​ improvements on [39, 47]
Lee & Sidford ’13 [27] Y ×\times ×\times 1st efficient accelerated
Liu et al ’13 [31] Y ×\times ×\times 1st asynchronous
Glasmachers ​&​ Dogan ​’13​ [22]​​​ ×\times Y ×\times ×\times ×\times heuristic sampling
Richtárik & Takáč ’16 [46] Y ×\times ×\times 1st arbitrary sampling
Allen-Zhu et al ’16 [2] Y ×\times non-uniform sampling
Sun et al ’17 [55] Y ×\times better asynchrony than [31]
Table 1: Selected papers for the minimization of smooth functions. Rate: we check whether the paper proves convergence (×\times), an asymptotic rate (asymp) or a finite time iteration complexity (✓). Rand: deterministic (N) or randomized (Y) selection of coordinates. Grad: exact minimization (×\times) or use of partial derivatives (✓). Blck: the paper considers 1D coordinates (×\times) or blocks of coordinates (✓). Par: Algorithm designed for parallel computing (✓). Acc: no momentum (×\times), accelerated but not efficient in practice (not eff), accelerated algorithm (✓)

2.2 Separable non-smooth functions

A large literature has been devoted to composite optimization problems with separable non-smooth functions:

minxNf(x)+i=1ngi(x(i))\min_{x\in\mathbb{R}^{N}}f(x)+\sum_{i=1}^{n}g_{i}(x^{(i)})

where ff is a convex differentiable function with Lipschitz-continuous partial derivatives and for all ii, gig_{i} is a convex function whose proximal operator is easy to compute. Indeed, regularized expected risk minimization problems often fit into this framework and this made the success of coordinate descent methods for machine learning applications.

Some papers study minxpf(x)+i=1ngi((Ax)(i))\min_{x\in\mathbb{R}^{p}}f(x)+\sum_{i=1}^{n}g_{i}((Ax)^{(i)}), where ff is strongly convex and apply coordinate descent to a dual problem written as

minyNf(Ay)+i=1ngi(y(i)).\min_{y\in\mathbb{R}^{N}}f^{*}(-A^{\top}y)+\sum_{i=1}^{n}g^{*}_{i}(y^{(i)})\;.

One of the challenges of these works is to show that even though we are solving the dual problem, one can still recover rates for a sequence minimizing the primal objective.

We present our selection of papers devoted to this type of problems in Table 2.

Paper Prx Par Acc Dual Notable feature
Tseng & Yun ’09 [60] ×\times ×\times ×\times 1st prox, line search, deterministic
S-Shwartz & Tewari ’09 [50] 1\ell_{1} ×\times ×\times ×\times 1st 1\ell_{1}-regularized w finite time bound
Bradley et al ’11 [7] 1\ell_{1} ×\times ×\times 1\ell_{1}-regularized parallel
Richtárik & Takáč ’14 [47] ×\times ×\times ×\times 1st proximal with finite time bound
S-Shwartz & Zhang ’13 [51] ×\times ×\times 1st dual
Richtárik & Takáč ’15 [48] ×\times ×\times 1st general parallel
Takáč et al ’13 [56] ×\times 1st dual & parallel
S-Shwartz & Zhang ’14 [53] ×\times acceleration in the primal
Yun ’14 [68] ×\times ×\times ×\times analysis of cyclic CD
Fercoq & Richtárik ’15 [18] ×\times 1st proximal and accelerated
Lin, Lu & Xiao ’14 [30] ×\times ×\times prox & accelerated on strong conv.
Richtárik & Takáč ’16 [45] ×\times ×\times 1st distributed
Fercoq et al ’14 [17] ×\times distributed computation
Lu & Xiao ’15 [33] ×\times ×\times ×\times improved complexity over [47, 39]
Li & Lin ’18 [29] ×\times acceleration in the dual
Fercoq & Qu ’18 [16] ×\times ×\times restart for obj with error bound
Table 2: An overview of selected papers proposing and analyzing the iteration complexity of coordinate descent methods for separable non-smooth objectives. Prx: uses a proximal operator to deal with the non-smooth part of the objective. Par: updates several blocks of coordinates in parallel. Acc: uses momentum to obtain an improved rate of convergence. Dual: solves a dual problem but still proves rates in the primal (only relevant for weakly convex duals).

2.3 Non-separable non-smooth functions

Non-separable non-smooth objective functions are much more challenging to coordinate descent methods. One wishes to solve

minxNf(x)+g(x)+h(Ax)\min_{x\in\mathbb{R}^{N}}f(x)+g(x)+h(Ax)

where ff is a convex differentiable function with Lipschitz-continuous partial derivatives, gg and hh are convex functions whose proximal operator are easy to compute and AA is a linear operator. Indeed, the linear operator introduces a coupling between the coordinates and a naive approach leads to a method that does not converge to a minimizer [3]. When h=ι{b}h=\iota_{\{b\}}, the convex indicator function of the set {b}\{b\}, we have equality constraints.

We present our selection of papers devoted to this type of problems in Table 3.

Paper Rate Const PD-CD Notable feature
Platt ’99 [43] ×\times P for SVM
Tseng & Yun ’09 [61] P adapts Gauss-Southwell rule
Tao et al ’12 [57] ×\times P uses averages of subgradients
Necoara et al ’12 [38] P 2-coordinate descent
Nesterov ’12 [40] ×\times P uses subgradients
Necoara & Clipici ’13 [37] P coupled constraints
Combettes & Pesquet ’14 [11] ×\times 1st PD-CD, short step sizes
Bianchi et al ’14 [5] ×\times distributed optimization
Hong et al ’14 [24] ×\times ×\times updates all dual variables
Fercoq & Richtárik ’17 [19] ×\times P uses smoothing
Alacaoglu et al ’17 [1] 1st PD-CD w rate for constraints
Xu & Zhang ’18 [66] ×\times better rate than [21]
Chambolle et al ’18 [9] ×\times updates all primal variables
Fercoq & Bianchi ’19 [15] 1st PD-CD w long step sizes
Gao et al ’19 [21] ×\times 1st primal-dual w rate for constraints
Latafat et al ’19 [26] linear conv w growth condition
Table 3: Selected papers for the minimization of non-smooth non-separable functions. Rate: does the paper prove rates? Const: can the method solve problems with linear equality constraints? PD-CD: the method is purely primal (P), the method updates some primal variables but all the Lagrange multipliers or some dual variables but all the primal variables (×\times), the method updates some primal and some dual variables at each iteration (✓).

2.4 Non-convex functions

The goal here is to find a local minimum to the problem

locminxNf(x)+i=1ngi(x(i))\operatorname*{loc-min}_{x\in\mathbb{R}^{N}}f(x)+\sum_{i=1}^{n}g_{i}(x^{(i)})

without any assumption on the convexity of ff nor gig_{i}. The function ff should be continuously differentiable and the proximal operator of each function gig_{i} should be easily computable. Note that in the non-convex setting, the proximal operator may be set-valued.

Paper Conv Smth Nsmth Notable feature
Grippo & Sciandrone ’00 [23]​​ ×\times ×\times 2-block or coordinate-wise quasiconvex
Tseng & Yun ’09 [60] ×\times convergence under error bound
Hsieh & Dhillon ’11 [25] ×\times ×\times non-negative matrix factorization
Breheny & Huang ’11 [8] ×\times ×\times regularized least squares
Mazumder et al ’11 [35] ×\times ×\times regularized least squares
Razaviyayn et al ’13 [44] ×\times requires uniqueness of the prox
Xu & Yin ’13 [64] ×\times multiconvex
Lu & Xiao ’13 [32] ×\times random sampling
Patrascu & Necoara ’15 [41] ×\times randomized + 1 linear constraint
Xu & Yin ’17 [65] convergence under KL
Table 4: Selected papers for the local minimization of nonconvex problems. Conv: study only limit points (×\times) or convergence of the sequence proved (✓). Smth: can deal with non-convex smooth functions. Nsmth: can deal with non-convex and non-smooth functions.

3 Description of the Algorithm

3.1 General scheme

The algorithm we implemented is a coordinate descent primal-dual method developed in [15]. Let introduce the notation F(x)=12xQx+j=1Jcjffj(Ajfxbjf)F(x)=\frac{1}{2}x^{\top}Qx+\sum_{j=1}^{J}c^{f}_{j}f_{j}(A^{f}_{j}x-b^{f}_{j}), G(x)=i=1Iciggi(Digx(i)big)G(x)=\sum_{i=1}^{I}c^{g}_{i}g_{i}(D^{g}_{i}x^{(i)}-b^{g}_{i}), H(z)=l=1Lclhhl(z(l)blh)H(z)=\sum_{l=1}^{L}c^{h}_{l}h_{l}(z^{(l)}-b^{h}_{l}), 𝒥(i)={j:Aj,ih0}\mathcal{J}(i)=\{j\;:\;A^{h}_{j,i}\neq 0\}, (j)={i:Aj,ih0}\mathcal{I}(j)=\{i\;:\;A^{h}_{j,i}\neq 0\}, mj=|(j)|m_{j}=|\mathcal{I}(j)| and ρ(A)\rho(A) the spectral radius of matrix AA. We shall also denote 𝒥f(i)={j:Aj,if0}\mathcal{J}^{f}(i)=\{j\;:\;A^{f}_{j,i}\neq 0\}, 𝒥Q(i)={j:Qj,i0}\mathcal{J}^{Q}(i)=\{j\;:\;Q_{j,i}\neq 0\}, AfjMjf×NA^{f}\in\mathbb{R}^{\sum_{j}M^{f}_{j}\times N} the matrix which stacks the matrices (Ajf)1jJ(A^{f}_{j})_{1\leq j\leq J} and AhA^{h} the matrix which stacks the matrices (Alh)1lL(A^{h}_{l})_{1\leq l\leq L}. The algorithm writes then as Algorithm 1.

Algorithm 1 Coordinate-descent primal-dual algorithm with duplicated variables (PD-CD)

Input: Differentiable function F:NF:\mathbb{R}^{N}\to\mathbb{R}, matrix AhMh×NA^{h}\in\mathbb{R}^{M^{h}\times N}, functions GG and HH whose proximal operators are available.

Initialization: Choose x0Nx_{0}\in\mathbb{R}^{N}, 𝒚0nnz(Ah)\boldsymbol{y}_{0}\in\mathbb{R}^{\mathrm{nnz}(A^{h})}. Denote 𝒥(i)={j:Aj,ih0}\mathcal{J}(i)=\{j\;:\;A^{h}_{j,i}\neq 0\}, (j)={i:Aj,ih0}\mathcal{I}(j)=\{i\;:\;A^{h}_{j,i}\neq 0\}, mj=|(j)|m_{j}=|\mathcal{I}(j)| and ρ(A)\rho(A) the spectral radius of matrix AA. Choose step sizes τ+I\tau\in\mathbb{R}_{+}^{I} and σ+L\sigma\in\mathbb{R}_{+}^{L} such that i{1,I}\forall i\in\{1,\ldots I\},

τi<1βi+ρ(j𝒥(i)mjσj(Ah)j,iAj,ih).\tau_{i}<\frac{1}{\beta_{i}+\rho\left(\sum_{j\in\mathcal{J}(i)}m_{j}\sigma_{j}(A^{h})_{j,i}^{\top}A^{h}_{j,i}\right)}\,. (2)

For all i{1,,I}i\in\{1,\dots,I\}, set w0(i)=j𝒥(i)(Ah)j,i𝒚0(j)(i)w_{0}^{(i)}=\sum_{j\in\mathcal{J}(i)}(A^{h})_{j,i}^{\top}\,\boldsymbol{y}_{0}^{(j)}(i).
For all j{1,,J}j\in\{1,\dots,J\}, set z0(j)=1mji(j)𝒚0(j)(i)z_{0}^{(j)}=\frac{1}{m_{j}}\sum_{i\in\mathcal{I}(j)}\boldsymbol{y}_{0}^{(j)}(i).

Iteration kk: Define:

y¯k+1\displaystyle\overline{y}_{k+1} =proxσ,H(zk+D(σ)Ahxk)\displaystyle=\operatorname{prox}_{\sigma,H^{\star}}\big{(}z_{k}+D(\sigma)A^{h}x_{k}\big{)}
x¯k+1\displaystyle\overline{x}_{k+1} =proxτ,G(xkD(τ)(F(xk)+2(Ah)y¯k+1wk)).\displaystyle=\operatorname{prox}_{\tau,G}\Big{(}x_{k}-D(\tau)\left(\nabla F(x_{k})+2(A^{h})^{\top}\overline{y}_{k+1}-w_{k}\right)\Big{)}\,.

For i=ik+1U({1,,I})i=i_{k+1}\sim U(\{1,\ldots,I\}) and for each j𝒥(ik+1)j\in\mathcal{J}(i_{k+1}), update:

xk+1(i)=x¯k+1(i)\displaystyle{x}_{k+1}^{(i)}=\overline{x}_{k+1}^{(i)}
𝒚k+1(j)(i)=y¯k+1(j)\displaystyle{\boldsymbol{y}}_{k+1}^{(j)}(i)=\overline{y}_{k+1}^{(j)}
wk+1(i)=wk(i)+jJ(i)(Ah)j,i(𝒚k+1(j)(i)𝒚k(j)(i))\displaystyle w_{k+1}^{(i)}=w_{k}^{(i)}+\sum_{j\in J(i)}(A^{h})_{j,i}^{\top}\,({\boldsymbol{y}}_{k+1}^{(j)}(i)-\boldsymbol{y}_{k}^{(j)}(i))
zk+1(j)=zk(j)+1mj(𝒚k+1(j)(i)𝒚k(j)(i)).\displaystyle z_{k+1}^{(j)}=z_{k}^{(j)}+\frac{1}{m_{j}}({\boldsymbol{y}}_{k+1}^{(j)}(i)-\boldsymbol{y}_{k}^{(j)}(i))\,.

Otherwise, set xk+1(i)=xk(i){x}_{k+1}^{(i^{\prime})}=x_{k}^{(i^{\prime})}, wk+1(i)=wk(i)w_{k+1}^{(i^{\prime})}=w_{k}^{(i^{\prime})}, zk+1(j)=zk(j)z_{k+1}^{(j^{\prime})}=z_{k}^{(j^{\prime})} and 𝒚k+1(j)(i)=𝒚k(j)(i){\boldsymbol{y}}_{k+1}^{(j^{\prime})}(i^{\prime})={\boldsymbol{y}}_{k}^{(j^{\prime})}(i^{\prime}).

We will denote U1,,UIU_{1},\ldots,U_{I} the columns of the identity matrix corresponding to the blocks of x=(x(1),,x(I))x=(x^{(1)},\ldots,x^{(I)}), so that Uix(i)NU_{i}x^{(i)}\in\mathbb{R}^{N} and V1,,VJV_{1},\ldots,V_{J} the columns of the identity matrix corresponding to the blocks of Afxbf=(A1fxb1f,,AJfxbJf)A^{f}x-b^{f}=(A^{f}_{1}x-b^{f}_{1},\ldots,A^{f}_{J}x-b^{f}_{J}).

3.2 Computation of partial derivatives

For simplicity of implementation, we are assuming that GG is separable and the blocks of variable will follow the block structure of GG. This implies in particular that at each iteration, only iF(xk)\nabla_{i}F(x_{k}) needs to be computed. This partial derivative needs to be calculated efficiently because it needs to be performed at each iteration of the algorithm. We now describe the efficient residual update method, which is classically used in coordinate descent implementations [39].

Denote rkf,x=Afxkbfr^{f,x}_{k}=A^{f}x_{k}-b^{f} and rkQ,x=Qxkr^{Q,x}_{k}=Qx_{k}. By the chain rule, we have

iF(xk)=j=1Jcjf(Af)j,ifj(Ajfxkbjf)+Qxk=j𝒥f(i)cjf(Af)j,ifj((rkf,x)j)+j𝒥Q(i)UirkQ,x\nabla_{i}F(x_{k})=\sum_{j=1}^{J}c^{f}_{j}(A^{f})^{\top}_{j,i}\nabla f_{j}(A^{f}_{j}x_{k}-b^{f}_{j})+Qx_{k}=\sum_{j\in\mathcal{J}^{f}(i)}c^{f}_{j}(A^{f})^{\top}_{j,i}\nabla f_{j}((r^{f,x}_{k})_{j})+\sum_{j\in\mathcal{J}^{Q}(i)}U_{i}r^{Q,x}_{k}

If rkf,xr^{f,x}_{k} and rkQ,xr^{Q,x}_{k} are pre-computed, only O(|𝒥f(i)|+|𝒥Q(i)|)O(|\mathcal{J}^{f}(i)|+|\mathcal{J}^{Q}(i)|) operations are needed.

For an efficient implementation, we will update the residuals rkf,xr^{f,x}_{k} as follows, using the fact that only the coordinate block ik+1i_{k+1} is updated:

rk+1f,x=Afxk+1bf=Af(xk+Uik+1(xk+1(ik+1)xk(ik+1)))bf=rkf,x+AfUik+1(xk+1(ik+1)xk(ik+1))=rkf,x+j𝒥f(ik+1)VjAj,ik+1f(xk+1(ik+1)xk(ik+1))r^{f,x}_{k+1}=A^{f}x_{k+1}-b^{f}=A^{f}\big{(}x_{k}+U_{i_{k+1}}(x_{k+1}^{(i_{k+1})}-x_{k}^{(i_{k+1})})\big{)}-b^{f}=r^{f,x}_{k}+A^{f}U_{i_{k+1}}(x_{k+1}^{(i_{k+1})}-x_{k}^{(i_{k+1})})\\ =r^{f,x}_{k}+\sum_{j\in\mathcal{J}^{f}(i_{k+1})}V_{j}A^{f}_{j,i_{k+1}}(x_{k+1}^{(i_{k+1})}-x_{k}^{(i_{k+1})})

Hence, updating rk+1f,xr^{f,x}_{k+1} also requires only O(|𝒥f(ik+1)|)O(|\mathcal{J}^{f}(i_{k+1})|) iterations.

Similarly, updating the residuals rkQ,xr^{Q,x}_{k}, rkh,x=Ahxkbhr^{h,x}_{k}=A^{h}x_{k}-b^{h}, wkw_{k} and zkz_{k} can be done in O(|𝒥Q(ik+1)|)O(|\mathcal{J}^{Q}(i_{k+1})|) and O(|𝒥(ik+1)|)O(|\mathcal{J}(i_{k+1})|) operations.

Although this technique is well known, it is not trivial how to write it in a generic fashion, since residual updates are needed at each iteration and should be written in a compiled language. We coded the residual update using abstract atom functions in order to achieve this goal.

3.3 Computation of proximal operators using atom functions

Another major step in the method is the computation of the ithi^{\text{th}} coordinate of proxτ,G(x)\operatorname{prox}_{\tau,G}(x) for a given xNx\in\mathbb{R}^{N}.

As DgD^{g} is assumed to be diagonal, GG is separable. Hence, by the change of variable z¯=Digx¯big\bar{z}=D^{g}_{i}\bar{x}-b^{g}_{i},

(proxτ,G(x))i\displaystyle(\operatorname{prox}_{\tau,G}(x))_{i} =argminx¯Niciggi(Digx¯big)+12τix¯x(i)2\displaystyle=\arg\min_{\bar{x}\in\mathbb{R}^{N_{i}}}c_{i}^{g}g_{i}(D^{g}_{i}\bar{x}-b^{g}_{i})+\frac{1}{2\tau_{i}}\lVert\bar{x}-x^{(i)}\rVert^{2}
=(Dig)1(big+argminz¯Niciggi(z¯)+12τi(Dig)1(big+z¯)x(i)2)\displaystyle=(D_{i}^{g})^{-1}\Big{(}b^{g}_{i}+\arg\min_{\bar{z}\in\mathbb{R}^{N_{i}}}c_{i}^{g}g_{i}(\bar{z})+\frac{1}{2\tau_{i}}\lVert(D_{i}^{g})^{-1}(b^{g}_{i}+\bar{z})-x^{(i)}\rVert^{2}\Big{)}
=(Dig)1(big+argminz¯Nigi(z¯)+12cig(Dig)2τiz¯(Digx(i)big)2)\displaystyle=(D_{i}^{g})^{-1}\Big{(}b^{g}_{i}+\arg\min_{\bar{z}\in\mathbb{R}^{N_{i}}}g_{i}(\bar{z})+\frac{1}{2c_{i}^{g}(D_{i}^{g})^{2}\tau_{i}}\lVert\bar{z}-(D^{g}_{i}x^{(i)}-b^{g}_{i})\rVert^{2}\Big{)}
=(Dig)1(big+proxcig(Dig)2τigi(Digx(i)big))\displaystyle=(D_{i}^{g})^{-1}\Big{(}b^{g}_{i}+\operatorname{prox}_{c_{i}^{g}(D_{i}^{g})^{2}\tau_{i}g_{i}}(D_{i}^{g}x^{(i)}-b_{i}^{g})\Big{)}

where we used the abuse of notation that DigD_{i}^{g} is either the scaled identity matrix or any of its diagonal elements. This derivation shows that to compute (proxτ,G(x))i(\operatorname{prox}_{\tau,G}(x))_{i} we only need linear algebra and the proximal operator of the atom function gig_{i}.

We can similarly compute proxH\operatorname{prox}H. To compute proxσ,H\operatorname{prox}_{\sigma,H^{\star}}, we use Moreau’s formula:

proxσ,H(z)=zD(σ)proxσ1,H(D(σ)1z)\operatorname{prox}_{\sigma,H^{\star}}(z)=z-D(\sigma)\operatorname{prox}_{\sigma^{-1},H}(D(\sigma)^{-1}z)

3.4 Duplication of dual variables

Algorithm 1 maintains duplicated dual variables 𝒚knnz(Ah)\boldsymbol{y}_{k}\in\mathbb{R}^{\mathrm{nnz}(A^{h})} as well as averaged dual variables zkMhz_{k}\in\mathbb{R}^{M^{h}} where Mh=l=1LMlhM^{h}=\sum_{l=1}^{L}M^{h}_{l} and Al,ihA^{h}_{l,i} is of size Mlh×NiM_{l}^{h}\times N_{i}. The sets 𝒥(i)\mathcal{J}(i) for all ii are given by the sparse column format representation of AhA^{h}. Yet, for all ii, we need to construct the set of indices of 𝒚k+1\boldsymbol{y}_{k+1} that need to be updated. This is the table dual_vars_to_update in the code. Moreover, as HH is not separable in general, in order to compute y¯k+1j\bar{y}_{k+1}^{j}, for j𝒥(ik+1)j\in\mathcal{J}(i_{k+1}), we need to determine the set of dual indices jj^{\prime} that belong to the same block as jj with respect to the block decomposition of HH. This is the purpose of the tables inv_blocks_h and blocks_h.

The procedure allows us to only compute the entries of y¯k+1\bar{y}_{k+1} that are required for the update of 𝒚k\boldsymbol{y}_{k}.

4 Code structure

Refer to caption
Figure 1: Code structure when using no screening and no inertial acceleration

The code is organized in nine files. The main file is cd_solver.pyx. It contains the Python callable and the data structure for the problem definition. The other files are atoms.pyx/pxd, algorithm.pyx/pxd, helpers.pyx/pxd and screening.pyx/pxd. They contain the definition of the atom functions, the algorithms and the functions for computing the objective value. In Figure 1, we show for each subfunction, in which function it is used. The user needs to call the Python class Problem and the Python function coordinate_descent. Atom functions can be added by the user without modifying the main algorithm.

All tables are defined using Numpy’s array constructor in the coordinate_descent function. The main loop of coordinate descent and the atom functions are pre-compiled for efficiency.

5 Atom functions

The code allows us to define atom functions independently of the coordinate descent algorithm. As an example, we provide in Figure 2 the code for the square function atom.

cdef DOUBLE square(DOUBLE[:] x, DOUBLE[:] buff, int nb_coord, MODE mode,
        DOUBLE prox_param, DOUBLE prox_param2) nogil:
    # Function x -> x**2
    cdef int i
    cdef DOUBLE val = 0.
    if mode == GRAD:
        for i in range(nb_coord):
            buff[i] = 2. * x[i]
        return buff[0]
    elif mode == PROX:
        for i in range(nb_coord):
            buff[i] = x[i] / (1. + 2. * prox_param)
        return buff[0]
    elif mode == PROX_CONJ:
        return prox_conj(square, x, buff, nb_coord, prox_param, prox_param2)
    elif mode == LIPSCHITZ:
        buff[0] = 2.
        return buff[0]
    elif mode == VAL_CONJ:
        return val_conj_not_implemented(square, x, buff, nb_coord)
    else:  # mode == VAL
        for i in range(nb_coord):
            val += x[i] * x[i]
        return val
Figure 2: Code for the square function atom

As inputs, it gets x (an array of numbers which is the point where the operation takes place), buff (the buffer for vectorial outputs), nb_coord (is the size of x), mode, prox_param and prox_param2 (numbers which are needed when computing the proximal operator). The input mode can be:

  • GRAD in order to compute the gradient.

  • PROX to compute the proximal operator.

  • PROX_CONJ uses Moreau’s formula to compute the proximal operator of the conjugate function.

  • LIPSCHITZ to return the Lipschitz constant of the gradient.

  • VAL_CONJ to return the value of the conjugate function. As this mode is used only by compute_smoothed_gap for printing purposes, its implementation is optional and can be approximated using the helper function val_conj_not_implemented. Indeed, for a small ϵ>0\epsilon>0, h(y)=supzz,yh(z)supzz,yh(z)ϵ2z2=p,yh(p)ϵ2p2h^{*}(y)=\sup_{z}\langle z,y\rangle-h(z)\approx\sup_{z}\langle z,y\rangle-h(z)-\frac{\epsilon}{2}\lVert z\rVert^{2}=\langle p,y\rangle-h(p)-\frac{\epsilon}{2}\lVert p\rVert^{2}, where p=proxh/ϵ(y/ϵ)p=\operatorname{prox}_{h/\epsilon}(y/\epsilon).

  • VAL to return the value of the function.

Some functions naturally require multi-dimensional inputs, like 2\lVert\cdot\rVert_{2} or the log-sum-exp function. For consistency, we define all the atoms with multi-dimensional inputs: for an atom function f0:f_{0}:\mathbb{R}\to\mathbb{R}, we extend it to an atom function f:Nif:\mathbb{R}^{N_{i}}\to\mathbb{R} by f(x)=l=1Nif0(xl)f(x)=\sum_{l=1}^{N_{i}}f_{0}(x_{l}).

For efficiency purposes, we are bypassing the square atom function when computing a gradient and implemented it directly in the algorithm.

6 Modelling language

In order to use the code in all its generality, we defined a modelling language that can be used to define the optimization problem we want to solve (1).

The user defines a problem using the class Problem. Its arguments can be:

  • N the number of variables, blocks the blocks of coordinates coded in the same fashion as the indptr index of sparse matrices (default [0, 1, …, N]), x_init the initial primal point (default 0) and y_init the initial duplicated dual variable (default 0)

  • Lists of strings f, g and h that code for the atom functions used. The function string_to_func is responsible for linking the atom function that corresponds to the string. Our convention is that the string code is exactly the name of the function in atoms.pyx. The size of the input of each atom function is defined in blocks_f, blocks and blocks_h. The function strings f, g or h may be absent, which means that the function does not appear in the problem to solve.

  • Arrays and matrices cf, Af, bf, cg, Dg, bg, ch, Ah, bh, Q. The class initiator transforms matrices into the sparse column format and checks whether Dg is diagonal.

For instance, in order to solve the Lasso problem, minx12Axb22+λx1\min_{x}\frac{1}{2}\lVert Ax-b\rVert^{2}_{2}+\lambda\lVert x\rVert_{1}, one can type

pb_lasso = cd_solver.Problem(N=A.shape[1], f=["square"]*A.shape[0], Af=A,

bf=b, cf=[0.5]*A.shape[0], g=["abs"]*A.shape[1], cg=[lambda]*A.shape[1])

cd_solver.coordinate_descent(pb_lasso)

7 Extensions

7.1 Non-uniform probabilities

We added the following feature for an improved efficiency. Under the argument sampling=’kink_half’, the algorithms periodically detects the set of blocks IkinkI_{\mathrm{kink}} such that iIkinki\in I_{\mathrm{kink}} if x(i)x^{(i)} is at a kink of gig_{i}. Then, block ii is selected with probability law

(ik+1=i)={1n if |Ikink|=n12n if |Ikink|<n and iIkink12n+12(n|Ikink|) if |Ikink|<n and iIkink\mathbb{P}(i_{k+1}=i)=\begin{cases}\frac{1}{n}&\text{ if }|I_{\mathrm{kink}}|=n\\ \frac{1}{2n}&\text{ if }|I_{\mathrm{kink}}|<n\text{ and }i\in I_{\mathrm{kink}}\\ \frac{1}{2n}+\frac{1}{2(n-|I_{\mathrm{kink}}|)}&\text{ if }|I_{\mathrm{kink}}|<n\text{ and }i\not\in I_{\mathrm{kink}}\end{cases}

The rationale for this probability law is that blocks at kinks are likely to incur no move when we try to update them. We thus put more computational power for non-kinks. On the other hand, we still keep an update probability weight of at least 12n\frac{1}{2n} for each block, so even in unfavourable cases, we should not observe too much degradation in the performance as compared to the uniform law.

7.2 Acceleration

We also coded accelerated coordinate descent [18], as well as its restarted [16] and primal-dual [1] variants. The algorithm is given in Algorithm 2. As before, y¯k+1\bar{y}_{k+1} and x¯k+1\bar{x}_{k+1} should not be computed: only the relevant coordinates should be computed.

Algorithm 2 Smooth, accelerate, randomize the Coordinate Descent (APPROX/SMART-CD)

Input: Differentiable function F:NF:\mathbb{R}^{N}\to\mathbb{R}, matrix AhMh×NA^{h}\in\mathbb{R}^{M^{h}\times N}, functions GG and HH whose proximal operators are available.

Initialization: Choose x0Nx_{0}\in\mathbb{R}^{N}, y˙0Mh\dot{y}_{0}\in\mathbb{R}^{M^{h}}. Choose γ1>0\gamma_{1}>0 and denote B0i=βi+ρ((A:,ih)A:,ih)γ1B_{0}^{i}=\beta_{i}+\frac{\rho((A^{h}_{:,i})^{\top}A^{h}_{:,i})}{\gamma_{1}}.

Set s=0s=0, θ0=1n\theta_{0}=\frac{1}{n}, c0=1c_{0}=1, x^0=0N\hat{x}_{0}=0\in\mathbb{R}^{N} and x~0=x0\tilde{x}_{0}=x_{0}.

Iteration kk: Define:

y¯k+1\displaystyle\overline{y}_{k+1} =proxγk+11,H(y˙s+D(γk+1)1(ckAhx^k+Ahx~k))\displaystyle=\operatorname{prox}_{\gamma^{-1}_{k+1},H^{\star}}\big{(}\dot{y}_{s}+D(\gamma_{k+1})^{-1}(c_{k}A^{h}\hat{x}_{k}+A^{h}\tilde{x}_{k})\big{)}
x¯k+1\displaystyle\overline{x}_{k+1} =proxθ0θkBk1,G(x~kθ0θkD(Bk)1(F(ckx^k+x~k)+(Ah)y¯k+1)).\displaystyle=\operatorname{prox}_{\frac{\theta_{0}}{\theta_{k}}B_{k}^{-1},G}\Big{(}\tilde{x}_{k}-\frac{\theta_{0}}{\theta_{k}}D(B_{k})^{-1}\left(\nabla F(c_{k}\hat{x}_{k}+\tilde{x}_{k})+(A^{h})^{\top}\overline{y}_{k+1}\right)\Big{)}\,.

For i=ik+1U({1,,I})i=i_{k+1}\sim U(\{1,\ldots,I\}), update:

x~k+1(i)=x¯k+1(i)\displaystyle\tilde{x}_{k+1}^{(i)}=\overline{x}_{k+1}^{(i)}
x^k+1(i)=x^k(i)1θk/θ0ck(x~k+1(i)x~k(i))\displaystyle\hat{x}_{k+1}^{(i)}=\hat{x}_{k}^{(i)}-\frac{1-\theta_{k}/\theta_{0}}{c_{k}}(\tilde{x}_{k+1}^{(i)}-\tilde{x}_{k}^{(i)})

Otherwise, set xk+1(i)=xk(i){x}_{k+1}^{(i^{\prime})}=x_{k}^{(i^{\prime})}.

Compute θk+1(0,1)\theta_{k+1}\in(0,1) as the unique positive root of

{θ3+θ2+θk2θθk2=0 if h0θ2+θk2θθk2=0 if h=0\begin{cases}\theta^{3}+\theta^{2}+\theta_{k}^{2}\theta-\theta_{k}^{2}=0&\text{ if }h\neq 0\\ \phantom{\theta^{3}~~~~}\theta^{2}+\theta_{k}^{2}\theta-\theta_{k}^{2}=0&\text{ if }h=0\\ \end{cases}

Update γk+2=γk+11+θk+1\gamma_{k+2}=\frac{\gamma_{k+1}}{1+\theta_{k+1}}, ck+1=(1θk+1)ckc_{k+1}=(1-\theta_{k+1})c_{k} and Bk+1i=βi+ρ((A:,ih)A:,ih)γk+2B_{k+1}^{i}=\beta_{i}+\frac{\rho((A^{h}_{:,i})^{\top}A^{h}_{:,i})}{\gamma_{k+2}} for all ii.

If Restart(kk) is true:

Set xk+1=x~k+1+ckx^k+1x_{k+1}=\tilde{x}_{k+1}+c_{k}\hat{x}_{k+1}.

Set y˙s+1=y¯k+1\dot{y}_{s+1}=\overline{y}_{k+1} and ss+1s\leftarrow s+1.

Reset x^k+10\hat{x}_{k+1}\leftarrow 0, x~k+1xk+1\tilde{x}_{k+1}\leftarrow x_{k+1}, ck+11c_{k+1}\leftarrow 1, θk+1θ0\theta_{k+1}\leftarrow\theta_{0}, βk+1β1\beta_{k+1}\leftarrow\beta_{1}.

The accelerated algorithms improve the worst case guarantee as explained in Table 5:

h=0h=0 h0h\neq 0
PD-CD Alg. 1 O(1/k)O(1/k) O(1/k)O(1/\sqrt{k})
APPROX / SMART-CD Alg. 2 O(1/k2)O(1/k^{2}) O(1/k)O(1/k)
Table 5: Convergence speed of the algorithms implemented

However, accelerated algorithms do not take profit of regularity properties of the objective like strong convexity. Hence, they are not guaranteed to be faster, even though restart may help.

7.3 Variable screening

The code includes the Gap Safe screening method presented in [36]. Note that the method has been studied only for the case where h=0h=0. Given a non-differentiability point x^(i)\hat{x}^{(i)} of the function gig_{i} where the subdifferential gi(x^(i))\partial g_{i}(\hat{x}^{(i)}) has a non-empty interior, a test is derived to check whether x^(i)\hat{x}^{(i)} is the ithi^{\text{th}} variable of an optimal solution. If this is the case, one can set x(i)=x^(i)x^{(i)}=\hat{x}^{(i)} and stop updating this variable. This may lead to a huge speed up in some cases. As the test relies on the computation of the duality gap, which has a nonnegligible cost, it is only performed from time to time.

In order to state Gap Safe screening in a general setting, we need the concept of polar of a support function. Let CC be a convex set. The support function of CC is σC\sigma_{C} defined by

σC(x)=supyCy,x.\sigma_{C}(x)=\sup_{y\in C}\langle y,x\rangle\;.

The polar to σC\sigma_{C} is

σC(x¯)=supx:σC(x)1x¯,x.\sigma_{C}^{\circ}(\bar{x})=\sup_{x:\sigma_{C}(x)\leq 1}\langle\bar{x},x\rangle\;.

In particular, if x¯C\bar{x}\in C, then σC(x¯)1\sigma_{C}^{\circ}(\bar{x})\leq 1. Denote f(z)=j=1Jcjffj(z(j)bjf)f(z)=\sum_{j=1}^{J}c^{f}_{j}f_{j}(z^{(j)}-b^{f}_{j}), so that (f(Afx))j=cjffj(Ajfxbjf)(\nabla f(A^{f}x))_{j}=c^{f}_{j}\nabla f_{j}(A^{f}_{j}x-b_{j}^{f}), Gi(x)=ciggi(Digx(i)big)G_{i}(x)=c^{g}_{i}g_{i}(D_{i}^{g}x^{(i)}-b_{i}^{g}), xx_{\star} a solution to the optimization problem (1) and suppose we have a set \mathcal{R} such that (f(Afxbf),Qx)(\nabla f(A^{f}x_{\star}-b^{f}),Qx_{\star})\in\mathcal{R}. Gap Safe screening states that

max(ζ,ω)σGi(x^(i))((Af)iζ+ω(i))<1x(i)=x^(i).\max_{(\zeta,\omega)\in\mathcal{R}}\sigma^{\circ}_{\partial G_{i}(\hat{x}^{(i)})}((A^{f})_{i}^{\top}\zeta+\omega^{(i)})<1\quad\Rightarrow\quad x^{(i)}_{\star}=\hat{x}^{(i)}\;.

Denote z=f(Afx)z=\nabla f(A^{f}x) and w=Qxw=Qx111We reuse the notation ww and zz here for the purpose of this section.. We choose \mathcal{R} as a sphere centered at

(ζ¯,ω¯)=(z,w)max(1,max1iI(σdomGi((Aif)z+w(i))))(\bar{\zeta},\bar{\omega})=\frac{(z,w)}{\max(1,\max_{1\leq i\leq I}\big{(}\sigma_{\operatorname{dom}G_{i}^{*}}^{\circ}((A_{i}^{f})^{\top}z+w^{(i)})\big{)})}

and with radius

r=2Gap(x,ζ¯,ω¯)Lf,Qr=\sqrt{\frac{2\mathrm{Gap}(x,\bar{\zeta},\bar{\omega})}{L_{f,Q}}}

where

Lf,Q=max(max1jJL(fj),max1iIρ(Qi,i))L_{f,Q}=\max(\max_{1\leq j\leq J}L(\nabla f_{j}),\max_{1\leq i\leq I}\rho(Q_{i,i}))

and

Gap(x,ζ¯,ω¯)=12xQx+f(Afx)+G(x)+G((Af)ζ¯ω¯)+12ω¯Qω¯+f(ζ¯).\mathrm{Gap}(x,\bar{\zeta},\bar{\omega})=\frac{1}{2}x^{\top}Qx+f(A^{f}x)+G(x)+G^{*}(-(A^{f})^{\top}\bar{\zeta}-\bar{\omega})+\frac{1}{2}\bar{\omega}^{\top}Q^{\dagger}\bar{\omega}+f^{*}(\bar{\zeta})\;.

Note that as ω¯\bar{\omega} is a rescaled version of w=Qxw=Qx, we do not need to know QQ^{\dagger} in order to compute ω¯Qω¯\bar{\omega}^{\top}Q^{\dagger}\bar{\omega}. It is proved in [36] that this set \mathcal{R} contains (f(Afxbf),Qx)(\nabla f(A^{f}x_{\star}-b^{f}),Qx_{\star}) for any optimal solutions xx_{\star} to the primal problem. In the case where GiG_{i} is a norm and x(i)=0x^{(i)}=0, these expressions simplify since the σGi(0)=σdomGi\sigma^{\circ}_{\partial G_{i}(0)}=\sigma^{\circ}_{\operatorname{dom}G_{i}^{*}} is nothing else than the dual norm associated to GiG_{i}.

For the estimation of max(ζ,ω)σGi(x^(i))((Af)iζ+ω(i))\max_{(\zeta,\omega)\in\mathcal{R}}\sigma^{\circ}_{\partial G_{i}(\hat{x}^{(i)})}((A^{f})_{i}^{\top}\zeta+\omega^{(i)}), we use the fact that the polar of a support function is sublinear and positively homogeneous. Indeed, we have

σGi(x^(i))\displaystyle\sigma^{\circ}_{\partial G_{i}(\hat{x}^{(i)})} ((Af)iζ+ω(i))=supx:σGi(x^(i))(x)1(Af)iζ+ω(i),x=supx:σcigDiggi(Digx^(i)big)1(Af)iζ+ω(i),x\displaystyle((A^{f})_{i}^{\top}\zeta+\omega^{(i)})=\sup_{x\;:\;\sigma_{\partial G_{i}(\hat{x}^{(i)})}(x)\leq 1}\langle(A^{f})_{i}^{\top}\zeta+\omega^{(i)},x\rangle=\sup_{x\;:\;\sigma_{c^{g}_{i}D^{g}_{i}\partial g_{i}(D^{g}_{i}\hat{x}^{(i)}-b^{g}_{i})\leq 1}}\langle(A^{f})_{i}^{\top}\zeta+\omega^{(i)},x\rangle
=1cigDigσgi(Digx^(i)big)((Af)iζ+ω(i))=1cigDigσgi(x^gi)((Af)iζ+ω(i))\displaystyle=\frac{1}{c^{g}_{i}D^{g}_{i}}\sigma^{\circ}_{\partial g_{i}(D^{g}_{i}\hat{x}^{(i)}-b^{g}_{i})}((A^{f})_{i}^{\top}\zeta+\omega^{(i)})=\frac{1}{c^{g}_{i}D^{g}_{i}}\sigma^{\circ}_{\partial g_{i}(\hat{x}_{g_{i}})}((A^{f})_{i}^{\top}\zeta+\omega^{(i)})
1cigDigσgi(x^gi)((Af)iζ¯+ω¯(i))+rsup(u,v):(u,v)=11cigDigσgi(x^gi)((Af)iu+v(i)).\displaystyle\leq\frac{1}{c^{g}_{i}D^{g}_{i}}\sigma^{\circ}_{\partial g_{i}(\hat{x}_{g_{i}})}((A^{f})_{i}^{\top}\bar{\zeta}+\bar{\omega}^{(i)})+r\sup_{(u,v):\lVert(u,v)\rVert=1}\frac{1}{c^{g}_{i}D^{g}_{i}}\sigma^{\circ}_{\partial g_{i}(\hat{x}_{g_{i}})}((A^{f})_{i}^{\top}u+v^{(i)})\;.

Here x^gi=Digx^(i)big\hat{x}_{g_{i}}=D^{g}_{i}\hat{x}^{(i)}-b^{g}_{i} is a point where gi(x^gi)\partial g_{i}(\hat{x}_{g_{i}}) has a nonempty interior. Some care should be taken when Gi(x^(i))\partial G_{i}(\hat{x}^{(i)}) is unbounded, so that we first check whether (Af)iζ+ω(i)domσgi(x^gi)(A^{f})_{i}^{\top}\zeta+\omega^{(i)}\in\operatorname{dom}\sigma^{\circ}_{\partial g_{i}(\hat{x}_{g_{i}})} for all (ζ,ω)(\zeta,\omega)\in\mathcal{R}.

Here also, the novelty lies in the genericity of the implementation.

8 Numerical validation

8.1 Performance

In order to evaluate the performance of the implementation, we compare our implementation with a pure Python coordinate descent solver and code written for specific problems: Scikit learn’s Lasso solver and Liblinear’s SVM solver. We run the code on an Intel Xeon CPU at 3.07GHz.

Lasso
Pure Python 308.76s
cd_solver 0.43s
Scikit learn Lasso 0.11s
SVM
Pure Python 126.24s
cd_solver 0.31s
Liblinear SVM 0.13s
Table 6: Comparison of our code with a pure Python code and reference implementations for performing 100NN coordinate descent iterations for the Lasso problem on the Leukemia dataset with regularization parameter λ=0.1(Af)bf\lambda=0.1\lVert(A^{f})^{\top}b^{f}\rVert_{\infty}, and for 10NN coordinate descent iterations for the dual SVM problem on the RCV1 dataset with penalty parameter C=10C=10.

We can see on Table 6 that our code is hundreds of times faster than the pure Python code. This is due to the compiled nature of our code, that does not suffer from the huge number of iterations required by coordinate descent. On the other hand, our code is about 4 times slower than state-of-the-art coordinate descent implementations designed for a specific problem. We can see it in both examples we chose. This overhead is the price of genericity.

We believe that, except for critical applications like Lasso or SVM, a 4 times speed-up does not justify writing a new code from scratch, since a separate piece of code for each problem makes it difficult to maintain and to improve with future algorithmic advances.

8.2 Genericity

We tested our algorithm on the following problems:

  • Lasso problem

    minxn12Axb22+λx1\min_{x\in\mathbb{R}^{n}}\frac{1}{2}\lVert Ax-b\rVert_{2}^{2}+\lambda\lVert x\rVert_{1}
  • Binomial logistic regression

    minxni=1mlog(1+exp(bi(Ax)i))+λ2x22\min_{x\in\mathbb{R}^{n}}\sum_{i=1}^{m}\log(1+\exp(b_{i}(Ax)_{i}))+\frac{\lambda}{2}\lVert x\rVert_{2}^{2}

    where bi{1,1}b_{i}\in\{-1,1\} for all ii.

  • Sparse binomial logistic regression

    minxni=1mlog(1+exp(bi(Ax)i))+λx1\min_{x\in\mathbb{R}^{n}}\sum_{i=1}^{m}\log(1+\exp(b_{i}(Ax)_{i}))+\lambda\lVert x\rVert_{1}
  • Dual SVM without intercept

    minxn12αAD(b)x22ex+ι[0,1]n(x)\min_{x\in\mathbb{R}^{n}}\frac{1}{2\alpha}\lVert A^{\top}D(b)x\rVert^{2}_{2}-e^{\top}x+\iota_{[0,1]^{n}}(x)

    where ι[0,1]n\iota_{[0,1]^{n}} is the convex indicator function of the set [0,1]n[0,1]^{n} and encodes the constraint x[0,1]nx\in[0,1]^{n}.

  • Dual SVM with intercept

    minxn12αAD(b)x22ex+ι[0,1](x)+ι{0}(bx)\min_{x\in\mathbb{R}^{n}}\frac{1}{2\alpha}\lVert A^{\top}D(b)x\rVert^{2}_{2}-e^{\top}x+\iota_{[0,1]}(x)+\iota_{\{0\}}(b^{\top}x)
  • Linearly constrained quadratic program

    minxn12(Af)xbf22+ι{0}(Ahxbh)\min_{x\in\mathbb{R}^{n}}\frac{1}{2}\lVert(A^{f})^{\top}x-b^{f}\rVert^{2}_{2}+\iota_{\{0\}}(A^{h}x-b^{h})
  • Linear program

    minxncx+ι+n(x)+ιm(Axb)\min_{x\in\mathbb{R}^{n}}c^{\top}x+\iota_{\mathbb{R}^{n}_{+}}(x)+\iota_{\mathbb{R}^{m}_{-}}(Ax-b)
  • TV+1\ell_{1}-regularized regression

    minxn1n2n312Axb22+α1Dx2,1+α2x1\min_{x\in\mathbb{R}^{n_{1}n_{2}n_{3}}}\frac{1}{2}\lVert Ax-b\rVert_{2}^{2}+\alpha_{1}\lVert Dx\rVert_{2,1}+\alpha_{2}\lVert x\rVert_{1}

    where DD is the discrete gradient operator and y2,1=i,j,kl=13yi,j,k,l2\lVert y\rVert_{2,1}=\sum_{i,j,k}\sqrt{\sum_{l=1}^{3}y_{i,j,k,l}^{2}}.

  • Sparse multinomial logistic regression

    minxn×qi=1mlog(j=1qexp(l=1nAi,lxl,j))+i=1nj=1qxi,jbi,j+l=1nj=1qxl,j2\min_{x\in\mathbb{R}^{n\times q}}\sum_{i=1}^{m}\log\Big{(}\sum_{j=1}^{q}\exp\big{(}\sum_{l=1}^{n}A_{i,l}x_{l,j}\big{)}\Big{)}+\sum_{i=1}^{n}\sum_{j=1}^{q}x_{i,j}b_{i,j}+\sum_{l=1}^{n}\sqrt{\sum_{j=1}^{q}x_{l,j}^{2}}

    where bi,j{0,1}b_{i,j}\in\{0,1\} for all i,ji,j.

This list demonstrates that the method is able to deal with differentiable functions, separable or nonseparable nondifferentiable functions, as well as use several types of atom function in a single problem.

8.3 Benchmarking

In Table 7, we compare the performance of Algorithm 1 with and without screening, Algorithm 2 with and without screening as well as 2 alternative solvers for 3 problems exhibiting various situations:

  • Lasso: the nonsmooth function in the Lasso problem is separable;

  • the TV-regularized regression problem has a nonsmooth, nonseparable regularizer whose matrix AhA^{h} is sparse;

  • the dual SVM with intercept has a single linear nonseparable constraint.

For Algorithm 2, we set the restart with a variable sequence as in [16]. We did not tune the algorithmic parameters for each instance. We evaluate the precision of a primal-dual pair (x,y)(x,y) as follows. We define the smoothed gap [59] as

𝒢β,γ(x,y,ζ,ω)=12xQx+f(Afx)+G(x)+maxyAhx,yH(y)β2yy2\displaystyle\mathcal{G}_{\beta,\gamma}(x,y,\zeta,\omega)=\frac{1}{2}x^{\top}Qx+f(A^{f}x)+G(x)+\max_{y^{\prime}}\langle A^{h}x,y^{\prime}\rangle-H^{*}(y^{\prime})-\frac{\beta}{2}\lVert y-y^{\prime}\rVert^{2}
+H(y)+12ωQω+f(ζ)+maxx(Ah)y(Af)ζω,xG(x)γ2xx2\displaystyle+H^{*}(y)+\frac{1}{2}\omega^{\top}Q^{\dagger}\omega+f^{*}(\zeta)+\max_{x^{\prime}}\langle-(A^{h})^{\top}y-(A^{f})^{\top}\zeta-\omega,x^{\prime}\rangle-G(x^{\prime})-\frac{\gamma}{2}\lVert x-x^{\prime}\rVert^{2}

and we choose the positive parameters β\beta and γ\gamma as

β=dist(Ahx,domH)γ=dist((Ah)y(Af)ζω,domG).\beta=\operatorname{dist}(A^{h}x,\operatorname{dom}H)\qquad\gamma=\operatorname{dist}(-(A^{h})^{\top}y-(A^{f})^{\top}\zeta-\omega,\operatorname{dom}G^{*})\;.

It is shown in [59] that when 𝒢β,γ(x,y,ζ,ω)\mathcal{G}_{\beta,\gamma}(x,y,\zeta,\omega), β\beta and γ\gamma are small then the objective value and feasibility gaps are small.

For the Lasso problems, we compare our implementations of the coordinate descent method with scikit-learn’s coordinate descent [42] and CVXPY’s augmented Lagrangian method [12] called OSQP. As in Table 6, we have a factor 4 between Alg. 1 without screening and scikit-learn. Acceleration and screening allows us to reduce this gap without sacrificing generality. OSQP is efficient on small problems but is not competitive on larger instances.

For the TV-regularized regression problems, we compare ourself with FISTA where the proximal operator of the total variation is computed inexactly and with LBFGS where the total variation is smoothed with a decreasing smoothing parameter. Those two methods have been implemented for [13]. They manage to solve the problem to an acceptable accuracy in a few hours. As the problem is rather large, we did not run OSQP on it. For our methods, as hh is nonzero, we cannot use variable screening with the current theory. Alg. 1 quickly reduces the objective value but fails to get a high precision in a reasonable amount of time. On the other hand, Alg. 2 is the quickest among the four solvers tested here.

The third problem we tried is dual support vector machine with intercept. A very famous solver is libsvm [10], which implements SMO [43], a 2-coordinate descent method that ensures the feasibility of the constraints at each iteration. The conclusions are similar to what we have seen above. The specialized solver remains a natural choice. OSQP can only solve small instances. Alg. 1 has trouble finding high accuracy solutions. Alg. 2 is competitive with respect to the specialized solver.

Problem Dataset Alternative solver 1 Alternative solver 2 Algorithm 1 Algorithm 2
Lasso (λ=0.1Xy\lambda=0.1\lVert X^{\top}y\rVert_{\infty}) triazines (JJ=186, II=60) Scikit-learn: 0.005s OSQP: 0.033s 0.107s; scr: 0.101s 0.066s; scr: 0.079s
ϵ=106\epsilon=10^{-6} scm1d (JJ=9,803, II=280) Scikit-learn: 24.40s OSQP: 33.21s 91.97s; scr: 8.73s 24.13; scr: 3.21s
news20.binary (JJ=19,996, II=1,355,191) Scikit-learn: 64.6s OSQP: >>2,000s 267.3s; scr: 114.4s 169.8s; scr: 130.7s
TV-regularized regression fMRI [58, 13] (α1=α2=5.103)(\alpha_{1}=\alpha_{2}=5.10^{-3}) inexact FISTA: 24,341s LBFGS+homotopy: 6,893s >>25,000s 2,734s
ϵ=1\epsilon=1 (JJ=768, II=65,280, LL=195,840)
Dual SVM with intercept ionosphere (JJ=14, II=351) libsvm: 0.04s OSQP: 0.12s 3.23s 0.42s
ϵ=103\epsilon=10^{-3} leukemia (JJ=7,129, II=72) libsvm: 0.1s OSQP: 3.2s 25.5s 0.8s
madelon (JJ=500, II=2,600) libsvm: 50s OSQP: 37s 3842s 170s
gisette (JJ=5,000, II=6,000) libsvm: 70s OSQP: 936s 170s 901s
rcv1 (JJ=47,236, II=20,242) libsvm: 195s OSQP: Memory error >>5000s 63s
Table 7: Time to reach precision ϵ\epsilon. scr = with gap safe screening. For SVM, we scale the data so that in each column of the data matrix AA, Ai,j[1,1]A_{i,j}\in[-1,1].

9 Conclusion

This paper introduces a generic coordinate descent solver. The technical challenge behind the implementation is the fundamental need for a compiled language in the low-level operations that are partial derivative and proximal operator computations. We solved it using pre-defined atom functions that are combined at run time using a python interface.

We show how genericity allows us to decouple algorithm development from a particular application problem. As an example, our software can solve at least 12 types of optimization problems on large instances using primal-dual coordinate descent, momentum acceleration, restart and variable screening.

As future works, apart from keeping the algorithm up to date with the state of the art, we plan to bind our solver with CVXPY in order to simplify further the user experience.

References

  • [1] Ahmet Alacaoglu, Quoc Tran Dinh, Olivier Fercoq, and Volkan Cevher. Smooth primal-dual coordinate descent algorithms for nonsmooth convex optimization. In Advances in Neural Information Processing Systems, pages 5852–5861, 2017.
  • [2] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning, pages 1110–1119, 2016.
  • [3] Alfred Auslender. Optimisation: Méthodes Numériques. Masson, Paris, 1976.
  • [4] Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods. SIAM journal on Optimization, 23(4):2037–2060, 2013.
  • [5] Pascal Bianchi, Walid Hachem, and Iutzeler Franck. A stochastic coordinate descent primal-dual algorithm and applications. In 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2014.
  • [6] Mathieu Blondel and Fabian Pedregosa. Lightning: large-scale linear classification, regression and ranking in Python, 2016.
  • [7] Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel Coordinate Descent for L1-Regularized Loss Minimization. In 28th International Conference on Machine Learning, 2011.
  • [8] Patrick Breheny and Jian Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The annals of applied statistics, 5(1):232, 2011.
  • [9] Antonin Chambolle, Matthias J Ehrhardt, Peter Richtárik, and Carola-Bibiane Schönlieb. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM Journal on Optimization, 28(4):2783–2808, 2018.
  • [10] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
  • [11] Patrick L Combettes and Jean-Christophe Pesquet. Stochastic Quasi-Fejér Block-Coordinate Fixed Point Iterations with Random Sweeping. SIAM Journal on Optimization, 25(2):1221–1248, 2015.
  • [12] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  • [13] Elvis Dohmatob, Alexandre Gramfort, Bertrand Thirion, and Gaėl Varoquaux. Benchmarking solvers for TV-l1 least-squares and logistic regression in brain imaging. In Pattern Recognition in Neuroimaging (PRNI). IEEE, 2014.
  • [14] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
  • [15] Olivier Fercoq and Pascal Bianchi. A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions. SIAM Journal on Optimization, 29(1):100–134, 2019.
  • [16] Olivier Fercoq and Zheng Qu. Restarting the accelerated coordinate descent method with a rough strong convexity estimate. arXiv preprint arXiv:1803.05771, 2018.
  • [17] Olivier Fercoq, Zheng Qu, Peter Richtárik, and Martin Takáč. Fast distributed coordinate descent for minimizing non-strongly convex losses. In IEEE International Workshop on Machine Learning for Signal Processing, 2014.
  • [18] Olivier Fercoq and Peter Richtárik. Accelerated, parallel and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997–2023, 2015.
  • [19] Olivier Fercoq and Peter Richtárik. Smooth minimization of nonsmooth functions with parallel coordinate descent methods. In Modeling and Optimization: Theory and Applications, pages 57–96. Springer, 2017.
  • [20] Jerome Friedman, Trevor Hastie, Holger Höfling, and Robert Tibshirani. Pathwise coordinate optimization. Ann. Appl. Stat., 1(2):302–332, 2007.
  • [21] Xiang Gao, Yangyang Xu, and Shuzhong Zhang. Randomized primal-dual proximal block coordinate updates. arXiv preprint arXiv:1605.05969, 2016.
  • [22] Tobias Glasmachers and Urun Dogan. Accelerated coordinate descent with adaptive coordinate frequencies. In Asian Conference on Machine Learning, pages 72–86, 2013.
  • [23] Luigi Grippo and Marco Sciandrone. On the convergence of the block nonlinear gauss–seidel method under convex constraints. Operations research letters, 26(3):127–136, 2000.
  • [24] Mingyi Hong, Tsung-Hui Chang, Xiangfeng Wang, Meisam Razaviyayn, Shiqian Ma, and Zhi-Quan Luo. A block successive upper bound minimization method of multipliers for linearly constrained convex optimization. arXiv preprint arXiv:1401.7079, 2014.
  • [25] Cho-Jui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1064–1072. ACM, 2011.
  • [26] Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos. A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization. IEEE Transactions on Automatic Control, 2019.
  • [27] Yin Tat Lee and Aaron Sidford. Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems. arXiv:1305.1922, 2013.
  • [28] Dennis Leventhal and Adrian Lewis. Randomized Methods for Linear Constraints: Convergence Rates and Conditioning. Mathematics of Operations Research, 35(3):641–654, 2010.
  • [29] Huan Li and Zhouchen Lin. On the complexity analysis of the primal solutions for the accelerated randomized dual coordinate ascent. arXiv preprint arXiv:1807.00261, 2018.
  • [30] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pages 3059–3067, 2014.
  • [31] Ji Liu, Stephen J Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285–322, 2015.
  • [32] Zhaosong Lu and Lin Xiao. Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. arXiv preprint arXiv:1306.5918, 1273, 2013.
  • [33] Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1-2):615–642, 2015.
  • [34] Zhi-Quan Luo and Paul Tseng. On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, 1992.
  • [35] Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138, 2011.
  • [36] Eugène Ndiaye. Safe Optimization Algorithms for Variable Selection and Hyperparameter Tuning. PhD thesis, Université Paris-Saclay, 2018.
  • [37] Ion Necoara and Dragos Clipici. Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed MPC. Journal of Process Control, 23:243–253, 2013.
  • [38] Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012.
  • [39] Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • [40] Yurii Nesterov. Subgradient Methods for Huge-Scale Optimization Problems. CORE DISCUSSION PAPER 2012/2, 2012.
  • [41] Andrei Patrascu and Ion Necoara. Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. Journal of Global Optimization, 61(1):19–46, 2015.
  • [42] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
  • [43] John C Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999. http://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets.
  • [44] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
  • [45] Peter Richtárik and Martin Takáč. Distributed coordinate descent method for learning with big data. The Journal of Machine Learning Research, 17(1):2657–2681, 2016.
  • [46] Peter Richtárik and Martin Takáč. On Optimal Probabilities in Stochastic Coordinate Descent Methods. Preprint arXiv:1310.3438, 2013.
  • [47] Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, Ser. A, 144(1-2):1–38, 2014.
  • [48] Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. Mathematical Programming, pages 1–52, 2015.
  • [49] Ludwig Seidel. Ueber ein Verfahren, die Gleichungen, auf welche die Methode der kleinsten Quadrate führt, sowie lineäre Gleichungen überhaupt, durch successive Annäherung aufzulösen:(Aus den Abhandl. dk bayer. Akademie II. Cl. XI. Bd. III. Abth. Verlag der k. Akademie, 1874.
  • [50] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for 1\ell_{1}-regularized loss minimization. Journal of Machine Learning Research, 12:1865–1892, 2011.
  • [51] Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv:1211.2717, 2012.
  • [52] Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. Journal of Machine Learning Research, 14(1):567–599, 2013.
  • [53] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pages 64–72, 2014.
  • [54] Hao-Jun Michael Shi, Shenyinying Tu, Yangyang Xu, and Wotao Yin. A primer on coordinate descent algorithms. arXiv preprint arXiv:1610.00040, 2016.
  • [55] Tao Sun, Robert Hannah, and Wotao Yin. Asynchronous coordinate descent under more realistic assumptions. In Advances in Neural Information Processing Systems, pages 6182–6190, 2017.
  • [56] Martin Takáč, Avleen Bijral, Peter Richtárik, and Nathan Srebro. Mini-batch primal and dual methods for SVMs. In 30th International Conference on Machine Learning, 2013.
  • [57] Qing Tao, Kang Kong, Dejun Chu, and Gaowei Wu. Stochastic Coordinate Descent Methods for Regularized Smooth and Nonsmooth Losses. Machine Learning and Knowledge Discovery in Databases, pages 537–552, 2012.
  • [58] Sabrina M Tom, Craig R Fox, Christopher Trepel, and Russell A Poldrack. The neural basis of loss aversion in decision-making under risk. Science, 315(5811):515–518, 2007.
  • [59] Quoc Tran-Dinh, Olivier Fercoq, and Volkan Cevher. A smooth primal-dual optimization framework for nonsmooth composite convex minimization. SIAM Journal on Optimization, 28(1):96–134, 2018.
  • [60] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117:387–423, 2009.
  • [61] Paul Tseng and Sangwoon Yun. Block-Coordinate Gradient Descent Method for Linearly Constrained Non Separable Optimization. Journal of Optimization Theory and Applications, 140:513–535, 2009.
  • [62] Jack Warga. Minimizing certain convex functions. Journal of the Society for Industrial & Applied Mathematics, 11(3):588–593, 1963.
  • [63] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
  • [64] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.
  • [65] Yangyang Xu and Wotao Yin. A globally convergent algorithm for nonconvex optimization based on block coordinate update. Journal of Scientific Computing, 72(2):700–734, 2017.
  • [66] Yangyang Xu and Shuzhong Zhang. Accelerated primal–dual proximal block coordinate updating methods for constrained convex optimization. Computational Optimization and Applications, 70(1):91–128, 2018.
  • [67] Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1-2):41–75, 2011.
  • [68] Sangwoon Yun. On the iteration complexity of cyclic coordinate gradient descent methods. SIAM Journal on Optimization, 24(3):1567–1580, 2014.