This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Inexact Successive Quadratic Approximation Method for Convex L-1 Regularized Optimization

Richard H. Byrd Department of Computer Science, University of Colorado, Boulder, CO, USA. This author was supported by National Science Foundation grant DMS-1216554 and Department of Energy grant DE-SC0001774.    Jorge Nocedal Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, USA. This author was supported by National Science Foundation grant DMS-0810213, and by Department of Energy grant DE-FG02-87ER25047.    Figen Oztoprak Istanbul Technical University. This author was supported by Department of Energy grant DE-SC0001774, and by a grant from Tubitak.
(September 18, 2025)
Abstract

We study a Newton-like method for the minimization of an objective function ϕ\phi that is the sum of a smooth convex function and an 1\ell_{1} regularization term. This method, which is sometimes referred to in the literature as a proximal Newton method, computes a step by minimizing a piecewise quadratic model qkq_{k} of the objective function ϕ\phi. In order to make this approach efficient in practice, it is imperative to perform this inner minimization inexactly. In this paper, we give inexactness conditions that guarantee global convergence and that can be used to control the local rate of convergence of the iteration. Our inexactness conditions are based on a semi-smooth function that represents a (continuous) measure of the optimality conditions of the problem, and that embodies the soft-thresholding iteration. We give careful consideration to the algorithm employed for the inner minimization, and report numerical results on two test sets originating in machine learning.

1 Introduction

In this paper, we study an inexact Newton-like method for solving optimization problems of the form

minxnϕ(x)=f(x)+μx1,\min_{x\in\mathbb{R}^{n}}\ \phi(x)=f(x)+\mu\|x\|_{1}, (1.1)

where ff is a smooth convex function and μ>0\mu>0 is a (fixed) regularization parameter. The method constructs, at every iteration, a piecewise quadratic model of ϕ\phi and minimizes this model inexactly to obtain a new estimate of the solution.

The piecewise quadratic model is defined, at an iterate xkx_{k}, as

qk(x)=f(xk)+g(xk)T(xxk)+12(xxk)THk(xxk)+μx1,q_{k}(x)=f(x_{k})+g(x_{k})^{T}(x-x_{k})+\textstyle\frac{1}{2}(x-x_{k})^{T}H_{k}(x-x_{k})+\mu\|x\|_{1}, (1.2)

where g(xk)=deff(xk)g(x_{k})\stackrel{{\scriptstyle\rm def}}{{=}}\nabla f(x_{k}) and HkH_{k} denotes the Hessian 2f(xk)\nabla^{2}f(x_{k}) or a quasi-Newton approximation to it. After computing an approximate solution x^\hat{x} of this model, the algorithm performs a backtracking line search along the direction dk=x^xkd_{k}=\hat{x}-x_{k} to ensure decrease in the objective ϕ\phi.

We refer to this method as the successive quadratic approximation method in analogy to the successive quadratic programming method for nonlinear programming. This method is also known in the literature as a “proximal Newton method” [20, 24], but we prefer not to use the term “proximal” in this context since the quadratic term in (1.2)(\ref{quadm}) is better interpreted as a second-order model rather than as a term that simply restricts the size of the step. The paper covers both the cases when the quadratic model qkq_{k} is constructed with an exact Hessian or a quasi-Newton approximation.

The two crucial ingredients in the inexact successive quadratic approximation method are the algorithm used for the minimization of the model qkq_{k}, and the criterion that controls the degree of inexactness in this minimization. In the first part of the paper, we propose an inexactness criterion for the minimization of qkq_{k} and prove that it guarantees global convergence of the iterates, and that it can be used to control the local rate of convergence. This criterion is based on the optimality conditions for the minimization of (1.2)(\ref{quadm}), expressed in the form of a semi-smooth function that is derived from the soft-thresholding operator.

The second part of the paper is devoted to the practical implementation of the method. Here, the choice of algorithm for the inner minimization of the model qkq_{k} is vital, and we consider two options: fista [4], which is a first-order method, and an orthant-based method [1, 7, 8]. The latter is a second-order method where each iteration consists of an orthant-face identification phase, followed by the minimization of a smooth model restricted to that orthant. The subspace minimization can be performed by computing a quasi-Newton step or a Newton-CG step (we explore both options). A projected bactracking line search is then applied; see section 5.3.

Some recent work on successive quadratic approximation methods for problem (1.1) include: Hsie et al. [12], where (1.2) is solved using a coordinate descent method, and which focuses on the inverse covariance selection method; [25] which also employs coordinate descent but uses a different working set identification than [12], and makes use of a quasi-Newton model; and Olsen et al. [18], where the inner solver is fista. None of these papers address convergence for inexact solutions of the subproblem. Recently Lee, Sun and Saunders [13] presented an inexact proximal Newton method that, at first glance, appears to be very close to the method presented here. Their inexactness criterion is, however, different from ours and suffers from a number of drawbacks, as discussed in section 2.

Inexact methods for solving generalized equations have been studied by Patricksson [21], and more recently by Dontchev and Rockafellar [10]. Special cases of the general methods described in those papers result in inexact sequential quadratic approximation algorithms. Patricksson [21] presents convergence analyses based on two conditions for controlling inexactness. The first is based on running the subproblem solver for a limited number of steps. The second rule requires that the residual norm be sufficiently small, but it does not cover the inexactness conditions presented in this paper (since the residual is computed differently and their inexactness measure is is different from ours). The rule suggested in Dontchev and Rockafellar [10] is very general, but it too does not cover the condition presented in this paper. Our rule, and those presented in [10, 13], is inspired by the classical inexactness condition proposed by Dembo et al. [9], and reduces to it for the smooth unconstrained minimization case (i.e. when μ=0\mu=0).

Another line of research that is relevant to this paper is the global and rate of convergence analysis for inexact proximal-gradient algorithms, which can be seen as special cases of sequential quadratic approximation without acceleration [15, 23, 26]. The inexactness conditions applied in those papers require that the subproblem objective function value be ϵ\epsilon-close to the optimal subproblem objective [15, 26], or that the approximate solution be exact with respect to an ϵ\epsilon-perturbed subdifferential [23], for a decreasing sequence {ϵ}\{\epsilon\}.

Our interest in the successive quadratic approximation method is motivated by the fact that it has not received sufficient attention from a practical perspective, where inexact solutions to the inner problem (1.2) are imperative. Although a number of studies have been devoted to the formulation and analysis of proximal Newton methods for convex composite optimization problems, as mentioned above, the viability of the approach in practice has not been fully explored.

This paper is organized in 5 sections. In section 2 we outline the algorithm, including the inexactness criteria that govern the solution of the subproblem (1.2)(\ref{quadm}). In sections 3 and 4, we analyze the global and local convergence properties of the algorithm. Numerical experiments are reported in section 5. The paper concludes in section 6 with a summary of our findings, and a list of questions to explore.


Notation. In the remainder, we let g(xk)=f(xk)g(x_{k})=\nabla f(x_{k}), and let \|\cdot\| denote any vector norm. We sometimes abbreviate successive quadratic approximation method as “SQA method”, and note that this algorithm is often referred to in the literature as the “proximal Newton method”.

2 The Algorithm

Given an iterate xkx_{k}, an iteration of the algorithm begins by forming the model (1.2)(\ref{quadm}), where μ>0\mu>0 is a given scalar and Hk0H_{k}\succ 0 is an approximation to the Hessian 2f(xk)\nabla^{2}f(x_{k}). Next, the algorithm computes an approximate minimizer x^\hat{x} of the subproblem

minxnqk(x)=f(xk)+g(xk)T(xxk)+12(xxk)THk(xxk)+μx1.\min_{x\in\mathbb{R}^{n}}\ q_{k}(x)=f(x_{k})+g(x_{k})^{T}(x-x_{k})+\textstyle\frac{1}{2}(x-x_{k})^{T}H_{k}(x-x_{k})+\mu\|x\|_{1}. (2.1)

The point x^\hat{x} defines the search direction dk=x^xkd_{k}=\hat{x}-x_{k}. The algorithm then performs a backtracking line search along the direction dkd_{k} that ensures sufficient decrease in the objective ϕ\phi. The minimization of (2.1) should be performed by a method that exploits the structure of this problem.

In order to compute an adequate approximate solution to (1.2)(\ref{quadm}), we need some measure of closeness to optimality. In the case of smooth unconstrained optimization, (i.e. (1.1)(\ref{l1prob}) with μ=0\mu=0), the norm of the gradient is a standard measure of optimality, and it is common [9] to require the approximate solution x^\hat{x} to satisfy the condition

g(xk)+Hk(x^xk)ηkg(xk),0<ηk<1.\|g(x_{k})+H_{k}(\hat{x}-x_{k})\|\leq\eta_{k}\|g(x_{k})\|,\quad\quad 0<\eta_{k}<1. (2.2)

The term on the left side of (2.2)(\ref{smoothcond}) is a measure of optimality for the model qk(x)q_{k}(x), in the unconstrained case.

For problem (1.1), the length of the iterative soft-thresholding (ISTA) step is a natural measure of optimality. The ISTA iteration is given by

xista=argminxg(xk)T(xxk)+12τxxk2+μx1,x_{\rm ista}=\arg\min_{x}\ g(x_{k})^{T}(x-x_{k})+{\frac{1}{2\tau}}\|x-x_{k}\|^{2}+\mu\|x\|_{1}, (2.3)

where τ>0\tau>0 is a fixed parameter. It is easy to verify that xistaxk\|x_{\rm ista}-x_{k}\| is zero if and only if xkx_{k} is a solution of problem (1.1). We need to express xistaxk\|x_{\rm ista}-x_{k}\| in a way that is convenient for our analysis, and for this purpose we note [16] that some algebraic manipulations show that xistaxk=τF(xk)\|x_{\rm ista}-x_{k}\|=\tau\|F(x_{k})\|, where

F(x)=g(x)P[μ,μ](g(x)x/τ).F(x)=g(x)-P_{[-\mu,\mu]}(g(x)-x/\tau). (2.4)

Here P(x)[μ,μ]P(x)_{[-\mu,\mu]} denotes the component-wise projection of xx onto the interval [μ,μ][-\mu,\mu], and τ\tau is a positive scalar.

One can directly verify that (2.4) is a valid optimality measure by noting that F(x)=0F(x)=0 is equivalent to the standard necessary optimality condition for (1.1)(\ref{l1prob}):

gi(x)+μ=0\displaystyle g_{i}(x^{*})+\mu=0 for i s.t. xi>0,\displaystyle\mbox{for }i\mbox{ s.t. }x_{i}^{*}>0,
gi(x)μ=0\displaystyle g_{i}(x^{*})-\mu=0 for i s.t. xi<0,\displaystyle\mbox{for }i\mbox{ s.t. }x_{i}^{*}<0,
μgi(x)μ\displaystyle-\mu\leq g_{i}(x^{*})\leq\mu for i s.t. xi=0.\displaystyle\mbox{for }i\mbox{ s.t. }x_{i}^{*}=0.

For the objective qkq_{k} of (2.1), this function takes the form

Fq(xk;x)=g(xk)+Hk(xxk)P[μ,μ](g(xk)+Hk(xxk)x/τ).F_{q}(x_{k};x)=g(x_{k})+H_{k}(x-x_{k})-P_{[-\mu,\mu]}(g(x_{k})+H_{k}(x-x_{k})-x/\tau). (2.5)

Using the measures (2.4)(\ref{Fdef}) and (2.5)(\ref{Fqdef}) in a manner similar to (2.2)(\ref{smoothcond}), leads to the condition Fq(xk;x^)ηkF(xk)\|F_{q}(x_{k};\hat{x})\|\leq\eta_{k}\|F(x_{k})\|. However, depending on the method used to approximately solve (2.1)(\ref{lassop}), this does not guarantee that x^xk\hat{x}-x_{k} is a descent direction for ϕ\phi. To achieve this, we impose the additional condition that the quadratic model is decreased at x^\hat{x}.


Inexactness Conditions. A point x^\hat{x} is considered an acceptable approximate solution of subproblem (2.1) if

Fq(xk;x^)ηkFq(xk;xk) and qk(x^)<qk(xk),\|F_{q}(x_{k};\hat{x})\|\leq\eta_{k}\|F_{q}(x_{k};x_{k})\|\ \mbox{ and }\ \ q_{k}(\hat{x})<q_{k}(x_{k}),\ (2.6)

for some parameter 0ηk<10\leq\eta_{k}<1, where \|\cdot\| is any norm. (Note that Fq(xk;xk)=F(xk)F_{q}(x_{k};x_{k})=F(x_{k}), so that the first condition can also be written as Fq(xk;x^)ηkF(xk)\|F_{q}(x_{k};\hat{x})\|\leq\eta_{k}\|F(x_{k})\|.)


The method is summarized in Algorithm 2.1.

Algorithm 2.1: Inexact Successive Quadratic Approximation (SQA) Method for Problem (1.1)(\ref{l1prob})
Choose an initial iterate x0x_{0}.
Select constants θ(0,1/2)\theta\in(0,1/2) and 0<τ<10<\tau<1 (which is used in definitions (2.4), (2.5)).
for k=0,,k=0,\cdots, until the optimality conditions of (1.1) are satisfied: 1. Compute (or update) the model Hessian HkH_{k} and form the piecewise quadratic model (1.2)(\ref{quadm}); 2. Compute an inexact solution x^\hat{x} of (1.2)(\ref{quadm}) satisfying conditions (2.6). 3. Perform a backtracking line search along the direction d=x^xkd=\hat{x}-x_{k}: starting with α=1\alpha=1, find α(0,1]\alpha\in(0,1] such that ϕ(xk)ϕ(xk+αd)θ(k(xk)k(xk+αd)),\phi(x_{k})-\phi(x_{k}+\alpha d)\geq\theta(\ell_{k}(x_{k})-\ell_{k}(x_{k}+\alpha d)), (2.7) where \ell is the following piecewise linear model of ϕ\phi at xkx_{k}: k(x)=f(xk)+g(xk)T(xxk)+μx1.\ell_{k}(x)=f(x_{k})+g(x_{k})^{T}(x-x_{k})+\mu\|x\|_{1}. (2.8) 4. Set xk+1=xk+αdx_{k+1}=x_{k}+\alpha d.
end(for)

For now, we simply assume that the sequence {ηk}\{\eta_{k}\} in (2.6) satisfies ηk[0,1)\eta_{k}\in[0,1), but in section 4 we show that by choosing {ηk}\{\eta_{k}\} and the parameter τ\tau appropriately, the algorithm achieves a fast rate of convergence. One may wonder whether the backtracking line search of Step 3 might hinder sparsity of the iterates. Our numerical experience indicates that this is not the case because, in our tests, Algorithm 2 almost always accepts the unit steplength (α=1\alpha=1).

It is worth pointing out that Lee et al. [13] recently proposed and analyzed an inexactness criterion that is similar to the first inequality of (2.6)(\ref{inx}). The main difference is that they use the subgradient of qkq_{k} on the left side of the inequality, and both norms are scaled by Hk1H_{k}^{-1}. They claim similar convergence results to ours, but a worrying consequence of the lack of continuity of the subgradient of qkq_{k} is that their inexactness condition can fail for vectors xx arbitrarily close to the exact minimizer of qkq_{k}. As a result, their criterion is not an appropriate termination test for the inner iteration. (In addition, their use of the scaling Hk1H_{k}^{-1} precludes setting Hk=2f(xk)H_{k}=\nabla^{2}f(x_{k}), except for small or highly structured problems.)

3 Global Convergence

In this section, we show that Algorithm 2.1 is globally convergent under certain assumptions on the function ff and the (approximate) Hessians HkH_{k}. Specifically, we assume that ff is a differentiable function with Lipschitz continuous gradient, i.e., there is a constant M>0M>0 such that

g(x)g(y)Mxy,\|g(x)-g(y)\|\leq M\|x-y\|, (3.1)

for all x,yx,y. We denote by λmin(Hk)\lambda_{\min}(H_{k}) and λmax(Hk)\lambda_{\max}(H_{k}) the smallest and largest eigenvalues of HkH_{k}, respectively.

Theorem 3.1

Suppose that ff is a smooth function that is bounded below and that satisfies (3.1)(\ref{lips}). Let {xk}\{x_{k}\} be the sequence of iterates generated by Algorithm 2.1, and suppose that there exist constants 0<λΛ0<\lambda\leq\Lambda such that the sequence {Hk}\{H_{k}\} satisfies

λmin(Hk)λ>0andλmax(Hk)Λ,\lambda_{\min}(H_{k})\geq\lambda>0\quad\mbox{and}\quad\lambda_{\max}(H_{k})\leq\Lambda,

for all kk. Then

limkF(xk)=0.\lim_{k\rightarrow\infty}F(x_{k})=0. (3.2)

Proof. We first show that if x^\hat{x} is an approximate solution of (1.2)(\ref{quadm}) that satisfies the inexactness conditions (2.6)(\ref{inx}), then there is a constant γ>0\gamma>0 (independent of kk) such that for all k{0,1,}k\in\{0,1,\cdots\}

k(xk)k(x^)γF(xk)2,\ell_{k}(x_{k})-\ell_{k}(\hat{x})\geq\gamma\|F(x_{k})\|^{2}, (3.3)

where k\ell_{k} and FF are defined in (2.8)(\ref{ell}) and (2.4)(\ref{Fdef}). To see this, note that by (2.6)

0>qk(x^)qk(xk)=k(x^)k(xk)+12(x^xk)THk(x^xk),0>q_{k}(\hat{x})-q_{k}(x_{k})=\ell_{k}(\hat{x})-\ell_{k}(x_{k})+\textstyle\frac{1}{2}(\hat{x}-x_{k})^{T}H_{k}(\hat{x}-x_{k}),

and therefore

k(xk)k(x^)>12(x^xk)THk(x^xk)12λx^xk2.\ell_{k}(x_{k})-\ell_{k}(\hat{x})>\textstyle\frac{1}{2}(\hat{x}-x_{k})^{T}H_{k}(\hat{x}-x_{k})\geq\textstyle\frac{1}{2}\lambda\|\hat{x}-x_{k}\|^{2}. (3.4)

Next, since F(xk)=Fq(xk;xk)F(x_{k})=F_{q}(x_{k};x_{k}), and using (2.6)(\ref{inx}) and the contraction property of the projection, we have that

(1ηk)F(xk)\displaystyle(1-\eta_{k})\|F(x_{k})\| =(1ηk)Fq(xk;xk)\displaystyle=(1-\eta_{k})\|F_{q}(x_{k};x_{k})\|
Fq(xk;xk)Fq(xk,x^)\displaystyle\leq\|F_{q}(x_{k};x_{k})\|-\|F_{q}(x_{k},\hat{x})\|
Fq(xk;x^)Fq(xk;xk)\displaystyle\leq\|F_{q}(x_{k};\hat{x})-F_{q}(x_{k};x_{k})\|
=Hk(x^xk)P[μ,μ](g(xk)+Hk(x^xk)1τx^)+P[μ,μ](g(xk)1τxk)\displaystyle=\|H_{k}(\hat{x}-x_{k})-P_{[-\mu,\mu]}(g(x_{k})+H_{k}(\hat{x}-x_{k})-\textstyle\frac{1}{\tau}\hat{x})+P_{[-\mu,\mu]}(g(x_{k})-\textstyle\frac{1}{\tau}x_{k})\|
Hk(x^xk)+1τ(x^xk)Hk(x^xk)\displaystyle\leq\|H_{k}(\hat{x}-x_{k})\|+\|\textstyle\frac{1}{\tau}(\hat{x}-x_{k})-H_{k}(\hat{x}-x_{k})\|
1τx^xk+2Hkx^xk\displaystyle\leq\textstyle\frac{1}{\tau}\|\hat{x}-x_{k}\|+2\|H_{k}\|\|\hat{x}-x_{k}\|
=(1τ+2Hk)x^xk\displaystyle=\left(\textstyle\frac{1}{\tau}+2\|H_{k}\|\right)\|\hat{x}-x_{k}\|
(1τ+2Λ)x^xk.\displaystyle\leq\left(\textstyle\frac{1}{\tau}+2\Lambda\right)\|\hat{x}-x_{k}\|.

Combining this expression with (3.4)(\ref{lpre}), we obtain (3.3)(\ref{ldec}) for

γ=λ2(1η1τ+2Λ)2.\gamma=\frac{\lambda}{2}\left(\frac{1-\eta}{\frac{1}{\tau}+2\Lambda}\right)^{2}.

Note that γ>0\gamma>0 as τ,λ,Λ>0\tau,\lambda,\Lambda>0 and η[0,1)\eta\in[0,1).

Let us define the search direction as d=x^xkd=\hat{x}-x_{k}. We now show that by performing a line search along dd we can ensure that the algorithm provides sufficient decrease in the objective function ϕ\phi, and this will allow us to establish the limit (3.2).

Since g(x)g(x) satisfies the Lipschitz condition (3.1)(\ref{lips}), we have

f(xk+αd)+μxk+αd1f(xk)+αg(xk)Td+M2α2d2+μxk+αd1,f(x_{k}+\alpha d)+\mu\|x_{k}+\alpha d\|_{1}\leq f(x_{k})+\alpha g(x_{k})^{T}d+\frac{M}{2}\alpha^{2}\|d\|^{2}+\mu\|x_{k}+\alpha d\|_{1},

and thus

f(xk)+μxk1f(xk+αd)μxk+αd1αg(xk)TdM2α2d2μxk+αd1+μxk1.\ f(x_{k})+\mu\|x_{k}\|_{1}-f(x_{k}+\alpha d)-\mu\|x_{k}+\alpha d\|_{1}\geq-\alpha g(x_{k})^{T}d-\frac{M}{2}\alpha^{2}\|d\|^{2}-\mu\|x_{k}+\alpha d\|_{1}+\mu\|x_{k}\|_{1}.

Recalling the definition of k\ell_{k}, we have

ϕ(xk)ϕ(xk+αd)k(xk)k(xk+αd)M2α2d2.\phi(x_{k})-\phi(x_{k}+\alpha d)\geq\ell_{k}(x_{k})-\ell_{k}(x_{k}+\alpha d)-\frac{M}{2}\alpha^{2}\|d\|^{2}.

By convexity of the 1\ell_{1}-norm, we have that

(xk)(xk+αd)α((xk)(xk+d)).\ell(x_{k})-\ell(x_{k}+\alpha d)\geq\alpha(\ell(x_{k})-\ell(x_{k}+d)).

Combining this inequality with (3.4), and recalling that x+d=x^x+d=\hat{x}, we obtain for θ(0,1)\theta\in(0,1),

ϕ(xk)ϕ(xk+αd)θ((xk)(xk+αd))\displaystyle\phi(x_{k})-\phi(x_{k}+\alpha d)-\theta(\ell(x_{k})-\ell(x_{k}+\alpha d)) (1θ)((xk)(xk+αd))M2α2d2\displaystyle\geq(1-\theta)(\ell(x_{k})-\ell(x_{k}+\alpha d))-\frac{M}{2}\alpha^{2}\|d\|^{2}
(1θ)α((xk)(xk+d))M2α2d2\displaystyle\geq(1-\theta)\alpha(\ell(x_{k})-\ell(x_{k}+d))-\frac{M}{2}\alpha^{2}\|d\|^{2}
(1θ)αλ2d2M2α2d2\displaystyle\geq(1-\theta)\alpha\frac{\lambda}{2}\|d\|^{2}-\frac{M}{2}\alpha^{2}\|d\|^{2}
=12αd2((1θ)λMα)\displaystyle=\textstyle\frac{1}{2}\alpha\|d\|^{2}\left((1-\theta)\lambda-M\alpha\right)
0,\displaystyle\geq 0, (3.5)

provided ((1θ)λMα)0\left((1-\theta)\lambda-M\alpha\right)\geq 0. Therefore, the sufficient decrease condition (2.7) is satisfied for any steplength α\alpha satisfying

0α(1θ)λM,0\leq\alpha\leq(1-\theta)\frac{\lambda}{M},

and if the backtracking line search cuts the steplength in half (say) after each trial, we have that the steplength chosen by the line search satisfies

α(1θ)λ2M.\alpha\geq(1-\theta)\frac{\lambda}{2M}.

Thus, from (3.5) and (3.3) we obtain

ϕ(xk)ϕ(xk+αd)θ(1θ)λ2MγF(xk)2.\phi(x_{k})-\phi(x_{k}+\alpha d)\geq\theta(1-\theta)\frac{\lambda}{2M}\gamma\|F(x_{k})\|^{2}.

Since ff is assumed to be bounded below, so is the objective function ϕ\phi, and given that the decrease in ϕ\phi is proportional to F(xk)\|F(x_{k})\| we obtain the limit (3.2)(\ref{limit}). \Box

We note that to establish this convergence result it was not necessary to assume convexity of ff.

4 Local Convergence

To analyze the local convergence rate of the successive quadratic approximation method, we use the theory developed in Chapter 7 of Facchinei and Pang [11]. To do this, we first show that, if xx^{*} is a nonsingular minimizer of ϕ\phi, then the functions Fq(x;):nnF_{q}(x;\cdot):\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} are a family of uniformly Lipschitzian nonsingular homeomorphisms for all xx in a neighborhood of xx^{*}.

Lemma 4.1

If HH is a symmetric positive definite matrix with smallest eigenvalue λ>0\lambda>0, then the function of yy given by

Fq(x;y)=g(x)+H(yx)P[μ,μ](g(x)+H(yx)y/τ),F_{q}(x;y)=g(x)+H(y-x)-P_{[-\mu,\mu]}(g(x)+H(y-x)-y/\tau),

is strongly monotone if τ<1/H\tau<1/\|H\|. Specifically, for any vectors y,zny,z\in\mathbb{R}^{n},

(zy)T(Fq(x;z)Fq(x;y))12λzy2.(z-y)^{T}(F_{q}(x;z)-F_{q}(x;y))\geq\textstyle\frac{1}{2}{\lambda}\|z-y\|^{2}. (4.1)

Proof. It is straightforward to show that for any scalars aba\neq b and interval [μ,μ][-\mu,\mu],

0P[μ,μ](a)P[μ,μ](b)ab1.0\leq{P_{[-\mu,\mu]}(a)-P_{[-\mu,\mu]}(b)\over a-b}\leq 1. (4.2)

Therefore for any vectors yy and zz, and for any index i{1,,n}i\in\{1,\cdots,n\} we have

Fq(x;z)iFq(x;y)i\displaystyle F_{q}(x;z)_{i}-F_{q}(x;y)_{i} =H(zy)iP[μ,μ](gi(x)+H(zx)izi/τ)]\displaystyle=H(z-y)_{i}-P_{[-\mu,\mu]}(g_{i}(x)+H(z-x)_{i}-z_{i}/\tau)]
+P[μ,μ](gi(x)+H(yx)iyi/τ)]\displaystyle\quad+P_{[-\mu,\mu]}(g_{i}(x)+H(y-x)_{i}-y_{i}/\tau)]
=H(zy)id¯i[H(zy)i(ziyi)/τ],\displaystyle=H(z-y)_{i}-\bar{d}_{i}[H(z-y)_{i}-(z_{i}-y_{i})/\tau],

where d¯i[0,1]\bar{d}_{i}\in[0,1] is a scalar implied by (4.2). This implies that

Fq(x;z)Fq(x;y)=H(zy)+D(1τIH)(zy),F_{q}(x;z)-F_{q}(x;y)=H(z-y)+D(\textstyle\frac{1}{\tau}I-H)(z-y),

where D=diag(d¯i)D={\rm diag}(\bar{d}_{i}). Hence

(zy)T(Fq(x;z)Fq(x;y))=(zy)TH(zy)+(zy)TD(1τIH)(zy).(z-y)^{T}(F_{q}(x;z)-F_{q}(x;y))=(z-y)^{T}H(z-y)+(z-y)^{T}D(\textstyle\frac{1}{\tau}I-H)(z-y). (4.3)

Since the right hand side is a quadratic form, we symmetrize the matrix, and if we let w=zyw=z-y, the right side is

wT[H+τ1D12(DH+HD)]w.w^{T}[H+\tau^{-1}D-\textstyle\frac{1}{2}(DH+HD)]w. (4.4)

To show that the symmetric matrix inside the square brackets is positive definite, we note that since (τHD)T(τHD)=τ2H2τ(HD+DH)+D2(\tau H-D)^{T}(\tau H-D)=\tau^{2}H^{2}-\tau(HD+DH)+D^{2} is positive semi-definite, we have that

wT(HD+DH)wwT(τH2+τ1D2)w.w^{T}(HD+DH)w\leq w^{T}(\tau H^{2}+\tau^{-1}D^{2})w.

Substituting this into (4.4) yields

wT[H+τ1D12(DH+HD)]w\displaystyle w^{T}\left[H+\tau^{-1}D-\textstyle\frac{1}{2}(DH+HD)\right]w wT[Hτ2H2+DD2/2τ]w\displaystyle\geq w^{T}\left[H-{\tau\over 2}H^{2}+{D-D^{2}/2\over\tau}\right]w
wT[Hτ2H2]w,\displaystyle\geq w^{T}\left[H-{\tau\over 2}H^{2}\right]w,

since DD2/2D-D^{2}/2 is positive semi-definite given that the elements of the diagonal matrix DD are in [0,1][0,1]. If λi\lambda_{i} is an eigenvalue of HH, the corresponding eigenvalue of the matrix Hτ2H2H-{\tau\over 2}H^{2} is λiτλi2/2λi/2\lambda_{i}-\tau\lambda_{i}^{2}/2\geq\lambda_{i}/2 since our assumption on τ\tau implies 1>τHτλi1>\tau\|H\|\geq\tau\lambda_{i}. Therefore, we have from (4.3) that

(zy)T(Fq(x;z)Fq(x;y))12λzy2.(z-y)^{T}(F_{q}(x;z)-F_{q}(x;y))\geq\textstyle\frac{1}{2}\lambda\|z-y\|^{2}.

\Box

Inequality (4.1) establishes that Fq(x;)F_{q}(x;\cdot) is strongly monotone. Next we show that, when HH is defined as the Hessian of ff, the functions Fq(x;)F_{q}(x;\cdot) are homeomorphisms and that they represent an accurate approximation to the function FF defined in (2.4).

Theorem 4.2

If 2f(x)\nabla^{2}f(x^{*}) is positive definite and τ<1/2f(x)\tau<1/\|\nabla^{2}f(x^{*})\|, then there is a neighborhood 𝒩\cal{N} of xx^{*} such that for all x𝒩x\in\cal{N} the functions of yy given by

Fq(x;y)=g(x)+2f(x)(yx)P[μ,μ](g(x)+2f(x)(yx)y/τ)F_{q}(x;y)=g(x)+\nabla^{2}f(x)(y-x)-P_{[-\mu,\mu]}(g(x)+\nabla^{2}f(x)(y-x)-y/\tau) (4.5)

are a family of homeomorphisms from n\mathbb{R}^{n} to n\mathbb{R}^{n}, whose inverses Fq1(x;)F_{q}^{-1}(x;\cdot) are uniformly Lipschitz continuous. In addition, if 2f(x)\nabla^{2}f(x) is Lipschitz continuous, then there exists a constant β>0\beta>0 such that

F(y)Fq(x;y)βxy2\|F(y)-F_{q}(x;y)\|\leq\beta\|x-y\|^{2} (4.6)

for any x𝒩x\in\cal{N}.

Proof. Since 2f(x)\nabla^{2}f(x) is continuous, there is a neighborhood 𝒩{\cal N} of xx^{*} and a positive constant λ\lambda such that λmin(2f(x))λ>0\lambda_{min}(\nabla^{2}f(x))\geq\lambda>0 and τ2f(x)<1\tau\|\nabla^{2}f(x)\|<1, for all x𝒩x\in\cal{N}. It follows from Lemma 4.1 that for any such xx, the function Fq(x;y)F_{q}(x;y) given by (4.5) is strongly (or uniformly) monotone with constant greater than λ/2\lambda/2. We now invoke the Uniform Monotonicity Theorem (see e.g. Theorem 6.4.4 in [19]), which states that if a function F:nnF:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} is continuous and uniformly monotone, then FF is a homeomorphism of n\mathbb{R}^{n} onto itself. We therefore conclude that Fq(x;y)F_{q}(x;y) is a homeomorphism.

In addition, we have from (4.1)(\ref{smono}) and the Cauchy-Schwartz inequality that

zyFq(x;z)Fq(x;y)(zy)T(Fq(x;z)Fq(x;y))12λzy2,\|z-y\|\|F_{q}(x;z)-F_{q}(x;y)\|\geq(z-y)^{T}(F_{q}(x;z)-F_{q}(x;y))\geq\textstyle\frac{1}{2}\lambda\|z-y\|^{2},

which implies Lipschitz continuity of Fq1(x;)F_{q}^{-1}(x;\cdot) with constant 2/λ2/\lambda. To establish (4.6)(\ref{fqerror}), note that

F(y)Fq(x;y)\displaystyle F(y)-F_{q}(x;y) =g(y)P[μ,μ](g(y)y/τ)(g(x)+2f(x)(yx))\displaystyle=g(y)-P_{[-\mu,\mu]}(g(y)-y/\tau)-\left(g(x)+\nabla^{2}f(x)(y-x)\right)
+P[μ,μ](g(x)+2f(x)(yx)y/τ),\displaystyle+P_{[-\mu,\mu]}(g(x)+\nabla^{2}f(x)(y-x)-y/\tau),

and thus

F(y)Fq(x;y)\displaystyle\|F(y)-F_{q}(x;y)\| \displaystyle\leq g(y)(g(x)+2f(x)(yx))\displaystyle\|g(y)-(g(x)+\nabla^{2}f(x)(y-x))\|
+P[μ,μ](g(y)y/τ)P[μ,μ](g(x)+2f(x)(yx)y/τ)\displaystyle+\|P_{[-\mu,\mu]}(g(y)-y/\tau)-P_{[-\mu,\mu]}(g(x)+\nabla^{2}f(x)(y-x)-y/\tau)\|
\displaystyle\leq 2g(y)(g(x)+2f(x)(yx))\displaystyle 2\|g(y)-(g(x)+\nabla^{2}f(x)(y-x))\|
=\displaystyle= O(yx2),\displaystyle O(\|y-x\|^{2}),

by the non-expansiveness of a projection onto a convex set and Taylor’s theorem. \Box

Theorem 4.2 shows that Fq(x;y)F_{q}(x;y) defines a strong nonsingular Newton approximation in the sense of Definition 7.2.2 of Pang and Facchinei [11]. This implies quadratic convergence for the (exact) successive quadratic approximation (SQA) method.

Theorem 4.3

If 2f(x)\nabla^{2}f(x) is Lipschitz continuous and positive definite at xx^{*}, and τ<1/2f(x)\tau<1/\|\nabla^{2}f(x^{*})\|, then there is a neighborhood of xx^{*} such that, if x0x_{0} lies in that neighborhood, the iteration that defines xk+1x_{k+1} as the unique solution to

Fq(xk;xk+1)=0F_{q}(x_{k};x_{k+1})=0

converges quadratically to xx^{*}.

Proof. By Theorem 4.2, Fq(xk;y)F_{q}(x_{k};y) satisfies the definition of a nonsingular strong Newton approximation of FF at xx^{*}, given by Facchinei and Pang ([11], 7.2.2) and thus by Theorem 7.2.5 of that book the local convergence is quadratic. \Box

Now we consider the inexact SQA algorithm that, at each step, computes a point yy satisfying

Fq(xk;y)=rk,F_{q}(x_{k};y)=r_{k}, (4.7)

where rkr_{k} is a vector such that rkηkF(xk)\|r_{k}\|\leq\eta_{k}\|F(x_{k})\| with ηk<1\eta_{k}<1; see (2.6). We obtain the following result for a method that sets xk+1=yx_{k+1}=y.

Theorem 4.4

Suppose that 2f(x)\nabla^{2}f(x) is Lipschitz continuous and positive definite at xx^{*}, τ<1/2f(x)\tau<1/\|\nabla^{2}f(x^{*})\|, and that xk+1x_{k+1} is computed by solving

Fq(xk;xk+1)=rk,F_{q}(x_{k};x_{k+1})=r_{k},

where rkηkF(xk)\|r_{k}\|\leq\eta_{k}\|F(x_{k})\|. Then, there is a neighborhood 𝒩\cal{N} of xx^{*} and a value η¯>0\bar{\eta}>0 such that if ηkη¯\eta_{k}\leq\bar{\eta} for all kk and if x0𝒩x_{0}\in\cal{N} then the sequence {xk}\{x_{k}\} converges Q-linearly to xx^{*}. In addition if ηk0\eta_{k}\rightarrow 0, then the convergence rate of {xk}\{x_{k}\} is Q-superlinear. Finally, if for some η~\tilde{\eta}, ηkη~F(xk)\eta_{k}\leq\tilde{\eta}\|F(x_{k})\| then the convergence rate is Q-quadratic.

Proof. By Theorem 4.2, the iteration described in the statement of the theorem satisfies all the conditions of Theorem 7.2.8 of [11]. The results then follow immediately from that theorem. \Box

We have shown above the the inexact successive quadratic approximation (SQA) method with αk=1\alpha_{k}=1 yields a fast rate of convergence. We now show that this inexact SQA algorithm will select the steplength αk=1\alpha_{k}=1 in a neighborhood of the solution. In order to do so, we strengthen the inexactness conditions (2.6) slightly so that they read

Fq(xk;x^)ηkFq(xk;xk) and qk(x^)qk(xk)ζ(k(x^)k(xk)),\|F_{q}(x_{k};\hat{x})\|\leq\eta_{k}\|F_{q}(x_{k};x_{k})\|\ \mbox{ and }\ \ q_{k}(\hat{x})-q_{k}(x_{k})\leq\zeta(\ell_{k}(\hat{x})-\ell_{k}(x_{k})), (4.8)

where ηk<1\eta_{k}<1, ζ(θ,1/2)\zeta\in(\theta,1/2) and θ\theta is the input parameter of Algorithm 2.1 used in (2.7). Thus, instead of simple decrease, we now impose sufficient decrease in qkq_{k}.

Lemma 4.5

If HkH_{k} is positive definite, the inexactness condition (4.8)(\ref{inxx}) is satisfied by any sufficiently accurate solution to (2.1)(\ref{lassop}).

Proof. If we denote by y¯\bar{y} the (exact) minimizer of qkq_{k}, we claim that

qk(xk)qk(y¯)12(k(xk)k(y¯)).q_{k}(x_{k})-q_{k}(\bar{y})\geq\textstyle\frac{1}{2}(\ell_{k}(x_{k})-\ell_{k}(\bar{y})). (4.9)

To see this, note that qk(y)=k(y)+12(yxk)THk(yxk)q_{k}(y)=\ell_{k}(y)+{1\over 2}(y-x_{k})^{T}H_{k}(y-x_{k}), and since k\ell_{k} and qkq_{k} are convex and y¯\bar{y} minimizes qkq_{k}, there exists a vector vk(y¯)v\in\partial\ell_{k}(\bar{y}) such that v+Hk(y¯xk)=0v+H_{k}(\bar{y}-x_{k})=0. By convexity

k(xk)k(y¯)+vT(xky¯)=k(y¯)+(y¯xk)THk(y¯xk).\ell_{k}(x_{k})\geq\ell_{k}(\bar{y})+v^{T}(x_{k}-\bar{y})=\ell_{k}(\bar{y})+(\bar{y}-x_{k})^{T}H_{k}(\bar{y}-x_{k}). (4.10)

Therefore,

qk(xk)qk(y¯)=k(xk)k(y¯)12(y¯xk)THk(y¯xk)12(k(xk)k(y¯)),q_{k}(x_{k})-q_{k}(\bar{y})=\ell_{k}(x_{k})-\ell_{k}(\bar{y})-\textstyle\frac{1}{2}(\bar{y}-x_{k})^{T}H_{k}(\bar{y}-x_{k})\\ \geq\textstyle\frac{1}{2}(\ell_{k}(x_{k})-\ell_{k}(\bar{y})),

which proves (4.9).

Now consider the continuous function qk(x)qk(xk)ζ(k(x)k(xk))q_{k}(x)-q_{k}(x_{k})-\zeta(\ell_{k}(x)-\ell_{k}(x_{k})). By (4.9)(\ref{masia}) its value at x=y¯x=\bar{y} is

qk(y¯)qk(xk)ζ(k(y¯)k(xk))(12ζ)(k(y¯)k(xk))<0,q_{k}(\bar{y})-q_{k}(x_{k})-\zeta(\ell_{k}(\bar{y})-\ell_{k}(x_{k}))\leq(\textstyle\frac{1}{2}-\zeta)(\ell_{k}(\bar{y})-\ell_{k}(x_{k}))<0, (4.11)

where the last inequality follows from (4.10). Therefore by continuity, the value of this function for any xx in some neighborhood of y¯\bar{y} is negative, implying that (4.8)(\ref{inxx}) is satisfied by any approximate solution x^\hat{x} sufficiently close to y¯\bar{y}. \Box

Theorem 4.6

Suppose that Hk=2f(xk)H_{k}=\nabla^{2}f(x_{k}) in Algorithm 2.1, and that we modify Step 2 in that algorithm to require that the approximate solution x^\hat{x} satisfies (4.8) instead of (2.6). If we assume that 2f(x)\nabla^{2}f(x) is Lipschitz continuous, then for all kk sufficiently large we have αk=1\alpha_{k}=1.

Proof. Given that x^=xk+dk\hat{x}=x_{k}+d_{k} satisfies (4.8)(\ref{inxx}), it follows from Taylor’s theorem, the Lipschitz continuity of 2f(x)\nabla^{2}f(x), and equation (3.4)(\ref{lpre}) that for some constant ρ>0\rho>0

ϕ(xk+dk)ϕ(xk)\displaystyle\phi(x_{k}+d_{k})-\phi(x_{k}) =\displaystyle= [ϕ(xk+dk)ϕ(xk)qk(xk+dk)+qk(xk)]\displaystyle[\phi(x_{k}+d_{k})-\phi(x_{k})-q_{k}(x_{k}+d_{k})+q_{k}(x_{k})]
(qk(xk)qk(xk+dk))\displaystyle-(q_{k}(x_{k})-q_{k}(x_{k}+d_{k}))
\displaystyle\leq ζ(k(xk)k(xk+dk))+ρdk3\displaystyle-\zeta(\ell_{k}(x_{k})-\ell_{k}(x_{k}+d_{k}))+\rho\|d_{k}\|^{3}
\displaystyle\leq θ(k(xk+dk)k(xk))+(ζθ)(k(xk+dk)k(xk))+ρdk3\displaystyle\theta(\ell_{k}(x_{k}+d_{k})-\ell_{k}(x_{k}))+(\zeta-\theta)(\ell_{k}(x_{k}+d_{k})-\ell_{k}(x_{k}))+\rho\|d_{k}\|^{3}
\displaystyle\leq θ(k(xk+dk)k(xk))(ζθ)λ2dk2+ρdk3\displaystyle\theta(\ell_{k}(x_{k}+d_{k})-\ell_{k}(x_{k}))-(\zeta-\theta){\lambda\over 2}\|d_{k}\|^{2}+\rho\|d_{k}\|^{3}
\displaystyle\leq θ(k(xk+dk)k(xk))\displaystyle\theta(\ell_{k}(x_{k}+d_{k})-\ell_{k}(x_{k}))

if dk(ζθ)λ/2ρ\|d_{k}\|\leq(\zeta-\theta)\lambda/2\rho . Since the global convergence analysis implies dk0\|d_{k}\|\rightarrow 0, we have from (2.7) that eventually the steplength αk=1\alpha_{k}=1 is accepted and used. \Box


We note that (4.8) is stronger than (2.6), and therefore, all the results presented in this and the previous section apply also to the strengthened condition (4.8). Theorem 4.6 implies that if Algorithm 2.1 is run with the strengthened accuracy condition (4.8)(\ref{inxx}), and Hk=2f(xk)H_{k}=\nabla^{2}f(x_{k}), then once the iterates are close enough to a nonsingular minimizer xx^{*}, the iterates have the linear, superlinear or quadratic convergence rates described in Theorem 4.4 if ηk\eta_{k} is chosen appropriately.

5 Numerical Results

One of the goals of this paper is to investigate whether the successive quadratic approximation (sqa) method is, in fact, an effective approach for solving convex 1\ell_{1} regularized problems of the form (1.1). Indeed, it is reasonable to ask whether it might be more effective to apply an algorithm such as ista or fista, directly to problem (1.1), rather than performing an inner iteration on the subproblem (2.1). Note that each iteration of fista requires an evaluation of the gradient of the objective (1.1), whereas each inner iteration for the subproblem (2.1) involves the product of HkH_{k} times a vector.

To study this question, we explore various algorithmic options within the successive quadratic approximation method, and evaluate their performance using data sets with different characteristics. One of the data sets concerns the covariance selection problem (where the unknown is a matrix), and the other involves a logistic objective function (where the unknown is a vector). Our benchmark is fista applied directly to problem (1.1). fista enjoys convergence guarantees when applied to problem (1.1), and is generally regarded as an effective method.

We employ two types of methods for solving the subproblem (2.1) in the successive quadratic approximation method: fista and an orthant based method (obm) [1, 7, 8]. The orthant based method (described in detail in section 5.3) is a two-phase method in which an active orthant face of n\mathbb{R}^{n} is first identified, and a subspace minimization is then performed with respect to the variables that define the orthant face. The subspace phase can be performed by means of a Newton-CG iteration, or by computing a quasi-Newton step; we consider both options.


The methods employed in our numerical tests are as follows.

  • FISTA. This is the fista algorithm [4] applied to the original problem (1.1). We used the implementation from the TFOCS package, called N83 [5]. This implementation differs from the (adaptive) algorithm described by Beck and Teboulle [4] in the way the Lipschitz parameter is updated, and performed significantly better in our test set than the method in [4].

  • PNOPT. This is the sequential quadratic approximation (proximal Newton) method of Lee, Sun and Saunders [13]. The Hessian HkH_{k} in the subproblem (2.1) is updated using the limited memory BFGS formula, with a (default) memory of 50. (The pnopt package also allows for the use of the exact Hessian, but since this matrix must be formed and factored at each iteration, its use is impractical.) The subproblem (2.1) is solved using the N83 implementation of fista mentioned above. pnopt provides the option of using sparsa [27] instead of N83 as an inner solver, but the performance of sparsa was not robust in our tests, and we will not report results with it.

  • SQA. Is the sequential quadratic approximation method described in Algorithm 2. We implemented 3 variants that differ in the method used to solve the subproblem (2.1).

    • SQA-FISTA. This is an sqa method using fista-n83 to solve the subproblem (2.1). The matrix HkH_{k} is the exact Hessian 2f(xk)\nabla^{2}f(x_{k}); each inner fista iteration requires two multiplications with HkH_{k}.

    • SQA-OBM-CG. This is an sqa method that employs an orthant based method to solve the subproblem (2.1). The obm method performs the subspace minimization step using a Newton-CG iteration. The number of CG iterations varies during the course of the (outer) iteration according to the rule min{3,1+k/10}\min\{3,1+\lfloor k/10\rfloor\}, where kk is the outer iteration number.

    • SQA-OBM-QN. This is an sqa method where the inner solver is an obm method in which the subspace phase consists of a limited memory BFGS step, with a memory of 50. The correction pairs used to update the quasi-Newton matrix employ gradient differences from the outer iteration (as in pnopt).

The initial point was set to the zero vector in all experiments, and the iteration was terminated if F(xk)105\|F(x_{k})\|_{\infty}\leq 10^{-5}, where FF is defined in (2.4). The maximum number of outer iterations for all solvers was 30003000. In the SQA method, the parameter ηk\eta_{k} in the inexactness condition (2.6) was defined as ηk=max{1/k,0.1}\eta_{k}=\max\{1/k,0.1\}, and we set θ=0.1\theta=0.1 in (2.7). For pnopt we set ‘ftol’=1e161e-16, and ‘xtol’=1e161e-16 (so that those two tests do not terminate the iteration prematurely), and chose ‘Lbfgs_mem’=50.

We noted above that Algorithm 2 can employ the inexactness conditions (2.6) or (4.8). We implemented both conditions, with ζ=θ=0.1\zeta=\theta=0.1, and obtained identical results in all our runs.

We now describe the numerical tests performed with these methods.

5.1 Inverse Covariance Estimation Problems

The task of estimating a high dimensional sparse inverse covariance matrix is closely tied to the topic of Gaussian Markov random fields [22], and arises in a variety of recognition tasks. This model can be used to recover a sparse social or genetic network from user or experimental data.

A popular approach to solving this problem [2, 3] is to minimize the negative log likelihood function, under the assumption of normality, with an additional 1\ell_{1} term to enforce sparsity in the estimated inverse covariance matrix. We can write the optimization problem as

minPn×ntr(SP)logdetP+μP,\min_{P\in\mathbb{R}^{n\times n}}\ {\rm tr}(SP)-\log\det P+\mu\|P\|, (5.1)

where SS is a given sample covariance matrix, PP denotes the unknown inverse covariance matrix, μ\mu is the regularization parameter, and P=defvec(P)1\|P\|\stackrel{{\scriptstyle\rm def}}{{=}}\|vec(P)\|_{1}. We note that the Hessian of the first two terms in (5.1) has a very special structure: it is given by P1P1P^{-1}\otimes P^{-1}.

Since the objective is not defined when det(P)0\det(P)\leq 0, we define it as ++\infty in that case to ensure that all iterates remain positive definite. Such a strategy could, however, be detrimental to a solver like fista, and to avoid this we selected the starting point so that the condition det(P)0\det(P)\leq 0 did not occur.

We employ three data sets: the well-known Estrogen and Leukemia test sets [14], and the problem given in Olsen et al. [18], which we call OONR. The characteristics of the data sets are given in Table 1, where nnz(PμP^{\ast}_{\mu}) denotes the number of nonzeros in the solution.

Table 1.

Data set number of features μ\mu nnz(PμP^{\ast}_{\mu})
Estrogen 692 0.5 10,614 (2.22%)
Leukemia 1255 0.5 34,781 (2.21%)
OONR 500 0.5 1856 (0.74%)

The performance of the algorithms on these three test problems is given in Tables 2, 3 and 4. We note that fista does not perform inner iterations since it is applied directly to the original problem (1.1), and that pnopt-fista does not compute Hessian vector products because the matrix HkH_{k} in the model (2.1) is defined by quasi-Newton updating. Each inner iteration of sl-obm-qn performs a Hessian-vector multiplication to compute the subproblem objective, and a multiplication of the inverse Hessian times a vector to compute the unconstrained minimizer on the active orthant face — we report these as two Hessian-vector products in Tables 2, 3 and 4.


Table 2. ESTROGEN; μ=0.5\mu=0.5, optimality tolerance = 10510^{-5}

solver FISTA SQA PNOPT SQA SQA
inner solver fista fista obm-qn obm-cg
outer iterations 808 9 43 44 8
inner iterations - 183 2134 64 93
function/gradient evals 1751 10 44 45 10
Hessian-vect mults - 417 - 2 213
time (s) 208.87 51.54 355.15 38.74 26.95
For PNOPT we report the number of prox. evaluations

Table 3. LEUKEMIA; μ=0.5\mu=0.5, optimality tolerance = 10510^{-5}

solver FISTA SQA PNOPT SQA SQA
inner solver fista fista obm-qn obm-cg
outer iterations 838 8 >488>488^{\ast\ast} 101 8
inner iterations - 187 - 196 101
function/gradient evals 1803 9 - 103 9
Hessian-vect mults - 420 - 4 239
time (s) 1048.77 239.23 - 171.41 140.33
out of memory for memory size = 50, we decrease memory size to 5
∗∗ exit with message: “Relative change in function value below ftol”
∗∗ optimality error is below 1e41e-4 after iteration 73, it is 2.3136e052.3136e-05 at termination

Table 4. OONR; μ=0.5\mu=0.5, optimality tolerance = 10510^{-5}

solver FISTA SQA PNOPT SQA SQA
inner solver fista fista obm-qn obm-cg
outer iterations 212 10 39 37 7
inner iterations - 80 761 37 60
function/gradient evals 461 11 41 44 9
Hessian-vect mults - 193 - 2 125
time (s) 23.53 10.14 70.37 12.73 7.09

We now comment on the results given in Tables 2-4. For the inverse covariance selection problem (5.1), Hessian-vector products are not as expensive as for other problems (c.f. Tables 5-6) — in fact, these products are not much costlier than computations with the limited memory BFGS matrix. This fact, combined with the effectiveness of the obm method, makes sqa-obm-cg the most efficient of the methods tested. obm is a good subproblem solver due to its ability to estimate the set of zero variables quickly, so that the subspace step is computed in a small reduced space (the density of PμP_{\mu}^{\ast} is less than 2.5%2.5\% for the three test problems.) In addition, the obm-cg method can decrease Fq\|F_{q}\| drastically in a single iteration, often yielding a high quality sqa step and thus a low number of outer iterations.

We note that the quasi-Newton algorithms sl-obm-qn and pnopt are different methods because of the subproblem solvers they employ. sl-obm-qn uses the two-phase obm method in which the quasi-Newton step is computed in a subspace, whereas pnopt applies the fista iteration to subproblem (1.2) where HkH_{k} is a quasi-Newton matrix. Although the number of outer iterations of both methods is comparable for problems Estrogen and OONR, there is a large difference in the number of inner iterations due to power of the obm approach.

Note that the number of inner fista iterations in sqa-fista is always smaller than for fista. We repeated the experiment with problem OONR using looser optimality tolerances (TOL); the total number of fista is given in Table 5.

Table 5. Effect of convergence tolerance TOL; OONR

TOL 10210^{-2} 10310^{-3} 10410^{-4}
FISTA (# of outer iterations) 30 74 136
SQA-FISTA (# of inner iterations) 47 56 69

These results are typical for the covariance selection problems, where the sqa-fista is clearly more efficient than fista; we will see that this is not the case for the problems considered next.

5.2 Logistic Regression Problems

In our second set of experiments the function ff in (1.1) is given by a logistic function. Given NN data pairs (zi,yi)(z_{i},y_{i}), with zin,yi{1,1},i=1,,Nz_{i}\in{\mathbb{R}}^{n},\,y_{i}\in\{-1,1\},\,i=1,\dots,N, the optimization problem is given by

minx1Ni=1Nlog(1+exp(yixTzi))+μx1.\min_{x}\ \ \frac{1}{N}\sum_{i=1}^{N}\log(1+\exp(-y_{i}x^{T}z_{i}))+\mu\|x\|_{1}.

We employed the data given in Table 6, which was downloaded from the SVMLib repository. The values of the regularization parameter μ\mu were taken from Lee et al. [13].


Table 6. Test problems for logistic regression tests Data set NN number of features μ\mu nnz(xμx^{\ast}_{\mu}) Gisette (scaled) 6,000 5,000 6.67e-04 482 (9.64%) RCV1 (binary) 20,242 47,236 3.31e-04 140 (0.30%)


Table 7. GISETTE; μ=6.67e04\mu=6.67e-04, optimality tolerance = 10510^{-5}

solver FISTA SQA PNOPT SQA SQA
inner solver fista fista obm-qn obm-cg
outer iterations 1023 11 237 253 10
inner iterations - 1744 25260 1075 770
function/gradient evals 2200 12 240 254 11
Hessian-vector mults - 3761 - 2 3321
time 185.55 311.28 108.84 38.47 273.10
For PNOPT we report the number of prox. evaluations.

Table 8. RCV1; μ=3.31e04\mu=3.31e-04, optimality tolerance = 10510^{-5}

solver FISTA SQA PNOPT SQA SQA
inner solver fista fista obm-qn obm-cg
outer iterations 90 7 19 18 6
inner iterations - 366 1148 27 54
function/gradient evals 184 8 20 19 7
Hessian-vector mults - 738 - 2 120
time (s) 1.95 7.52 11.23 0.92 1.33

For the logistic regression problems, Hessian-vector products are expensive, particularly for gisette, where the data set is dense. As a result, the obm variant that employs quasi-Newton approximations, namely sqa-obm-qn, performs best (even though sqa-obm-cg requires a smaller number of outer iterations). Note that sqa-fista is not efficient; in fact it requires a much larger number of inner iterations than the total number of iterations in fista. In Table 9 we observe the effect of the optimality tolerance, on these two methods, using problem gisette.

Table 9. Effect of convergence tolerance TOL ; Gisette

TOL 10410^{-4} 10510^{-5} 10610^{-6}
FISTA (# of outer iterations) 605 1023 2555
SQA-FISTA (# of inner iterations) 1249 1744 2002

We observe from Table 9, that fista requires a smaller number of iterations; it is only for a very high accuracy of 10610^{-6} that sqa-fista becomes competitive. This is in stark contrast with Table 5.

In summary, for the logistic regression problems the advantage of the sqa method is less pronounced than for the inverse covariance estimation problems, and is achieved only through the appropriate choice of model Hessian HkH_{k} (quasi-Newton) and the appropriate choice of inner solver (active set obm method).

5.3 Description of the orthant based method (OBM)

We conclude this section by describing the orthant-based method used in our experiments to solve the subproblem (2.1). We let tt denote the iteration counter of the obm method, and let ztz_{t} denote its iterates.

Given an iterate ztz_{t}, the method defines an orthant face Ωt\Omega_{t} of n\mathbb{R}^{n} by

Ωt=cl({dn:sgn(di)=sgn([ωt]i),i=1,,n}),\Omega_{t}=\mbox{{cl}}(\{d\in\mathbb{R}^{n}:\mbox{{sgn}}(d_{i})=\mbox{{sgn}}([\omega_{t}]_{i}),\ i=1,...,n\}), (5.2)

with

[ωt]i={sgn([zt]i)if [zt]i0sgn([vt]i)if [zt]i=0,[\omega_{t}]_{i}=\begin{cases}\mbox{{sgn}}([z_{t}]_{i})\quad\ \ \mbox{if }[z_{t}]_{i}\neq 0\\ \mbox{{sgn}}(-[v_{t}]_{i})\quad\mbox{if }[z_{t}]_{i}=0,\end{cases} (5.3)

where vtv_{t} is the minimum norm subgradient of qkq_{k} computed at ztz_{t}, i.e.,

[vt]i={[qk(zt)]i+μ if [zt]i>0or  ([zt]i=0  qk(zt)]i+μ<0)[qk(zt)]iμ if [zt]i<0or  ([zt]i=0  qk(zt)]iμ>0)0 if [zt]i=0and  0[qk(zt)]iμ,qk(zt)]i+μ].[v_{t}]_{i}=\begin{cases}[\nabla q_{k}(z_{t})]_{i}+\mu\quad&\mbox{ if }[z_{t}]_{i}>0\quad\mbox{or }\mbox{ }([z_{t}]_{i}=0\mbox{ }\land\mbox{ }\nabla q_{k}(z_{t})]_{i}+\mu<0)\\ [\nabla q_{k}(z_{t})]_{i}-\mu\quad&\mbox{ if }[z_{t}]_{i}<0\quad\mbox{or }\mbox{ }([z_{t}]_{i}=0\mbox{ }\land\mbox{ }\nabla q_{k}(z_{t})]_{i}-\mu>0)\\ \qquad\quad 0\quad\quad&\mbox{ if }[z_{t}]_{i}=0\quad\mbox{and }\mbox{ }0\in[\nabla q_{k}(z_{t})]_{i}-\mu,\nabla q_{k}(z_{t})]_{i}+\mu].\\ \end{cases} (5.4)

Defining Ωt\Omega_{t} in this manner was proposed, among others, by Andrew and Gao [1]. In the relative interior of Ωt\Omega_{t}, the model function qkq_{k} is differentiable. The active set in the orthant-based method, defined as Ak={i:ωik=0}A^{k}=\{i:\omega^{k}_{i}=0\}, determines the variables that are kept at zero, while the rest of the variables are chosen to minimize a (smooth) quadratic model. Specifically, the search direction dtd_{t} of the algorithm is given by dt=z^ztd_{t}=\hat{z}-z_{t}, where z^\hat{z} is a solution of

minzn\displaystyle\min_{z\in\mathbb{R}^{n}} ψ(z)=qk(zt)+(zzt)Tvk+12(zzt)THk(zzt)\displaystyle\ \ \psi(z)=q_{k}(z_{t})+(z-z_{t})^{T}v^{k}+\textstyle\frac{1}{2}(z-z_{t})^{T}H_{k}(z-z_{t})
s.t. zi=[zt]i,iAk.\displaystyle\ \ z_{i}=[z_{t}]_{i},\ i\in A^{k}. (5.5)

Note that ψ(z)=f(xk)+(g(xk)+ωtμ)T(zxk)+12(zxk)THk(zxk)\psi(z)=f(x_{k})+(g(x_{k})+\omega_{t}\mu)^{T}(z-x_{k})+\frac{1}{2}(z-x_{k})^{T}H_{k}(z-x_{k}).

In the obm-cg variant, we set Hk=2f(xk)H_{k}=\nabla^{2}f(x_{k}), and perform an approximate minimization of this problem using the projected conjugate gradient iteration [17]. In the obm-qn version, HkH_{k} is a limited memory BFGS matrix and z^\hat{z} is the exact solution of (5.5). This requires computation of the inverse reduced Hessian Rk=(ZkT2HkZk)1R_{k}=(Z_{k}^{T}\nabla^{2}H_{k}Z_{k})^{-1}, where ZkZ_{k} is a basis for the space defined by (5.5). The matrix RkR_{k} can be updated using the compact representations of quasi-Newton matrices [6]. After the direction dt=z^ztd_{t}=\hat{z}-z_{t} has been computed, the obm method performs a line search along dtd_{t}, projecting the iterate back onto the orthant face Ωt\Omega_{t}, until a sufficient reduction in the function qkq_{k} has been obtained. Although this algorithm performed reliably in our tests, its convergence has not been proved (to the best of our knowledge) because the orthant face identification procedure (5.2)-(5.5) can lead to arbitrarily small steps.

Our obm-qn algorithm differs from the owl method in two respects: it does not realign the direction zztz-z_{t} so that the sign of its components match those of vtv_{t}, and it performs the minimization of the model exactly, while the owl method computes only an approximate solution – defined by computing the reduced inverse Hessian ZkT2Hk1ZkZ_{k}^{T}\nabla^{2}H_{k}^{-1}Z_{k}, instead of the inverse of the reduced Hessian RkR_{k}.

6 Final Remarks

One of the key ingredients in making the successive quadratic approximation (or proximal Newton) method practical for problem (1.1) is the ability to terminate the inner iteration as soon as a step of sufficiently good quality is computed. In this paper, we have proposed such an inexactness criterion; it employs an optimality measure that is tailored to the structure of the problem. We have shown that the resulting algorithm is globally convergent, that its rate of convergence can be controlled through an inexactness parameter, and that the inexact method will naturally accept unit step lengths in a neighborhood of the solution. We have also argued that our inexactness criterion is preferable to the one proposed by Lee et al. [13].

The method presented in this paper can use any algorithm for the inner minimization of the subproblem (1.2). In particular, all the results are applicable to the case when this inner minimization is performed using a coordinate descent algorithm [12, 25]. In our numerical tests we employed fista and an orthant-based method as the inner solvers, and found the latter method to be particularly effective. The efficacy of the successive quadratic approximation approach depends of the choice of matrix HkH_{k} in (1.2), which is problem dependent: when Hessian-vector products are expensive to compute, then a quasi-Newton approximation is most efficient; otherwise defining HkH_{k} as the exact Hessian and implementing a Newton-CG iteration is likely to give the best results.


Acknowledgement. The authors thank Jong-Shi Pang for his insights and advice throughout the years. The theory presented by Facchinei and Pang [11] in the context of variational inequalities was used in our analysis, showing the power and generality that masterful book.

References

  • [1] G. Andrew and J. Gao. Scalable training of L1{L}_{1}-regularized log-linear models. In Proceedings of the 24th international conference on Machine Learning, pages 33–40. ACM, 2007.
  • [2] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485–516, 2008.
  • [3] O. Banerjee, L. El Ghaoui, A. d’Aspremont, and G. Natsoulis. Convex optimization techniques for fitting sparse Gaussian graphical models. In Proceedings of the 23rd international conference on Machine learning, pages 89–96. ACM, 2006.
  • [4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
  • [5] Stephen R Becker, Emmanuel J Candès, and Michael C Grant. Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation, 3(3):165–218, 2011.
  • [6] R. H. Byrd, J. Nocedal, and R. Schnabel. Representations of quasi-newton matrices and their use in limited memory methods. Mathematical Programming, 63(4):129–156, 1994.
  • [7] Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
  • [8] Byrd, R., G. M Chin, J. Nocedal and F. Oztoprak. A family of second-order methods for convex L1 regularized optimization. Technical report, Optimization Center Report 2012/2, Northwestern University, 2012.
  • [9] R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact-Newton methods. SIAM Journal on Numerical Analysis, 19(2):400–408, 1982.
  • [10] Asen L Dontchev and RT Rockafellar. Convergence of inexact newton methods for generalized equations. Mathematical Programming, pages 1–23, 2013.
  • [11] F. Facchinei and J.S. Pang. Finite-dimensional variational inequalities and complementarity problems, volume 1. Springer Verlag, 2003.
  • [12] C. J. Hsieh, M. A. Sustik, P. Ravikumar, and I. S. Dhillon. Sparse inverse covariance matrix estimation using quadratic approximation. Advances in Neural Information Processing Systems (NIPS), 24, 2011.
  • [13] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal newton-type methods for minimizing composite functions. arXiv preprint arXiv:1206.1623, 2012.
  • [14] L. Li and K. C. Toh. An inexact interior point method for L1-regularized sparse covariance selection. Mathematical Programming Computation, 2(3):291–315, 2010.
  • [15] N. Le Roux M. W. Schmidt and F. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. NIPS, pages 1458–1466, 2011.
  • [16] A. Milzarek and M. Ulbrich. A semismooth newton method with multi-dimensional filter globalization for l1-optimization. Technical report, Technical University, Munich, 2010.
  • [17] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999.
  • [18] Peder Olsen, Figen Oztoprak, Jorge Nocedal, and Steven Rennie. Newton-like methods for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems 25, pages 764–772, 2012.
  • [19] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, London, 1970.
  • [20] M. Patriksson. Nonlinear Programming and Variational Inequality Problems, a Unified Approach. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1998.
  • [21] Michael Patriksson. Cost approximation: a unified framework of descent algorithms for nonlinear programs. SIAM Journal on Optimization, 8(2):561–582, 1998.
  • [22] J. D. Picka. Gaussian Markov random fields: theory and applications. Technometrics, 48(1):146–147, 2006.
  • [23] S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms. Journal of Convex Analysis, 19(4), 2012.
  • [24] S. Sra, S. Nowozin, and S.J. Wright. Optimization for Machine Learning. Mit Pr, 2011.
  • [25] X. Tan and K. Scheinberg. Complexity of inexact proximal newton method. Technical report, Dept of ISE, Lehigh University, 2013.
  • [26] Rachael Tappenden, Peter Richtárik, and Jacek Gondzio. Inexact coordinate descent: complexity and preconditioning. arXiv preprint arXiv:1304.5530, 2013.
  • [27] S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.