An Inexact Successive Quadratic Approximation Method for Convex L-1 Regularized Optimization
Abstract
We study a Newton-like method for the minimization of an objective function that is the sum of a smooth convex function and an regularization term. This method, which is sometimes referred to in the literature as a proximal Newton method, computes a step by minimizing a piecewise quadratic model of the objective function . In order to make this approach efficient in practice, it is imperative to perform this inner minimization inexactly. In this paper, we give inexactness conditions that guarantee global convergence and that can be used to control the local rate of convergence of the iteration. Our inexactness conditions are based on a semi-smooth function that represents a (continuous) measure of the optimality conditions of the problem, and that embodies the soft-thresholding iteration. We give careful consideration to the algorithm employed for the inner minimization, and report numerical results on two test sets originating in machine learning.
1 Introduction
In this paper, we study an inexact Newton-like method for solving optimization problems of the form
(1.1) |
where is a smooth convex function and is a (fixed) regularization parameter. The method constructs, at every iteration, a piecewise quadratic model of and minimizes this model inexactly to obtain a new estimate of the solution.
The piecewise quadratic model is defined, at an iterate , as
(1.2) |
where and denotes the Hessian or a quasi-Newton approximation to it. After computing an approximate solution of this model, the algorithm performs a backtracking line search along the direction to ensure decrease in the objective .
We refer to this method as the successive quadratic approximation method in analogy to the successive quadratic programming method for nonlinear programming. This method is also known in the literature as a “proximal Newton method” [20, 24], but we prefer not to use the term “proximal” in this context since the quadratic term in is better interpreted as a second-order model rather than as a term that simply restricts the size of the step. The paper covers both the cases when the quadratic model is constructed with an exact Hessian or a quasi-Newton approximation.
The two crucial ingredients in the inexact successive quadratic approximation method are the algorithm used for the minimization of the model , and the criterion that controls the degree of inexactness in this minimization. In the first part of the paper, we propose an inexactness criterion for the minimization of and prove that it guarantees global convergence of the iterates, and that it can be used to control the local rate of convergence. This criterion is based on the optimality conditions for the minimization of , expressed in the form of a semi-smooth function that is derived from the soft-thresholding operator.
The second part of the paper is devoted to the practical implementation of the method. Here, the choice of algorithm for the inner minimization of the model is vital, and we consider two options: fista [4], which is a first-order method, and an orthant-based method [1, 7, 8]. The latter is a second-order method where each iteration consists of an orthant-face identification phase, followed by the minimization of a smooth model restricted to that orthant. The subspace minimization can be performed by computing a quasi-Newton step or a Newton-CG step (we explore both options). A projected bactracking line search is then applied; see section 5.3.
Some recent work on successive quadratic approximation methods for problem (1.1) include: Hsie et al. [12], where (1.2) is solved using a coordinate descent method, and which focuses on the inverse covariance selection method; [25] which also employs coordinate descent but uses a different working set identification than [12], and makes use of a quasi-Newton model; and Olsen et al. [18], where the inner solver is fista. None of these papers address convergence for inexact solutions of the subproblem. Recently Lee, Sun and Saunders [13] presented an inexact proximal Newton method that, at first glance, appears to be very close to the method presented here. Their inexactness criterion is, however, different from ours and suffers from a number of drawbacks, as discussed in section 2.
Inexact methods for solving generalized equations have been studied by Patricksson [21], and more recently by Dontchev and Rockafellar [10]. Special cases of the general methods described in those papers result in inexact sequential quadratic approximation algorithms. Patricksson [21] presents convergence analyses based on two conditions for controlling inexactness. The first is based on running the subproblem solver for a limited number of steps. The second rule requires that the residual norm be sufficiently small, but it does not cover the inexactness conditions presented in this paper (since the residual is computed differently and their inexactness measure is is different from ours). The rule suggested in Dontchev and Rockafellar [10] is very general, but it too does not cover the condition presented in this paper. Our rule, and those presented in [10, 13], is inspired by the classical inexactness condition proposed by Dembo et al. [9], and reduces to it for the smooth unconstrained minimization case (i.e. when ).
Another line of research that is relevant to this paper is the global and rate of convergence analysis for inexact proximal-gradient algorithms, which can be seen as special cases of sequential quadratic approximation without acceleration [15, 23, 26]. The inexactness conditions applied in those papers require that the subproblem objective function value be -close to the optimal subproblem objective [15, 26], or that the approximate solution be exact with respect to an -perturbed subdifferential [23], for a decreasing sequence .
Our interest in the successive quadratic approximation method is motivated by the fact that it has not received sufficient attention from a practical perspective, where inexact solutions to the inner problem (1.2) are imperative. Although a number of studies have been devoted to the formulation and analysis of proximal Newton methods for convex composite optimization problems, as mentioned above, the viability of the approach in practice has not been fully explored.
This paper is organized in 5 sections. In section 2 we outline the algorithm, including the inexactness criteria that govern the solution of the subproblem . In sections 3 and 4, we analyze the global and local convergence properties of the algorithm. Numerical experiments are reported in section 5. The paper concludes in section 6 with a summary of our findings, and a list of questions to explore.
Notation. In the remainder, we let , and let denote any vector norm. We sometimes abbreviate successive quadratic approximation method as “SQA method”, and note that this algorithm is often referred to in the literature as the “proximal Newton method”.
2 The Algorithm
Given an iterate , an iteration of the algorithm begins by forming the model , where is a given scalar and is an approximation to the Hessian . Next, the algorithm computes an approximate minimizer of the subproblem
(2.1) |
The point defines the search direction . The algorithm then performs a backtracking line search along the direction that ensures sufficient decrease in the objective . The minimization of (2.1) should be performed by a method that exploits the structure of this problem.
In order to compute an adequate approximate solution to , we need some measure of closeness to optimality. In the case of smooth unconstrained optimization, (i.e. with ), the norm of the gradient is a standard measure of optimality, and it is common [9] to require the approximate solution to satisfy the condition
(2.2) |
The term on the left side of is a measure of optimality for the model , in the unconstrained case.
For problem (1.1), the length of the iterative soft-thresholding (ISTA) step is a natural measure of optimality. The ISTA iteration is given by
(2.3) |
where is a fixed parameter. It is easy to verify that is zero if and only if is a solution of problem (1.1). We need to express in a way that is convenient for our analysis, and for this purpose we note [16] that some algebraic manipulations show that , where
(2.4) |
Here denotes the component-wise projection of onto the interval , and is a positive scalar.
One can directly verify that (2.4) is a valid optimality measure by noting that is equivalent to the standard necessary optimality condition for :
For the objective of (2.1), this function takes the form
(2.5) |
Using the measures and in a manner similar to , leads to the condition . However, depending on the method used to approximately solve , this does not guarantee that is a descent direction for . To achieve this, we impose the additional condition that the quadratic model is decreased at .
Inexactness Conditions. A point is considered an acceptable approximate solution of subproblem (2.1) if
(2.6) |
for some parameter , where is any norm. (Note that , so that the first condition can also be written as .)
The method is summarized in Algorithm 2.1.
Algorithm 2.1: Inexact Successive Quadratic Approximation (SQA) Method for Problem
Choose an initial iterate .
Select constants and (which is used in definitions (2.4), (2.5)).
for until the optimality conditions of (1.1) are satisfied:
1.
Compute (or update) the model Hessian and form the piecewise quadratic model ;
2.
Compute an inexact solution of satisfying conditions (2.6).
3.
Perform a backtracking line search along the direction : starting with , find such that
(2.7)
where is the following piecewise linear model of at :
(2.8)
4.
Set .
end(for)
For now, we simply assume that the sequence in (2.6) satisfies , but in section 4 we show that by choosing and the parameter appropriately, the algorithm achieves a fast rate of convergence. One may wonder whether the backtracking line search of Step 3 might hinder sparsity of the iterates. Our numerical experience indicates that this is not the case because, in our tests, Algorithm 2 almost always accepts the unit steplength ().
It is worth pointing out that Lee et al. [13] recently proposed and analyzed an inexactness criterion that is similar to the first inequality of . The main difference is that they use the subgradient of on the left side of the inequality, and both norms are scaled by . They claim similar convergence results to ours, but a worrying consequence of the lack of continuity of the subgradient of is that their inexactness condition can fail for vectors arbitrarily close to the exact minimizer of . As a result, their criterion is not an appropriate termination test for the inner iteration. (In addition, their use of the scaling precludes setting , except for small or highly structured problems.)
3 Global Convergence
In this section, we show that Algorithm 2.1 is globally convergent under certain assumptions on the function and the (approximate) Hessians . Specifically, we assume that is a differentiable function with Lipschitz continuous gradient, i.e., there is a constant such that
(3.1) |
for all . We denote by and the smallest and largest eigenvalues of , respectively.
Theorem 3.1
Suppose that is a smooth function that is bounded below and that satisfies . Let be the sequence of iterates generated by Algorithm 2.1, and suppose that there exist constants such that the sequence satisfies
for all . Then
(3.2) |
Proof. We first show that if is an approximate solution of that satisfies the inexactness conditions , then there is a constant (independent of ) such that for all
(3.3) |
where and are defined in and . To see this, note that by (2.6)
and therefore
(3.4) |
Next, since , and using and the contraction property of the projection, we have that
Combining this expression with , we obtain for
Note that as and .
Let us define the search direction as . We now show that by performing a line search along we can ensure that the algorithm provides sufficient decrease in the objective function , and this will allow us to establish the limit (3.2).
Since satisfies the Lipschitz condition , we have
and thus
Recalling the definition of , we have
By convexity of the -norm, we have that
Combining this inequality with (3.4), and recalling that , we obtain for ,
(3.5) |
provided . Therefore, the sufficient decrease condition (2.7) is satisfied for any steplength satisfying
and if the backtracking line search cuts the steplength in half (say) after each trial, we have that the steplength chosen by the line search satisfies
Thus, from (3.5) and (3.3) we obtain
Since is assumed to be bounded below, so is the objective function , and given that the decrease in is proportional to we obtain the limit .
We note that to establish this convergence result it was not necessary to assume convexity of .
4 Local Convergence
To analyze the local convergence rate of the successive quadratic approximation method, we use the theory developed in Chapter 7 of Facchinei and Pang [11]. To do this, we first show that, if is a nonsingular minimizer of , then the functions are a family of uniformly Lipschitzian nonsingular homeomorphisms for all in a neighborhood of .
Lemma 4.1
If is a symmetric positive definite matrix with smallest eigenvalue , then the function of given by
is strongly monotone if . Specifically, for any vectors ,
(4.1) |
Proof. It is straightforward to show that for any scalars and interval ,
(4.2) |
Therefore for any vectors and , and for any index we have
where is a scalar implied by (4.2). This implies that
where . Hence
(4.3) |
Since the right hand side is a quadratic form, we symmetrize the matrix, and if we let , the right side is
(4.4) |
To show that the symmetric matrix inside the square brackets is positive definite, we note that since is positive semi-definite, we have that
Substituting this into (4.4) yields
since is positive semi-definite given that the elements of the diagonal matrix are in . If is an eigenvalue of , the corresponding eigenvalue of the matrix is since our assumption on implies . Therefore, we have from (4.3) that
Inequality (4.1) establishes that is strongly monotone. Next we show that, when is defined as the Hessian of , the functions are homeomorphisms and that they represent an accurate approximation to the function defined in (2.4).
Theorem 4.2
If is positive definite and , then there is a neighborhood of such that for all the functions of given by
(4.5) |
are a family of homeomorphisms from to , whose inverses are uniformly Lipschitz continuous. In addition, if is Lipschitz continuous, then there exists a constant such that
(4.6) |
for any .
Proof. Since is continuous, there is a neighborhood of and a positive constant such that and , for all . It follows from Lemma 4.1 that for any such , the function given by (4.5) is strongly (or uniformly) monotone with constant greater than . We now invoke the Uniform Monotonicity Theorem (see e.g. Theorem 6.4.4 in [19]), which states that if a function is continuous and uniformly monotone, then is a homeomorphism of onto itself. We therefore conclude that is a homeomorphism.
In addition, we have from and the Cauchy-Schwartz inequality that
which implies Lipschitz continuity of with constant . To establish , note that
and thus
by the non-expansiveness of a projection onto a convex set and Taylor’s theorem.
Theorem 4.2 shows that defines a strong nonsingular Newton approximation in the sense of Definition 7.2.2 of Pang and Facchinei [11]. This implies quadratic convergence for the (exact) successive quadratic approximation (SQA) method.
Theorem 4.3
If is Lipschitz continuous and positive definite at , and , then there is a neighborhood of such that, if lies in that neighborhood, the iteration that defines as the unique solution to
converges quadratically to .
Proof. By Theorem 4.2, satisfies the definition of a nonsingular strong Newton approximation of at , given by Facchinei and Pang ([11], 7.2.2) and thus by Theorem 7.2.5 of that book the local convergence is quadratic.
Now we consider the inexact SQA algorithm that, at each step, computes a point satisfying
(4.7) |
where is a vector such that with ; see (2.6). We obtain the following result for a method that sets .
Theorem 4.4
Suppose that is Lipschitz continuous and positive definite at , , and that is computed by solving
where . Then, there is a neighborhood of and a value such that if for all and if then the sequence converges Q-linearly to . In addition if , then the convergence rate of is Q-superlinear. Finally, if for some , then the convergence rate is Q-quadratic.
Proof. By Theorem 4.2, the iteration described in the statement of the theorem satisfies all the conditions of Theorem 7.2.8 of [11]. The results then follow immediately from that theorem.
We have shown above the the inexact successive quadratic approximation (SQA) method with yields a fast rate of convergence. We now show that this inexact SQA algorithm will select the steplength in a neighborhood of the solution. In order to do so, we strengthen the inexactness conditions (2.6) slightly so that they read
(4.8) |
where , and is the input parameter of Algorithm 2.1 used in (2.7). Thus, instead of simple decrease, we now impose sufficient decrease in .
Lemma 4.5
If is positive definite, the inexactness condition is satisfied by any sufficiently accurate solution to .
Proof. If we denote by the (exact) minimizer of , we claim that
(4.9) |
To see this, note that , and since and are convex and minimizes , there exists a vector such that . By convexity
(4.10) |
Therefore,
which proves (4.9).
Now consider the continuous function . By its value at is
(4.11) |
where the last inequality follows from (4.10). Therefore by continuity, the value of this function for any in some neighborhood of is negative, implying that is satisfied by any approximate solution sufficiently close to .
Theorem 4.6
Proof. Given that satisfies , it follows from Taylor’s theorem, the Lipschitz continuity of , and equation that for some constant
if . Since the global convergence analysis implies , we have from (2.7) that eventually the steplength is accepted and used.
We note that (4.8) is stronger than (2.6), and therefore, all the results presented in this and the previous section apply also to the strengthened condition (4.8). Theorem 4.6 implies that if Algorithm 2.1 is run with the strengthened accuracy condition , and , then once the iterates are close enough to a nonsingular minimizer , the iterates have the linear, superlinear or quadratic convergence rates described in Theorem 4.4 if is chosen appropriately.
5 Numerical Results
One of the goals of this paper is to investigate whether the successive quadratic approximation (sqa) method is, in fact, an effective approach for solving convex regularized problems of the form (1.1). Indeed, it is reasonable to ask whether it might be more effective to apply an algorithm such as ista or fista, directly to problem (1.1), rather than performing an inner iteration on the subproblem (2.1). Note that each iteration of fista requires an evaluation of the gradient of the objective (1.1), whereas each inner iteration for the subproblem (2.1) involves the product of times a vector.
To study this question, we explore various algorithmic options within the successive quadratic approximation method, and evaluate their performance using data sets with different characteristics. One of the data sets concerns the covariance selection problem (where the unknown is a matrix), and the other involves a logistic objective function (where the unknown is a vector). Our benchmark is fista applied directly to problem (1.1). fista enjoys convergence guarantees when applied to problem (1.1), and is generally regarded as an effective method.
We employ two types of methods for solving the subproblem (2.1) in the successive quadratic approximation method: fista and an orthant based method (obm) [1, 7, 8]. The orthant based method (described in detail in section 5.3) is a two-phase method in which an active orthant face of is first identified, and a subspace minimization is then performed with respect to the variables that define the orthant face. The subspace phase can be performed by means of a Newton-CG iteration, or by computing a quasi-Newton step; we consider both options.
The methods employed in our numerical tests are as follows.
-
FISTA. This is the fista algorithm [4] applied to the original problem (1.1). We used the implementation from the TFOCS package, called N83 [5]. This implementation differs from the (adaptive) algorithm described by Beck and Teboulle [4] in the way the Lipschitz parameter is updated, and performed significantly better in our test set than the method in [4].
-
PNOPT. This is the sequential quadratic approximation (proximal Newton) method of Lee, Sun and Saunders [13]. The Hessian in the subproblem (2.1) is updated using the limited memory BFGS formula, with a (default) memory of 50. (The pnopt package also allows for the use of the exact Hessian, but since this matrix must be formed and factored at each iteration, its use is impractical.) The subproblem (2.1) is solved using the N83 implementation of fista mentioned above. pnopt provides the option of using sparsa [27] instead of N83 as an inner solver, but the performance of sparsa was not robust in our tests, and we will not report results with it.
-
SQA. Is the sequential quadratic approximation method described in Algorithm 2. We implemented 3 variants that differ in the method used to solve the subproblem (2.1).
-
SQA-FISTA. This is an sqa method using fista-n83 to solve the subproblem (2.1). The matrix is the exact Hessian ; each inner fista iteration requires two multiplications with .
-
SQA-OBM-CG. This is an sqa method that employs an orthant based method to solve the subproblem (2.1). The obm method performs the subspace minimization step using a Newton-CG iteration. The number of CG iterations varies during the course of the (outer) iteration according to the rule , where is the outer iteration number.
-
SQA-OBM-QN. This is an sqa method where the inner solver is an obm method in which the subspace phase consists of a limited memory BFGS step, with a memory of 50. The correction pairs used to update the quasi-Newton matrix employ gradient differences from the outer iteration (as in pnopt).
-
The initial point was set to the zero vector in all experiments, and the iteration was terminated if , where is defined in (2.4). The maximum number of outer iterations for all solvers was . In the SQA method, the parameter in the inexactness condition (2.6) was defined as , and we set in (2.7). For pnopt we set ‘ftol’=, and ‘xtol’= (so that those two tests do not terminate the iteration prematurely), and chose ‘Lbfgs_mem’=50.
We noted above that Algorithm 2 can employ the inexactness conditions (2.6) or (4.8). We implemented both conditions, with , and obtained identical results in all our runs.
We now describe the numerical tests performed with these methods.
5.1 Inverse Covariance Estimation Problems
The task of estimating a high dimensional sparse inverse covariance matrix is closely tied to the topic of Gaussian Markov random fields [22], and arises in a variety of recognition tasks. This model can be used to recover a sparse social or genetic network from user or experimental data.
A popular approach to solving this problem [2, 3] is to minimize the negative log likelihood function, under the assumption of normality, with an additional term to enforce sparsity in the estimated inverse covariance matrix. We can write the optimization problem as
(5.1) |
where is a given sample covariance matrix, denotes the unknown inverse covariance matrix, is the regularization parameter, and . We note that the Hessian of the first two terms in (5.1) has a very special structure: it is given by .
Since the objective is not defined when , we define it as in that case to ensure that all iterates remain positive definite. Such a strategy could, however, be detrimental to a solver like fista, and to avoid this we selected the starting point so that the condition did not occur.
We employ three data sets: the well-known Estrogen and Leukemia test sets [14], and the problem given in Olsen et al. [18], which we call OONR. The characteristics of the data sets are given in Table 1, where nnz() denotes the number of nonzeros in the solution.
Table 1.
Data set | number of features | nnz() | |
---|---|---|---|
Estrogen | 692 | 0.5 | 10,614 (2.22%) |
Leukemia | 1255 | 0.5 | 34,781 (2.21%) |
OONR | 500 | 0.5 | 1856 (0.74%) |
The performance of the algorithms on these three test problems is given in Tables 2, 3 and 4. We note that fista does not perform inner iterations since it is applied directly to the original problem (1.1), and that pnopt-fista does not compute Hessian vector products because the matrix in the model (2.1) is defined by quasi-Newton updating. Each inner iteration of sl-obm-qn performs a Hessian-vector multiplication to compute the subproblem objective, and a multiplication of the inverse Hessian times a vector to compute the unconstrained minimizer on the active orthant face — we report these as two Hessian-vector products in Tables 2, 3 and 4.
Table 2. ESTROGEN; , optimality tolerance =
solver | FISTA | SQA | PNOPT | SQA | SQA |
---|---|---|---|---|---|
inner solver | fista | fista | obm-qn | obm-cg | |
outer iterations | 808 | 9 | 43 | 44 | 8 |
inner iterations | - | 183 | 2134∗ | 64 | 93 |
function/gradient evals | 1751 | 10 | 44 | 45 | 10 |
Hessian-vect mults | - | 417 | - | 2 | 213 |
time (s) | 208.87 | 51.54 | 355.15 | 38.74 | 26.95 |
∗ For PNOPT we report the number of prox. evaluations |
Table 3. LEUKEMIA; , optimality tolerance =
solver | FISTA | SQA | PNOPT∗ | SQA∗ | SQA |
---|---|---|---|---|---|
inner solver | fista | fista | obm-qn | obm-cg | |
outer iterations | 838 | 8 | 101 | 8 | |
inner iterations | - | 187 | - | 196 | 101 |
function/gradient evals | 1803 | 9 | - | 103 | 9 |
Hessian-vect mults | - | 420 | - | 4 | 239 |
time (s) | 1048.77 | 239.23 | - | 171.41 | 140.33 |
∗ out of memory for memory size = 50, we decrease memory size to 5 | |||||
∗∗ exit with message: “Relative change in function value below ftol” | |||||
∗∗ optimality error is below after iteration 73, it is at termination |
Table 4. OONR; , optimality tolerance =
solver | FISTA | SQA | PNOPT | SQA | SQA |
inner solver | fista | fista | obm-qn | obm-cg | |
outer iterations | 212 | 10 | 39 | 37 | 7 |
inner iterations | 80 | 761 | 37 | 60 | |
function/gradient evals | 461 | 11 | 41 | 44 | 9 |
Hessian-vect mults | 193 | - | 2 | 125 | |
time (s) | 23.53 | 10.14 | 70.37 | 12.73 | 7.09 |
We now comment on the results given in Tables 2-4. For the inverse covariance selection problem (5.1), Hessian-vector products are not as expensive as for other problems (c.f. Tables 5-6) — in fact, these products are not much costlier than computations with the limited memory BFGS matrix. This fact, combined with the effectiveness of the obm method, makes sqa-obm-cg the most efficient of the methods tested. obm is a good subproblem solver due to its ability to estimate the set of zero variables quickly, so that the subspace step is computed in a small reduced space (the density of is less than for the three test problems.) In addition, the obm-cg method can decrease drastically in a single iteration, often yielding a high quality sqa step and thus a low number of outer iterations.
We note that the quasi-Newton algorithms sl-obm-qn and pnopt are different methods because of the subproblem solvers they employ. sl-obm-qn uses the two-phase obm method in which the quasi-Newton step is computed in a subspace, whereas pnopt applies the fista iteration to subproblem (1.2) where is a quasi-Newton matrix. Although the number of outer iterations of both methods is comparable for problems Estrogen and OONR, there is a large difference in the number of inner iterations due to power of the obm approach.
Note that the number of inner fista iterations in sqa-fista is always smaller than for fista. We repeated the experiment with problem OONR using looser optimality tolerances (TOL); the total number of fista is given in Table 5.
Table 5. Effect of convergence tolerance TOL; OONR
TOL | |||
---|---|---|---|
FISTA (# of outer iterations) | 30 | 74 | 136 |
SQA-FISTA (# of inner iterations) | 47 | 56 | 69 |
These results are typical for the covariance selection problems, where the sqa-fista is clearly more efficient than fista; we will see that this is not the case for the problems considered next.
5.2 Logistic Regression Problems
In our second set of experiments the function in (1.1) is given by a logistic function. Given data pairs , with , the optimization problem is given by
We employed the data given in Table 6, which was downloaded from the SVMLib repository. The values of the regularization parameter were taken from Lee et al. [13].
Table 6. Test problems for logistic regression tests Data set number of features nnz() Gisette (scaled) 6,000 5,000 6.67e-04 482 (9.64%) RCV1 (binary) 20,242 47,236 3.31e-04 140 (0.30%)
Table 7. GISETTE; , optimality tolerance =
solver | FISTA | SQA | PNOPT | SQA | SQA |
---|---|---|---|---|---|
inner solver | fista | fista | obm-qn | obm-cg | |
outer iterations | 1023 | 11 | 237 | 253 | 10 |
inner iterations | 1744 | 25260∗ | 1075 | 770 | |
function/gradient evals | 2200 | 12 | 240 | 254 | 11 |
Hessian-vector mults | 3761 | 2 | 3321 | ||
time | 185.55 | 311.28 | 108.84 | 38.47 | 273.10 |
∗ For PNOPT we report the number of prox. evaluations. |
Table 8. RCV1; , optimality tolerance =
solver | FISTA | SQA | PNOPT | SQA | SQA |
inner solver | fista | fista | obm-qn | obm-cg | |
outer iterations | 90 | 7 | 19 | 18 | 6 |
inner iterations | 366 | 1148 | 27 | 54 | |
function/gradient evals | 184 | 8 | 20 | 19 | 7 |
Hessian-vector mults | 738 | 2 | 120 | ||
time (s) | 1.95 | 7.52 | 11.23 | 0.92 | 1.33 |
For the logistic regression problems, Hessian-vector products are expensive, particularly for gisette, where the data set is dense. As a result, the obm variant that employs quasi-Newton approximations, namely sqa-obm-qn, performs best (even though sqa-obm-cg requires a smaller number of outer iterations). Note that sqa-fista is not efficient; in fact it requires a much larger number of inner iterations than the total number of iterations in fista. In Table 9 we observe the effect of the optimality tolerance, on these two methods, using problem gisette.
Table 9. Effect of convergence tolerance TOL ; Gisette
TOL | |||
---|---|---|---|
FISTA (# of outer iterations) | 605 | 1023 | 2555 |
SQA-FISTA (# of inner iterations) | 1249 | 1744 | 2002 |
We observe from Table 9, that fista requires a smaller number of iterations; it is only for a very high accuracy of that sqa-fista becomes competitive. This is in stark contrast with Table 5.
In summary, for the logistic regression problems the advantage of the sqa method is less pronounced than for the inverse covariance estimation problems, and is achieved only through the appropriate choice of model Hessian (quasi-Newton) and the appropriate choice of inner solver (active set obm method).
5.3 Description of the orthant based method (OBM)
We conclude this section by describing the orthant-based method used in our experiments to solve the subproblem (2.1). We let denote the iteration counter of the obm method, and let denote its iterates.
Given an iterate , the method defines an orthant face of by
(5.2) |
with
(5.3) |
where is the minimum norm subgradient of computed at , i.e.,
(5.4) |
Defining in this manner was proposed, among others, by Andrew and Gao [1]. In the relative interior of , the model function is differentiable. The active set in the orthant-based method, defined as , determines the variables that are kept at zero, while the rest of the variables are chosen to minimize a (smooth) quadratic model. Specifically, the search direction of the algorithm is given by , where is a solution of
s.t. | (5.5) |
Note that .
In the obm-cg variant, we set , and perform an approximate minimization of this problem using the projected conjugate gradient iteration [17]. In the obm-qn version, is a limited memory BFGS matrix and is the exact solution of (5.5). This requires computation of the inverse reduced Hessian , where is a basis for the space defined by (5.5). The matrix can be updated using the compact representations of quasi-Newton matrices [6]. After the direction has been computed, the obm method performs a line search along , projecting the iterate back onto the orthant face , until a sufficient reduction in the function has been obtained. Although this algorithm performed reliably in our tests, its convergence has not been proved (to the best of our knowledge) because the orthant face identification procedure (5.2)-(5.5) can lead to arbitrarily small steps.
Our obm-qn algorithm differs from the owl method in two respects: it does not realign the direction so that the sign of its components match those of , and it performs the minimization of the model exactly, while the owl method computes only an approximate solution – defined by computing the reduced inverse Hessian , instead of the inverse of the reduced Hessian .
6 Final Remarks
One of the key ingredients in making the successive quadratic approximation (or proximal Newton) method practical for problem (1.1) is the ability to terminate the inner iteration as soon as a step of sufficiently good quality is computed. In this paper, we have proposed such an inexactness criterion; it employs an optimality measure that is tailored to the structure of the problem. We have shown that the resulting algorithm is globally convergent, that its rate of convergence can be controlled through an inexactness parameter, and that the inexact method will naturally accept unit step lengths in a neighborhood of the solution. We have also argued that our inexactness criterion is preferable to the one proposed by Lee et al. [13].
The method presented in this paper can use any algorithm for the inner minimization of the subproblem (1.2). In particular, all the results are applicable to the case when this inner minimization is performed using a coordinate descent algorithm [12, 25]. In our numerical tests we employed fista and an orthant-based method as the inner solvers, and found the latter method to be particularly effective. The efficacy of the successive quadratic approximation approach depends of the choice of matrix in (1.2), which is problem dependent: when Hessian-vector products are expensive to compute, then a quasi-Newton approximation is most efficient; otherwise defining as the exact Hessian and implementing a Newton-CG iteration is likely to give the best results.
Acknowledgement. The authors thank Jong-Shi Pang for his insights and advice throughout the years. The theory presented by Facchinei and Pang [11] in the context of variational inequalities was used in our analysis, showing the power and generality that masterful book.
References
- [1] G. Andrew and J. Gao. Scalable training of -regularized log-linear models. In Proceedings of the 24th international conference on Machine Learning, pages 33–40. ACM, 2007.
- [2] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485–516, 2008.
- [3] O. Banerjee, L. El Ghaoui, A. d’Aspremont, and G. Natsoulis. Convex optimization techniques for fitting sparse Gaussian graphical models. In Proceedings of the 23rd international conference on Machine learning, pages 89–96. ACM, 2006.
- [4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
- [5] Stephen R Becker, Emmanuel J Candès, and Michael C Grant. Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation, 3(3):165–218, 2011.
- [6] R. H. Byrd, J. Nocedal, and R. Schnabel. Representations of quasi-newton matrices and their use in limited memory methods. Mathematical Programming, 63(4):129–156, 1994.
- [7] Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
- [8] Byrd, R., G. M Chin, J. Nocedal and F. Oztoprak. A family of second-order methods for convex L1 regularized optimization. Technical report, Optimization Center Report 2012/2, Northwestern University, 2012.
- [9] R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact-Newton methods. SIAM Journal on Numerical Analysis, 19(2):400–408, 1982.
- [10] Asen L Dontchev and RT Rockafellar. Convergence of inexact newton methods for generalized equations. Mathematical Programming, pages 1–23, 2013.
- [11] F. Facchinei and J.S. Pang. Finite-dimensional variational inequalities and complementarity problems, volume 1. Springer Verlag, 2003.
- [12] C. J. Hsieh, M. A. Sustik, P. Ravikumar, and I. S. Dhillon. Sparse inverse covariance matrix estimation using quadratic approximation. Advances in Neural Information Processing Systems (NIPS), 24, 2011.
- [13] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal newton-type methods for minimizing composite functions. arXiv preprint arXiv:1206.1623, 2012.
- [14] L. Li and K. C. Toh. An inexact interior point method for L1-regularized sparse covariance selection. Mathematical Programming Computation, 2(3):291–315, 2010.
- [15] N. Le Roux M. W. Schmidt and F. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. NIPS, pages 1458–1466, 2011.
- [16] A. Milzarek and M. Ulbrich. A semismooth newton method with multi-dimensional filter globalization for l1-optimization. Technical report, Technical University, Munich, 2010.
- [17] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999.
- [18] Peder Olsen, Figen Oztoprak, Jorge Nocedal, and Steven Rennie. Newton-like methods for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems 25, pages 764–772, 2012.
- [19] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, London, 1970.
- [20] M. Patriksson. Nonlinear Programming and Variational Inequality Problems, a Unified Approach. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1998.
- [21] Michael Patriksson. Cost approximation: a unified framework of descent algorithms for nonlinear programs. SIAM Journal on Optimization, 8(2):561–582, 1998.
- [22] J. D. Picka. Gaussian Markov random fields: theory and applications. Technometrics, 48(1):146–147, 2006.
- [23] S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms. Journal of Convex Analysis, 19(4), 2012.
- [24] S. Sra, S. Nowozin, and S.J. Wright. Optimization for Machine Learning. Mit Pr, 2011.
- [25] X. Tan and K. Scheinberg. Complexity of inexact proximal newton method. Technical report, Dept of ISE, Lehigh University, 2013.
- [26] Rachael Tappenden, Peter Richtárik, and Jacek Gondzio. Inexact coordinate descent: complexity and preconditioning. arXiv preprint arXiv:1304.5530, 2013.
- [27] S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.