This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonlinear conjugate gradient for smooth convex functionsthanks: Supported in part by a grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Sahar Karimi Department of Combinatorics & Optimization, University of Waterloo, 200 University Ave. W., Waterloo, ON, N2L 3G1, Canada, sahar.karimi@gmail.com.    Stephen Vavasis Department of Combinatorics & Optimization, University of Waterloo, 200 University Ave. W., Waterloo, ON, N2L 3G1, Canada, vavasis@uwaterloo.ca.
Abstract

The method of nonlinear conjugate gradients (NCG) is widely used in practice for unconstrained optimization, but it satisfies weak complexity bounds at best when applied to smooth convex functions. In contrast, Nesterov’s accelerated gradient (AG) method is optimal up to constant factors for this class.

However, when specialized to quadratic function, conjugate gradient is optimal in a strong sense among function-gradient methods. Therefore, there is seemingly a gap in the menu of available algorithms: NCG, the optimal algorithm for quadratic functions that also exhibits good practical performance for general functions, has poor complexity bounds compared to AG.

We propose an NCG method called C+AG (“conjugate plus accelerated gradient”) to close this gap, that is, it is optimal for quadratic functions and still satisfies the best possible complexity bound for more general smooth convex functions. It takes conjugate gradient steps until insufficient progress is made, at which time it switches to accelerated gradient steps, and later retries conjugate gradient. The proposed method has the following theoretical properties: (i) It is identical to linear conjugate gradient (and hence terminates finitely) if the objective function is quadratic; (ii) Its running-time bound is O(ϵ1/2)O(\epsilon^{-1/2}) gradient evaluations for an LL-smooth convex function, where ϵ\epsilon is the desired residual reduction, (iii) Its running-time bound is O(L/ln(1/ϵ))O(\sqrt{L/\ell}\ln(1/\epsilon)) if the function is both LL-smooth and \ell-strongly convex. We also conjecture and outline a proof that a variant of the method has the property: (iv) It is nn-step quadratically convergent for a function whose second derivative is smooth and invertible at the optimizer. Note that the bounds in (ii) and (iii) match AG and are the best possible, i.e., they match lower bounds up to constant factors for the classes of functions under consideration. On the other hand, (i) and (iv) match NCG.

In computational tests, the function-gradient evaluation count for the C+AG method typically behaves as whichever is better of AG or classical NCG. In some test cases it outperforms both.

1 First-order methods for smooth convex functions

The problem under consideration is minimizing an unconstrained LL-smooth convex function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}. Recall that a convex function is LL-smooth if it is differentiable and for all 𝒙,𝒚n{\bm{x}},\bm{y}\in\mathbb{R}^{n},

f(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)L𝒙𝒚2/2.f(\bm{y})-f({\bm{x}})-\nabla f({\bm{x}})^{T}(\bm{y}-{\bm{x}})\leq L\|{\bm{x}}-\bm{y}\|^{2}/2. (1)

Note that convexity implies that the quantity on the left-hand side of this inequality is nonnegative. We will also sometimes consider \ell-strongly convex functions, which are defined to be those functions satisfying

f(𝒚)f(𝒙)f(𝒙)T(𝒚𝒙)𝒙𝒚2/2f(\bm{y})-f({\bm{x}})-\nabla f({\bm{x}})^{T}(\bm{y}-{\bm{x}})\geq\ell\|{\bm{x}}-\bm{y}\|^{2}/2 (2)

for all 𝒙,𝒚n{\bm{x}},\bm{y}\in\mathbb{R}^{n} for some modulus 0\ell\geq 0. (This definition presumes differentiability. The class of strongly convex functions includes nondifferentiable functions as well, but those functions are not smooth and are therefore not relevant to this paper.) Note that =0\ell=0 means simply that ff is differentiable and convex.

Nonlinear conjugate gradient (NCG) [21] is perhaps the most widely used first-order method for unconstrained optimization. It is an extension of linear conjugate gradient (LCG) introduced by Hestenes and Stiefel [13] to minimize convex quadratic functions. The smoothness modulus LL for a quadratic function f(𝒙)=𝒙TA𝒙/2𝒃T𝒙f({\bm{x}})={\bm{x}}^{T}A{\bm{x}}/2-\bm{b}^{T}{\bm{x}}, where An×nA\in\mathbb{R}^{n\times n} is positive semidefinite, is the maximum eigenvalue of AA. LCG is optimal for quadratic functions in a strong sense that f(𝒙k)f({\bm{x}}_{k}), where kk is the kkth iterate, is minimized over all choices of 𝒙k{\bm{x}}_{k} lying in the kkth Krylov space leading to the estimate of k=O(ln(1/ϵ)L/)k=O(\ln(1/\epsilon)\sqrt{L/\ell}) iterations to reduce the residual of ff by a factor ϵ\epsilon.

Stated precisely, Daniel [9, Theorem 1.2.2] proved the following bound for the kkth iteration of LCG:

f(𝒙k)f4(1/L1+/L)2k(f(𝒙0)f).f({\bm{x}}_{k})-f^{*}\leq 4\left(\frac{1-\sqrt{\ell/L}}{1+\sqrt{\ell/L}}\right)^{2k}(f({\bm{x}}_{0})-f^{*}). (3)

Here and in the remainder of the paper, ff^{*} stands for min𝒙nf(𝒙)\min_{{\bm{x}}\in\mathbb{R}^{n}}f({\bm{x}}) and is assumed to be attained. Therefore, if we define the residual reduction

ϵ1:=f(𝒙k)ff(𝒙0)f,\epsilon_{1}:=\frac{f({\bm{x}}_{k})-f^{*}}{f({\bm{x}}_{0})-f^{*}},

then a rearrangement and application of simple bounds in (3) shows that ϵ1\epsilon_{1} reduction is achieved after k=ln(4/ϵ1)L//2k=\ln(4/\epsilon_{1})\cdot\sqrt{L/\ell}/2 iterations.

However, if the class of problems is enlarged to include all smooth, strongly convex functions, then NCG fails to satisfy this complexity bound and could perform even worse than steepest descent as shown by Nemirovsky and Yudin [18]. Nonetheless, in practice, nonlinear conjugate gradient is difficult to beat even by algorithms with better theoretical guarantees, as observed in our experiments and by others, e.g., Carmon et al. [4].

Several first-order methods have been proposed for smooth strongly convex functions that achieve the running time O(ln(1/ϵ)L/)O(\ln(1/\epsilon)\sqrt{L/\ell}) including Nemirovsky and Yudin’s [18] method, which requires solution of a two-dimensional subproblem on each iteration, Nesterov’s accelerated gradient (AG) [19], and Geometric Descent (GD) [2]. In the case of GD, the bound is on iterations rather than gradient evaluations since each iteration requires additional gradient evaluations for the line search. It is known that Ω(ln(1/ϵ)L/)\Omega(\ln(1/\epsilon)\sqrt{L/\ell}) gradient evaluations are required to reduce the objective function residual by a factor of ϵ\epsilon; see, e.g., the book of Nesterov [20] for a discussion of the lower bound result. Since the algorithms mentioned in this paragraph match the lower bound, these algorithms are optimal up to a constant factor. In the case of smooth convex (not strongly convex) functions, AG requires O(ϵ1/2)O(\epsilon^{-1/2}) iterations.

The precise bound for accelerated gradient for strongly convex functions is stated below in (14), and this bound can be rearranged and estimated as follows. Define the residual reduction

ϵ2:=f(𝒙k)fL𝒙0𝒙2.\epsilon_{2}:=\frac{f({\bm{x}}^{k})-f^{*}}{L\|{\bm{x}}_{0}-{\bm{x}}^{*}\|^{2}}.

This reduction can be attained after no more than k=ln(1/ϵ2)L/k=\ln(1/\epsilon_{2})\cdot\sqrt{L/\ell} iterations.

A technical issue in comparing the bounds is the different definitions of ϵ1\epsilon_{1} and ϵ2\epsilon_{2}. A standard result of extremal eigenvalues states that

𝒙0𝒙2/2f(𝒙0)fL𝒙0𝒙2/2.\ell\|{\bm{x}}_{0}-{\bm{x}}^{*}\|^{2}/2\leq f({\bm{x}}_{0})-f^{*}\leq L\|{\bm{x}}_{0}-{\bm{x}}^{*}\|^{2}/2.

Therefore, it follows that ϵ22ϵ1(L/)ϵ2.\epsilon_{2}\leq 2\epsilon_{1}\leq(L/\ell)\epsilon_{2}. Thus, ϵ1\epsilon_{1} tightly controls ϵ2\epsilon_{2}, whereas ϵ2\epsilon_{2} only loosely controls ϵ1\epsilon_{1}, and therefore, the LCG bound is slightly better than the corresponding AG bound in this technical sense.

In practice, conjugate gradient for quadratic functions is typically superior to AG applied to quadratic functions. The reason appears to be not so much the technical reason mentioned in the previous paragraph but the fact that LCG is optimal in a stronger sense than AG: the iterate 𝒙k{\bm{x}}_{k} is the true minimizer of f(𝒙)f({\bm{x}}) over the kkth Krylov space, which is defined to be {𝒃,A𝒃,,Ak1𝒃}\{\bm{b},A\bm{b},\ldots,A^{k-1}\bm{b}\} assuming 𝒙0=𝟎{\bm{x}}_{0}=\bm{0}. For quadratic problems in which the eigenvalues lie in a small number of clusters, one expects convergence much faster than O(ln(1/ϵ)L/)O(\ln(1/\epsilon)\sqrt{L/\ell}). But even in the case of spread-out eigenvalues, conjugate gradient often bests accelerated gradient. Table 1 shows the results for these algorithms for three 1000×10001000\times 1000 matrices:

A1\displaystyle A_{1} =Diag(1,,1500,1000,,1000500),\displaystyle=\mathop{\rm Diag}(\underbrace{1,\ldots,1}_{500},\underbrace{1000,\ldots,1000}_{500}),
A2\displaystyle A_{2} =Diag(1,,1250,500,,500250,1000,,1000500),\displaystyle=\mathop{\rm Diag}(\underbrace{1,\ldots,1}_{250},\underbrace{500,\ldots,500}_{250},\underbrace{1000,\ldots,1000}_{500}),
A3\displaystyle A_{3} =Diag(1,4,9,,10002),\displaystyle=\mathop{\rm Diag}(1,4,9,\ldots,1000^{2}),

two of which have clustered eigenvalues, and the third has spread-out eigenvalues.

Table 1: Minimization of f(𝒙)=𝒙TA𝒙/2𝒃T𝒙f({\bm{x}})={\bm{x}}^{T}A{\bm{x}}/2-\bm{b}^{T}{\bm{x}} for diagonal matrices A1,A2,A3A_{1},A_{2},A_{3} described in the text with 𝒃T=[sin1,sin2,,sin1000]\bm{b}^{T}=[\sin 1,\sin 2,\ldots,\sin 1000]. Tolerance is f(𝒙)108\|\nabla f({\bm{x}})\|\leq 10^{-8}

. LCG C+AG NCG AG A1A_{1}: # iterations 2 3 2 9167 A1A_{1}: # function-gradient evaluations 2 27 5 18357 A2A_{2}: # iterations 3 4 3 10267 A2A_{2}: # function-gradient evaluations 3 30 7 20557 A3A_{3}: # iterations 1509 1512 1515 >106>10^{6} A3A_{3}: # function-gradient evaluations 1509 3065 3036 >106>10^{6}

Note that the three conjugate gradient methods require roughly the same number of iterations, all much less than AG. The function evaluations are: 1 per LCG iteration (i.e., a single matrix-vector multiplication), 2 per C+AG iteration, approximately 2 per NCG iteration, and 1 per AG iteration. The requirement for more than one function evaluation per NCG iteration stems from the need to select a step-size. We return to this point below. Table 1 shows results for the adaptive versions of C+AG and AG for the case that LL is not known a priori, discussed in Section 6. Thus, further function-gradient evaluations are needed to estimate LL, and the number of function-gradient evaluations is increased accordingly. The reader will note from Table 1 that C+AG required only 22 iterations for A2A_{2} but 2727 function evaluations due to the overhead of initially estimating LL, and a similar pattern is noted for A3A_{3}. The adaptive estimation doubles the function-gradient evaluations for AG from 1 to 2 per iteration.

In the previous paragraph and for the remainder of this paper, we speak of a “function-gradient evaluation”, which means, for a given 𝒙{\bm{x}}, the simultaneous evaluation of f(𝒙)f({\bm{x}}) and f(𝒙)\nabla f({\bm{x}}). We regard function-gradient evaluation as the atom of work. This metric may be inaccurate in the setting that f(𝒙)f({\bm{x}}) can be evaluated much more efficiently than f(𝒙)\nabla f({\bm{x}}) for an algorithm that separately uses ff and f\nabla f. However, in most applications evaluating f,ff,\nabla f together is not much more expensive than evaluating either separately. Theoretically, gradient evaluation is never more than a constant factor larger than function evaluation due to the existence of reverse-mode automatic differentiation; see, e.g., [21].

It is also known that NCG is fast for functions that are nearly quadratic. In particular, a classic result of Cohen [5] shows that NCG exhibits nn-step quadratic convergence near the minimizer provided that the objective function is nearly quadratic and strongly convex in this neighborhood.

Thus, there is seemingly a gap in the menu of available algorithms because NCG, though optimal for quadratic functions and fast for nearly quadratic functions, is not able to achieve reasonable complexity bounds for the larger class of smooth, strongly convex functions. On the other hand, the methods with the good complexity bounds for this class are all suboptimal for quadratic functions.

We propose C+AG to close this gap. The method is based on the following ideas:

  1. 1.

    A line-search for NCG described in Section 2 is used that is exact for quadratic functions but could be poor for general functions. Failure of the line-search is caught by the progress measure described in the next item.

  2. 2.

    A progress measure described in Section 3 for nonlinear conjugate gradient based on Nesterov’s estimate sequence is checked. The progress measure is inexpensive to evaluate on every iteration.

  3. 3.

    Restarting in the traditional sense of NCG (i.e., an occasional steepest descent step) is used, as described in Section 4

  4. 4.

    On iterations when insufficient progress is made, C+AG switches to AG to guarantee reduction in the progress measure as explained in Section 5.

  5. 5.

    C+AG switches from AG back to NCG after a test measuring nearness to a quadratic is satisfied. When CG resumes, the progress measure may be modified to a more elaborate procedure discussed in Appendix A, although this elaborate progress measure is disabled by default.

When applied to a quadratic function, the progress measure is always satisfied (this is proved in Section 3), and hence all steps will be NCG (and therefore LCG) steps. The fact that the C+AG satisfies the O(ln(1/ϵ)L/)O(\ln(1/\epsilon)\sqrt{L/\ell}) running-time bound of AG for general convex functions follows because of the progress measure. The conjectured final theoretical property of the method, namely, nn-step quadratic convergence, is described in the Appendix B. Computational testing in Section 8 indicates that the running time of the method measured in function-gradient evaluations is roughly equal to whichever of AG or NCG is better for the problem at hand, and in some cases it outperforms both of them.

Both AG and C+AG use prior knowledge of L,L,\ell. However, LL can be estimated by both methods, and \ell may be taken to be 0 if no further information is available. Thus, both algorithms are useable, albeit slower, in the absence of prior knowledge of these parameters. We return to this topic in Section 6.

We conclude this introductory section with a few remarks about our previous unpublished manuscript [16]. That work explored connections between AG, GD, and LCG, and proposed a hybrid of NCG and GD. However, that work did not address the number of function-gradient evaluations because the the evaluation count for the line search was not analyzed. Closer to what is proposed here is a hybrid algorithm from the first author’s PhD thesis [15]. Compared to the PhD thesis, the proposed algorithm herein has several improvements including the line search and the method for switching from AG to CG as well as additional analysis.

2 NCG and line search

The full C+AG algorithm is presented in Section 7. In this section, we present the NCG portion of C+AG, which is identical to most other NCG routines and is as follows

𝒙0:=𝟎;{\bm{x}}_{0}:=\bm{0};
𝒑0:=f(𝒙0);\bm{p}_{0}:=-\nabla f({\bm{x}}_{0});
k:=0;k:=0;
while f(𝒙k)>𝚝𝚘𝚕\|\nabla f({\bm{x}}_{k})\|>\mathtt{tol}
Select αk\alpha_{k};
𝒙k+1:=𝒙k+αk𝒑k;{\bm{x}}_{k+1}:={\bm{x}}_{k}+\alpha_{k}\bm{p}_{k};
Confirm that 𝒙k+1{\bm{x}}_{k+1} satisfies the progress measure.
Select βk+1\beta_{k+1};
𝒑k+1:=f(𝒙k+1)+βk+1𝒑k;\bm{p}_{k+1}:=-\nabla f({\bm{x}}_{k+1})+\beta_{k+1}\bm{p}_{k};
k:=k+1k:=k+1;
end while

For βk+1\beta_{k+1} we use the Hager-Zhang formula [12], which is as follows:

𝒚^\displaystyle\hat{\bm{y}} :=𝒈k+1𝒈k;\displaystyle:=\bm{g}_{k+1}-\bm{g}_{k};
β1\displaystyle\beta^{1} :=(𝒚^𝒑k+12𝒚^2𝒚^T𝒑k+1)T𝒈k+1𝒚^T𝒑k+1,\displaystyle:=\left(\hat{\bm{y}}-\bm{p}_{k+1}\cdot\frac{2\|\hat{\bm{y}}\|^{2}}{\hat{\bm{y}}^{T}\bm{p}_{k+1}}\right)^{T}\frac{\bm{g}_{k+1}}{\hat{\bm{y}}^{T}\bm{p}_{k+1}},
β2\displaystyle\beta^{2} :=1𝒑k+1min(.01𝒈0,𝒈k+1).\displaystyle:=\frac{-1}{\|\bm{p}_{k+1}\|\cdot\min\left(.01\|\bm{g}_{0}\|,\|\bm{g}_{k+1}\|\right)}.
βk+1\displaystyle\beta_{k+1} :=max(β1,β2).\displaystyle:=\max(\beta^{1},\beta^{2}).

For αk\alpha_{k}, we use the following formulas.

𝒙~\displaystyle\tilde{{\bm{x}}} :=𝒙k+𝒑k/L,\displaystyle:={\bm{x}}_{k}+\bm{p}_{k}/L, (4)
𝒔\displaystyle\bm{s} :=L(f(𝒙~)f(𝒙k)),\displaystyle:=L(\nabla f(\tilde{{\bm{x}}})-\nabla f({\bm{x}}_{k})), (5)
αk\displaystyle\alpha_{k} :=f(𝒙k)T𝒑k𝒑kT𝒔.\displaystyle:=\frac{-\nabla f({\bm{x}}_{k})^{T}\bm{p}_{k}}{\bm{p}_{k}^{T}\bm{s}}. (6)

The rationale for this choice of αk\alpha_{k} is as follows. In the case that f(𝒙)=𝒙TA𝒙/2𝒃T𝒙f({\bm{x}})={\bm{x}}^{T}A{\bm{x}}/2-\bm{b}^{T}{\bm{x}} (quadratic), f(𝒙k)=A𝒙k𝒃\nabla f({\bm{x}}_{k})=A{\bm{x}}_{k}-\bm{b} and f(𝒙~)=A𝒙~𝒃=A𝒙k𝒃+A𝒑k/L\nabla f(\tilde{\bm{x}})=A\tilde{\bm{x}}-\bm{b}=A{\bm{x}}_{k}-\bm{b}+A\bm{p}_{k}/L, and therefore 𝒔=A𝒑k\bm{s}=A\bm{p}_{k} and 𝒑kT𝒔=𝒑kTA𝒑k\bm{p}_{k}^{T}\bm{s}=\bm{p}_{k}^{T}A\bm{p}_{k}. In this case, αk\alpha_{k} is the exact minimizer of the univariate quadratic function αf(𝒙k+α𝒑k)\alpha\mapsto f({\bm{x}}_{k}+\alpha\bm{p}_{k}).

In the case of a nonquadratic function, this choice of αk\alpha_{k} could be inaccurate, which could lead to a poor choice for 𝒙k+1{\bm{x}}_{k+1}. The progress measure in the next section will catch this failure and select a new search direction 𝒑k\bm{p}_{k} in this case.

As for the choice of βk\beta_{k}, we mention that Dai and Yuan [7] carried out an extensive convergence study of different formulas for βk\beta_{k} (not including the Hager-Zhang formula, but covering many others), and in the end, proposed a certain hybrid formula for βk\beta_{k}. Note that Dai and Kou [8] propose another family with the interesting property [8, Theorem 4.2] that convergence is guaranteed for smooth, strongly convex functions, although no rate is derived.

Our experiments (not reported here) suggest that the choice of βk\beta_{k} is not critical to the performance of C+AG. The Dai-Yuan paper points out that certain choices of βk\beta_{k} will lead to nonconvergence if the line-search fails to satisfy the strong Wolfe conditions. However, our line-search does not satisfy even the weak Wolfe conditions. In other words, the criteria used in the previous literature to distinguish among the choices of formulas for βk\beta_{k} may not be applicable to C+AG.

3 Progress measure

The progress measure is taken from Nesterov’s [20] analysis of the AG method. The basis of the progress measure is a sequence of strongly convex quadratic functions ϕ0(𝒙),ϕ1(𝒙),\phi_{0}({\bm{x}}),\phi_{1}({\bm{x}}),\ldots called an “estimate sequence” in [20].

The sequence is initialized with

ϕ0(𝒙):=f(𝒙0)+L2𝒙𝒗02,\phi_{0}({\bm{x}}):=f({\bm{x}}_{0})+\frac{L}{2}\|{\bm{x}}-\bm{v}_{0}\|^{2}, (7)

where 𝒗0:=𝒙0\bm{v}_{0}:={\bm{x}}_{0}. Then ϕk\phi_{k} for k1k\geq 1 is defined in terms of its predecessor via:

ϕk+1(𝒙)=(1θk)ϕk(𝒙)+θk[f(𝒙¯k)+f(𝒙¯k)T(𝒙𝒙¯k)+2𝒙𝒙¯k2].\phi_{k+1}({\bm{x}})=(1-\theta_{k})\phi_{k}({\bm{x}})+\theta_{k}\left[f(\bar{\bm{x}}_{k})+\nabla f(\bar{\bm{x}}_{k})^{T}({\bm{x}}-\bar{\bm{x}}_{k})+\frac{\ell}{2}\|{\bm{x}}-\bar{\bm{x}}_{k}\|^{2}\right]. (8)

Here, θk(0,1)\theta_{k}\in(0,1) and 𝒙¯kn\bar{\bm{x}}_{k}\in\mathbb{R}^{n} are detailed below. Note that the Hessian of ϕ0\phi_{0} and of the square-bracketed quadratic function in (8) are both multiples of II, and therefore inductively the Hessian of ϕk\phi_{k} is a multiple of II for all kk. In other words, for each kk there exists γk,𝒗k,ϕk\gamma_{k},\bm{v}_{k},\phi^{*}_{k} such that ϕk(𝒙)=ϕk+(γk/2)𝒙𝒗k2\phi_{k}({\bm{x}})=\phi_{k}^{*}+(\gamma_{k}/2)\|{\bm{x}}-\bm{v}_{k}\|^{2}, where γk>0\gamma_{k}>0. The formulas for γk+1\gamma_{k+1}, 𝒗k+1\bm{v}_{k+1}, ϕk+1\phi_{k+1}^{*} are straightforward to obtain from (8) given γk\gamma_{k}, 𝒗k\bm{v}_{k}, ϕk\phi_{k}^{*} as well as θk\theta_{k} and 𝒙¯k\bar{\bm{x}}_{k}. These formulas appear in [20, p. 73] and are repeated here for the sake of completeness:

γk+1\displaystyle\gamma_{k+1} :=(1θk)γk+θk,\displaystyle:=(1-\theta_{k})\gamma_{k}+\theta_{k}\ell, (9)
𝒗k+1\displaystyle\bm{v}_{k+1} :=1γk+1[(1θk)γk𝒗k+θk𝒙¯kθkf(𝒙¯k)],\displaystyle:=\frac{1}{\gamma_{k+1}}[(1-\theta_{k})\gamma_{k}\bm{v}_{k}+\theta_{k}\ell\bar{\bm{x}}_{k}-\theta_{k}\nabla f(\bar{\bm{x}}_{k})], (10)
ϕk+1\displaystyle\phi_{k+1}^{*} :=(1θk)ϕk+θkf(𝒙¯k)θk22γk+1f(𝒙¯k)2\displaystyle:=(1-\theta_{k})\phi_{k}^{*}+\theta_{k}f(\bar{\bm{x}}_{k})-\frac{\theta_{k}^{2}}{2\gamma_{k+1}}\|\nabla f(\bar{\bm{x}}_{k})\|^{2}
+θk(1θk)γkγk+1(𝒙¯k𝒗k2/2+f(𝒙¯k)T(𝒗k𝒙¯k)).\displaystyle\hphantom{:=}\quad\mbox{}+\frac{\theta_{k}(1-\theta_{k})\gamma_{k}}{\gamma_{k+1}}(\ell\|\bar{\bm{x}}_{k}-\bm{v}_{k}\|^{2}/2+\nabla f(\bar{\bm{x}}_{k})^{T}(\bm{v}_{k}-\bar{\bm{x}}_{k})). (11)

We call the sequence 𝒙¯0,𝒙¯1,\bar{\bm{x}}_{0},\bar{\bm{x}}_{1},\ldots the gradient sequence because the gradient is evaluated at these points.

To complete the formulation of ϕk()\phi_{k}(\cdot), we now specify θk\theta_{k} and 𝒙¯k\bar{\bm{x}}_{k}. Scalar θk\theta_{k} is selected as the positive root of the quadratic equation

Lθk2+(γk)θkγk=0.L\theta_{k}^{2}+(\gamma_{k}-\ell)\theta_{k}-\gamma_{k}=0. (12)

The fact that θk(0,1)\theta_{k}\in(0,1) is easily seen by observing the sign-change in this quadratic over [0,1][0,1]. A consequence of (12) and (9) is the identity

θk22γk+1=12L.\frac{\theta_{k}^{2}}{2\gamma_{k+1}}=\frac{1}{2L}. (13)

Finally, C+AG has three cases for selecting gradient sequence 𝒙¯k\bar{\bm{x}}_{k}. The first case is 𝒙¯k:=𝒙k\bar{\bm{x}}_{k}:={\bm{x}}_{k}. The second and third cases are presented later.

The principal theorem assuring progress is Theorem 2.2.2 of Nesterov [20], which we restate here for the sake of completeness. The validity of this theorem does not depend on how the gradient sequence 𝒙¯k\bar{\bm{x}}_{k} is selected in the definition of ϕk()\phi_{k}(\cdot).

Theorem 1

(Nesterov) Suppose that for each k=0,1,2,3,k=0,1,2,3,\ldots, there exists a point 𝐱k{\bm{x}}_{k}, called the witness, satisfying f(𝐱k)min𝐱ϕk(𝐱)ϕkf({\bm{x}}_{k})\leq\min_{\bm{x}}\phi_{k}({\bm{x}})\equiv\phi_{k}^{*}. Then

f(𝒙k)fLmin((1/L)k,4(k+2)2)𝒙0𝒙2.f({\bm{x}}_{k})-f^{*}\leq L\min\left(\left(1-\sqrt{\ell/L}\right)^{k},\frac{4}{(k+2)^{2}}\right)\|{\bm{x}}_{0}-{\bm{x}}^{*}\|^{2}. (14)

The sufficient-progress test used for CG is the condition of the theorem, namely, f(𝒙k)ϕkf({\bm{x}}_{k})\leq\phi_{k}^{*}. That is, in the case of CG, the witness sequence is also the main sequence computed by CG. As long as this test holds, the algorithm continues to take CG steps, and the right-hand side of (14) assures us that the optimal complexity bound is achieved (up to constants) in the case of both smooth convex functions and smooth, strongly convex functions.

The main result of this section is the following theorem, which states that if the objective function is quadratic, then the progress measure is always satisfied by conjugate gradient.

Theorem 2

Assume 𝐯0=𝐱0\bm{v}_{0}={\bm{x}}_{0}. If f(𝐱)=12𝐱TA𝐱𝐛T𝐱f({\bm{x}})=\frac{1}{2}{\bm{x}}^{T}A{\bm{x}}-\bm{b}^{T}{\bm{x}}, where AA is positive definite, and if the gradient sequence 𝐱¯k\bar{\bm{x}}_{k} and witness sequence are the same as the CG sequence 𝐱¯k:=𝐱k\bar{\bm{x}}_{k}:={\bm{x}}_{k}, then f(𝐱k)ϕkf({\bm{x}}_{k})\leq\phi_{k}^{*} for every iteration kk.

Proof. The result clearly holds when k=0k=0 by (7). Assuming f(𝒙k)ϕkf({\bm{x}}_{k})\leq\phi_{k}^{*}, we now show that f(𝒙k+1)ϕk+1f({\bm{x}}_{k+1})\leq\phi_{k+1}^{*} For k0k\geq 0,

ϕk+1\displaystyle\phi_{k+1}^{*} =(1θk)ϕk+θkf(𝒙k)θk22γk+1f(𝒙k)2\displaystyle=(1-\theta_{k})\phi_{k}^{*}+\theta_{k}f({\bm{x}}_{k})-\frac{\theta_{k}^{2}}{2\gamma_{k+1}}\|\nabla f({\bm{x}}_{k})\|^{2}
+θk(1θk)γkγk+1(𝒙k𝒗k2+f(𝒙k)T(𝒗k𝒙k))\displaystyle\hphantom{}\quad\mbox{}+\frac{\theta_{k}(1-\theta_{k})\gamma_{k}}{\gamma_{k+1}}\left(\ell\|{\bm{x}}_{k}-\bm{v}_{k}\|^{2}+\nabla f({\bm{x}}_{k})^{T}(\bm{v}_{k}-{\bm{x}}_{k})\right) (15)
f(𝒙k)θk22γk+1f(𝒙k)2+θk(1θk)γkγk+1(f(𝒙k)T(𝒗k𝒙k))\displaystyle\geq f({\bm{x}}_{k})-\frac{\theta_{k}^{2}}{2\gamma_{k+1}}\|\nabla f({\bm{x}}_{k})\|^{2}+\frac{\theta_{k}(1-\theta_{k})\gamma_{k}}{\gamma_{k+1}}\left(\nabla f({\bm{x}}_{k})^{T}(\bm{v}_{k}-{\bm{x}}_{k})\right) (16)
=f(𝒙k)12Lf(𝒙k)2+θk(1θk)γkγk+1(f(𝒙k)T(𝒗k𝒙k)).\displaystyle=f({\bm{x}}_{k})-\frac{1}{2L}\|\nabla f({\bm{x}}_{k})\|^{2}+\frac{\theta_{k}(1-\theta_{k})\gamma_{k}}{\gamma_{k+1}}\left(\nabla f({\bm{x}}_{k})^{T}(\bm{v}_{k}-{\bm{x}}_{k})\right). (17)
=f(𝒙k)12Lf(𝒙k)2\displaystyle=f({\bm{x}}_{k})-\frac{1}{2L}\|\nabla f({\bm{x}}_{k})\|^{2} (18)
f(𝒙kf(𝒙k)/L)\displaystyle\geq f({\bm{x}}_{k}-\nabla f({\bm{x}}_{k})/L) (19)
f(𝒙k+1).\displaystyle\geq f({\bm{x}}_{k+1}). (20)

Here, (15) is a restatement of (11) under the assumption that 𝒙¯k=𝒙k\bar{\bm{x}}_{k}={\bm{x}}_{k}. Line (16) follows from the induction hypothesis. Line (17) follows from (13).

Line (18) follows because, according to the recursive formula (10), 𝒗k\bm{v}_{k} lies in the affine space 𝒙0+span{f(𝒙0),,f(𝒙k1)}{\bm{x}}_{0}+\mathop{\rm span}\{\nabla f({\bm{x}}_{0}),\cdots,\nabla f({\bm{x}}_{k-1})\}. Similarly, 𝒙k{\bm{x}}_{k} lies in this space. Therefore, 𝒗k𝒙k\bm{v}_{k}-{\bm{x}}_{k} lies the kkth Krylov subspace span{f(𝒙0),,f(𝒙k1)}\mathop{\rm span}\{\nabla f({\bm{x}}_{0}),\cdots,\nabla f({\bm{x}}_{k-1})\}. It is well known (see, e.g., [10]) that f(𝒙k)\nabla f({\bm{x}}_{k}) is orthogonal to every vector in this space.

Line (19) is a standard inequality for LL-smooth convex functions; see, e.g., p. 57 of [20]. Finally (20) follows because 𝒙kf(𝒙k)/L{\bm{x}}_{k}-\nabla f({\bm{x}}_{k})/L lies in the k+1k+1-dimensional affine space 𝒙0+span{f(𝒙0),,f(𝒙k)}{\bm{x}}_{0}+\mathop{\rm span}\{\nabla f({\bm{x}}_{0}),\ldots,\nabla f({\bm{x}}_{k})\} while 𝒙k+1{\bm{x}}_{k+1} minimizes ff over the same affine space, another well known property of CG. \square

We note that this proof yields another proof of Daniel’s result that conjugate gradient converges at a linear rate with a factor 1const/L1-\mbox{const}\sqrt{\ell/L} per iteration. Specifically, the two bounds are

f(𝒙k)f(𝒙){4(1/L1+/L)2k(f(𝒙0)f) (Daniel)(1/L)kL𝒙0𝒙2 (Nesterov + this proof).f({\bm{x}}_{k})-f({\bm{x}}^{*})\leq\left\{\begin{array}[]{l}4\left(\frac{1-\sqrt{\ell/L}}{1+\sqrt{\ell/L}}\right)^{2k}(f({\bm{x}}_{0})-f^{*})\mbox{ (Daniel)}\\ \left(1-\sqrt{\ell/L}\right)^{k}\cdot L\|{\bm{x}}_{0}-{\bm{x}}^{*}\|^{2}\mbox{ (Nesterov + this proof)}.\end{array}\right.

We see that the reduction factor is better in Daniel’s bound because of the exponent 2k2k rather than kk, which means approximately reduction of 12/L1-2\sqrt{\ell/L} per iteration (assuming L\ell\ll L) instead of 1/L1-\sqrt{\ell/L}. Furthermore, the multiplicative constant in Daniel’s bound is better as noted earlier.

A limitation of this proof is that it requires 𝒗0=𝒙0\bm{v}_{0}={\bm{x}}_{0}. In the case that the CG iterations follow a sequence of AG iterations, this equality will not hold. We return to this point in Appendix A.

4 Restarting CG

The classical method for coping with failure to make progress in nonlinear CG is ‘restarting’, which means taking a step using the steepest descent direction instead of the conjugate gradient direction. Our proposed C+AG method, when it detects insufficient progress on the kkth iteration (i.e., the inequality f(𝒙k)ϕkf({\bm{x}}_{k})\leq\phi_{k}^{*} fails to hold), attempts two fall-back solutions. The first fall-back is restarting CG in the traditional manner (steepest descent). It should be noted that this restart does not play a role in the theoretical analysis of the algorithm, but we found that in practice it improves performance (versus immediately proceeding to the second fall-back of the AG method).

The issue of when to restart has received significant attention in the previous literature. One guideline is to restart every n+1n+1 iterations. In more modern codes, restarting is done every kn+1kn+1 iterations, where k1k\geq 1 is a constant, e.g., Hager-Zhang’s CG-Descent restarts every 6n+16n+1 iterations. We also have implemented this restart in our code, although it is usually not activated because some other test fails first.

Three other conditions for restarting from the previous literature, collectively termed the “Beale-Powell” conditions by Dai and Yuan [6], are

|f(𝒙k1)Tf(𝒙k)|c1f(𝒙k)2,(BP1)|\nabla f({\bm{x}}_{k-1})^{T}\nabla f({\bm{x}}_{k})|\geq c_{1}\|\nabla f({\bm{x}}_{k})\|^{2},\quad\mbox{(BP1)} (21)
𝒑kTf(𝒙k)c2f(𝒙k)2,(BP2)\bm{p}_{k}^{T}\nabla f({\bm{x}}_{k})\geq-c_{2}\|\nabla f({\bm{x}}_{k})\|^{2},\quad\mbox{(BP2)} (22)
𝒑kTf(𝒙k)c3f(𝒙k)2,(BP3)\bm{p}_{k}^{T}\nabla f({\bm{x}}_{k})\leq-c_{3}\|\nabla f({\bm{x}}_{k})\|^{2},\quad\mbox{(BP3)} (23)

where c1(0,)c_{1}\in(0,\infty), c2[0,1]c_{2}\in[0,1], and c3(1,)c_{3}\in(1,\infty). It should be noted that in exact arithmetic on a quadratic objective function, none of these tests will ever be activated. It should also be noted that Powell [24] as well as Dai and Yuan consider other restart directions (which we have not implemented) besides the steepest descent direction.

We have found that including these tests (BP1)–(BP3) caused the code to take more iterations in the majority of tests (although the difference in iteration counts was never large), so we have turned them off by default. In other words, the default values are c1=c_{1}=\infty, c2=0c_{2}=0 (i.e., we restart if 𝒑k\bm{p}_{k} is not a descent direction), and c3=c_{3}=\infty. Presumably this is because there are cases when progress is possible according to our progress measure even though a BP condition fails.

One additional restart condition in our code is that if the denominator of (6) is nonpositive, then we restart because this means that calculation of αk\alpha_{k} has failed.

A recent paper by Buhler et al. [3] considers nonlinear conjugate gradient for problems in machine learning and in particular the role of (BP) restarts. The motivation for their work is somewhat different from ours: they consider general objective functions, not necessarily smooth convex functions, and therefore accelerated gradient is not available for their class of problems. Their first main concern, costly line search, is not an issue for C+AG since its line search requires only one additional gradient per iteration. Their second main concern, nonconjugacy in search directions, is also less of a concern for C+AG since conjugacy is not an ingredient of the C+AG progress measure. They conclude that (BP1)–(BP3) (and (BP1) in particular) are important for a successful nonlinear CG implementation. The fact that their conclusions differ from ours arises from the different aims of the papers.

5 Switching to AG in the case of lack of progress

If the inequality f(𝒙k)ϕkf({\bm{x}}_{k})\leq\phi_{k}^{*} fails to hold also for the steepest descent direction described in the previous section, then the second fall-back is to switch to a sequence of AG steps.

AG steps continue until the following “almost-quadratic” test is satisfied. Let

qk:=f(𝒙¯k)T(f(𝒙¯k)+f(𝒙k+1))2L.q_{k}:=-\frac{\nabla f(\bar{\bm{x}}_{k})^{T}(\nabla f(\bar{\bm{x}}_{k})+\nabla f({\bm{x}}_{k+1}))}{2L}.

The termination test for AG is

f(𝒙k+1)f(𝒙¯k)+45qk.f({\bm{x}}_{k+1})\leq f(\bar{\bm{x}}_{k})+\frac{4}{5}q_{k}. (24)

The rationale for this formula is as follows. One checks with a few lines of algebra that if ff were quadratic, then the identity f(𝒙k+1)=f(𝒙¯k)+qkf({\bm{x}}_{k+1})=f(\bar{\bm{x}}_{k})+q_{k} holds, assuming that 𝒙k+1{\bm{x}}_{k+1} is defined as 𝒙¯kf(𝒙¯k)/L\bar{\bm{x}}_{k}-\nabla f(\bar{\bm{x}}_{k})/L as in AG (see l. 46 of Algorithm C+AG, Part II below ). Thus, (24) tests whether ff in the neighborhood of 𝒙¯k\bar{\bm{x}}_{k} is behaving approximately as a quadratic function would behave, which indicates that conjugate gradient might succeed.

Note that (24) involves an extra function-gradient evaluation since an ordinary AG iteration evaluates f,ff,\nabla f at 𝒙¯k\bar{\bm{x}}_{k} but not at 𝒙k+1{\bm{x}}_{k+1}. Therefore, (24) is checked only every 8 iterations to reduce the overhead of the test.

Overall, the C+AG algorithm requires two function-gradient evaluations for a CG step, one to obtain the gradient at 𝒙k{\bm{x}}_{k} and the second to evaluate f(𝒙~)\nabla f(\tilde{\bm{x}}) in (5). Similarly, the first fall-back requires two function-gradient evaluations. Finally, the accelerated gradient step requires another gradient evaluation. Therefore, the maximum number of function-gradient evaluations on a single iteration before progress is made is five. Therefore, Theorem 14 applies to our algorithm except for a constant factor of 5. In practice, on almost every step, either the CG step succeeds or AG is used, meaning the actual number of function-gradient evaluations per iteration lies between 1 and 2.

6 Estimating LL

The C+AG method as described thus far requires knowledge of LL and \ell, the moduli of smoothness and strong convexity respectively. In case these parameters are not known, the method estimates LL and simply assumes =0\ell=0. For many problems (including our examples) L\ell\ll L, meaning that convergence is governed by the second parenthesized of factor of (14). Thus, taking =0\ell=0 does not appear to significantly impede convergence.

As for estimating LL, we use the following procedure outlined in §10.4.2 of Beck [1]. Starting from an initial guess of L=1L=1, we first decrease LL by multiplying by 1/21/\sqrt{2} or increase by multiplying by 2\sqrt{2} until the condition

f(𝒙0f(𝒙0)/L)f(𝒙0)f(𝒙0)2/(2L)f({\bm{x}}_{0}-\nabla f({\bm{x}}_{0})/L)\leq f({\bm{x}}_{0})-\|\nabla f({\bm{x}}_{0})\|^{2}/(2L)

is satisfied. This means that C+AG and AG require O(|log(L)|)O(|\log(L)|) iterations for the initial estimate. On subsequent iterations, we re-estimate LL on the first iteration of consecutive sequences of CG iterations and on every iteration of AG. On iterations after the first, LL cannot decrease; it will either stay the same or increase. Beck proves that using this procedure does not worsen the iteration-complexity of AG when compared to the case that LL is known a priori. However, this procedure increases the number of function-gradient evaluations per AG step from one to approximately two. In the case of the conjugate gradient steps, the impact is negligible since only the first in a sequence requires the extra function-gradient evaluation.

It is not necessary to re-estimate LL on every CG iteration because the CG step uses LL only as the finite-difference step size in its procedure to determine αk\alpha_{k} as in (4)–(6). This dependence is mild in the sense that, assuming ff is quadratic, (4)–(6) will compute the same (correct) value of αk\alpha_{k} regardless of the step size. All computational tests in Tables 1 and 2 use the adaptive version of C+AG described in this section.

There are procedures in the literature for estimating \ell also, for example, the procedure by Carmon et al. [4]. However, Carmon et al.’s procedure is costly as it involves several trial steps and a guess-and-update procedure for \ell. Even in the pure quadratic case, estimating \ell apparently requires multiple steps of the Lanczos method [10] and would be costly. We have not implemented it since, as mentioned above, taking =0\ell=0 appears to be satisfactory.

7 Pseudocode for C+AG

In this section we present pseudocode for the C+AG algorithm. The pseudocode includes the procedure for estimating LL. It does not, however, include the elaborate procedure for finding witnesses after CG is restarted as described in Appendix A. We did not include this in the pseudocode because it is disabled by default as discussed in the appendix. See the concluding remarks of the paper for a downloadable implementation in Matlab, which does include that procedure as an option. The input arguments are ff and f\nabla f, the objective function and gradient, 𝒙0{\bm{x}}_{0}, the initial guess, gtol, the termination criterion, and LL, \ell, the smoothness modulus and the modulus of strong convexity respectively. Note that =0\ell=0 is a valid argument, and L=NaNL=\mathrm{NaN} (not a number) is a signal to the algorithm to estimate LL.

Algorithm 1 ComputeThetaGamma (implementation of (9) and (12))
L,,γkL,\ell,\gamma_{k}
θk,γk+1\theta_{k},\gamma_{k+1}
Solve Lθk2+(γk)θkγk=0L\theta_{k}^{2}+(\gamma_{k}-\ell)\theta_{k}-\gamma_{k}=0 via the quadratic formula for the positive root θk\theta_{k}.
Let γk+1:=(1θk)γk+θk\gamma_{k+1}:=(1-\theta_{k})\gamma_{k}+\theta_{k}\ell
Algorithm 2 EstimateLInitial
f()f(\cdot), 𝒙0{\bm{x}}_{0}, f(𝒙0)f({\bm{x}}_{0}), f(𝒙0)\nabla f({\bm{x}}_{0}), Linit(=1 by default)L_{\rm init}(=1\mbox{ by default})
LL
L:=LinitL:=L_{\rm init}
for k=1:100k=1:100 do
     if f(𝒙0f(𝒙0)/L)<f(𝒙0)f(𝒙0)2/(2L)f({\bm{x}}_{0}-\nabla f({\bm{x}}_{0})/L)<f({\bm{x}}_{0})-\|\nabla f({\bm{x}}_{0})\|^{2}/(2L) then
         L:=L/2L:=L/\sqrt{2}
     else
         return EstimateL(f(),𝒙0,f(𝒙0),f(𝒙0),L)(f(\cdot),{\bm{x}}_{0},f({\bm{x}}_{0}),\nabla f({\bm{x}}_{0}),L)
     end if
end for
Throw error: “ff may be unbounded below”
Algorithm 3 EstimateL
f()f(\cdot), 𝒙0{\bm{x}}_{0}, f(𝒙0)f({\bm{x}}_{0}), f(𝒙0)\nabla f({\bm{x}}_{0}), LL
LL
for k=1:60k=1:60 do
     f1:=f(𝒙0f(𝒙0)/L)f_{1}:=f({\bm{x}}_{0}-\nabla f({\bm{x}}_{0})/L)
     if f1f(𝒙0)f(𝒙0)2/(2L)f_{1}\geq f({\bm{x}}_{0})-\|\nabla f({\bm{x}}_{0})\|^{2}/(2L) and |f1f(𝒙0)|1011|f(𝒙0)||f_{1}-f({\bm{x}}_{0})|\geq 10^{-11}|f({\bm{x}}_{0})| then
         L:=2LL:=\sqrt{2}L
     else
         return LL
     end if
end for
Throw error: “Line search failed to determine LL; possible incorrect gradient function or excessive roundoff error”
Algorithm 4 ComputeVPhiStar (implementation of (10) and (11))
θk,γk,γk+1,,𝒗k,ϕk,𝒙¯k,f(𝒙¯k),f(𝒙¯k)\theta_{k},\gamma_{k},\gamma_{k+1},\ell,\bm{v}_{k},\phi_{k}^{*},\bar{\bm{x}}_{k},f(\bar{\bm{x}}_{k}),\nabla f(\bar{\bm{x}}_{k})
𝒗k+1,ϕk+1\bm{v}_{k+1},\phi_{k+1}^{*}
Let 𝒗k+1:=1γk+1[(1θk)γk𝒗k+θk𝒙¯kθkf(𝒙¯k)].\bm{v}_{k+1}:=\frac{1}{\gamma_{k+1}}[(1-\theta_{k})\gamma_{k}\bm{v}_{k}+\theta_{k}\ell\bar{\bm{x}}_{k}-\theta_{k}\nabla f(\bar{\bm{x}}_{k})].
Let ϕk+1:=(1θk)ϕk+θkf(𝒙¯k)θk22γk+1f(𝒙¯k)2+θk(1θk)γkγk+1(𝒙¯k𝒗k2/2+f(𝒙¯k)T(𝒗k𝒙¯k)).\begin{aligned} \mbox{Let }\phi_{k+1}^{*}:=&(1-\theta_{k})\phi_{k}^{*}+\theta_{k}f(\bar{\bm{x}}_{k})-\frac{\theta_{k}^{2}}{2\gamma_{k+1}}\|\nabla f(\bar{\bm{x}}_{k})\|^{2}\\ &\>\mbox{}+\frac{\theta_{k}(1-\theta_{k})\gamma_{k}}{\gamma_{k+1}}(\ell\|\bar{\bm{x}}_{k}-\bm{v}_{k}\|^{2}/2+\nabla f(\bar{\bm{x}}_{k})^{T}(\bm{v}_{k}-\bar{\bm{x}}_{k})).\end{aligned}
Algorithm 5 C+AG (Part I)
1:f(),f(),𝒙0n,,L,gtolf(\cdot),\nabla f(\cdot),{\bm{x}}_{0}\in\mathbb{R}^{n},\ell,L,\texttt{gtol}
2:𝒙{\bm{x}}^{*} such that f(𝒙)gtol\|\nabla f({\bm{x}}^{*})\|\leq\texttt{gtol}
3:f0:=f(𝒙0)f_{0}:=f({\bm{x}}_{0}); 𝒈0:=f(𝒙0)\bm{g}_{0}:=\nabla f({\bm{x}}_{0}); ϕ0:=f0\phi_{0}^{*}:=f_{0}; 𝒗0:=𝒙0\bm{v}_{0}:={\bm{x}}_{0}; onlyAG:=𝐟𝐚𝐥𝐬𝐞\texttt{onlyAG}:=\mathbf{false};
4:icg:=0i_{cg}:=0; iag:=0i_{ag}:=0; 𝒑0:=𝒈0\bm{p}_{0}:=-\bm{g}_{0}; LNaNFlag:=𝐟𝐚𝐥𝐬𝐞\texttt{LNaNFlag}:=\mathbf{false}
5:if L=NaNL=\mathrm{NaN} then
6:     L:=EstimateLInit(f(),𝒙0,f(𝒙0),f(𝒙0),1)L:=\texttt{EstimateLInit}(f(\cdot),{\bm{x}}_{0},f({\bm{x}}_{0}),\nabla f({\bm{x}}_{0}),1); :=0\ell:=0; LNaNFlag:=𝐭𝐫𝐮𝐞\texttt{LNaNFlag}:=\mathbf{true}
7:end if
8:γ0:=L\gamma_{0}:=L;
9:for k=0,1,k=0,1,\ldots do
10:     θk,γk+1:=ComputeThetaGamma(L,,γk)\theta_{k},\gamma_{k+1}:=\texttt{ComputeThetaGamma}(L,\ell,\gamma_{k})
11:     for whichsteptype:=1:3\texttt{whichsteptype}:=1:3 do
12:         if whichsteptype2\texttt{whichsteptype}\leq 2 then
13:              if onlyAG then continue  end if
14:              if icg6n+1i_{cg}\geq 6n+1 or whichsteptype=2\texttt{whichsteptype}=2 then
15:                  𝒑k:=𝒈k\bm{p}_{k}:=-\bm{g}_{k}; icg:=0i_{cg}:=0
16:              end if
17:              if icg=0i_{cg}=0 and k>0k>0 and LNaNFlag then
18:                  L:=EstimateL(f(),𝒙k,f(𝒙k),f(𝒙k),L)L:=\texttt{EstimateL}(f(\cdot),{\bm{x}}_{k},f({\bm{x}}_{k}),\nabla f({\bm{x}}_{k}),L)
19:              end if
20:              icg:=icg+1;iag:=0i_{cg}:=i_{cg}+1;\>i_{ag}:=0
21:              𝒙~:=𝒙k+𝒑k/L;\tilde{\bm{x}}:={\bm{x}}_{k}+\bm{p}_{k}/L; f~:=f(𝒙~)\tilde{f}:=f(\tilde{\bm{x}}); 𝒈~:=f(𝒙~)\tilde{\bm{g}}:=\nabla f(\tilde{\bm{x}})
22:              if 𝒈~gtol\|\tilde{\bm{g}}\|\leq\texttt{gtol} then 𝐫𝐞𝐭𝐮𝐫𝐧𝒙~\mathbf{return}\>\tilde{\bm{x}}  end if
23:              Ap:=L(𝒈~𝒈k)\texttt{Ap}:=L(\tilde{\bm{g}}-\bm{g}_{k}); pAp:=𝒑kTAp\texttt{pAp}:=\bm{p}_{k}^{T}\texttt{Ap};
24:              if 𝒈kT𝒑k0 or pAp0\bm{g}_{k}^{T}\bm{p}_{k}\geq 0\textbf{ or }\texttt{pAp}\leq 0 then continue end if
25:              αk:=𝒈kT𝒑k/pAp\alpha_{k}:=-\bm{g}_{k}^{T}\bm{p}_{k}/\texttt{pAp}
26:              𝒙k+1:=𝒙k+αk𝒑k{\bm{x}}_{k+1}:={\bm{x}}_{k}+\alpha_{k}\bm{p}_{k}; fk+1:=f(𝒙k+1)f_{k+1}:=f({\bm{x}}_{k+1}); 𝒈k+1:=f(𝒙k+1)\bm{g}_{k+1}:=\nabla f({\bm{x}}_{k+1})
27:              if 𝒈k+1gtol\|\bm{g}_{k+1}\|\leq\texttt{gtol} then 𝐫𝐞𝐭𝐮𝐫𝐧𝒙k+1\mathbf{return}\>{\bm{x}}_{k+1}  end if
28:              𝒗k+1,ϕk+1:=ComputeVPhiStar(θk,γk,γk+1,,𝒗k,ϕk,𝒙k,fk,𝒈k)\bm{v}_{k+1},\phi_{k+1}^{*}:=\texttt{ComputeVPhiStar}(\theta_{k},\gamma_{k},\gamma_{k+1},\ell,\bm{v}_{k},\phi_{k}^{*},{\bm{x}}_{k},f_{k},\bm{g}_{k})
29:              if fk+1ϕk+1f_{k+1}\leq\phi_{k+1}^{*} then
30:                  𝒚^:=𝒈k+1𝒈k\hat{\bm{y}}:=\bm{g}_{k+1}-\bm{g}_{k}
31:                  β1:=(𝒚^𝒑k2𝒚^2𝒚^T𝒑k)T𝒈k+1𝒚^T𝒑k\beta^{1}:=\left(\hat{\bm{y}}-\bm{p}_{k}\cdot\frac{2\|\hat{\bm{y}}\|^{2}}{\hat{\bm{y}}^{T}\bm{p}_{k}}\right)^{T}\frac{\bm{g}_{k+1}}{\hat{\bm{y}}^{T}\bm{p}_{k}}
32:                  β2:=1𝒑kmin(.01𝒈0,𝒈k+1).\beta^{2}:=\frac{-1}{\|\bm{p}_{k}\|\cdot\min\left(.01\|\bm{g}_{0}\|,\|\bm{g}_{k+1}\|\right)}.
33:                  βk+1:=max(β1,β2)\beta_{k+1}:=\max(\beta^{1},\beta^{2})
34:                  𝒑k+1:=𝒈k+1+βk+1𝒑k\bm{p}_{k+1}:=-\bm{g}_{k+1}+\beta_{k+1}\bm{p}_{k}
35:                  break (terminate whichsteptype loop)
36:              end if
Algorithm 6 C+AG Part II
37:         else(whichsteptype=3)\>\>(\texttt{whichsteptype}=3)
38:              if 𝐧𝐨𝐭onlyAG\mathbf{not}\>\texttt{onlyAG} then
39:                  onlyAG:=𝐭𝐫𝐮𝐞\texttt{onlyAG}:=\mathbf{true}
40:                  iag:=0i_{ag}:=0; icg:=0i_{cg}:=0
41:              end if
42:              iag:=iag+1i_{ag}:=i_{ag}+1
43:              𝒙¯k:=θkγk𝒗k+γk+1𝒙kγk+θk\bar{{\bm{x}}}_{k}:=\frac{\theta_{k}\gamma_{k}\bm{v}_{k}+\gamma_{k+1}{\bm{x}}_{k}}{\gamma_{k}+\theta_{k}\ell}; f¯k:=f(𝒙¯k)\bar{f}_{k}:=f(\bar{{\bm{x}}}_{k}); 𝒈¯k:=f(𝒙¯k)\bar{\bm{g}}_{k}:=\nabla f(\bar{{\bm{x}}}_{k})
44:              if 𝒈¯kgtol\|\bar{\bm{g}}_{k}\|\leq\texttt{gtol} then return 𝒙¯k\bar{\bm{x}}_{k}  end if
45:              if LNaNFlag then L:=EstimateL(f(),𝒙k,f(𝒙k),f(𝒙k),L)L:=\texttt{EstimateL}(f(\cdot),{\bm{x}}_{k},f({\bm{x}}_{k}),\nabla f({\bm{x}}_{k}),L)  end if
46:              𝒙k+1:=𝒙¯k𝒈¯k/L{\bm{x}}_{k+1}:=\bar{{\bm{x}}}_{k}-\bar{\bm{g}}_{k}/L
47:              𝒗k+1,ϕk+1:=ComputeVPhiStar(θk,γk,γk+1,,𝒗k,ϕk,𝒙¯k,f¯k,𝒈¯k)\bm{v}_{k+1},\phi_{k+1}^{*}:=\texttt{ComputeVPhiStar}(\theta_{k},\gamma_{k},\gamma_{k+1},\ell,\bm{v}_{k},\phi_{k}^{*},\bar{\bm{x}}_{k},\bar{f}_{k},\bar{\bm{g}}_{k})
48:              if iag0(mod 8)𝐚𝐧𝐝f(𝒙k+1)f(𝒙¯k)45𝒈¯kT(𝒈¯k+f(𝒙k+1))2Li_{ag}\equiv 0\>(\mbox{mod $8$})\>\mathbf{and}\>f({\bm{x}}_{k+1})\leq f(\bar{\bm{x}}_{k})-\frac{4}{5}\cdot\frac{\bar{\bm{g}}_{k}^{T}(\bar{\bm{g}}_{k}+\nabla f({\bm{x}}_{k+1}))}{2L} then
49:                  fk+1:=f(𝒙k+1)f_{k+1}:=f({\bm{x}}_{k+1}); 𝒈k+1:=f(𝒙k+1)\bm{g}_{k+1}:=\nabla f({\bm{x}}_{k+1})
50:                  𝒑k+1:=𝒈k+1\bm{p}_{k+1}:=-\bm{g}_{k+1}
51:                  onlyAG:=𝐟𝐚𝐥𝐬𝐞\texttt{onlyAG}:=\mathbf{false}
52:              end if
53:         end if
54:     end for
55:end for

The mathematical structure of the algorithm has already been described in previous sections, and in this section we will describe a few computational details. The flag onlyAG indicates that the algorithm is currently executing a block of AG statements. As mentioned in Section 5, once AG iterations begin, they continue uninterrupted until (24) holds, and this is checked every 8th iteration.

The variable icgi_{cg} counts the number of consecutive CG iterations without a restart. A restart means that 𝒑k+1\bm{p}_{k+1} is defined to be 𝒈k-\bm{g}_{k}, i.e., a steepest descent step is taken. The code forces a restart after 6n+16n+1 iterations. Such a test is typical in NCG algorithms. The theory for nn-step quadratic convergence requires a restart every n+1n+1 iterations, but well known codes, e.g., CG-Descent [12] often use a longer interval.

The for-loop on l. 11 tries first a CG step and second a steepest descent step in order to make progress. Line 25 computes αk\alpha_{k} as discussed in Section 2. The test for sufficient progress in a CG step appears in l. 29. If sufficient progress is attained, then the next CG search direction 𝒑k+1\bm{p}_{k+1} is computed.

Line 38 can be reached if both attempts at CG fail to make sufficient progress. It is also reached if the onlyAG flag is set to ‘true’, meaning that the algorithm is currently inside a block of consecutive AG iterations. The statements beginning with 43 are the formulation of the AG method taken from [20]. Statement l. 48 implements the test for resuming CG described in Section 5.

8 Computational experiments

In this section we report on computational experiments with C+AG. We compared it to two other codes: AG coded by us based on (2.2.8) of [20] and CG-Descent by Hager and Zhang. We used C version 6.8 of CG-Descent with the Matlab front-end. We used it in memoryless mode, i.e., NCG rather than L-BFGS.

As for our own code C+AG, we discovered via experimentation that the more elaborate calculation of 𝒙¯k\bar{\bm{x}}_{k} described in Appendix A diminished the performance of the code. In particular, as mentioned the additional overhead of 100% per iteration slowed the code down without improving the iteration count and thus was not used.

To the best of our knowledge, there is no standard benchmark set of smooth convex functions with a known L,L,\ell, so instead we constructed our own test set of eight problem instances based on formulations that have occurred in the recent literature in data science.

The convergence criterion is f(𝒙k)gtol\|\nabla f({\bm{x}}_{k})\|\leq\texttt{gtol}, where gtol=108\texttt{gtol}=10^{-8} for the ABPDN and LL problems described below, while gtol=106\texttt{gtol}=10^{-6} for the HR problems. (None of the algorithms were able to converge to 10810^{-8} for the HR problems in fewer than 21062\cdot 10^{6} iterations, so we loosened the criterion for that problem.) In the case of CG-Descent, the convergence criterion is based on the \infty-norm rather than 2-norm. (The \infty-norm convergence criterion is hardwired into the code, and we did not attempt to rewrite this fairly complex software.) Via some experiments not reported here, we found that stopping when f(𝒙k)3gtol/n\|\nabla f({\bm{x}}_{k})\|_{\infty}\leq 3\cdot\texttt{gtol}/\sqrt{n} approximately reproduced the criterion f(𝒙k)gtol\|\nabla f({\bm{x}}_{k})\|\leq\texttt{gtol} used for the other methods, so this was the termination criterion for CG-Descent.

The reported results for our C+AG routine used L=𝐍𝐚𝐍L=\mathbf{NaN} and =0\ell=0 in all cases, in other words, adaptive estimation of LL. We assume most users would select this mode since L,L,\ell are often not known in advance. For AG, we tested both the known and unknown L,L,\ell modes; this is indicated in the table below by “AG” and “AG/EstL.”

As mentioned in the introduction, we have reported on function-gradient evaluation counts since that is usually taken as the primary work-unit of first-order methods. We did not report on wall-clock time because CG-Descent, having been written in highly optimized C, has an advantage over AG and C+AG, which are written Matlab.

The first four problems are based on BPDN (basis pursuit denoising), that is, the unconstrained convex optimization problem:

min12A𝒙𝒃2+λ𝒙1\min\frac{1}{2}\|A{\bm{x}}-\bm{b}\|^{2}+\lambda\|{\bm{x}}\|_{1}

in which λ>0\lambda>0 and Am×nA\in\mathbb{R}^{m\times n} has fewer rows than columns, so that the problem is neither strongly convex nor smooth. However, the following approximation (called APBDN) is both smooth and strongly convex on any bounded domain:

min12A𝒙𝒃2+λi=1nxi2+δ\min\frac{1}{2}\|A{\bm{x}}-\bm{b}\|^{2}+\lambda\sum_{i=1}^{n}\sqrt{x_{i}^{2}+\delta}

where δ>0\delta>0 is a fixed scalar. It is easy to see that as δ0\delta\rightarrow 0, the original problem is recovered. As δ0\delta\rightarrow 0, 0\ell\rightarrow 0 and LL\rightarrow\infty, where ,L\ell,L are the moduli of strong and smooth convexity respectively.

In our tests of ABPDN we took AA to be a subset of n\sqrt{n} rows of the discrete-cosine transform matrix of size n×nn\times n, where nn is an even power of 2. (This matrix and its transpose, although dense, can be applied in O(nlnn)O(n\ln n) operations.) The subset of rows was selected to be those numbered by the first m=nm=\sqrt{n} prime integers in order to get reproducible pseudorandomness in the choices. Similarly, in order to obtain a pseudorandom 𝒃\bm{b}, we selected 𝒃m\bm{b}\in\mathbb{R}^{m} according to the formula bi=sin(i2)b_{i}=\sin(i^{2}). The value of λ\lambda was fixed at 10310^{-3} in all tests. Finally, we varied δ=104\delta=10^{-4} and δ=5106\delta=5\cdot 10^{-6} and we tried both n=65536n=65536 and n=262144n=262144 for a total of four test cases.

The 1\ell_{1}-penalty term in the original BPDN formulation leads to many entries of the solution 𝒙{\bm{x}}^{*} equal to 0. This is no longer the case for ABPDN since the nondifferentiability at xi=0x_{i}=0 has been smoothed out. However, a simple scaling argument suggests that the threshold for “nearly zero” should be magnitude bounded above by δ\sqrt{\delta}. We determined that our computed solutions for the four ABPDN solutions had between 78% to 82% of entries nearly zero.

The second test case is logistic loss (LL), which is as follows:

f(𝒙)=R(A𝒙)+λ𝒙2/2,f({\bm{x}})=R(A{\bm{x}})+\lambda\|{\bm{x}}\|^{2}/2,

where AA is a given m×nm\times n matrix, and λ>0\lambda>0 is a regularization parameter. Here, R(𝒗)=i=1mr(vi)R(\bm{v})=\sum_{i=1}^{m}r(v_{i}) where r(v)=ln(1+ev)r(v)=\ln(1+e^{-v}). Row ii of AA, i=1:ni=1:n, is of the form 𝒆T/n+𝒛i\bm{e}^{T}/\sqrt{n}+\bm{z}_{i}, where 𝒆n\bm{e}\in\mathbb{R}^{n} is the vector of all 1’s and 𝒛iN(𝟎,σ2I)\bm{z}_{i}\sim N(\bm{0},\sigma^{2}I) where σ=0.4\sigma=0.4. We seeded Matlab’s random number generator with the same seed on each run for reproducibility. This formulation arises from the problem of identifying the the best halfspace that contains noisy data points. We ran this test with A6000×3000A\in\mathbb{R}^{6000\times 3000} with two values of λ\lambda, namely, λ=104\lambda=10^{-4} and λ=5106\lambda=5\cdot 10^{-6}. This function is smooth and strongly convex. The function λ\lambda controls the strong convexity.

The third test case is Huber regression (HR). Given a matrix Am×nA\in\mathbb{R}^{m\times n} and a vector 𝒃m\bm{b}\in\mathbb{R}^{m}, Huber regression [14] has been proposed as a means to make linear least-squares regression more robust against outliers. In more detail, a cutoff τ>0\tau>0 is selected in order to define the function

ζ(t)={τ22τt,tτ,t2,t[τ,τ],τ2+2τt,tτ.\zeta(t)=\left\{\begin{array}[]{ll}-\tau^{2}-2\tau t,&t\leq-\tau,\\ t^{2},&t\in[-\tau,\tau],\\ -\tau^{2}+2\tau t,&t\geq\tau.\end{array}\right.

Note that ζ(t)\zeta(t) is differentiable and convex and behaves like t2t^{2} for small |t||t| and like O(|t|)O(|t|) for large |t||t|. Finally, f(𝒙)=i=1mζ(A(i,:)𝒙bi).f({\bm{x}})=\sum_{i=1}^{m}\zeta(A(i,:){\bm{x}}-b_{i}). This function is smooth but not strongly convex, i.e., =0\ell=0. We chose the (n+1)×n(n+1)\times n sparse matrix with 11’s on the main diagonal and 1-1’s on the first sub diagonal. The vector 𝒃\bm{b} was taken to be all 1’s except for the final entry, which is 1.1n-1.1n. (This 𝒃\bm{b} is close to A𝒙A{\bm{x}} where 𝒙=[1,2,,n]T{\bm{x}}=[1,2,\ldots,n]^{T}; in other words, 𝒃\bm{b} is close to a right-hand side that has a 0-residual solution.) We fixed n=10000n=10000. We selected two choices for τ\tau, namely, τ=250\tau=250 and τ=1000\tau=1000.

Our results for the four algorithms on the eight problems are shown in Table 2. We set a limit of 10610^{6} function-gradient evaluations for C+AG and AG.

Table 2: Function-gradient evaluation counts for four algorithms in columns 3–6 on eight problems. Bold indicates the best for each row. The second column of the table indicates the percentage of C+AG iterations spent in AG iterations.
Problem %AG C+AG AG AG/EstL CG-Descent
ABPDN(n=65536,δ=104)(n=65536,\delta=10^{-4}) .03% 55,891 518,019 982,919 82,472
ABPDN(n=65536,δ=5106)(n=65536,\delta=5\cdot 10^{-6}) 14% 226,141 660,355 >106>10^{6} 165,207
ABPDN(n=262144,δ=104)(n=262144,\delta=10^{-4}) .02% 80,335 901,418 >106>10^{6} 130,040
ABPDN(n=262144,δ=5106)(n=262144,\delta=5\cdot 10^{-6}) 0% 483,420 >106>10^{6} >106>10^{6} 532,706
LL(λ=104)(\lambda=10^{-4}) 0% 148 106,507 >106>10^{6} 128
LL(λ=5106)(\lambda=5\cdot 10^{-6}) 0% 140 362,236 >106>10^{6} 125
HR(τ=250)(\tau=250) 64% 160,115 >106>10^{6} >106>10^{6} 946,488
HR(τ=1000)(\tau=1000) 60% 95,416 >106>10^{6} >106>10^{6} 245,376

The results in Table 2 show that C+AG outperformed both AG and CG-Descent in most cases. In all cases, C+AG did as about as well as or better than whichever of AG or CG-Descent performed better.

9 Conclusions

We have presented an algorithm called C+AG, which is a variant of nonlinear conjugate gradient tailored to smooth, convex objective functions. It has a guaranteed rate of convergence equal to that of accelerated gradient, which is optimal for the class of problems under consideration in the function-gradient evaluation model of computation. The method reduces to linear conjugate gradient in the case that the objective function is quadratic.

The code and all the test cases are implemented in Matlab and available on Github under the project named ‘ConjugatePlusAcceleratedGradient’.

The C+AG code outperformed both AG and CG on most test cases tried. One interesting future direction is as follows. The method switches between CG and AG steps depending on the progress measure. A more elegant approach would be a formulation that interpolates continuously between the two steps. Indeed, it is possible (see, e.g., [16]) to write down a single parameterized update step in which one parameter value corresponds to AG and another to CG. But we do not know of a systematic way to select this parameter that leads to a complexity bound.

Another direction to pursue is incorporating restarts into AG. In particular, [22] propose a restart method that appears to work well in practice according to their experiments. Both AG and C+AG could use their restarts, although presumably the former would benefit more than the latter since AG steps are not used as frequently in C+AG compared to pure AG. However, to the best of our knowledge, the restart method is not guaranteed in the sense: It has not been proved that the bounds on AG iteration of Theorem 14 are still valid in the presence of restarts. Since our goal here was to provide a method with the iteration guarantee of AG, we did not incorporate their restarts.

Yet another extension would be to use memory of previous iterates. In the context of NCG, adding memory yields L-BFGS [21]. It is clearly possible to hybridize L-BFGS with AG, and it is likely that the progress measure proposed still holds for L-BFGS in the case that it is applied to a quadratic since L-BFGS reduces to LCG in this case. Note that some authors, e.g. [25], have also proposed adding memory to AG-like methods, so many combinations are possible.

Another interesting direction is to develop a conjugate gradient iteration suitable for constrained convex problems. Conjugate gradient is known to extend to the special case of minimizing a convex quadratic function with a Euclidean-ball constraint; see e.g. Gould et al. [11] and further remarks on the complexity in Paquette and Vavasis [23].

10 Acknowledgements

The authors are grateful to the referees of the previous version of this paper for their helpful comments.

Appendix A Restarting CG after AG

As mentioned in Section 5, if the sufficient-progress test fails for a CG iteration, then the method can fall back to AG iterations. After the AG termination test holds, the method resumes CG, say at iteration mm.

An issue with restarting CG is that the sufficient-progress test may not hold for the CG iterations even if the function is purely quadratic on the remaining iterations. This is because the proof of sufficient progress in Theorem 2 assumed that 𝒗0=𝒙0\bm{v}_{0}={\bm{x}}_{0}. However, we would not expect the equality 𝒙m=𝒗m{\bm{x}}_{m}=\bm{v}_{m} to hold, where mm is the first iteration of CG, and therefore Theorem 2 does not apply.

We can address this issue with a more elaborate CG sufficient-progress test after resuming CG. The more elaborate test increases the number of function-gradient evaluations per CG iteration from two to four. This extra overhead, according to our experiments, overwhelmed any benefit from the more elaborate procedure.

We did observe the benefit of the elaborate procedure in a contrived test case. In this contrived test, the objective function ff is quadratic, and the C+AG code is modified in two ways. First, the initial block of iterations is AG instead of CG, i.e., the only_ag flag is set to true at initialization. Second, instead of setting 𝒗0:=𝒙0\bm{v}_{0}:={\bm{x}}_{0}, we add random noise to the right-hand side. For this contrived test, indeed we observed for some choices of the random noise that CG was never able to restart unless the elaborate progress measure computation was enabled, and therefore the number of function-gradient evaluations associated with the elaborate procedure was much smaller than without it. However, we did not observe this behavior in any naturally occurring test case.

Therefore, our default setting is not to use the elaborate test. This means that with our default settings, there is no theoretical guarantee that CG iterations can continue indefinitely upon their resumption after AG even if the objective function is purely quadratic on the level set of 𝒙m{\bm{x}}_{m} when CG resumes. However, the other theoretical guarantees of our method (that it has the same complexity as accelerated gradient and that it reduces to CG on a quadratic function) remain.

Despite the fact that the test is turned off by default, it is available in the code and may prove useful, and therefore we describe it here for the sake of completeness. Also, the proof-sketch of nn-step quadratic convergence in the next appendix requires that this procedure be used.

Above, we defined mm to be the iteration at which CG iterations are resumed. To simplify notation in the remainder of the section, assume m=0m=0. Thus, we consider CG starting with 𝒗0𝒙0\bm{v}_{0}\neq{\bm{x}}_{0}. As in Theorem 2, we assume the objective function f(𝒙)f({\bm{x}}) is quadratic, say f(𝒙)=𝒙TA𝒙/2𝒃T𝒙f({\bm{x}})={\bm{x}}^{T}A{\bm{x}}/2-\bm{b}^{T}{\bm{x}}. Recall that linear CG generates a sequence of vectors 𝒑0,𝒑1,\bm{p}_{0},\bm{p}_{1},\ldots (see, e.g., [21], §5.1) such that 𝒑iTA𝒑j=0\bm{p}_{i}^{T}A\bm{p}_{j}=0 whenever iji\neq j (called “conjugacy”) and such that for all kk, span{𝒑0,,𝒑k1}=span{𝒑0,A𝒑0,,Ak1𝒑0}=span{f(𝒙0),,f(𝒙k1)}\mathop{\rm span}\{\bm{p}_{0},\ldots,\bm{p}_{k-1}\}=\mathop{\rm span}\{\bm{p}_{0},A\bm{p}_{0},\ldots,A^{k-1}\bm{p}_{0}\}=\mathop{\rm span}\{\nabla f({\bm{x}}_{0}),\ldots,\nabla f({\bm{x}}_{k-1})\}, called the kkth Krylov space and denoted KkK_{k} for k=1,2,k=1,2,\ldots (where K0:={𝟎}K_{0}:=\{\bm{0}\}). Other properties are that 𝒙k{\bm{x}}_{k} is the minimizer of ff over the affine set 𝒙0+Kk{\bm{x}}_{0}+K_{k} and the equation 𝒙k+1:=𝒙k+αk𝒑k{\bm{x}}_{k+1}:={\bm{x}}_{k}+\alpha_{k}\bm{p}_{k}.

The new progress-checking procedure augments the Krylov space with two additional directions on each iteration to assure that the analogs of (18) and (20), that is, (41) and (43) below, hold for the method. In more detail, it computes additional sequences of vectors 𝒘0,𝒘1,𝒘2,\bm{w}_{0},\bm{w}_{1},\bm{w}_{2},\ldots; 𝒛1,𝒛2,𝒛3,\bm{z}_{1},\bm{z}_{2},\bm{z}_{3},\ldots; and 𝒙¯0,𝒙¯1,𝒙¯2\bar{\bm{x}}_{0},\bar{\bm{x}}_{1},\bar{\bm{x}}_{2}\ldots; and scalars α¯0,α¯1,α¯2,\bar{\alpha}_{0},\bar{\alpha}_{1},\bar{\alpha}_{2},\ldots; β¯1,β¯2,β¯3,\bar{\beta}_{1},\bar{\beta}_{2},\bar{\beta}_{3},\ldots; and μ1,μ2,μ3,\mu_{1},\mu_{2},\mu_{3},\ldots whose formulas are forthcoming. Let K¯0:=span{𝒘0}\bar{K}_{0}:=\mathop{\rm span}\{\bm{w}_{0}\} and for k1k\geq 1, K¯k:=span{𝒑0,,𝒑k1,𝒘k,𝒛k}=Kkspan{𝒘k,𝒛k}\bar{K}_{k}:=\mathop{\rm span}\{\bm{p}_{0},\ldots,\bm{p}_{k-1},\bm{w}_{k},\bm{z}_{k}\}=K_{k}\oplus\mathop{\rm span}\{\bm{w}_{k},\bm{z}_{k}\}. For each k1k\geq 1, define

𝒖k:=𝒙¯k1f(𝒙¯k1)/L𝒙k,\bm{u}_{k}:=\bar{\bm{x}}_{k-1}-\nabla f(\bar{\bm{x}}_{k-1})/L-{\bm{x}}_{k}, (25)

which is not computed by the method but used in its derivation. The construction of these vectors and scalars inductively has the following properties.

  1. 1.

    For all k0k\geq 0, 𝒗k𝒙kK¯k\bm{v}_{k}-{\bm{x}}_{k}\in\bar{K}_{k}. Furthermore, when writing 𝒗k𝒙k\bm{v}_{k}-{\bm{x}}_{k} as a linear combination of 𝒑0,,𝒑k1,𝒘k,𝒛k\bm{p}_{0},\ldots,\bm{p}_{k-1},\bm{w}_{k},\bm{z}_{k}, the coefficients of 𝒘k,𝒛k\bm{w}_{k},\bm{z}_{k} are 1,01,0 respectively. For this property and the next, in the degenerate case that 𝒑0,,𝒑k1,𝒘k,𝒛k\bm{p}_{0},\ldots,\bm{p}_{k-1},\bm{w}_{k},\bm{z}_{k} are dependent, this property is understood to mean that there exists a way to write 𝒗k𝒙k\bm{v}_{k}-{\bm{x}}_{k} with coefficients 1,01,0.

  2. 2.

    For all k1k\geq 1, 𝒖kK¯k\bm{u}_{k}\in\bar{K}_{k}. Furthermore, when writing 𝒖k\bm{u}_{k} as a linear combination of 𝒑0,,𝒑k1,𝒘k,𝒛k\bm{p}_{0},\ldots,\bm{p}_{k-1},\bm{w}_{k},\bm{z}_{k}, the coefficients of 𝒘k,𝒛k\bm{w}_{k},\bm{z}_{k} are μk,1\mu_{k},1 respectively.

  3. 3.

    For k1k\geq 1, the vectors 𝒑0,,𝒑k1,𝒘k,𝒛k\bm{p}_{0},\ldots,\bm{p}_{k-1},\bm{w}_{k},\bm{z}_{k} of K¯k\bar{K}_{k} are conjugate with respect to AA.

  4. 4.

    For k=0k=0, define 𝒙¯0:=𝒙0α¯0𝒘0\bar{\bm{x}}_{0}:={\bm{x}}_{0}-\bar{\alpha}_{0}\bm{w}_{0}, and for k1k\geq 1, define 𝒙¯k:=𝒙kα¯k𝒘kβ¯k𝒛k\bar{\bm{x}}_{k}:={\bm{x}}_{k}-\bar{\alpha}_{k}\bm{w}_{k}-\bar{\beta}_{k}\bm{z}_{k}. Observe that 𝒙¯k𝒙k\bar{\bm{x}}_{k}-{\bm{x}}_{k} lies in K¯k\bar{K}_{k}. Then for all k0k\geq 0, 𝒙¯k\bar{\bm{x}}_{k} is the minimizer of ff over 𝒙k+K¯k{\bm{x}}_{k}+\bar{K}_{k}.

  5. 5.

    For all k0k\geq 0, vector 𝒗k+1\bm{v}_{k+1} is computed (as usual) from 𝒗k\bm{v}_{k} according to (10).

To initiate this sequence of properties, let

𝒘0:=𝒗0𝒙0.\bm{w}_{0}:=\bm{v}_{0}-{\bm{x}}_{0}. (26)

This formula implies Property 1 holds. Properties 2–3 do not apply when k=0k=0. Assure Property 4 by solving a 1-dimensional minimization problem for α¯0\bar{\alpha}_{0}, that is,

α¯0:=argminαf(𝒙0α𝒘0)\bar{\alpha}_{0}:=\mathop{\rm argmin}_{\alpha}f({\bm{x}}_{0}-\alpha\bm{w}_{0}) (27)

Finally, Property 5 specifies how 𝒗1\bm{v}_{1} is computed.

Next, for the induction step, assume all the above properties hold for kk. Recall that (10) is used to obtain 𝒗k+1\bm{v}_{k+1}. Observe from (10) that 𝒗k+1\bm{v}_{k+1} is written as a linear combination

𝒗k+1:=σ𝒗k+(1σ)𝒙¯kτf(𝒙¯k),\bm{v}_{k+1}:=\sigma\bm{v}_{k}+(1-\sigma)\bar{\bm{x}}_{k}-\tau\nabla f(\bar{\bm{x}}_{k}), (28)

where the scalars σ,τ\sigma,\tau may be read off from (10). Note: we omit the subscripts kk on σ\sigma, τ\tau and some other scalars in this discussion for clarity, since kk is fixed. In the upcoming formulas, if k=0k=0, take β¯0\bar{\beta}_{0}, 𝒑1\bm{p}_{-1}, and 𝒛0\bm{z}_{0} to be zero. Let us define

𝒘k+1\displaystyle\bm{w}_{k+1} :=(σ(1σ)α¯k)𝒘k(1σ)β¯k𝒛k+τα¯kA𝒘k+τβ¯kA𝒛k+δ1𝒑k1+δ2𝒑k,\displaystyle:=(\sigma-(1-\sigma)\bar{\alpha}_{k})\bm{w}_{k}-(1-\sigma)\bar{\beta}_{k}\bm{z}_{k}+\tau\bar{\alpha}_{k}A\bm{w}_{k}+\tau\bar{\beta}_{k}A\bm{z}_{k}+\delta_{1}\bm{p}_{k-1}+\delta_{2}\bm{p}_{k}, (29)
𝒛k+1\displaystyle\bm{z}_{k+1} :=α¯k𝒘kβ¯k𝒛k+α¯kA𝒘k/L+β¯kA𝒛k/L+ϵ1𝒑k1+ϵ2𝒑kμk+1𝒘k+1,\displaystyle:=-\bar{\alpha}_{k}\bm{w}_{k}-\bar{\beta}_{k}\bm{z}_{k}+\bar{\alpha}_{k}A\bm{w}_{k}/L+\bar{\beta}_{k}A\bm{z}_{k}/L+\epsilon_{1}\bm{p}_{k-1}+\epsilon_{2}\bm{p}_{k}-\mu_{k+1}\bm{w}_{k+1}, (30)

where δ1,δ2,ϵ1,ϵ2,μk+1\delta_{1},\delta_{2},\epsilon_{1},\epsilon_{2},\mu_{k+1} are yet to be determined. We now establish that these formulas satisfy the above properties.

Starting with Property 1, we have

𝒗k+1𝒙k+1\displaystyle\bm{v}_{k+1}-{\bm{x}}_{k+1} =𝒗k𝒙k+(1σ)(𝒙¯k𝒗k)τf(𝒙¯k)αk𝒑k\displaystyle=\bm{v}_{k}-{\bm{x}}_{k}+(1-\sigma)(\bar{\bm{x}}_{k}-\bm{v}_{k})-\tau\nabla f(\bar{\bm{x}}_{k})-\alpha_{k}\bm{p}_{k} (31)
=σ(𝒗k𝒙k)+(1σ)(𝒙¯k𝒙k)τ(A𝒙¯k𝒃)αk𝒑k\displaystyle=\sigma(\bm{v}_{k}-{\bm{x}}_{k})+(1-\sigma)(\bar{\bm{x}}_{k}-{\bm{x}}_{k})-\tau(A\bar{\bm{x}}_{k}-\bm{b})-\alpha_{k}\bm{p}_{k} (32)
=σ(𝒗k𝒙k)+(1σ)(𝒙¯k𝒙k)τ(A𝒙k𝒃)τA(𝒙¯k𝒙k)αk𝒑k.\displaystyle=\sigma(\bm{v}_{k}-{\bm{x}}_{k})+(1-\sigma)(\bar{\bm{x}}_{k}-{\bm{x}}_{k})-\tau(A{\bm{x}}_{k}-\bm{b})-\tau A(\bar{\bm{x}}_{k}-{\bm{x}}_{k})-\alpha_{k}\bm{p}_{k}. (33)

Here, (31) follows by substituting (28) for 𝒗k+1\bm{v}_{k+1} and 𝒙k+1:=𝒙k+αk𝒑k{\bm{x}}_{k+1}:={\bm{x}}_{k}+\alpha_{k}\bm{p}_{k} for 𝒙k{\bm{x}}_{k}; (32) follows from rearranging and from substituting the definition of ff; (33) adds and subtracts τA𝒙k\tau A{\bm{x}}_{k}.

Consider the terms of (33) separately. The first term σ(𝒗k𝒙k)\sigma(\bm{v}_{k}-{\bm{x}}_{k}) lies in K¯k\bar{K}_{k} with coefficients of σ,0\sigma,0 on 𝒘k,𝒛k\bm{w}_{k},\bm{z}_{k} respectively by Property 1 inductively. The second term (1σ)(𝒙¯k𝒙k)(1-\sigma)(\bar{\bm{x}}_{k}-{\bm{x}}_{k}) lies in K¯k\bar{K}_{k} with coefficients (1σ)α¯k,(1σ)β¯k-(1-\sigma)\bar{\alpha}_{k},-(1-\sigma)\bar{\beta}_{k} on 𝒘k,𝒛k\bm{w}_{k},\bm{z}_{k} respectively by Property 4 inductively. The third term τ(A𝒙k𝒃)-\tau(A{\bm{x}}_{k}-\bm{b}) lies in span{𝒑0,,𝒑k}\mathop{\rm span}\{\bm{p}_{0},\ldots,\bm{p}_{k}\} by the usual properties of CG. The fourth term τA(𝒙¯k𝒙k)-\tau A(\bar{\bm{x}}_{k}-{\bm{x}}_{k}) lies in Aspan{𝒑0,,𝒑k1,𝒘k,𝒛k}A\mathop{\rm span}\{\bm{p}_{0},\ldots,\bm{p}_{k-1},\bm{w}_{k},\bm{z}_{k}\}= span{𝒑1,,𝒑k,A𝒘k,A𝒛k}\mathop{\rm span}\{\bm{p}_{1},\ldots,\bm{p}_{k},A\bm{w}_{k},A\bm{z}_{k}\}. In this expansion, the coefficients of A𝒘kA\bm{w}_{k} and A𝒛kA\bm{z}_{k} are τα¯k,τβ¯k\tau\bar{\alpha}_{k},\tau\bar{\beta}_{k} respectively. Finally, the last term obviously lies in span{𝒑k}\mathop{\rm span}\{\bm{p}_{k}\}. Therefore, if we define 𝒘k+1\bm{w}_{k+1} according to (29), then Property 1 holds for k+1k+1.

Next we turn to Property 2. Observe that

𝒖k+1\displaystyle\bm{u}_{k+1} =𝒙¯kf(𝒙¯k)/L𝒙k+1\displaystyle=\bar{\bm{x}}_{k}-\nabla f(\bar{\bm{x}}_{k})/L-{\bm{x}}_{k+1} (34)
=(𝒙¯k𝒙k)αk𝒑k(A𝒙k𝒃)/LA(𝒙¯k𝒙k)/L.\displaystyle=(\bar{\bm{x}}_{k}-{\bm{x}}_{k})-\alpha_{k}\bm{p}_{k}-(A{\bm{x}}_{k}-\bm{b})/L-A(\bar{\bm{x}}_{k}-{\bm{x}}_{k})/L. (35)

Here, (34) follows from the definition (25) while (35) follows from the substitutions f(𝒙¯k)=A𝒙¯k𝒃\nabla f(\bar{\bm{x}}_{k})=A\bar{\bm{x}}_{k}-\bm{b} and 𝒙k+1=𝒙k+αk𝒑k{\bm{x}}_{k+1}={\bm{x}}_{k}+\alpha_{k}\bm{p}_{k} and adding and subtracting A𝒙k/LA{\bm{x}}_{k}/L.

Considering (35) term by term, the first term (𝒙¯k𝒙k)(\bar{\bm{x}}_{k}-{\bm{x}}_{k}) lies in K¯k\bar{K}_{k} with coefficients α¯k,β¯k-\bar{\alpha}_{k},-\bar{\beta}_{k} on 𝒘k,𝒛k\bm{w}_{k},\bm{z}_{k} inductively by Property 4. The second term lies in K¯k+1\bar{K}_{k+1}. The third term (A𝒙k𝒃)/L-(A{\bm{x}}_{k}-\bm{b})/L lies in span{𝒑0,,𝒑k}\mathop{\rm span}\{\bm{p}_{0},\ldots,\bm{p}_{k}\}. As in the preceding derivation, the fourth term A(𝒙¯k𝒙k)/L-A(\bar{\bm{x}}_{k}-{\bm{x}}_{k})/L lies in span{𝒑0,,𝒑k,A𝒘k,A𝒛k}\mathop{\rm span}\{\bm{p}_{0},\ldots,\bm{p}_{k},A\bm{w}_{k},A\bm{z}_{k}\} with coefficients α¯k/L,β¯k/L\bar{\alpha}_{k}/L,\bar{\beta}_{k}/L on A𝒘k,A𝒛kA\bm{w}_{k},A\bm{z}_{k} respectively. Again, we see now that (30) is chosen precisely so that the induction carries through on Property 2.

Now we turn to Property 3. By the induction hypothesis on Property 3, 𝒘k,𝒛k\bm{w}_{k},\bm{z}_{k} are conjugate to each other and to 𝒑0,,𝒑k1\bm{p}_{0},\ldots,\bm{p}_{k-1}. This implies A𝒘k,A𝒛kA\bm{w}_{k},A\bm{z}_{k} are conjugate to 𝒑0,𝒑k2\bm{p}_{0}\ldots,\bm{p}_{k-2} since for j=0,,k2j=0,\ldots,k-2, 𝒑jTA(A𝒘k)=(A𝒑j)TA𝒘k\bm{p}_{j}^{T}A(A\bm{w}_{k})=(A\bm{p}_{j})^{T}A\bm{w}_{k}, and A𝒑jspan{𝒑0,,𝒑k1}A\bm{p}_{j}\in\mathop{\rm span}\{\bm{p}_{0},\ldots,\bm{p}_{k-1}\}, and similarly for A𝒛kA\bm{z}_{k}. Thus, we see term-by-term that every term in the definitions of 𝒘k+1,𝒛k+1\bm{w}_{k+1},\bm{z}_{k+1} in (29)–(30) is already conjugate to 𝒑0,,𝒑k2\bm{p}_{0},\ldots,\bm{p}_{k-2}. Thus, to ensure 𝒘k+1\bm{w}_{k+1} is also conjugate to 𝒑k1\bm{p}_{k-1} and 𝒑k\bm{p}_{k}, solve a system of two linear equations for the scalars δ1,δ2\delta_{1},\delta_{2} that appear in (29). (In fact, by conjugacy, the system decouples into two univariate linear equations that are solved via single division operations.) Similarly, to assure 𝒛k+1\bm{z}_{k+1} is conjugate to 𝒑k1,𝒑k,𝒘k+1\bm{p}_{k-1},\bm{p}_{k},\bm{w}_{k+1}, we solve three linear equations for ϵ1,ϵ2,μk+1\epsilon_{1},\epsilon_{2},\mu_{k+1} from (30), which again decouple into three separate equations.

Next, for Property 4, recall again from the theory of CG that minimizing f(𝒙)f({\bm{x}}) over 𝒙k+1+K¯k+1{\bm{x}}_{k+1}+\bar{K}_{k+1} for which we have a conjugate basis of K¯k\bar{K}_{k} reduces to k+1k+1 uncoupled 1-dimensional minimizations; indeed, this is the main purpose of conjugacy. However, 𝒙k+1{\bm{x}}_{k+1} is already minimal with respect to 𝒑0,,𝒑k\bm{p}_{0},\ldots,\bm{p}_{k} by the usual properties of CG, so it is necessary to minimize only with respect to 𝒘k+1,𝒛k+1\bm{w}_{k+1},\bm{z}_{k+1}. Therefore, to obtain 𝒙¯k+1\bar{\bm{x}}_{k+1}, one defines

𝒙¯k+1:=𝒙k+1α¯k+1𝒘k+1β¯k+1𝒛k+1,\bar{\bm{x}}_{k+1}:={\bm{x}}_{k+1}-\bar{\alpha}_{k+1}\bm{w}_{k+1}-\bar{\beta}_{k+1}\bm{z}_{k+1}, (36)

where

(α¯k+1,β¯k+1):=argminα,βf(𝒙k+1α𝒘k+1β𝒛k+1),(\bar{\alpha}_{k+1},\bar{\beta}_{k+1}):=\mathop{\rm argmin}_{\alpha,\beta}f({\bm{x}}_{k+1}-\alpha\bm{w}_{k+1}-\beta\bm{z}_{k+1}), (37)

a problem that decouples into two simple equations.

Thus, we have established the following result:

Theorem 3

Assume the sequences α¯k\bar{\alpha}_{k}’s, β¯k\bar{\beta}_{k}’s, μk\mu_{k}’s, 𝐯k\bm{v}_{k}’s, 𝐰k\bm{w}_{k}’s, 𝐳k\bm{z}_{k}’s, and 𝐱¯k\bar{\bm{x}}_{k}’s are defined by (10) and (26)–(37). Then Properties 1155 above hold.

Now, finally, we can propose a progress measure assured to hold for quadratic functions, namely, f(𝒙¯k)ϕkf(\bar{\bm{x}}_{k})\leq\phi^{*}_{k}. The proof that this holds is a straightforward rewriting of Theorem 2.

Theorem 4

If f(𝐱)=12𝐱TA𝐱𝐛T𝐱f({\bm{x}})=\frac{1}{2}{\bm{x}}^{T}A{\bm{x}}-\bm{b}^{T}{\bm{x}}, where AA is positive definite, then f(𝐱¯k)ϕkf(\bar{\bm{x}}_{k})\leq\phi_{k}^{*} for every iteration kk, where 𝐱k{\bm{x}}_{k} is the sequence of conjugate gradient iterates and 𝐱¯k\bar{\bm{x}}_{k} is defined by (36).

Proof. The result clearly holds when k=0k=0 by (7) and the fact that f(𝒙¯0)f(𝒙0)f(\bar{\bm{x}}_{0})\leq f({\bm{x}}_{0}) by the minimality of 𝒙¯0\bar{\bm{x}}_{0} in the line 𝒙0+span{𝒘0}{\bm{x}}_{0}+\mathop{\rm span}\{\bm{w}_{0}\}. Assuming f(𝒙¯k)ϕkf(\bar{\bm{x}}_{k})\leq\phi_{k}^{*}, we now show that f(𝒙¯k+1)ϕk+1f(\bar{\bm{x}}_{k+1})\leq\phi_{k+1}^{*}. For k0k\geq 0,

ϕk+1\displaystyle\phi_{k+1}^{*} =(1θk)ϕk+θkf(𝒙¯k)θk22γk+1f(𝒙¯k)2\displaystyle=(1-\theta_{k})\phi_{k}^{*}+\theta_{k}f(\bar{\bm{x}}_{k})-\frac{\theta_{k}^{2}}{2\gamma_{k+1}}\|\nabla f(\bar{\bm{x}}_{k})\|^{2}
+θk(1θk)γkγk+1(𝒙¯k𝒗k2+f(𝒙¯k)T(𝒗k𝒙¯k))\displaystyle\hphantom{}\quad\mbox{}+\frac{\theta_{k}(1-\theta_{k})\gamma_{k}}{\gamma_{k+1}}\left(\ell\|\bar{\bm{x}}_{k}-\bm{v}_{k}\|^{2}+\nabla f(\bar{\bm{x}}_{k})^{T}(\bm{v}_{k}-\bar{\bm{x}}_{k})\right) (38)
f(𝒙¯k)θk22γk+1f(𝒙¯k)2+θk(1θk)γkγk+1(f(𝒙¯k)T(𝒗k𝒙¯k))\displaystyle\geq f(\bar{\bm{x}}_{k})-\frac{\theta_{k}^{2}}{2\gamma_{k+1}}\|\nabla f(\bar{\bm{x}}_{k})\|^{2}+\frac{\theta_{k}(1-\theta_{k})\gamma_{k}}{\gamma_{k+1}}\left(\nabla f(\bar{\bm{x}}_{k})^{T}(\bm{v}_{k}-\bar{\bm{x}}_{k})\right) (39)
=f(𝒙¯k)12Lf(𝒙¯k)2+θk(1θk)γkγk+1(f(𝒙¯k)T(𝒗k𝒙¯k))\displaystyle=f(\bar{\bm{x}}_{k})-\frac{1}{2L}\|\nabla f(\bar{\bm{x}}_{k})\|^{2}+\frac{\theta_{k}(1-\theta_{k})\gamma_{k}}{\gamma_{k+1}}\left(\nabla f(\bar{\bm{x}}_{k})^{T}(\bm{v}_{k}-\bar{\bm{x}}_{k})\right) (40)
=f(𝒙¯k)12Lf(𝒙¯k)2\displaystyle=f(\bar{\bm{x}}_{k})-\frac{1}{2L}\|\nabla f(\bar{\bm{x}}_{k})\|^{2} (41)
f(𝒙¯kf(𝒙¯k)/L)\displaystyle\geq f(\bar{\bm{x}}_{k}-\nabla f(\bar{\bm{x}}_{k})/L) (42)
f(𝒙¯k+1).\displaystyle\geq f(\bar{\bm{x}}_{k+1}). (43)

Here, (38) is a restatement of (11). Line (39) follows from the induction hypothesis. Line (40) follows from (13).

Line (41) follows because, according to the recursive formula (10), 𝒗k\bm{v}_{k} lies in the affine space 𝒙k+K¯k{\bm{x}}_{k}+\bar{K}_{k}. Also, 𝒙¯k\bar{\bm{x}}_{k} lies in this affine space and is optimal for ff over this space. Therefore, f(𝒙¯k)\nabla f(\bar{\bm{x}}_{k}) is orthogonal to every vector in K¯k\bar{K}_{k} and in particular to 𝒗k𝒙¯k\bm{v}_{k}-\bar{\bm{x}}_{k}.

Line (42) is the standard inequality for LL-smooth convex functions. Finally (43) follows because 𝒙kf(𝒙k)/L=𝒙k+1+𝒖k+1{\bm{x}}_{k}-\nabla f({\bm{x}}_{k})/L={\bm{x}}_{k+1}+\bm{u}_{k+1} by (25) and 𝒖k+1K¯k+1\bm{u}_{k+1}\in\bar{K}_{k+1} by Property 2. On the other hand 𝒙¯k+1\bar{\bm{x}}_{k+1} minimizes ff over 𝒙k+1+K¯k+1{\bm{x}}_{k+1}+\bar{K}_{k+1}. \square

This concludes the discussion of how to compute 𝒙¯k+1\bar{\bm{x}}_{k+1} in the case that ff is quadratic. In the general case, we do not have a matrix AA but only ff and f\nabla f. Therefore, we propose a method to evaluate the quantities in the preceding paragraphs in the general case. Assume inductively that we have 𝒘k,𝒛k,𝒘^k,𝒛^k\bm{w}_{k},\bm{z}_{k},\hat{\bm{w}}_{k},\hat{\bm{z}}_{k}, where 𝒘^k\hat{\bm{w}}_{k} corresponds to A𝒘kA\bm{w}_{k} in the above equations and similarly 𝒛^k\hat{\bm{z}}_{k} corresponds to A𝒛kA\bm{z}_{k}. Then all the formulas above may be evaluated using these vectors. Note that we already have vectors corresponding to A𝒑k1A\bm{p}_{k-1} and A𝒑kA\bm{p}_{k}; these are found in (4)–(6). We need these vectors to solve for the scalars δ1,δ2,ϵ1,ϵ2\delta_{1},\delta_{2},\epsilon_{1},\epsilon_{2} assuring conjugacy.

To continue the induction, it is necessary to evaluate 𝒘^k+1,𝒛^k+1\hat{\bm{w}}_{k+1},\hat{\bm{z}}_{k+1} corresponding to A𝒘k+1,A𝒛k+1A\bm{w}_{k+1},A\bm{z}_{k+1}. Multiplying both sides of (29) and (30) by the fictitious AA shows that we need to obtain the analog of the product A(α¯kA𝒘k+β¯kA𝒛k)A(\bar{\alpha}_{k}A\bm{w}_{k}+\bar{\beta}_{k}A\bm{z}_{k}). Note that this combination of A𝒘k,A𝒛kA\bm{w}_{k},A\bm{z}_{k} appears in both right-hand sides of (29) and (30). This may be found by evaluating f(𝒙k+1+α¯k𝒘^k+β¯k𝒛^k)\nabla f({\bm{x}}_{k+1}+\bar{\alpha}_{k}\hat{\bm{w}}_{k}+\bar{\beta}_{k}\hat{\bm{z}}_{k}) and then subtracting f(𝒙k+1)\nabla f({\bm{x}}_{k+1}) (which is already evaluated).

We also need to evaluate f(𝒙¯k+1)f(\bar{\bm{x}}_{k+1}) and f(𝒙¯k+1)\nabla f(\bar{\bm{x}}_{k+1}) in order to obtain (25) and check the progress measure f(𝒙¯k+1)ϕk+1f(\bar{\bm{x}}_{k+1})\leq\phi_{k+1}^{*}. Thus, the cost of carrying out the method of this section is two extra function-gradient evaluations per iteration. As mentioned earlier, our computational experiments indicate that this 100% overhead is not compensated for by any reduction in iteration count in any test case, and therefore the method described in this section is disabled by default.

Appendix B Toward nn-step quadratic convergence

In previous sections, we showed the proposed C+AG method has the same complexity bound up to a constant factor as accelerated gradient. In this section we suggest that it has an additional complexity bound known to apply to nonlinear conjugate gradient, namely, nn-step quadratic convergence. This property is mainly of theoretical interest since nn-step quadratic convergence is not usually observed in practice (see, e.g., [21, p. 124]) for larger problems. Therefore, we will provide only a sketch of a possible proof and leave the proof, which is likely to be complicated, to future research.

We propose to follow the proof given by McCormick and Ritter [17] rather than the original proof by Cohen [5]. They argue that classical NCG is nn-step quadratically convergent provided that the line-search is sufficiently accurate and provided that the objective function is strongly convex in the neighborhood of the objective function and that its second derivative is Lipschitz continuous. Finally, they assume that nonlinear CG is restarted every nn iterations.

In order to apply this result to C+AG, we first need to argue that the sufficient-progress test will not prevent CG iterations. We showed in the last section that for quadratic functions, the sufficient-progress test always holds. A strongly convex function with a Lipschitz second derivative is approximated arbitrarily closely by a quadratic function in a sufficiently small neighborhood of the root. In particular, if we assume that LL is strictly larger than the largest eigenvalue of the Hessian in a neighborhood of the root, then inequality (42) is strictly satisfied, and small deviations from a quadratic function are dominated by the residual in (42).

Some other issues with [17] is that their version of NCG implements a step-limiter operation that is not present in our algorithm. It is not clear whether the step-limiter is a requirement for nn-step quadratic convergence or merely a device to simplify their analysis. Furthermore, [17] assume the Polak-Ribière choice of βk\beta_{k} [21] rather than the Hager-Zhang step that we used. Note that [17] predates [12] by several decades. We conjecture that the analysis of [17] extends to [12].

The line-search accuracy required by [17] is as follows. On each iteration kk, |αkαk|O(𝒑k)|\alpha_{k}-\alpha_{k}^{*}|\leq O(\|\bm{p}_{k}\|), where αk\alpha_{k} is chosen by C+AG in (4)–(6) and αk=argmin{f(𝒙k+α𝒑k):α}\alpha_{k}^{*}=\mathop{\rm argmin}\{f({\bm{x}}_{k}+\alpha\bm{p}_{k}):\alpha\in\mathbb{R}\}. We sketch an argument that could prove |αkαk|O(𝒑k)|\alpha_{k}-\alpha_{k}^{*}|\leq O(\|\bm{p}_{k}\|) as follows.

Let ψ(α):=f(𝒙k+α𝒑k)\psi(\alpha):=f({\bm{x}}_{k}+\alpha\bm{p}_{k}). It follows from [17] that 𝒑k=Θ(𝒙k𝒙)\|\bm{p}_{k}\|=\Theta(\|{\bm{x}}_{k}-{\bm{x}}^{*}\|). In other words, there is both an upper and lower bound on

𝒑k𝒙k𝒙\frac{\|\bm{p}_{k}\|}{\|{\bm{x}}_{k}-{\bm{x}}^{*}\|}

independent of kk. It also follows from their results that ψ(0)<0\psi^{\prime}(0)<0. Write ψ′′(α)=s+ϵ(α)\psi^{\prime\prime}(\alpha)=s+\epsilon(\alpha), where s>0s>0 is the estimate of ψ′′\psi^{\prime\prime} obtained from (4)–(6), that is s=L(ψ(1/L)ψ(0))s=L(\psi^{\prime}(1/L)-\psi^{\prime}(0)). It follows by strong convexity and the equation 𝒑k=Θ(𝒙k𝒙)\|\bm{p}_{k}\|=\Theta(\|{\bm{x}}_{k}-{\bm{x}}^{*}\|) that αkO(1)\alpha_{k}\leq O(1) and αkO(1)\alpha_{k}^{*}\leq O(1). Again by these same assumptions, we can also derive that s=Θ(𝒑k2)s=\Theta(\|\bm{p}_{k}\|^{2}). By the Lipschitz assumption, ϵ(α)O(𝒑k3)\epsilon(\alpha)\leq O(\|\bm{p}_{k}\|^{3}) for αO(1)\alpha\leq O(1).

It also follows from the assumption of strict convexity and proximity to the root |ψ(0)|O(𝒑k2)|\psi^{\prime}(0)|\leq O(\|\bm{p}_{k}\|^{2}). The true minimizer αk\alpha_{k}^{*} is the root of ψ\psi^{\prime} and therefore satisfies the equation

ψ(0)+0αk(s+ϵ(α))𝑑α=0,\psi^{\prime}(0)+\int_{0}^{\alpha_{k}^{*}}(s+\epsilon(\alpha))\,d\alpha=0,

i.e.,

ψ(0)+sαk+0αkϵ(α)𝑑α=0.\psi^{\prime}(0)+s\alpha_{k}^{*}+\int_{0}^{\alpha_{k}^{*}}\epsilon(\alpha)\,d\alpha=0.

The computed minimizer is αk=ψ(0)/s\alpha_{k}=-\psi^{\prime}(0)/s. The integral in the previous line is bounded by O(𝒑k3)O(\|\bm{p}_{k}\|^{3}), whereas ψ(0)-\psi^{\prime}(0) is O(𝒑k2)O(\|\bm{p}_{k}\|^{2}), and s=Θ(𝒑k2)s=\Theta(\|\bm{p}_{k}\|^{2}). Therefore, the intermediate value theorem applied to the above univariate equation for αk\alpha_{k}^{*} tells us that αk=ψ(0)/s+O(𝒑k)=αk+O(𝒑k)\alpha_{k}^{*}=-\psi^{\prime}(0)/s+O(\|\bm{p}_{k}\|)=\alpha_{k}+O(\|\bm{p}_{k}\|). This concludes the explanation of accuracy of the line-search.

References

  • [1] Amir Beck. First-Order Methods in Optimization. SIAM, 2017.
  • [2] S. Bubeck, Y. T. Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradient descent. http://arxiv.org/abs/1506.08187, 2015.
  • [3] C. K. Buhler, H. Benson, and D. Shanno. Regularized step directions in conjugate gradient minimization for machine learning. arXiv preprint arXiv:2110.06308, 2021.
  • [4] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. “Convex until proven guilty”: dimension-free acceleration of gradient descent on non-convex functions. In Proceedings of 2017 International Conference on Machine Learning, 2017.
  • [5] A. Cohen. Rate of convergence of several conjugate gradient algorithms. SIAM Journal on Numerical Analysis, 9(2):248–259, 1972.
  • [6] Y. Dai and Y. Yuan. Convergence properties of Beale-Powell restart algorithms. Science in China, Ser. A, 41:1142–1150, 1998.
  • [7] Y. Dai and Y. Yuan. An efficient hybrid conjugate gradient method for unconstrained optimization. Annals of Operations Research, 103:33–47, 2001.
  • [8] Yu-Hong Dai and Cai-Xia Kou. A nonlinear conjugate gradient algorithm with an optimal property and an improved Wolfe line search. SIAM J. Optimization, 23:296–320, 2013.
  • [9] James W Daniel. The conjugate gradient method for linear and nonlinear operator equations. SIAM Journal on Numerical Analysis, 4(1):10–26, 1967.
  • [10] G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd Edition. Johns Hopkins University Press, Baltimore, 1996.
  • [11] N. Gould, S. Lucidi, M. Roma, and P. Toint. Solving the trust-region subproblem using the Lanczos method. SIAM Journal on Optimization, 9(2):504–525, 1999.
  • [12] W. Hager and H. Zhang. A new conjugate gradient method with guaranteed descent and an efficient line search. SIAM J. Optimization, 16:170–192, 2005.
  • [13] Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49:409–436, 1952.
  • [14] P. J. Huber. Robust regression: asymptotics, conjectures, and Monte Carlo. Annals of Statistics, 1(5):799–821, 1973.
  • [15] Sahar Karimi. On the relationship between conjugate gradient and optimal first-order methods for convex optimization. PhD thesis, University of Waterloo, 2014.
  • [16] Sahar Karimi and Stephen Vavasis. A single potential governing convergence of conjugate gradient, accelerated gradient and geometric descent. https://arxiv.org/abs/1712.09498, 2017.
  • [17] G. P. McCormick and K. Ritter. Alternative proofs of the convergence properties of the conjugate-gradient method. J. Optimization Theory and Applications, 13(5):497–518, 1974.
  • [18] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley and Sons, Chichester, 1983. Translated by E. R. Dawson from Slozhnost’ Zadach i Effektivnost’ Metodov Optimizatsii, 1979, Glavnaya redaktsiya fiziko-matematicheskoi literatury, Izdatelstva “Nauka”.
  • [19] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2)O(1/k^{2}). Doklady AN SSSR (translated as Soviet Math. Dokl.), 269(3):543–547, 1983.
  • [20] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, 2003.
  • [21] J. Nocedal and S. Wright. Numerical Optimization, 2nd Edition. Springer, New York, 2006.
  • [22] B. O’Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3):715–732, 2015.
  • [23] C. Paquette and S. Vavasis. Potential-based analyses of first-order methods for constrained and composite optimization. http://arxiv.org/abs/1903.08497, 2019.
  • [24] M. J. D. Powell. Restart procedures for the conjugate gradient method. Mathematical Programming, 12:241–254, 1977.
  • [25] J. Zhang, B. O’Donoghue, and S. Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.