∎

²²institutetext: Brian Irwin

²²institutetext: Eldad Haber ³³institutetext: Department of Earth, Ocean and Atmospheric Sciences, The University of British Columbia, Vancouver, BC, Canada
³³email: {birwin, haber}@eoas.ubc.ca

Secant Penalized BFGS: A Noise Robust Quasi-Newton Method Via Penalizing The Secant Condition

Brian Irwin * Corresponding author. Eldad Haber

(Received: date / Accepted: date)

Abstract

In this paper, we introduce a new variant of the BFGS method designed to perform well when gradient measurements are corrupted by noise. We show that by treating the secant condition with a penalty method approach motivated by regularized least squares estimation, one can smoothly interpolate between updating the inverse Hessian approximation with the original BFGS update formula and not updating the inverse Hessian approximation. Furthermore, we find the curvature condition is smoothly relaxed as the interpolation moves towards not updating the inverse Hessian approximation, disappearing entirely when the inverse Hessian approximation is not updated. These developments allow us to develop a method we refer to as secant penalized BFGS (SP-BFGS) that allows one to relax the secant condition based on the amount of noise in the gradient measurements. SP-BFGS provides a means of incrementally updating the new inverse Hessian approximation with a controlled amount of bias towards the previous inverse Hessian approximation, which allows one to replace the overwriting nature of the original BFGS update with an averaging nature that resists the destructive effects of noise and can cope with negative curvature measurements. We discuss the theoretical properties of SP-BFGS, including convergence when minimizing strongly convex functions in the presence of uniformly bounded noise. Finally, we present extensive numerical experiments using over 30 problems from the CUTEst test problem set that demonstrate the superior performance of SP-BFGS compared to BFGS in the presence of both noisy function and gradient evaluations.

Keywords:

Quasi-Newton Methods Secant Condition Penalty Methods Least Squares Estimation Measurement Error Noise Robust Optimization

^†^†journal:

1 Introduction

Over the past 50 years, quasi-Newton methods have proved to be some of the most economical and effective methods for a variety of optimization problems. Originally conceived to provide some of the advantages of second order methods without the full cost of Newton’s method, quasi-Newton methods, which are also referred to as variable metric methods Johnson2019_quasiNewton_notes , are based on the observation that by differencing observed gradients, one can calculate approximate curvature information. This approximate curvature information can then be used to improve the speed of convergence, especially in comparison to first order methods, such as gradient descent. There are currently a variety of different quasi-Newton methods, with the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method 10.1093/imamat/6.1.76 ; 10.1093/comjnl/13.3.317 ; 10.2307/2004873 ; 10.2307/2004840 almost certainly being the best known quasi-Newton method.

Modern quasi-Newton methods were developed for problems involving the optimization of smooth functions without constraints. The BFGS method is the best known quasi-Newton method because in practice it has demonstrated superior performance due to its very effective self-correcting properties Nocedal2006 . Accordingly, BFGS has since been extended to handle box constraints doi:10.1137/0916069 , and shown to be effective even for some nonsmooth optimization problems Lewis2013NonsmoothOV . Furthermore, a limited memory version of BFGS known as L-BFGS Liu1989 has become a favourite algorithm for solving optimization problems with a very large number of variables, as it avoids directly storing approximate inverse Hessian matrices. However, BFGS and its relatives were not designed to explicitly handle noisy optimization problems, and noise can unacceptably degrade the performance of these methods.

The authors of doi:10.1137/140954362 make the important observation that quasi-Newton updating is inherently an overwriting process rather than an averaging process. Fundamentally, differencing noisy gradients can produce harmful efffects because the resulting approximate curvature information may be inaccurate, and this inaccurate curvature information may overwrite accurate curvature information. Newton’s method can naturally be viewed as a local rescaling of coordinates so that the rescaled problem is better conditioned than the original problem. Quasi-Newton methods attempt to perform a similar rescaling, but instead of using the (inverse) Hessian matrix to obtain curvature information for the rescaling, they use differences of gradients to obtain curvature information. Thus, it should be unsurprising that inaccurate curvature information obtained from differencing noisy gradients can be problematic because it means the resulting rescaling of the problem can be poor, and the conditioning of the rescaled problem could be even worse than the conditioning of the original problem.

With the above in mind, several works have dealt with how to improve the performance of quasi-Newton methods in the presence of noise. Many recent works focus on the empirical risk minimization (ERM) problem, which is ubiquitous in machine learning. For example, in doi:10.1137/140954362 the authors propose a technique designed for the stochastic approximation (SA) regime that employs subsampled Hessian-vector products to collect curvature information pointwise and at spaced intervals, in contrast to the classical approach of computing the difference of gradients at each iteration. This work is built upon in pmlr-v51-moritz16 , where the authors present a stochastic L-BFGS algorithm that draws upon the variance reduction approach of Johnson2013AcceleratingSG . In doi:10.1137/15M1053141 , the authors outline a stochastic damped limited-memory BFGS (SdLBFGS) method that employs damping techniques used in sequential quadratic programming (SQP). A stochastic block BFGS method that updates the approximate inverse Hessian matrix using a sketch of the Hessian matrix is proposed in pmlr-v48-gower16 . Further work on stochastic L-BFGS algorithms, including convergence results, can be found in pmlr-v2-schraudolph07a ; 10.5555/2789272.2912100 ; Zhao2018StochasticLI ; 8626766 .

Despite the importance of the ERM problem due to the current prevalence of machine learning, there are still a variety of important noisy optimization problems that arise in other contexts. In engineering design, numerical simulations are often employed in place of conducting costly, if even feasible, physical experiments. In this context, one tries to find optimal design parameters using the numerical simulation instead of physical experiments. Some examples from aerospace engineering, including interplanetary trajectory and wing design, can be found in Fasano2019 ; doi:10.1002/0470855487 ; doi:10.2514/1.J057294 . Examples from materials engineering include stable composite design doi:10.1177/1464420716664921 and ternary alloy composition Graf2017 , amongst others Munoz-Rojas2016 , while examples from electrical engineering include power system operation doi:10.1002/9780470466971 , hardware verification Gal_2020_HowToCatchALion , and antenna design Koziel2014 . Noise is often an unavoidable property of such numerical simulations, as the simulations can include stochastic internal components, and floating point arithmetic vulnerable to roundoff error. Apart from the analysis of the BFGS method with bounded errors in doi:10.1137/19M1240794 , there is relatively little work on the behaviour of quasi-Newton methods in the presence of general bounded noise. As optimizing noisy numerical simulations does not always fit the framework of the ERM problem, analyses of the behaviour of quasi-Newton methods in the presence of general bounded noise are of practical value when optimizing numerical simulations.

1.1 Contributions

Noise is inevitably introduced into machine learning problems due to the approximations required to handle large datasets, and numerical simulations due to the effects of finite precision arithmetic, and parts of the simulator containing inherently stochastic components. In this paper, we return to the fundamental theory underlying the design of quasi-Newton methods, which allows us to design a new variant of the BFGS method that explicitly handles the corrupting effects of noise. We do this as follows:

1.

In Section 2, we review the setup and derivation of the original BFGS method.
2.

In Section 3, motivated by regularized least squares estimation, we treat the secant condition of BFGS with a penalty method. This creates a new BFGS update formula that we refer to as secant penalized BFGS (SP-BFGS), which we show reduces to the original BFGS update formula in a limiting case, as expected.
3.

In Section 4, we present an algorithmic framework for practically implementing SP-BFGS updating. We also discuss implementation details, including how to perform a line search and choose the penalty parameter in the presence of noise.
4.

In Section 5, we discuss the theoretical properties of SP-BFGS, including how the penalty parameter influences the eigenvalues of the approximate inverse Hessian. This allows us to show that under appropriate conditions SP-BFGS iterations are guaranteed to converge linearly to a neighborhood of the global minimizer when minimizing strongly convex functions in the presence of uniformly bounded noise.
5.

In Section 6, we study the empirical performance of SP-BFGS updating compared to BFGS updating by performing extensive numerical experiments with both convex and nonconvex objective functions corrupted by function and gradient noise. Results from a diverse set of over 30 problems from the CUTEst test problem set demonstrate that intelligently implemented SP-BFGS updating frequently outperforms BFGS updating in the presence of noise.
6.

Finally, Section 7 concludes the paper and outlines directions for further work.

2 Mathematical Background

In this section, as preliminaries to the main results of this paper, we review the setup and derivation of the original BFGS method.

2.1 BFGS Setup

The BFGS method was originally designed to solve the following unconstrained optimization problem

\min_{x}\big{\{}\phi(x)\big{\}}

(1)

with $x\in\mathbb{R}^{n}$ , $\phi:\mathbb{R}^{n}\mapsto\mathbb{R}$ , and $\phi$ being a smooth twice continuously differentiable and nonnoisy function. Below, we use the notational conventions of Nocedal2006 , including $\phi_{k}=\phi(x_{k})$ . We begin by using the Taylor expansion of $\phi$ to build a local quadratic model $m_{k}$ of the objective function $\phi$ at the $k^{th}$ iterate $x_{k}$ of the optimization procedure

\phi(x_{k}+p)\approx\phi_{k}+\nabla\phi_{k}^{T}p+\frac{1}{2}p^{T}B_{k}p=m_{k}(p)

(2)

where $B_{k}$ is an $n\times n$ symmetric positive definite matrix that approximates the Hessian matrix (i.e. $B_{k}\approx\nabla^{2}\phi_{k}$ ). By setting the gradient of $m_{k}$ to zero, we see that the unique minimizer $p_{k}$ of this local quadratic model is

p_{k}=-B_{k}^{-1}\nabla\phi_{k}

(3)

and thus it is natural to update the next iterate $x_{k+1}$ as

x_{k+1}=x_{k}+\alpha_{k}p_{k}

(4)

where $\alpha_{k}$ is the step size along the direction $p_{k}$ , which is often chosen using a line search.

To avoid computing $B_{k}$ from scratch at each iteration $k$ , we use the curvature information from recent gradient evaluations to update $B_{k}$ , and thus relatively economically form $B_{k+1}$ . A Taylor expansion of $\nabla\phi$ reveals

\nabla\phi(x_{k}+p)\approx\nabla\phi_{k}+\nabla^{2}\phi_{k}p

(5)

and so it is reasonable to require that the new approximate Hessian $B_{k+1}$ satisfies

\nabla\phi_{k+1}=\nabla\phi_{k}+\alpha_{k}B_{k+1}p_{k}

(6)

which rearranges to

B_{k+1}\alpha_{k}p_{k}=\nabla\phi_{k+1}-\nabla\phi_{k}.

(7)

Now, define the two new quantities $s_{k}$ and $y_{k}$ as

	$s_{k}\coloneqq x_{k+1}-x_{k}=\alpha_{k}p_{k},$		(8a)
	$y_{k}\coloneqq\nabla\phi_{k+1}-\nabla\phi_{k}.$		(8b)

Thus, we arrive at (9), which is known as the secant condition

B_{k+1}s_{k}=y_{k}.

(9)

In words, the secant condition dictates that the new approximate Hessian $B_{k+1}$ must map the measured displacement $s_{k}$ into the measured difference of gradients $y_{k}$ . If we denote the approximate inverse Hessian $H_{k}=B_{k}^{-1}\approx\nabla^{2}\phi_{k}^{-1}$ , then the secant condition can be equivalently expressed as (10)

H_{k+1}y_{k}=s_{k}.

(10)

As $H_{k+1}$ is not yet uniquely determined, to obtain the BFGS update formula, we impose a minimum norm restriction. Specifically, we choose $H_{k+1}$ to be the solution of the following quadratic program over matrices

\min_{H}\bigg{\{}\frac{1}{2}\left\|W^{1/2}(H-H_{k})W^{1/2}\right\|_{F}^{2}\bigg{\}}\quad\text{s.t.}\quad H=H^{T},\text{~{}~{}~{}}Hy_{k}=s_{k}

(11)

where $||\cdot||_{F}$ denotes the Frobenius norm, and $W^{1/2}$ the principal square root (see MatrixAnalysisHornJohnson or a similar reference) of a symmetric positive definite weight matrix $W$ satisfying

Ws_{k}=y_{k}.

(12)

As we will see, choosing the weight matrix $W$ to satisfy (12) ensures that the resulting optimization method is scale invariant. The weight matrix $W$ can be chosen to be any symmetric positive definite matrix satisfying (12), and the specific choice of $W$ is not of great importance, as $W$ will not appear directly in the main results of this paper. However, as a concrete example from Nocedal2006 , one could assume $W=\bar{G}_{k}$ , where $\bar{G}_{k}$ is the average Hessian defined by

\bar{G}_{k}=\int_{0}^{1}\nabla^{2}\phi(x_{k}+t\alpha_{k}p_{k})dt\text{~{}~{}}.

(13)

2.2 Solving For The BFGS Update

To solve the quadratic program given by (11), we setup a Lagrangian $\mathcal{L}(H,q,\Gamma)$ involving the constraints. Recalling that

\left\|W^{1/2}(H-H_{k})W^{1/2}\right\|^{2}_{F}=\operatorname{Tr}\bigg{(}W(H-H_{k})W(H-H_{k})^{T}\bigg{)}\text{~{}},

(14)

this gives the Lagrangian defined by (15) below

\mathcal{L}=\frac{1}{2}\operatorname{Tr}\bigg{(}W(H-H_{k})W(H-H_{k})^{T}\bigg{)}+\operatorname{Tr}\bigg{(}(Hy_{k}-s_{k})q^{T}\bigg{)}+\operatorname{Tr}\bigg{(}\Gamma(H-H^{T})\bigg{)}

(15)

where $q$ is a vector of Lagrange multipliers associated with the secant condition, and $\Gamma$ is a matrix of Lagrange multipliers associated with the symmetry condition. Taking the derivative of the Lagrangian $\mathcal{L}(H,q,\Gamma)$ with respect to the matrix $H$ yields

\frac{\partial\mathcal{L}(H,q,\Gamma)}{\partial H}=W(H-H_{k})W+qy_{k}^{T}+\Gamma^{T}-\Gamma

(16)

and so we have the Karush-Kuhn-Tucker (KKT) system defined by the three equations (17a), (17b), and (17c) below

	$W(H-H_{k})W+qy_{k}^{T}+\Gamma^{T}-\Gamma=0\text{~{}},$		(17a)
	$Hy_{k}-s_{k}=0\text{~{}},$		(17b)
	$H-H^{T}=0\text{~{}}.$		(17c)

For brevity, we omit the details of the solution of the KKT system defined above because it is a limiting case of the system solved in Theorem 3.1. For an alternative geometric solution technique, we refer the interested reader to Section 2 of doi:10.1080/10556780802367205 . The minimizer $H^{*}=H_{k+1}$ is given by the well known BFGS update formula

H_{k+1}=\bigg{(}I-\frac{s_{k}y_{k}^{T}}{s_{k}^{T}y_{k}}\bigg{)}H_{k}\bigg{(}I-\frac{y_{k}s_{k}^{T}}{s_{k}^{T}y_{k}}\bigg{)}+\frac{s_{k}s_{k}^{T}}{s_{k}^{T}y_{k}}

(18)

which, if we define the curvature parameter $\rho_{k}=\frac{1}{s_{k}^{T}y_{k}}$ , can be equivalently written as

H_{k+1}=\bigg{(}I-\rho_{k}s_{k}y_{k}^{T}\bigg{)}H_{k}\bigg{(}I-\rho_{k}y_{k}s_{k}^{T}\bigg{)}+\rho_{k}s_{k}s_{k}^{T}.

(19)

Applying the Sherman-Morrison-Woodbury formula (see 10.2307/2030425 ) to the BFGS update formula immediately above, one can also write the BFGS update in terms of the approximate Hessian $B_{k}=H_{k}^{-1}$ instead of the approximate inverse Hessian. Again, for brevity, the details are omitted because they are a special case of Theorem 3.2 shown later. The result is

B_{k+1}=B_{k}-\frac{B_{k}s_{k}s_{k}^{T}B_{k}}{s_{k}^{T}B_{k}s_{k}}+\frac{y_{k}y_{k}^{T}}{s_{k}^{T}y_{k}}=B_{k}-\frac{B_{k}s_{k}s_{k}^{T}B_{k}}{s_{k}^{T}B_{k}s_{k}}+\rho_{k}y_{k}y_{k}^{T}.

(20)

To ensure the updated approximate Hessian $B_{k+1}$ is positive definite, we must enforce that

s_{k}^{T}B_{k+1}s_{k}>0.

(21)

Substituting $B_{k+1}s_{k}=y_{k}$ from the secant condition, the condition (21) becomes

s_{k}^{T}y_{k}>0

(22)

which is known as the curvature condition, as it is equivalent to

\frac{1}{\rho_{k}}>0.

(23)

3 Derivation Of Secant Penalized BFGS

In this section, having reviewed the construction of the original BFGS method, we now show how treating the secant condition with a penalty method approach motivated by regularized least squares estimation allows one to generalize the original BFGS update.

3.1 Penalizing The Secant Condition

By applying a penalty method (see Chapter 17 of Nocedal2006 ) to the secant condition instead of directly enforcing the secant condition as a constraint, we obtain the problem

\min_{H}\bigg{\{}\frac{1}{2}\left\|W^{1/2}(H-H_{k})W^{1/2}\right\|_{F}^{2}+\frac{\beta_{k}}{2}\left\|W^{1/2}(Hy_{k}-s_{k})\right\|_{2}^{2}\bigg{\}}\quad\text{s.t.}\quad H=H^{T}

(24)

where $\beta_{k}\in[0,+\infty]$ is a penalty parameter that determines how strongly to penalize violations of the secant condition. As we will see, one recovers the solution to the constrained problem (11) in the limit $\beta_{k}=+\infty$ , so $\beta_{k}$ can be intuitively thought of as the cost of violating the secant condition. By treating the symmetry constraint with a matrix $\Gamma$ of Lagrange multipliers again, we obtain the following Lagrangian

\mathcal{L}=\frac{1}{2}\operatorname{Tr}\bigg{(}W(H-H_{k})W(H-H_{k})^{T}\bigg{)}+\frac{\beta_{k}}{2}\left\|W^{1/2}(Hy_{k}-s_{k})\right\|_{2}^{2}+\operatorname{Tr}\bigg{(}\Gamma(H-H^{T})\bigg{)}\text{~{}}.

(25)

Defining the residual associated with the secant condition as $r_{k}(H)\coloneqq Hy_{k}-s_{k}$ and $u\coloneqq\beta_{k}Wr_{k}$ , the first order optimality conditions of (25) can be written as the system

	$W(H-H_{k})W+uy_{k}^{T}+\Gamma^{T}-\Gamma=0\text{~{}},$		(26a)
	$Hy_{k}-s_{k}-\frac{W^{-1}u}{\beta_{k}}=0\text{~{}},$		(26b)
	$H-H^{T}=0\text{~{}}.$		(26c)

Note that, as expected, in the limit $\beta_{k}=+\infty$ , the system given by (26a), (26b), and (26c) reduces to the KKT system given by (17a), (17b), and (17c).

We now find an explicit closed form solution to the problem given by (24), which is given in Theorem 3.1.

Theorem 3.1 (SP-BFGS Update)

The update formula given by the minimizer $H^{*}$ of the problem defined by (24), which can be obtained by solving the system given by (26a), (26b), and (26c), is the SP-BFGS update

H_{k+1}=\bigg{(}I-\omega_{k}s_{k}y_{k}^{T}\bigg{)}H_{k}\bigg{(}I-\omega_{k}y_{k}s_{k}^{T}\bigg{)}+\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}s_{k}s_{k}^{T}

(27)

where

\gamma_{k}=\frac{1}{(s_{k}^{T}y_{k}+\frac{1}{\beta_{k}})},\quad\omega_{k}=\frac{1}{(s_{k}^{T}y_{k}+\frac{2}{\beta_{k}})}.

(28)

Proof

See Appendix A. ∎

At this point, a few comments are in order regarding the SP-BFGS update given by (27). First, observe that as $\beta_{k}\rightarrow+\infty$ , we have that $\omega_{k}\rightarrow\rho_{k}$ and $\gamma_{k}\rightarrow\rho_{k}$ . As a result, when $\beta_{k}=+\infty$ , one recovers the original BFGS update, as expected. Second, also observe that as $\beta_{k}\rightarrow 0$ , we have that $\omega_{k}\rightarrow 0$ and $\gamma_{k}\rightarrow 0$ . As a result, we see that in the case $\beta_{k}=0$ the SP-BFGS update reduces to $H_{k+1}=H_{k}$ . This is again expected because as $\beta_{k}\rightarrow 0$ , the cost of violating the secant condition goes to zero, and the minimum norm symmetric update is simply $H_{k+1}=H_{k}$ .

We now examine what the analog of the curvature condition (22) is for SP-BFGS. Lemma 1 demonstrates that (29) is the SP-BFGS analog of the BFGS curvature condition (22).

Lemma 1 (Positive Definiteness Of SP-BFGS Update)

If $H_{k}$ is positive definite, then the $H_{k+1}$ given by the SP-BFGS update (27) is positive definite if and only if the SP-BFGS curvature condition

s_{k}^{T}y_{k}>-\frac{1}{\beta_{k}}

(29)

is satisfied.

Proof

See Appendix B. ∎

The result in Lemma 1 warrants some discussion. First, the limiting behaviour with respect to $\beta_{k}$ is consistent with Theorem 3.1. As $\beta_{k}\rightarrow+\infty$ , condition (29) reduces to the BFGS curvature condition (22). As $\beta_{k}\rightarrow 0$ , condition (29) reduces to no condition at all, as $s_{k}^{T}y_{k}>-\infty$ is always true. This is consistent with the observation that when $\beta_{k}=0$ , the minimum norm symmetric update is $H_{k+1}=H_{k}$ , and in this case $H_{k+1}$ is guaranteed to be positive definite if $H_{k}$ is positive definite, regardless of $s_{k}^{T}y_{k}$ .

From the proof of Lemma 1 (see (93)), it is now clear that

y_{k}^{T}H_{k+1}y_{k}=\bigg{(}\frac{\beta_{k}y_{k}^{T}s_{k}}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}s_{k}+\bigg{(}\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}H_{k}y_{k}

(30)

and so $y_{k}^{T}H_{k+1}y_{k}$ is a convex combination of $y_{k}^{T}s_{k}$ and $y_{k}^{T}H_{k}y_{k}$ . Thus, $H_{k+1}$ interpolates between the current inverse Hessian approximation $H_{k}$ and the original BFGS update, and as $\beta_{k}$ decreases, the interpolation is increasingly biased towards the current approximation $H_{k}$ . From a regularized least squares estimation perspective, $\beta_{k}$ plays the role of a regularization parameter that controls the amount of bias in the estimate of $H_{k+1}$ . Note that this behaviour is somewhat similar to the behaviour of Powell damping Powell1978 , although Powell damping was introduced to handle approximating a potentially indefinite Hessian of the Lagrangian in constrained optimization problems, and not noise.

We finish introducing the SP-BFGS update by applying the Sherman-Morrison-Woodbury formula to (27), which allows us to write the update in terms of the approximate Hessian $B_{k}$ instead of the approximate inverse Hessian $H_{k}$ . The result is given in Theorem 3.2.

Theorem 3.2 (SP-BFGS Inverse Update)

The SP-BFGS update formula given by (27) can be written in terms of $B_{k}=H_{k}^{-1}$ as

$B_{k+1}=B_{k}-\frac{\omega_{k}\bigg{[}\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}B_{k}^{-1}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}B_{k}s_{k}s_{k}^{T}B_{k}+(1-\omega_{k}s_{k}^{T}y_{k})(B_{k}s_{k}y_{k}^{T}+y_{k}s_{k}^{T}B_{k})+\omega_{k}(s_{k}^{T}B_{k}s_{k})y_{k}y_{k}^{T}\bigg{]}}{\big{(}(\omega_{k}-\gamma_{k})y_{k}^{T}B_{k}^{-1}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\big{)}\big{(}\omega_{k}s_{k}^{T}B_{k}s_{k}\big{)}-(1-\omega_{k}y_{k}^{T}s_{k})^{2}}.$

Proof

See Appendix C. ∎

Note that the limiting behaviour of Theorem 3.2 with respect to $\beta_{k}$ is again consistent. When $\beta_{k}=+\infty$ , we obtain the original BFGS inverse update (20), and when $\beta_{k}=0$ , we obtain $B_{k+1}=B_{k}$ . One complication with respect to the SP-BFGS inverse update (98) is that $B_{k+1}$ cannot in general be expressed solely in terms of $B_{k}$ due to the presence of $y_{k}^{T}B_{k}^{-1}y_{k}$ (i.e. $y_{k}^{T}H_{k}y_{k}$ ) in the denominator.

4 Algorithmic Framework

We now outline how to practically implement SP-BFGS updating. We consider the situation where one has access to noise corrupted versions of a smooth function $\phi$ and its gradient $\nabla\phi$ that can be decomposed as

f(x)=\phi(x)+\epsilon(x),

(31)

g(x)=\nabla\phi(x)+e(x).

(32)

In (31) and (32), $\phi$ is a smooth twice continuously differentiable function as in Section 2.1, and $\epsilon(x)$ is a scalar representing noise in the function evaluations. Similarly, $\nabla\phi$ is the gradient of the smooth function $\phi$ , while $e(x)$ is a vector representing noise in the gradient evaluations. Similar decompositions are used in DFONoisyFunctionsQuasiNewton ; Gal_2020_HowToCatchALion ; doi:10.1137/19M1240794 .

4.1 Minimization Routine

Algorithm 1 outlines a general procedure for minimizing a noisy function with noisy function and gradient values $f$ and $g$ that can be decomposed as shown in (31) and (32). The inputs to the procedure in Algorithm 1 are a means of evaluating the noisy objective function $f(x)$ and gradient $g(x)$ , the starting point $x^{0}$ , and an initial inverse Hessian approximation $H^{0}$ . As the best convergence/stopping test is problem dependent, we note that standard gradient and function value based tests can be employed in conjuction with smoothing and noise estimation techniques (e.g. see Section 3.3.4 of DFONoisyFunctionsQuasiNewton ). In the next several subsections, we discuss how to choose the penalty parameter $\beta_{k}$ and step size $\alpha_{k}$ , and appropriate courses of action for when the SP-BFGS curvature condition (29) fails.

Algorithm 1 SP-BFGS Minimization Routine

1:procedure SP-BFGS-Minimize(

f(x),g(x),x^{0},H^{0}

)

k\leftarrow 0

H_{k}\leftarrow H^{0}

x_{k}\leftarrow x^{0}

5: while Not Converged/Stopped do

p_{k}\leftarrow-H_{k}g_{k}

7: Choose step size

\alpha_{k}

x_{k+1}\leftarrow x_{k}+\alpha_{k}p_{k}

s_{k}\leftarrow x_{k+1}-x_{k}

10:

y_{k}\leftarrow g_{k+1}-g_{k}

11: Choose penalty parameter

\beta_{k}

12: if

s_{k}^{T}y_{k}>-\frac{1}{\beta_{k}}

then

13:

\gamma_{k}\leftarrow\frac{1}{(s_{k}^{T}y_{k}+\frac{1}{\beta_{k}})},\quad\omega_{k}\leftarrow\frac{1}{(s_{k}^{T}y_{k}+\frac{2}{\beta_{k}})}

$H_{k+1}=\bigg{(}I-\omega_{k}s_{k}y_{k}^{T}\bigg{)}H_{k}\bigg{(}I-\omega_{k}y_{k}s_{k}^{T}\bigg{)}+\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}s_{k}s_{k}^{T}$

14: else

15: Trigger SP-BFGS curvature condition failure recovery procedure

16:

k\leftarrow k+1

4.2 Choosing The Penalty Parameter $\beta_{k}$

As the choice of $\beta_{k}$ determines how strongly to bias the estimate of $H_{k+1}$ towards $H_{k}$ , the choice of $\beta_{k}$ is fundamentally connected to the amount of noise present in the measured gradients $g_{k+1}$ and $g_{k}$ . In brief, if the amount of noise present in the measured gradients is large, $\beta_{k}$ should be small to avoid overfitting the noise, and if the amount of noise present in the measured gradients is small, $\beta_{k}$ should be large to avoid underfitting curvature information. To make this point more rigorous, we introduce the following assumption.

Assumption 1 (Uniform Gradient Noise Bound)

There exists a nonnegative constant $\bar{\epsilon}_{g}\geq 0$ such that

\left\|g(x)-\nabla\phi(x)\right\|_{2}=\left\|e(x)\right\|_{2}\leq\bar{\epsilon}_{g},\qquad\forall x\in\mathbb{R}^{n}.

(33)

As $\nabla\phi(x)$ is continuous, for each $k\geq 0$ we have

\left\|\lim_{\alpha_{k}\downarrow 0}\big{[}\nabla\phi(x_{k}+\alpha_{k}p_{k})\big{]}-\nabla\phi(x_{k})\right\|_{2}=0.

(34)

However, due to noise we cannot in general guarantee

\left\|\lim_{\alpha_{k}\downarrow 0}\big{[}g(x_{k}+\alpha_{k}p_{k})\big{]}-g(x_{k})\right\|_{2}=0.

(35)

Using the continuity of $\nabla\phi(x)$ , Assumption 1, and the triangle inequality, one can conclude that

0\leq\left\|\lim_{\alpha_{k}\downarrow 0}\big{[}g(x_{k}+\alpha_{k}p_{k})\big{]}-g(x_{k})\right\|_{2}\leq 2\bar{\epsilon}_{g}.

(36)

As a result, it is now clear that in the presence of uniformly bounded gradient noise, sending the step size $\alpha_{k}$ to zero, and thus $s_{k}$ to zero, only bounds the difference of measured gradients within a ball with radius dependent on the gradient noise bound $\bar{\epsilon}_{g}$ .

As $g_{k+1}$ and $g_{k}$ can be decomposed into smooth and noise components, so can $s_{k}^{T}y_{k}$ , giving

s_{k}^{T}y_{k}=s_{k}^{T}y_{k}^{smooth}+s_{k}^{T}y_{k}^{noise}=s_{k}^{T}[\nabla\phi_{k+1}-\nabla\phi_{k}]+s_{k}^{T}[e_{k+1}-e_{k}].

(37)

In conjunction with the Cauchy-Schwarz inequality, Assumption 1 implies that

-2\bar{\epsilon}_{g}\left\|s_{k}\right\|_{2}\leq s_{k}^{T}[e_{k+1}-e_{k}]\leq 2\bar{\epsilon}_{g}\left\|s_{k}\right\|_{2}

(38)

and so we have the lower and upper bounds

-2\bar{\epsilon}_{g}\left\|s_{k}\right\|_{2}+s_{k}^{T}y_{k}^{smooth}\leq s_{k}^{T}y_{k}\leq s_{k}^{T}y_{k}^{smooth}+2\bar{\epsilon}_{g}\left\|s_{k}\right\|_{2}.

(39)

From (39), it is clear that the bound on the effect of the noise grows linearly with $\left\|s_{k}\right\|_{2}$ . However, by using the average Hessian $\bar{G}_{k}$ from (13) and applying Taylor’s theorem to $\nabla\phi$ , it is also clear that

s_{k}^{T}y_{k}^{smooth}=s_{k}^{T}\bar{G}_{k}s_{k}=O\big{(}\left\|s_{k}\right\|_{2}^{2}\big{)}

(40)

and so

s_{k}^{T}y_{k}=O\big{(}\left\|s_{k}\right\|_{2}^{2}\big{)}+O\big{(}\left\|s_{k}\right\|_{2}\big{)}

(41)

where the $O\big{(}\left\|s_{k}\right\|_{2}^{2}\big{)}$ term is due to the true curvature of the smooth function $\phi$ , and the $O\big{(}\left\|s_{k}\right\|_{2}\big{)}$ term is due to noise. Thus, we have now illustrated an important general behaviour given Assumption 1. As $\left\|s_{k}\right\|_{2}$ dominates $\left\|s_{k}\right\|_{2}^{2}$ as $\left\|s_{k}\right\|_{2}\rightarrow 0$ , the effects of noise can dominate the true curvature for small $s_{k}$ . Conversely, as $\left\|s_{k}\right\|_{2}^{2}$ dominates $\left\|s_{k}\right\|_{2}$ as $\left\|s_{k}\right\|_{2}\rightarrow+\infty$ , the true curvature can dominate the effects of noise for large $s_{k}$ .

Given the above analysis, a simple strategy for choosing $\beta_{k}$ is to make $\beta_{k}$ grow linearly with $\left\|s_{k}\right\|_{2}$ , such as

\beta_{k}=N_{s}\left\|s_{k}\right\|_{2}

(42)

where $N_{s}>0$ is a slope parameter. As $\left\|s_{k}\right\|_{2}\rightarrow 0$ , $H_{k+1}\rightarrow H_{k}$ , which is desirable because the effects of noise likely dominate as $\left\|s_{k}\right\|_{2}\rightarrow 0$ . Increasingly biasing the estimate of $H_{k+1}$ towards $H_{k}$ reduces how much $H_{k+1}$ can be corrupted by noise, and relaxes the SP-BFGS curvature condition (29), reducing the likelihood of needing to trigger a recovery procedure described in Section 4.4. Also, as shown earlier, because $\nabla\phi$ is continuous, the true difference of gradients is guaranteed to go to zero as $s_{k}$ approaches zero. As a result, without noise present, it is natural that $H_{k+1}\rightarrow H_{k}$ as $s_{k}\rightarrow 0$ . In the presence of noise, we wish for this behaviour to be preserved. Informally, one can intuitively think of wanting $H_{k}$ to behave as an approximate average inverse Hessian, and the averaging should remove the corrupting effects of noise, leaving $H_{k}$ to behave as if no noise were present. Similarly, as $\left\|s_{k}\right\|_{2}\rightarrow+\infty$ , $\beta_{k}\rightarrow+\infty$ , and one recovers the BFGS update in the limit, which is desirable because the effects of noise are likely dominated by the true curvature as $\left\|s_{k}\right\|_{2}\rightarrow+\infty$ . The slope parameter $N_{s}$ dictates how sensitive $\beta_{k}$ is to $\left\|s_{k}\right\|_{2}$ , and should be set proportional to the gradient noise level (i.e. $\bar{\epsilon}_{g}$ ). Intuitively, if the gradient noise level is low, $\beta_{k}$ should grow quickly with $\left\|s_{k}\right\|_{2}$ , as the effect of noise diminishes quickly, and vice versa.

It may also be desirable to modify (42) to

\beta_{k}=\max\bigg{\{}N_{s}\left\|s_{k}\right\|_{2}-N_{o},0\bigg{\}}

(43)

where $N_{o}>0$ is an intercept parameter. The inclusion of $N_{o}$ allows one to stop updating $H_{k}$ if $\left\|s_{k}\right\|_{2}$ is sufficiently small. For example, it may be desirable to stop updating $H_{k}$ when one is very close to a stationary point, as gradient measurements are likely heavily dominated by noise.

4.3 Choosing The Step Size $\alpha_{k}$

Classically, during BFGS updating $\alpha_{k}$ is chosen to satisfy the Armijo-Wolfe conditions. As function and gradient evaluations are not corrupted by noise in the classical BFGS setting, we can write the Armijo condition, also known as the sufficient decrease condition, as

\phi_{k+1}\leq\phi_{k}+c_{1}\alpha_{k}\nabla\phi_{k}^{T}p_{k}

(44)

and the Wolfe condition, also known as the curvature condition, as

\nabla\phi_{k+1}^{T}p_{k}\geq c_{2}\nabla\phi_{k}^{T}p_{k}

(45)

where $0<c_{1}<c_{2}<1$ , with well known choices being $c_{1}=10^{-4}$ and $c_{2}=0.9$ . Observe that by adding $\nabla\phi_{k}^{T}p_{k}$ to both sides of (45) and multiplying by $\alpha_{k}$ , (45) becomes

y_{k}^{T}s_{k}=[\nabla\phi_{k+1}-\nabla\phi_{k}]^{T}\alpha_{k}p_{k}\geq(c_{2}-1)\nabla\phi_{k}^{T}\alpha_{k}p_{k}.

(46)

If $p_{k}$ is a descent direction then $\nabla\phi_{k}^{T}p_{k}<0$ , and combined with $(c_{2}-1)<0$ and $\alpha_{k}>0$ , one sees that (46) implies

y_{k}^{T}s_{k}\geq(c_{2}-1)\nabla\phi_{k}^{T}s_{k}>0

(47)

so (45) effectively enforces (22) when no gradient noise is present.

In the presence of noisy gradients, we argue that in general it no longer makes sense to enforce the Wolfe condition (45). In the presence of gradient noise, (45) becomes

[\nabla\phi_{k+1}+e_{k+1}]^{T}p_{k}\geq c_{2}[\nabla\phi_{k}+e_{k}]^{T}p_{k}

(48)

which can behave erratically once the noise vectors $e_{k+1}$ and $e_{k}$ start to dominate the gradient of $\phi$ . For example, the noise vectors $e_{k+1}$ and $e_{k}$ can cause both sides of (48) to erratically change sign, in which case whether or not the Wolfe condition is satisfied can be governed by randomness more than anything else.

We argue that because the SP-BFGS update allows one to relax the curvature condition based on the value of $\beta_{k}$ as shown in the SP-BFGS curvature condition (29), it is appropriate to drop the Wolfe condition entirely in the presence of gradient noise and instead employ only a version of the sufficient decrease condition when choosing $\alpha_{k}$ . In the situation where gradient noise is present but function noise is not (i.e. $f(x)=\phi(x)$ in (31)), one can use a backtracking line search based on the sufficient decrease condition, which can guarantee convergence to a neighborhood of a stationary point of $\phi$ . The situation where noise is present in both function and gradient evaluations is trickier. Similar to the approach presented in Section 4.2 of DFONoisyFunctionsQuasiNewton , one option is to use a backtracking line search with a relaxed sufficient decrease condition of the form

f_{k+1}\leq f_{k}+c_{1}\alpha_{k}g_{k}^{T}p_{k}+2\epsilon_{A}

(49)

where $\epsilon_{A}\geq 0$ is a noise tolerance parameter and $p_{k}=-H_{k}g_{k}$ . In Theorem 4.2 of DFONoisyFunctionsQuasiNewton , the authors show that under Assumptions 1 and 2, using the iteration (4) and a backtracking line search governed by the relaxed Armijo condition (49) with $p_{k}=-g_{k}$ guarantees linear convergence to a neighborhood of the global minimizer for strongly convex functions.

Assumption 2 (Uniform Function Noise Bound)

There exists a nonnegative constant $\bar{\epsilon}_{f}\geq 0$ such that

\left|f(x)-\phi(x)\right|=\left|\epsilon(x)\right|\leq\bar{\epsilon}_{f},\qquad\forall x\in\mathbb{R}^{n}.

(50)

We agree with the authors of DFONoisyFunctionsQuasiNewton that it is possible to prove an extension of Theorem 4.2 of DFONoisyFunctionsQuasiNewton to a quasi-Newton iteration with positive definite $H_{k}$ , and briefly outline why in Section 5.2. A quasi-Newton extension of Theorem 4.2 of DFONoisyFunctionsQuasiNewton is relevant to SP-BFGS updating because, as we will formally see in Section 5.1, control of $\beta_{k}$ makes it possible to uniformly bound the minimum and maximum eigenvalues of $H_{k+1}$ .

4.4 Failed SP-BFGS Curvature Condition Recovery Procedure

In the classical BFGS scenario where no gradient noise is present, the curvature condition (22) may fail if $\alpha_{k}$ is not chosen based on the Armijo-Wolfe conditions and $\phi$ is not strongly convex. One of the most common strategies to handle this scenario is to skip the BFGS update (i.e. set $H_{k+1}=H_{k}$ ) when this occurs, which corresponds to an SP-BFGS update with $\beta_{k}=0$ . However, this simple strategy has the downside of potentially producing poor inverse Hessian approximations if updates are skipped too frequently.

Conditionally skipping BFGS updates is an option in the presence of noisy gradients as well. In addition to skipping BFGS updates when (22) fails, as described above, another course of action sometimes recommended in the presence of noise is to replace (22) with

s_{k}^{T}y_{k}\geq\varepsilon\left\|s_{k}\right\|_{2}^{2}

(51)

where $\varepsilon>0$ is a small positive constant, and skip the BFGS update if (51) is not satisfied. This strategy may be somewhat effective if $\left\|s_{k}\right\|_{2}$ is large, but reduces back to the initial update skipping approach as $\left\|s_{k}\right\|_{2}\rightarrow 0$ . A similar strategy (e.g. see Section 3.3.3 of DFONoisyFunctionsQuasiNewton ) is to replace (22) with

s_{k}^{T}y_{k}\geq\zeta\left\|s_{k}\right\|_{2}\left\|y_{k}\right\|_{2}

(52)

and skip the BFGS update if (52) is not satisfied for some $\zeta\in(0,1)$ . Notice that none of the aforementioned update skipping strategies allow for curvature information to be incorporated if the measured curvature $s_{k}^{T}y_{k}$ is negative.

Unlike in the classical BFGS scenario, with SP-BFGS updating, curvature information can be incorporated even if the measured curvature $s_{k}^{T}y_{k}$ is negative by decreasing $\beta_{k}$ towards $0$ . In addition to having the option of conditionally skipping updates (i.e. setting $\beta_{k}=0$ ), one can also alternatively relax the SP-BFGS curvature condition by decreasing $\beta_{k}$ towards $0$ if (29) fails. Since $\beta_{k}$ is chosen after $s_{k}$ and $y_{k}$ are fixed in Algorithm 1, one can solve for $\beta_{k}$ values satisfying (29) when $s_{k}^{T}y_{k}<0$ , yielding

\beta_{k}=-\frac{1}{c_{3}(s_{k}^{T}y_{k})}

(53)

for all $c_{3}>1$ , assuming that $s_{k}\neq 0$ and $y_{k}\neq 0$ . Note that if $s_{k}\neq 0$ and $y_{k}\neq 0$ and (29) fails, then the measured curvature $s_{k}^{T}y_{k}$ must be negative. The choice of $c_{3}$ determines how much to shrink $\beta_{k}$ compared to the largest value of $\beta_{k}$ that still satisfies (29) and thus guarantees the positive definiteness of $H_{k+1}$ . Hence, if the value of $\beta_{k}$ produced by (42) or (43) is too large and (29) fails, one can choose an acceptable value of $\beta_{k}$ by using (53) and selecting a $c_{3}>1$ . Thus, instead of skipping the update (i.e. setting $\beta_{k}=0$ ) if (29) fails, one can reduce $\beta_{k}$ towards $0$ , which has the effect of reducing the magnitude of the update by increasing how much $H_{k+1}$ is biased towards $H_{k}$ . An approach based on reducing $\beta_{k}$ towards $0$ never entirely skips incorporating measured curvature information, even if the measured curvature information is negative, but instead weights how heavily the measured curvature information affects $H_{k+1}$ .

5 Convergence of SP-BFGS

In this section, we discuss relevant theoretical and convergence properties of SP-BFGS. First, it is important to note that for specific choices of the sequence of penalty parameters $\beta_{k}$ , known convergence results already exist. Specifically, if $\beta_{k}=+\infty$ for all $k$ , then SP-BFGS updating is equivalent to BFGS updating. Although there are not many works on the convergence properties of BFGS updating in the presence of uniformly bounded noise, such as in Assumptions 1 and 2, in doi:10.1137/19M1240794 the authors provide convergence results for a BFGS variant that employs an Armijo-Wolfe line search and lengthens the differencing interval in the presence of uniformly bounded function and gradient noise. At the other extreme, if $\beta_{k}=0$ for all $k$ , then one obtains a scaled gradient method for general $H^{0}\succ 0$ , and this becomes the gradient method when $H^{0}=I$ . Convergence analyses of the gradient method in the presence of uniformly bounded function and gradient noise for both a fixed step size and backtracking line search are provided in Section 4 of DFONoisyFunctionsQuasiNewton .

Given that perhaps the defining feature of SP-BFGS updating is the ability to vary $\beta_{k}$ at each iteration, we focus our attention on how varying $\beta_{k}$ can influence convergence behaviour in this section. As a result, most of the ensuing analysis centers around situations where the condition number of $H_{k}$ can be bounded. We do not employ the approach of bounding the cosine of the angle between the descent direction $p_{k}$ and the negative gradient above zero, and then showing that the condition number of $H_{k}$ is bounded, which is similar to the approaches taken when no noise is present in 10.2307/2157646 ; 10.2307/2157680 , and when noise is present in doi:10.1137/19M1240794 . Although it may be possible to apply the strategies employed in 10.2307/2157646 ; 10.2307/2157680 ; doi:10.1137/19M1240794 to establish convergence results for SP-BFGS, such an analysis is complicated enough that it is beyond the scope of this initial paper.

5.1 The Influence of $\beta_{k}$ on $H_{k+1}$

We first examine how $\beta_{k}$ determines how much the maximum and minimum eigenvalues $\lambda_{max}(H_{k+1})$ and $\lambda_{min}(H_{k+1})$ can change. In what follows, $\lambda(H)$ denotes the set of eigenvalues $\lambda_{1},\dots,\lambda_{n}$ of the matrix $H\in\mathbb{R}^{n\times n}$ . We provide upper bounds on $\lambda_{max}(B_{k+1})$ and $\lambda_{max}(H_{k+1})$ via Theorem 5.1. As $H_{k}=B_{k}^{-1}$ , $1/\lambda_{min}(H_{k+1})=\lambda_{max}(B_{k+1})$ , and putting an upper bound on $\lambda_{max}(B_{k+1})$ is equivalent to putting a lower bound on $\lambda_{min}(H_{k+1})$ .

Theorem 5.1 (Eigenvalue Upper Bounds)

When $H_{k+1}$ is given by the SP-BFGS update (27), the following upper bounds (54) and (55) hold

\lambda_{max}(H_{k+1})\leq\operatorname{Tr}(H_{k+1})\leq\bigg{[}\big{(}1+\gamma_{k}\left\|y_{k}\right\|_{2}\left\|s_{k}\right\|_{2}\big{)}^{2}\bigg{]}\operatorname{Tr}(H_{k})+\gamma_{k}\left\|s_{k}\right\|_{2}^{2},

(54)

\lambda_{max}(B_{k+1})\leq\operatorname{Tr}(B_{k+1})\leq\bigg{[}1+\beta_{k}\left\|y_{k}\right\|_{2}\left\|s_{k}\right\|_{2}\bigg{]}\operatorname{Tr}(B_{k})+\gamma_{k}\left\|y_{k}\right\|_{2}^{2}.

(55)

Proof

See Appendix D. ∎

With Theorem 5.1 in hand, we now formally see that when $s_{k}^{T}y_{k}>0$ , as $\beta_{k}$ increases from $0$ to $+\infty$ , an upper bound on $\lambda_{max}(H_{k+1})$ interpolates between $\operatorname{Tr}(H_{k})$ and $+\infty$ , and an upper bound on $\lambda_{max}(B_{k+1})$ interpolates between $\operatorname{Tr}(B_{k})$ and $+\infty$ . Similarly, when $s_{k}^{T}y_{k}<0$ , as $\beta_{k}$ increases from $0$ towards $-\frac{1}{(s_{k}^{T}y_{k})}$ , an upper bound on $\lambda_{max}(H_{k+1})$ interpolates between between $\operatorname{Tr}(H_{k})$ and $+\infty$ , and an upper bound on $\lambda_{max}(B_{k+1})$ interpolates between $\operatorname{Tr}(B_{k})$ and $+\infty$ . Standard BFGS updating corresponds to setting $\beta_{k}=+\infty$ for all $k$ , and as this is the largest possible value of $\beta_{k}$ , one can no longer formally guarantee that $\lambda_{max}(B_{k+1})$ and $\lambda_{max}(H_{k+1})$ are bounded from above at each iteration because the measured curvature $s_{k}^{T}y_{k}$ may become arbitrarily close to zero due to the effects of noise. The key takeaway is that upper bounds on $\lambda_{max}(H_{k+1})$ and $\lambda_{max}(B_{k+1})$ can be tightened arbitrarily close to $\operatorname{Tr}(H_{k})$ and $\operatorname{Tr}(B_{k})$ by shrinking $\beta_{k}$ towards zero, as $s_{k}$ , $y_{k}$ , and $H_{k}$ are fixed before the value of $\beta_{k}$ is chosen in Algorithm 1.

Thus, if one must enforce a bound of the form $\lambda_{max}(H_{k+1})\leq C_{H}$ or a bound of the form $\lambda_{max}(B_{k+1})\leq C_{B}$ for all $k\geq 0$ , where $C_{H}>\operatorname{Tr}(H_{0})>0$ and $C_{B}>\operatorname{Tr}(B_{0})>0$ are positive constants, there exist nontrivial sequences of sufficiently small $\beta_{k}$ with $\lim_{k\rightarrow\infty}\beta_{k}=0$ that ensure the bounds hold for all $k$ . To see this, observe that the interval $[\lambda_{max}(H_{0}),C_{H}]$ can be partitioned into subintervals corresponding to each iteration, and the sum of the subintervals cannot exceed $C_{H}-\lambda_{max}(H_{0})$ , which can be guaranteed by assigning a small enough value of $\beta_{k}$ to each subinterval, as this guarantees the maximum eigenvalue does not grow too much at each iteration $k$ . Furthermore, although there clearly exist sequences of $\beta_{k}$ that ensure the bounds hold for all $k$ that satisfy $\beta_{k}=0$ for all $k\geq K$ , where $K$ is a positive integer, there also exist sequences of $\beta_{k}$ that ensure the bounds hold for all $k$ where $\beta_{k}$ instead only approaches zero in the limit $k\rightarrow\infty$ .

5.2 Minimization Of Strongly Convex Functions

Having established that SP-BFGS iterations can maintain bounds on the maximum and minimum eigenvalues of the approximate inverse Hessians via sufficiently small choices of $\beta_{k}$ , we now consider minimizing strongly convex functions in the presence of bounded noise. We introduce Assumption 3, and the notation $x^{\star}$ to denote the argument of the unique minimum of $\phi$ , and $\phi^{\star}=\phi(x^{\star})$ to denote the minimum.

Assumption 3 (Strong Convexity of $\phi$ )

The function $\phi\in C^{2}$ is twice continuously differentiable and there exist positive constants $0<m\leq M$ such that

mI\preceq\nabla^{2}\phi(x)\preceq MI,\qquad\forall x\in\mathbb{R}^{n}.

(56)

We also state a general result in Lemma 2 that establishes a region where $H_{k}g_{k}$ may not provide a descent direction with respect to $\phi$ due to noise dominating gradient measurements. Outside of this region, $H_{k}g_{k}$ is guaranteed to provide a descent direction for $\phi$ .

Lemma 2 (Region Where Gradient Noise Can Dominate $\nabla\phi$ )

Suppose Assumptions 1 and 3, and the decomposition in (32) apply. Let $H$ be a symmetric positive definite matrix bounded by $\psi I\preceq H\preceq\Psi I$ , where $0<\psi\leq\Psi$ . Define the neighborhood $\mathcal{N}_{1}(\psi,\Psi)$ as

\mathcal{N}_{1}(\psi,\Psi)\equiv\bigg{\{}x\text{ }\big{|}\text{ }\phi(x)\leq\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{\}}.

(57)

For all $x\notin\mathcal{N}_{1}$ , $\nabla\phi(x)^{T}Hg(x)>0$ . Contrapositively, for all $x$ such that $\nabla\phi(x)^{T}Hg(x)\leq 0$ , $x\in\mathcal{N}_{1}$ .

Proof

See Appendix E. ∎

Applying Lemma 2 in the context of SP-BFGS updating makes several convergence properties clear. First, if one chooses $\beta_{k}$ such that $\psi I\preceq H_{k}\preceq\Psi I$ for all $k$ (i.e. the eigenvalues of the approximate inverse Hessian are uniformly bounded from above and below for all $k$ ), by Lemma 2 it becomes clear that in the presence of gradient noise and absence of function noise (i.e. $f(x)=\phi(x)$ in (31)), the iterates of SP-BFGS with a backtracking line search based on (49) with $\epsilon_{A}=0$ in the worst case approach $\mathcal{N}_{1}$ as $k\rightarrow\infty$ . To see this, observe that $H_{k}g_{k}$ is guaranteed to provide a descent direction outside of $\mathcal{N}_{1}$ and the sufficient decrease condition guarantees that $\alpha_{k}$ is not too large, while backtracking guarantees that $\alpha_{k}$ is not too small. For more background, see Chapter 3 of Nocedal2006 .

Second, if both function and gradient noise are present, and one again chooses $\beta_{k}$ such that the bounds $\psi I\preceq H_{k}\preceq\Psi I$ hold for all $k$ , under additional conditions a worst case analysis in Theorem 5.2 shows that an approach using a sufficiently small fixed step size $\alpha$ approaches $\mathcal{N}_{1}$ at a linear rate as $k\rightarrow\infty$ . For a general quasi-Newton iteration of the form

x_{k+1}=x_{k}-\alpha H_{k}g_{k}

(58)

with constant step size $\alpha$ and $H_{k}\succ 0$ , Theorem 5.2 establishes linear convergence to the region where noise can dominate $\nabla\phi$ (i.e. $\mathcal{N}_{1}$ in Lemma 2).

Theorem 5.2 (Linear Convergence For Sufficiently Small Fixed $\alpha$ )

Suppose that Assumptions 1 and 3 hold. Further suppose that $H_{k}$ is symmetric positive definite and bounded by $\psi I\preceq H_{k}\preceq\Psi I$ , where $0<\psi\leq\Psi$ . Let $\psi$ be such that the inequality

\nabla\phi_{k}^{T}H_{k}g_{k}\geq\psi\nabla\phi_{k}^{T}g_{k}

(59)

is true for all $k$ . Let $\{x_{k}\}$ be the iterates generated by (58), where the constant step size $\alpha$ satisfies

0<\alpha\leq\frac{\psi}{\Psi^{2}M}.

(60)

Then for all $k$ such that $x_{k}\notin\mathcal{N}_{1}(\psi,\Psi)$ , one has the Q-linear convergence result

\phi_{k+1}-\bigg{[}\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{]}\leq(1-\alpha\psi m)\bigg{(}\phi_{k}-\bigg{[}\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{]}\bigg{)}.

(61)

Similarly, for any $x_{0}\notin\mathcal{N}_{1}(\psi,\Psi)$ , one has the R-linear convergence result

\phi_{k+1}-\phi^{\star}\leq(1-\alpha\psi m)^{k}\bigg{(}\phi_{0}-\bigg{[}\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{]}\bigg{)}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}.

(62)

Proof

See Appendix F. ∎

Theorem 5.2 can be considered a quasi-Newton extension of Theorem 4.1 from DFONoisyFunctionsQuasiNewton , which lays the foundation for Theorem 4.2 from DFONoisyFunctionsQuasiNewton . To extend the convergence result of Theorem 5.2 to the backtracking line search approach based on (49), see that (60), (115), and Assumption 2 combined imply

f(x_{k}-\alpha H_{k}g_{k})\leq f(x_{k})-\frac{\alpha\psi}{2}\bigg{(}\left\|\nabla\phi_{k}\right\|_{2}^{2}-\left\|e_{k}\right\|_{2}^{2}\bigg{)}+2\bar{\epsilon}_{f}

(63)

and so if $\epsilon_{A}>\bar{\epsilon}_{f}$ , comparing (49) and (63) makes it clear that (49) will be satisfied for sufficiently small $\alpha$ . Hence, the backtracking line search always finds an $\alpha_{k}$ satisfying (49). For brevity, we defer a full, rigorous quasi-Newton extension of Theorem 4.2 of DFONoisyFunctionsQuasiNewton to future work and instead investigate the performance of an approach based on (49) via numerical experiments in Section 6.

6 Numerical Experiments

In this section, we test instances of Algorithm 1 on a diverse set of 33 test problems for unconstrained minimization. The set of test problems includes convex and nonconvex functions, and well known pathological functions such as the Rosenbrock function 10.1093/comjnl/3.3.175 and its relatives. Described in Section 6.1, the first test problem is similar to the one used in the numerical experiments section of doi:10.1137/19M1240794 , and involves an ill conditioned quadratic function. The other 32 problems are selected problems from the CUTEst test problem set gould-orban-cutest , and are used for tests in Section 6.2. Code for running these numerical experiments was written in the Julia programming language doi:10.1137/141000671 , and utilizes the NLPModels.jl orban-siqueira-nlpmodels-2020 , CUTEst.jl orban-siqueira-cutest-2020 , and Distributions.jl 2019arXiv190708611B ; Distributions.jl-2019 packages. In all the numerical experiments that follow, noise $\epsilon(x)$ was added to function evaluations by uniformly sampling from the interval $[-\bar{\epsilon}_{f},\bar{\epsilon}_{f}]$ , and noise $e(x)$ was added to the gradient evaluations by uniformly sampling from the closed Euclidean ball $\left\|x\right\|_{2}\leq\bar{\epsilon}_{g}$ .

6.1 Ill Conditioned Quadratic Function With Additive Gradient Noise Only

The first test problem is strongly convex and consists of the $4$ -dimensional quadratic function given by

\phi(x)=\frac{1}{2}x^{T}Tx

(64)

where the eigenvalues of $T$ are $\lambda(T)=\{10^{-2},1,10^{2},10^{4}\}$ . Consequently, the strong convexity parameter is $m=10^{-2}$ , the Lipschitz constant is $M=10^{4}$ , and the condition number of the Hessian $T$ is $10^{6}$ . For this test problem, no noise was added to the function evaluations (i.e. $f(x)=\phi(x)$ in (31)), and $\bar{\epsilon}_{g}=1$ . As a result, in this scenario $\mathcal{N}_{1}$ from Lemma 2 with $\psi=\Psi=1$ (i.e. the smallest possible $\mathcal{N}_{1}$ ) becomes

\mathcal{N}_{1}(1,1)=\bigg{\{}x\text{ }\big{|}\text{ }\phi(x)\leq 50\bigg{\}}.

(65)

Following the discussion in Section 4.2, we set the penalty parameters via the formula $\beta_{k}=\frac{1}{\bar{\epsilon}_{g}}\left\|s_{k}\right\|_{2}+10^{-10}$ , which corresponds to a choice of $N_{s}=1$ in (42). The $10^{-10}$ term was added as a small perturbation to provide numerical stability. The step size $\alpha_{k}$ was chosen using a backtracking line search based on the sufficient decrease condition (49) with $p_{k}=-H_{k}g_{k}$ , where $g_{k}$ is defined by (32), $\epsilon_{A}=0$ , and $c_{1}=10^{-4}$ . At each iteration, backtracking started from the initial step size $\alpha^{0}=1$ , decreasing by a factor of $\tau=1/2$ each time the sufficient decrease condition failed. If the backtracking line search exceeded the maximum number of $75$ backtracks, we set $\alpha_{k}=0$ . However, the maximum number of backtracks was never exceeded when performing experiments with this first test problem.

Algorithm 1 was initialized using the matrix and starting point

H^{0}=I,\qquad x^{0}=10^{5}\cdot[1,1,1,1]^{T}

(66)

given in (66), with $\left\|\nabla\phi(x^{0})\right\|_{2}\approx 10^{9}$ . Figures 3, 3, and 3 compare the performance of $30$ independent runs of SP-BFGS vs. BFGS over a fixed budget of $100$ iterations. The relevant curvature condition failed an average of $25.7$ total iterations per BFGS run, and $0.6$ total iterations per SP-BFGS run. For the sake of comparability, both SP-BFGS and BFGS skipped the update if the relevant curvature condition failed. Observe that SP-BFGS reduces the objective function value by several more orders of magnitude compared to BFGS on average, and maintains significantly better inverse Hessian approximations than BFGS in the presence of gradient noise.

Refer to caption — Figure 1: Base 10 logarithm of the optimality gap vs. the iteration number $k$ for $30$ independent runs. After $100$ iterations, SP-BFGS has an average $\log_{10}(\phi_{100}-\phi^{\star})$ of $-5.03$ while BFGS has an average $\log_{10}(\phi_{100}-\phi^{\star})$ of $-1.27$ . Observe that both SP-BFGS and BFGS appear to enter $\mathcal{N}_{1}(1,1)$ , which corresponds to values less than $\log_{10}(50)\approx 1.7$ on the y-axis, but SP-BFGS makes more progress inside $\mathcal{N}_{1}(1,1)$ . Outside of $\mathcal{N}_{1}(1,1)$ , the performance of SP-BFGS and BFGS is almost indistinguishable.

6.2 CUTEst Test Problems With Various Additive Noise Combinations

The remaining $32$ test problems were selected from the CUTEst problem set, the successor of CUTEr gould-orban-toint-cuter . At the time of writing, SIF files and descriptions of all $32$ test problems can be found at https://www.cuter.rl.ac.uk/Problems/mastsif.shtml. As a brief summary, some of the problems can be interpreted as least squares type problems (e.g. ARGTRGLS), some of the problems are ill conditioned or singular type problems (e.g. BOXPOWER), some of the problems are well known nonlinear optimization test problems (e.g. ROSENBR) or extensions of them (e.g. ROSENBRTU, SROSENBR), and some of the problems come from real applications (e.g. COATING, HEART6LS, VIBRBEAM). As shown in Tables 2 and 3, the selected CUTEst test problems vary in size from $2$ -dimensional to $1000$ -dimensional.

Using these $32$ CUTEst test problems and a fixed budget of $2000$ objective function evaluations (not $2000$ iterations) per test, we tested the performance of SP-BFGS compared to BFGS with various combinations of function and gradient noise levels $\bar{\epsilon}_{f}$ and $\bar{\epsilon}_{g}$ . For all the experiments in Tables 1, 2, and 3, as well as the additional experiments in Appendix G, both SP-BFGS and BFGS skipped updating if the curvature condition failed. In Tables 1, 2, and 4, the SPBFGS penalty parameter was set as $\beta_{k}=\frac{10^{8}}{\bar{\epsilon}_{g}}\left\|s_{k}\right\|_{2}+10^{-10}$ , as the authors heuristically discovered setting $N_{s}=\frac{10^{8}}{\bar{\epsilon}_{g}}$ works well in practice for a variety of problems. With regards to the backtracking line search based on (49), we set $\alpha^{0}=1$ , $\epsilon_{A}=\bar{\epsilon}_{f}$ , $c_{1}=10^{-4}$ , $\tau=1/2$ , and the maximum number of backtracks as $45$ . We define $\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star})$ as a measure of the optimality gap, and use $\phi_{best}$ to denote the smallest value of the true function $\phi$ measured at any point during an algorithm run. The true minimum values $\phi^{\star}$ for each CUTEst problem were obtained from the SIF file for each CUTEst problem. The sample variance (i.e. the variance with Bessel’s correction) is denoted by $s^{2}(\cdot)$ .

Table 1 compares the performance of SP-BFGS vs. BFGS on the Rosenbrock function (i.e. ROSENBR) corrupted by different combinations of function and gradient noise of varying orders of magnitude. Observe that SP-BFGS outperforms BFGS with respect to the mean and median optimality gap for every noise combination in Table 1, sometimes by several orders of magnitude. Tables 2 and 3 compare the performance of SP-BFGS vs. BFGS on the $32$ CUTEst test problems with both function and gradient noise present. Gradient noise was generated using $\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}$ , and function noise was generated using $\bar{\epsilon}_{f}=10^{-4}\left|\phi(x^{0})\right|$ , both to ensure that noise does not initially dominate function or gradient evaluations. Note that as the noise in these numerical experiments is additive, the signal to noise ratio of gradient measurements decreases as a stationary point is approached. Overall, SP-BFGS outperforms BFGS on approximately $70\%$ of the CUTEst problems with both function and gradient noise present, and performs at least as good as BFGS on approximately $90\%$ of these problems. Referring to Appendix G, with only gradient noise present, these percentages become $80\%$ and $95\%$ respectively.

$\bar{\epsilon}_{f}$	$\bar{\epsilon}_{g}$	Mean( $\Delta_{opt}$ )	Median( $\Delta_{opt}$ )	Min( $\Delta_{opt}$ )	Max( $\Delta_{opt}$ )	$s^{2}(\Delta_{opt})$	Mean( $I$ )
SPBFGS With No Function Noise
$0$	$10^{-4}$	-1.4E+01	-1.4E+01	-1.8E+01	-1.2E+01	1.4E+00	114
$0$	$10^{-2}$	-1.3E+01	-1.3E+01	-1.5E+01	-8.3E+00	2.9E+00	104
$0$	$10^{0}$	-2.1E+00	-1.8E+00	-5.7E+00	-9.2E-01	9.4E-01	153
$0$	$10^{2}$	3.5E-02	2.9E-01	-1.9E+00	7.9E-01	3.9E-01	90
BFGS With No Function Noise
$0$	$10^{-4}$	-1.1E+01	-1.0E+01	-1.4E+01	-8.8E+00	1.8E+00	263
$0$	$10^{-2}$	-6.6E+00	-6.6E+00	-9.6E+00	-4.3E+00	1.6E+00	281
$0$	$10^{0}$	-1.5E+00	-1.2E+00	-3.3E+00	-5.4E-01	6.3E-01	279
$0$	$10^{2}$	1.1E-01	4.3E-01	-2.4E+00	6.5E-01	4.7E-01	373
SPBFGS With Low Function Noise Level
$10^{-4}$	$10^{-4}$	-1.4E+01	-1.4E+01	-1.5E+01	-1.3E+01	1.9E-01	1980
$10^{-4}$	$10^{-2}$	-1.0E+01	-1.0E+01	-1.2E+01	-8.0E+00	1.3E+00	1964
$10^{-4}$	$10^{0}$	-2.1E+00	-2.0E+00	-3.6E+00	-1.6E+00	2.0E-01	1759
$10^{-4}$	$10^{2}$	8.7E-02	3.1E-01	-2.2E+00	9.1E-01	4.5E-01	1720
BFGS With Low Function Noise Level
$10^{-4}$	$10^{-4}$	-1.1E+01	-1.2E+01	-1.5E+01	-8.7E+00	1.7E+00	1980
$10^{-4}$	$10^{-2}$	-6.6E+00	-6.5E+00	-8.8E+00	-4.7E+00	1.2E+00	1975
$10^{-4}$	$10^{0}$	-1.2E+00	-1.1E+00	-1.8E+00	-8.6E-01	5.9E-02	1936
$10^{-4}$	$10^{2}$	9.5E-02	5.1E-01	-3.1E+00	9.2E-01	8.5E-01	1934
SPBFGS With Medium Function Noise Level
$10^{-2}$	$10^{-4}$	-1.4E+01	-1.4E+01	-1.5E+01	-1.3E+01	3.4E-01	1981
$10^{-2}$	$10^{-2}$	-1.0E+01	-1.0E+01	-1.3E+01	-7.5E+00	1.5E+00	1977
$10^{-2}$	$10^{0}$	-3.4E+00	-3.0E+00	-7.5E+00	-2.0E+00	1.7E+00	1934
$10^{-2}$	$10^{2}$	-1.8E-01	1.7E-01	-3.7E+00	7.4E-01	1.0E+00	1890
BFGS With Medium Function Noise Level
$10^{-2}$	$10^{-4}$	-1.1E+01	-1.1E+01	-1.4E+01	-8.5E+00	1.4E+00	1981
$10^{-2}$	$10^{-2}$	-6.7E+00	-6.7E+00	-1.0E+01	-4.9E+00	1.7E+00	1979
$10^{-2}$	$10^{0}$	-1.8E+00	-1.5E+00	-3.8E+00	-9.1E-01	6.3E-01	1961
$10^{-2}$	$10^{2}$	1.4E-01	3.9E-01	-2.3E+00	8.5E-01	6.1E-01	1953
SPBFGS With High Function Noise Level
$10^{0}$	$10^{-4}$	-1.4E+01	-1.4E+01	-1.5E+01	-1.3E+01	2.2E-01	1980
$10^{0}$	$10^{-2}$	-1.0E+01	-1.0E+01	-1.2E+01	-7.3E+00	9.6E-01	1978
$10^{0}$	$10^{0}$	-3.1E+00	-2.8E+00	-5.1E+00	-1.7E+00	8.9E-01	1969
$10^{0}$	$10^{2}$	-2.2E-01	1.1E-02	-1.9E+00	8.4E-01	7.6E-01	1943
BFGS With High Function Noise Level
$10^{0}$	$10^{-4}$	-1.1E+01	-1.1E+01	-1.3E+01	-9.0E+00	1.4E+00	1980
$10^{0}$	$10^{-2}$	-6.7E+00	-6.4E+00	-9.1E+00	-5.0E+00	1.5E+00	1980
$10^{0}$	$10^{0}$	-1.8E+00	-1.4E+00	-5.3E+00	-8.2E-01	1.1E+00	1973
$10^{0}$	$10^{2}$	-2.9E-02	3.7E-01	-2.1E+00	8.9E-01	7.9E-01	1965

Table 1: Performance of SP-BFGS vs. BFGS on the Rosenbrock function (i.e. ROSENBR) corrupted by noise.

\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star})

measures the optimality gap, where

\phi_{best}

denotes the smallest value of the true function

\phi

measured at any point during an algorithm run. The number of objective function evaluations is fixed at

2000

, but the number of iterations

I

can vary. Statistics are calculated from a sample of

30

runs per algorithm.

SP-BFGS With Function And Gradient Noise
Problem	Dim.	Mean( $\Delta_{opt}$ )	Median( $\Delta_{opt}$ )	Min( $\Delta_{opt}$ )	Max( $\Delta_{opt}$ )	$s^{2}(\Delta_{opt})$
ARGTRGLS	200	4.5E-02	4.8E-02	1.7E-02	8.0E-02	2.5E-04
ARWHEAD	500	-2.5E+00	-2.5E+00	-2.6E+00	-2.5E+00	2.6E-04
BEALE	2	-1.1E+01	-1.1E+01	-1.4E+01	-9.8E+00	8.0E-01
BOX3	3	-7.1E+00	-6.8E+00	-8.9E+00	-6.5E+00	6.2E-01
BOXPOWER	100	-3.8E+00	-3.8E+00	-4.2E+00	-3.5E+00	5.0E-02
BROWNBS	2	-1.2E+00	-7.4E-01	-5.2E+00	2.0E+00	3.5E+00
BROYDNBDLS	50	-6.2E+00	-6.2E+00	-6.4E+00	-6.0E+00	6.9E-03
CHAINWOO	100	1.7E+00	1.8E+00	7.7E-03	2.1E+00	1.6E-01
CHNROSNB	50	-4.2E+00	-4.0E+00	-5.5E+00	-3.6E+00	3.8E-01
COATING	134	-2.7E-02	-1.2E-02	-1.3E-01	9.6E-02	3.5E-03
COOLHANSLS	9	-1.2E+00	-1.1E+00	-1.6E+00	-8.7E-01	1.7E-02
CUBE	2	-5.2E+00	-4.7E+00	-8.9E+00	-3.1E+00	2.2E+00
CYCLOOCFLS	20	-8.4E+00	-8.5E+00	-9.1E+00	-6.9E+00	3.0E-01
EXTROSNB	10	-5.2E+00	-5.2E+00	-5.2E+00	-5.1E+00	1.3E-03
FMINSRF2	64	-8.7E+00	-8.7E+00	-8.7E+00	-8.6E+00	2.6E-04
GENHUMPS	5	4.1E-02	2.4E-01	-2.9E+00	7.8E-01	4.5E-01
GENROSE	5	-9.4E+00	-9.3E+00	-9.9E+00	-9.1E+00	5.6E-02
HEART6LS	6	-3.5E-01	2.7E-01	-2.0E+00	1.2E+00	1.5E+00
HELIX	3	-6.1E+00	-6.0E+00	-7.4E+00	-4.5E+00	5.0E-01
MANCINO	30	-2.1E+00	-2.1E+00	-2.5E+00	-1.9E+00	1.2E-02
METHANB8LS	31	-3.8E+00	-3.9E+00	-4.2E+00	-3.4E+00	3.6E-02
MODBEALE	200	1.1E+00	1.0E+00	4.7E-01	1.8E+00	1.8E-01
NONDIA	10	-4.2E-03	-4.3E-03	-4.4E-03	-3.2E-03	9.1E-08
POWELLSG	4	-6.1E+00	-6.0E+00	-7.9E+00	-4.6E+00	9.1E-01
POWER	10	-3.9E+00	-3.8E+00	-4.9E+00	-3.3E+00	1.9E-01
ROSENBR	2	-8.6E+00	-8.5E+00	-1.1E+01	-6.3E+00	1.8E+00
ROSENBRTU	2	-1.8E+01	-1.8E+01	-2.0E+01	-1.7E+01	4.0E-01
SBRYBND	500	3.9E+00	3.9E+00	3.9E+00	3.9E+00	9.2E-06
SINEVAL	2	-1.4E+01	-1.4E+01	-1.5E+01	-1.3E+01	3.7E-01
SNAIL	2	-1.2E+01	-1.2E+01	-1.4E+01	-1.1E+01	2.9E-01
SROSENBR	1000	5.0E-01	5.0E-01	2.9E-01	6.8E-01	8.6E-03
VIBRBEAM	8	1.5E+00	1.5E+00	1.2E+00	2.1E+00	2.6E-02

Table 2: Performance of SP-BFGS on

32

selected CUTEst test problems with noise added to both function and gradient evaluations. The number of objective function evaluations is fixed at

2000

\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star})

measures the optimality gap, where

\phi_{best}

denotes the smallest value of the true function

\phi

measured at any point during an algorithm run. Statistics are calculated from a sample of

30

runs per algorithm, and the Dim. column gives the problem dimension. The SPBFGS penalty parameter was set as

\beta_{k}=\frac{10^{8}}{\bar{\epsilon}_{g}}\left\|s_{k}\right\|_{2}+10^{-10}

. For each problem, function noise was generated using

\bar{\epsilon}_{f}=10^{-4}\left|\phi(x^{0})\right|

, and gradient noise was generated using

\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}

, where the starting point

x^{0}

varies by CUTEst problem.

BFGS With Function And Gradient Noise
Problem	Dim.	Mean( $\Delta_{opt}$ )	Median( $\Delta_{opt}$ )	Min( $\Delta_{opt}$ )	Max( $\Delta_{opt}$ )	$s^{2}(\Delta_{opt})$
ARGTRGLS	200	5.6E-02	5.5E-02	2.4E-02	8.4E-02	2.7E-04
ARWHEAD	500	-2.5E+00	-2.5E+00	-2.6E+00	-2.5E+00	4.1E-04
BEALE	2	-7.7E+00	-7.8E+00	-9.7E+00	-6.1E+00	7.1E-01
BOX3	3	-6.5E+00	-6.5E+00	-6.7E+00	-6.4E+00	4.7E-03
BOXPOWER	100	-3.7E+00	-3.7E+00	-4.2E+00	-3.4E+00	3.4E-02
BROWNBS	2	6.8E-01	1.3E+00	-3.2E+00	3.1E+00	2.9E+00
BROYDNBDLS	50	-6.0E+00	-6.0E+00	-6.3E+00	-5.7E+00	2.6E-02
CHAINWOO	100	1.7E+00	1.7E+00	1.2E+00	2.1E+00	5.9E-02
CHNROSNB	50	-4.2E+00	-4.1E+00	-5.7E+00	-3.4E+00	4.4E-01
COATING	134	-3.7E-02	-5.7E-02	-1.6E-01	8.0E-02	4.1E-03
COOLHANSLS	9	-1.0E+00	-1.0E+00	-2.0E+00	-4.5E-01	7.2E-02
CUBE	2	-1.6E+00	-1.4E+00	-3.6E+00	-9.7E-01	4.1E-01
CYCLOOCFLS	20	-7.2E+00	-7.2E+00	-9.1E+00	-5.8E+00	8.7E-01
EXTROSNB	10	-5.2E+00	-5.2E+00	-5.2E+00	-5.1E+00	1.8E-03
FMINSRF2	64	-8.6E+00	-8.7E+00	-8.8E+00	-8.2E+00	2.8E-02
GENHUMPS	5	1.2E-01	1.2E-01	-1.2E+00	8.1E-01	2.3E-01
GENROSE	5	-7.5E+00	-7.6E+00	-9.1E+00	-6.2E+00	7.3E-01
HEART6LS	6	3.1E-01	6.1E-01	-1.9E+00	1.2E+00	1.4E+00
HELIX	3	-4.5E+00	-4.7E+00	-7.0E+00	-2.7E+00	1.1E+00
MANCINO	30	-1.6E+00	-1.6E+00	-1.8E+00	-1.3E+00	1.3E-02
METHANB8LS	31	-3.9E+00	-3.8E+00	-4.4E+00	-3.6E+00	5.5E-02
MODBEALE	200	1.1E+00	1.1E+00	2.9E-01	1.8E+00	1.6E-01
NONDIA	10	-3.7E-03	-3.8E-03	-4.4E-03	-2.6E-03	3.1E-07
POWELLSG	4	-5.2E+00	-5.2E+00	-7.6E+00	-4.2E+00	7.1E-01
POWER	10	-3.5E+00	-3.5E+00	-4.1E+00	-2.9E+00	1.0E-01
ROSENBR	2	-5.9E+00	-5.5E+00	-9.2E+00	-4.5E+00	1.4E+00
ROSENBRTU	2	-1.6E+01	-1.6E+01	-1.8E+01	-1.4E+01	1.5E+00
SBRYBND	500	3.9E+00	3.9E+00	3.9E+00	3.9E+00	2.7E-05
SINEVAL	2	-1.1E+01	-1.1E+01	-1.3E+01	-8.9E+00	1.3E+00
SNAIL	2	-9.4E+00	-9.2E+00	-1.2E+01	-8.0E+00	7.2E-01
SROSENBR	1000	5.4E-01	5.4E-01	3.6E-01	7.8E-01	6.9E-03
VIBRBEAM	8	1.7E+00	1.7E+00	1.2E+00	2.0E+00	2.9E-02

Table 3: Performance of BFGS on

32

selected CUTEst test problems with noise added to both function and gradient evaluations. The number of objective function evaluations is fixed at

2000

\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star})

measures the optimality gap, where

\phi_{best}

denotes the smallest value of the true function

\phi

measured at any point during an algorithm run. Statistics are calculated from a sample of

30

runs per algorithm, and the Dim. column gives the problem dimension. For each problem, function noise was generated using

\bar{\epsilon}_{f}=10^{-4}\left|\phi(x^{0})\right|

, and gradient noise was generated using

\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}

, where the starting point

x^{0}

varies by CUTEst problem.

7 Final Remarks

In this paper, we introduced SP-BFGS, a new variant of the BFGS method designed to resist the corrupting effects of noise. Motivated by regularized least squares estimation, we derived the SP-BFGS update by applying a penalty method to the secant condition. We argued that with an appropriate choice of penalty parameter, SP-BFGS updating is robust to the corrupting effects of noise that can destroy the performance of BFGS. We empirically validated this claim by performing numerical experiments on a diverse set of over $30$ test problems with both function and gradient noise of varying orders of magnitude. The results of these numerical experiments showed that SP-BFGS can outperform BFGS approximately $70\%$ or more of the time, and performs at least as good as BFGS approximately $90\%$ or more of the time. Furthermore, a theoretical analysis confirmed that with appropriate choices of penalty parameter, it is possible to guarantee that SP-BFGS is not corrupted arbitrarily badly by noise, unlike standard BFGS. In the future, we believe it is worth investigating the performance of SP-BFGS in the presence of other types of noise, including multiplicative stochastic noise and deterministic noise, and also believe it is worthwhile to study the use of noise estimation techniques in conjunction with SP-BFGS updating. The authors are also working to publish a limited memory version of SP-BFGS for high dimensional noisy problems.

Acknowledgements.

EH and BI’s work is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of British Columbia (UBC).

Conflict of interest

The authors declare that they have no conflict of interest.

References

(1) Aydin, L., Aydin, O., Artem, H.S., Mert, A.: Design of dimensionally stable composites using efficient global optimization method. Proceedings of the Institution of Mechanical Engineers, Part L: Journal of Materials: Design and Applications 233(2), 156–168 (2019). DOI 10.1177/1464420716664921. URL https://doi.org/10.1177/1464420716664921
(2) Berahas, A.S., Byrd, R.H., Nocedal, J.: Derivative-free optimization of noisy functions via quasi-newton methods. SIAM Journal on Optimization 29, 965–993 (2019)
(3) Besançon, M., Anthoff, D., Arslan, A., Byrne, S., Lin, D., Papamarkou, T., Pearson, J.: Distributions.jl: Definition and modeling of probability distributions in the juliastats ecosystem. arXiv e-prints arXiv:1907.08611 (2019)
(4) Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. SIAM Review 59(1), 65–98 (2017). DOI 10.1137/141000671. URL https://doi.org/10.1137/141000671
(5) Bons, N.P., He, X., Mader, C.A., Martins, J.R.R.A.: Multimodality in aerodynamic wing design optimization. AIAA Journal 57(3), 1004–1018 (2019). DOI 10.2514/1.J057294. URL https://doi.org/10.2514/1.J057294
(6) Broyden, C.G.: The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations. IMA Journal of Applied Mathematics 6(1), 76–90 (1970). DOI 10.1093/imamat/6.1.76. URL https://doi.org/10.1093/imamat/6.1.76
(7) Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization 26(2), 1008–1031 (2016). DOI 10.1137/140954362. URL https://doi.org/10.1137/140954362
(8) Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16(5), 1190–1208 (1995). DOI 10.1137/0916069. URL https://doi.org/10.1137/0916069
(9) Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-newton methods with application to unconstrained minimization. SIAM Journal on Numerical Analysis 26(3), 727–739 (1989). URL http://www.jstor.org/stable/2157680
(10) Byrd, R.H., Nocedal, J., Yuan, Y.X.: Global convergence of a class of quasi-newton methods on convex problems. SIAM Journal on Numerical Analysis 24(5), 1171–1190 (1987). URL http://www.jstor.org/stable/2157646
(11) Chang, D., Sun, S., Zhang, C.: An accelerated linearly convergent stochastic l-bfgs algorithm. IEEE Transactions on Neural Networks and Learning Systems 30(11), 3338–3346 (2019)
(12) Fasano, G., Pintér, J.D.: Modeling and Optimization in Space Engineering: State of the Art and New Challenges. Springer International Publishing (2019)
(13) Fletcher, R.: A new approach to variable metric algorithms. The Computer Journal 13(3), 317–322 (1970). DOI 10.1093/comjnl/13.3.317. URL https://doi.org/10.1093/comjnl/13.3.317
(14) Gal, R., Haber, E., Irwin, B., Saleh, B., Ziv, A.: How to catch a lion in the desert: on the solution of the coverage directed generation (CDG) problem. Optimization and Engineering (2020). DOI 10.1007/s11081-020-09507-w. URL https://doi.org/10.1007%2Fs11081-020-09507-w
(15) Goldfarb, D.: A family of variable-metric methods derived by variational means. Mathematics of Computation 24(109), 23–26 (1970). URL http://www.jstor.org/stable/2004873
(16) Gould, N.I.M., Orban, D., contributors: The Constrained and Unconstrained Testing Environment with safe threads (CUTEst) for optimization software. https://github.com/ralna/CUTEst (2019)
(17) Gould, N.I.M., Orban, D., Toint, P.L.: CUTEr a Constrained and Unconstrained Testing Environment, revisited. https://www.cuter.rl.ac.uk (2001)
(18) Gower, R., Goldfarb, D., Richtarik, P.: Stochastic block bfgs: Squeezing more curvature out of data. In: M.F. Balcan, K.Q. Weinberger (eds.) Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 48, pp. 1869–1878. PMLR, New York, New York, USA (2016). URL http://proceedings.mlr.press/v48/gower16.html
(19) Graf, P.A., Billups, S.: Mdtri: robust and efficient global mixed integer search of spaces of multiple ternary alloys. Computational Optimization and Applications 68(3), 671–687 (2017). DOI 10.1007/s10589-017-9922-9. URL https://doi.org/10.1007/s10589-017-9922-9
(20) Güler, O., Gürtuna, F., Shevchenko, O.: Duality in quasi-newton methods and new variational characterizations of the dfp and bfgs updates. Optimization Methods and Software 24(1), 45–62 (2009). DOI 10.1080/10556780802367205. URL https://doi.org/10.1080/10556780802367205
(21) Hager, W.W.: Updating the inverse of a matrix. SIAM Review 31(2), 221–239 (1989). URL http://www.jstor.org/stable/2030425
(22) Horn, R.A., Johnson, C.R.: Matrix analysis, 2nd edn. Cambridge University Press, New York (2013)
(23) Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
(24) Johnson, S.G.: Quasi-newton optimization: Origin of the bfgs update (2019). URL https://ocw.mit.edu/courses/mathematics/18-335j-introduction-to-numerical-methods-spring-2019/week-11/MIT18_335JS19_lec30.pdf
(25) Keane, A.J., Nair, P.B.: Computational Approaches for Aerospace Design: The Pursuit of Excellence. John Wiley & Sons, Ltd (2005). DOI 10.1002/0470855487. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/0470855487
(26) Koziel, S., Ogurtsov, S.: Antenna Design by Simulation-Driven Optimization. Springer International Publishing (2014)
(27) Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-newton methods. Mathematical Programming 141, 135–163 (2013)
(28) Lin, D., White, J.M., Byrne, S., Bates, D., Noack, A., Pearson, J., Arslan, A., Squire, K., Anthoff, D., Papamarkou, T., Besançon, M., Drugowitsch, J., Schauer, M., other contributors: JuliaStats/Distributions.jl: a Julia package for probability distributions and associated functions. https://github.com/JuliaStats/Distributions.jl (2019). DOI 10.5281/zenodo.2647458. URL https://doi.org/10.5281/zenodo.2647458
(29) Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Mathematical Programming 45(1), 503–528 (1989). DOI 10.1007/BF01589116. URL https://doi.org/10.1007/BF01589116
(30) Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory bfgs. Journal of Machine Learning Research 16(1), 3151–3181 (2015)
(31) Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic l-bfgs algorithm. In: A. Gretton, C.C. Robert (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 51, pp. 249–258. PMLR, Cadiz, Spain (2016). URL http://proceedings.mlr.press/v51/moritz16.html
(32) Muñoz-Rojas, P.A.: Computational Modeling, Optimization and Manufacturing Simulation of Advanced Engineering Materials. Springer (2016)
(33) Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (2006)
(34) Orban, D., Siqueira, A.S., contributors: CUTEst.jl: Julia’s CUTEst interface. https://github.com/JuliaSmoothOptimizers/CUTEst.jl (2020). DOI 10.5281/zenodo.1188851
(35) Orban, D., Siqueira, A.S., contributors: NLPModels.jl: Data structures for optimization models. https://github.com/JuliaSmoothOptimizers/NLPModels.jl (2020). DOI 10.5281/zenodo.2558627
(36) Powell, M.J.D.: Algorithms for nonlinear constraints that use lagrangian functions. Mathematical Programming 14(1), 224–248 (1978). DOI 10.1007/BF01588967. URL https://doi.org/10.1007/BF01588967
(37) Rosenbrock, H.H.: An Automatic Method for Finding the Greatest or Least Value of a Function. The Computer Journal 3(3), 175–184 (1960). DOI 10.1093/comjnl/3.3.175. URL https://doi.org/10.1093/comjnl/3.3.175
(38) Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-newton method for online convex optimization. In: M. Meila, X. Shen (eds.) Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 2, pp. 436–443. PMLR, San Juan, Puerto Rico (2007). URL http://proceedings.mlr.press/v2/schraudolph07a.html
(39) Shanno, D.F.: Conditioning of quasi-newton methods for function minimization. Mathematics of Computation 24(111), 647–656 (1970). URL http://www.jstor.org/stable/2004840
(40) Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization 27(2), 927–956 (2017). DOI 10.1137/15M1053141. URL https://doi.org/10.1137/15M1053141
(41) Xie, Y., Byrd, R.H., Nocedal, J.: Analysis of the bfgs method with errors. SIAM Journal on Optimization 30(1), 182–209 (2020). DOI 10.1137/19M1240794. URL https://doi.org/10.1137/19M1240794
(42) Zhao, R., Haskell, W.B., Tan, V.Y.F.: Stochastic l-bfgs: Improved convergence rates and practical acceleration strategies. IEEE Transactions on Signal Processing 66, 1155–1169 (2018)
(43) Zhu, J.: Optimization of Power System Operation. John Wiley & Sons, Ltd (2008). DOI 10.1002/9780470466971. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470466971

Appendix A Proof of Theorem 3.1

To produce the SP-BFGS update, we first rearrange (26a), revealing that

(H-H_{k})=-W^{-1}(uy_{k}^{T}+\Gamma^{T}-\Gamma)W^{-1}

(67)

and so the symmetry requirement that $H=H^{T}$ means transposing (67) gives

uy_{k}^{T}+\Gamma^{T}-\Gamma=(uy_{k}^{T}+\Gamma^{T}-\Gamma)^{T}=y_{k}u^{T}+\Gamma-\Gamma^{T}

(68)

which rearranges to

\Gamma^{T}-\Gamma=\frac{1}{2}(y_{k}u^{T}-uy_{k}^{T})

(69)

and so

(H-H_{k})=-\frac{1}{2}W^{-1}(y_{k}u^{T}+uy_{k}^{T})W^{-1}.

(70)

Next, we right multiply (70) by $y_{k}$ to get

(H-H_{k})y_{k}=-\frac{1}{2}W^{-1}\bigg{(}y_{k}u^{T}W^{-1}y_{k}+u(y_{k}^{T}W^{-1}y_{k})\bigg{)}

(71)

and use (26b) to get that

s_{k}+\frac{W^{-1}u}{\beta_{k}}-H_{k}y_{k}=-\frac{1}{2}W^{-1}\bigg{(}y_{k}u^{T}W^{-1}y_{k}+u(y_{k}^{T}W^{-1}y_{k})\bigg{)}.

(72)

We now left multiply both sides by $-2W$ and rearrange, giving

-2W(s_{k}-H_{k}y_{k})=y_{k}u^{T}W^{-1}y_{k}+u\bigg{(}y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}\bigg{)}.

(73)

This can be rearranged so that $u$ is isolated, giving

u=\frac{-2W(s_{k}-H_{k}y_{k})-y_{k}u^{T}W^{-1}y_{k}}{y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}}=-\frac{2W(s_{k}-H_{k}y_{k})+y_{k}u^{T}W^{-1}y_{k}}{y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}}.

(74)

To get rid of the $u^{T}$ on the right hand side, we first left multiply both sides by $y_{k}^{T}W^{-1}$ , and then transpose to get

u^{T}W^{-1}y_{k}=-\frac{2(s_{k}-H_{k}y_{k})^{T}y_{k}+(y_{k}^{T}W^{-1}y_{k})(u^{T}W^{-1}y_{k})}{y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}}

(75)

where we have taken advantage of the fact that the transpose of a scalar returns the same scalar. This now allows us to solve for $u^{T}W^{-1}y_{k}$ using some basic algebra, and resulting in

u^{T}W^{-1}y_{k}=-\frac{(s_{k}-H_{k}y_{k})^{T}y_{k}}{y_{k}^{T}W^{-1}y_{k}+\frac{1}{\beta_{k}}}.

(76)

Substituting (76) into (74) gives

u=\frac{y_{k}y_{k}^{T}(s_{k}-H_{k}y_{k})}{(y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}})(y_{k}^{T}W^{-1}y_{k}+\frac{1}{\beta_{k}})}-\frac{2W(s_{k}-H_{k}y_{k})}{y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}}.

(77)

Now, if we substitute the expression for $u$ in (77) into (70), after some simplification we get

(H-H_{k})=\frac{1}{(y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}})}\bigg{[}(s_{k}-H_{k}y_{k})y_{k}^{T}W^{-1}+W^{-1}y_{k}(s_{k}-H_{k}y_{k})^{T}\\ -\frac{y_{k}^{T}(s_{k}-H_{k}y_{k})}{(y_{k}^{T}W^{-1}y_{k}+\frac{1}{\beta_{k}})}W^{-1}y_{k}y_{k}^{T}W^{-1}\bigg{]}.

(78)

Now, we further simplify by applying that $Ws_{k}=y_{k}$ , and thus $W^{-1}y_{k}=s_{k}$ , revealing

H=H_{k}+\frac{(s_{k}-H_{k}y_{k})s_{k}^{T}+s_{k}(s_{k}-H_{k}y_{k})^{T}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})}-\frac{y_{k}^{T}(s_{k}-H_{k}y_{k})}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})(y_{k}^{T}s_{k}+\frac{1}{\beta_{k}})}s_{k}s_{k}^{T}

(79)

which, after a bit of algebra, reveals that the update formula solving the system defined by (26a), (26b), and (26c) can be expressed as

H^{*}=H_{k}-\frac{H_{k}y_{k}s_{k}^{T}+s_{k}y_{k}^{T}H_{k}^{T}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})}+\bigg{[}\frac{y_{k}^{T}s_{k}+\frac{2}{\beta_{k}}+y_{k}^{T}H_{k}y_{k}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})(y_{k}^{T}s_{k}+\frac{1}{\beta_{k}})}\bigg{]}s_{k}s_{k}^{T}.

(80)

We can make (80) look similar to the common form of the BFGS update given in (19) by defining the two quantities $\gamma_{k}$ and $\omega_{k}$ as in (28) and observing that completing the square gives

H^{*}=\bigg{(}I-\frac{s_{k}y_{k}^{T}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})}\bigg{)}H_{k}\bigg{(}I-\frac{y_{k}s_{k}^{T}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})}\bigg{)}\\ +\bigg{[}\frac{y_{k}^{T}s_{k}+\frac{2}{\beta_{k}}+y_{k}^{T}H_{k}y_{k}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})(y_{k}^{T}s_{k}+\frac{1}{\beta_{k}})}-\frac{y_{k}^{T}H_{k}y_{k}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})^{2}}\bigg{]}s_{k}s_{k}^{T}

(81)

which is equivalent to

H^{*}=\bigg{(}I-\omega_{k}s_{k}y_{k}^{T}\bigg{)}H_{k}\bigg{(}I-\omega_{k}y_{k}s_{k}^{T}\bigg{)}+\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}s_{k}s_{k}^{T}

(82)

concluding the proof.

Appendix B Proof of Lemma 1

The $H_{k+1}$ given by (27) has the general form

H_{k+1}=G^{T}H_{k}G+ds_{k}s_{k}^{T}

(83)

with the specific choices

G=I-\omega_{k}y_{k}s_{k}^{T},\quad d=\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}.

(84)

By definition, $H_{k+1}$ is positive definite if

v^{T}H_{k+1}v>0,\quad\forall v\in\mathbb{R}^{n}\setminus 0\text{~{}~{}}.

(85)

We first show that (29) is a sufficient condition for $H_{k+1}$ to be positive definite, given that $H_{k}$ is positive definite. By applying (83) to (85), we see that

v^{T}\bigg{(}G^{T}H_{k}G+ds_{k}s_{k}^{T}\bigg{)}v>0,\quad\forall v\in\mathbb{R}^{n}\setminus 0

(86)

must be true for the choices of $G$ and $d$ in (84) if $H_{k+1}$ is positive definite. Substituting (84) into (86) reveals that

\bigg{(}v-\omega_{k}(s_{k}^{T}v)y_{k}\bigg{)}^{T}H_{k}\bigg{(}v-\omega_{k}(s_{k}^{T}v)y_{k}\bigg{)}+\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}(s_{k}^{T}v)^{2}>0

(87)

must be true for all $v\in\mathbb{R}^{n}\setminus 0$ if $H_{k+1}$ is positive definite. Both $(s_{k}^{T}v)^{2}$ and $v^{T}G^{T}H_{k}Gv$ are always nonnegative. To see that $v^{T}G^{T}H_{k}Gv\geq 0$ , note that because $H_{k}$ is positive definite, it has a principal square root $H_{k}^{1/2}$ , and so

v^{T}G^{T}H_{k}Gv=v^{T}G^{T}H_{k}^{1/2}H_{k}^{1/2}Gv=\left\|H_{k}^{1/2}Gv\right\|_{2}^{2}\geq 0\text{~{}}.

(88)

We now observe that if $d>0$ , the right term $d(s_{k}^{T}v)^{2}$ in (87) is zero if and only if $(s_{k}^{T}v)=0$ . However, if $(s_{k}^{T}v)=0$ , then the left term $v^{T}G^{T}H_{k}Gv$ in (87) is zero only when $v=0$ . Hence, the condition $d>0$ guarantees that (87) is true for all $v$ excluding the zero vector, and thus that $H_{k+1}$ is positive definite. The condition $d>0$ expands to

\gamma_{k}+\omega_{k}(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}>0\text{~{}}.

(89)

Using the definitions of $\gamma_{k}$ and $\omega_{k}$ in (28), it is clear that $(\gamma_{k}-\omega_{k})\geq 0$ , as $\beta_{k}$ can only take nonnegative values. Furthermore, as $H_{k}$ is positive definite, $y_{k}^{T}H_{k}y_{k}\geq 0$ for all $y_{k}$ . As it is possible for $(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}$ to be zero, we requre $\gamma_{k}>0$ . The condition $\gamma_{k}>0$ immediately gives (29), as $\gamma_{k}$ can only be positive if the denominator in its definition is positive. Finally, as $\beta_{k}$ can only take nonnegative values, (29) also ensures that $\omega_{k}$ is nonnegative, and so when (29) is true, $\omega_{k}(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\geq 0$ . In summary, we have shown that the condition (29) ensures that the left term in (89) is positive, and the right term nonnegative, so $d>0$ , and thus $H_{k+1}$ is positive definite.

We now show that (29) is a necessary condition for $H_{k+1}$ to be positive definite, given that $H_{k}$ is positive definite. If $H_{k+1}$ is positive definite, then

y_{k}^{T}H_{k+1}y_{k}>0

(90)

assuming $y_{k}\neq 0$ . Substituting (26b) into (90) gives

y_{k}^{T}\bigg{[}s_{k}+\frac{W^{-1}u}{\beta_{k}}\bigg{]}>0

(91)

and using (76) shows that (91) is equivalent to

y_{k}^{T}\bigg{[}s_{k}+\frac{\gamma_{k}(H_{k}y_{k}-s_{k})}{\beta_{k}}\bigg{]}>0.

(92)

Now, some algebra shows that

\begin{split}y_{k}^{T}\bigg{[}s_{k}+\frac{\gamma_{k}(H_{k}y_{k}-s_{k})}{\beta_{k}}\bigg{]}&=y_{k}^{T}s_{k}+\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{[}y_{k}^{T}H_{k}y_{k}-y_{k}^{T}s_{k}\bigg{]}\\ &=\bigg{(}1-\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}s_{k}+\bigg{(}\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}H_{k}y_{k}\\ &=\bigg{(}\frac{\beta_{k}y_{k}^{T}s_{k}}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}s_{k}+\bigg{(}\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}H_{k}y_{k}\\ &=\frac{\beta_{k}(y_{k}^{T}s_{k})^{2}+y_{k}^{T}H_{k}y_{k}}{1+\beta_{k}y_{k}^{T}s_{k}}\end{split}

(93)

and we also know that because $H_{k}$ is positive definite, $y_{k}^{T}H_{k}y_{k}>0$ for all $y_{k}\neq 0$ , by definition $\beta_{k}\geq 0$ , and by the definition of the square of a real number, $(y_{k}^{T}s_{k})^{2}\geq 0$ . As a result,

y_{k}^{T}\bigg{[}s_{k}+\frac{W^{-1}u}{\beta_{k}}\bigg{]}=\frac{\beta_{k}(y_{k}^{T}s_{k})^{2}+y_{k}^{T}H_{k}y_{k}}{1+\beta_{k}y_{k}^{T}s_{k}}>0

(94)

is guaranteed only if the denominator $1+\beta_{k}y_{k}^{T}s_{k}$ is positive, which occurs when

s_{k}^{T}y_{k}>-\frac{1}{\beta_{k}}.

(95)

This establishes that (29) is a necessary condition for $H_{k+1}$ to be positive definite, given that $H_{k}$ is positive definite, and concludes the proof.

Appendix C Proof of Theorem 3.2

The Sherman-Morrison-Woodbury formula says

(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}.

(96)

Now, observe that the SP-BFGS update (27) can be written in the factored form

H_{k+1}=H_{k}+\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right]\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right].

(97)

Applying the Sherman-Morrison-Woodbury formula (96) to the factored SP-BFGS update (97) with

$A=H_{k},\quad U=\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]},\quad C=\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right],\quad V=\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]$

yields

$H_{k+1}^{-1}=H_{k}^{-1}-H_{k}^{-1}\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\bigg{(}\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right]^{-1}+\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]H_{k}^{-1}\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\bigg{)}^{-1}\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]H_{k}^{-1}.$

Inverting $C$ here gives

C^{-1}=\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right]^{-1}=\left[\begin{array}[]{cc}0&-1\\ -1&-\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}\end{array}\right]

and we also have

\begin{split}VA^{-1}U&=\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]H_{k}^{-1}\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\\ &=\omega_{k}\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]\big{[}H_{k}^{-1}s_{k}\quad y_{k}\big{]}\\ &=\left[\begin{array}[]{cc}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}&\omega_{k}s_{k}^{T}y_{k}\\ \omega_{k}y_{k}^{T}s_{k}&\omega_{k}y_{k}^{T}H_{k}y_{k}\end{array}\right]\end{split}

which is just a $2\times 2$ matrix with real entries. Now, it becomes clear that

\begin{split}(C^{-1}+VA^{-1}U)&=\bigg{(}\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right]^{-1}+\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]H_{k}^{-1}\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\bigg{)}\\ &=\left[\begin{array}[]{cc}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}&-1+\omega_{k}s_{k}^{T}y_{k}\\ -1+\omega_{k}y_{k}^{T}s_{k}&\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}\end{array}\right].\end{split}

For notational compactness, let

D=(C^{-1}+VA^{-1}U)=\left[\begin{array}[]{cc}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}&-1+\omega_{k}s_{k}^{T}y_{k}\\ -1+\omega_{k}y_{k}^{T}s_{k}&\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}\end{array}\right]

D^{-1}=\frac{1}{\det(D)}\left[\begin{array}[]{cc}\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&1-\omega_{k}s_{k}^{T}y_{k}\\ 1-\omega_{k}y_{k}^{T}s_{k}&\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}\end{array}\right]

where the determinant of $D$ is

\begin{split}\det(D)&=\bigg{(}\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}\bigg{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\bigg{)}\bigg{)}\bigg{(}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}\bigg{)}-(1-\omega_{k}y_{k}^{T}s_{k})^{2}\\ &=\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}H_{k}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}\bigg{(}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}\bigg{)}-(1-\omega_{k}y_{k}^{T}s_{k})^{2}\end{split}

and we have used the fact that $y_{k}^{T}s_{k}=s_{k}^{T}y_{k}$ , as this is a scalar quantity. Next,

\begin{split}U\det(D)D^{-1}V&=\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\left[\begin{array}[]{cc}\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}(\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k})&1-\omega_{k}s_{k}^{T}y_{k}\\ 1-\omega_{k}y_{k}^{T}s_{k}&\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}\end{array}\right]\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]\\ &=\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\left[\begin{array}[]{cc}\omega_{k}y_{k}^{T}H_{k}y_{k}s_{k}^{T}-\gamma_{k}(\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k})s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})y_{k}^{T}H_{k}\\ (1-\omega_{k}y_{k}^{T}s_{k})s_{k}^{T}+\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}y_{k}^{T}H_{k}\end{array}\right]\end{split}

so $U\det(D)D^{-1}V$ fully expanded becomes

$\omega_{k}\bigg{[}s_{k}\bigg{(}\omega_{k}y_{k}^{T}H_{k}y_{k}s_{k}^{T}-\gamma_{k}(\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k})s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})y_{k}^{T}H_{k}\bigg{)}+H_{k}y_{k}\bigg{(}(1-\omega_{k}y_{k}^{T}s_{k})s_{k}^{T}+\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}y_{k}^{T}H_{k}\bigg{)}\bigg{]}.$

This looks rather ugly at the moment, but we continue by breaking the problem down further, noting that

s_{k}\bigg{(}\omega_{k}y_{k}^{T}H_{k}y_{k}s_{k}^{T}-\gamma_{k}\bigg{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\bigg{)}s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})y_{k}^{T}H_{k}\bigg{)}=\\ \bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}H_{k}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}s_{k}s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})s_{k}y_{k}^{T}H_{k}

and

H_{k}y_{k}\bigg{(}(1-\omega_{k}y_{k}^{T}s_{k})s_{k}^{T}+\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}y_{k}^{T}H_{k}\bigg{)}=\\ (1-\omega_{k}y_{k}^{T}s_{k})H_{k}y_{k}s_{k}^{T}+\omega_{k}H_{k}y_{k}(s_{k}^{T}H_{k}^{-1}s_{k})y_{k}^{T}H_{k}.

The above intermediate results further simplify $U\det(D)D^{-1}V$ to

$\omega_{k}\bigg{[}\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}H_{k}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}s_{k}s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})(s_{k}y_{k}^{T}H_{k}+H_{k}y_{k}s_{k}^{T})+\omega_{k}H_{k}y_{k}(s_{k}^{T}H_{k}^{-1}s_{k})y_{k}^{T}H_{k}\bigg{]}.$

Left and right multiplying the line immediately above by $A^{-1}=H_{k}^{-1}$ gives

$\omega_{k}\bigg{[}\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}H_{k}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}H_{k}^{-1}s_{k}s_{k}^{T}H_{k}^{-1}+(1-\omega_{k}s_{k}^{T}y_{k})(H_{k}^{-1}s_{k}y_{k}^{T}+y_{k}s_{k}^{T}H_{k}^{-1})+\omega_{k}y_{k}(s_{k}^{T}H_{k}^{-1}s_{k})y_{k}^{T}\bigg{]}$

and thus, after dividing out $\det(D)$ and applying $B_{k}=H_{k}^{-1}$ , we arrive at the following final formula

(98)

for the SP-BFGS inverse update, which concludes the proof.

Appendix D Proof of Theorem 5.1

Referring to Theorem 3.2, taking the trace of both sides of (98) and applying the linearity and cyclic invariance properties of the trace yields

\operatorname{Tr}(B_{k+1})=\kappa_{1}\operatorname{Tr}(B_{k})+\kappa_{2}\left\|B_{k}s_{k}\right\|_{2}^{2}+2\kappa_{3}(y_{k}^{T}B_{k}s_{k})+\kappa_{4}\left\|y_{k}\right\|_{2}^{2}

(99)

where

\kappa_{1}=1,\quad\kappa_{2}=-\frac{\omega_{k}\hat{D}}{[\hat{D}(\omega_{k}s_{k}^{T}B_{k}s_{k})-(\hat{E})^{2}]},\\

(100)

\kappa_{3}=-\frac{\omega_{k}\hat{E}}{[\hat{D}(\omega_{k}s_{k}^{T}B_{k}s_{k})-(\hat{E})^{2}]},\quad\kappa_{4}=-\frac{(\omega_{k})^{2}s_{k}^{T}B_{k}s_{k}}{[\hat{D}(\omega_{k}s_{k}^{T}B_{k}s_{k})-(\hat{E})^{2}]}

(101)

with $\hat{D}$ and $\hat{E}$ defined as

\hat{D}=\bigg{[}(\omega_{k}-\gamma_{k})(y_{k}^{T}B_{k}^{-1}y_{k})-\frac{\gamma_{k}}{\omega_{k}}\bigg{]},\quad\hat{E}=(1-\omega_{k}s_{k}^{T}y_{k})=\frac{2\omega_{k}}{\beta_{k}}.

(102)

We now observe that after applying some basic algebra, and recalling that $B_{k}$ is positive definite, one can deduce that for all $\beta_{k}\in[0,+\infty]$ , the following inequalities hold

(\omega_{k}-\gamma_{k})\leq 0,\quad 1\leq\frac{\gamma_{k}}{\omega_{k}},\quad\hat{D}\leq-1,\quad 0\leq\frac{2\omega_{k}}{\beta_{k}}\leq 1.

(103)

By minimizing the absolute value of the common denominator in $\kappa_{2},\kappa_{3}$ , and $\kappa_{4}$ using the inequalities above, we can obtain the bounds

-\frac{1}{s_{k}^{T}B_{k}s_{k}}\leq\kappa_{2}\leq 0,\qquad 0\leq\kappa_{4}\leq\omega_{k}\leq\gamma_{k}

(104)

0\leq\kappa_{3}\leq\frac{2\omega_{k}}{\beta_{k}}\frac{1}{s_{k}^{T}B_{k}s_{k}+\frac{2\omega_{k}}{\beta_{k}}\frac{2}{\beta_{k}}}\leq\frac{\beta_{k}}{2}.

(105)

As a result,

	$\displaystyle\operatorname{Tr}(B_{k+1})$	$\displaystyle\leq\operatorname{Tr}(B_{k})+2\kappa_{3}\|y_{k}^{T}B_{k}s_{k}\|+\kappa_{4}\left\\|y_{k}\right\\|_{2}^{2}$		(106)
		$\displaystyle\leq\operatorname{Tr}(B_{k})+\beta_{k}\left\\|y_{k}\right\\|_{2}\lambda_{max}(B_{k})\left\\|s_{k}\right\\|_{2}+\gamma_{k}\left\\|y_{k}\right\\|_{2}^{2}$		(107)

and applying $\lambda_{max}(B_{k})\leq\operatorname{Tr}(B_{k})$ establishes (55). Similarly, referring to (80) reveals the upper bound

\operatorname{Tr}(H_{k+1})\leq\operatorname{Tr}(H_{k})+2\omega_{k}|y_{k}^{T}H_{k}s_{k}|+\bigg{[}\gamma_{k}+\omega_{k}\gamma_{k}(y_{k}^{T}H_{k}y_{k})\bigg{]}\left\|s_{k}\right\|_{2}^{2}.

(108)

To establish (54), we apply $\lambda_{max}(H_{k})\leq\operatorname{Tr}(H_{k})$ and $\omega_{k}\leq\gamma_{k}$ to the line above, and then factor. This completes the proof.

Appendix E Proof of Lemma 2

The angle condition $\nabla\phi(x)^{T}Hg(x)>0$ expands to

\nabla\phi(x)^{T}Hg(x)=\nabla\phi(x)^{T}H\nabla\phi(x)+\nabla\phi(x)^{T}He(x)>0

(109)

and by applying the Cauchy-Schwarz inequality and Assumption 1, we see that if

\psi\left\|\nabla\phi(x)\right\|_{2}^{2}>\Psi\left\|\nabla\phi(x)\right\|_{2}\bar{\epsilon}_{g}

(110)

then $\nabla\phi(x)^{T}Hg(x)>0$ . Contrapositively, if $\nabla\phi(x)^{T}Hg(x)\leq 0$ then

\left\|\nabla\phi(x)\right\|_{2}\leq\frac{\Psi\bar{\epsilon}_{g}}{\psi}.

(111)

As $\phi$ is m-strongly convex due to Assumption 3, we have

\phi^{\star}\geq\phi(x)+\min_{v}\bigg{\{}\nabla\phi(x)^{T}v+\frac{m}{2}\left\|v\right\|_{2}^{2}\bigg{\}}=\phi(x)-\frac{1}{2m}\left\|\nabla\phi(x)\right\|_{2}^{2}.

(112)

Squaring (111) and then combining it with (112) gives $\mathcal{N}_{1}(\psi,\Psi)$ , completing the proof.

Appendix F Proof of Theorem 5.2

As $\phi\in C^{2}$ by Assumption 3, applying Taylor’s theorem and using (58) and strong convexity gives

	$\displaystyle\phi_{k+1}$	$\displaystyle=\phi_{k}+\nabla\phi_{k}^{T}[x_{k+1}-x_{k}]+\frac{1}{2}[x_{k+1}-x_{k}]^{T}\nabla^{2}\phi(u)[x_{k+1}-x_{k}]$
		$\displaystyle\leq\phi_{k}-\alpha\nabla\phi_{k}^{T}H_{k}g_{k}+\frac{\alpha^{2}M}{2}\left\\|H_{k}g_{k}\right\\|_{2}^{2}$

where $u$ is some convex combination of $x_{k+1}$ and $x_{k}$ . Proceeding, note that the smallest $\mathcal{N}_{1}$ from Lemma 2 occurs when $\psi=\Psi$ , and in this case $\nabla\phi_{k}^{T}g_{k}>0$ if $x_{k}\notin\mathcal{N}_{1}$ . Hence, for all possible choices of $\mathcal{N}_{1}$ it is true that $\nabla\phi_{k}^{T}g_{k}>0$ if $x_{k}\notin\mathcal{N}_{1}$ . Combining this with (59) gives

\nabla\phi_{k}^{T}H_{k}g_{k}\geq\psi\nabla\phi_{k}^{T}g_{k}>0

(113)

if $x_{k}\notin\mathcal{N}_{1}$ . With (113) in hand, continuing to bound terms gives

	$\displaystyle\phi_{k+1}$	$\displaystyle\leq\phi_{k}-\alpha\psi\nabla\phi_{k}^{T}[\nabla\phi_{k}+e_{k}]+\frac{\alpha^{2}\Psi^{2}M}{2}\left\\|\nabla\phi_{k}+e_{k}\right\\|_{2}^{2}$
		$\displaystyle=\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\\|\nabla\phi_{k}\right\\|_{2}^{2}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\nabla\phi_{k}^{T}e_{k}+\frac{\alpha^{2}\Psi^{2}M}{2}\left\\|e_{k}\right\\|_{2}^{2}$
		$\displaystyle\leq\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\\|\nabla\phi_{k}\right\\|_{2}^{2}+\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\left\\|\nabla\phi_{k}\right\\|_{2}\left\\|e_{k}\right\\|_{2}+\frac{\alpha^{2}\Psi^{2}M}{2}\left\\|e_{k}\right\\|_{2}^{2}$
		$\displaystyle\leq\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\\|\nabla\phi_{k}\right\\|_{2}^{2}+\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\bigg{[}\frac{1}{2}\left\\|\nabla\phi_{k}\right\\|_{2}^{2}+\frac{1}{2}\left\\|e_{k}\right\\|_{2}^{2}\bigg{]}$
		$\displaystyle\qquad+\frac{\alpha^{2}\Psi^{2}M}{2}\left\\|e_{k}\right\\|_{2}^{2}$

where the last inequality follows from expanding

0\leq\bigg{(}\frac{1}{\sqrt{2}}\left\|\nabla\phi_{k}\right\|_{2}-\frac{1}{\sqrt{2}}\left\|e_{k}\right\|_{2}\bigg{)}^{2}=\frac{1}{2}\left\|\nabla\phi_{k}\right\|_{2}^{2}-\left\|\nabla\phi_{k}\right\|_{2}\left\|e_{k}\right\|_{2}+\frac{1}{2}\left\|e_{k}\right\|_{2}^{2}

(114)

and using (60). Simplifying the last inequality reveals that

\phi_{k+1}\leq\phi_{k}-\frac{\alpha\psi}{2}\left\|\nabla\phi_{k}\right\|_{2}^{2}+\frac{\alpha\psi}{2}\left\|e_{k}\right\|_{2}^{2}.

(115)

Since $\phi$ is m-strongly convex by Assumption 3, we can apply

\left\|\nabla\phi_{k}\right\|_{2}^{2}\geq 2m(\phi_{k}-\phi^{\star})

(116)

as shown in the proof of Lemma 2 (see Appendix E), which combined with (115) and Assumption 1 gives

\phi_{k+1}\leq\phi_{k}-\alpha\psi m(\phi_{k}-\phi^{\star})+\frac{\alpha\psi}{2}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}.

(117)

Subtracting $\phi^{\star}$ from both sides, we get

\phi_{k+1}-\phi^{\star}\leq(1-\alpha\psi m)(\phi_{k}-\phi^{\star})+\frac{\alpha\psi}{2}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}

(118)

which, by subtracting $\frac{1}{2m}(\frac{\Psi\bar{\epsilon}_{g}}{\psi})^{2}$ from both sides and simplifying, gives

	$\displaystyle\phi_{k+1}-\phi^{\star}-\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}$	$\displaystyle\leq(1-\alpha\psi m)(\phi_{k}-\phi^{\star})+\frac{\alpha\psi}{2}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}-\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}$
		$\displaystyle=(1-\alpha\psi m)(\phi_{k}-\phi^{\star})+(\alpha\psi m-1)\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}$
		$\displaystyle=(1-\alpha\psi m)\bigg{(}\phi_{k}-\bigg{[}\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{]}\bigg{)}$

thus establishing the Q-linear result (61). We obtain (62) by recursively applying the worst case bound in (61), noting that in the worst case if $x_{0}\notin\mathcal{N}_{1}$ , then the sequence of iterates $\{x_{k}\}$ remains outside of $\mathcal{N}_{1}$ , only approaching $\mathcal{N}_{1}$ in the limit $k\rightarrow\infty$ .

Appendix G Extended Numerical Experiments

Tables 4 and 5 compare the performance of SP-BFGS vs. BFGS on the $32$ CUTEst test problems with only gradient noise present (i.e. $\bar{\epsilon}_{f}=0$ ). Gradient noise was generated using $\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}$ , where the starting point $x^{0}$ varies by CUTEst problem, to ensure that noise does not initially dominate gradient evaluations. Overall, SP-BFGS outperforms BFGS on approximately $80\%$ of the CUTEst problems with only gradient noise present, and performs at least as good as BFGS on approximately $95\%$ of these problems.

SP-BFGS With Gradient Noise Only
Problem	Dim.	Mean( $\Delta_{opt}$ )	Median( $\Delta_{opt}$ )	Min( $\Delta_{opt}$ )	Max( $\Delta_{opt}$ )	$s^{2}(\Delta_{opt})$
ARGTRGLS	200	-9.6E-02	-9.6E-02	-1.0E-01	-8.5E-02	1.9E-05
ARWHEAD	500	-2.8E+00	-2.8E+00	-2.8E+00	-2.7E+00	1.7E-03
BEALE	2	-1.4E+01	-1.4E+01	-1.6E+01	-7.0E+00	4.1E+00
BOX3	3	-6.7E+00	-6.5E+00	-1.1E+01	-6.3E+00	6.3E-01
BOXPOWER	100	-2.7E+00	-2.7E+00	-3.1E+00	-2.3E+00	4.6E-02
BROWNBS	2	-4.5E+00	-5.9E+00	-8.0E+00	1.1E+00	8.4E+00
BROYDNBDLS	50	-5.4E+00	-5.4E+00	-5.9E+00	-5.0E+00	3.4E-02
CHAINWOO	100	1.6E+00	1.7E+00	7.6E-02	2.1E+00	1.5E-01
CHNROSNB	50	-3.2E+00	-3.0E+00	-4.9E+00	-2.6E+00	4.5E-01
COATING	134	3.4E-01	3.4E-01	1.8E-01	4.2E-01	3.1E-03
COOLHANSLS	9	-9.4E-01	-9.4E-01	-1.2E+00	-4.8E-01	4.2E-02
CUBE	2	-2.7E+00	-2.5E+00	-5.8E+00	-1.7E+00	7.5E-01
CYCLOOCFLS	20	-7.4E+00	-7.2E+00	-9.3E+00	-5.9E+00	8.1E-01
EXTROSNB	10	-5.1E+00	-5.2E+00	-5.3E+00	-4.7E+00	3.0E-02
FMINSRF2	64	-8.6E+00	-8.7E+00	-8.8E+00	-8.1E+00	3.4E-02
GENHUMPS	5	-2.7E+00	-2.6E+00	-5.2E+00	-1.0E+00	1.1E+00
GENROSE	5	-1.2E+01	-1.2E+01	-1.4E+01	-8.9E+00	2.0E+00
HEART6LS	6	1.0E+00	1.2E+00	-1.8E+00	1.2E+00	5.0E-01
HELIX	3	-5.7E+00	-5.9E+00	-8.7E+00	-3.4E+00	1.4E+00
MANCINO	30	-1.0E+00	-1.0E+00	-1.4E+00	-7.0E-01	3.7E-02
METHANB8LS	31	-3.6E+00	-3.6E+00	-4.0E+00	-3.3E+00	3.1E-02
MODBEALE	200	1.2E+00	1.2E+00	3.8E-01	1.9E+00	1.8E-01
NONDIA	10	-3.5E-03	-3.6E-03	-4.3E-03	-1.1E-03	6.6E-07
POWELLSG	4	-5.7E+00	-5.3E+00	-9.3E+00	-4.0E+00	1.6E+00
POWER	10	-3.5E+00	-3.5E+00	-4.4E+00	-2.8E+00	1.3E-01
ROSENBR	2	-1.1E+01	-1.2E+01	-1.4E+01	-5.1E+00	4.4E+00
ROSENBRTU	2	-1.9E+01	-1.9E+01	-2.2E+01	-1.7E+01	1.1E+00
SBRYBND	500	3.9E+00	3.9E+00	3.9E+00	3.9E+00	2.0E-05
SINEVAL	2	-1.3E+01	-1.3E+01	-1.8E+01	-1.1E+01	3.3E+00
SNAIL	2	-1.5E+01	-1.6E+01	-1.8E+01	-1.2E+01	1.6E+00
SROSENBR	1000	-9.7E-01	-9.7E-01	-1.3E+00	-4.8E-01	3.2E-02
VIBRBEAM	8	1.6E+00	1.6E+00	1.2E+00	2.8E+00	9.1E-02

Table 4: Performance of SP-BFGS on

32

selected CUTEst test problems with noise added to gradient evaluations only (i.e.

\bar{\epsilon}_{f}=0

). The number of objective function evaluations is fixed at

2000

\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star})

measures the optimality gap, where

\phi_{best}

denotes the smallest value of the true function

\phi

measured at any point during an algorithm run. Statistics are calculated from a sample of

30

runs per algorithm, and the Dim. column gives the problem dimension. The SPBFGS penalty parameter was set as

\beta_{k}=\frac{10^{8}}{\bar{\epsilon}_{g}}\left\|s_{k}\right\|_{2}+10^{-10}

. For each problem, gradient noise was generated using

\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}

, where the starting point

x^{0}

varies by CUTEst problem.

BFGS With Gradient Noise Only
Problem	Dim.	Mean( $\Delta_{opt}$ )	Median( $\Delta_{opt}$ )	Min( $\Delta_{opt}$ )	Max( $\Delta_{opt}$ )	$s^{2}(\Delta_{opt})$
ARGTRGLS	200	-9.2E-02	-9.3E-02	-9.9E-02	-8.2E-02	1.7E-05
ARWHEAD	500	-2.5E+00	-2.5E+00	-2.6E+00	-2.5E+00	7.6E-04
BEALE	2	-8.3E+00	-8.5E+00	-1.2E+01	-5.8E+00	3.9E+00
BOX3	3	-6.4E+00	-6.4E+00	-6.6E+00	-6.3E+00	2.3E-03
BOXPOWER	100	-2.8E+00	-2.8E+00	-3.3E+00	-2.4E+00	6.1E-02
BROWNBS	2	6.8E-02	1.0E+00	-8.2E+00	3.6E+00	1.0E+01
BROYDNBDLS	50	-5.1E+00	-5.1E+00	-5.3E+00	-4.9E+00	1.5E-02
CHAINWOO	100	1.7E+00	1.8E+00	1.1E+00	2.2E+00	5.8E-02
CHNROSNB	50	-2.9E+00	-2.7E+00	-4.5E+00	-2.1E+00	3.8E-01
COATING	134	3.6E-01	3.7E-01	2.1E-01	4.2E-01	2.6E-03
COOLHANSLS	9	-5.6E-01	-6.4E-01	-1.3E+00	1.9E-01	1.9E-01
CUBE	2	-1.1E+00	-1.1E+00	-1.8E+00	-9.6E-01	5.6E-02
CYCLOOCFLS	20	-6.5E+00	-6.5E+00	-8.3E+00	-5.1E+00	5.4E-01
EXTROSNB	10	-5.1E+00	-5.1E+00	-5.3E+00	-4.9E+00	8.1E-03
FMINSRF2	64	-8.2E+00	-8.2E+00	-8.7E+00	-7.3E+00	1.4E-01
GENHUMPS	5	-1.5E+00	-1.2E+00	-4.0E+00	-2.8E-01	8.4E-01
GENROSE	5	-6.8E+00	-6.7E+00	-8.7E+00	-5.9E+00	4.8E-01
HEART6LS	6	1.2E+00	1.2E+00	1.2E+00	1.2E+00	1.9E-04
HELIX	3	-4.8E+00	-4.6E+00	-8.1E+00	-2.6E+00	2.1E+00
MANCINO	30	-8.3E-01	-8.8E-01	-1.2E+00	-3.3E-01	5.3E-02
METHANB8LS	31	-3.5E+00	-3.4E+00	-3.9E+00	-3.3E+00	2.8E-02
MODBEALE	200	1.0E+00	1.1E+00	-6.2E-01	2.1E+00	3.6E-01
NONDIA	10	1.2E-03	1.3E-03	-4.4E-03	1.3E-02	2.2E-05
POWELLSG	4	-5.3E+00	-5.0E+00	-8.0E+00	-3.6E+00	1.6E+00
POWER	10	-3.4E+00	-3.4E+00	-4.3E+00	-2.8E+00	1.4E-01
ROSENBR	2	-6.1E+00	-5.9E+00	-1.0E+01	-3.7E+00	2.9E+00
ROSENBRTU	2	-1.5E+01	-1.5E+01	-1.8E+01	-1.4E+01	1.6E+00
SBRYBND	500	3.9E+00	3.9E+00	3.9E+00	3.9E+00	2.0E-05
SINEVAL	2	-1.2E+01	-1.3E+01	-1.7E+01	-8.5E+00	4.0E+00
SNAIL	2	-1.1E+01	-1.1E+01	-1.6E+01	-8.2E+00	3.5E+00
SROSENBR	1000	-9.1E-01	-8.8E-01	-1.3E+00	-5.1E-01	3.1E-02
VIBRBEAM	8	1.7E+00	1.6E+00	1.4E+00	2.6E+00	1.0E-01

Table 5: Performance of BFGS on

32

selected CUTEst test problems with noise added to gradient evaluations only (i.e.

\bar{\epsilon}_{f}=0

). The number of objective function evaluations is fixed at

2000

\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star})

measures the optimality gap, where

\phi_{best}

denotes the smallest value of the true function

\phi

measured at any point during an algorithm run. Statistics are calculated from a sample of

30

runs per algorithm, and the Dim. column gives the problem dimension. For each problem, gradient noise was generated using

\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}

, where the starting point

x^{0}

varies by CUTEst problem.

	$\displaystyle\phi_{k+1}$	$\displaystyle\leq\phi_{k}-\alpha\psi\nabla\phi_{k}^{T}[\nabla\phi_{k}+e_{k}]+\frac{\alpha^{2}\Psi^{2}M}{2}\left\\|\nabla\phi_{k}+e_{k}\right\\|_{2}^{2}$
		$\displaystyle=\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\\|\nabla\phi_{k}\right\\|_{2}^{2}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\nabla\phi_{k}^{T}e_{k}+\frac{\alpha^{2}\Psi^{2}M}{2}\left\\|e_{k}\right\\|_{2}^{2}$
		$\displaystyle\leq\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\\|\nabla\phi_{k}\right\\|_{2}^{2}+\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\left\\|\nabla\phi_{k}\right\\|_{2}\left\\|e_{k}\right\\|_{2}+\frac{\alpha^{2}\Psi^{2}M}{2}\left\\|e_{k}\right\\|_{2}^{2}$
		$\displaystyle\leq\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\\|\nabla\phi_{k}\right\\|_{2}^{2}+\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\bigg{[}\frac{1}{2}\left\\|\nabla\phi_{k}\right\\|_{2}^{2}+\frac{1}{2}\left\\|e_{k}\right\\|_{2}^{2}\bigg{]}$
		$\displaystyle\qquad+\frac{\alpha^{2}\Psi^{2}M}{2}\left\\|e_{k}\right\\|_{2}^{2}$

Secant Penalized BFGS: A Noise Robust Quasi-Newton Method Via Penalizing The Secant Condition

Abstract

Keywords:

1 Introduction

1.1 Contributions

2 Mathematical Background

2.1 BFGS Setup

2.2 Solving For The BFGS Update

3 Derivation Of Secant Penalized BFGS

3.1 Penalizing The Secant Condition

Theorem 3.1 (SP-BFGS Update)

Proof

Lemma 1 (Positive Definiteness Of SP-BFGS Update)

Proof

Theorem 3.2 (SP-BFGS Inverse Update)

Proof

4 Algorithmic Framework

4.1 Minimization Routine

4.2 Choosing The Penalty Parameter βk\beta_{k}

Assumption 1 (Uniform Gradient Noise Bound)

4.3 Choosing The Step Size αk\alpha_{k}

Assumption 2 (Uniform Function Noise Bound)

4.4 Failed SP-BFGS Curvature Condition Recovery Procedure

5 Convergence of SP-BFGS

5.1 The Influence of βk\beta_{k} on Hk+1H_{k+1}

Theorem 5.1 (Eigenvalue Upper Bounds)

Proof

5.2 Minimization Of Strongly Convex Functions

Assumption 3 (Strong Convexity of ϕ\phi)

Lemma 2 (Region Where Gradient Noise Can Dominate ∇ϕ\nabla\phi)

Proof

Theorem 5.2 (Linear Convergence For Sufficiently Small Fixed α\alpha)

Proof

6 Numerical Experiments

6.1 Ill Conditioned Quadratic Function With Additive Gradient Noise Only

6.2 CUTEst Test Problems With Various Additive Noise Combinations

7 Final Remarks

Acknowledgements.

Conflict of interest

References

Appendix A Proof of Theorem 3.1

Appendix B Proof of Lemma 1

Appendix C Proof of Theorem 3.2

Appendix D Proof of Theorem 5.1

Appendix E Proof of Lemma 2

Appendix F Proof of Theorem 5.2

Appendix G Extended Numerical Experiments

4.2 Choosing The Penalty Parameter $\beta_{k}$

4.3 Choosing The Step Size $\alpha_{k}$

5.1 The Influence of $\beta_{k}$ on $H_{k+1}$

Assumption 3 (Strong Convexity of $\phi$ )

Lemma 2 (Region Where Gradient Noise Can Dominate $\nabla\phi$ )

Theorem 5.2 (Linear Convergence For Sufficiently Small Fixed $\alpha$ )