This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

22institutetext: Brian Irwin [Uncaptioned image] 22institutetext: Eldad Haber 33institutetext: Department of Earth, Ocean and Atmospheric Sciences, The University of British Columbia, Vancouver, BC, Canada
33email: {birwin, haber}@eoas.ubc.ca

Secant Penalized BFGS: A Noise Robust Quasi-Newton Method Via Penalizing The Secant Condition

Brian Irwin * Corresponding author.    Eldad Haber
(Received: date / Accepted: date)
Abstract

In this paper, we introduce a new variant of the BFGS method designed to perform well when gradient measurements are corrupted by noise. We show that by treating the secant condition with a penalty method approach motivated by regularized least squares estimation, one can smoothly interpolate between updating the inverse Hessian approximation with the original BFGS update formula and not updating the inverse Hessian approximation. Furthermore, we find the curvature condition is smoothly relaxed as the interpolation moves towards not updating the inverse Hessian approximation, disappearing entirely when the inverse Hessian approximation is not updated. These developments allow us to develop a method we refer to as secant penalized BFGS (SP-BFGS) that allows one to relax the secant condition based on the amount of noise in the gradient measurements. SP-BFGS provides a means of incrementally updating the new inverse Hessian approximation with a controlled amount of bias towards the previous inverse Hessian approximation, which allows one to replace the overwriting nature of the original BFGS update with an averaging nature that resists the destructive effects of noise and can cope with negative curvature measurements. We discuss the theoretical properties of SP-BFGS, including convergence when minimizing strongly convex functions in the presence of uniformly bounded noise. Finally, we present extensive numerical experiments using over 30 problems from the CUTEst test problem set that demonstrate the superior performance of SP-BFGS compared to BFGS in the presence of both noisy function and gradient evaluations.

Keywords:
Quasi-Newton Methods Secant Condition Penalty Methods Least Squares Estimation Measurement Error Noise Robust Optimization
journal:

1 Introduction

Over the past 50 years, quasi-Newton methods have proved to be some of the most economical and effective methods for a variety of optimization problems. Originally conceived to provide some of the advantages of second order methods without the full cost of Newton’s method, quasi-Newton methods, which are also referred to as variable metric methods Johnson2019_quasiNewton_notes , are based on the observation that by differencing observed gradients, one can calculate approximate curvature information. This approximate curvature information can then be used to improve the speed of convergence, especially in comparison to first order methods, such as gradient descent. There are currently a variety of different quasi-Newton methods, with the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method 10.1093/imamat/6.1.76 ; 10.1093/comjnl/13.3.317 ; 10.2307/2004873 ; 10.2307/2004840 almost certainly being the best known quasi-Newton method.

Modern quasi-Newton methods were developed for problems involving the optimization of smooth functions without constraints. The BFGS method is the best known quasi-Newton method because in practice it has demonstrated superior performance due to its very effective self-correcting properties Nocedal2006 . Accordingly, BFGS has since been extended to handle box constraints doi:10.1137/0916069 , and shown to be effective even for some nonsmooth optimization problems Lewis2013NonsmoothOV . Furthermore, a limited memory version of BFGS known as L-BFGS Liu1989 has become a favourite algorithm for solving optimization problems with a very large number of variables, as it avoids directly storing approximate inverse Hessian matrices. However, BFGS and its relatives were not designed to explicitly handle noisy optimization problems, and noise can unacceptably degrade the performance of these methods.

The authors of doi:10.1137/140954362 make the important observation that quasi-Newton updating is inherently an overwriting process rather than an averaging process. Fundamentally, differencing noisy gradients can produce harmful efffects because the resulting approximate curvature information may be inaccurate, and this inaccurate curvature information may overwrite accurate curvature information. Newton’s method can naturally be viewed as a local rescaling of coordinates so that the rescaled problem is better conditioned than the original problem. Quasi-Newton methods attempt to perform a similar rescaling, but instead of using the (inverse) Hessian matrix to obtain curvature information for the rescaling, they use differences of gradients to obtain curvature information. Thus, it should be unsurprising that inaccurate curvature information obtained from differencing noisy gradients can be problematic because it means the resulting rescaling of the problem can be poor, and the conditioning of the rescaled problem could be even worse than the conditioning of the original problem.

With the above in mind, several works have dealt with how to improve the performance of quasi-Newton methods in the presence of noise. Many recent works focus on the empirical risk minimization (ERM) problem, which is ubiquitous in machine learning. For example, in doi:10.1137/140954362 the authors propose a technique designed for the stochastic approximation (SA) regime that employs subsampled Hessian-vector products to collect curvature information pointwise and at spaced intervals, in contrast to the classical approach of computing the difference of gradients at each iteration. This work is built upon in pmlr-v51-moritz16 , where the authors present a stochastic L-BFGS algorithm that draws upon the variance reduction approach of Johnson2013AcceleratingSG . In doi:10.1137/15M1053141 , the authors outline a stochastic damped limited-memory BFGS (SdLBFGS) method that employs damping techniques used in sequential quadratic programming (SQP). A stochastic block BFGS method that updates the approximate inverse Hessian matrix using a sketch of the Hessian matrix is proposed in pmlr-v48-gower16 . Further work on stochastic L-BFGS algorithms, including convergence results, can be found in pmlr-v2-schraudolph07a ; 10.5555/2789272.2912100 ; Zhao2018StochasticLI ; 8626766 .

Despite the importance of the ERM problem due to the current prevalence of machine learning, there are still a variety of important noisy optimization problems that arise in other contexts. In engineering design, numerical simulations are often employed in place of conducting costly, if even feasible, physical experiments. In this context, one tries to find optimal design parameters using the numerical simulation instead of physical experiments. Some examples from aerospace engineering, including interplanetary trajectory and wing design, can be found in Fasano2019 ; doi:10.1002/0470855487 ; doi:10.2514/1.J057294 . Examples from materials engineering include stable composite design doi:10.1177/1464420716664921 and ternary alloy composition Graf2017 , amongst others Munoz-Rojas2016 , while examples from electrical engineering include power system operation doi:10.1002/9780470466971 , hardware verification Gal_2020_HowToCatchALion , and antenna design Koziel2014 . Noise is often an unavoidable property of such numerical simulations, as the simulations can include stochastic internal components, and floating point arithmetic vulnerable to roundoff error. Apart from the analysis of the BFGS method with bounded errors in doi:10.1137/19M1240794 , there is relatively little work on the behaviour of quasi-Newton methods in the presence of general bounded noise. As optimizing noisy numerical simulations does not always fit the framework of the ERM problem, analyses of the behaviour of quasi-Newton methods in the presence of general bounded noise are of practical value when optimizing numerical simulations.

1.1 Contributions

Noise is inevitably introduced into machine learning problems due to the approximations required to handle large datasets, and numerical simulations due to the effects of finite precision arithmetic, and parts of the simulator containing inherently stochastic components. In this paper, we return to the fundamental theory underlying the design of quasi-Newton methods, which allows us to design a new variant of the BFGS method that explicitly handles the corrupting effects of noise. We do this as follows:

  1. 1.

    In Section 2, we review the setup and derivation of the original BFGS method.

  2. 2.

    In Section 3, motivated by regularized least squares estimation, we treat the secant condition of BFGS with a penalty method. This creates a new BFGS update formula that we refer to as secant penalized BFGS (SP-BFGS), which we show reduces to the original BFGS update formula in a limiting case, as expected.

  3. 3.

    In Section 4, we present an algorithmic framework for practically implementing SP-BFGS updating. We also discuss implementation details, including how to perform a line search and choose the penalty parameter in the presence of noise.

  4. 4.

    In Section 5, we discuss the theoretical properties of SP-BFGS, including how the penalty parameter influences the eigenvalues of the approximate inverse Hessian. This allows us to show that under appropriate conditions SP-BFGS iterations are guaranteed to converge linearly to a neighborhood of the global minimizer when minimizing strongly convex functions in the presence of uniformly bounded noise.

  5. 5.

    In Section 6, we study the empirical performance of SP-BFGS updating compared to BFGS updating by performing extensive numerical experiments with both convex and nonconvex objective functions corrupted by function and gradient noise. Results from a diverse set of over 30 problems from the CUTEst test problem set demonstrate that intelligently implemented SP-BFGS updating frequently outperforms BFGS updating in the presence of noise.

  6. 6.

    Finally, Section 7 concludes the paper and outlines directions for further work.

2 Mathematical Background

In this section, as preliminaries to the main results of this paper, we review the setup and derivation of the original BFGS method.

2.1 BFGS Setup

The BFGS method was originally designed to solve the following unconstrained optimization problem

minx{ϕ(x)}\min_{x}\big{\{}\phi(x)\big{\}} (1)

with xnx\in\mathbb{R}^{n}, ϕ:n\phi:\mathbb{R}^{n}\mapsto\mathbb{R}, and ϕ\phi being a smooth twice continuously differentiable and nonnoisy function. Below, we use the notational conventions of Nocedal2006 , including ϕk=ϕ(xk)\phi_{k}=\phi(x_{k}). We begin by using the Taylor expansion of ϕ\phi to build a local quadratic model mkm_{k} of the objective function ϕ\phi at the kthk^{th} iterate xkx_{k} of the optimization procedure

ϕ(xk+p)ϕk+ϕkTp+12pTBkp=mk(p)\phi(x_{k}+p)\approx\phi_{k}+\nabla\phi_{k}^{T}p+\frac{1}{2}p^{T}B_{k}p=m_{k}(p) (2)

where BkB_{k} is an n×nn\times n symmetric positive definite matrix that approximates the Hessian matrix (i.e. Bk2ϕkB_{k}\approx\nabla^{2}\phi_{k}). By setting the gradient of mkm_{k} to zero, we see that the unique minimizer pkp_{k} of this local quadratic model is

pk=Bk1ϕkp_{k}=-B_{k}^{-1}\nabla\phi_{k} (3)

and thus it is natural to update the next iterate xk+1x_{k+1} as

xk+1=xk+αkpkx_{k+1}=x_{k}+\alpha_{k}p_{k} (4)

where αk\alpha_{k} is the step size along the direction pkp_{k}, which is often chosen using a line search.

To avoid computing BkB_{k} from scratch at each iteration kk, we use the curvature information from recent gradient evaluations to update BkB_{k}, and thus relatively economically form Bk+1B_{k+1}. A Taylor expansion of ϕ\nabla\phi reveals

ϕ(xk+p)ϕk+2ϕkp\nabla\phi(x_{k}+p)\approx\nabla\phi_{k}+\nabla^{2}\phi_{k}p (5)

and so it is reasonable to require that the new approximate Hessian Bk+1B_{k+1} satisfies

ϕk+1=ϕk+αkBk+1pk\nabla\phi_{k+1}=\nabla\phi_{k}+\alpha_{k}B_{k+1}p_{k} (6)

which rearranges to

Bk+1αkpk=ϕk+1ϕk.B_{k+1}\alpha_{k}p_{k}=\nabla\phi_{k+1}-\nabla\phi_{k}. (7)

Now, define the two new quantities sks_{k} and yky_{k} as

skxk+1xk=αkpk,s_{k}\coloneqq x_{k+1}-x_{k}=\alpha_{k}p_{k}, (8a)
ykϕk+1ϕk.y_{k}\coloneqq\nabla\phi_{k+1}-\nabla\phi_{k}. (8b)

Thus, we arrive at (9), which is known as the secant condition

Bk+1sk=yk.B_{k+1}s_{k}=y_{k}. (9)

In words, the secant condition dictates that the new approximate Hessian Bk+1B_{k+1} must map the measured displacement sks_{k} into the measured difference of gradients yky_{k}. If we denote the approximate inverse Hessian Hk=Bk12ϕk1H_{k}=B_{k}^{-1}\approx\nabla^{2}\phi_{k}^{-1}, then the secant condition can be equivalently expressed as (10)

Hk+1yk=sk.H_{k+1}y_{k}=s_{k}. (10)

As Hk+1H_{k+1} is not yet uniquely determined, to obtain the BFGS update formula, we impose a minimum norm restriction. Specifically, we choose Hk+1H_{k+1} to be the solution of the following quadratic program over matrices

minH{12W1/2(HHk)W1/2F2}s.t.H=HT, Hyk=sk\min_{H}\bigg{\{}\frac{1}{2}\left\|W^{1/2}(H-H_{k})W^{1/2}\right\|_{F}^{2}\bigg{\}}\quad\text{s.t.}\quad H=H^{T},\text{~{}~{}~{}}Hy_{k}=s_{k} (11)

where ||||F||\cdot||_{F} denotes the Frobenius norm, and W1/2W^{1/2} the principal square root (see MatrixAnalysisHornJohnson or a similar reference) of a symmetric positive definite weight matrix WW satisfying

Wsk=yk.Ws_{k}=y_{k}. (12)

As we will see, choosing the weight matrix WW to satisfy (12) ensures that the resulting optimization method is scale invariant. The weight matrix WW can be chosen to be any symmetric positive definite matrix satisfying (12), and the specific choice of WW is not of great importance, as WW will not appear directly in the main results of this paper. However, as a concrete example from Nocedal2006 , one could assume W=G¯kW=\bar{G}_{k}, where G¯k\bar{G}_{k} is the average Hessian defined by

G¯k=012ϕ(xk+tαkpk)𝑑t .\bar{G}_{k}=\int_{0}^{1}\nabla^{2}\phi(x_{k}+t\alpha_{k}p_{k})dt\text{~{}~{}}. (13)

2.2 Solving For The BFGS Update

To solve the quadratic program given by (11), we setup a Lagrangian (H,q,Γ)\mathcal{L}(H,q,\Gamma) involving the constraints. Recalling that

W1/2(HHk)W1/2F2=Tr(W(HHk)W(HHk)T) ,\left\|W^{1/2}(H-H_{k})W^{1/2}\right\|^{2}_{F}=\operatorname{Tr}\bigg{(}W(H-H_{k})W(H-H_{k})^{T}\bigg{)}\text{~{}}, (14)

this gives the Lagrangian defined by (15) below

=12Tr(W(HHk)W(HHk)T)+Tr((Hyksk)qT)+Tr(Γ(HHT))\mathcal{L}=\frac{1}{2}\operatorname{Tr}\bigg{(}W(H-H_{k})W(H-H_{k})^{T}\bigg{)}+\operatorname{Tr}\bigg{(}(Hy_{k}-s_{k})q^{T}\bigg{)}+\operatorname{Tr}\bigg{(}\Gamma(H-H^{T})\bigg{)} (15)

where qq is a vector of Lagrange multipliers associated with the secant condition, and Γ\Gamma is a matrix of Lagrange multipliers associated with the symmetry condition. Taking the derivative of the Lagrangian (H,q,Γ)\mathcal{L}(H,q,\Gamma) with respect to the matrix HH yields

(H,q,Γ)H=W(HHk)W+qykT+ΓTΓ\frac{\partial\mathcal{L}(H,q,\Gamma)}{\partial H}=W(H-H_{k})W+qy_{k}^{T}+\Gamma^{T}-\Gamma (16)

and so we have the Karush-Kuhn-Tucker (KKT) system defined by the three equations (17a), (17b), and (17c) below

W(HHk)W+qykT+ΓTΓ=0 ,W(H-H_{k})W+qy_{k}^{T}+\Gamma^{T}-\Gamma=0\text{~{}}, (17a)
Hyksk=0 ,Hy_{k}-s_{k}=0\text{~{}}, (17b)
HHT=0 .H-H^{T}=0\text{~{}}. (17c)

For brevity, we omit the details of the solution of the KKT system defined above because it is a limiting case of the system solved in Theorem 3.1. For an alternative geometric solution technique, we refer the interested reader to Section 2 of doi:10.1080/10556780802367205 . The minimizer H=Hk+1H^{*}=H_{k+1} is given by the well known BFGS update formula

Hk+1=(IskykTskTyk)Hk(IykskTskTyk)+skskTskTykH_{k+1}=\bigg{(}I-\frac{s_{k}y_{k}^{T}}{s_{k}^{T}y_{k}}\bigg{)}H_{k}\bigg{(}I-\frac{y_{k}s_{k}^{T}}{s_{k}^{T}y_{k}}\bigg{)}+\frac{s_{k}s_{k}^{T}}{s_{k}^{T}y_{k}} (18)

which, if we define the curvature parameter ρk=1skTyk\rho_{k}=\frac{1}{s_{k}^{T}y_{k}}, can be equivalently written as

Hk+1=(IρkskykT)Hk(IρkykskT)+ρkskskT.H_{k+1}=\bigg{(}I-\rho_{k}s_{k}y_{k}^{T}\bigg{)}H_{k}\bigg{(}I-\rho_{k}y_{k}s_{k}^{T}\bigg{)}+\rho_{k}s_{k}s_{k}^{T}. (19)

Applying the Sherman-Morrison-Woodbury formula (see 10.2307/2030425 ) to the BFGS update formula immediately above, one can also write the BFGS update in terms of the approximate Hessian Bk=Hk1B_{k}=H_{k}^{-1} instead of the approximate inverse Hessian. Again, for brevity, the details are omitted because they are a special case of Theorem 3.2 shown later. The result is

Bk+1=BkBkskskTBkskTBksk+ykykTskTyk=BkBkskskTBkskTBksk+ρkykykT.B_{k+1}=B_{k}-\frac{B_{k}s_{k}s_{k}^{T}B_{k}}{s_{k}^{T}B_{k}s_{k}}+\frac{y_{k}y_{k}^{T}}{s_{k}^{T}y_{k}}=B_{k}-\frac{B_{k}s_{k}s_{k}^{T}B_{k}}{s_{k}^{T}B_{k}s_{k}}+\rho_{k}y_{k}y_{k}^{T}. (20)

To ensure the updated approximate Hessian Bk+1B_{k+1} is positive definite, we must enforce that

skTBk+1sk>0.s_{k}^{T}B_{k+1}s_{k}>0. (21)

Substituting Bk+1sk=ykB_{k+1}s_{k}=y_{k} from the secant condition, the condition (21) becomes

skTyk>0s_{k}^{T}y_{k}>0 (22)

which is known as the curvature condition, as it is equivalent to

1ρk>0.\frac{1}{\rho_{k}}>0. (23)

3 Derivation Of Secant Penalized BFGS

In this section, having reviewed the construction of the original BFGS method, we now show how treating the secant condition with a penalty method approach motivated by regularized least squares estimation allows one to generalize the original BFGS update.

3.1 Penalizing The Secant Condition

By applying a penalty method (see Chapter 17 of Nocedal2006 ) to the secant condition instead of directly enforcing the secant condition as a constraint, we obtain the problem

minH{12W1/2(HHk)W1/2F2+βk2W1/2(Hyksk)22}s.t.H=HT\min_{H}\bigg{\{}\frac{1}{2}\left\|W^{1/2}(H-H_{k})W^{1/2}\right\|_{F}^{2}+\frac{\beta_{k}}{2}\left\|W^{1/2}(Hy_{k}-s_{k})\right\|_{2}^{2}\bigg{\}}\quad\text{s.t.}\quad H=H^{T} (24)

where βk[0,+]\beta_{k}\in[0,+\infty] is a penalty parameter that determines how strongly to penalize violations of the secant condition. As we will see, one recovers the solution to the constrained problem (11) in the limit βk=+\beta_{k}=+\infty, so βk\beta_{k} can be intuitively thought of as the cost of violating the secant condition. By treating the symmetry constraint with a matrix Γ\Gamma of Lagrange multipliers again, we obtain the following Lagrangian

=12Tr(W(HHk)W(HHk)T)+βk2W1/2(Hyksk)22+Tr(Γ(HHT)) .\mathcal{L}=\frac{1}{2}\operatorname{Tr}\bigg{(}W(H-H_{k})W(H-H_{k})^{T}\bigg{)}+\frac{\beta_{k}}{2}\left\|W^{1/2}(Hy_{k}-s_{k})\right\|_{2}^{2}+\operatorname{Tr}\bigg{(}\Gamma(H-H^{T})\bigg{)}\text{~{}}. (25)

Defining the residual associated with the secant condition as rk(H)Hykskr_{k}(H)\coloneqq Hy_{k}-s_{k} and uβkWrku\coloneqq\beta_{k}Wr_{k}, the first order optimality conditions of (25) can be written as the system

W(HHk)W+uykT+ΓTΓ=0 ,W(H-H_{k})W+uy_{k}^{T}+\Gamma^{T}-\Gamma=0\text{~{}}, (26a)
HykskW1uβk=0 ,Hy_{k}-s_{k}-\frac{W^{-1}u}{\beta_{k}}=0\text{~{}}, (26b)
HHT=0 .H-H^{T}=0\text{~{}}. (26c)

Note that, as expected, in the limit βk=+\beta_{k}=+\infty, the system given by (26a), (26b), and (26c) reduces to the KKT system given by (17a), (17b), and (17c).

We now find an explicit closed form solution to the problem given by (24), which is given in Theorem 3.1.

Theorem 3.1 (SP-BFGS Update)

The update formula given by the minimizer HH^{*} of the problem defined by (24), which can be obtained by solving the system given by (26a), (26b), and (26c), is the SP-BFGS update

Hk+1=(IωkskykT)Hk(IωkykskT)+ωk[γkωk+(γkωk)ykTHkyk]skskTH_{k+1}=\bigg{(}I-\omega_{k}s_{k}y_{k}^{T}\bigg{)}H_{k}\bigg{(}I-\omega_{k}y_{k}s_{k}^{T}\bigg{)}+\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}s_{k}s_{k}^{T} (27)

where

γk=1(skTyk+1βk),ωk=1(skTyk+2βk).\gamma_{k}=\frac{1}{(s_{k}^{T}y_{k}+\frac{1}{\beta_{k}})},\quad\omega_{k}=\frac{1}{(s_{k}^{T}y_{k}+\frac{2}{\beta_{k}})}. (28)
Proof

See Appendix A. ∎

At this point, a few comments are in order regarding the SP-BFGS update given by (27). First, observe that as βk+\beta_{k}\rightarrow+\infty, we have that ωkρk\omega_{k}\rightarrow\rho_{k} and γkρk\gamma_{k}\rightarrow\rho_{k}. As a result, when βk=+\beta_{k}=+\infty, one recovers the original BFGS update, as expected. Second, also observe that as βk0\beta_{k}\rightarrow 0, we have that ωk0\omega_{k}\rightarrow 0 and γk0\gamma_{k}\rightarrow 0. As a result, we see that in the case βk=0\beta_{k}=0 the SP-BFGS update reduces to Hk+1=HkH_{k+1}=H_{k}. This is again expected because as βk0\beta_{k}\rightarrow 0, the cost of violating the secant condition goes to zero, and the minimum norm symmetric update is simply Hk+1=HkH_{k+1}=H_{k}.

We now examine what the analog of the curvature condition (22) is for SP-BFGS. Lemma 1 demonstrates that (29) is the SP-BFGS analog of the BFGS curvature condition (22).

Lemma 1 (Positive Definiteness Of SP-BFGS Update)

If HkH_{k} is positive definite, then the Hk+1H_{k+1} given by the SP-BFGS update (27) is positive definite if and only if the SP-BFGS curvature condition

skTyk>1βks_{k}^{T}y_{k}>-\frac{1}{\beta_{k}} (29)

is satisfied.

Proof

See Appendix B. ∎

The result in Lemma 1 warrants some discussion. First, the limiting behaviour with respect to βk\beta_{k} is consistent with Theorem 3.1. As βk+\beta_{k}\rightarrow+\infty, condition (29) reduces to the BFGS curvature condition (22). As βk0\beta_{k}\rightarrow 0, condition (29) reduces to no condition at all, as skTyk>s_{k}^{T}y_{k}>-\infty is always true. This is consistent with the observation that when βk=0\beta_{k}=0, the minimum norm symmetric update is Hk+1=HkH_{k+1}=H_{k}, and in this case Hk+1H_{k+1} is guaranteed to be positive definite if HkH_{k} is positive definite, regardless of skTyks_{k}^{T}y_{k}.

From the proof of Lemma 1 (see (93)), it is now clear that

ykTHk+1yk=(βkykTsk1+βkykTsk)ykTsk+(11+βkykTsk)ykTHkyky_{k}^{T}H_{k+1}y_{k}=\bigg{(}\frac{\beta_{k}y_{k}^{T}s_{k}}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}s_{k}+\bigg{(}\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}H_{k}y_{k} (30)

and so ykTHk+1yky_{k}^{T}H_{k+1}y_{k} is a convex combination of ykTsky_{k}^{T}s_{k} and ykTHkyky_{k}^{T}H_{k}y_{k}. Thus, Hk+1H_{k+1} interpolates between the current inverse Hessian approximation HkH_{k} and the original BFGS update, and as βk\beta_{k} decreases, the interpolation is increasingly biased towards the current approximation HkH_{k}. From a regularized least squares estimation perspective, βk\beta_{k} plays the role of a regularization parameter that controls the amount of bias in the estimate of Hk+1H_{k+1}. Note that this behaviour is somewhat similar to the behaviour of Powell damping Powell1978 , although Powell damping was introduced to handle approximating a potentially indefinite Hessian of the Lagrangian in constrained optimization problems, and not noise.

We finish introducing the SP-BFGS update by applying the Sherman-Morrison-Woodbury formula to (27), which allows us to write the update in terms of the approximate Hessian BkB_{k} instead of the approximate inverse Hessian HkH_{k}. The result is given in Theorem 3.2.

Theorem 3.2 (SP-BFGS Inverse Update)

The SP-BFGS update formula given by (27) can be written in terms of Bk=Hk1B_{k}=H_{k}^{-1} as

Bk+1=Bkωk[((ωkγk)ykTBk1ykγkωk)BkskskTBk+(1ωkskTyk)(BkskykT+ykskTBk)+ωk(skTBksk)ykykT]((ωkγk)ykTBk1ykγkωk)(ωkskTBksk)(1ωkykTsk)2.B_{k+1}=B_{k}-\frac{\omega_{k}\bigg{[}\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}B_{k}^{-1}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}B_{k}s_{k}s_{k}^{T}B_{k}+(1-\omega_{k}s_{k}^{T}y_{k})(B_{k}s_{k}y_{k}^{T}+y_{k}s_{k}^{T}B_{k})+\omega_{k}(s_{k}^{T}B_{k}s_{k})y_{k}y_{k}^{T}\bigg{]}}{\big{(}(\omega_{k}-\gamma_{k})y_{k}^{T}B_{k}^{-1}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\big{)}\big{(}\omega_{k}s_{k}^{T}B_{k}s_{k}\big{)}-(1-\omega_{k}y_{k}^{T}s_{k})^{2}}.

Proof

See Appendix C. ∎

Note that the limiting behaviour of Theorem 3.2 with respect to βk\beta_{k} is again consistent. When βk=+\beta_{k}=+\infty, we obtain the original BFGS inverse update (20), and when βk=0\beta_{k}=0, we obtain Bk+1=BkB_{k+1}=B_{k}. One complication with respect to the SP-BFGS inverse update (98) is that Bk+1B_{k+1} cannot in general be expressed solely in terms of BkB_{k} due to the presence of ykTBk1yky_{k}^{T}B_{k}^{-1}y_{k} (i.e. ykTHkyky_{k}^{T}H_{k}y_{k}) in the denominator.

4 Algorithmic Framework

We now outline how to practically implement SP-BFGS updating. We consider the situation where one has access to noise corrupted versions of a smooth function ϕ\phi and its gradient ϕ\nabla\phi that can be decomposed as

f(x)=ϕ(x)+ϵ(x),f(x)=\phi(x)+\epsilon(x), (31)
g(x)=ϕ(x)+e(x).g(x)=\nabla\phi(x)+e(x). (32)

In (31) and (32), ϕ\phi is a smooth twice continuously differentiable function as in Section 2.1, and ϵ(x)\epsilon(x) is a scalar representing noise in the function evaluations. Similarly, ϕ\nabla\phi is the gradient of the smooth function ϕ\phi, while e(x)e(x) is a vector representing noise in the gradient evaluations. Similar decompositions are used in DFONoisyFunctionsQuasiNewton ; Gal_2020_HowToCatchALion ; doi:10.1137/19M1240794 .

4.1 Minimization Routine

Algorithm 1 outlines a general procedure for minimizing a noisy function with noisy function and gradient values ff and gg that can be decomposed as shown in (31) and (32). The inputs to the procedure in Algorithm 1 are a means of evaluating the noisy objective function f(x)f(x) and gradient g(x)g(x), the starting point x0x^{0}, and an initial inverse Hessian approximation H0H^{0}. As the best convergence/stopping test is problem dependent, we note that standard gradient and function value based tests can be employed in conjuction with smoothing and noise estimation techniques (e.g. see Section 3.3.4 of DFONoisyFunctionsQuasiNewton ). In the next several subsections, we discuss how to choose the penalty parameter βk\beta_{k} and step size αk\alpha_{k}, and appropriate courses of action for when the SP-BFGS curvature condition (29) fails.

Algorithm 1 SP-BFGS Minimization Routine
1:procedure SP-BFGS-Minimize(f(x),g(x),x0,H0f(x),g(x),x^{0},H^{0})
2:     k0k\leftarrow 0
3:     HkH0H_{k}\leftarrow H^{0}
4:     xkx0x_{k}\leftarrow x^{0}
5:     while Not Converged/Stopped do
6:         pkHkgkp_{k}\leftarrow-H_{k}g_{k}
7:         Choose step size αk\alpha_{k}
8:         xk+1xk+αkpkx_{k+1}\leftarrow x_{k}+\alpha_{k}p_{k}
9:         skxk+1xks_{k}\leftarrow x_{k+1}-x_{k}   
10:         ykgk+1gky_{k}\leftarrow g_{k+1}-g_{k}
11:         Choose penalty parameter βk\beta_{k}
12:         if skTyk>1βks_{k}^{T}y_{k}>-\frac{1}{\beta_{k}} then
13:              γk1(skTyk+1βk),ωk1(skTyk+2βk)\gamma_{k}\leftarrow\frac{1}{(s_{k}^{T}y_{k}+\frac{1}{\beta_{k}})},\quad\omega_{k}\leftarrow\frac{1}{(s_{k}^{T}y_{k}+\frac{2}{\beta_{k}})}

Hk+1=(IωkskykT)Hk(IωkykskT)+ωk[γkωk+(γkωk)ykTHkyk]skskTH_{k+1}=\bigg{(}I-\omega_{k}s_{k}y_{k}^{T}\bigg{)}H_{k}\bigg{(}I-\omega_{k}y_{k}s_{k}^{T}\bigg{)}+\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}s_{k}s_{k}^{T}

14:         else
15:              Trigger SP-BFGS curvature condition failure recovery procedure          
16:         kk+1k\leftarrow k+1      

4.2 Choosing The Penalty Parameter βk\beta_{k}

As the choice of βk\beta_{k} determines how strongly to bias the estimate of Hk+1H_{k+1} towards HkH_{k}, the choice of βk\beta_{k} is fundamentally connected to the amount of noise present in the measured gradients gk+1g_{k+1} and gkg_{k}. In brief, if the amount of noise present in the measured gradients is large, βk\beta_{k} should be small to avoid overfitting the noise, and if the amount of noise present in the measured gradients is small, βk\beta_{k} should be large to avoid underfitting curvature information. To make this point more rigorous, we introduce the following assumption.

Assumption 1 (Uniform Gradient Noise Bound)

There exists a nonnegative constant ϵ¯g0\bar{\epsilon}_{g}\geq 0 such that

g(x)ϕ(x)2=e(x)2ϵ¯g,xn.\left\|g(x)-\nabla\phi(x)\right\|_{2}=\left\|e(x)\right\|_{2}\leq\bar{\epsilon}_{g},\qquad\forall x\in\mathbb{R}^{n}. (33)

As ϕ(x)\nabla\phi(x) is continuous, for each k0k\geq 0 we have

limαk0[ϕ(xk+αkpk)]ϕ(xk)2=0.\left\|\lim_{\alpha_{k}\downarrow 0}\big{[}\nabla\phi(x_{k}+\alpha_{k}p_{k})\big{]}-\nabla\phi(x_{k})\right\|_{2}=0. (34)

However, due to noise we cannot in general guarantee

limαk0[g(xk+αkpk)]g(xk)2=0.\left\|\lim_{\alpha_{k}\downarrow 0}\big{[}g(x_{k}+\alpha_{k}p_{k})\big{]}-g(x_{k})\right\|_{2}=0. (35)

Using the continuity of ϕ(x)\nabla\phi(x), Assumption 1, and the triangle inequality, one can conclude that

0limαk0[g(xk+αkpk)]g(xk)22ϵ¯g.0\leq\left\|\lim_{\alpha_{k}\downarrow 0}\big{[}g(x_{k}+\alpha_{k}p_{k})\big{]}-g(x_{k})\right\|_{2}\leq 2\bar{\epsilon}_{g}. (36)

As a result, it is now clear that in the presence of uniformly bounded gradient noise, sending the step size αk\alpha_{k} to zero, and thus sks_{k} to zero, only bounds the difference of measured gradients within a ball with radius dependent on the gradient noise bound ϵ¯g\bar{\epsilon}_{g}.

As gk+1g_{k+1} and gkg_{k} can be decomposed into smooth and noise components, so can skTyks_{k}^{T}y_{k}, giving

skTyk=skTyksmooth+skTyknoise=skT[ϕk+1ϕk]+skT[ek+1ek].s_{k}^{T}y_{k}=s_{k}^{T}y_{k}^{smooth}+s_{k}^{T}y_{k}^{noise}=s_{k}^{T}[\nabla\phi_{k+1}-\nabla\phi_{k}]+s_{k}^{T}[e_{k+1}-e_{k}]. (37)

In conjunction with the Cauchy-Schwarz inequality, Assumption 1 implies that

2ϵ¯gsk2skT[ek+1ek]2ϵ¯gsk2-2\bar{\epsilon}_{g}\left\|s_{k}\right\|_{2}\leq s_{k}^{T}[e_{k+1}-e_{k}]\leq 2\bar{\epsilon}_{g}\left\|s_{k}\right\|_{2} (38)

and so we have the lower and upper bounds

2ϵ¯gsk2+skTyksmoothskTykskTyksmooth+2ϵ¯gsk2.-2\bar{\epsilon}_{g}\left\|s_{k}\right\|_{2}+s_{k}^{T}y_{k}^{smooth}\leq s_{k}^{T}y_{k}\leq s_{k}^{T}y_{k}^{smooth}+2\bar{\epsilon}_{g}\left\|s_{k}\right\|_{2}. (39)

From (39), it is clear that the bound on the effect of the noise grows linearly with sk2\left\|s_{k}\right\|_{2}. However, by using the average Hessian G¯k\bar{G}_{k} from (13) and applying Taylor’s theorem to ϕ\nabla\phi, it is also clear that

skTyksmooth=skTG¯ksk=O(sk22)s_{k}^{T}y_{k}^{smooth}=s_{k}^{T}\bar{G}_{k}s_{k}=O\big{(}\left\|s_{k}\right\|_{2}^{2}\big{)} (40)

and so

skTyk=O(sk22)+O(sk2)s_{k}^{T}y_{k}=O\big{(}\left\|s_{k}\right\|_{2}^{2}\big{)}+O\big{(}\left\|s_{k}\right\|_{2}\big{)} (41)

where the O(sk22)O\big{(}\left\|s_{k}\right\|_{2}^{2}\big{)} term is due to the true curvature of the smooth function ϕ\phi, and the O(sk2)O\big{(}\left\|s_{k}\right\|_{2}\big{)} term is due to noise. Thus, we have now illustrated an important general behaviour given Assumption 1. As sk2\left\|s_{k}\right\|_{2} dominates sk22\left\|s_{k}\right\|_{2}^{2} as sk20\left\|s_{k}\right\|_{2}\rightarrow 0, the effects of noise can dominate the true curvature for small sks_{k}. Conversely, as sk22\left\|s_{k}\right\|_{2}^{2} dominates sk2\left\|s_{k}\right\|_{2} as sk2+\left\|s_{k}\right\|_{2}\rightarrow+\infty, the true curvature can dominate the effects of noise for large sks_{k}.

Given the above analysis, a simple strategy for choosing βk\beta_{k} is to make βk\beta_{k} grow linearly with sk2\left\|s_{k}\right\|_{2}, such as

βk=Nssk2\beta_{k}=N_{s}\left\|s_{k}\right\|_{2} (42)

where Ns>0N_{s}>0 is a slope parameter. As sk20\left\|s_{k}\right\|_{2}\rightarrow 0, Hk+1HkH_{k+1}\rightarrow H_{k}, which is desirable because the effects of noise likely dominate as sk20\left\|s_{k}\right\|_{2}\rightarrow 0. Increasingly biasing the estimate of Hk+1H_{k+1} towards HkH_{k} reduces how much Hk+1H_{k+1} can be corrupted by noise, and relaxes the SP-BFGS curvature condition (29), reducing the likelihood of needing to trigger a recovery procedure described in Section 4.4. Also, as shown earlier, because ϕ\nabla\phi is continuous, the true difference of gradients is guaranteed to go to zero as sks_{k} approaches zero. As a result, without noise present, it is natural that Hk+1HkH_{k+1}\rightarrow H_{k} as sk0s_{k}\rightarrow 0. In the presence of noise, we wish for this behaviour to be preserved. Informally, one can intuitively think of wanting HkH_{k} to behave as an approximate average inverse Hessian, and the averaging should remove the corrupting effects of noise, leaving HkH_{k} to behave as if no noise were present. Similarly, as sk2+\left\|s_{k}\right\|_{2}\rightarrow+\infty, βk+\beta_{k}\rightarrow+\infty, and one recovers the BFGS update in the limit, which is desirable because the effects of noise are likely dominated by the true curvature as sk2+\left\|s_{k}\right\|_{2}\rightarrow+\infty. The slope parameter NsN_{s} dictates how sensitive βk\beta_{k} is to sk2\left\|s_{k}\right\|_{2}, and should be set proportional to the gradient noise level (i.e. ϵ¯g\bar{\epsilon}_{g}). Intuitively, if the gradient noise level is low, βk\beta_{k} should grow quickly with sk2\left\|s_{k}\right\|_{2}, as the effect of noise diminishes quickly, and vice versa.

It may also be desirable to modify (42) to

βk=max{Nssk2No,0}\beta_{k}=\max\bigg{\{}N_{s}\left\|s_{k}\right\|_{2}-N_{o},0\bigg{\}} (43)

where No>0N_{o}>0 is an intercept parameter. The inclusion of NoN_{o} allows one to stop updating HkH_{k} if sk2\left\|s_{k}\right\|_{2} is sufficiently small. For example, it may be desirable to stop updating HkH_{k} when one is very close to a stationary point, as gradient measurements are likely heavily dominated by noise.

4.3 Choosing The Step Size αk\alpha_{k}

Classically, during BFGS updating αk\alpha_{k} is chosen to satisfy the Armijo-Wolfe conditions. As function and gradient evaluations are not corrupted by noise in the classical BFGS setting, we can write the Armijo condition, also known as the sufficient decrease condition, as

ϕk+1ϕk+c1αkϕkTpk\phi_{k+1}\leq\phi_{k}+c_{1}\alpha_{k}\nabla\phi_{k}^{T}p_{k} (44)

and the Wolfe condition, also known as the curvature condition, as

ϕk+1Tpkc2ϕkTpk\nabla\phi_{k+1}^{T}p_{k}\geq c_{2}\nabla\phi_{k}^{T}p_{k} (45)

where 0<c1<c2<10<c_{1}<c_{2}<1, with well known choices being c1=104c_{1}=10^{-4} and c2=0.9c_{2}=0.9. Observe that by adding ϕkTpk\nabla\phi_{k}^{T}p_{k} to both sides of (45) and multiplying by αk\alpha_{k}, (45) becomes

ykTsk=[ϕk+1ϕk]Tαkpk(c21)ϕkTαkpk.y_{k}^{T}s_{k}=[\nabla\phi_{k+1}-\nabla\phi_{k}]^{T}\alpha_{k}p_{k}\geq(c_{2}-1)\nabla\phi_{k}^{T}\alpha_{k}p_{k}. (46)

If pkp_{k} is a descent direction then ϕkTpk<0\nabla\phi_{k}^{T}p_{k}<0, and combined with (c21)<0(c_{2}-1)<0 and αk>0\alpha_{k}>0, one sees that (46) implies

ykTsk(c21)ϕkTsk>0y_{k}^{T}s_{k}\geq(c_{2}-1)\nabla\phi_{k}^{T}s_{k}>0 (47)

so (45) effectively enforces (22) when no gradient noise is present.

In the presence of noisy gradients, we argue that in general it no longer makes sense to enforce the Wolfe condition (45). In the presence of gradient noise, (45) becomes

[ϕk+1+ek+1]Tpkc2[ϕk+ek]Tpk[\nabla\phi_{k+1}+e_{k+1}]^{T}p_{k}\geq c_{2}[\nabla\phi_{k}+e_{k}]^{T}p_{k} (48)

which can behave erratically once the noise vectors ek+1e_{k+1} and eke_{k} start to dominate the gradient of ϕ\phi. For example, the noise vectors ek+1e_{k+1} and eke_{k} can cause both sides of (48) to erratically change sign, in which case whether or not the Wolfe condition is satisfied can be governed by randomness more than anything else.

We argue that because the SP-BFGS update allows one to relax the curvature condition based on the value of βk\beta_{k} as shown in the SP-BFGS curvature condition (29), it is appropriate to drop the Wolfe condition entirely in the presence of gradient noise and instead employ only a version of the sufficient decrease condition when choosing αk\alpha_{k}. In the situation where gradient noise is present but function noise is not (i.e. f(x)=ϕ(x)f(x)=\phi(x) in (31)), one can use a backtracking line search based on the sufficient decrease condition, which can guarantee convergence to a neighborhood of a stationary point of ϕ\phi. The situation where noise is present in both function and gradient evaluations is trickier. Similar to the approach presented in Section 4.2 of DFONoisyFunctionsQuasiNewton , one option is to use a backtracking line search with a relaxed sufficient decrease condition of the form

fk+1fk+c1αkgkTpk+2ϵAf_{k+1}\leq f_{k}+c_{1}\alpha_{k}g_{k}^{T}p_{k}+2\epsilon_{A} (49)

where ϵA0\epsilon_{A}\geq 0 is a noise tolerance parameter and pk=Hkgkp_{k}=-H_{k}g_{k}. In Theorem 4.2 of DFONoisyFunctionsQuasiNewton , the authors show that under Assumptions 1 and 2, using the iteration (4) and a backtracking line search governed by the relaxed Armijo condition (49) with pk=gkp_{k}=-g_{k} guarantees linear convergence to a neighborhood of the global minimizer for strongly convex functions.

Assumption 2 (Uniform Function Noise Bound)

There exists a nonnegative constant ϵ¯f0\bar{\epsilon}_{f}\geq 0 such that

|f(x)ϕ(x)|=|ϵ(x)|ϵ¯f,xn.\left|f(x)-\phi(x)\right|=\left|\epsilon(x)\right|\leq\bar{\epsilon}_{f},\qquad\forall x\in\mathbb{R}^{n}. (50)

We agree with the authors of DFONoisyFunctionsQuasiNewton that it is possible to prove an extension of Theorem 4.2 of DFONoisyFunctionsQuasiNewton to a quasi-Newton iteration with positive definite HkH_{k}, and briefly outline why in Section 5.2. A quasi-Newton extension of Theorem 4.2 of DFONoisyFunctionsQuasiNewton is relevant to SP-BFGS updating because, as we will formally see in Section 5.1, control of βk\beta_{k} makes it possible to uniformly bound the minimum and maximum eigenvalues of Hk+1H_{k+1}.

4.4 Failed SP-BFGS Curvature Condition Recovery Procedure

In the classical BFGS scenario where no gradient noise is present, the curvature condition (22) may fail if αk\alpha_{k} is not chosen based on the Armijo-Wolfe conditions and ϕ\phi is not strongly convex. One of the most common strategies to handle this scenario is to skip the BFGS update (i.e. set Hk+1=HkH_{k+1}=H_{k}) when this occurs, which corresponds to an SP-BFGS update with βk=0\beta_{k}=0. However, this simple strategy has the downside of potentially producing poor inverse Hessian approximations if updates are skipped too frequently.

Conditionally skipping BFGS updates is an option in the presence of noisy gradients as well. In addition to skipping BFGS updates when (22) fails, as described above, another course of action sometimes recommended in the presence of noise is to replace (22) with

skTykεsk22s_{k}^{T}y_{k}\geq\varepsilon\left\|s_{k}\right\|_{2}^{2} (51)

where ε>0\varepsilon>0 is a small positive constant, and skip the BFGS update if (51) is not satisfied. This strategy may be somewhat effective if sk2\left\|s_{k}\right\|_{2} is large, but reduces back to the initial update skipping approach as sk20\left\|s_{k}\right\|_{2}\rightarrow 0. A similar strategy (e.g. see Section 3.3.3 of DFONoisyFunctionsQuasiNewton ) is to replace (22) with

skTykζsk2yk2s_{k}^{T}y_{k}\geq\zeta\left\|s_{k}\right\|_{2}\left\|y_{k}\right\|_{2} (52)

and skip the BFGS update if (52) is not satisfied for some ζ(0,1)\zeta\in(0,1). Notice that none of the aforementioned update skipping strategies allow for curvature information to be incorporated if the measured curvature skTyks_{k}^{T}y_{k} is negative.

Unlike in the classical BFGS scenario, with SP-BFGS updating, curvature information can be incorporated even if the measured curvature skTyks_{k}^{T}y_{k} is negative by decreasing βk\beta_{k} towards 0. In addition to having the option of conditionally skipping updates (i.e. setting βk=0\beta_{k}=0), one can also alternatively relax the SP-BFGS curvature condition by decreasing βk\beta_{k} towards 0 if (29) fails. Since βk\beta_{k} is chosen after sks_{k} and yky_{k} are fixed in Algorithm 1, one can solve for βk\beta_{k} values satisfying (29) when skTyk<0s_{k}^{T}y_{k}<0, yielding

βk=1c3(skTyk)\beta_{k}=-\frac{1}{c_{3}(s_{k}^{T}y_{k})} (53)

for all c3>1c_{3}>1, assuming that sk0s_{k}\neq 0 and yk0y_{k}\neq 0. Note that if sk0s_{k}\neq 0 and yk0y_{k}\neq 0 and (29) fails, then the measured curvature skTyks_{k}^{T}y_{k} must be negative. The choice of c3c_{3} determines how much to shrink βk\beta_{k} compared to the largest value of βk\beta_{k} that still satisfies (29) and thus guarantees the positive definiteness of Hk+1H_{k+1}. Hence, if the value of βk\beta_{k} produced by (42) or (43) is too large and (29) fails, one can choose an acceptable value of βk\beta_{k} by using (53) and selecting a c3>1c_{3}>1. Thus, instead of skipping the update (i.e. setting βk=0\beta_{k}=0) if (29) fails, one can reduce βk\beta_{k} towards 0, which has the effect of reducing the magnitude of the update by increasing how much Hk+1H_{k+1} is biased towards HkH_{k}. An approach based on reducing βk\beta_{k} towards 0 never entirely skips incorporating measured curvature information, even if the measured curvature information is negative, but instead weights how heavily the measured curvature information affects Hk+1H_{k+1}.

5 Convergence of SP-BFGS

In this section, we discuss relevant theoretical and convergence properties of SP-BFGS. First, it is important to note that for specific choices of the sequence of penalty parameters βk\beta_{k}, known convergence results already exist. Specifically, if βk=+\beta_{k}=+\infty for all kk, then SP-BFGS updating is equivalent to BFGS updating. Although there are not many works on the convergence properties of BFGS updating in the presence of uniformly bounded noise, such as in Assumptions 1 and 2, in doi:10.1137/19M1240794 the authors provide convergence results for a BFGS variant that employs an Armijo-Wolfe line search and lengthens the differencing interval in the presence of uniformly bounded function and gradient noise. At the other extreme, if βk=0\beta_{k}=0 for all kk, then one obtains a scaled gradient method for general H00H^{0}\succ 0, and this becomes the gradient method when H0=IH^{0}=I. Convergence analyses of the gradient method in the presence of uniformly bounded function and gradient noise for both a fixed step size and backtracking line search are provided in Section 4 of DFONoisyFunctionsQuasiNewton .

Given that perhaps the defining feature of SP-BFGS updating is the ability to vary βk\beta_{k} at each iteration, we focus our attention on how varying βk\beta_{k} can influence convergence behaviour in this section. As a result, most of the ensuing analysis centers around situations where the condition number of HkH_{k} can be bounded. We do not employ the approach of bounding the cosine of the angle between the descent direction pkp_{k} and the negative gradient above zero, and then showing that the condition number of HkH_{k} is bounded, which is similar to the approaches taken when no noise is present in 10.2307/2157646 ; 10.2307/2157680 , and when noise is present in doi:10.1137/19M1240794 . Although it may be possible to apply the strategies employed in 10.2307/2157646 ; 10.2307/2157680 ; doi:10.1137/19M1240794 to establish convergence results for SP-BFGS, such an analysis is complicated enough that it is beyond the scope of this initial paper.

5.1 The Influence of βk\beta_{k} on Hk+1H_{k+1}

We first examine how βk\beta_{k} determines how much the maximum and minimum eigenvalues λmax(Hk+1)\lambda_{max}(H_{k+1}) and λmin(Hk+1)\lambda_{min}(H_{k+1}) can change. In what follows, λ(H)\lambda(H) denotes the set of eigenvalues λ1,,λn\lambda_{1},\dots,\lambda_{n} of the matrix Hn×nH\in\mathbb{R}^{n\times n}. We provide upper bounds on λmax(Bk+1)\lambda_{max}(B_{k+1}) and λmax(Hk+1)\lambda_{max}(H_{k+1}) via Theorem 5.1. As Hk=Bk1H_{k}=B_{k}^{-1}, 1/λmin(Hk+1)=λmax(Bk+1)1/\lambda_{min}(H_{k+1})=\lambda_{max}(B_{k+1}), and putting an upper bound on λmax(Bk+1)\lambda_{max}(B_{k+1}) is equivalent to putting a lower bound on λmin(Hk+1)\lambda_{min}(H_{k+1}).

Theorem 5.1 (Eigenvalue Upper Bounds)

When Hk+1H_{k+1} is given by the SP-BFGS update (27), the following upper bounds (54) and (55) hold

λmax(Hk+1)Tr(Hk+1)[(1+γkyk2sk2)2]Tr(Hk)+γksk22,\lambda_{max}(H_{k+1})\leq\operatorname{Tr}(H_{k+1})\leq\bigg{[}\big{(}1+\gamma_{k}\left\|y_{k}\right\|_{2}\left\|s_{k}\right\|_{2}\big{)}^{2}\bigg{]}\operatorname{Tr}(H_{k})+\gamma_{k}\left\|s_{k}\right\|_{2}^{2}, (54)
λmax(Bk+1)Tr(Bk+1)[1+βkyk2sk2]Tr(Bk)+γkyk22.\lambda_{max}(B_{k+1})\leq\operatorname{Tr}(B_{k+1})\leq\bigg{[}1+\beta_{k}\left\|y_{k}\right\|_{2}\left\|s_{k}\right\|_{2}\bigg{]}\operatorname{Tr}(B_{k})+\gamma_{k}\left\|y_{k}\right\|_{2}^{2}. (55)
Proof

See Appendix D. ∎

With Theorem 5.1 in hand, we now formally see that when skTyk>0s_{k}^{T}y_{k}>0, as βk\beta_{k} increases from 0 to ++\infty, an upper bound on λmax(Hk+1)\lambda_{max}(H_{k+1}) interpolates between Tr(Hk)\operatorname{Tr}(H_{k}) and ++\infty, and an upper bound on λmax(Bk+1)\lambda_{max}(B_{k+1}) interpolates between Tr(Bk)\operatorname{Tr}(B_{k}) and ++\infty. Similarly, when skTyk<0s_{k}^{T}y_{k}<0, as βk\beta_{k} increases from 0 towards 1(skTyk)-\frac{1}{(s_{k}^{T}y_{k})}, an upper bound on λmax(Hk+1)\lambda_{max}(H_{k+1}) interpolates between between Tr(Hk)\operatorname{Tr}(H_{k}) and ++\infty, and an upper bound on λmax(Bk+1)\lambda_{max}(B_{k+1}) interpolates between Tr(Bk)\operatorname{Tr}(B_{k}) and ++\infty. Standard BFGS updating corresponds to setting βk=+\beta_{k}=+\infty for all kk, and as this is the largest possible value of βk\beta_{k}, one can no longer formally guarantee that λmax(Bk+1)\lambda_{max}(B_{k+1}) and λmax(Hk+1)\lambda_{max}(H_{k+1}) are bounded from above at each iteration because the measured curvature skTyks_{k}^{T}y_{k} may become arbitrarily close to zero due to the effects of noise. The key takeaway is that upper bounds on λmax(Hk+1)\lambda_{max}(H_{k+1}) and λmax(Bk+1)\lambda_{max}(B_{k+1}) can be tightened arbitrarily close to Tr(Hk)\operatorname{Tr}(H_{k}) and Tr(Bk)\operatorname{Tr}(B_{k}) by shrinking βk\beta_{k} towards zero, as sks_{k}, yky_{k}, and HkH_{k} are fixed before the value of βk\beta_{k} is chosen in Algorithm 1.

Thus, if one must enforce a bound of the form λmax(Hk+1)CH\lambda_{max}(H_{k+1})\leq C_{H} or a bound of the form λmax(Bk+1)CB\lambda_{max}(B_{k+1})\leq C_{B} for all k0k\geq 0, where CH>Tr(H0)>0C_{H}>\operatorname{Tr}(H_{0})>0 and CB>Tr(B0)>0C_{B}>\operatorname{Tr}(B_{0})>0 are positive constants, there exist nontrivial sequences of sufficiently small βk\beta_{k} with limkβk=0\lim_{k\rightarrow\infty}\beta_{k}=0 that ensure the bounds hold for all kk. To see this, observe that the interval [λmax(H0),CH][\lambda_{max}(H_{0}),C_{H}] can be partitioned into subintervals corresponding to each iteration, and the sum of the subintervals cannot exceed CHλmax(H0)C_{H}-\lambda_{max}(H_{0}), which can be guaranteed by assigning a small enough value of βk\beta_{k} to each subinterval, as this guarantees the maximum eigenvalue does not grow too much at each iteration kk. Furthermore, although there clearly exist sequences of βk\beta_{k} that ensure the bounds hold for all kk that satisfy βk=0\beta_{k}=0 for all kKk\geq K, where KK is a positive integer, there also exist sequences of βk\beta_{k} that ensure the bounds hold for all kk where βk\beta_{k} instead only approaches zero in the limit kk\rightarrow\infty.

5.2 Minimization Of Strongly Convex Functions

Having established that SP-BFGS iterations can maintain bounds on the maximum and minimum eigenvalues of the approximate inverse Hessians via sufficiently small choices of βk\beta_{k}, we now consider minimizing strongly convex functions in the presence of bounded noise. We introduce Assumption 3, and the notation xx^{\star} to denote the argument of the unique minimum of ϕ\phi, and ϕ=ϕ(x)\phi^{\star}=\phi(x^{\star}) to denote the minimum.

Assumption 3 (Strong Convexity of ϕ\phi)

The function ϕC2\phi\in C^{2} is twice continuously differentiable and there exist positive constants 0<mM0<m\leq M such that

mI2ϕ(x)MI,xn.mI\preceq\nabla^{2}\phi(x)\preceq MI,\qquad\forall x\in\mathbb{R}^{n}. (56)

We also state a general result in Lemma 2 that establishes a region where HkgkH_{k}g_{k} may not provide a descent direction with respect to ϕ\phi due to noise dominating gradient measurements. Outside of this region, HkgkH_{k}g_{k} is guaranteed to provide a descent direction for ϕ\phi.

Lemma 2 (Region Where Gradient Noise Can Dominate ϕ\nabla\phi)

Suppose Assumptions 1 and 3, and the decomposition in (32) apply. Let HH be a symmetric positive definite matrix bounded by ψIHΨI\psi I\preceq H\preceq\Psi I, where 0<ψΨ0<\psi\leq\Psi. Define the neighborhood 𝒩1(ψ,Ψ)\mathcal{N}_{1}(\psi,\Psi) as

𝒩1(ψ,Ψ){x | ϕ(x)ϕ+12m(Ψϵ¯gψ)2}.\mathcal{N}_{1}(\psi,\Psi)\equiv\bigg{\{}x\text{ }\big{|}\text{ }\phi(x)\leq\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{\}}. (57)

For all x𝒩1x\notin\mathcal{N}_{1}, ϕ(x)THg(x)>0\nabla\phi(x)^{T}Hg(x)>0. Contrapositively, for all xx such that ϕ(x)THg(x)0\nabla\phi(x)^{T}Hg(x)\leq 0, x𝒩1x\in\mathcal{N}_{1}.

Proof

See Appendix E. ∎

Applying Lemma 2 in the context of SP-BFGS updating makes several convergence properties clear. First, if one chooses βk\beta_{k} such that ψIHkΨI\psi I\preceq H_{k}\preceq\Psi I for all kk (i.e. the eigenvalues of the approximate inverse Hessian are uniformly bounded from above and below for all kk), by Lemma 2 it becomes clear that in the presence of gradient noise and absence of function noise (i.e. f(x)=ϕ(x)f(x)=\phi(x) in (31)), the iterates of SP-BFGS with a backtracking line search based on (49) with ϵA=0\epsilon_{A}=0 in the worst case approach 𝒩1\mathcal{N}_{1} as kk\rightarrow\infty. To see this, observe that HkgkH_{k}g_{k} is guaranteed to provide a descent direction outside of 𝒩1\mathcal{N}_{1} and the sufficient decrease condition guarantees that αk\alpha_{k} is not too large, while backtracking guarantees that αk\alpha_{k} is not too small. For more background, see Chapter 3 of Nocedal2006 .

Second, if both function and gradient noise are present, and one again chooses βk\beta_{k} such that the bounds ψIHkΨI\psi I\preceq H_{k}\preceq\Psi I hold for all kk, under additional conditions a worst case analysis in Theorem 5.2 shows that an approach using a sufficiently small fixed step size α\alpha approaches 𝒩1\mathcal{N}_{1} at a linear rate as kk\rightarrow\infty. For a general quasi-Newton iteration of the form

xk+1=xkαHkgkx_{k+1}=x_{k}-\alpha H_{k}g_{k} (58)

with constant step size α\alpha and Hk0H_{k}\succ 0, Theorem 5.2 establishes linear convergence to the region where noise can dominate ϕ\nabla\phi (i.e. 𝒩1\mathcal{N}_{1} in Lemma 2).

Theorem 5.2 (Linear Convergence For Sufficiently Small Fixed α\alpha)

Suppose that Assumptions 1 and 3 hold. Further suppose that HkH_{k} is symmetric positive definite and bounded by ψIHkΨI\psi I\preceq H_{k}\preceq\Psi I, where 0<ψΨ0<\psi\leq\Psi. Let ψ\psi be such that the inequality

ϕkTHkgkψϕkTgk\nabla\phi_{k}^{T}H_{k}g_{k}\geq\psi\nabla\phi_{k}^{T}g_{k} (59)

is true for all kk. Let {xk}\{x_{k}\} be the iterates generated by (58), where the constant step size α\alpha satisfies

0<αψΨ2M.0<\alpha\leq\frac{\psi}{\Psi^{2}M}. (60)

Then for all kk such that xk𝒩1(ψ,Ψ)x_{k}\notin\mathcal{N}_{1}(\psi,\Psi), one has the Q-linear convergence result

ϕk+1[ϕ+12m(Ψϵ¯gψ)2](1αψm)(ϕk[ϕ+12m(Ψϵ¯gψ)2]).\phi_{k+1}-\bigg{[}\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{]}\leq(1-\alpha\psi m)\bigg{(}\phi_{k}-\bigg{[}\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{]}\bigg{)}. (61)

Similarly, for any x0𝒩1(ψ,Ψ)x_{0}\notin\mathcal{N}_{1}(\psi,\Psi), one has the R-linear convergence result

ϕk+1ϕ(1αψm)k(ϕ0[ϕ+12m(Ψϵ¯gψ)2])+12m(Ψϵ¯gψ)2.\phi_{k+1}-\phi^{\star}\leq(1-\alpha\psi m)^{k}\bigg{(}\phi_{0}-\bigg{[}\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{]}\bigg{)}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}. (62)
Proof

See Appendix F. ∎

Theorem 5.2 can be considered a quasi-Newton extension of Theorem 4.1 from DFONoisyFunctionsQuasiNewton , which lays the foundation for Theorem 4.2 from DFONoisyFunctionsQuasiNewton . To extend the convergence result of Theorem 5.2 to the backtracking line search approach based on (49), see that (60), (115), and Assumption 2 combined imply

f(xkαHkgk)f(xk)αψ2(ϕk22ek22)+2ϵ¯ff(x_{k}-\alpha H_{k}g_{k})\leq f(x_{k})-\frac{\alpha\psi}{2}\bigg{(}\left\|\nabla\phi_{k}\right\|_{2}^{2}-\left\|e_{k}\right\|_{2}^{2}\bigg{)}+2\bar{\epsilon}_{f} (63)

and so if ϵA>ϵ¯f\epsilon_{A}>\bar{\epsilon}_{f}, comparing (49) and (63) makes it clear that (49) will be satisfied for sufficiently small α\alpha. Hence, the backtracking line search always finds an αk\alpha_{k} satisfying (49). For brevity, we defer a full, rigorous quasi-Newton extension of Theorem 4.2 of DFONoisyFunctionsQuasiNewton to future work and instead investigate the performance of an approach based on (49) via numerical experiments in Section 6.

6 Numerical Experiments

In this section, we test instances of Algorithm 1 on a diverse set of 33 test problems for unconstrained minimization. The set of test problems includes convex and nonconvex functions, and well known pathological functions such as the Rosenbrock function 10.1093/comjnl/3.3.175 and its relatives. Described in Section 6.1, the first test problem is similar to the one used in the numerical experiments section of doi:10.1137/19M1240794 , and involves an ill conditioned quadratic function. The other 32 problems are selected problems from the CUTEst test problem set gould-orban-cutest , and are used for tests in Section 6.2. Code for running these numerical experiments was written in the Julia programming language doi:10.1137/141000671 , and utilizes the NLPModels.jl orban-siqueira-nlpmodels-2020 , CUTEst.jl orban-siqueira-cutest-2020 , and Distributions.jl 2019arXiv190708611B ; Distributions.jl-2019 packages. In all the numerical experiments that follow, noise ϵ(x)\epsilon(x) was added to function evaluations by uniformly sampling from the interval [ϵ¯f,ϵ¯f][-\bar{\epsilon}_{f},\bar{\epsilon}_{f}], and noise e(x)e(x) was added to the gradient evaluations by uniformly sampling from the closed Euclidean ball x2ϵ¯g\left\|x\right\|_{2}\leq\bar{\epsilon}_{g}.

6.1 Ill Conditioned Quadratic Function With Additive Gradient Noise Only

The first test problem is strongly convex and consists of the 44-dimensional quadratic function given by

ϕ(x)=12xTTx\phi(x)=\frac{1}{2}x^{T}Tx (64)

where the eigenvalues of TT are λ(T)={102,1,102,104}\lambda(T)=\{10^{-2},1,10^{2},10^{4}\}. Consequently, the strong convexity parameter is m=102m=10^{-2}, the Lipschitz constant is M=104M=10^{4}, and the condition number of the Hessian TT is 10610^{6}. For this test problem, no noise was added to the function evaluations (i.e. f(x)=ϕ(x)f(x)=\phi(x) in (31)), and ϵ¯g=1\bar{\epsilon}_{g}=1. As a result, in this scenario 𝒩1\mathcal{N}_{1} from Lemma 2 with ψ=Ψ=1\psi=\Psi=1 (i.e. the smallest possible 𝒩1\mathcal{N}_{1}) becomes

𝒩1(1,1)={x | ϕ(x)50}.\mathcal{N}_{1}(1,1)=\bigg{\{}x\text{ }\big{|}\text{ }\phi(x)\leq 50\bigg{\}}. (65)

Following the discussion in Section 4.2, we set the penalty parameters via the formula βk=1ϵ¯gsk2+1010\beta_{k}=\frac{1}{\bar{\epsilon}_{g}}\left\|s_{k}\right\|_{2}+10^{-10}, which corresponds to a choice of Ns=1N_{s}=1 in (42). The 101010^{-10} term was added as a small perturbation to provide numerical stability. The step size αk\alpha_{k} was chosen using a backtracking line search based on the sufficient decrease condition (49) with pk=Hkgkp_{k}=-H_{k}g_{k}, where gkg_{k} is defined by (32), ϵA=0\epsilon_{A}=0, and c1=104c_{1}=10^{-4}. At each iteration, backtracking started from the initial step size α0=1\alpha^{0}=1, decreasing by a factor of τ=1/2\tau=1/2 each time the sufficient decrease condition failed. If the backtracking line search exceeded the maximum number of 7575 backtracks, we set αk=0\alpha_{k}=0. However, the maximum number of backtracks was never exceeded when performing experiments with this first test problem.

Algorithm 1 was initialized using the matrix and starting point

H0=I,x0=105[1,1,1,1]TH^{0}=I,\qquad x^{0}=10^{5}\cdot[1,1,1,1]^{T} (66)

given in (66), with ϕ(x0)2109\left\|\nabla\phi(x^{0})\right\|_{2}\approx 10^{9}. Figures 3, 3, and 3 compare the performance of 3030 independent runs of SP-BFGS vs. BFGS over a fixed budget of 100100 iterations. The relevant curvature condition failed an average of 25.725.7 total iterations per BFGS run, and 0.60.6 total iterations per SP-BFGS run. For the sake of comparability, both SP-BFGS and BFGS skipped the update if the relevant curvature condition failed. Observe that SP-BFGS reduces the objective function value by several more orders of magnitude compared to BFGS on average, and maintains significantly better inverse Hessian approximations than BFGS in the presence of gradient noise.

Refer to caption Refer to caption
Figure 1: Base 10 logarithm of the optimality gap vs. the iteration number kk for 3030 independent runs. After 100100 iterations, SP-BFGS has an average log10(ϕ100ϕ)\log_{10}(\phi_{100}-\phi^{\star}) of 5.03-5.03 while BFGS has an average log10(ϕ100ϕ)\log_{10}(\phi_{100}-\phi^{\star}) of 1.27-1.27. Observe that both SP-BFGS and BFGS appear to enter 𝒩1(1,1)\mathcal{N}_{1}(1,1), which corresponds to values less than log10(50)1.7\log_{10}(50)\approx 1.7 on the y-axis, but SP-BFGS makes more progress inside 𝒩1(1,1)\mathcal{N}_{1}(1,1). Outside of 𝒩1(1,1)\mathcal{N}_{1}(1,1), the performance of SP-BFGS and BFGS is almost indistinguishable.
Refer to caption Refer to caption
Figure 2: Base 10 logarithm of the Euclidean norm of the true gradient ϕk\nabla\phi_{k} vs. the iteration number kk for 3030 independent runs. Note that the BFGS values appear to vary more wildly than the SP-BFGS values.
Refer to caption Refer to caption
Figure 3: Base 10 logarithm of the condition number of the true Hessian 2ϕk\nabla^{2}\phi_{k} scaled by the approximate inverse Hessian HkH_{k} at each iteration kk for 3030 independent runs. As ideally one wants Hk2ϕk=IH_{k}\nabla^{2}\phi_{k}=I, which has a condition number of 11, the ideal value on these plots is log10(1)=0\log_{10}(1)=0. Observe how the BFGS approximation deteriorates massively inside 𝒩1(1,1)\mathcal{N}_{1}(1,1), and how SP-BFGS avoids this massive deterioration. From examining the BFGS HkH_{k}, the authors were able to determine that in the region of deterioration, the values of the entries of HkH_{k} are often smaller than 10510^{-5}.

6.2 CUTEst Test Problems With Various Additive Noise Combinations

The remaining 3232 test problems were selected from the CUTEst problem set, the successor of CUTEr gould-orban-toint-cuter . At the time of writing, SIF files and descriptions of all 3232 test problems can be found at https://www.cuter.rl.ac.uk/Problems/mastsif.shtml. As a brief summary, some of the problems can be interpreted as least squares type problems (e.g. ARGTRGLS), some of the problems are ill conditioned or singular type problems (e.g. BOXPOWER), some of the problems are well known nonlinear optimization test problems (e.g. ROSENBR) or extensions of them (e.g. ROSENBRTU, SROSENBR), and some of the problems come from real applications (e.g. COATING, HEART6LS, VIBRBEAM). As shown in Tables 2 and 3, the selected CUTEst test problems vary in size from 22-dimensional to 10001000-dimensional.

Using these 3232 CUTEst test problems and a fixed budget of 20002000 objective function evaluations (not 20002000 iterations) per test, we tested the performance of SP-BFGS compared to BFGS with various combinations of function and gradient noise levels ϵ¯f\bar{\epsilon}_{f} and ϵ¯g\bar{\epsilon}_{g}. For all the experiments in Tables 12, and 3, as well as the additional experiments in Appendix G, both SP-BFGS and BFGS skipped updating if the curvature condition failed. In Tables 12, and 4, the SPBFGS penalty parameter was set as βk=108ϵ¯gsk2+1010\beta_{k}=\frac{10^{8}}{\bar{\epsilon}_{g}}\left\|s_{k}\right\|_{2}+10^{-10}, as the authors heuristically discovered setting Ns=108ϵ¯gN_{s}=\frac{10^{8}}{\bar{\epsilon}_{g}} works well in practice for a variety of problems. With regards to the backtracking line search based on (49), we set α0=1\alpha^{0}=1, ϵA=ϵ¯f\epsilon_{A}=\bar{\epsilon}_{f}, c1=104c_{1}=10^{-4} , τ=1/2\tau=1/2, and the maximum number of backtracks as 4545. We define Δoptlog10(ϕbestϕ)\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star}) as a measure of the optimality gap, and use ϕbest\phi_{best} to denote the smallest value of the true function ϕ\phi measured at any point during an algorithm run. The true minimum values ϕ\phi^{\star} for each CUTEst problem were obtained from the SIF file for each CUTEst problem. The sample variance (i.e. the variance with Bessel’s correction) is denoted by s2()s^{2}(\cdot).

Table 1 compares the performance of SP-BFGS vs. BFGS on the Rosenbrock function (i.e. ROSENBR) corrupted by different combinations of function and gradient noise of varying orders of magnitude. Observe that SP-BFGS outperforms BFGS with respect to the mean and median optimality gap for every noise combination in Table 1, sometimes by several orders of magnitude. Tables 2 and 3 compare the performance of SP-BFGS vs. BFGS on the 3232 CUTEst test problems with both function and gradient noise present. Gradient noise was generated using ϵ¯g=104ϕ(x0)2\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}, and function noise was generated using ϵ¯f=104|ϕ(x0)|\bar{\epsilon}_{f}=10^{-4}\left|\phi(x^{0})\right|, both to ensure that noise does not initially dominate function or gradient evaluations. Note that as the noise in these numerical experiments is additive, the signal to noise ratio of gradient measurements decreases as a stationary point is approached. Overall, SP-BFGS outperforms BFGS on approximately 70%70\% of the CUTEst problems with both function and gradient noise present, and performs at least as good as BFGS on approximately 90%90\% of these problems. Referring to Appendix G, with only gradient noise present, these percentages become 80%80\% and 95%95\% respectively.

ϵ¯f\bar{\epsilon}_{f} ϵ¯g\bar{\epsilon}_{g} Mean(Δopt\Delta_{opt}) Median(Δopt\Delta_{opt}) Min(Δopt\Delta_{opt}) Max(Δopt\Delta_{opt}) s2(Δopt)s^{2}(\Delta_{opt}) Mean(II)
SPBFGS With No Function Noise
0 10410^{-4} -1.4E+01 -1.4E+01 -1.8E+01 -1.2E+01 1.4E+00 114
0 10210^{-2} -1.3E+01 -1.3E+01 -1.5E+01 -8.3E+00 2.9E+00 104
0 10010^{0} -2.1E+00 -1.8E+00 -5.7E+00 -9.2E-01 9.4E-01 153
0 10210^{2} 3.5E-02 2.9E-01 -1.9E+00 7.9E-01 3.9E-01 90
BFGS With No Function Noise
0 10410^{-4} -1.1E+01 -1.0E+01 -1.4E+01 -8.8E+00 1.8E+00 263
0 10210^{-2} -6.6E+00 -6.6E+00 -9.6E+00 -4.3E+00 1.6E+00 281
0 10010^{0} -1.5E+00 -1.2E+00 -3.3E+00 -5.4E-01 6.3E-01 279
0 10210^{2} 1.1E-01 4.3E-01 -2.4E+00 6.5E-01 4.7E-01 373
SPBFGS With Low Function Noise Level
10410^{-4} 10410^{-4} -1.4E+01 -1.4E+01 -1.5E+01 -1.3E+01 1.9E-01 1980
10410^{-4} 10210^{-2} -1.0E+01 -1.0E+01 -1.2E+01 -8.0E+00 1.3E+00 1964
10410^{-4} 10010^{0} -2.1E+00 -2.0E+00 -3.6E+00 -1.6E+00 2.0E-01 1759
10410^{-4} 10210^{2} 8.7E-02 3.1E-01 -2.2E+00 9.1E-01 4.5E-01 1720
BFGS With Low Function Noise Level
10410^{-4} 10410^{-4} -1.1E+01 -1.2E+01 -1.5E+01 -8.7E+00 1.7E+00 1980
10410^{-4} 10210^{-2} -6.6E+00 -6.5E+00 -8.8E+00 -4.7E+00 1.2E+00 1975
10410^{-4} 10010^{0} -1.2E+00 -1.1E+00 -1.8E+00 -8.6E-01 5.9E-02 1936
10410^{-4} 10210^{2} 9.5E-02 5.1E-01 -3.1E+00 9.2E-01 8.5E-01 1934
SPBFGS With Medium Function Noise Level
10210^{-2} 10410^{-4} -1.4E+01 -1.4E+01 -1.5E+01 -1.3E+01 3.4E-01 1981
10210^{-2} 10210^{-2} -1.0E+01 -1.0E+01 -1.3E+01 -7.5E+00 1.5E+00 1977
10210^{-2} 10010^{0} -3.4E+00 -3.0E+00 -7.5E+00 -2.0E+00 1.7E+00 1934
10210^{-2} 10210^{2} -1.8E-01 1.7E-01 -3.7E+00 7.4E-01 1.0E+00 1890
BFGS With Medium Function Noise Level
10210^{-2} 10410^{-4} -1.1E+01 -1.1E+01 -1.4E+01 -8.5E+00 1.4E+00 1981
10210^{-2} 10210^{-2} -6.7E+00 -6.7E+00 -1.0E+01 -4.9E+00 1.7E+00 1979
10210^{-2} 10010^{0} -1.8E+00 -1.5E+00 -3.8E+00 -9.1E-01 6.3E-01 1961
10210^{-2} 10210^{2} 1.4E-01 3.9E-01 -2.3E+00 8.5E-01 6.1E-01 1953
SPBFGS With High Function Noise Level
10010^{0} 10410^{-4} -1.4E+01 -1.4E+01 -1.5E+01 -1.3E+01 2.2E-01 1980
10010^{0} 10210^{-2} -1.0E+01 -1.0E+01 -1.2E+01 -7.3E+00 9.6E-01 1978
10010^{0} 10010^{0} -3.1E+00 -2.8E+00 -5.1E+00 -1.7E+00 8.9E-01 1969
10010^{0} 10210^{2} -2.2E-01 1.1E-02 -1.9E+00 8.4E-01 7.6E-01 1943
BFGS With High Function Noise Level
10010^{0} 10410^{-4} -1.1E+01 -1.1E+01 -1.3E+01 -9.0E+00 1.4E+00 1980
10010^{0} 10210^{-2} -6.7E+00 -6.4E+00 -9.1E+00 -5.0E+00 1.5E+00 1980
10010^{0} 10010^{0} -1.8E+00 -1.4E+00 -5.3E+00 -8.2E-01 1.1E+00 1973
10010^{0} 10210^{2} -2.9E-02 3.7E-01 -2.1E+00 8.9E-01 7.9E-01 1965
Table 1: Performance of SP-BFGS vs. BFGS on the Rosenbrock function (i.e. ROSENBR) corrupted by noise. Δoptlog10(ϕbestϕ)\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star}) measures the optimality gap, where ϕbest\phi_{best} denotes the smallest value of the true function ϕ\phi measured at any point during an algorithm run. The number of objective function evaluations is fixed at 20002000, but the number of iterations II can vary. Statistics are calculated from a sample of 3030 runs per algorithm.
SP-BFGS With Function And Gradient Noise
Problem Dim. Mean(Δopt\Delta_{opt}) Median(Δopt\Delta_{opt}) Min(Δopt\Delta_{opt}) Max(Δopt\Delta_{opt}) s2(Δopt)s^{2}(\Delta_{opt})
ARGTRGLS 200 4.5E-02 4.8E-02 1.7E-02 8.0E-02 2.5E-04
ARWHEAD 500 -2.5E+00 -2.5E+00 -2.6E+00 -2.5E+00 2.6E-04
BEALE 2 -1.1E+01 -1.1E+01 -1.4E+01 -9.8E+00 8.0E-01
BOX3 3 -7.1E+00 -6.8E+00 -8.9E+00 -6.5E+00 6.2E-01
BOXPOWER 100 -3.8E+00 -3.8E+00 -4.2E+00 -3.5E+00 5.0E-02
BROWNBS 2 -1.2E+00 -7.4E-01 -5.2E+00 2.0E+00 3.5E+00
BROYDNBDLS 50 -6.2E+00 -6.2E+00 -6.4E+00 -6.0E+00 6.9E-03
CHAINWOO 100 1.7E+00 1.8E+00 7.7E-03 2.1E+00 1.6E-01
CHNROSNB 50 -4.2E+00 -4.0E+00 -5.5E+00 -3.6E+00 3.8E-01
COATING 134 -2.7E-02 -1.2E-02 -1.3E-01 9.6E-02 3.5E-03
COOLHANSLS 9 -1.2E+00 -1.1E+00 -1.6E+00 -8.7E-01 1.7E-02
CUBE 2 -5.2E+00 -4.7E+00 -8.9E+00 -3.1E+00 2.2E+00
CYCLOOCFLS 20 -8.4E+00 -8.5E+00 -9.1E+00 -6.9E+00 3.0E-01
EXTROSNB 10 -5.2E+00 -5.2E+00 -5.2E+00 -5.1E+00 1.3E-03
FMINSRF2 64 -8.7E+00 -8.7E+00 -8.7E+00 -8.6E+00 2.6E-04
GENHUMPS 5 4.1E-02 2.4E-01 -2.9E+00 7.8E-01 4.5E-01
GENROSE 5 -9.4E+00 -9.3E+00 -9.9E+00 -9.1E+00 5.6E-02
HEART6LS 6 -3.5E-01 2.7E-01 -2.0E+00 1.2E+00 1.5E+00
HELIX 3 -6.1E+00 -6.0E+00 -7.4E+00 -4.5E+00 5.0E-01
MANCINO 30 -2.1E+00 -2.1E+00 -2.5E+00 -1.9E+00 1.2E-02
METHANB8LS 31 -3.8E+00 -3.9E+00 -4.2E+00 -3.4E+00 3.6E-02
MODBEALE 200 1.1E+00 1.0E+00 4.7E-01 1.8E+00 1.8E-01
NONDIA 10 -4.2E-03 -4.3E-03 -4.4E-03 -3.2E-03 9.1E-08
POWELLSG 4 -6.1E+00 -6.0E+00 -7.9E+00 -4.6E+00 9.1E-01
POWER 10 -3.9E+00 -3.8E+00 -4.9E+00 -3.3E+00 1.9E-01
ROSENBR 2 -8.6E+00 -8.5E+00 -1.1E+01 -6.3E+00 1.8E+00
ROSENBRTU 2 -1.8E+01 -1.8E+01 -2.0E+01 -1.7E+01 4.0E-01
SBRYBND 500 3.9E+00 3.9E+00 3.9E+00 3.9E+00 9.2E-06
SINEVAL 2 -1.4E+01 -1.4E+01 -1.5E+01 -1.3E+01 3.7E-01
SNAIL 2 -1.2E+01 -1.2E+01 -1.4E+01 -1.1E+01 2.9E-01
SROSENBR 1000 5.0E-01 5.0E-01 2.9E-01 6.8E-01 8.6E-03
VIBRBEAM 8 1.5E+00 1.5E+00 1.2E+00 2.1E+00 2.6E-02
Table 2: Performance of SP-BFGS on 3232 selected CUTEst test problems with noise added to both function and gradient evaluations. The number of objective function evaluations is fixed at 20002000. Δoptlog10(ϕbestϕ)\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star}) measures the optimality gap, where ϕbest\phi_{best} denotes the smallest value of the true function ϕ\phi measured at any point during an algorithm run. Statistics are calculated from a sample of 3030 runs per algorithm, and the Dim. column gives the problem dimension. The SPBFGS penalty parameter was set as βk=108ϵ¯gsk2+1010\beta_{k}=\frac{10^{8}}{\bar{\epsilon}_{g}}\left\|s_{k}\right\|_{2}+10^{-10}. For each problem, function noise was generated using ϵ¯f=104|ϕ(x0)|\bar{\epsilon}_{f}=10^{-4}\left|\phi(x^{0})\right|, and gradient noise was generated using ϵ¯g=104ϕ(x0)2\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}, where the starting point x0x^{0} varies by CUTEst problem.
BFGS With Function And Gradient Noise
Problem Dim. Mean(Δopt\Delta_{opt}) Median(Δopt\Delta_{opt}) Min(Δopt\Delta_{opt}) Max(Δopt\Delta_{opt}) s2(Δopt)s^{2}(\Delta_{opt})
ARGTRGLS 200 5.6E-02 5.5E-02 2.4E-02 8.4E-02 2.7E-04
ARWHEAD 500 -2.5E+00 -2.5E+00 -2.6E+00 -2.5E+00 4.1E-04
BEALE 2 -7.7E+00 -7.8E+00 -9.7E+00 -6.1E+00 7.1E-01
BOX3 3 -6.5E+00 -6.5E+00 -6.7E+00 -6.4E+00 4.7E-03
BOXPOWER 100 -3.7E+00 -3.7E+00 -4.2E+00 -3.4E+00 3.4E-02
BROWNBS 2 6.8E-01 1.3E+00 -3.2E+00 3.1E+00 2.9E+00
BROYDNBDLS 50 -6.0E+00 -6.0E+00 -6.3E+00 -5.7E+00 2.6E-02
CHAINWOO 100 1.7E+00 1.7E+00 1.2E+00 2.1E+00 5.9E-02
CHNROSNB 50 -4.2E+00 -4.1E+00 -5.7E+00 -3.4E+00 4.4E-01
COATING 134 -3.7E-02 -5.7E-02 -1.6E-01 8.0E-02 4.1E-03
COOLHANSLS 9 -1.0E+00 -1.0E+00 -2.0E+00 -4.5E-01 7.2E-02
CUBE 2 -1.6E+00 -1.4E+00 -3.6E+00 -9.7E-01 4.1E-01
CYCLOOCFLS 20 -7.2E+00 -7.2E+00 -9.1E+00 -5.8E+00 8.7E-01
EXTROSNB 10 -5.2E+00 -5.2E+00 -5.2E+00 -5.1E+00 1.8E-03
FMINSRF2 64 -8.6E+00 -8.7E+00 -8.8E+00 -8.2E+00 2.8E-02
GENHUMPS 5 1.2E-01 1.2E-01 -1.2E+00 8.1E-01 2.3E-01
GENROSE 5 -7.5E+00 -7.6E+00 -9.1E+00 -6.2E+00 7.3E-01
HEART6LS 6 3.1E-01 6.1E-01 -1.9E+00 1.2E+00 1.4E+00
HELIX 3 -4.5E+00 -4.7E+00 -7.0E+00 -2.7E+00 1.1E+00
MANCINO 30 -1.6E+00 -1.6E+00 -1.8E+00 -1.3E+00 1.3E-02
METHANB8LS 31 -3.9E+00 -3.8E+00 -4.4E+00 -3.6E+00 5.5E-02
MODBEALE 200 1.1E+00 1.1E+00 2.9E-01 1.8E+00 1.6E-01
NONDIA 10 -3.7E-03 -3.8E-03 -4.4E-03 -2.6E-03 3.1E-07
POWELLSG 4 -5.2E+00 -5.2E+00 -7.6E+00 -4.2E+00 7.1E-01
POWER 10 -3.5E+00 -3.5E+00 -4.1E+00 -2.9E+00 1.0E-01
ROSENBR 2 -5.9E+00 -5.5E+00 -9.2E+00 -4.5E+00 1.4E+00
ROSENBRTU 2 -1.6E+01 -1.6E+01 -1.8E+01 -1.4E+01 1.5E+00
SBRYBND 500 3.9E+00 3.9E+00 3.9E+00 3.9E+00 2.7E-05
SINEVAL 2 -1.1E+01 -1.1E+01 -1.3E+01 -8.9E+00 1.3E+00
SNAIL 2 -9.4E+00 -9.2E+00 -1.2E+01 -8.0E+00 7.2E-01
SROSENBR 1000 5.4E-01 5.4E-01 3.6E-01 7.8E-01 6.9E-03
VIBRBEAM 8 1.7E+00 1.7E+00 1.2E+00 2.0E+00 2.9E-02
Table 3: Performance of BFGS on 3232 selected CUTEst test problems with noise added to both function and gradient evaluations. The number of objective function evaluations is fixed at 20002000. Δoptlog10(ϕbestϕ)\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star}) measures the optimality gap, where ϕbest\phi_{best} denotes the smallest value of the true function ϕ\phi measured at any point during an algorithm run. Statistics are calculated from a sample of 3030 runs per algorithm, and the Dim. column gives the problem dimension. For each problem, function noise was generated using ϵ¯f=104|ϕ(x0)|\bar{\epsilon}_{f}=10^{-4}\left|\phi(x^{0})\right|, and gradient noise was generated using ϵ¯g=104ϕ(x0)2\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}, where the starting point x0x^{0} varies by CUTEst problem.

7 Final Remarks

In this paper, we introduced SP-BFGS, a new variant of the BFGS method designed to resist the corrupting effects of noise. Motivated by regularized least squares estimation, we derived the SP-BFGS update by applying a penalty method to the secant condition. We argued that with an appropriate choice of penalty parameter, SP-BFGS updating is robust to the corrupting effects of noise that can destroy the performance of BFGS. We empirically validated this claim by performing numerical experiments on a diverse set of over 3030 test problems with both function and gradient noise of varying orders of magnitude. The results of these numerical experiments showed that SP-BFGS can outperform BFGS approximately 70%70\% or more of the time, and performs at least as good as BFGS approximately 90%90\% or more of the time. Furthermore, a theoretical analysis confirmed that with appropriate choices of penalty parameter, it is possible to guarantee that SP-BFGS is not corrupted arbitrarily badly by noise, unlike standard BFGS. In the future, we believe it is worth investigating the performance of SP-BFGS in the presence of other types of noise, including multiplicative stochastic noise and deterministic noise, and also believe it is worthwhile to study the use of noise estimation techniques in conjunction with SP-BFGS updating. The authors are also working to publish a limited memory version of SP-BFGS for high dimensional noisy problems.

Acknowledgements.
EH and BI’s work is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of British Columbia (UBC).

Conflict of interest

The authors declare that they have no conflict of interest.

References

  • (1) Aydin, L., Aydin, O., Artem, H.S., Mert, A.: Design of dimensionally stable composites using efficient global optimization method. Proceedings of the Institution of Mechanical Engineers, Part L: Journal of Materials: Design and Applications 233(2), 156–168 (2019). DOI 10.1177/1464420716664921. URL https://doi.org/10.1177/1464420716664921
  • (2) Berahas, A.S., Byrd, R.H., Nocedal, J.: Derivative-free optimization of noisy functions via quasi-newton methods. SIAM Journal on Optimization 29, 965–993 (2019)
  • (3) Besançon, M., Anthoff, D., Arslan, A., Byrne, S., Lin, D., Papamarkou, T., Pearson, J.: Distributions.jl: Definition and modeling of probability distributions in the juliastats ecosystem. arXiv e-prints arXiv:1907.08611 (2019)
  • (4) Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. SIAM Review 59(1), 65–98 (2017). DOI 10.1137/141000671. URL https://doi.org/10.1137/141000671
  • (5) Bons, N.P., He, X., Mader, C.A., Martins, J.R.R.A.: Multimodality in aerodynamic wing design optimization. AIAA Journal 57(3), 1004–1018 (2019). DOI 10.2514/1.J057294. URL https://doi.org/10.2514/1.J057294
  • (6) Broyden, C.G.: The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations. IMA Journal of Applied Mathematics 6(1), 76–90 (1970). DOI 10.1093/imamat/6.1.76. URL https://doi.org/10.1093/imamat/6.1.76
  • (7) Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization 26(2), 1008–1031 (2016). DOI 10.1137/140954362. URL https://doi.org/10.1137/140954362
  • (8) Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16(5), 1190–1208 (1995). DOI 10.1137/0916069. URL https://doi.org/10.1137/0916069
  • (9) Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-newton methods with application to unconstrained minimization. SIAM Journal on Numerical Analysis 26(3), 727–739 (1989). URL http://www.jstor.org/stable/2157680
  • (10) Byrd, R.H., Nocedal, J., Yuan, Y.X.: Global convergence of a class of quasi-newton methods on convex problems. SIAM Journal on Numerical Analysis 24(5), 1171–1190 (1987). URL http://www.jstor.org/stable/2157646
  • (11) Chang, D., Sun, S., Zhang, C.: An accelerated linearly convergent stochastic l-bfgs algorithm. IEEE Transactions on Neural Networks and Learning Systems 30(11), 3338–3346 (2019)
  • (12) Fasano, G., Pintér, J.D.: Modeling and Optimization in Space Engineering: State of the Art and New Challenges. Springer International Publishing (2019)
  • (13) Fletcher, R.: A new approach to variable metric algorithms. The Computer Journal 13(3), 317–322 (1970). DOI 10.1093/comjnl/13.3.317. URL https://doi.org/10.1093/comjnl/13.3.317
  • (14) Gal, R., Haber, E., Irwin, B., Saleh, B., Ziv, A.: How to catch a lion in the desert: on the solution of the coverage directed generation (CDG) problem. Optimization and Engineering (2020). DOI 10.1007/s11081-020-09507-w. URL https://doi.org/10.1007%2Fs11081-020-09507-w
  • (15) Goldfarb, D.: A family of variable-metric methods derived by variational means. Mathematics of Computation 24(109), 23–26 (1970). URL http://www.jstor.org/stable/2004873
  • (16) Gould, N.I.M., Orban, D., contributors: The Constrained and Unconstrained Testing Environment with safe threads (CUTEst) for optimization software. https://github.com/ralna/CUTEst (2019)
  • (17) Gould, N.I.M., Orban, D., Toint, P.L.: CUTEr a Constrained and Unconstrained Testing Environment, revisited. https://www.cuter.rl.ac.uk (2001)
  • (18) Gower, R., Goldfarb, D., Richtarik, P.: Stochastic block bfgs: Squeezing more curvature out of data. In: M.F. Balcan, K.Q. Weinberger (eds.) Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 48, pp. 1869–1878. PMLR, New York, New York, USA (2016). URL http://proceedings.mlr.press/v48/gower16.html
  • (19) Graf, P.A., Billups, S.: Mdtri: robust and efficient global mixed integer search of spaces of multiple ternary alloys. Computational Optimization and Applications 68(3), 671–687 (2017). DOI 10.1007/s10589-017-9922-9. URL https://doi.org/10.1007/s10589-017-9922-9
  • (20) Güler, O., Gürtuna, F., Shevchenko, O.: Duality in quasi-newton methods and new variational characterizations of the dfp and bfgs updates. Optimization Methods and Software 24(1), 45–62 (2009). DOI 10.1080/10556780802367205. URL https://doi.org/10.1080/10556780802367205
  • (21) Hager, W.W.: Updating the inverse of a matrix. SIAM Review 31(2), 221–239 (1989). URL http://www.jstor.org/stable/2030425
  • (22) Horn, R.A., Johnson, C.R.: Matrix analysis, 2nd edn. Cambridge University Press, New York (2013)
  • (23) Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
  • (24) Johnson, S.G.: Quasi-newton optimization: Origin of the bfgs update (2019). URL https://ocw.mit.edu/courses/mathematics/18-335j-introduction-to-numerical-methods-spring-2019/week-11/MIT18_335JS19_lec30.pdf
  • (25) Keane, A.J., Nair, P.B.: Computational Approaches for Aerospace Design: The Pursuit of Excellence. John Wiley & Sons, Ltd (2005). DOI 10.1002/0470855487. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/0470855487
  • (26) Koziel, S., Ogurtsov, S.: Antenna Design by Simulation-Driven Optimization. Springer International Publishing (2014)
  • (27) Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-newton methods. Mathematical Programming 141, 135–163 (2013)
  • (28) Lin, D., White, J.M., Byrne, S., Bates, D., Noack, A., Pearson, J., Arslan, A., Squire, K., Anthoff, D., Papamarkou, T., Besançon, M., Drugowitsch, J., Schauer, M., other contributors: JuliaStats/Distributions.jl: a Julia package for probability distributions and associated functions. https://github.com/JuliaStats/Distributions.jl (2019). DOI 10.5281/zenodo.2647458. URL https://doi.org/10.5281/zenodo.2647458
  • (29) Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Mathematical Programming 45(1), 503–528 (1989). DOI 10.1007/BF01589116. URL https://doi.org/10.1007/BF01589116
  • (30) Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory bfgs. Journal of Machine Learning Research 16(1), 3151–3181 (2015)
  • (31) Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic l-bfgs algorithm. In: A. Gretton, C.C. Robert (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 51, pp. 249–258. PMLR, Cadiz, Spain (2016). URL http://proceedings.mlr.press/v51/moritz16.html
  • (32) Muñoz-Rojas, P.A.: Computational Modeling, Optimization and Manufacturing Simulation of Advanced Engineering Materials. Springer (2016)
  • (33) Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (2006)
  • (34) Orban, D., Siqueira, A.S., contributors: CUTEst.jl: Julia’s CUTEst interface. https://github.com/JuliaSmoothOptimizers/CUTEst.jl (2020). DOI 10.5281/zenodo.1188851
  • (35) Orban, D., Siqueira, A.S., contributors: NLPModels.jl: Data structures for optimization models. https://github.com/JuliaSmoothOptimizers/NLPModels.jl (2020). DOI 10.5281/zenodo.2558627
  • (36) Powell, M.J.D.: Algorithms for nonlinear constraints that use lagrangian functions. Mathematical Programming 14(1), 224–248 (1978). DOI 10.1007/BF01588967. URL https://doi.org/10.1007/BF01588967
  • (37) Rosenbrock, H.H.: An Automatic Method for Finding the Greatest or Least Value of a Function. The Computer Journal 3(3), 175–184 (1960). DOI 10.1093/comjnl/3.3.175. URL https://doi.org/10.1093/comjnl/3.3.175
  • (38) Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-newton method for online convex optimization. In: M. Meila, X. Shen (eds.) Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 2, pp. 436–443. PMLR, San Juan, Puerto Rico (2007). URL http://proceedings.mlr.press/v2/schraudolph07a.html
  • (39) Shanno, D.F.: Conditioning of quasi-newton methods for function minimization. Mathematics of Computation 24(111), 647–656 (1970). URL http://www.jstor.org/stable/2004840
  • (40) Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization 27(2), 927–956 (2017). DOI 10.1137/15M1053141. URL https://doi.org/10.1137/15M1053141
  • (41) Xie, Y., Byrd, R.H., Nocedal, J.: Analysis of the bfgs method with errors. SIAM Journal on Optimization 30(1), 182–209 (2020). DOI 10.1137/19M1240794. URL https://doi.org/10.1137/19M1240794
  • (42) Zhao, R., Haskell, W.B., Tan, V.Y.F.: Stochastic l-bfgs: Improved convergence rates and practical acceleration strategies. IEEE Transactions on Signal Processing 66, 1155–1169 (2018)
  • (43) Zhu, J.: Optimization of Power System Operation. John Wiley & Sons, Ltd (2008). DOI 10.1002/9780470466971. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470466971

Appendix A Proof of Theorem 3.1

To produce the SP-BFGS update, we first rearrange (26a), revealing that

(HHk)=W1(uykT+ΓTΓ)W1(H-H_{k})=-W^{-1}(uy_{k}^{T}+\Gamma^{T}-\Gamma)W^{-1} (67)

and so the symmetry requirement that H=HTH=H^{T} means transposing (67) gives

uykT+ΓTΓ=(uykT+ΓTΓ)T=ykuT+ΓΓTuy_{k}^{T}+\Gamma^{T}-\Gamma=(uy_{k}^{T}+\Gamma^{T}-\Gamma)^{T}=y_{k}u^{T}+\Gamma-\Gamma^{T} (68)

which rearranges to

ΓTΓ=12(ykuTuykT)\Gamma^{T}-\Gamma=\frac{1}{2}(y_{k}u^{T}-uy_{k}^{T}) (69)

and so

(HHk)=12W1(ykuT+uykT)W1.(H-H_{k})=-\frac{1}{2}W^{-1}(y_{k}u^{T}+uy_{k}^{T})W^{-1}. (70)

Next, we right multiply (70) by yky_{k} to get

(HHk)yk=12W1(ykuTW1yk+u(ykTW1yk))(H-H_{k})y_{k}=-\frac{1}{2}W^{-1}\bigg{(}y_{k}u^{T}W^{-1}y_{k}+u(y_{k}^{T}W^{-1}y_{k})\bigg{)} (71)

and use (26b) to get that

sk+W1uβkHkyk=12W1(ykuTW1yk+u(ykTW1yk)).s_{k}+\frac{W^{-1}u}{\beta_{k}}-H_{k}y_{k}=-\frac{1}{2}W^{-1}\bigg{(}y_{k}u^{T}W^{-1}y_{k}+u(y_{k}^{T}W^{-1}y_{k})\bigg{)}. (72)

We now left multiply both sides by 2W-2W and rearrange, giving

2W(skHkyk)=ykuTW1yk+u(ykTW1yk+2βk).-2W(s_{k}-H_{k}y_{k})=y_{k}u^{T}W^{-1}y_{k}+u\bigg{(}y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}\bigg{)}. (73)

This can be rearranged so that uu is isolated, giving

u=2W(skHkyk)ykuTW1ykykTW1yk+2βk=2W(skHkyk)+ykuTW1ykykTW1yk+2βk.u=\frac{-2W(s_{k}-H_{k}y_{k})-y_{k}u^{T}W^{-1}y_{k}}{y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}}=-\frac{2W(s_{k}-H_{k}y_{k})+y_{k}u^{T}W^{-1}y_{k}}{y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}}. (74)

To get rid of the uTu^{T} on the right hand side, we first left multiply both sides by ykTW1y_{k}^{T}W^{-1}, and then transpose to get

uTW1yk=2(skHkyk)Tyk+(ykTW1yk)(uTW1yk)ykTW1yk+2βku^{T}W^{-1}y_{k}=-\frac{2(s_{k}-H_{k}y_{k})^{T}y_{k}+(y_{k}^{T}W^{-1}y_{k})(u^{T}W^{-1}y_{k})}{y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}} (75)

where we have taken advantage of the fact that the transpose of a scalar returns the same scalar. This now allows us to solve for uTW1yku^{T}W^{-1}y_{k} using some basic algebra, and resulting in

uTW1yk=(skHkyk)TykykTW1yk+1βk.u^{T}W^{-1}y_{k}=-\frac{(s_{k}-H_{k}y_{k})^{T}y_{k}}{y_{k}^{T}W^{-1}y_{k}+\frac{1}{\beta_{k}}}. (76)

Substituting (76) into (74) gives

u=ykykT(skHkyk)(ykTW1yk+2βk)(ykTW1yk+1βk)2W(skHkyk)ykTW1yk+2βk.u=\frac{y_{k}y_{k}^{T}(s_{k}-H_{k}y_{k})}{(y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}})(y_{k}^{T}W^{-1}y_{k}+\frac{1}{\beta_{k}})}-\frac{2W(s_{k}-H_{k}y_{k})}{y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}}}. (77)

Now, if we substitute the expression for uu in (77) into (70), after some simplification we get

(HHk)=1(ykTW1yk+2βk)[(skHkyk)ykTW1+W1yk(skHkyk)TykT(skHkyk)(ykTW1yk+1βk)W1ykykTW1].(H-H_{k})=\frac{1}{(y_{k}^{T}W^{-1}y_{k}+\frac{2}{\beta_{k}})}\bigg{[}(s_{k}-H_{k}y_{k})y_{k}^{T}W^{-1}+W^{-1}y_{k}(s_{k}-H_{k}y_{k})^{T}\\ -\frac{y_{k}^{T}(s_{k}-H_{k}y_{k})}{(y_{k}^{T}W^{-1}y_{k}+\frac{1}{\beta_{k}})}W^{-1}y_{k}y_{k}^{T}W^{-1}\bigg{]}. (78)

Now, we further simplify by applying that Wsk=ykWs_{k}=y_{k}, and thus W1yk=skW^{-1}y_{k}=s_{k}, revealing

H=Hk+(skHkyk)skT+sk(skHkyk)T(ykTsk+2βk)ykT(skHkyk)(ykTsk+2βk)(ykTsk+1βk)skskTH=H_{k}+\frac{(s_{k}-H_{k}y_{k})s_{k}^{T}+s_{k}(s_{k}-H_{k}y_{k})^{T}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})}-\frac{y_{k}^{T}(s_{k}-H_{k}y_{k})}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})(y_{k}^{T}s_{k}+\frac{1}{\beta_{k}})}s_{k}s_{k}^{T} (79)

which, after a bit of algebra, reveals that the update formula solving the system defined by (26a), (26b), and (26c) can be expressed as

H=HkHkykskT+skykTHkT(ykTsk+2βk)+[ykTsk+2βk+ykTHkyk(ykTsk+2βk)(ykTsk+1βk)]skskT.H^{*}=H_{k}-\frac{H_{k}y_{k}s_{k}^{T}+s_{k}y_{k}^{T}H_{k}^{T}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})}+\bigg{[}\frac{y_{k}^{T}s_{k}+\frac{2}{\beta_{k}}+y_{k}^{T}H_{k}y_{k}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})(y_{k}^{T}s_{k}+\frac{1}{\beta_{k}})}\bigg{]}s_{k}s_{k}^{T}. (80)

We can make (80) look similar to the common form of the BFGS update given in (19) by defining the two quantities γk\gamma_{k} and ωk\omega_{k} as in (28) and observing that completing the square gives

H=(IskykT(ykTsk+2βk))Hk(IykskT(ykTsk+2βk))+[ykTsk+2βk+ykTHkyk(ykTsk+2βk)(ykTsk+1βk)ykTHkyk(ykTsk+2βk)2]skskTH^{*}=\bigg{(}I-\frac{s_{k}y_{k}^{T}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})}\bigg{)}H_{k}\bigg{(}I-\frac{y_{k}s_{k}^{T}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})}\bigg{)}\\ +\bigg{[}\frac{y_{k}^{T}s_{k}+\frac{2}{\beta_{k}}+y_{k}^{T}H_{k}y_{k}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})(y_{k}^{T}s_{k}+\frac{1}{\beta_{k}})}-\frac{y_{k}^{T}H_{k}y_{k}}{(y_{k}^{T}s_{k}+\frac{2}{\beta_{k}})^{2}}\bigg{]}s_{k}s_{k}^{T} (81)

which is equivalent to

H=(IωkskykT)Hk(IωkykskT)+ωk[γkωk+(γkωk)ykTHkyk]skskTH^{*}=\bigg{(}I-\omega_{k}s_{k}y_{k}^{T}\bigg{)}H_{k}\bigg{(}I-\omega_{k}y_{k}s_{k}^{T}\bigg{)}+\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}s_{k}s_{k}^{T} (82)

concluding the proof.

Appendix B Proof of Lemma 1

The Hk+1H_{k+1} given by (27) has the general form

Hk+1=GTHkG+dskskTH_{k+1}=G^{T}H_{k}G+ds_{k}s_{k}^{T} (83)

with the specific choices

G=IωkykskT,d=ωk[γkωk+(γkωk)ykTHkyk].G=I-\omega_{k}y_{k}s_{k}^{T},\quad d=\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}. (84)

By definition, Hk+1H_{k+1} is positive definite if

vTHk+1v>0,vn0 .v^{T}H_{k+1}v>0,\quad\forall v\in\mathbb{R}^{n}\setminus 0\text{~{}~{}}. (85)

We first show that (29) is a sufficient condition for Hk+1H_{k+1} to be positive definite, given that HkH_{k} is positive definite. By applying (83) to (85), we see that

vT(GTHkG+dskskT)v>0,vn0v^{T}\bigg{(}G^{T}H_{k}G+ds_{k}s_{k}^{T}\bigg{)}v>0,\quad\forall v\in\mathbb{R}^{n}\setminus 0 (86)

must be true for the choices of GG and dd in (84) if Hk+1H_{k+1} is positive definite. Substituting (84) into (86) reveals that

(vωk(skTv)yk)THk(vωk(skTv)yk)+ωk[γkωk+(γkωk)ykTHkyk](skTv)2>0\bigg{(}v-\omega_{k}(s_{k}^{T}v)y_{k}\bigg{)}^{T}H_{k}\bigg{(}v-\omega_{k}(s_{k}^{T}v)y_{k}\bigg{)}+\omega_{k}\bigg{[}\frac{\gamma_{k}}{\omega_{k}}+(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\bigg{]}(s_{k}^{T}v)^{2}>0 (87)

must be true for all vn0v\in\mathbb{R}^{n}\setminus 0 if Hk+1H_{k+1} is positive definite. Both (skTv)2(s_{k}^{T}v)^{2} and vTGTHkGvv^{T}G^{T}H_{k}Gv are always nonnegative. To see that vTGTHkGv0v^{T}G^{T}H_{k}Gv\geq 0, note that because HkH_{k} is positive definite, it has a principal square root Hk1/2H_{k}^{1/2}, and so

vTGTHkGv=vTGTHk1/2Hk1/2Gv=Hk1/2Gv220 .v^{T}G^{T}H_{k}Gv=v^{T}G^{T}H_{k}^{1/2}H_{k}^{1/2}Gv=\left\|H_{k}^{1/2}Gv\right\|_{2}^{2}\geq 0\text{~{}}. (88)

We now observe that if d>0d>0, the right term d(skTv)2d(s_{k}^{T}v)^{2} in (87) is zero if and only if (skTv)=0(s_{k}^{T}v)=0. However, if (skTv)=0(s_{k}^{T}v)=0, then the left term vTGTHkGvv^{T}G^{T}H_{k}Gv in (87) is zero only when v=0v=0. Hence, the condition d>0d>0 guarantees that (87) is true for all vv excluding the zero vector, and thus that Hk+1H_{k+1} is positive definite. The condition d>0d>0 expands to

γk+ωk(γkωk)ykTHkyk>0 .\gamma_{k}+\omega_{k}(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}>0\text{~{}}. (89)

Using the definitions of γk\gamma_{k} and ωk\omega_{k} in (28), it is clear that (γkωk)0(\gamma_{k}-\omega_{k})\geq 0, as βk\beta_{k} can only take nonnegative values. Furthermore, as HkH_{k} is positive definite, ykTHkyk0y_{k}^{T}H_{k}y_{k}\geq 0 for all yky_{k}. As it is possible for (γkωk)ykTHkyk(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k} to be zero, we requre γk>0\gamma_{k}>0. The condition γk>0\gamma_{k}>0 immediately gives (29), as γk\gamma_{k} can only be positive if the denominator in its definition is positive. Finally, as βk\beta_{k} can only take nonnegative values, (29) also ensures that ωk\omega_{k} is nonnegative, and so when (29) is true, ωk(γkωk)ykTHkyk0\omega_{k}(\gamma_{k}-\omega_{k})y_{k}^{T}H_{k}y_{k}\geq 0. In summary, we have shown that the condition (29) ensures that the left term in (89) is positive, and the right term nonnegative, so d>0d>0, and thus Hk+1H_{k+1} is positive definite.

We now show that (29) is a necessary condition for Hk+1H_{k+1} to be positive definite, given that HkH_{k} is positive definite. If Hk+1H_{k+1} is positive definite, then

ykTHk+1yk>0y_{k}^{T}H_{k+1}y_{k}>0 (90)

assuming yk0y_{k}\neq 0. Substituting (26b) into (90) gives

ykT[sk+W1uβk]>0y_{k}^{T}\bigg{[}s_{k}+\frac{W^{-1}u}{\beta_{k}}\bigg{]}>0 (91)

and using (76) shows that (91) is equivalent to

ykT[sk+γk(Hkyksk)βk]>0.y_{k}^{T}\bigg{[}s_{k}+\frac{\gamma_{k}(H_{k}y_{k}-s_{k})}{\beta_{k}}\bigg{]}>0. (92)

Now, some algebra shows that

ykT[sk+γk(Hkyksk)βk]=ykTsk+11+βkykTsk[ykTHkykykTsk]=(111+βkykTsk)ykTsk+(11+βkykTsk)ykTHkyk=(βkykTsk1+βkykTsk)ykTsk+(11+βkykTsk)ykTHkyk=βk(ykTsk)2+ykTHkyk1+βkykTsk\begin{split}y_{k}^{T}\bigg{[}s_{k}+\frac{\gamma_{k}(H_{k}y_{k}-s_{k})}{\beta_{k}}\bigg{]}&=y_{k}^{T}s_{k}+\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{[}y_{k}^{T}H_{k}y_{k}-y_{k}^{T}s_{k}\bigg{]}\\ &=\bigg{(}1-\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}s_{k}+\bigg{(}\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}H_{k}y_{k}\\ &=\bigg{(}\frac{\beta_{k}y_{k}^{T}s_{k}}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}s_{k}+\bigg{(}\frac{1}{1+\beta_{k}y_{k}^{T}s_{k}}\bigg{)}y_{k}^{T}H_{k}y_{k}\\ &=\frac{\beta_{k}(y_{k}^{T}s_{k})^{2}+y_{k}^{T}H_{k}y_{k}}{1+\beta_{k}y_{k}^{T}s_{k}}\end{split} (93)

and we also know that because HkH_{k} is positive definite, ykTHkyk>0y_{k}^{T}H_{k}y_{k}>0 for all yk0y_{k}\neq 0, by definition βk0\beta_{k}\geq 0, and by the definition of the square of a real number, (ykTsk)20(y_{k}^{T}s_{k})^{2}\geq 0. As a result,

ykT[sk+W1uβk]=βk(ykTsk)2+ykTHkyk1+βkykTsk>0y_{k}^{T}\bigg{[}s_{k}+\frac{W^{-1}u}{\beta_{k}}\bigg{]}=\frac{\beta_{k}(y_{k}^{T}s_{k})^{2}+y_{k}^{T}H_{k}y_{k}}{1+\beta_{k}y_{k}^{T}s_{k}}>0 (94)

is guaranteed only if the denominator 1+βkykTsk1+\beta_{k}y_{k}^{T}s_{k} is positive, which occurs when

skTyk>1βk.s_{k}^{T}y_{k}>-\frac{1}{\beta_{k}}. (95)

This establishes that (29) is a necessary condition for Hk+1H_{k+1} to be positive definite, given that HkH_{k} is positive definite, and concludes the proof.

Appendix C Proof of Theorem 3.2

The Sherman-Morrison-Woodbury formula says

(A+UCV)1=A1A1U(C1+VA1U)1VA1.(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}. (96)

Now, observe that the SP-BFGS update (27) can be written in the factored form

Hk+1=Hk+ωk[skHkyk][γk(1ωk+ykTHkyk)110][skTykTHk].H_{k+1}=H_{k}+\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right]\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]. (97)

Applying the Sherman-Morrison-Woodbury formula (96) to the factored SP-BFGS update (97) with

A=Hk,U=ωk[skHkyk],C=[γk(1ωk+ykTHkyk)110],V=[skTykTHk]A=H_{k},\quad U=\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]},\quad C=\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right],\quad V=\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]

yields

Hk+11=Hk1Hk1ωk[skHkyk]([γk(1ωk+ykTHkyk)110]1+[skTykTHk]Hk1ωk[skHkyk])1[skTykTHk]Hk1.H_{k+1}^{-1}=H_{k}^{-1}-H_{k}^{-1}\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\bigg{(}\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right]^{-1}+\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]H_{k}^{-1}\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\bigg{)}^{-1}\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]H_{k}^{-1}.

Inverting CC here gives

C1=[γk(1ωk+ykTHkyk)110]1=[011γk(1ωk+ykTHkyk)]C^{-1}=\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right]^{-1}=\left[\begin{array}[]{cc}0&-1\\ -1&-\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}\end{array}\right]

and we also have

VA1U=[skTykTHk]Hk1ωk[skHkyk]=ωk[skTykTHk][Hk1skyk]=[ωkskTHk1skωkskTykωkykTskωkykTHkyk]\begin{split}VA^{-1}U&=\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]H_{k}^{-1}\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\\ &=\omega_{k}\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]\big{[}H_{k}^{-1}s_{k}\quad y_{k}\big{]}\\ &=\left[\begin{array}[]{cc}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}&\omega_{k}s_{k}^{T}y_{k}\\ \omega_{k}y_{k}^{T}s_{k}&\omega_{k}y_{k}^{T}H_{k}y_{k}\end{array}\right]\end{split}

which is just a 2×22\times 2 matrix with real entries. Now, it becomes clear that

(C1+VA1U)=([γk(1ωk+ykTHkyk)110]1+[skTykTHk]Hk1ωk[skHkyk])=[ωkskTHk1sk1+ωkskTyk1+ωkykTskωkykTHkykγk(1ωk+ykTHkyk)].\begin{split}(C^{-1}+VA^{-1}U)&=\bigg{(}\left[\begin{array}[]{cc}\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&-1\\ -1&0\end{array}\right]^{-1}+\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]H_{k}^{-1}\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\bigg{)}\\ &=\left[\begin{array}[]{cc}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}&-1+\omega_{k}s_{k}^{T}y_{k}\\ -1+\omega_{k}y_{k}^{T}s_{k}&\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}\end{array}\right].\end{split}

For notational compactness, let

D=(C1+VA1U)=[ωkskTHk1sk1+ωkskTyk1+ωkykTskωkykTHkykγk(1ωk+ykTHkyk)]D=(C^{-1}+VA^{-1}U)=\left[\begin{array}[]{cc}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}&-1+\omega_{k}s_{k}^{T}y_{k}\\ -1+\omega_{k}y_{k}^{T}s_{k}&\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}\end{array}\right]

so

D1=1det(D)[ωkykTHkykγk(1ωk+ykTHkyk)1ωkskTyk1ωkykTskωkskTHk1sk]D^{-1}=\frac{1}{\det(D)}\left[\begin{array}[]{cc}\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}\big{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\big{)}&1-\omega_{k}s_{k}^{T}y_{k}\\ 1-\omega_{k}y_{k}^{T}s_{k}&\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}\end{array}\right]

where the determinant of DD is

det(D)=(ωkykTHkykγk(1ωk+ykTHkyk))(ωkskTHk1sk)(1ωkykTsk)2=((ωkγk)ykTHkykγkωk)(ωkskTHk1sk)(1ωkykTsk)2\begin{split}\det(D)&=\bigg{(}\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}\bigg{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\bigg{)}\bigg{)}\bigg{(}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}\bigg{)}-(1-\omega_{k}y_{k}^{T}s_{k})^{2}\\ &=\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}H_{k}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}\bigg{(}\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}\bigg{)}-(1-\omega_{k}y_{k}^{T}s_{k})^{2}\end{split}

and we have used the fact that ykTsk=skTyky_{k}^{T}s_{k}=s_{k}^{T}y_{k}, as this is a scalar quantity. Next,

Udet(D)D1V=ωk[skHkyk][ωkykTHkykγk(1ωk+ykTHkyk)1ωkskTyk1ωkykTskωkskTHk1sk][skTykTHk]=ωk[skHkyk][ωkykTHkykskTγk(1ωk+ykTHkyk)skT+(1ωkskTyk)ykTHk(1ωkykTsk)skT+ωkskTHk1skykTHk]\begin{split}U\det(D)D^{-1}V&=\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\left[\begin{array}[]{cc}\omega_{k}y_{k}^{T}H_{k}y_{k}-\gamma_{k}(\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k})&1-\omega_{k}s_{k}^{T}y_{k}\\ 1-\omega_{k}y_{k}^{T}s_{k}&\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}\end{array}\right]\left[\begin{array}[]{c}s_{k}^{T}\\ y_{k}^{T}H_{k}\end{array}\right]\\ &=\omega_{k}\big{[}s_{k}\quad H_{k}y_{k}\big{]}\left[\begin{array}[]{cc}\omega_{k}y_{k}^{T}H_{k}y_{k}s_{k}^{T}-\gamma_{k}(\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k})s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})y_{k}^{T}H_{k}\\ (1-\omega_{k}y_{k}^{T}s_{k})s_{k}^{T}+\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}y_{k}^{T}H_{k}\end{array}\right]\end{split}

so Udet(D)D1VU\det(D)D^{-1}V fully expanded becomes

ωk[sk(ωkykTHkykskTγk(1ωk+ykTHkyk)skT+(1ωkskTyk)ykTHk)+Hkyk((1ωkykTsk)skT+ωkskTHk1skykTHk)].\omega_{k}\bigg{[}s_{k}\bigg{(}\omega_{k}y_{k}^{T}H_{k}y_{k}s_{k}^{T}-\gamma_{k}(\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k})s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})y_{k}^{T}H_{k}\bigg{)}+H_{k}y_{k}\bigg{(}(1-\omega_{k}y_{k}^{T}s_{k})s_{k}^{T}+\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}y_{k}^{T}H_{k}\bigg{)}\bigg{]}.

This looks rather ugly at the moment, but we continue by breaking the problem down further, noting that

sk(ωkykTHkykskTγk(1ωk+ykTHkyk)skT+(1ωkskTyk)ykTHk)=((ωkγk)ykTHkykγkωk)skskT+(1ωkskTyk)skykTHks_{k}\bigg{(}\omega_{k}y_{k}^{T}H_{k}y_{k}s_{k}^{T}-\gamma_{k}\bigg{(}\frac{1}{\omega_{k}}+y_{k}^{T}H_{k}y_{k}\bigg{)}s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})y_{k}^{T}H_{k}\bigg{)}=\\ \bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}H_{k}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}s_{k}s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})s_{k}y_{k}^{T}H_{k}

and

Hkyk((1ωkykTsk)skT+ωkskTHk1skykTHk)=(1ωkykTsk)HkykskT+ωkHkyk(skTHk1sk)ykTHk.H_{k}y_{k}\bigg{(}(1-\omega_{k}y_{k}^{T}s_{k})s_{k}^{T}+\omega_{k}s_{k}^{T}H_{k}^{-1}s_{k}y_{k}^{T}H_{k}\bigg{)}=\\ (1-\omega_{k}y_{k}^{T}s_{k})H_{k}y_{k}s_{k}^{T}+\omega_{k}H_{k}y_{k}(s_{k}^{T}H_{k}^{-1}s_{k})y_{k}^{T}H_{k}.

The above intermediate results further simplify Udet(D)D1VU\det(D)D^{-1}V to

ωk[((ωkγk)ykTHkykγkωk)skskT+(1ωkskTyk)(skykTHk+HkykskT)+ωkHkyk(skTHk1sk)ykTHk].\omega_{k}\bigg{[}\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}H_{k}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}s_{k}s_{k}^{T}+(1-\omega_{k}s_{k}^{T}y_{k})(s_{k}y_{k}^{T}H_{k}+H_{k}y_{k}s_{k}^{T})+\omega_{k}H_{k}y_{k}(s_{k}^{T}H_{k}^{-1}s_{k})y_{k}^{T}H_{k}\bigg{]}.

Left and right multiplying the line immediately above by A1=Hk1A^{-1}=H_{k}^{-1} gives

ωk[((ωkγk)ykTHkykγkωk)Hk1skskTHk1+(1ωkskTyk)(Hk1skykT+ykskTHk1)+ωkyk(skTHk1sk)ykT]\omega_{k}\bigg{[}\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}H_{k}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}H_{k}^{-1}s_{k}s_{k}^{T}H_{k}^{-1}+(1-\omega_{k}s_{k}^{T}y_{k})(H_{k}^{-1}s_{k}y_{k}^{T}+y_{k}s_{k}^{T}H_{k}^{-1})+\omega_{k}y_{k}(s_{k}^{T}H_{k}^{-1}s_{k})y_{k}^{T}\bigg{]}

and thus, after dividing out det(D)\det(D) and applying Bk=Hk1B_{k}=H_{k}^{-1}, we arrive at the following final formula

Bk+1=Bkωk[((ωkγk)ykTBk1ykγkωk)BkskskTBk+(1ωkskTyk)(BkskykT+ykskTBk)+ωk(skTBksk)ykykT]((ωkγk)ykTBk1ykγkωk)(ωkskTBksk)(1ωkykTsk)2B_{k+1}=B_{k}-\frac{\omega_{k}\bigg{[}\bigg{(}(\omega_{k}-\gamma_{k})y_{k}^{T}B_{k}^{-1}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\bigg{)}B_{k}s_{k}s_{k}^{T}B_{k}+(1-\omega_{k}s_{k}^{T}y_{k})(B_{k}s_{k}y_{k}^{T}+y_{k}s_{k}^{T}B_{k})+\omega_{k}(s_{k}^{T}B_{k}s_{k})y_{k}y_{k}^{T}\bigg{]}}{\big{(}(\omega_{k}-\gamma_{k})y_{k}^{T}B_{k}^{-1}y_{k}-\frac{\gamma_{k}}{\omega_{k}}\big{)}\big{(}\omega_{k}s_{k}^{T}B_{k}s_{k}\big{)}-(1-\omega_{k}y_{k}^{T}s_{k})^{2}}

(98)

for the SP-BFGS inverse update, which concludes the proof.

Appendix D Proof of Theorem 5.1

Referring to Theorem 3.2, taking the trace of both sides of (98) and applying the linearity and cyclic invariance properties of the trace yields

Tr(Bk+1)=κ1Tr(Bk)+κ2Bksk22+2κ3(ykTBksk)+κ4yk22\operatorname{Tr}(B_{k+1})=\kappa_{1}\operatorname{Tr}(B_{k})+\kappa_{2}\left\|B_{k}s_{k}\right\|_{2}^{2}+2\kappa_{3}(y_{k}^{T}B_{k}s_{k})+\kappa_{4}\left\|y_{k}\right\|_{2}^{2} (99)

where

κ1=1,κ2=ωkD^[D^(ωkskTBksk)(E^)2],\kappa_{1}=1,\quad\kappa_{2}=-\frac{\omega_{k}\hat{D}}{[\hat{D}(\omega_{k}s_{k}^{T}B_{k}s_{k})-(\hat{E})^{2}]},\\ (100)
κ3=ωkE^[D^(ωkskTBksk)(E^)2],κ4=(ωk)2skTBksk[D^(ωkskTBksk)(E^)2]\kappa_{3}=-\frac{\omega_{k}\hat{E}}{[\hat{D}(\omega_{k}s_{k}^{T}B_{k}s_{k})-(\hat{E})^{2}]},\quad\kappa_{4}=-\frac{(\omega_{k})^{2}s_{k}^{T}B_{k}s_{k}}{[\hat{D}(\omega_{k}s_{k}^{T}B_{k}s_{k})-(\hat{E})^{2}]} (101)

with D^\hat{D} and E^\hat{E} defined as

D^=[(ωkγk)(ykTBk1yk)γkωk],E^=(1ωkskTyk)=2ωkβk.\hat{D}=\bigg{[}(\omega_{k}-\gamma_{k})(y_{k}^{T}B_{k}^{-1}y_{k})-\frac{\gamma_{k}}{\omega_{k}}\bigg{]},\quad\hat{E}=(1-\omega_{k}s_{k}^{T}y_{k})=\frac{2\omega_{k}}{\beta_{k}}. (102)

We now observe that after applying some basic algebra, and recalling that BkB_{k} is positive definite, one can deduce that for all βk[0,+]\beta_{k}\in[0,+\infty], the following inequalities hold

(ωkγk)0,1γkωk,D^1,02ωkβk1.(\omega_{k}-\gamma_{k})\leq 0,\quad 1\leq\frac{\gamma_{k}}{\omega_{k}},\quad\hat{D}\leq-1,\quad 0\leq\frac{2\omega_{k}}{\beta_{k}}\leq 1. (103)

By minimizing the absolute value of the common denominator in κ2,κ3\kappa_{2},\kappa_{3}, and κ4\kappa_{4} using the inequalities above, we can obtain the bounds

1skTBkskκ20,0κ4ωkγk-\frac{1}{s_{k}^{T}B_{k}s_{k}}\leq\kappa_{2}\leq 0,\qquad 0\leq\kappa_{4}\leq\omega_{k}\leq\gamma_{k} (104)
0κ32ωkβk1skTBksk+2ωkβk2βkβk2.0\leq\kappa_{3}\leq\frac{2\omega_{k}}{\beta_{k}}\frac{1}{s_{k}^{T}B_{k}s_{k}+\frac{2\omega_{k}}{\beta_{k}}\frac{2}{\beta_{k}}}\leq\frac{\beta_{k}}{2}. (105)

As a result,

Tr(Bk+1)\displaystyle\operatorname{Tr}(B_{k+1}) Tr(Bk)+2κ3|ykTBksk|+κ4yk22\displaystyle\leq\operatorname{Tr}(B_{k})+2\kappa_{3}|y_{k}^{T}B_{k}s_{k}|+\kappa_{4}\left\|y_{k}\right\|_{2}^{2} (106)
Tr(Bk)+βkyk2λmax(Bk)sk2+γkyk22\displaystyle\leq\operatorname{Tr}(B_{k})+\beta_{k}\left\|y_{k}\right\|_{2}\lambda_{max}(B_{k})\left\|s_{k}\right\|_{2}+\gamma_{k}\left\|y_{k}\right\|_{2}^{2} (107)

and applying λmax(Bk)Tr(Bk)\lambda_{max}(B_{k})\leq\operatorname{Tr}(B_{k}) establishes (55). Similarly, referring to (80) reveals the upper bound

Tr(Hk+1)Tr(Hk)+2ωk|ykTHksk|+[γk+ωkγk(ykTHkyk)]sk22.\operatorname{Tr}(H_{k+1})\leq\operatorname{Tr}(H_{k})+2\omega_{k}|y_{k}^{T}H_{k}s_{k}|+\bigg{[}\gamma_{k}+\omega_{k}\gamma_{k}(y_{k}^{T}H_{k}y_{k})\bigg{]}\left\|s_{k}\right\|_{2}^{2}. (108)

To establish (54), we apply λmax(Hk)Tr(Hk)\lambda_{max}(H_{k})\leq\operatorname{Tr}(H_{k}) and ωkγk\omega_{k}\leq\gamma_{k} to the line above, and then factor. This completes the proof.

Appendix E Proof of Lemma 2

The angle condition ϕ(x)THg(x)>0\nabla\phi(x)^{T}Hg(x)>0 expands to

ϕ(x)THg(x)=ϕ(x)THϕ(x)+ϕ(x)THe(x)>0\nabla\phi(x)^{T}Hg(x)=\nabla\phi(x)^{T}H\nabla\phi(x)+\nabla\phi(x)^{T}He(x)>0 (109)

and by applying the Cauchy-Schwarz inequality and Assumption 1, we see that if

ψϕ(x)22>Ψϕ(x)2ϵ¯g\psi\left\|\nabla\phi(x)\right\|_{2}^{2}>\Psi\left\|\nabla\phi(x)\right\|_{2}\bar{\epsilon}_{g} (110)

then ϕ(x)THg(x)>0\nabla\phi(x)^{T}Hg(x)>0. Contrapositively, if ϕ(x)THg(x)0\nabla\phi(x)^{T}Hg(x)\leq 0 then

ϕ(x)2Ψϵ¯gψ.\left\|\nabla\phi(x)\right\|_{2}\leq\frac{\Psi\bar{\epsilon}_{g}}{\psi}. (111)

As ϕ\phi is m-strongly convex due to Assumption 3, we have

ϕϕ(x)+minv{ϕ(x)Tv+m2v22}=ϕ(x)12mϕ(x)22.\phi^{\star}\geq\phi(x)+\min_{v}\bigg{\{}\nabla\phi(x)^{T}v+\frac{m}{2}\left\|v\right\|_{2}^{2}\bigg{\}}=\phi(x)-\frac{1}{2m}\left\|\nabla\phi(x)\right\|_{2}^{2}. (112)

Squaring (111) and then combining it with (112) gives 𝒩1(ψ,Ψ)\mathcal{N}_{1}(\psi,\Psi), completing the proof.

Appendix F Proof of Theorem 5.2

As ϕC2\phi\in C^{2} by Assumption 3, applying Taylor’s theorem and using (58) and strong convexity gives

ϕk+1\displaystyle\phi_{k+1} =ϕk+ϕkT[xk+1xk]+12[xk+1xk]T2ϕ(u)[xk+1xk]\displaystyle=\phi_{k}+\nabla\phi_{k}^{T}[x_{k+1}-x_{k}]+\frac{1}{2}[x_{k+1}-x_{k}]^{T}\nabla^{2}\phi(u)[x_{k+1}-x_{k}]
ϕkαϕkTHkgk+α2M2Hkgk22\displaystyle\leq\phi_{k}-\alpha\nabla\phi_{k}^{T}H_{k}g_{k}+\frac{\alpha^{2}M}{2}\left\|H_{k}g_{k}\right\|_{2}^{2}

where uu is some convex combination of xk+1x_{k+1} and xkx_{k}. Proceeding, note that the smallest 𝒩1\mathcal{N}_{1} from Lemma 2 occurs when ψ=Ψ\psi=\Psi, and in this case ϕkTgk>0\nabla\phi_{k}^{T}g_{k}>0 if xk𝒩1x_{k}\notin\mathcal{N}_{1}. Hence, for all possible choices of 𝒩1\mathcal{N}_{1} it is true that ϕkTgk>0\nabla\phi_{k}^{T}g_{k}>0 if xk𝒩1x_{k}\notin\mathcal{N}_{1}. Combining this with (59) gives

ϕkTHkgkψϕkTgk>0\nabla\phi_{k}^{T}H_{k}g_{k}\geq\psi\nabla\phi_{k}^{T}g_{k}>0 (113)

if xk𝒩1x_{k}\notin\mathcal{N}_{1}. With (113) in hand, continuing to bound terms gives

ϕk+1\displaystyle\phi_{k+1} ϕkαψϕkT[ϕk+ek]+α2Ψ2M2ϕk+ek22\displaystyle\leq\phi_{k}-\alpha\psi\nabla\phi_{k}^{T}[\nabla\phi_{k}+e_{k}]+\frac{\alpha^{2}\Psi^{2}M}{2}\left\|\nabla\phi_{k}+e_{k}\right\|_{2}^{2}
=ϕkαΨ(ψΨαΨM2)ϕk22αΨ(ψΨαΨM)ϕkTek+α2Ψ2M2ek22\displaystyle=\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\|\nabla\phi_{k}\right\|_{2}^{2}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\nabla\phi_{k}^{T}e_{k}+\frac{\alpha^{2}\Psi^{2}M}{2}\left\|e_{k}\right\|_{2}^{2}
ϕkαΨ(ψΨαΨM2)ϕk22+αΨ(ψΨαΨM)ϕk2ek2+α2Ψ2M2ek22\displaystyle\leq\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\|\nabla\phi_{k}\right\|_{2}^{2}+\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\left\|\nabla\phi_{k}\right\|_{2}\left\|e_{k}\right\|_{2}+\frac{\alpha^{2}\Psi^{2}M}{2}\left\|e_{k}\right\|_{2}^{2}
ϕkαΨ(ψΨαΨM2)ϕk22+αΨ(ψΨαΨM)[12ϕk22+12ek22]\displaystyle\leq\phi_{k}-\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\frac{\alpha\Psi M}{2}\bigg{)}\left\|\nabla\phi_{k}\right\|_{2}^{2}+\alpha\Psi\bigg{(}\frac{\psi}{\Psi}-\alpha\Psi M\bigg{)}\bigg{[}\frac{1}{2}\left\|\nabla\phi_{k}\right\|_{2}^{2}+\frac{1}{2}\left\|e_{k}\right\|_{2}^{2}\bigg{]}
+α2Ψ2M2ek22\displaystyle\qquad+\frac{\alpha^{2}\Psi^{2}M}{2}\left\|e_{k}\right\|_{2}^{2}

where the last inequality follows from expanding

0(12ϕk212ek2)2=12ϕk22ϕk2ek2+12ek220\leq\bigg{(}\frac{1}{\sqrt{2}}\left\|\nabla\phi_{k}\right\|_{2}-\frac{1}{\sqrt{2}}\left\|e_{k}\right\|_{2}\bigg{)}^{2}=\frac{1}{2}\left\|\nabla\phi_{k}\right\|_{2}^{2}-\left\|\nabla\phi_{k}\right\|_{2}\left\|e_{k}\right\|_{2}+\frac{1}{2}\left\|e_{k}\right\|_{2}^{2} (114)

and using (60). Simplifying the last inequality reveals that

ϕk+1ϕkαψ2ϕk22+αψ2ek22.\phi_{k+1}\leq\phi_{k}-\frac{\alpha\psi}{2}\left\|\nabla\phi_{k}\right\|_{2}^{2}+\frac{\alpha\psi}{2}\left\|e_{k}\right\|_{2}^{2}. (115)

Since ϕ\phi is m-strongly convex by Assumption 3, we can apply

ϕk222m(ϕkϕ)\left\|\nabla\phi_{k}\right\|_{2}^{2}\geq 2m(\phi_{k}-\phi^{\star}) (116)

as shown in the proof of Lemma 2 (see Appendix E), which combined with (115) and Assumption 1 gives

ϕk+1ϕkαψm(ϕkϕ)+αψ2(Ψϵ¯gψ)2.\phi_{k+1}\leq\phi_{k}-\alpha\psi m(\phi_{k}-\phi^{\star})+\frac{\alpha\psi}{2}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}. (117)

Subtracting ϕ\phi^{\star} from both sides, we get

ϕk+1ϕ(1αψm)(ϕkϕ)+αψ2(Ψϵ¯gψ)2\phi_{k+1}-\phi^{\star}\leq(1-\alpha\psi m)(\phi_{k}-\phi^{\star})+\frac{\alpha\psi}{2}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2} (118)

which, by subtracting 12m(Ψϵ¯gψ)2\frac{1}{2m}(\frac{\Psi\bar{\epsilon}_{g}}{\psi})^{2} from both sides and simplifying, gives

ϕk+1ϕ12m(Ψϵ¯gψ)2\displaystyle\phi_{k+1}-\phi^{\star}-\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2} (1αψm)(ϕkϕ)+αψ2(Ψϵ¯gψ)212m(Ψϵ¯gψ)2\displaystyle\leq(1-\alpha\psi m)(\phi_{k}-\phi^{\star})+\frac{\alpha\psi}{2}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}-\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}
=(1αψm)(ϕkϕ)+(αψm1)12m(Ψϵ¯gψ)2\displaystyle=(1-\alpha\psi m)(\phi_{k}-\phi^{\star})+(\alpha\psi m-1)\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}
=(1αψm)(ϕk[ϕ+12m(Ψϵ¯gψ)2])\displaystyle=(1-\alpha\psi m)\bigg{(}\phi_{k}-\bigg{[}\phi^{\star}+\frac{1}{2m}\bigg{(}\frac{\Psi\bar{\epsilon}_{g}}{\psi}\bigg{)}^{2}\bigg{]}\bigg{)}

thus establishing the Q-linear result (61). We obtain (62) by recursively applying the worst case bound in (61), noting that in the worst case if x0𝒩1x_{0}\notin\mathcal{N}_{1}, then the sequence of iterates {xk}\{x_{k}\} remains outside of 𝒩1\mathcal{N}_{1}, only approaching 𝒩1\mathcal{N}_{1} in the limit kk\rightarrow\infty.

Appendix G Extended Numerical Experiments

Tables 4 and 5 compare the performance of SP-BFGS vs. BFGS on the 3232 CUTEst test problems with only gradient noise present (i.e. ϵ¯f=0\bar{\epsilon}_{f}=0). Gradient noise was generated using ϵ¯g=104ϕ(x0)2\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}, where the starting point x0x^{0} varies by CUTEst problem, to ensure that noise does not initially dominate gradient evaluations. Overall, SP-BFGS outperforms BFGS on approximately 80%80\% of the CUTEst problems with only gradient noise present, and performs at least as good as BFGS on approximately 95%95\% of these problems.

SP-BFGS With Gradient Noise Only
Problem Dim. Mean(Δopt\Delta_{opt}) Median(Δopt\Delta_{opt}) Min(Δopt\Delta_{opt}) Max(Δopt\Delta_{opt}) s2(Δopt)s^{2}(\Delta_{opt})
ARGTRGLS 200 -9.6E-02 -9.6E-02 -1.0E-01 -8.5E-02 1.9E-05
ARWHEAD 500 -2.8E+00 -2.8E+00 -2.8E+00 -2.7E+00 1.7E-03
BEALE 2 -1.4E+01 -1.4E+01 -1.6E+01 -7.0E+00 4.1E+00
BOX3 3 -6.7E+00 -6.5E+00 -1.1E+01 -6.3E+00 6.3E-01
BOXPOWER 100 -2.7E+00 -2.7E+00 -3.1E+00 -2.3E+00 4.6E-02
BROWNBS 2 -4.5E+00 -5.9E+00 -8.0E+00 1.1E+00 8.4E+00
BROYDNBDLS 50 -5.4E+00 -5.4E+00 -5.9E+00 -5.0E+00 3.4E-02
CHAINWOO 100 1.6E+00 1.7E+00 7.6E-02 2.1E+00 1.5E-01
CHNROSNB 50 -3.2E+00 -3.0E+00 -4.9E+00 -2.6E+00 4.5E-01
COATING 134 3.4E-01 3.4E-01 1.8E-01 4.2E-01 3.1E-03
COOLHANSLS 9 -9.4E-01 -9.4E-01 -1.2E+00 -4.8E-01 4.2E-02
CUBE 2 -2.7E+00 -2.5E+00 -5.8E+00 -1.7E+00 7.5E-01
CYCLOOCFLS 20 -7.4E+00 -7.2E+00 -9.3E+00 -5.9E+00 8.1E-01
EXTROSNB 10 -5.1E+00 -5.2E+00 -5.3E+00 -4.7E+00 3.0E-02
FMINSRF2 64 -8.6E+00 -8.7E+00 -8.8E+00 -8.1E+00 3.4E-02
GENHUMPS 5 -2.7E+00 -2.6E+00 -5.2E+00 -1.0E+00 1.1E+00
GENROSE 5 -1.2E+01 -1.2E+01 -1.4E+01 -8.9E+00 2.0E+00
HEART6LS 6 1.0E+00 1.2E+00 -1.8E+00 1.2E+00 5.0E-01
HELIX 3 -5.7E+00 -5.9E+00 -8.7E+00 -3.4E+00 1.4E+00
MANCINO 30 -1.0E+00 -1.0E+00 -1.4E+00 -7.0E-01 3.7E-02
METHANB8LS 31 -3.6E+00 -3.6E+00 -4.0E+00 -3.3E+00 3.1E-02
MODBEALE 200 1.2E+00 1.2E+00 3.8E-01 1.9E+00 1.8E-01
NONDIA 10 -3.5E-03 -3.6E-03 -4.3E-03 -1.1E-03 6.6E-07
POWELLSG 4 -5.7E+00 -5.3E+00 -9.3E+00 -4.0E+00 1.6E+00
POWER 10 -3.5E+00 -3.5E+00 -4.4E+00 -2.8E+00 1.3E-01
ROSENBR 2 -1.1E+01 -1.2E+01 -1.4E+01 -5.1E+00 4.4E+00
ROSENBRTU 2 -1.9E+01 -1.9E+01 -2.2E+01 -1.7E+01 1.1E+00
SBRYBND 500 3.9E+00 3.9E+00 3.9E+00 3.9E+00 2.0E-05
SINEVAL 2 -1.3E+01 -1.3E+01 -1.8E+01 -1.1E+01 3.3E+00
SNAIL 2 -1.5E+01 -1.6E+01 -1.8E+01 -1.2E+01 1.6E+00
SROSENBR 1000 -9.7E-01 -9.7E-01 -1.3E+00 -4.8E-01 3.2E-02
VIBRBEAM 8 1.6E+00 1.6E+00 1.2E+00 2.8E+00 9.1E-02
Table 4: Performance of SP-BFGS on 3232 selected CUTEst test problems with noise added to gradient evaluations only (i.e. ϵ¯f=0\bar{\epsilon}_{f}=0). The number of objective function evaluations is fixed at 20002000. Δoptlog10(ϕbestϕ)\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star}) measures the optimality gap, where ϕbest\phi_{best} denotes the smallest value of the true function ϕ\phi measured at any point during an algorithm run. Statistics are calculated from a sample of 3030 runs per algorithm, and the Dim. column gives the problem dimension. The SPBFGS penalty parameter was set as βk=108ϵ¯gsk2+1010\beta_{k}=\frac{10^{8}}{\bar{\epsilon}_{g}}\left\|s_{k}\right\|_{2}+10^{-10}. For each problem, gradient noise was generated using ϵ¯g=104ϕ(x0)2\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}, where the starting point x0x^{0} varies by CUTEst problem.
BFGS With Gradient Noise Only
Problem Dim. Mean(Δopt\Delta_{opt}) Median(Δopt\Delta_{opt}) Min(Δopt\Delta_{opt}) Max(Δopt\Delta_{opt}) s2(Δopt)s^{2}(\Delta_{opt})
ARGTRGLS 200 -9.2E-02 -9.3E-02 -9.9E-02 -8.2E-02 1.7E-05
ARWHEAD 500 -2.5E+00 -2.5E+00 -2.6E+00 -2.5E+00 7.6E-04
BEALE 2 -8.3E+00 -8.5E+00 -1.2E+01 -5.8E+00 3.9E+00
BOX3 3 -6.4E+00 -6.4E+00 -6.6E+00 -6.3E+00 2.3E-03
BOXPOWER 100 -2.8E+00 -2.8E+00 -3.3E+00 -2.4E+00 6.1E-02
BROWNBS 2 6.8E-02 1.0E+00 -8.2E+00 3.6E+00 1.0E+01
BROYDNBDLS 50 -5.1E+00 -5.1E+00 -5.3E+00 -4.9E+00 1.5E-02
CHAINWOO 100 1.7E+00 1.8E+00 1.1E+00 2.2E+00 5.8E-02
CHNROSNB 50 -2.9E+00 -2.7E+00 -4.5E+00 -2.1E+00 3.8E-01
COATING 134 3.6E-01 3.7E-01 2.1E-01 4.2E-01 2.6E-03
COOLHANSLS 9 -5.6E-01 -6.4E-01 -1.3E+00 1.9E-01 1.9E-01
CUBE 2 -1.1E+00 -1.1E+00 -1.8E+00 -9.6E-01 5.6E-02
CYCLOOCFLS 20 -6.5E+00 -6.5E+00 -8.3E+00 -5.1E+00 5.4E-01
EXTROSNB 10 -5.1E+00 -5.1E+00 -5.3E+00 -4.9E+00 8.1E-03
FMINSRF2 64 -8.2E+00 -8.2E+00 -8.7E+00 -7.3E+00 1.4E-01
GENHUMPS 5 -1.5E+00 -1.2E+00 -4.0E+00 -2.8E-01 8.4E-01
GENROSE 5 -6.8E+00 -6.7E+00 -8.7E+00 -5.9E+00 4.8E-01
HEART6LS 6 1.2E+00 1.2E+00 1.2E+00 1.2E+00 1.9E-04
HELIX 3 -4.8E+00 -4.6E+00 -8.1E+00 -2.6E+00 2.1E+00
MANCINO 30 -8.3E-01 -8.8E-01 -1.2E+00 -3.3E-01 5.3E-02
METHANB8LS 31 -3.5E+00 -3.4E+00 -3.9E+00 -3.3E+00 2.8E-02
MODBEALE 200 1.0E+00 1.1E+00 -6.2E-01 2.1E+00 3.6E-01
NONDIA 10 1.2E-03 1.3E-03 -4.4E-03 1.3E-02 2.2E-05
POWELLSG 4 -5.3E+00 -5.0E+00 -8.0E+00 -3.6E+00 1.6E+00
POWER 10 -3.4E+00 -3.4E+00 -4.3E+00 -2.8E+00 1.4E-01
ROSENBR 2 -6.1E+00 -5.9E+00 -1.0E+01 -3.7E+00 2.9E+00
ROSENBRTU 2 -1.5E+01 -1.5E+01 -1.8E+01 -1.4E+01 1.6E+00
SBRYBND 500 3.9E+00 3.9E+00 3.9E+00 3.9E+00 2.0E-05
SINEVAL 2 -1.2E+01 -1.3E+01 -1.7E+01 -8.5E+00 4.0E+00
SNAIL 2 -1.1E+01 -1.1E+01 -1.6E+01 -8.2E+00 3.5E+00
SROSENBR 1000 -9.1E-01 -8.8E-01 -1.3E+00 -5.1E-01 3.1E-02
VIBRBEAM 8 1.7E+00 1.6E+00 1.4E+00 2.6E+00 1.0E-01
Table 5: Performance of BFGS on 3232 selected CUTEst test problems with noise added to gradient evaluations only (i.e. ϵ¯f=0\bar{\epsilon}_{f}=0). The number of objective function evaluations is fixed at 20002000. Δoptlog10(ϕbestϕ)\Delta_{opt}\coloneqq\log_{10}(\phi_{best}-\phi^{\star}) measures the optimality gap, where ϕbest\phi_{best} denotes the smallest value of the true function ϕ\phi measured at any point during an algorithm run. Statistics are calculated from a sample of 3030 runs per algorithm, and the Dim. column gives the problem dimension. For each problem, gradient noise was generated using ϵ¯g=104ϕ(x0)2\bar{\epsilon}_{g}=10^{-4}\left\|\nabla\phi(x^{0})\right\|_{2}, where the starting point x0x^{0} varies by CUTEst problem.