This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Primal-Dual Damping algorithms for optimization

Xinzhe Zuo, Stanley Osher, Wuchen Li  
\star Department of Mathematics, University of California, Los Angles
{zxz, sjo}@math.ucla.edu
\dagger
Department of Mathematics, University of South Carolina
wuchen@mailbox.sc.edu
Abstract

We propose an unconstrained optimization method based on the well-known primal-dual hybrid gradient (PDHG) algorithm. We first formulate the optimality condition of the unconstrained optimization problem as a saddle point problem. We then compute the minimizer by applying generalized primal-dual hybrid gradient algorithms. Theoretically, we demonstrate the continuous-time limit of the proposed algorithm forms a class of second-order differential equations, which contains and extends the heavy ball ODEs and Hessian-driven damping dynamics. Following the Lyapunov analysis of the ODE system, we prove the linear convergence of the algorithm for strongly convex functions. Experimentally, we showcase the advantage of algorithms on several convex and non-convex optimization problems by comparing the performance with other well-known algorithms, such as Nesterov’s accelerated gradient methods. In particular, we demonstrate that our algorithm is efficient in training two-layer and convolution neural networks in supervised learning problems.

Keywords— Optimization; Primal-dual hybrid gradient algorithms; Primal-dual damping dynamics.

1 Introduction

Optimization is one of the essential building blocks in many applications, including scientific computing and machine learning problems. One of the classical algorithms for unconstrained optimization problems is the gradient descent method, which updates the state variable in the negative gradient direction at each step [Boyd and Vandenberghe, 2004]. Nowadays, accelerated gradient descent methods have been widely studied. Typical examples include Nesterov’s accelerated gradient method [Nesterov, 1983], Polyak’s heavy ball method [Polyak, 1964], and Hessian-driven damping methods [Chen and Luo, 2019, Attouch et al., 2019, 2020, 2021].

On the other hand, some first-order methods are introduced to solve linear-constrained optimization problems. Typical examples include the primal-dual hybrid gradient (PDHG) method [Chambolle and Pock, 2011] and the alternating direction method of multipliers (ADMM) [Boyd et al., 2011, Gabay and Mercier, 1976]. They are designed to solve an inf-sup saddle point type problem, which updates the gradient descent direction for the minimization variable and applies the gradient ascent direction for the maximization variable. Both PDHG and ADMM are designed for solving optimization problems with affine constraints. Ouyang et al. [2015] proposed accelerated linearized ADMM, which incorporates a multi-step acceleration scheme into linearized ADMM. Recently, the PDHG method has been extended into solving nonlinear-constrained minimization problems [Valkonen, 2014].

In this paper, we study a general class of accelerated first-order methods for unconstrained optimization problems. We reformulate the original optimization problem into an inf-sup type saddle point problem, whose saddle point solves the optimality condition. We then apply a linearized preconditioned primal-dual hybrid gradient algorithm to compute the proposed saddle point problem.

The main description of the algorithm is as follows. Consider the following inf-sup problem for a 𝒞2\mathcal{C}^{2} strongly convex function ff over d\mathbb{R}^{d}

inf𝒙dsup𝒑df(𝒙),𝒑ε2𝒑2,\inf_{{\bm{x}}\in\mathbb{\mathbb{R}}^{d}}\sup_{{\bm{p}}\in\mathbb{\mathbb{R}}^{d}}\quad\langle\nabla f({\bm{x}}),{\bm{p}}\rangle-\frac{\varepsilon}{2}\left\lVert{\bm{p}}\right\rVert^{2}\,, (1.1)

where 𝒑{\bm{p}} is a constructed “dual variable”, ε>0\varepsilon>0 is a constant, ,\langle\cdot,\cdot\rangle is an Euclidean inner product, and \|\cdot\| is an Euclidean norm. We later prove that the solution to the saddle point problem 1.1 gives the global minimum of ff. We propose a linearized preconditioned PDHG algorithm for solving the above inf-sup problem:

𝒑n+1\displaystyle{\bm{p}}^{n+1} =𝒑n+σ𝑨(𝒙n)f(𝒙n)σε𝑨(𝒙n)𝒑n+1,\displaystyle={\bm{p}}^{n}+\sigma{\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})-\sigma\varepsilon{\bm{A}}({\bm{x}}^{n}){\bm{p}}^{n+1}\,, (1.2a)
𝒑~n+1\displaystyle\tilde{{\bm{p}}}^{n+1} =𝒑n+1+ω(𝒑n+1𝒑n),\displaystyle={\bm{p}}^{n+1}+\omega({\bm{p}}^{n+1}-{\bm{p}}^{n})\,, (1.2b)
𝒙n+1\displaystyle{\bm{x}}^{n+1} =𝒙nτ𝑪(𝒙n)𝒑~n+1,\displaystyle={\bm{x}}^{n}-\tau{\bm{C}}({\bm{x}}^{n})\tilde{{\bm{p}}}^{n+1}\,, (1.2c)

where n=1,2,n=1,2,\cdots is the iteration step, τ\tau, σ>0\sigma>0 are stepsizes for the updates of 𝒙{\bm{x}}, 𝒑{\bm{p}}, respectively, and ω>0\omega>0 is a parameter. In the above algorithm, 𝑪(𝒙n)=𝑩(𝒙n)2f(𝒙n){\bm{C}}({\bm{x}}^{n})={\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n}), where 𝑨(𝒙n)d×d{\bm{A}}({\bm{x}}^{n})\in\mathbb{\mathbb{R}}^{d\times d}, and 𝑩(𝒙n)d×d{\bm{B}}({\bm{x}}^{n})\in\mathbb{\mathbb{R}}^{d\times d} act as preconditioners on the updates of 𝒑n+1{\bm{p}}^{n+1} and 𝒙n+1{\bm{x}}^{n+1}, respectively. This paper will only focus on the simple case where 𝑨(𝒙)=A𝕀{\bm{A}}({\bm{x}})=A{\mathbb{I}} for some constant A>0A>0. Although there is a second-order term 2f(𝒙n)\nabla^{2}f({\bm{x}}^{n}) in the update of 𝒙n+1{\bm{x}}^{n+1} (hidden in 𝑪(𝒙n){\bm{C}}({\bm{x}}^{n})), our algorithm is still a first-order algorithm by choosing 𝑩(𝒙n)2f(𝒙n)=𝑪(𝒙n){\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})={\bm{C}}({\bm{x}}^{n}) for some 𝑪{\bm{C}} that is easy to compute. For example, we test that 𝑪=𝕀{\bm{C}}={\mathbb{I}} is a very good choice in most of our numerical examples. See empirical choices of parameters in our numerical sections.

Our method forms a class of ordinary differential equation systems in terms of (𝒙,𝒑)({\bm{x}},{\bm{p}}) in the continuous limit τ\tau, σ0\sigma\rightarrow 0. We call it the primal-dual damping (PDD) dynamics. We show that the PDD dynamics form a class of second-order ODEs, which contains and extends the inertia Hessian-driven damping dynamics [Chen and Luo, 2019, Attouch et al., 2019]. Theoretically, we analyze the convergence property of PDD dynamics. If ff is a quadratic function of 𝒙{\bm{x}}, with constant 𝑨{\bm{A}}, 𝑩{\bm{B}}, the PDD dynamic satisfies a linear ODE system. Under suitable choices of parameters, we obtain a similar convergence acceleration in heavy ball ODE [Siegel, 2019]. Moreover, for general nonlinear function ff, we have the following informal theorem characterizing the convergence speed of our algorithm:

Theorem 1.1 (Informal).

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a 𝒞4\mathcal{C}^{4} strongly convex function. Let 𝐱{\bm{x}}^{*} be the global minimum of ff and 𝐩=0{\bm{p}}^{*}=0. Then, the iteration (𝐱n,𝐩n)({\bm{x}}^{n},{\bm{p}}^{n}) produced by 1.2 converges to the saddle point (𝐱,𝐩)({\bm{x}}^{*},{\bm{p}}^{*}) if τ\tau, σ\sigma, are small enough. Moreover,

𝒑n2+f(𝒙n)2(𝒑02+f(𝒙0)2)(1μ2M+δ)n,\|{\bm{p}}^{n}\|^{2}+\|\nabla f({\bm{x}}^{n})\|^{2}\leq(\|{\bm{p}}^{0}\|^{2}+\|\nabla f({\bm{x}}^{0})\|^{2})(1-\frac{\mu^{2}}{M+\delta})^{n},

where μ=min𝐱λmin(2f(𝐱)𝐂(𝐱))\mu=\min_{{\bm{x}}}\lambda_{\min}(\nabla^{2}f({\bm{x}}){\bm{C}}({\bm{x}})), 𝐂(𝐱)=𝐁(𝐱)2f(𝐱){\bm{C}}({\bm{x}})={\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}}), δ>0\delta>0 depends on the initial condition, and M>0M>0 depends on 𝐂(𝐱)T(3f(𝐱)f(𝐱)+(2f(𝐱))2)𝐂(𝐱){\bm{C}}({\bm{x}})^{T}\big{(}\nabla^{3}f({\bm{x}})\nabla f({\bm{x}})+(\nabla^{2}f({\bm{x}}))^{2}\big{)}{\bm{C}}({\bm{x}}), τ\tau, σ\sigma, AA, ε\varepsilon, and ω\omega. The detailed version is given in Theorem 3.10.

Numerically, we test the algorithms in both convex and non-convex optimization problems. In convex optimization, we demonstrate the fast convergence results of the proposed algorithm with selected preconditioners, compared with the gradient descent method, Nesterov accelerated gradient method, and Heavy ball damping method. This justifies the convergence analysis. We also test our algorithm for several well-known non-convex optimization problems. Some examples, such as the Rosenbrock and Ackley functions, demonstrate the potential advantage of our algorithms in converging to the global minimizer. In particular, we compare our algorithms with the stochastic gradient descent method, Adam, for training two-layer and convolutional neural network functions in supervised learning problems. This showcases the potential advantage of the proposed methods in terms of convergence speed and test accuracy.

PDHG has been widely used in linear-constrained optimization problems [Chambolle and Pock, 2011]. Recently, Valkonen [2014] applied the PDHG for nonlinearly constrained optimization problems. They proved the asymptotic convergence for the nonlinear coupling saddle point problems. It is different from our PDHG algorithm for computing unconstrained optimizations. And we show the linear convergence for a particular nonlinear coupling saddle point problem. Meanwhile, Nesterov accelerated gradient methods and Hessian damping algorithms can also be formulated in both discrete-time updates and continuous-time second-order ODEs. Wibisono et al. [2016] also introduced the idea of Bregman Lagrangian to study a family of accelerated methods in continuous time limit. It forms a nonlinear second-order ODE. Compared to them, the PDD algorithm induces a generalized second-order ODE system, which contains both heavy ball ODE [Siegel, 2019] and Hessian damping dynamics [Chen and Luo, 2019, Attouch et al., 2019, 2020, 2021]. For example, when C=𝕀C={\mathbb{I}}, algorithm Eq. 1.2 can be viewed as the other time discretization of Hessian damping dynamics [Chen and Luo, 2019, Attouch et al., 2019]. It provides a different update in discrete time update. We only evaluate the gradient of ff once, whereas Attouch’s algorithm [Attouch et al., 2020] evaluates the gradient of ff twice. In numerical experiments, we demonstrate that the proposed algorithm outperforms Nesterov accelerated methods and Hessian-driven damping methods in some non-convex optimization problems, including supervised learning problems for training neural network functions.

Our work is also related to preconditioning, an important technique in numerical linear algebra [Trefethen and Bau, 2022] and numerical PDEs [Rees, 2010, Park et al., 2021]. In general, preconditioning aims to reduce the condition number of some operators to improve convergence speed. One famous example would be preconditioning gradient descent by the inverse of the Hessian matrix, which gives rise to Newton’s method. In recent years, preconditioning techniques have also been developed in training neural networks Osher et al. [2022], Kingma and Ba [2014]. Adam [Kingma and Ba, 2014] is arguably one of the most popular optimizers in training deep neural networks. It can also be viewed as a preconditioned algorithm using a diagonal preconditioner that approximates the diagonal of the Fisher information matrix [Pascanu and Bengio, 2013]. Shortly after Chambolle and Pock [2011] developed PDHG for constrained optimization, the same authors also studied preconditioned PDHG method [Pock and Chambolle, 2011], in which they developed a simple diagonal preconditioner that can guarantee convergence without the need to compute step sizes. Liu et al. [2021] proposed non-diagonal preconditioners for PDHG and showed close connections between preconditioned PDHG and ADMM. Park et al. [2021] studied the preconditioned Nesterov’s accelerated gradient method and proved convergence in the induced norm. Jacobs et al. [2019] introduced a preconditioned norm in the primal update of the PDHG method and improved the step size restriction of the PDHG method.

Our paper is organized as follows. In Section 2 we provide some background and derivations of our algorithm. We also provide the ODE formulations for our primal-dual damping dynamics. In Section 3, we prove our main convergence results for the algorithm. In Section 4 we showcase the advantage of our algorithm through several convex and non-convex examples. In particular, we show that our algorithm can train neural networks and is competitive with commonly used optimizers, such as SGD with Nesterov’s momentum and Adam. We conclude in Section 5 with more discussions and future directions.

2 Primal-dual damping algorithms for optimizations

In this section, we first review PDHG algorithms for constrained optimization problems. We then construct a saddle point problem for the unconstrained optimization problem and apply the preconditioned PDHG algorithm to compute the proposed saddle point problem. We last derive an ODE system, which takes the limit of stepsizes in the PDHG algorithm. It forms a second-order ODE, which generalizes the Hessian-driven damping dynamics. We analyze the convergence properties of the ODE system for quadratic optimization problems.

2.1 Review PDHG for constrained optimization

In Chambolle and Pock [2011], the following saddle point problem was considered:

minxXmaxyYKx,y+G(x)F(y),\min_{x\in X}\max_{y\in Y}\langle Kx,y\rangle+G(x)-F^{*}(y)\,, (2.1)

where XX and YY are two finite-dimensional real vector spaces equipped with inner product ,\langle\cdot,\cdot\rangle and norm =,1/2\|\cdot\|=\langle\cdot,\cdot\rangle^{1/2}. The map K:XYK:X\to Y is a continuous linear operator. G:X[0,+]G:X\to[0,+\infty] and F:Y[0,+]F^{*}:Y\to[0,+\infty] are proper, convex, lower semi-continuous (l.s.c.) functions. FF^{*} is the convex conjugate of a convex l.s.c. function FF. It is straightforward to verify that 2.1 is the primal-dual formulation of the nonlinear primal problem

minxXF(Kx)+G(x).\min_{x\in X}F(Kx)+G(x)\,.

Then the PDHG algorithm for saddle point problem 2.1 is given by

yn+1\displaystyle y^{n+1} =(I+σF)1(yn+σKx~n),\displaystyle=(I+\sigma\partial F^{*})^{-1}(y^{n}+\sigma K\tilde{x}^{n})\,, (2.2a)
xn+1\displaystyle x^{n+1} =(I+τG)1(xn+τKyn+1),\displaystyle=(I+\tau\partial G)^{-1}(x^{n}+\tau K^{*}y^{n+1})\,, (2.2b)
x~n+1\displaystyle\tilde{x}^{n+1} =xn+1+ω(xn+1xn),\displaystyle=x^{n+1}+\omega(x^{n+1}-x^{n})\,, (2.2c)

where II is the identity operator and (I+σF)1(I+\sigma\partial F)^{-1} is the resolvent operator, which is defined the same way as the proximal operator

(I+τF)1(y)\displaystyle(I+\tau\partial F)^{-1}(y) =argminxxy22τ+F(x)\displaystyle=\operatorname*{arg\,min}_{x}\frac{\|x-y\|^{2}}{2\tau}+F(x)
=proxτF(y)\displaystyle=\textrm{prox}_{\tau F}(y)

When ω=1\omega=1, Chambolle and Pock [2011] proved convergence if τσK2<1\tau\sigma\|K\|^{2}<1, where \|\cdot\| is the induced operator norm. It is worth noting that the convergence analysis requires that KK is a linear operator.

2.2 Saddle point problem for unconstrained optimization

We consider the problem of minimizing a 𝒞2\mathcal{C}^{2} strongly convex function f:df:\mathbb{R}^{d}\to\mathbb{R} over d\mathbb{R}^{d}. Instead of directly solving for f(𝒙)=0\nabla f({\bm{x}}^{*})=0, we consider the following saddle point problem:

inf𝒙dsup𝒑df(𝒙),𝒑,\inf_{{\bm{x}}\in\mathbb{R}^{d}}\sup_{{\bm{p}}\in\mathbb{R}^{d}}\quad\langle\nabla f({\bm{x}}),{\bm{p}}\rangle\,, (2.3)

due to the following proposition.

Proposition 2.1.

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a 𝒞2\mathcal{C}^{2} strongly convex function. Then the saddle point to 2.3 is the unique global minimum of ff.

Proof.

Directly differentiating 2.3 and setting the derivatives to 0 yields

f(𝒙)\displaystyle\nabla f({\bm{x}}^{*}) =0,\displaystyle=0\,,
2f(𝒙)𝒑\displaystyle\nabla^{2}f({\bm{x}}^{*}){\bm{p}}^{*} =0.\displaystyle=0\,.

By the strong convexity of ff, we obtain that 𝒙{\bm{x}}^{*} is the unique global minimum and 𝒑=0{\bm{p}}^{*}=0. ∎

Recall that 𝒑=0{\bm{p}}^{*}=0 by the optimality condition. Thus we make the following change to our saddle point formulation. We add a regularization term in 2.3:

inf𝒙dsup𝒑df(𝒙),𝒑ε2𝒑2,\inf_{{\bm{x}}\in\mathbb{R}^{d}}\sup_{{\bm{p}}\in\mathbb{R}^{d}}\quad\langle\nabla f({\bm{x}}),{\bm{p}}\rangle-\frac{\varepsilon}{2}\left\lVert{\bm{p}}\right\rVert^{2}\,, (2.4)

where ε>0\varepsilon>0 is a constant. This regularization term further drives 𝒑{\bm{p}} to 0. Similar to Proposition 2.1, we have the following proposition

Proposition 2.2.

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a 𝒞2\mathcal{C}^{2} strongly convex function. Then the saddle point to 2.4 is the unique global minimum of ff.

Proof.

Directly differentiating 2.4 and setting derivatives to 0 yields

f(𝒙)\displaystyle\nabla f({\bm{x}}^{*}) =ε𝒑,\displaystyle=\varepsilon{\bm{p}}^{*}\,,
2f(𝒙)𝒑\displaystyle\nabla^{2}f({\bm{x}}^{*}){\bm{p}}^{*} =0.\displaystyle=0\,.

Since ff is strongly convex, we have 2f(𝒙)0\nabla^{2}f({\bm{x}}^{*})\succ 0 and the second equation implies 𝒑=0{\bm{p}}^{*}=0. Then the first equation implies f(𝒙)=0\nabla f({\bm{x}}^{*})=0. Since ff is strongly convex, we conclude that 𝒙{\bm{x}}^{*} is the unique global minimum. ∎

2.3 PDHG for unconstrained optimization

We apply the scheme given by 2.2 to the saddle point problem 2.4 (set F=G=0F=G=0 and identify K𝒙=f(𝒙)K{\bm{x}}=\nabla f({\bm{x}}) in 2.1). Thus,

𝒑n+1\displaystyle{\bm{p}}^{n+1} =argmax𝒑f(𝒙n),𝒑ε2𝒑2𝒑𝒑n𝑨(𝒙n)122σ,\displaystyle=\operatorname*{arg\,max}_{{\bm{p}}}\quad\langle\nabla f({\bm{x}}^{n}),{\bm{p}}\rangle-\frac{\varepsilon}{2}\left\lVert{\bm{p}}\right\rVert^{2}-\frac{\left\lVert{\bm{p}}-{\bm{p}}^{n}\right\rVert^{2}_{{\bm{A}}({\bm{x}}^{n})^{-1}}}{2\sigma}\,, (2.5a)
𝒑~n+1\displaystyle\widetilde{{\bm{p}}}^{n+1} =ω(𝒑n+1𝒑n)+𝒑n+1,\displaystyle=\omega({\bm{p}}^{n+1}-{\bm{p}}^{n})+{\bm{p}}^{n+1}\,, (2.5b)
𝒙n+1\displaystyle{\bm{x}}^{n+1} =argmin𝒙f(𝒙),𝒑~n+1+𝒙𝒙n𝑩(𝒙n)122τ,\displaystyle=\operatorname*{arg\,min}_{{\bm{x}}}\quad\langle\nabla f({\bm{x}}),\widetilde{{\bm{p}}}^{n+1}\rangle+\frac{\left\lVert{\bm{x}}-{\bm{x}}^{n}\right\rVert^{2}_{{\bm{B}}({\bm{x}}^{n})^{-1}}}{2\tau}\,, (2.5c)

where we have added symmetric positive definite matrices 𝑨(𝒙n){\bm{A}}({\bm{x}}^{n}), 𝑩(𝒙n)d×d{\bm{B}}({\bm{x}}^{n})\in\mathbb{R}^{d\times d}, as preconditioners for updates of 𝒑{\bm{p}}, 𝒙{\bm{x}}, respectively. We also denote the norm 𝒉𝑨12\left\lVert{\bm{h}}\right\rVert^{2}_{{\bm{A}}^{-1}} as 𝒉T𝑨1𝒉{\bm{h}}^{T}{\bm{A}}^{-1}{\bm{h}}, where 𝒉d{\bm{h}}\in\mathbb{R}^{d}.

As mentioned, the convergence analysis of PDHG relies on the assumption that KK is a linear operator. So we can not apply the same convergence analysis to 2.5 since f(𝒙)\nabla f({\bm{x}}) is not necessarily linear in 𝒙{\bm{x}}. By taking the optimality conditions of 2.5, we find that 𝒑n+1{\bm{p}}^{n+1} and 𝒙n+1{\bm{x}}^{n+1} solves

𝒑n+1+σε𝑨(𝒙n)𝒑n+1𝒑nσ𝑨(𝒙n)f(𝒙n)\displaystyle{\bm{p}}^{n+1}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n}){\bm{p}}^{n+1}-{\bm{p}}^{n}-\sigma{\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n}) =0,\displaystyle=0\,, (2.6a)
τ𝑩(𝒙n)2f(𝒙n+1)[(1+ω)(𝕀+σε𝑨(𝒙n))1σ𝑨(𝒙n)f(𝒙n)\displaystyle\tau{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n+1})\big{[}(1+\omega)\big{(}{\mathbb{I}}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n})\big{)}^{-1}\sigma{\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})
+((ω+1)(𝕀+σε𝑨(𝒙n))1ω𝕀)𝒑n]+(𝒙n+1𝒙n)\displaystyle+\big{(}(\omega+1)\big{(}{\mathbb{I}}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n})\big{)}^{-1}-\omega{\mathbb{I}}\big{)}{\bm{p}}^{n}\big{]}+({\bm{x}}^{n+1}-{\bm{x}}^{n}) =0,\displaystyle=0\ , (2.6b)

where we substitute the update Eq. 2.5b into update Eq. 2.6b. We use 𝕀{\mathbb{I}} to represent the identity matrix in Eq. 2.6b.

Note that the update for 𝒙n+1{\bm{x}}^{n+1} in Eq. 2.6b is implicit, unless 2f(𝒙)\nabla^{2}f({\bm{x}}) does not depend on 𝒙{\bm{x}}. We also remark that the update for 𝒙n+1{\bm{x}}^{n+1} in Eq. 2.6b will be explicit if we perform a gradient step instead of a proximal step in Eq. 2.5c. To be more precise, when 𝑩=𝕀{\bm{B}}={\mathbb{I}}, the linearized version of Eq. 2.5c can be written as

𝒙n+1=proxτf(),𝒑~n+1(𝒙n).{\bm{x}}^{n+1}=\mathrm{prox}_{\tau\langle\nabla f(\cdot),\widetilde{{\bm{p}}}^{n+1}\rangle}({\bm{x}}^{n})\,.

Taking a gradient step instead of proximal step yields

𝒙n+1=𝒙nτ2f(𝒙n)𝒑~n+1{\bm{x}}^{n+1}={\bm{x}}^{n}-\tau\nabla^{2}f({\bm{x}}^{n})\widetilde{{\bm{p}}}^{n+1} (2.7)

For general choice of preconditioner 𝑩(𝒙n){\bm{B}}({\bm{x}}^{n}), the linearized version of Eq. 2.5c satisfies

𝒙n+1=𝒙nτ𝑩(xn)2f(𝒙n)𝒑~n+1=𝒙nτ𝑪(xn)𝒑~n+1.{\bm{x}}^{n+1}={\bm{x}}^{n}-\tau{\bm{B}}(x^{n})\nabla^{2}f({\bm{x}}^{n})\widetilde{{\bm{p}}}^{n+1}={\bm{x}}^{n}-\tau{\bm{C}}(x^{n})\widetilde{{\bm{p}}}^{n+1}.

Here we always denote a matrix function 𝑪{\bm{C}}, such that

𝑪(𝒙n):=𝑩(𝒙n)2f(𝒙n).{\bm{C}}({\bm{x}}^{n}):={\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n}).

For simplicity of presentation, we only consider the simple case where 𝑨(𝒙n)=A𝕀{\bm{A}}({\bm{x}}^{n})=A{\mathbb{I}} for some A>0A>0. We now summarize the linearized update Eq. 2.6 into the following algorithm.

Algorithm 1 Linearized Primal-Dual Damping Algorithm
Initial guesses 𝒙0d{\bm{x}}^{0}\in\mathbb{R}^{d}, 𝒑0d{\bm{p}}^{0}\in\mathbb{R}^{d}; Stepsizes τ>0\tau>0, σ>0\sigma>0; Parameters A>0A>0, ε>0\varepsilon>0, ω>0\omega>0, 𝑪0{\bm{C}}\succ 0.
while n=1,2,,n=1,2,\cdots, not converge do
    𝒑n+1=11+σεA𝒑n+σA1+σεAf(𝒙n);{\bm{p}}^{n+1}=\frac{1}{1+\sigma\varepsilon A}{\bm{p}}^{n}+\frac{\sigma A}{1+\sigma\varepsilon A}\nabla f({\bm{x}}^{n});
    𝒑~n+1=𝒑n+1+ω(𝒑n+1𝒑n);\tilde{{\bm{p}}}^{n+1}={\bm{p}}^{n+1}+\omega({\bm{p}}^{n+1}-{\bm{p}}^{n});
    𝒙n+1=𝒙nτ𝑪(𝒙n)𝒑~n+1;{\bm{x}}^{n+1}={\bm{x}}^{n}-\tau{\bm{C}}({\bm{x}}^{n})\tilde{{\bm{p}}}^{n+1};
end while

We note that Algorithm 1 and update Eq. 2.6 are different methods for solving saddle point problem Eq. 2.3. In this paper, we focus on the computation and analysis of Algorithm 1.

2.4 PDD dynamics

An approach for analyzing optimization algorithms is by first studying the continuous limit of the algorithm using ODEs [Su et al., 2015, Siegel, 2019, Attouch et al., 2019]. The advantage of doing so is that ODEs provide insights into the convergence property of the algorithm.

We first reformulate the proposed algorithm Eq. 2.6 into a first-order ODE system.

Proposition 2.3.

As τ,σ0\tau,\sigma\to 0 and σωγ\sigma\omega\to\gamma, both updates in 2.6 and Algorithm 1 can be formulated as a discrete-time update of the following ODE system.

𝒑˙\displaystyle\dot{{\bm{p}}} =𝑨(x)f(𝒙)ε𝑨(x)𝒑,\displaystyle={\bm{A}}(x)\nabla f({\bm{x}})-\varepsilon{\bm{A}}(x){\bm{p}}\,, (2.8a)
𝒙˙\displaystyle\dot{{\bm{x}}} =𝑪(𝒙)(𝒑+γ(𝑨(x)f(𝒙)ε𝑨(x)𝒑)),\displaystyle=-{\bm{C}}({\bm{x}})({\bm{p}}+\gamma({\bm{A}}(x)\nabla f({\bm{x}})-\varepsilon{\bm{A}}(x){\bm{p}}))\,, (2.8b)

where 𝐂(𝐱)=𝐁(𝐱)2f(𝐱){\bm{C}}({\bm{x}})={\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}}) and the initial condition satisfies 𝐱(0)=𝐱0{\bm{x}}(0)={\bm{x}}^{0}, 𝐩(0)=𝐩0{\bm{p}}(0)={\bm{p}}^{0}. Suppose that f\nabla f is Lispchitz continuous and each index in matrix 𝐀{\bm{A}}, 𝐂{\bm{C}} is continuous and bounded. Then, there exists a unique solution for the ODE system Eq. 2.8. A stationary state (𝐱,𝐩)({\bm{x}}^{*},{\bm{p}}^{*}) of ODE system Eq. 2.8 satisfies

f(𝒙)=0,𝒑=0.\nabla f({\bm{x}}^{*})=0,\quad{\bm{p}}^{*}=0.
Proof.

Rearranging Eq. 2.6a and Eq. 2.6b, we have

𝒑n+1𝒑nσ\displaystyle\frac{{\bm{p}}^{n+1}-{\bm{p}}^{n}}{\sigma} =𝑨(𝒙n)f(𝒙n)ε𝑨(𝒙n)𝒑n+1,\displaystyle={\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})-\varepsilon{\bm{A}}({\bm{x}}^{n}){\bm{p}}^{n+1}\,,
𝒙n+1𝒙nτ\displaystyle\frac{{\bm{x}}^{n+1}-{\bm{x}}^{n}}{\tau} =𝑩(𝒙n)2f(𝒙n+1)[(1+ω)(𝕀+σε𝑨(𝒙n))1σ𝑨(𝒙n)f(𝒙n)\displaystyle=-{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n+1})\big{[}(1+\omega)\big{(}{\mathbb{I}}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n})\big{)}^{-1}\sigma{\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})
+((ω+1)(𝕀+σε𝑨(𝒙n))1ω𝕀)𝒑n].\displaystyle\qquad+\big{(}(\omega+1)\big{(}{\mathbb{I}}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n})\big{)}^{-1}-\omega{\mathbb{I}}\big{)}{\bm{p}}^{n}\big{]}\,.

Taking the limit as τ,σ0\tau,\sigma\to 0 and σωγ\sigma\omega\to\gamma, we obtain

𝒑˙\displaystyle\dot{{\bm{p}}} =𝑨(x)f(𝒙)ε𝑨(x)𝒑,\displaystyle={\bm{A}}(x)\nabla f({\bm{x}})-\varepsilon{\bm{A}}(x){\bm{p}}\,,
𝒙˙\displaystyle\dot{{\bm{x}}} =𝑩(𝒙)2f(𝒙)(𝒑+γ(𝑨(x)f(𝒙)ε𝑨(x)𝒑)).\displaystyle=-{\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})({\bm{p}}+\gamma({\bm{A}}(x)\nabla f({\bm{x}})-\varepsilon{\bm{A}}(x){\bm{p}}))\,.

Similarly, the update in Algorithm 1 also converges to the ODE system Eq. 2.8. Clearly, a stationary state satisfies 𝒑=0{\bm{p}}^{*}=0, f(𝒙)=0\nabla f({\bm{x}}^{*})=0. ∎

Proposition 2.4 (Primal-dual damping second order ODE).

The ODE system Eq. 2.8 satisfies the following second-order ODE

𝒙¨+[ε𝑨+γ𝑪𝑨2f(𝒙)𝑪˙𝑪1]𝒙˙+𝑪𝑨f(𝒙)=0.\ddot{{\bm{x}}}+\big{[}\varepsilon{\bm{A}}+\gamma{\bm{C}}{\bm{A}}\nabla^{2}f({\bm{x}})-\dot{{\bm{C}}}{\bm{C}}^{-1}\big{]}\dot{{\bm{x}}}+{\bm{C}}{\bm{A}}\nabla f({\bm{x}})=0\,. (2.9)

Here 𝐂˙=ddt𝐂(𝐱(t))\dot{\bm{C}}=\frac{d}{dt}{\bm{C}}({\bm{x}}(t)).

The proof follows by direct calculations and can be found in Appendix C. We note that the formulation given by Eq. 2.9 includes several important special cases in the literature. In a word, we view Eq. 2.4 as a preconditioned accelerated gradient flow.

Example 2.1.

Let 𝑪=𝑨=𝕀{\bm{C}}={\bm{A}}=\mathbb{I} and γ0\gamma\neq 0. Then equation Eq. 2.4 satisfies

𝒙¨+ϵ𝒙˙+γ2f(𝒙)𝒙˙+f(𝒙)=0,\ddot{{\bm{x}}}+\epsilon\dot{\bm{x}}+\gamma\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}+\nabla f({\bm{x}})=0\,, (2.10)

which is an inertial system with Hessian-driven damping [Attouch et al., 2020].

Remark 2.5.

In the case of 𝑪=𝑨=𝕀{\bm{C}}={\bm{A}}={\mathbb{I}}, although the derived second order ODE Eq. 2.9 is the same as the one in Attouch et al. [2020] at a continuous time level, our algorithm 1 provides a different time discretization from the one in Attouch et al. [2020].

Example 2.2.

Let 𝑪=𝑨=𝕀{\bm{C}}={\bm{A}}={\mathbb{I}}, γ(t)=0\gamma(t)=0. Then equation Eq. 2.4 satisfies the heavy ball ODE [Siegel, 2019]

𝒙¨+ε𝒙˙+f(𝒙)=0.\ddot{{\bm{x}}}+\varepsilon\dot{{\bm{x}}}+\nabla f({\bm{x}})=0\,. (2.11)
Example 2.3.

Let 𝑪=𝑨=𝕀{\bm{C}}={\bm{A}}={\mathbb{I}}, γ(t)=0\gamma(t)=0, ε(t)=3t\varepsilon(t)=\frac{3}{t}. Then equation Eq. 2.4 satisfies the Nesterov ODE [Su et al., 2015]:

𝒙¨+3t𝒙˙+f(𝒙)=0.\ddot{{\bm{x}}}+\frac{3}{t}\dot{{\bm{x}}}+\nabla f({\bm{x}})=0\,. (2.12)

We next provide a convergence analysis of ODE Eq. 2.8 for quadratic optimization problems. We demonstrate the importance of preconditioners in characterizing the convergence speed of ODE Eq. 2.8.

Theorem 2.6.

Suppose f(𝐱)=12𝐱T𝐐𝐱f({\bm{x}})=\frac{1}{2}{\bm{x}}^{T}{\bm{Q}}{\bm{x}} for some symmetric positive definite matrix 𝐐d×d{\bm{Q}}\in\mathbb{R}^{d\times d}. Assume 𝐀{\bm{A}}, 𝐁{\bm{B}} are constant matrices. In this case, equation Eq. 2.8 satisfies the linear ODE system:

(𝒙˙𝒑˙)=(γ𝑩𝑸𝑨𝑸𝑩𝑸(𝕀γε𝑨)𝑨𝑸ε𝑨)(𝒙𝒑).\begin{pmatrix}\dot{{\bm{x}}}\\ \dot{{\bm{p}}}\end{pmatrix}=\begin{pmatrix}-\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}&-{\bm{B}}{\bm{Q}}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})\\ {\bm{A}}{\bm{Q}}&-\varepsilon{\bm{A}}\end{pmatrix}\begin{pmatrix}{\bm{x}}\\ {\bm{p}}\end{pmatrix}\,.

Suppose that 𝐀{\bm{A}} commutes with 𝐐{\bm{Q}}, such that 𝐀𝐐=𝐐𝐀{\bm{A}}{\bm{Q}}={\bm{Q}}{\bm{A}}. Suppose 𝐀{\bm{A}} and 𝐁𝐐𝐀𝐐{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}} are simultaneously diagonalizable and have positive eigenvalues. Let μ1μn>0\mu_{1}\geq\ldots\geq\mu_{n}>0 be the eigenvalues of 𝐁𝐐𝐀𝐐{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}} and aia_{i} the ii-th eigenvalue of 𝐀{\bm{A}} (not necessarily in descending order) in the same basis. Then

  1. (a)

    The solution of ODE system 2.8 converges to (𝒙,𝒑)=(0,0)({\bm{x}}^{*},{\bm{p}}^{*})=(0,0):

    (𝒙(t),𝒑(t))(𝒙0,𝒑0)exp(αt),\|({\bm{x}}(t),{\bm{p}}(t))\|\leq\|({\bm{x}}_{0},{\bm{p}}_{0})\|\exp(\alpha t)\,,

    where

    α=maxi12[γμiεai+((γμi+ε)24μi)].\alpha=\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-\varepsilon a_{i}+\Re\big{(}\sqrt{(\gamma\mu_{i}+\varepsilon)^{2}-4\mu_{i}}\big{)}\big{]}\,.
  2. (b)

    When 𝑨=𝕀,ε=0{\bm{A}}={\mathbb{I}},\varepsilon=0, the optimal convergence rate is achieved at γ=2μ1μn(2μ1μn)\gamma^{*}=\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}}. The corresponding rate is α=μn21κ\alpha=\frac{-\sqrt{\mu_{n}}}{\sqrt{2-\frac{1}{\kappa}}}, where κ=μ1/μn>1\kappa=\mu_{1}/\mu_{n}>1.

  3. (c)

    Moreover, when γ=ε=0\gamma=\varepsilon=0, the system will not converge for any initial data (𝒙0,𝒑0)(0,0)({\bm{x}}_{0},{\bm{p}}_{0})\neq(0,0).

  4. (d)

    If 𝑨=𝕀{\bm{A}}={\mathbb{I}}, γ1μ1\gamma\leq\frac{1}{\sqrt{\mu_{1}}}, ε=2μγμ\varepsilon=2\sqrt{\mu^{\prime}}-\gamma\mu^{\prime} for some μμn\mu^{\prime}\leq\mu_{n}, then

    α=μγ2(μnμ)μ.\alpha=-\sqrt{\mu^{\prime}}-\frac{\gamma}{2}(\mu_{n}-\mu^{\prime})\leq-\sqrt{\mu^{\prime}}\,.

We defer the proof to Appendix B.

Remark 2.7.

If ω\omega is bounded, then we have γ=𝒪(σ)\gamma=\mathcal{O}(\sigma). Then, in the limit as σ0\sigma\to 0, we also have that γ0\gamma\to 0. By Theorem 2.6 (c), the ODE system 2.8 does not converge for any initial data.

Remark 2.8.

If μ\mu^{\prime} is an estimate of the smallest eigenvalue μn\mu_{n}, then the convergence speed for the solution of heavy ball ODE is exp(μt)\exp(-\sqrt{\mu^{\prime}}t). In Theorem 2.6 (d), if γ=0\gamma=0 and μ=μn\mu^{\prime}=\mu_{n}, then α=μn\alpha=-\sqrt{\mu_{n}} which is the same as the convergence rate of the heavy ball ODE [Siegel, 2019]. However, if γ>0\gamma>0 and μ<μn\mu^{\prime}<\mu_{n}, then we have α=μγ(μnμ)<μ\alpha=-\sqrt{\mu^{\prime}}-\gamma(\mu_{n}-\mu^{\prime})<-\sqrt{\mu^{\prime}}, which converges faster than the heavy ball ODE.

3 Lyapunov Analysis

In this section, we present the main theoretical result of this paper. We provide the convergence analysis for general objective functions in both continuous-time ODEs Eq. 2.8 and discrete-time Algorithm 1. From now on, we make the following two assumptions for the convergence analysis.

Assumption 3.1.

There exists two constants Lμ>0L\geq\mu>0 such that μ𝕀𝑪0(𝒙)L𝕀\mu{\mathbb{I}}\preceq{\bm{C}}_{0}({\bm{x}})\preceq L{\mathbb{I}} for all 𝒙{\bm{x}}, where 𝑪0(𝒙)=2f(𝒙)𝑩(𝒙)2f(𝒙){\bm{C}}_{0}({\bm{x}})=\nabla^{2}f({\bm{x}}){\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}}), and μ1\mu\leq 1.

Assumption 3.2.

There exists a constant L>0L^{\prime}>0 such that

𝑪(𝒙)T(3f(𝒙)f(𝒙)+(2f(𝒙))2)𝑪(𝒙)L𝕀{\bm{C}}({\bm{x}})^{T}\big{(}\nabla^{3}f({\bm{x}})\nabla f({\bm{x}})+(\nabla^{2}f({\bm{x}}))^{2}\big{)}{\bm{C}}({\bm{x}})\preceq L^{\prime}{\mathbb{I}} (3.1)

for all 𝒙{\bm{x}}, where 𝑪(𝒙)=𝑩(𝒙)2f(𝒙){\bm{C}}({\bm{x}})={\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}}).

3.1 Continuous time Lyapunov analysis

In this subsection, we establish convergence results of the ODE system Eq. 2.8.

Theorem 3.3.

Consider the ODE system Eq. 2.8 with an initial condition (x(0),p(0))2d(x(0),p(0))\in\mathbb{R}^{2d}. Define the functional

(𝒙,𝒑)=12(𝒑2+f(𝒙)2).{\mathcal{I}}({\bm{x}},{\bm{p}})=\frac{1}{2}(\|{\bm{p}}\|^{2}+\|\nabla f({\bm{x}})\|^{2})\,. (3.2)

Suppose Assumption 3.1 holds, we have

(𝒙(t),𝒑(t))(𝒙(0),𝒑(0))exp(2λt),{\mathcal{I}}({\bm{x}}(t),{\bm{p}}(t))\leq{\mathcal{I}}({\bm{x}}(0),{\bm{p}}(0))\exp(-2\lambda t)\,, (3.3)

where

λ=min{\displaystyle\lambda=\min\Big{\{} μγA12|Aμ(1εγA)|,LγA12|AL(1εγA)|,\displaystyle\mu\gamma A-\frac{1}{2}|A-\mu(1-\varepsilon\gamma A)|,L\gamma A-\frac{1}{2}|A-L(1-\varepsilon\gamma A)|,
εA12|Aμ(1εγA)|,εA12|AL(1εγA)|}\displaystyle\varepsilon A-\frac{1}{2}|A-\mu(1-\varepsilon\gamma A)|,\varepsilon A-\frac{1}{2}|A-L(1-\varepsilon\gamma A)|\Big{\}}

In particular, when γ=1μ,ε=1,A=μ+L2+(μ+L)εγ\gamma=\frac{1}{\mu},\varepsilon=1,A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}, then λ=μ2\lambda=\frac{\mu}{2}.

Proof.

It is straightforward to compute the following

ddt\displaystyle\frac{\operatorname{d}\!{{\mathcal{I}}}}{\operatorname{d}\!{t}} =𝒑,𝒑˙+f,2f𝒙˙\displaystyle=\langle{\bm{p}},\dot{{\bm{p}}}\rangle+\langle\nabla f,\nabla^{2}f\dot{{\bm{x}}}\rangle
=fT𝑪0γ𝑨f𝒑Tε𝑨𝒑+fT(𝑨𝑪0(𝕀εγ𝑨))𝒑\displaystyle=-\nabla f^{T}{\bm{C}}_{0}\gamma{\bm{A}}\nabla f-{\bm{p}}^{T}\varepsilon{\bm{A}}{\bm{p}}+\nabla f^{T}\big{(}{\bm{A}}-{\bm{C}}_{0}({\mathbb{I}}-\varepsilon\gamma{\bm{A}})\big{)}{\bm{p}} (3.4)

We shall find λ\lambda such that ddt+2λ0\frac{\operatorname{d}\!{{\mathcal{I}}}}{\operatorname{d}\!{t}}+2\lambda{\mathcal{I}}\leq 0. Then we obtain the exponential convergence by Gronwall’s inequality, i.e.,

(𝒙(t),𝒑(t))(𝒙(0),𝒑(0))exp(2λt).{\mathcal{I}}({\bm{x}}(t),{\bm{p}}(t))\leq{\mathcal{I}}({\bm{x}}(0),{\bm{p}}(0))\exp(-2\lambda t)\ .

We can compute

ddt+2λ\displaystyle\frac{\operatorname{d}\!{{\mathcal{I}}}}{\operatorname{d}\!{t}}+2\lambda{\mathcal{I}} =fT(𝑪0γ𝑨+λ𝕀)f+𝒑T(ε𝑨+λ𝕀)𝒑\displaystyle=\nabla f^{T}\big{(}-{\bm{C}}_{0}\gamma{\bm{A}}+\lambda{\mathbb{I}}\big{)}\nabla f+{\bm{p}}^{T}\big{(}-\varepsilon{\bm{A}}+\lambda{\mathbb{I}}\big{)}{\bm{p}}
+fT(𝑨𝑪0(𝕀εγ𝑨))𝒑.\displaystyle\qquad+\nabla f^{T}\big{(}{\bm{A}}-{\bm{C}}_{0}({\mathbb{I}}-\varepsilon\gamma{\bm{A}})\big{)}{\bm{p}}\,. (3.5)

By Lemma A.1, we obtain the following sufficient conditions for ddt+2λ0\frac{\operatorname{d}\!{{\mathcal{I}}}}{\operatorname{d}\!{t}}+2\lambda{\mathcal{I}}\leq 0

εA+λ+12|ξi(1εγA)A|\displaystyle-\varepsilon A+\lambda+\frac{1}{2}|\xi_{i}(1-\varepsilon\gamma A)-A| 0\displaystyle\leq 0 (3.6a)
λξiγA+12|ξi(1εγA)A|\displaystyle\lambda-\xi_{i}\gamma A+\frac{1}{2}|\xi_{i}(1-\varepsilon\gamma A)-A| 0\displaystyle\leq 0 (3.6b)

where ξi(𝒙)\xi_{i}({\bm{x}}) is the eigenvalue of 𝑪0(𝒙){\bm{C}}_{0}({\bm{x}}). By our assumptions, we have Lξ1(𝒙)ξn(𝒙)μL\geq\xi_{1}({\bm{x}})\geq\ldots\geq\xi_{n}({\bm{x}})\geq\mu. Eq. 3.6 give two upper bounds on λ\lambda. Define g1(ξ)=εA+12|ξ(1εγA)A|g_{1}(\xi)=\varepsilon A+\frac{1}{2}|\xi(1-\varepsilon\gamma A)-A|, and g2(ξ)=ξγA12|ξ(1εγA)A|g_{2}(\xi)=\xi\gamma A-\frac{1}{2}|\xi(1-\varepsilon\gamma A)-A| on the interval [μ,L][\mu,L]. Then Eq. 3.6 implies that

λgj(ξi),\lambda\leq g_{j}(\xi_{i})\,, (3.7)

for all i=1,,ni=1,\ldots,n and j=1,2j=1,2. Since each gj(ξ)g_{j}(\xi) is a piece-wise linear in ξ\xi, it is not hard to see that

minξ[μ,L]gj(ξ)=min{gj(μ),gj(L)},\displaystyle\min_{\xi\in[\mu,L]}g_{j}(\xi)=\min\{g_{j}(\mu),g_{j}(L)\}\,,

for j=1,2j=1,2. This proves the formula for λ\lambda. When A=μ+L2+(μ+L)εγA=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}, we have g1(μ)=g1(L)g_{1}(\mu)=g_{1}(L), and

μ(1εγA)A=L(1εγA)+A.\mu(1-\varepsilon\gamma A)-A=-L(1-\varepsilon\gamma A)+A\,.

Further, requiring g1(μ)=g2(μ)g_{1}(\mu)=g_{2}(\mu) yields ε=μγ\varepsilon=\mu\gamma. And we obtain

λ\displaystyle\lambda =μγA12|Aμ(1εγ𝑨)|\displaystyle=\mu\gamma A-\frac{1}{2}|A-\mu(1-\varepsilon\gamma{\bm{A}})|
=μγA12(Aμ(1εγA))\displaystyle=\mu\gamma A-\frac{1}{2}(A-\mu(1-\varepsilon\gamma A))
=μ2+A(γμ12γ2μ212)\displaystyle=\frac{\mu}{2}+A(\gamma\mu-\frac{1}{2}\gamma^{2}\mu^{2}-\frac{1}{2})
=μ2A2(γμ1)2.\displaystyle=\frac{\mu}{2}-\frac{A}{2}(\gamma\mu-1)^{2}. (3.8)

We note that λ\lambda is maximized when taking γ=μ1\gamma=\mu^{-1}. We obtain λ=μ2\lambda=\frac{\mu}{2}.

3.2 Discrete time Lyapunov analysis

In this subsection, we study the convergence criterion for the discretized linearized PDHG flow given by Eq. 1.2 and Algorithm 1.

From now on, we assume that ff is a 𝒞4\mathcal{C}^{4} strongly convex function.

We can rewrite the iterations as

𝒑n+1\displaystyle{\bm{p}}^{n+1} =11+σεA𝒑n+σA1+σεAf(𝒙n),\displaystyle=\frac{1}{1+\sigma\varepsilon A}{\bm{p}}^{n}+\frac{\sigma A}{1+\sigma\varepsilon A}\nabla f({\bm{x}}^{n})\,, (3.9a)
𝒙n+1\displaystyle{\bm{x}}^{n+1} =𝒙nτ𝑩(𝒙n)2f(𝒙n)(1εγA1+σεA𝒑n+σA+γA1+σεAf(𝒙n)),\displaystyle={\bm{x}}^{n}-\tau{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})\left(\frac{1-\varepsilon\gamma A}{1+\sigma\varepsilon A}{\bm{p}}^{n}+\frac{\sigma A+\gamma A}{1+\sigma\varepsilon A}\nabla f({\bm{x}}^{n})\right)\,, (3.9b)

where γ=σω\gamma=\sigma\omega. We define the following notations which will be used later.

𝑵(𝒙n)=11+σεA(𝑩(𝒙n)2f(𝒙n)(σA+γA)𝑩(𝒙n)2f(𝒙n)(1εγA)στAστεA).{\bm{N}}({\bm{x}}^{n})=\frac{1}{1+\sigma\varepsilon A}\begin{pmatrix}{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})(\sigma A+\gamma A)&{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})(1-\varepsilon\gamma A)\\ -\frac{\sigma}{\tau}A&\frac{\sigma}{\tau}\varepsilon A\end{pmatrix}\,. (3.10)

And

𝑯(𝒙n)=sym((2f(𝐱n)00𝕀)𝐍(𝐱n)).{\bm{H}}({\bm{x}}^{n})=\rm{sym}\begin{pmatrix}\begin{pmatrix}\nabla^{2}f({\bm{x}}^{n})&0\\ 0&{\mathbb{I}}\end{pmatrix}\cdot{\bm{N}}({\bm{x}}^{n})\end{pmatrix}\,.
Remark 3.4.

The matrix 𝑵(𝒙n){\bm{N}}({\bm{x}}^{n}) and 𝑯(𝒙n){\bm{H}}({\bm{x}}^{n}) also depends on the τ\tau, σ\sigma, AA, ε\varepsilon and ω\omega.

Define the Lyapunov functional in discrete time as

(𝒙n,𝒑n)=12f(𝒙n)2+12𝒑n2.{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})=\frac{1}{2}\|\nabla f({\bm{x}}^{n})\|^{2}+\frac{1}{2}\|{\bm{p}}^{n}\|^{2}\,.
Theorem 3.5.

Suppose that there exists positive constants λ,M1+\lambda,M_{1}\in\mathbb{R}_{+}, such that

𝑯(𝒙)\displaystyle{\bm{H}}({\bm{x}}) λ𝕀,\displaystyle\succeq\lambda{\mathbb{I}}\,,
𝑵(𝒙)T2(𝒙~,𝒑~)𝑵(𝒙)\displaystyle{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}(\tilde{{\bm{x}}},\tilde{{\bm{p}}}){\bm{N}}({\bm{x}}) M1𝕀,\displaystyle\preceq M_{1}{\mathbb{I}},

for all 𝐱,𝐱~n{\bm{x}},\tilde{{\bm{x}}}\in\mathbb{R}^{n}. If τ=aλM\tau=a\frac{\lambda}{M} for some a(0,2)a\in(0,2), then the functional (𝐱n,𝐩n){\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n}) decreases geometrically, i.e.

(𝒙n,𝒑n)(𝒙0,𝒑0)(1+(a22a)λ22M1)n.{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+(a^{2}-2a)\frac{\lambda^{2}}{2M_{1}}\big{)}^{n}\,.
Proof.

It follows from our definition of 𝑵(𝒙n){\bm{N}}({\bm{x}}^{n}) that

(𝒙n+1𝒙n𝒑n+1𝒑n)=τ𝑵(𝒙n)(f(𝒙n)𝒑n),\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}=-\tau{\bm{N}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}\,, (3.11)

By the mean-value theorem, we obtain

(𝒙n+1,𝒑n+1)(𝒙n,𝒑n)\displaystyle{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})-{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})
=\displaystyle= (𝒙(𝒙n,𝒑n)𝒑(𝒙n,𝒑n))T(𝒙n+1𝒙n𝒑n+1𝒑n)+12(𝒙n+1𝒙n𝒑n+1𝒑n)T2(𝒙~,𝒑~)(𝒙n+1𝒙n𝒑n+1𝒑n)\displaystyle\begin{pmatrix}\nabla_{{\bm{x}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\\ \nabla_{{\bm{p}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\end{pmatrix}^{T}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}+\frac{1}{2}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}^{T}\nabla^{2}{\mathcal{I}}(\tilde{{\bm{x}}},\tilde{{\bm{p}}})\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}

where (𝒙~,𝒑~)(\tilde{{\bm{x}}},\tilde{{\bm{p}}}) is in between (𝒙n+1,𝒑n+1)({\bm{x}}^{n+1},{\bm{p}}^{n+1}) and (𝒙n,𝒑n)({\bm{x}}^{n},{\bm{p}}^{n}). And

𝒙(𝒙n,𝒑n)\displaystyle\nabla_{{\bm{x}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n}) =2f(𝒙n)f(𝒙n),\displaystyle=\nabla^{2}f({\bm{x}}^{n})\nabla f({\bm{x}}^{n})\,,
𝒑(𝒙n,𝒑n)\displaystyle\nabla_{{\bm{p}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n}) =𝒑n,\displaystyle={\bm{p}}^{n}\,,
2(𝒙n,𝒑n)\displaystyle\nabla^{2}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n}) =(3f(𝒙n)f(𝒙n)+2f(𝒙n)2f(𝒙n)00𝕀).\displaystyle=\begin{pmatrix}\nabla^{3}f({\bm{x}}^{n})\nabla f({\bm{x}}^{n})+\nabla^{2}f({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})&0\\ 0&{\mathbb{I}}\end{pmatrix}\,.

Then using Eq. 3.11 and definition of 𝑯(𝒙n){\bm{H}}({\bm{x}}^{n}), we obtain

(𝒙n+1,𝒑n+1)(𝒙n,𝒑n)\displaystyle{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})-{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})
=\displaystyle= τ(f(𝒙n)𝒑n)T(2f(𝒙n)00𝕀)𝑵(𝒙n)(f(𝒙n)𝒑n)\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}^{T}\begin{pmatrix}\nabla^{2}f({\bm{x}}^{n})&0\\ 0&{\mathbb{I}}\end{pmatrix}\cdot{\bm{N}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}
+τ22(f(𝒙n)𝒑n)T𝑵(𝒙n)T2(𝒙~,𝒑~)𝑵(𝒙n)(f(𝒙n)𝒑n)\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{n})^{T}\nabla^{2}{\mathcal{I}}(\tilde{{\bm{x}}},\tilde{{\bm{p}}}){\bm{N}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}
=\displaystyle= τ(f(𝒙n)𝒑n)T𝑯(𝒙n)(f(𝒙n)𝒑n)\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}
+τ22(f(𝒙n)𝒑n)T𝑵(𝒙n)T2(𝒙~,𝒑~)𝑵(𝒙n)(f(𝒙n)𝒑n),\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{n})^{T}\nabla^{2}{\mathcal{I}}(\tilde{{\bm{x}}},\tilde{{\bm{p}}}){\bm{N}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}\,, (3.12)

From Eq. 3.2 and our assumption on 𝑵(𝒙){\bm{N}}({\bm{x}}) and 𝑯(𝒙){\bm{H}}({\bm{x}}), we obtain

(𝒙n+1,𝒑n+1)(𝒙n,𝒑n)\displaystyle{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})-{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n}) (τλ+τ2M12)(𝒙n,𝒑n)\displaystyle\leq\big{(}-\tau\lambda+\frac{\tau^{2}M_{1}}{2}\big{)}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})
=M12((τλM1)2λ2M12)(𝒙n,𝒑n)\displaystyle=\frac{M_{1}}{2}\big{(}(\tau-\frac{\lambda}{M_{1}})^{2}-\frac{\lambda^{2}}{M_{1}^{2}}\big{)}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})
=(a22a)λ22M1(𝒙n,𝒑n),\displaystyle=(a^{2}-2a)\frac{\lambda^{2}}{2M_{1}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\,, (3.13)

where we used τ=aλM1\tau=a\frac{\lambda}{M_{1}}. Hence,

(𝒙n+1,𝒑n+1)(𝒙n,𝒑n)(1+(a22a)λ22M1)(𝒙0,𝒑0)(1+(a22a)λ22M1)n+1.{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})\leq{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\big{(}1+(a^{2}-2a)\frac{\lambda^{2}}{2M_{1}}\big{)}\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+(a^{2}-2a)\frac{\lambda^{2}}{2M_{1}}\big{)}^{n+1}\,.

When 0<a<20<a<2, we have a22a<0a^{2}-2a<0. Thus we obtain the desired convergence result. ∎

Theorem 3.6.

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a 𝒞4\mathcal{C}^{4} strongly convex function. Suppose (𝐱0,𝐩0)({\bm{x}}^{0},{\bm{p}}^{0}) satisfies

(𝒙0,𝒑0)1/2δτD0𝑵(𝒙)23,{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})^{1/2}\leq\frac{\delta}{\tau D_{0}\|{\bm{N}}({\bm{x}})\|_{2}^{3}}\,, (3.14)

for some δ>0\delta>0 and all 𝐱{\bm{x}}. Here

D0=sup𝒙,𝒑,𝒙,𝒑(𝒙𝒑)T(3(𝒙,𝒑)(𝒙𝒑))(𝒙𝒑)(𝒙𝒑)23.D_{0}=\sup_{{\bm{x}},{\bm{p}},{\bm{x}}^{\prime},{\bm{p}}^{\prime}}\frac{\begin{pmatrix}{\bm{x}}^{\prime}\\ {\bm{p}}^{\prime}\end{pmatrix}^{T}\left(\nabla^{3}{\mathcal{I}}({\bm{x}},{\bm{p}})\begin{pmatrix}{\bm{x}}^{\prime}\\ {\bm{p}}^{\prime}\end{pmatrix}\right)\begin{pmatrix}{\bm{x}}^{\prime}\\ {\bm{p}}^{\prime}\end{pmatrix}}{\left\|\begin{pmatrix}{\bm{x}}^{\prime}\\ {\bm{p}}^{\prime}\end{pmatrix}\right\|_{2}^{3}}\,.

Suppose further that there exists positive constants λ,M2+\lambda,M_{2}\in\mathbb{R}_{+} such that

𝑯(𝒙)\displaystyle{\bm{H}}({\bm{x}}) λ𝕀,\displaystyle\succeq\lambda{\mathbb{I}}\,,
𝑵(𝒙)T2(𝒙,𝒑)𝑵(𝒙)\displaystyle{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}}) M2𝕀\displaystyle\preceq M_{2}{\mathbb{I}}

for all 𝐱n{\bm{x}}\in\mathbb{R}^{n}. If τ=aλM2+δ\tau=a\frac{\lambda}{M_{2}+\delta} for some a(0,2)a\in(0,2), then the functional (𝐱n,𝐩n){\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n}) decreases geometrically, i.e.

(𝒙n,𝒑n)(𝒙0,𝒑0)(1+a22a2λ2M2+δ)n.{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}^{n}\,.
Remark 3.7.

Note that the constant M2M_{2} in Theorem 3.6 can be better than the constant M1M_{1} in Theorem 3.5 because 𝑵{\bm{N}} and 2\nabla^{2}{\mathcal{I}} are evaluated at the same 𝒙{\bm{x}} in Theorem 3.6.

Proof.

We will prove it by induction. Using the mean-value theorem, we have

(𝒙n+1,𝒑n+1)(𝒙n,𝒑n)\displaystyle{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})-{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})
=\displaystyle= (𝒙(𝒙n,𝒑n)𝒑(𝒙n,𝒑n))T(𝒙n+1𝒙n𝒑n+1𝒑n)+12(𝒙n+1𝒙n𝒑n+1𝒑n)T2(𝒙n,𝒑n)(𝒙n+1𝒙n𝒑n+1𝒑n)\displaystyle\begin{pmatrix}\nabla_{{\bm{x}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\\ \nabla_{{\bm{p}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\end{pmatrix}^{T}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}+\frac{1}{2}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}
+16(𝒙n+1𝒙n𝒑n+1𝒑n)T(3(𝒙~n,𝒑~n)(𝒙n+1𝒙n𝒑n+1𝒑n))(𝒙n+1𝒙n𝒑n+1𝒑n),\displaystyle\qquad+\frac{1}{6}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}^{T}\left(\nabla^{3}{\mathcal{I}}(\tilde{{\bm{x}}}^{n},\tilde{{\bm{p}}}^{n})\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}\right)\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}\,, (3.15)

where (𝒙~n,𝒑~n)(\tilde{{\bm{x}}}^{n},\tilde{{\bm{p}}}^{n}) is in between (𝒙n+1,𝒑n+1)({\bm{x}}^{n+1},{\bm{p}}^{n+1}) and (𝒙n,𝒑n)({\bm{x}}^{n},{\bm{p}}^{n}). By Eq. 3.2 and Eq. 3.11, we can bound

(𝒙1,𝒑1)(𝒙0,𝒑0)\displaystyle{\mathcal{I}}({\bm{x}}^{1},{\bm{p}}^{1})-{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})
=\displaystyle= τ(f(𝒙0)𝒑0)T𝑯(𝒙0)(f(𝒙0)𝒑0)\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
+τ22(f(𝒙0)𝒑0)T𝑵(𝒙0)T2(𝒙0,𝒑0)𝑵(𝒙0)(f(𝒙0)𝒑0)\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
τ36(f(𝒙0)𝒑0)T𝑵(𝒙0)T(3(𝒙~0,𝒑~0)𝑵(𝒙0)(f(𝒙0)𝒑0))𝑵(𝒙0)(f(𝒙0)𝒑0)\displaystyle-\frac{\tau^{3}}{6}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\left(\nabla^{3}{\mathcal{I}}(\tilde{{\bm{x}}}^{0},\tilde{{\bm{p}}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}\right){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
\displaystyle\leq τ(f(𝒙0)𝒑0)T𝑯(𝒙0)(f(𝒙0)𝒑0)\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
+τ22(f(𝒙0)𝒑0)T𝑵(𝒙0)T2(𝒙0,𝒑0)𝑵(𝒙0)(f(𝒙0)𝒑0)\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
+τ36(f(𝒙0)𝒑0)T(D0𝑵(𝒙0)23(f(𝒙0)𝒑0)2)(f(𝒙0)𝒑0)\displaystyle+\frac{\tau^{3}}{6}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}\left(D_{0}\|{\bm{N}}({\bm{x}}^{0})\|_{2}^{3}\left\|\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}\right\|_{2}\right)\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
=\displaystyle= τ(f(𝒙0)𝒑0)T𝑯(𝒙0)(f(𝒙0)𝒑0)\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
+τ22(f(𝒙0)𝒑0)T𝑵(𝒙0)T2(𝒙0,𝒑0)𝑵(𝒙0)(f(𝒙0)𝒑0)\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
+τ36(f(𝒙0)𝒑0)TD0𝑵(𝒙0)23(𝒙0,𝒑0)1/2(f(𝒙0)𝒑0)\displaystyle+\frac{\tau^{3}}{6}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}D_{0}\|{\bm{N}}({\bm{x}}^{0})\|_{2}^{3}{\mathcal{I}}({\bm{x}}_{0},{\bm{p}}_{0})^{1/2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
\displaystyle\leq τ(f(𝒙0)𝒑0)T𝑯(𝒙0)(f(𝒙0)𝒑0)\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
+τ22(f(𝒙0)𝒑0)T𝑵(𝒙0)T2(𝒙0,𝒑0)𝑵(𝒙0)(f(𝒙0)𝒑0)\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}
+τ2δ6(f(𝒙0)𝒑0)T(f(𝒙0)𝒑0),\displaystyle+\frac{\tau^{2}\delta}{6}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}\,, (3.16)

where the last inequality is by our assumption on (𝒙0,𝒑0)({\bm{x}}^{0},{\bm{p}}^{0}). Using our assumptions on the lower bound of 𝑯{\bm{H}} and the upper bound of 𝑵T2𝑵{\bm{N}}^{T}\cdot\nabla^{2}{\mathcal{I}}\cdot{\bm{N}}, we obtain

(𝒙1,𝒑1)(𝒙0,𝒑0)\displaystyle{\mathcal{I}}({\bm{x}}^{1},{\bm{p}}^{1})-{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}) (τλ+τ2δ6+τ2M22)(𝒙0,𝒑0)\displaystyle\leq\big{(}-\tau\lambda+\frac{\tau^{2}\delta}{6}+\frac{\tau^{2}M_{2}}{2}\big{)}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})
(τλ+τ2(δ+M2)2)(𝒙0,𝒑0)\displaystyle\leq\big{(}-\tau\lambda+\frac{\tau^{2}(\delta+M_{2})}{2}\big{)}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})
=12(a22a)λ2M2+δ(𝒙0,𝒑0),\displaystyle=\frac{1}{2}(a^{2}-2a)\frac{\lambda^{2}}{M_{2}+\delta}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\,, (3.17)

where we used τ=aλM2+δ\tau=a\frac{\lambda}{M_{2}+\delta} for some a(0,2)a\in(0,2). Hence,

(𝒙1,𝒑1)(𝒙0,𝒑0)(1+a22a2λ2M2+δ).{\mathcal{I}}({\bm{x}}^{1},{\bm{p}}^{1})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}\,.

This proves the base case. Now suppose it holds that

(𝒙n,𝒑n)(𝒙0,𝒑0)(1+a22a2λ2M2+δ)n,{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}^{n}\,,

for some n1n\geq 1. In particular, this implies that

(𝒙n,𝒑n)<(𝒙0,𝒑0),{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})<{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\,,

which yields

τD0𝑵(𝒙)23(𝒙n,𝒑n)1/2<τD0𝑵(𝒙)23(𝒙0,𝒑0)1/2δ.\tau D_{0}\|{\bm{N}}({\bm{x}})\|_{2}^{3}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})^{1/2}<\tau D_{0}\|{\bm{N}}({\bm{x}})\|_{2}^{3}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})^{1/2}\leq\delta\,.

Then, repeating the derivation of Eq. 3.2 and Eq. 3.2 yields

(𝒙n+1,𝒑n+1)(𝒙n,𝒑n)(1+a22a2λ2M2+δ).{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})\leq{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}\,.

Combining with our induction hypothesis, we conclude that

(𝒙n+1,𝒑n+1)(𝒙0,𝒑0)(1+a22a2λ2M2+δ)n+1.{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}^{n+1}\,.

The proof is complete by induction. ∎

Corollary 3.8.

Suppose Assumption 3.1 and Assumption 3.2 hold. When σ=τ\sigma=\tau, γ=1σμμ,ε=1,A=μ+L2+(μ+L)εγ\gamma=\frac{1-\sigma\mu}{\mu},\varepsilon=1,A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}, we have

𝑯(𝒙)μ4𝕀.{\bm{H}}({\bm{x}})\succeq\frac{\mu}{4}{\mathbb{I}}.
Proof.

By definition of 𝑯{\bm{H}}, we can compute

(1+σεA)𝑯(𝒙)=(𝑪0(𝒙)(σ𝑨+γ𝑨)12𝑪0(𝒙)(1εγA)12η𝑨12𝑪0(𝒙)(1εγA)12η𝑨ηε𝑨),(1+\sigma\varepsilon A)\cdot{\bm{H}}({\bm{x}})=\begin{pmatrix}{\bm{C}}_{0}({\bm{x}})(\sigma{\bm{A}}+\gamma{\bm{A}})&\frac{1}{2}{\bm{C}}_{0}({\bm{x}})(1-\varepsilon\gamma A)-\frac{1}{2}\eta{\bm{A}}\\ \frac{1}{2}{\bm{C}}_{0}({\bm{x}})(1-\varepsilon\gamma A)-\frac{1}{2}\eta{\bm{A}}&\eta\varepsilon{\bm{A}}\end{pmatrix}\,,

where η=σ/τ=1\eta=\sigma/\tau=1, 𝑪0(𝒙)=2f(𝒙)𝑩(𝒙)2f(𝒙){\bm{C}}_{0}({\bm{x}})=\nabla^{2}f({\bm{x}}){\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}}). We want to find some constant λ>0\lambda>0, such that

(𝒛𝒘)T𝑯(𝒙)(𝒛𝒘)λ(𝒛2+𝒘2).\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}^{T}{\bm{H}}({\bm{x}})\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}\geq\lambda(\|{\bm{z}}\|^{2}+\|{\bm{w}}\|^{2})\,.

Observe that

(𝒛𝒘)T𝑯(𝒙)(𝒛𝒘)λ(𝒛2+𝒘2)\displaystyle\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}^{T}{\bm{H}}({\bm{x}})\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}-\lambda(\|{\bm{z}}\|^{2}+\|{\bm{w}}\|^{2})
=𝒛T(𝑪0(γ+σ)A/(1+σεA)λ𝕀)𝒛+𝒘T(εA/(1+σεA)λ𝕀)𝒘\displaystyle={\bm{z}}^{T}\big{(}{\bm{C}}_{0}(\gamma+\sigma)A/(1+\sigma\varepsilon A)-\lambda{\mathbb{I}}\big{)}{\bm{z}}+{\bm{w}}^{T}\big{(}\varepsilon A/(1+\sigma\varepsilon A)-\lambda{\mathbb{I}}\big{)}{\bm{w}}
+𝒛T(A+𝑪0(𝕀εγA))𝒘/(1+σεA),\displaystyle\qquad+{\bm{z}}^{T}\big{(}-A+{\bm{C}}_{0}({\mathbb{I}}-\varepsilon\gamma A)\big{)}{\bm{w}}/(1+\sigma\varepsilon A)\,, (3.18)

which is almost the same as Eq. 3.1. Thus, following a similar procedure in Theorem 3.3 with the provided parameters, we obtain that

λμ21+Aσ21+σAμ4.\lambda\geq\frac{\mu}{2}\frac{1+\frac{A\sigma}{2}}{1+\sigma A}\geq\frac{\mu}{4}\,.

This implies

𝑯(𝒙)μ4𝕀.\displaystyle{\bm{H}}({\bm{x}})\succeq\frac{\mu}{4}{\mathbb{I}}\,. (3.19)

Corollary 3.9.

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a 𝒞4\mathcal{C}^{4} strongly convex function. Suppose Assumption 3.1 and Assumption 3.2 hold. If σ=τ\sigma=\tau, γ=1σμμ,ε=1,A=μ+L2+(μ+L)εγ\gamma=\frac{1-\sigma\mu}{\mu},\varepsilon=1,A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}, we have

  1. (1)
    𝑵(𝒙)2max{L,1}(A(σ+2γ+2)+1)(1+σA).\|{\bm{N}}({\bm{x}})\|_{2}\leq\frac{\max\{L,1\}\cdot\big{(}A(\sigma+2\gamma+2)+1\big{)}}{(1+\sigma A)}\,.
  2. (2)
    𝑵(𝒙)T2(𝒙,𝒑)𝑵(𝒙)(3+σA+2A)2(1+σA)2max{L,1}𝕀.{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})\preceq\frac{(3+\sigma A+2A)^{2}}{(1+\sigma A)^{2}}\cdot\max\{L^{\prime},1\}\cdot{\mathbb{I}}\,.
Proof.

We can decompose

(1+σA)𝑵(𝒙)=(𝑩(𝒙)2f(𝒙)00𝕀)((σ+γ)𝑨(𝕀γ𝑨)𝑨𝑨).(1+\sigma A)\cdot{\bm{N}}({\bm{x}})=\begin{pmatrix}{\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})&0\\ 0&{\mathbb{I}}\end{pmatrix}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\,.

Observe that

((σ+γ)𝑨(𝕀γ𝑨)𝑨𝑨)=((σ+γ)𝕀γ𝕀𝕀𝕀)(𝑨00𝑨)+(0𝕀00).\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}=\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\cdot\begin{pmatrix}{\bm{A}}&0\\ 0&{\bm{A}}\end{pmatrix}+\begin{pmatrix}0&{\mathbb{I}}\\ 0&0\end{pmatrix}\,.

Therefore,

((σ+γ)𝑨(𝕀γ𝑨)𝑨𝑨)2\displaystyle\bigg{\|}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\bigg{\|}_{2} A((σ+γ)𝕀γ𝕀𝕀𝕀)2+(0𝕀00)2\displaystyle\leq A\bigg{\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\bigg{\|}_{2}+\bigg{\|}\begin{pmatrix}0&{\mathbb{I}}\\ 0&0\end{pmatrix}\bigg{\|}_{2}
A((σ+γ)𝕀γ𝕀𝕀𝕀)2+1\displaystyle\leq A\bigg{\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\bigg{\|}_{2}+1
A(σ+2γ+2)+1.\displaystyle\leq A(\sigma+2\gamma+2)+1\,. (3.20)

To get the last inequality, we consider (𝒛,𝒘)({\bm{z}},{\bm{w}}) such that 𝒛2+𝒘2=1\|{\bm{z}}\|^{2}+\|{\bm{w}}\|^{2}=1. Thus

((σ+γ)𝕀γ𝕀𝕀𝕀)(𝒛𝒘)2\displaystyle\bigg{\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}\bigg{\|}_{2} =((σ+γ)𝒛γ𝒘𝒛+𝒘)\displaystyle=\bigg{\|}\begin{pmatrix}(\sigma+\gamma){\bm{z}}-\gamma{\bm{w}}\\ -{\bm{z}}+{\bm{w}}\end{pmatrix}\bigg{\|}
σ(𝒛𝒘)+γ𝒛𝒘+𝒛𝒘\displaystyle\leq\sigma\bigg{\|}\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}\bigg{\|}+\gamma\|{\bm{z}}-{\bm{w}}\|+\|{\bm{z}}-{\bm{w}}\|
σ+2γ+2.\displaystyle\leq\sigma+2\gamma+2\,.

We now have

(1+σA)𝑵(𝒙)2\displaystyle(1+\sigma A)\|{\bm{N}}({\bm{x}})\|_{2} (𝑩(𝒙)2f(𝒙)00𝕀)2((σ+γ)𝑨(𝕀γ𝑨)𝑨𝑨)2\displaystyle\leq\bigg{\|}\begin{pmatrix}{\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})&0\\ 0&{\mathbb{I}}\end{pmatrix}\bigg{\|}_{2}\bigg{\|}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\bigg{\|}_{2}
max{L,1}(A(σ+2γ+2)+1).\displaystyle\leq\max\{L,1\}\cdot\big{(}A(\sigma+2\gamma+2)+1\big{)}\,.

This proves part (1) of our Corollary. It follows that

𝑵(𝒙)T2(𝒙,𝒑)𝑵(𝒙)\displaystyle{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})
=1(1+σA)2((σ+γ)𝑨𝑨(𝕀γ𝑨)𝑨)(𝑪(𝒙)T(3f(𝒙)f(𝒙)+(2f(𝒙))2)𝑪(𝒙)00𝕀)\displaystyle=\frac{1}{(1+\sigma A)^{2}}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&-{\bm{A}}\\ ({\mathbb{I}}-\gamma{\bm{A}})&{\bm{A}}\end{pmatrix}\cdot\begin{pmatrix}{\bm{C}}({\bm{x}})^{T}\big{(}\nabla^{3}f({\bm{x}})\nabla f({\bm{x}})+(\nabla^{2}f({\bm{x}}))^{2}\big{)}{\bm{C}}({\bm{x}})&0\\ 0&{\mathbb{I}}\end{pmatrix}
((σ+γ)𝑨(𝕀γ𝑨)𝑨𝑨),\displaystyle\hskip 56.9055pt\cdot\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\,, (3.21)

where we recall 𝑪(𝒙)=𝑩(𝒙)2f(𝒙){\bm{C}}({\bm{x}})={\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}}).

By assumption, there exists L>0L^{\prime}>0, such that

𝑪(𝒙)T(3f(𝒙)f(𝒙)+(2f(𝒙))2)𝑪(𝒙)L𝕀,{\bm{C}}({\bm{x}})^{T}\big{(}\nabla^{3}f({\bm{x}})\nabla f({\bm{x}})+(\nabla^{2}f({\bm{x}}))^{2}\big{)}{\bm{C}}({\bm{x}})\preceq L^{\prime}{\mathbb{I}}\,,

for all 𝒙{\bm{x}}. Then, combining Eq. 3.2 and Eq. 3.2, we obtain that

𝑵(𝒙)T2(𝒙,𝒑)𝑵(𝒙)2\displaystyle\|{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})\|_{2} (A(σ+2γ+2)+1)2(1+σA)2max{L,1}\displaystyle\leq\frac{\big{(}A(\sigma+2\gamma+2)+1\big{)}^{2}}{(1+\sigma A)^{2}}\cdot\max\{L^{\prime},1\}
(3+σA+2A)2(1+σA)2max{L,1},\displaystyle\leq\frac{(3+\sigma A+2A)^{2}}{(1+\sigma A)^{2}}\cdot\max\{L^{\prime},1\}\,, (3.22)

where we have used γA<1\gamma A<1 to derive the last inequality. ∎

Theorem 3.10 (Restatement of Theorem 1.1).

Suppose Assumption 3.1 and Assumption 3.2 hold. Let σ=τ\sigma=\tau, γ=1σμμ,ε=1,A=μ+L2+(μ+L)εγ\gamma=\frac{1-\sigma\mu}{\mu},\varepsilon=1,A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}. And suppose 3.14 holds for some δ>0\delta>0 and all 𝐱{\bm{x}}. If τ=14μδ+36max{L,1}\tau=\frac{1}{4}\frac{\mu}{\delta+36\max\{L^{\prime},1\}}, then

(𝒙n,𝒑n)(𝒙0,𝒑0)(1μ2/32δ+36max{L,1})n.{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1-\frac{\mu^{2}/32}{\delta+36\max\{L^{\prime},1\}}\big{)}^{n}\,.
Proof.

By Assumption 3.1 and Assumption 3.2, we have μLL\mu\leq L\leq L^{\prime}. Thus μ/L1\mu/L^{\prime}\leq 1 and σ=τ<1/36\sigma=\tau<1/36. Moreover,

γ=1μσ1136=3536.\gamma=\frac{1}{\mu}-\sigma\geq 1-\frac{1}{36}=\frac{35}{36}\,.

And

A=μ+L2+(μ+L)εγ<1γ3635.A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}<\frac{1}{\gamma}\leq\frac{36}{35}\,.

Then it follows

3+σA+2A1+σA3+σA+2A<6.\frac{3+\sigma A+2A}{1+\sigma A}\leq 3+\sigma A+2A<6\,.

By Corollary 3.9, we have

𝑵(𝒙)T2(𝒙,𝒑)𝑵(𝒙)236max{L,1}.\|{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})\|_{2}\leq 36\max\{L^{\prime},1\}\,.

Combining this with Theorem 3.6 and Corollary 3.8, we finish the proof.

Remark 3.11.

The choice of parameters in Theorem 3.10 may not be optimal. The main purpose of Theorem 3.10 is to show the existence of geometric convergence in Algorithm 1.

4 Numerical experiment

We test our PDD algorithm using several convex and non-convex functions and compare the results with other commonly used optimizers, such as gradient descent, Nesterov’s accelerated gradient (NAG), IGAHD (inertial gradient algorithm with Hessian damping) [Attouch et al., 2020], and IGAHD-SC (inertial gradient algorithm with Hessian damping for strongly convex functions) [Attouch et al., 2020].

4.1 Summary of algorithms

For reference, we write down the iterations of gradient descent, NAG, IGAHD-SC, and IGAHD for better comparison.

Gradient descent:

𝒙n+1=𝒙nτgdf(𝒙n),{\bm{x}}^{n+1}={\bm{x}}^{n}-\tau_{\textrm{gd}}\nabla f({\bm{x}}^{n})\,,

where τgd>0\tau_{\textrm{gd}}>0 is a stepsize.

NAG:

𝒚n+1\displaystyle{\bm{y}}^{n+1} =𝒙nτnagf(𝒙n),\displaystyle={\bm{x}}^{n}-\tau_{\textrm{nag}}\nabla f({\bm{x}}^{n})\,,
𝒙n+1\displaystyle{\bm{x}}^{n+1} =𝒚n+1+βnag(𝒚n𝒚n1),\displaystyle={\bm{y}}^{n+1}+\beta_{\textrm{nag}}({\bm{y}}^{n}-{\bm{y}}^{n-1})\,,

where τnag>0\tau_{\textrm{nag}}>0 is a stepsize, and βnag>0\beta_{\textrm{nag}}>0 is a parameter.

IGAHD: Suppose f\nabla f is L1L_{1}-Lipschitz.

𝒚n\displaystyle{\bm{y}}^{n} =𝒙n+αn(𝒙n𝒙n1)β(1)τatt(f(𝒙n)f(𝒙n1))β(1)τattnf(𝒙n1),\displaystyle={\bm{x}}^{n}+\alpha_{n}({\bm{x}}^{n}-{\bm{x}}^{n-1})-\beta^{(1)}\sqrt{\tau_{\textrm{att}}}(\nabla f({\bm{x}}^{n})-\nabla f({\bm{x}}^{n-1}))-\frac{\beta^{(1)}\sqrt{\tau_{\textrm{att}}}}{n}\nabla f({\bm{x}}^{n-1})\,,
𝒙n+1\displaystyle{\bm{x}}^{n+1} =𝒚nτattf(𝒚n).\displaystyle={\bm{y}}^{n}-\tau_{\textrm{att}}\nabla f({\bm{y}}^{n})\,.

Here αn=1αn\alpha_{n}=1-\frac{\alpha}{n} for some α3\alpha\geq 3. β(1)\beta^{(1)} needs to satisfy

0β(1)2τatt.0\leq\beta^{(1)}\leq 2\sqrt{\tau_{\textrm{att}}}\,.

And τatt>0\tau_{\textrm{att}}>0 is a stepsize, which needs to satisfy

τatt1L1.\tau_{\textrm{att}}\leq\frac{1}{L_{1}}\,.
Remark 4.1.

As mentioned earlier, in each iteration of IGAHD, f()\nabla f(\cdot) is evaluated twice: at 𝒙n{\bm{x}}^{n} and at 𝒚n{\bm{y}}^{n}. This differs from one gradient evaluation in gradient descent, NAG, and our method Eq. 1.2. Chen and Luo [2021] proposed a slightly different algorithm from IGAHD that only requires one gradient evaluation in each iteration.

IGHD-SC: Suppose ff is m1m_{1}-strongly convex and f\nabla f is L1L_{1}-Lipschitz.

𝒙n+1=𝒙n+1m1τatt1+m1τatt(𝒙n𝒙n1)β(2)τatt1+m1τatt(f(𝒙n)f(𝒙n1))τatt1+m1τattf(𝒙n).{\bm{x}}^{n+1}={\bm{x}}^{n}+\frac{1-\sqrt{m_{1}\tau_{\textrm{att}}}}{1+\sqrt{m_{1}\tau_{\textrm{att}}}}({\bm{x}}^{n}-{\bm{x}}^{n-1})-\frac{\beta^{(2)}\sqrt{\tau_{\textrm{att}}}}{1+\sqrt{m_{1}\tau_{\textrm{att}}}}(\nabla f({\bm{x}}^{n})-\nabla f({\bm{x}}^{n-1}))-\frac{\tau_{\textrm{att}}}{1+\sqrt{m_{1}\tau_{\textrm{att}}}}\nabla f({\bm{x}}^{n})\,.

Here β(2)\beta^{(2)} and L1L_{1} need to satisfy

β(2)1m1,L1min{m18β(2),m12τatt+m1τatt2β(2)m1+1τatt+m12}.\displaystyle\beta^{(2)}\leq\frac{1}{\sqrt{m_{1}}}\,,\quad L_{1}\leq\min\Big{\{}\frac{\sqrt{m_{1}}}{8\beta^{(2)}},\frac{\frac{\sqrt{m_{1}}}{2\tau_{\textrm{att}}}+\frac{m_{1}}{\sqrt{\tau_{\textrm{att}}}}}{2\beta^{(2)}m_{1}+\frac{1}{\sqrt{\tau_{\textrm{att}}}}+\frac{\sqrt{m_{1}}}{2}}\Big{\}}\,. (4.1)

4.2 regularized log-sum-exp

Consider the regularized log-sum-exp function

f(𝒙)=log(i=1nexp(𝒒iT𝒙))+12𝒙T𝑸𝒙,f({\bm{x}})=\log\left(\sum_{i=1}^{n}\exp({\bm{q}}_{i}^{T}{\bm{x}})\right)+\frac{1}{2}{\bm{x}}^{T}{\bm{Q}}{\bm{x}}\,,

where n=100n=100, 𝑸=𝑸T0{\bm{Q}}={\bm{Q}}^{T}\succ 0 and 𝒒iT{\bm{q}}_{i}^{T} is the ith row of 𝑸{\bm{Q}}. 𝑸{\bm{Q}} is chosen to be diagonally dominant, i.e. Qi,i>ji|Qi,j|Q_{i,i}>\sum_{j\neq i}|Q_{i,j}|. In this case, we may choose the diagonal preconditioner 𝑪(𝒙)=(diag(𝑸))1{\bm{C}}({\bm{x}})=\big{(}\textrm{diag}({\bm{Q}})\big{)}^{-1}. We compare the performance of gradient descent, preconditioned gradient descent, PDD with 𝑪(𝒙)=𝕀{\bm{C}}({\bm{x}})={\mathbb{I}}, PDD with diagonal preconditioner, NAG, and IGAHD-SC (inertial gradient algorithm with Hessian damping for strongly convex functions) by Attouch et al. [2020] methods for minimizing ff. The stepsize of gradient descent is determined by τgd=2λ13+λn\tau_{\textrm{gd}}=\frac{2}{\lambda_{1}*3+\lambda_{n}}, where λ1\lambda_{1} and λn\lambda_{n} are the maximum and minimum eigenvalues of 𝑸{\bm{Q}}, respectively. For a pure quadratic objective function, 𝒙T𝑸𝒙{\bm{x}}^{T}{\bm{Q}}{\bm{x}}, the optimal stepsize of gradient descent is 2λ1+λn\frac{2}{\lambda_{1}+\lambda_{n}}. However, since our objective function also contains a log-sum-exp term, we slightly change the stepsize. Otherwise, gradient descent will not converge. Similarly, when deciding the parameters for NAG, we choose τnag=430λ1+λn\tau_{\textrm{nag}}=\frac{4}{30*\lambda_{1}+\lambda_{n}} and βnag=3κ+123κ+1+2\beta_{\textrm{nag}}=\frac{\sqrt{3\kappa^{\prime}+1}-2}{\sqrt{3\kappa^{\prime}+1}+2}, where κ=10λ1/λn\kappa^{\prime}=10\lambda_{1}/\lambda_{n}, which is slightly smaller than the optimal parameters of NAG for a purely quadratic function to guarantee convergence. For PDD with 𝑪(𝒙)=𝕀{\bm{C}}({\bm{x}})={\mathbb{I}}, we choose τpdd=σpdd=2λ1+λn\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=\frac{2}{\lambda_{1}+\lambda_{n}}, ε=1\varepsilon=1, A=10A=10, ω=1\omega=1. For PDD with diagonal preconditioner 𝑪(𝒙)=(diag(𝑸))1{\bm{C}}({\bm{x}})=\big{(}\textrm{diag}({\bm{Q}})\big{)}^{-1}, we choose τpdd=σpdd=0.5\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.5, ε=1\varepsilon=1, A=1A=1, ω=1\omega=1. We use the same 𝑪(𝒙)=(diag(𝑸))1{\bm{C}}({\bm{x}})=\big{(}\textrm{diag}({\bm{Q}})\big{)}^{-1} as a preconditioner for gradient descent. The stepsize for preconditioned gradient descent is chosen to be the same as τpdd=0.5\tau_{\textrm{pdd}}=0.5. For IGAHD-SC (‘att’), we need m1m_{1} as the smallest eigenvalue of 2f(𝒙)\nabla^{2}f({\bm{x}}). In this example, we may estimate m1m_{1} as the smallest eigenvalue of 𝑸{\bm{Q}}. And τatt=0.0016\tau_{\textrm{att}}=0.0016 via grid search. β(2)\beta^{(2)} in IGAHD-SC is found by solving (see Theorem 11 Eq. (26) of Attouch et al. [2020])

m18β(2)=m12τatt+m1τatt2β(2)m1+1τatt+m12,\frac{\sqrt{m_{1}}}{8\beta^{(2)}}=\frac{\frac{\sqrt{m_{1}}}{2\tau_{\textrm{att}}}+\frac{\sqrt{m_{1}}}{\sqrt{\tau_{\textrm{att}}}}}{2\beta^{(2)}m_{1}+\frac{1}{\sqrt{\tau_{\textrm{att}}}}+\frac{\sqrt{m_{1}}}{2}}\,,

which gives

β(2)=τatt+τattm1/24+8m1τatt2m1τatt.\beta^{(2)}=\frac{\sqrt{\tau_{\textrm{att}}}+\tau_{\textrm{att}}\sqrt{m_{1}}/2}{4+8\sqrt{m_{1}}\sqrt{\tau_{\textrm{att}}}-2m_{1}\tau_{\textrm{att}}}\,. (4.2)

The initial condition is 𝒙0=np.ones(n)0.1{\bm{x}}^{0}=\textrm{np.ones}(n)*0.1. The result is presented in Fig. 1(a).

4.3 Quadratic minus cosine function

Consider the function

f(𝒙)=𝒙2cos(𝒄T𝒙),f({\bm{x}})=\|{\bm{x}}\|^{2}-\cos({\bm{c}}^{T}{\bm{x}})\,,

where 𝒄{\bm{c}} is a vector in 100\mathbb{R}^{100} with 𝒄2=1.9\|{\bm{c}}\|^{2}=1.9. Then a direct calculation shows that 0.1𝕀2f(𝒙)3.9𝕀0.1{\mathbb{I}}\preceq\nabla^{2}f({\bm{x}})\preceq 3.9{\mathbb{I}} for any 𝒙{\bm{x}}. This allows us to choose the optimal stepsize for gradient descent and NAG. When minimizing ff using gradient descent, we can choose τgd=20.1+3.9=0.5\tau_{\textrm{gd}}=\frac{2}{0.1+3.9}=0.5. Meanwhile, for NAG, we may choose τnag=433.9+0.1\tau_{\textrm{nag}}=\frac{4}{3*3.9+0.1}, and β=3κ+123κ+1+2\beta=\frac{\sqrt{3\kappa+1}-2}{\sqrt{3\kappa+1}+2}, where κ=3.9/0.1\kappa=3.9/0.1. For PDD with 𝑪(𝒙)=𝕀{\bm{C}}({\bm{x}})={\mathbb{I}}, we choose τpdd=σpdd=0.5\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.5, ε=1\varepsilon=1, A=1A=1, ω=1\omega=1. For IGAHD-SC (‘att’), we choose m1=0.1m_{1}=0.1, τatt=0.55\tau_{\textrm{att}}=0.55 via grid search and β(2)\beta^{(2)} is given by Eq. 4.2. The initial condition is 𝒙0=np.ones(n)5{\bm{x}}^{0}=\textrm{np.ones}(n)*5. The result is presented in Fig. 1(b).

Refer to caption
(a) Regularized log-sum-exp
Refer to caption
(b) Qudratic minus cosine
Figure 1: Comparison of gradient descent, NAG, PDD, and IGAHD-SC (we use ‘att’ as a shorthand for this method) on minimizing (a) the regularized log-sum-exp function and (b) the quadratic minus cosine function. The y-axis represents the 2-norm of the gradient of the objective function on a logarithmic scale. The x-axis represents the number of iterations on a logarithmic scale.

4.4 Rosenbrock function

4.4.1 2-dimension

The 2-dimensional Rosenbrock function is defined as

f(x,y)=(ax)2+b(yx2)2,f(x,y)=(a-x)^{2}+b(y-x^{2})^{2}\,,

where a common choice of parameters is a=1a=1, b=100b=100. This is a non-convex function with a global minimum of (x,y)=(a,a2)(x^{*},y^{*})=(a,a^{2}). The global minimum is inside a long, narrow, parabolic-shaped flat valley. To find the valley is trivial. To converge to the global minimum, however, is difficult. We compare the performance of gradient descent, NAG, PDD with 𝑪(𝒙)=𝕀{\bm{C}}({\bm{x}})={\mathbb{I}} and IGAHD (inertia gradient algorithm with Hessian damping) by Attouch et al. [2020] when minimizing the Rosenbrock function starting from (3,4)(-3,-4). The stepsize of gradient descent is τgd=0.0002\tau_{\textrm{gd}}=0.0002. The stepsize of NAG is τnag=0.0002\tau_{\textrm{nag}}=0.0002, βnag=0.9\beta_{\textrm{nag}}=0.9. The parameters of PDD are τpdd=σpdd=0.005\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.005, ε=1\varepsilon=1, ω=1\omega=1, A=5A=5. The stepsize of the PDD method is larger than τgd\tau_{\textrm{gd}} and τnag\tau_{\textrm{nag}} because gradient descent and NAG do not allow larger stepsizes for convergence. For IGAHD (‘att’), we choose τatt=0.00045\tau_{\textrm{att}}=0.00045, α=3\alpha=3, β(1)=τatt/14\beta^{(1)}=\sqrt{\tau_{\textrm{att}}}/14. The convergence result and the optimization trajectories are shown in Fig. 2.

Refer to caption
(a) Convergence comparison
Refer to caption
(b) Optimization trajectories
Figure 2: Minimizing the Rosenbrock function with gradient descent, NAG, PDD with 𝑪(𝒙)=𝕀{\bm{C}}({\bm{x}})={\mathbb{I}} and IGAHD (‘att’). The left panel shows the convergence speed of each method. The right panel shows the optimization trajectories of each method.

4.4.2 N-dimension

The NN-dimensional coupled Rosenbrock function is defined as

f(𝒙)=i=1N1((axi)2+b(xi+1xi2)2),f({\bm{x}})=\sum_{i=1}^{N-1}\big{(}(a-x_{i})^{2}+b(x_{i+1}-x_{i}^{2})^{2}\big{)}\,,

where we choose a=1a=1 and b=100b=100 as in the 2-dimensioal case and we set N=100N=100. The global minimum is at 𝒙=(1,1,,1){\bm{x}}^{*}=(1,1,\ldots,1). Using the same stepsizes as in the 2-dimensional case, we compare the performance of the three algorithms starting from 𝒙0=(0,,0){\bm{x}}_{0}=(0,\ldots,0). The stepsize of gradient descent is τgd=0.001\tau_{\textrm{gd}}=0.001. The stepsize of NAG is τnag=0.0008\tau_{\textrm{nag}}=0.0008, β=0.95\beta=0.95. The parameters of PDD are τpdd=σpdd=0.01\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.01, ε=0.5\varepsilon=0.5, ω=1\omega=1, A=5A=5. The stepsize of the PDD method is larger than τgd\tau_{\textrm{gd}} and τnag\tau_{\textrm{nag}} because gradient descent and NAG do not allow larger stepsizes for convergence. For IGAHD (‘att’), we choose τatt=0.0002\tau_{\textrm{att}}=0.0002, α=3\alpha=3, β(1)=2τatt\beta^{(1)}=2*\sqrt{\tau_{\textrm{att}}}. The result is summarized in Fig. 3

Refer to caption
Figure 3: Comparison of gradient descent, NAG, PDD with 𝑪(𝒙)=𝕀{\bm{C}}({\bm{x}})={\mathbb{I}} and IGAHD (‘att’) on minimizing the 100-dimensional coupled Rosenbrock function. The y-axis represents the distance between the current iterate and the global minimum on a logarithmic scale. The x-axis represents the number of iterations on a logarithmic scale.

4.5 Ackley function

We consider minimizing the two-dimensional Ackley function given by

f(x,y)=20exp(0.20.5(x2+y2))exp[0.5(cos(2πx)+cos(2πy))]+e+20,f(x,y)=-20\exp\big{(}-0.2\sqrt{0.5(x^{2}+y^{2})}\big{)}-\exp\big{[}0.5\big{(}\cos(2\pi x)+\cos(2\pi y)\big{)}\big{]}+\mathrm{e}+20\,,

which has many local minima. The unique global minimum is located at (x,y)=(0,0)(x^{*},y^{*})=(0,0). We compare the performance of gradient descent, NAG, PDD, and IGAHD (‘att’) for minimizing the two-dimensional Ackley function starting from (x0,y0)=(2.5,4)(x_{0},y_{0})=(2.5,4). The stepsize of gradient descent is τgd=0.002\tau_{\textrm{gd}}=0.002. The stepsize of NAG is τnag=0.002\tau_{\textrm{nag}}=0.002, βnag=0.9\beta_{\textrm{nag}}=0.9. The parameters of PDD are τpdd=σpdd=0.002\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.002, ε=1\varepsilon=1, ω=1\omega=1, A=1A=1. For IGAHD (‘att’), we choose τatt=0.01\tau_{\textrm{att}}=0.01, α=3\alpha=3, β(1)=2τatt\beta^{(1)}=2*\sqrt{\tau_{\textrm{att}}}. The results are summarized in Fig. 4.

Refer to caption
(a) Convergence comparison
Refer to caption
(b) Optimization trajectories
Figure 4: Minimizing the Ackley function with gradient descent, NAG, PDD and IGAHD (‘att’). The left panel shows the convergence speed of each method. The right panel shows the optimization trajectories of each method.
Remark 4.2.

We remark that our algorithm has no stochasticity. It will not always converge to the global minimum for non-convex functions in general. For example, it will not converge for the Griewank, Drop-Wave, and Rastrigin functions.

4.6 Neural Networks training

4.6.1 MNIST with Two-layer neural network

We consider the classification problem using the MNIST handwritten digit data set with a two-layer neural network. The neural network has an input layer of size 784=28×28784=28\times 28, a hidden layer of size 3232 followed by another hidden layer of size 3232, and an output layer of size 1010. We use ReLU activation function across the layers, and the loss is evaluated using the cross-entropy loss. We use a batch size of 200 for all the algorithms. The stepsize of gradient descent is τgd=0.001\tau_{\textrm{gd}}=0.001. The stepsize of NAG is τnag=0.001\tau_{\textrm{nag}}=0.001, momentum=0.9\textrm{momentum}=0.9. The parameters of PDD are τpdd=0.001\tau_{\textrm{pdd}}=0.001, σpdd=5\sigma_{\textrm{pdd}}=5, ε=0.005\varepsilon=0.005, ω=1\omega=1, A=1A=1. For IGAHD (‘att’), we choose τatt=0.001\tau_{\textrm{att}}=0.001, α=3\alpha=3, β(1)=0.01\beta^{(1)}=0.01. For Adam, we choose τadam=0.001\tau_{\textrm{adam}}=0.001, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999.

Algorithm SGD NAG PDD Adam Att
train loss 2.223 ±\pm 0.034 0.964 ±\pm 0.244 0.433 ±\pm 0.270 0.589 ±\pm 0.282 0.591 ±\pm 0.288
test acc 29.3 ±\pm 8.3 % 71.2 ±\pm 9.4 % 85.4 ±\pm 10.4 % 79.1 ±\pm 11.3 % 80.8 ±\pm 11.4 %
Table 1: Average training loss and test accuracy of different algorithms for MNIST handwritten digit recognition over 60 random seeds.
Refer to caption
Figure 5: Training a two-layer neural network with the MNIST data set using gradient descent, NAG, PDD, Adam, and IGAHD (‘att’). The left panel shows the convergence speed of training loss. The right panel shows the test accuracy of each method. The x-axis represents the number of iterations in terms of mini-batches.

4.6.2 CIFAR10 with CNN

We train a convolutional neural network using the CIFAR10 datasets with SGD, Nesterov, PDD, Adam, and IGAHD (‘Att’). The architecture of the network is described as follows. The network consists of two convolutional layers. The first convolutional layer has 3232 output channels, and the filter size is 3×33\times 3. The second convolutional layer has 6464 output channels, and the filter size is 4×44\times 4. Each convolutional layer is followed by a ReLU activation and then a 2×22\times 2 max-pooling layer. Lastly, we have 33 fully connected layers of size (6444,120)(64\cdot 4\cdot 4,120), (120,84)(120,84), and (84,10)(84,10). The loss is evaluated using the cross-entropy loss. The stepsize of gradient descent is τgd=0.01\tau_{\textrm{gd}}=0.01. The stepsize of NAG is τnag=0.005\tau_{\textrm{nag}}=0.005, momentum=0.9\textrm{momentum}=0.9. The parameters of PDD are τpdd=0.005\tau_{\textrm{pdd}}=0.005, σpdd=5\sigma_{\textrm{pdd}}=5, ε=0.005\varepsilon=0.005, ω=1\omega=1, A=1A=1. For IGAHD (‘att’), we choose τatt=0.005\tau_{\textrm{att}}=0.005, α=3\alpha=3, β(1)=0.01\beta^{(1)}=0.01. For Adam, we choose τadam=0.005\tau_{\textrm{adam}}=0.005, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999.

Refer to caption
Figure 6: Training loss (left panel) and test accuracy (right panel) of a convolutional neural network on the CIFAR10 data set. The x-axis represents the number of iterations in terms of mini-batches.
Algorithm SGD NAG PDD Adam Att
train loss 2.038 ±\pm 0.070 1.347 ±\pm 0.100 0.697 ±\pm 0.077 0.927 ±\pm 0.128 0.879 ±\pm 0.092
test acc 27.5 ±\pm 2.2 % 51.2 ±\pm 0.7 % 70.3 ±\pm 0.5 % 64.4 ±\pm 2.7 % 66.8 ±\pm 0.6 %
Table 2: Average training loss and test accuracy of different algorithms for CIFAR10 data set over 60 random seeds.

5 Discussion

This paper presents primal-dual hybrid gradient algorithms for solving unconstrained optimization problems. We reformulate the optimality condition of the optimization problem as a saddle-point problem and then compute the proposed saddle-point problem by a preconditioned PDHG method. We present the geometric convergence analysis for the strongly convex objective functions. In numerical experiments, we demonstrate that the proposed method works efficiently in non-convex optimization problems, at least in some examples, such as Rosenbrock and Ackley functions. In particular, it could efficiently train two-layer and convolution neural networks in supervised learning problems.

So far, our convergence study is limited to strongly convex objective functions, not convex ones. Meanwhile, the choice of preconditioners and stepsizes are independent of time. We also have not discussed the optimal choices of parameters or general proximal operators in the updates of algorithms. These generalized choices of functions, parameters, and their convergence properties have been intensively studied in Nesterov accelerated gradient methods and Attouch’s Hessian-driven damping methods. In future work, we shall explore the convergence property of PDHG methods for convex functions with time-dependent parameters. We also investigate the convergence of similar algorithms in scientific computing problems of implicit time updates of partial differential equations [Li et al., 2022, 2023, Liu et al., 2023].

Acknowledgement: X. Zuo and S. Osher’s work was partly supported by AFOSR MURI FP 9550-18-1-502 and ONR grants: N00014-20-1-2093 and N00014-20-1-2787. W. Li’s work was supported by AFOSR MURI FP 9550-18-1-502, AFOSR YIP award No. FA9550-23-1-0087, and NSF RTG: 2038080.

References

  • Attouch et al. [2019] H. Attouch, Z. Chbani, and H. Riahi. Fast proximal methods via time scaling of damped inertial dynamics. SIAM Journal on Optimization, 29(3):2227–2256, 2019.
  • Attouch et al. [2020] H. Attouch, Z. Chbani, J. Fadili, and H. Riahi. First-order optimization algorithms via inertial systems with Hessian driven damping. Mathematical Programming, pages 1–43, 2020.
  • Attouch et al. [2021] H. Attouch, Z. Chbani, J. Fadili, and H. Riahi. Convergence of iterates for first-order optimization algorithms with inertia and Hessian driven damping. Optimization, pages 1–40, 2021.
  • Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Boyd et al. [2011] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • Chambolle and Pock [2011] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40:120–145, 2011.
  • Chen and Luo [2019] L. Chen and H. Luo. First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow. arXiv preprint arXiv:1912.09276, 2019.
  • Chen and Luo [2021] L. Chen and H. Luo. A unified convergence analysis of first order convex optimization methods via strong Lyapunov functions. arXiv preprint arXiv:2108.00132, 2021.
  • Gabay and Mercier [1976] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. ISSN 0898-1221. doi: https://doi.org/10.1016/0898-1221(76)90003-1. URL https://www.sciencedirect.com/science/article/pii/0898122176900031.
  • Jacobs et al. [2019] M. Jacobs, F. Léger, W. Li, and S. Osher. Solving large-scale optimization problems with a convergence rate independent of grid size. SIAM Journal on Numerical Analysis, 57(3):1100–1123, 2019.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Li et al. [2022] W. Li, S. Liu, and S. Osher. Controlling conservation laws II: Compressible Navier–Stokes equations. Journal of Computational Physics, 463:111264, 2022.
  • Li et al. [2023] W. Li, S. Liu, and S. Osher. Controlling conservation laws I: Entropy–entropy flux. Journal of Computational Physics, 480:112019, 2023.
  • Liu et al. [2023] S. Liu, S. Osher, W. Li, and C.-W. Shu. A primal-dual approach for solving conservation laws with implicit in time approximations. Journal of Computational Physics, 472:111654, 2023.
  • Liu et al. [2021] Y. Liu, Y. Xu, and W. Yin. Acceleration of primal–dual methods by preconditioning and simple subproblem procedures. Journal of Scientific Computing, 86(2):21, 2021.
  • Nesterov [1983] Y. E. Nesterov. A method of solving a convex programming problem with convergence rate 𝒪(1k2)\mathcal{O}(\frac{1}{k^{2}}). In Doklady Akademii Nauk, volume 269, pages 543–547. Russian Academy of Sciences, 1983.
  • Osher et al. [2022] S. Osher, B. Wang, P. Yin, X. Luo, F. Barekat, M. Pham, and A. Lin. Laplacian smoothing gradient descent. Research in the Mathematical Sciences, 9(3):55, 2022.
  • Ouyang et al. [2015] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao Jr. An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences, 8(1):644–681, 2015.
  • Park et al. [2021] J.-H. Park, A. J. Salgado, and S. M. Wise. Preconditioned accelerated gradient descent methods for locally lipschitz smooth objectives with applications to the solution of nonlinear PDEs. Journal of Scientific Computing, 89(1):17, 2021.
  • Pascanu and Bengio [2013] R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
  • Pock and Chambolle [2011] T. Pock and A. Chambolle. Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In 2011 International Conference on Computer Vision, pages 1762–1769, 2011. doi: 10.1109/ICCV.2011.6126441.
  • Polyak [1964] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR computational mathematics and mathematical physics, 4(5):1–17, 1964.
  • Rees [2010] T. Rees. Preconditioning iterative methods for PDE constrained optimization. PhD thesis, University of Oxford Oxford, UK, 2010.
  • Siegel [2019] J. W. Siegel. Accelerated first-order methods: Differential equations and Lyapunov functions. arXiv preprint arXiv:1903.05671, 2019.
  • Su et al. [2015] W. Su, S. Boyd, and E. J. Candes. A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. arXiv preprint arXiv:1503.01243, 2015.
  • Trefethen and Bau [2022] L. N. Trefethen and D. Bau. Numerical linear algebra, volume 181. SIAM, 2022.
  • Valkonen [2014] T. Valkonen. A primal–dual hybrid gradient method for nonlinear operators with applications to MRI. Inverse Problems, 30(5):055012, 2014.
  • Wibisono et al. [2016] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.

Appendix A Matrix lemma

Lemma A.1.

Let 𝐀,𝐁,𝐂n{\bm{A}},{\bm{B}},{\bm{C}}\in\mathbb{R}^{n} be real symmetric matrices that are simultaneously diagonalizable. Then for any 𝐱,𝐲n{\bm{x}},{\bm{y}}\in\mathbb{R}^{n}, if

λA,i+|λC,i|20\lambda_{A,i}+\frac{|\lambda_{C,i}|}{2}\leq 0\,
λB,i+|λC,i|20\lambda_{B,i}+\frac{|\lambda_{C,i}|}{2}\leq 0\,

for all ii, where λA,i,λB,i,λC,i\lambda_{A,i},\lambda_{B,i},\lambda_{C,i} are the iith eigenvalues of 𝐀,𝐁,𝐂{\bm{A}},{\bm{B}},{\bm{C}} respectively in the same basis. Then

𝒙T𝑨𝒙+𝒚T𝑩𝒚+𝒙T𝑪𝒚0,{\bm{x}}^{T}{\bm{A}}{\bm{x}}+{\bm{y}}^{T}{\bm{B}}{\bm{y}}+{\bm{x}}^{T}{\bm{C}}{\bm{y}}\leq 0,

for all 𝐱,𝐲n{\bm{x}},{\bm{y}}\in\mathbb{R}^{n}.

Proof.

Let 𝒙,𝒚n{\bm{x}},{\bm{y}}\in\mathbb{R}^{n}. By our assumption, there exists 𝑸{\bm{Q}} unitary such that 𝑨,𝑩,𝑪{\bm{A}},{\bm{B}},{\bm{C}} are simultaneously diagonalizable by 𝑸{\bm{Q}}. Set 𝒙~=𝑸𝒙\tilde{{\bm{x}}}={\bm{Q}}{\bm{x}} and 𝒚~=𝑸𝒚\tilde{{\bm{y}}}={\bm{Q}}{\bm{y}}. Then we can compute

𝒙T𝑨𝒙+𝒚T𝑩𝒚+𝒙T𝑪𝒚\displaystyle{\bm{x}}^{T}{\bm{A}}{\bm{x}}+{\bm{y}}^{T}{\bm{B}}{\bm{y}}+{\bm{x}}^{T}{\bm{C}}{\bm{y}} =i=1nx~i2λA,i+y~i2λB,i+x~iλC,iy~i\displaystyle=\sum_{i=1}^{n}\tilde{x}_{i}^{2}\lambda_{A,i}+\tilde{y}_{i}^{2}\lambda_{B,i}+\tilde{x}_{i}\lambda_{C,i}\tilde{y}_{i}
i=1nx~i2(λA,i+|λC,i|2)+y~i2(λB,i+|λC,i|2)\displaystyle\leq\sum_{i=1}^{n}\tilde{x}_{i}^{2}\big{(}\lambda_{A,i}+\frac{|\lambda_{C,i}|}{2}\big{)}+\tilde{y}_{i}^{2}\big{(}\lambda_{B,i}+\frac{|\lambda_{C,i}|}{2}\big{)}
0,\displaystyle\leq 0\,,

where the first inequality follows from αxy(x2+y2)|α|/2\alpha xy\leq(x^{2}+y^{2})|\alpha|/2 for any α,x,y\alpha,x,y\in\mathbb{R}. ∎

Appendix B Proof of Theorem 2.6

B.1 Part (a)

We have the following system of ODE:

(𝒙˙𝒑˙)=(γ𝑩𝑸𝑨𝑸𝑩𝑸(𝕀γε𝑨)𝑨𝑸ε𝑨)(𝒙𝒑).\begin{pmatrix}\dot{{\bm{x}}}\\ \dot{{\bm{p}}}\end{pmatrix}=\begin{pmatrix}-\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}&-{\bm{B}}{\bm{Q}}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})\\ {\bm{A}}{\bm{Q}}&-\varepsilon{\bm{A}}\end{pmatrix}\begin{pmatrix}{\bm{x}}\\ {\bm{p}}\end{pmatrix}\,. (B.1)

Let us compute the eigenvalues of the above system. Let α\alpha be an eigenvalue, then α\alpha satisfies

det(γ𝑩𝑸𝑨𝑸α𝕀𝑩𝑸(𝕀γε𝑨)𝑨𝑸ε𝑨α𝕀)\displaystyle\mathrm{det}\begin{pmatrix}-\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}-\alpha{\mathbb{I}}&-{\bm{B}}{\bm{Q}}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})\\ {\bm{A}}{\bm{Q}}&-\varepsilon{\bm{A}}-\alpha{\mathbb{I}}\end{pmatrix} =0\displaystyle=0
det((γ𝑩𝑸𝑨𝑸α𝕀)(ε𝑨α𝕀)+𝑩𝑸(𝕀γε𝑨)𝑨𝑸)\displaystyle\mathrm{det}\big{(}(-\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}-\alpha{\mathbb{I}})(-\varepsilon{\bm{A}}-\alpha{\mathbb{I}})+{\bm{B}}{\bm{Q}}({\mathbb{I}}-\gamma\varepsilon{\bm{A}}){\bm{A}}{\bm{Q}}\big{)} =0\displaystyle=0
det(α2𝕀+α(ε𝑨+γ𝑩𝑸𝑨𝑸)+γε𝑩𝑸𝑨𝑸𝑨+𝑩𝑸𝑨𝑸γε𝑩𝑸𝑨𝑨𝑸)\displaystyle\mathrm{det}\big{(}\alpha^{2}{\mathbb{I}}+\alpha(\varepsilon{\bm{A}}+\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}})+\gamma\varepsilon{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}{\bm{A}}+{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}-\gamma\varepsilon{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{A}}{\bm{Q}}\big{)} =0\displaystyle=0
det(α2𝕀+α(ε𝑨+γ𝑩𝑸𝑨𝑸)+𝑩𝑸𝑨𝑸)\displaystyle\mathrm{det}\big{(}\alpha^{2}{\mathbb{I}}+\alpha(\varepsilon{\bm{A}}+\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}})+{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}\big{)} =0.\displaystyle=0\,.

The late equality is because 𝑨{\bm{A}} commutes with 𝑸{\bm{Q}}. We assume that 𝑨{\bm{A}} and 𝑩𝑸𝑨𝑸{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}} are simultaneously diagonalizable. Thus,

0\displaystyle 0 =α2+α(εai+γμi)+μi,\displaystyle=\alpha^{2}+\alpha(\varepsilon a_{i}+\gamma\mu_{i})+\mu_{i}\,,
α\displaystyle\alpha =εaiγμi±(ε+γμi)24μi2.\displaystyle=\frac{-\varepsilon a_{i}-\gamma\mu_{i}\pm\sqrt{(\varepsilon+\gamma\mu_{i})^{2}-4\mu_{i}}}{2}\,.

If γ>0\gamma>0 and ε0\varepsilon\geq 0, then the real part of the eigenvalues are negative, and the system will converge. The convergence rate depends on the largest real part of the eigenvalues, which is

maxi12[γμiεai+((γμi+ε)24μi)].\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-\varepsilon a_{i}+\Re\big{(}\sqrt{(\gamma\mu_{i}+\varepsilon)^{2}-4\mu_{i}}\big{)}\big{]}\,.

B.2 Part (c)

When γ=ε=0\gamma=\varepsilon=0, we see that α\alpha is purely imaginary. Thus solutions to Eq. B.1 will be oscillatory and will not converge.

B.3 Part (b)

Let us define

g(γ)=maxi{μi(γ+(γ24/μi))2}.g(\gamma)=\max_{i}\big{\{}\frac{\mu_{i}\big{(}-\gamma+\Re\big{(}\sqrt{\gamma^{2}-4/\mu_{i}}\big{)}\big{)}}{2}\big{\}}\,.

Essentially, we would like to find γ=argminγg(γ)\gamma^{*}=\operatorname*{arg\,min}_{\gamma}g(\gamma). We then define

γ(μ)\displaystyle\gamma(\mu) :=argminγμ(γ+(γ24/μ))2\displaystyle:=\operatorname*{arg\,min}_{\gamma}\frac{\mu\big{(}-\gamma+\Re\big{(}\sqrt{\gamma^{2}-4/\mu}\big{)}\big{)}}{2}
=2μ.\displaystyle=\frac{2}{\sqrt{\mu}}\,.

Observe that if γ2/μn\gamma\geq 2/\sqrt{\mu_{n}}, then γ24/μi0\gamma^{2}-4/\mu_{i}\geq 0 for all ii. Thus

g(γ)\displaystyle g(\gamma) =maxi{μi(γ+γ24/μi)2}.\displaystyle=\max_{i}\big{\{}\frac{\mu_{i}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{i}}\big{)}}{2}\big{\}}\,.

For μ[μn,μ1]\mu\in[\mu_{n},\mu_{1}] and γ2/μn\gamma\geq 2/\sqrt{\mu_{n}}, one can check that the function μ(γ+γ24/μ)\mu\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu}\big{)} is increasing in μ\mu by computing the partial derivative with respect to μ\mu. Then we get

g(γ)=μ1(γ+γ24/μ1)2g(2/μn)=μ1(κ1κ)μn/2,g(\gamma)=\frac{\mu_{1}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{1}}\big{)}}{2}\geq g(2/\sqrt{\mu_{n}})=\sqrt{\mu_{1}}(\sqrt{\kappa-1}-\sqrt{\kappa})\approx-\sqrt{\mu_{n}}/2\,,

where κ=μ1/μn>1\kappa=\mu_{1}/\mu_{n}>1. The last approximation is valid for μ1/μn1\mu_{1}/\mu_{n}\gg 1. This shows that γ2/μn\gamma^{*}\leq 2/\sqrt{\mu_{n}}. Similarly, if γ2/μ1\gamma\leq 2/\sqrt{\mu_{1}}, then γ24/μi0\gamma^{2}-4/\mu_{i}\leq 0 for all ii. Thus

g(γ)\displaystyle g(\gamma) =maxi{μiγ2}\displaystyle=\max_{i}\big{\{}\frac{-\mu_{i}\gamma}{2}\big{\}}
=μnγ2\displaystyle=\frac{-\mu_{n}\gamma}{2}
μnμ1=g(2/μ1).\displaystyle\geq-\frac{\mu_{n}}{\sqrt{\mu_{1}}}=g(2/\sqrt{\mu_{1}})\,.

This shows that γ2/μ1\gamma^{*}\geq 2/\sqrt{\mu_{1}}. Combining with our previous observation, we get γ[2/μ1,2/μn]\gamma^{*}\in[2/\sqrt{\mu_{1}},2/\sqrt{\mu_{n}}]. Now let us fix some γ[2/μ1,2/μn]\gamma^{\prime}\in[2/\sqrt{\mu_{1}},2/\sqrt{\mu_{n}}]. Let j=inf{i:1in,γ24/μi0}j=\inf\{i:1\leq i\leq n,\gamma^{\prime 2}-4/\mu_{i}\leq 0\}. By our assumption on γ\gamma^{\prime}, we know that 1<j<n1<j<n. Now for 1ij11\leq i\leq j-1, we have

μi(γ+(γ24/μi))2=μi(γ+γ24/μi)2μ1(γ+γ24/μ1)2.\frac{\mu_{i}\big{(}-\gamma^{\prime}+\Re\big{(}\sqrt{\gamma^{\prime 2}-4/\mu_{i}}\big{)}\big{)}}{2}=\frac{\mu_{i}\big{(}-\gamma^{\prime}+\sqrt{\gamma^{\prime 2}-4/\mu_{i}}\big{)}}{2}\leq\frac{\mu_{1}\big{(}-\gamma^{\prime}+\sqrt{\gamma^{\prime 2}-4/\mu_{1}}\big{)}}{2}\,.

And for jknj\leq k\leq n, we have

μk(γ+(γ24/μk))2=μkγ2μnγ2.\frac{\mu_{k}\big{(}-\gamma^{\prime}+\Re\big{(}\sqrt{\gamma^{\prime 2}-4/\mu_{k}}\big{)}\big{)}}{2}=\frac{-\mu_{k}\gamma^{\prime}}{2}\leq\frac{-\mu_{n}\gamma^{\prime}}{2}\,.

It is thus clear that for γ[2/μ1,2/μn]\gamma^{\prime}\in[2/\sqrt{\mu_{1}},2/\sqrt{\mu_{n}}],

g(γ)\displaystyle g(\gamma^{\prime}) =max{μ1(γ+γ24/μ1)2,μnγ2}.\displaystyle=\max\big{\{}\frac{\mu_{1}\big{(}-\gamma^{\prime}+\sqrt{\gamma^{\prime 2}-4/\mu_{1}}\big{)}}{2},\frac{-\mu_{n}\gamma^{\prime}}{2}\big{\}}\,.

It is straightforward to calculate that for γ[2μ1,2μ1μn(2μ1μn)]\gamma\in[\frac{2}{\sqrt{\mu_{1}}},\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}}], we have

μnγ2μ1(γ+γ24/μ1)2.\frac{-\mu_{n}\gamma}{2}\geq\frac{\mu_{1}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{1}}\big{)}}{2}\,.

So

g(γ)=μnγ2g(2μ1μn(2μ1μn))=μn21κ.g(\gamma)=\frac{-\mu_{n}\gamma}{2}\geq g(\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}})=\frac{-\sqrt{\mu_{n}}}{\sqrt{2-\frac{1}{\kappa}}}\,.

And for γ[2μ1μn(2μ1μn),2/μn]\gamma\in[\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}},2/\sqrt{\mu_{n}}] we have

μnγ2μ1(γ+γ24/μ1)2.\frac{-\mu_{n}\gamma}{2}\leq\frac{\mu_{1}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{1}}\big{)}}{2}\,.

This implies

g(γ)=μ1(γ+γ24/μ1)2g(2μ1μn(2μ1μn))=μn21κ.g(\gamma)=\frac{\mu_{1}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{1}}\big{)}}{2}\geq g(\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}})=\frac{-\sqrt{\mu_{n}}}{\sqrt{2-\frac{1}{\kappa}}}\,.

This shows γ=2μ1μn(2μ1μn)\gamma^{*}=\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}}.

B.4 Part (d)

Define Δγ(μ,ε)=(γμ+ε)24μ\Delta_{\gamma}(\mu,\varepsilon)=(\gamma\mu+\varepsilon)^{2}-4\mu. Also define gγ(μ)=2μγμg_{\gamma}(\mu)=2\sqrt{\mu}-\gamma\mu. Then for μ0\mu\geq 0, we have Δγ(μ,ε)0\Delta_{\gamma}(\mu,\varepsilon)\leq 0 if and only if εgγ(μ)\varepsilon\leq g_{\gamma}(\mu). Note that gγ(μ)=1μγ0g_{\gamma}^{\prime}(\mu)=\frac{1}{\sqrt{\mu}}-\gamma\geq 0 for μμ1\mu\leq\mu_{1} if γ1μ1\gamma\leq\frac{1}{\sqrt{\mu_{1}}}. Then Δγ(μ,ε)0\Delta_{\gamma}(\mu,\varepsilon)\leq 0 for all μμ1\mu\leq\mu_{1} if γ1μ1\gamma\leq\frac{1}{\sqrt{\mu_{1}}} and εgγ(μn)\varepsilon\leq g_{\gamma}(\mu_{n}). In particular, Δγ(μ,ε)0\Delta_{\gamma}(\mu,\varepsilon)\leq 0 for all μμ1\mu\leq\mu_{1} if ε=gγ(μ)\varepsilon=g_{\gamma}(\mu^{\prime}) for some μμn\mu^{\prime}\leq\mu_{n}. We have

α\displaystyle\alpha =maxi12[γμiε+((γμi+ε)24μi)]\displaystyle=\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-\varepsilon+\Re\big{(}\sqrt{(\gamma\mu_{i}+\varepsilon)^{2}-4\mu_{i}}\big{)}\big{]}
=maxi12[γμiε]\displaystyle=\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-\varepsilon\big{]}
=maxi12[γμi2μ+γμ]\displaystyle=\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-2\sqrt{\mu^{\prime}}+\gamma\mu^{\prime}\big{]}
=μγ(μnμ)2.\displaystyle=-\sqrt{\mu^{\prime}}-\frac{\gamma(\mu_{n}-\mu^{\prime})}{2}.

Appendix C Proof of Proposition 2.4

We directly compute

𝒙¨\displaystyle\ddot{{\bm{x}}} =𝑪((𝕀γε𝑨)𝒑˙+γ𝑨2f(𝒙)𝒙˙)𝑪˙((𝕀γε𝑨)𝒑+γ𝑨f(𝒙))\displaystyle=-{\bm{C}}\big{(}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})\dot{{\bm{p}}}+\gamma{\bm{A}}\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}\big{)}-\dot{{\bm{C}}}\big{(}({\mathbb{I}}-\gamma\varepsilon{\bm{A}}){\bm{p}}+\gamma{\bm{A}}\nabla f({\bm{x}})\big{)}
=𝑪((𝕀γε𝑨)(𝑨f(𝒙)ε𝑨𝒑)+γ𝑨2f(𝒙)𝒙˙)\displaystyle=-{\bm{C}}\big{(}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})({\bm{A}}\nabla f({\bm{x}})-\varepsilon{\bm{A}}{\bm{p}})+\gamma{\bm{A}}\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}\big{)}
𝑪˙((𝕀γε𝑨)𝒑+γ𝑨f(𝒙))\displaystyle\qquad-\dot{{\bm{C}}}\big{(}({\mathbb{I}}-\gamma\varepsilon{\bm{A}}){\bm{p}}+\gamma{\bm{A}}\nabla f({\bm{x}})\big{)}
=𝑪[(𝕀γε𝑨)𝑨f(𝒙)+ε𝑨(𝑪1𝒙˙+γ𝑨f(𝒙))+γ𝑨2f(𝒙)𝒙˙]+𝑪˙𝑪1𝒙˙\displaystyle=-{\bm{C}}\big{[}({\mathbb{I}}-\gamma\varepsilon{\bm{A}}){\bm{A}}\nabla f({\bm{x}})+\varepsilon{\bm{A}}({\bm{C}}^{-1}\dot{{\bm{x}}}+\gamma{\bm{A}}\nabla f({\bm{x}}))+\gamma{\bm{A}}\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}\big{]}+\dot{{\bm{C}}}{\bm{C}}^{-1}\dot{{\bm{x}}}
=𝑪[𝑨f(𝒙)+ε𝑨𝑪1𝒙˙+γ𝑨2f(𝒙)𝒙˙]+𝑪˙𝑪1𝒙˙.\displaystyle=-{\bm{C}}\big{[}{\bm{A}}\nabla f({\bm{x}})+\varepsilon{\bm{A}}{\bm{C}}^{-1}\dot{{\bm{x}}}+\gamma{\bm{A}}\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}\big{]}+\dot{{\bm{C}}}{\bm{C}}^{-1}\dot{{\bm{x}}}\,.