Primal-Dual Damping algorithms for optimization

Xinzhe Zuo^⋆, Stanley Osher^⋆, Wuchen Li^†
^$\star$ Department of Mathematics, University of California, Los Angles
{zxz, sjo}@math.ucla.edu
^$\dagger$ Department of Mathematics, University of South Carolina
wuchen@mailbox.sc.edu

Abstract

We propose an unconstrained optimization method based on the well-known primal-dual hybrid gradient (PDHG) algorithm. We first formulate the optimality condition of the unconstrained optimization problem as a saddle point problem. We then compute the minimizer by applying generalized primal-dual hybrid gradient algorithms. Theoretically, we demonstrate the continuous-time limit of the proposed algorithm forms a class of second-order differential equations, which contains and extends the heavy ball ODEs and Hessian-driven damping dynamics. Following the Lyapunov analysis of the ODE system, we prove the linear convergence of the algorithm for strongly convex functions. Experimentally, we showcase the advantage of algorithms on several convex and non-convex optimization problems by comparing the performance with other well-known algorithms, such as Nesterov’s accelerated gradient methods. In particular, we demonstrate that our algorithm is efficient in training two-layer and convolution neural networks in supervised learning problems.

Keywords— Optimization; Primal-dual hybrid gradient algorithms; Primal-dual damping dynamics.

1 Introduction

Optimization is one of the essential building blocks in many applications, including scientific computing and machine learning problems. One of the classical algorithms for unconstrained optimization problems is the gradient descent method, which updates the state variable in the negative gradient direction at each step [Boyd and Vandenberghe, 2004]. Nowadays, accelerated gradient descent methods have been widely studied. Typical examples include Nesterov’s accelerated gradient method [Nesterov, 1983], Polyak’s heavy ball method [Polyak, 1964], and Hessian-driven damping methods [Chen and Luo, 2019, Attouch et al., 2019, 2020, 2021].

On the other hand, some first-order methods are introduced to solve linear-constrained optimization problems. Typical examples include the primal-dual hybrid gradient (PDHG) method [Chambolle and Pock, 2011] and the alternating direction method of multipliers (ADMM) [Boyd et al., 2011, Gabay and Mercier, 1976]. They are designed to solve an inf-sup saddle point type problem, which updates the gradient descent direction for the minimization variable and applies the gradient ascent direction for the maximization variable. Both PDHG and ADMM are designed for solving optimization problems with affine constraints. Ouyang et al. [2015] proposed accelerated linearized ADMM, which incorporates a multi-step acceleration scheme into linearized ADMM. Recently, the PDHG method has been extended into solving nonlinear-constrained minimization problems [Valkonen, 2014].

In this paper, we study a general class of accelerated first-order methods for unconstrained optimization problems. We reformulate the original optimization problem into an inf-sup type saddle point problem, whose saddle point solves the optimality condition. We then apply a linearized preconditioned primal-dual hybrid gradient algorithm to compute the proposed saddle point problem.

The main description of the algorithm is as follows. Consider the following inf-sup problem for a $\mathcal{C}^{2}$ strongly convex function $f$ over $\mathbb{R}^{d}$

\inf_{{\bm{x}}\in\mathbb{\mathbb{R}}^{d}}\sup_{{\bm{p}}\in\mathbb{\mathbb{R}}^{d}}\quad\langle\nabla f({\bm{x}}),{\bm{p}}\rangle-\frac{\varepsilon}{2}\left\lVert{\bm{p}}\right\rVert^{2}\,,

(1.1)

where ${\bm{p}}$ is a constructed “dual variable”, $\varepsilon>0$ is a constant, $\langle\cdot,\cdot\rangle$ is an Euclidean inner product, and $\|\cdot\|$ is an Euclidean norm. We later prove that the solution to the saddle point problem 1.1 gives the global minimum of $f$ . We propose a linearized preconditioned PDHG algorithm for solving the above inf-sup problem:


$\displaystyle{\bm{p}}^{n+1}$	$\displaystyle={\bm{p}}^{n}+\sigma{\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})-\sigma\varepsilon{\bm{A}}({\bm{x}}^{n}){\bm{p}}^{n+1}\,,$	(1.2a)
$\displaystyle\tilde{{\bm{p}}}^{n+1}$	$\displaystyle={\bm{p}}^{n+1}+\omega({\bm{p}}^{n+1}-{\bm{p}}^{n})\,,$	(1.2b)
$\displaystyle{\bm{x}}^{n+1}$	$\displaystyle={\bm{x}}^{n}-\tau{\bm{C}}({\bm{x}}^{n})\tilde{{\bm{p}}}^{n+1}\,,$	(1.2c)

where $n=1,2,\cdots$ is the iteration step, $\tau$ , $\sigma>0$ are stepsizes for the updates of ${\bm{x}}$ , ${\bm{p}}$ , respectively, and $\omega>0$ is a parameter. In the above algorithm, ${\bm{C}}({\bm{x}}^{n})={\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})$ , where ${\bm{A}}({\bm{x}}^{n})\in\mathbb{\mathbb{R}}^{d\times d}$ , and ${\bm{B}}({\bm{x}}^{n})\in\mathbb{\mathbb{R}}^{d\times d}$ act as preconditioners on the updates of ${\bm{p}}^{n+1}$ and ${\bm{x}}^{n+1}$ , respectively. This paper will only focus on the simple case where ${\bm{A}}({\bm{x}})=A{\mathbb{I}}$ for some constant $A>0$ . Although there is a second-order term $\nabla^{2}f({\bm{x}}^{n})$ in the update of ${\bm{x}}^{n+1}$ (hidden in ${\bm{C}}({\bm{x}}^{n})$ ), our algorithm is still a first-order algorithm by choosing ${\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})={\bm{C}}({\bm{x}}^{n})$ for some ${\bm{C}}$ that is easy to compute. For example, we test that ${\bm{C}}={\mathbb{I}}$ is a very good choice in most of our numerical examples. See empirical choices of parameters in our numerical sections.

Our method forms a class of ordinary differential equation systems in terms of $({\bm{x}},{\bm{p}})$ in the continuous limit $\tau$ , $\sigma\rightarrow 0$ . We call it the primal-dual damping (PDD) dynamics. We show that the PDD dynamics form a class of second-order ODEs, which contains and extends the inertia Hessian-driven damping dynamics [Chen and Luo, 2019, Attouch et al., 2019]. Theoretically, we analyze the convergence property of PDD dynamics. If $f$ is a quadratic function of ${\bm{x}}$ , with constant ${\bm{A}}$ , ${\bm{B}}$ , the PDD dynamic satisfies a linear ODE system. Under suitable choices of parameters, we obtain a similar convergence acceleration in heavy ball ODE [Siegel, 2019]. Moreover, for general nonlinear function $f$ , we have the following informal theorem characterizing the convergence speed of our algorithm:

Theorem 1.1 (Informal).

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a $\mathcal{C}^{4}$ strongly convex function. Let ${\bm{x}}^{*}$ be the global minimum of $f$ and ${\bm{p}}^{*}=0$ . Then, the iteration $({\bm{x}}^{n},{\bm{p}}^{n})$ produced by 1.2 converges to the saddle point $({\bm{x}}^{*},{\bm{p}}^{*})$ if $\tau$ , $\sigma$ , are small enough. Moreover,

\|{\bm{p}}^{n}\|^{2}+\|\nabla f({\bm{x}}^{n})\|^{2}\leq(\|{\bm{p}}^{0}\|^{2}+\|\nabla f({\bm{x}}^{0})\|^{2})(1-\frac{\mu^{2}}{M+\delta})^{n},

where $\mu=\min_{{\bm{x}}}\lambda_{\min}(\nabla^{2}f({\bm{x}}){\bm{C}}({\bm{x}}))$ , ${\bm{C}}({\bm{x}})={\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})$ , $\delta>0$ depends on the initial condition, and $M>0$ depends on ${\bm{C}}({\bm{x}})^{T}\big{(}\nabla^{3}f({\bm{x}})\nabla f({\bm{x}})+(\nabla^{2}f({\bm{x}}))^{2}\big{)}{\bm{C}}({\bm{x}})$ , $\tau$ , $\sigma$ , $A$ , $\varepsilon$ , and $\omega$ . The detailed version is given in Theorem 3.10.

Numerically, we test the algorithms in both convex and non-convex optimization problems. In convex optimization, we demonstrate the fast convergence results of the proposed algorithm with selected preconditioners, compared with the gradient descent method, Nesterov accelerated gradient method, and Heavy ball damping method. This justifies the convergence analysis. We also test our algorithm for several well-known non-convex optimization problems. Some examples, such as the Rosenbrock and Ackley functions, demonstrate the potential advantage of our algorithms in converging to the global minimizer. In particular, we compare our algorithms with the stochastic gradient descent method, Adam, for training two-layer and convolutional neural network functions in supervised learning problems. This showcases the potential advantage of the proposed methods in terms of convergence speed and test accuracy.

PDHG has been widely used in linear-constrained optimization problems [Chambolle and Pock, 2011]. Recently, Valkonen [2014] applied the PDHG for nonlinearly constrained optimization problems. They proved the asymptotic convergence for the nonlinear coupling saddle point problems. It is different from our PDHG algorithm for computing unconstrained optimizations. And we show the linear convergence for a particular nonlinear coupling saddle point problem. Meanwhile, Nesterov accelerated gradient methods and Hessian damping algorithms can also be formulated in both discrete-time updates and continuous-time second-order ODEs. Wibisono et al. [2016] also introduced the idea of Bregman Lagrangian to study a family of accelerated methods in continuous time limit. It forms a nonlinear second-order ODE. Compared to them, the PDD algorithm induces a generalized second-order ODE system, which contains both heavy ball ODE [Siegel, 2019] and Hessian damping dynamics [Chen and Luo, 2019, Attouch et al., 2019, 2020, 2021]. For example, when $C={\mathbb{I}}$ , algorithm Eq. 1.2 can be viewed as the other time discretization of Hessian damping dynamics [Chen and Luo, 2019, Attouch et al., 2019]. It provides a different update in discrete time update. We only evaluate the gradient of $f$ once, whereas Attouch’s algorithm [Attouch et al., 2020] evaluates the gradient of $f$ twice. In numerical experiments, we demonstrate that the proposed algorithm outperforms Nesterov accelerated methods and Hessian-driven damping methods in some non-convex optimization problems, including supervised learning problems for training neural network functions.

Our work is also related to preconditioning, an important technique in numerical linear algebra [Trefethen and Bau, 2022] and numerical PDEs [Rees, 2010, Park et al., 2021]. In general, preconditioning aims to reduce the condition number of some operators to improve convergence speed. One famous example would be preconditioning gradient descent by the inverse of the Hessian matrix, which gives rise to Newton’s method. In recent years, preconditioning techniques have also been developed in training neural networks Osher et al. [2022], Kingma and Ba [2014]. Adam [Kingma and Ba, 2014] is arguably one of the most popular optimizers in training deep neural networks. It can also be viewed as a preconditioned algorithm using a diagonal preconditioner that approximates the diagonal of the Fisher information matrix [Pascanu and Bengio, 2013]. Shortly after Chambolle and Pock [2011] developed PDHG for constrained optimization, the same authors also studied preconditioned PDHG method [Pock and Chambolle, 2011], in which they developed a simple diagonal preconditioner that can guarantee convergence without the need to compute step sizes. Liu et al. [2021] proposed non-diagonal preconditioners for PDHG and showed close connections between preconditioned PDHG and ADMM. Park et al. [2021] studied the preconditioned Nesterov’s accelerated gradient method and proved convergence in the induced norm. Jacobs et al. [2019] introduced a preconditioned norm in the primal update of the PDHG method and improved the step size restriction of the PDHG method.

Our paper is organized as follows. In Section 2 we provide some background and derivations of our algorithm. We also provide the ODE formulations for our primal-dual damping dynamics. In Section 3, we prove our main convergence results for the algorithm. In Section 4 we showcase the advantage of our algorithm through several convex and non-convex examples. In particular, we show that our algorithm can train neural networks and is competitive with commonly used optimizers, such as SGD with Nesterov’s momentum and Adam. We conclude in Section 5 with more discussions and future directions.

2 Primal-dual damping algorithms for optimizations

In this section, we first review PDHG algorithms for constrained optimization problems. We then construct a saddle point problem for the unconstrained optimization problem and apply the preconditioned PDHG algorithm to compute the proposed saddle point problem. We last derive an ODE system, which takes the limit of stepsizes in the PDHG algorithm. It forms a second-order ODE, which generalizes the Hessian-driven damping dynamics. We analyze the convergence properties of the ODE system for quadratic optimization problems.

2.1 Review PDHG for constrained optimization

In Chambolle and Pock [2011], the following saddle point problem was considered:

\min_{x\in X}\max_{y\in Y}\langle Kx,y\rangle+G(x)-F^{*}(y)\,,

(2.1)

where $X$ and $Y$ are two finite-dimensional real vector spaces equipped with inner product $\langle\cdot,\cdot\rangle$ and norm $\|\cdot\|=\langle\cdot,\cdot\rangle^{1/2}$ . The map $K:X\to Y$ is a continuous linear operator. $G:X\to[0,+\infty]$ and $F^{*}:Y\to[0,+\infty]$ are proper, convex, lower semi-continuous (l.s.c.) functions. $F^{*}$ is the convex conjugate of a convex l.s.c. function $F$ . It is straightforward to verify that 2.1 is the primal-dual formulation of the nonlinear primal problem

\min_{x\in X}F(Kx)+G(x)\,.

Then the PDHG algorithm for saddle point problem 2.1 is given by


$\displaystyle y^{n+1}$	$\displaystyle=(I+\sigma\partial F^{*})^{-1}(y^{n}+\sigma K\tilde{x}^{n})\,,$	(2.2a)
$\displaystyle x^{n+1}$	$\displaystyle=(I+\tau\partial G)^{-1}(x^{n}+\tau K^{*}y^{n+1})\,,$	(2.2b)
$\displaystyle\tilde{x}^{n+1}$	$\displaystyle=x^{n+1}+\omega(x^{n+1}-x^{n})\,,$	(2.2c)

where $I$ is the identity operator and $(I+\sigma\partial F)^{-1}$ is the resolvent operator, which is defined the same way as the proximal operator

	$\displaystyle(I+\tau\partial F)^{-1}(y)$	$\displaystyle=\operatorname*{arg\,min}_{x}\frac{\\|x-y\\|^{2}}{2\tau}+F(x)$
		$\displaystyle=\textrm{prox}_{\tau F}(y)$

When $\omega=1$ , Chambolle and Pock [2011] proved convergence if $\tau\sigma\|K\|^{2}<1$ , where $\|\cdot\|$ is the induced operator norm. It is worth noting that the convergence analysis requires that $K$ is a linear operator.

2.2 Saddle point problem for unconstrained optimization

We consider the problem of minimizing a $\mathcal{C}^{2}$ strongly convex function $f:\mathbb{R}^{d}\to\mathbb{R}$ over $\mathbb{R}^{d}$ . Instead of directly solving for $\nabla f({\bm{x}}^{*})=0$ , we consider the following saddle point problem:

\inf_{{\bm{x}}\in\mathbb{R}^{d}}\sup_{{\bm{p}}\in\mathbb{R}^{d}}\quad\langle\nabla f({\bm{x}}),{\bm{p}}\rangle\,,

(2.3)

due to the following proposition.

Proposition 2.1.

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a $\mathcal{C}^{2}$ strongly convex function. Then the saddle point to 2.3 is the unique global minimum of $f$ .

Proof.

Directly differentiating 2.3 and setting the derivatives to 0 yields

	$\displaystyle\nabla f({\bm{x}}^{*})$	$\displaystyle=0\,,$
	$\displaystyle\nabla^{2}f({\bm{x}}^{}){\bm{p}}^{}$	$\displaystyle=0\,.$

By the strong convexity of $f$ , we obtain that ${\bm{x}}^{*}$ is the unique global minimum and ${\bm{p}}^{*}=0$ . ∎

Recall that ${\bm{p}}^{*}=0$ by the optimality condition. Thus we make the following change to our saddle point formulation. We add a regularization term in 2.3:

\inf_{{\bm{x}}\in\mathbb{R}^{d}}\sup_{{\bm{p}}\in\mathbb{R}^{d}}\quad\langle\nabla f({\bm{x}}),{\bm{p}}\rangle-\frac{\varepsilon}{2}\left\lVert{\bm{p}}\right\rVert^{2}\,,

(2.4)

where $\varepsilon>0$ is a constant. This regularization term further drives ${\bm{p}}$ to $0$ . Similar to Proposition 2.1, we have the following proposition

Proposition 2.2.

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a $\mathcal{C}^{2}$ strongly convex function. Then the saddle point to 2.4 is the unique global minimum of $f$ .

Proof.

Directly differentiating 2.4 and setting derivatives to 0 yields

	$\displaystyle\nabla f({\bm{x}}^{*})$	$\displaystyle=\varepsilon{\bm{p}}^{*}\,,$
	$\displaystyle\nabla^{2}f({\bm{x}}^{}){\bm{p}}^{}$	$\displaystyle=0\,.$

Since $f$ is strongly convex, we have $\nabla^{2}f({\bm{x}}^{*})\succ 0$ and the second equation implies ${\bm{p}}^{*}=0$ . Then the first equation implies $\nabla f({\bm{x}}^{*})=0$ . Since $f$ is strongly convex, we conclude that ${\bm{x}}^{*}$ is the unique global minimum. ∎

2.3 PDHG for unconstrained optimization

We apply the scheme given by 2.2 to the saddle point problem 2.4 (set $F=G=0$ and identify $K{\bm{x}}=\nabla f({\bm{x}})$ in 2.1). Thus,


$\displaystyle{\bm{p}}^{n+1}$	$\displaystyle=\operatorname*{arg\,max}_{{\bm{p}}}\quad\langle\nabla f({\bm{x}}^{n}),{\bm{p}}\rangle-\frac{\varepsilon}{2}\left\lVert{\bm{p}}\right\rVert^{2}-\frac{\left\lVert{\bm{p}}-{\bm{p}}^{n}\right\rVert^{2}_{{\bm{A}}({\bm{x}}^{n})^{-1}}}{2\sigma}\,,$	(2.5a)
$\displaystyle\widetilde{{\bm{p}}}^{n+1}$	$\displaystyle=\omega({\bm{p}}^{n+1}-{\bm{p}}^{n})+{\bm{p}}^{n+1}\,,$	(2.5b)
$\displaystyle{\bm{x}}^{n+1}$	$\displaystyle=\operatorname*{arg\,min}_{{\bm{x}}}\quad\langle\nabla f({\bm{x}}),\widetilde{{\bm{p}}}^{n+1}\rangle+\frac{\left\lVert{\bm{x}}-{\bm{x}}^{n}\right\rVert^{2}_{{\bm{B}}({\bm{x}}^{n})^{-1}}}{2\tau}\,,$	(2.5c)

where we have added symmetric positive definite matrices ${\bm{A}}({\bm{x}}^{n})$ , ${\bm{B}}({\bm{x}}^{n})\in\mathbb{R}^{d\times d}$ , as preconditioners for updates of ${\bm{p}}$ , ${\bm{x}}$ , respectively. We also denote the norm $\left\lVert{\bm{h}}\right\rVert^{2}_{{\bm{A}}^{-1}}$ as ${\bm{h}}^{T}{\bm{A}}^{-1}{\bm{h}}$ , where ${\bm{h}}\in\mathbb{R}^{d}$ .

As mentioned, the convergence analysis of PDHG relies on the assumption that $K$ is a linear operator. So we can not apply the same convergence analysis to 2.5 since $\nabla f({\bm{x}})$ is not necessarily linear in ${\bm{x}}$ . By taking the optimality conditions of 2.5, we find that ${\bm{p}}^{n+1}$ and ${\bm{x}}^{n+1}$ solves


$\displaystyle{\bm{p}}^{n+1}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n}){\bm{p}}^{n+1}-{\bm{p}}^{n}-\sigma{\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})$	$\displaystyle=0\,,$	(2.6a)
$\displaystyle\tau{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n+1})\big{[}(1+\omega)\big{(}{\mathbb{I}}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n})\big{)}^{-1}\sigma{\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})$
$\displaystyle+\big{(}(\omega+1)\big{(}{\mathbb{I}}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n})\big{)}^{-1}-\omega{\mathbb{I}}\big{)}{\bm{p}}^{n}\big{]}+({\bm{x}}^{n+1}-{\bm{x}}^{n})$	$\displaystyle=0\ ,$	(2.6b)

where we substitute the update Eq. 2.5b into update Eq. 2.6b. We use ${\mathbb{I}}$ to represent the identity matrix in Eq. 2.6b.

Note that the update for ${\bm{x}}^{n+1}$ in Eq. 2.6b is implicit, unless $\nabla^{2}f({\bm{x}})$ does not depend on ${\bm{x}}$ . We also remark that the update for ${\bm{x}}^{n+1}$ in Eq. 2.6b will be explicit if we perform a gradient step instead of a proximal step in Eq. 2.5c. To be more precise, when ${\bm{B}}={\mathbb{I}}$ , the linearized version of Eq. 2.5c can be written as

{\bm{x}}^{n+1}=\mathrm{prox}_{\tau\langle\nabla f(\cdot),\widetilde{{\bm{p}}}^{n+1}\rangle}({\bm{x}}^{n})\,.

Taking a gradient step instead of proximal step yields

{\bm{x}}^{n+1}={\bm{x}}^{n}-\tau\nabla^{2}f({\bm{x}}^{n})\widetilde{{\bm{p}}}^{n+1}

(2.7)

For general choice of preconditioner ${\bm{B}}({\bm{x}}^{n})$ , the linearized version of Eq. 2.5c satisfies

{\bm{x}}^{n+1}={\bm{x}}^{n}-\tau{\bm{B}}(x^{n})\nabla^{2}f({\bm{x}}^{n})\widetilde{{\bm{p}}}^{n+1}={\bm{x}}^{n}-\tau{\bm{C}}(x^{n})\widetilde{{\bm{p}}}^{n+1}.

Here we always denote a matrix function ${\bm{C}}$ , such that

{\bm{C}}({\bm{x}}^{n}):={\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n}).

For simplicity of presentation, we only consider the simple case where ${\bm{A}}({\bm{x}}^{n})=A{\mathbb{I}}$ for some $A>0$ . We now summarize the linearized update Eq. 2.6 into the following algorithm.

Algorithm 1 Linearized Primal-Dual Damping Algorithm

Initial guesses

{\bm{x}}^{0}\in\mathbb{R}^{d}

{\bm{p}}^{0}\in\mathbb{R}^{d}

; Stepsizes

\tau>0

\sigma>0

; Parameters

A>0

\varepsilon>0

\omega>0

{\bm{C}}\succ 0

while

n=1,2,\cdots,

not converge do

{\bm{p}}^{n+1}=\frac{1}{1+\sigma\varepsilon A}{\bm{p}}^{n}+\frac{\sigma A}{1+\sigma\varepsilon A}\nabla f({\bm{x}}^{n});

\tilde{{\bm{p}}}^{n+1}={\bm{p}}^{n+1}+\omega({\bm{p}}^{n+1}-{\bm{p}}^{n});

{\bm{x}}^{n+1}={\bm{x}}^{n}-\tau{\bm{C}}({\bm{x}}^{n})\tilde{{\bm{p}}}^{n+1};

end while

We note that Algorithm 1 and update Eq. 2.6 are different methods for solving saddle point problem Eq. 2.3. In this paper, we focus on the computation and analysis of Algorithm 1.

2.4 PDD dynamics

An approach for analyzing optimization algorithms is by first studying the continuous limit of the algorithm using ODEs [Su et al., 2015, Siegel, 2019, Attouch et al., 2019]. The advantage of doing so is that ODEs provide insights into the convergence property of the algorithm.

We first reformulate the proposed algorithm Eq. 2.6 into a first-order ODE system.

Proposition 2.3.

As $\tau,\sigma\to 0$ and $\sigma\omega\to\gamma$ , both updates in 2.6 and Algorithm 1 can be formulated as a discrete-time update of the following ODE system.


$\displaystyle\dot{{\bm{p}}}$	$\displaystyle={\bm{A}}(x)\nabla f({\bm{x}})-\varepsilon{\bm{A}}(x){\bm{p}}\,,$	(2.8a)
$\displaystyle\dot{{\bm{x}}}$	$\displaystyle=-{\bm{C}}({\bm{x}})({\bm{p}}+\gamma({\bm{A}}(x)\nabla f({\bm{x}})-\varepsilon{\bm{A}}(x){\bm{p}}))\,,$	(2.8b)

where ${\bm{C}}({\bm{x}})={\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})$ and the initial condition satisfies ${\bm{x}}(0)={\bm{x}}^{0}$ , ${\bm{p}}(0)={\bm{p}}^{0}$ . Suppose that $\nabla f$ is Lispchitz continuous and each index in matrix ${\bm{A}}$ , ${\bm{C}}$ is continuous and bounded. Then, there exists a unique solution for the ODE system Eq. 2.8. A stationary state $({\bm{x}}^{*},{\bm{p}}^{*})$ of ODE system Eq. 2.8 satisfies

\nabla f({\bm{x}}^{*})=0,\quad{\bm{p}}^{*}=0.

Proof.

Rearranging Eq. 2.6a and Eq. 2.6b, we have

	$\displaystyle\frac{{\bm{p}}^{n+1}-{\bm{p}}^{n}}{\sigma}$	$\displaystyle={\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})-\varepsilon{\bm{A}}({\bm{x}}^{n}){\bm{p}}^{n+1}\,,$
	$\displaystyle\frac{{\bm{x}}^{n+1}-{\bm{x}}^{n}}{\tau}$	$\displaystyle=-{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n+1})\big{[}(1+\omega)\big{(}{\mathbb{I}}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n})\big{)}^{-1}\sigma{\bm{A}}({\bm{x}}^{n})\nabla f({\bm{x}}^{n})$
		$\displaystyle\qquad+\big{(}(\omega+1)\big{(}{\mathbb{I}}+\sigma\varepsilon{\bm{A}}({\bm{x}}^{n})\big{)}^{-1}-\omega{\mathbb{I}}\big{)}{\bm{p}}^{n}\big{]}\,.$

Taking the limit as $\tau,\sigma\to 0$ and $\sigma\omega\to\gamma$ , we obtain

	$\displaystyle\dot{{\bm{p}}}$	$\displaystyle={\bm{A}}(x)\nabla f({\bm{x}})-\varepsilon{\bm{A}}(x){\bm{p}}\,,$
	$\displaystyle\dot{{\bm{x}}}$	$\displaystyle=-{\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})({\bm{p}}+\gamma({\bm{A}}(x)\nabla f({\bm{x}})-\varepsilon{\bm{A}}(x){\bm{p}}))\,.$

Similarly, the update in Algorithm 1 also converges to the ODE system Eq. 2.8. Clearly, a stationary state satisfies ${\bm{p}}^{*}=0$ , $\nabla f({\bm{x}}^{*})=0$ . ∎

Proposition 2.4 (Primal-dual damping second order ODE).

The ODE system Eq. 2.8 satisfies the following second-order ODE

\ddot{{\bm{x}}}+\big{[}\varepsilon{\bm{A}}+\gamma{\bm{C}}{\bm{A}}\nabla^{2}f({\bm{x}})-\dot{{\bm{C}}}{\bm{C}}^{-1}\big{]}\dot{{\bm{x}}}+{\bm{C}}{\bm{A}}\nabla f({\bm{x}})=0\,.

(2.9)

Here $\dot{\bm{C}}=\frac{d}{dt}{\bm{C}}({\bm{x}}(t))$ .

The proof follows by direct calculations and can be found in Appendix C. We note that the formulation given by Eq. 2.9 includes several important special cases in the literature. In a word, we view Eq. 2.4 as a preconditioned accelerated gradient flow.

Example 2.1.

Let ${\bm{C}}={\bm{A}}=\mathbb{I}$ and $\gamma\neq 0$ . Then equation Eq. 2.4 satisfies

\ddot{{\bm{x}}}+\epsilon\dot{\bm{x}}+\gamma\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}+\nabla f({\bm{x}})=0\,,

(2.10)

which is an inertial system with Hessian-driven damping [Attouch et al., 2020].

Remark 2.5.

In the case of ${\bm{C}}={\bm{A}}={\mathbb{I}}$ , although the derived second order ODE Eq. 2.9 is the same as the one in Attouch et al. [2020] at a continuous time level, our algorithm 1 provides a different time discretization from the one in Attouch et al. [2020].

Example 2.2.

Let ${\bm{C}}={\bm{A}}={\mathbb{I}}$ , $\gamma(t)=0$ . Then equation Eq. 2.4 satisfies the heavy ball ODE [Siegel, 2019]

\ddot{{\bm{x}}}+\varepsilon\dot{{\bm{x}}}+\nabla f({\bm{x}})=0\,.

(2.11)

Example 2.3.

Let ${\bm{C}}={\bm{A}}={\mathbb{I}}$ , $\gamma(t)=0$ , $\varepsilon(t)=\frac{3}{t}$ . Then equation Eq. 2.4 satisfies the Nesterov ODE [Su et al., 2015]:

\ddot{{\bm{x}}}+\frac{3}{t}\dot{{\bm{x}}}+\nabla f({\bm{x}})=0\,.

(2.12)

We next provide a convergence analysis of ODE Eq. 2.8 for quadratic optimization problems. We demonstrate the importance of preconditioners in characterizing the convergence speed of ODE Eq. 2.8.

Theorem 2.6.

Suppose $f({\bm{x}})=\frac{1}{2}{\bm{x}}^{T}{\bm{Q}}{\bm{x}}$ for some symmetric positive definite matrix ${\bm{Q}}\in\mathbb{R}^{d\times d}$ . Assume ${\bm{A}}$ , ${\bm{B}}$ are constant matrices. In this case, equation Eq. 2.8 satisfies the linear ODE system:

\begin{pmatrix}\dot{{\bm{x}}}\\ \dot{{\bm{p}}}\end{pmatrix}=\begin{pmatrix}-\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}&-{\bm{B}}{\bm{Q}}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})\\ {\bm{A}}{\bm{Q}}&-\varepsilon{\bm{A}}\end{pmatrix}\begin{pmatrix}{\bm{x}}\\ {\bm{p}}\end{pmatrix}\,.

Suppose that ${\bm{A}}$ commutes with ${\bm{Q}}$ , such that ${\bm{A}}{\bm{Q}}={\bm{Q}}{\bm{A}}$ . Suppose ${\bm{A}}$ and ${\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}$ are simultaneously diagonalizable and have positive eigenvalues. Let $\mu_{1}\geq\ldots\geq\mu_{n}>0$ be the eigenvalues of ${\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}$ and $a_{i}$ the $i$ -th eigenvalue of ${\bm{A}}$ (not necessarily in descending order) in the same basis. Then

(a)

The solution of ODE system 2.8 converges to $({\bm{x}}^{*},{\bm{p}}^{*})=(0,0)$ :

\|({\bm{x}}(t),{\bm{p}}(t))\|\leq\|({\bm{x}}_{0},{\bm{p}}_{0})\|\exp(\alpha t)\,,

where

\alpha=\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-\varepsilon a_{i}+\Re\big{(}\sqrt{(\gamma\mu_{i}+\varepsilon)^{2}-4\mu_{i}}\big{)}\big{]}\,.

(b)

When ${\bm{A}}={\mathbb{I}},\varepsilon=0$ , the optimal convergence rate is achieved at $\gamma^{*}=\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}}$ . The corresponding rate is $\alpha=\frac{-\sqrt{\mu_{n}}}{\sqrt{2-\frac{1}{\kappa}}}$ , where $\kappa=\mu_{1}/\mu_{n}>1$ .
(c)

Moreover, when $\gamma=\varepsilon=0$ , the system will not converge for any initial data $({\bm{x}}_{0},{\bm{p}}_{0})\neq(0,0)$ .
(d)

If ${\bm{A}}={\mathbb{I}}$ , $\gamma\leq\frac{1}{\sqrt{\mu_{1}}}$ , $\varepsilon=2\sqrt{\mu^{\prime}}-\gamma\mu^{\prime}$ for some $\mu^{\prime}\leq\mu_{n}$ , then

$\alpha=-\sqrt{\mu^{\prime}}-\frac{\gamma}{2}(\mu_{n}-\mu^{\prime})\leq-\sqrt{\mu^{\prime}}\,.$

We defer the proof to Appendix B.

Remark 2.7.

If $\omega$ is bounded, then we have $\gamma=\mathcal{O}(\sigma)$ . Then, in the limit as $\sigma\to 0$ , we also have that $\gamma\to 0$ . By Theorem 2.6 (c), the ODE system 2.8 does not converge for any initial data.

Remark 2.8.

If $\mu^{\prime}$ is an estimate of the smallest eigenvalue $\mu_{n}$ , then the convergence speed for the solution of heavy ball ODE is $\exp(-\sqrt{\mu^{\prime}}t)$ . In Theorem 2.6 (d), if $\gamma=0$ and $\mu^{\prime}=\mu_{n}$ , then $\alpha=-\sqrt{\mu_{n}}$ which is the same as the convergence rate of the heavy ball ODE [Siegel, 2019]. However, if $\gamma>0$ and $\mu^{\prime}<\mu_{n}$ , then we have $\alpha=-\sqrt{\mu^{\prime}}-\gamma(\mu_{n}-\mu^{\prime})<-\sqrt{\mu^{\prime}}$ , which converges faster than the heavy ball ODE.

3 Lyapunov Analysis

In this section, we present the main theoretical result of this paper. We provide the convergence analysis for general objective functions in both continuous-time ODEs Eq. 2.8 and discrete-time Algorithm 1. From now on, we make the following two assumptions for the convergence analysis.

Assumption 3.1.

There exists two constants $L\geq\mu>0$ such that $\mu{\mathbb{I}}\preceq{\bm{C}}_{0}({\bm{x}})\preceq L{\mathbb{I}}$ for all ${\bm{x}}$ , where ${\bm{C}}_{0}({\bm{x}})=\nabla^{2}f({\bm{x}}){\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})$ , and $\mu\leq 1$ .

Assumption 3.2.

There exists a constant $L^{\prime}>0$ such that

{\bm{C}}({\bm{x}})^{T}\big{(}\nabla^{3}f({\bm{x}})\nabla f({\bm{x}})+(\nabla^{2}f({\bm{x}}))^{2}\big{)}{\bm{C}}({\bm{x}})\preceq L^{\prime}{\mathbb{I}}

(3.1)

for all ${\bm{x}}$ , where ${\bm{C}}({\bm{x}})={\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})$ .

3.1 Continuous time Lyapunov analysis

In this subsection, we establish convergence results of the ODE system Eq. 2.8.

Theorem 3.3.

Consider the ODE system Eq. 2.8 with an initial condition $(x(0),p(0))\in\mathbb{R}^{2d}$ . Define the functional

{\mathcal{I}}({\bm{x}},{\bm{p}})=\frac{1}{2}(\|{\bm{p}}\|^{2}+\|\nabla f({\bm{x}})\|^{2})\,.

(3.2)

Suppose Assumption 3.1 holds, we have

{\mathcal{I}}({\bm{x}}(t),{\bm{p}}(t))\leq{\mathcal{I}}({\bm{x}}(0),{\bm{p}}(0))\exp(-2\lambda t)\,,

(3.3)

where

	$\displaystyle\lambda=\min\Big{\{}$	$\displaystyle\mu\gamma A-\frac{1}{2}\|A-\mu(1-\varepsilon\gamma A)\|,L\gamma A-\frac{1}{2}\|A-L(1-\varepsilon\gamma A)\|,$
		$\displaystyle\varepsilon A-\frac{1}{2}\|A-\mu(1-\varepsilon\gamma A)\|,\varepsilon A-\frac{1}{2}\|A-L(1-\varepsilon\gamma A)\|\Big{\}}$

In particular, when $\gamma=\frac{1}{\mu},\varepsilon=1,A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}$ , then $\lambda=\frac{\mu}{2}$ .

Proof.

It is straightforward to compute the following

	$\displaystyle\frac{\operatorname{d}\!{{\mathcal{I}}}}{\operatorname{d}\!{t}}$	$\displaystyle=\langle{\bm{p}},\dot{{\bm{p}}}\rangle+\langle\nabla f,\nabla^{2}f\dot{{\bm{x}}}\rangle$
		$\displaystyle=-\nabla f^{T}{\bm{C}}_{0}\gamma{\bm{A}}\nabla f-{\bm{p}}^{T}\varepsilon{\bm{A}}{\bm{p}}+\nabla f^{T}\big{(}{\bm{A}}-{\bm{C}}_{0}({\mathbb{I}}-\varepsilon\gamma{\bm{A}})\big{)}{\bm{p}}$		(3.4)

We shall find $\lambda$ such that $\frac{\operatorname{d}\!{{\mathcal{I}}}}{\operatorname{d}\!{t}}+2\lambda{\mathcal{I}}\leq 0$ . Then we obtain the exponential convergence by Gronwall’s inequality, i.e.,

{\mathcal{I}}({\bm{x}}(t),{\bm{p}}(t))\leq{\mathcal{I}}({\bm{x}}(0),{\bm{p}}(0))\exp(-2\lambda t)\ .

We can compute

	$\displaystyle\frac{\operatorname{d}\!{{\mathcal{I}}}}{\operatorname{d}\!{t}}+2\lambda{\mathcal{I}}$	$\displaystyle=\nabla f^{T}\big{(}-{\bm{C}}_{0}\gamma{\bm{A}}+\lambda{\mathbb{I}}\big{)}\nabla f+{\bm{p}}^{T}\big{(}-\varepsilon{\bm{A}}+\lambda{\mathbb{I}}\big{)}{\bm{p}}$
		$\displaystyle\qquad+\nabla f^{T}\big{(}{\bm{A}}-{\bm{C}}_{0}({\mathbb{I}}-\varepsilon\gamma{\bm{A}})\big{)}{\bm{p}}\,.$		(3.5)

By Lemma A.1, we obtain the following sufficient conditions for $\frac{\operatorname{d}\!{{\mathcal{I}}}}{\operatorname{d}\!{t}}+2\lambda{\mathcal{I}}\leq 0$


$\displaystyle-\varepsilon A+\lambda+\frac{1}{2}\|\xi_{i}(1-\varepsilon\gamma A)-A\|$	$\displaystyle\leq 0$	(3.6a)
$\displaystyle\lambda-\xi_{i}\gamma A+\frac{1}{2}\|\xi_{i}(1-\varepsilon\gamma A)-A\|$	$\displaystyle\leq 0$	(3.6b)

where $\xi_{i}({\bm{x}})$ is the eigenvalue of ${\bm{C}}_{0}({\bm{x}})$ . By our assumptions, we have $L\geq\xi_{1}({\bm{x}})\geq\ldots\geq\xi_{n}({\bm{x}})\geq\mu$ . Eq. 3.6 give two upper bounds on $\lambda$ . Define $g_{1}(\xi)=\varepsilon A+\frac{1}{2}|\xi(1-\varepsilon\gamma A)-A|$ , and $g_{2}(\xi)=\xi\gamma A-\frac{1}{2}|\xi(1-\varepsilon\gamma A)-A|$ on the interval $[\mu,L]$ . Then Eq. 3.6 implies that

\lambda\leq g_{j}(\xi_{i})\,,

(3.7)

for all $i=1,\ldots,n$ and $j=1,2$ . Since each $g_{j}(\xi)$ is a piece-wise linear in $\xi$ , it is not hard to see that

\displaystyle\min_{\xi\in[\mu,L]}g_{j}(\xi)=\min\{g_{j}(\mu),g_{j}(L)\}\,,

for $j=1,2$ . This proves the formula for $\lambda$ . When $A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}$ , we have $g_{1}(\mu)=g_{1}(L)$ , and

\mu(1-\varepsilon\gamma A)-A=-L(1-\varepsilon\gamma A)+A\,.

Further, requiring $g_{1}(\mu)=g_{2}(\mu)$ yields $\varepsilon=\mu\gamma$ . And we obtain

$\displaystyle\lambda$	$\displaystyle=\mu\gamma A-\frac{1}{2}\|A-\mu(1-\varepsilon\gamma{\bm{A}})\|$
	$\displaystyle=\mu\gamma A-\frac{1}{2}(A-\mu(1-\varepsilon\gamma A))$
	$\displaystyle=\frac{\mu}{2}+A(\gamma\mu-\frac{1}{2}\gamma^{2}\mu^{2}-\frac{1}{2})$
	$\displaystyle=\frac{\mu}{2}-\frac{A}{2}(\gamma\mu-1)^{2}.$	(3.8)

We note that $\lambda$ is maximized when taking $\gamma=\mu^{-1}$ . We obtain $\lambda=\frac{\mu}{2}$ .

∎

3.2 Discrete time Lyapunov analysis

In this subsection, we study the convergence criterion for the discretized linearized PDHG flow given by Eq. 1.2 and Algorithm 1.

From now on, we assume that $f$ is a $\mathcal{C}^{4}$ strongly convex function.

We can rewrite the iterations as


$\displaystyle{\bm{p}}^{n+1}$	$\displaystyle=\frac{1}{1+\sigma\varepsilon A}{\bm{p}}^{n}+\frac{\sigma A}{1+\sigma\varepsilon A}\nabla f({\bm{x}}^{n})\,,$	(3.9a)
$\displaystyle{\bm{x}}^{n+1}$	$\displaystyle={\bm{x}}^{n}-\tau{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})\left(\frac{1-\varepsilon\gamma A}{1+\sigma\varepsilon A}{\bm{p}}^{n}+\frac{\sigma A+\gamma A}{1+\sigma\varepsilon A}\nabla f({\bm{x}}^{n})\right)\,,$	(3.9b)

where $\gamma=\sigma\omega$ . We define the following notations which will be used later.

{\bm{N}}({\bm{x}}^{n})=\frac{1}{1+\sigma\varepsilon A}\begin{pmatrix}{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})(\sigma A+\gamma A)&{\bm{B}}({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})(1-\varepsilon\gamma A)\\ -\frac{\sigma}{\tau}A&\frac{\sigma}{\tau}\varepsilon A\end{pmatrix}\,.

(3.10)

And

{\bm{H}}({\bm{x}}^{n})=\rm{sym}\begin{pmatrix}\begin{pmatrix}\nabla^{2}f({\bm{x}}^{n})&0\\ 0&{\mathbb{I}}\end{pmatrix}\cdot{\bm{N}}({\bm{x}}^{n})\end{pmatrix}\,.

Remark 3.4.

The matrix ${\bm{N}}({\bm{x}}^{n})$ and ${\bm{H}}({\bm{x}}^{n})$ also depends on the $\tau$ , $\sigma$ , $A$ , $\varepsilon$ and $\omega$ .

Define the Lyapunov functional in discrete time as

{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})=\frac{1}{2}\|\nabla f({\bm{x}}^{n})\|^{2}+\frac{1}{2}\|{\bm{p}}^{n}\|^{2}\,.

Theorem 3.5.

Suppose that there exists positive constants $\lambda,M_{1}\in\mathbb{R}_{+}$ , such that

	$\displaystyle{\bm{H}}({\bm{x}})$	$\displaystyle\succeq\lambda{\mathbb{I}}\,,$
	$\displaystyle{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}(\tilde{{\bm{x}}},\tilde{{\bm{p}}}){\bm{N}}({\bm{x}})$	$\displaystyle\preceq M_{1}{\mathbb{I}},$

for all ${\bm{x}},\tilde{{\bm{x}}}\in\mathbb{R}^{n}$ . If $\tau=a\frac{\lambda}{M}$ for some $a\in(0,2)$ , then the functional ${\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$ decreases geometrically, i.e.

{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+(a^{2}-2a)\frac{\lambda^{2}}{2M_{1}}\big{)}^{n}\,.

Proof.

It follows from our definition of ${\bm{N}}({\bm{x}}^{n})$ that

\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}=-\tau{\bm{N}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}\,,

(3.11)

By the mean-value theorem, we obtain

		$\displaystyle{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})-{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$
	$\displaystyle=$	$\displaystyle\begin{pmatrix}\nabla_{{\bm{x}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\\ \nabla_{{\bm{p}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\end{pmatrix}^{T}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}+\frac{1}{2}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}^{T}\nabla^{2}{\mathcal{I}}(\tilde{{\bm{x}}},\tilde{{\bm{p}}})\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}$

where $(\tilde{{\bm{x}}},\tilde{{\bm{p}}})$ is in between $({\bm{x}}^{n+1},{\bm{p}}^{n+1})$ and $({\bm{x}}^{n},{\bm{p}}^{n})$ . And

	$\displaystyle\nabla_{{\bm{x}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$	$\displaystyle=\nabla^{2}f({\bm{x}}^{n})\nabla f({\bm{x}}^{n})\,,$
	$\displaystyle\nabla_{{\bm{p}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$	$\displaystyle={\bm{p}}^{n}\,,$
	$\displaystyle\nabla^{2}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$	$\displaystyle=\begin{pmatrix}\nabla^{3}f({\bm{x}}^{n})\nabla f({\bm{x}}^{n})+\nabla^{2}f({\bm{x}}^{n})\nabla^{2}f({\bm{x}}^{n})&0\\ 0&{\mathbb{I}}\end{pmatrix}\,.$

Then using Eq. 3.11 and definition of ${\bm{H}}({\bm{x}}^{n})$ , we obtain

	$\displaystyle{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})-{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$
$\displaystyle=$	$\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}^{T}\begin{pmatrix}\nabla^{2}f({\bm{x}}^{n})&0\\ 0&{\mathbb{I}}\end{pmatrix}\cdot{\bm{N}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{n})^{T}\nabla^{2}{\mathcal{I}}(\tilde{{\bm{x}}},\tilde{{\bm{p}}}){\bm{N}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}$
$\displaystyle=$	$\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{n})^{T}\nabla^{2}{\mathcal{I}}(\tilde{{\bm{x}}},\tilde{{\bm{p}}}){\bm{N}}({\bm{x}}^{n})\begin{pmatrix}\nabla f({\bm{x}}^{n})\\ {\bm{p}}^{n}\end{pmatrix}\,,$	(3.12)

From Eq. 3.2 and our assumption on ${\bm{N}}({\bm{x}})$ and ${\bm{H}}({\bm{x}})$ , we obtain

$\displaystyle{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})-{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$	$\displaystyle\leq\big{(}-\tau\lambda+\frac{\tau^{2}M_{1}}{2}\big{)}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$
	$\displaystyle=\frac{M_{1}}{2}\big{(}(\tau-\frac{\lambda}{M_{1}})^{2}-\frac{\lambda^{2}}{M_{1}^{2}}\big{)}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$
	$\displaystyle=(a^{2}-2a)\frac{\lambda^{2}}{2M_{1}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\,,$	(3.13)

where we used $\tau=a\frac{\lambda}{M_{1}}$ . Hence,

{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})\leq{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\big{(}1+(a^{2}-2a)\frac{\lambda^{2}}{2M_{1}}\big{)}\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+(a^{2}-2a)\frac{\lambda^{2}}{2M_{1}}\big{)}^{n+1}\,.

When $0<a<2$ , we have $a^{2}-2a<0$ . Thus we obtain the desired convergence result. ∎

Theorem 3.6.

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a $\mathcal{C}^{4}$ strongly convex function. Suppose $({\bm{x}}^{0},{\bm{p}}^{0})$ satisfies

{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})^{1/2}\leq\frac{\delta}{\tau D_{0}\|{\bm{N}}({\bm{x}})\|_{2}^{3}}\,,

(3.14)

for some $\delta>0$ and all ${\bm{x}}$ . Here

D_{0}=\sup_{{\bm{x}},{\bm{p}},{\bm{x}}^{\prime},{\bm{p}}^{\prime}}\frac{\begin{pmatrix}{\bm{x}}^{\prime}\\ {\bm{p}}^{\prime}\end{pmatrix}^{T}\left(\nabla^{3}{\mathcal{I}}({\bm{x}},{\bm{p}})\begin{pmatrix}{\bm{x}}^{\prime}\\ {\bm{p}}^{\prime}\end{pmatrix}\right)\begin{pmatrix}{\bm{x}}^{\prime}\\ {\bm{p}}^{\prime}\end{pmatrix}}{\left\|\begin{pmatrix}{\bm{x}}^{\prime}\\ {\bm{p}}^{\prime}\end{pmatrix}\right\|_{2}^{3}}\,.

Suppose further that there exists positive constants $\lambda,M_{2}\in\mathbb{R}_{+}$ such that

	$\displaystyle{\bm{H}}({\bm{x}})$	$\displaystyle\succeq\lambda{\mathbb{I}}\,,$
	$\displaystyle{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})$	$\displaystyle\preceq M_{2}{\mathbb{I}}$

for all ${\bm{x}}\in\mathbb{R}^{n}$ . If $\tau=a\frac{\lambda}{M_{2}+\delta}$ for some $a\in(0,2)$ , then the functional ${\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$ decreases geometrically, i.e.

{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}^{n}\,.

Remark 3.7.

Note that the constant $M_{2}$ in Theorem 3.6 can be better than the constant $M_{1}$ in Theorem 3.5 because ${\bm{N}}$ and $\nabla^{2}{\mathcal{I}}$ are evaluated at the same ${\bm{x}}$ in Theorem 3.6.

Proof.

We will prove it by induction. Using the mean-value theorem, we have

	$\displaystyle{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})-{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})$
$\displaystyle=$	$\displaystyle\begin{pmatrix}\nabla_{{\bm{x}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\\ \nabla_{{\bm{p}}}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\end{pmatrix}^{T}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}+\frac{1}{2}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}$
	$\displaystyle\qquad+\frac{1}{6}\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}^{T}\left(\nabla^{3}{\mathcal{I}}(\tilde{{\bm{x}}}^{n},\tilde{{\bm{p}}}^{n})\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}\right)\begin{pmatrix}{\bm{x}}^{n+1}-{\bm{x}}^{n}\\ {\bm{p}}^{n+1}-{\bm{p}}^{n}\end{pmatrix}\,,$	(3.15)

where $(\tilde{{\bm{x}}}^{n},\tilde{{\bm{p}}}^{n})$ is in between $({\bm{x}}^{n+1},{\bm{p}}^{n+1})$ and $({\bm{x}}^{n},{\bm{p}}^{n})$ . By Eq. 3.2 and Eq. 3.11, we can bound

	$\displaystyle{\mathcal{I}}({\bm{x}}^{1},{\bm{p}}^{1})-{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})$
$\displaystyle=$	$\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
	$\displaystyle-\frac{\tau^{3}}{6}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\left(\nabla^{3}{\mathcal{I}}(\tilde{{\bm{x}}}^{0},\tilde{{\bm{p}}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}\right){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
$\displaystyle\leq$	$\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{3}}{6}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}\left(D_{0}\\|{\bm{N}}({\bm{x}}^{0})\\|_{2}^{3}\left\\|\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}\right\\|_{2}\right)\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
$\displaystyle=$	$\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{3}}{6}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}D_{0}\\|{\bm{N}}({\bm{x}}^{0})\\|_{2}^{3}{\mathcal{I}}({\bm{x}}_{0},{\bm{p}}_{0})^{1/2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
$\displaystyle\leq$	$\displaystyle-\tau\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{H}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{2}}{2}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}{\bm{N}}({\bm{x}}^{0})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0}){\bm{N}}({\bm{x}}^{0})\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}$
	$\displaystyle+\frac{\tau^{2}\delta}{6}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}^{T}\begin{pmatrix}\nabla f({\bm{x}}^{0})\\ {\bm{p}}^{0}\end{pmatrix}\,,$	(3.16)

where the last inequality is by our assumption on $({\bm{x}}^{0},{\bm{p}}^{0})$ . Using our assumptions on the lower bound of ${\bm{H}}$ and the upper bound of ${\bm{N}}^{T}\cdot\nabla^{2}{\mathcal{I}}\cdot{\bm{N}}$ , we obtain

$\displaystyle{\mathcal{I}}({\bm{x}}^{1},{\bm{p}}^{1})-{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})$	$\displaystyle\leq\big{(}-\tau\lambda+\frac{\tau^{2}\delta}{6}+\frac{\tau^{2}M_{2}}{2}\big{)}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})$
	$\displaystyle\leq\big{(}-\tau\lambda+\frac{\tau^{2}(\delta+M_{2})}{2}\big{)}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})$
	$\displaystyle=\frac{1}{2}(a^{2}-2a)\frac{\lambda^{2}}{M_{2}+\delta}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\,,$	(3.17)

where we used $\tau=a\frac{\lambda}{M_{2}+\delta}$ for some $a\in(0,2)$ . Hence,

{\mathcal{I}}({\bm{x}}^{1},{\bm{p}}^{1})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}\,.

This proves the base case. Now suppose it holds that

{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}^{n}\,,

for some $n\geq 1$ . In particular, this implies that

{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})<{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\,,

which yields

\tau D_{0}\|{\bm{N}}({\bm{x}})\|_{2}^{3}{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})^{1/2}<\tau D_{0}\|{\bm{N}}({\bm{x}})\|_{2}^{3}{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})^{1/2}\leq\delta\,.

Then, repeating the derivation of Eq. 3.2 and Eq. 3.2 yields

{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})\leq{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}\,.

Combining with our induction hypothesis, we conclude that

{\mathcal{I}}({\bm{x}}^{n+1},{\bm{p}}^{n+1})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1+\frac{a^{2}-2a}{2}\frac{\lambda^{2}}{M_{2}+\delta}\big{)}^{n+1}\,.

The proof is complete by induction. ∎

Corollary 3.8.

Suppose Assumption 3.1 and Assumption 3.2 hold. When $\sigma=\tau$ , $\gamma=\frac{1-\sigma\mu}{\mu},\varepsilon=1,A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}$ , we have

{\bm{H}}({\bm{x}})\succeq\frac{\mu}{4}{\mathbb{I}}.

Proof.

By definition of ${\bm{H}}$ , we can compute

(1+\sigma\varepsilon A)\cdot{\bm{H}}({\bm{x}})=\begin{pmatrix}{\bm{C}}_{0}({\bm{x}})(\sigma{\bm{A}}+\gamma{\bm{A}})&\frac{1}{2}{\bm{C}}_{0}({\bm{x}})(1-\varepsilon\gamma A)-\frac{1}{2}\eta{\bm{A}}\\ \frac{1}{2}{\bm{C}}_{0}({\bm{x}})(1-\varepsilon\gamma A)-\frac{1}{2}\eta{\bm{A}}&\eta\varepsilon{\bm{A}}\end{pmatrix}\,,

where $\eta=\sigma/\tau=1$ , ${\bm{C}}_{0}({\bm{x}})=\nabla^{2}f({\bm{x}}){\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})$ . We want to find some constant $\lambda>0$ , such that

\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}^{T}{\bm{H}}({\bm{x}})\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}\geq\lambda(\|{\bm{z}}\|^{2}+\|{\bm{w}}\|^{2})\,.

Observe that

	$\displaystyle\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}^{T}{\bm{H}}({\bm{x}})\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}-\lambda(\\|{\bm{z}}\\|^{2}+\\|{\bm{w}}\\|^{2})$
	$\displaystyle={\bm{z}}^{T}\big{(}{\bm{C}}_{0}(\gamma+\sigma)A/(1+\sigma\varepsilon A)-\lambda{\mathbb{I}}\big{)}{\bm{z}}+{\bm{w}}^{T}\big{(}\varepsilon A/(1+\sigma\varepsilon A)-\lambda{\mathbb{I}}\big{)}{\bm{w}}$
	$\displaystyle\qquad+{\bm{z}}^{T}\big{(}-A+{\bm{C}}_{0}({\mathbb{I}}-\varepsilon\gamma A)\big{)}{\bm{w}}/(1+\sigma\varepsilon A)\,,$		(3.18)

which is almost the same as Eq. 3.1. Thus, following a similar procedure in Theorem 3.3 with the provided parameters, we obtain that

\lambda\geq\frac{\mu}{2}\frac{1+\frac{A\sigma}{2}}{1+\sigma A}\geq\frac{\mu}{4}\,.

This implies

\displaystyle{\bm{H}}({\bm{x}})\succeq\frac{\mu}{4}{\mathbb{I}}\,.

(3.19)

∎

Corollary 3.9.

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a $\mathcal{C}^{4}$ strongly convex function. Suppose Assumption 3.1 and Assumption 3.2 hold. If $\sigma=\tau$ , $\gamma=\frac{1-\sigma\mu}{\mu},\varepsilon=1,A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}$ , we have

(1)

$\|{\bm{N}}({\bm{x}})\|_{2}\leq\frac{\max\{L,1\}\cdot\big{(}A(\sigma+2\gamma+2)+1\big{)}}{(1+\sigma A)}\,.$

(2)

{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})\preceq\frac{(3+\sigma A+2A)^{2}}{(1+\sigma A)^{2}}\cdot\max\{L^{\prime},1\}\cdot{\mathbb{I}}\,.

Proof.

We can decompose

(1+\sigma A)\cdot{\bm{N}}({\bm{x}})=\begin{pmatrix}{\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})&0\\ 0&{\mathbb{I}}\end{pmatrix}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\,.

Observe that

\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}=\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\cdot\begin{pmatrix}{\bm{A}}&0\\ 0&{\bm{A}}\end{pmatrix}+\begin{pmatrix}0&{\mathbb{I}}\\ 0&0\end{pmatrix}\,.

Therefore,

$\displaystyle\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\bigg{\\|}_{2}$	$\displaystyle\leq A\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\bigg{\\|}_{2}+\bigg{\\|}\begin{pmatrix}0&{\mathbb{I}}\\ 0&0\end{pmatrix}\bigg{\\|}_{2}$
	$\displaystyle\leq A\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\bigg{\\|}_{2}+1$
	$\displaystyle\leq A(\sigma+2\gamma+2)+1\,.$	(3.20)

To get the last inequality, we consider $({\bm{z}},{\bm{w}})$ such that $\|{\bm{z}}\|^{2}+\|{\bm{w}}\|^{2}=1$ . Thus

	$\displaystyle\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}\bigg{\\|}_{2}$	$\displaystyle=\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\bm{z}}-\gamma{\bm{w}}\\ -{\bm{z}}+{\bm{w}}\end{pmatrix}\bigg{\\|}$
		$\displaystyle\leq\sigma\bigg{\\|}\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}\bigg{\\|}+\gamma\\|{\bm{z}}-{\bm{w}}\\|+\\|{\bm{z}}-{\bm{w}}\\|$
		$\displaystyle\leq\sigma+2\gamma+2\,.$

We now have

	$\displaystyle(1+\sigma A)\\|{\bm{N}}({\bm{x}})\\|_{2}$	$\displaystyle\leq\bigg{\\|}\begin{pmatrix}{\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})&0\\ 0&{\mathbb{I}}\end{pmatrix}\bigg{\\|}_{2}\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\bigg{\\|}_{2}$
		$\displaystyle\leq\max\{L,1\}\cdot\big{(}A(\sigma+2\gamma+2)+1\big{)}\,.$

This proves part (1) of our Corollary. It follows that

		$\displaystyle{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})$
		$\displaystyle=\frac{1}{(1+\sigma A)^{2}}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&-{\bm{A}}\\ ({\mathbb{I}}-\gamma{\bm{A}})&{\bm{A}}\end{pmatrix}\cdot\begin{pmatrix}{\bm{C}}({\bm{x}})^{T}\big{(}\nabla^{3}f({\bm{x}})\nabla f({\bm{x}})+(\nabla^{2}f({\bm{x}}))^{2}\big{)}{\bm{C}}({\bm{x}})&0\\ 0&{\mathbb{I}}\end{pmatrix}$
		$\displaystyle\hskip 56.9055pt\cdot\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\,,$		(3.21)

where we recall ${\bm{C}}({\bm{x}})={\bm{B}}({\bm{x}})\nabla^{2}f({\bm{x}})$ .

By assumption, there exists $L^{\prime}>0$ , such that

{\bm{C}}({\bm{x}})^{T}\big{(}\nabla^{3}f({\bm{x}})\nabla f({\bm{x}})+(\nabla^{2}f({\bm{x}}))^{2}\big{)}{\bm{C}}({\bm{x}})\preceq L^{\prime}{\mathbb{I}}\,,

for all ${\bm{x}}$ . Then, combining Eq. 3.2 and Eq. 3.2, we obtain that

	$\displaystyle\\|{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})\\|_{2}$	$\displaystyle\leq\frac{\big{(}A(\sigma+2\gamma+2)+1\big{)}^{2}}{(1+\sigma A)^{2}}\cdot\max\{L^{\prime},1\}$
		$\displaystyle\leq\frac{(3+\sigma A+2A)^{2}}{(1+\sigma A)^{2}}\cdot\max\{L^{\prime},1\}\,,$		(3.22)

where we have used $\gamma A<1$ to derive the last inequality. ∎

Theorem 3.10 (Restatement of Theorem 1.1).

Suppose Assumption 3.1 and Assumption 3.2 hold. Let $\sigma=\tau$ , $\gamma=\frac{1-\sigma\mu}{\mu},\varepsilon=1,A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}$ . And suppose 3.14 holds for some $\delta>0$ and all ${\bm{x}}$ . If $\tau=\frac{1}{4}\frac{\mu}{\delta+36\max\{L^{\prime},1\}}$ , then

{\mathcal{I}}({\bm{x}}^{n},{\bm{p}}^{n})\leq{\mathcal{I}}({\bm{x}}^{0},{\bm{p}}^{0})\big{(}1-\frac{\mu^{2}/32}{\delta+36\max\{L^{\prime},1\}}\big{)}^{n}\,.

Proof.

By Assumption 3.1 and Assumption 3.2, we have $\mu\leq L\leq L^{\prime}$ . Thus $\mu/L^{\prime}\leq 1$ and $\sigma=\tau<1/36$ . Moreover,

\gamma=\frac{1}{\mu}-\sigma\geq 1-\frac{1}{36}=\frac{35}{36}\,.

And

A=\frac{\mu+L}{2+(\mu+L)\varepsilon\gamma}<\frac{1}{\gamma}\leq\frac{36}{35}\,.

Then it follows

\frac{3+\sigma A+2A}{1+\sigma A}\leq 3+\sigma A+2A<6\,.

By Corollary 3.9, we have

\|{\bm{N}}({\bm{x}})^{T}\nabla^{2}{\mathcal{I}}({\bm{x}},{\bm{p}}){\bm{N}}({\bm{x}})\|_{2}\leq 36\max\{L^{\prime},1\}\,.

Combining this with Theorem 3.6 and Corollary 3.8, we finish the proof.

∎

Remark 3.11.

The choice of parameters in Theorem 3.10 may not be optimal. The main purpose of Theorem 3.10 is to show the existence of geometric convergence in Algorithm 1.

4 Numerical experiment

We test our PDD algorithm using several convex and non-convex functions and compare the results with other commonly used optimizers, such as gradient descent, Nesterov’s accelerated gradient (NAG), IGAHD (inertial gradient algorithm with Hessian damping) [Attouch et al., 2020], and IGAHD-SC (inertial gradient algorithm with Hessian damping for strongly convex functions) [Attouch et al., 2020].

4.1 Summary of algorithms

For reference, we write down the iterations of gradient descent, NAG, IGAHD-SC, and IGAHD for better comparison.

Gradient descent:

{\bm{x}}^{n+1}={\bm{x}}^{n}-\tau_{\textrm{gd}}\nabla f({\bm{x}}^{n})\,,

where $\tau_{\textrm{gd}}>0$ is a stepsize.

NAG:

	$\displaystyle{\bm{y}}^{n+1}$	$\displaystyle={\bm{x}}^{n}-\tau_{\textrm{nag}}\nabla f({\bm{x}}^{n})\,,$
	$\displaystyle{\bm{x}}^{n+1}$	$\displaystyle={\bm{y}}^{n+1}+\beta_{\textrm{nag}}({\bm{y}}^{n}-{\bm{y}}^{n-1})\,,$

where $\tau_{\textrm{nag}}>0$ is a stepsize, and $\beta_{\textrm{nag}}>0$ is a parameter.

IGAHD: Suppose $\nabla f$ is $L_{1}$ -Lipschitz.

	$\displaystyle{\bm{y}}^{n}$	$\displaystyle={\bm{x}}^{n}+\alpha_{n}({\bm{x}}^{n}-{\bm{x}}^{n-1})-\beta^{(1)}\sqrt{\tau_{\textrm{att}}}(\nabla f({\bm{x}}^{n})-\nabla f({\bm{x}}^{n-1}))-\frac{\beta^{(1)}\sqrt{\tau_{\textrm{att}}}}{n}\nabla f({\bm{x}}^{n-1})\,,$
	$\displaystyle{\bm{x}}^{n+1}$	$\displaystyle={\bm{y}}^{n}-\tau_{\textrm{att}}\nabla f({\bm{y}}^{n})\,.$

Here $\alpha_{n}=1-\frac{\alpha}{n}$ for some $\alpha\geq 3$ . $\beta^{(1)}$ needs to satisfy

0\leq\beta^{(1)}\leq 2\sqrt{\tau_{\textrm{att}}}\,.

And $\tau_{\textrm{att}}>0$ is a stepsize, which needs to satisfy

\tau_{\textrm{att}}\leq\frac{1}{L_{1}}\,.

Remark 4.1.

As mentioned earlier, in each iteration of IGAHD, $\nabla f(\cdot)$ is evaluated twice: at ${\bm{x}}^{n}$ and at ${\bm{y}}^{n}$ . This differs from one gradient evaluation in gradient descent, NAG, and our method Eq. 1.2. Chen and Luo [2021] proposed a slightly different algorithm from IGAHD that only requires one gradient evaluation in each iteration.

IGHD-SC: Suppose $f$ is $m_{1}$ -strongly convex and $\nabla f$ is $L_{1}$ -Lipschitz.

{\bm{x}}^{n+1}={\bm{x}}^{n}+\frac{1-\sqrt{m_{1}\tau_{\textrm{att}}}}{1+\sqrt{m_{1}\tau_{\textrm{att}}}}({\bm{x}}^{n}-{\bm{x}}^{n-1})-\frac{\beta^{(2)}\sqrt{\tau_{\textrm{att}}}}{1+\sqrt{m_{1}\tau_{\textrm{att}}}}(\nabla f({\bm{x}}^{n})-\nabla f({\bm{x}}^{n-1}))-\frac{\tau_{\textrm{att}}}{1+\sqrt{m_{1}\tau_{\textrm{att}}}}\nabla f({\bm{x}}^{n})\,.

Here $\beta^{(2)}$ and $L_{1}$ need to satisfy

\displaystyle\beta^{(2)}\leq\frac{1}{\sqrt{m_{1}}}\,,\quad L_{1}\leq\min\Big{\{}\frac{\sqrt{m_{1}}}{8\beta^{(2)}},\frac{\frac{\sqrt{m_{1}}}{2\tau_{\textrm{att}}}+\frac{m_{1}}{\sqrt{\tau_{\textrm{att}}}}}{2\beta^{(2)}m_{1}+\frac{1}{\sqrt{\tau_{\textrm{att}}}}+\frac{\sqrt{m_{1}}}{2}}\Big{\}}\,.

(4.1)

4.2 regularized log-sum-exp

Consider the regularized log-sum-exp function

f({\bm{x}})=\log\left(\sum_{i=1}^{n}\exp({\bm{q}}_{i}^{T}{\bm{x}})\right)+\frac{1}{2}{\bm{x}}^{T}{\bm{Q}}{\bm{x}}\,,

where $n=100$ , ${\bm{Q}}={\bm{Q}}^{T}\succ 0$ and ${\bm{q}}_{i}^{T}$ is the ith row of ${\bm{Q}}$ . ${\bm{Q}}$ is chosen to be diagonally dominant, i.e. $Q_{i,i}>\sum_{j\neq i}|Q_{i,j}|$ . In this case, we may choose the diagonal preconditioner ${\bm{C}}({\bm{x}})=\big{(}\textrm{diag}({\bm{Q}})\big{)}^{-1}$ . We compare the performance of gradient descent, preconditioned gradient descent, PDD with ${\bm{C}}({\bm{x}})={\mathbb{I}}$ , PDD with diagonal preconditioner, NAG, and IGAHD-SC (inertial gradient algorithm with Hessian damping for strongly convex functions) by Attouch et al. [2020] methods for minimizing $f$ . The stepsize of gradient descent is determined by $\tau_{\textrm{gd}}=\frac{2}{\lambda_{1}*3+\lambda_{n}}$ , where $\lambda_{1}$ and $\lambda_{n}$ are the maximum and minimum eigenvalues of ${\bm{Q}}$ , respectively. For a pure quadratic objective function, ${\bm{x}}^{T}{\bm{Q}}{\bm{x}}$ , the optimal stepsize of gradient descent is $\frac{2}{\lambda_{1}+\lambda_{n}}$ . However, since our objective function also contains a log-sum-exp term, we slightly change the stepsize. Otherwise, gradient descent will not converge. Similarly, when deciding the parameters for NAG, we choose $\tau_{\textrm{nag}}=\frac{4}{30*\lambda_{1}+\lambda_{n}}$ and $\beta_{\textrm{nag}}=\frac{\sqrt{3\kappa^{\prime}+1}-2}{\sqrt{3\kappa^{\prime}+1}+2}$ , where $\kappa^{\prime}=10\lambda_{1}/\lambda_{n}$ , which is slightly smaller than the optimal parameters of NAG for a purely quadratic function to guarantee convergence. For PDD with ${\bm{C}}({\bm{x}})={\mathbb{I}}$ , we choose $\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=\frac{2}{\lambda_{1}+\lambda_{n}}$ , $\varepsilon=1$ , $A=10$ , $\omega=1$ . For PDD with diagonal preconditioner ${\bm{C}}({\bm{x}})=\big{(}\textrm{diag}({\bm{Q}})\big{)}^{-1}$ , we choose $\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.5$ , $\varepsilon=1$ , $A=1$ , $\omega=1$ . We use the same ${\bm{C}}({\bm{x}})=\big{(}\textrm{diag}({\bm{Q}})\big{)}^{-1}$ as a preconditioner for gradient descent. The stepsize for preconditioned gradient descent is chosen to be the same as $\tau_{\textrm{pdd}}=0.5$ . For IGAHD-SC (‘att’), we need $m_{1}$ as the smallest eigenvalue of $\nabla^{2}f({\bm{x}})$ . In this example, we may estimate $m_{1}$ as the smallest eigenvalue of ${\bm{Q}}$ . And $\tau_{\textrm{att}}=0.0016$ via grid search. $\beta^{(2)}$ in IGAHD-SC is found by solving (see Theorem 11 Eq. (26) of Attouch et al. [2020])

\frac{\sqrt{m_{1}}}{8\beta^{(2)}}=\frac{\frac{\sqrt{m_{1}}}{2\tau_{\textrm{att}}}+\frac{\sqrt{m_{1}}}{\sqrt{\tau_{\textrm{att}}}}}{2\beta^{(2)}m_{1}+\frac{1}{\sqrt{\tau_{\textrm{att}}}}+\frac{\sqrt{m_{1}}}{2}}\,,

which gives

\beta^{(2)}=\frac{\sqrt{\tau_{\textrm{att}}}+\tau_{\textrm{att}}\sqrt{m_{1}}/2}{4+8\sqrt{m_{1}}\sqrt{\tau_{\textrm{att}}}-2m_{1}\tau_{\textrm{att}}}\,.

(4.2)

The initial condition is ${\bm{x}}^{0}=\textrm{np.ones}(n)*0.1$ . The result is presented in Fig. 1(a).

4.3 Quadratic minus cosine function

Consider the function

f({\bm{x}})=\|{\bm{x}}\|^{2}-\cos({\bm{c}}^{T}{\bm{x}})\,,

where ${\bm{c}}$ is a vector in $\mathbb{R}^{100}$ with $\|{\bm{c}}\|^{2}=1.9$ . Then a direct calculation shows that $0.1{\mathbb{I}}\preceq\nabla^{2}f({\bm{x}})\preceq 3.9{\mathbb{I}}$ for any ${\bm{x}}$ . This allows us to choose the optimal stepsize for gradient descent and NAG. When minimizing $f$ using gradient descent, we can choose $\tau_{\textrm{gd}}=\frac{2}{0.1+3.9}=0.5$ . Meanwhile, for NAG, we may choose $\tau_{\textrm{nag}}=\frac{4}{3*3.9+0.1}$ , and $\beta=\frac{\sqrt{3\kappa+1}-2}{\sqrt{3\kappa+1}+2}$ , where $\kappa=3.9/0.1$ . For PDD with ${\bm{C}}({\bm{x}})={\mathbb{I}}$ , we choose $\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.5$ , $\varepsilon=1$ , $A=1$ , $\omega=1$ . For IGAHD-SC (‘att’), we choose $m_{1}=0.1$ , $\tau_{\textrm{att}}=0.55$ via grid search and $\beta^{(2)}$ is given by Eq. 4.2. The initial condition is ${\bm{x}}^{0}=\textrm{np.ones}(n)*5$ . The result is presented in Fig. 1(b).

Refer to caption — (a) Regularized log-sum-exp

4.4 Rosenbrock function

4.4.1 2-dimension

The 2-dimensional Rosenbrock function is defined as

f(x,y)=(a-x)^{2}+b(y-x^{2})^{2}\,,

where a common choice of parameters is $a=1$ , $b=100$ . This is a non-convex function with a global minimum of $(x^{*},y^{*})=(a,a^{2})$ . The global minimum is inside a long, narrow, parabolic-shaped flat valley. To find the valley is trivial. To converge to the global minimum, however, is difficult. We compare the performance of gradient descent, NAG, PDD with ${\bm{C}}({\bm{x}})={\mathbb{I}}$ and IGAHD (inertia gradient algorithm with Hessian damping) by Attouch et al. [2020] when minimizing the Rosenbrock function starting from $(-3,-4)$ . The stepsize of gradient descent is $\tau_{\textrm{gd}}=0.0002$ . The stepsize of NAG is $\tau_{\textrm{nag}}=0.0002$ , $\beta_{\textrm{nag}}=0.9$ . The parameters of PDD are $\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.005$ , $\varepsilon=1$ , $\omega=1$ , $A=5$ . The stepsize of the PDD method is larger than $\tau_{\textrm{gd}}$ and $\tau_{\textrm{nag}}$ because gradient descent and NAG do not allow larger stepsizes for convergence. For IGAHD (‘att’), we choose $\tau_{\textrm{att}}=0.00045$ , $\alpha=3$ , $\beta^{(1)}=\sqrt{\tau_{\textrm{att}}}/14$ . The convergence result and the optimization trajectories are shown in Fig. 2.

4.4.2 N-dimension

The $N$ -dimensional coupled Rosenbrock function is defined as

f({\bm{x}})=\sum_{i=1}^{N-1}\big{(}(a-x_{i})^{2}+b(x_{i+1}-x_{i}^{2})^{2}\big{)}\,,

where we choose $a=1$ and $b=100$ as in the 2-dimensioal case and we set $N=100$ . The global minimum is at ${\bm{x}}^{*}=(1,1,\ldots,1)$ . Using the same stepsizes as in the 2-dimensional case, we compare the performance of the three algorithms starting from ${\bm{x}}_{0}=(0,\ldots,0)$ . The stepsize of gradient descent is $\tau_{\textrm{gd}}=0.001$ . The stepsize of NAG is $\tau_{\textrm{nag}}=0.0008$ , $\beta=0.95$ . The parameters of PDD are $\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.01$ , $\varepsilon=0.5$ , $\omega=1$ , $A=5$ . The stepsize of the PDD method is larger than $\tau_{\textrm{gd}}$ and $\tau_{\textrm{nag}}$ because gradient descent and NAG do not allow larger stepsizes for convergence. For IGAHD (‘att’), we choose $\tau_{\textrm{att}}=0.0002$ , $\alpha=3$ , $\beta^{(1)}=2*\sqrt{\tau_{\textrm{att}}}$ . The result is summarized in Fig. 3

4.5 Ackley function

We consider minimizing the two-dimensional Ackley function given by

f(x,y)=-20\exp\big{(}-0.2\sqrt{0.5(x^{2}+y^{2})}\big{)}-\exp\big{[}0.5\big{(}\cos(2\pi x)+\cos(2\pi y)\big{)}\big{]}+\mathrm{e}+20\,,

which has many local minima. The unique global minimum is located at $(x^{*},y^{*})=(0,0)$ . We compare the performance of gradient descent, NAG, PDD, and IGAHD (‘att’) for minimizing the two-dimensional Ackley function starting from $(x_{0},y_{0})=(2.5,4)$ . The stepsize of gradient descent is $\tau_{\textrm{gd}}=0.002$ . The stepsize of NAG is $\tau_{\textrm{nag}}=0.002$ , $\beta_{\textrm{nag}}=0.9$ . The parameters of PDD are $\tau_{\textrm{pdd}}=\sigma_{\textrm{pdd}}=0.002$ , $\varepsilon=1$ , $\omega=1$ , $A=1$ . For IGAHD (‘att’), we choose $\tau_{\textrm{att}}=0.01$ , $\alpha=3$ , $\beta^{(1)}=2*\sqrt{\tau_{\textrm{att}}}$ . The results are summarized in Fig. 4.

Remark 4.2.

We remark that our algorithm has no stochasticity. It will not always converge to the global minimum for non-convex functions in general. For example, it will not converge for the Griewank, Drop-Wave, and Rastrigin functions.

4.6 Neural Networks training

4.6.1 MNIST with Two-layer neural network

We consider the classification problem using the MNIST handwritten digit data set with a two-layer neural network. The neural network has an input layer of size $784=28\times 28$ , a hidden layer of size $32$ followed by another hidden layer of size $32$ , and an output layer of size $10$ . We use ReLU activation function across the layers, and the loss is evaluated using the cross-entropy loss. We use a batch size of 200 for all the algorithms. The stepsize of gradient descent is $\tau_{\textrm{gd}}=0.001$ . The stepsize of NAG is $\tau_{\textrm{nag}}=0.001$ , $\textrm{momentum}=0.9$ . The parameters of PDD are $\tau_{\textrm{pdd}}=0.001$ , $\sigma_{\textrm{pdd}}=5$ , $\varepsilon=0.005$ , $\omega=1$ , $A=1$ . For IGAHD (‘att’), we choose $\tau_{\textrm{att}}=0.001$ , $\alpha=3$ , $\beta^{(1)}=0.01$ . For Adam, we choose $\tau_{\textrm{adam}}=0.001$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ .

Algorithm	SGD	NAG	PDD	Adam	Att
train loss	2.223 $\pm$ 0.034	0.964 $\pm$ 0.244	0.433 $\pm$ 0.270	0.589 $\pm$ 0.282	0.591 $\pm$ 0.288
test acc	29.3 $\pm$ 8.3 %	71.2 $\pm$ 9.4 %	85.4 $\pm$ 10.4 %	79.1 $\pm$ 11.3 %	80.8 $\pm$ 11.4 %

Table 1: Average training loss and test accuracy of different algorithms for MNIST handwritten digit recognition over 60 random seeds.

4.6.2 CIFAR10 with CNN

We train a convolutional neural network using the CIFAR10 datasets with SGD, Nesterov, PDD, Adam, and IGAHD (‘Att’). The architecture of the network is described as follows. The network consists of two convolutional layers. The first convolutional layer has $32$ output channels, and the filter size is $3\times 3$ . The second convolutional layer has $64$ output channels, and the filter size is $4\times 4$ . Each convolutional layer is followed by a ReLU activation and then a $2\times 2$ max-pooling layer. Lastly, we have $3$ fully connected layers of size $(64\cdot 4\cdot 4,120)$ , $(120,84)$ , and $(84,10)$ . The loss is evaluated using the cross-entropy loss. The stepsize of gradient descent is $\tau_{\textrm{gd}}=0.01$ . The stepsize of NAG is $\tau_{\textrm{nag}}=0.005$ , $\textrm{momentum}=0.9$ . The parameters of PDD are $\tau_{\textrm{pdd}}=0.005$ , $\sigma_{\textrm{pdd}}=5$ , $\varepsilon=0.005$ , $\omega=1$ , $A=1$ . For IGAHD (‘att’), we choose $\tau_{\textrm{att}}=0.005$ , $\alpha=3$ , $\beta^{(1)}=0.01$ . For Adam, we choose $\tau_{\textrm{adam}}=0.005$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ .

Algorithm	SGD	NAG	PDD	Adam	Att
train loss	2.038 $\pm$ 0.070	1.347 $\pm$ 0.100	0.697 $\pm$ 0.077	0.927 $\pm$ 0.128	0.879 $\pm$ 0.092
test acc	27.5 $\pm$ 2.2 %	51.2 $\pm$ 0.7 %	70.3 $\pm$ 0.5 %	64.4 $\pm$ 2.7 %	66.8 $\pm$ 0.6 %

Table 2: Average training loss and test accuracy of different algorithms for CIFAR10 data set over 60 random seeds.

5 Discussion

This paper presents primal-dual hybrid gradient algorithms for solving unconstrained optimization problems. We reformulate the optimality condition of the optimization problem as a saddle-point problem and then compute the proposed saddle-point problem by a preconditioned PDHG method. We present the geometric convergence analysis for the strongly convex objective functions. In numerical experiments, we demonstrate that the proposed method works efficiently in non-convex optimization problems, at least in some examples, such as Rosenbrock and Ackley functions. In particular, it could efficiently train two-layer and convolution neural networks in supervised learning problems.

So far, our convergence study is limited to strongly convex objective functions, not convex ones. Meanwhile, the choice of preconditioners and stepsizes are independent of time. We also have not discussed the optimal choices of parameters or general proximal operators in the updates of algorithms. These generalized choices of functions, parameters, and their convergence properties have been intensively studied in Nesterov accelerated gradient methods and Attouch’s Hessian-driven damping methods. In future work, we shall explore the convergence property of PDHG methods for convex functions with time-dependent parameters. We also investigate the convergence of similar algorithms in scientific computing problems of implicit time updates of partial differential equations [Li et al., 2022, 2023, Liu et al., 2023].

Acknowledgement: X. Zuo and S. Osher’s work was partly supported by AFOSR MURI FP 9550-18-1-502 and ONR grants: N00014-20-1-2093 and N00014-20-1-2787. W. Li’s work was supported by AFOSR MURI FP 9550-18-1-502, AFOSR YIP award No. FA9550-23-1-0087, and NSF RTG: 2038080.

References

Attouch et al. [2019] H. Attouch, Z. Chbani, and H. Riahi. Fast proximal methods via time scaling of damped inertial dynamics. SIAM Journal on Optimization, 29(3):2227–2256, 2019.
Attouch et al. [2020] H. Attouch, Z. Chbani, J. Fadili, and H. Riahi. First-order optimization algorithms via inertial systems with Hessian driven damping. Mathematical Programming, pages 1–43, 2020.
Attouch et al. [2021] H. Attouch, Z. Chbani, J. Fadili, and H. Riahi. Convergence of iterates for first-order optimization algorithms with inertia and Hessian driven damping. Optimization, pages 1–40, 2021.
Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
Boyd et al. [2011] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
Chambolle and Pock [2011] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40:120–145, 2011.
Chen and Luo [2019] L. Chen and H. Luo. First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow. arXiv preprint arXiv:1912.09276, 2019.
Chen and Luo [2021] L. Chen and H. Luo. A unified convergence analysis of first order convex optimization methods via strong Lyapunov functions. arXiv preprint arXiv:2108.00132, 2021.
Gabay and Mercier [1976] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. ISSN 0898-1221. doi: https://doi.org/10.1016/0898-1221(76)90003-1. URL https://www.sciencedirect.com/science/article/pii/0898122176900031.
Jacobs et al. [2019] M. Jacobs, F. Léger, W. Li, and S. Osher. Solving large-scale optimization problems with a convergence rate independent of grid size. SIAM Journal on Numerical Analysis, 57(3):1100–1123, 2019.
Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Li et al. [2022] W. Li, S. Liu, and S. Osher. Controlling conservation laws II: Compressible Navier–Stokes equations. Journal of Computational Physics, 463:111264, 2022.
Li et al. [2023] W. Li, S. Liu, and S. Osher. Controlling conservation laws I: Entropy–entropy flux. Journal of Computational Physics, 480:112019, 2023.
Liu et al. [2023] S. Liu, S. Osher, W. Li, and C.-W. Shu. A primal-dual approach for solving conservation laws with implicit in time approximations. Journal of Computational Physics, 472:111654, 2023.
Liu et al. [2021] Y. Liu, Y. Xu, and W. Yin. Acceleration of primal–dual methods by preconditioning and simple subproblem procedures. Journal of Scientific Computing, 86(2):21, 2021.
Nesterov [1983] Y. E. Nesterov. A method of solving a convex programming problem with convergence rate $\mathcal{O}(\frac{1}{k^{2}})$ . In Doklady Akademii Nauk, volume 269, pages 543–547. Russian Academy of Sciences, 1983.
Osher et al. [2022] S. Osher, B. Wang, P. Yin, X. Luo, F. Barekat, M. Pham, and A. Lin. Laplacian smoothing gradient descent. Research in the Mathematical Sciences, 9(3):55, 2022.
Ouyang et al. [2015] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao Jr. An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences, 8(1):644–681, 2015.
Park et al. [2021] J.-H. Park, A. J. Salgado, and S. M. Wise. Preconditioned accelerated gradient descent methods for locally lipschitz smooth objectives with applications to the solution of nonlinear PDEs. Journal of Scientific Computing, 89(1):17, 2021.
Pascanu and Bengio [2013] R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
Pock and Chambolle [2011] T. Pock and A. Chambolle. Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In 2011 International Conference on Computer Vision, pages 1762–1769, 2011. doi: 10.1109/ICCV.2011.6126441.
Polyak [1964] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR computational mathematics and mathematical physics, 4(5):1–17, 1964.
Rees [2010] T. Rees. Preconditioning iterative methods for PDE constrained optimization. PhD thesis, University of Oxford Oxford, UK, 2010.
Siegel [2019] J. W. Siegel. Accelerated first-order methods: Differential equations and Lyapunov functions. arXiv preprint arXiv:1903.05671, 2019.
Su et al. [2015] W. Su, S. Boyd, and E. J. Candes. A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. arXiv preprint arXiv:1503.01243, 2015.
Trefethen and Bau [2022] L. N. Trefethen and D. Bau. Numerical linear algebra, volume 181. SIAM, 2022.
Valkonen [2014] T. Valkonen. A primal–dual hybrid gradient method for nonlinear operators with applications to MRI. Inverse Problems, 30(5):055012, 2014.
Wibisono et al. [2016] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.

Appendix A Matrix lemma

Lemma A.1.

Let ${\bm{A}},{\bm{B}},{\bm{C}}\in\mathbb{R}^{n}$ be real symmetric matrices that are simultaneously diagonalizable. Then for any ${\bm{x}},{\bm{y}}\in\mathbb{R}^{n}$ , if

\lambda_{A,i}+\frac{|\lambda_{C,i}|}{2}\leq 0\,

\lambda_{B,i}+\frac{|\lambda_{C,i}|}{2}\leq 0\,

for all $i$ , where $\lambda_{A,i},\lambda_{B,i},\lambda_{C,i}$ are the $i$ th eigenvalues of ${\bm{A}},{\bm{B}},{\bm{C}}$ respectively in the same basis. Then

{\bm{x}}^{T}{\bm{A}}{\bm{x}}+{\bm{y}}^{T}{\bm{B}}{\bm{y}}+{\bm{x}}^{T}{\bm{C}}{\bm{y}}\leq 0,

for all ${\bm{x}},{\bm{y}}\in\mathbb{R}^{n}$ .

Proof.

Let ${\bm{x}},{\bm{y}}\in\mathbb{R}^{n}$ . By our assumption, there exists ${\bm{Q}}$ unitary such that ${\bm{A}},{\bm{B}},{\bm{C}}$ are simultaneously diagonalizable by ${\bm{Q}}$ . Set $\tilde{{\bm{x}}}={\bm{Q}}{\bm{x}}$ and $\tilde{{\bm{y}}}={\bm{Q}}{\bm{y}}$ . Then we can compute

	$\displaystyle{\bm{x}}^{T}{\bm{A}}{\bm{x}}+{\bm{y}}^{T}{\bm{B}}{\bm{y}}+{\bm{x}}^{T}{\bm{C}}{\bm{y}}$	$\displaystyle=\sum_{i=1}^{n}\tilde{x}_{i}^{2}\lambda_{A,i}+\tilde{y}_{i}^{2}\lambda_{B,i}+\tilde{x}_{i}\lambda_{C,i}\tilde{y}_{i}$
		$\displaystyle\leq\sum_{i=1}^{n}\tilde{x}_{i}^{2}\big{(}\lambda_{A,i}+\frac{\|\lambda_{C,i}\|}{2}\big{)}+\tilde{y}_{i}^{2}\big{(}\lambda_{B,i}+\frac{\|\lambda_{C,i}\|}{2}\big{)}$
		$\displaystyle\leq 0\,,$

where the first inequality follows from $\alpha xy\leq(x^{2}+y^{2})|\alpha|/2$ for any $\alpha,x,y\in\mathbb{R}$ . ∎

Appendix B Proof of Theorem 2.6

B.1 Part (a)

We have the following system of ODE:

\begin{pmatrix}\dot{{\bm{x}}}\\ \dot{{\bm{p}}}\end{pmatrix}=\begin{pmatrix}-\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}&-{\bm{B}}{\bm{Q}}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})\\ {\bm{A}}{\bm{Q}}&-\varepsilon{\bm{A}}\end{pmatrix}\begin{pmatrix}{\bm{x}}\\ {\bm{p}}\end{pmatrix}\,.

(B.1)

Let us compute the eigenvalues of the above system. Let $\alpha$ be an eigenvalue, then $\alpha$ satisfies

	$\displaystyle\mathrm{det}\begin{pmatrix}-\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}-\alpha{\mathbb{I}}&-{\bm{B}}{\bm{Q}}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})\\ {\bm{A}}{\bm{Q}}&-\varepsilon{\bm{A}}-\alpha{\mathbb{I}}\end{pmatrix}$	$\displaystyle=0$
	$\displaystyle\mathrm{det}\big{(}(-\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}-\alpha{\mathbb{I}})(-\varepsilon{\bm{A}}-\alpha{\mathbb{I}})+{\bm{B}}{\bm{Q}}({\mathbb{I}}-\gamma\varepsilon{\bm{A}}){\bm{A}}{\bm{Q}}\big{)}$	$\displaystyle=0$
	$\displaystyle\mathrm{det}\big{(}\alpha^{2}{\mathbb{I}}+\alpha(\varepsilon{\bm{A}}+\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}})+\gamma\varepsilon{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}{\bm{A}}+{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}-\gamma\varepsilon{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{A}}{\bm{Q}}\big{)}$	$\displaystyle=0$
	$\displaystyle\mathrm{det}\big{(}\alpha^{2}{\mathbb{I}}+\alpha(\varepsilon{\bm{A}}+\gamma{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}})+{\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}\big{)}$	$\displaystyle=0\,.$

The late equality is because ${\bm{A}}$ commutes with ${\bm{Q}}$ . We assume that ${\bm{A}}$ and ${\bm{B}}{\bm{Q}}{\bm{A}}{\bm{Q}}$ are simultaneously diagonalizable. Thus,

	$\displaystyle 0$	$\displaystyle=\alpha^{2}+\alpha(\varepsilon a_{i}+\gamma\mu_{i})+\mu_{i}\,,$
	$\displaystyle\alpha$	$\displaystyle=\frac{-\varepsilon a_{i}-\gamma\mu_{i}\pm\sqrt{(\varepsilon+\gamma\mu_{i})^{2}-4\mu_{i}}}{2}\,.$

If $\gamma>0$ and $\varepsilon\geq 0$ , then the real part of the eigenvalues are negative, and the system will converge. The convergence rate depends on the largest real part of the eigenvalues, which is

\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-\varepsilon a_{i}+\Re\big{(}\sqrt{(\gamma\mu_{i}+\varepsilon)^{2}-4\mu_{i}}\big{)}\big{]}\,.

B.2 Part (c)

When $\gamma=\varepsilon=0$ , we see that $\alpha$ is purely imaginary. Thus solutions to Eq. B.1 will be oscillatory and will not converge.

B.3 Part (b)

Let us define

g(\gamma)=\max_{i}\big{\{}\frac{\mu_{i}\big{(}-\gamma+\Re\big{(}\sqrt{\gamma^{2}-4/\mu_{i}}\big{)}\big{)}}{2}\big{\}}\,.

Essentially, we would like to find $\gamma^{*}=\operatorname*{arg\,min}_{\gamma}g(\gamma)$ . We then define

	$\displaystyle\gamma(\mu)$	$\displaystyle:=\operatorname*{arg\,min}_{\gamma}\frac{\mu\big{(}-\gamma+\Re\big{(}\sqrt{\gamma^{2}-4/\mu}\big{)}\big{)}}{2}$
		$\displaystyle=\frac{2}{\sqrt{\mu}}\,.$

Observe that if $\gamma\geq 2/\sqrt{\mu_{n}}$ , then $\gamma^{2}-4/\mu_{i}\geq 0$ for all $i$ . Thus

\displaystyle g(\gamma)

\displaystyle=\max_{i}\big{\{}\frac{\mu_{i}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{i}}\big{)}}{2}\big{\}}\,.

For $\mu\in[\mu_{n},\mu_{1}]$ and $\gamma\geq 2/\sqrt{\mu_{n}}$ , one can check that the function $\mu\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu}\big{)}$ is increasing in $\mu$ by computing the partial derivative with respect to $\mu$ . Then we get

g(\gamma)=\frac{\mu_{1}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{1}}\big{)}}{2}\geq g(2/\sqrt{\mu_{n}})=\sqrt{\mu_{1}}(\sqrt{\kappa-1}-\sqrt{\kappa})\approx-\sqrt{\mu_{n}}/2\,,

where $\kappa=\mu_{1}/\mu_{n}>1$ . The last approximation is valid for $\mu_{1}/\mu_{n}\gg 1$ . This shows that $\gamma^{*}\leq 2/\sqrt{\mu_{n}}$ . Similarly, if $\gamma\leq 2/\sqrt{\mu_{1}}$ , then $\gamma^{2}-4/\mu_{i}\leq 0$ for all $i$ . Thus

	$\displaystyle g(\gamma)$	$\displaystyle=\max_{i}\big{\{}\frac{-\mu_{i}\gamma}{2}\big{\}}$
		$\displaystyle=\frac{-\mu_{n}\gamma}{2}$
		$\displaystyle\geq-\frac{\mu_{n}}{\sqrt{\mu_{1}}}=g(2/\sqrt{\mu_{1}})\,.$

This shows that $\gamma^{*}\geq 2/\sqrt{\mu_{1}}$ . Combining with our previous observation, we get $\gamma^{*}\in[2/\sqrt{\mu_{1}},2/\sqrt{\mu_{n}}]$ . Now let us fix some $\gamma^{\prime}\in[2/\sqrt{\mu_{1}},2/\sqrt{\mu_{n}}]$ . Let $j=\inf\{i:1\leq i\leq n,\gamma^{\prime 2}-4/\mu_{i}\leq 0\}$ . By our assumption on $\gamma^{\prime}$ , we know that $1<j<n$ . Now for $1\leq i\leq j-1$ , we have

\frac{\mu_{i}\big{(}-\gamma^{\prime}+\Re\big{(}\sqrt{\gamma^{\prime 2}-4/\mu_{i}}\big{)}\big{)}}{2}=\frac{\mu_{i}\big{(}-\gamma^{\prime}+\sqrt{\gamma^{\prime 2}-4/\mu_{i}}\big{)}}{2}\leq\frac{\mu_{1}\big{(}-\gamma^{\prime}+\sqrt{\gamma^{\prime 2}-4/\mu_{1}}\big{)}}{2}\,.

And for $j\leq k\leq n$ , we have

\frac{\mu_{k}\big{(}-\gamma^{\prime}+\Re\big{(}\sqrt{\gamma^{\prime 2}-4/\mu_{k}}\big{)}\big{)}}{2}=\frac{-\mu_{k}\gamma^{\prime}}{2}\leq\frac{-\mu_{n}\gamma^{\prime}}{2}\,.

It is thus clear that for $\gamma^{\prime}\in[2/\sqrt{\mu_{1}},2/\sqrt{\mu_{n}}]$ ,

\displaystyle g(\gamma^{\prime})

\displaystyle=\max\big{\{}\frac{\mu_{1}\big{(}-\gamma^{\prime}+\sqrt{\gamma^{\prime 2}-4/\mu_{1}}\big{)}}{2},\frac{-\mu_{n}\gamma^{\prime}}{2}\big{\}}\,.

It is straightforward to calculate that for $\gamma\in[\frac{2}{\sqrt{\mu_{1}}},\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}}]$ , we have

\frac{-\mu_{n}\gamma}{2}\geq\frac{\mu_{1}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{1}}\big{)}}{2}\,.

g(\gamma)=\frac{-\mu_{n}\gamma}{2}\geq g(\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}})=\frac{-\sqrt{\mu_{n}}}{\sqrt{2-\frac{1}{\kappa}}}\,.

And for $\gamma\in[\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}},2/\sqrt{\mu_{n}}]$ we have

\frac{-\mu_{n}\gamma}{2}\leq\frac{\mu_{1}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{1}}\big{)}}{2}\,.

This implies

g(\gamma)=\frac{\mu_{1}\big{(}-\gamma+\sqrt{\gamma^{2}-4/\mu_{1}}\big{)}}{2}\geq g(\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}})=\frac{-\sqrt{\mu_{n}}}{\sqrt{2-\frac{1}{\kappa}}}\,.

This shows $\gamma^{*}=\frac{2\sqrt{\mu_{1}}}{\sqrt{\mu_{n}(2\mu_{1}-\mu_{n})}}$ .

B.4 Part (d)

Define $\Delta_{\gamma}(\mu,\varepsilon)=(\gamma\mu+\varepsilon)^{2}-4\mu$ . Also define $g_{\gamma}(\mu)=2\sqrt{\mu}-\gamma\mu$ . Then for $\mu\geq 0$ , we have $\Delta_{\gamma}(\mu,\varepsilon)\leq 0$ if and only if $\varepsilon\leq g_{\gamma}(\mu)$ . Note that $g_{\gamma}^{\prime}(\mu)=\frac{1}{\sqrt{\mu}}-\gamma\geq 0$ for $\mu\leq\mu_{1}$ if $\gamma\leq\frac{1}{\sqrt{\mu_{1}}}$ . Then $\Delta_{\gamma}(\mu,\varepsilon)\leq 0$ for all $\mu\leq\mu_{1}$ if $\gamma\leq\frac{1}{\sqrt{\mu_{1}}}$ and $\varepsilon\leq g_{\gamma}(\mu_{n})$ . In particular, $\Delta_{\gamma}(\mu,\varepsilon)\leq 0$ for all $\mu\leq\mu_{1}$ if $\varepsilon=g_{\gamma}(\mu^{\prime})$ for some $\mu^{\prime}\leq\mu_{n}$ . We have

	$\displaystyle\alpha$	$\displaystyle=\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-\varepsilon+\Re\big{(}\sqrt{(\gamma\mu_{i}+\varepsilon)^{2}-4\mu_{i}}\big{)}\big{]}$
		$\displaystyle=\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-\varepsilon\big{]}$
		$\displaystyle=\max_{i}\frac{1}{2}\big{[}-\gamma\mu_{i}-2\sqrt{\mu^{\prime}}+\gamma\mu^{\prime}\big{]}$
		$\displaystyle=-\sqrt{\mu^{\prime}}-\frac{\gamma(\mu_{n}-\mu^{\prime})}{2}.$

Appendix C Proof of Proposition 2.4

We directly compute

	$\displaystyle\ddot{{\bm{x}}}$	$\displaystyle=-{\bm{C}}\big{(}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})\dot{{\bm{p}}}+\gamma{\bm{A}}\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}\big{)}-\dot{{\bm{C}}}\big{(}({\mathbb{I}}-\gamma\varepsilon{\bm{A}}){\bm{p}}+\gamma{\bm{A}}\nabla f({\bm{x}})\big{)}$
		$\displaystyle=-{\bm{C}}\big{(}({\mathbb{I}}-\gamma\varepsilon{\bm{A}})({\bm{A}}\nabla f({\bm{x}})-\varepsilon{\bm{A}}{\bm{p}})+\gamma{\bm{A}}\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}\big{)}$
		$\displaystyle\qquad-\dot{{\bm{C}}}\big{(}({\mathbb{I}}-\gamma\varepsilon{\bm{A}}){\bm{p}}+\gamma{\bm{A}}\nabla f({\bm{x}})\big{)}$
		$\displaystyle=-{\bm{C}}\big{[}({\mathbb{I}}-\gamma\varepsilon{\bm{A}}){\bm{A}}\nabla f({\bm{x}})+\varepsilon{\bm{A}}({\bm{C}}^{-1}\dot{{\bm{x}}}+\gamma{\bm{A}}\nabla f({\bm{x}}))+\gamma{\bm{A}}\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}\big{]}+\dot{{\bm{C}}}{\bm{C}}^{-1}\dot{{\bm{x}}}$
		$\displaystyle=-{\bm{C}}\big{[}{\bm{A}}\nabla f({\bm{x}})+\varepsilon{\bm{A}}{\bm{C}}^{-1}\dot{{\bm{x}}}+\gamma{\bm{A}}\nabla^{2}f({\bm{x}})\dot{{\bm{x}}}\big{]}+\dot{{\bm{C}}}{\bm{C}}^{-1}\dot{{\bm{x}}}\,.$

$\displaystyle\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\bm{A}}&({\mathbb{I}}-\gamma{\bm{A}})\\ -{\bm{A}}&{\bm{A}}\end{pmatrix}\bigg{\\|}_{2}$	$\displaystyle\leq A\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\bigg{\\|}_{2}+\bigg{\\|}\begin{pmatrix}0&{\mathbb{I}}\\ 0&0\end{pmatrix}\bigg{\\|}_{2}$
	$\displaystyle\leq A\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\bigg{\\|}_{2}+1$
	$\displaystyle\leq A(\sigma+2\gamma+2)+1\,.$	(3.20)

	$\displaystyle\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\mathbb{I}}&-\gamma{\mathbb{I}}\\ -{\mathbb{I}}&{\mathbb{I}}\end{pmatrix}\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}\bigg{\\|}_{2}$	$\displaystyle=\bigg{\\|}\begin{pmatrix}(\sigma+\gamma){\bm{z}}-\gamma{\bm{w}}\\ -{\bm{z}}+{\bm{w}}\end{pmatrix}\bigg{\\|}$
		$\displaystyle\leq\sigma\bigg{\\|}\begin{pmatrix}{\bm{z}}\\ {\bm{w}}\end{pmatrix}\bigg{\\|}+\gamma\\|{\bm{z}}-{\bm{w}}\\|+\\|{\bm{z}}-{\bm{w}}\\|$
		$\displaystyle\leq\sigma+2\gamma+2\,.$