Oracle complexities of augmented Lagrangian methods for nonsmooth manifold optimization

Kangkang Deng Department of Mathematics, National University of Defense Technology, Changsha, 410073, CHINA (). freedeng1208@gmail.com Jiang Hu Corresonding author. Massachusetts General Hospital and Harvard Medical School, Harvard University, Boston, MA 02114, US (). hujiangopt@gmail.com Jiayuan Wu College of Engineering, Peking University, Beijing 100871, CHINA (). 1901110043@pku.edu.cn Zaiwen Wen Beijing International Center for Mathematical Research, Center for Machine Learning Research and College of Engineering, Peking University, Beijing 100871, CHINA (). wenzw@pku.edu.cn

Abstract

In this paper, we present two novel manifold inexact augmented Lagrangian methods, ManIAL for deterministic settings and StoManIAL for stochastic settings, solving nonsmooth manifold optimization problems. By using the Riemannian gradient method as a subroutine, we establish an $\mathcal{O}(\epsilon^{-3})$ oracle complexity result of ManIAL, matching the best-known complexity result. Our algorithm relies on the careful selection of penalty parameters and the precise control of termination criteria for subproblems. Moreover, for cases where the smooth term follows an expectation form, our proposed StoManIAL utilizes a Riemannian recursive momentum method as a subroutine, and achieves an oracle complexity of $\tilde{\mathcal{O}}(\epsilon^{-3.5})$ , which surpasses the best-known $\mathcal{O}(\epsilon^{-4})$ result. Numerical experiments conducted on sparse principal component analysis and sparse canonical correlation analysis demonstrate that our proposed methods outperform an existing method with the previously best-known complexity result. To the best of our knowledge, these are the first complexity results of the augmented Lagrangian methods for solving nonsmooth manifold optimization problems.

1 Introduction

We are concerned with the following nonsmooth composite manifold optimization problem

(1.1)

\min_{x\in\mathcal{M}}\varphi(x):=f(x)+h(\mathcal{A}x),

where $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a continuously differentiable function, $\mathcal{A}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is a linear mapping, $h:\mathbb{R}^{m}\rightarrow(-\infty,+\infty]$ is a proper closed convex function, and $\mathcal{M}$ is an embedded Riemannian submanifold. Hence, the objective function of (1.1) can be nonconvex and nonsmooth. Manifold optimization with smooth objective functions (i.e., smooth manifold optimization) has been extensively studied in the literature; see [1, 6, 39, 21] for a comprehensive review. Recently, manifold optimization with nonsmooth objective functions has been growing in popularity due to its wide applications in sparse principal component analysis [25], nonnegative principal component analysis [44, 24], semidefinite programming [8, 41], etc. Different from smooth manifold optimization, solving problem (1.1) is challenging because of the nonsmoothness. Recently, a range of algorithms is developed to address nonsmooth optimization problems. Notably, one category of algorithms is the augmented Lagrangian framework, as introduced by [37, 46]. Compared with other existing works, such algorithm framework can deal with the case of the general $\mathcal{A}$ along with possible extra constraints [46]. However, they both only provide an asymptotic convergence result, and the iteration complexity or oracle complexity is unclear. This naturally raises the following question:

Can we design an augmented Lagrangian algorithm framework for (1.1) with oracle complexity guarantee?

We answer this question affirmatively, giving two novel augmented Lagrangian algorithms that achieves the oracle complexity of $\mathcal{O}(\epsilon^{-3})$ in the deterministic setting and $\tilde{\mathcal{O}}(\epsilon^{-3.5})$ in the stochastic setting. The main contributions of this paper are given as follows:

•

Developing an inexact augmented Lagrangian algorithm framework for solving problem (1.1). Our algorithm relies on carefully selecting penalty parameters and controlling termination criteria for subproblems. In particular, when the Riemannian gradient descent method is used for the inner iterates, we prove that our algorithm, ManIAL, finds an $\epsilon$ -stationary point with $\mathcal{O}(\epsilon^{-3})$ calls to the first-order oracle (refer to Theorem 3.13). To the best of our knowledge, this is the first complexity result for the augmented Lagrangian method for solving (1.1).
•

Addressing scenarios where the objective function $f$ takes the form $f(x):=\mathbb{E}_{\xi}[f(x;\xi)]$ . We introduce a stochastic augmented Lagrangian method. Owing to the smoothness of the subproblem, we are able to apply the Riemannian recursive momentum method as a subroutine within our framework. Consequently, we establish that StoManIAL achieves the oracle complexity of $\tilde{\mathcal{O}}(\epsilon^{-3.5})$ (refer to Theorem 4.10), which is better than the best-known $\mathcal{O}(\epsilon^{-4})$ result [30].
•

Our algorithm framework is highly flexible, allowing for the application of various Riemannian (stochastic) optimization methods, including first-order methods, semi-smooth Newton methods, and others, to solve sub-problems. This is due to the smoothness properties of the subproblems. Additionally, we provide two options for selecting the stopping criteria for subproblems, based on either a given residual or a given number of iterations, further enhancing the flexibility of the algorithm.

1.1 Related works

When the objective function is geodesically convex over a Riemannian manifold, the subgradient-type methods are studied in [4, 15, 14]. An asymptotic convergence result is first established in [14], while an iteration complexity of $\mathcal{O}(\epsilon^{-2})$ is established in [4, 15]. For the general nonsmooth problem on the Riemannian manifold, the Riemannian subgradient may not exist. By sampling points within a neighborhood of the current iteration point at which the objective function $f$ is differentiable, the Riemannian gradient sampling methods are developed in [18, 19].

There are many works addressing nonsmooth problems in the form of (1.1). When $\mathcal{A}$ in (1.1) is the identity operator and the proximal operator of $h$ can be calculated analytically, Riemannian proximal gradient methods [10, 22, 23] are proposed. These methods achieve an outer iteration complexity of $\mathcal{O}(\epsilon^{-2})$ for attaining an $\epsilon$ -stationary solution. It should be emphasized that each iteration in these algorithms involves solving a subproblem that lacks an explicit solution, and they utilize the semismooth Newton method to solve it. For the general $\mathcal{A}$ , Peng et al. [37] develop the Riemannian smoothing gradient method by utilizing a smoothing technique from the Moreau envelope. They show that the algorithm achieves an iteration complexity of $\mathcal{O}(\epsilon^{-3})$ . Moreover, Beck and Rosset [3] propose a dynamic smoothing gradient descent on manifold and obtain an iteration complexity of $\mathcal{O}(\epsilon^{-3})$ . Very recently, based on a similar smoothing technique, a Riemannian alternating direction method of multipliers (RADMM) is proposed in [29] with an iteration complexity result of $\mathcal{O}(\epsilon^{-4})$ for driving a Karush–Kuhn–Tucker (KKT) residual based stationarity. This complexity result is a bit worse than $\mathcal{O}(\epsilon^{-3})$ [37] because there is a lack of adaptive adjustment of the smoothing parameter. There is also a line of research based on the augmented Lagrangian function. Deng and Peng [12] proposed a manifold inexact augmented Lagrangian method for solving problem (1.1). Zhou et al. [46] consider a more general problem and propose a globalized semismooth Newton method to solve its subproblem efficiently. In these two papers, they only give the asymptotic convergence result, while the iteration complexity is missing.

In the case when the manifold $\mathcal{M}$ is specified by equality constraints $c(x)=0$ , e.g., the Stiefel manifold. Problem (1.1) can be regarded as a constrained optimization problem with a nonsmooth and nonconvex objective function. Given that $\mathcal{M}$ is often nonconvex, we only list the related works in case of that the constraint functions are nonconvex. Papers [32, 38] propose and study the iteration complexity of augmented Lagrangian methods for solving nonlinearly constrained nonconvex composite optimization problems. The iteration complexity results they achieve for an $\epsilon$ -stationary point are both $\tilde{\mathcal{O}}(\epsilon^{-3})$ . More specifically, [38] uses the accelerated gradient method of [16] to obtain the approximate stationary point. On the other hand, the authors in [32] obtain such approximate stationary point by applying an inner accelerated proximal method as in [9, 26], whose generated subproblems are convex. It is worth mentioning that both of these papers make a strong assumption about how the feasibility of an iterate is related to its stationarity. Lin et al. [34] propose an inexact proximal-point penalty method by solving a sequence of penalty subproblems. Under a non-singularity condition, they show a complexity result of $\tilde{\mathcal{O}}(\epsilon^{-3})$ .

Finally, we point out that there exist several works for nonsmooth expectation problems on the Riemannian manifold, i.e., when $f(x)=\mathbb{E}[f(x,\xi)]$ in problem (1.1). In particular, Li et al. [30] focus on a class of weakly convex problems on the Stiefel manifold. They present a Riemannian stochastic subgradient method and show that it has an iteration complexity of $\mathcal{O}(\epsilon^{-4})$ for driving a natural stationarity measure below $\epsilon$ . Peng et al. [37] propose a Riemannian stochastic smoothing method and give an iteration complexity of $\mathcal{O}(\epsilon^{-5})$ for driving a natural stationary point. Li et al. [33] propose a stochastic inexact augmented Lagrangian method to solve problems involving a nonconvex composite objective and nonconvex smooth functional constraints. Under certain regularity conditions, they establish an oracle complexity result of $\mathcal{O}(\epsilon^{-5})$ with a stochastic first-order oracle. Wang et al. [40] propose two Riemannian stochastic proximal gradient methods for minimizing a nonsmooth function over the Stiefel manifold. They show that the proposed algorithm finds an $\epsilon$ -stationary point with $\mathcal{O}(\epsilon^{-3})$ first-order oracle complexity. It should be emphasized that each oracle call involves a subroutine whose iteration complexity remains unclear.

In Table 1, we summarize our complexity results and several existing methods to produce an $\epsilon$ -stationary point. The definitions of first-order oracle in determined and stochastic settings are provided in Definition 3.12 and 4.9, respectively. The notation $\tilde{\mathcal{O}}$ denotes the dependence of a $\log$ factor. Here, we only consider the scenario where both the objective function and the nonlinear constraint are nonconvex. For the iteration complexity of ALM in convex case, please refer to [2, 35, 43, 42, 27], etc. We do not add the proximal type algorithms in [10, 40, 22, 23] since those algorithms involve using the semismooth Newton method to solve the subproblem, where the overall oracle complexity is not clear. It can easily be shown that our algorithms achieve better oracle complexity results both in determined and stochastic settings. In particular, in the stochastic setting, our algorithm establishes an oracle complexity of $\tilde{\mathcal{O}}(\epsilon^{-3.5})$ , which is better than the best-known $\mathcal{O}(\epsilon^{-4})$ result.

Table 1: Comparison of the oracle complexity results of several methods in the literature to our method to produce an

\epsilon

-stationary point

Algorithms	$\mathcal{A}$	Constraints	Objective	Complexity
iALM [38]	$\mathcal{I}$	$c(x)=0$	determine	$\mathcal{O}(\epsilon^{-4})$
iPPP [34]	$\mathcal{I}$	$c(x)=0$	determine	$\tilde{\mathcal{O}}(\epsilon^{-3})$ ¹¹1We only give the case of $c(x)$ is nonconvex and a non-singularity condition holds on $c(x)$
improved iALM [32]	$\mathcal{I}$	$c(x)=0$	determine	$\tilde{\mathcal{O}}(\epsilon^{-3})$
Sto-iALM [33]	$\mathcal{I}$	$\mathbb{E}_{\xi}[c(x,\xi)]=0$	stochastic	$\mathcal{O}(\epsilon^{-5})$
RADMM [29]	general	compact submanifold	determine	$\mathcal{O}(\epsilon^{-4})$
DSGM [3]	general	compact submanifold	determine	$\mathcal{O}(\epsilon^{-3})$
subgradient [30]	general	Stiefel manifold	determine	$\mathcal{O}(\epsilon^{-4})$
subgradient [30]	general	Stiefel manifold	stochastic	$\mathcal{O}(\epsilon^{-4})$
smoothing [37]	general	compact submanifold	determine	$\mathcal{O}(\epsilon^{-3})$
smoothing [37]	general	compact submanifold	stochastic	$\mathcal{O}(\epsilon^{-5})$
this paper	general	compact submanifold	determine	$\mathcal{O}(\epsilon^{-3})$
this paper	general	compact submanifold	stochastic	$\tilde{\mathcal{O}}(\epsilon^{-3.5})$

1.2 Notation

Let $\left<\cdot,\cdot\right>$ and $\|\cdot\|$ be the Euclidean product and induced Euclidean norm. Given a matrix $A$ , we use $\|A\|_{F}$ to denote the Frobenius norm, $\|A\|_{1}:=\sum_{ij}|A_{ij}|$ to denote the $\ell_{1}$ norm. For a vector $x$ , we use $\|x\|_{2}$ and $\|x\|_{1}$ to denote its Euclidean norm and $\ell_{1}$ norm, respectively. The indicator function of a set $\mathcal{C}$ , denoted by $\delta_{\mathcal{C}}$ , is set to 0 on $\mathcal{C}$ and $+\infty$ otherwise. The distance from $x$ to $\mathcal{C}$ is denoted by $\mathrm{dist}(x,\mathcal{C}):=\min_{y\in\mathcal{C}}\|x-y\|$ . We use $\nabla f(x)$ and $\operatorname{grad}f(x)$ to denote the Euclidean gradient and Riemannian gradient of $f$ , respectively.

2 Preliminaries

In this section, we introduce some necessary concepts for Riemannian optimization and the Moreau envelope.

2.1 Riemannian optimization

An $n$ -dimensional smooth manifold $\mathcal{M}$ is an $n$ -dimensional topological manifold equipped with a smooth structure, where each point has a neighborhood that is diffeomorphism to the $n$ -dimensional Euclidean space. For all $x\in\mathcal{M}$ , there exists a chart $(U,\psi)$ such that $U$ is an open set and $\psi$ is a diffeomorphism between $U$ and an open set $\psi(U)$ in the Euclidean space. A tangent vector $\eta_{x}$ to $\mathcal{M}$ at $x$ is defined as tangents of parametrized curves $\gamma$ on $\mathcal{M}$ such that $\gamma(0)=x$ and

\eta_{x}u:=\dot{\gamma}(0)u=\left.\frac{d(u(\gamma(t)))}{dt}\right|_{t=0},\forall u\in\wp_{x}\mathcal{M},

where $\wp_{x}\mathcal{M}$ is the set of all real-valued functions $f$ defined in a neighborhood of $x$ in $\mathcal{M}$ . Then, the tangent space $T_{x}\mathcal{M}$ of a manifold $\mathcal{M}$ at $x$ is defined as the set of all tangent vectors at point $x$ . The manifold $\mathcal{M}$ is called a Riemannian manifold if it is equipped with an inner product on the tangent space $T_{x}\mathcal{M}$ at each $x\in\mathcal{M}$ . If $\mathcal{M}$ is a Riemannian submanifold of an Euclidean space $\mathcal{E}$ , the inner product is defined as the Euclidean inner product: $\left<\eta_{x},\xi_{x}\right>=\mathrm{tr}(\eta_{x}^{\top}\xi_{x})$ . The Riemannian gradient $\operatorname{grad}f(x)\in T_{x}\mathcal{M}$ is the unique tangent vector satisfying

\left<\operatorname{grad}f(x),\xi\right>=df(x)[\xi],\forall\xi\in T_{x}\mathcal{M}.

If $\mathcal{M}$ is a compact Riemannian manifold embedded in an Euclidean space, we have that $\operatorname{grad}f(x)=\mathcal{P}_{T_{x}\mathcal{M}}(\nabla f(x))$ , where $\nabla f(x)$ is the Euclidean gradient, $\mathcal{P}_{T_{x}\mathcal{M}}$ is the projection operator onto the tangent space $T_{x}\mathcal{M}$ . The retraction operator is one of the most important ingredients for manifold optimization, which turns an element of $T_{x}\mathcal{M}$ into a point in $\mathcal{M}$ .

Definition 2.1 (Retraction, [1]).

A retraction on a manifold $\mathcal{M}$ is a smooth mapping $\mathcal{R}:T\mathcal{M}\rightarrow\mathcal{M}$ with the following properties. Let $\mathcal{R}_{x}:T_{x}\mathcal{M}\rightarrow\mathcal{M}$ be the restriction of $\mathcal{R}$ at $x$ . It satisfies

•

$\mathcal{R}_{x}(0_{x})=x$ , where $0_{x}$ is the zero element of $T_{x}\mathcal{M}$ ,
•

$d\mathcal{R}_{x}(0_{x})=id_{T_{x}\mathcal{M}}$ ,where $id_{T_{x}\mathcal{M}}$ is the identity mapping on $T_{x}\mathcal{M}$ .

We have the following Lipschitz-type inequalities on the retraction on the compact submanifold.

Proposition 2.2 ([7, Appendix B]).

Let $\mathcal{R}$ be a retraction operator on a compact submanifold $\mathcal{M}$ . Then, there exist two positive constants $\alpha,\beta$ such that for all $x\in\mathcal{M}$ and all $u\in T_{x}\mathcal{M}$ , we have

(2.1)

\left\{\begin{aligned} \|\mathcal{R}_{x}(u)-x\|&\leq\alpha\|u\|,\\ \|\mathcal{R}_{x}(u)-x-u\|&\leq\beta\|u\|^{2}.\end{aligned}\right.

Another basic ingredient for manifold optimization is the vector transport.

Definition 2.3 (Vector Transport, [1]).

The vector transport $\mathcal{T}$ is a smooth mapping

(2.2)

T\mathcal{M}\oplus T\mathcal{M}\rightarrow T\mathcal{M}:(\eta_{x},\xi_{x})\mapsto\mathcal{T}_{\eta_{x}}(\xi_{x})\in T\mathcal{M}

satisfying the following properties for all $x\in\mathcal{M}$ :

•

for any $\xi_{x}\in T_{x}\mathcal{M}$ , $\mathcal{T}_{0_{x}}\xi_{x}=\xi_{x}$ ,
•

$\mathcal{T}_{\eta_{x}}(a\xi_{x}+b\gamma_{x})=a\mathcal{T}_{\eta_{x}}(\xi_{x})+b\mathcal{T}_{\eta_{x}}(\gamma_{x})$ .

When there exists $\mathcal{R}$ such that $y=\mathcal{R}_{x}(\eta_{x})$ , we write $\mathcal{T}_{x}^{y}(\xi_{x})=\mathcal{T}_{\eta_{x}}(\xi_{x})$ .

2.2 Moreau envelope and retraction smoothness

We first introduce the concept of the Moreau envelope.

Definition 2.4.

For a proper, convex and lower semicontinuous function $h:\mathbb{R}^{n}\rightarrow\mathbb{R}$ , the Moreau envelope of $h$ with the parameter $\mu>0$ is given by

M_{h}^{\mu}(y):=\inf_{z\in\mathbb{R}^{n}}\left\{h(z)+\frac{1}{2\mu}\|z-y\|^{2}\right\}.

The following result demonstrates that the Moreau envelope of a convex function is not only continuously differentiable but also possesses a bounded gradient norm.

Proposition 2.5.

[5, Lemma 3.3] Let $h$ be a proper, convex, and $\ell_{h}$ -Lipschitz continuous. Then, the Moreau envelope $M_{h}^{\mu}$ has Lipschitz continuous gradient over $\mathbb{R}^{n}$ with constant $\mu^{-1}$ , and the gradient is given by

(2.3)

\nabla M_{h}^{\mu}(x)=\frac{1}{\mu}\left(x-\mathrm{prox}_{\mu h}(x)\right)\in\partial h\left(\mathrm{prox}_{\mu h}(x)\right),

where $\partial h$ denote the subdifferential of $h$ , $\mathrm{prox}_{\mu h}(x):=\arg\min_{y\in\mathbb{R}^{n}}\{h(y)+\frac{1}{2\mu}\|y-x\|^{2}\}$ is the proximal operator. Moreover, it holds that

(2.4)

\|\nabla M_{h}^{\mu}(x)\|\leq\ell_{h},\;\|x-\mathrm{prox}_{\mu h}(x)\|\leq\mu\ell_{h}.

Next, we provide the definition of retraction smoothness. This concept plays a crucial role in the convergence analysis of the algorithms proposed in the subsequent sections.

Definition 2.6.

[7, Retraction smooth] A function $f:\mathcal{M}\rightarrow\mathbb{R}$ is said to be retraction smooth (short to retr-smooth) with constant $\ell$ and a retraction $\mathcal{R}$ , if for $\forall~{}x,y\in\mathcal{M}$ it holds that

(2.5)

f(y)\leq f(x)+\left<\operatorname{grad}f(x),\eta\right>+\frac{\ell}{2}\|\eta\|^{2},

where $\eta\in T_{x}\mathcal{M}$ and $y=\mathcal{R}_{x}(\eta)$ .

Similar to the Lipschitz continuity of gradient in Euclidean space, we give the following extended definition of the Lipschitz continuity of the Riemannian gradient.

Definition 2.7.

The Riemannian gradient $\operatorname{grad}f$ is Lipschitz continuous with respect to a vector transport $\mathcal{T}$ if there exists a positive constant $L$ such that for all $x,\xi\in T_{x}{\mathcal{M}},y=\mathcal{R}_{x}(\xi)$ ,

(2.6)

\|\operatorname{grad}f(x)-\mathcal{T}_{y}^{x}(\operatorname{grad}f(y))\|\leq L\|\xi\|.

The following lemma shows that the Lipschitz continuity of $\operatorname{grad}f(x)$ can be deduced by the Lipschitz continuity of $\nabla f(x)$ .

Lemma 2.8.

Suppose that $\mathcal{M}$ is a compact submanifold embedded in the Euclidean space, given $x,y\in\mathcal{M}$ and $u\in T_{x}\mathcal{M}$ , the vector transport $\mathcal{T}$ is defined as $\mathcal{T}_{x}^{y}(u):=D\mathcal{R}_{x}[\xi](u)$ , where $y=\mathcal{R}_{x}(\xi)$ . Denote $\zeta:=\max_{x\in\text{conv}(\mathcal{M})}\|D^{2}\mathcal{R}_{x}(\cdot)\|$ and $G:=\max_{x\in\mathcal{M}}\|\nabla f(x)\|$ . Let $L_{p}$ be the Lipschitz constant of $\mathcal{P}_{T_{x}\mathcal{M}}$ over $x\in\mathcal{M}$ . For a function $f$ with Lipschitz continuous gradient with constant $L$ , i.e., $\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|$ , then we have that

(2.7)

\|\operatorname{grad}f(x)-\mathcal{T}_{y}^{x}(\operatorname{grad}f(y))\|\leq((\alpha L_{p}+\zeta)G+\alpha L)\|\xi\|.

Proof 2.9.

It follows from the definition of Riemannian gradient and vector transport that

(2.8)		$\displaystyle\\|\operatorname{grad}f(y)-\mathcal{T}_{x}^{y}(\operatorname{grad}f(x))\\|$
	$\displaystyle\leq$	$\displaystyle\\|\operatorname{grad}f(y)-\operatorname{grad}f(x)\\|+\\|\operatorname{grad}f(x)-\mathcal{T}_{x}^{y}(\operatorname{grad}f(x))\\|$
	$\displaystyle\leq$	$\displaystyle\\|\operatorname{grad}f(y)-\mathcal{P}_{T_{x}\mathcal{M}}(\nabla f(y))\\|+\\|\mathcal{P}_{T_{x}\mathcal{M}}(\nabla f(y)-\nabla f(x))\\|$
		$\displaystyle+\\|D\mathcal{R}_{x}(0_{x})[\operatorname{grad}f(x)]-D\mathcal{R}_{x}(\xi)[\operatorname{grad}f(x)]\\|$
	$\displaystyle\leq$	$\displaystyle(L_{p}G+L)\\|\\|y-x\\|+\zeta\\|\xi\\|\\|\operatorname{grad}f(x)\\|$
	$\displaystyle\leq$	$\displaystyle((\alpha L_{p}+\zeta)G+\alpha L)\\|\xi\\|,$

where the third inequality utilizes the Lipschitz continuity of $\nabla f(x)$ and $\mathcal{P}_{T_{x}\mathcal{M}}$ over $x\in\mathcal{M}$ , the last inequality follows from (2.1) and $\|\operatorname{grad}f(x)\|\leq\|\nabla f(x)\|\leq G$ . This proof is completed.

3 Augmented Lagrangian method

In this section, we present an augmented Lagrangian method to solve (1.1). The subproblem is solved by the Riemannian gradient descent method. We also give the iteration complexity result. Throughout this paper, we make the following assumptions.

Assumption 1.

The following assumptions hold:

1.

The manifold $\mathcal{M}$ is a compact Riemannian submanifold embedded in $\mathbb{R}^{n}$ ;
2.

The function $f$ is $\ell_{\nabla f}$ -smooth but not necessarily convex. The function $h$ is convex and $\ell_{h}$ -Lipschitz continuous but is not necessarily smooth; $\varphi$ is bounded from below.

3.1 KKT condition and stationary point

We say $x\in\mathcal{M}$ is an $\epsilon$ -stationary point of problem (1.1) if there exists $\zeta\in\partial h(\mathcal{A}x)$ such that $\left\|\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x)+\mathcal{A}^{*}\zeta\right)\right\|\leq\epsilon$ . As in [45], there is no finite time algorithm that can guarantee $\epsilon$ -stationarity in the nonconvex nonsmooth setting. Motivated by the notion of $(\delta,\epsilon)$ -stationarity introduced in [45], we introduce a generalized $\epsilon$ -stationarity for problem (1.1) based on the Lagrangian function.

Since problem (1.1) contains the nonsmooth term, we introduce the auxiliary variable $y=\mathcal{A}x$ , which results in the following equivalent split problem:

(3.1)

\min_{x\in\mathcal{M},y\in\mathbb{R}^{m}}f(x)+h(y),\;\mathrm{s.t.}\;\mathcal{A}x=y.

The Lagrangian function $l:\mathcal{M}\times\mathbb{R}^{m}\times\mathbb{R}^{m}\rightarrow\mathbb{R}$ of (3.1) is given by:

(3.2)

l(x,y,z)=f(x)+h(y)-\left<z,\mathcal{A}x-y\right>,

where $z$ is the corresponding Lagrangian multiplier. Then the KKT condition of (1.1) is given as follows [12]: A point $x\in\mathcal{M}$ satisfies the KKT condition if there exists $y,z\in\mathbb{R}^{m}$ such that

(3.3)

\left\{\begin{aligned} 0&=\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x)-\mathcal{A}^{*}z\right),\\ 0&\in z+\partial h(y),\\ 0&=\mathcal{A}x-y.\end{aligned}\right.

Based on the KKT condition, we give the definition of $\epsilon$ -stationary point for (1.1):

Definition 3.1.

We say $x\in\mathcal{M}$ is an $\epsilon$ -stationary point of (1.1) if there exists $y,z\in\mathbb{R}^{m}$ such that

(3.4)

\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x)-\mathcal{A}^{*}z\right)\right\|\leq\epsilon,\\ \mathrm{dist}(-z,\partial h(y))\leq\epsilon,\\ \|\mathcal{A}x-y\|\leq\epsilon.\end{aligned}\right.

In other words, $(x,y,z)$ is an $\epsilon$ -KKT point pair of (1.1).

3.2 Augmented Lagrangian function of (1.1)

Before constructing the augmented Lagrangian function of (1.1), we first focus on the equivalent split problem (3.1). In particular, the augmented Lagrangian function of (3.1) is constructed as follows:

(3.5)		$\displaystyle\mathbb{L}_{\sigma}(x,y;z):$	$\displaystyle=l(x,y,z)+\frac{\sigma}{2}\\|\mathcal{A}x-y\\|^{2}$
(3.5)			$\displaystyle=f(x)+h(y)-\left<z,\mathcal{A}x-y\right>+\frac{\sigma}{2}\\|\mathcal{A}x-y\\|^{2},$

where $\sigma$ is the penalty parameter. It is shown that the augmented Lagrangian method involves the variable $x$ and the auxiliary variables $y$ . However, for the original problem (1.1), which only features the variable $x$ , the corresponding augmented Lagrangian function should solely include $x$ as well as the associated multipliers.

An intuitive way of introducing the augmented Lagrangian function associated with (1.1) is to eliminate the auxiliary variables in (3.5), i.e.,

(3.6)

\mathcal{L}_{\sigma}(x,z):=\min_{y\in\mathbb{R}^{m}}\mathbb{L}_{\sigma}(x,y;z).

Note that the optimal solution $y^{*}$ of the above minimization problem has the following closed-form solution:

(3.7)

y^{*}=\mathrm{prox}_{h/\sigma}(\mathcal{A}x-z/\sigma).

Therefore, one can obtain the explicit form of $\mathcal{L}_{\sigma}(x,z)$ :

(3.8)	$\displaystyle\mathcal{L}_{\sigma}(x,z)=$	$\displaystyle~{}~{}f(x)+g(\mbox{prox}_{h/\sigma}(\mathcal{A}(x)-z/\sigma_{k}))$
		$\displaystyle+\frac{\sigma}{2}\left\\|\mathcal{A}(x)-z/\sigma-\mbox{prox}_{h/\sigma}(\mathcal{A}(x)-z/\sigma)\right\\|_{2}^{2}-\frac{1}{2\sigma}\\|z\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle~{}~{}f(x)+M_{h}^{1/\sigma}\left(\mathcal{A}(x)-\frac{1}{\sigma}z\right)-\frac{1}{2\sigma}\\|z\\|_{2}^{2},$

where $M_{h}^{1/\sigma}$ is the Moreau envelope given in Definition 2.4. It is important to emphasize that the augmented Lagrangian function (3.6) only retains $x$ as the primal variable. The Lagrangian multiplier $z$ corresponds to the nonsmooth function $h$ . When comparing $\mathcal{L}_{\sigma}$ with $\mathbb{L}_{\sigma}$ , it becomes evident that $\mathcal{L}_{\sigma}$ involves fewer variables. We will also demonstrate that the augmented Lagrangian function (3.6) possesses favorable properties, such as being continuously differentiable. This characteristic proves to be advantageous when it comes to designing and analyzing algorithms [13, 31].

3.3 Algorithm

As shown in [12], a manifold augmented Lagrangian method based on the augmented Lagrangian function (3.5) can be written as

(3.9)

\left\{\begin{aligned} (x^{k+1},y^{k+1})&=\arg\min_{x\in\mathcal{M}}\mathbb{L}_{\sigma_{k}}(x,y,z^{k}),\\ z^{k+1}&=z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1}),\\ \sigma_{k+1}&=\sigma_{k}\uparrow.\end{aligned}\right.

By the definition of $\mathcal{L}_{\sigma}$ and (3.7), we know that (3.9) can be rewritten as

(3.10)

\left\{\begin{aligned} x^{k+1}&=\arg\min_{x\in\mathcal{M}}\mathcal{L}_{\sigma_{k}}(x,z^{k}),\\ y^{k+1}&=\mathrm{prox}_{h/\sigma_{k}}(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}),\\ z^{k+1}&=z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1}),\\ \sigma_{k+1}&=\sigma_{k}\uparrow.\end{aligned}\right.

We call (3.10) the augmented Lagrangian method based on (3.6) for (1.1). The authors in [12] show that an inexact version of (3.10) converges to a stationary point. However, the iteration complexity result is unknown. Motivated by [38], we design a novel manifold inexact augmented Lagrangian framework, as outlined in Algorithm 1 to efficiently find an $\epsilon$ -KKT point of (1.1). The key distinction from (3.10) lies in the update strategy for the Lagrangian multiplier $z^{k}$ and the careful selection of $\sigma_{k}$ . This strategic choice will enable us to bound the norm of $z^{k}$ and ensures that $\frac{\|z^{k}\|}{\sigma_{k}}$ converges to zero.

1: Given: increasing positive sequence

\{\sigma_{k}\}_{k\geq 1}

, decreasing positive sequence

\{\epsilon_{k}\}_{k\geq 1}

2: Initialize: Primal variable

x^{0}

, dual variable

z^{0}=0

, and dual step size

\beta_{0}

. Denote

y^{0}=\mathcal{A}x^{0}

3: for

k=0,1,2,\cdots

4: Apply subroutine to solve the following subproblem

(3.11)

\min_{x\in\mathcal{M}}\mathcal{L}_{\sigma_{k}}(x,z^{k})

and return

x^{k+1}

with two options:

•

Option I: $x^{k+1}$ satisfies

(3.12) $\|\operatorname{grad}\mathcal{L}_{\sigma_{k}}(x^{k+1},z^{k})\|\leq\epsilon_{k}.$
•

Option II: return $x^{k+1}$ with the fixed iteration number $2^{k}$ .

5: Update the auxiliary variable:

y^{k+1}=\mathrm{prox}_{h/\sigma_{k}}(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}).

6: Update the dual step size:

(3.13)

\beta_{k+1}=\beta_{0}\min\left(\frac{\|\mathcal{A}x^{0}-y^{0}\|\log^{2}2}{\|\mathcal{A}x^{k+1}-y^{k+1}\|(k+1)^{2}\log(k+2)},1\right)

7: Update the dual variable:

(3.14)

z^{k+1}=z^{k}-\beta_{k+1}(\mathcal{A}x^{k+1}-y^{k+1}).

8: end for

Algorithm 1 (ManIAL)

In Algorithm 1, we introduce the dual stepsize $\beta_{k}$ , which is used in the update of the dual variable $z^{k+1}$ (also called the Lagrangian multiplier). The update rule (3.13) of $\beta_{k+1}$ is to make sure that the dual variable $z^{k+1}$ is bounded when $k\rightarrow\infty$ . The detailed derivation is given in the proof of Theorem 3.8. Moreover, for the convenience of checking the termination criteria and subsequent convergence analysis, we introduce an auxiliary sequence denoted as $y^{k}$ . The main difference between our algorithm and the one presented in [12] lies in our algorithm’s reliance on a more refined selection of step sizes and improved control over the accuracy of $x$ -subproblem solutions. Another crucial difference is the stopping criteria for the subproblem. In particular, we provide two options based on either a given residual or a given number of iterations, further enhancing the flexibility of the algorithm These modifications allow us to derive results pertaining to the iterative complexity of the algorithm.

The function $\mathcal{L}_{\sigma}(x,z)$ with respect to $x$ exhibits favorable properties, such as continuous differentiability, enabling a broader range of choices in selecting subproblem-solving algorithms. Let us denote $\psi_{k}(x):=\mathcal{L}_{\sigma_{k}}(x,z^{k})$ . By the property of the proximal mapping and the Moreau envelope in Proposition 2.5, one can readily conclude that $\psi_{k}(x)$ is continuously differentiable and

(3.15)

\displaystyle\nabla\psi_{k}(x)

\displaystyle=\nabla f(x)+\sigma_{k}\mathcal{A}^{*}\left(\mathcal{A}x-\frac{1}{\sigma_{k}}z^{k}-\mbox{prox}_{h/\sigma_{k}}(\mathcal{A}(x)-\frac{1}{\sigma_{k}}z^{k})\right).

The main computational cost of Algorithm 1 lies in solving the subproblem (3.11). In the next subsection, we will give a subroutine to solve (3.11).

3.4 Subproblem solver

In this subsection, we give a subroutine to find each $x^{k+1}$ in Algorithm 1. We first present a general algorithm for smooth problems on Riemannian manifolds, and then try to apply the algorithm to solve the subproblem (3.11) in Algorithm 1 by checking the related condition. Let us focus on the following smooth problem on the Riemannian manifold:

(3.16)

\min_{x\in\mathcal{M}}\psi(x),

where $\mathcal{M}$ is a Riemannian submanifold embedded in Euclidean space. Denote $\psi_{\min}:=\inf_{x\in\mathcal{M}}\psi(x)$ . We assume that $\psi$ is retr-smooth with $L$ and $\mathcal{R}$ , i.e.,

(3.17)

\psi(\mathcal{R}_{x}(\eta))\leq\psi(x)+\langle\operatorname{grad}\psi(x),\eta\rangle+\frac{L}{2}\|\eta\|^{2},~{}\forall x\in\mathcal{M},\eta\in T_{x}\mathcal{M}.

1: Given:

x_{0}\in\mathcal{M}

L>0

, an integer

T>0

2: for

t=0,1,\cdots,T

3: Set the stepsize

\alpha_{t}=\frac{1}{L}

4: Compute the next iterate

x_{t+1}

(3.18)

x_{t+1}=\mathcal{R}_{x_{t}}(-\alpha_{t}\operatorname{grad}\psi(x_{t})).

5: end for

Output:

x_{*}=x_{t^{*}}

, where

t^{*}=\arg\min_{0\leq t\leq T}\|\operatorname{grad}\psi(x_{t})\|

Algorithm 2 Riemannian gradient descent method,

\text{RGD}(\psi,L,T,x_{0})

We adopt a Riemannian gradient descent method (RGD) [1, 6] to solve problem (3.16). The detailed algorithm is given in Algorithm 2. The following lemma gives the iteration complexity result of Algorithm 2.

Lemma 3.2 (Iteration complexity of RGD).

Suppose that $\psi$ is retr-smooth with constant $L$ and a retraction operator $\mathcal{R}$ . Let $\{x_{t}\}$ be the iterative sequence of Algorithm 2. Given an integer $T$ , it holds that

\|\operatorname{grad}\psi(x_{*})\|\leq\frac{\sqrt{2L(\psi(x_{0})-\psi_{\min})}}{\sqrt{T}}.

Proof 3.3.

It follows from (3.17) and (3.18) that

(3.19)		$\displaystyle\psi(x_{t+1})$	$\displaystyle\leq\psi(x_{t})-\alpha_{t}\langle\operatorname{grad}\psi(x_{t}),\operatorname{grad}\psi(x_{t})\rangle+\frac{\alpha_{t}^{2}L}{2}\\|\operatorname{grad}\psi(x_{t})\\|^{2}$
(3.19)			$\displaystyle\leq\psi(x_{t})-\frac{1}{2L}\\|\operatorname{grad}\psi(x_{t})\\|^{2}.$

By summing both sides of the above inequality over $t=0,\cdots,T-1$ , we get

(3.20)

\psi(x_{T})\leq\psi(x_{0})-\frac{1}{2L}\sum_{t=0}^{T}\|\operatorname{grad}\psi(x_{t})\|^{2}.

This implies that

(3.21)

\|\operatorname{grad}\psi(x_{*})\|=\min_{0\leq t\leq T}\|\operatorname{grad}\psi(x_{t})\|\leq\frac{\sqrt{2L(\psi(x_{0})-\psi_{\min})}}{\sqrt{T}}.

We complete the proof.

To apply Algorithm 2 for computing $x^{k+1}$ in Algorithm 1, it is crucial to demonstrate that $\psi_{k}(x)$ is retr-smooth, meaning it satisfies (3.17). The following lemma provides the essential insight.

Lemma 3.4.

Suppose that Assumption 1 holds. Then, for $\psi_{k}$ , there exists finite $G$ such that $\|\nabla\psi_{k}(x)\|\leq G$ for all $x\in\mathcal{M}$ , and $\psi_{k}$ is retr-smooth in the sense that

(3.22)

\psi_{k}(\mathcal{R}_{x}(\eta))\leq\psi_{k}(x)+\left<\eta,\operatorname{grad}\psi_{k}(x)\right>+\frac{L_{k}}{2}\|\eta\|^{2}

for all $\eta\in T_{x}\mathcal{M}$ , where $L_{k}=\alpha^{2}(\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2})+2G\beta.$

Proof 3.5.

By Proposition 2.5, one shows that $\nabla\psi_{k}$ is Lipschitz continuous with constant $\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2}$ . Since $\nabla\psi_{k}$ is continuous on the compact manifold $\mathcal{M}$ , it follows from (2.4) that there exists $G>0$ such that $\|\nabla\psi_{k}(x)\|\leq G$ for all $x\in\mathcal{M}$ and any $k>0$ . The proof is completed by combining (2.1) with Lemma 2.7 in [7].

With Lemmas 3.2 and 3.4, we can apply Algorithm 2 to find each $x^{k+1}$ . The lemma below gives the inner iteration complexity for the $k$ -th outer iteration of Algorithm 1.

Lemma 3.6.

Suppose that Assumption 1 holds. Let $(x^{k},y^{k},z^{k})$ be the $k$ -iterate generated in Algorithm 1. We can find $x^{k+1}$ with Option I by the following call

(3.23)

x^{k+1}=RGD(\psi_{k},L_{k},T_{k},x^{k}),

where $L_{k}=\alpha^{2}(\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2})+2G\beta,~{}~{}T_{k}=\frac{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}{\epsilon_{k}^{2}}$ , and $\psi_{k,\min}=\inf_{x\in\mathcal{M}}\psi_{k}(x)$ . Moreover, we can also find $x^{k+1}$ with Option II by the following call

(3.24)

x^{k+1}=RGD(\psi_{k},L_{k},2^{k},x^{k}),

This implies that

(3.25)

\|\operatorname{grad}\psi_{k}(x^{k+1})\|\leq\frac{\sqrt{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}}{\sqrt{2^{k}}}.

Proof 3.7.

For Option I, by Lemma 3.2, we find $x^{k+1}$ that satisfy

(3.26)

\|\operatorname{grad}\psi_{k}(x^{k+1})\|\leq\frac{\sqrt{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}}{T_{k}}\leq\epsilon_{k}.

When $T_{k}=\frac{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}{\epsilon_{k}^{2}}$ , the above inequality holds. For Option II, it directly follows from Lemma 3.2.

3.5 Convergence analysis of ManIAL

We have shown the oracle complexity for solving the subproblems of ManIAL. Now, let us first investigate the outer iteration complexity and conclude this subsection with total oracle complexity.

3.5.1 Outer iteration complexity of ManIAL

Without specifying a subroutine to obtain $x^{k+1}$ , we establish the outer iteration complexity result of Algorithm 1 in the following lemma.

Theorem 3.8 (Iteration complexity of ManIAL with Option I).

Suppose that Assumption 1 holds. For $b>1$ , if $\sigma_{k}=b^{k}$ and $\epsilon_{k}=1/\sigma_{k}$ , given $\epsilon>0$ , then Algorithm 1 with Option I needs at most $K:=\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|+1}{\epsilon}\right)$ iterations to produce an $\epsilon$ -KKT solution pair $(x^{K+1},y^{K+1},\bar{z}^{K+1})$ of (3.4), where

(3.27)

\bar{z}^{k+1}:=z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1}),\forall k\geq 0,~{}~{}z_{\max}:=\frac{\beta_{0}\pi^{2}}{6}\|\mathcal{A}x^{0}-y^{0}\|.

Proof 3.9.

It follows from the update rule of $x^{k+1}$ in Algorithm 1 with Option I and (3.15) that

(3.28)

\left\|\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x^{k+1})+\sigma_{k}\mathcal{A}^{*}\left(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}-\mathrm{prox}_{h/\sigma_{k}}(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k})\right)\right)\right\|\leq\epsilon_{k}.

Moreover, the update rule of $y^{k+1}$ implies that

(3.29)

0\in z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1})+\partial h(y^{k+1}).

Combining (3.28), (3.29) and the definition of $\bar{z}^{K+1}$ in (3.27) yields

(3.30)

\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1})+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|\leq\epsilon_{k},\\ \mathrm{dist}(-\bar{z}^{k+1},\partial h(y^{k+1}))=0.\end{aligned}\right.

Before bounding the feasibility condition $\|\mathcal{A}x^{k+1}-y^{k+1}\|$ , we first give a uniform upper bound of the dual variable $z^{k+1}$ . By (3.13), (3.14), and $z^{0}=0$ , we have that for any $k>1$ ,

(3.31)	$\displaystyle\\|z^{k+1}\\|$	$\displaystyle\leq\sum_{l=1}^{k+1}\beta_{l}\\|\mathcal{A}x^{l}-y^{l}\\|\leq\sum_{l=1}^{\infty}\beta_{l}\\|\mathcal{A}x^{l}-y^{l}\\|$
		$\displaystyle\leq\beta_{0}\\|\mathcal{A}x^{0}-y^{0}\\|\log^{2}2\sum_{l=1}^{\infty}\frac{1}{(l+1)^{2}\log(l+2)}$
		$\displaystyle\leq\frac{\beta_{0}\pi^{2}}{6}\\|\mathcal{A}x^{0}-y^{0}\\|=z_{\max},$

where the last inequality utilize that $\log(l+2)>1$ for $l\geq 1$ and $\sum_{l=1}^{\infty}\frac{1}{(l+1)^{2}}=\frac{\pi^{2}}{6}$ . For the feasibility bound $\|\mathcal{A}x^{k+1}-y^{k+1}\|$ , we have that for all $k\geq 1$ ,

(3.32)		$\displaystyle\\|\mathcal{A}x^{k+1}-y^{k+1}\\|$	$\displaystyle\leq\\|\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}-y^{k+1}\\|+\frac{1}{\sigma_{k}}\\|z^{k}\\|$
(3.32)			$\displaystyle\leq\frac{\ell_{h}+\\|z^{k}\\|}{\sigma_{k}}\leq\frac{\ell_{h}+\\|z_{\max}\\|}{\sigma_{k}},$

where the second inequality utilizes (2.4). By the definition of $\sigma_{k},\epsilon_{k}$ and $K$ , we have that

(3.33)

\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x^{K+1}}\mathcal{M}}\left(\nabla f(x^{K+1})+\mathcal{A}^{*}\bar{z}^{K+1}\right)\right\|&\leq\epsilon,\\ \mathrm{dist}(-\bar{z}^{K+1},\partial h(y^{K+1}))&=0,\\ \|\mathcal{A}x^{K+1}-y^{K+1}\|&\leq\epsilon.\end{aligned}\right.

Therefore, $(x^{K+1},y^{K+1},\bar{z}^{K+1})$ is an $\epsilon$ -KKT point of (3.4).

Analogously, we also have the following outer iteration complexity for ManIAL with Option II.

Theorem 3.10 (Iteration complexity of ManIAL with Option II).

Suppose that Assumption 1 holds. Let $\sigma_{k}=\frac{1}{2^{k/3}}$ . Given $\epsilon>0$ , Algorithm 1 with Option II needs at most

K:=\log_{2}\left(\frac{(\sqrt{2c_{1}}+\ell_{h}+z_{\max})^{3}}{\epsilon^{3}}\right)

iterations to produce an $\epsilon$ -KKT point pair $(x^{K+1},y^{K+1},\bar{z}^{K+1})$ , where $c_{1}$ is defined in the proof. $\bar{z}^{K+1}$ and $z_{\max}$ are defined in (3.27).

Proof 3.11.

Similar to Theorem 3.8, for any $k\in\mathbb{N}$ , we have that

(3.34)

\displaystyle\mathrm{dist}(-\bar{z}^{k+1},\partial h(y^{k+1}))=0,~{}~{}\|\mathcal{A}x^{k+1}-y^{k+1}\|

\displaystyle\leq\frac{\ell_{h}+\|z_{\max}\|}{\sigma_{k}}.

Moreover, it follows from Lemma 3.6 that

(3.35)

\displaystyle\left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1})+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|

\displaystyle\leq\frac{\sqrt{L_{k}(\psi_{k}(x_{k})-\psi_{k,\min})}}{\sqrt{2^{k}}}.

Next, we show that $\psi_{k}(x_{k})-\psi_{k,\min}$ is bounded for any $k\geq 1$ . By the properties of the Moreau envelope, we have that

(3.36)

\psi_{k}(x)\leq f(x)+h(\mathcal{A}x-z^{k}/\sigma_{k})\leq\psi_{k}(x)+\frac{\ell_{h}^{2}+\|z_{\max}\|^{2}}{2\sigma_{0}}.

Since $\mathcal{M}$ is a compact submanifold, $z^{k}$ is bounded by $z_{\max}$ and $\sigma_{k}>\sigma_{0}$ has the low bound, there exist $\overline{\phi}$ and $\underline{\phi}$ such that $\underline{\phi}\leq f(x)+h(\mathcal{A}x-z^{k}/\sigma_{k})\leq\overline{\phi}$ for any $x\in\mathcal{M}$ and $k\geq 1$ . Therefore, we have that

(3.37)

\psi_{k}(x^{k})-\psi_{k,\min}\leq\overline{\phi}-\underline{\phi}+\frac{\ell_{h}^{2}+\|z_{\max}\|^{2}}{2\sigma_{0}}.

Let us denote $c_{0}=\overline{\phi}-\underline{\phi}+\frac{\ell_{h}^{2}+\|z_{\max}\|^{2}}{2\sigma_{0}},c_{1}=c_{0}\alpha^{2}\|\mathcal{A}\|^{2},c_{2}=c_{0}(\alpha^{2}\ell_{\nabla f}+2G\beta)$ . By the fact that $\sigma_{k}=2^{k/3}$ , we have

(3.38)		$\displaystyle L_{k}(\psi_{k}(x_{k})-\psi_{k,\min})$
	$\displaystyle=$	$\displaystyle(\alpha^{2}(\ell_{\nabla f}+\sigma_{k}\\|\mathcal{A}\\|^{2})+2G\beta)(\psi_{k}(x_{k})-\psi_{k,\min})$
	$\displaystyle=$	$\displaystyle c_{1}2^{k/3}+c_{2}.$

Combining with (3.34) and (3.35) yields

(3.39)

\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1})+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|&\leq\frac{\sqrt{c_{1}2^{k/3}+c_{2}}}{2^{k/2}},\\ \|\mathcal{A}x^{k+1}-y^{k+1}\|&\leq\frac{\ell_{h}+\|z_{\max}\|}{2^{k/3}},\\ \mathrm{dist}(-\bar{z}^{k+1},\partial h(y^{k+1}))&=0.\end{aligned}\right.

Without loss of generality, we assume that $c_{1}2^{k/3}\geq c_{2}$ . This implies that $\frac{\sqrt{c_{1}2^{k/3}+c_{2}}}{2^{k/2}}\leq\frac{\sqrt{2c_{1}}}{2^{k/3}}$ . Letting $K=\log_{2}\left(\frac{(\sqrt{2c_{1}}+\ell_{h}+z_{\max})^{3}}{\epsilon^{3}}\right)$ , one can shows that

(3.40)

\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x^{K+1}}\mathcal{M}}\left(\nabla f(x^{K+1})+\mathcal{A}^{*}\bar{z}^{K+1}\right)\right\|&\leq\epsilon,\\ \|\mathcal{A}x^{K+1}-y^{K+1}\|&\leq\epsilon,\\ \mathrm{dist}(-\bar{z}^{K+1},\partial h(y^{K+1}))&=0.\end{aligned}\right.

This implies that $(x^{K+1},y^{K+1},\bar{z}^{K+1})$ is an $\epsilon$ -KKT point pair.

3.5.2 Overall oracle complexity of ManIAL

Before giving the overall oracle complexity of ManIAL, we give the definition of a first-order oracle for (1.1).

Definition 3.12 (first-order oracle).

For problem (1.1), a first-order oracle can be defined as follows: compute the Euclidean gradient $\nabla f(x)$ , the proximal operator $\operatorname{prox}_{h}(x)$ , and the retraction operator $\mathcal{R}$ .

Now, let us combine the inner oracle complexity and the outer iteration complexity to derive the overall oracle complexity of ManIAL.

Theorem 3.13 (Oracle complexity of ManIAL).

Suppose that Assumption 1 holds. We use Algorithm 2 as the subroutine to solve the subproblems in ManIAL. The following holds:

(a)

If we set $\sigma_{k}=b^{k}$ for some $b>1$ and $\epsilon_{k}=1/\sigma_{k}$ , then ManIAL with Option I finds an $\epsilon$ -KKT point, after at most $\tilde{\mathcal{O}}(\epsilon^{-3})$ calls to the first-order oracle.
(b)

If we set $\sigma_{k}=\frac{1}{2^{k/3}}$ , then ManIAL with Option II finds an $\epsilon$ -KKT point, after at most $\mathcal{O}(\epsilon^{-3})$ calls to the first-order oracle.

Proof 3.14.

Let $K$ denote the number of (outer) iterations of Algorithm 1 to reach the accuracy $\epsilon$ . Let us show (a) first. It follows from Theorem 3.8 that $K=\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|+1}{\epsilon}\right)$ . By Lemma 3.6, in the $k$ -th iterate, one needs at most $\frac{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}{\epsilon_{k}^{2}}$ iterations of Algorithm 2. Therefore, one can bound $T$ :

(3.41)	$\displaystyle T=$	$\displaystyle\sum_{k=1}^{K}\frac{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}{\epsilon_{k}^{2}}$
	$\displaystyle\leq$	$\displaystyle\sum_{k=1}^{K}\frac{c_{1}\sigma_{k}+c_{2}}{\epsilon_{k}^{2}}\leq\sum_{k=1}^{K}\left(\frac{c_{1}}{\epsilon_{k}^{3}}+\frac{c_{2}}{\epsilon_{k}^{2}}\right)$
	$\displaystyle\leq$	$\displaystyle K\left(\frac{c_{1}}{\epsilon^{3}}+\frac{c_{2}}{\epsilon^{2}}\right)$
	$\displaystyle\leq$	$\displaystyle\log_{b}\left(\frac{\ell_{h}+\\|z_{\max}\\|+1}{\epsilon}\right)\left(\frac{c_{1}}{\epsilon^{3}}+\frac{c_{2}}{\epsilon^{2}}\right),$

where the first inequality follows from (3.38). Secondly, it follows from Theorem 3.10 that $K=\log_{2}\left(\frac{(\sqrt{2c_{1}}+\ell_{h}+z_{\max})^{3}}{\epsilon^{3}}\right)$ . Then we have

(3.42)		$\displaystyle T=$	$\displaystyle\sum_{k=1}^{K}2^{k}=\sum_{k=1}^{K}(2^{k+1}-2^{k})$
(3.42)		$\displaystyle=$	$\displaystyle 2^{K+1}-2=2\left(\frac{\left(\sqrt{2c_{1}}+\ell_{h}+z_{\max}\right)^{3}}{\epsilon^{3}}-1\right).$

The proof is completed.

It should be emphasized that ManIAL with Option I, as well as the existing works, such as [34] and [34], achieves an oracle complexity of $\mathcal{O}(\epsilon^{-3})$ up to a logarithmic factor. However, ManIAL with Option II achieves an oracle complexity of $\mathcal{O}(\epsilon^{-3})$ independently of the logarithmic factor. In that sense, we demonstrate that ManIAL with Option II achieves a better oracle complexity result.

4 Stochastic augmented Lagrangian method

In this section, we focus on the case that $f$ in problem (1.1) has the expectation form, i.e., $f(x)=\mathbb{E}_{\xi\in\mathcal{D}}[f(x,\xi)]$ . In particular, we consider the following optimization problem on manifolds:

(4.1)

\min_{x\in\mathcal{M}}F(x):=\mathbb{E}_{\xi\in\mathcal{D}}[f(x,\xi)]+h(\mathcal{A}x).

Here, we assume that the function $h$ is convex and $\ell_{h}$ -Lipschitz continuous, $\nabla f(x,\xi)$ is $\ell_{\nabla f}$ -Lipschitz continuous. The following assumptions are made for the stochastic gradient $\nabla f(x,\xi)$ :

Assumption 2.

For any $x$ , the algorithm generates a sample $\xi\sim\mathcal{D}$ and returns a stochastic gradient $\nabla f(x,\xi)$ , there exists a parameter $\delta>0$ such that

(4.2)		$\displaystyle\mathbb{E}_{\xi}\left[\nabla f(x,\xi)\right]=\nabla f(x),$
(4.3)		$\displaystyle\mathbb{E}_{\xi}\left[\\|\nabla f(x,\xi)-\nabla f(x)\\|^{2}\right]\leq\delta^{2}.$

Moreover, $\nabla f(x,\xi)$ is $\ell_{\nabla f}$ -Lipschitz continuous.

Similarly, we give the definition of $\epsilon$ -stationary point for problem (4.1):

Definition 4.1.

We say $x\in\mathcal{M}$ is an $\epsilon$ -stationary point of (4.1) if there exists $y,z\in\mathbb{R}^{m}$ such that

(4.4)

\left\{\begin{aligned} \mathbb{E}_{\xi}\left[\left\|\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x,\xi)-\mathcal{A}^{*}z\right)\right\|\right]\leq\epsilon,\\ \mathrm{dist}(-z,\partial h(y))\leq\epsilon,\\ \|\mathcal{A}x-y\|\leq\epsilon.\end{aligned}\right.

In other words, $(x,y,z)$ is an $\epsilon$ -KKT point pair of (4.1).

4.1 Algorithm

We define the augmented Lagrangian function of (4.1):

(4.5)		$\displaystyle\mathcal{L}_{\sigma}(x,z):$	$\displaystyle=\min_{y\in\mathbb{R}^{m}}\left\{\mathbb{E}_{\xi}[f(x,\xi)]+h(y)-\langle z,Ax-y\rangle+\frac{\sigma}{2}\\|Ax-y\\|^{2}\right\}$
(4.5)			$\displaystyle=\mathbb{E}_{\xi}[f(x,\xi)]+M_{h}^{1/\sigma}(Ax-z/\sigma)-\frac{1}{2\sigma}\\|z\\|^{2}.$

To efficiently find an $\epsilon$ -stationary point of (4.1), we design a Riemannian stochastic gradient-type method based on the framework of the stochastic manifold inexact augmented Lagrangian method (StoManIAL), which is given in Algorithm 3. To simplify the notation, in the $k$ -iterate, let us denote

(4.6)

\psi_{k}(x,\xi):=f(x,\xi)+M_{h}^{1/\sigma_{k}}(Ax-z^{k}/\sigma_{k})-\frac{1}{2\sigma_{k}}\|z^{k}\|^{2},

and $\psi_{k}(x):=\mathbb{E}_{\xi}[\psi_{k}(x,\xi)]$ . Under Assumption 2, it can be easily shown that

(4.7)

\mathbb{E}_{\xi}\|\psi_{k}(x,\xi)-\psi_{k}(x)\|^{2}\leq\delta^{2}.

Moreover, as shown in (3.15), one can derive the Euclidean gradient of $\psi_{k}(x,\xi)$ as follows

(4.8)

\displaystyle\nabla\psi_{k}(x,\xi)

\displaystyle=\nabla f(x,\xi)+\sigma_{k}\mathcal{A}^{*}\left(\mathcal{A}x-\frac{1}{\sigma_{k}}z^{k}-\mbox{prox}_{h/\sigma_{k}}(\mathcal{A}(x)-\frac{1}{\sigma_{k}}z^{k})\right).

Note that, unlike ManIAL, the condition (4.10) of Option I in StoManIAL cannot be checked in the numerical experiment due to the expectation operation, although its validity can be assured by the convergence rule result of the subroutine, which will be presented in the next subsection. In contrast, the condition in Option II is checkable.

0: Initial point

x^{0}\in\mathcal{M},y^{0},z^{0}

;

\epsilon

\{\sigma_{k}\}_{k}

. Set

k=0

. 1: while not converge do 2: Solve the following manifold optimization problem (4.9)

\min_{x\in\mathcal{M}}\mathcal{L}_{\sigma_{k}}(x,z^{k})

and return

x^{k+1}

with two options: • Option I:

x^{k+1}

satisfies (4.10)

\mathbb{E}_{\xi}\left[\left\|\operatorname{grad}\psi_{k}(x^{k+1},\xi)\right\|\right]\leq\epsilon_{k}.

• Option II: return

x^{k+1}

with the fixed iteration number

2^{k}

. 3: Update the auxiliary variable:

y^{k+1}=\mathrm{prox}_{h/\sigma_{k}}(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}).

4: Update the step sizes:

\beta_{k+1}=\beta_{1}\min\left(\frac{\|\mathcal{A}x^{1}-y^{1}\|\log^{2}2}{\|\mathcal{A}x^{k+1}-y^{k+1}\|(k+1)\log^{2}(k+2)},1\right).

5: Update the dual variable: (4.11)

z^{k+1}=z^{k}-\beta_{k+1}(\mathcal{A}x^{k+1}-y^{k+1}).

6: Set

k=k+1

. 7: end while

Algorithm 3 (StoManIAL)

4.2 Subproblem solver

In this subsection, we present a Riemannian stochastic optimization method to solve the $x$ -subproblem in Algorithm 3. As will be shown in Lemma 4.3, the function $\mathcal{L}_{\sigma_{k}}(x,z^{k})$ is retr-smooth with respect to $x$ , and its retr-smoothness parameter depends on $\sigma_{k}$ , which, in turn, is ultimately determined by a prescribed tolerance $\epsilon_{k}$ . To achieve a favorable overall complexity outcome with a lower order, it is essential to employ a subroutine whose complexity result exhibits a low-order dependence on the smoothness parameter $\sigma_{k}$ .

A recent work, the momentum-based variance-reduced stochastic gradient method known as Storm [11], has demonstrated efficacy in solving nonconvex smooth stochastic optimization problems in Euclidean space, yielding a low-order overall complexity result. Despite the existence of a Riemannian version [17] of Storm (RSRM) that matches the same overall complexity result, its dependence on the smoothness parameters is excessively significant for this outcome. Therefore, we introduce a modified Riemannian version of Storm. Subsequently, in the next subsection, we elaborate on its application in determining $x^{k+1}$ within Algorithm 3.

Consider the following general stochastic optimization problem on a Riemannian manifold

(4.12)

\min_{x\in\mathcal{M}}\mathbb{E}[\psi(x,\xi)].

Suppose the following conditions hold: for some finite constants $L,\delta$

(4.13)		$\displaystyle\mathbb{E}_{\xi}\left[\nabla\psi(x,\xi)\right]=\nabla\psi(x),~{}~{}\mathbb{E}_{\xi}\left[\\|\nabla\psi(x,\xi)-\nabla\psi(x)\\|^{2}\right]\leq\delta^{2},$
(4.14)		$\displaystyle\\|\operatorname{grad}\psi(x,\xi)-\mathcal{T}_{y}^{x}\operatorname{grad}\psi(y,\xi)\\|\leq L\\|d\\|,~{}~{}\forall x,y\in\mathcal{M},d=\mathcal{R}_{x}^{-1}(y),$
(4.15)		$\displaystyle\psi(y,\xi)\leq\psi(x,\xi)+\langle d,\operatorname{grad}\psi(x,\xi)\rangle+\frac{L}{2}\\|d\\|^{2},~{}~{}\forall x,y\in\mathcal{M},d=\mathcal{R}_{x}^{-1}(y).$

With the above conditions, we give the modified Riemannian version of Storm in Algorithm 4 and the complexity result in Lemma 4.2. The proof is similar to Theorem 1 in [11]. We provide it in the Appendix.

0: Initial point

x_{1}

, iteration limit

T

, parameters

\kappa,w,c

, sample set

\{\xi_{t}\}_{t\geq 1}

1: Compute

d_{1}=-\operatorname{grad}\psi(x_{1},\xi_{1})

2: for

t=1,\ldots,T

3: Update

\eta_{t}=\frac{\kappa}{(w+\sum_{i=1}^{t}G_{t}^{2})^{1/3}}

4: Update

x_{t+1}

via

x_{t+1}=\mathcal{R}_{x_{t}}(-\eta_{t}d_{t}).

5: Update

a_{t+1}=c\eta_{t}^{2}

and

G_{t+1}=\|\operatorname{grad}\psi(x_{t+1},\xi_{t+1})\|

6: Compute

d_{t+1}=\operatorname{grad}\psi(x_{t+1};\xi_{t+1})+(1-a_{t+1})\mathcal{T}_{x_{t}}^{x_{t+1}}(d_{t}-\operatorname{grad}\psi(x_{t};\xi_{t}))

7: end for

8: Output: Return

\hat{x}

uniformly at random from

x_{1},\cdots,x_{T}

Algorithm 4 Riemannian stochastic recursive momentum gradient descent method:

\text{RStorm}(\psi,L,T,x_{1})

Lemma 4.2.

Suppose that the conditions (4.13)-(4.15) hold. Let us denote $G=\max_{x\in\mathcal{M}}\nabla\psi(x,\xi)$ . For any $b>0$ , we write $\kappa=\frac{bG^{\frac{2}{3}}}{L}$ . Set $c=10L^{2}+$ $G^{2}/\left(7L\kappa^{3}\right)$ and $w=\max\left((4L\kappa)^{3},2G^{2},\left(\frac{c\kappa}{4L}\right)^{3}\right)$ . Then, the output $\hat{x}$ in Algorithm 4 satisfies

(4.16)

\mathbb{E}[\|\operatorname{grad}\psi(\hat{x})\|]=\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|\right]\leq\frac{w^{1/6}\sqrt{2M}+2M^{3/4}}{\sqrt{T}}+\frac{2\delta^{1/3}}{T^{1/3}},

where $M=\frac{6}{\kappa}\left(\psi\left(x_{1}\right)-\psi_{\min}\right)+\frac{w^{1/3}\delta^{2}}{2L^{2}\kappa^{2}}+\frac{\kappa^{2}c^{2}}{L^{2}}\ln(T+2)$ .

To apply Algorithm 4 to find $x^{k+1}$ , we need to validate conditions (4.13)-(4.15) for the associated subproblem.

Lemma 4.3.

Suppose that Assumption 1 holds. Then, for $\psi_{k}$ , there exists finite $G$ such that $\|\nabla\psi_{k}(x,\xi)\|\leq G$ for all $x\in\mathcal{M}$ . Moreover, the function $\psi_{k}(x,\xi)$ is retr-smooth with constant $\hat{L}_{k}$ and the Riemannian gradient $\operatorname{grad}\psi_{k}(x,\xi)$ is Lipschitz continuous with constant $\bar{L}_{k}$ , where

(4.17)

\hat{L}_{k}:=\alpha^{2}\ell_{\nabla f}+\alpha^{2}\|\mathcal{A}\|\sigma_{k}+2G\beta,\;\bar{L}_{k}:=(\alpha L_{p}+\zeta)G+\alpha\ell_{\nabla f}+\|\mathcal{A}\|\sigma_{k},

and $L_{p}$ and $\zeta$ are defined in Lemma 2.8. Moreover, under Assumption 2, conditions (4.13)-(4.15) with constants $L_{k}:=\max\{\hat{L}_{k},\bar{L}_{k}\}$ and $\delta$ hold for subproblem (4.9).

Proof 4.4.

We first show (4.17). By Proposition 2.5, one shows that $\nabla\psi_{k}$ is Lipschitz continuous with constant $\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2}$ . The retraction smoothness of $\psi_{k}(x,\xi)$ can directly follow from Lemma 3.4. The Lipschitz continuity of $\operatorname{grad}\psi_{k}(x,\xi)$ follows from Lemma 2.8. These induce conditions (4.14) and (4.15). Under Assumption 2, condition (4.13) is directly follows from the definition of $\psi_{k}(x)$ and (4.7). We complete the proof.

As shown in Lemma 4.3, the Lipschitz constant $L_{k}$ increases as the penalty parameter $\sigma_{k}$ increases. Therefore, we analyze the dependence of $L$ for the right-hand side in (4.16) when $L$ tends to infinity. Noting that $\kappa=\mathcal{O}(1/L),c=\mathcal{O}(L^{2}),w=\mathcal{O}(1)$ , and $M=\mathcal{O}(L)$ , we can simplify the expression of (4.16) via $\mathcal{O}$ notation:

(4.18)

\mathbb{E}[\|\operatorname{grad}\psi(\hat{x})\|]\leq\tilde{\mathcal{O}}\left(\frac{L^{1/2}+L^{3/4}}{T^{1/2}}+\frac{1}{T^{1/3}}\right),

where $\tilde{\mathcal{O}}$ is due to the dependence of $\ln(T)$ .

With Lemma 4.3, we are able to apply Algorithm 4 to find $x^{k+1}$ . The lemma below gives the oracle complexity for the $k$ -th outer iteration of Algorithm 3. This is a direct application of Lemmas 4.2 and 4.3, and we omit the proof.

Lemma 4.5.

Suppose that Assumptions 1 and 2 hold. Let $(x^{k},y^{k},z^{k})$ be the $k$ -iterate generated in Algorithm 3. We can find $x^{k+1}$ with Option I by the following call

(4.19)

x^{k+1}=\text{RStorm}(\psi_{k},L_{k},T_{k},x^{k}),

where $L_{k}=\max\{\hat{L}_{k},\bar{L}_{k}\},~{}~{}T_{k}:=\tilde{\mathcal{O}}\left(\frac{1}{\epsilon_{k}^{3}}+\frac{\sigma_{k}^{1.5}+\sigma_{k}}{\epsilon_{k}^{2}}\right).$ Moreover, we can also find $x^{k+1}$ with Option II by the following call

(4.20)

x^{k+1}=\text{RStorm}(\psi_{k},L_{k},2^{k},x^{k}),

This implies that

(4.21)

\mathbb{E}_{\xi}\left[\|\operatorname{grad}\psi_{k}(x^{k+1},\xi)\|\right]\leq\tilde{\mathcal{O}}\left(\frac{\sigma_{k}^{3/4}+\sigma_{k}^{1/2}}{2^{k/2}}+\frac{1}{2^{k/3}}\right).

4.3 Convergence analysis of StoManIAL

We have shown the oracle complexity for solving the subproblems of StoManIAL. Now, let us first investigate the outer iteration complexity and conclude this subsection with the overall oracle complexity.

4.3.1 Outer iteration complexity of StoManIAL

Without specifying a subroutine to obtain $x^{k+1}$ , we establish the outer iteration complexity result of Algorithm 3 in the following lemma. Since the proof is similar to Theorem 3.8, we omit it.

Theorem 4.6 (Iteration complexity of StoManIAL with Option I).

Suppose that Assumptions 1 and 2 hold. For $b>1$ , if $\sigma_{k}=b^{k}$ and $\epsilon_{k}=1/\sigma_{k}$ , given $\epsilon>0$ , then Algorithm 3 needs at most $K:=\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|+1}{\epsilon}\right)$ iterations to produce an $\epsilon$ -KKT solution pair $(x^{K+1},y^{K+1},\bar{z}^{K+1})$ of (3.4), where

(4.22)

\bar{z}^{k+1}:=z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1}),\forall k\geq 0,~{}~{}z_{\max}:=\frac{\beta_{0}\pi^{2}}{6}\|\mathcal{A}x^{0}-y^{0}\|.

Analogous to Theorem 4.6, we also have the following theorem on the iteration complexity of StoManIAL with a fixed number of inner iterations.

Theorem 4.7 (Iteration complexity of StoManIAL with Option II).

Suppose that Assumption 1 and 2 hold. Let $\sigma_{k}=\frac{1}{2^{2k/7}}$ . Given $\epsilon>0$ , StoManIAL with Option II needs at most $\mathcal{O}(\log_{2}(\epsilon^{-7/2}))$ iterations to produce an $\epsilon$ -KKT point pair $(x^{K+1},y^{K+1},\bar{z}^{K+1})$ , where $\bar{z}^{K+1}$ is defined in (4.22).

Proof 4.8.

Similar to Theorem 3.8, for any $k\in\mathbb{N}$ , we have that

(4.23)		$\displaystyle\mathrm{dist}(-\bar{z}^{k+1},\partial h(y^{k+1}))$	$\displaystyle=0,$
(4.23)		$\displaystyle\\|\mathcal{A}x^{k+1}-y^{k+1}\\|$	$\displaystyle\leq\frac{\ell_{h}+\\|z_{\max}\\|}{\sigma_{k}}.$

Moreover, it follows from Lemma 4.5 that

(4.24)

\displaystyle\mathbb{E}_{\xi}\left[\left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1},\xi)+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|\right]

\displaystyle\leq\mathcal{O}\left(\frac{\sigma_{k}^{3/4}+\sigma_{k}^{1/2}}{2^{k/2}}+\frac{1}{2^{k/3}}\right).

Since $\sigma_{k}=2^{\frac{2k}{7}}$ , we have that

(4.25)

\displaystyle\mathbb{E}_{\xi}\left[\left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1},\xi)+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|\right]\leq\mathcal{O}(2^{-2k/7}+2^{-5k/14}+2^{-k/3}),

and

\|\mathcal{A}x^{k+1}-y^{k+1}\|\leq\mathcal{O}(2^{-2k/7}).

Letting $K=\mathcal{O}(\log_{2}(\epsilon^{-\frac{7}{2}}))$ , one can shows that

(4.26)

\left\{\begin{aligned} \mathbb{E}_{\xi}\left[\left\|\mathcal{P}_{T_{x^{K+1}}\mathcal{M}}\left(\nabla f(x^{K+1},\xi)+\mathcal{A}^{*}\bar{z}^{K+1}\right)\right\|\right]&\leq\mathcal{O}(\epsilon),\\ \mathrm{dist}(-\bar{z}^{K+1},\partial h(y^{K+1}))&=0,\\ \|\mathcal{A}x^{K+1}-y^{K+1}\|&\leq\mathcal{O}(\epsilon),\end{aligned}\right.

which implies that $(x^{K+1},y^{K+1},\bar{z}^{K+1})$ is an $\epsilon$ -KKT point pair.

4.3.2 Overall oracle complexity of StoManIAL

To measure the oracle complexity of our algorithm, we give the definition of a stochastic first-order oracle for (4.1).

Definition 4.9 (stochastic first-order oracle).

For the problem (4.1), a stochastic first-order oracle can be defined as follows: compute the Euclidean gradient $\nabla f(x,\xi)$ given a sample $\xi\in\mathcal{D}$ , the proximal operator $\operatorname{prox}_{h}(x)$ and the retraction operator $\mathcal{R}$ .

We are now able to establish the overall oracle complexity in the following theorem.

Theorem 4.10 (Oracle complexity of StoManIAL).

Suppose that Assumptions 1 and 2 hold. We choose Algorithm 4 as the subroutine of StoManIAL. The following holds:

(a)

If we set $\sigma_{k}=b^{k}$ for some $b>1$ and $\epsilon_{k}=1/\sigma_{k}$ , then StoManIAL with Option I finds an $\epsilon$ -KKT point, after at most $\tilde{\mathcal{O}}(\epsilon^{-3.5})$ calls to the stochastic first-order oracle.
(b)

If we set $\sigma_{k}=\frac{1}{2^{2k/7}}$ , then StoManIAL with Option II finds an $\epsilon$ -KKT point, after at most $\tilde{\mathcal{O}}(\epsilon^{-3.5})$ calls to the stochastic first-order oracle.

Proof 4.11.

Let $K$ denote the number of (outer) iterations of Algorithm 3 to reach the accuracy $\epsilon$ . Let us show (a) first. It follows from Theorem 4.6 that $K=\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|}{\sigma_{0}\epsilon}\right)+1$ . Combining this with Lemma 4.5, one can bounds the number $T$ of calls to the stochastic first-order oracle:

(4.27)		$\displaystyle T=$	$\displaystyle\sum_{k=1}^{K}\tilde{\mathcal{O}}\left(\frac{1}{\epsilon_{\kappa}^{3}}+\frac{\sigma_{k}^{1.5}+\sigma_{k}}{\epsilon_{k}^{2}}\right)=\sum_{k=1}^{K}\mathcal{O}\left(\frac{1}{\epsilon_{k}^{3}}+\frac{1}{\epsilon_{k}^{3.5}}\right)$
(4.27)		$\displaystyle\leq$	$\displaystyle K\tilde{\mathcal{O}}\left(\frac{1}{\epsilon^{3}}+\frac{1}{\epsilon^{3.5}}\right)=\tilde{\mathcal{O}}(\log_{b}(\epsilon^{-1})\epsilon^{-3.5}).$

In terms of (b), it follows from Theorem 4.7 and Lemma 4.5 that

(4.28)		$\displaystyle T=$	$\displaystyle\sum_{k=1}^{K}l_{k}=\sum_{k=1}^{K}2^{k}=\sum_{k=1}^{K}(2^{k+1}-2^{k})$
(4.28)		$\displaystyle=$	$\displaystyle 2^{k+1}-2=2\tilde{\mathcal{O}}(2^{\log_{2}(\epsilon^{-3.5})}-1)=\tilde{\mathcal{O}}(\epsilon^{-3.5}).$

The proof is completed.

5 Numerical results

In this section, we demonstrate the performance of the proposed methods ManIAL and StoManIAL on the sparse principal component analysis. We use ManIAL-I and ManIAL-II to denote Algorithms ManIAL with option I and option II, respectively. Due to that StoManIAL with option I is not checkable, we only give the comparison of StoManIAL with option II. We compare those algorithms with the subgradient method (we call it Rsub) in [30] that achieves the state-of-the-art oracle complexity for solving (4.1). All the tests were performed in MATLAB 2022a on a ThinkPad X1 with 4 cores and 32GB memory. The following relative KKT residual of problem (1.1) is set to a stopping criterion for our ManIAL:

(5.1)

\text{error}:=\max\left\{\eta_{p},\eta_{d},\eta_{C}\right\}\leq\mbox{tol},

where “tol” is a given accuracy tolerance and

(5.2)	$\displaystyle\eta_{p}$	$\displaystyle:=\frac{\left\\|\mathcal{A}x^{k}-y^{k}\right\\|}{1+\left\\|\mathcal{A}x^{k}\\|+\\|y^{k}\right\\|},$
	$\displaystyle\eta_{d}$	$\displaystyle:=\frac{\left\\|\mathcal{P}_{T_{x^{k}}\mathcal{M}}(\nabla f(x^{k})-\mathcal{A}^{*}z^{k})\right\\|}{1+\left\\|\nabla f(x^{k})\right\\|},$
	$\displaystyle\eta_{C}$	$\displaystyle:=\frac{\\|z^{k}-\mbox{prox}_{h^{*}}(z^{k}-\mathcal{A}x^{k})\\|}{1+\\|z^{k}\\|}.$

In the following experiments, we will first run our algorithms ManIAL-I and ManIAL-II, and terminate them when either condition (5.1) is satisfied or the maximum iteration steps of 10,000 are reached. The obtained function value of ManIAL-I is denoted as $F_{M}$ . For the other algorithms, we terminate them when either the objective function value satisfies $F(x^{k})\leq F_{M}+10^{-10}$ or the maximum iteration steps of 10,000 are reached.

5.1 Sparse principal component analysis

Given a data set $\{b_{1},\ldots,b_{m}\}$ where $b_{i}\in\mathbb{R}^{n\times 1}$ , the sparse principal component analysis (SPCA) problem is

(5.3)

\displaystyle\min_{X\in\mathbb{R}^{n\times r}}\sum_{i=1}^{m}\|b_{i}-XX^{T}b_{i}\|_{2}^{2}+\mu\|X\|_{1},~{}~{}\mbox{s.t. }~{}~{}X^{T}X=I_{r},

where $\mu>0$ is a regularization parameter. Let $B=[b_{1},\cdots,b_{m}]^{T}\in\mathbb{R}^{m\times n}$ , problem (5.3) can be rewritten as:

(5.4)

\displaystyle\min_{X\in\mathbb{R}^{n\times r}}-\mathrm{tr}(X^{T}B^{T}BX)+\mu\|X\|_{1},~{}~{}\mbox{s.t. }~{}~{}X^{T}X=I_{r}.

Here, the constraint consists of the Stiefel manifold $\texttt{St}(n,r):=\{X\in\mathbb{R}^{n\times r}~{}:~{}X^{\top}X=I_{r}\}$ . The tangent space of $\texttt{St}(n,r)$ is defined by $T_{X}\texttt{St}(n,r)=\{\eta\in\mathbb{R}^{n\times r}~{}:~{}X^{\top}\eta+\eta^{\top}X=0\}$ . Given any $U\in\mathbb{R}^{n\times r}$ , the projection of $U$ onto $T_{X}\texttt{St}(n,r)$ is $\mathcal{P}_{T_{X}\texttt{St}(n,r)}(U)=U-X\frac{U^{\top}X+X^{\top}U}{2}$ [1]. In our experiment, the data matrix $B\in\mathbb{R}^{m\times n}$ is produced by MATLAB function $\texttt{randn}(m,n)$ , in which all entries of $B$ follow the standard Gaussian distribution. We shift the columns of $B$ such that they have zero mean, and finally the column vectors are normalized. We use the polar decomposition as the retraction mapping.

We first compare the performance of the proposed algorithms on the SPCA problem with random data, i.e., we set $m=5000$ . For the StoManIAL, we partition the 50,000 samples into 100 subsets, and in each iteration, we sample one subset. The tolerance is set $\mathrm{tol}=10^{-8}\times n\times r$ . Figure 1 presents the results of the four algorithms for fixed $n=500$ and varying $r=1,2$ and $\mu=0.4,0.6,0.8$ . The horizontal axis represents CPU time, while the vertical axis represents the objective function value gap: $F(x^{k})-F_{M}$ , where $F_{M}$ is given by the ManIAL-I. The results indicate that StoManIAL outperforms the deterministic version. Moreover, among the deterministic versions, ManIAL-I and ManIAL-II have similar performance. In conclusion, compared with the Rsub, our algorithms achieve better performance.

Next, we conduct experiments on two real datasets: coil100 [36] and mnist [28]. The coil100 dataset contains $n=7200$ RGB images of 100 objects taken from different angles. The mnist dataset has $n=8,000$ grayscale digit images of size $28\times 28$ . Our experiments, illustrated in Figures 2 and 3, demonstrate that our algorithms outperform Rsub.

5.2 Sparse canonical correlation analysis

Canonical correlation analysis (CCA) first proposed by Hotelling [20] aims to tackle the associations between two sets of variables. The sparse CCA (SCCA) model is proposed to improve the interpretability of canonical variables by restricting the linear combinations to a subset of original variables. Suppose that there are two data sets: $X\in\mathbb{R}^{n\times p}$ containing $p$ variables and $Y\in$ $\mathbb{R}^{n\times q}$ containing $q$ variables, both are obtained from $n$ observations. Let $\hat{\Sigma}_{xx}=\frac{1}{n}X^{\top}X,\quad\hat{\Sigma}_{yy}=\frac{1}{n}Y^{\top}Y$ be the sample covariance matrices of $X$ and $Y$ , respectively, and $\hat{\Sigma}_{xy}=\frac{1}{n}X^{\top}Y$ be the sample cross-covariance matrix, then the SCCA can be formulated as

(5.5)

\min_{U\in\mathbb{R}^{p\times r},V\in\mathbb{R}^{q\times r}}-\mathrm{tr}(U^{\top}\hat{\Sigma}_{xy}V)+\mu_{1}\|U\|_{1}+\mu_{2}\|V\|_{1},\;\mathrm{s.t.}\;U^{\top}\hat{\Sigma}_{xx}U=V^{\top}\hat{\Sigma}_{yy}V=I_{r}.

Let us denote $\mathcal{M}_{1}=\{U\in\mathbb{R}^{p\times r}~{}:~{}U^{\top}\hat{\Sigma}_{xx}U=I_{r}\}$ and $\mathcal{M}_{2}=\{V\in\mathbb{R}^{p\times r}~{}:~{}V^{\top}\hat{\Sigma}_{yy}V=I_{r}\}$ . Without loss of generality, we assume that $\hat{\Sigma}_{xx}$ and $\hat{\Sigma}_{yy}$ are positive definite (otherwise, we can impose a small identity matrix). In this case, $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ are two generalized Stiefel manifolds, problem (5.5) is a nonsmooth problem on a product manifold $\mathcal{M}:=\mathcal{M}_{1}\otimes\mathcal{M}_{2}$ . Moreover, since $\hat{\Sigma}_{xy}=\frac{1}{n}X^{\top}Y=\frac{1}{n}\sum_{i}^{n}x_{i}^{\top}y_{i}$ , where $x_{i},y_{i}$ denote the $i$ -th row of $X$ and $Y$ , respectively, we can rewrite (5.5) as the following finite-sum form:

(5.6)

\min_{U\in\mathbb{R}^{p\times r},V\in\mathbb{R}^{q\times r}}-\frac{1}{n}\sum_{i}^{n}\mathrm{tr}(U^{\top}x_{i}^{\top}y_{i}V)+\mu_{1}\|U\|_{1}+\mu_{2}\|V\|_{1},\;\mathrm{s.t.}\;U\in\mathcal{M}_{1},V\in\mathcal{M}_{2}.

Therefore, we can apply our algorithms ManIAL and StoManIAL to solve (5.6). Owing to the poor performance in the SPCA problem, we only list the numerical comparisons of our algorithms. We compare the performance of the proposed algorithms on the SPCA problem with random data, i.e., we set $n=5000$ . For the StoManIAL, we partition the 50,000 samples into 100 subsets, and in each iteration, we sample one subset. The tolerance is set $\mathrm{tol}=10^{-8}\times p\times r$ . Figure 1 presents the results of the four algorithms for fixed $p,q=200$ and varying $r=1,2$ and $\mu=0.2,0.3,0.4$ . The results show that all algorithms have similar performance.

6 Conclusions

This paper proposes two novel manifold inexact augmented Lagrangian methods, namely ManIAL and StoManIAL, with the oracle complexity guarantee. To the best of our knowledge, this is the first complexity result for the augmented Lagrangian method for solving nonsmooth problems on Riemannian manifolds. We prove that ManIAL and StoManIAL achieve the oracle complexity of $\mathcal{O}(\epsilon^{-3})$ and $\tilde{\mathcal{O}}(\epsilon^{-3.5})$ , respectively. Numerical experiments on SPCA and SCCA problems demonstrate that our proposed methods outperform an existing method with the previously best-known complexity result.

References

[1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds, Princeton University Press, Princeton, NJ, 2008.
[2] N. Aybat and G. Iyengar, A first-order augmented Lagrangian method for compressed sensing, SIAM Journal on Optimization, 22 (2012), p. 429.
[3] A. Beck and I. Rosset, A dynamic smoothing technique for a class of nonsmooth optimization problems on manifolds, SIAM Journal on Optimization, 33 (2023), pp. 1473–1493, https://doi.org/10.1137/22M1489447.
[4] G. C. Bento, O. P. Ferreira, and J. G. Melo, Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds, Journal of Optimization Theory and Applications, 173 (2017), pp. 548–562.
[5] A. Böhm and S. J. Wright, Variable smoothing for weakly convex composite functions, Journal of optimization theory and applications, 188 (2021), pp. 628–649.
[6] N. Boumal, An introduction to optimization on smooth manifolds, Cambridge University Press, Cambridge, England, 2023, https://doi.org/10.1017/9781009166164.
[7] N. Boumal, P.-A. Absil, and C. Cartis, Global rates of convergence for nonconvex optimization on manifolds, IMA Journal of Numerical Analysis, 39 (2019), pp. 1–33.
[8] S. Burer and R. D. Monteiro, A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization, Mathematical programming, 95 (2003), pp. 329–357.
[9] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford, Accelerated methods for nonconvex optimization, SIAM Journal on Optimization, 28 (2018), pp. 1751–1772.
[10] S. Chen, S. Ma, A. Man-Cho So, and T. Zhang, Proximal gradient method for nonsmooth optimization over the Siefel manifold, SIAM Journal on Optimization, 30 (2020), pp. 210–239.
[11] A. Cutkosky and F. Orabona, Momentum-based variance reduction in non-convex sgd, Advances in neural information processing systems, 32 (2019).
[12] K. Deng and Z. Peng, A manifold inexact augmented Lagrangian method for nonsmooth optimization on Riemannian submanifolds in Euclidean space, IMA Journal of Numerical Analysis, (2022), https://doi.org/10.1093/imanum/drac018.
[13] Z. Deng, K. Deng, J. Hu, and Z. Wen, An augmented Lagrangian primal-dual semismooth newton method for multi-block composite optimization, arXiv preprint arXiv:2312.01273, (2023).
[14] O. Ferreira and P. Oliveira, Subgradient algorithm on Riemannian manifolds, Journal of Optimization Theory and Applications, 97 (1998), pp. 93–104.
[15] O. P. Ferreira, M. S. Louzeiro, and L. Prudente, Iteration-complexity of the subgradient method on Riemannian manifolds with lower bounded curvature, Optimization, 68 (2019), pp. 713–729.
[16] S. Ghadimi and G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Mathematical Programming, 156 (2016), pp. 59–99.
[17] A. Han and J. Gao, Riemannian stochastic recursive momentum method for non-convex optimization, arXiv preprint arXiv:2008.04555, (2020).
[18] S. Hosseini, W. Huang, and R. Yousefpour, Line search algorithms for locally Lipschitz functions on Riemannian manifolds, SIAM Journal on Optimization, 28 (2018), pp. 596–619.
[19] S. Hosseini and A. Uschmajew, A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds, SIAM Journal on Optimization, 27 (2017), pp. 173–189.
[20] H. Hotelling, Relations between two sets of variates, in Breakthroughs in statistics: methodology and distribution, Springer, 1992, pp. 162–190.
[21] J. Hu, X. Liu, Z.-W. Wen, and Y.-X. Yuan, A brief introduction to manifold optimization, Journal of the Operations Research Society of China, 8 (2020), pp. 199–248.
[22] W. Huang and K. Wei, Riemannian proximal gradient methods, Mathematical Programming, 194 (2022), pp. 371–413.
[23] W. Huang and K. Wei, An inexact Riemannian proximal gradient method, Computational Optimization and Applications, 85 (2023), pp. 1–32.
[24] B. Jiang, X. Meng, Z. Wen, and X. Chen, An exact penalty approach for optimization with nonnegative orthogonality constraints, Mathematical Programming, 198 (2023), pp. 855–897.
[25] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin, A modified principal component technique based on the LASSO, Journal of computational and Graphical Statistics, 12 (2003), pp. 531–547.
[26] W. Kong, J. G. Melo, and R. D. Monteiro, Complexity of a quadratic penalty accelerated inexact proximal point method for solving linearly constrained nonconvex composite programs, SIAM Journal on Optimization, 29 (2019), pp. 2566–2593.
[27] G. Lan and R. D. Monteiro, Iteration-complexity of first-order augmented Lagrangian methods for convex programming, Mathematical Programming, 155 (2016), pp. 511–547.
[28] Y. LeCun, The mnist database of handwritten digits, http://yann. lecun. com/exdb/mnist/, (1998).
[29] J. Li, S. Ma, and T. Srivastava, A Riemannian ADMM, arXiv preprint arXiv:2211.02163, (2022).
[30] X. Li, S. Chen, Z. Deng, Q. Qu, Z. Zhu, and A. Man-Cho So, Weakly convex optimization over Stiefel manifold using Riemannian subgradient-type methods, SIAM Journal on Optimization, 31 (2021), pp. 1605–1634.
[31] X. Li, D. Sun, and K.-C. Toh, A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems, SIAM Journal on Optimization, 28 (2018), pp. 433–458.
[32] Z. Li, P.-Y. Chen, S. Liu, S. Lu, and Y. Xu, Rate-improved inexact augmented Lagrangian method for constrained nonconvex optimization, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 2170–2178.
[33] Z. Li, P.-Y. Chen, S. Liu, S. Lu, and Y. Xu, Stochastic inexact augmented Lagrangian method for nonconvex expectation constrained optimization, Computational Optimization and Applications, 87 (2024), pp. 117–147.
[34] Q. Lin, R. Ma, and Y. Xu, Inexact proximal-point penalty methods for constrained non-convex optimization, arXiv preprint arXiv:1908.11518, (2019).
[35] Z. Lu and Z. Zhou, Iteration-complexity of first-order augmented Lagrangian methods for convex conic programming, SIAM journal on optimization, 33 (2023), pp. 1159–1190.
[36] S. A. Nene, S. K. Nayar, H. Murase, et al., Columbia object image library (coil-20), (1996).
[37] Z. Peng, W. Wu, J. Hu, and K. Deng, Riemannian smoothing gradient type algorithms for nonsmooth optimization problem on compact Riemannian submanifold embedded in Euclidean space, Applied Mathematics & Optimization, 88 (2023), p. 85.
[38] M. F. Sahin, A. Eftekhari, A. Alacaoglu, F. L. Gómez, and V. Cevher, An inexact augmented Lagrangian framework for nonconvex optimization with nonlinear constraints, in Proceedings of NeurIPS 2019, 2019.
[39] H. Sato, Riemannian optimization and its applications, Springer Cham, Cham, 2021, https://doi.org/doi.org/10.1007/978-3-030-62391-3.
[40] B. Wang, S. Ma, and L. Xue, Riemannian stochastic proximal gradient methods for nonsmooth optimization over the Stiefel manifold, Journal of machine learning research, 23 (2022), pp. 1–33.
[41] Y. Wang, K. Deng, H. Liu, and Z. Wen, A decomposition augmented Lagrangian method for low-rank semidefinite programming, SIAM Journal on Optimization, 33 (2023), pp. 1361–1390.
[42] Y. Xu, First-order methods for constrained convex programming based on linearized augmented Lagrangian function, INFORMS Journal on Optimization, 3 (2021), pp. 89–117.
[43] Y. Xu, Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming, Mathematical Programming, 185 (2021), pp. 199–244.
[44] R. Zass and A. Shashua, Nonnegative sparse PCA, Advances in neural information processing systems, 19 (2006).
[45] J. Zhang, H. Lin, S. Jegelka, S. Sra, and A. Jadbabaie, Complexity of finding stationary points of nonconvex nonsmooth functions, in International Conference on Machine Learning, PMLR, 2020, pp. 11173–11182.
[46] Y. Zhou, C. Bao, C. Ding, and J. Zhu, A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds, Mathematical Programming, (2022), pp. 1–61.

Appendix A Technical lemmas

The next lemma is a technical observation and follows from Lemma 3 in [11].

Lemma A.1.

Suppose that the conditions (4.13)-(4.15) hold. Let us denote $s_{t}=d_{t}-\operatorname{grad}\psi(x_{t})$ . Then we have

		$\displaystyle\mathbb{E}\left[\left(\operatorname{grad}\psi\left({x}_{t},\xi_{t}\right)-\operatorname{grad}\psi\left({x}_{t}\right)\right)\cdot\eta_{t-1}^{-1}\left(1-a_{t}\right)^{2}{s}_{t-1}\right]=0$
		$\displaystyle\mathbb{E}\left[\left(\operatorname{grad}\psi\left({x}_{t},\xi_{t}\right)-\operatorname{grad}\psi\left({x}_{t-1},\xi_{t}\right)-\operatorname{grad}\psi\left({x}_{t}\right)+\operatorname{grad}\psi\left({x}_{t-1}\right)\right)\cdot\eta_{t-1}^{-1}\left(1-a_{t}\right)^{2}{s}_{t-1}\right]=0.$

Lemma A.2.

Suppose that the conditions (4.13)-(4.15) hold. If $\eta_{t}\leq\frac{1}{L}$ in Algorithm 4 for all $t$ . Then

\mathbb{E}\left[\psi\left(x_{t+1}\right)-\psi\left(x_{t}\right)\right]\leq\mathbb{E}\left[-\eta_{t}/2\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\eta_{t}/2\left\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right].

Proof A.3.

By retr-smoothness of $\psi$ , we have

	$\displaystyle\psi\left(x_{t+1}\right)$	$\displaystyle\leq\psi\left(x_{t}\right)-\left\langle\operatorname{grad}\psi\left(x_{t}\right),\eta_{t}d_{t}\right\rangle+\frac{\eta_{t}^{2}L}{2}\left\\|d_{t}\right\\|^{2}$
		$\displaystyle=\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}-\frac{\eta_{t}}{2}\left\\|d_{t}\right\\|^{2}+\frac{\eta_{t}}{2}\left\\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\frac{L\eta_{t}^{2}}{2}\left\\|d_{t}\right\\|^{2}$
		$\displaystyle=\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\frac{\eta_{t}}{2}\left\\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}-\left(\frac{\eta_{t}}{2}-\frac{L\eta_{t}^{2}}{2}\right)\left\\|d_{t}\right\\|^{2}$
		$\displaystyle\leq\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\frac{\eta_{t}}{2}\left\\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2},$

where for the last inequality we choose $\eta_{t}\leq\frac{1}{L}$ .

The following technical lemma, which follows from Lemma 3 in [11], provides a recurrence variance bound.

Lemma A.4.

Suppose that the conditions (4.13)-(4.15) hold. With the notation in Algorithm 4, let $s_{t}=d_{t}-\operatorname{grad}\psi\left(x_{t}\right)$ , we have

	$\displaystyle\mathbb{E}\left[\\|s_{t}\\|^{2}\right]\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}\right.$	$\displaystyle+(1-a_{t})^{2}(1+4L^{2}\eta_{t-1}^{2})\\|s_{t-1}\\|^{2}$
		$\displaystyle+\left.4(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}\\|\operatorname{grad}\psi(x_{t-1})\\|^{2}\right].$

Proof A.5.

By definition of $s_{t}$ and the notation in Algorithm 4, we have $s_{t}=d_{t}-\operatorname{grad}\psi\left(x_{t}\right)=\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)+(1-$ $\left.a_{t}\right)\mathcal{T}_{x_{t-1}}^{x_{t}}\left(d_{t-1}-\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)\right)-\operatorname{grad}\psi\left(x_{t}\right)$ . Hence, we can write

		$\displaystyle\mathbb{E}{\left[\left\\|s_{t}\right\\|^{2}\right]=\mathbb{E}\left[\left\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)+\left(1-a_{t}\right)\mathcal{T}_{x_{t-1}}^{x_{t}}\left(d_{t-1}-\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)\right)-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}\right]}$
		$\displaystyle=\mathbb{E}\left[\\|a_{t}(\operatorname{grad}\psi(x_{t},\xi_{t})-\operatorname{grad}\psi(x_{t}))+(1-a_{t})(\operatorname{grad}\psi(x_{t},\xi_{t})-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi(x_{t-1},\xi_{t}))\right.$
		$\displaystyle+(1-a_{t})(\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi(x_{t-1})-\operatorname{grad}\psi(x_{t}))+\left.\left(1-a_{t}\right)\mathcal{T}_{x_{t-1}}^{x_{t}}\left(d_{t-1}-\operatorname{grad}\psi\left(x_{t-1}\right)\right)\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}\left\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\left(1-a_{t}\right)^{2}\left\\|s_{t-1}\right\\|^{2}+2(1-a_{t})^{2}\right.$
		$\displaystyle\left.\left\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)-\operatorname{grad}\psi\left(x_{t}\right)+\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)\\|^{2}+(1-a_{t})^{2}\\|s_{t-1}\\|^{2}\right.$
		$\displaystyle\left.+2(1-a_{t})^{2}\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}\\|s_{t-1}\\|^{2}+2(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}\\|d_{t-1}\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}\\|s_{t-1}\\|^{2}+4(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}(\\|s_{t-1}\\|^{2}+\\|\operatorname{grad}\psi(x_{t-1})\\|^{2})\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}(1+4L^{2}\eta_{t-1}^{2})\\|s_{t-1}\\|^{2}+4(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}\\|\operatorname{grad}\psi(x_{t-1})\\|^{2}\right],$

where the first inequality uses Lemma A.1 and $\|x+y\|^{2}\leq 2\|x\|^{2}+2\|y\|^{2}$ , the second inequality follows from $\mathbb{E}\|x-\mathbb{E}[x]\|^{2}\leq\mathbb{E}\|x\|^{2}$ , the third inequality uses the update rule of $x_{t}$ and (4.14).

Appendix B The proof of Lemma 4.2

Proof B.1.

First, since $w\geq(4L\kappa)^{3}$ , we have from the update formula of $\eta_{t}$ that $\eta_{t}\leq\frac{\kappa}{w^{1/3}}\leq\frac{1}{4L}<\frac{1}{L}$ . It follows from Lemma A.2 that

(B.1)

\mathbb{E}\left[\psi\left(x_{t+1}\right)-\psi\left(x_{t}\right)\right]\leq\mathbb{E}\left[-\eta_{t}/2\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\eta_{t}/2\left\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right].

Consider the potential function $\Phi_{t}=\psi\left(x_{t}\right)+\frac{1}{24L^{2}\eta_{t-1}}\left\|s_{t}\right\|^{2}$ . Since $a_{t+1}=c\eta_{t}^{2}$ and $w\geq\left(\frac{c\kappa}{4L}\right)^{3}$ , we have that for all $t$

a_{t+1}\leq\frac{c\eta_{t}\kappa}{w^{1/3}}\leq\frac{c\kappa}{4Lw^{1/3}}\leq 1.

Then, we first consider $\eta_{t}^{-1}\left\|s_{t+1}\right\|^{2}-\eta_{t-1}^{-1}\left\|s_{t}\right\|^{2}$ . Using Lemma A.4, we obtain

		$\displaystyle\mathbb{E}{\left[\eta_{t}^{-1}\left\\|s_{t+1}\right\\|^{2}-\eta_{t-1}^{-1}\left\\|s_{t}\right\\|^{2}\right]}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[2\frac{a_{t+1}^{2}}{\eta_{t}}G_{t+1}^{2}+\frac{\left(1-a_{t+1}\right)^{2}\left(1+4L^{2}\eta_{t}^{2}\right)\left\\|s_{t}\right\\|^{2}}{\eta_{t}}+4\left(1-a_{t+1}\right)^{2}L^{2}\eta_{t}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}\right.$
		$\displaystyle\left.-\frac{\left\\|s_{t}\right\\|^{2}}{\eta_{t-1}}\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\underbrace{2\frac{a_{t+1}^{2}}{\eta_{t}}G_{t+1}^{2}}_{A_{t}}+\underbrace{\left(\eta_{t}^{-1}\left(1-a_{t+1}\right)\left(1+4L^{2}\eta_{t}^{2}\right)-\eta_{t-1}^{-1}\right)\left\\|s_{t}\right\\|^{2}}_{B_{t}}+\underbrace{4L^{2}\eta_{t}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}}_{C_{t}}].$

Let us focus on the terms of this expression individually. For the first term, $A_{t}$ , observing that $w\geq 2G^{2}\geq$ $G^{2}+G_{t+1}^{2}$ , we have

	$\displaystyle\sum_{t=1}^{T}A_{t}$	$\displaystyle=\sum_{t=1}^{T}2c^{2}\eta_{t}^{3}G_{t+1}^{2}=\sum_{t=1}^{T}\frac{2\kappa^{3}c^{2}G_{t+1}^{2}}{w+\sum_{i=1}^{t}G_{i}^{2}}$
		$\displaystyle\leq\sum_{t=1}^{T}\frac{2\kappa^{3}c^{2}G_{t+1}^{2}}{G^{2}+\sum_{i=1}^{t+1}G_{i}^{2}}\leq 2\kappa^{3}c^{2}\ln\left(1+\sum_{t=1}^{T+1}\frac{G_{t}^{2}}{G^{2}}\right)$
		$\displaystyle\leq 2\kappa^{3}c^{2}\ln(T+2),$

where in the second to last inequality we used Lemma 4 in [11]. For the second term $B_{t}$ , we have

B_{t}\leq\left(\eta_{t}^{-1}-\eta_{t-1}^{-1}+\eta_{t}^{-1}\left(4L^{2}\eta_{t}^{2}-a_{t+1}\right)\right)\left\|s_{t}\right\|^{2}\leq\left(\eta_{t}^{-1}-\eta_{t-1}^{-1}+\eta_{t}\left(4L^{2}-c\right)\right)\left\|s_{t}\right\|^{2}.

Let us estimate $\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}$ . Using the concavity of $x^{1/3}$ , we have $(x+y)^{1/3}\leq x^{1/3}+yx^{-2/3}/3$ . Therefore:

	$\displaystyle\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}$	$\displaystyle=\frac{1}{\kappa}\left[\left(w+\sum_{i=1}^{t}G_{i}^{2}\right)^{1/3}-\left(w+\sum_{i=1}^{t-1}G_{i}^{2}\right)^{1/3}\right]\leq\frac{G_{t}^{2}}{3\kappa\left(w+\sum_{i=1}^{t-1}G_{i}^{2}\right)^{2/3}}$
		$\displaystyle\leq\frac{G_{t}^{2}}{3\kappa\left(w-G^{2}+\sum_{i=1}^{t}G_{i}^{2}\right)^{2/3}}\leq\frac{G_{t}^{2}}{3\kappa\left(w/2+\sum_{i=1}^{t}G_{i}^{2}\right)^{2/3}}$
		$\displaystyle\leq\frac{2^{2/3}G_{t}^{2}}{3\kappa\left(w+\sum_{i=1}^{t}G_{i}^{2}\right)^{2/3}}\leq\frac{2^{2/3}G^{2}}{3\kappa^{3}}\eta_{t}^{2}\leq\frac{2^{2/3}G^{2}}{12L\kappa^{3}}\eta_{t}\leq\frac{G^{2}}{7L\kappa^{3}}\eta_{t},$

where we use $\eta_{t}\leq\frac{1}{4L}$ . Further, since $c=10L^{2}+G^{2}/\left(7L\kappa^{3}\right)$ , we have

\eta_{t}\left(4L^{2}-c\right)\leq-6L^{2}\eta_{t}-G^{2}\eta_{t}/\left(7L\kappa^{3}\right).

Thus, we obtain $B_{t}\leq-6L^{2}\eta_{t}\left\|s_{t}\right\|^{2}$ . Putting all this together yields:

(B.2)

\frac{1}{12L^{2}}\sum_{t=1}^{T}\left(\frac{\left\|s_{t+1}\right\|^{2}}{\eta_{t}}-\frac{\left\|s_{t}\right\|^{2}}{\eta_{t-1}}\right)\leq\frac{\kappa^{3}c^{2}}{6L^{2}}\ln(T+2)+\sum_{t=1}^{T}\left[\frac{\eta_{t}}{3}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}-\frac{\eta_{t}}{2}\left\|s_{t}\right\|^{2}\right].

Now, we are ready to analyze the potential $\Phi_{t}$ . It follows from (B.1) that

\mathbb{E}\left[\Phi_{t+1}-\Phi_{t}\right]\leq\mathbb{E}\left[-\frac{\eta_{t}}{2}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\frac{\eta_{t}}{2}\left\|s_{t}\right\|^{2}+\frac{1}{12L^{2}}\left(\frac{\|s_{t+1}\|^{2}}{\eta_{t}}-\frac{\left\|s_{t}\right\|^{2}}{\eta_{t-1}}\right)\right].

Summing over $t$ and using (B.2), we obtain

	$\displaystyle\mathbb{E}\left[\Phi_{T+1}-\Phi_{1}\right]$	$\displaystyle\leq\sum_{t=1}^{T}\mathbb{E}\left[-\frac{\eta_{t}}{2}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\frac{\eta_{t}}{2}\left\\|s_{t}\right\\|^{2}+\frac{1}{12L^{2}}\left(\frac{\\|s_{t+1}\\|^{2}}{\eta_{t}}-\frac{\left\\|s_{t}\right\\|^{2}}{\eta_{t-1}}\right)\right]$
		$\displaystyle\leq\mathbb{E}\left[\frac{\kappa^{3}c^{2}}{6L^{2}}\ln(T+2)-\sum_{t=1}^{T}\frac{\eta_{t}}{6}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}\right].$

Reordering the terms, we have

	$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\eta_{t}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}\right]$	$\displaystyle\leq\mathbb{E}\left[6\left(\Phi_{1}-\Phi_{T+1}\right)+\frac{\kappa^{3}c^{2}}{2L^{2}}\ln(T+2)\right]$
		$\displaystyle\leq 6\left(\psi\left(x_{1}\right)-F^{\star}\right)+\frac{\mathbb{E}\left[\left\\|s_{1}\right\\|^{2}\right]}{2L^{2}\eta_{0}}+\frac{\kappa^{3}c^{2}}{L^{2}}\ln(T+2)$
		$\displaystyle\leq 6\left(\psi\left(x_{1}\right)-F^{\star}\right)+\frac{w^{1/3}\delta^{2}}{2L^{2}\kappa}+\frac{\kappa^{3}c^{2}}{L^{2}}\ln(T+2),$

where the last inequality is given by the definition of $d_{1}$ and $\eta_{0}$ in the algorithm. Now, we relate $\mathbb{E}\left[\sum_{t=1}^{T}\eta_{t}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right]$ to $\mathbb{E}\left[\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right]$ . First, since $\eta_{t}$ is decreasing,

\mathbb{E}\left[\sum_{t=1}^{T}\eta_{t}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right]\geq\mathbb{E}\left[\eta_{T}\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right]

Therefore, if we set $M=\frac{1}{\kappa}\left[6\left(\psi\left(x_{1}\right)-\psi_{\min}\right)+\frac{w^{1/3}\delta^{2}}{2L^{2}\kappa}+\frac{\kappa^{3}c^{2}}{L^{2}}\ln(T+2)\right]$ , to get

(B.3)		$\displaystyle\mathbb{E}\left[\sqrt{\sum_{t=1}^{T}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}}\right]^{2}$	$\displaystyle\leq\mathbb{E}\left[\frac{6\left(\psi\left(x_{1}\right)-F^{\star}\right)+\frac{w^{1/3}\delta^{2}}{2L^{2}\kappa}+\frac{\kappa^{3}c^{2}}{L^{2}}\ln(T+2)}{\eta_{T}}\right]$
(B.3)			$\displaystyle=\mathbb{E}\left[\frac{\kappa M}{\eta_{T}}\right]\leq\mathbb{E}\left[M\left(w+\sum_{t=1}^{T}G_{t}^{2}\right)^{1/3}\right].$

It follows from (4.13) that

(B.4)

G_{t}^{2}=\left\|\operatorname{grad}\psi\left(x_{t}\right)+\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\leq 2\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+2\delta^{2}.

Plugging this in (B.3) and using $(a+b)^{1/3}\leq a^{1/3}+b^{1/3}$ we obtain:

		$\displaystyle\mathbb{E}\left[\sqrt{\sum_{t=1}^{T}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}}\right]^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[M\left(w+2\sum_{t=1}^{T}\delta^{2}\right)^{1/3}+M\left(2\sum_{t=1}^{T}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}\right)^{1/3}\right]$
	$\displaystyle\leq$	$\displaystyle M\left(w+2T\delta^{2}\right)^{1/3}+\mathbb{E}\left[2^{1/3}M\left(\sqrt{\sum_{t=1}^{T}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}}\right)^{2/3}\right]$
	$\displaystyle\leq$	$\displaystyle M\left(w+2T\delta^{2}\right)^{1/3}+2^{1/3}M\left(\mathbb{E}\left[\sqrt{\sum_{t=1}^{T}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}}\right]\right)^{2/3},$

where we have used the concavity of $x\mapsto x^{a}$ for all $a\leq 1$ to move expectations inside the exponents. Now, define $X=\sqrt{\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}}$ . Then the above can be rewritten as:

(\mathbb{E}[X])^{2}\leq M\left(w+2T\delta^{2}\right)^{1/3}+2^{1/3}M(\mathbb{E}[X])^{2/3}.

Note that this implies that either $(\mathbb{E}[X])^{2}\leq 2M\left(w+T\delta^{2}\right)^{1/3}$ , or $(\mathbb{E}[X])^{2}\leq 2\cdot 2^{1/3}M(\mathbb{E}[X])^{2/3}$ . Solving for $\mathbb{E}[X]$ in these two cases, we obtain

\mathbb{E}[X]\leq\sqrt{2M}\left(w+2T\delta^{2}\right)^{1/6}+2M^{3/4}.

Finally, observe that by Cauchy-Schwarz we have $\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|/T\leq X/\sqrt{T}$ so that

	$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\frac{\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|}{T}\right]$	$\displaystyle\leq\frac{\sqrt{2M}\left(w+2T\delta^{2}\right)^{1/6}+2M^{3/4}}{\sqrt{T}}$
		$\displaystyle\leq\frac{w^{1/6}\sqrt{2M}+2M^{3/4}}{\sqrt{T}}+\frac{2\sigma^{1/3}}{T^{1/3}},$

where we used $(a+b)^{1/3}\leq a^{1/3}+b^{1/3}$ in the last inequality.

(2.8)		$\displaystyle\\|\operatorname{grad}f(y)-\mathcal{T}_{x}^{y}(\operatorname{grad}f(x))\\|$
	$\displaystyle\leq$	$\displaystyle\\|\operatorname{grad}f(y)-\operatorname{grad}f(x)\\|+\\|\operatorname{grad}f(x)-\mathcal{T}_{x}^{y}(\operatorname{grad}f(x))\\|$
	$\displaystyle\leq$	$\displaystyle\\|\operatorname{grad}f(y)-\mathcal{P}_{T_{x}\mathcal{M}}(\nabla f(y))\\|+\\|\mathcal{P}_{T_{x}\mathcal{M}}(\nabla f(y)-\nabla f(x))\\|$
		$\displaystyle+\\|D\mathcal{R}_{x}(0_{x})[\operatorname{grad}f(x)]-D\mathcal{R}_{x}(\xi)[\operatorname{grad}f(x)]\\|$
	$\displaystyle\leq$	$\displaystyle(L_{p}G+L)\\|\\|y-x\\|+\zeta\\|\xi\\|\\|\operatorname{grad}f(x)\\|$
	$\displaystyle\leq$	$\displaystyle((\alpha L_{p}+\zeta)G+\alpha L)\\|\xi\\|,$

(3.31)	$\displaystyle\\|z^{k+1}\\|$	$\displaystyle\leq\sum_{l=1}^{k+1}\beta_{l}\\|\mathcal{A}x^{l}-y^{l}\\|\leq\sum_{l=1}^{\infty}\beta_{l}\\|\mathcal{A}x^{l}-y^{l}\\|$
		$\displaystyle\leq\beta_{0}\\|\mathcal{A}x^{0}-y^{0}\\|\log^{2}2\sum_{l=1}^{\infty}\frac{1}{(l+1)^{2}\log(l+2)}$
		$\displaystyle\leq\frac{\beta_{0}\pi^{2}}{6}\\|\mathcal{A}x^{0}-y^{0}\\|=z_{\max},$

(3.32)		$\displaystyle\\|\mathcal{A}x^{k+1}-y^{k+1}\\|$	$\displaystyle\leq\\|\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}-y^{k+1}\\|+\frac{1}{\sigma_{k}}\\|z^{k}\\|$
(3.32)			$\displaystyle\leq\frac{\ell_{h}+\\|z^{k}\\|}{\sigma_{k}}\leq\frac{\ell_{h}+\\|z_{\max}\\|}{\sigma_{k}},$

	$\displaystyle\psi\left(x_{t+1}\right)$	$\displaystyle\leq\psi\left(x_{t}\right)-\left\langle\operatorname{grad}\psi\left(x_{t}\right),\eta_{t}d_{t}\right\rangle+\frac{\eta_{t}^{2}L}{2}\left\\|d_{t}\right\\|^{2}$
		$\displaystyle=\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}-\frac{\eta_{t}}{2}\left\\|d_{t}\right\\|^{2}+\frac{\eta_{t}}{2}\left\\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\frac{L\eta_{t}^{2}}{2}\left\\|d_{t}\right\\|^{2}$
		$\displaystyle=\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\frac{\eta_{t}}{2}\left\\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}-\left(\frac{\eta_{t}}{2}-\frac{L\eta_{t}^{2}}{2}\right)\left\\|d_{t}\right\\|^{2}$
		$\displaystyle\leq\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\\|\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\frac{\eta_{t}}{2}\left\\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2},$

		$\displaystyle\mathbb{E}{\left[\left\\|s_{t}\right\\|^{2}\right]=\mathbb{E}\left[\left\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)+\left(1-a_{t}\right)\mathcal{T}_{x_{t-1}}^{x_{t}}\left(d_{t-1}-\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)\right)-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}\right]}$
		$\displaystyle=\mathbb{E}\left[\\|a_{t}(\operatorname{grad}\psi(x_{t},\xi_{t})-\operatorname{grad}\psi(x_{t}))+(1-a_{t})(\operatorname{grad}\psi(x_{t},\xi_{t})-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi(x_{t-1},\xi_{t}))\right.$
		$\displaystyle+(1-a_{t})(\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi(x_{t-1})-\operatorname{grad}\psi(x_{t}))+\left.\left(1-a_{t}\right)\mathcal{T}_{x_{t-1}}^{x_{t}}\left(d_{t-1}-\operatorname{grad}\psi\left(x_{t-1}\right)\right)\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}\left\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\operatorname{grad}\psi\left(x_{t}\right)\right\\|^{2}+\left(1-a_{t}\right)^{2}\left\\|s_{t-1}\right\\|^{2}+2(1-a_{t})^{2}\right.$
		$\displaystyle\left.\left\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)-\operatorname{grad}\psi\left(x_{t}\right)+\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)\\|^{2}+(1-a_{t})^{2}\\|s_{t-1}\\|^{2}\right.$
		$\displaystyle\left.+2(1-a_{t})^{2}\\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}\\|s_{t-1}\\|^{2}+2(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}\\|d_{t-1}\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}\\|s_{t-1}\\|^{2}+4(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}(\\|s_{t-1}\\|^{2}+\\|\operatorname{grad}\psi(x_{t-1})\\|^{2})\right]$
		$\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}(1+4L^{2}\eta_{t-1}^{2})\\|s_{t-1}\\|^{2}+4(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}\\|\operatorname{grad}\psi(x_{t-1})\\|^{2}\right],$