This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Oracle complexities of augmented Lagrangian methods for nonsmooth manifold optimization

Kangkang Deng Department of Mathematics, National University of Defense Technology, Changsha, 410073, CHINA (). freedeng1208@gmail.com    Jiang Hu Corresonding author. Massachusetts General Hospital and Harvard Medical School, Harvard University, Boston, MA 02114, US (). hujiangopt@gmail.com    Jiayuan Wu College of Engineering, Peking University, Beijing 100871, CHINA (). 1901110043@pku.edu.cn    Zaiwen Wen Beijing International Center for Mathematical Research, Center for Machine Learning Research and College of Engineering, Peking University, Beijing 100871, CHINA (). wenzw@pku.edu.cn
Abstract

In this paper, we present two novel manifold inexact augmented Lagrangian methods, ManIAL for deterministic settings and StoManIAL for stochastic settings, solving nonsmooth manifold optimization problems. By using the Riemannian gradient method as a subroutine, we establish an 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) oracle complexity result of ManIAL, matching the best-known complexity result. Our algorithm relies on the careful selection of penalty parameters and the precise control of termination criteria for subproblems. Moreover, for cases where the smooth term follows an expectation form, our proposed StoManIAL utilizes a Riemannian recursive momentum method as a subroutine, and achieves an oracle complexity of 𝒪~(ϵ3.5)\tilde{\mathcal{O}}(\epsilon^{-3.5}), which surpasses the best-known 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4}) result. Numerical experiments conducted on sparse principal component analysis and sparse canonical correlation analysis demonstrate that our proposed methods outperform an existing method with the previously best-known complexity result. To the best of our knowledge, these are the first complexity results of the augmented Lagrangian methods for solving nonsmooth manifold optimization problems.

1 Introduction

We are concerned with the following nonsmooth composite manifold optimization problem

(1.1) minxφ(x):=f(x)+h(𝒜x),\min_{x\in\mathcal{M}}\varphi(x):=f(x)+h(\mathcal{A}x),

where f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is a continuously differentiable function, 𝒜:nm\mathcal{A}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is a linear mapping, h:m(,+]h:\mathbb{R}^{m}\rightarrow(-\infty,+\infty] is a proper closed convex function, and \mathcal{M} is an embedded Riemannian submanifold. Hence, the objective function of (1.1) can be nonconvex and nonsmooth. Manifold optimization with smooth objective functions (i.e., smooth manifold optimization) has been extensively studied in the literature; see [1, 6, 39, 21] for a comprehensive review. Recently, manifold optimization with nonsmooth objective functions has been growing in popularity due to its wide applications in sparse principal component analysis [25], nonnegative principal component analysis [44, 24], semidefinite programming [8, 41], etc. Different from smooth manifold optimization, solving problem (1.1) is challenging because of the nonsmoothness. Recently, a range of algorithms is developed to address nonsmooth optimization problems. Notably, one category of algorithms is the augmented Lagrangian framework, as introduced by [37, 46]. Compared with other existing works, such algorithm framework can deal with the case of the general 𝒜\mathcal{A} along with possible extra constraints [46]. However, they both only provide an asymptotic convergence result, and the iteration complexity or oracle complexity is unclear. This naturally raises the following question:

Can we design an augmented Lagrangian algorithm framework for (1.1) with oracle complexity guarantee?

We answer this question affirmatively, giving two novel augmented Lagrangian algorithms that achieves the oracle complexity of 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) in the deterministic setting and 𝒪~(ϵ3.5)\tilde{\mathcal{O}}(\epsilon^{-3.5}) in the stochastic setting. The main contributions of this paper are given as follows:

  • Developing an inexact augmented Lagrangian algorithm framework for solving problem (1.1). Our algorithm relies on carefully selecting penalty parameters and controlling termination criteria for subproblems. In particular, when the Riemannian gradient descent method is used for the inner iterates, we prove that our algorithm, ManIAL, finds an ϵ\epsilon-stationary point with 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) calls to the first-order oracle (refer to Theorem 3.13). To the best of our knowledge, this is the first complexity result for the augmented Lagrangian method for solving (1.1).

  • Addressing scenarios where the objective function ff takes the form f(x):=𝔼ξ[f(x;ξ)]f(x):=\mathbb{E}_{\xi}[f(x;\xi)]. We introduce a stochastic augmented Lagrangian method. Owing to the smoothness of the subproblem, we are able to apply the Riemannian recursive momentum method as a subroutine within our framework. Consequently, we establish that StoManIAL achieves the oracle complexity of 𝒪~(ϵ3.5)\tilde{\mathcal{O}}(\epsilon^{-3.5}) (refer to Theorem 4.10), which is better than the best-known 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4}) result [30].

  • Our algorithm framework is highly flexible, allowing for the application of various Riemannian (stochastic) optimization methods, including first-order methods, semi-smooth Newton methods, and others, to solve sub-problems. This is due to the smoothness properties of the subproblems. Additionally, we provide two options for selecting the stopping criteria for subproblems, based on either a given residual or a given number of iterations, further enhancing the flexibility of the algorithm.

1.1 Related works

When the objective function is geodesically convex over a Riemannian manifold, the subgradient-type methods are studied in [4, 15, 14]. An asymptotic convergence result is first established in [14], while an iteration complexity of 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) is established in [4, 15]. For the general nonsmooth problem on the Riemannian manifold, the Riemannian subgradient may not exist. By sampling points within a neighborhood of the current iteration point at which the objective function ff is differentiable, the Riemannian gradient sampling methods are developed in [18, 19].

There are many works addressing nonsmooth problems in the form of (1.1). When 𝒜\mathcal{A} in (1.1) is the identity operator and the proximal operator of hh can be calculated analytically, Riemannian proximal gradient methods [10, 22, 23] are proposed. These methods achieve an outer iteration complexity of 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) for attaining an ϵ\epsilon-stationary solution. It should be emphasized that each iteration in these algorithms involves solving a subproblem that lacks an explicit solution, and they utilize the semismooth Newton method to solve it. For the general 𝒜\mathcal{A}, Peng et al. [37] develop the Riemannian smoothing gradient method by utilizing a smoothing technique from the Moreau envelope. They show that the algorithm achieves an iteration complexity of 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}). Moreover, Beck and Rosset [3] propose a dynamic smoothing gradient descent on manifold and obtain an iteration complexity of 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}). Very recently, based on a similar smoothing technique, a Riemannian alternating direction method of multipliers (RADMM) is proposed in [29] with an iteration complexity result of 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4}) for driving a Karush–Kuhn–Tucker (KKT) residual based stationarity. This complexity result is a bit worse than 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) [37] because there is a lack of adaptive adjustment of the smoothing parameter. There is also a line of research based on the augmented Lagrangian function. Deng and Peng [12] proposed a manifold inexact augmented Lagrangian method for solving problem (1.1). Zhou et al. [46] consider a more general problem and propose a globalized semismooth Newton method to solve its subproblem efficiently. In these two papers, they only give the asymptotic convergence result, while the iteration complexity is missing.

In the case when the manifold \mathcal{M} is specified by equality constraints c(x)=0c(x)=0, e.g., the Stiefel manifold. Problem (1.1) can be regarded as a constrained optimization problem with a nonsmooth and nonconvex objective function. Given that \mathcal{M} is often nonconvex, we only list the related works in case of that the constraint functions are nonconvex. Papers [32, 38] propose and study the iteration complexity of augmented Lagrangian methods for solving nonlinearly constrained nonconvex composite optimization problems. The iteration complexity results they achieve for an ϵ\epsilon-stationary point are both 𝒪~(ϵ3)\tilde{\mathcal{O}}(\epsilon^{-3}). More specifically, [38] uses the accelerated gradient method of [16] to obtain the approximate stationary point. On the other hand, the authors in [32] obtain such approximate stationary point by applying an inner accelerated proximal method as in [9, 26], whose generated subproblems are convex. It is worth mentioning that both of these papers make a strong assumption about how the feasibility of an iterate is related to its stationarity. Lin et al. [34] propose an inexact proximal-point penalty method by solving a sequence of penalty subproblems. Under a non-singularity condition, they show a complexity result of 𝒪~(ϵ3)\tilde{\mathcal{O}}(\epsilon^{-3}).

Finally, we point out that there exist several works for nonsmooth expectation problems on the Riemannian manifold, i.e., when f(x)=𝔼[f(x,ξ)]f(x)=\mathbb{E}[f(x,\xi)] in problem (1.1). In particular, Li et al. [30] focus on a class of weakly convex problems on the Stiefel manifold. They present a Riemannian stochastic subgradient method and show that it has an iteration complexity of 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4}) for driving a natural stationarity measure below ϵ\epsilon. Peng et al. [37] propose a Riemannian stochastic smoothing method and give an iteration complexity of 𝒪(ϵ5)\mathcal{O}(\epsilon^{-5}) for driving a natural stationary point. Li et al. [33] propose a stochastic inexact augmented Lagrangian method to solve problems involving a nonconvex composite objective and nonconvex smooth functional constraints. Under certain regularity conditions, they establish an oracle complexity result of 𝒪(ϵ5)\mathcal{O}(\epsilon^{-5}) with a stochastic first-order oracle. Wang et al. [40] propose two Riemannian stochastic proximal gradient methods for minimizing a nonsmooth function over the Stiefel manifold. They show that the proposed algorithm finds an ϵ\epsilon-stationary point with 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) first-order oracle complexity. It should be emphasized that each oracle call involves a subroutine whose iteration complexity remains unclear.

In Table 1, we summarize our complexity results and several existing methods to produce an ϵ\epsilon-stationary point. The definitions of first-order oracle in determined and stochastic settings are provided in Definition 3.12 and 4.9, respectively. The notation 𝒪~\tilde{\mathcal{O}} denotes the dependence of a log\log factor. Here, we only consider the scenario where both the objective function and the nonlinear constraint are nonconvex. For the iteration complexity of ALM in convex case, please refer to [2, 35, 43, 42, 27], etc. We do not add the proximal type algorithms in [10, 40, 22, 23] since those algorithms involve using the semismooth Newton method to solve the subproblem, where the overall oracle complexity is not clear. It can easily be shown that our algorithms achieve better oracle complexity results both in determined and stochastic settings. In particular, in the stochastic setting, our algorithm establishes an oracle complexity of 𝒪~(ϵ3.5)\tilde{\mathcal{O}}(\epsilon^{-3.5}), which is better than the best-known 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4}) result.

Table 1: Comparison of the oracle complexity results of several methods in the literature to our method to produce an ϵ\epsilon-stationary point
Algorithms 𝒜\mathcal{A} Constraints Objective Complexity
iALM [38] \mathcal{I} c(x)=0c(x)=0 determine 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4})
iPPP [34] \mathcal{I} c(x)=0c(x)=0 determine 𝒪~(ϵ3)\tilde{\mathcal{O}}(\epsilon^{-3})111We only give the case of c(x)c(x) is nonconvex and a non-singularity condition holds on c(x)c(x)
improved iALM [32] \mathcal{I} c(x)=0c(x)=0 determine 𝒪~(ϵ3)\tilde{\mathcal{O}}(\epsilon^{-3})
Sto-iALM [33] \mathcal{I} 𝔼ξ[c(x,ξ)]=0\mathbb{E}_{\xi}[c(x,\xi)]=0 stochastic 𝒪(ϵ5)\mathcal{O}(\epsilon^{-5})
RADMM [29] general compact submanifold determine 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4})
DSGM [3] general compact submanifold determine 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3})
subgradient [30] general Stiefel manifold determine 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4})
stochastic 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4})
smoothing [37] general compact submanifold determine 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3})
stochastic 𝒪(ϵ5)\mathcal{O}(\epsilon^{-5})
this paper general compact submanifold determine 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3})
stochastic 𝒪~(ϵ3.5)\tilde{\mathcal{O}}(\epsilon^{-3.5})

1.2 Notation

Let ,\left<\cdot,\cdot\right> and \|\cdot\| be the Euclidean product and induced Euclidean norm. Given a matrix AA, we use AF\|A\|_{F} to denote the Frobenius norm, A1:=ij|Aij|\|A\|_{1}:=\sum_{ij}|A_{ij}| to denote the 1\ell_{1} norm. For a vector xx, we use x2\|x\|_{2} and x1\|x\|_{1} to denote its Euclidean norm and 1\ell_{1} norm, respectively. The indicator function of a set 𝒞\mathcal{C}, denoted by δ𝒞\delta_{\mathcal{C}}, is set to 0 on 𝒞\mathcal{C} and ++\infty otherwise. The distance from xx to 𝒞\mathcal{C} is denoted by dist(x,𝒞):=miny𝒞xy\mathrm{dist}(x,\mathcal{C}):=\min_{y\in\mathcal{C}}\|x-y\|. We use f(x)\nabla f(x) and gradf(x)\operatorname{grad}f(x) to denote the Euclidean gradient and Riemannian gradient of ff, respectively.

2 Preliminaries

In this section, we introduce some necessary concepts for Riemannian optimization and the Moreau envelope.

2.1 Riemannian optimization

An nn-dimensional smooth manifold \mathcal{M} is an nn-dimensional topological manifold equipped with a smooth structure, where each point has a neighborhood that is diffeomorphism to the nn-dimensional Euclidean space. For all xx\in\mathcal{M}, there exists a chart (U,ψ)(U,\psi) such that UU is an open set and ψ\psi is a diffeomorphism between UU and an open set ψ(U)\psi(U) in the Euclidean space. A tangent vector ηx\eta_{x} to \mathcal{M} at xx is defined as tangents of parametrized curves γ\gamma on \mathcal{M} such that γ(0)=x\gamma(0)=x and

ηxu:=γ˙(0)u=d(u(γ(t)))dt|t=0,ux,\eta_{x}u:=\dot{\gamma}(0)u=\left.\frac{d(u(\gamma(t)))}{dt}\right|_{t=0},\forall u\in\wp_{x}\mathcal{M},

where x\wp_{x}\mathcal{M} is the set of all real-valued functions ff defined in a neighborhood of xx in \mathcal{M}. Then, the tangent space TxT_{x}\mathcal{M} of a manifold \mathcal{M} at xx is defined as the set of all tangent vectors at point xx. The manifold \mathcal{M} is called a Riemannian manifold if it is equipped with an inner product on the tangent space TxT_{x}\mathcal{M} at each xx\in\mathcal{M}. If \mathcal{M} is a Riemannian submanifold of an Euclidean space \mathcal{E}, the inner product is defined as the Euclidean inner product: ηx,ξx=tr(ηxξx)\left<\eta_{x},\xi_{x}\right>=\mathrm{tr}(\eta_{x}^{\top}\xi_{x}). The Riemannian gradient gradf(x)Tx\operatorname{grad}f(x)\in T_{x}\mathcal{M} is the unique tangent vector satisfying

gradf(x),ξ=df(x)[ξ],ξTx.\left<\operatorname{grad}f(x),\xi\right>=df(x)[\xi],\forall\xi\in T_{x}\mathcal{M}.

If \mathcal{M} is a compact Riemannian manifold embedded in an Euclidean space, we have that gradf(x)=𝒫Tx(f(x))\operatorname{grad}f(x)=\mathcal{P}_{T_{x}\mathcal{M}}(\nabla f(x)), where f(x)\nabla f(x) is the Euclidean gradient, 𝒫Tx\mathcal{P}_{T_{x}\mathcal{M}} is the projection operator onto the tangent space TxT_{x}\mathcal{M}. The retraction operator is one of the most important ingredients for manifold optimization, which turns an element of TxT_{x}\mathcal{M} into a point in \mathcal{M}.

Definition 2.1 (Retraction, [1]).

A retraction on a manifold \mathcal{M} is a smooth mapping :T\mathcal{R}:T\mathcal{M}\rightarrow\mathcal{M} with the following properties. Let x:Tx\mathcal{R}_{x}:T_{x}\mathcal{M}\rightarrow\mathcal{M} be the restriction of \mathcal{R} at xx. It satisfies

  • x(0x)=x\mathcal{R}_{x}(0_{x})=x, where 0x0_{x} is the zero element of TxT_{x}\mathcal{M},

  • dx(0x)=idTxd\mathcal{R}_{x}(0_{x})=id_{T_{x}\mathcal{M}},where idTxid_{T_{x}\mathcal{M}} is the identity mapping on TxT_{x}\mathcal{M}.

We have the following Lipschitz-type inequalities on the retraction on the compact submanifold.

Proposition 2.2 ([7, Appendix B]).

Let \mathcal{R} be a retraction operator on a compact submanifold \mathcal{M}. Then, there exist two positive constants α,β\alpha,\beta such that for all xx\in\mathcal{M} and all uTxu\in T_{x}\mathcal{M}, we have

(2.1) {x(u)xαu,x(u)xuβu2.\left\{\begin{aligned} \|\mathcal{R}_{x}(u)-x\|&\leq\alpha\|u\|,\\ \|\mathcal{R}_{x}(u)-x-u\|&\leq\beta\|u\|^{2}.\end{aligned}\right.

Another basic ingredient for manifold optimization is the vector transport.

Definition 2.3 (Vector Transport, [1]).

The vector transport 𝒯\mathcal{T} is a smooth mapping

(2.2) TTT:(ηx,ξx)𝒯ηx(ξx)TT\mathcal{M}\oplus T\mathcal{M}\rightarrow T\mathcal{M}:(\eta_{x},\xi_{x})\mapsto\mathcal{T}_{\eta_{x}}(\xi_{x})\in T\mathcal{M}

satisfying the following properties for all xx\in\mathcal{M}:

  • for any ξxTx\xi_{x}\in T_{x}\mathcal{M}, 𝒯0xξx=ξx\mathcal{T}_{0_{x}}\xi_{x}=\xi_{x},

  • 𝒯ηx(aξx+bγx)=a𝒯ηx(ξx)+b𝒯ηx(γx)\mathcal{T}_{\eta_{x}}(a\xi_{x}+b\gamma_{x})=a\mathcal{T}_{\eta_{x}}(\xi_{x})+b\mathcal{T}_{\eta_{x}}(\gamma_{x}).

When there exists \mathcal{R} such that y=x(ηx)y=\mathcal{R}_{x}(\eta_{x}), we write 𝒯xy(ξx)=𝒯ηx(ξx)\mathcal{T}_{x}^{y}(\xi_{x})=\mathcal{T}_{\eta_{x}}(\xi_{x}).

2.2 Moreau envelope and retraction smoothness

We first introduce the concept of the Moreau envelope.

Definition 2.4.

For a proper, convex and lower semicontinuous function h:nh:\mathbb{R}^{n}\rightarrow\mathbb{R}, the Moreau envelope of hh with the parameter μ>0\mu>0 is given by

Mhμ(y):=infzn{h(z)+12μzy2}.M_{h}^{\mu}(y):=\inf_{z\in\mathbb{R}^{n}}\left\{h(z)+\frac{1}{2\mu}\|z-y\|^{2}\right\}.

The following result demonstrates that the Moreau envelope of a convex function is not only continuously differentiable but also possesses a bounded gradient norm.

Proposition 2.5.

[5, Lemma 3.3] Let hh be a proper, convex, and h\ell_{h}-Lipschitz continuous. Then, the Moreau envelope MhμM_{h}^{\mu} has Lipschitz continuous gradient over n\mathbb{R}^{n} with constant μ1\mu^{-1}, and the gradient is given by

(2.3) Mhμ(x)=1μ(xproxμh(x))h(proxμh(x)),\nabla M_{h}^{\mu}(x)=\frac{1}{\mu}\left(x-\mathrm{prox}_{\mu h}(x)\right)\in\partial h\left(\mathrm{prox}_{\mu h}(x)\right),

where h\partial h denote the subdifferential of hh, proxμh(x):=argminyn{h(y)+12μyx2}\mathrm{prox}_{\mu h}(x):=\arg\min_{y\in\mathbb{R}^{n}}\{h(y)+\frac{1}{2\mu}\|y-x\|^{2}\} is the proximal operator. Moreover, it holds that

(2.4) Mhμ(x)h,xproxμh(x)μh.\|\nabla M_{h}^{\mu}(x)\|\leq\ell_{h},\;\|x-\mathrm{prox}_{\mu h}(x)\|\leq\mu\ell_{h}.

Next, we provide the definition of retraction smoothness. This concept plays a crucial role in the convergence analysis of the algorithms proposed in the subsequent sections.

Definition 2.6.

[7, Retraction smooth] A function f:f:\mathcal{M}\rightarrow\mathbb{R} is said to be retraction smooth (short to retr-smooth) with constant \ell and a retraction \mathcal{R}, if for x,y\forall~{}x,y\in\mathcal{M} it holds that

(2.5) f(y)f(x)+gradf(x),η+2η2,f(y)\leq f(x)+\left<\operatorname{grad}f(x),\eta\right>+\frac{\ell}{2}\|\eta\|^{2},

where ηTx\eta\in T_{x}\mathcal{M} and y=x(η)y=\mathcal{R}_{x}(\eta).

Similar to the Lipschitz continuity of gradient in Euclidean space, we give the following extended definition of the Lipschitz continuity of the Riemannian gradient.

Definition 2.7.

The Riemannian gradient gradf\operatorname{grad}f is Lipschitz continuous with respect to a vector transport 𝒯\mathcal{T} if there exists a positive constant LL such that for all x,ξTx,y=x(ξ)x,\xi\in T_{x}{\mathcal{M}},y=\mathcal{R}_{x}(\xi),

(2.6) gradf(x)𝒯yx(gradf(y))Lξ.\|\operatorname{grad}f(x)-\mathcal{T}_{y}^{x}(\operatorname{grad}f(y))\|\leq L\|\xi\|.

The following lemma shows that the Lipschitz continuity of gradf(x)\operatorname{grad}f(x) can be deduced by the Lipschitz continuity of f(x)\nabla f(x).

Lemma 2.8.

Suppose that \mathcal{M} is a compact submanifold embedded in the Euclidean space, given x,yx,y\in\mathcal{M} and uTxu\in T_{x}\mathcal{M}, the vector transport 𝒯\mathcal{T} is defined as 𝒯xy(u):=Dx[ξ](u)\mathcal{T}_{x}^{y}(u):=D\mathcal{R}_{x}[\xi](u), where y=x(ξ)y=\mathcal{R}_{x}(\xi). Denote ζ:=maxxconv()D2x()\zeta:=\max_{x\in\text{conv}(\mathcal{M})}\|D^{2}\mathcal{R}_{x}(\cdot)\| and G:=maxxf(x)G:=\max_{x\in\mathcal{M}}\|\nabla f(x)\|. Let LpL_{p} be the Lipschitz constant of 𝒫Tx\mathcal{P}_{T_{x}\mathcal{M}} over xx\in\mathcal{M}. For a function ff with Lipschitz continuous gradient with constant LL, i.e., f(x)f(y)Lxy\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|, then we have that

(2.7) gradf(x)𝒯yx(gradf(y))((αLp+ζ)G+αL)ξ.\|\operatorname{grad}f(x)-\mathcal{T}_{y}^{x}(\operatorname{grad}f(y))\|\leq((\alpha L_{p}+\zeta)G+\alpha L)\|\xi\|.

Proof 2.9.

It follows from the definition of Riemannian gradient and vector transport that

(2.8) gradf(y)𝒯xy(gradf(x))\displaystyle\|\operatorname{grad}f(y)-\mathcal{T}_{x}^{y}(\operatorname{grad}f(x))\|
\displaystyle\leq gradf(y)gradf(x)+gradf(x)𝒯xy(gradf(x))\displaystyle\|\operatorname{grad}f(y)-\operatorname{grad}f(x)\|+\|\operatorname{grad}f(x)-\mathcal{T}_{x}^{y}(\operatorname{grad}f(x))\|
\displaystyle\leq gradf(y)𝒫Tx(f(y))+𝒫Tx(f(y)f(x))\displaystyle\|\operatorname{grad}f(y)-\mathcal{P}_{T_{x}\mathcal{M}}(\nabla f(y))\|+\|\mathcal{P}_{T_{x}\mathcal{M}}(\nabla f(y)-\nabla f(x))\|
+Dx(0x)[gradf(x)]Dx(ξ)[gradf(x)]\displaystyle+\|D\mathcal{R}_{x}(0_{x})[\operatorname{grad}f(x)]-D\mathcal{R}_{x}(\xi)[\operatorname{grad}f(x)]\|
\displaystyle\leq (LpG+L)yx+ζξgradf(x)\displaystyle(L_{p}G+L)\|\|y-x\|+\zeta\|\xi\|\|\operatorname{grad}f(x)\|
\displaystyle\leq ((αLp+ζ)G+αL)ξ,\displaystyle((\alpha L_{p}+\zeta)G+\alpha L)\|\xi\|,

where the third inequality utilizes the Lipschitz continuity of f(x)\nabla f(x) and 𝒫Tx\mathcal{P}_{T_{x}\mathcal{M}} over xx\in\mathcal{M}, the last inequality follows from (2.1) and gradf(x)f(x)G\|\operatorname{grad}f(x)\|\leq\|\nabla f(x)\|\leq G. This proof is completed.

3 Augmented Lagrangian method

In this section, we present an augmented Lagrangian method to solve (1.1). The subproblem is solved by the Riemannian gradient descent method. We also give the iteration complexity result. Throughout this paper, we make the following assumptions.

Assumption 1.

The following assumptions hold:

  1. 1.

    The manifold \mathcal{M} is a compact Riemannian submanifold embedded in n\mathbb{R}^{n};

  2. 2.

    The function ff is f\ell_{\nabla f}-smooth but not necessarily convex. The function hh is convex and h\ell_{h}-Lipschitz continuous but is not necessarily smooth; φ\varphi is bounded from below.

3.1 KKT condition and stationary point

We say xx\in\mathcal{M} is an ϵ\epsilon-stationary point of problem (1.1) if there exists ζh(𝒜x)\zeta\in\partial h(\mathcal{A}x) such that 𝒫Tx(f(x)+𝒜ζ)ϵ\left\|\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x)+\mathcal{A}^{*}\zeta\right)\right\|\leq\epsilon. As in [45], there is no finite time algorithm that can guarantee ϵ\epsilon-stationarity in the nonconvex nonsmooth setting. Motivated by the notion of (δ,ϵ)(\delta,\epsilon)-stationarity introduced in [45], we introduce a generalized ϵ\epsilon-stationarity for problem (1.1) based on the Lagrangian function.

Since problem (1.1) contains the nonsmooth term, we introduce the auxiliary variable y=𝒜xy=\mathcal{A}x, which results in the following equivalent split problem:

(3.1) minx,ymf(x)+h(y),s.t.𝒜x=y.\min_{x\in\mathcal{M},y\in\mathbb{R}^{m}}f(x)+h(y),\;\mathrm{s.t.}\;\mathcal{A}x=y.

The Lagrangian function l:×m×ml:\mathcal{M}\times\mathbb{R}^{m}\times\mathbb{R}^{m}\rightarrow\mathbb{R} of (3.1) is given by:

(3.2) l(x,y,z)=f(x)+h(y)z,𝒜xy,l(x,y,z)=f(x)+h(y)-\left<z,\mathcal{A}x-y\right>,

where zz is the corresponding Lagrangian multiplier. Then the KKT condition of (1.1) is given as follows [12]: A point xx\in\mathcal{M} satisfies the KKT condition if there exists y,zmy,z\in\mathbb{R}^{m} such that

(3.3) {0=𝒫Tx(f(x)𝒜z),0z+h(y),0=𝒜xy.\left\{\begin{aligned} 0&=\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x)-\mathcal{A}^{*}z\right),\\ 0&\in z+\partial h(y),\\ 0&=\mathcal{A}x-y.\end{aligned}\right.

Based on the KKT condition, we give the definition of ϵ\epsilon-stationary point for (1.1):

Definition 3.1.

We say xx\in\mathcal{M} is an ϵ\epsilon-stationary point of (1.1) if there exists y,zmy,z\in\mathbb{R}^{m} such that

(3.4) {𝒫Tx(f(x)𝒜z)ϵ,dist(z,h(y))ϵ,𝒜xyϵ.\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x)-\mathcal{A}^{*}z\right)\right\|\leq\epsilon,\\ \mathrm{dist}(-z,\partial h(y))\leq\epsilon,\\ \|\mathcal{A}x-y\|\leq\epsilon.\end{aligned}\right.

In other words, (x,y,z)(x,y,z) is an ϵ\epsilon-KKT point pair of (1.1).

3.2 Augmented Lagrangian function of (1.1)

Before constructing the augmented Lagrangian function of (1.1), we first focus on the equivalent split problem (3.1). In particular, the augmented Lagrangian function of (3.1) is constructed as follows:

(3.5) 𝕃σ(x,y;z):\displaystyle\mathbb{L}_{\sigma}(x,y;z): =l(x,y,z)+σ2𝒜xy2\displaystyle=l(x,y,z)+\frac{\sigma}{2}\|\mathcal{A}x-y\|^{2}
=f(x)+h(y)z,𝒜xy+σ2𝒜xy2,\displaystyle=f(x)+h(y)-\left<z,\mathcal{A}x-y\right>+\frac{\sigma}{2}\|\mathcal{A}x-y\|^{2},

where σ\sigma is the penalty parameter. It is shown that the augmented Lagrangian method involves the variable xx and the auxiliary variables yy. However, for the original problem (1.1), which only features the variable xx, the corresponding augmented Lagrangian function should solely include xx as well as the associated multipliers.

An intuitive way of introducing the augmented Lagrangian function associated with (1.1) is to eliminate the auxiliary variables in (3.5), i.e.,

(3.6) σ(x,z):=minym𝕃σ(x,y;z).\mathcal{L}_{\sigma}(x,z):=\min_{y\in\mathbb{R}^{m}}\mathbb{L}_{\sigma}(x,y;z).

Note that the optimal solution yy^{*} of the above minimization problem has the following closed-form solution:

(3.7) y=proxh/σ(𝒜xz/σ).y^{*}=\mathrm{prox}_{h/\sigma}(\mathcal{A}x-z/\sigma).

Therefore, one can obtain the explicit form of σ(x,z)\mathcal{L}_{\sigma}(x,z):

(3.8) σ(x,z)=\displaystyle\mathcal{L}_{\sigma}(x,z)= f(x)+g(proxh/σ(𝒜(x)z/σk))\displaystyle~{}~{}f(x)+g(\mbox{prox}_{h/\sigma}(\mathcal{A}(x)-z/\sigma_{k}))
+σ2𝒜(x)z/σproxh/σ(𝒜(x)z/σ)2212σz22\displaystyle+\frac{\sigma}{2}\left\|\mathcal{A}(x)-z/\sigma-\mbox{prox}_{h/\sigma}(\mathcal{A}(x)-z/\sigma)\right\|_{2}^{2}-\frac{1}{2\sigma}\|z\|_{2}^{2}
=\displaystyle= f(x)+Mh1/σ(𝒜(x)1σz)12σz22,\displaystyle~{}~{}f(x)+M_{h}^{1/\sigma}\left(\mathcal{A}(x)-\frac{1}{\sigma}z\right)-\frac{1}{2\sigma}\|z\|_{2}^{2},

where Mh1/σM_{h}^{1/\sigma} is the Moreau envelope given in Definition 2.4. It is important to emphasize that the augmented Lagrangian function (3.6) only retains xx as the primal variable. The Lagrangian multiplier zz corresponds to the nonsmooth function hh. When comparing σ\mathcal{L}_{\sigma} with 𝕃σ\mathbb{L}_{\sigma}, it becomes evident that σ\mathcal{L}_{\sigma} involves fewer variables. We will also demonstrate that the augmented Lagrangian function (3.6) possesses favorable properties, such as being continuously differentiable. This characteristic proves to be advantageous when it comes to designing and analyzing algorithms [13, 31].

3.3 Algorithm

As shown in [12], a manifold augmented Lagrangian method based on the augmented Lagrangian function (3.5) can be written as

(3.9) {(xk+1,yk+1)=argminx𝕃σk(x,y,zk),zk+1=zkσk(𝒜xk+1yk+1),σk+1=σk.\left\{\begin{aligned} (x^{k+1},y^{k+1})&=\arg\min_{x\in\mathcal{M}}\mathbb{L}_{\sigma_{k}}(x,y,z^{k}),\\ z^{k+1}&=z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1}),\\ \sigma_{k+1}&=\sigma_{k}\uparrow.\end{aligned}\right.

By the definition of σ\mathcal{L}_{\sigma} and (3.7), we know that (3.9) can be rewritten as

(3.10) {xk+1=argminxσk(x,zk),yk+1=proxh/σk(𝒜xk+1zk/σk),zk+1=zkσk(𝒜xk+1yk+1),σk+1=σk.\left\{\begin{aligned} x^{k+1}&=\arg\min_{x\in\mathcal{M}}\mathcal{L}_{\sigma_{k}}(x,z^{k}),\\ y^{k+1}&=\mathrm{prox}_{h/\sigma_{k}}(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}),\\ z^{k+1}&=z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1}),\\ \sigma_{k+1}&=\sigma_{k}\uparrow.\end{aligned}\right.

We call (3.10) the augmented Lagrangian method based on (3.6) for (1.1). The authors in [12] show that an inexact version of (3.10) converges to a stationary point. However, the iteration complexity result is unknown. Motivated by [38], we design a novel manifold inexact augmented Lagrangian framework, as outlined in Algorithm 1 to efficiently find an ϵ\epsilon-KKT point of (1.1). The key distinction from (3.10) lies in the update strategy for the Lagrangian multiplier zkz^{k} and the careful selection of σk\sigma_{k}. This strategic choice will enable us to bound the norm of zkz^{k} and ensures that zkσk\frac{\|z^{k}\|}{\sigma_{k}} converges to zero.

1:  Given: increasing positive sequence {σk}k1\{\sigma_{k}\}_{k\geq 1}, decreasing positive sequence {ϵk}k1\{\epsilon_{k}\}_{k\geq 1}.
2:  Initialize: Primal variable x0x^{0}, dual variable z0=0z^{0}=0, and dual step size β0\beta_{0}. Denote y0=𝒜x0y^{0}=\mathcal{A}x^{0}.
3:  for k=0,1,2,k=0,1,2,\cdots do
4:     Apply subroutine to solve the following subproblem
(3.11) minxσk(x,zk)\min_{x\in\mathcal{M}}\mathcal{L}_{\sigma_{k}}(x,z^{k})
and return xk+1x^{k+1} with two options:
  • Option I: xk+1x^{k+1} satisfies

    (3.12) gradσk(xk+1,zk)ϵk.\|\operatorname{grad}\mathcal{L}_{\sigma_{k}}(x^{k+1},z^{k})\|\leq\epsilon_{k}.
  • Option II: return xk+1x^{k+1} with the fixed iteration number 2k2^{k}.

5:     Update the auxiliary variable:
yk+1=proxh/σk(𝒜xk+1zk/σk).y^{k+1}=\mathrm{prox}_{h/\sigma_{k}}(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}).
6:     Update the dual step size:
(3.13) βk+1=β0min(𝒜x0y0log22𝒜xk+1yk+1(k+1)2log(k+2),1)\beta_{k+1}=\beta_{0}\min\left(\frac{\|\mathcal{A}x^{0}-y^{0}\|\log^{2}2}{\|\mathcal{A}x^{k+1}-y^{k+1}\|(k+1)^{2}\log(k+2)},1\right)
7:     Update the dual variable:
(3.14) zk+1=zkβk+1(𝒜xk+1yk+1).z^{k+1}=z^{k}-\beta_{k+1}(\mathcal{A}x^{k+1}-y^{k+1}).
8:  end for
Algorithm 1 (ManIAL)

In Algorithm 1, we introduce the dual stepsize βk\beta_{k}, which is used in the update of the dual variable zk+1z^{k+1} (also called the Lagrangian multiplier). The update rule (3.13) of βk+1\beta_{k+1} is to make sure that the dual variable zk+1z^{k+1} is bounded when kk\rightarrow\infty. The detailed derivation is given in the proof of Theorem 3.8. Moreover, for the convenience of checking the termination criteria and subsequent convergence analysis, we introduce an auxiliary sequence denoted as yky^{k}. The main difference between our algorithm and the one presented in [12] lies in our algorithm’s reliance on a more refined selection of step sizes and improved control over the accuracy of xx-subproblem solutions. Another crucial difference is the stopping criteria for the subproblem. In particular, we provide two options based on either a given residual or a given number of iterations, further enhancing the flexibility of the algorithm These modifications allow us to derive results pertaining to the iterative complexity of the algorithm.

The function σ(x,z)\mathcal{L}_{\sigma}(x,z) with respect to xx exhibits favorable properties, such as continuous differentiability, enabling a broader range of choices in selecting subproblem-solving algorithms. Let us denote ψk(x):=σk(x,zk)\psi_{k}(x):=\mathcal{L}_{\sigma_{k}}(x,z^{k}). By the property of the proximal mapping and the Moreau envelope in Proposition 2.5, one can readily conclude that ψk(x)\psi_{k}(x) is continuously differentiable and

(3.15) ψk(x)\displaystyle\nabla\psi_{k}(x) =f(x)+σk𝒜(𝒜x1σkzkproxh/σk(𝒜(x)1σkzk)).\displaystyle=\nabla f(x)+\sigma_{k}\mathcal{A}^{*}\left(\mathcal{A}x-\frac{1}{\sigma_{k}}z^{k}-\mbox{prox}_{h/\sigma_{k}}(\mathcal{A}(x)-\frac{1}{\sigma_{k}}z^{k})\right).

The main computational cost of Algorithm 1 lies in solving the subproblem (3.11). In the next subsection, we will give a subroutine to solve (3.11).

3.4 Subproblem solver

In this subsection, we give a subroutine to find each xk+1x^{k+1} in Algorithm 1. We first present a general algorithm for smooth problems on Riemannian manifolds, and then try to apply the algorithm to solve the subproblem (3.11) in Algorithm 1 by checking the related condition. Let us focus on the following smooth problem on the Riemannian manifold:

(3.16) minxψ(x),\min_{x\in\mathcal{M}}\psi(x),

where \mathcal{M} is a Riemannian submanifold embedded in Euclidean space. Denote ψmin:=infxψ(x)\psi_{\min}:=\inf_{x\in\mathcal{M}}\psi(x). We assume that ψ\psi is retr-smooth with LL and \mathcal{R}, i.e.,

(3.17) ψ(x(η))ψ(x)+gradψ(x),η+L2η2,x,ηTx.\psi(\mathcal{R}_{x}(\eta))\leq\psi(x)+\langle\operatorname{grad}\psi(x),\eta\rangle+\frac{L}{2}\|\eta\|^{2},~{}\forall x\in\mathcal{M},\eta\in T_{x}\mathcal{M}.
1:  Given: x0x_{0}\in\mathcal{M}, L>0L>0, an integer T>0T>0.
2:  for t=0,1,,Tt=0,1,\cdots,T do
3:     Set the stepsize αt=1L\alpha_{t}=\frac{1}{L}.
4:     Compute the next iterate xt+1x_{t+1}:
(3.18) xt+1=xt(αtgradψ(xt)).x_{t+1}=\mathcal{R}_{x_{t}}(-\alpha_{t}\operatorname{grad}\psi(x_{t})).
5:  end for
Output: x=xtx_{*}=x_{t^{*}}, where t=argmin0tTgradψ(xt)t^{*}=\arg\min_{0\leq t\leq T}\|\operatorname{grad}\psi(x_{t})\|.
Algorithm 2 Riemannian gradient descent method, RGD(ψ,L,T,x0)\text{RGD}(\psi,L,T,x_{0})

We adopt a Riemannian gradient descent method (RGD) [1, 6] to solve problem (3.16). The detailed algorithm is given in Algorithm 2. The following lemma gives the iteration complexity result of Algorithm 2.

Lemma 3.2 (Iteration complexity of RGD).

Suppose that ψ\psi is retr-smooth with constant LL and a retraction operator \mathcal{R}. Let {xt}\{x_{t}\} be the iterative sequence of Algorithm 2. Given an integer TT, it holds that

gradψ(x)2L(ψ(x0)ψmin)T.\|\operatorname{grad}\psi(x_{*})\|\leq\frac{\sqrt{2L(\psi(x_{0})-\psi_{\min})}}{\sqrt{T}}.

Proof 3.3.

It follows from (3.17) and (3.18) that

(3.19) ψ(xt+1)\displaystyle\psi(x_{t+1}) ψ(xt)αtgradψ(xt),gradψ(xt)+αt2L2gradψ(xt)2\displaystyle\leq\psi(x_{t})-\alpha_{t}\langle\operatorname{grad}\psi(x_{t}),\operatorname{grad}\psi(x_{t})\rangle+\frac{\alpha_{t}^{2}L}{2}\|\operatorname{grad}\psi(x_{t})\|^{2}
ψ(xt)12Lgradψ(xt)2.\displaystyle\leq\psi(x_{t})-\frac{1}{2L}\|\operatorname{grad}\psi(x_{t})\|^{2}.

By summing both sides of the above inequality over t=0,,T1t=0,\cdots,T-1, we get

(3.20) ψ(xT)ψ(x0)12Lt=0Tgradψ(xt)2.\psi(x_{T})\leq\psi(x_{0})-\frac{1}{2L}\sum_{t=0}^{T}\|\operatorname{grad}\psi(x_{t})\|^{2}.

This implies that

(3.21) gradψ(x)=min0tTgradψ(xt)2L(ψ(x0)ψmin)T.\|\operatorname{grad}\psi(x_{*})\|=\min_{0\leq t\leq T}\|\operatorname{grad}\psi(x_{t})\|\leq\frac{\sqrt{2L(\psi(x_{0})-\psi_{\min})}}{\sqrt{T}}.

We complete the proof.

To apply Algorithm 2 for computing xk+1x^{k+1} in Algorithm 1, it is crucial to demonstrate that ψk(x)\psi_{k}(x) is retr-smooth, meaning it satisfies (3.17). The following lemma provides the essential insight.

Lemma 3.4.

Suppose that Assumption 1 holds. Then, for ψk\psi_{k}, there exists finite GG such that ψk(x)G\|\nabla\psi_{k}(x)\|\leq G for all xx\in\mathcal{M}, and ψk\psi_{k} is retr-smooth in the sense that

(3.22) ψk(x(η))ψk(x)+η,gradψk(x)+Lk2η2\psi_{k}(\mathcal{R}_{x}(\eta))\leq\psi_{k}(x)+\left<\eta,\operatorname{grad}\psi_{k}(x)\right>+\frac{L_{k}}{2}\|\eta\|^{2}

for all ηTx\eta\in T_{x}\mathcal{M}, where Lk=α2(f+σk𝒜2)+2Gβ.L_{k}=\alpha^{2}(\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2})+2G\beta.

Proof 3.5.

By Proposition 2.5, one shows that ψk\nabla\psi_{k} is Lipschitz continuous with constant f+σk𝒜2\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2}. Since ψk\nabla\psi_{k} is continuous on the compact manifold \mathcal{M}, it follows from (2.4) that there exists G>0G>0 such that ψk(x)G\|\nabla\psi_{k}(x)\|\leq G for all xx\in\mathcal{M} and any k>0k>0. The proof is completed by combining (2.1) with Lemma 2.7 in [7].

With Lemmas 3.2 and 3.4, we can apply Algorithm 2 to find each xk+1x^{k+1}. The lemma below gives the inner iteration complexity for the kk-th outer iteration of Algorithm 1.

Lemma 3.6.

Suppose that Assumption 1 holds. Let (xk,yk,zk)(x^{k},y^{k},z^{k}) be the kk-iterate generated in Algorithm 1. We can find xk+1x^{k+1} with Option I by the following call

(3.23) xk+1=RGD(ψk,Lk,Tk,xk),x^{k+1}=RGD(\psi_{k},L_{k},T_{k},x^{k}),

where Lk=α2(f+σk𝒜2)+2Gβ,Tk=Lk(ψk(xk)ψk,min)ϵk2L_{k}=\alpha^{2}(\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2})+2G\beta,~{}~{}T_{k}=\frac{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}{\epsilon_{k}^{2}}, and ψk,min=infxψk(x)\psi_{k,\min}=\inf_{x\in\mathcal{M}}\psi_{k}(x). Moreover, we can also find xk+1x^{k+1} with Option II by the following call

(3.24) xk+1=RGD(ψk,Lk,2k,xk),x^{k+1}=RGD(\psi_{k},L_{k},2^{k},x^{k}),

This implies that

(3.25) gradψk(xk+1)Lk(ψk(xk)ψk,min)2k.\|\operatorname{grad}\psi_{k}(x^{k+1})\|\leq\frac{\sqrt{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}}{\sqrt{2^{k}}}.

Proof 3.7.

For Option I, by Lemma 3.2, we find xk+1x^{k+1} that satisfy

(3.26) gradψk(xk+1)Lk(ψk(xk)ψk,min)Tkϵk.\|\operatorname{grad}\psi_{k}(x^{k+1})\|\leq\frac{\sqrt{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}}{T_{k}}\leq\epsilon_{k}.

When Tk=Lk(ψk(xk)ψk,min)ϵk2T_{k}=\frac{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}{\epsilon_{k}^{2}}, the above inequality holds. For Option II, it directly follows from Lemma 3.2.

3.5 Convergence analysis of ManIAL

We have shown the oracle complexity for solving the subproblems of ManIAL. Now, let us first investigate the outer iteration complexity and conclude this subsection with total oracle complexity.

3.5.1 Outer iteration complexity of ManIAL

Without specifying a subroutine to obtain xk+1x^{k+1}, we establish the outer iteration complexity result of Algorithm 1 in the following lemma.

Theorem 3.8 (Iteration complexity of ManIAL with Option I).

Suppose that Assumption 1 holds. For b>1b>1, if σk=bk\sigma_{k}=b^{k} and ϵk=1/σk\epsilon_{k}=1/\sigma_{k}, given ϵ>0\epsilon>0, then Algorithm 1 with Option I needs at most K:=logb(h+zmax+1ϵ)K:=\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|+1}{\epsilon}\right) iterations to produce an ϵ\epsilon-KKT solution pair (xK+1,yK+1,z¯K+1)(x^{K+1},y^{K+1},\bar{z}^{K+1}) of (3.4), where

(3.27) z¯k+1:=zkσk(𝒜xk+1yk+1),k0,zmax:=β0π26𝒜x0y0.\bar{z}^{k+1}:=z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1}),\forall k\geq 0,~{}~{}z_{\max}:=\frac{\beta_{0}\pi^{2}}{6}\|\mathcal{A}x^{0}-y^{0}\|.

Proof 3.9.

It follows from the update rule of xk+1x^{k+1} in Algorithm 1 with Option I and (3.15) that

(3.28) 𝒫Tx(f(xk+1)+σk𝒜(𝒜xk+1zk/σkproxh/σk(𝒜xk+1zk/σk)))ϵk.\left\|\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x^{k+1})+\sigma_{k}\mathcal{A}^{*}\left(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}-\mathrm{prox}_{h/\sigma_{k}}(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k})\right)\right)\right\|\leq\epsilon_{k}.

Moreover, the update rule of yk+1y^{k+1} implies that

(3.29) 0zkσk(𝒜xk+1yk+1)+h(yk+1).0\in z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1})+\partial h(y^{k+1}).

Combining (3.28), (3.29) and the definition of z¯K+1\bar{z}^{K+1} in (3.27) yields

(3.30) {𝒫Txk+1(f(xk+1)+𝒜z¯k+1)ϵk,dist(z¯k+1,h(yk+1))=0.\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1})+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|\leq\epsilon_{k},\\ \mathrm{dist}(-\bar{z}^{k+1},\partial h(y^{k+1}))=0.\end{aligned}\right.

Before bounding the feasibility condition 𝒜xk+1yk+1\|\mathcal{A}x^{k+1}-y^{k+1}\|, we first give a uniform upper bound of the dual variable zk+1z^{k+1}. By (3.13), (3.14), and z0=0z^{0}=0, we have that for any k>1k>1,

(3.31) zk+1\displaystyle\|z^{k+1}\| l=1k+1βl𝒜xlyll=1βl𝒜xlyl\displaystyle\leq\sum_{l=1}^{k+1}\beta_{l}\|\mathcal{A}x^{l}-y^{l}\|\leq\sum_{l=1}^{\infty}\beta_{l}\|\mathcal{A}x^{l}-y^{l}\|
β0𝒜x0y0log22l=11(l+1)2log(l+2)\displaystyle\leq\beta_{0}\|\mathcal{A}x^{0}-y^{0}\|\log^{2}2\sum_{l=1}^{\infty}\frac{1}{(l+1)^{2}\log(l+2)}
β0π26𝒜x0y0=zmax,\displaystyle\leq\frac{\beta_{0}\pi^{2}}{6}\|\mathcal{A}x^{0}-y^{0}\|=z_{\max},

where the last inequality utilize that log(l+2)>1\log(l+2)>1 for l1l\geq 1 and l=11(l+1)2=π26\sum_{l=1}^{\infty}\frac{1}{(l+1)^{2}}=\frac{\pi^{2}}{6}. For the feasibility bound 𝒜xk+1yk+1\|\mathcal{A}x^{k+1}-y^{k+1}\|, we have that for all k1k\geq 1,

(3.32) 𝒜xk+1yk+1\displaystyle\|\mathcal{A}x^{k+1}-y^{k+1}\| 𝒜xk+1zk/σkyk+1+1σkzk\displaystyle\leq\|\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}-y^{k+1}\|+\frac{1}{\sigma_{k}}\|z^{k}\|
h+zkσkh+zmaxσk,\displaystyle\leq\frac{\ell_{h}+\|z^{k}\|}{\sigma_{k}}\leq\frac{\ell_{h}+\|z_{\max}\|}{\sigma_{k}},

where the second inequality utilizes (2.4). By the definition of σk,ϵk\sigma_{k},\epsilon_{k} and KK, we have that

(3.33) {𝒫TxK+1(f(xK+1)+𝒜z¯K+1)ϵ,dist(z¯K+1,h(yK+1))=0,𝒜xK+1yK+1ϵ.\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x^{K+1}}\mathcal{M}}\left(\nabla f(x^{K+1})+\mathcal{A}^{*}\bar{z}^{K+1}\right)\right\|&\leq\epsilon,\\ \mathrm{dist}(-\bar{z}^{K+1},\partial h(y^{K+1}))&=0,\\ \|\mathcal{A}x^{K+1}-y^{K+1}\|&\leq\epsilon.\end{aligned}\right.

Therefore, (xK+1,yK+1,z¯K+1)(x^{K+1},y^{K+1},\bar{z}^{K+1}) is an ϵ\epsilon-KKT point of (3.4).

Analogously, we also have the following outer iteration complexity for ManIAL with Option II.

Theorem 3.10 (Iteration complexity of ManIAL with Option II).

Suppose that Assumption 1 holds. Let σk=12k/3\sigma_{k}=\frac{1}{2^{k/3}}. Given ϵ>0\epsilon>0, Algorithm 1 with Option II needs at most

K:=log2((2c1+h+zmax)3ϵ3)K:=\log_{2}\left(\frac{(\sqrt{2c_{1}}+\ell_{h}+z_{\max})^{3}}{\epsilon^{3}}\right)

iterations to produce an ϵ\epsilon-KKT point pair (xK+1,yK+1,z¯K+1)(x^{K+1},y^{K+1},\bar{z}^{K+1}), where c1c_{1} is defined in the proof. z¯K+1\bar{z}^{K+1} and zmaxz_{\max} are defined in (3.27).

Proof 3.11.

Similar to Theorem 3.8, for any kk\in\mathbb{N}, we have that

(3.34) dist(z¯k+1,h(yk+1))=0,𝒜xk+1yk+1\displaystyle\mathrm{dist}(-\bar{z}^{k+1},\partial h(y^{k+1}))=0,~{}~{}\|\mathcal{A}x^{k+1}-y^{k+1}\| h+zmaxσk.\displaystyle\leq\frac{\ell_{h}+\|z_{\max}\|}{\sigma_{k}}.

Moreover, it follows from Lemma 3.6 that

(3.35) 𝒫Txk+1(f(xk+1)+𝒜z¯k+1)\displaystyle\left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1})+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\| Lk(ψk(xk)ψk,min)2k.\displaystyle\leq\frac{\sqrt{L_{k}(\psi_{k}(x_{k})-\psi_{k,\min})}}{\sqrt{2^{k}}}.

Next, we show that ψk(xk)ψk,min\psi_{k}(x_{k})-\psi_{k,\min} is bounded for any k1k\geq 1. By the properties of the Moreau envelope, we have that

(3.36) ψk(x)f(x)+h(𝒜xzk/σk)ψk(x)+h2+zmax22σ0.\psi_{k}(x)\leq f(x)+h(\mathcal{A}x-z^{k}/\sigma_{k})\leq\psi_{k}(x)+\frac{\ell_{h}^{2}+\|z_{\max}\|^{2}}{2\sigma_{0}}.

Since \mathcal{M} is a compact submanifold, zkz^{k} is bounded by zmaxz_{\max} and σk>σ0\sigma_{k}>\sigma_{0} has the low bound, there exist ϕ¯\overline{\phi} and ϕ¯\underline{\phi} such that ϕ¯f(x)+h(𝒜xzk/σk)ϕ¯\underline{\phi}\leq f(x)+h(\mathcal{A}x-z^{k}/\sigma_{k})\leq\overline{\phi} for any xx\in\mathcal{M} and k1k\geq 1. Therefore, we have that

(3.37) ψk(xk)ψk,minϕ¯ϕ¯+h2+zmax22σ0.\psi_{k}(x^{k})-\psi_{k,\min}\leq\overline{\phi}-\underline{\phi}+\frac{\ell_{h}^{2}+\|z_{\max}\|^{2}}{2\sigma_{0}}.

Let us denote c0=ϕ¯ϕ¯+h2+zmax22σ0,c1=c0α2𝒜2,c2=c0(α2f+2Gβ)c_{0}=\overline{\phi}-\underline{\phi}+\frac{\ell_{h}^{2}+\|z_{\max}\|^{2}}{2\sigma_{0}},c_{1}=c_{0}\alpha^{2}\|\mathcal{A}\|^{2},c_{2}=c_{0}(\alpha^{2}\ell_{\nabla f}+2G\beta). By the fact that σk=2k/3\sigma_{k}=2^{k/3}, we have

(3.38) Lk(ψk(xk)ψk,min)\displaystyle L_{k}(\psi_{k}(x_{k})-\psi_{k,\min})
=\displaystyle= (α2(f+σk𝒜2)+2Gβ)(ψk(xk)ψk,min)\displaystyle(\alpha^{2}(\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2})+2G\beta)(\psi_{k}(x_{k})-\psi_{k,\min})
=\displaystyle= c12k/3+c2.\displaystyle c_{1}2^{k/3}+c_{2}.

Combining with (3.34) and (3.35) yields

(3.39) {𝒫Txk+1(f(xk+1)+𝒜z¯k+1)c12k/3+c22k/2,𝒜xk+1yk+1h+zmax2k/3,dist(z¯k+1,h(yk+1))=0.\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1})+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|&\leq\frac{\sqrt{c_{1}2^{k/3}+c_{2}}}{2^{k/2}},\\ \|\mathcal{A}x^{k+1}-y^{k+1}\|&\leq\frac{\ell_{h}+\|z_{\max}\|}{2^{k/3}},\\ \mathrm{dist}(-\bar{z}^{k+1},\partial h(y^{k+1}))&=0.\end{aligned}\right.

Without loss of generality, we assume that c12k/3c2c_{1}2^{k/3}\geq c_{2}. This implies that c12k/3+c22k/22c12k/3\frac{\sqrt{c_{1}2^{k/3}+c_{2}}}{2^{k/2}}\leq\frac{\sqrt{2c_{1}}}{2^{k/3}}. Letting K=log2((2c1+h+zmax)3ϵ3)K=\log_{2}\left(\frac{(\sqrt{2c_{1}}+\ell_{h}+z_{\max})^{3}}{\epsilon^{3}}\right), one can shows that

(3.40) {𝒫TxK+1(f(xK+1)+𝒜z¯K+1)ϵ,𝒜xK+1yK+1ϵ,dist(z¯K+1,h(yK+1))=0.\left\{\begin{aligned} \left\|\mathcal{P}_{T_{x^{K+1}}\mathcal{M}}\left(\nabla f(x^{K+1})+\mathcal{A}^{*}\bar{z}^{K+1}\right)\right\|&\leq\epsilon,\\ \|\mathcal{A}x^{K+1}-y^{K+1}\|&\leq\epsilon,\\ \mathrm{dist}(-\bar{z}^{K+1},\partial h(y^{K+1}))&=0.\end{aligned}\right.

This implies that (xK+1,yK+1,z¯K+1)(x^{K+1},y^{K+1},\bar{z}^{K+1}) is an ϵ\epsilon-KKT point pair.

3.5.2 Overall oracle complexity of ManIAL

Before giving the overall oracle complexity of ManIAL, we give the definition of a first-order oracle for (1.1).

Definition 3.12 (first-order oracle).

For problem (1.1), a first-order oracle can be defined as follows: compute the Euclidean gradient f(x)\nabla f(x), the proximal operator proxh(x)\operatorname{prox}_{h}(x), and the retraction operator \mathcal{R}.

Now, let us combine the inner oracle complexity and the outer iteration complexity to derive the overall oracle complexity of ManIAL.

Theorem 3.13 (Oracle complexity of ManIAL).

Suppose that Assumption 1 holds. We use Algorithm 2 as the subroutine to solve the subproblems in ManIAL. The following holds:

  • (a)

    If we set σk=bk\sigma_{k}=b^{k} for some b>1b>1 and ϵk=1/σk\epsilon_{k}=1/\sigma_{k}, then ManIAL with Option I finds an ϵ\epsilon-KKT point, after at most 𝒪~(ϵ3)\tilde{\mathcal{O}}(\epsilon^{-3}) calls to the first-order oracle.

  • (b)

    If we set σk=12k/3\sigma_{k}=\frac{1}{2^{k/3}}, then ManIAL with Option II finds an ϵ\epsilon-KKT point, after at most 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) calls to the first-order oracle.

Proof 3.14.

Let KK denote the number of (outer) iterations of Algorithm 1 to reach the accuracy ϵ\epsilon. Let us show (a) first. It follows from Theorem 3.8 that K=logb(h+zmax+1ϵ)K=\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|+1}{\epsilon}\right). By Lemma 3.6, in the kk-th iterate, one needs at most Lk(ψk(xk)ψk,min)ϵk2\frac{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}{\epsilon_{k}^{2}} iterations of Algorithm 2. Therefore, one can bound TT:

(3.41) T=\displaystyle T= k=1KLk(ψk(xk)ψk,min)ϵk2\displaystyle\sum_{k=1}^{K}\frac{L_{k}(\psi_{k}(x^{k})-\psi_{k,\min})}{\epsilon_{k}^{2}}
\displaystyle\leq k=1Kc1σk+c2ϵk2k=1K(c1ϵk3+c2ϵk2)\displaystyle\sum_{k=1}^{K}\frac{c_{1}\sigma_{k}+c_{2}}{\epsilon_{k}^{2}}\leq\sum_{k=1}^{K}\left(\frac{c_{1}}{\epsilon_{k}^{3}}+\frac{c_{2}}{\epsilon_{k}^{2}}\right)
\displaystyle\leq K(c1ϵ3+c2ϵ2)\displaystyle K\left(\frac{c_{1}}{\epsilon^{3}}+\frac{c_{2}}{\epsilon^{2}}\right)
\displaystyle\leq logb(h+zmax+1ϵ)(c1ϵ3+c2ϵ2),\displaystyle\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|+1}{\epsilon}\right)\left(\frac{c_{1}}{\epsilon^{3}}+\frac{c_{2}}{\epsilon^{2}}\right),

where the first inequality follows from (3.38). Secondly, it follows from Theorem 3.10 that K=log2((2c1+h+zmax)3ϵ3)K=\log_{2}\left(\frac{(\sqrt{2c_{1}}+\ell_{h}+z_{\max})^{3}}{\epsilon^{3}}\right). Then we have

(3.42) T=\displaystyle T= k=1K2k=k=1K(2k+12k)\displaystyle\sum_{k=1}^{K}2^{k}=\sum_{k=1}^{K}(2^{k+1}-2^{k})
=\displaystyle= 2K+12=2((2c1+h+zmax)3ϵ31).\displaystyle 2^{K+1}-2=2\left(\frac{\left(\sqrt{2c_{1}}+\ell_{h}+z_{\max}\right)^{3}}{\epsilon^{3}}-1\right).

The proof is completed.

It should be emphasized that ManIAL with Option I, as well as the existing works, such as [34] and [34], achieves an oracle complexity of 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) up to a logarithmic factor. However, ManIAL with Option II achieves an oracle complexity of 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) independently of the logarithmic factor. In that sense, we demonstrate that ManIAL with Option II achieves a better oracle complexity result.

4 Stochastic augmented Lagrangian method

In this section, we focus on the case that ff in problem (1.1) has the expectation form, i.e., f(x)=𝔼ξ𝒟[f(x,ξ)]f(x)=\mathbb{E}_{\xi\in\mathcal{D}}[f(x,\xi)]. In particular, we consider the following optimization problem on manifolds:

(4.1) minxF(x):=𝔼ξ𝒟[f(x,ξ)]+h(𝒜x).\min_{x\in\mathcal{M}}F(x):=\mathbb{E}_{\xi\in\mathcal{D}}[f(x,\xi)]+h(\mathcal{A}x).

Here, we assume that the function hh is convex and h\ell_{h}-Lipschitz continuous, f(x,ξ)\nabla f(x,\xi) is f\ell_{\nabla f}-Lipschitz continuous. The following assumptions are made for the stochastic gradient f(x,ξ)\nabla f(x,\xi):

Assumption 2.

For any xx, the algorithm generates a sample ξ𝒟\xi\sim\mathcal{D} and returns a stochastic gradient f(x,ξ)\nabla f(x,\xi), there exists a parameter δ>0\delta>0 such that

(4.2) 𝔼ξ[f(x,ξ)]=f(x),\displaystyle\mathbb{E}_{\xi}\left[\nabla f(x,\xi)\right]=\nabla f(x),
(4.3) 𝔼ξ[f(x,ξ)f(x)2]δ2.\displaystyle\mathbb{E}_{\xi}\left[\|\nabla f(x,\xi)-\nabla f(x)\|^{2}\right]\leq\delta^{2}.

Moreover, f(x,ξ)\nabla f(x,\xi) is f\ell_{\nabla f}-Lipschitz continuous.

Similarly, we give the definition of ϵ\epsilon-stationary point for problem (4.1):

Definition 4.1.

We say xx\in\mathcal{M} is an ϵ\epsilon-stationary point of (4.1) if there exists y,zmy,z\in\mathbb{R}^{m} such that

(4.4) {𝔼ξ[𝒫Tx(f(x,ξ)𝒜z)]ϵ,dist(z,h(y))ϵ,𝒜xyϵ.\left\{\begin{aligned} \mathbb{E}_{\xi}\left[\left\|\mathcal{P}_{T_{x}\mathcal{M}}\left(\nabla f(x,\xi)-\mathcal{A}^{*}z\right)\right\|\right]\leq\epsilon,\\ \mathrm{dist}(-z,\partial h(y))\leq\epsilon,\\ \|\mathcal{A}x-y\|\leq\epsilon.\end{aligned}\right.

In other words, (x,y,z)(x,y,z) is an ϵ\epsilon-KKT point pair of (4.1).

4.1 Algorithm

We define the augmented Lagrangian function of (4.1):

(4.5) σ(x,z):\displaystyle\mathcal{L}_{\sigma}(x,z): =minym{𝔼ξ[f(x,ξ)]+h(y)z,Axy+σ2Axy2}\displaystyle=\min_{y\in\mathbb{R}^{m}}\left\{\mathbb{E}_{\xi}[f(x,\xi)]+h(y)-\langle z,Ax-y\rangle+\frac{\sigma}{2}\|Ax-y\|^{2}\right\}
=𝔼ξ[f(x,ξ)]+Mh1/σ(Axz/σ)12σz2.\displaystyle=\mathbb{E}_{\xi}[f(x,\xi)]+M_{h}^{1/\sigma}(Ax-z/\sigma)-\frac{1}{2\sigma}\|z\|^{2}.

To efficiently find an ϵ\epsilon-stationary point of (4.1), we design a Riemannian stochastic gradient-type method based on the framework of the stochastic manifold inexact augmented Lagrangian method (StoManIAL), which is given in Algorithm 3. To simplify the notation, in the kk-iterate, let us denote

(4.6) ψk(x,ξ):=f(x,ξ)+Mh1/σk(Axzk/σk)12σkzk2,\psi_{k}(x,\xi):=f(x,\xi)+M_{h}^{1/\sigma_{k}}(Ax-z^{k}/\sigma_{k})-\frac{1}{2\sigma_{k}}\|z^{k}\|^{2},

and ψk(x):=𝔼ξ[ψk(x,ξ)]\psi_{k}(x):=\mathbb{E}_{\xi}[\psi_{k}(x,\xi)]. Under Assumption 2, it can be easily shown that

(4.7) 𝔼ξψk(x,ξ)ψk(x)2δ2.\mathbb{E}_{\xi}\|\psi_{k}(x,\xi)-\psi_{k}(x)\|^{2}\leq\delta^{2}.

Moreover, as shown in (3.15), one can derive the Euclidean gradient of ψk(x,ξ)\psi_{k}(x,\xi) as follows

(4.8) ψk(x,ξ)\displaystyle\nabla\psi_{k}(x,\xi) =f(x,ξ)+σk𝒜(𝒜x1σkzkproxh/σk(𝒜(x)1σkzk)).\displaystyle=\nabla f(x,\xi)+\sigma_{k}\mathcal{A}^{*}\left(\mathcal{A}x-\frac{1}{\sigma_{k}}z^{k}-\mbox{prox}_{h/\sigma_{k}}(\mathcal{A}(x)-\frac{1}{\sigma_{k}}z^{k})\right).

Note that, unlike ManIAL, the condition (4.10) of Option I in StoManIAL cannot be checked in the numerical experiment due to the expectation operation, although its validity can be assured by the convergence rule result of the subroutine, which will be presented in the next subsection. In contrast, the condition in Option II is checkable.

0:  Initial point x0,y0,z0x^{0}\in\mathcal{M},y^{0},z^{0}; ϵ\epsilon, {σk}k\{\sigma_{k}\}_{k}. Set k=0k=0. 1:  while not converge do 2:     Solve the following manifold optimization problem (4.9) minxσk(x,zk)\min_{x\in\mathcal{M}}\mathcal{L}_{\sigma_{k}}(x,z^{k}) and return xk+1x^{k+1} with two options: Option I: xk+1x^{k+1} satisfies (4.10) 𝔼ξ[gradψk(xk+1,ξ)]ϵk.\mathbb{E}_{\xi}\left[\left\|\operatorname{grad}\psi_{k}(x^{k+1},\xi)\right\|\right]\leq\epsilon_{k}. Option II: return xk+1x^{k+1} with the fixed iteration number 2k2^{k}. 3:     Update the auxiliary variable: yk+1=proxh/σk(𝒜xk+1zk/σk).y^{k+1}=\mathrm{prox}_{h/\sigma_{k}}(\mathcal{A}x^{k+1}-z^{k}/\sigma_{k}). 4:     Update the step sizes: βk+1=β1min(𝒜x1y1log22𝒜xk+1yk+1(k+1)log2(k+2),1).\beta_{k+1}=\beta_{1}\min\left(\frac{\|\mathcal{A}x^{1}-y^{1}\|\log^{2}2}{\|\mathcal{A}x^{k+1}-y^{k+1}\|(k+1)\log^{2}(k+2)},1\right). 5:     Update the dual variable: (4.11) zk+1=zkβk+1(𝒜xk+1yk+1).z^{k+1}=z^{k}-\beta_{k+1}(\mathcal{A}x^{k+1}-y^{k+1}). 6:     Set k=k+1k=k+1. 7:  end while
Algorithm 3 (StoManIAL)

4.2 Subproblem solver

In this subsection, we present a Riemannian stochastic optimization method to solve the xx-subproblem in Algorithm 3. As will be shown in Lemma 4.3, the function σk(x,zk)\mathcal{L}_{\sigma_{k}}(x,z^{k}) is retr-smooth with respect to xx, and its retr-smoothness parameter depends on σk\sigma_{k}, which, in turn, is ultimately determined by a prescribed tolerance ϵk\epsilon_{k}. To achieve a favorable overall complexity outcome with a lower order, it is essential to employ a subroutine whose complexity result exhibits a low-order dependence on the smoothness parameter σk\sigma_{k}.

A recent work, the momentum-based variance-reduced stochastic gradient method known as Storm [11], has demonstrated efficacy in solving nonconvex smooth stochastic optimization problems in Euclidean space, yielding a low-order overall complexity result. Despite the existence of a Riemannian version [17] of Storm (RSRM) that matches the same overall complexity result, its dependence on the smoothness parameters is excessively significant for this outcome. Therefore, we introduce a modified Riemannian version of Storm. Subsequently, in the next subsection, we elaborate on its application in determining xk+1x^{k+1} within Algorithm 3.

Consider the following general stochastic optimization problem on a Riemannian manifold

(4.12) minx𝔼[ψ(x,ξ)].\min_{x\in\mathcal{M}}\mathbb{E}[\psi(x,\xi)].

Suppose the following conditions hold: for some finite constants L,δL,\delta

(4.13) 𝔼ξ[ψ(x,ξ)]=ψ(x),𝔼ξ[ψ(x,ξ)ψ(x)2]δ2,\displaystyle\mathbb{E}_{\xi}\left[\nabla\psi(x,\xi)\right]=\nabla\psi(x),~{}~{}\mathbb{E}_{\xi}\left[\|\nabla\psi(x,\xi)-\nabla\psi(x)\|^{2}\right]\leq\delta^{2},
(4.14) gradψ(x,ξ)𝒯yxgradψ(y,ξ)Ld,x,y,d=x1(y),\displaystyle\|\operatorname{grad}\psi(x,\xi)-\mathcal{T}_{y}^{x}\operatorname{grad}\psi(y,\xi)\|\leq L\|d\|,~{}~{}\forall x,y\in\mathcal{M},d=\mathcal{R}_{x}^{-1}(y),
(4.15) ψ(y,ξ)ψ(x,ξ)+d,gradψ(x,ξ)+L2d2,x,y,d=x1(y).\displaystyle\psi(y,\xi)\leq\psi(x,\xi)+\langle d,\operatorname{grad}\psi(x,\xi)\rangle+\frac{L}{2}\|d\|^{2},~{}~{}\forall x,y\in\mathcal{M},d=\mathcal{R}_{x}^{-1}(y).

With the above conditions, we give the modified Riemannian version of Storm in Algorithm 4 and the complexity result in Lemma 4.2. The proof is similar to Theorem 1 in [11]. We provide it in the Appendix.

0:  Initial point x1x_{1}, iteration limit TT, parameters κ,w,c\kappa,w,c, sample set {ξt}t1\{\xi_{t}\}_{t\geq 1}.
1:  Compute d1=gradψ(x1,ξ1)d_{1}=-\operatorname{grad}\psi(x_{1},\xi_{1})
2:  for t=1,,Tt=1,\ldots,T do
3:     Update ηt=κ(w+i=1tGt2)1/3\eta_{t}=\frac{\kappa}{(w+\sum_{i=1}^{t}G_{t}^{2})^{1/3}}
4:     Update xt+1x_{t+1} via xt+1=xt(ηtdt).x_{t+1}=\mathcal{R}_{x_{t}}(-\eta_{t}d_{t}).
5:     Update at+1=cηt2a_{t+1}=c\eta_{t}^{2} and Gt+1=gradψ(xt+1,ξt+1)G_{t+1}=\|\operatorname{grad}\psi(x_{t+1},\xi_{t+1})\|.
6:     Compute dt+1=gradψ(xt+1;ξt+1)+(1at+1)𝒯xtxt+1(dtgradψ(xt;ξt))d_{t+1}=\operatorname{grad}\psi(x_{t+1};\xi_{t+1})+(1-a_{t+1})\mathcal{T}_{x_{t}}^{x_{t+1}}(d_{t}-\operatorname{grad}\psi(x_{t};\xi_{t})).
7:  end for
8:  Output: Return x^\hat{x} uniformly at random from x1,,xTx_{1},\cdots,x_{T}.
Algorithm 4 Riemannian stochastic recursive momentum gradient descent method: RStorm(ψ,L,T,x1)\text{RStorm}(\psi,L,T,x_{1})
Lemma 4.2.

Suppose that the conditions (4.13)-(4.15) hold. Let us denote G=maxxψ(x,ξ)G=\max_{x\in\mathcal{M}}\nabla\psi(x,\xi). For any b>0b>0, we write κ=bG23L\kappa=\frac{bG^{\frac{2}{3}}}{L}. Set c=10L2+c=10L^{2}+ G2/(7Lκ3)G^{2}/\left(7L\kappa^{3}\right) and w=max((4Lκ)3,2G2,(cκ4L)3)w=\max\left((4L\kappa)^{3},2G^{2},\left(\frac{c\kappa}{4L}\right)^{3}\right). Then, the output x^\hat{x} in Algorithm 4 satisfies

(4.16) 𝔼[gradψ(x^)]=𝔼[1Tt=1Tgradψ(xt)]w1/62M+2M3/4T+2δ1/3T1/3,\mathbb{E}[\|\operatorname{grad}\psi(\hat{x})\|]=\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|\right]\leq\frac{w^{1/6}\sqrt{2M}+2M^{3/4}}{\sqrt{T}}+\frac{2\delta^{1/3}}{T^{1/3}},

where M=6κ(ψ(x1)ψmin)+w1/3δ22L2κ2+κ2c2L2ln(T+2)M=\frac{6}{\kappa}\left(\psi\left(x_{1}\right)-\psi_{\min}\right)+\frac{w^{1/3}\delta^{2}}{2L^{2}\kappa^{2}}+\frac{\kappa^{2}c^{2}}{L^{2}}\ln(T+2).

To apply Algorithm 4 to find xk+1x^{k+1}, we need to validate conditions (4.13)-(4.15) for the associated subproblem.

Lemma 4.3.

Suppose that Assumption 1 holds. Then, for ψk\psi_{k}, there exists finite GG such that ψk(x,ξ)G\|\nabla\psi_{k}(x,\xi)\|\leq G for all xx\in\mathcal{M}. Moreover, the function ψk(x,ξ)\psi_{k}(x,\xi) is retr-smooth with constant L^k\hat{L}_{k} and the Riemannian gradient gradψk(x,ξ)\operatorname{grad}\psi_{k}(x,\xi) is Lipschitz continuous with constant L¯k\bar{L}_{k}, where

(4.17) L^k:=α2f+α2𝒜σk+2Gβ,L¯k:=(αLp+ζ)G+αf+𝒜σk,\hat{L}_{k}:=\alpha^{2}\ell_{\nabla f}+\alpha^{2}\|\mathcal{A}\|\sigma_{k}+2G\beta,\;\bar{L}_{k}:=(\alpha L_{p}+\zeta)G+\alpha\ell_{\nabla f}+\|\mathcal{A}\|\sigma_{k},

and LpL_{p} and ζ\zeta are defined in Lemma 2.8. Moreover, under Assumption 2, conditions (4.13)-(4.15) with constants Lk:=max{L^k,L¯k}L_{k}:=\max\{\hat{L}_{k},\bar{L}_{k}\} and δ\delta hold for subproblem (4.9).

Proof 4.4.

We first show (4.17). By Proposition 2.5, one shows that ψk\nabla\psi_{k} is Lipschitz continuous with constant f+σk𝒜2\ell_{\nabla f}+\sigma_{k}\|\mathcal{A}\|^{2}. The retraction smoothness of ψk(x,ξ)\psi_{k}(x,\xi) can directly follow from Lemma 3.4. The Lipschitz continuity of gradψk(x,ξ)\operatorname{grad}\psi_{k}(x,\xi) follows from Lemma 2.8. These induce conditions (4.14) and (4.15). Under Assumption 2, condition (4.13) is directly follows from the definition of ψk(x)\psi_{k}(x) and (4.7). We complete the proof.

As shown in Lemma 4.3, the Lipschitz constant LkL_{k} increases as the penalty parameter σk\sigma_{k} increases. Therefore, we analyze the dependence of LL for the right-hand side in (4.16) when LL tends to infinity. Noting that κ=𝒪(1/L),c=𝒪(L2),w=𝒪(1)\kappa=\mathcal{O}(1/L),c=\mathcal{O}(L^{2}),w=\mathcal{O}(1), and M=𝒪(L)M=\mathcal{O}(L), we can simplify the expression of (4.16) via 𝒪\mathcal{O} notation:

(4.18) 𝔼[gradψ(x^)]𝒪~(L1/2+L3/4T1/2+1T1/3),\mathbb{E}[\|\operatorname{grad}\psi(\hat{x})\|]\leq\tilde{\mathcal{O}}\left(\frac{L^{1/2}+L^{3/4}}{T^{1/2}}+\frac{1}{T^{1/3}}\right),

where 𝒪~\tilde{\mathcal{O}} is due to the dependence of ln(T)\ln(T).

With Lemma 4.3, we are able to apply Algorithm 4 to find xk+1x^{k+1}. The lemma below gives the oracle complexity for the kk-th outer iteration of Algorithm 3. This is a direct application of Lemmas 4.2 and 4.3, and we omit the proof.

Lemma 4.5.

Suppose that Assumptions 1 and 2 hold. Let (xk,yk,zk)(x^{k},y^{k},z^{k}) be the kk-iterate generated in Algorithm 3. We can find xk+1x^{k+1} with Option I by the following call

(4.19) xk+1=RStorm(ψk,Lk,Tk,xk),x^{k+1}=\text{RStorm}(\psi_{k},L_{k},T_{k},x^{k}),

where Lk=max{L^k,L¯k},Tk:=𝒪~(1ϵk3+σk1.5+σkϵk2).L_{k}=\max\{\hat{L}_{k},\bar{L}_{k}\},~{}~{}T_{k}:=\tilde{\mathcal{O}}\left(\frac{1}{\epsilon_{k}^{3}}+\frac{\sigma_{k}^{1.5}+\sigma_{k}}{\epsilon_{k}^{2}}\right). Moreover, we can also find xk+1x^{k+1} with Option II by the following call

(4.20) xk+1=RStorm(ψk,Lk,2k,xk),x^{k+1}=\text{RStorm}(\psi_{k},L_{k},2^{k},x^{k}),

This implies that

(4.21) 𝔼ξ[gradψk(xk+1,ξ)]𝒪~(σk3/4+σk1/22k/2+12k/3).\mathbb{E}_{\xi}\left[\|\operatorname{grad}\psi_{k}(x^{k+1},\xi)\|\right]\leq\tilde{\mathcal{O}}\left(\frac{\sigma_{k}^{3/4}+\sigma_{k}^{1/2}}{2^{k/2}}+\frac{1}{2^{k/3}}\right).

4.3 Convergence analysis of StoManIAL

We have shown the oracle complexity for solving the subproblems of StoManIAL. Now, let us first investigate the outer iteration complexity and conclude this subsection with the overall oracle complexity.

4.3.1 Outer iteration complexity of StoManIAL

Without specifying a subroutine to obtain xk+1x^{k+1}, we establish the outer iteration complexity result of Algorithm 3 in the following lemma. Since the proof is similar to Theorem 3.8, we omit it.

Theorem 4.6 (Iteration complexity of StoManIAL with Option I).

Suppose that Assumptions 1 and 2 hold. For b>1b>1, if σk=bk\sigma_{k}=b^{k} and ϵk=1/σk\epsilon_{k}=1/\sigma_{k}, given ϵ>0\epsilon>0, then Algorithm 3 needs at most K:=logb(h+zmax+1ϵ)K:=\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|+1}{\epsilon}\right) iterations to produce an ϵ\epsilon-KKT solution pair (xK+1,yK+1,z¯K+1)(x^{K+1},y^{K+1},\bar{z}^{K+1}) of (3.4), where

(4.22) z¯k+1:=zkσk(𝒜xk+1yk+1),k0,zmax:=β0π26𝒜x0y0.\bar{z}^{k+1}:=z^{k}-\sigma_{k}(\mathcal{A}x^{k+1}-y^{k+1}),\forall k\geq 0,~{}~{}z_{\max}:=\frac{\beta_{0}\pi^{2}}{6}\|\mathcal{A}x^{0}-y^{0}\|.

Analogous to Theorem 4.6, we also have the following theorem on the iteration complexity of StoManIAL with a fixed number of inner iterations.

Theorem 4.7 (Iteration complexity of StoManIAL with Option II).

Suppose that Assumption 1 and 2 hold. Let σk=122k/7\sigma_{k}=\frac{1}{2^{2k/7}}. Given ϵ>0\epsilon>0, StoManIAL with Option II needs at most 𝒪(log2(ϵ7/2))\mathcal{O}(\log_{2}(\epsilon^{-7/2})) iterations to produce an ϵ\epsilon-KKT point pair (xK+1,yK+1,z¯K+1)(x^{K+1},y^{K+1},\bar{z}^{K+1}), where z¯K+1\bar{z}^{K+1} is defined in (4.22).

Proof 4.8.

Similar to Theorem 3.8, for any kk\in\mathbb{N}, we have that

(4.23) dist(z¯k+1,h(yk+1))\displaystyle\mathrm{dist}(-\bar{z}^{k+1},\partial h(y^{k+1})) =0,\displaystyle=0,
𝒜xk+1yk+1\displaystyle\|\mathcal{A}x^{k+1}-y^{k+1}\| h+zmaxσk.\displaystyle\leq\frac{\ell_{h}+\|z_{\max}\|}{\sigma_{k}}.

Moreover, it follows from Lemma 4.5 that

(4.24) 𝔼ξ[𝒫Txk+1(f(xk+1,ξ)+𝒜z¯k+1)]\displaystyle\mathbb{E}_{\xi}\left[\left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1},\xi)+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|\right] 𝒪(σk3/4+σk1/22k/2+12k/3).\displaystyle\leq\mathcal{O}\left(\frac{\sigma_{k}^{3/4}+\sigma_{k}^{1/2}}{2^{k/2}}+\frac{1}{2^{k/3}}\right).

Since σk=22k7\sigma_{k}=2^{\frac{2k}{7}}, we have that

(4.25) 𝔼ξ[𝒫Txk+1(f(xk+1,ξ)+𝒜z¯k+1)]𝒪(22k/7+25k/14+2k/3),\displaystyle\mathbb{E}_{\xi}\left[\left\|\mathcal{P}_{T_{x^{k+1}}\mathcal{M}}\left(\nabla f(x^{k+1},\xi)+\mathcal{A}^{*}\bar{z}^{k+1}\right)\right\|\right]\leq\mathcal{O}(2^{-2k/7}+2^{-5k/14}+2^{-k/3}),

and

𝒜xk+1yk+1𝒪(22k/7).\|\mathcal{A}x^{k+1}-y^{k+1}\|\leq\mathcal{O}(2^{-2k/7}).

Letting K=𝒪(log2(ϵ72))K=\mathcal{O}(\log_{2}(\epsilon^{-\frac{7}{2}})), one can shows that

(4.26) {𝔼ξ[𝒫TxK+1(f(xK+1,ξ)+𝒜z¯K+1)]𝒪(ϵ),dist(z¯K+1,h(yK+1))=0,𝒜xK+1yK+1𝒪(ϵ),\left\{\begin{aligned} \mathbb{E}_{\xi}\left[\left\|\mathcal{P}_{T_{x^{K+1}}\mathcal{M}}\left(\nabla f(x^{K+1},\xi)+\mathcal{A}^{*}\bar{z}^{K+1}\right)\right\|\right]&\leq\mathcal{O}(\epsilon),\\ \mathrm{dist}(-\bar{z}^{K+1},\partial h(y^{K+1}))&=0,\\ \|\mathcal{A}x^{K+1}-y^{K+1}\|&\leq\mathcal{O}(\epsilon),\end{aligned}\right.

which implies that (xK+1,yK+1,z¯K+1)(x^{K+1},y^{K+1},\bar{z}^{K+1}) is an ϵ\epsilon-KKT point pair.

4.3.2 Overall oracle complexity of StoManIAL

To measure the oracle complexity of our algorithm, we give the definition of a stochastic first-order oracle for (4.1).

Definition 4.9 (stochastic first-order oracle).

For the problem (4.1), a stochastic first-order oracle can be defined as follows: compute the Euclidean gradient f(x,ξ)\nabla f(x,\xi) given a sample ξ𝒟\xi\in\mathcal{D}, the proximal operator proxh(x)\operatorname{prox}_{h}(x) and the retraction operator \mathcal{R}.

We are now able to establish the overall oracle complexity in the following theorem.

Theorem 4.10 (Oracle complexity of StoManIAL).

Suppose that Assumptions 1 and 2 hold. We choose Algorithm 4 as the subroutine of StoManIAL. The following holds:

  • (a)

    If we set σk=bk\sigma_{k}=b^{k} for some b>1b>1 and ϵk=1/σk\epsilon_{k}=1/\sigma_{k}, then StoManIAL with Option I finds an ϵ\epsilon-KKT point, after at most 𝒪~(ϵ3.5)\tilde{\mathcal{O}}(\epsilon^{-3.5}) calls to the stochastic first-order oracle.

  • (b)

    If we set σk=122k/7\sigma_{k}=\frac{1}{2^{2k/7}}, then StoManIAL with Option II finds an ϵ\epsilon-KKT point, after at most 𝒪~(ϵ3.5)\tilde{\mathcal{O}}(\epsilon^{-3.5}) calls to the stochastic first-order oracle.

Proof 4.11.

Let KK denote the number of (outer) iterations of Algorithm 3 to reach the accuracy ϵ\epsilon. Let us show (a) first. It follows from Theorem 4.6 that K=logb(h+zmaxσ0ϵ)+1K=\log_{b}\left(\frac{\ell_{h}+\|z_{\max}\|}{\sigma_{0}\epsilon}\right)+1. Combining this with Lemma 4.5, one can bounds the number TT of calls to the stochastic first-order oracle:

(4.27) T=\displaystyle T= k=1K𝒪~(1ϵκ3+σk1.5+σkϵk2)=k=1K𝒪(1ϵk3+1ϵk3.5)\displaystyle\sum_{k=1}^{K}\tilde{\mathcal{O}}\left(\frac{1}{\epsilon_{\kappa}^{3}}+\frac{\sigma_{k}^{1.5}+\sigma_{k}}{\epsilon_{k}^{2}}\right)=\sum_{k=1}^{K}\mathcal{O}\left(\frac{1}{\epsilon_{k}^{3}}+\frac{1}{\epsilon_{k}^{3.5}}\right)
\displaystyle\leq K𝒪~(1ϵ3+1ϵ3.5)=𝒪~(logb(ϵ1)ϵ3.5).\displaystyle K\tilde{\mathcal{O}}\left(\frac{1}{\epsilon^{3}}+\frac{1}{\epsilon^{3.5}}\right)=\tilde{\mathcal{O}}(\log_{b}(\epsilon^{-1})\epsilon^{-3.5}).

In terms of (b), it follows from Theorem 4.7 and Lemma 4.5 that

(4.28) T=\displaystyle T= k=1Klk=k=1K2k=k=1K(2k+12k)\displaystyle\sum_{k=1}^{K}l_{k}=\sum_{k=1}^{K}2^{k}=\sum_{k=1}^{K}(2^{k+1}-2^{k})
=\displaystyle= 2k+12=2𝒪~(2log2(ϵ3.5)1)=𝒪~(ϵ3.5).\displaystyle 2^{k+1}-2=2\tilde{\mathcal{O}}(2^{\log_{2}(\epsilon^{-3.5})}-1)=\tilde{\mathcal{O}}(\epsilon^{-3.5}).

The proof is completed.

5 Numerical results

In this section, we demonstrate the performance of the proposed methods ManIAL and StoManIAL on the sparse principal component analysis. We use ManIAL-I and ManIAL-II to denote Algorithms ManIAL with option I and option II, respectively. Due to that StoManIAL with option I is not checkable, we only give the comparison of StoManIAL with option II. We compare those algorithms with the subgradient method (we call it Rsub) in [30] that achieves the state-of-the-art oracle complexity for solving (4.1). All the tests were performed in MATLAB 2022a on a ThinkPad X1 with 4 cores and 32GB memory. The following relative KKT residual of problem (1.1) is set to a stopping criterion for our ManIAL:

(5.1) error:=max{ηp,ηd,ηC}tol,\text{error}:=\max\left\{\eta_{p},\eta_{d},\eta_{C}\right\}\leq\mbox{tol},

where “tol” is a given accuracy tolerance and

(5.2) ηp\displaystyle\eta_{p} :=𝒜xkyk1+𝒜xk+yk,\displaystyle:=\frac{\left\|\mathcal{A}x^{k}-y^{k}\right\|}{1+\left\|\mathcal{A}x^{k}\|+\|y^{k}\right\|},
ηd\displaystyle\eta_{d} :=𝒫Txk(f(xk)𝒜zk)1+f(xk),\displaystyle:=\frac{\left\|\mathcal{P}_{T_{x^{k}}\mathcal{M}}(\nabla f(x^{k})-\mathcal{A}^{*}z^{k})\right\|}{1+\left\|\nabla f(x^{k})\right\|},
ηC\displaystyle\eta_{C} :=zkproxh(zk𝒜xk)1+zk.\displaystyle:=\frac{\|z^{k}-\mbox{prox}_{h^{*}}(z^{k}-\mathcal{A}x^{k})\|}{1+\|z^{k}\|}.

In the following experiments, we will first run our algorithms ManIAL-I and ManIAL-II, and terminate them when either condition (5.1) is satisfied or the maximum iteration steps of 10,000 are reached. The obtained function value of ManIAL-I is denoted as FMF_{M}. For the other algorithms, we terminate them when either the objective function value satisfies F(xk)FM+1010F(x^{k})\leq F_{M}+10^{-10} or the maximum iteration steps of 10,000 are reached.

5.1 Sparse principal component analysis

Given a data set {b1,,bm}\{b_{1},\ldots,b_{m}\} where bin×1b_{i}\in\mathbb{R}^{n\times 1}, the sparse principal component analysis (SPCA) problem is

(5.3) minXn×ri=1mbiXXTbi22+μX1,s.t. XTX=Ir,\displaystyle\min_{X\in\mathbb{R}^{n\times r}}\sum_{i=1}^{m}\|b_{i}-XX^{T}b_{i}\|_{2}^{2}+\mu\|X\|_{1},~{}~{}\mbox{s.t. }~{}~{}X^{T}X=I_{r},

where μ>0\mu>0 is a regularization parameter. Let B=[b1,,bm]Tm×nB=[b_{1},\cdots,b_{m}]^{T}\in\mathbb{R}^{m\times n}, problem (5.3) can be rewritten as:

(5.4) minXn×rtr(XTBTBX)+μX1,s.t. XTX=Ir.\displaystyle\min_{X\in\mathbb{R}^{n\times r}}-\mathrm{tr}(X^{T}B^{T}BX)+\mu\|X\|_{1},~{}~{}\mbox{s.t. }~{}~{}X^{T}X=I_{r}.

Here, the constraint consists of the Stiefel manifold St(n,r):={Xn×r:XX=Ir}\texttt{St}(n,r):=\{X\in\mathbb{R}^{n\times r}~{}:~{}X^{\top}X=I_{r}\}. The tangent space of St(n,r)\texttt{St}(n,r) is defined by TXSt(n,r)={ηn×r:Xη+ηX=0}T_{X}\texttt{St}(n,r)=\{\eta\in\mathbb{R}^{n\times r}~{}:~{}X^{\top}\eta+\eta^{\top}X=0\}. Given any Un×rU\in\mathbb{R}^{n\times r}, the projection of UU onto TXSt(n,r)T_{X}\texttt{St}(n,r) is 𝒫TXSt(n,r)(U)=UXUX+XU2\mathcal{P}_{T_{X}\texttt{St}(n,r)}(U)=U-X\frac{U^{\top}X+X^{\top}U}{2} [1]. In our experiment, the data matrix Bm×nB\in\mathbb{R}^{m\times n} is produced by MATLAB function randn(m,n)\texttt{randn}(m,n), in which all entries of BB follow the standard Gaussian distribution. We shift the columns of BB such that they have zero mean, and finally the column vectors are normalized. We use the polar decomposition as the retraction mapping.

Refer to caption
(a) r=10,μ=0.4r=10,\mu=0.4
Refer to caption
(b) r=10,μ=0.6r=10,\mu=0.6
Refer to caption
(c) r=10,μ=0.8r=10,\mu=0.8
Refer to caption
(d) r=20,μ=0.4r=20,\mu=0.4
Refer to caption
(e) r=20,μ=0.6r=20,\mu=0.6
Refer to caption
(f) r=20,μ=0.8r=20,\mu=0.8
Figure 1: The performance of our algorithms on SPCA with random data (n=1000)(n=1000)

We first compare the performance of the proposed algorithms on the SPCA problem with random data, i.e., we set m=5000m=5000. For the StoManIAL, we partition the 50,000 samples into 100 subsets, and in each iteration, we sample one subset. The tolerance is set tol=108×n×r\mathrm{tol}=10^{-8}\times n\times r. Figure 1 presents the results of the four algorithms for fixed n=500n=500 and varying r=1,2r=1,2 and μ=0.4,0.6,0.8\mu=0.4,0.6,0.8. The horizontal axis represents CPU time, while the vertical axis represents the objective function value gap: F(xk)FMF(x^{k})-F_{M}, where FMF_{M} is given by the ManIAL-I. The results indicate that StoManIAL outperforms the deterministic version. Moreover, among the deterministic versions, ManIAL-I and ManIAL-II have similar performance. In conclusion, compared with the Rsub, our algorithms achieve better performance.

Next, we conduct experiments on two real datasets: coil100 [36] and mnist [28]. The coil100 dataset contains n=7200n=7200 RGB images of 100 objects taken from different angles. The mnist dataset has n=8,000n=8,000 grayscale digit images of size 28×2828\times 28. Our experiments, illustrated in Figures 2 and 3, demonstrate that our algorithms outperform Rsub.

Refer to caption
(a) r=1,μ=0.1r=1,\mu=0.1
Refer to caption
(b) r=1,μ=0.2r=1,\mu=0.2
Refer to caption
(c) r=1,μ=0.3r=1,\mu=0.3
Refer to caption
(d) r=2,μ=0.1r=2,\mu=0.1
Refer to caption
(e) r=2,μ=0.2r=2,\mu=0.2
Refer to caption
(f) r=2,μ=0.3r=2,\mu=0.3
Figure 2: The performance of our algorithms on SPCA with mnist data (n=784)(n=784)
Refer to caption
(a) r=1,μ=0.1r=1,\mu=0.1
Refer to caption
(b) r=1,μ=0.2r=1,\mu=0.2
Refer to caption
(c) r=1,μ=0.3r=1,\mu=0.3
Refer to caption
(d) r=2,μ=0.1r=2,\mu=0.1
Refer to caption
(e) r=2,μ=0.2r=2,\mu=0.2
Refer to caption
(f) r=2,μ=0.3r=2,\mu=0.3
Figure 3: The performance of our algorithms on SPCA with coil100 data (n=1024)(n=1024)

5.2 Sparse canonical correlation analysis

Canonical correlation analysis (CCA) first proposed by Hotelling [20] aims to tackle the associations between two sets of variables. The sparse CCA (SCCA) model is proposed to improve the interpretability of canonical variables by restricting the linear combinations to a subset of original variables. Suppose that there are two data sets: Xn×pX\in\mathbb{R}^{n\times p} containing pp variables and YY\in n×q\mathbb{R}^{n\times q} containing qq variables, both are obtained from nn observations. Let Σ^xx=1nXX,Σ^yy=1nYY\hat{\Sigma}_{xx}=\frac{1}{n}X^{\top}X,\quad\hat{\Sigma}_{yy}=\frac{1}{n}Y^{\top}Y be the sample covariance matrices of XX and YY, respectively, and Σ^xy=1nXY\hat{\Sigma}_{xy}=\frac{1}{n}X^{\top}Y be the sample cross-covariance matrix, then the SCCA can be formulated as

(5.5) minUp×r,Vq×rtr(UΣ^xyV)+μ1U1+μ2V1,s.t.UΣ^xxU=VΣ^yyV=Ir.\min_{U\in\mathbb{R}^{p\times r},V\in\mathbb{R}^{q\times r}}-\mathrm{tr}(U^{\top}\hat{\Sigma}_{xy}V)+\mu_{1}\|U\|_{1}+\mu_{2}\|V\|_{1},\;\mathrm{s.t.}\;U^{\top}\hat{\Sigma}_{xx}U=V^{\top}\hat{\Sigma}_{yy}V=I_{r}.

Let us denote 1={Up×r:UΣ^xxU=Ir}\mathcal{M}_{1}=\{U\in\mathbb{R}^{p\times r}~{}:~{}U^{\top}\hat{\Sigma}_{xx}U=I_{r}\} and 2={Vp×r:VΣ^yyV=Ir}\mathcal{M}_{2}=\{V\in\mathbb{R}^{p\times r}~{}:~{}V^{\top}\hat{\Sigma}_{yy}V=I_{r}\}. Without loss of generality, we assume that Σ^xx\hat{\Sigma}_{xx} and Σ^yy\hat{\Sigma}_{yy} are positive definite (otherwise, we can impose a small identity matrix). In this case, 1\mathcal{M}_{1} and 2\mathcal{M}_{2} are two generalized Stiefel manifolds, problem (5.5) is a nonsmooth problem on a product manifold :=12\mathcal{M}:=\mathcal{M}_{1}\otimes\mathcal{M}_{2}. Moreover, since Σ^xy=1nXY=1ninxiyi\hat{\Sigma}_{xy}=\frac{1}{n}X^{\top}Y=\frac{1}{n}\sum_{i}^{n}x_{i}^{\top}y_{i}, where xi,yix_{i},y_{i} denote the ii-th row of XX and YY, respectively, we can rewrite (5.5) as the following finite-sum form:

(5.6) minUp×r,Vq×r1nintr(UxiyiV)+μ1U1+μ2V1,s.t.U1,V2.\min_{U\in\mathbb{R}^{p\times r},V\in\mathbb{R}^{q\times r}}-\frac{1}{n}\sum_{i}^{n}\mathrm{tr}(U^{\top}x_{i}^{\top}y_{i}V)+\mu_{1}\|U\|_{1}+\mu_{2}\|V\|_{1},\;\mathrm{s.t.}\;U\in\mathcal{M}_{1},V\in\mathcal{M}_{2}.

Therefore, we can apply our algorithms ManIAL and StoManIAL to solve (5.6). Owing to the poor performance in the SPCA problem, we only list the numerical comparisons of our algorithms. We compare the performance of the proposed algorithms on the SPCA problem with random data, i.e., we set n=5000n=5000. For the StoManIAL, we partition the 50,000 samples into 100 subsets, and in each iteration, we sample one subset. The tolerance is set tol=108×p×r\mathrm{tol}=10^{-8}\times p\times r. Figure 1 presents the results of the four algorithms for fixed p,q=200p,q=200 and varying r=1,2r=1,2 and μ=0.2,0.3,0.4\mu=0.2,0.3,0.4. The results show that all algorithms have similar performance.

Refer to caption
(a) r=10,μ=0.4r=10,\mu=0.4
Refer to caption
(b) r=10,μ=0.6r=10,\mu=0.6
Refer to caption
(c) r=10,μ=0.8r=10,\mu=0.8
Refer to caption
(d) r=20,μ=0.4r=20,\mu=0.4
Refer to caption
(e) r=20,μ=0.6r=20,\mu=0.6
Refer to caption
(f) r=20,μ=0.8r=20,\mu=0.8
Figure 4: The performance of our algorithms on SCCA with random data (n=200)(n=200)

6 Conclusions

This paper proposes two novel manifold inexact augmented Lagrangian methods, namely ManIAL and StoManIAL, with the oracle complexity guarantee. To the best of our knowledge, this is the first complexity result for the augmented Lagrangian method for solving nonsmooth problems on Riemannian manifolds. We prove that ManIAL and StoManIAL achieve the oracle complexity of 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) and 𝒪~(ϵ3.5)\tilde{\mathcal{O}}(\epsilon^{-3.5}), respectively. Numerical experiments on SPCA and SCCA problems demonstrate that our proposed methods outperform an existing method with the previously best-known complexity result.

References

  • [1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds, Princeton University Press, Princeton, NJ, 2008.
  • [2] N. Aybat and G. Iyengar, A first-order augmented Lagrangian method for compressed sensing, SIAM Journal on Optimization, 22 (2012), p. 429.
  • [3] A. Beck and I. Rosset, A dynamic smoothing technique for a class of nonsmooth optimization problems on manifolds, SIAM Journal on Optimization, 33 (2023), pp. 1473–1493, https://doi.org/10.1137/22M1489447.
  • [4] G. C. Bento, O. P. Ferreira, and J. G. Melo, Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds, Journal of Optimization Theory and Applications, 173 (2017), pp. 548–562.
  • [5] A. Böhm and S. J. Wright, Variable smoothing for weakly convex composite functions, Journal of optimization theory and applications, 188 (2021), pp. 628–649.
  • [6] N. Boumal, An introduction to optimization on smooth manifolds, Cambridge University Press, Cambridge, England, 2023, https://doi.org/10.1017/9781009166164.
  • [7] N. Boumal, P.-A. Absil, and C. Cartis, Global rates of convergence for nonconvex optimization on manifolds, IMA Journal of Numerical Analysis, 39 (2019), pp. 1–33.
  • [8] S. Burer and R. D. Monteiro, A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization, Mathematical programming, 95 (2003), pp. 329–357.
  • [9] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford, Accelerated methods for nonconvex optimization, SIAM Journal on Optimization, 28 (2018), pp. 1751–1772.
  • [10] S. Chen, S. Ma, A. Man-Cho So, and T. Zhang, Proximal gradient method for nonsmooth optimization over the Siefel manifold, SIAM Journal on Optimization, 30 (2020), pp. 210–239.
  • [11] A. Cutkosky and F. Orabona, Momentum-based variance reduction in non-convex sgd, Advances in neural information processing systems, 32 (2019).
  • [12] K. Deng and Z. Peng, A manifold inexact augmented Lagrangian method for nonsmooth optimization on Riemannian submanifolds in Euclidean space, IMA Journal of Numerical Analysis, (2022), https://doi.org/10.1093/imanum/drac018.
  • [13] Z. Deng, K. Deng, J. Hu, and Z. Wen, An augmented Lagrangian primal-dual semismooth newton method for multi-block composite optimization, arXiv preprint arXiv:2312.01273, (2023).
  • [14] O. Ferreira and P. Oliveira, Subgradient algorithm on Riemannian manifolds, Journal of Optimization Theory and Applications, 97 (1998), pp. 93–104.
  • [15] O. P. Ferreira, M. S. Louzeiro, and L. Prudente, Iteration-complexity of the subgradient method on Riemannian manifolds with lower bounded curvature, Optimization, 68 (2019), pp. 713–729.
  • [16] S. Ghadimi and G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Mathematical Programming, 156 (2016), pp. 59–99.
  • [17] A. Han and J. Gao, Riemannian stochastic recursive momentum method for non-convex optimization, arXiv preprint arXiv:2008.04555, (2020).
  • [18] S. Hosseini, W. Huang, and R. Yousefpour, Line search algorithms for locally Lipschitz functions on Riemannian manifolds, SIAM Journal on Optimization, 28 (2018), pp. 596–619.
  • [19] S. Hosseini and A. Uschmajew, A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds, SIAM Journal on Optimization, 27 (2017), pp. 173–189.
  • [20] H. Hotelling, Relations between two sets of variates, in Breakthroughs in statistics: methodology and distribution, Springer, 1992, pp. 162–190.
  • [21] J. Hu, X. Liu, Z.-W. Wen, and Y.-X. Yuan, A brief introduction to manifold optimization, Journal of the Operations Research Society of China, 8 (2020), pp. 199–248.
  • [22] W. Huang and K. Wei, Riemannian proximal gradient methods, Mathematical Programming, 194 (2022), pp. 371–413.
  • [23] W. Huang and K. Wei, An inexact Riemannian proximal gradient method, Computational Optimization and Applications, 85 (2023), pp. 1–32.
  • [24] B. Jiang, X. Meng, Z. Wen, and X. Chen, An exact penalty approach for optimization with nonnegative orthogonality constraints, Mathematical Programming, 198 (2023), pp. 855–897.
  • [25] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin, A modified principal component technique based on the LASSO, Journal of computational and Graphical Statistics, 12 (2003), pp. 531–547.
  • [26] W. Kong, J. G. Melo, and R. D. Monteiro, Complexity of a quadratic penalty accelerated inexact proximal point method for solving linearly constrained nonconvex composite programs, SIAM Journal on Optimization, 29 (2019), pp. 2566–2593.
  • [27] G. Lan and R. D. Monteiro, Iteration-complexity of first-order augmented Lagrangian methods for convex programming, Mathematical Programming, 155 (2016), pp. 511–547.
  • [28] Y. LeCun, The mnist database of handwritten digits, http://yann. lecun. com/exdb/mnist/, (1998).
  • [29] J. Li, S. Ma, and T. Srivastava, A Riemannian ADMM, arXiv preprint arXiv:2211.02163, (2022).
  • [30] X. Li, S. Chen, Z. Deng, Q. Qu, Z. Zhu, and A. Man-Cho So, Weakly convex optimization over Stiefel manifold using Riemannian subgradient-type methods, SIAM Journal on Optimization, 31 (2021), pp. 1605–1634.
  • [31] X. Li, D. Sun, and K.-C. Toh, A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems, SIAM Journal on Optimization, 28 (2018), pp. 433–458.
  • [32] Z. Li, P.-Y. Chen, S. Liu, S. Lu, and Y. Xu, Rate-improved inexact augmented Lagrangian method for constrained nonconvex optimization, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 2170–2178.
  • [33] Z. Li, P.-Y. Chen, S. Liu, S. Lu, and Y. Xu, Stochastic inexact augmented Lagrangian method for nonconvex expectation constrained optimization, Computational Optimization and Applications, 87 (2024), pp. 117–147.
  • [34] Q. Lin, R. Ma, and Y. Xu, Inexact proximal-point penalty methods for constrained non-convex optimization, arXiv preprint arXiv:1908.11518, (2019).
  • [35] Z. Lu and Z. Zhou, Iteration-complexity of first-order augmented Lagrangian methods for convex conic programming, SIAM journal on optimization, 33 (2023), pp. 1159–1190.
  • [36] S. A. Nene, S. K. Nayar, H. Murase, et al., Columbia object image library (coil-20), (1996).
  • [37] Z. Peng, W. Wu, J. Hu, and K. Deng, Riemannian smoothing gradient type algorithms for nonsmooth optimization problem on compact Riemannian submanifold embedded in Euclidean space, Applied Mathematics & Optimization, 88 (2023), p. 85.
  • [38] M. F. Sahin, A. Eftekhari, A. Alacaoglu, F. L. Gómez, and V. Cevher, An inexact augmented Lagrangian framework for nonconvex optimization with nonlinear constraints, in Proceedings of NeurIPS 2019, 2019.
  • [39] H. Sato, Riemannian optimization and its applications, Springer Cham, Cham, 2021, https://doi.org/doi.org/10.1007/978-3-030-62391-3.
  • [40] B. Wang, S. Ma, and L. Xue, Riemannian stochastic proximal gradient methods for nonsmooth optimization over the Stiefel manifold, Journal of machine learning research, 23 (2022), pp. 1–33.
  • [41] Y. Wang, K. Deng, H. Liu, and Z. Wen, A decomposition augmented Lagrangian method for low-rank semidefinite programming, SIAM Journal on Optimization, 33 (2023), pp. 1361–1390.
  • [42] Y. Xu, First-order methods for constrained convex programming based on linearized augmented Lagrangian function, INFORMS Journal on Optimization, 3 (2021), pp. 89–117.
  • [43] Y. Xu, Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming, Mathematical Programming, 185 (2021), pp. 199–244.
  • [44] R. Zass and A. Shashua, Nonnegative sparse PCA, Advances in neural information processing systems, 19 (2006).
  • [45] J. Zhang, H. Lin, S. Jegelka, S. Sra, and A. Jadbabaie, Complexity of finding stationary points of nonconvex nonsmooth functions, in International Conference on Machine Learning, PMLR, 2020, pp. 11173–11182.
  • [46] Y. Zhou, C. Bao, C. Ding, and J. Zhu, A semismooth Newton based augmented Lagrangian method for nonsmooth optimization on matrix manifolds, Mathematical Programming, (2022), pp. 1–61.

Appendix A Technical lemmas

The next lemma is a technical observation and follows from Lemma 3 in [11].

Lemma A.1.

Suppose that the conditions (4.13)-(4.15) hold. Let us denote st=dtgradψ(xt)s_{t}=d_{t}-\operatorname{grad}\psi(x_{t}). Then we have

𝔼[(gradψ(xt,ξt)gradψ(xt))ηt11(1at)2st1]=0\displaystyle\mathbb{E}\left[\left(\operatorname{grad}\psi\left({x}_{t},\xi_{t}\right)-\operatorname{grad}\psi\left({x}_{t}\right)\right)\cdot\eta_{t-1}^{-1}\left(1-a_{t}\right)^{2}{s}_{t-1}\right]=0
𝔼[(gradψ(xt,ξt)gradψ(xt1,ξt)gradψ(xt)+gradψ(xt1))ηt11(1at)2st1]=0.\displaystyle\mathbb{E}\left[\left(\operatorname{grad}\psi\left({x}_{t},\xi_{t}\right)-\operatorname{grad}\psi\left({x}_{t-1},\xi_{t}\right)-\operatorname{grad}\psi\left({x}_{t}\right)+\operatorname{grad}\psi\left({x}_{t-1}\right)\right)\cdot\eta_{t-1}^{-1}\left(1-a_{t}\right)^{2}{s}_{t-1}\right]=0.

Lemma A.2.

Suppose that the conditions (4.13)-(4.15) hold. If ηt1L\eta_{t}\leq\frac{1}{L} in Algorithm 4 for all tt. Then

𝔼[ψ(xt+1)ψ(xt)]𝔼[ηt/2gradψ(xt)2+ηt/2dtgradψ(xt)2].\mathbb{E}\left[\psi\left(x_{t+1}\right)-\psi\left(x_{t}\right)\right]\leq\mathbb{E}\left[-\eta_{t}/2\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\eta_{t}/2\left\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right].

Proof A.3.

By retr-smoothness of ψ\psi, we have

ψ(xt+1)\displaystyle\psi\left(x_{t+1}\right) ψ(xt)gradψ(xt),ηtdt+ηt2L2dt2\displaystyle\leq\psi\left(x_{t}\right)-\left\langle\operatorname{grad}\psi\left(x_{t}\right),\eta_{t}d_{t}\right\rangle+\frac{\eta_{t}^{2}L}{2}\left\|d_{t}\right\|^{2}
=ψ(xt)ηt2gradψ(xt)2ηt2dt2+ηt2dtgradψ(xt)2+Lηt22dt2\displaystyle=\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}-\frac{\eta_{t}}{2}\left\|d_{t}\right\|^{2}+\frac{\eta_{t}}{2}\left\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\frac{L\eta_{t}^{2}}{2}\left\|d_{t}\right\|^{2}
=ψ(xt)ηt2gradψ(xt)2+ηt2dtgradψ(xt)2(ηt2Lηt22)dt2\displaystyle=\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\frac{\eta_{t}}{2}\left\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}-\left(\frac{\eta_{t}}{2}-\frac{L\eta_{t}^{2}}{2}\right)\left\|d_{t}\right\|^{2}
ψ(xt)ηt2gradψ(xt)2+ηt2dtgradψ(xt)2,\displaystyle\leq\psi\left(x_{t}\right)-\frac{\eta_{t}}{2}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\frac{\eta_{t}}{2}\left\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2},

where for the last inequality we choose ηt1L\eta_{t}\leq\frac{1}{L}.

The following technical lemma, which follows from Lemma 3 in [11], provides a recurrence variance bound.

Lemma A.4.

Suppose that the conditions (4.13)-(4.15) hold. With the notation in Algorithm 4, let st=dtgradψ(xt)s_{t}=d_{t}-\operatorname{grad}\psi\left(x_{t}\right), we have

𝔼[st2]𝔼[2at2Gt2\displaystyle\mathbb{E}\left[\|s_{t}\|^{2}\right]\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}\right. +(1at)2(1+4L2ηt12)st12\displaystyle+(1-a_{t})^{2}(1+4L^{2}\eta_{t-1}^{2})\|s_{t-1}\|^{2}
+4(1at)2L2ηt12gradψ(xt1)2].\displaystyle+\left.4(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}\|\operatorname{grad}\psi(x_{t-1})\|^{2}\right].

Proof A.5.

By definition of sts_{t} and the notation in Algorithm 4, we have st=dtgradψ(xt)=gradψ(xt,ξt)+(1s_{t}=d_{t}-\operatorname{grad}\psi\left(x_{t}\right)=\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)+(1- at)𝒯xt1xt(dt1gradψ(xt1,ξt))gradψ(xt)\left.a_{t}\right)\mathcal{T}_{x_{t-1}}^{x_{t}}\left(d_{t-1}-\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)\right)-\operatorname{grad}\psi\left(x_{t}\right). Hence, we can write

𝔼[st2]=𝔼[gradψ(xt,ξt)+(1at)𝒯xt1xt(dt1gradψ(xt1,ξt))gradψ(xt)2]\displaystyle\mathbb{E}{\left[\left\|s_{t}\right\|^{2}\right]=\mathbb{E}\left[\left\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)+\left(1-a_{t}\right)\mathcal{T}_{x_{t-1}}^{x_{t}}\left(d_{t-1}-\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)\right)-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right]}
=𝔼[at(gradψ(xt,ξt)gradψ(xt))+(1at)(gradψ(xt,ξt)𝒯xt1xtgradψ(xt1,ξt))\displaystyle=\mathbb{E}\left[\|a_{t}(\operatorname{grad}\psi(x_{t},\xi_{t})-\operatorname{grad}\psi(x_{t}))+(1-a_{t})(\operatorname{grad}\psi(x_{t},\xi_{t})-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi(x_{t-1},\xi_{t}))\right.
+(1at)(𝒯xt1xtgradψ(xt1)gradψ(xt))+(1at)𝒯xt1xt(dt1gradψ(xt1))2]\displaystyle+(1-a_{t})(\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi(x_{t-1})-\operatorname{grad}\psi(x_{t}))+\left.\left(1-a_{t}\right)\mathcal{T}_{x_{t-1}}^{x_{t}}\left(d_{t-1}-\operatorname{grad}\psi\left(x_{t-1}\right)\right)\|^{2}\right]
𝔼[2at2gradψ(xt,ξt)gradψ(xt)2+(1at)2st12+2(1at)2\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}\left\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\left(1-a_{t}\right)^{2}\left\|s_{t-1}\right\|^{2}+2(1-a_{t})^{2}\right.
gradψ(xt,ξt)𝒯xt1xtgradψ(xt1,ξt)gradψ(xt)+𝒯xt1xtgradψ(xt1)2]\displaystyle\left.\left\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)-\operatorname{grad}\psi\left(x_{t}\right)+\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1}\right)\right\|^{2}\right]
𝔼[2at2gradψ(xt,ξt)2+(1at)2st12\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)\|^{2}+(1-a_{t})^{2}\|s_{t-1}\|^{2}\right.
+2(1at)2gradψ(xt,ξt)𝒯xt1xtgradψ(xt1,ξt)2]\displaystyle\left.+2(1-a_{t})^{2}\|\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\mathcal{T}_{x_{t-1}}^{x_{t}}\operatorname{grad}\psi\left(x_{t-1},\xi_{t}\right)\|^{2}\right]
𝔼[2at2Gt2+(1at)2st12+2(1at)2L2ηt12dt12]\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}\|s_{t-1}\|^{2}+2(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}\|d_{t-1}\|^{2}\right]
𝔼[2at2Gt2+(1at)2st12+4(1at)2L2ηt12(st12+gradψ(xt1)2)]\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}\|s_{t-1}\|^{2}+4(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}(\|s_{t-1}\|^{2}+\|\operatorname{grad}\psi(x_{t-1})\|^{2})\right]
𝔼[2at2Gt2+(1at)2(1+4L2ηt12)st12+4(1at)2L2ηt12gradψ(xt1)2],\displaystyle\leq\mathbb{E}\left[2a_{t}^{2}G_{t}^{2}+(1-a_{t})^{2}(1+4L^{2}\eta_{t-1}^{2})\|s_{t-1}\|^{2}+4(1-a_{t})^{2}L^{2}\eta_{t-1}^{2}\|\operatorname{grad}\psi(x_{t-1})\|^{2}\right],

where the first inequality uses Lemma A.1 and x+y22x2+2y2\|x+y\|^{2}\leq 2\|x\|^{2}+2\|y\|^{2}, the second inequality follows from 𝔼x𝔼[x]2𝔼x2\mathbb{E}\|x-\mathbb{E}[x]\|^{2}\leq\mathbb{E}\|x\|^{2}, the third inequality uses the update rule of xtx_{t} and (4.14).

Appendix B The proof of Lemma 4.2

Proof B.1.

First, since w(4Lκ)3w\geq(4L\kappa)^{3}, we have from the update formula of ηt\eta_{t} that ηtκw1/314L<1L\eta_{t}\leq\frac{\kappa}{w^{1/3}}\leq\frac{1}{4L}<\frac{1}{L}. It follows from Lemma A.2 that

(B.1) 𝔼[ψ(xt+1)ψ(xt)]𝔼[ηt/2gradψ(xt)2+ηt/2dtgradψ(xt)2].\mathbb{E}\left[\psi\left(x_{t+1}\right)-\psi\left(x_{t}\right)\right]\leq\mathbb{E}\left[-\eta_{t}/2\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\eta_{t}/2\left\|d_{t}-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right].

Consider the potential function Φt=ψ(xt)+124L2ηt1st2\Phi_{t}=\psi\left(x_{t}\right)+\frac{1}{24L^{2}\eta_{t-1}}\left\|s_{t}\right\|^{2}. Since at+1=cηt2a_{t+1}=c\eta_{t}^{2} and w(cκ4L)3w\geq\left(\frac{c\kappa}{4L}\right)^{3}, we have that for all tt

at+1cηtκw1/3cκ4Lw1/31.a_{t+1}\leq\frac{c\eta_{t}\kappa}{w^{1/3}}\leq\frac{c\kappa}{4Lw^{1/3}}\leq 1.

Then, we first consider ηt1st+12ηt11st2\eta_{t}^{-1}\left\|s_{t+1}\right\|^{2}-\eta_{t-1}^{-1}\left\|s_{t}\right\|^{2}. Using Lemma A.4, we obtain

𝔼[ηt1st+12ηt11st2]\displaystyle\mathbb{E}{\left[\eta_{t}^{-1}\left\|s_{t+1}\right\|^{2}-\eta_{t-1}^{-1}\left\|s_{t}\right\|^{2}\right]}
\displaystyle\leq 𝔼[2at+12ηtGt+12+(1at+1)2(1+4L2ηt2)st2ηt+4(1at+1)2L2ηtgradψ(xt)2\displaystyle\mathbb{E}\left[2\frac{a_{t+1}^{2}}{\eta_{t}}G_{t+1}^{2}+\frac{\left(1-a_{t+1}\right)^{2}\left(1+4L^{2}\eta_{t}^{2}\right)\left\|s_{t}\right\|^{2}}{\eta_{t}}+4\left(1-a_{t+1}\right)^{2}L^{2}\eta_{t}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right.
st2ηt1]\displaystyle\left.-\frac{\left\|s_{t}\right\|^{2}}{\eta_{t-1}}\right]
\displaystyle\leq 𝔼[2at+12ηtGt+12At+(ηt1(1at+1)(1+4L2ηt2)ηt11)st2Bt+4L2ηtgradψ(xt)2Ct].\displaystyle\mathbb{E}[\underbrace{2\frac{a_{t+1}^{2}}{\eta_{t}}G_{t+1}^{2}}_{A_{t}}+\underbrace{\left(\eta_{t}^{-1}\left(1-a_{t+1}\right)\left(1+4L^{2}\eta_{t}^{2}\right)-\eta_{t-1}^{-1}\right)\left\|s_{t}\right\|^{2}}_{B_{t}}+\underbrace{4L^{2}\eta_{t}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}}_{C_{t}}].

Let us focus on the terms of this expression individually. For the first term, AtA_{t}, observing that w2G2w\geq 2G^{2}\geq G2+Gt+12G^{2}+G_{t+1}^{2}, we have

t=1TAt\displaystyle\sum_{t=1}^{T}A_{t} =t=1T2c2ηt3Gt+12=t=1T2κ3c2Gt+12w+i=1tGi2\displaystyle=\sum_{t=1}^{T}2c^{2}\eta_{t}^{3}G_{t+1}^{2}=\sum_{t=1}^{T}\frac{2\kappa^{3}c^{2}G_{t+1}^{2}}{w+\sum_{i=1}^{t}G_{i}^{2}}
t=1T2κ3c2Gt+12G2+i=1t+1Gi22κ3c2ln(1+t=1T+1Gt2G2)\displaystyle\leq\sum_{t=1}^{T}\frac{2\kappa^{3}c^{2}G_{t+1}^{2}}{G^{2}+\sum_{i=1}^{t+1}G_{i}^{2}}\leq 2\kappa^{3}c^{2}\ln\left(1+\sum_{t=1}^{T+1}\frac{G_{t}^{2}}{G^{2}}\right)
2κ3c2ln(T+2),\displaystyle\leq 2\kappa^{3}c^{2}\ln(T+2),

where in the second to last inequality we used Lemma 4 in [11]. For the second term BtB_{t}, we have

Bt(ηt1ηt11+ηt1(4L2ηt2at+1))st2(ηt1ηt11+ηt(4L2c))st2.B_{t}\leq\left(\eta_{t}^{-1}-\eta_{t-1}^{-1}+\eta_{t}^{-1}\left(4L^{2}\eta_{t}^{2}-a_{t+1}\right)\right)\left\|s_{t}\right\|^{2}\leq\left(\eta_{t}^{-1}-\eta_{t-1}^{-1}+\eta_{t}\left(4L^{2}-c\right)\right)\left\|s_{t}\right\|^{2}.

Let us estimate 1ηt1ηt1\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}. Using the concavity of x1/3x^{1/3}, we have (x+y)1/3x1/3+yx2/3/3(x+y)^{1/3}\leq x^{1/3}+yx^{-2/3}/3. Therefore:

1ηt1ηt1\displaystyle\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}} =1κ[(w+i=1tGi2)1/3(w+i=1t1Gi2)1/3]Gt23κ(w+i=1t1Gi2)2/3\displaystyle=\frac{1}{\kappa}\left[\left(w+\sum_{i=1}^{t}G_{i}^{2}\right)^{1/3}-\left(w+\sum_{i=1}^{t-1}G_{i}^{2}\right)^{1/3}\right]\leq\frac{G_{t}^{2}}{3\kappa\left(w+\sum_{i=1}^{t-1}G_{i}^{2}\right)^{2/3}}
Gt23κ(wG2+i=1tGi2)2/3Gt23κ(w/2+i=1tGi2)2/3\displaystyle\leq\frac{G_{t}^{2}}{3\kappa\left(w-G^{2}+\sum_{i=1}^{t}G_{i}^{2}\right)^{2/3}}\leq\frac{G_{t}^{2}}{3\kappa\left(w/2+\sum_{i=1}^{t}G_{i}^{2}\right)^{2/3}}
22/3Gt23κ(w+i=1tGi2)2/322/3G23κ3ηt222/3G212Lκ3ηtG27Lκ3ηt,\displaystyle\leq\frac{2^{2/3}G_{t}^{2}}{3\kappa\left(w+\sum_{i=1}^{t}G_{i}^{2}\right)^{2/3}}\leq\frac{2^{2/3}G^{2}}{3\kappa^{3}}\eta_{t}^{2}\leq\frac{2^{2/3}G^{2}}{12L\kappa^{3}}\eta_{t}\leq\frac{G^{2}}{7L\kappa^{3}}\eta_{t},

where we use ηt14L\eta_{t}\leq\frac{1}{4L}. Further, since c=10L2+G2/(7Lκ3)c=10L^{2}+G^{2}/\left(7L\kappa^{3}\right), we have

ηt(4L2c)6L2ηtG2ηt/(7Lκ3).\eta_{t}\left(4L^{2}-c\right)\leq-6L^{2}\eta_{t}-G^{2}\eta_{t}/\left(7L\kappa^{3}\right).

Thus, we obtain Bt6L2ηtst2B_{t}\leq-6L^{2}\eta_{t}\left\|s_{t}\right\|^{2}. Putting all this together yields:

(B.2) 112L2t=1T(st+12ηtst2ηt1)κ3c26L2ln(T+2)+t=1T[ηt3gradψ(xt)2ηt2st2].\frac{1}{12L^{2}}\sum_{t=1}^{T}\left(\frac{\left\|s_{t+1}\right\|^{2}}{\eta_{t}}-\frac{\left\|s_{t}\right\|^{2}}{\eta_{t-1}}\right)\leq\frac{\kappa^{3}c^{2}}{6L^{2}}\ln(T+2)+\sum_{t=1}^{T}\left[\frac{\eta_{t}}{3}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}-\frac{\eta_{t}}{2}\left\|s_{t}\right\|^{2}\right].

Now, we are ready to analyze the potential Φt\Phi_{t}. It follows from (B.1) that

𝔼[Φt+1Φt]𝔼[ηt2gradψ(xt)2+ηt2st2+112L2(st+12ηtst2ηt1)].\mathbb{E}\left[\Phi_{t+1}-\Phi_{t}\right]\leq\mathbb{E}\left[-\frac{\eta_{t}}{2}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\frac{\eta_{t}}{2}\left\|s_{t}\right\|^{2}+\frac{1}{12L^{2}}\left(\frac{\|s_{t+1}\|^{2}}{\eta_{t}}-\frac{\left\|s_{t}\right\|^{2}}{\eta_{t-1}}\right)\right].

Summing over tt and using (B.2), we obtain

𝔼[ΦT+1Φ1]\displaystyle\mathbb{E}\left[\Phi_{T+1}-\Phi_{1}\right] t=1T𝔼[ηt2gradψ(xt)2+ηt2st2+112L2(st+12ηtst2ηt1)]\displaystyle\leq\sum_{t=1}^{T}\mathbb{E}\left[-\frac{\eta_{t}}{2}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+\frac{\eta_{t}}{2}\left\|s_{t}\right\|^{2}+\frac{1}{12L^{2}}\left(\frac{\|s_{t+1}\|^{2}}{\eta_{t}}-\frac{\left\|s_{t}\right\|^{2}}{\eta_{t-1}}\right)\right]
𝔼[κ3c26L2ln(T+2)t=1Tηt6gradψ(xt)2].\displaystyle\leq\mathbb{E}\left[\frac{\kappa^{3}c^{2}}{6L^{2}}\ln(T+2)-\sum_{t=1}^{T}\frac{\eta_{t}}{6}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right].

Reordering the terms, we have

𝔼[t=1Tηtgradψ(xt)2]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\eta_{t}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right] 𝔼[6(Φ1ΦT+1)+κ3c22L2ln(T+2)]\displaystyle\leq\mathbb{E}\left[6\left(\Phi_{1}-\Phi_{T+1}\right)+\frac{\kappa^{3}c^{2}}{2L^{2}}\ln(T+2)\right]
6(ψ(x1)F)+𝔼[s12]2L2η0+κ3c2L2ln(T+2)\displaystyle\leq 6\left(\psi\left(x_{1}\right)-F^{\star}\right)+\frac{\mathbb{E}\left[\left\|s_{1}\right\|^{2}\right]}{2L^{2}\eta_{0}}+\frac{\kappa^{3}c^{2}}{L^{2}}\ln(T+2)
6(ψ(x1)F)+w1/3δ22L2κ+κ3c2L2ln(T+2),\displaystyle\leq 6\left(\psi\left(x_{1}\right)-F^{\star}\right)+\frac{w^{1/3}\delta^{2}}{2L^{2}\kappa}+\frac{\kappa^{3}c^{2}}{L^{2}}\ln(T+2),

where the last inequality is given by the definition of d1d_{1} and η0\eta_{0} in the algorithm. Now, we relate 𝔼[t=1Tηtgradψ(xt)2]\mathbb{E}\left[\sum_{t=1}^{T}\eta_{t}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right] to 𝔼[t=1Tgradψ(xt)2]\mathbb{E}\left[\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right]. First, since ηt\eta_{t} is decreasing,

𝔼[t=1Tηtgradψ(xt)2]𝔼[ηTt=1Tgradψ(xt)2]\mathbb{E}\left[\sum_{t=1}^{T}\eta_{t}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right]\geq\mathbb{E}\left[\eta_{T}\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right]

Therefore, if we set M=1κ[6(ψ(x1)ψmin)+w1/3δ22L2κ+κ3c2L2ln(T+2)]M=\frac{1}{\kappa}\left[6\left(\psi\left(x_{1}\right)-\psi_{\min}\right)+\frac{w^{1/3}\delta^{2}}{2L^{2}\kappa}+\frac{\kappa^{3}c^{2}}{L^{2}}\ln(T+2)\right], to get

(B.3) 𝔼[t=1Tgradψ(xt)2]2\displaystyle\mathbb{E}\left[\sqrt{\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}}\right]^{2} 𝔼[6(ψ(x1)F)+w1/3δ22L2κ+κ3c2L2ln(T+2)ηT]\displaystyle\leq\mathbb{E}\left[\frac{6\left(\psi\left(x_{1}\right)-F^{\star}\right)+\frac{w^{1/3}\delta^{2}}{2L^{2}\kappa}+\frac{\kappa^{3}c^{2}}{L^{2}}\ln(T+2)}{\eta_{T}}\right]
=𝔼[κMηT]𝔼[M(w+t=1TGt2)1/3].\displaystyle=\mathbb{E}\left[\frac{\kappa M}{\eta_{T}}\right]\leq\mathbb{E}\left[M\left(w+\sum_{t=1}^{T}G_{t}^{2}\right)^{1/3}\right].

It follows from (4.13) that

(B.4) Gt2=gradψ(xt)+gradψ(xt,ξt)gradψ(xt)22gradψ(xt)2+2δ2.G_{t}^{2}=\left\|\operatorname{grad}\psi\left(x_{t}\right)+\operatorname{grad}\psi\left(x_{t},\xi_{t}\right)-\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\leq 2\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}+2\delta^{2}.

Plugging this in (B.3) and using (a+b)1/3a1/3+b1/3(a+b)^{1/3}\leq a^{1/3}+b^{1/3} we obtain:

𝔼[t=1Tgradψ(xt)2]2\displaystyle\mathbb{E}\left[\sqrt{\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}}\right]^{2}
\displaystyle\leq 𝔼[M(w+2t=1Tδ2)1/3+M(2t=1Tgradψ(xt)2)1/3]\displaystyle\mathbb{E}\left[M\left(w+2\sum_{t=1}^{T}\delta^{2}\right)^{1/3}+M\left(2\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}\right)^{1/3}\right]
\displaystyle\leq M(w+2Tδ2)1/3+𝔼[21/3M(t=1Tgradψ(xt)2)2/3]\displaystyle M\left(w+2T\delta^{2}\right)^{1/3}+\mathbb{E}\left[2^{1/3}M\left(\sqrt{\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}}\right)^{2/3}\right]
\displaystyle\leq M(w+2Tδ2)1/3+21/3M(𝔼[t=1Tgradψ(xt)2])2/3,\displaystyle M\left(w+2T\delta^{2}\right)^{1/3}+2^{1/3}M\left(\mathbb{E}\left[\sqrt{\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}}\right]\right)^{2/3},

where we have used the concavity of xxax\mapsto x^{a} for all a1a\leq 1 to move expectations inside the exponents. Now, define X=t=1Tgradψ(xt)2X=\sqrt{\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|^{2}}. Then the above can be rewritten as:

(𝔼[X])2M(w+2Tδ2)1/3+21/3M(𝔼[X])2/3.(\mathbb{E}[X])^{2}\leq M\left(w+2T\delta^{2}\right)^{1/3}+2^{1/3}M(\mathbb{E}[X])^{2/3}.

Note that this implies that either (𝔼[X])22M(w+Tδ2)1/3(\mathbb{E}[X])^{2}\leq 2M\left(w+T\delta^{2}\right)^{1/3}, or (𝔼[X])2221/3M(𝔼[X])2/3(\mathbb{E}[X])^{2}\leq 2\cdot 2^{1/3}M(\mathbb{E}[X])^{2/3}. Solving for 𝔼[X]\mathbb{E}[X] in these two cases, we obtain

𝔼[X]2M(w+2Tδ2)1/6+2M3/4.\mathbb{E}[X]\leq\sqrt{2M}\left(w+2T\delta^{2}\right)^{1/6}+2M^{3/4}.

Finally, observe that by Cauchy-Schwarz we have t=1Tgradψ(xt)/TX/T\sum_{t=1}^{T}\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|/T\leq X/\sqrt{T} so that

𝔼[t=1Tgradψ(xt)T]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\frac{\left\|\operatorname{grad}\psi\left(x_{t}\right)\right\|}{T}\right] 2M(w+2Tδ2)1/6+2M3/4T\displaystyle\leq\frac{\sqrt{2M}\left(w+2T\delta^{2}\right)^{1/6}+2M^{3/4}}{\sqrt{T}}
w1/62M+2M3/4T+2σ1/3T1/3,\displaystyle\leq\frac{w^{1/6}\sqrt{2M}+2M^{3/4}}{\sqrt{T}}+\frac{2\sigma^{1/3}}{T^{1/3}},

where we used (a+b)1/3a1/3+b1/3(a+b)^{1/3}\leq a^{1/3}+b^{1/3} in the last inequality.