This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

General Constrained Matrix Optimization

Casey Garner           Gilad Lerman11footnotemark: 1           Shuzhong Zhang School of Mathematics, University of Minnesota (garne214@umn.edu, lerman@umn.edu) Department of Industrial and Systems Engineering (zhangs@umn.edu)
Abstract

This paper presents and analyzes the first matrix optimization model which allows general coordinate and spectral constraints. The breadth of problems our model covers is exemplified by a lengthy list of examples from the literature, including semidefinite programming, matrix completion, and quadratically constrained quadratic programs (QCQPs), and we demonstrate our model enables completely novel formulations of numerous problems. Our solution methodology leverages matrix factorization and constrained manifold optimization to develop an equivalent reformulation of our general matrix optimization model for which we design a feasible, first-order algorithm. We prove our algorithm converges to (ϵ,ϵ)(\epsilon,\epsilon)-approximate first-order KKT points of our reformulation in 𝒪(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) iterations. The method we developed applies to a special class of constrained manifold optimization problems and is one of the first which generates a sequence of feasible points which converges to a KKT point. We validate our model and method through numerical experimentation. Our first experiment presents a generalized version of semidefinite programming which allows novel eigenvalue constraints, and our second numerical experiment compares our method to the classical semidefinite relaxation approach for solving QCQPs. For the QCQP numerical experiments, we demonstrate our method is able to dominate the classical state-of-the-art approach, solving more than ten times as many problems compared to the standard solution procedure.

Keywords: spectrally constrained optimization \cdot constrained manifold optimization \cdot quadratically constrained quadratic programs \cdot semidefinite programming \cdot matrix completion \cdot nonconvex optimization \cdot iteration complexity

MSC codes: 90C26, 90C52, 65K10, 68W40

1 Introduction

The ubiquity of matrix optimization is unquestionable and the utility of the models is even more so. From matrix completion [10, 12], various forms of principal component analysis [26, 54], max-cut [23], robust subpsace recovery [29, 44], covariance estimation [19], semidefinite programming [43], eigenvalue optimization [32, 34], Markov processes [9], etc., we see the applications of minimizing and maximizing functions whose arguments are matrices subject to various forms of constraints. These constraints come in two varieties: coordinate constraints and spectral constraints. The coordinate constraints are the most readily visible and most prominently leveraged. The spectral constraints which require the eigenvalues and/or singular values to satisfy some set of conditions are present but to a lesser degree. One of the most common forms of spectral constraints is given by enforcing a matrix to be positive definite or semidefinite; however, this is a relatively tame condition when we consider the vast possibilities of restricting the spectrum.

The study of how to solve matrix optimization problems with general spectral constraints is still in its infancy. Our prior work [22] was the first model allowing general linear inequality constraints on the spectrum of a symmetric matrix. Shortly following our first paper, Ito and Lourenço came out with the second paper on this problem. In [25], they utilized the work of a relatively unknown paper by Gowda [24] to enable solving matrix optimization problems with general convex constraints on the spectrum; however, neither of these works addressed how to solve matrix optimization problems with both general coordinate and spectral constraints.

This paper presents the first model and algorithm which enables general constraints on both the coordinates and spectrum of a matrix. Namely, in these pages we develop and analysis a method to find approximate solutions to models of the form:

min\displaystyle\min F(𝑿)\displaystyle\;F(\bm{X}) (1)
s.t. 𝑮(𝑿)=𝟎,\displaystyle\;\bm{G}(\bm{X})=\bm{0},
𝒈(𝝈(𝑿))𝟎,\displaystyle\bm{g}(\bm{\sigma}(\bm{X}))\leq\bm{0},
𝑿𝕊,\displaystyle\bm{X}\in\mathbb{S},

where 𝕊\mathbb{S} can be any one of many sets including: m×n\mathbb{R}^{m\times n}, 𝒮n×n\mathcal{S}^{n\times n}, and n1××nk\mathbb{R}^{n_{1}\times\cdots\times n_{k}}, i.e., real m×nm\times n matrices, real n×nn\times n symmetric matrices, and real tensors, and 𝝈:𝕊p\bm{\sigma}:\mathbb{S}\rightarrow\mathbb{R}^{p} maps the matrix or tensor 𝑿\bm{X} to its ordered vector of singular values, eigenvalues, or generalized singular values. So, the method we develop extends beyond even matrices to tensors, enabling a novel modeling power not yet leveraged in the literature.

Remark 1.1.

It is worth noting the only reason we choose to present (1) with equality constraints only on the coordinates, i.e., 𝐆(𝐗)=𝟎\bm{G}(\bm{X})=\bm{0}, and not an additionally set of inequality constraints, e.g., 𝐇(𝐗)𝟎\bm{H}(\bm{X})\leq\bm{0}, was for ease of presentation through a reduction of notation and terms. The analysis to come admits additionally inequality constraints with minor alterations to the assumptions, and our numerical experiments on quadratically constraints quadratic programs contain inequality constraints, further supporting this.

2 Motivating Examples

The generality of (1) is vast, including multiple important applications and problem classes. To showcase this fact, let us investigate some examples and see how (1) enables new constraints on the spectrum which before had been impossible, while also extending essential models in the literature.

2.1 Generalized Semidefinite Programs

The impact of semidefinite programming (SDP) is nearly impossible to overstate, though it took some time before the salience of the model was truly realized and established. Today, we have seen applications of SDP extend from control theory to economics and most places in-between [47, 43]. The standard form of a semidefinite program can be expressed as

min\displaystyle\min 𝑪0,𝑿\displaystyle\;\langle\bm{C}_{0},\bm{X}\rangle (2)
s.t. 𝑪i,𝑿=bi,i=1,,p,\displaystyle\;\langle\bm{C}_{i},\bm{X}\rangle=b_{i},\;i=1,\ldots,p,
𝑿𝟎,\displaystyle\;\bm{X}\succeq\bm{0},

where 𝑪i𝒮n×n\bm{C}_{i}\in\mathcal{S}^{n\times n} for i=0,1,,pi=0,1,\ldots,p. By appropriately defining the functions in (1) and letting 𝕊=𝒮n×n\mathbb{S}=\mathcal{S}^{n\times n}, we see the standard SDP model is a special instance of our general matrix model, making (1) a generalization of semidefinite programming. We can extend SDPs and define the following generalized SDP model:

min\displaystyle\min 𝑪0,𝑿\displaystyle\;\langle\bm{C}_{0},\bm{X}\rangle (3)
s.t. 𝑪i,𝑿=bi,i=1,,p,\displaystyle\;\langle\bm{C}_{i},\bm{X}\rangle=b_{i},\;i=1,\ldots,p,
𝑨𝝀(𝑿)𝒃,\displaystyle\bm{A}\bm{\lambda}(\bm{X})\leq\bm{b},
𝑿𝒮n×n,\displaystyle\bm{X}\in\mathcal{S}^{n\times n},

where 𝝀(𝑿):=(λ1(𝑿),,λn(𝑿))\bm{\lambda}(\bm{X}):=(\lambda_{1}(\bm{X}),\ldots,\lambda_{n}(\bm{X}))^{\top} is the vector of eigenvalues of 𝑿\bm{X} listed in descending order, i.e., λi(𝑿)λi+1(𝑿)\lambda_{i}(\bm{X})\geq\lambda_{i+1}(\bm{X}) for i=1,,n1i=1,\ldots,n-1. So, letting 𝒃=𝟎\bm{b}=\bm{0} and 𝑨=𝑰\bm{A}=-\bm{I}, we see (3) becomes (2). Thus, our general matrix model enables entirely new classes of constraints on the eigenvalues not yet attempted, while extending an essential problem class. Clearly, (3) is no longer a convex formulation. For example, a constraint as simple as λ1(𝑿)c1\lambda_{1}(\bm{X})\geq c_{1} and λn1(𝑿)cn1\lambda_{n-1}(\bm{X})\leq c_{n-1} with 0<cn1<c10<c_{n-1}<c_{1} would be nonconvex. In the case when 𝑿\bm{X} is a Laplacian matrix, the condition λn1(𝑿)cn1\lambda_{n-1}(\bm{X})\leq c_{n-1} relates to limiting the connectivity of a graph which might well be desirable in certain design circumstances. Furthermore, the eigenvectors associated with the second smallest eigenvalue are known as Fiedler vectors and they have important applications to combinatorial optimization problems; see [17] and the references therein.

2.2 Matrix Preconditioning

The subject of matrix preconditioning often arises in practical applications [13, 45]. The topic is broached from the desire to find a relatively simple invertible matrix to improve the stability and solvability of various numerical procedures, e.g., solving a system of linear equations or minimizing a quadratic form. One of many preconditioning models is to locate good feasible solutions to

min\displaystyle\min 12𝑿𝑨𝑿𝑰F2\displaystyle\;\frac{1}{2}\|\bm{X}\bm{A}\bm{X}^{\top}-\bm{I}\|_{F}^{2} (4)
s.t. 𝑿𝒞,\displaystyle\bm{X}\in\mathcal{C},

where 𝒞\mathcal{C} enforces some structure upon the matrix 𝑿\bm{X}. Assuming 𝑨𝒮n×n\bm{A}\in\mathcal{S}^{n\times n}, the pre- and post-multiplication by 𝑿\bm{X} ensures symmetry is maintained. This model relates directly to (1) through the constraint set 𝒞\mathcal{C}. For example, it would be reasonable to desire 𝑿\bm{X} to a be a well-conditioned block diagonal matrix. Writing this explicitly, we have the model

min\displaystyle\min 12𝑿𝑨𝑿𝑰F2\displaystyle\;\frac{1}{2}\|\bm{X}\bm{A}\bm{X}^{\top}-\bm{I}\|_{F}^{2} (5)
s.t. 𝑿=(𝑿1𝟎𝟎𝑿p)𝒮n×n,\displaystyle\bm{X}=\begin{pmatrix}\bm{X}_{1}&&\bm{0}\\ &\ddots&\\ \bm{0}&&\bm{X}_{p}\end{pmatrix}\in\mathcal{S}^{n\times n},
λ1(𝑿i)κλn(𝑿i)0,\displaystyle\lambda_{1}(\bm{X}_{i})-\kappa\lambda_{n}(\bm{X}_{i})\leq 0,
λn(𝑿i)δ,\displaystyle\lambda_{n}(\bm{X}_{i})\geq\delta,
𝑿i𝒮s×s,i=1,,p,\displaystyle\bm{X}_{i}\in\mathcal{S}^{s\times s},i=1,\ldots,p,

where δ\delta and κ\kappa are nonnegative and sp=nsp=n. Note, the eigenvalue constraint ensures the condition number of 𝑿\bm{X} is bounded above by κ\kappa and each matrix forming 𝑿\bm{X} is well-conditioned. This single example demonstrates the versatility of (1) for modeling optimization models related to matrix preconditioning.

2.3 Inverse Eigenvalue and Singular Value Problems and Constrained PCA

A large body of literature exists on what are known as inverse eigenvalue and singular value problems [15, 16]. The general concern is to form a matrix from prescribed spectral information. For example, the entire spectrum of the matrix may be provided or some subset of it. Then, the goal is to construct a matrix that has this structure, possibly subject to additional structural constraints. The prevalence of problems such as this are hard to overstate as applications of them include: seismic tomography, remote sensing, geophysics, circuit theory, and more; see the aforementioned works by Chu for further discussion. These problems naturally relate to (1) as it models both coordinate and spectral constraints. For example, a simple instance of an inverse singular value problem is the following projection problem,

min\displaystyle\min 12𝑿𝑨F2\displaystyle\;\frac{1}{2}\|\bm{X}-\bm{A}\|_{F}^{2} (6)
s.t. 𝑿𝒞1,\displaystyle\bm{X}\in\mathcal{C}_{1},
𝝈(𝑿)𝒞2,\displaystyle\bm{\sigma}(\bm{X})\in\mathcal{C}_{2},

where 𝑨m×n\bm{A}\in\mathbb{R}^{m\times n}, 𝝈(𝑿):=(σ1(𝑿),,σmin(m,n)(𝑿))\bm{\sigma}(\bm{X}):=(\sigma_{1}(\bm{X}),\ldots,\sigma_{\min(m,n)}(\bm{X})) is the vector of singular values of 𝑿\bm{X} in descending order, and 𝒞1\mathcal{C}_{1} and 𝒞2\mathcal{C}_{2} correspond to coordinate and spectral constraints respectively. So, (6) seeks to project 𝑨\bm{A} onto general matrix constraints. Hence, if 𝑨\bm{A} satisfies the constraints it is a solution to the inverse singular problem described by 𝒞1\mathcal{C}_{1} and 𝒞2\mathcal{C}_{2}, otherwise the model seeks to find the closet matrix to 𝑨\bm{A} which solves the inverse problem.

The relevance of this problem is further demonstrated by the fact (6) becomes the principal component analysis (PCA) model by the appropriate selection of the constraints. Letting 𝒞1=m×n\mathcal{C}_{1}=\mathbb{R}^{m\times n} and C2={𝒛min(m,n)|z1>0,z2==zmin(m,n)=0}C_{2}=\{\bm{z}\in\mathbb{R}^{\min(m,n)}_{\geq}\;|\;z_{1}>0,z_{2}=\cdots=z_{\min(m,n)}=0\}, the model becomes

min\displaystyle\min 12𝑿𝑨F2\displaystyle\;\frac{1}{2}\|\bm{X}-\bm{A}\|_{F}^{2}
s.t. rank(𝑿)=1,𝑿m×n,\displaystyle\text{rank}(\bm{X})=1,\;\bm{X}\in\mathbb{R}^{m\times n},

whose solution is the first principal component of 𝑨\bm{A}. However, our model extends beyond this to enable constrained versions of PCA. For example, we have the following nonnegative PCA model,

min\displaystyle\min 12𝑿𝑨F2\displaystyle\;\frac{1}{2}\|\bm{X}-\bm{A}\|_{F}^{2}
s.t. rank(𝑿)=1,\displaystyle\text{rank}(\bm{X})=1,
𝑿𝟎,𝑿m×n,\displaystyle\bm{X}\geq\bm{0},\;\bm{X}\in\mathbb{R}^{m\times n},

where 𝑿𝟎\bm{X}\geq\bm{0} enforces Xij0X_{ij}\geq 0 for all ii and jj. We can also move beyond simple rank constraints to enforce far more general constraints on the singular values. For instance, we could constrain the top kk-singular values to account for between p1p_{1} and p2p_{2} percent of the total information of the matrix, that is,

min\displaystyle\min 12𝑿𝑨F2\displaystyle\;\frac{1}{2}\|\bm{X}-\bm{A}\|_{F}^{2} (7)
s.t. p1σ1(𝑿)++σk(𝑿)σ1(𝑿)++σmin(m,n)(𝑿)p2,\displaystyle p_{1}\leq\frac{\sigma_{1}(\bm{X})+\cdots+\sigma_{k}(\bm{X})}{\sigma_{1}(\bm{X})+\cdots+\sigma_{\min(m,n)}(\bm{X})}\leq p_{2},
𝑿𝟎,𝑿m×n,\displaystyle\bm{X}\geq\bm{0},\;\bm{X}\in\mathbb{R}^{m\times n},

where we note (7) is a nonconvex model due to the spectral constraint. Thus, (1) enables novel extensions of PCA which have not yet been investigated and could be of much interest for future investigations.

2.4 Quadratically Constrained Quadratic Programs

A problem of immense importance in signal processing and communications is that of solving quadratically constrained quadratic programs (QCQP), which can be formulated as,

min\displaystyle\min 𝒙𝑪𝒙\displaystyle\;\bm{x}^{\top}\bm{C}\bm{x} (8)
s.t. 𝒙𝑭i𝒙gi,i=1,,p,\displaystyle\;\bm{x}^{\top}\bm{F}_{i}\bm{x}\geq g_{i},\;i=1,\ldots,p,
𝒙𝑯i𝒙=li,i=1,,q,\displaystyle\;\bm{x}^{\top}\bm{H}_{i}\bm{x}=l_{i},\;i=1,\ldots,q,
𝒙n,\displaystyle\;\bm{x}\in\mathbb{R}^{n},

where 𝑪,𝑭1,,𝑭p,𝑯1,,𝑯q\bm{C},\bm{F}_{1},\ldots,\bm{F}_{p},\bm{H}_{1},\ldots,\bm{H}_{q} are general real symmetric matrices. To highlight one example, let us consider the Boolean quadratic program (BQP),

min\displaystyle\min 𝒙𝑪𝒙\displaystyle\;\bm{x}^{\top}\bm{C}\bm{x} (9)
s.t. xi2=1,i=1,,n,\displaystyle\;x_{i}^{2}=1,\;i=1,\ldots,n,
𝒙n.\displaystyle\;\bm{x}\in\mathbb{R}^{n}.

The BQP is known to be NP-Hard problem and has applications in multiple-input-multiple-output (MIMO) detection and multiuser detection. And, a well-known important instance of BQP is the Max-Cut problem, which seeks to locate a maximum cut of a graph, i.e., a partition of the vertices of a graph into two sets S1S_{1} and S2S_{2} such that the number of edges connecting vertices in S1S_{1} to vertices in S2S_{2} is maximized.

A powerful approach taken to try and solve (8) is to first reformulate it is as the equivalent matrix optimization model,

min\displaystyle\min 𝑪,𝑿\displaystyle\;\langle\bm{C},\bm{X}\rangle (10)
s.t. 𝑭i,𝑿gi,i=1,,p,\displaystyle\;\langle\bm{F}_{i},\bm{X}\rangle\geq g_{i},\;i=1,\ldots,p,
𝑯i,𝑿=li,i=1,,q,\displaystyle\;\langle\bm{H}_{i},\bm{X}\rangle=l_{i},\;i=1,\ldots,q,
𝑿𝟎,rank(𝑿)=1,\displaystyle\;\bm{X}\succeq\bm{0},\;\text{rank}(\bm{X})=1,

then form and solve a semidefinite program relaxation (SDR) of (10) by dropping the rank-1 condition, and then finally applying randomization techniques to find feasible solutions to (8). This approach of convexification plus randomization for QCQPs has received much attention in recent years [36], and numerous results have been published proving the accuracy of SDR for certain QCQPs. For example, the influential 0.8756 approximation result for the Max-Cut problem demonstrates the solution to the SDR can sometimes be good [23]; however, dropping the rank-1 condition in (10) is not the only way to relax the problem. Our new general matrix optimization model admits several alternatives which we demonstrate can significantly outperform the state-of-the-art SDR approach. For example, we can approximate the rank-1 condition with different eigenvalue constraints, for example,

{𝑿𝒮n×n|λ1(𝑿)δ,λi(𝑿)[0,δ],i=2,,n},\{\bm{X}\in\mathcal{S}^{n\times n}\;|\;\lambda_{1}(\bm{X})\geq\delta,\;\lambda_{i}(\bm{X})\in[0,\delta],\;i=2,\ldots,n\}, (11)
{𝑿𝒮n×n|λi(𝑿)αiλi+1(𝑿),i=1,,n1},\{\bm{X}\in\mathcal{S}^{n\times n}\;|\;\lambda_{i}(\bm{X})\geq\alpha_{i}\lambda_{i+1}(\bm{X}),\;i=1,\ldots,n-1\}, (12)

where δ0\delta\geq 0 and αi1\alpha_{i}\geq 1, for i=1,,n1i=1,\ldots,n-1. In [22], we saw relaxations of this form applied to solving systems of quadratic equations was able to best the classical SDR approach. Later in Section 8.2, we demonstrate through a large number of numerical experiments that our general matrix optimization framework using (11) to relax the rank constraints dominates the SDR approach to solving a set of QCQPs.

2.5 Structured Linear Regression

Across the scientific disciplines few mathematical paradigms are utilized more than linear regression. The novel framework we are proposing can also be applied to perform novel structured linear regression between vector spaces. For example, let {(𝒙i,𝒃i)}i=1Nn×m\{(\bm{x}^{i},\bm{b}^{i})\}_{i=1}^{N}\subset\mathbb{R}^{n}\times\mathbb{R}^{m} be a set of observed data points. Our goal is to locate a constrained linear transform :nm\mathcal{L}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} which solves or nearly solves (𝒙i)=𝒃i\mathcal{L}(\bm{x}^{i})=\bm{b}^{i} for all ii. Since all such linear transforms are representable by matrices 𝑨m×n\bm{A}\in\mathbb{R}^{m\times n}, this leads to the following general matrix optimization model,

min\displaystyle\min 12i=1N𝑨𝒙i𝒃i2\displaystyle\;\;\frac{1}{2}\sum_{i=1}^{N}\|\bm{A}\bm{x}^{i}-\bm{b}^{i}\|^{2} (13)
s.t. 𝑨𝒞1,\displaystyle\;\bm{A}\in\mathcal{C}_{1},
λ(𝑨)𝒞2,\displaystyle\;\lambda(\bm{A})\in\mathcal{C}_{2},

where 𝒞1\mathcal{C}_{1} and 𝒞2\mathcal{C}_{2} represent the coordinate and spectral constraints on 𝑨m×n\bm{A}\in\mathbb{R}^{m\times n} respectively. For example, say we know the process generating our data set {(𝒙i,𝒃i)}i=1N\{(\bm{x}^{i},\bm{b}^{i})\}_{i=1}^{N} is given by a symmetric tri-diagonal matrix with a particular spectral structure, e.g., rank(𝑨)r\text{rank}(\bm{A})\leq r, then we can leverage this information to form the structured linear regression model

min\displaystyle\min 12i=1N𝑨𝒙i𝒃i2\displaystyle\;\;\frac{1}{2}\sum_{i=1}^{N}\|\bm{A}\bm{x}^{i}-\bm{b}^{i}\|^{2} (14)
s.t. 𝑨𝒯n,\displaystyle\;\bm{A}\in\mathcal{T}_{n},
rank(𝑨)r,\displaystyle\;\text{rank}(\bm{A})\leq r,

where 𝒯n:={𝑿𝒮n×n|Xij=0 if ji1 or ji+1}\mathcal{T}_{n}:=\{\bm{X}\in\mathcal{S}^{n\times n}\;|\;X_{ij}=0\text{ if }j\leq i-1\text{ or }j\geq i+1\}. This framework directly relates to the work in robust subspace recovery [29, 30] where our framework enables a much greater freedom to leverage spectral information which might be present to a practitioner.

2.6 Matrix Completion

The task of filling-in the missing entries of a matrix, also known as, matrix completion, has found immense success in both image recovery [50] and recommender systems [14], among other applications [20]. Formally, the matrix completion problem can be formulated as: given a partially known matrix 𝐌m×n\bm{M}\in\mathbb{R}^{m\times n} such that MijM_{ij} for all (i,j)Ω(i,j)\in\Omega are known, recover the unknown entries in 𝐌\bm{M}. A classic formulation of this problem is the rank minimization problem,

min\displaystyle\min rank(𝑿)\displaystyle\;\;\text{rank}(\bm{X}) (15)
s.t. Xi,j=Mi,j,(i,j)Ω,\displaystyle\;X_{i,j}=M_{i,j},\;(i,j)\in\Omega,
𝑿m×n.\displaystyle\;\bm{X}\in\mathbb{R}^{m\times n}.

Due to the fact (15) is an extremely challenging problem, it is common to replace the rank function with a convex approximation of it to obtain,

min\displaystyle\min 𝑿\displaystyle\;\;\|\bm{X}\|_{*} (16)
s.t. Xi,j=Mi,j,(i,j)Ω,\displaystyle\;X_{i,j}=M_{i,j},\;(i,j)\in\Omega,
𝑿m×n,\displaystyle\;\bm{X}\in\mathbb{R}^{m\times n},

where 𝑿\|\bm{X}\|_{*} is the nuclear norm of 𝑿\bm{X}. The main benefit of (16) is the fact it is convex and in the seminal work of Candès and Recht [11] they proved if enough entries of 𝑴\bm{M} were known then the matrix could be perfectly recovered with high probability. Central to this approach is the assumption the matrix being recovered has a low-rank structure, which very well may not be the case. In [22], our experiments demonstrated if spectral information was known about the matrix being recovered than our methods could nearly double the chance at recover when compared to solving (16).

Thus, we are motivated to apply (1) to propose the following matrix completion model with spectral constraints,

min\displaystyle\min F(𝑿)\displaystyle\;\;F(\bm{X}) (17)
s.t. Xi,j=Mi,j,(i,j)Ω,\displaystyle\;X_{i,j}=M_{i,j},\;(i,j)\in\Omega,
𝝈(𝑿)𝒞,\displaystyle\;\bm{\sigma}(\bm{X})\in\mathcal{C},
𝑿m×n,\displaystyle\;\bm{X}\in\mathbb{R}^{m\times n},

where 𝒞\mathcal{C} is derived from the spectral information at our disposal and F:m×nF:\mathbb{R}^{m\times n}\rightarrow\mathbb{R} is an objective function chosen to bias the matrix as dictated by the user. For example, it might be the case that a subset of our known entries are noisy, then it would make sense to let F(𝑿):=12PΩ(𝑿𝑴)F2F(\bm{X}):=\frac{1}{2}\|P_{\Omega^{\prime}}(\bm{X}-\bm{M})\|_{F}^{2}, where

[PΩ(𝑿)]ij={0,(i,j)ΩXij,otherwise,\left[P_{\Omega^{\prime}}(\bm{X})\right]_{ij}=\begin{cases}0&,\;(i,j)\in\Omega^{\prime}\\ X_{ij}&,\;\text{otherwise}\end{cases},

for all i,ji,j, and Ω\Omega^{\prime} is the subset of the known entries which are noisy. Hence, in this example we would replace Ω\Omega in the constraint with Ω′′\Omega^{\prime\prime} to be the set of known noiseless entries, yielding the noisy matrix completion model with spectral constraints,

min\displaystyle\min 12PΩ(𝑿𝑴)F2\displaystyle\;\;\frac{1}{2}\|P_{\Omega^{\prime}}(\bm{X}-\bm{M})\|_{F}^{2} (18)
s.t. Xi,j=Mi,j,(i,j)Ω′′,\displaystyle\;X_{i,j}=M_{i,j},\;(i,j)\in\Omega^{\prime\prime},
𝝈(𝑿)𝒞,\displaystyle\;\bm{\sigma}(\bm{X})\in\mathcal{C},
𝑿m×n.\displaystyle\;\bm{X}\in\mathbb{R}^{m\times n}.

It is worth noting that learning structural information about the spectrum for particular problems is very possible in practice; see for example Theorem 1 of [27].

2.7 And More…

The prior examples should hopefully be sufficient to see the breadth of problems modeled and extended by (1). Further examples include: low-rank matrix optimization [53, 37], tensor completion [21], and synchronization problems [31, 41], among others; however, we do not have the space here to elaborate on these important applications.

3 An Equivalent Reformulation

The primary approach we take in this paper to solve (1) is to form a decomposed model by utilizing standard matrix and tensor factorizations. For example, in the case 𝕊=𝒮n×n\mathbb{S}=\mathcal{S}^{n\times n} in (1) we utilize the eigenvalue decomposition. It is well known a symmetric matrix 𝑿\bm{X} has an eigenvalue/spectral decomposition such that 𝑿=𝑸Diag(𝝀(𝑿))𝑸\bm{X}=\bm{Q}\operatorname{Diag}(\bm{\lambda}(\bm{X}))\bm{Q}^{\top} where 𝑸𝒪(n,n)\bm{Q}\in\mathcal{O}(n,n), the matrix manifold of orthogonal matrices, and Diag(𝝀(𝑿))\operatorname{Diag}(\bm{\lambda}(\bm{X})) is the real diagonal matrix formed by the eigenvalues of 𝑿\bm{X}. Thus, using the eigenvalue decomposition, we can equivalently rewrite

min\displaystyle\min F(𝑿)\displaystyle\;F(\bm{X}) (19)
s.t. 𝑮(𝑿)=𝟎,\displaystyle\;\bm{G}(\bm{X})=\bm{0},
𝒈(𝝀(𝑿))𝟎,\displaystyle\;\bm{g}(\bm{\lambda}(\bm{X}))\leq\bm{0},
𝑿𝒮n×n,\displaystyle\bm{X}\in\mathcal{S}^{n\times n},

as the decomposed model

min\displaystyle\min f(𝑸,𝝀):=F(𝑸Diag(𝝀)𝑸)\displaystyle\;f(\bm{Q},\bm{\lambda}):=F(\bm{Q}\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top}) (20)
s.t. 𝑮(𝑸Diag(𝝀)𝑸)=𝟎,\displaystyle\;\bm{G}(\bm{Q}\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top})=\bm{0},
𝒈~(𝝀)𝟎,\displaystyle\;\bm{\tilde{g}}(\bm{\lambda})\leq\bm{0},
𝑸𝒪(n,n),𝝀n,\displaystyle\bm{Q}\in\mathcal{O}(n,n),\;\bm{\lambda}\in\mathbb{R}^{n},

where 𝒈~{\bm{\tilde{g}}} additionally enforces the condition the elements of 𝝀\bm{\lambda} are in decreasing order, i.e., λiλi+1\lambda_{i}\geq\lambda_{i+1} for all ii. Now, it is clear the global minimums of (19) and (20) are equivalent, but there might be some concern that the decomposed model introduces new local minimums which do not correspond to local minimums in the original model; however, we prove this is not the case for local minimizers of (19) which have no repeated eigenvalues. Since a minor perturbation of the constraint in (19) ensures all feasible matrices have unique eigenvalues, our result demonstrates the strong tie between the local minimums of the decomposed model and the original which supports our decomposition approach to solving (1).

Before stating and proving our main result, we first introduce a few definitions and prove a set of lemmas. Let n:={𝒙n|x1x2xn}\mathbb{R}^{n}_{\geq}:=\{\bm{x}\in\mathbb{R}^{n}\;|\;x_{1}\geq x_{2}\geq\cdots\geq x_{n}\}, F\|\cdot\|_{F} and \|\cdot\| denote the Frobenius and Euclidean norms respectively, and given 𝑿𝒮n×n\bm{X}\in\mathcal{S}^{n\times n}, 𝑸𝒪(n,n)\bm{Q}\in\mathcal{O}(n,n), 𝝀n\bm{\lambda}\in\mathbb{R}^{n}_{\geq} and δ>0\delta>0 we define

𝑩((𝑸,𝝀),δ):={𝑸¯Diag(𝝀¯)𝑸¯|𝑸¯𝑸Fδ,𝝀¯𝝀δ,𝑸¯𝒪(n,n),𝝀¯n},\bm{B}\left((\bm{Q},\bm{\lambda}),\delta\right):=\left\{\bar{\bm{Q}}\operatorname{Diag}(\bar{\bm{\lambda}})\bar{\bm{Q}}^{\top}\;|\;\|\bar{\bm{Q}}-\bm{Q}\|_{F}\leq\delta,\;\|\bar{\bm{\lambda}}-\bm{\lambda}\|\leq\delta,\;\bar{\bm{Q}}\in\mathcal{O}(n,n),\;\bar{\bm{\lambda}}\in\mathbb{R}^{n}_{\geq}\right\}, (21)
𝑩(𝑿,δ):={𝑿¯𝒮n×n|𝑿¯𝑿Fδ}.\bm{B}\left(\bm{X},\delta\right):=\left\{\bar{\bm{X}}\in\mathcal{S}^{n\times n}\;|\;\|\bar{\bm{X}}-\bm{X}\|_{F}\leq\delta\right\}. (22)
Lemma 1.

Given 𝐗𝒮n×n\bm{X}\in\mathcal{S}^{n\times n}, δ>0\delta>0, and a spectral decomposition of 𝐗\bm{X} such that 𝐗=𝐐xDiag(𝛌(𝐗))𝐐x\bm{X}=\bm{Q}_{x}\operatorname{Diag}(\bm{\lambda}(\bm{X}))\bm{Q}_{x}^{\top} with 𝐐x𝒪(n,n)\bm{Q}_{x}\in\mathcal{O}(n,n), there exists δ>0\delta^{\prime}>0 such that

𝑩((𝑸x,𝝀(𝑿)),δ)𝑩(𝑿,δ).\bm{B}\left((\bm{Q}_{x},\bm{\lambda}(\bm{X})),\delta^{\prime}\right)\subseteq\bm{B}\left(\bm{X},\delta\right).
Proof.

Define

δ:=min{δ2,δ2(δ+2𝝀(𝑿))}.\delta^{\prime}:=\min\left\{\frac{\delta}{2},\;\frac{\delta}{2(\delta+2\|\bm{\lambda}(\bm{X})\|)}\right\}.

Then, for all 𝑸𝒪(n,n)\bm{Q}\in\mathcal{O}(n,n) such that 𝑸𝑸xFδ\|\bm{Q}-\bm{Q}_{x}\|_{F}\leq\delta^{\prime} and 𝝀n\bm{\lambda}\in\mathbb{R}^{n}_{\geq} such that 𝝀𝝀(𝑿)δ\|\bm{\lambda}-\bm{\lambda}(\bm{X})\|\leq\delta^{\prime}

𝑸Diag(𝝀)𝑸𝑿F\displaystyle\|\bm{Q}\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top}-\bm{X}\|_{F} =𝑸Diag(𝝀)𝑸𝑸xDiag(𝝀(𝑿))(𝑸x)F\displaystyle=\|\bm{Q}\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top}-\bm{Q}_{x}\operatorname{Diag}(\bm{\lambda}(\bm{X}))(\bm{Q}_{x})^{\top}\|_{F}
(𝑸𝑸x)Diag(𝝀)𝑸F+Diag(𝝀)Diag(𝝀(𝑿))F\displaystyle\leq\|(\bm{Q}-\bm{Q}_{x})\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top}\|_{F}+\|\operatorname{Diag}(\bm{\lambda})-\operatorname{Diag}(\bm{\lambda}(\bm{X}))\|_{F}
+𝑸xDiag(𝝀(𝑿))(𝑸𝑸x)F\displaystyle\hskip 72.26999pt+\|\bm{Q}_{x}\operatorname{Diag}(\bm{\lambda}(\bm{X}))(\bm{Q}-\bm{Q}_{x})^{\top}\|_{F}
𝑸𝑸xF𝝀+𝝀𝝀(𝑿)+𝝀(𝑿)𝑸𝑸xF\displaystyle\leq\|\bm{Q}-\bm{Q}_{x}\|_{F}\|\bm{\lambda}\|+\|\bm{\lambda}-\bm{\lambda}(\bm{X})\|+\|\bm{\lambda}(\bm{X})\|\cdot\|\bm{Q}-\bm{Q}_{x}\|_{F}
δ(2𝝀(𝑿)+δ)+δ\displaystyle\leq\delta^{\prime}(2\|\bm{\lambda}(\bm{X})\|+\delta^{\prime})+\delta^{\prime}
δ2(δ+2𝝀(𝑿))(2𝝀(𝑿)+δ)+δ2\displaystyle\leq\frac{\delta}{2(\delta+2\|\bm{\lambda}(\bm{X})\|)}\cdot\left(2\|\bm{\lambda}(\bm{X})\|+\delta\right)+\frac{\delta}{2}
=δ,\displaystyle=\delta,

and the result follows by the definition of 𝑩((𝑸x,𝝀(𝑿)),δ)\bm{B}\left((\bm{Q}_{x},\bm{\lambda}(\bm{X})),\delta^{\prime}\right). ∎

This lemma is crucial to be able to claim that any local minimum in (19) generates a local minimum in (20). We next move to a lemma which demonstrates an equivalence between different representations of the same symmetric matrix. Note, in Lemma 2, we let 𝒆i\bm{e}_{i} denote the vector of the of all zeros except with a single one in the ii-th entry.

Lemma 2.

Given 𝐗𝒮n×n\bm{X}\in\mathcal{S}^{n\times n} with unique eigenvalues, δ0\delta\geq 0, and 𝐄[±1𝐞1||±1𝐞n]\bm{E}\in\left[\pm 1\bm{e}_{1}\;|\;\cdots\;|\;\pm 1\bm{e}_{n}\right], we have for any spectral decomposition of 𝐗\bm{X}, i.e., 𝐗=𝐐xDiag(𝛌(𝐗))𝐐x\bm{X}=\bm{Q}_{x}\operatorname{Diag}(\bm{\lambda}(\bm{X}))\bm{Q}_{x}^{\top} with 𝐐x𝒪(n,n)\bm{Q}_{x}\in\mathcal{O}(n,n), that

𝑩((𝑸x,𝝀(𝑿)),δ)=𝑩((𝑸x𝑬,𝝀(𝑿)),δ).\bm{B}\left((\bm{Q}_{x},\bm{\lambda}(\bm{X})),\delta\right)=\bm{B}\left((\bm{Q}_{x}\bm{E},\bm{\lambda}(\bm{X})),\delta\right).
Proof.

Let 𝑸Diag(𝝀)𝑸𝑩((𝑸x,𝝀(𝑿)),δ)\bm{Q}\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top}\in\bm{B}\left((\bm{Q}_{x},\bm{\lambda}(\bm{X})),\delta\right). Then, 𝑸𝑸xFδ\|\bm{Q}-\bm{Q}_{x}\|_{F}\leq\delta, λλ(𝑿)δ\|\lambda-\lambda(\bm{X})\|\leq\delta, and

𝑸𝑬𝑸x𝑬F=𝑸𝑸xFδ\|\bm{Q}\bm{E}-\bm{Q}_{x}\bm{E}\|_{F}=\|\bm{Q}-\bm{Q}_{x}\|_{F}\leq\delta

which implies

𝑸𝑬Diag(𝝀)𝑬𝑸=𝑸Diag(𝝀)𝑸𝑩((𝑸x𝑬,𝝀(𝑿)),δ).\bm{Q}\bm{E}\operatorname{Diag}(\bm{\lambda})\bm{E}^{\top}\bm{Q}^{\top}=\bm{Q}\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top}\in\bm{B}\left((\bm{Q}_{x}\bm{E},\bm{\lambda}(\bm{X})),\delta\right).

On the other hand, letting 𝑸~Diag(𝝀)𝑸~𝑩((𝑸x𝑬,𝝀(𝑿)),δ)\tilde{\bm{Q}}\operatorname{Diag}(\bm{\lambda})\tilde{\bm{Q}}^{\top}\in\bm{B}\left((\bm{Q}_{x}\bm{E},\bm{\lambda}(\bm{X})),\delta\right) and setting 𝑸=𝑸~𝑬\bm{Q}=\tilde{\bm{Q}}\bm{E} we see

𝑸𝑸xF=𝑸~𝑬𝑸xF=𝑸~𝑸x𝑬Fδ.\|\bm{Q}-\bm{Q}_{x}\|_{F}=\|\tilde{\bm{Q}}\bm{E}-\bm{Q}_{x}\|_{F}=\|\tilde{\bm{Q}}-\bm{Q}_{x}\bm{E}\|_{F}\leq\delta.

Thus,

𝑸~Diag(𝝀)𝑸~=𝑸Diag(𝝀)𝑸𝑩((𝑸x,𝝀(𝑿)),δ).\tilde{\bm{Q}}\operatorname{Diag}(\bm{\lambda})\tilde{\bm{Q}}^{\top}=\bm{Q}\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top}\in\bm{B}\left((\bm{Q}_{x},\bm{\lambda}(\bm{X})),\delta\right).

This lemma demonstrates if we are in the setting where 𝑿\bm{X} has only unique eigenvalues then any of the neighborhoods about the different representations of 𝑿\bm{X} create the same image in the space of symmetric matrices. From this we shall be able to conclude that a local minimum at any one of the representations shall imply a local minimum at all of the representations. Finally, we want to demonstrate that we can generate a ball in the space of symmetric matrices which shall be contained inside the image of the balls over decompositions. Note, in the event the argument is a matrix, 2\|\cdot\|_{2} denotes the matrix 2-norm.

Lemma 3.

Given 𝐗𝒮n×n\bm{X}\in\mathcal{S}^{n\times n} with unique eigenvalues, δ>0\delta^{\prime}>0, and the spectral decomposition 𝐗=𝐐xDiag(𝛌(𝐗))𝐐x\bm{X}=\bm{Q}_{x}\operatorname{Diag}(\bm{\lambda}(\bm{X}))\bm{Q}_{x}^{\top}, there exists δ′′>0\delta^{\prime\prime}>0 such that

𝑩(𝑿,δ′′)𝑩((𝑸x,𝝀(𝑿)),δ).\bm{B}\left(\bm{X},\delta^{\prime\prime}\right)\subset\bm{B}\left((\bm{Q}_{x},\bm{\lambda}(\bm{X})),\delta^{\prime}\right).
Proof.

Define τ:=min1in1|λi(𝑿)λi+1(𝑿)|\tau:=\min_{1\leq i\leq n-1}\;|\lambda_{i}(\bm{X})-\lambda_{i+1}(\bm{X})| and let

δ′′=min{1,τ/2,δ,τ24nτ2+16nδ}.\delta^{\prime\prime}=\min\left\{1,\tau/2,\delta^{\prime},\frac{\tau^{2}}{\sqrt{4n\tau^{2}+16n}}\delta^{\prime}\right\}.

Then, given 𝒁𝒮n×n\bm{Z}\in\mathcal{S}^{n\times n} such that 𝒁𝑿Fδ′′\|\bm{Z}-\bm{X}\|_{F}\leq\delta^{\prime\prime} it follows

𝝀(𝒁)𝝀(𝑿)𝒁𝑿2𝒁𝑿Fδ′′,\|\bm{\lambda}(\bm{Z})-\bm{\lambda}(\bm{X})\|\leq\|\bm{Z}-\bm{X}\|_{2}\leq\|\bm{Z}-\bm{X}\|_{F}\leq\delta^{\prime\prime},

which implies the eigenvalues of 𝒁\bm{Z} are simple. By Lemma 2, the result shall follow provided 𝒁B((𝑸x𝑬),𝝀(𝑿)),δ)\bm{Z}\in B((\bm{Q}_{x}\bm{E}),\bm{\lambda}(\bm{X})),\delta^{\prime}) for some 𝑬\bm{E} as defined in Lemma 2. Now, by Theorem 3.3.7 in [39] it follows there exists normalized eigenvectors 𝒖j\bm{u}_{j} and 𝒗j\bm{v}_{j} of 𝒁\bm{Z} and 𝑿\bm{X} corresponding to λj(𝒁)\lambda_{j}(\bm{Z}) and λj(𝑿)\lambda_{j}(\bm{X}) respectively for j=1,,nj=1,\ldots,n such that

𝒖j𝒗jγ(1+γ2)1/2,\|\bm{u}_{j}-\bm{v}_{j}\|\leq\gamma(1+\gamma^{2})^{1/2}, (23)

where γ=𝒁𝑿2/(τ𝒁𝑿2)\gamma=\|\bm{Z}-\bm{X}\|_{2}/(\tau-\|\bm{Z}-\bm{X}\|_{2}). Since 𝒁\bm{Z} and 𝑿\bm{X} both have simple eigenvalues there exists 𝑬\bm{E} and 𝑸z\bm{Q}_{z} such that

𝑸x𝑬=[𝒗1||𝒗n] and 𝑸z=[𝒖1||𝒖n]\bm{Q}_{x}\bm{E}=\left[\bm{v}_{1}\;|\;\cdots\;|\;\bm{v}_{n}\right]\text{ and }\bm{Q}_{z}=\left[\bm{u}_{1}\;|\;\cdots\;|\;\bm{u}_{n}\right]

with 𝒁=𝑸zDiag(𝝀(𝒁))𝑸z\bm{Z}=\bm{Q}_{z}\operatorname{Diag}(\bm{\lambda}(\bm{Z}))\bm{Q}_{z}^{\top} and 𝑸z𝒪(n,n)\bm{Q}_{z}\in\mathcal{O}(n,n). Noting γ2δ′′/τ\gamma\leq{2\delta^{\prime\prime}}/{\tau} we see

𝑸z𝑸x𝑬F2\displaystyle\|\bm{Q}_{z}-\bm{Q}_{x}\bm{E}\|_{F}^{2} i=1n𝒖i𝒗i2\displaystyle\leq\sum_{i=1}^{n}\|\bm{u}_{i}-\bm{v}_{i}\|^{2}
nγ2(1+γ2)\displaystyle\leq n\gamma^{2}(1+\gamma^{2})
4nτ2(δ′′)2(1+4(δ′′)2τ2)\displaystyle\leq\frac{4n}{\tau^{2}}(\delta^{\prime\prime})^{2}\left(1+\frac{4(\delta^{\prime\prime})^{2}}{\tau^{2}}\right)
(4nτ2+16nτ4)(δ′′)2\displaystyle\leq\left(\frac{4n\tau^{2}+16n}{\tau^{4}}\right)(\delta^{\prime\prime})^{2}
(δ)2,\displaystyle\leq(\delta^{\prime})^{2},

where the second inequality followed from (23) and the remaining inequalities were a result of the definition of δ′′\delta^{\prime\prime}. ∎

We now combine our lemmas to prove our main result, but first we need to define a local minimum for our models.

Definition 1.

Define the following feasible regions for (19) and (20):

𝒳1:={𝑿𝒮n×n|𝑮(𝑿)=𝟎,𝝀(𝑿)𝒞},\mathcal{X}_{1}:=\left\{\bm{X}\in\mathcal{S}^{n\times n}\;|\;\bm{G}(\bm{X})=\bm{0},\;\bm{\lambda}(\bm{X})\in\mathcal{C}\right\},
𝒳2:={(𝑸,𝝀)𝒪(n,n)×n|𝑮(𝑸Diag(𝝀)𝑸)=𝟎,𝝀𝒞},\mathcal{X}_{2}:=\left\{(\bm{Q},\bm{\lambda})\in\mathcal{O}(n,n)\times\mathbb{R}^{n}_{\geq}\;|\;\bm{G}(\bm{Q}\operatorname{Diag}(\bm{\lambda})\bm{Q}^{\top})=\bm{0},\;\bm{\lambda}\in\mathcal{C}\right\},

where 𝒞\mathcal{C} is a closed subset of n\mathbb{R}^{n}. Then, the matrix 𝐗𝒮n×n\bm{X}^{*}\in\mathcal{S}^{n\times n} is a local minimum of (19) if there exists δ10\delta_{1}\geq 0 such that F(𝐗)F(𝐗)F(\bm{X}^{*})\leq F(\bm{X}) for all 𝐗𝐁(𝐗,δ1)𝒳1\bm{X}\in\bm{B}(\bm{X}^{*},\delta_{1})\cap\mathcal{X}_{1}. Similarly, (𝐐,𝛌)𝒪(n,n)×n(\bm{Q}^{*},\bm{\lambda}^{*})\in\mathcal{O}(n,n)\times\mathbb{R}^{n}_{\geq} is a local minimum of (20) provided there exists δ20\delta_{2}\geq 0 such that f(𝐐,𝛌)f(𝐐,𝛌)f(\bm{Q}^{*},\bm{\lambda}^{*})\leq f(\bm{Q},\bm{\lambda}) for all

(𝑸,𝝀){(𝑸¯,𝝀¯)𝒪(n,n)×n|𝑸¯𝑸Fδ2,𝝀¯𝝀δ2}𝒳2.(\bm{Q},\bm{\lambda})\in\left\{(\bar{\bm{Q}},\bar{\bm{\lambda}})\in\mathcal{O}(n,n)\times\mathbb{R}^{n}_{\geq}\;|\;\|\bar{\bm{Q}}-\bm{Q}^{*}\|_{F}\leq\delta_{2},\;\|\bar{\bm{\lambda}}-\bm{\lambda}^{*}\|\leq\delta_{2}\right\}\cap\mathcal{X}_{2}.

We now present our main theorem for this section.

Theorem 4.

A matrix 𝐗𝒮n×n\bm{X}^{*}\in\mathcal{S}^{n\times n} with unique eigenvalues is a local minimum of (19) if and only if one of its spectral decompositions is a local minimum of (20). Furthermore, if one of the spectral decompositions of 𝐗\bm{X}^{*} is a local minimum of (20), then all of its spectral decompositions form local minimums.

Proof.

Assume 𝑿\bm{X}^{*} is a local minimum of (19). Thus, there exists δ10\delta_{1}\geq 0 such that F(𝑿)F(𝑿)F(\bm{X}^{*})\leq F(\bm{X}) for all 𝑿𝑩(𝑿,δ1)𝒳1\bm{X}\in\bm{B}(\bm{X}^{*},\delta_{1})\cap\mathcal{X}_{1}. Let 𝑸Diag(𝝀(𝑿))𝑸\bm{Q}^{*}\operatorname{Diag}(\bm{\lambda}(\bm{X}^{*}))\bm{Q}^{*} be any spectral decomposition of 𝑿\bm{X}^{*}. Then by Lemma 1 there exists δ20\delta_{2}\geq 0 such that 𝑩((𝑸,𝝀(𝑿),δ2)𝑩(𝑿,δ1).\bm{B}\left((\bm{Q}^{*},\bm{\lambda}(\bm{X}^{*}),\delta_{2}\right)\subseteq\bm{B}(\bm{X}^{*},\delta_{1}). Therefore, it follows that f(𝑸,𝝀(𝑿))f(𝑸,𝝀)f(\bm{Q}^{*},\bm{\lambda}(\bm{X}^{*}))\leq f(\bm{Q},\bm{\lambda}) for all

(𝑸,𝝀){(𝑸¯,𝝀¯)𝒪(n,n)×n|𝑸¯𝑸Fδ2,𝝀¯𝝀(𝑿)δ2}𝒳2.(\bm{Q},\bm{\lambda})\in\left\{(\bar{\bm{Q}},\bar{\bm{\lambda}})\in\mathcal{O}(n,n)\times\mathbb{R}^{n}\;|\;\|\bar{\bm{Q}}-\bm{Q}^{*}\|_{F}\leq\delta_{2},\;\|\bar{\bm{\lambda}}-\bm{\lambda}(\bm{X}^{*})\|\leq\delta_{2}\right\}\cap\mathcal{X}_{2}.

Since no particular spectral decomposition of 𝑿\bm{X}^{*} was selected, any spectral decomposition of 𝑿\bm{X}^{*} generates a local minimum of (20). To prove the other direction, assume (𝑸,𝝀)𝒪(n,n)×n(\bm{Q}^{*},\bm{\lambda}^{*})\in\mathcal{O}(n,n)\times\mathbb{R}^{n}_{\geq} is a local minimum of (20). Then there exists δ10\delta_{1}\geq 0 such that f(𝑸,𝝀)f(𝑸,𝝀)f(\bm{Q}^{*},\bm{\lambda}^{*})\leq f(\bm{Q},\bm{\lambda}) for all

(𝑸,𝝀){(𝑸¯,𝝀¯)𝒪(n,n)×n|𝑸¯𝑸Fδ2,𝝀¯𝝀δ2}𝒳2(\bm{Q},\bm{\lambda})\in\left\{(\bar{\bm{Q}},\bar{\bm{\lambda}})\in\mathcal{O}(n,n)\times\mathbb{R}^{n}_{\geq}\;|\;\|\bar{\bm{Q}}-\bm{Q}^{*}\|_{F}\leq\delta_{2},\;\|\bar{\bm{\lambda}}-\bm{\lambda}^{*}\|\leq\delta_{2}\right\}\cap\mathcal{X}_{2}

which is equivalent to stating F(𝑸Diag(𝝀)(𝑸))F(𝑿¯)F(\bm{Q}^{*}\operatorname{Diag}(\bm{\lambda}^{*})(\bm{Q}^{*})^{\top})\leq F(\bar{\bm{X}}) for all 𝑿¯𝑩((𝑸,λ),δ2)𝒳1\bar{\bm{X}}\in\bm{B}\left((\bm{Q}^{*},\lambda^{*}),\delta_{2}\right)\cap\mathcal{X}_{1}. By Lemma 3, it follows there exists δ10\delta_{1}\geq 0 such that 𝑩(𝑸Diag(𝝀)(𝑸),δ1)𝒳1𝑩((𝑸,λ),δ2)𝒳1\bm{B}(\bm{Q}^{*}\operatorname{Diag}(\bm{\lambda}^{*})(\bm{Q}^{*})^{\top},\delta_{1})\cap\mathcal{X}_{1}\subseteq\bm{B}\left((\bm{Q}^{*},\lambda^{*}),\delta_{2}\right)\cap\mathcal{X}_{1} which implies 𝑿=𝑸Diag(𝝀)(𝑸)\bm{X}^{*}=\bm{Q}^{*}\operatorname{Diag}(\bm{\lambda}^{*})(\bm{Q}^{*})^{\top} is a local minimum of (19). ∎

Thus, we have demonstrated the local minimums of the two models are identical in the case of local minimizers without repeated eigenvalues. This might seem like a strong restriction, but in reality it is always possible to use the eigenvalue constraints to ensure no repeated eigenvalues occur. Additionally, the continuity of the eigenvalues of a symmetric matrix with respect to perturbations of the coordinates of the matrix ensure that such a restriction only alters the value of the objective function in a boundable fashion, assuming some degree of smoothness in the objective function; see Theorem 4 in [22] for a more substantial discussion on this point.

4 A General Block Manifold Model

The decomposition approach outlined in Section 3 leads us to formulate and study a new optimization model, which generalizes beyond the decomposition approach for symmetric matrices presented in (20). Thus, the central model guiding the remainder of our discussion is the block manifold model:

min\displaystyle\min f(𝒙,𝒚)\displaystyle\;f(\bm{x},\bm{y}) (24)
s.t. 𝒄(𝒙,𝒚)=𝟎,\displaystyle\;\bm{c}(\bm{x},\bm{y})=\bm{0},
𝒈(𝒚)𝟎,\displaystyle\;\bm{g}(\bm{y})\leq\bm{0},
𝒙,𝒚n,\displaystyle\;\bm{x}\in\mathcal{M},\;\bm{y}\in\mathbb{R}^{n},

where f:×nf:\mathcal{M}\times\mathbb{R}^{n}\rightarrow\mathbb{R}, 𝒄:×np\bm{c}:\mathcal{M}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{p}, and 𝒈:ns\bm{g}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{s} are differentiable and \mathcal{M} is a differentiable submanifold in a Euclidean space. Depending on the choice of 𝕊\mathbb{S} in (1), the corresponding manifold in (24) will vary, as dictated by the decomposition utilized. For example, if 𝕊=m×n\mathbb{S}=\mathbb{R}^{m\times n}, then we shall apply a singular value decomposition to form the variables 𝒙\bm{x} and 𝒚\bm{y} in (24) and \mathcal{M} shall be a Cartesian product of Stiefel manifolds. In this fashion, (24) can be utilized to model all presented versions of (24). In the remainder of this paper, we shall develop, analyze, and test an algorithm which converges to approximate first-order KKT points of (24) when viewed as a constrained Riemannian optimization problem.

4.1 Riemannian Manifold Optimization Revisited

As optimization over smooth manifolds shall be a major aspect of our discussion, we begin by presenting a brief introduction to the major geometric objects to be leveraged. Our presentation of Riemannian geometry for optimization is restricted to differentiable submanifolds of Euclidean spaces, i.e., vector spaces with a defined inner product, as this covers our examples of interest; however, our later analysis holds for general smooth Riemannian manifolds. The notation and descriptions presented in this section are inspired and drawn heavily from Chapters 3 and 10 of [7]. For readers interested in further details, the authors highly recommend Boumal’s excellent text.

At a high level, a smooth manifold is a generalization of linear spaces in such a way calculus operations such as differentiation, integration, etc., are well-defined and made possible. In our presentation, we let \mathcal{E} refer to a general finite dimensional vector space. If \mathcal{E} additionally has a defined inner product, then we call \mathcal{E} an Euclidean space. The simplified definition of an embedded submanifold of \mathcal{E} we choose to present is provided by Boumal:

Definition 2.

[Definition 3.6, [7]] Let \mathcal{M} be a subset of \mathcal{E}. We say \mathcal{M} is a (smooth) embedded submanifold of \mathcal{E} if either of the following holds:

  1. 1.

    \mathcal{M} is an open subset of \mathcal{E}. Then we call \mathcal{M} an open submanifold. If =\mathcal{M}=\mathcal{E}, we also call it a linear manifold.

  2. 2.

    For a fixed integer k1k\geq 1 and for each 𝒙\bm{x}\in\mathcal{M} there exists a neighborhood UU of 𝒙\bm{x} in \mathcal{E} and a smooth function h:Ukh:U\rightarrow\mathbb{R}^{k} such that

    1. (a)

      If 𝒚U\bm{y}\in U, then h(𝒚)=𝟎h(\bm{y})=\bm{0} if and only if 𝒚\bm{y}\in\mathcal{M}; and

    2. (b)

      rank Dh(𝒙)=kDh(\bm{x})=k.

    Such a function hh is called a local defining function of \mathcal{M} at 𝒙\bm{x}.

The set \mathcal{E} is often referred to as the embedding space or the ambient space of \mathcal{M}.

An experienced student of Riemannian geometry shall note from UU and hh via the inverse function theorem charts in the more traditional presentation of general smooth manifolds can be obtained.

Definition 3.

[Definitions 3.7 & 3.10, [7]] Let \mathcal{M} be an embedded submanifold of \mathcal{E}. For all 𝐱\bm{x}\in\mathcal{M}, the tangent space of \mathcal{M} at 𝐱\bm{x} is defined as the set

𝒯𝒙:={c(0)|c: is smooth around 0 and c(0)=𝒙},\mathcal{T}_{\bm{x}}\mathcal{M}:=\{c^{\prime}(0)\;|\;c:\mathcal{I}\rightarrow\mathcal{M}\;\text{ is smooth around $0$ and }c(0)=\bm{x}\}, (25)

and elements 𝐯𝒯𝐱\bm{v}\in\mathcal{T}_{\bm{x}}\mathcal{M} are called tangent vectors to \mathcal{M} at 𝐱\bm{x}. So, 𝐯𝒯𝐱\bm{v}\in\mathcal{T}_{\bm{x}}\mathcal{M} if and only if there is a smooth curve on \mathcal{M} through 𝐱\bm{x} such that c(0)=𝐯c^{\prime}(0)=\bm{v}.

In the case of embedded submanifolds of a vector space, a local defining function at a point 𝒙\bm{x}\in\mathcal{M} presents a simple manner to define the set of tangent vectors at 𝒙\bm{x}.

Theorem 5.

[Theorem 3.8, [7]] Let \mathcal{M} be an embedded submanifold of \mathcal{E}. Consider 𝐱\bm{x}\in\mathcal{M} and the set 𝒯𝐱\mathcal{T}_{\bm{x}}\mathcal{M} (25). If \mathcal{M} is an open submanifold of \mathcal{E}, then 𝒯𝐱=\mathcal{T}_{\bm{x}}\mathcal{M}=\mathcal{E}. Otherwise, 𝒯𝐱=ker(Dh(𝐱))\mathcal{T}_{\bm{x}}\mathcal{M}=\ker\left(Dh(\bm{x})\right) with hh any local defining function at 𝐱\bm{x}.

The tangent space of \mathcal{M} at a point 𝒙\bm{x} is intended to be a reasonable approximation of the manifold locally. This idea is seen most clearly from the description of the tangent space in Theorem 5 as the kernal of the differential of a local defining function.

We have new spaces over which we are defining functions, namely embedded submanifolds of vector spaces. Thus, it is prudent for us to consider what it means for such functions to be smooth over these manifolds. This is clarified in the following definition.

Definition 4.

[Definition 3.23, [7]] Let \mathcal{M} and \mathcal{M}^{\prime} be embedded submanifolds of vector spaces \mathcal{E} and \mathcal{E}^{\prime} respectively. A map F:F:\mathcal{M}\rightarrow\mathcal{M}^{\prime} is smooth at 𝐱\bm{x}\in\mathcal{M} if there exists a function F¯:U\bar{F}:U\rightarrow\mathcal{E}^{\prime} which is smooth (in the usual sense) on a neighborhood uu of 𝐱\bm{x} in \mathcal{E} and such that F¯\bar{F} restricted to U\mathcal{M}\cap U is equal to FF. The function F¯\bar{F} is called a local smooth extension of FF at 𝐱\bm{x}. We say FF is smooth if it is smooth for all 𝐱\bm{x}\in\mathcal{M}.

The set of all tangent spaces over the manifold can be bundled together to form another manifold known as the tangent bundle.

Definition 5.

[Definition 3.35, [7]] The tangent bundle of a manifold \mathcal{M} is the disjoint union of the tangent spaces of \mathcal{M}:

T:={(𝒙,𝒗)|𝒙,𝒗T𝒙}.T\mathcal{M}:=\{(\bm{x},\bm{v})\;|\;\bm{x}\in\mathcal{M},\;\bm{v}\in T_{\bm{x}}\mathcal{M}\}. (26)

Retraction operations on manifolds, to be defined shortly, and mapping elements of the manifold \mathcal{M} to tangent vectors naturally take place over the tangent bundle. A particularly important class of maps are known as vector fields.

Definition 6.

[Definition 3.37, [7]] A vector field on a manifold \mathcal{M} is a map V:TV:\mathcal{M}\rightarrow T\mathcal{M} such that V(𝐱)T𝐱V(\bm{x})\in T_{\bm{x}}\mathcal{M} for all 𝐱\bm{x}\in\mathcal{M}. If VV is a smooth map, we say it is a smooth vector field.

In order to define distances and gradients on a manifold an inner product must be introduced over the tangent spaces of the manifold.

Definition 7.

[Definition 3.43, [7]] An inner product on T𝐱T_{\bm{x}}\mathcal{M} is a bilinear, symmetric, positive definite function ,𝐱:T𝐱×T𝐱\langle\cdot,\cdot\rangle_{\bm{x}}:T_{\bm{x}}\mathcal{M}\times T_{\bm{x}}\mathcal{M}\rightarrow\mathbb{R}. It induces a norm for tangent vectors: 𝐯𝐱:=𝐯,𝐯𝐱\|\bm{v}\|_{\bm{x}}:=\sqrt{\langle\bm{v},\bm{v}\rangle_{\bm{x}}}.

A metric on \mathcal{M} follows from a particular choice of inner product ,𝒙\langle\cdot,\cdot\rangle_{\bm{x}} for each 𝒙\bm{x}\in\mathcal{M}. As the inner product over the manifold depends on the point on the manifold where the inner product is taken, an important property for an inner product is that it smoothly varies over the manifold. Such inner products are known are Riemannian metrics. More precisely,

Definition 8.

[Definition 3.44, [7]] A metric ,𝐱\langle\cdot,\cdot\rangle_{\bm{x}} on \mathcal{M} is a Riemannain metric if it varies smoothly with 𝐱\bm{x}, in the sense that if VV, WW, are two smooth vector fields on \mathcal{M} then the function 𝐱V(𝐱),W(𝐱)𝐱\bm{x}\mapsto\langle V(\bm{x}),W(\bm{x})\rangle_{\bm{x}} is smooth from \mathcal{M} to \mathbb{R}.

The presence of a Riemannian metric produces a Riemannian manifold and from this metric we can define a notion of distance between connected components, i.e., points which can be connected by a continuous curve on the manifold, on a Riemannian manifold.

Definition 9.

Let \mathcal{M} be a Riemannian manifold. The length of a piecewise smooth curve, see Definition 10.2, [7], c:[a,b]c:[a,b]\rightarrow\mathcal{M} is defined as

L(c)=abc(t)c(t)𝑑t.L(c)=\int_{a}^{b}\|c^{\prime}(t)\|_{c(t)}\;dt.

The Riemannian distance between 𝐱,𝐲\bm{x},\bm{y}\in\mathcal{M} is then dist(𝐱,𝐲)=infcL(c)\text{dist}(\bm{x},\bm{y})=\inf_{c}L(c) where the infimum is taken over all piecewise regular curve segments on \mathcal{M} connecting 𝐱\bm{x} and 𝐲\bm{y}.

An interesting by-product of this notation of distance on a Riemannian manifold is the Riemannian distance in tandem with \mathcal{M} forms a metric space (Theorem 10.3, [7]). Now, for the case of submanifolds of Euclidean spaces, that is, vector spaces with a defined inner product, e.g., =n\mathcal{E}=\mathbb{R}^{n} with the standard inner 𝒙,𝒚=𝒙𝒚\langle\bm{x},\bm{y}\rangle=\bm{x}^{\top}\bm{y}, a particular choice of inner product is of great importance.

Definition 10.

[Definition 3.47, [7]] Let \mathcal{M} be an embedded submanifold of a Euclidean space \mathcal{E}. Equipped with the Riemannian metric obtained by restriction of the metric of \mathcal{E}, we call \mathcal{M} a Riemannian submanifold of \mathcal{E}.

Riemannian submanifolds of Euclidean spaces are of immense practical interest as many applications of optimization over manifolds consists of optimizing over such spaces. For example, the sphere Sn1:={𝒙n|𝒙=1}S^{n-1}:=\{\bm{x}\in\mathbb{R}^{n}\;|\;\|\bm{x}\|=1\}, the Stiefel manifold Stn,k:={𝑿n×k|𝑿𝑿=𝑰k×k}\text{St}_{n,k}:=\{\bm{X}\in\mathbb{R}^{n\times k}\;|\;\bm{X}^{\top}\bm{X}=\bm{I}_{k\times k}\}, and low-rank spectrahedron {𝑿n×n|Tr(𝑿)=1,𝑿𝟎,rank(𝑿)rn}\{\bm{X}\in\mathbb{R}^{n\times n}\;|\;\text{Tr}(\bm{X})=1,\bm{X}\succeq\bm{0},\text{rank}(\bm{X})\leq r\leq n\}, and any Cartesian product of these are all Riemannain submanifolds of Euclidean spaces.

In the case of a function F¯:\bar{F}:\mathcal{E}\rightarrow\mathcal{E}^{\prime} which is a smooth map between two vector spaces, the differential of F¯\bar{F} at 𝒙\bm{x}\in\mathcal{E} is a linear operator from \mathcal{E} to \mathcal{E}^{\prime}, denoted DF¯(𝒙):D\bar{F}(\bm{x}):\mathcal{E}\rightarrow\mathcal{E}^{\prime}, defined as

DF¯(𝒙)[𝒗]=limt0F¯(𝒙+t𝒗)F¯(𝒙)t.D\bar{F}(\bm{x})[\bm{v}]=\lim_{t\rightarrow 0}\frac{\bar{F}(\bm{x}+t\bm{v})-\bar{F}(\bm{x})}{t}.

From a smooth curve cc over \mathcal{M}, a smooth curve over \mathcal{M}^{\prime} can be formed as the map tF(c(t))t\mapsto F(c(t)). From this idea, we can define the differential of a map FF between two embedded submanifolds of vector spaces.

Definition 11.

[Definition 3.27, [7]] Let \mathcal{M} and \mathcal{M}^{\prime} be two embedded submanifolds of vectors spaces \mathcal{E} and \mathcal{E}^{\prime} respectively. The differential of F:F:\mathcal{M}\rightarrow\mathcal{M}^{\prime} at 𝐱\bm{x} is a linear operator DF(𝐱):T𝐱TF(𝐱)DF(\bm{x}):T_{\bm{x}}\mathcal{M}\rightarrow T_{F(\bm{x})}\mathcal{M}^{\prime} defined by

DF(𝒙)[𝒗]=ddtF(c(t))|t=0,DF(\bm{x})[\bm{v}]=\frac{d}{dt}F(c(t))\bigg{|}_{t=0},

where cc is a smooth curve on \mathcal{M} passing through 𝐱\bm{x} at t=0t=0 with c(0)=𝐯c^{\prime}(0)=\bm{v}.

With the definition of a differential and Riemannian metrics presented, we can now introduce the definition of the Riemannian gradient of a smooth function f:f:\mathcal{M}\rightarrow\mathbb{R} over a Riemannian manifold.

Definition 12.

[Definition 3.50, [7]] Let f:f:\mathcal{M}\rightarrow\mathbb{R} be a smooth function on a Riemannian manifold \mathcal{M}. The Riemannian gradient of ff is the vector field gradf:T\operatorname{grad}f:\mathcal{M}\rightarrow T\mathcal{M} uniquely defined by the identities:

(𝒙,𝒗)T,Df(𝒙)[𝒗]=𝒗,gradf(𝒙)𝒙,\forall(\bm{x},\bm{v})\in T\mathcal{M},\;\;\;Df(\bm{x})[\bm{v}]=\langle\bm{v},\operatorname{grad}f(\bm{x})\rangle_{\bm{x}},

where Df(𝐱)Df(\bm{x}) is the differential of ff as defined above and ,𝐱\langle\cdot,\cdot\rangle_{\bm{x}} is the Riemannian metric.

In the event \mathcal{M} is a Riemannian submanifold of a Euclidean space, then computing the Riemannian gradient of ff is easily accomplished through a smooth extension of ff.

Proposition 6.

[Proposition 3.53, [7]] Let \mathcal{M} be a Riemannian submanifold of \mathcal{E} endowed with the metric ,\langle\cdot,\cdot\rangle and let f:f:\mathcal{M}\rightarrow\mathbb{R} be a smooth function. The Riemannian gradient of ff is given by

gradf(𝒙)=Proj𝒙(gradf¯(𝒙)),\operatorname{grad}f(\bm{x})=\text{\rm Proj}_{\bm{x}}\left(\operatorname{grad}\bar{f}(\bm{x})\right), (27)

where f¯\bar{f} is any smooth extension of ff in a neighborhood of 𝐱\bm{x}\in\mathcal{M} and Proj𝐱()\text{\rm Proj}_{\bm{x}}(\cdot) is the projection operator from \mathcal{E} to T𝐱T_{\bm{x}}\mathcal{M}, orthogonal with respect to the inner product over \mathcal{E}.

Another crucial operation over manifolds are retractions.

Definition 13.

[Definition 3.41, [7]] A retraction on \mathcal{M} is a smooth map Retr:T\text{\rm Retr}_{\mathcal{M}}:T\mathcal{M}\rightarrow\mathcal{M} with the following properties. For each 𝐱\bm{x}\in\mathcal{M}, let Retr,𝐱:T𝐱\text{\rm Retr}_{\mathcal{M},\bm{x}}:T_{\bm{x}}\mathcal{M}\rightarrow\mathcal{M} be the restriction of Retr\text{\rm Retr}_{\mathcal{M}} at 𝐱\bm{x}, such that Retr,𝐱(𝐯)=Retr(𝐱,𝐯)\text{\rm Retr}_{\mathcal{M},\bm{x}}(\bm{v})=\text{\rm Retr}_{\mathcal{M}}(\bm{x},\bm{v}). Then,

  1. 1.

    Retr,𝒙(𝟎)=𝒙\text{\rm Retr}_{\mathcal{M},\bm{x}}(\bm{0})=\bm{x} for all 𝒙\bm{x}\in\mathcal{M}, and

  2. 2.

    DRetr,𝒙(𝟎):T𝒙T𝒙D\text{\rm Retr}_{\mathcal{M},\bm{x}}(\bm{0}):T_{\bm{x}}\mathcal{M}\rightarrow T_{\bm{x}}\mathcal{M} is the identity map: DRetr,𝒙(𝟎)[𝒗]=𝒗D\text{\rm Retr}_{\mathcal{M},\bm{x}}(\bm{0})[\bm{v}]=\bm{v}.

Equivalently, each curve c(t)=Retr,𝐱(t𝐯)c(t)=\text{\rm Retr}_{\mathcal{M},\bm{x}}(t\bm{v}) satisfies c(0)=𝐱c(0)=\bm{x} and c(0)=𝐯c^{\prime}(0)=\bm{v}.

From an optimization perspective, a retraction is a way of returning to the manifold if one has moved away from the manifold in the ambient space. A very special retraction available on each manifold is known as the Riemannian exponential map whose retraction curves are geodesics.

From classical results on differential equations, we know for all (𝒙,𝒗)T(\bm{x},\bm{v})\in T\mathcal{M} there exists a unique maixmal geodesics on \mathcal{M}, γ𝒗:I\gamma_{\bm{v}}:I\rightarrow\mathcal{M} such that γ𝒗(0)=𝒙\gamma_{\bm{v}}(0)=\bm{x} and γ𝒗(0)=𝒗\gamma_{\bm{v}}^{\prime}(0)=\bm{v}. The Riemannian exponential mapping is a special retraction which aligns with this unique geodesic.

Definition 14.

[Definition 10.16,[7]] Consider the following subset of the tangent bundle:

𝒪={(𝒙,𝒗)T|γ𝒗 is defined on an interval containing [0,1]}.\mathcal{O}=\{(\bm{x},\bm{v})\in T\mathcal{M}\;|\;\gamma_{\bm{v}}\text{ is defined on an interval containing $[0,1]$}\}.

The exponential map Exp:𝒪\text{\rm Exp}:\mathcal{O}\rightarrow\mathcal{M} is defined by Exp(𝐱,𝐯)=γ𝐯(1).\text{\rm Exp}(\bm{x},\bm{v})=\gamma_{\bm{v}}(1).

Thus, the exponential map has Exp(𝒙,0)=𝒙\text{Exp}(\bm{x},0)=\bm{x} and ddt(Exp(𝒙,t𝒗))|t=0=𝒗\frac{d}{dt}\left(\text{Exp}(\bm{x},t\bm{v})\right)|_{t=0}=\bm{v}. The exponential map is smooth over its domain and is a diffeomorphism over its image over a potentially restricted subspace of its domain. The injectivity radius of a manifold captures the size of domain on which the expoential map is smoothly invertible.

Definition 15.

[Definitions 10.19 & 10.28 in [7]] The injectivity radius of a Riemannian manifold \mathcal{M} at a point 𝐱\bm{x}, denoted by inj(𝐱)\text{\rm inj}(\bm{x}), is the supremum over radii r>0r>0 such that Exp(𝐱,)\text{\rm Exp}(\bm{x},\cdot) is defined and is a diffeomorphism on the open ball

B(𝒙,r)={𝒗𝒯𝒙|𝒗𝒙r}.B(\bm{x},r)=\left\{\bm{v}\in\mathcal{T}_{\bm{x}}\mathcal{M}\;|\;\|\bm{v}\|_{\bm{x}}\leq r\right\}.

The injectivity radius inj()\text{\rm inj}(\mathcal{M}) of a Riemannian manifold \mathcal{M} is the infimum of inj(𝐱)\text{\rm inj}(\bm{x}) over 𝐱\bm{x}\in\mathcal{M}.

The injectivity radius shall become important in the analysis of our algorithm as it shall determine a ball on which we can always ensure a tangent vector exists which moves us between points on the manifold \mathcal{M} which are sufficiently close together. For compact Riemannian submanifolds, e.g., the Stiefel manifold or products of Stiefel manifolds, the injectivity radius over the manifold is strictly positive. Another useful property over compact Riemannian submanifolds is a Lipschitz-type condition over any defined retraction.

Proposition 7.

For a retraction Retr:T\text{\rm Retr}_{\mathcal{M}}:T\mathcal{M}\rightarrow\mathcal{M} on a compact Riemannian submanifold of \mathcal{E} there exists L1>0L_{1}>0 such that

Retr(𝒙,𝒗)𝒙L1𝒗𝒙\|\text{\rm Retr}_{\mathcal{M}}(\bm{x},\bm{v})-\bm{x}\|_{\mathcal{M}}\leq L_{1}\|\bm{v}\|_{\bm{x}} (28)

for all 𝐱\bm{x}\in\mathcal{M} and 𝐯T𝐱\bm{v}\in T_{\bm{x}}\mathcal{M}, where \|\cdot\|_{\mathcal{M}} is the induced norm from the inner product of the embedding space \mathcal{E}.

This result was proven as an intermediate step of an argument in Appendix B of [8] and was concisely stated as Proposition 2.7 in [52]. As a note, Proposition 7 holds for any valid retraction, though the constant L1L_{1} shall change dependent upon the retraction. Lastly, through the use of retractions a natural extension of the traditional gradient Lipschitz condition was proposed by Boumal et al. [8].

Definition 16.

The function f:f:\mathcal{M}\rightarrow\mathbb{R} has a gradient Lipschitz retraction with constant LR0L_{R}\geq 0 with respect to the retraction Retr:T\text{\rm Retr}_{\mathcal{M}}:T\mathcal{M}\rightarrow\mathcal{M} if for all 𝐱\bm{x}\in\mathcal{M} and 𝐯T𝐱\bm{v}\in T_{\bm{x}}\mathcal{M} we have

f(Retr(𝒙,𝒗))f(𝒙)+gradf(𝒙),𝒗𝒙+LR2𝒗𝒙2.f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\bm{v}))\leq f(\bm{x})+\langle\operatorname{grad}f(\bm{x}),\bm{v}\rangle_{\bm{x}}+\frac{L_{R}}{2}\|\bm{v}\|_{\bm{x}}^{2}. (29)

This condition enables many convergence results over n\mathbb{R}^{n} to be extended, relatively easily, to Riemannian manifolds. Our analysis makes heavy use of this condition, and, in the case of compact Riemannain submanifolds, Lemma 2.7 of [8] shows this condition follows with minor additional assumptions from the classic gradient Lipschitz condition.

4.2 Additional Definitions and Notation

To facilitate our discussion, some additional notation and conventions must be set in order. For 𝒄\bm{c} and 𝒈\bm{g} we denote their component functions as 𝒄(𝒙,𝒚)=[c1(𝒙,𝒚),,cp(𝒙,𝒚)]\bm{c}(\bm{x},\bm{y})=[c_{1}(\bm{x},\bm{y}),\ldots,c_{p}(\bm{x},\bm{y})]^{\top} and 𝒈(𝒚)=[g1(𝒚),,gs(𝒚)]\bm{g}(\bm{y})=[g_{1}(\bm{y}),\ldots,g_{s}(\bm{y})]^{\top}, where ci:×nc_{i}:\mathcal{M}\times\mathbb{R}^{n}\rightarrow\mathbb{R} and gj:ng_{j}:\mathbb{R}^{n}\rightarrow\mathbb{R} for i=1,,pi=1,\ldots,p and j=1,,sj=1,\ldots,s. We let yf\nabla_{y}f and yci\nabla_{y}c_{i} denote the Euclidean gradients of ff and cic_{i} with respect to 𝒚\bm{y}. The Jacobian of 𝒄\bm{c} is denoted

y𝒄(𝒙,𝒚)=[yc1(𝒙,𝒚)||ycp(𝒙,𝒚)]n×p.\nabla_{y}\bm{c}(\bm{x},\bm{y})=\bigg{[}\nabla_{y}c_{1}(\bm{x},\bm{y})\;|\cdots\;|\;\nabla_{y}c_{p}(\bm{x},\bm{y})\bigg{]}\in\mathbb{R}^{n\times p}.

The feasible region of (24) is defined as

𝒞:={(𝒙,𝒚)×n|𝒄(𝒙,𝒚)=𝟎,𝒈(𝒚)𝟎}.\mathcal{C}:=\left\{(\bm{x},\bm{y})\in\mathcal{M}\times\mathbb{R}^{n}\;|\;\bm{c}(\bm{x},\bm{y})=\bm{0},\;\bm{g}(\bm{y})\leq\bm{0}\right\}. (30)

We further define the following slices of the feasible region. For all 𝒙\bm{x}\in\mathcal{M} and 𝒚n\bm{y}\in\mathbb{R}^{n} we define the sets:

𝒞y(𝒙):={𝒚n|𝒄(𝒙,𝒚)=𝟎,𝒈(𝒚)𝟎},\mathcal{C}_{y}(\bm{x}):=\left\{\bm{y}\in\mathbb{R}^{n}\;|\;\bm{c}(\bm{x},\bm{y})=\bm{0},\;\bm{g}(\bm{y})\leq\bm{0}\right\}, (31)
𝒞x(𝒚):={𝒙|𝒄(𝒙,𝒚)=𝟎}.\mathcal{C}_{x}(\bm{y}):=\left\{\bm{x}\in\mathcal{M}\;|\;\bm{c}(\bm{x},\bm{y})=\bm{0}\right\}. (32)

The level set of ff at the point (𝒙¯,𝒚¯)𝒞(\bar{\bm{x}},\bar{\bm{y}})\in\mathcal{C} is given by

L(𝒙¯,𝒚¯):={(𝒙,𝒚)𝒞|f(𝒙,𝒚)f(𝒙¯,𝒚¯)}.L(\bar{\bm{x}},\bar{\bm{y}}):=\{(\bm{x},\bm{y})\in\mathcal{C}\;|\;f(\bm{x},\bm{y})\leq f(\bar{\bm{x}},\bar{\bm{y}})\}.

We define :=×n\mathcal{M}^{\prime}:=\mathcal{M}\times\mathbb{R}^{n} and shall often let 𝒛\bm{z} denote elements of \mathcal{M}^{\prime}. Letting z¯=(𝒙¯,𝒚¯)\bar{z}=(\bar{\bm{x}},\bar{\bm{y}}), we will frequently abbreviate L(𝒙¯,𝒚¯)L(\bar{\bm{x}},\bar{\bm{y}}) as L(𝒛¯)L(\bar{\bm{z}}). The set of active inequality constraints at (𝒙,𝒚)(\bm{x},\bm{y}) is the set 𝒜(𝒚):={i|gi(𝒚)=0,i=1,,s}\mathcal{A}(\bm{y}):=\{i\;|\;g_{i}(\bm{y})=0,\;i=1,\ldots,s\}.

4.3 A Constrained Manifold Reformulation

To present a clear formulation of optimality conditions for (24), we recast the model as a nonlinear programming model over a Riemannian manifold. Remembering =×n\mathcal{M}^{\prime}=\mathcal{M}\times\mathbb{R}^{n}, we let i1:i_{1}:\mathcal{M}^{\prime}\rightarrow\mathcal{M} be the projection onto \mathcal{M}, i.e., i1(𝒙,𝒚)=𝒙i_{1}(\bm{x},\bm{y})=\bm{x}, and i2:ni_{2}:\mathcal{M}^{\prime}\rightarrow\mathbb{R}^{n} be the projection onto n\mathbb{R}^{n}, i.e., i2(𝒙,𝒚)=𝒚i_{2}(\bm{x},\bm{y})=\bm{y}. Then, with 𝒛=(𝒙,𝒚)\bm{z}=(\bm{x},\bm{y})\in\mathcal{M}^{\prime}, we define the functions

f¯(𝒛):=f(i1(𝒛),i2(𝒛))=f(𝒙,𝒚),c¯j(𝒛):=cj(i1(𝒛),i2(𝒛)), and g¯k(𝒛):=gk(i1(𝒛),i2(𝒛)),\bar{f}(\bm{z}):=f(i_{1}(\bm{z}),i_{2}(\bm{z}))=f(\bm{x},\bm{y}),\;\;\;\bar{c}_{j}(\bm{z}):=c_{j}(i_{1}(\bm{z}),i_{2}(\bm{z})),\;\text{ and }\;\bar{g}_{k}(\bm{z}):=g_{k}(i_{1}(\bm{z}),i_{2}(\bm{z})),

for j=1,,pj=1,\ldots,p and k=1,,sk=1,\ldots,s. We then equivalently rewrite (24) as

min\displaystyle\min f¯(𝒛)\displaystyle\;\bar{f}(\bm{z}) (33)
s.t. 𝒄¯(𝒛)=𝟎,\displaystyle\;\bar{\bm{c}}(\bm{z})=\bm{0},
𝒈¯(𝒛)𝟎,\displaystyle\;\bar{\bm{g}}(\bm{z})\leq\bm{0},
𝒛.\displaystyle\;\bm{z}\in\mathcal{M}^{\prime}.

This is a nonlinear program over a Riemannian manifold, and programs of this forms have only recently begun to be studied in the literature. Necessary and sufficient conditions for this problem class were derived in 2014 by Yang et al. [49]. Since this paper, other investigations have been undertaken developing theory and algorithms [6, 35, 28, 48, 40, 38]. From these developments, we state the first-order necessary conditions for (33).

Definition 17 (First-Order KKT Points [49]).

The point 𝐳\bm{z}^{*}\in\mathcal{M}^{\prime} such that 𝐜¯(𝐳)=𝟎\bar{\bm{c}}(\bm{z}^{*})=\bm{0} and 𝐠¯(𝐳)𝟎\bar{\bm{g}}(\bm{z}^{*})\leq\bm{0} is a first-order KKT point of (33) if there exists 𝛌p\bm{\lambda}^{*}\in\mathbb{R}^{p} and 𝛍+s\bm{\mu}^{*}\in\mathbb{R}^{s}_{+} such that

gradf¯(𝒛)+i=1pλjgradc¯i(𝒛)+j=1sμjgradg¯j(𝒛)=𝟎,\displaystyle\operatorname{grad}\bar{f}(\bm{z}^{*})+\sum_{i=1}^{p}\lambda_{j}^{*}\cdot\operatorname{grad}\bar{c}_{i}(\bm{z}^{*})+\sum_{j=1}^{s}\mu^{*}_{j}\cdot\operatorname{grad}\bar{g}_{j}(\bm{z}^{*})=\bm{0},
μjg¯j(𝒛)=0,j=1,,s.\displaystyle\mu_{j}^{*}\cdot\bar{g}_{j}(\bm{z}^{*})=0,\;j=1,\ldots,s.\; (34)

Similarly, a feasible point 𝐳\bm{z}^{*}\in\mathcal{M}^{\prime} is an ϵ\epsilon-approximate first-order KKT point of (33) provided there exists 𝛌p\bm{\lambda}^{*}\in\mathbb{R}^{p} and 𝛍+s\bm{\mu}^{*}\in\mathbb{R}^{s}_{+} such that

gradf¯(𝒛)+i=1pλjgradc¯i(𝒛)+j=1sμjgradg¯j(𝒛)𝒛ϵ,\displaystyle\bigg{\|}\operatorname{grad}\bar{f}(\bm{z}^{*})+\sum_{i=1}^{p}\lambda_{j}^{*}\cdot\operatorname{grad}\bar{c}_{i}(\bm{z}^{*})+\sum_{j=1}^{s}\mu^{*}_{j}\cdot\operatorname{grad}\bar{g}_{j}(\bm{z}^{*})\bigg{\|}_{\bm{z}^{*}}\leq\epsilon, (35)
μjg¯j(𝒛)=0,j=1,,s.\displaystyle\mu_{j}^{*}\cdot\bar{g}_{j}(\bm{z}^{*})=0,\;j=1,\ldots,s.\;

Often constraint qualifications are necessary to strength results. In this work, we shall assume the linear independence constraint qualification (LICQ) for manifold optimization, which is analogous to the standard property by the same name in Euclidean space.

Definition 18 (LICQ [49]).

The linear independent constraint qualification (LICQ) is said to hold at 𝐳\bm{z}\in\mathcal{M}^{\prime} with 𝐜¯(𝐳)=𝟎\bar{\bm{c}}(\bm{z})=\bm{0} and 𝐠¯(𝐳)𝟎\bar{\bm{g}}(\bm{z})\leq\bm{0} provided {gradc¯i(𝐳)}i=1p{gradg¯j(𝐳)}j𝒜(𝐲)\{\text{grad}\;\bar{c}_{i}(\bm{z})\}_{i=1}^{p}\cap\{\text{grad}\;\bar{g}_{j}(\bm{z})\}_{j\in\mathcal{A}(\bm{y})} forms a linearly independent set on 𝒯𝐳\mathcal{T}_{\bm{z}^{*}}\mathcal{M}^{\prime}.

Under the LICQ, we obtain we obtain the fact (17) is a first-order necessary condition.

Proposition 8 (Theorem 4.1, [49]).

If 𝐳\bm{z}^{*} is a local minimum of (33) and the LICQ property holds at 𝐳\bm{z}^{*}, then there exists multipliers 𝛌p\bm{\lambda}^{*}\in\mathbb{R}^{p} and 𝛍+s\bm{\mu}^{*}\in\mathbb{R}^{s}_{+} such that (17) holds.

In Section 6 we present our new algorithm, which from the lens of constrained Riemannian optimization can escape stationary points of (24) and converge to first-order KKT points of the equivalent reformulation (33). Since finding exact stationary points and KKT points is in-general an intractable task, we further develop notions of approximate stationarity and KKT points in the upcoming section.

5 Approximate Stationary and KKT Points

From investigating (24) and (33), we recognize there are multiple manners of measuring approximate optimality, and our algorithm development takes advantage of these different approaches. Here we outline different types of approximate solutions and our proposed schemes for measuring them.

Definition 19 (Approximate Stationary Points in 𝒙\bm{x}).

The point (𝐱,𝐲)𝒞(\bm{x}^{*},\bm{y}^{*})\in\mathcal{C} is a first-order stationary point of (24) with respect to 𝐱\bm{x} if there exists 𝛌p\bm{\lambda}^{*}\in\mathbb{R}^{p} such that

gradf(𝒙,𝒚)+i=1pλigradci(𝒙,𝒚)=𝟎.\operatorname{grad}f(\bm{x}^{*},\bm{y}^{*})+\sum_{i=1}^{p}\lambda_{i}^{*}\cdot\operatorname{grad}c_{i}(\bm{x}^{*},\bm{y}^{*})=\bm{0}. (36)

We say (𝐱,𝐲)𝒞(\bm{x}^{*},\bm{y}^{*})\in\mathcal{C} is an ϵ\epsilon-approximate first-order stationary point of (24) with respect to 𝐱\bm{x} if there exists 𝛌p\bm{\lambda}^{*}\in\mathbb{R}^{p} such that

gradf(𝒙,𝒚)+i=1pλigradci(𝒙,𝒚)𝒙ϵ.\bigg{\|}\operatorname{grad}f(\bm{x}^{*},\bm{y}^{*})+\sum_{i=1}^{p}\lambda_{i}^{*}\cdot\operatorname{grad}c_{i}(\bm{x}^{*},\bm{y}^{*})\bigg{\|}_{\bm{x}^{*}}\leq\epsilon. (37)
Definition 20 (Approximate Stationary Points in 𝒚\bm{y}).

The point (𝐱,𝐲)𝒞(\bm{x}^{*},\bm{y}^{*})\in\mathcal{C} is a first-order stationary point of (24) with respect to 𝐲\bm{y} if there exists 𝛌p\bm{\lambda}^{*}\in\mathbb{R}^{p} and 𝛍+s\bm{\mu}^{*}\in\mathbb{R}^{s}_{+} such that

yf(𝒙,𝒚)+i=1pλiyci(𝒙,𝒚)+j=1sμjgj(𝒚)=𝟎,\displaystyle\nabla_{y}f(\bm{x}^{*},\bm{y}^{*})+\sum_{i=1}^{p}\lambda_{i}\cdot\nabla_{y}c_{i}(\bm{x}^{*},\bm{y}^{*})+\sum_{j=1}^{s}\mu_{j}^{*}\cdot\nabla g_{j}(\bm{y}^{*})=\bm{0}, (38)
μjgj(𝒚)=0,j=1,,s.\displaystyle\mu_{j}^{*}\cdot g_{j}(\bm{y}^{*})=0,\;j=1,\ldots,s.\;

We say (𝐱,𝐲)𝒞(\bm{x}^{*},\bm{y}^{*})\in\mathcal{C} is an ϵ\epsilon-approximate first-order stationary point of (24) with respect to 𝐲\bm{y} provided there exists 𝛌p\bm{\lambda}^{*}\in\mathbb{R}^{p} and 𝛍+s\bm{\mu}^{*}\in\mathbb{R}^{s}_{+} such that

yf(𝒙,𝒚)+i=1pλiyci(𝒙,𝒚)+j=1sμjgj(𝒚)ϵ,\displaystyle\bigg{\|}\nabla_{y}f(\bm{x}^{*},\bm{y}^{*})+\sum_{i=1}^{p}\lambda_{i}\cdot\nabla_{y}c_{i}(\bm{x}^{*},\bm{y}^{*})+\sum_{j=1}^{s}\mu_{j}^{*}\cdot\nabla g_{j}(\bm{y}^{*})\bigg{\|}\leq\epsilon, (39)
μjgj(𝒚)=0,j=1,,s.\displaystyle\mu_{j}^{*}\cdot g_{j}(\bm{y}^{*})=0,\;j=1,\ldots,s.

Combining these definitions, we obtain our notion of stationary points for (24).

Definition 21 (Approximate Stationary Points of (24)).

The point (𝐱,𝐲)𝒞(\bm{x}^{*},\bm{y}^{*})\in\mathcal{C} is a first-order stationary point of (24) if both (36) and (38) hold for some set of multipliers; the point (𝐱,𝐲)𝒞(\bm{x}^{*},\bm{y}^{*})\in\mathcal{C} is an ϵ\epsilon-approximate first-order stationary point of (24) if both (37) and (39) hold.

Clearly, it is possible a point could be an approximate stationary point of (24) and fail to be a local minimum. So, we consider the stronger constrained Riemannian KKT formulation given in Definition 17. However, in-order to generate algorithms which provide guarantees, we must be able to measure these conditions; we introduce the next result for this purpose.

Proposition 9.

Define the following measures:

mx(𝒙,𝒚):=|min𝒅𝒯𝒙{gradf(𝒙,𝒚),𝒅𝒙|grad𝒄i(𝒙,𝒚),𝒅𝒙=0,i=1,,p,𝒅𝒙1}|,m_{x}(\bm{x},\bm{y}):=\bigg{|}\;\underset{\bm{d}\in\mathcal{T}_{\bm{x}}\mathcal{M}}{\min}\{\langle\operatorname{grad}f(\bm{x},\bm{y}),\bm{d}\rangle_{\bm{x}}\;|\;\langle\operatorname{grad}\bm{c}_{i}(\bm{x},\bm{y}),\bm{d}\rangle_{\bm{x}}=0,\;i=1,\ldots,p,\;\|\bm{d}\|_{\bm{x}}\leq 1\}\bigg{|}, (40)
my(𝒙,𝒚):=|min𝒅n{yf(𝒙,𝒚),𝒅|yci(𝒙,𝒚),𝒅=0,i=1,,p,ygj(𝒙,𝒚),𝒅0,j𝒜(𝒚),𝒅1}|,m_{y}(\bm{x},\bm{y}):=\bigg{|}\;\underset{\bm{d}\in\mathbb{R}^{n}}{\min}\{\langle\nabla_{y}f(\bm{x},\bm{y}),\bm{d}\rangle\;|\;\\ \langle\nabla_{y}c_{i}(\bm{x},\bm{y}),\bm{d}\rangle=0,\;i=1,\ldots,p,\;\langle\nabla_{y}g_{j}(\bm{x},\bm{y}),\bm{d}\rangle\leq 0,\;j\in\mathcal{A}(\bm{y}),\;\|\bm{d}\|\leq 1\}\bigg{|}, (41)
mKKT(𝒙,𝒚):=|min𝒅𝒯𝒛{gradf¯(𝒛),𝒅𝒛|gradc¯i(𝒛),𝒅𝒛=0,i=1,,p,gradg¯j(𝒛),𝒅𝒛0,j𝒜(𝒛),𝒅𝒛1}|.m_{KKT}(\bm{x},\bm{y}):=\bigg{|}\;\underset{\bm{d}\in\mathcal{T}_{\bm{z}}\mathcal{M}^{\prime}}{\min}\{\langle\operatorname{grad}\bar{f}(\bm{z}),\bm{d}\rangle_{\bm{z}}\;|\;\\ \langle\operatorname{grad}\bar{c}_{i}(\bm{z}),\bm{d}\rangle_{\bm{z}}=0,\;i=1,\ldots,p,\langle\operatorname{grad}\bar{g}_{j}(\bm{z}),\bm{d}\rangle_{\bm{z}}\leq 0,\;j\in\mathcal{A}(\bm{z}),\;\|\bm{d}\|_{\bm{z}}\leq 1\}\bigg{|}. (42)

Assume the LICQ holds at (𝐱,𝐲)𝒞(\bm{x}^{*},\bm{y}^{*})\in\mathcal{C}. Then, if max{mx(𝐱,𝐲),my(𝐱,𝐲)}ϵ\max\left\{m_{x}(\bm{x}^{*},\bm{y}^{*}),m_{y}(\bm{x}^{*},\bm{y}^{*})\right\}\leq\epsilon, then (𝐱,𝐲)(\bm{x}^{*},\bm{y}^{*}) is an ϵ\epsilon-approximate first-order stationary point of (24), and if mKKT(𝐱,𝐲)ϵm_{KKT}(\bm{x}^{*},\bm{y}^{*})\leq\epsilon then (𝐱,𝐲)(\bm{x}^{*},\bm{y}^{*}) is an ϵ\epsilon-approximate first-order KKT point of (24), viewing the problem as the constrained Riemannian optimization problem (33).

Proof.

The proof of this claim is an exercise in duality with the utilization of the LICQ, i.e., Definition 18. Under this assumption, we see from the form of the product manifold \mathcal{M}^{\prime} that {gradci(𝒙,𝒚)}i=1p\{\text{grad}\;c_{i}(\bm{x}^{*},\bm{y}^{*})\}_{i=1}^{p} forms a linearly independent set over 𝒯𝒙\mathcal{T}_{\bm{x}^{*}}\mathcal{M}; therefore, forming the dual problem defining (40), we obtain

max𝝀pmin𝒅𝒙1gradf(𝒙,𝒚)+i=1pλigradci(𝒙,𝒚),𝒅𝒙=min𝝀pgradf(𝒙,𝒚)+i=1pλigradci(𝒙,𝒚)𝒙=mx(𝒙,𝒚),\underset{\bm{\lambda}\in\mathbb{R}^{p}}{\max}\;\underset{\|\bm{d}\|_{\bm{x}^{*}}\leq 1}{\min}\;\bigg{\langle}\text{grad}\;f(\bm{x}^{*},\bm{y}^{*})+\sum_{i=1}^{p}\;\lambda_{i}\;\text{grad}\;c_{i}(\bm{x}^{*},\bm{y}^{*}),\bm{d}\bigg{\rangle}_{\bm{x}^{*}}\\ =\underset{\bm{\lambda}\in\mathbb{R}^{p}}{\min}\;\bigg{\|}\text{grad}\;f(\bm{x}^{*},\bm{y}^{*})+\sum_{i=1}^{p}\;\lambda_{i}\;\text{grad}\;c_{i}(\bm{x}^{*},\bm{y}^{*})\bigg{\|}_{\bm{x}^{*}}=m_{x}(\bm{x}^{*},\bm{y}^{*}), (43)

where the last equality follows by strong duality. Thus, mx(𝒙,𝒚)m_{x}(\bm{x},\bm{y}) provides an exact measure of the approximate stationarity of 𝒙\bm{x} per Definition 19. In a similar fashion, since the LICQ implies the Mangasarian-Fromovitz (MFCQ) over manifolds [6], we see the dual problem of the convex program defining (42) is

max𝝀p,𝝁+|𝒜(𝒚)|min𝒅𝒛1gradf¯(𝒛)+i=1pλigradc¯i(𝒛)+j𝒜(𝒚)μjgradg¯j(𝒛),𝒅𝒛\displaystyle\underset{\bm{\lambda}\in\mathbb{R}^{p},\;\bm{\mu}\in\mathbb{R}^{|\mathcal{A}(\bm{y}^{*})|}_{+}}{\max}\;\;\underset{\|\bm{d}\|_{\bm{z}^{*}}\leq 1}{\min}\;\bigg{\langle}\text{grad}\;\bar{f}(\bm{z}^{*})+\sum_{i=1}^{p}\;\lambda_{i}\;\text{grad}\;\bar{c}_{i}(\bm{z}^{*})+\sum_{j\in\mathcal{A}(\bm{y}^{*})}\;\mu_{j}\;\text{grad}\;\bar{g}_{j}(\bm{z}^{*})\;,\bm{d}\bigg{\rangle}_{\bm{z}^{*}}
=min𝝀p,𝝁+|𝒜(𝒚)|gradf¯(𝒛)+i=1pλigradci(𝒛)+j𝒜(𝒚)μjgradg¯j(𝒛)𝒛\displaystyle=\underset{\bm{\lambda}\in\mathbb{R}^{p},\;\bm{\mu}\in\mathbb{R}^{|\mathcal{A}(\bm{y}^{*})|}_{+}}{\min}\;\bigg{\|}\text{grad}\;\bar{f}(\bm{z}^{*})+\sum_{i=1}^{p}\;\lambda_{i}\;\text{grad}\;c_{i}(\bm{z}^{*})+\sum_{j\in\mathcal{A}(\bm{y}^{*})}\;\mu_{j}\;\text{grad}\;\bar{g}_{j}(\bm{z}^{*})\bigg{\|}_{\bm{z}^{*}}
=mKKT(𝒙,𝒚),\displaystyle=m_{KKT}(\bm{x}^{*},\bm{y}^{*}),

where as before the last equality follows by strong duality due to the constraint qualification. Note, throughout we let 𝒛=(𝒙,𝒚)\bm{z}^{*}=(\bm{x}^{*},\bm{y}^{*}). Looking at Definition 17, we then see by setting μj=0\mu_{j}=0 for all j{1,,s}𝒜(𝒚)j\in\{1,\ldots,s\}\setminus\mathcal{A}(\bm{y}^{*}) that mKKT(𝒙,𝒚)m_{KKT}(\bm{x}^{*},\bm{y}^{*}) is equivalent to (35). The argument to prove the equivalence of (41) and (39) is similar and we leave it to reader. ∎

Proposition 9 provides a practical manner of measuring the optimality of a proposed feasible point without directly computing any dual variables. Now, in order to provide a convergence rate for our proposed method, we require a slightly altered version of the approximate stationary and KKT points. This new definition is only required for the yy-subproblem and the zz-subproblem due to the inequality constraints and requires a notation of “almost” active constraints.

Definition 22.

Given a set of inequality constraints gj(𝐲)0g_{j}(\bm{y})\leq 0 for j=1,,sj=1,\ldots,s, the set of δ0\delta\geq 0 almost active constraints is 𝒜δ(𝐱):={j{1,,s}|gj(𝐲)[δ,0]}.\mathcal{A}^{\delta}(\bm{x}):=\{j\in\{1,\ldots,s\}\;|\;g_{j}(\bm{y})\in[-\delta,0]\}. Under this definition, we define the following variants of mym_{y} and mKKTm_{\text{KKT}}:

myδ(𝒙,𝒚):=|min𝒅n{yf(𝒙,𝒚),𝒅|yci(𝒙,𝒚),𝒅=0,i=1,,p,ygj(𝒙,𝒚),𝒅0,j𝒜δ(𝒚),𝒅1}|,m^{\delta}_{y}(\bm{x},\bm{y}):=\bigg{|}\;\underset{\bm{d}\in\mathbb{R}^{n}}{\min}\{\langle\nabla_{y}f(\bm{x},\bm{y}),\bm{d}\rangle\;|\;\\ \langle\nabla_{y}c_{i}(\bm{x},\bm{y}),\bm{d}\rangle=0,\;i=1,\ldots,p,\;\langle\nabla_{y}g_{j}(\bm{x},\bm{y}),\bm{d}\rangle\leq 0,\;j\in\mathcal{A}^{\delta}(\bm{y}),\;\|\bm{d}\|\leq 1\}\bigg{|}, (44)
mKKTδ(𝒙,𝒚):=|min𝒅𝒯𝒛{gradf¯(𝒛),𝒅𝒛|gradc¯i(𝒛),𝒅𝒛=0,i=1,,p,gradg¯j(𝒛),𝒅𝒛0,j𝒜δ(𝒛),𝒅𝒛1}|.m_{KKT}^{\delta}(\bm{x},\bm{y}):=\bigg{|}\;\underset{\bm{d}\in\mathcal{T}_{\bm{z}}\mathcal{M}^{\prime}}{\min}\{\langle\operatorname{grad}\bar{f}(\bm{z}),\bm{d}\rangle_{\bm{z}}\;|\;\\ \langle\operatorname{grad}\bar{c}_{i}(\bm{z}),\bm{d}\rangle_{\bm{z}}=0,\;i=1,\ldots,p,\langle\operatorname{grad}\bar{g}_{j}(\bm{z}),\bm{d}\rangle_{\bm{z}}\leq 0,\;j\in\mathcal{A}^{\delta}(\bm{z}),\;\|\bm{d}\|_{\bm{z}}\leq 1\}\bigg{|}. (45)

We then say a point is an (δ,ϵ)(\delta,\epsilon)-approximate stationary point with respect to 𝐲\bm{y} and an (δ,ϵ)(\delta,\epsilon)-approximate first-order KKT point if myδ(𝐱,𝐲)ϵm_{y}^{\delta}(\bm{x},\bm{y})\leq\epsilon and mKKTδ(𝐱,𝐲)ϵm_{KKT}^{\delta}(\bm{x},\bm{y})\leq\epsilon respectively. Note, the original definitions of mym_{y} and mKKTm_{KKT} are retrieved if δ\delta are set to zero.

There is no fundamental difference between computing (44) and (45) as compared to the prior definitions (41) and (42), though it is possible more conditions must be satisfied. This definition is necessary for us to clearly obtain a convergence rate of our method because this definition enables us to determine the stepsize we can take in our algorithm as to not violate our error bound conditions for the inequality constraints.

Remark 5.1.

We shall be computing minimizers of the form represented by (40), (41), and (42) in-order to compute directions to decrease the value of the objective function. It is crucial to recognize all of these models are convex programs with linear objectives and therefore are tractable subproblems which admit numerous efficient solution procedures maintained in standard optimization solvers.

Algorithm 1 Staged-BCD for (24)

Input: Initial point (𝒙0,𝒚0)𝒞(\bm{x}^{0},\bm{y}^{0})\in\mathcal{C}; positive tolerances, ϵ1\epsilon_{1}, ϵ2\epsilon_{2}; nonegative tolerances, δ1,δ2\delta_{1},\delta_{2}; constants, α,γ(0,1)\alpha,\gamma\in(0,1); base step-length, tb>0t_{b}>0

Steps:

1: For: k=0,1,k=0,1,\ldots do

2:       tk=tbt_{k}=t_{b}

2:       If myδ1(𝒙k,𝒚k)>ϵ1m^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{1} do

3:             Set 𝒅yk\bm{d}_{y}^{k} to be a minimizer which defines myδ1(𝒙k,𝒚k)m^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})

4:             While f(𝒙k,Π𝒞y(𝒚k+tkdyk;𝒙k))>f(𝒙k,𝒚k)αtkmyδ1(𝒙k,𝒚k)f(\bm{x}^{k},\Pi_{\mathcal{C}_{y}}(\bm{y}^{k}+t_{k}d_{y}^{k};\bm{x}^{k}))>f(\bm{x}^{k},\bm{y}^{k})-\alpha t_{k}m^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k}) do

5:                    tk=γtkt_{k}=\gamma\cdot t_{k}

4:             End While

4:             Set 𝒙k+1=𝒙k\bm{x}^{k+1}=\bm{x}^{k} and 𝒚k+1=Π𝒞y(𝒚k+tkdyk;𝒙k)\bm{y}^{k+1}=\Pi_{\mathcal{C}_{y}}(\bm{y}^{k}+t_{k}d_{y}^{k};\bm{x}^{k}).

2:       ElseIf mx(𝒙k,𝒚k)>ϵ1m_{x}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{1} do

3:             Set 𝒅xk\bm{d}_{x}^{k} to be a minimizer which defines mx(𝒙k,𝒚k)m_{x}(\bm{x}^{k},\bm{y}^{k})

4:             While f(Π𝒞x(Retr(𝒙k,tk𝒅xk);𝒚k),𝒚k)>f(𝒙k,𝒚k)αtkmx(𝒙k,𝒚k)f(\Pi_{\mathcal{C}_{x}}(\text{Retr}_{\mathcal{M}}(\bm{x}^{k},t_{k}\bm{d}_{x}^{k});\bm{y}^{k}),\bm{y}^{k})>f(\bm{x}^{k},\bm{y}^{k})-\alpha t_{k}m_{x}(\bm{x}^{k},\bm{y}^{k}) do

5:                    tk=γtkt_{k}=\gamma\cdot t_{k}

4:             End While

4:             Set 𝒙k+1=Π𝒞x(Retr(𝒙k,tk𝒅xk);𝒚k)\bm{x}^{k+1}=\Pi_{\mathcal{C}_{x}}(\text{Retr}_{\mathcal{M}}(\bm{x}^{k},t_{k}\bm{d}_{x}^{k});\bm{y}^{k}) and 𝒚k+1=𝒚k\bm{y}^{k+1}=\bm{y}^{k}.

7:      ElseIf mKKTδ2(𝒙k,𝒚k)>ϵ2m_{KKT}^{\delta_{2}}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{2} do

3:             Set 𝒅k\bm{d}^{k} to be a minimizer which defines mKKTδ2(𝒙k,𝒚k)m_{KKT}^{\delta_{2}}(\bm{x}^{k},\bm{y}^{k})

9:             While f(Π𝒞(Retr(𝒛k,tk𝒅k))>f(𝒙k,𝒚k)αtkmKKTδ2(𝒙k,𝒚k)f(\Pi_{\mathcal{C}}(\text{Retr}_{\mathcal{M}^{\prime}}(\bm{z}^{k},t_{k}\bm{d}^{k}))>f(\bm{x}^{k},\bm{y}^{k})-\alpha t_{k}m_{KKT}^{\delta_{2}}(\bm{x}^{k},\bm{y}^{k}) do

10:                     tk=γtkt_{k}=\gamma\cdot t_{k}

11:             End While

12:             (𝒙k+1,𝒚k+1)=Π𝒞(Retr(𝒛k,tk𝒅k))(\bm{x}^{k+1},\bm{y}^{k+1})=\Pi_{\mathcal{C}}(\text{\rm Retr}_{\mathcal{M}^{\prime}}(\bm{z}^{k},t_{k}\bm{d}^{k}))

13:       Else

14:             Return (𝒙k,𝒚k)(\bm{x}^{k},\bm{y}^{k}) as solution.

15:       End If

16: End For

6 Algorithm

We now present our algorithm for solving the block manifold optimization model (24) and describe how it proceeds to compute approximate first-order KKT points reformulated as constrained Riemannian optimization model (33). Our procedure presented in Algorithm 1 has three different updating phases: a 𝒚\bm{y}-update, 𝒙\bm{x}-update, and a joint-update.

The first updating phase in the algorithm is the 𝒚\bm{y}-updating phase. We let 𝒙k\bm{x}^{k}\in\mathcal{M} be fixed and seek to improve the value of the objective function with respect to 𝒚\bm{y} by computing a descent direction and then projecting back onto the feasible region. The projection in this step is defined as

Π𝒞y(𝒚;𝒙)argmin𝒚¯𝒞𝒚(𝒙){𝒚¯𝒚}.\Pi_{\mathcal{C}_{y}}(\bm{y};\bm{x})\in\operatorname*{argmin}_{\bar{\bm{y}}\in\mathcal{C}_{\bm{y}}(\bm{x})}\;\left\{\|\bar{\bm{y}}-\bm{y}\|\right\}. (46)

There are a few reasons we do the yy-update first. In the event {𝒚n|𝒄(𝒙,𝒚)=𝟎,𝒈(𝒚)𝟎}\{\bm{y}\in\mathbb{R}^{n}\;|\;\bm{c}(\bm{x},\bm{y})=\bm{0},\;\bm{g}(\bm{y})\leq\bm{0}\} is a convex set for any fixed 𝒙\bm{x}\in\mathcal{M}, then the projection operation is relatively simple because it becomes the projection onto a convex subset of n\mathbb{R}^{n}. We shall see important examples, e.g., QCQPs, where this occurs. Second, there is no explicit reference of a manifold in this projection, which reduces the computational difficulty in most cases. Third, given our applications of interest, we have \mathcal{M} as either the Stiefel manifold or a product of Stiefel manifolds; thus, the dimension of \mathcal{M} is larger than n\mathbb{R}^{n}. So, the memory cost of operating over 𝒚\bm{y} can be significantly less than over 𝒙\bm{x}. Finally, given the form of Algorithm 1, placing the yy-update first means it will likely be undertaken the most, further amplifying the cost savings.

The second phase in the algorithm is the 𝒙\bm{x}-updating phase. This phase is similar to the 𝒚\bm{y}-update in the sense that 𝒚k\bm{y}^{k} is fixed and we seek to decrease ff by updating 𝒙\bm{x} through the computation of a descent direction and performing line-search in that direction with projections back onto the feasible region. The projection in this phase is defined as

Π𝒞x(𝒙;𝒚)argmin𝒙¯𝒞𝒙(𝒚){𝒙¯𝒙}.\Pi_{\mathcal{C}_{x}}(\bm{x};\bm{y})\in\operatorname*{argmin}_{\bar{\bm{x}}\in\mathcal{C}_{\bm{x}}(\bm{y})}\;\left\{\|\bar{\bm{x}}-\bm{x}\|_{\mathcal{M}}\right\}. (47)

This phase shall only be entered if my(𝒙k,𝒚k)ϵ1m_{y}^{*}(\bm{x}^{k},\bm{y}^{k})\leq\epsilon_{1}, so it is unlikely to be as frequently visited as the 𝒚\bm{y}-update step. A discussion about this projection operation shall be held in the coming sections.

The third and final phase of the algorithm updates 𝒙\bm{x} and 𝒚\bm{y} jointly. This phase is only entered when it appears progress cannot be made by further updating only 𝒙\bm{x} and 𝒚\bm{y} sequentially. Similar to the prior two phases, a joint descent direction is computed by treating the original problem as a constrained Riemannian optimization problem over the product manifold \mathcal{M}^{\prime}. In this phase, a point 𝒛\bm{z}\in\mathcal{M}^{\prime} is projected onto both the functional constraints and the manifold and is defined as

Π𝒞(𝒙)argmin𝒛¯𝒞{𝒛¯𝒛}.\Pi_{\mathcal{C}}(\bm{x})\in\operatorname*{argmin}_{\bar{\bm{z}}\in\mathcal{C}}\;\left\{\|\bar{\bm{z}}-\bm{z}\|_{\mathcal{M}^{\prime}}\right\}. (48)

This projection operations shall be discussed as well in the forthcoming sections.

6.1 A Feasible Constrained Riemannian Optimization Method

By design, Algorithm 1 is a feasible method, i.e., every iterate generated by the procedure satisfies the original constraints. Other methods such as penalty, augmented Lagrangian, sequential quadratic programming, primal-dual algorithms, and interior-point methods for constrained Riemannian optimization cannot return a feasible solution until the algorithm converges to a solution [51, 35, 28, 48, 40, 38, 4]; however, this might never occur or could potentially take a prohibitively long period of time. Our algorithm is one of the few feasible constrained Riemannian optimization methods. Recently, Weber and Sra [46] developed a Frank-Wolfe method for a constrained Riemannian optimization model, but the constraint set they considered requires the strong conditions of being compact and geodesically convex. So, our algorithm, as far as we are aware, presents the first feasible constrained Riemannian optimization method for problems without geodesic convexity.

Developing such an approach comes with natural pros and cons. On the positive side, our staged block descent method can be terminated at any time with a feasible solution in-hand. On the negative side, in-order for our method to maintain feasibility, projections onto 𝒞𝒙\mathcal{C}_{\bm{x}} and 𝒞\mathcal{C} must be computed, which could potentially prove difficult. The next sections describe some potential methods to approximate optimal projections onto 𝒞𝒙\mathcal{C}_{\bm{x}} and 𝒞\mathcal{C} with provable guarantees for our applications of interest.

6.2 Approximate Alternating Projection Approach

The method of alternating projections has a long history and more recently investigations have been made into dealing with alternating projections onto manifolds [3, 33, 42]. Our algorithm asks for the projections onto the sets 𝒞x\mathcal{C}_{x} and 𝒞\mathcal{C}, which can be highly nonconvex; this could pose a problem. Recently, however, Drusvyatskiy and Lewis [18] proposed an approximate alternating projection method for finding feasible points of the constraint set

{G(𝒙)𝟎H(𝒙)=𝟎𝒙𝒬,\begin{cases}G(\bm{x})\leq\bm{0}\\ H(\bm{x})=\bm{0}\\ \bm{x}\in\mathcal{Q},\end{cases} (49)

where GG and HH are C2C^{2}-smooth and 𝒬\mathcal{Q} is a closed subset of Euclidean space. In the case of attempting to project onto 𝒞x\mathcal{C}_{x} or 𝒞\mathcal{C}, our constraint sets of interest are of the same form as (49), with 𝒬\mathcal{Q} either the Stiefel manifold, a product of Stiefel manifolds, or a product of Stiefel manifolds and n\mathbb{R}^{n}, which are all closed subsets of Euclidean spaces. Thus, computing unique and exact projections onto 𝒬\mathcal{Q} for points near 𝒬\mathcal{Q} are possible; see Proposition 7 in [1] and Appendix A in [5] for a full description of optimal and unique projections onto the Stiefel manifold. This then implies optimal and unique projections onto our product manifolds of interest. Thus, by [18], we have the following result:

Theorem 10.

[Theorem 1, [18]] If 𝒬=Stn,k,i=1rStni,ki,Stn,k×n\mathcal{Q}=\text{St}_{n,k},\;\otimes_{i=1}^{r}\text{St}_{n_{i},k_{i}},\;\text{St}_{n,k}\times\mathbb{R}^{n} or i=1rStni,ki×m\otimes_{i=1}^{r}\text{St}_{n_{i},k_{i}}\times\mathbb{R}^{m} intersects a closed set MM separably (see Definition 2, [18]) at a point 𝐱¯\bar{\bm{x}} and Φ\Phi is an inexact projection onto MM (see Definition 3, [18]) around 𝐱¯\bar{\bm{x}}, then starting from a point nearby 𝐳𝒬\bm{z}\in\mathcal{Q}, the inexact alternating projection scheme

𝒛P𝒬(Φ(𝒛))\bm{z}\leftarrow P_{\mathcal{Q}}\left(\Phi(\bm{z})\right) (50)

converges linearly to a point in the intersection M𝒬M\cap\mathcal{Q}.

Furthermore, the authors in [18] propose an inexact projection method onto constraint sets defined by equality and inequality constraints which satisfy the conditions of Theorem 10 under the LICQ. Thus, we can approximate the exact projections in the 𝒙\bm{x}-updating phase and the joint-updating phase in Algorithm 1 with (50). A potential critique is that the computed point in the intersection might not be the optimal projection onto the feasible region. This is possible; however, for the implementation of our algorithm to be successful only a sufficient improvement of the objective function is necessary and this is certainly obtainable with almost optimal projections. In Section 8, we shall see that we are able to compute global minimums to nonconvex problems using Algorithm 1 with inexact projections.

6.3 A Simple Penalty Approach to Approximate Projections

Another method for approximating Π𝒞x\Pi_{\mathcal{C}_{x}} and Π𝒞\Pi_{\mathcal{C}} is via a localized penalty approach. For example, let (𝒙^,𝒚^)𝒞(\hat{\bm{x}},\hat{\bm{y}})\in\mathcal{C} and 𝒅\bm{d}\in\mathcal{E}, with \mathcal{E} the Euclidean space in which \mathcal{M} is embedded. Then, we can approximate the projection of (Retr(𝒙^,t𝒅),𝒚^)(\text{\rm Retr}_{\mathcal{M}}(\hat{\bm{x}},t\bm{d}),\hat{\bm{y}}) onto 𝒞x(𝒚^)\mathcal{C}_{x}(\hat{\bm{y}}) by solving the manifold optimization model

min\displaystyle\min P(c(𝒙,𝒚^))\displaystyle\;{P}(c(\bm{x},\hat{\bm{y}})) (51)
s.t. 𝒙,\displaystyle\;\bm{x}\in\mathcal{M},

by initializing procedures to solve (51) at Retr(𝒙^,t𝒅)\text{\rm Retr}_{\mathcal{M}}(\hat{\bm{x}},t\bm{d}), where P:p[0,){P}:\mathbb{R}^{p}\rightarrow[0,\infty) is a penalty function such that P(𝟎)=0{P}(\bm{0})=0. Ideally, initializing algorithms to solve (51) at points which are nearly feasible, e.g., (Retr(𝒙^,t𝒅),𝒚^)(\text{\rm Retr}_{\mathcal{M}}(\hat{\bm{x}},t\bm{d}),\hat{\bm{y}}), shall not only compute feasible points but points which are nearly optimal projections. In testing, we see for small values of tt, feasible solutions can readily be computed which are nearer to (Retr(𝒙^,t𝒅),𝒚^)(\text{\rm Retr}_{\mathcal{M}}(\hat{\bm{x}},t\bm{d}),\hat{\bm{y}}) than the original feasible point (𝒙^,𝒚^)(\hat{\bm{x}},\hat{\bm{y}}). This does not guarantee optimal projections but this is sufficient to ensure feasibility and general improvement of the objective function.

7 Convergence Analysis

In this section, we prove under reasonable assumptions Algorithm 1 converges to an (ϵ,ϵ)(\epsilon,\epsilon)-approximate first-order KKT point, as defined in Definition 22, in 𝒪(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) iterations. We begin our analysis by defining several constants which depend upon the initialization of Algorithm 1. For all (𝒙0,𝒚0)𝒞(\bm{x}^{0},\bm{y}^{0})\in\mathcal{C} we define the following terms:

𝒢f,y(𝒙0,𝒚0):=sup{yf(𝒙,𝒚+𝒅)|(𝒙,𝒚)L(𝒙0,𝒚0),𝒅1},\mathcal{G}_{f,y}(\bm{x}^{0},\bm{y}^{0}):=\sup\left\{\|\nabla_{y}f(\bm{x},\bm{y}+\bm{d})\|\;|\;(\bm{x},\bm{y})\in L(\bm{x}^{0},\bm{y}^{0}),\;\|\bm{d}\|\leq 1\right\}, (52)
𝒢gi,y(𝒙0,𝒚0):=sup{ygi(𝒚+𝒅)|(𝒙,𝒚)L(𝒙0,𝒚0),𝒅1},\mathcal{G}_{g_{i},y}(\bm{x}^{0},\bm{y}^{0}):=\sup\left\{\|\nabla_{y}g_{i}(\bm{y}+\bm{d})\|\;|\;(\bm{x},\bm{y})\in L(\bm{x}^{0},\bm{y}^{0}),\;\|\bm{d}\|\leq 1\right\}, (53)
𝒢g,y(𝒙0,𝒚0):=max1is𝒢gi,y(𝒙0,𝒚0),\mathcal{G}_{g,y}(\bm{x}^{0},\bm{y}^{0}):=\max_{1\leq i\leq s}\mathcal{G}_{g_{i},y}(\bm{x}^{0},\bm{y}^{0}), (54)
gi,y(𝒙0,𝒚0):=sup{y2gi(𝒚+𝒅)2|(𝒙,𝒚)L(𝒙0,𝒚0),𝒅1},\mathcal{H}_{g_{i},y}(\bm{x}^{0},\bm{y}^{0}):=\sup\left\{\|\nabla_{y}^{2}g_{i}(\bm{y}+\bm{d})\|_{2}\;|\;(\bm{x},\bm{y})\in L(\bm{x}^{0},\bm{y}^{0}),\;\|\bm{d}\|\leq 1\right\}, (55)
g,y(𝒙0,𝒚0):=max1isgi,y(𝒙0,𝒚0),\mathcal{H}_{g,y}(\bm{x}^{0},\bm{y}^{0}):=\max_{1\leq i\leq s}\mathcal{H}_{g_{i},y}(\bm{x}^{0},\bm{y}^{0}), (56)
𝒢f,x(𝒙0,𝒚0):=sup{gradf(Retr(𝒙,𝒅),𝒚)Retr(𝒙,𝒅)|(𝒙,𝒚)L(𝒙0,𝒚0),𝒅𝒙1,𝒅T𝒙},\mathcal{G}_{f,x}(\bm{x}^{0},\bm{y}^{0}):=\sup\bigg{\{}\|\operatorname{grad}f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\bm{d}),\bm{y})\|_{\text{\rm Retr}_{\mathcal{M}}(\bm{x},\bm{d})}\;\\ \bigg{|}\;(\bm{x},\bm{y})\in L(\bm{x}^{0},\bm{y}^{0}),\;\|\bm{d}\|_{\bm{x}}\leq 1,\;\bm{d}\in T_{\bm{x}}\mathcal{M}\bigg{\}}, (57)
𝒢f¯,z(𝒛0):=sup{gradf¯(Retr(𝒛,𝒅))𝒛|𝒛L(𝒛0),𝒅𝒛1,𝒅T𝒛},\mathcal{G}_{\bar{f},z}(\bm{z}^{0}):=\sup\left\{\|\operatorname{grad}\bar{f}(\text{\rm Retr}_{\mathcal{M}^{\prime}}(\bm{z},\bm{d}))\|_{\bm{z}}\;|\;\bm{z}\in L(\bm{z}^{0}),\;\|\bm{d}\|_{\bm{z}}\leq 1,\;\bm{d}\in T_{\bm{z}}\mathcal{M}^{\prime}\right\}, (58)
𝒢ci,x(𝒙0,𝒚0):=sup{abs(d2dt2(ci(Retr(𝒙,t𝒅),𝒚)))|t=a|a[0,1],(𝒙,𝒚)L(𝒙0,𝒚0),𝒅𝒙1,𝒅T𝒙},\mathcal{G}_{c_{i},x}(\bm{x}^{0},\bm{y}^{0}):=\sup\bigg{\{}\text{abs}\left(\frac{d^{2}}{dt^{2}}\left(c_{i}(\text{\rm Retr}_{\mathcal{M}}(\bm{x},t\bm{d}),\bm{y})\right)\right)\big{|}_{t=a}\;\\ \bigg{|}\;a\in[0,1],\;(\bm{x},\bm{y})\in L(\bm{x}^{0},\bm{y}^{0}),\;\|\bm{d}\|_{\bm{x}}\leq 1,\;\bm{d}\in T_{\bm{x}}\mathcal{M}\bigg{\}}, (59)
𝒢c,x(𝒙0,𝒚0):=𝒢c1,x(𝒙0,𝒚0)2++𝒢cp,x(𝒙0,𝒚0)2,\mathcal{G}_{c,x}(\bm{x}^{0},\bm{y}^{0}):=\sqrt{\mathcal{G}_{c_{1},x}(\bm{x}^{0},\bm{y}^{0})^{2}+\cdots+\mathcal{G}_{c_{p},x}(\bm{x}^{0},\bm{y}^{0})^{2}}, (60)
𝒢c¯i,z(𝒙0,𝒚0):=sup{abs(d2dt2(c¯i(Retr(𝒛,t𝒅)))|t=a|a[0,1],𝒛=(𝒙,𝒚)L(𝒙0,𝒚0),𝒅𝒛1,𝒅T𝒛},\mathcal{G}_{\bar{c}_{i},z}(\bm{x}^{0},\bm{y}^{0}):=\sup\bigg{\{}\text{abs}\left(\frac{d^{2}}{dt^{2}}\left(\bar{c}_{i}(\text{\rm Retr}_{\mathcal{M}^{\prime}}(\bm{z},t\bm{d})\right)\right)\big{|}_{t=a}\;\\ \bigg{|}\;a\in[0,1],\;\bm{z}=(\bm{x},\bm{y})\in L(\bm{x}^{0},\bm{y}^{0}),\;\|\bm{d}\|_{\bm{z}}\leq 1,\;\bm{d}\in T_{\bm{z}}\mathcal{M}^{\prime}\bigg{\}}, (61)
𝒢c¯,z(𝒙0,𝒚0):=𝒢c¯1,x(𝒙0,𝒚0)2++𝒢c¯p,x(𝒙0,𝒚0)2,\mathcal{G}_{\bar{c},z}(\bm{x}^{0},\bm{y}^{0}):=\sqrt{\mathcal{G}_{\bar{c}_{1},x}(\bm{x}^{0},\bm{y}^{0})^{2}+\cdots+\mathcal{G}_{\bar{c}_{p},x}(\bm{x}^{0},\bm{y}^{0})^{2}}, (62)
ci,y(𝒙0,𝒚0):=sup{y2ci(𝒙,𝒚+𝒅)2|(𝒙,𝒚)L(𝒙0,𝒚0),𝒅1},\mathcal{H}_{c_{i},y}(\bm{x}^{0},\bm{y}^{0}):=\sup\left\{\|\nabla_{y}^{2}c_{i}(\bm{x},\bm{y}+\bm{d})\|_{2}\;|\;(\bm{x},\bm{y})\in L(\bm{x}^{0},\bm{y}^{0}),\;\|\bm{d}\|\leq 1\right\}, (63)
c,y(𝒙0,𝒚0):=12c1,y(𝒙0,𝒚0)2++cp,y(𝒙0,𝒚0)2.\mathcal{H}_{c,y}(\bm{x}^{0},\bm{y}^{0}):=\frac{1}{2}\sqrt{\mathcal{H}_{c_{1},y}(\bm{x}^{0},\bm{y}^{0})^{2}+\cdots+\mathcal{H}_{c_{p},y}(\bm{x}^{0},\bm{y}^{0})^{2}}. (64)

Under the assumptions we shall impose, it can be simply argued the defined constants above are all positive and finite. Each of these constants focuses on being able to bound the first and second-order derivatives of the objective and constraint functions. This is crucial in-order to produce a convergence rate for our algorithm. Note, we shall almost always drop the dependence on (𝒙0,𝒚0)(\bm{x}^{0},\bm{y}^{0}) when we use these constants in the remainder of the paper to reduce notational clutter.

We now present the assumptions we utilize in our analysis. The assumptions can be divided into four categories: smoothness assumptions on the objective and functional constraints, error-bound conditions on the constraints, constraint qualifications, and manifold conditions.

Assumption 1.

The function f:×nf:\mathcal{M}\times\mathbb{R}^{n}\rightarrow\mathbb{R} is two-times continuously differentiable and 𝐜:×np\bm{c}:\mathcal{M}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{p} and 𝐠:×ns\bm{g}:\mathcal{M}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{s} are three-times continuously differentiable.

Assumption 2.

The functions f:×nf:\mathcal{M}\times\mathbb{R}^{n}\rightarrow\mathbb{R} and 𝐜:×np\bm{c}:\mathcal{M}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{p} satisfy the following Lipschitz conditions:

  1. 1.

    There exists Ly0L_{y}\geq 0 such that for all (𝒙,𝒚1)(\bm{x},\bm{y}^{1}) and (𝒙,𝒚2)(\bm{x},\bm{y}^{2}) in ×n\mathcal{M}\times\mathbb{R}^{n}

    yf(𝒙,𝒚1)yf(𝒙,𝒚2)Ly𝒚1𝒚2.\|\nabla_{y}f(\bm{x},\bm{y}^{1})-\nabla_{y}f(\bm{x},\bm{y}^{2})\|\leq L_{y}\|\bm{y}^{1}-\bm{y}^{2}\|. (65)
  2. 2.

    f(,𝒚)f(\cdot,\bm{y}) has a gradient Lipschitz retraction with respect to both the exponential map over \mathcal{M} and the retraction operator utilized in Algorithm 1 with constant Lx1L_{x}\geq 1 for all 𝒚n\bm{y}\in\mathbb{R}^{n}.

  3. 3.

    There exists Lxy0L_{xy}\geq 0 such that for all (𝒙1,𝒚)(\bm{x}^{1},\bm{y}) and (𝒙2,𝒚)(\bm{x}^{2},\bm{y}) in ×n\mathcal{M}\times\mathbb{R}^{n}

    yf(𝒙1,𝒚)yf(𝒙2,𝒚)Lxy𝒙1𝒙2.\|\nabla_{y}f(\bm{x}^{1},\bm{y})-\nabla_{y}f(\bm{x}^{2},\bm{y})\|\leq L_{xy}\|\bm{x}^{1}-\bm{x}^{2}\|_{\mathcal{M}}. (66)
  4. 4.

    There exists Lc0L_{c}\geq 0 such that for all (𝒙1,𝒚)(\bm{x}^{1},\bm{y}) and (𝒙2,𝒚)(\bm{x}^{2},\bm{y}) in ×n\mathcal{M}\times\mathbb{R}^{n} we have

    y𝒄(𝒙1,𝒚)y𝒄(𝒙2,𝒚)2Lc𝒙1𝒙2.\|\nabla_{y}\bm{c}(\bm{x}^{1},\bm{y})-\nabla_{y}\bm{c}(\bm{x}^{2},\bm{y})\|_{2}\leq L_{c}\|\bm{x}^{1}-\bm{x}^{2}\|_{\mathcal{M}}. (67)
Assumption 3.

The LICQ holds for all iterates generated by Algorithm 1 in the Euclidean and Riemannian senses for 𝐜\bm{c} and 𝐠\bm{g} and 𝐜¯\bar{\bm{c}} and 𝐠¯\bar{\bm{g}} respectively.

Assumption 4.

There exists constants δx,δy\delta_{x},\delta_{y}, and δz>0\delta_{z}>0 and τx,τy\tau_{x},\tau_{y}, and τz>0\tau_{z}>0 such that

  1. 1.

    For all 𝒚n\bm{y}\in\mathbb{R}^{n} such that dist(𝒚,𝒞y(𝒙))δy\text{\rm dist}(\bm{y},\mathcal{C}_{y}(\bm{x}))\leq\delta_{y}, independent of 𝒙\bm{x}\in\mathcal{M}, it follows

    dist(𝒚,𝒞y(𝒙))τy(𝒄(𝒙,𝒚)+max(𝒈(𝒚),0)),\text{\rm dist}(\bm{y},\mathcal{C}_{y}(\bm{x}))\leq\tau_{y}\left(\|\bm{c}(\bm{x},\bm{y})\|+\max(\bm{g}(\bm{y}),0)\right), (68)

    where dist(𝒚,𝒞y(𝒙)):=inf𝒚^𝒞y(𝒙)𝒚^𝒚\text{\rm dist}(\bm{y},\mathcal{C}_{y}(\bm{x})):=\underset{\hat{\bm{y}}\in\mathcal{C}_{y}(\bm{x})}{\inf}\|\hat{\bm{y}}-\bm{y}\| and max(𝒈(𝒚),0):=max(g1(𝒚),,gs(𝒚),0)\max(\bm{g}(\bm{y}),0):=\max(g_{1}(\bm{y}),\ldots,g_{s}(\bm{y}),0).

  2. 2.

    For all 𝒙\bm{x}\in\mathcal{M} such that dist(𝒙,𝒞x(𝒚))δx\text{\rm dist}(\bm{x},\mathcal{C}_{x}(\bm{y}))\leq\delta_{x}, independent of 𝒚n\bm{y}\in\mathbb{R}^{n}, it follows

    dist(𝒙,𝒞x(𝒚))τx𝒄(𝒙,𝒚),\text{\rm dist}(\bm{x},\mathcal{C}_{x}(\bm{y}))\leq\tau_{x}\|\bm{c}(\bm{x},\bm{y})\|, (69)

    where dist(𝒙,𝒞x(𝒚)):=inf{𝒙¯𝒙|𝒙¯𝒞x(𝒚)}\text{\rm dist}(\bm{x},\mathcal{C}_{x}(\bm{y})):=\inf\left\{\|\bar{\bm{x}}-\bm{x}\|_{\mathcal{M}}\;|\;\bar{\bm{x}}\in\mathcal{C}_{x}(\bm{y})\right\}.

  3. 3.

    For all 𝒛=(𝒙,𝒚)×n\bm{z}=(\bm{x},\bm{y})\in\mathcal{M}\times\mathbb{R}^{n} such that dist((𝒙,𝒚),𝒞)δz\text{\rm dist}((\bm{x},\bm{y}),\mathcal{C})\leq\delta_{z}, we have that

    dist((𝒙,𝒚),𝒞)τz(𝒄(𝒙,𝒚)+max(𝒈(𝒚),0)),\text{\rm dist}((\bm{x},\bm{y}),\mathcal{C})\leq\tau_{z}\left(\|\bm{c}(\bm{x},\bm{y})\|+\max(\bm{g}(\bm{y}),0)\right), (70)

    where dist((𝐱,𝐲),𝒞):=inf{𝐱¯𝐱+𝐲¯𝐲|(𝐱¯,𝐲¯)𝒞}\text{\rm dist}((\bm{x},\bm{y}),\mathcal{C}):=\inf\left\{\|\bar{\bm{x}}-\bm{x}\|_{\mathcal{M}}+\|\bar{\bm{y}}-\bm{y}\|\;|\;(\bar{\bm{x}},\bar{\bm{y}})\in\mathcal{C}\right\}.

Assumption 5.

The injectivity radius of \mathcal{M} is strictly positive and Proposition 7 holds for the exponential mapping over \mathcal{M} and the retraction operation utilized in Algorithm 1.

Assumption 6.

The Riemannian distance and the norm inherited from the Euclidean space in which \mathcal{M} is embedded are locally equivalent, i.e., there exist constants ρ>0\rho>0 and Γ1\Gamma\geq 1 such that for all 𝐱,𝐲\bm{x},\bm{y}\in\mathcal{M}, if 𝐱𝐲ρ\|\bm{x}-\bm{y}\|_{\mathcal{M}}\leq\rho, then 𝐱𝐲dist(𝐱,𝐲)Γ𝐱𝐲\|\bm{x}-\bm{y}\|_{\mathcal{M}}\leq\text{\rm dist}(\bm{x},\bm{y})\leq\Gamma\|\bm{x}-\bm{y}\|_{\mathcal{M}}.

It is worth noting that these assumptions are fairly standard. The first assumption asks for sufficient differentiability and is readily satisfied by all of the motivating examples. The Lipschitz conditions in the second assumption can follow from a simple Lipschitz condition on the original problem before decomposing; see Appendix A. Constraint qualifications are a staple in the analysis of optimization algorithms, and error bound conditions are a common occurrence in constrained problems. Finally, Assumption 5 holds for any compact Riemannian submanifold, for example, the Stiefel manifold, and Assumption 6 is expected due to the locally Euclidean nature of manifolds.

7.1 Y-Updating Phase

Lemma 11.

Let δ1>0\delta_{1}>0. If myδ1(𝐱k,𝐲k)>ϵ1m_{y}^{\delta_{1}}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{1}, then the y-updating phase is entered and shall terminate with

f(𝒙k+1,𝒚k+1)f(𝒙k,𝒚k)tlow,yα1ϵ1f(\bm{x}^{k+1},\bm{y}^{k+1})\leq f(\bm{x}^{k},\bm{y}^{k})-t_{\text{low},y}\cdot\alpha_{1}\cdot\epsilon_{1} (71)

where

tlow,y:=γmin{1,δy,δ1𝒢g,y,2(1α1)ϵ1Ly+2τy𝒢f,y(c,y+12g,y)+Lyτy2(c,y+12g,y)}.t_{\text{low},y}:=\gamma\min\left\{1,\delta_{y},\frac{\delta_{1}}{\mathcal{G}_{g,y}},\frac{2(1-\alpha_{1})\epsilon_{1}}{{L_{y}}+2\tau_{y}\mathcal{G}_{f,y}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})+L_{y}\tau_{y}^{2}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})}\right\}. (72)
Proof.

Let (𝒙k,𝒚k)𝒞(\bm{x}^{k},\bm{y}^{k})\in\mathcal{C} and assume myδ1(𝒙k,𝒚k)>ϵ1m_{y}^{\delta_{1}}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{1}. Let 𝒅yk\bm{d}_{y}^{k} be a minimizer of the subproblem defining myδ1(𝒙k,𝒚k)m^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k}). Then by the gradient Lipschitz condition on ff with respect to 𝒚\bm{y},

f(𝒙k,𝒚k+t𝒅yk)\displaystyle f(\bm{x}^{k},\bm{y}^{k}+t\bm{d}_{y}^{k}) f(𝒙k,𝒚k)+tyf(𝒙k,𝒚k),𝒅yk+Ly2t2\displaystyle\leq f(\bm{x}^{k},\bm{y}^{k})+t\langle\nabla_{y}f(\bm{x}^{k},\bm{y}^{k}),\bm{d}_{y}^{k}\rangle+\frac{L_{y}}{2}t^{2}
=f(𝒙k,𝒚k)tmyδ1(𝒙k,𝒚k)+Ly2t2.\displaystyle=f(\bm{x}^{k},\bm{y}^{k})-tm^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})+\frac{L_{y}}{2}t^{2}. (73)

From the definition of 𝒅yk\bm{d}_{y}^{k} and the feasibility of (𝒙k,𝒚k)(\bm{x}^{k},\bm{y}^{k}) it follows

𝒄(𝒙k,𝒚k+t𝒅yk)=t22(y2c1(𝒙k,𝒚¯)[𝒅yk],𝒅yky2cp(𝒙k,𝒚¯)[𝒅yk],𝒅yk)\bm{c}(\bm{x}^{k},\bm{y}^{k}+t\bm{d}_{y}^{k})=\frac{t^{2}}{2}\begin{pmatrix}\langle\nabla_{y}^{2}c_{1}(\bm{x}^{k},\bar{\bm{y}})[\bm{d}_{y}^{k}],\bm{d}_{y}^{k}\rangle\\ \vdots\\ \langle\nabla_{y}^{2}c_{p}(\bm{x}^{k},\bar{\bm{y}})[\bm{d}_{y}^{k}],\bm{d}_{y}^{k}\rangle\end{pmatrix}

where 𝒚¯\bar{\bm{y}} lies between 𝒚k\bm{y}^{k} and 𝒚k+t𝒅yk\bm{y}^{k}+t\bm{d}_{y}^{k}. Thus,

𝒄(𝒙k,𝒚k+t𝒅yk)\displaystyle\|\bm{c}(\bm{x}^{k},\bm{y}^{k}+t\bm{d}_{y}^{k})\| t22[y2c1(𝒙k,𝒚¯)22++y2cp(𝒙k,𝒚¯)22]1/2\displaystyle\leq\frac{t^{2}}{2}\left[\|\nabla_{y}^{2}c_{1}(\bm{x}^{k},\bar{\bm{y}})\|_{2}^{2}+\cdots+\|\nabla_{y}^{2}c_{p}(\bm{x}^{k},\bar{\bm{y}})\|_{2}^{2}\right]^{1/2}
(63)t22[c1,y2++cp,y2]1/2\displaystyle\overset{\eqref{eq:Hci}}{\leq}\frac{t^{2}}{2}\left[\mathcal{H}_{c_{1},y}^{2}+\cdots+\mathcal{H}_{c_{p},y}^{2}\right]^{1/2}
=(64)t2c,y.\displaystyle\overset{\eqref{eq:Hc}}{=}t^{2}\mathcal{H}_{c,y}.

We must similarly bound the potential violation of the inequality constraints. We begin by dealing with those constraints ii such that gi(𝒚k)<δ1g_{i}(\bm{y}^{k})<-\delta_{1}. Therefore, for some 𝒚~\tilde{\bm{y}} between 𝒚k\bm{y}^{k} and 𝒚k+t𝒅yk\bm{y}^{k}+t\bm{d}_{y}^{k}

gi(𝒚k+t𝒅yk)=gi(𝒚k)+tygi(𝒚~),𝒅yk<δ1+t𝒢gi,y.g_{i}(\bm{y}^{k}+t\bm{d}_{y}^{k})=g_{i}(\bm{y}^{k})+t\langle\nabla_{y}g_{i}(\tilde{\bm{y}}),\bm{d}_{y}^{k}\rangle<-\delta_{1}+t\mathcal{G}_{g_{i},y}.

Thus, if tδ1/𝒢gi,yt\leq\delta_{1}/\mathcal{G}_{g_{i},y}, then we know the ii-th constraint shall not be violated. Next, we bound the error of the δ1\delta_{1}-active constraints. Expanding gig_{i} for i𝒜δ1(𝒚k)i\in\mathcal{A}^{\delta_{1}}(\bm{y}^{k}) we have

gi(𝒚k+t𝒅yk)\displaystyle g_{i}(\bm{y}^{k}+t\bm{d}_{y}^{k}) =gi(𝒚k)+tygi(𝒚k),𝒅yk+t22y2gi(𝒚~)[𝒅yk],𝒅yk\displaystyle=g_{i}(\bm{y}^{k})+t\langle\nabla_{y}g_{i}(\bm{y}^{k}),\bm{d}_{y}^{k}\rangle+\frac{t^{2}}{2}\langle\nabla_{y}^{2}g_{i}(\tilde{\bm{y}})[\bm{d}_{y}^{k}],\bm{d}_{y}^{k}\rangle
t22y2gi(𝒚~)[𝒅yk],𝒅yk\displaystyle\leq\frac{t^{2}}{2}\langle\nabla_{y}^{2}g_{i}(\tilde{\bm{y}})[\bm{d}_{y}^{k}],\bm{d}_{y}^{k}\rangle
(55)t22gi,y.\displaystyle\overset{\eqref{eq:Hgy}}{\leq}\frac{t^{2}}{2}\mathcal{H}_{g_{i},y}. (74)

So, assuming tδyt\leq\delta_{y} and tδ1/𝒢g,yt\leq\delta_{1}/\mathcal{G}_{g,y}, it then follows by the error bound condition, (68), that

dist(𝒚k+t𝒅yk,𝒞y(𝒙k))\displaystyle\text{dist}(\bm{y}^{k}+t\bm{d}_{y}^{k},\mathcal{C}_{y}(\bm{x}^{k})) τy(𝒄(𝒙k,𝒚k+t𝒅yk)+max(𝒈(𝒚k+t𝒅yk),𝟎))\displaystyle\leq\tau_{y}\left(\|\bm{c}(\bm{x}^{k},\bm{y}^{k}+t\bm{d}_{y}^{k})\|+\max\big{(}\bm{g}(\bm{y}^{k}+t\bm{d}_{y}^{k}),\bm{0}\big{)}\right)
τy(c,y+12g,y)t2.\displaystyle\leq\tau_{y}\left(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y}\right)t^{2}. (75)

Hence, it follows

f(𝒙k,Π𝒞y(𝒙k)(𝒚k+t𝒅yk))\displaystyle f(\bm{x}^{k},\Pi_{\mathcal{C}_{y}(\bm{x}^{k})}(\bm{y}^{k}+t\bm{d}_{y}^{k}))
f(𝒙k,𝒚k+t𝒅yk)+yf(𝒙k,𝒚k+t𝒅yk),Π𝒞y(𝒙k)(𝒚k+t𝒅yk)(𝒚k+t𝒅yk)+Ly2dist(𝒚k+t𝒅yk,𝒞y(𝒙k))2\displaystyle\leq f(\bm{x}^{k},\bm{y}^{k}+t\bm{d}_{y}^{k})+\langle\nabla_{y}f(\bm{x}^{k},\bm{y}^{k}+t\bm{d}_{y}^{k}),\Pi_{\mathcal{C}_{y}(\bm{x}^{k})}(\bm{y}^{k}+t\bm{d}_{y}^{k})-(\bm{y}^{k}+t\bm{d}_{y}^{k})\rangle+\frac{L_{y}}{2}\text{dist}(\bm{y}^{k}+t\bm{d}_{y}^{k},\mathcal{C}_{y}(\bm{x}^{k}))^{2}
(7.1)f(𝒙k,𝒚k)tmyδ1(𝒙k,𝒚k)+Ly2t2+yf(𝒙k,𝒚k+t𝒅yk),Π𝒞y(𝒙k)(𝒚k+t𝒅yk)(𝒚k+t𝒅yk)\displaystyle\overset{\eqref{eq:y_help1}}{\leq}f(\bm{x}^{k},\bm{y}^{k})-tm^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})+\frac{L_{y}}{2}t^{2}+\langle\nabla_{y}f(\bm{x}^{k},\bm{y}^{k}+t\bm{d}_{y}^{k}),\Pi_{\mathcal{C}_{y}(\bm{x}^{k})}(\bm{y}^{k}+t\bm{d}_{y}^{k})-(\bm{y}^{k}+t\bm{d}_{y}^{k})\rangle
+Ly2dist(𝒚k+t𝒅yk,𝒞y(𝒙k))2\displaystyle\hskip 144.54pt+\frac{L_{y}}{2}\text{dist}(\bm{y}^{k}+t\bm{d}_{y}^{k},\mathcal{C}_{y}(\bm{x}^{k}))^{2}
(7.1)f(𝒙k,𝒚k)tmyδ1(𝒙k,𝒚k)+Ly2t2+yf(𝒙k,𝒚k+t𝒅yk),Π𝒞y(𝒙k)(𝒚k+t𝒅yk)(𝒚k+t𝒅yk)\displaystyle\overset{\eqref{eq:y_help2}}{\leq}f(\bm{x}^{k},\bm{y}^{k})-tm^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})+\frac{L_{y}}{2}t^{2}+\langle\nabla_{y}f(\bm{x}^{k},\bm{y}^{k}+t\bm{d}_{y}^{k}),\Pi_{\mathcal{C}_{y}(\bm{x}^{k})}(\bm{y}^{k}+t\bm{d}_{y}^{k})-(\bm{y}^{k}+t\bm{d}_{y}^{k})\rangle
+12Lyτy2(c,y+12g,y)2t4\displaystyle\hskip 144.54pt+\frac{1}{2}L_{y}\tau_{y}^{2}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})^{2}t^{4}
(52)f(𝒙k,𝒚k)tmyδ1(𝒙k,𝒚k)+Ly2t2+𝒢f,ydist(𝒚k+t𝒅yk,𝒞y(𝒙k))+12Lyτy2(c,y+12g,y)2t4\displaystyle\overset{\eqref{eq:Gfynew}}{\leq}f(\bm{x}^{k},\bm{y}^{k})-tm^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})+\frac{L_{y}}{2}t^{2}+\mathcal{G}_{f,y}\text{dist}(\bm{y}^{k}+t\bm{d}_{y}^{k},\mathcal{C}_{y}(\bm{x}^{k}))+\frac{1}{2}L_{y}\tau_{y}^{2}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})^{2}t^{4}
(7.1)f(𝒙k,𝒚k)tmyδ1(𝒙k,𝒚k)+Ly2t2+𝒢f,yτy(c,y+12g,y)t2+12Lyτy2(c,y+12g,y)2t4.\displaystyle\overset{\eqref{eq:y_help2}}{\leq}f(\bm{x}^{k},\bm{y}^{k})-tm^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})+\frac{L_{y}}{2}t^{2}+\mathcal{G}_{f,y}\tau_{y}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})t^{2}+\frac{1}{2}L_{y}\tau_{y}^{2}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})^{2}t^{4}.

Thus, assuming t1t\leq 1 it follows

f(𝒙k,Π𝒞y(𝒙k)(𝒚k+t𝒅yk))\displaystyle f(\bm{x}^{k},\Pi_{\mathcal{C}_{y}(\bm{x}^{k})}(\bm{y}^{k}+t\bm{d}_{y}^{k}))
f(𝒙k,𝒚k)myδ1(𝒙k,𝒚k)t+(Ly2+τy𝒢f,y(c,y+12g,y)+12Lyτy2(c,y+12g,y)2)t2.\displaystyle\hskip 7.22743pt\leq f(\bm{x}^{k},\bm{y}^{k})-m^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})t+\left(\frac{L_{y}}{2}+\tau_{y}\mathcal{G}_{f,y}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})+\frac{1}{2}L_{y}\tau_{y}^{2}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})^{2}\right)t^{2}. (76)

Therefore, a sufficient decrease shall be obtained provided

tmin{1,δy,δ1𝒢g,y,(1α1)myδ1(𝒙k,𝒚k)Ly2+τy𝒢f,y(c,y+12g,y)+12Lyτy2(c,y+12g,y)2}.t\leq\min\left\{1,\delta_{y},\frac{\delta_{1}}{\mathcal{G}_{g,y}},\frac{(1-\alpha_{1})m^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})}{\frac{L_{y}}{2}+\tau_{y}\mathcal{G}_{f,y}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})+\frac{1}{2}L_{y}\tau_{y}^{2}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})^{2}}\right\}.

Given the structure of the line-search procedure in the y-updating phase, it follows the computed stepsize tkt_{k} shall be lower bounded by

tlow,y:=γmin{1,δy,δ1𝒢g,y,2(1α1)ϵ1Ly+2τy𝒢f,y(c,y+12g,y)+Lyτy2(c,y+12g,y)2},t_{\text{low},y}:=\gamma\min\left\{1,\delta_{y},\frac{\delta_{1}}{\mathcal{G}_{g,y}},\frac{2(1-\alpha_{1})\epsilon_{1}}{{L_{y}}+2\tau_{y}\mathcal{G}_{f,y}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})+L_{y}\tau_{y}^{2}(\mathcal{H}_{c,y}+\frac{1}{2}\mathcal{H}_{g,y})^{2}}\right\},

and by combining this lower bound with the sufficient decrease inequality we obtain our result. ∎

It is worth remarking the bound tlow,yt_{\text{low},y} is very pessimistic and follows from the worst possible scenario which might occur given the bounds on the norms of the Hessians and gradients for the objective and constraint functions.

7.2 X-Updating Phase

For our analysis of the x-updating phase, we begin by proving a useful lemma.

Lemma 12.

Assume \mathcal{M} is a compact Riemannian manifold with positive injectivity radius κ=inj()>0\kappa=\text{\rm inj}(\mathcal{M})>0. Then for any two points 𝐱,𝐲\bm{x},\bm{y}\in\mathcal{M} with dist(𝐱,𝐲)κ/2\text{\rm dist}(\bm{x},\bm{y})\leq\kappa/2, then there exists 𝐝𝒯𝐱\bm{d}\in\mathcal{T}_{\bm{x}}\mathcal{M} such that Exp(𝐱,𝐝)=𝐲\text{\rm Exp}(\bm{x},\bm{d})=\bm{y} and 𝐝𝐱<κ\|\bm{d}\|_{\bm{x}}<\kappa.

Proof.

Define the function h:h:\mathcal{M}\rightarrow\mathbb{R} such that h(𝒙):=min𝒗𝒙=κ/2,𝒗𝒯𝒙dist(𝒙,Exp(𝒙,𝒗)).h(\bm{x}):=\min_{\|\bm{v}\|_{\bm{x}}=\kappa/2,\;\bm{v}\in\mathcal{T}_{\bm{x}}\mathcal{M}}\;\text{\rm dist}(\bm{x},\text{Exp}(\bm{x},\bm{v})). By Proposition 10.22 in [7], since 𝒗𝒙=κ/2<inj()\|\bm{v}\|_{\bm{x}}=\kappa/2<\text{inj}(\mathcal{M}) it follows h(𝒙)=𝒗𝒙=κ/2h(\bm{x})=\|\bm{v}\|_{\bm{x}}=\kappa/2. Thus, hh is a constant function over \mathcal{M}. Therefore, min𝒙h(𝒙)=κ/2.\min_{\bm{x}\in\mathcal{M}}h(\bm{x})=\kappa/2. Since 𝒰𝒙:={Exp(𝒙,𝒗)|𝒗𝒙<κ,𝒗𝒯𝒙}\mathcal{U}_{\bm{x}}:=\{\text{Exp}(\bm{x},\bm{v})\;|\;\|\bm{v}\|_{\bm{x}}<\kappa,\;\bm{v}\in\mathcal{T}_{\bm{x}}\mathcal{M}\} is a neighborhood of 𝒙\bm{x}, it follows {𝒙~|dist(𝒙~,𝒙)<κ/2}𝒰𝒙\{\tilde{\bm{x}}\in\mathcal{M}\;|\;\text{dist}(\tilde{\bm{x}},\bm{x})<\kappa/2\}\subset\mathcal{U}_{\bm{x}} yielding the result. ∎

Lemma 13.

If mx(𝐱k,𝐲k)>ϵ2m_{x}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{2} and the x-updating phase is entered, then the update shall terminate with

f(𝒙k+1,𝒚k+1)f(𝒙k,𝒚k)tlow,xα2ϵ2f(\bm{x}^{k+1},\bm{y}^{k+1})\leq f(\bm{x}^{k},\bm{y}^{k})-t_{\text{low},x}\cdot\alpha_{2}\cdot\epsilon_{2} (77)

where

tlow,x:=γmin{1,δxL1,ρL1,κ𝒢c,xΓτx,8(1α2)ϵ2Lx𝒢c,x2Γ2τx2+4(Lx+𝒢f,x𝒢c,xΓτx)}.t_{\text{low},x}:=\gamma\min\left\{1,\frac{\delta_{x}}{L_{1}},\frac{\rho}{L_{1}},\sqrt{\frac{\kappa_{\mathcal{M}}}{\mathcal{G}_{c,x}\Gamma\tau_{x}}},\frac{8(1-\alpha_{2})\epsilon_{2}}{L_{x}\mathcal{G}_{c,x}^{2}\Gamma^{2}\tau_{x}^{2}+4(L_{x}+\mathcal{G}_{f,x}\mathcal{G}_{c,x}\Gamma\tau_{x})}\right\}. (78)
Proof.

Let (𝒙k,𝒚k)𝒞(\bm{x}^{k},\bm{y}^{k})\in\mathcal{C} satisfy myδ1(𝒙k,𝒚k)ϵ1m^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})\leq\epsilon_{1} and mx(𝒙k,𝒚k)>ϵ2m_{x}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{2} and let 𝒅xk\bm{d}_{x}^{k} be a minimizer defining mx(𝒙k,𝒚k)m_{x}(\bm{x}^{k},\bm{y}^{k}). By the gradient Lipschitz retraction assumption,

f(Retr(𝒙k,t𝒅xk),𝒚k)\displaystyle f(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k}) f(𝒙k,𝒚k)+tgradf(𝒙k,𝒚k),𝒅xk𝒙k+Lxt22𝒅xk𝒙k2\displaystyle\leq f(\bm{x}^{k},\bm{y}^{k})+t\langle\operatorname{grad}f(\bm{x}^{k},\bm{y}^{k}),\bm{d}_{x}^{k}\rangle_{\bm{x}^{k}}+\frac{L_{x}t^{2}}{2}\|\bm{d}_{x}^{k}\|_{\bm{x}^{k}}^{2}
f(𝒙k,𝒚k)tmx(𝒙k,𝒚k)+Lxt22𝒅xk𝒙k2.\displaystyle\leq f(\bm{x}^{k},\bm{y}^{k})-tm_{x}(\bm{x}^{k},\bm{y}^{k})+\frac{L_{x}t^{2}}{2}\|\bm{d}_{x}^{k}\|_{\bm{x}^{k}}^{2}. (79)

For i=1,,pi=1,\ldots,p, define the functions hi(t)=ci(Retr(𝒙k,t𝒅xk),𝒚k)h_{i}(t)=c_{i}(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k}). By Taylor’s theorem,

hi(t)=hi(0)+tgradci(𝒙k,𝒚k),𝒅xk𝒙k+t22hi′′(a)h_{i}(t)=h_{i}(0)+t\langle\operatorname{grad}c_{i}(\bm{x}^{k},\bm{y}^{k}),\bm{d}_{x}^{k}\rangle_{\bm{x}^{k}}+\frac{t^{2}}{2}h_{i}^{\prime\prime}(a)

for some a(0,1)a\in(0,1). Therefore, it follows by the definition of 𝒅xk\bm{d}_{x}^{k} and (59) that

|ci(Retr(𝒙k,t𝒅xk),𝒚k)|t22𝒢ci,x|c_{i}(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k})|\leq\frac{t^{2}}{2}\mathcal{G}_{c_{i},x}

for all ii, which implies 𝒄(Retr(𝒙k,t𝒅xk),𝒚k)t22𝒢c,x\|\bm{c}(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k})\|\leq\frac{t^{2}}{2}\mathcal{G}_{c,x}. Thus, by Assumption 4 and (69), for a small enough stepsize tt, to be specified shortly, the error bound condition applies and

dist(Retr(𝒙k,t𝒅xk),𝒞x(𝒚k))τx𝒄(Retr(𝒙k,t𝒅xk),𝒚k)τx2𝒢c,xt2.\displaystyle\text{\rm dist}(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\mathcal{C}_{x}(\bm{y}^{k}))\leq\tau_{x}\|\bm{c}(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k})\|\leq\frac{\tau_{x}}{2}\mathcal{G}_{c,x}t^{2}. (80)

Since \mathcal{M} is either St(n1,m1)\text{St}(n_{1},m_{1}) or St(n1,m1)××St(np,mp)\text{St}(n_{1},m_{1})\times\cdots\times\text{St}(n_{p},m_{p}), we can precisely specify how small the stepsize needs to be so that the error bound condition applies. For example, for the Stiefel manifold there exists a global constant L1L_{1} such that

Retr(𝒙k,t𝒅xk)𝒙kL1t𝒅xk𝒙kL1t\|\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k})-\bm{x}^{k}\|_{\mathcal{M}}\leq L_{1}t\|\bm{d}_{x}^{k}\|_{\bm{x}^{k}}\leq L_{1}t (81)

for all 𝒙k\bm{x}^{k} and 𝒅xk\bm{d}_{x}^{k} for both the polar retraction and the QR retraction [52]. Therefore, if tδx/L1t\leq\delta_{x}/L_{1}, then the error bound condition (69) shall hold making (80) a valid bound.

Define 𝒙k(t):=Retr(𝒙k,t𝒅xk)\bm{x}^{k}(t):=\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}) and its projection onto the constraint set 𝒙pk(t):=Π𝒞x(𝒚k)(𝒙k(t))\bm{x}_{p}^{k}(t):=\Pi_{\mathcal{C}_{x}(\bm{y}^{k})}\left(\bm{x}^{k}(t)\right). We now argue for the existence of a direction 𝒅~(t)T𝒙k(t)\tilde{\bm{d}}(t)\in T_{\bm{x}^{k}(t)}\mathcal{M} such that

𝒙pk(t)=Exp(𝒙k(t),𝒅~(t))\bm{x}_{p}^{k}(t)=\text{Exp}\left(\bm{x}^{k}(t),\tilde{\bm{d}}(t)\right) (82)

when tt is sufficiently small. Since \mathcal{M} is either St(n1,m1)\text{St}(n_{1},m_{1}) or St(n1,m1)××St(np,mp)\text{St}(n_{1},m_{1})\times\cdots\times\text{St}(n_{p},m_{p}) for some positive integers nin_{i} and mim_{i}, it follows there is a positive global lower bound on the injectivity radius, κ\kappa_{\mathcal{M}}, over \mathcal{M}. From (80), (81), and Assumption 6, if additionally tmin{ρ/L1,κ/(τxΓ𝒢c,x)}t\leq\min\{\rho/L_{1},\sqrt{\kappa_{\mathcal{M}}/(\tau_{x}\Gamma\mathcal{G}_{c,x})}\} it follows

dist(𝒙pk(t),𝒙k(t))κ/2\text{\rm dist}(\bm{x}_{p}^{k}(t),\bm{x}^{k}(t))\leq\kappa_{\mathcal{M}}/2

and by Lemma 12 there exists 𝒅~(t)𝒯𝒙k(t)\tilde{\bm{d}}(t)\in\mathcal{T}_{\bm{x}^{k}(t)}\mathcal{M} with 𝒅~(t)𝒙k(t)<κ\|\tilde{\bm{d}}(t)\|_{\bm{x}^{k}(t)}<\kappa_{\mathcal{M}} such that (82) holds. Furthermore, by Proposition 10.22 in [7], Assumption 6, and (80),

dist(𝒙pk(t),𝒙k(t))=𝒅~(t)𝒙k(t)τxΓ2𝒢c,xt2.\text{\rm dist}(\bm{x}_{p}^{k}(t),\bm{x}^{k}(t))=\|\tilde{\bm{d}}(t)\|_{\bm{x}^{k}(t)}\leq\frac{\tau_{x}\Gamma}{2}\mathcal{G}_{c,x}t^{2}. (83)

Thus, by the gradient Lipschitz retraction condition of ff with respect to 𝒙\bm{x}

f(𝒙pk(t),𝒚k)\displaystyle f(\bm{x}^{k}_{p}(t),\bm{y}^{k}) (82)f(𝒙k(t),𝒚k)+gradxf(𝒙k(t),𝒚k),𝒅~(t)𝒙k(t)+Lx2𝒅~(t)𝒙k(t)2\displaystyle\overset{\eqref{eq:x_help23}}{\leq}f({\bm{x}}^{k}(t),\bm{y}^{k})+\langle\operatorname{grad}_{x}f({\bm{x}}^{k}(t),\bm{y}^{k}),\tilde{\bm{d}}(t)\rangle_{{\bm{x}}^{k}(t)}+\frac{L_{x}}{2}\|\tilde{\bm{d}}(t)\|^{2}_{{\bm{x}}^{k}(t)}
(57)f(𝒙k(t),𝒚k)+𝒢f,x𝒅~(t)𝒙k(t)+Lx2𝒅~(t)𝒙k(t)2\displaystyle\overset{\eqref{eq:Gfxnew}}{\leq}f({\bm{x}}^{k}(t),\bm{y}^{k})+\mathcal{G}_{f,x}\|\tilde{\bm{d}}(t)\|_{{\bm{x}}^{k}(t)}+\frac{L_{x}}{2}\|\tilde{\bm{d}}(t)\|^{2}_{{\bm{x}}^{k}(t)}
(83)f(𝒙k(t),𝒚k)+12𝒢f,x𝒢c,xΓτxt2+Lx2(12𝒢c,xΓτxt2)2\displaystyle\overset{\eqref{eq:x_help4}}{\leq}f({\bm{x}}^{k}(t),\bm{y}^{k})+\frac{1}{2}\mathcal{G}_{f,x}\mathcal{G}_{c,x}\Gamma\tau_{x}t^{2}+\frac{L_{x}}{2}\left(\frac{1}{2}\mathcal{G}_{c,x}\Gamma\tau_{x}t^{2}\right)^{2}
(7.2)f(𝒙k,𝒚k)tmx(𝒙k,𝒚k)+(12Lx+12𝒢f,x𝒢c,xΓτx)t2+18Lx𝒢c,x2Γ2τx2t4\displaystyle\overset{\eqref{eq:x_help1}}{\leq}f(\bm{x}^{k},\bm{y}^{k})-tm_{x}(\bm{x}^{k},\bm{y}^{k})+\left(\frac{1}{2}L_{x}+\frac{1}{2}\mathcal{G}_{f,x}\mathcal{G}_{c,x}\Gamma\tau_{x}\right)t^{2}+\frac{1}{8}L_{x}\mathcal{G}_{c,x}^{2}\Gamma^{2}\tau_{x}^{2}t^{4}
f(𝒙k,𝒚k)tmx(𝒙k,𝒚k)+(12Lx+12𝒢f,x𝒢c,xΓτx+18Lx𝒢c,x2Γ2τx2)t2\displaystyle\leq f(\bm{x}^{k},\bm{y}^{k})-tm_{x}(\bm{x}^{k},\bm{y}^{k})+\left(\frac{1}{2}L_{x}+\frac{1}{2}\mathcal{G}_{f,x}\mathcal{G}_{c,x}\Gamma\tau_{x}+\frac{1}{8}L_{x}\mathcal{G}_{c,x}^{2}\Gamma^{2}\tau_{x}^{2}\right)t^{2}

where the last inequality follows assuming t1t\leq 1. Hence, if

tmin{1,δxL1,ρL1,κ𝒢c,xΓτx,8(1α2)mx(𝒙k,𝒚k)Lx𝒢c,x2Γ2τx2+4(Lx+𝒢f,x𝒢c,xΓτx)}t\leq\min\left\{1,\frac{\delta_{x}}{L_{1}},\frac{\rho}{L_{1}},\sqrt{\frac{\kappa_{\mathcal{M}}}{\mathcal{G}_{c,x}\Gamma\tau_{x}}},\frac{8(1-\alpha_{2})m_{x}(\bm{x}^{k},\bm{y}^{k})}{L_{x}\mathcal{G}_{c,x}^{2}\Gamma^{2}\tau_{x}^{2}+4(L_{x}+\mathcal{G}_{f,x}\mathcal{G}_{c,x}\Gamma\tau_{x})}\right\} (84)

then the while loop in the x-updating phase shall terminate with a guaranteed descent. Lastly, given (84) and the x-updating phase procedure it follows tktlow,xt_{k}\geq t_{\text{low},x} for all k0k\geq 0 where

tlow,x=γmin{1,δxL1,ρL1,κ𝒢c,xΓτx,8(1α2)ϵ2Lx𝒢c,x2Γ2τx2+4(Lx+𝒢f,x𝒢c,xΓτx)}.t_{\text{low},x}=\gamma\min\left\{1,\frac{\delta_{x}}{L_{1}},\frac{\rho}{L_{1}},\sqrt{\frac{\kappa_{\mathcal{M}}}{\mathcal{G}_{c,x}\Gamma\tau_{x}}},\frac{8(1-\alpha_{2})\epsilon_{2}}{L_{x}\mathcal{G}_{c,x}^{2}\Gamma^{2}\tau_{x}^{2}+4(L_{x}+\mathcal{G}_{f,x}\mathcal{G}_{c,x}\Gamma\tau_{x})}\right\}.

The result then follows as before via similar argumentation. ∎

7.3 Joint-Updating Phase

We open our analysis of the joint-updating phase of Algorithm 1 by proving f¯\bar{f}, as defined in Section 4.3, has a gradient Lipschitz retraction over {\mathcal{M}}^{\prime}.

Lemma 14.

Under Assumptions 2 and 5, f¯:\bar{f}:\mathcal{M}^{\prime}\rightarrow\mathbb{R} has a gradient Lipschitz retraction with the retraction over the product manifold, Retr:𝒯\text{\rm Retr}_{\mathcal{M}^{\prime}}:\mathcal{T}\mathcal{M}^{\prime}\rightarrow\mathcal{M}^{\prime} defined such that (𝐱,ηx)×(𝐲,ηy)(Retr(𝐱,ηx),𝐲+ηy)(\bm{x},\eta_{x})\times(\bm{y},\eta_{y})\mapsto(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x}),\bm{y}+\eta_{y}), with constant LR=LxyL1+max(Lx,Ly)L_{R}=L_{xy}L_{1}+\max\left(L_{x},L_{y}\right).

Proof.

Let 𝒛=(𝒙,𝒚)×n\bm{z}=(\bm{x},\bm{y})\in\mathcal{M}\times\mathbb{R}^{n} and ηz=(ηx,ηy)𝒯𝒙×n\eta_{z}=(\eta_{x},\eta_{y})\in\mathcal{T}_{\bm{x}}\mathcal{M}\times\mathbb{R}^{n}. Then,

f¯(Retr(𝒛,ηz))\displaystyle\bar{f}(\text{\rm Retr}_{\mathcal{M}^{\prime}}(\bm{z},\eta_{z}))
=f(Retr(𝒙,ηx),𝒚+ηy)\displaystyle\hskip 21.68121pt=f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x}),\bm{y}+\eta_{y})
f(Retr(𝒙,ηx),𝒚)+yf(Retr(𝒙,ηx),𝒚),ηy+Ly2ηy2\displaystyle\hskip 21.68121pt\leq f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x}),\bm{y})+\langle\nabla_{y}f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x}),\bm{y}),\eta_{y}\rangle+\frac{L_{y}}{2}\|\eta_{y}\|^{2}
f(𝒙,𝒚)+gradf(𝒙,𝒚),ηx𝒙+yf(Retr(𝒙,ηx),𝒚),ηy+Ly2ηy2+Lx2ηx𝒙2\displaystyle\hskip 21.68121pt\leq f(\bm{x},\bm{y})+\langle\operatorname{grad}f(\bm{x},\bm{y}),\eta_{x}\rangle_{\bm{x}}+\langle\nabla_{y}f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x}),\bm{y}),\eta_{y}\rangle+\frac{L_{y}}{2}\|\eta_{y}\|^{2}+\frac{L_{x}}{2}\|\eta_{x}\|_{\bm{x}}^{2}
=f(𝒙,𝒚)+yf(𝒙,𝒚),ηy+yf(Retr(𝒙,ηx),𝒚)yf(𝒙,𝒚),ηy\displaystyle\hskip 21.68121pt=f(\bm{x},\bm{y})+\langle\nabla_{y}f(\bm{x},\bm{y}),\eta_{y}\rangle+\langle\nabla_{y}f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x}),\bm{y})-\nabla_{y}f(\bm{x},\bm{y}),\eta_{y}\rangle
+gradf(𝒙,𝒚),ηx𝒙+Ly2ηy2+Lx2ηx𝒙2,\displaystyle\hskip 72.26999pt+\langle\operatorname{grad}f(\bm{x},\bm{y}),\eta_{x}\rangle_{\bm{x}}+\frac{L_{y}}{2}\|\eta_{y}\|^{2}+\frac{L_{x}}{2}\|\eta_{x}\|_{\bm{x}}^{2}, (85)

where both inequalities followed by the gradient Lipschitz conditions on ff. Investigating the third term we see

|yf(Retr(𝒙,ηx),𝒚)yf(𝒙,𝒚),ηy|\displaystyle|\langle\nabla_{y}f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x}),\bm{y})-\nabla_{y}f(\bm{x},\bm{y}),\eta_{y}\rangle| yf(Retr(𝒙,ηx),𝒚)yf(𝒙,𝒚)ηy\displaystyle\leq\|\nabla_{y}f(\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x}),\bm{y})-\nabla_{y}f(\bm{x},\bm{y})\|\cdot\|\eta_{y}\|
(66)LxyRetr(𝒙,ηx)𝒙ηy\displaystyle\overset{\eqref{eq:f_grad_xy}}{\leq}L_{xy}\|\text{\rm Retr}_{\mathcal{M}}(\bm{x},\eta_{x})-\bm{x}\|_{\mathcal{M}}\cdot\|\eta_{y}\|
(81)LxyL1ηx𝒙ηy\displaystyle\overset{\eqref{eq:retraction_cst}}{\leq}L_{xy}L_{1}\cdot\|\eta_{x}\|_{\bm{x}}\cdot\|\eta_{y}\|
LxyL12(ηx𝒙2+ηy2),\displaystyle\leq\frac{L_{xy}L_{1}}{2}\cdot\left(\|\eta_{x}\|_{\bm{x}}^{2}+\|\eta_{y}\|^{2}\right), (86)

where the mixed Lipschitz condition for ff was utilized to obtained the second inequality and the third followed by Assumption 5 and Proposition 7. Combining (7.3) and (7.3), we obtain our result. ∎

Remark 7.1.

Note, this lemma does depend on the retraction being applied. Due to Assumption 2, Lemma 14 applies to both the exponential map and the chosen retraction utilized in Algorithm 1.

We now present the improvement guarantee for the joint-updating phase if this part of the algorithm begins with a point which is not an (δ2,ϵ3)(\delta_{2},\epsilon_{3})-approximate first-order KKT point of (24). The argument is similar to the analysis of both the xx and yy-updating phases.

Lemma 15.

Let δ2>0\delta_{2}>0. If mKKTδ2(𝐱k,𝐲k)>ϵ3m^{\delta_{2}}_{KKT}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{3} and the joint-updating phase is entered, then the update shall terminate with

f(𝒙k+1,𝒚k+1)f(𝒙k,𝒚k)tlow,kktα3ϵ3f(\bm{x}^{k+1},\bm{y}^{k+1})\leq f(\bm{x}^{k},\bm{y}^{k})-t_{\text{low},kkt}\cdot\alpha_{3}\cdot\epsilon_{3} (87)

where

tlow,kkt:=γmin{1,δzL1+1,ρL1+1,κτzΓ(𝒢c¯,z+g,y),(1α3)ϵ3LR2+τz2Γ𝒢f¯,z(𝒢c¯,z+g,y)+18LRτz2Γ2(𝒢c¯,z+g,y)2}.t_{\text{low},\text{kkt}}:=\gamma\min\bigg{\{}1,\frac{\delta_{z}}{L_{1}+1},\frac{\rho}{L_{1}+1},\sqrt{\frac{\kappa_{\mathcal{M}}}{\tau_{z}\Gamma(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y})}},\\ \frac{(1-\alpha_{3})\epsilon_{3}}{\frac{L_{R}}{2}+\frac{\tau_{z}}{2}\Gamma\mathcal{G}_{\bar{f},z}(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y})+\frac{1}{8}L_{R}\tau_{z}^{2}\Gamma^{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)^{2}}\bigg{\}}. (88)
Proof.

Let 𝒛k=(𝒙k,𝒚k)𝒞\bm{z}^{k}=(\bm{x}^{k},\bm{y}^{k})\in\mathcal{C} satisfy myδ1(𝒙k,𝒚k)ϵ1m^{\delta_{1}}_{y}(\bm{x}^{k},\bm{y}^{k})\leq\epsilon_{1}, mx(𝒙k,𝒚k)ϵ2m_{x}(\bm{x}^{k},\bm{y}^{k})\leq\epsilon_{2}, and mKKTδ2(𝒙k,𝒚k)>ϵ3m^{\delta_{2}}_{KKT}(\bm{x}^{k},\bm{y}^{k})>\epsilon_{3}. Let 𝒅k=(𝒅xk,𝒅yk)𝒯𝒙k×n\bm{d}^{k}=(\bm{d}_{x}^{k},\bm{d}_{y}^{k})\in\mathcal{T}_{\bm{x}^{k}}\mathcal{M}\times\mathbb{R}^{n} be a minimizer defining mKKTδ2(𝒙k,𝒚k)m^{\delta_{2}}_{KKT}(\bm{x}^{k},\bm{y}^{k}). By Lemma 14, we know f¯\bar{f} has a gradient Lipschitz retraction. So, using the definition of the retraction on the product manifold \mathcal{M}^{\prime} we have

f(Retr(𝒙k,t𝒅xk),𝒚k+t𝒅yk)\displaystyle f(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k}+t\bm{d}_{y}^{k}) f(𝒙k,𝒚k)+tgradf¯(𝒛k),𝒅k𝒛k+LRt22𝒅k𝒛k2\displaystyle\leq f(\bm{x}^{k},\bm{y}^{k})+t\langle\operatorname{grad}\bar{f}(\bm{z}^{k}),\bm{d}^{k}\rangle_{\bm{z}^{k}}+\frac{L_{R}t^{2}}{2}\|\bm{d}^{k}\|_{\bm{z}^{k}}^{2}
f(𝒙k,𝒚k)tmKKTδ2(𝒙k,𝒚k)+12LRt2.\displaystyle\leq f(\bm{x}^{k},\bm{y}^{k})-t\cdot m^{\delta_{2}}_{KKT}(\bm{x}^{k},\bm{y}^{k})+\frac{1}{2}L_{R}t^{2}. (89)

This demonstrates a first-order improvement on the value of the objective function can be obtained. Next, we demonstrate only a second-order violation occurs in the constraints. Similar to the x-updating phase, for i=1,,pi=1,\ldots,p, define the functions h¯i(t)=c¯i(Retr(𝒛k,t𝒅k))\bar{h}_{i}(t)=\bar{c}_{i}(\text{\rm Retr}_{\mathcal{M}^{\prime}}(\bm{z}^{k},t\bm{d}^{k})). By Taylor’s theorem,

h¯i(t)=h¯i(0)+tgradc¯i(𝒛k),𝒅k𝒛k+t22h¯i′′(a)\bar{h}_{i}(t)=\bar{h}_{i}(0)+t\langle\operatorname{grad}\bar{c}_{i}(\bm{z}^{k}),\bm{d}^{k}\rangle_{\bm{z}^{k}}+\frac{t^{2}}{2}\bar{h}_{i}^{\prime\prime}(a)

for some a(0,1)a\in(0,1). Therefore, it follows by the definition of 𝒅k\bm{d}^{k} and (61) that

|c¯i(Retr(𝒛k,t𝒅k))|t22𝒢c¯i,z which implies 𝒄(Retr(𝒙k,t𝒅xk),𝒚k+t𝒅yk)t22𝒢c¯,z.|\bar{c}_{i}(\text{\rm Retr}_{\mathcal{M}^{\prime}}(\bm{z}^{k},t\bm{d}^{k}))|\leq\frac{t^{2}}{2}\mathcal{G}_{\bar{c}_{i},z}\text{ which implies }\|\bm{c}(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k}+t\bm{d}_{y}^{k})\|\leq\frac{t^{2}}{2}\mathcal{G}_{\bar{c},z}.

Since the inequality constraints only depend upon 𝒚\bm{y}, the prior argument to prove (7.1) stands provided tδ2/𝒢g,yt\leq\delta_{2}/\mathcal{G}_{g,y}. Then, since our utilized retraction over \mathcal{M} satisfies (81), if tδz/(L1+1)t\leq\delta_{z}/(L_{1}+1) we have

Retr(𝒛k,t𝒅k)𝒛k\displaystyle\|\text{\rm Retr}_{\mathcal{M}^{\prime}}(\bm{z}^{k},t\bm{d}^{k})-\bm{z}^{k}\|_{\mathcal{M}^{\prime}} =Retr(𝒙k,t𝒅xk)𝒙k+(𝒚k+t𝒅yk)𝒚k\displaystyle=\|\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k})-\bm{x}^{k}\|_{\mathcal{M}}+\|(\bm{y}^{k}+t\bm{d}_{y}^{k})-\bm{y}^{k}\|
tL1𝒅xk𝒙k+t𝒅yk\displaystyle\leq tL_{1}\|\bm{d}_{x}^{k}\|_{\bm{x}^{k}}+t\|\bm{d}_{y}^{k}\|
δz,\displaystyle\leq\delta_{z}, (90)

and the error bound condition (70) holds. Therefore,

dist((Retr(𝒙k,t𝒅xk),𝒚k+t𝒅yk),𝒞)τz2(𝒢c¯,z+g,y)t2.\text{dist}\left((\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k}+t\bm{d}_{y}^{k}),\mathcal{C}\right)\leq\frac{\tau_{z}}{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)t^{2}. (91)

Define 𝒛k(t)=(𝒙k(t),𝒚k(t)):=(Retr(𝒙k,t𝒅xk),𝒚k+t𝒅yk)\bm{z}^{k}(t)=(\bm{x}^{k}(t),\bm{y}^{k}(t)):=(\text{\rm Retr}_{\mathcal{M}}(\bm{x}^{k},t\bm{d}_{x}^{k}),\bm{y}^{k}+t\bm{d}_{y}^{k}) and (𝒙pk(t),𝒚pk(t))=Π𝒞(𝒙k(t),𝒚k(t))(\bm{x}_{p}^{k}(t),\bm{y}_{p}^{k}(t))=\Pi_{\mathcal{C}}(\bm{x}^{k}(t),\bm{y}^{k}(t)). By Lemma 12, Assumption 6, and the fact the exponential map over the product manifold \mathcal{M}^{\prime} is the Cartesian product of the exponential maps over \mathcal{M} and n\mathbb{R}^{n}, if

tmin{ρ/(L1+1),κ/(τzΓ(𝒢c¯,z+g,y))},t\leq\min\{\rho/(L_{1}+1),\sqrt{{\kappa_{\mathcal{M}}}/(\tau_{z}\Gamma(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}))}\},

then there exists 𝒅~(t)=(𝒅x~(t),𝒅y~(t))𝒯(𝒙k(t),𝒚k(t))\tilde{\bm{d}}(t)=(\tilde{\bm{d}_{x}}(t),\tilde{\bm{d}_{y}}(t))\in\mathcal{T}_{(\bm{x}^{k}(t),\bm{y}^{k}(t))}\mathcal{M}^{\prime} such that

(𝒙pk(t),𝒚pk(t))=(Exp(𝒙k(t),t𝒅x~(t)),𝒚k+t𝒅y~(t)).(\bm{x}^{k}_{p}(t),\bm{y}^{k}_{p}(t))=\left(\text{Exp}(\bm{x}^{k}(t),t\tilde{\bm{d}_{x}}(t)),\bm{y}^{k}+t\tilde{\bm{d}_{y}}(t)\right).

Furthermore, we have

dist((𝒙pk(t),𝒚pk(t)),(𝒙k(t),𝒚k(t)))=𝒅~(t)𝒛k(t)τzΓ2(𝒢c¯,z+g,y)t2.\text{\rm dist}((\bm{x}^{k}_{p}(t),\bm{y}^{k}_{p}(t)),(\bm{x}^{k}(t),\bm{y}^{k}(t)))=\|\tilde{\bm{d}}(t)\|_{\bm{z}^{k}(t)}\leq\frac{\tau_{z}\Gamma}{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)t^{2}. (92)

Therefore, by the gradient Lipschitz retraction on f¯\bar{f} we have

f(𝒙pk(t),𝒚pk(t))\displaystyle f(\bm{x}^{k}_{p}(t),\bm{y}^{k}_{p}(t))
f(𝒙k(t),𝒚k(t))+gradzf¯(𝒛k(t)),𝒅~(t)𝒛k(t)+LR2𝒅~(t)𝒛k(t)2\displaystyle\leq f(\bm{x}^{k}(t),\bm{y}^{k}(t))+\langle\operatorname{grad}_{z}\bar{f}(\bm{z}^{k}(t)),\tilde{\bm{d}}(t)\rangle_{\bm{z}^{k}(t)}+\frac{L_{R}}{2}\|\tilde{\bm{d}}(t)\|_{\bm{z}^{k}(t)}^{2}
f(𝒙k(t),𝒚k(t))+𝒢f¯,z𝒅~(t)𝒛k(t)+LR2𝒅~(t)𝒛k(t)2\displaystyle\leq f(\bm{x}^{k}(t),\bm{y}^{k}(t))+\mathcal{G}_{\bar{f},z}\|\tilde{\bm{d}}(t)\|_{\bm{z}^{k}(t)}+\frac{L_{R}}{2}\|\tilde{\bm{d}}(t)\|_{\bm{z}^{k}(t)}^{2}
(92)f(𝒙k(t),𝒚k(t))+𝒢f¯,z(τzΓ2(𝒢c¯,z+g,y)t2)+LR2(τzΓ2(𝒢c¯,z+g,y)t2)2\displaystyle\overset{\eqref{eq:joint_1}}{\leq}f(\bm{x}^{k}(t),\bm{y}^{k}(t))+\mathcal{G}_{\bar{f},z}\left(\frac{\tau_{z}\Gamma}{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)t^{2}\right)+\frac{L_{R}}{2}\left(\frac{\tau_{z}\Gamma}{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)t^{2}\right)^{2}
(7.3)f(𝒙k,𝒚k)tmKKTδ2(𝒙k,𝒚k)+12LRt2\displaystyle\overset{\eqref{eq:f_improve}}{\leq}f(\bm{x}^{k},\bm{y}^{k})-tm^{\delta_{2}}_{KKT}(\bm{x}^{k},\bm{y}^{k})+\frac{1}{2}L_{R}t^{2}
+𝒢f¯,z(τzΓ2(𝒢c¯,z+g,y)t2)+LR2(τzΓ2(𝒢c¯,z+g,y)t2)2\displaystyle\hskip 144.54pt+\mathcal{G}_{\bar{f},z}\left(\frac{\tau_{z}\Gamma}{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)t^{2}\right)+\frac{L_{R}}{2}\left(\frac{\tau_{z}\Gamma}{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)t^{2}\right)^{2}
f(𝒙k,𝒚k)tmKKTδ2(𝒙k,𝒚k)+(LR2+τzΓ2𝒢f¯,z(𝒢c¯,z+g,y)+18LRΓ2τz2(𝒢c¯,z+g,y)2)t2,\displaystyle\leq f(\bm{x}^{k},\bm{y}^{k})-tm^{\delta_{2}}_{KKT}(\bm{x}^{k},\bm{y}^{k})+\left(\frac{L_{R}}{2}+\frac{\tau_{z}\Gamma}{2}\mathcal{G}_{\bar{f},z}(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y})+\frac{1}{8}L_{R}\Gamma^{2}\tau_{z}^{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)^{2}\right)t^{2},

where the last inequality follows assuming t1t\leq 1. Thus, a sufficient decrease shall be obtained provided

tmin{1,δzL1+1,ρL1+1,κτzΓ(𝒢c¯,z+g,y),(1α3)mKKTδ2(𝒙k,𝒚k)LR2+τz2Γ𝒢f¯,z(𝒢c¯,z+g,y)+18LRτz2Γ2(𝒢c¯,z+g,y)2}.t\leq\min\bigg{\{}1,\frac{\delta_{z}}{L_{1}+1},\frac{\rho}{L_{1}+1},\sqrt{\frac{\kappa_{\mathcal{M}}}{\tau_{z}\Gamma(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y})}},\\ \frac{(1-\alpha_{3})m^{\delta_{2}}_{KKT}(\bm{x}^{k},\bm{y}^{k})}{\frac{L_{R}}{2}+\frac{\tau_{z}}{2}\Gamma\mathcal{G}_{\bar{f},z}(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y})+\frac{1}{8}L_{R}\tau_{z}^{2}\Gamma^{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)^{2}}\bigg{\}}. (93)

Hence, as in the prior arguments, it follows the stepsize in the joint-updating phase is lower bounded by

tlow,kkt=γmin{1,δzL1+1,ρL1+1,κτzΓ(𝒢c¯,z+g,y),(1α3)ϵ3LR2+τz2Γ𝒢f¯,z(𝒢c¯,z+g,y)+18LRτz2Γ2(𝒢c¯,z+g,y)2}t_{\text{low},\text{kkt}}=\gamma\min\bigg{\{}1,\frac{\delta_{z}}{L_{1}+1},\frac{\rho}{L_{1}+1},\sqrt{\frac{\kappa_{\mathcal{M}}}{\tau_{z}\Gamma(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y})}},\\ \frac{(1-\alpha_{3})\epsilon_{3}}{\frac{L_{R}}{2}+\frac{\tau_{z}}{2}\Gamma\mathcal{G}_{\bar{f},z}(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y})+\frac{1}{8}L_{R}\tau_{z}^{2}\Gamma^{2}\left(\mathcal{G}_{\bar{c},z}+\mathcal{H}_{g,y}\right)^{2}}\bigg{\}} (94)

and the result follows. ∎

7.4 Overall Complexity

Using the prior updating phases, we can derive the overall complexity of Algorithm 1 for locating an approximate first-order KKT point of (24).

Theorem 16.

Assuming the initial level-set L(𝐱0,𝐲0)L(\bm{x}^{0},\bm{y}^{0}) is bounded, their exists a finite lower bound ff^{*} such that ff(𝐱,𝐲)f^{*}\leq f(\bm{x},\bm{y}) for all (𝐱,𝐲)L(𝐱0,𝐲0)(\bm{x},\bm{y})\in L(\bm{x}^{0},\bm{y}^{0}), and Assumptions 1-6 hold, then Algorithm 1 shall locate an (ϵ,ϵ)(\epsilon,\epsilon)-approximate first-order KKT point of (24) in at most 𝒪(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) iterations.

Proof.

Assume δ1=δ2=ϵ1=ϵ2=ϵ3=ϵ\delta_{1}=\delta_{2}=\epsilon_{1}=\epsilon_{2}=\epsilon_{3}=\epsilon in Algorithm 1. For the sake of obtaining a contradiction, assume additionally the algorithm never terminates. By Lemmas 11, 13, and 15, it follows

f(𝒙k+1,𝒚k+1)f(𝒙k,𝒚k)(min1i3αi)min{tlow,y,tlow,x,tlow,kkt}ϵf(\bm{x}^{k+1},\bm{y}^{k+1})\leq f(\bm{x}^{k},\bm{y}^{k})-\left(\min_{1\leq i\leq 3}\alpha_{i}\right)\cdot\min\{t_{\text{low},y},t_{\text{low},x},t_{\text{low,kkt}}\}\cdot\epsilon

for all k0k\geq 0. For ϵ\epsilon small enough there exist positive constants C1,C2C_{1},C_{2}, and C3C_{3} such that

tlow,y=C1ϵ,tlow,x=C2ϵ,andtlow,kkt=C3ϵ.t_{\text{low},y}=C_{1}\epsilon,\;\;t_{\text{low},x}=C_{2}\epsilon,\;\;\text{and}\;\;t_{\text{low},kkt}=C_{3}\epsilon.

Hence, letting Θ:=(min1i3αi)(min1i3Ci)\Theta:=\left(\min_{1\leq i\leq 3}\alpha_{i}\right)\cdot\left(\min_{1\leq i\leq 3}C_{i}\right) we have f(𝒙k+1,𝒚k+1)f(𝒙k,𝒚k)Θϵ2f(\bm{x}^{k+1},\bm{y}^{k+1})\leq f(\bm{x}^{k},\bm{y}^{k})-\Theta\epsilon^{2} for all k0k\geq 0. Since the algorithm does not terminate, by repeated application of the above inequality we have

f(𝒙k+1,𝒚k+1)f(𝒙,0𝒚0)Θ(k+1)ϵ2.f(\bm{x}^{k+1},\bm{y}^{k+1})\leq f(\bm{x},^{0}\bm{y}^{0})-\Theta(k+1)\epsilon^{2}.

So, if k>f(𝒙0,𝒚0)fΘϵ2+1k>\lceil\frac{f(\bm{x}^{0},\bm{y}^{0})-f^{*}}{\Theta\epsilon^{2}}+1\rceil, then f(𝒙k+1,𝒚k+1)<ff(\bm{x}^{k+1},\bm{y}^{k+1})<f^{*} which violates our assumptions. Therefore, the algorithm must terminate in at most 𝒪(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) iterations. ∎

8 Numerical Experiments

The purpose of this section is twofold. First, we want to demonstrate the efficacy of the modeling framework given by (1), and, second, we want to exhibit the theory and applicability of Algorithm 1.

8.1 Generalized Semidefinite Programs

To demonstrate the novelty of our modeling framework and the ability of our method to solve general matrix optimization models with nonconvex constraints, we present a set of numerical tests on generalized semidefinite programs. The instance of (3) we considered in our experiments was

min\displaystyle\min 𝑰,𝑿\displaystyle\;\;\langle-\bm{I},\bm{X}\rangle (95)
s.t. 𝑨i,𝑿=i,i=1,,s,\displaystyle\;\;\langle\bm{A}_{i},\bm{X}\rangle=\ell_{i},\;\;i=1,\ldots,s,
λn(𝑿)b1,\displaystyle\;\;\lambda_{n}(\bm{X})\leq b_{1},
λn(𝑿)+λn1(𝑿)b2,\displaystyle\;\;\lambda_{n}(\bm{X})+\lambda_{n-1}(\bm{X})\leq b_{2},
\displaystyle\;\;\hskip 36.135pt\vdots
λn(𝑿)+λn1(𝑿)++λ1(𝑿)bn,\displaystyle\;\;\lambda_{n}(\bm{X})+\lambda_{n-1}(\bm{X})+\cdots+\lambda_{1}(\bm{X})\leq b_{n},
λn(𝑿)0,\displaystyle\;\;\lambda_{n}(\bm{X})\geq 0,
𝑿𝒮n×n,\displaystyle\;\;\bm{X}\in\mathcal{S}^{n\times n},

where we designed the problem parameters such that a feasible point existed. A few observations are notable about (95). First, the objective function is lower bounded by bn-b_{n} if we look at the second to last inequality constraint in the model. Thus, we have a simple way of measuring if our algorithm is able to compute a global minimizer. Second, the eigenvalue constraint in the model is nonconvex since the left-hand side of each of the inequality constraints is a concave function.

We generated our test problems by randomly constructing a positive definite matrix 𝑪\bm{C}, setting i=𝑨i,𝑪\ell_{i}=\langle\bm{A}_{i},\bm{C}\rangle for i=1,,si=1,\ldots,s, with 𝑨i\bm{A}_{i} a randomly generated symmetric matrix, and setting bib_{i} equal to the sum of the ii smallest eigenvalues of 𝑪\bm{C}. Hence, 𝑪\bm{C} is a feasible solution to (95) and because it makes the second to last inequality tight it yields a globally optimal solution.

We conducted ten randomized tests for varying problems sizes of (95). The results are displayed in Table 1. For each randomized test we computed the following values:

  • the absolute difference between the objective value of the final iterate generated by Algorithm 1 and the optimal objective value, i.e., |f(𝑿final)f||f(\bm{X}^{\text{final}})-f^{*}|

  • the feasibility error of the final iterate computed by Algorithm 1 with respect to the equality constraints, i.e., 𝑮(𝑿final)\|\bm{G}(\bm{X}^{\text{final}})\|

  • the feasibility error of the final iterate produced by Algorithm 1 with respect to the inequality constraints, i.e., max(𝒈(𝝀(𝑿final)),𝟎)\|\max\left(\bm{g}(\bm{\lambda}(\bm{X}^{\text{final}})),\bm{0}\right)\|

Table 1 also tracks the minimum, median, and maximum values of these three quantities across the ten randomized experiments for each problem size. It also counts the total number of problems solved for each dimension; we consider a problem solved if the difference between the objective value computed when the algorithm terminated was within 10610^{-6} of the optimal value, and the point located had a feasibility error less than 10610^{-6} in both the equality and inequality constraints.

Dim. Solved Dist. to Opt. Value Feasibility Error
( |f(𝑿final)f||f(\bm{X}^{\text{final}})-f^{*}| ) Equality Inequality
n=5 9 [0, 1.48e-11, 1.22] [2.22e-16, 1.12e-08, 1.78e-07] [0, 4.86e-7, 9.99e-7]
n=10 8 [0, 1.31e-08, 3.68] [7.11e-15, 2.51e-14, 2.57e-07] [0, 0, 9.53e-7]
n=25 10 [0, 2.27e-13, 5.89e-8] [7.69e-14, 1.42e-13, 2.06e-07] [0, 0, 9.69e-7]
n=50 10 [0, 1.59e-12, 3.44e-7] [6.96e-13, 4.40e-09, 1.30e-07] [0, 4.54e-7, 9.69e-7]
n=100 8 [1.82e-12, 5.46e-12, 4.72] [4.33e-12, 3.22e-10, 1.89e-08] [0, 4.47e-7, 9.93e-7]
Table 1: Displays the results of the randomized tests conducted on (95). Each entry in the columns provides the minimum, median, and maximum value of the respective quantity for the ten randomized experiments conducted on each problem size nn. Any numerical value obtained below 102010^{-20} was set to zero in the table.

From the table we notice Algorithm 1 was able to find near globally optimal solutions to a vast majority of the problems. Even though (95) is a nonconvex problem, our method successfully solve 45 of the 50 problems instances. Furthermore, our method seems to be independent of the problem size, at least for the subset of generalized SDPs represented by (95). A question worth pursuing in the future is applications of generalized SDPs, especially as an approach to relax problems which have classically been rewritten as semidefinite programs.

8.2 Quadratically Constrained Quadratic Programs

An important instance of a QCQP is

min\displaystyle\min 𝒙𝑪𝒙\displaystyle\;\bm{x}^{\top}\bm{C}\bm{x} (96)
s.t. 𝒙𝑨i𝒙1,i=1,,m,\displaystyle\;\bm{x}^{\top}\bm{A}_{i}\bm{x}\geq 1,\;i=1,\ldots,m,
𝒙n,\displaystyle\;\bm{x}\in\mathbb{R}^{n},

where 𝑪\bm{C} and 𝑨i\bm{A}_{i} are positive semidefinite. Though the objective function in (96) is convex, the constraint is nonconvex since it is the intersection of the exterior of mm ellipsoids. This model is well-known to be NP-Hard and in-general more difficult than Max-cut, a comparison of the approximation ratios between the two problems supports this position; (96) has applications in communications through the multicast downlink transmit beamforming problem [2].

A popular and powerful approach to tackle (96) follows by relaxing the problem to a semidefinite program and then applying a randomization procedure to compute feasible solutions to the problem. Namely, (96) is rewritten as the equivalent matrix optimization model,

min\displaystyle\min 𝑪,𝑿\displaystyle\;\langle\bm{C},\bm{X}\rangle (97)
s.t. 𝑨i,𝑿1,i=1,,m,\displaystyle\;\langle\bm{A}_{i},\bm{X}\rangle\geq 1,\;i=1,\ldots,m,
𝑿𝟎,rank(𝑿)=1,\displaystyle\;\bm{X}\succeq\bm{0},\;\text{rank}(\bm{X})=1,

and then relaxed to the semidefinite program

min\displaystyle\min 𝑪,𝑿\displaystyle\;\langle\bm{C},\bm{X}\rangle (98)
s.t. 𝑨i,𝑿1,i=1,,m,\displaystyle\;\langle-\bm{A}_{i},\bm{X}\rangle\leq-1,\;i=1,\ldots,m,
𝑿𝟎.\displaystyle\;\bm{X}\succeq\bm{0}.

A global min 𝑿sdp\bm{X}^{*}_{sdp} is then computed to (98) and is utilized to define Gaussian random variables 𝝃𝒩(𝟎,𝑿sdp)\bm{\xi}_{\ell}\sim\mathcal{N}(\bm{0},\bm{X}_{sdp}^{*}) for =1,,L\ell=1,\ldots,L. From these random variables, feasible solutions to (96) are formed, 𝒙(𝝃)\bm{x}(\bm{\xi}_{\ell}), such that

𝒙(𝝃)=𝝃min1im𝝃𝑨i𝝃\bm{x}(\bm{\xi}_{\ell})=\frac{\bm{\xi}_{\ell}}{\sqrt{\min_{1\leq i\leq m}\bm{\xi}_{\ell}^{\top}\bm{A}_{i}\bm{\xi}_{\ell}}} (99)

for =1,,L\ell=1,\ldots,L. Randomization procedures such as this combined with semidefinite program relaxations have proven extremely effective at locating approximate solutions to challenging QCQPs; see [36] for a survey of semidefinite relaxation and randomization applied to QCQPs.

The model and algorithm we have developed present new approaches for solving (96) by enabling novel nonconvex relaxations of the rank-1 constraint in the equivalent model (97). Rather than removing the rank condition completely to obtain a convex problem, we instead solve

min\displaystyle\min 𝑪,𝑿\displaystyle\;\langle\bm{C},\bm{X}\rangle (100)
s.t. 𝑨i,𝑿1,i=1,,m,\displaystyle\;\langle-\bm{A}_{i},\bm{X}\rangle\leq-1,\;i=1,\ldots,m,
λ1(𝑿)δ,λi(𝑿)[0,δ],i=2,,n,\displaystyle\;\lambda_{1}(\bm{X})\geq\delta,\;\lambda_{i}(\bm{X})\in[0,\delta],\;i=2,\ldots,n,
𝑿𝒮n×n,\displaystyle\;\bm{X}\in\mathcal{S}^{n\times n},

where the parameter δ0\delta\geq 0 allows us to alter the tightness of our approximation. If δ=0\delta=0, then (100) is equivalent to (97). For small positive values of δ\delta, the eigenvalue constraint enforces a near rank-1 constraint, as the bottom n1n-1 eigenvalues of 𝑿\bm{X} are then forced to be within δ\delta of zero. Once an approximate solution to (100) is computed using Algorithm 1, we form feasible solutions to (96) in two different fashions:

  1. (A1.)

    Projection Approach: After computing a stationary point 𝑿\bm{X}^{*} to (100), the top eigenvalue λ1(𝑿)\lambda_{1}(\bm{X}^{*}) and corresponding eigenvector 𝒗1\bm{v}_{1} of 𝑿\bm{X}^{*} are computed and the vector 𝝃=λ1(𝑿)𝒗1\bm{\xi}_{*}=\sqrt{\lambda_{1}(\bm{X}^{*})}\bm{v}_{1} is formed. Then, using (99), the vector 𝝃\bm{\xi}_{*} is scaled so that it satisfies the quadratic inequality constraints in (96). This scaled version of 𝝃\bm{\xi}_{*} is then used as an approximate solution to (96).

  2. (A2).

    Randomization Approach: After computing a stationary point to (100), the stationary point is utilized to generate randomized solutions in the same manner as done in the semidefinite relaxation and randomization approach described above.

Refer to caption
Figure 1: Displays the feasible region and level sets for an instance of (101) with m=25m=25. The white region represents all of the feasible points outside the interior of the mm ellipsoids while the concentric circles centered at the origin are the level sets of the norm-squared objective function.

To compare our model (100) and two approximation approaches detailed above to the semidefinite program relaxation (SDR) approach with randomization, we conducted forty numerical experiments. In-order to check if our method was able to compute global minimums, we conducted our numerical experiments with n=2n=2 so that a brute force grid search was able to compute a global optimum of (96) in a less than prohibitively long time. The form of (96) we considered in our experiments was

min\displaystyle\min 𝒙2\displaystyle\;\|\bm{x}\|^{2} (101)
s.t. 𝒙𝑨i𝒙1,i=1,,m,\displaystyle\;\bm{x}^{\top}\bm{A}_{i}\bm{x}\geq 1,\;i=1,\ldots,m,
𝒙2,\displaystyle\;\bm{x}\in\mathbb{R}^{2},

where we randomly generated positive definite matrices 𝑨i\bm{A}_{i} for i=1,,mi=1,\ldots,m. An example of the feasible region and level sets of the objective can be seen in Figure 1. In our experiments, we conducted ten randomized tests for m=5,10,25m=5,10,25, and 5050, and solved (100) with δ=101,103\delta=10^{-1},10^{-3}, and 10610^{-6} for all settings. When generating points via randomization, we sampled 20 points from the Gaussian distributions. Since our model is nonconvex, we can benefit from initializing our algorithm multiple times and resolving, whereas the SDR approach cannot. Thus, for these tests, we solved each instance of (100) three times. One initialization of our algorithm was formed by taking the solution to (98) and projecting it onto the constraint set in (100); the other two came from randomly generating a symmetric matrix and projecting onto the constraint set. Tables 2 and 3 display the numerical results of the experiments.

SDR Our Model
δ=101\delta=10^{-1} δ=103\delta=10^{-3} δ=106\delta=10^{-6}
m Test Optimal Orig. Random Orig. Random Project Orig. Random Project Orig. Random Project
5 1 1.10 1.10 1.10 1.10 1.10 1.10 1.10 1.10 1.10 1.10 1.10 1.10
2 1.49 1.49 1.49 1.49 1.49 1.49 1.49 1.49 1.49 1.49 1.49 1.49
3 2.86 1.91 3.17 2.74 2.94 2.87 2.85 2.86 2.85 2.85 2.85 2.85
4 2.33 1.77 2.41 2.25 2.33 2.34 2.33 2.33 2.33 2.33 2.33 2.33
5 1.67 1.66 1.83 1.67 1.68 1.68 1.67 1.67 1.67 1.67 1.67 1.67
6 2.26 1.76 2.59 2.20 2.27 2.28 2.26 2.26 2.26 2.26 2.26 2.26
7 1.91 1.67 1.94 1.87 1.91 1.92 1.91 1.91 1.91 1.91 1.91 1.91
8 1.13 1.13 1.13 1.13 1.13 1.13 1.13 1.13 1.13 1.13 1.13 1.13
9 2.57 1.93 2.59 2.49 2.64 2.58 2.57 2.57 2.57 2.57 2.57 2.57
10 2.18 1.84 2.35 2.14 2.19 2.19 2.17 2.18 2.17 2.17 2.17 2.17
10 1 5.41 1.95 9.05 5.19 5.68 5.57 5.56 5.57 5.56 5.56 5.56 5.56
2 3.91 1.89 5.05 3.68 4.41 3.92 3.91 3.92 3.91 3.91 3.91 3.91
3 2.37 1.88 2.60 2.31 2.42 2.36 2.36 2.36 2.36 2.36 2.36 2.36
4 1.78 1.72 2.11 1.77 1.79 1.80 1.78 1.78 1.78 1.78 1.78 1.78
5 2.98 1.92 3.18 2.87 3.01 2.98 2.98 2.98 2.98 2.98 2.98 2.98
6 4.49 1.89 5.34 4.21 4.59 4.49 4.95 4.97 4.95 4.48 4.48 4.48
7 3.45 1.87 3.74 3.26 3.49 3.47 3.45 3.45 3.45 3.45 3.45 3.45
8 3.85 1.89 5.79 3.64 3.99 3.85 3.85 3.85 3.85 3.85 3.85 3.85
9 3.06 1.88 3.27 2.90 3.09 3.06 3.05 3.06 3.05 3.05 3.05 3.05
10 2.97 1.94 3.00 2.87 2.98 2.97 2.97 2.98 2.97 2.97 2.97 2.97
Table 2: Showcases the results for the 20 tests conducted on (101) with m=5m=5 and 1010. The column titled, Optimal, estimates the opt. value of (101) within ±0.013\pm 0.013 of the exact value. Under SDR, the values in column, Orig., are the opt. obj. values of (98) while column Random provides the best obj. value obtained via the randomization process (99) using twenty samples. The header, Our Model, presents the best obj. values obtained via (100) and our approximation procedures for different values of δ\delta; Orig. states the best obj. value of (100) obtained from three initializations of Algorithm 1; Random gives the best randomized solution obtained from sampling based on our three solutions to (100), and Project gives the best value obtained from our three projected solutions obtained by solving (100) and projecting as described in (A1.). Values in blue denote a global opt. solution, within a tolerance.
SDR Our Model
δ=101\delta=10^{-1} δ=103\delta=10^{-3} δ=106\delta=10^{-6}
m Test Optimal Orig. Random Orig. Random Project Orig. Random Project Orig. Random Project
25 1 6.05 1.91 6.23 5.58 6.07 6.06 6.04 6.05 6.05 6.05 6.05 6.05
2 4.84 1.96 6.06 4.53 4.92 4.85 4.83 4.84 4.83 4.83 4.83 4.83
3 4.14 1.95 4.29 3.89 4.24 4.16 4.14 4.15 4.14 4.14 4.14 4.14
4 7.67 1.94 9.62 7.03 7.84 7.68 7.66 7.67 7.67 7.67 7.67 7.67
5 10.06 1.96 13.55 9.20 10.08 10.07 10.05 10.09 10.06 10.06 10.06 10.06
6 3.53 1.91 4.32 3.35 3.53 3.53 3.53 3.53 3.53 3.53 3.53 3.53
7 8.07 1.96 9.13 7.43 8.21 8.08 8.06 8.08 8.07 8.07 8.07 8.07
8 4.70 1.95 7.12 4.39 4.72 4.71 4.69 4.70 4.70 4.70 4.70 4.70
9 7.69 1.97 7.76 7.09 7.83 7.70 7.68 7.69 7.69 7.69 7.69 7.69
10 4.91 1.93 6.83 4.59 4.95 4.91 4.90 4.92 4.91 4.91 4.91 4.91
50 1 5.31 1.95 5.67 4.94 5.34 5.33 5.31 5.31 5.31 5.31 5.31 5.31
2 9.35 1.97 10.14 8.54 9.40 9.35 9.34 9.35 9.35 9.35 9.35 9.35
3 7.25 1.95 7.34 6.66 7.39 7.27 7.24 7.25 7.25 7.25 7.25 7.25
4 9.49 1.96 11.01 8.66 9.54 9.49 9.47 9.50 9.48 9.48 9.48 9.48
5 9.13 1.95 10.44 8.33 9.20 9.13 9.12 9.13 9.12 9.12 9.12 9.12
6 7.69 1.96 8.07 7.03 7.70 7.69 7.68 7.69 7.69 7.69 7.69 7.69
7 14.93 1.97 17.22 13.58 16.66 14.94 17.17 17.19 17.18 14.92 14.92 14.92
8 11.26 1.95 13.43 11.36 12.57 12.50 12.48 12.50 12.49 11.26 11.26 11.26
9 5.82 1.96 6.39 5.40 5.91 5.84 5.82 5.83 5.82 5.82 5.82 5.82
10 11.79 1.97 12.47 10.90 12.10 11.98 11.95 11.97 11.96 11.96 11.96 11.96
Table 3: Showcases the results for the 20 tests conducted on (101) with m=25m=25 and 5050. The description of the entries in this table are identical to Table 2.
Refer to caption
Figure 2: Displays the feasible points computed by each method for Test 8 with m=50m=50. The yellow star is an estimated global optimal solution, the red circles are the randomized solutions generated from the SDR approach, the green circles are the candidate points generated from our method via (A2.), and magenta circles are the projected solutions obtained via (A1.). From left to right, the three panels display the results for (100) with δ=101\delta=10^{-1}, δ=103\delta=10^{-3}, and δ=106\delta=10^{-6}.

The tables showcase our method was able to drastically outperform the standard SDR approach for solving (101). Each of the blue values in Tables 2 and 3 denote an objective value obtained which was within 0.090.01260.09-0.0126 of the global optimal value estimated via grid search. In the 40 numerical tests, the SDR approach was only able to compute a near optimal solution 3 times while our approach with δ=106\delta=10^{-6}, with both approaches (A1.) and (A2.), was able to solve 38 for the 40 tests. Similarly, with δ=101\delta=10^{-1} and 10310^{-3}, the projection approach was able to compute near optimal solutions in 24 and 35 of the experiments respectively. Furthermore, our projection approach, (A1.), was able to successfully compute near optimal solutions without the need for randomization. Figure 2 displays an example of the feasible points generated by the different methods.

A few observations can be made from Figure 2. First, the randomized solutions generated from the semidefinite programming approach spread themselves out across the feasible region greatly while the general spread of the randomized solutions via our approach is highly dependent upon the value of δ\delta. For the green circles in the far left panel of Figure 2, we see the spread is less than that of the red circles, and we note the sampled distribution of points collapses around about four local minimums when δ\delta decreases to 10310^{-3} in the middel panel. So, while the randomized points computed via (100) seem to cluster around local minimums, the randomized points computed from SDR do not seem to exhibit any such behavior. Second, the projected solutions obtain via (A1.) seem to locate local minimums in this example without the need for randomization. Note, they are the magenta circles in Figure 2. This numerical experiment was an interesting example because it was the only instance where our method with the two larger δ\delta values was unable to compute a global solution while the smallest value was able to do so. The far right panel in Figure 2 displays the results for δ=106\delta=10^{-6} on this problem. From the figure we can see that one of the projected solutions located the global optimal value, and furthermore all of the randomized points generated by out method collapsed very tightly about four different local minimum, including the global minimums.

Overall, the numerical results demonstrate our method presents a novel way of approaching a difficult NP-Hard problem and can best the classical and effective SDR approach with randomization. The general behavior of our approach is worth further investigation as well from the observations stated here. Also, it should noted the procedure presented in this section extends over to any problem, not just QCQPs, where rank constraints are removed in-order to obtain easier and sometimes convex problems. Hence, our model could greatly impact the way numerous problems are solved.

9 Conclusion

We have set out in this paper to present the first general matrix optimization framework which enables the modeling of both coordinate and spectral constraints. Further, we developed and theoretically guaranteed the performance of an algorithm which computes stationary points for our model. Under standard assumptions, we proved our algorithm converges to an (ϵ,ϵ)(\epsilon,\epsilon)-approximate first-order KKT point in 𝒪(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) iterations, and we demonstrated our approach could compute global minimums to challenging nonconvex problems. In our numerical experiments, we demonstrated a novel nonconvex relaxation of a quadratically constrained quadratic program was solvable by our method to global optimality and significantly outperformed the state-of-the-art semidefinite relaxation approach with randomization. We also presented a generalized form of semidefinite programming which extends a current essential paradigm.

The results of this paper are exciting. We have extended matrix optimization into a new arena and demonstrated novel nonconvex relaxations are possible which can outperform current frameworks. However, there is still much work to be done on this research program. For starters, though our algorithm performed well in our numerical experiments, the method is not computationally efficient enough to be applied to large-scale problems. The development of more efficient methods for solving (1) is of first importance moving forward, and we believe there are several ways to utilize developments in nonconvex optimization to these ends. Also, investigating theoretical guarantees for approximation methods resulting from our nonconvex relaxations of rank-one constraints would be an intriguing avenue for future work.

References

  • [1] P-A Absil and Jérôme Malick. Projection-like retractions on matrix manifolds. SIAM Journal on Optimization, 22(1):135–158, 2012.
  • [2] BG Alex, DS Nicholas, S Shahram, B Mats, and O Bjorn. Convex optimization-based beamforming: From receive to transmit and network designs. IEEE Signal Processing Magazine, 27(3):62–75, 2010.
  • [3] Fredrik Andersson and Marcus Carlsson. Alternating projections on nontangential manifolds. Constructive approximation, 38(3):489–525, 2013.
  • [4] Roberto Andreani, Kelvin R. Couto, Orizon P. Ferreira, and Gabriel Haeser. Constraint qualifications and strong global convergence properties of an augmented lagrangian method on riemannian manifolds. SIAM Journal on Optimization, 34(2):1799–1825, 2024.
  • [5] MV Balashov and AA Tremba. Error bound conditions and convergence of optimization methods on smooth and proximally smooth manifolds. Optimization, 71(3):711–735, 2022.
  • [6] Ronny Bergmann and Roland Herzog. Intrinsic formulation of kkt conditions and constraint qualifications on smooth manifolds. SIAM Journal on Optimization, 29(4):2423–2444, 2019.
  • [7] Nicolas Boumal. An introduction to optimization on smooth manifolds. Cambridge University Press, 2023.
  • [8] Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global Rates of Convergence for Nonconvex Optimization on Manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
  • [9] Stephen Boyd, Persi Diaconis, and Lin Xiao. Fastest mixing markov chain on a graph. SIAM review, 46(4):667–689, 2004.
  • [10] Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
  • [11] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119, 2012.
  • [12] Emmanuel J. Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
  • [13] Ke Chen. Matrix Preconditioning Techniques and Applications. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2005.
  • [14] Zhaoliang Chen and Shiping Wang. A review on matrix completion for recommender systems. Knowledge and Information Systems, 64(1):1–34, 2022.
  • [15] Moody Chu and Gene Golub. Inverse eigenvalue problems: theory, algorithms, and applications. OUP Oxford, 2005.
  • [16] Moody T Chu. Inverse eigenvalue problems. SIAM review, 40(1):1–39, 1998.
  • [17] Nair Maria Maia De Abreu. Old and new results on algebraic connectivity of graphs. Linear algebra and its applications, 423(1):53–73, 2007.
  • [18] Dmitriy Drusvyatskiy and Adrian S Lewis. Inexact alternating projections on nonconvex sets. arXiv preprint arXiv:1811.01298, 2018.
  • [19] Jianqing Fan, Yuan Liao, and Han Liu. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1):C1–C32, 2016.
  • [20] Maryam Fazel. Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University, 2002.
  • [21] Silvia Gandy, Benjamin Recht, and Isao Yamada. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse problems, 27(2):025010, 2011.
  • [22] Casey Garner, Gilad Lerman, and Shuzhong Zhang. Spectrally constrained optimization. Journal of Scientific Computing, 100(3):89, 2024.
  • [23] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
  • [24] M Seetharama Gowda. Optimizing certain combinations of spectral and linear // distance functions over spectral sets. arXiv preprint arXiv:1902.06640, 2019.
  • [25] Masaru Ito and Bruno F Lourenço. Eigenvalue programming beyond matrices. arXiv preprint arXiv:2311.04637, 2023.
  • [26] Bo Jiang, Shiqian Ma, and Shuzhong Zhang. Tensor principal component analysis via convex optimization. Mathematical Programming, 150(2):423–457, 2015.
  • [27] Yoni Kasten, Amnon Geifman, Meirav Galun, and Ronen Basri. Algebraic characterization of essential matrices and their averaging in multiview settings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5895–5903, 2019.
  • [28] Zhijian Lai and Akiko Yoshise. Riemannian interior point methods for constrained optimization on manifolds. Journal of Optimization Theory and Applications, 201(1):433–469, 2024.
  • [29] Gilad Lerman and Tyler Maunu. An overview of robust subspace recovery. Proceedings of the IEEE, 106(8):1380–1410, 2018.
  • [30] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. Robust Computation of Linear Models by Convex Relaxation. Foundations of Computational Mathematics, 15(2):363–410, 2015.
  • [31] Gilad Lerman and Yunpeng Shi. Robust group synchronization via cycle-edge message passing. Foundations of Computational Mathematics, 22(6):1665–1741, 2022.
  • [32] Adrian S Lewis. The mathematics of eigenvalue optimization. Mathematical Programming, 97:155–176, 2003.
  • [33] Adrian S Lewis and Jérôme Malick. Alternating projections on manifolds. Mathematics of Operations Research, 33(1):216–234, 2008.
  • [34] Adrian S Lewis and Michael L Overton. Eigenvalue optimization. Acta numerica, 5:149–190, 1996.
  • [35] Changshuo Liu and Nicolas Boumal. Simple algorithms for optimization on riemannian manifolds with constraints. Applied Mathematics & Optimization, 82:949–981, 2020.
  • [36] Zhi-Quan Luo, Wing-Kin Ma, Anthony Man-Cho So, Yinyu Ye, and Shuzhong Zhang. Semidefinite relaxation of quadratic optimization problems. IEEE Signal Processing Magazine, 27(3):20–34, 2010.
  • [37] Luong Trung Nguyen, Junhan Kim, and Byonghyo Shim. Low-rank matrix completion: A contemporary survey. IEEE Access, 7:94215–94237, 2019.
  • [38] Mitsuaki Obara, Takayuki Okuno, and Akiko Takeda. Sequential quadratic optimization for nonlinear optimization problems on riemannian manifolds. SIAM Journal on Optimization, 32(2):822–853, 2022.
  • [39] James M. Ortega. Numerical Analysis. Society for Industrial and Applied Mathematics, 1990.
  • [40] Anton Schiela and Julian Ortiz. An sqp method for equality constrained optimization on hilbert manifolds. SIAM Journal on Optimization, 31(3):2255–2284, 2021.
  • [41] Yunpeng Shi and Gilad Lerman. Message passing least squares framework and its application to rotation synchronization. In International conference on machine learning, pages 8796–8806. PMLR, 2020.
  • [42] Guangjing Song and Michael K Ng. Fast alternating projections on manifolds based on tangent spaces. arXiv preprint arXiv:2003.10324, 2020.
  • [43] Lieven Vandenberghe and Stephen Boyd. Semidefinite Programming. SIAM review, 38(1):49–95, 1996.
  • [44] Namrata Vaswani, Thierry Bouwmans, Sajid Javed, and Praneeth Narayanamurthy. Robust subspace learning: Robust pca, robust subspace tracking, and robust subspace recovery. IEEE Signal Processing Magazine, 35(4):32–55, 2018.
  • [45] A. J. Wathen. Preconditioning. Acta Numerica, 24:329–376, 2015.
  • [46] Melanie Weber and Suvrit Sra. Riemannian optimization via Frank-Wolfe methods. Mathematical Programming, 199(1):525–556, 2023.
  • [47] Henry Wolkowicz, Romesh Saigal, and Lieven Vandenberghe. Handbook of semidefinite programming: theory, algorithms, and applications, volume 27. Springer Science & Business Media, 2012.
  • [48] Yuya Yamakawa and Hiroyuki Sato. Sequential optimality conditions for nonlinear optimization on riemannian manifolds and a globally convergent augmented lagrangian method. Computational Optimization and Applications, 81(2):397–421, 2022.
  • [49] Wei Hong Yang, Lei-Hong Zhang, and Ruyi Song. Optimality conditions for the nonlinear programming problems on riemannian manifolds. Pacific Journal of Optimization, 10(2):415–434, 2014.
  • [50] Debing Zhang, Yao Hu, Jieping Ye, Xuelong Li, and Xiaofei He. Matrix completion by truncated nuclear norm regularization. In 2012 IEEE Conference on computer vision and pattern recognition, pages 2192–2199. IEEE, 2012.
  • [51] Junyu Zhang, Shiqian Ma, and Shuzhong Zhang. Primal-dual optimization algorithms over riemannian manifolds: an iteration complexity analysis. Mathematical Programming, 184(1-2):445–490, 2020.
  • [52] Junyu Zhang and Shuzhong Zhang. A Cubic Regularized Newton’s Method over Riemannian Manifolds. arXiv preprint arXiv:1805.05565, 2018.
  • [53] Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. Global optimality in low-rank matrix optimization. IEEE Transactions on Signal Processing, 66(13):3614–3628, 2018.
  • [54] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

Appendix A Gradient Lipschitz Properties of the Reformulation

Theorem 17.

Assume F:𝒮n×nF:\mathcal{S}^{n\times n}\rightarrow\mathbb{R} is gradient Lipschitz with parameter LF>0L_{F}>0, i.e.,

F(𝑿1)F(𝑿2)FLF𝑿1𝑿2F,𝑿1,𝑿2𝒮n×n,\|\nabla F(\bm{X}_{1})-\bm{\nabla}F(\bm{X}_{2})\|_{F}\leq L_{F}\|\bm{X}_{1}-\bm{X}_{2}\|_{F},\;\forall\bm{X}_{1},\bm{X}_{2}\in\mathcal{S}^{n\times n},

and define f:𝒪(n,n)×nf:\mathcal{O}(n,n)\times\mathbb{R}^{n}\rightarrow\mathbb{R} using the spectral decomposition of 𝐗𝒮n×n\bm{X}\in\mathcal{S}^{n\times n} such that,

f(𝑸,𝝀):=F(𝑸Diag(𝝀)𝑸).f(\bm{Q},\bm{\lambda}):=F(\bm{Q}\text{\rm Diag}(\bm{\lambda})\bm{Q}^{\top}).

If we restrict the domain of ff to 𝒪(n,n)×𝒞\mathcal{O}(n,n)\times\mathcal{C} where 𝒞\mathcal{C} is a bounded subset of n\mathbb{R}^{n} such that 𝛌D\|\bm{\lambda}\|\leq D for all 𝛌𝒞\bm{\lambda}\in\mathcal{C}, then ff satisfies the Lipschitz conditions in Assumption 2.

Proof.

Let 𝑸=[𝒒1||𝒒n]𝒪(n,n)\bm{Q}=[\bm{q}_{1}\;|\;\ldots\;|\bm{q}_{n}]\in\mathcal{O}(n,n) be fixed. Define the function H:nH:\mathbb{R}^{n}\rightarrow\mathbb{R} such that, H(𝝀;𝑸):=F(𝑸Diag(𝝀)𝑸).H(\bm{\lambda};\bm{Q}):=F(\bm{Q}\text{Diag}(\bm{\lambda})\bm{Q}^{\top}). Then,

H(𝝀;𝑸)=F(𝑸Diag(𝝀)𝑸)=F(i=1nλi𝒒i𝒒i),H(\bm{\lambda};\bm{Q})=F(\bm{Q}\text{Diag}(\bm{\lambda})\bm{Q}^{\top})=F\left(\sum_{i=1}^{n}\lambda_{i}\bm{q}_{i}\bm{q}_{i}^{\top}\right),

and computing the partial derivatives of HH we have,

H(𝝀;𝑸)λi\displaystyle\frac{\partial H(\bm{\lambda};\bm{Q})}{\partial\lambda_{i}} =ddt(F(λ1𝒒1𝒒1++λi1𝒒i1𝒒i1+(λi+t)𝒒i𝒒i++λn𝒒n𝒒n))|t=0\displaystyle=\frac{d}{dt}\left(F\left(\lambda_{1}\bm{q}_{1}\bm{q}_{1}^{\top}+\cdots+\lambda_{i-1}\bm{q}_{i-1}\bm{q}_{i-1}^{\top}+(\lambda_{i}+t)\bm{q}_{i}\bm{q}_{i}^{\top}+\cdots+\lambda_{n}\bm{q}_{n}\bm{q}_{n}^{\top}\right)\right)\bigg{|}_{t=0}
=ddt(F(𝑸Diag(𝝀)𝑸+t𝒒i𝒒i))|t=0\displaystyle=\frac{d}{dt}\left(F\left(\bm{Q}\text{Diag}(\bm{\lambda})\bm{Q}^{\top}+t\bm{q}_{i}\bm{q}_{i}^{\top}\right)\right)\bigg{|}_{t=0}
=F(𝑸Diag(𝝀)𝑸),𝒒i𝒒i.\displaystyle=\langle\nabla F(\bm{Q}\text{Diag}(\bm{\lambda})\bm{Q}^{\top}),\bm{q}_{i}\bm{q}_{i}^{\top}\rangle.

Then,

𝝀H(𝝀1;𝑸)𝝀H(𝝀2;𝑸)2\displaystyle\|\nabla_{\bm{\lambda}}H(\bm{\lambda}_{1};\bm{Q})-\nabla_{\bm{\lambda}}H(\bm{\lambda}_{2};\bm{Q})\|^{2} =i=1n[F(𝑸Diag(𝝀1)𝑸)F(𝑸Diag(𝝀2)𝑸),𝒒i𝒒i]2\displaystyle=\sum_{i=1}^{n}\left[\langle\nabla F(\bm{Q}\text{Diag}(\bm{\lambda}_{1})\bm{Q}^{\top})-\nabla F(\bm{Q}\text{Diag}(\bm{\lambda}_{2})\bm{Q}^{\top}),\bm{q}_{i}\bm{q}_{i}^{\top}\rangle\right]^{2}
i=1nF(𝑸Diag(𝝀1)𝑸)F(𝑸Diag(𝝀2)𝑸)F2𝒒i𝒒iF2\displaystyle\leq\sum_{i=1}^{n}\|\nabla F(\bm{Q}\text{Diag}(\bm{\lambda}_{1})\bm{Q}^{\top})-\nabla F(\bm{Q}\text{Diag}(\bm{\lambda}_{2})\bm{Q}^{\top})\|_{F}^{2}\|\bm{q}_{i}\bm{q}_{i}^{\top}\|_{F}^{2}
i=1nLF2𝑸Diag(𝝀1)𝑸𝑸Diag(𝝀2)𝑸F2\displaystyle\leq\sum_{i=1}^{n}L_{F}^{2}\|\bm{Q}\text{Diag}(\bm{\lambda}_{1})\bm{Q}^{\top}-\bm{Q}\text{Diag}(\bm{\lambda}_{2})\bm{Q}^{\top}\|_{F}^{2}
i=1nLF2𝑸(Diag(𝝀1)Diag(𝝀2))𝑸F2\displaystyle\leq\sum_{i=1}^{n}L_{F}^{2}\|\bm{Q}(\text{Diag}(\bm{\lambda}_{1})-\text{Diag}(\bm{\lambda}_{2}))\bm{Q}^{\top}\|_{F}^{2}
=nLF2𝝀1𝝀22,\displaystyle=nL_{F}^{2}\|\bm{\lambda}_{1}-\bm{\lambda}_{2}\|^{2},

where the second inequality follows from the gradient Lipschitz condition on FF. Thus,

𝝀H(𝝀1;𝑸)𝝀H(𝝀2;𝑸)nLF𝝀1𝝀2,𝝀1,𝝀2n.\|\nabla_{\bm{\lambda}}H(\bm{\lambda}_{1};\bm{Q})-\nabla_{\bm{\lambda}}H(\bm{\lambda}_{2};\bm{Q})\|\leq\sqrt{n}L_{F}\|\bm{\lambda}_{1}-\bm{\lambda}_{2}\|,\;\;\forall\bm{\lambda}_{1},\bm{\lambda}_{2}\in\mathbb{R}^{n}.

For fixed 𝝀n\bm{\lambda}\in\mathbb{R}^{n}, define the function G:n×nG:\mathbb{R}^{n\times n}\rightarrow\mathbb{R} such that,

G(𝑼;𝝀)=F(𝑼Diag(𝝀)𝑼).G(\bm{U};\bm{\lambda})=F(\bm{U}\text{Diag}(\bm{\lambda})\bm{U}^{\top}).

We next compute the gradient of GG. Let 𝑿=𝑼Diag(𝝀)𝑼\bm{X}=\bm{U}\text{Diag}(\bm{\lambda})\bm{U}^{\top} and 𝑬n×n\bm{E}\in\mathbb{R}^{n\times n}. Then,

ddt(G(𝑼+t𝑬))|t=0\displaystyle\frac{d}{dt}\left(G(\bm{U}+t\bm{E})\right)\bigg{|}_{t=0} =ddt(F([𝑼+t𝑬]Diag(𝝀)[𝑼+t𝑬]))|t=0\displaystyle=\frac{d}{dt}\left(F(\left[\bm{U}+t\bm{E}\right]\text{Diag}(\bm{\lambda})\left[\bm{U}+t\bm{E}\right]^{\top})\right)\bigg{|}_{t=0}
=ddt(F(𝑿+t(𝑬Diag(𝝀)𝑼+𝑼Diag(𝝀)𝑬)+𝒪(t2))|t=0\displaystyle=\frac{d}{dt}\left(F(\bm{X}+t(\bm{E}\text{Diag}(\bm{\lambda})\bm{U}^{\top}+\bm{U}\text{Diag}(\bm{\lambda})\bm{E}^{\top})+\mathcal{O}(t^{2})\right)\bigg{|}_{t=0}
=F(𝑿),𝑬Diag(𝝀)𝑼+𝑼Diag(𝝀)𝑬\displaystyle=\langle\nabla F(\bm{X}),\;\bm{E}\text{Diag}(\bm{\lambda})\bm{U}^{\top}+\bm{U}\text{Diag}(\bm{\lambda})\bm{E}^{\top}\rangle
=F(𝑿)𝑼Diag(𝝀)+F(𝑿)𝑼Diag(𝝀),𝑬\displaystyle=\langle\nabla F(\bm{X})\bm{U}\text{Diag}(\bm{\lambda})+\nabla F(\bm{X})^{\top}\bm{U}\text{Diag}(\bm{\lambda}),\;\bm{E}\rangle
=G(𝑼),𝑬.\displaystyle=\langle\nabla G(\bm{U}),\bm{E}\rangle.

Thus,

G(𝑼)=F(𝑿)𝑼Diag(𝝀)+F(𝑿)𝑼Diag(𝝀).\nabla G(\bm{U})=\nabla F(\bm{X})\bm{U}\text{Diag}(\bm{\lambda})+\nabla F(\bm{X})^{\top}\bm{U}\text{Diag}(\bm{\lambda}).

We leverage Lemma 2.7 of [8] to prove GG has a gradient Lipschitz retraction over the manifold of orthogonal matrices, i.e., we shall show GG is gradient Lipschitz over the convex hull of 𝒪(n,n)\mathcal{O}(n,n). Let 𝑼,𝑽n×n\bm{U},\bm{V}\in\mathbb{R}^{n\times n} be contained in the convex hull of 𝒪(n,n)\mathcal{O}(n,n), 𝝀n\bm{\lambda}\in\mathbb{R}^{n} be fixed, 𝑿=𝑼Diag(𝝀)𝑼\bm{X}=\bm{U}\text{Diag}(\bm{\lambda})\bm{U}^{\top}, and 𝒀=𝑽Diag(𝝀)𝑽\bm{Y}=\bm{V}\text{Diag}(\bm{\lambda})\bm{V}^{\top}. Then,

G(𝑼)G(𝑽)F\displaystyle\|\nabla G(\bm{U})-\nabla G(\bm{V})\|_{F} =(F(𝑿)+F(𝑿))𝚺𝒙:=𝑼Diag(𝝀)(F(𝒀)+F(𝒀))𝚺Y:=𝑽Diag(𝝀)F\displaystyle=\|\overbrace{\left(\nabla F(\bm{X})+\nabla F(\bm{X})^{\top}\right)}^{\bm{\Sigma_{x}}:=}\bm{U}\text{Diag}(\bm{\lambda})-\overbrace{\left(\nabla F(\bm{Y})+\nabla F(\bm{Y})^{\top}\right)}^{\bm{\Sigma}_{Y}:=}\bm{V}\text{Diag}(\bm{\lambda})\|_{F}
𝚺x𝑼𝚺x𝑽+𝚺x𝑽𝚺y𝑽F𝝀\displaystyle\leq\|\bm{\Sigma}_{x}\bm{U}-\bm{\Sigma}_{x}\bm{V}+\bm{\Sigma}_{x}\bm{V}-\bm{\Sigma}_{y}\bm{V}\|_{F}\cdot\|\bm{\lambda}\|
D𝚺x(𝑼𝑽)+(𝚺x𝚺y)𝑽F\displaystyle\leq D\|\bm{\Sigma}_{x}\left(\bm{U}-\bm{V}\right)+\left(\bm{\Sigma}_{x}-\bm{\Sigma}_{y}\right)\bm{V}\|_{F}
D𝚺xF𝑼𝑽F+D𝑽F𝚺x𝚺yF.\displaystyle\leq D\|\bm{\Sigma}_{x}\|_{F}\|\bm{U}-\bm{V}\|_{F}+D\|\bm{V}\|_{F}\|\bm{\Sigma}_{x}-\bm{\Sigma}_{y}\|_{F}. (102)

Since we are assuming FF is gradient Lipschitz over the set of symmetric matrices, 𝝀\|\bm{\lambda}\| is bounded, and 𝑼\bm{U} and 𝑽\bm{V} are contained in the convex hull of 𝒪(n,n)\mathcal{O}(n,n), which is a bounded subset of n×n\mathbb{R}^{n\times n}, it follows there exists MF>0M_{F}>0 such that,

F(𝑿)FMF,𝑿=𝑼Diag(𝝀)𝑼,\|\nabla F(\bm{X})\|_{F}\leq M_{F},\;\forall\bm{X}=\bm{U}\operatorname{Diag}(\bm{\lambda})\bm{U}^{\top}, (103)

with 𝑼\bm{U} and 𝝀\bm{\lambda} as defined above. Additionally, be virtue of 𝑼\bm{U} being in the convex hull of 𝒪(n,n)\mathcal{O}(n,n) it follows 𝑼\bm{U} is the convex combination of orthogonal matrices 𝑨1,,𝑨p\bm{A}_{1},\ldots,\bm{A}_{p}. From this a simple bound on the norm of 𝑼\bm{U} is achieved. Assume 𝑼=ipαi𝑨i\bm{U}=\sum_{i}^{p}\alpha_{i}\bm{A}_{i} with αi>0\alpha_{i}>0 and i=1pαi=1\sum_{i=1}^{p}\alpha_{i}=1. Then,

𝑼F=ipαi𝑨iFipαi𝑨iFi=1pαin=n.\|\bm{U}\|_{F}=\|\sum_{i}^{p}\alpha_{i}\bm{A}_{i}\|_{F}\leq\sum_{i}^{p}\alpha_{i}\|\bm{A}_{i}\|_{F}\leq\sum_{i=1}^{p}\alpha_{i}\sqrt{n}=\sqrt{n}. (104)

Next, we bound the last term in (A). So,

𝚺x𝚺yF\displaystyle\|\bm{\Sigma}_{x}-\bm{\Sigma}_{y}\|_{F} 2F(𝑿)F(𝒀)F\displaystyle\leq 2\|\nabla F(\bm{X})-\nabla F(\bm{Y})\|_{F}
2LF𝑼Diag(𝝀)𝑼𝑽Diag(𝝀)𝑽F\displaystyle\leq 2L_{F}\|\bm{U}\operatorname{Diag}(\bm{\lambda})\bm{U}^{\top}-\bm{V}\operatorname{Diag}(\bm{\lambda})\bm{V}^{\top}\|_{F}
=2LF𝑼Diag(𝝀)𝑼𝑼Diag(𝝀)𝑽+𝑼Diag(𝝀)𝑽𝑽Diag(𝝀)𝑽F\displaystyle=2L_{F}\|\bm{U}\operatorname{Diag}(\bm{\lambda})\bm{U}^{\top}-\bm{U}\operatorname{Diag}(\bm{\lambda})\bm{V}^{\top}+\bm{U}\operatorname{Diag}(\bm{\lambda})\bm{V}^{\top}-\bm{V}\operatorname{Diag}(\bm{\lambda})\bm{V}^{\top}\|_{F}
=2LF𝑼Diag(𝝀)(𝑼𝑽)+(𝑼𝑽)Diag(𝝀)𝑽F\displaystyle=2L_{F}\|\bm{U}\operatorname{Diag}(\bm{\lambda})\left(\bm{U}-\bm{V}\right)^{\top}+\left(\bm{U}-\bm{V}\right)\operatorname{Diag}(\bm{\lambda})\bm{V}^{\top}\|_{F}
2LF(𝑼Diag(𝝀)F+Diag(𝝀)𝑽F)𝑼𝑽F\displaystyle\leq 2L_{F}\left(\|\bm{U}\operatorname{Diag}(\bm{\lambda})\|_{F}+\|\operatorname{Diag}(\bm{\lambda})\bm{V}^{\top}\|_{F}\right)\|\bm{U}-\bm{V}\|_{F}
4LFDn𝑼𝑽F.\displaystyle\leq 4L_{F}D\sqrt{n}\|\bm{U}-\bm{V}\|_{F}. (105)

Combining (103), (104) and (A) with (A) we obtain,

G(𝑼)G(𝑽)F\displaystyle\|\nabla G(\bm{U})-\nabla G(\bm{V})\|_{F} 2DMF𝑼𝑽F+4LFD2n𝑼𝑽F\displaystyle\leq 2DM_{F}\|\bm{U}-\bm{V}\|_{F}+4L_{F}D^{2}n\|\bm{U}-\bm{V}\|_{F}
=2D(MF+2LFDn)𝑼𝑽F,\displaystyle=2D\left(M_{F}+2L_{F}Dn\right)\|\bm{U}-\bm{V}\|_{F},

and since this holds for any 𝑼\bm{U} and 𝑽\bm{V} in the convex hull of 𝒪(n,n)\mathcal{O}(n,n) it follows GG is gradient Lipschitz over the convex hull of the orthogonal matrices. Therefore, by Lemma 2.7 in [8], GG restricted to 𝒪(n,n)\mathcal{O}(n,n) has a gradient Lipschitz retraction and this holds for all 𝝀𝒞\bm{\lambda}\in\mathcal{C}.

Lastly, let 𝑼,𝑽𝒪(n,n)\bm{U},\bm{V}\in\mathcal{O}(n,n) with 𝑿\bm{X} and 𝒀\bm{Y} defined as before, 𝒖i\bm{u}_{i} and 𝒗i\bm{v}_{i} be the ii-th columns of 𝑼\bm{U} and 𝑽\bm{V} respectively, and 𝝀n\bm{\lambda}\in\mathbb{R}^{n}. Then,

λH(𝝀;𝑼)λH(𝝀;𝑽)2\displaystyle\|\nabla_{\lambda}H(\bm{\lambda};\bm{U})-\nabla_{\lambda}H(\bm{\lambda};\bm{V})\|^{2} =i=1n(F(𝑿),𝒖i𝒖iF(𝒀),𝒗i𝒗i)2\displaystyle=\sum_{i=1}^{n}\left(\langle\nabla F(\bm{X}),\bm{u}_{i}\bm{u}_{i}^{\top}\rangle-\langle\nabla F(\bm{Y}),\bm{v}_{i}\bm{v}_{i}^{\top}\rangle\right)^{2}
=i=1n(F(𝑿)F(𝒀),𝒖i𝒖iF(𝒀),𝒗i𝒗i𝒖i𝒖i)2\displaystyle=\sum_{i=1}^{n}\left(\langle\nabla F(\bm{X})-\nabla F(\bm{Y}),\bm{u}_{i}\bm{u}_{i}^{\top}\rangle-\langle\nabla F(\bm{Y}),\bm{v}_{i}\bm{v}_{i}^{\top}-\bm{u}_{i}\bm{u}_{i}^{\top}\rangle\right)^{2}
i=1n(F(𝑿)F(𝒀)𝒖i𝒖i+F(𝒀)𝒗i𝒗i𝒖i𝒖i)2\displaystyle\leq\sum_{i=1}^{n}\left(\|\nabla F(\bm{X})-\nabla F(\bm{Y})\|\cdot\|\bm{u}_{i}\bm{u}_{i}^{\top}\|+\|\nabla F(\bm{Y})\|\cdot\|\bm{v}_{i}\bm{v}_{i}^{\top}-\bm{u}_{i}\bm{u}_{i}^{\top}\|\right)^{2}
i=1n(2LFDn𝑼𝑽F+MF𝒗i𝒗i𝒖i𝒖i)2,\displaystyle\leq\sum_{i=1}^{n}\left(2L_{F}D\sqrt{n}\|\bm{U}-\bm{V}\|_{F}+M_{F}\|\bm{v}_{i}\bm{v}_{i}^{\top}-\bm{u}_{i}\bm{u}_{i}^{\top}\|\right)^{2}, (106)

where we used the bounds (103) and (A). Focusing on the last term in (A),

𝒗i𝒗i𝒖i𝒖iF\displaystyle\|\bm{v}_{i}\bm{v}_{i}^{\top}-\bm{u}_{i}\bm{u}_{i}^{\top}\|_{F} =𝑽𝒆i𝒆i𝑽𝑼𝒆i𝒆i𝑼F\displaystyle=\|\bm{V}\bm{e}_{i}\bm{e}_{i}^{\top}\bm{V}^{\top}-\bm{U}\bm{e}_{i}\bm{e}_{i}^{\top}\bm{U}^{\top}\|_{F}
=𝑽𝒆i𝒆i𝑽𝑽𝒆i𝒆i𝑼+𝑽𝒆i𝒆i𝑼𝑼𝒆i𝒆i𝑼F\displaystyle=\|\bm{V}\bm{e}_{i}\bm{e}_{i}^{\top}\bm{V}^{\top}-\bm{V}\bm{e}_{i}\bm{e}_{i}^{\top}\bm{U}^{\top}+\bm{V}\bm{e}_{i}\bm{e}_{i}^{\top}\bm{U}^{\top}-\bm{U}\bm{e}_{i}\bm{e}_{i}^{\top}\bm{U}^{\top}\|_{F}
=𝑽𝒆i𝒆i(𝑽𝑼)+(𝑽𝑼)𝒆i𝒆i𝑼F\displaystyle=\|\bm{V}\bm{e}_{i}\bm{e}_{i}^{\top}\left(\bm{V}-\bm{U}\right)^{\top}+\left(\bm{V}-\bm{U}\right)\bm{e}_{i}\bm{e}_{i}^{\top}\bm{U}^{\top}\|_{F}
(𝒆i𝒆iF(𝑽F+𝑼F))𝑼𝑽F\displaystyle\leq\left(\|\bm{e}_{i}\bm{e}_{i}^{\top}\|_{F}\cdot(\|\bm{V}\|_{F}+\|\bm{U}\|_{F})\right)\|\bm{U}-\bm{V}\|_{F}
=2n𝑼𝑽F.\displaystyle=2\sqrt{n}\|\bm{U}-\bm{V}\|_{F}. (107)

Combining (A) and (A),

λH(𝝀;𝑼)λH(𝝀;𝑽)2\displaystyle\|\nabla_{\lambda}H(\bm{\lambda};\bm{U})-\nabla_{\lambda}H(\bm{\lambda};\bm{V})\|^{2} i=1n(2LFDn𝑼𝑽F+2nMF𝑼𝑽F)2,\displaystyle\leq\sum_{i=1}^{n}\left(2L_{F}D\sqrt{n}\|\bm{U}-\bm{V}\|_{F}+2\sqrt{n}M_{F}\|\bm{U}-\bm{V}\|_{F}\right)^{2},
=i=1n(2n(LFD+MF)𝑼𝑽F)2,\displaystyle=\sum_{i=1}^{n}\left(2\sqrt{n}(L_{F}D+M_{F})\|\bm{U}-\bm{V}\|_{F}\right)^{2},
=4n2(LFD+MF)2𝑼𝑽F2.\displaystyle=4n^{2}\left(L_{F}D+M_{F}\right)^{2}\|\bm{U}-\bm{V}\|_{F}^{2}.