General Constrained Matrix Optimization
Abstract
This paper presents and analyzes the first matrix optimization model which allows general coordinate and spectral constraints. The breadth of problems our model covers is exemplified by a lengthy list of examples from the literature, including semidefinite programming, matrix completion, and quadratically constrained quadratic programs (QCQPs), and we demonstrate our model enables completely novel formulations of numerous problems. Our solution methodology leverages matrix factorization and constrained manifold optimization to develop an equivalent reformulation of our general matrix optimization model for which we design a feasible, first-order algorithm. We prove our algorithm converges to -approximate first-order KKT points of our reformulation in iterations. The method we developed applies to a special class of constrained manifold optimization problems and is one of the first which generates a sequence of feasible points which converges to a KKT point. We validate our model and method through numerical experimentation. Our first experiment presents a generalized version of semidefinite programming which allows novel eigenvalue constraints, and our second numerical experiment compares our method to the classical semidefinite relaxation approach for solving QCQPs. For the QCQP numerical experiments, we demonstrate our method is able to dominate the classical state-of-the-art approach, solving more than ten times as many problems compared to the standard solution procedure.
Keywords: spectrally constrained optimization constrained manifold optimization quadratically constrained quadratic programs semidefinite programming matrix completion nonconvex optimization iteration complexity
MSC codes: 90C26, 90C52, 65K10, 68W40
1 Introduction
The ubiquity of matrix optimization is unquestionable and the utility of the models is even more so. From matrix completion [10, 12], various forms of principal component analysis [26, 54], max-cut [23], robust subpsace recovery [29, 44], covariance estimation [19], semidefinite programming [43], eigenvalue optimization [32, 34], Markov processes [9], etc., we see the applications of minimizing and maximizing functions whose arguments are matrices subject to various forms of constraints. These constraints come in two varieties: coordinate constraints and spectral constraints. The coordinate constraints are the most readily visible and most prominently leveraged. The spectral constraints which require the eigenvalues and/or singular values to satisfy some set of conditions are present but to a lesser degree. One of the most common forms of spectral constraints is given by enforcing a matrix to be positive definite or semidefinite; however, this is a relatively tame condition when we consider the vast possibilities of restricting the spectrum.
The study of how to solve matrix optimization problems with general spectral constraints is still in its infancy. Our prior work [22] was the first model allowing general linear inequality constraints on the spectrum of a symmetric matrix. Shortly following our first paper, Ito and Lourenço came out with the second paper on this problem. In [25], they utilized the work of a relatively unknown paper by Gowda [24] to enable solving matrix optimization problems with general convex constraints on the spectrum; however, neither of these works addressed how to solve matrix optimization problems with both general coordinate and spectral constraints.
This paper presents the first model and algorithm which enables general constraints on both the coordinates and spectrum of a matrix. Namely, in these pages we develop and analysis a method to find approximate solutions to models of the form:
(1) | ||||
s.t. | ||||
where can be any one of many sets including: , , and , i.e., real matrices, real symmetric matrices, and real tensors, and maps the matrix or tensor to its ordered vector of singular values, eigenvalues, or generalized singular values. So, the method we develop extends beyond even matrices to tensors, enabling a novel modeling power not yet leveraged in the literature.
Remark 1.1.
It is worth noting the only reason we choose to present (1) with equality constraints only on the coordinates, i.e., , and not an additionally set of inequality constraints, e.g., , was for ease of presentation through a reduction of notation and terms. The analysis to come admits additionally inequality constraints with minor alterations to the assumptions, and our numerical experiments on quadratically constraints quadratic programs contain inequality constraints, further supporting this.
2 Motivating Examples
The generality of (1) is vast, including multiple important applications and problem classes. To showcase this fact, let us investigate some examples and see how (1) enables new constraints on the spectrum which before had been impossible, while also extending essential models in the literature.
2.1 Generalized Semidefinite Programs
The impact of semidefinite programming (SDP) is nearly impossible to overstate, though it took some time before the salience of the model was truly realized and established. Today, we have seen applications of SDP extend from control theory to economics and most places in-between [47, 43]. The standard form of a semidefinite program can be expressed as
(2) | ||||
s.t. | ||||
where for . By appropriately defining the functions in (1) and letting , we see the standard SDP model is a special instance of our general matrix model, making (1) a generalization of semidefinite programming. We can extend SDPs and define the following generalized SDP model:
(3) | ||||
s.t. | ||||
where is the vector of eigenvalues of listed in descending order, i.e., for . So, letting and , we see (3) becomes (2). Thus, our general matrix model enables entirely new classes of constraints on the eigenvalues not yet attempted, while extending an essential problem class. Clearly, (3) is no longer a convex formulation. For example, a constraint as simple as and with would be nonconvex. In the case when is a Laplacian matrix, the condition relates to limiting the connectivity of a graph which might well be desirable in certain design circumstances. Furthermore, the eigenvectors associated with the second smallest eigenvalue are known as Fiedler vectors and they have important applications to combinatorial optimization problems; see [17] and the references therein.
2.2 Matrix Preconditioning
The subject of matrix preconditioning often arises in practical applications [13, 45]. The topic is broached from the desire to find a relatively simple invertible matrix to improve the stability and solvability of various numerical procedures, e.g., solving a system of linear equations or minimizing a quadratic form. One of many preconditioning models is to locate good feasible solutions to
(4) | ||||
s.t. |
where enforces some structure upon the matrix . Assuming , the pre- and post-multiplication by ensures symmetry is maintained. This model relates directly to (1) through the constraint set . For example, it would be reasonable to desire to a be a well-conditioned block diagonal matrix. Writing this explicitly, we have the model
(5) | ||||
s.t. | ||||
where and are nonnegative and . Note, the eigenvalue constraint ensures the condition number of is bounded above by and each matrix forming is well-conditioned. This single example demonstrates the versatility of (1) for modeling optimization models related to matrix preconditioning.
2.3 Inverse Eigenvalue and Singular Value Problems and Constrained PCA
A large body of literature exists on what are known as inverse eigenvalue and singular value problems [15, 16]. The general concern is to form a matrix from prescribed spectral information. For example, the entire spectrum of the matrix may be provided or some subset of it. Then, the goal is to construct a matrix that has this structure, possibly subject to additional structural constraints. The prevalence of problems such as this are hard to overstate as applications of them include: seismic tomography, remote sensing, geophysics, circuit theory, and more; see the aforementioned works by Chu for further discussion. These problems naturally relate to (1) as it models both coordinate and spectral constraints. For example, a simple instance of an inverse singular value problem is the following projection problem,
(6) | ||||
s.t. | ||||
where , is the vector of singular values of in descending order, and and correspond to coordinate and spectral constraints respectively. So, (6) seeks to project onto general matrix constraints. Hence, if satisfies the constraints it is a solution to the inverse singular problem described by and , otherwise the model seeks to find the closet matrix to which solves the inverse problem.
The relevance of this problem is further demonstrated by the fact (6) becomes the principal component analysis (PCA) model by the appropriate selection of the constraints. Letting and , the model becomes
s.t. |
whose solution is the first principal component of . However, our model extends beyond this to enable constrained versions of PCA. For example, we have the following nonnegative PCA model,
s.t. | |||
where enforces for all and . We can also move beyond simple rank constraints to enforce far more general constraints on the singular values. For instance, we could constrain the top -singular values to account for between and percent of the total information of the matrix, that is,
(7) | ||||
s.t. | ||||
where we note (7) is a nonconvex model due to the spectral constraint. Thus, (1) enables novel extensions of PCA which have not yet been investigated and could be of much interest for future investigations.
2.4 Quadratically Constrained Quadratic Programs
A problem of immense importance in signal processing and communications is that of solving quadratically constrained quadratic programs (QCQP), which can be formulated as,
(8) | ||||
s.t. | ||||
where are general real symmetric matrices. To highlight one example, let us consider the Boolean quadratic program (BQP),
(9) | ||||
s.t. | ||||
The BQP is known to be NP-Hard problem and has applications in multiple-input-multiple-output (MIMO) detection and multiuser detection. And, a well-known important instance of BQP is the Max-Cut problem, which seeks to locate a maximum cut of a graph, i.e., a partition of the vertices of a graph into two sets and such that the number of edges connecting vertices in to vertices in is maximized.
A powerful approach taken to try and solve (8) is to first reformulate it is as the equivalent matrix optimization model,
(10) | ||||
s.t. | ||||
then form and solve a semidefinite program relaxation (SDR) of (10) by dropping the rank-1 condition, and then finally applying randomization techniques to find feasible solutions to (8). This approach of convexification plus randomization for QCQPs has received much attention in recent years [36], and numerous results have been published proving the accuracy of SDR for certain QCQPs. For example, the influential 0.8756 approximation result for the Max-Cut problem demonstrates the solution to the SDR can sometimes be good [23]; however, dropping the rank-1 condition in (10) is not the only way to relax the problem. Our new general matrix optimization model admits several alternatives which we demonstrate can significantly outperform the state-of-the-art SDR approach. For example, we can approximate the rank-1 condition with different eigenvalue constraints, for example,
(11) |
(12) |
where and , for . In [22], we saw relaxations of this form applied to solving systems of quadratic equations was able to best the classical SDR approach. Later in Section 8.2, we demonstrate through a large number of numerical experiments that our general matrix optimization framework using (11) to relax the rank constraints dominates the SDR approach to solving a set of QCQPs.
2.5 Structured Linear Regression
Across the scientific disciplines few mathematical paradigms are utilized more than linear regression. The novel framework we are proposing can also be applied to perform novel structured linear regression between vector spaces. For example, let be a set of observed data points. Our goal is to locate a constrained linear transform which solves or nearly solves for all . Since all such linear transforms are representable by matrices , this leads to the following general matrix optimization model,
(13) | ||||
s.t. | ||||
where and represent the coordinate and spectral constraints on respectively. For example, say we know the process generating our data set is given by a symmetric tri-diagonal matrix with a particular spectral structure, e.g., , then we can leverage this information to form the structured linear regression model
(14) | ||||
s.t. | ||||
where . This framework directly relates to the work in robust subspace recovery [29, 30] where our framework enables a much greater freedom to leverage spectral information which might be present to a practitioner.
2.6 Matrix Completion
The task of filling-in the missing entries of a matrix, also known as, matrix completion, has found immense success in both image recovery [50] and recommender systems [14], among other applications [20]. Formally, the matrix completion problem can be formulated as: given a partially known matrix such that for all are known, recover the unknown entries in . A classic formulation of this problem is the rank minimization problem,
(15) | ||||
s.t. | ||||
Due to the fact (15) is an extremely challenging problem, it is common to replace the rank function with a convex approximation of it to obtain,
(16) | ||||
s.t. | ||||
where is the nuclear norm of . The main benefit of (16) is the fact it is convex and in the seminal work of Candès and Recht [11] they proved if enough entries of were known then the matrix could be perfectly recovered with high probability. Central to this approach is the assumption the matrix being recovered has a low-rank structure, which very well may not be the case. In [22], our experiments demonstrated if spectral information was known about the matrix being recovered than our methods could nearly double the chance at recover when compared to solving (16).
Thus, we are motivated to apply (1) to propose the following matrix completion model with spectral constraints,
(17) | ||||
s.t. | ||||
where is derived from the spectral information at our disposal and is an objective function chosen to bias the matrix as dictated by the user. For example, it might be the case that a subset of our known entries are noisy, then it would make sense to let , where
for all , and is the subset of the known entries which are noisy. Hence, in this example we would replace in the constraint with to be the set of known noiseless entries, yielding the noisy matrix completion model with spectral constraints,
(18) | ||||
s.t. | ||||
It is worth noting that learning structural information about the spectrum for particular problems is very possible in practice; see for example Theorem 1 of [27].
2.7 And More…
The prior examples should hopefully be sufficient to see the breadth of problems modeled and extended by (1). Further examples include: low-rank matrix optimization [53, 37], tensor completion [21], and synchronization problems [31, 41], among others; however, we do not have the space here to elaborate on these important applications.
3 An Equivalent Reformulation
The primary approach we take in this paper to solve (1) is to form a decomposed model by utilizing standard matrix and tensor factorizations. For example, in the case in (1) we utilize the eigenvalue decomposition. It is well known a symmetric matrix has an eigenvalue/spectral decomposition such that where , the matrix manifold of orthogonal matrices, and is the real diagonal matrix formed by the eigenvalues of . Thus, using the eigenvalue decomposition, we can equivalently rewrite
(19) | ||||
s.t. | ||||
as the decomposed model
(20) | ||||
s.t. | ||||
where additionally enforces the condition the elements of are in decreasing order, i.e., for all . Now, it is clear the global minimums of (19) and (20) are equivalent, but there might be some concern that the decomposed model introduces new local minimums which do not correspond to local minimums in the original model; however, we prove this is not the case for local minimizers of (19) which have no repeated eigenvalues. Since a minor perturbation of the constraint in (19) ensures all feasible matrices have unique eigenvalues, our result demonstrates the strong tie between the local minimums of the decomposed model and the original which supports our decomposition approach to solving (1).
Before stating and proving our main result, we first introduce a few definitions and prove a set of lemmas. Let , and denote the Frobenius and Euclidean norms respectively, and given , , and we define
(21) |
(22) |
Lemma 1.
Given , , and a spectral decomposition of such that with , there exists such that
Proof.
Define
Then, for all such that and such that
and the result follows by the definition of . ∎
This lemma is crucial to be able to claim that any local minimum in (19) generates a local minimum in (20). We next move to a lemma which demonstrates an equivalence between different representations of the same symmetric matrix. Note, in Lemma 2, we let denote the vector of the of all zeros except with a single one in the -th entry.
Lemma 2.
Given with unique eigenvalues, , and , we have for any spectral decomposition of , i.e., with , that
Proof.
Let . Then, , , and
which implies
On the other hand, letting and setting we see
Thus,
∎
This lemma demonstrates if we are in the setting where has only unique eigenvalues then any of the neighborhoods about the different representations of create the same image in the space of symmetric matrices. From this we shall be able to conclude that a local minimum at any one of the representations shall imply a local minimum at all of the representations. Finally, we want to demonstrate that we can generate a ball in the space of symmetric matrices which shall be contained inside the image of the balls over decompositions. Note, in the event the argument is a matrix, denotes the matrix 2-norm.
Lemma 3.
Given with unique eigenvalues, , and the spectral decomposition , there exists such that
Proof.
Define and let
Then, given such that it follows
which implies the eigenvalues of are simple. By Lemma 2, the result shall follow provided for some as defined in Lemma 2. Now, by Theorem 3.3.7 in [39] it follows there exists normalized eigenvectors and of and corresponding to and respectively for such that
(23) |
where . Since and both have simple eigenvalues there exists and such that
with and . Noting we see
where the second inequality followed from (23) and the remaining inequalities were a result of the definition of . ∎
We now combine our lemmas to prove our main result, but first we need to define a local minimum for our models.
Definition 1.
We now present our main theorem for this section.
Theorem 4.
Proof.
Assume is a local minimum of (19). Thus, there exists such that for all . Let be any spectral decomposition of . Then by Lemma 1 there exists such that Therefore, it follows that for all
Since no particular spectral decomposition of was selected, any spectral decomposition of generates a local minimum of (20). To prove the other direction, assume is a local minimum of (20). Then there exists such that for all
which is equivalent to stating for all . By Lemma 3, it follows there exists such that which implies is a local minimum of (19). ∎
Thus, we have demonstrated the local minimums of the two models are identical in the case of local minimizers without repeated eigenvalues. This might seem like a strong restriction, but in reality it is always possible to use the eigenvalue constraints to ensure no repeated eigenvalues occur. Additionally, the continuity of the eigenvalues of a symmetric matrix with respect to perturbations of the coordinates of the matrix ensure that such a restriction only alters the value of the objective function in a boundable fashion, assuming some degree of smoothness in the objective function; see Theorem 4 in [22] for a more substantial discussion on this point.
4 A General Block Manifold Model
The decomposition approach outlined in Section 3 leads us to formulate and study a new optimization model, which generalizes beyond the decomposition approach for symmetric matrices presented in (20). Thus, the central model guiding the remainder of our discussion is the block manifold model:
(24) | ||||
s.t. | ||||
where , , and are differentiable and is a differentiable submanifold in a Euclidean space. Depending on the choice of in (1), the corresponding manifold in (24) will vary, as dictated by the decomposition utilized. For example, if , then we shall apply a singular value decomposition to form the variables and in (24) and shall be a Cartesian product of Stiefel manifolds. In this fashion, (24) can be utilized to model all presented versions of (24). In the remainder of this paper, we shall develop, analyze, and test an algorithm which converges to approximate first-order KKT points of (24) when viewed as a constrained Riemannian optimization problem.
4.1 Riemannian Manifold Optimization Revisited
As optimization over smooth manifolds shall be a major aspect of our discussion, we begin by presenting a brief introduction to the major geometric objects to be leveraged. Our presentation of Riemannian geometry for optimization is restricted to differentiable submanifolds of Euclidean spaces, i.e., vector spaces with a defined inner product, as this covers our examples of interest; however, our later analysis holds for general smooth Riemannian manifolds. The notation and descriptions presented in this section are inspired and drawn heavily from Chapters 3 and 10 of [7]. For readers interested in further details, the authors highly recommend Boumal’s excellent text.
At a high level, a smooth manifold is a generalization of linear spaces in such a way calculus operations such as differentiation, integration, etc., are well-defined and made possible. In our presentation, we let refer to a general finite dimensional vector space. If additionally has a defined inner product, then we call an Euclidean space. The simplified definition of an embedded submanifold of we choose to present is provided by Boumal:
Definition 2.
[Definition 3.6, [7]] Let be a subset of . We say is a (smooth) embedded submanifold of if either of the following holds:
-
1.
is an open subset of . Then we call an open submanifold. If , we also call it a linear manifold.
-
2.
For a fixed integer and for each there exists a neighborhood of in and a smooth function such that
-
(a)
If , then if and only if ; and
-
(b)
rank .
Such a function is called a local defining function of at .
-
(a)
The set is often referred to as the embedding space or the ambient space of .
An experienced student of Riemannian geometry shall note from and via the inverse function theorem charts in the more traditional presentation of general smooth manifolds can be obtained.
Definition 3.
[Definitions 3.7 & 3.10, [7]] Let be an embedded submanifold of . For all , the tangent space of at is defined as the set
(25) |
and elements are called tangent vectors to at . So, if and only if there is a smooth curve on through such that .
In the case of embedded submanifolds of a vector space, a local defining function at a point presents a simple manner to define the set of tangent vectors at .
Theorem 5.
The tangent space of at a point is intended to be a reasonable approximation of the manifold locally. This idea is seen most clearly from the description of the tangent space in Theorem 5 as the kernal of the differential of a local defining function.
We have new spaces over which we are defining functions, namely embedded submanifolds of vector spaces. Thus, it is prudent for us to consider what it means for such functions to be smooth over these manifolds. This is clarified in the following definition.
Definition 4.
[Definition 3.23, [7]] Let and be embedded submanifolds of vector spaces and respectively. A map is smooth at if there exists a function which is smooth (in the usual sense) on a neighborhood of in and such that restricted to is equal to . The function is called a local smooth extension of at . We say is smooth if it is smooth for all .
The set of all tangent spaces over the manifold can be bundled together to form another manifold known as the tangent bundle.
Definition 5.
[Definition 3.35, [7]] The tangent bundle of a manifold is the disjoint union of the tangent spaces of :
(26) |
Retraction operations on manifolds, to be defined shortly, and mapping elements of the manifold to tangent vectors naturally take place over the tangent bundle. A particularly important class of maps are known as vector fields.
Definition 6.
[Definition 3.37, [7]] A vector field on a manifold is a map such that for all . If is a smooth map, we say it is a smooth vector field.
In order to define distances and gradients on a manifold an inner product must be introduced over the tangent spaces of the manifold.
Definition 7.
[Definition 3.43, [7]] An inner product on is a bilinear, symmetric, positive definite function . It induces a norm for tangent vectors: .
A metric on follows from a particular choice of inner product for each . As the inner product over the manifold depends on the point on the manifold where the inner product is taken, an important property for an inner product is that it smoothly varies over the manifold. Such inner products are known are Riemannian metrics. More precisely,
Definition 8.
[Definition 3.44, [7]] A metric on is a Riemannain metric if it varies smoothly with , in the sense that if , , are two smooth vector fields on then the function is smooth from to .
The presence of a Riemannian metric produces a Riemannian manifold and from this metric we can define a notion of distance between connected components, i.e., points which can be connected by a continuous curve on the manifold, on a Riemannian manifold.
Definition 9.
Let be a Riemannian manifold. The length of a piecewise smooth curve, see Definition 10.2, [7], is defined as
The Riemannian distance between is then where the infimum is taken over all piecewise regular curve segments on connecting and .
An interesting by-product of this notation of distance on a Riemannian manifold is the Riemannian distance in tandem with forms a metric space (Theorem 10.3, [7]). Now, for the case of submanifolds of Euclidean spaces, that is, vector spaces with a defined inner product, e.g., with the standard inner , a particular choice of inner product is of great importance.
Definition 10.
[Definition 3.47, [7]] Let be an embedded submanifold of a Euclidean space . Equipped with the Riemannian metric obtained by restriction of the metric of , we call a Riemannian submanifold of .
Riemannian submanifolds of Euclidean spaces are of immense practical interest as many applications of optimization over manifolds consists of optimizing over such spaces. For example, the sphere , the Stiefel manifold , and low-rank spectrahedron , and any Cartesian product of these are all Riemannain submanifolds of Euclidean spaces.
In the case of a function which is a smooth map between two vector spaces, the differential of at is a linear operator from to , denoted , defined as
From a smooth curve over , a smooth curve over can be formed as the map . From this idea, we can define the differential of a map between two embedded submanifolds of vector spaces.
Definition 11.
[Definition 3.27, [7]] Let and be two embedded submanifolds of vectors spaces and respectively. The differential of at is a linear operator defined by
where is a smooth curve on passing through at with .
With the definition of a differential and Riemannian metrics presented, we can now introduce the definition of the Riemannian gradient of a smooth function over a Riemannian manifold.
Definition 12.
[Definition 3.50, [7]] Let be a smooth function on a Riemannian manifold . The Riemannian gradient of is the vector field uniquely defined by the identities:
where is the differential of as defined above and is the Riemannian metric.
In the event is a Riemannian submanifold of a Euclidean space, then computing the Riemannian gradient of is easily accomplished through a smooth extension of .
Proposition 6.
[Proposition 3.53, [7]] Let be a Riemannian submanifold of endowed with the metric and let be a smooth function. The Riemannian gradient of is given by
(27) |
where is any smooth extension of in a neighborhood of and is the projection operator from to , orthogonal with respect to the inner product over .
Another crucial operation over manifolds are retractions.
Definition 13.
[Definition 3.41, [7]] A retraction on is a smooth map with the following properties. For each , let be the restriction of at , such that . Then,
-
1.
for all , and
-
2.
is the identity map: .
Equivalently, each curve satisfies and .
From an optimization perspective, a retraction is a way of returning to the manifold if one has moved away from the manifold in the ambient space. A very special retraction available on each manifold is known as the Riemannian exponential map whose retraction curves are geodesics.
From classical results on differential equations, we know for all there exists a unique maixmal geodesics on , such that and . The Riemannian exponential mapping is a special retraction which aligns with this unique geodesic.
Definition 14.
[Definition 10.16,[7]] Consider the following subset of the tangent bundle:
The exponential map is defined by
Thus, the exponential map has and . The exponential map is smooth over its domain and is a diffeomorphism over its image over a potentially restricted subspace of its domain. The injectivity radius of a manifold captures the size of domain on which the expoential map is smoothly invertible.
Definition 15.
[Definitions 10.19 & 10.28 in [7]] The injectivity radius of a Riemannian manifold at a point , denoted by , is the supremum over radii such that is defined and is a diffeomorphism on the open ball
The injectivity radius of a Riemannian manifold is the infimum of over .
The injectivity radius shall become important in the analysis of our algorithm as it shall determine a ball on which we can always ensure a tangent vector exists which moves us between points on the manifold which are sufficiently close together. For compact Riemannian submanifolds, e.g., the Stiefel manifold or products of Stiefel manifolds, the injectivity radius over the manifold is strictly positive. Another useful property over compact Riemannian submanifolds is a Lipschitz-type condition over any defined retraction.
Proposition 7.
For a retraction on a compact Riemannian submanifold of there exists such that
(28) |
for all and , where is the induced norm from the inner product of the embedding space .
This result was proven as an intermediate step of an argument in Appendix B of [8] and was concisely stated as Proposition 2.7 in [52]. As a note, Proposition 7 holds for any valid retraction, though the constant shall change dependent upon the retraction. Lastly, through the use of retractions a natural extension of the traditional gradient Lipschitz condition was proposed by Boumal et al. [8].
Definition 16.
The function has a gradient Lipschitz retraction with constant with respect to the retraction if for all and we have
(29) |
This condition enables many convergence results over to be extended, relatively easily, to Riemannian manifolds. Our analysis makes heavy use of this condition, and, in the case of compact Riemannain submanifolds, Lemma 2.7 of [8] shows this condition follows with minor additional assumptions from the classic gradient Lipschitz condition.
4.2 Additional Definitions and Notation
To facilitate our discussion, some additional notation and conventions must be set in order. For and we denote their component functions as and , where and for and . We let and denote the Euclidean gradients of and with respect to . The Jacobian of is denoted
The feasible region of (24) is defined as
(30) |
We further define the following slices of the feasible region. For all and we define the sets:
(31) |
(32) |
The level set of at the point is given by
We define and shall often let denote elements of . Letting , we will frequently abbreviate as . The set of active inequality constraints at is the set .
4.3 A Constrained Manifold Reformulation
To present a clear formulation of optimality conditions for (24), we recast the model as a nonlinear programming model over a Riemannian manifold. Remembering , we let be the projection onto , i.e., , and be the projection onto , i.e., . Then, with , we define the functions
for and . We then equivalently rewrite (24) as
(33) | ||||
s.t. | ||||
This is a nonlinear program over a Riemannian manifold, and programs of this forms have only recently begun to be studied in the literature. Necessary and sufficient conditions for this problem class were derived in 2014 by Yang et al. [49]. Since this paper, other investigations have been undertaken developing theory and algorithms [6, 35, 28, 48, 40, 38]. From these developments, we state the first-order necessary conditions for (33).
Definition 17 (First-Order KKT Points [49]).
Often constraint qualifications are necessary to strength results. In this work, we shall assume the linear independence constraint qualification (LICQ) for manifold optimization, which is analogous to the standard property by the same name in Euclidean space.
Definition 18 (LICQ [49]).
The linear independent constraint qualification (LICQ) is said to hold at with and provided forms a linearly independent set on .
Under the LICQ, we obtain we obtain the fact (17) is a first-order necessary condition.
Proposition 8 (Theorem 4.1, [49]).
In Section 6 we present our new algorithm, which from the lens of constrained Riemannian optimization can escape stationary points of (24) and converge to first-order KKT points of the equivalent reformulation (33). Since finding exact stationary points and KKT points is in-general an intractable task, we further develop notions of approximate stationarity and KKT points in the upcoming section.
5 Approximate Stationary and KKT Points
From investigating (24) and (33), we recognize there are multiple manners of measuring approximate optimality, and our algorithm development takes advantage of these different approaches. Here we outline different types of approximate solutions and our proposed schemes for measuring them.
Definition 19 (Approximate Stationary Points in ).
Definition 20 (Approximate Stationary Points in ).
Combining these definitions, we obtain our notion of stationary points for (24).
Definition 21 (Approximate Stationary Points of (24)).
Clearly, it is possible a point could be an approximate stationary point of (24) and fail to be a local minimum. So, we consider the stronger constrained Riemannian KKT formulation given in Definition 17. However, in-order to generate algorithms which provide guarantees, we must be able to measure these conditions; we introduce the next result for this purpose.
Proposition 9.
Define the following measures:
(40) |
(41) |
Proof.
The proof of this claim is an exercise in duality with the utilization of the LICQ, i.e., Definition 18. Under this assumption, we see from the form of the product manifold that forms a linearly independent set over ; therefore, forming the dual problem defining (40), we obtain
(43) |
where the last equality follows by strong duality. Thus, provides an exact measure of the approximate stationarity of per Definition 19. In a similar fashion, since the LICQ implies the Mangasarian-Fromovitz (MFCQ) over manifolds [6], we see the dual problem of the convex program defining (42) is
where as before the last equality follows by strong duality due to the constraint qualification. Note, throughout we let . Looking at Definition 17, we then see by setting for all that is equivalent to (35). The argument to prove the equivalence of (41) and (39) is similar and we leave it to reader. ∎
Proposition 9 provides a practical manner of measuring the optimality of a proposed feasible point without directly computing any dual variables. Now, in order to provide a convergence rate for our proposed method, we require a slightly altered version of the approximate stationary and KKT points. This new definition is only required for the -subproblem and the -subproblem due to the inequality constraints and requires a notation of “almost” active constraints.
Definition 22.
Given a set of inequality constraints for , the set of almost active constraints is Under this definition, we define the following variants of and :
(44) |
(45) |
We then say a point is an -approximate stationary point with respect to and an -approximate first-order KKT point if and respectively. Note, the original definitions of and are retrieved if are set to zero.
There is no fundamental difference between computing (44) and (45) as compared to the prior definitions (41) and (42), though it is possible more conditions must be satisfied. This definition is necessary for us to clearly obtain a convergence rate of our method because this definition enables us to determine the stepsize we can take in our algorithm as to not violate our error bound conditions for the inequality constraints.
Remark 5.1.
We shall be computing minimizers of the form represented by (40), (41), and (42) in-order to compute directions to decrease the value of the objective function. It is crucial to recognize all of these models are convex programs with linear objectives and therefore are tractable subproblems which admit numerous efficient solution procedures maintained in standard optimization solvers.
Input: Initial point ; positive tolerances, , ; nonegative tolerances, ; constants, ; base step-length,
Steps:
1: For: do
2:
2: If do
3: Set to be a minimizer which defines
4: While do
5:
4: End While
4: Set and .
2: ElseIf do
3: Set to be a minimizer which defines
4: While do
5:
4: End While
4: Set and .
7: ElseIf do
3: Set to be a minimizer which defines
9: While do
10:
11: End While
12:
13: Else
14: Return as solution.
15: End If
16: End For
6 Algorithm
We now present our algorithm for solving the block manifold optimization model (24) and describe how it proceeds to compute approximate first-order KKT points reformulated as constrained Riemannian optimization model (33). Our procedure presented in Algorithm 1 has three different updating phases: a -update, -update, and a joint-update.
The first updating phase in the algorithm is the -updating phase. We let be fixed and seek to improve the value of the objective function with respect to by computing a descent direction and then projecting back onto the feasible region. The projection in this step is defined as
(46) |
There are a few reasons we do the -update first. In the event is a convex set for any fixed , then the projection operation is relatively simple because it becomes the projection onto a convex subset of . We shall see important examples, e.g., QCQPs, where this occurs. Second, there is no explicit reference of a manifold in this projection, which reduces the computational difficulty in most cases. Third, given our applications of interest, we have as either the Stiefel manifold or a product of Stiefel manifolds; thus, the dimension of is larger than . So, the memory cost of operating over can be significantly less than over . Finally, given the form of Algorithm 1, placing the -update first means it will likely be undertaken the most, further amplifying the cost savings.
The second phase in the algorithm is the -updating phase. This phase is similar to the -update in the sense that is fixed and we seek to decrease by updating through the computation of a descent direction and performing line-search in that direction with projections back onto the feasible region. The projection in this phase is defined as
(47) |
This phase shall only be entered if , so it is unlikely to be as frequently visited as the -update step. A discussion about this projection operation shall be held in the coming sections.
The third and final phase of the algorithm updates and jointly. This phase is only entered when it appears progress cannot be made by further updating only and sequentially. Similar to the prior two phases, a joint descent direction is computed by treating the original problem as a constrained Riemannian optimization problem over the product manifold . In this phase, a point is projected onto both the functional constraints and the manifold and is defined as
(48) |
This projection operations shall be discussed as well in the forthcoming sections.
6.1 A Feasible Constrained Riemannian Optimization Method
By design, Algorithm 1 is a feasible method, i.e., every iterate generated by the procedure satisfies the original constraints. Other methods such as penalty, augmented Lagrangian, sequential quadratic programming, primal-dual algorithms, and interior-point methods for constrained Riemannian optimization cannot return a feasible solution until the algorithm converges to a solution [51, 35, 28, 48, 40, 38, 4]; however, this might never occur or could potentially take a prohibitively long period of time. Our algorithm is one of the few feasible constrained Riemannian optimization methods. Recently, Weber and Sra [46] developed a Frank-Wolfe method for a constrained Riemannian optimization model, but the constraint set they considered requires the strong conditions of being compact and geodesically convex. So, our algorithm, as far as we are aware, presents the first feasible constrained Riemannian optimization method for problems without geodesic convexity.
Developing such an approach comes with natural pros and cons. On the positive side, our staged block descent method can be terminated at any time with a feasible solution in-hand. On the negative side, in-order for our method to maintain feasibility, projections onto and must be computed, which could potentially prove difficult. The next sections describe some potential methods to approximate optimal projections onto and with provable guarantees for our applications of interest.
6.2 Approximate Alternating Projection Approach
The method of alternating projections has a long history and more recently investigations have been made into dealing with alternating projections onto manifolds [3, 33, 42]. Our algorithm asks for the projections onto the sets and , which can be highly nonconvex; this could pose a problem. Recently, however, Drusvyatskiy and Lewis [18] proposed an approximate alternating projection method for finding feasible points of the constraint set
(49) |
where and are -smooth and is a closed subset of Euclidean space. In the case of attempting to project onto or , our constraint sets of interest are of the same form as (49), with either the Stiefel manifold, a product of Stiefel manifolds, or a product of Stiefel manifolds and , which are all closed subsets of Euclidean spaces. Thus, computing unique and exact projections onto for points near are possible; see Proposition 7 in [1] and Appendix A in [5] for a full description of optimal and unique projections onto the Stiefel manifold. This then implies optimal and unique projections onto our product manifolds of interest. Thus, by [18], we have the following result:
Theorem 10.
Furthermore, the authors in [18] propose an inexact projection method onto constraint sets defined by equality and inequality constraints which satisfy the conditions of Theorem 10 under the LICQ. Thus, we can approximate the exact projections in the -updating phase and the joint-updating phase in Algorithm 1 with (50). A potential critique is that the computed point in the intersection might not be the optimal projection onto the feasible region. This is possible; however, for the implementation of our algorithm to be successful only a sufficient improvement of the objective function is necessary and this is certainly obtainable with almost optimal projections. In Section 8, we shall see that we are able to compute global minimums to nonconvex problems using Algorithm 1 with inexact projections.
6.3 A Simple Penalty Approach to Approximate Projections
Another method for approximating and is via a localized penalty approach. For example, let and , with the Euclidean space in which is embedded. Then, we can approximate the projection of onto by solving the manifold optimization model
(51) | ||||
s.t. |
by initializing procedures to solve (51) at , where is a penalty function such that . Ideally, initializing algorithms to solve (51) at points which are nearly feasible, e.g., , shall not only compute feasible points but points which are nearly optimal projections. In testing, we see for small values of , feasible solutions can readily be computed which are nearer to than the original feasible point . This does not guarantee optimal projections but this is sufficient to ensure feasibility and general improvement of the objective function.
7 Convergence Analysis
In this section, we prove under reasonable assumptions Algorithm 1 converges to an -approximate first-order KKT point, as defined in Definition 22, in iterations. We begin our analysis by defining several constants which depend upon the initialization of Algorithm 1. For all we define the following terms:
(52) |
(53) |
(54) |
(55) |
(56) |
(57) |
(58) |
(59) |
(60) |
(61) |
(62) |
(63) |
(64) |
Under the assumptions we shall impose, it can be simply argued the defined constants above are all positive and finite. Each of these constants focuses on being able to bound the first and second-order derivatives of the objective and constraint functions. This is crucial in-order to produce a convergence rate for our algorithm. Note, we shall almost always drop the dependence on when we use these constants in the remainder of the paper to reduce notational clutter.
We now present the assumptions we utilize in our analysis. The assumptions can be divided into four categories: smoothness assumptions on the objective and functional constraints, error-bound conditions on the constraints, constraint qualifications, and manifold conditions.
Assumption 1.
The function is two-times continuously differentiable and and are three-times continuously differentiable.
Assumption 2.
The functions and satisfy the following Lipschitz conditions:
-
1.
There exists such that for all and in
(65) -
2.
has a gradient Lipschitz retraction with respect to both the exponential map over and the retraction operator utilized in Algorithm 1 with constant for all .
-
3.
There exists such that for all and in
(66) -
4.
There exists such that for all and in we have
(67)
Assumption 3.
The LICQ holds for all iterates generated by Algorithm 1 in the Euclidean and Riemannian senses for and and and respectively.
Assumption 4.
There exists constants , and and , and such that
-
1.
For all such that , independent of , it follows
(68) where and .
-
2.
For all such that , independent of , it follows
(69) where .
-
3.
For all such that , we have that
(70) where .
Assumption 5.
Assumption 6.
The Riemannian distance and the norm inherited from the Euclidean space in which is embedded are locally equivalent, i.e., there exist constants and such that for all , if , then .
It is worth noting that these assumptions are fairly standard. The first assumption asks for sufficient differentiability and is readily satisfied by all of the motivating examples. The Lipschitz conditions in the second assumption can follow from a simple Lipschitz condition on the original problem before decomposing; see Appendix A. Constraint qualifications are a staple in the analysis of optimization algorithms, and error bound conditions are a common occurrence in constrained problems. Finally, Assumption 5 holds for any compact Riemannian submanifold, for example, the Stiefel manifold, and Assumption 6 is expected due to the locally Euclidean nature of manifolds.
7.1 Y-Updating Phase
Lemma 11.
Let . If , then the y-updating phase is entered and shall terminate with
(71) |
where
(72) |
Proof.
Let and assume . Let be a minimizer of the subproblem defining . Then by the gradient Lipschitz condition on with respect to ,
(73) |
From the definition of and the feasibility of it follows
where lies between and . Thus,
We must similarly bound the potential violation of the inequality constraints. We begin by dealing with those constraints such that . Therefore, for some between and
Thus, if , then we know the -th constraint shall not be violated. Next, we bound the error of the -active constraints. Expanding for we have
(74) |
So, assuming and , it then follows by the error bound condition, (68), that
(75) |
Hence, it follows
Thus, assuming it follows
(76) |
Therefore, a sufficient decrease shall be obtained provided
Given the structure of the line-search procedure in the y-updating phase, it follows the computed stepsize shall be lower bounded by
and by combining this lower bound with the sufficient decrease inequality we obtain our result. ∎
It is worth remarking the bound is very pessimistic and follows from the worst possible scenario which might occur given the bounds on the norms of the Hessians and gradients for the objective and constraint functions.
7.2 X-Updating Phase
For our analysis of the x-updating phase, we begin by proving a useful lemma.
Lemma 12.
Assume is a compact Riemannian manifold with positive injectivity radius . Then for any two points with , then there exists such that and .
Proof.
Define the function such that By Proposition 10.22 in [7], since it follows . Thus, is a constant function over . Therefore, Since is a neighborhood of , it follows yielding the result. ∎
Lemma 13.
If and the x-updating phase is entered, then the update shall terminate with
(77) |
where
(78) |
Proof.
Let satisfy and and let be a minimizer defining . By the gradient Lipschitz retraction assumption,
(79) |
For , define the functions . By Taylor’s theorem,
for some . Therefore, it follows by the definition of and (59) that
for all , which implies . Thus, by Assumption 4 and (69), for a small enough stepsize , to be specified shortly, the error bound condition applies and
(80) |
Since is either or , we can precisely specify how small the stepsize needs to be so that the error bound condition applies. For example, for the Stiefel manifold there exists a global constant such that
(81) |
for all and for both the polar retraction and the QR retraction [52]. Therefore, if , then the error bound condition (69) shall hold making (80) a valid bound.
Define and its projection onto the constraint set . We now argue for the existence of a direction such that
(82) |
when is sufficiently small. Since is either or for some positive integers and , it follows there is a positive global lower bound on the injectivity radius, , over . From (80), (81), and Assumption 6, if additionally it follows
and by Lemma 12 there exists with such that (82) holds. Furthermore, by Proposition 10.22 in [7], Assumption 6, and (80),
(83) |
Thus, by the gradient Lipschitz retraction condition of with respect to
where the last inequality follows assuming . Hence, if
(84) |
then the while loop in the x-updating phase shall terminate with a guaranteed descent. Lastly, given (84) and the x-updating phase procedure it follows for all where
The result then follows as before via similar argumentation. ∎
7.3 Joint-Updating Phase
We open our analysis of the joint-updating phase of Algorithm 1 by proving , as defined in Section 4.3, has a gradient Lipschitz retraction over .
Lemma 14.
Proof.
Let and . Then,
(85) |
where both inequalities followed by the gradient Lipschitz conditions on . Investigating the third term we see
(86) |
where the mixed Lipschitz condition for was utilized to obtained the second inequality and the third followed by Assumption 5 and Proposition 7. Combining (7.3) and (7.3), we obtain our result. ∎
Remark 7.1.
We now present the improvement guarantee for the joint-updating phase if this part of the algorithm begins with a point which is not an -approximate first-order KKT point of (24). The argument is similar to the analysis of both the and -updating phases.
Lemma 15.
Let . If and the joint-updating phase is entered, then the update shall terminate with
(87) |
where
(88) |
Proof.
Let satisfy , , and . Let be a minimizer defining . By Lemma 14, we know has a gradient Lipschitz retraction. So, using the definition of the retraction on the product manifold we have
(89) |
This demonstrates a first-order improvement on the value of the objective function can be obtained. Next, we demonstrate only a second-order violation occurs in the constraints. Similar to the x-updating phase, for , define the functions . By Taylor’s theorem,
for some . Therefore, it follows by the definition of and (61) that
Since the inequality constraints only depend upon , the prior argument to prove (7.1) stands provided . Then, since our utilized retraction over satisfies (81), if we have
(90) |
and the error bound condition (70) holds. Therefore,
(91) |
Define and . By Lemma 12, Assumption 6, and the fact the exponential map over the product manifold is the Cartesian product of the exponential maps over and , if
then there exists such that
Furthermore, we have
(92) |
Therefore, by the gradient Lipschitz retraction on we have
where the last inequality follows assuming . Thus, a sufficient decrease shall be obtained provided
(93) |
Hence, as in the prior arguments, it follows the stepsize in the joint-updating phase is lower bounded by
(94) |
and the result follows. ∎
7.4 Overall Complexity
Using the prior updating phases, we can derive the overall complexity of Algorithm 1 for locating an approximate first-order KKT point of (24).
Theorem 16.
Proof.
Assume in Algorithm 1. For the sake of obtaining a contradiction, assume additionally the algorithm never terminates. By Lemmas 11, 13, and 15, it follows
for all . For small enough there exist positive constants , and such that
Hence, letting we have for all . Since the algorithm does not terminate, by repeated application of the above inequality we have
So, if , then which violates our assumptions. Therefore, the algorithm must terminate in at most iterations. ∎
8 Numerical Experiments
The purpose of this section is twofold. First, we want to demonstrate the efficacy of the modeling framework given by (1), and, second, we want to exhibit the theory and applicability of Algorithm 1.
8.1 Generalized Semidefinite Programs
To demonstrate the novelty of our modeling framework and the ability of our method to solve general matrix optimization models with nonconvex constraints, we present a set of numerical tests on generalized semidefinite programs. The instance of (3) we considered in our experiments was
(95) | ||||
s.t. | ||||
where we designed the problem parameters such that a feasible point existed. A few observations are notable about (95). First, the objective function is lower bounded by if we look at the second to last inequality constraint in the model. Thus, we have a simple way of measuring if our algorithm is able to compute a global minimizer. Second, the eigenvalue constraint in the model is nonconvex since the left-hand side of each of the inequality constraints is a concave function.
We generated our test problems by randomly constructing a positive definite matrix , setting for , with a randomly generated symmetric matrix, and setting equal to the sum of the smallest eigenvalues of . Hence, is a feasible solution to (95) and because it makes the second to last inequality tight it yields a globally optimal solution.
We conducted ten randomized tests for varying problems sizes of (95). The results are displayed in Table 1. For each randomized test we computed the following values:
-
•
the absolute difference between the objective value of the final iterate generated by Algorithm 1 and the optimal objective value, i.e.,
-
•
the feasibility error of the final iterate computed by Algorithm 1 with respect to the equality constraints, i.e.,
-
•
the feasibility error of the final iterate produced by Algorithm 1 with respect to the inequality constraints, i.e.,
Table 1 also tracks the minimum, median, and maximum values of these three quantities across the ten randomized experiments for each problem size. It also counts the total number of problems solved for each dimension; we consider a problem solved if the difference between the objective value computed when the algorithm terminated was within of the optimal value, and the point located had a feasibility error less than in both the equality and inequality constraints.
Dim. | Solved | Dist. to Opt. Value | Feasibility Error | |
---|---|---|---|---|
( ) | Equality | Inequality | ||
n=5 | 9 | [0, 1.48e-11, 1.22] | [2.22e-16, 1.12e-08, 1.78e-07] | [0, 4.86e-7, 9.99e-7] |
n=10 | 8 | [0, 1.31e-08, 3.68] | [7.11e-15, 2.51e-14, 2.57e-07] | [0, 0, 9.53e-7] |
n=25 | 10 | [0, 2.27e-13, 5.89e-8] | [7.69e-14, 1.42e-13, 2.06e-07] | [0, 0, 9.69e-7] |
n=50 | 10 | [0, 1.59e-12, 3.44e-7] | [6.96e-13, 4.40e-09, 1.30e-07] | [0, 4.54e-7, 9.69e-7] |
n=100 | 8 | [1.82e-12, 5.46e-12, 4.72] | [4.33e-12, 3.22e-10, 1.89e-08] | [0, 4.47e-7, 9.93e-7] |
From the table we notice Algorithm 1 was able to find near globally optimal solutions to a vast majority of the problems. Even though (95) is a nonconvex problem, our method successfully solve 45 of the 50 problems instances. Furthermore, our method seems to be independent of the problem size, at least for the subset of generalized SDPs represented by (95). A question worth pursuing in the future is applications of generalized SDPs, especially as an approach to relax problems which have classically been rewritten as semidefinite programs.
8.2 Quadratically Constrained Quadratic Programs
An important instance of a QCQP is
(96) | ||||
s.t. | ||||
where and are positive semidefinite. Though the objective function in (96) is convex, the constraint is nonconvex since it is the intersection of the exterior of ellipsoids. This model is well-known to be NP-Hard and in-general more difficult than Max-cut, a comparison of the approximation ratios between the two problems supports this position; (96) has applications in communications through the multicast downlink transmit beamforming problem [2].
A popular and powerful approach to tackle (96) follows by relaxing the problem to a semidefinite program and then applying a randomization procedure to compute feasible solutions to the problem. Namely, (96) is rewritten as the equivalent matrix optimization model,
(97) | ||||
s.t. | ||||
and then relaxed to the semidefinite program
(98) | ||||
s.t. | ||||
A global min is then computed to (98) and is utilized to define Gaussian random variables for . From these random variables, feasible solutions to (96) are formed, , such that
(99) |
for . Randomization procedures such as this combined with semidefinite program relaxations have proven extremely effective at locating approximate solutions to challenging QCQPs; see [36] for a survey of semidefinite relaxation and randomization applied to QCQPs.
The model and algorithm we have developed present new approaches for solving (96) by enabling novel nonconvex relaxations of the rank-1 constraint in the equivalent model (97). Rather than removing the rank condition completely to obtain a convex problem, we instead solve
(100) | ||||
s.t. | ||||
where the parameter allows us to alter the tightness of our approximation. If , then (100) is equivalent to (97). For small positive values of , the eigenvalue constraint enforces a near rank-1 constraint, as the bottom eigenvalues of are then forced to be within of zero. Once an approximate solution to (100) is computed using Algorithm 1, we form feasible solutions to (96) in two different fashions:
-
(A1.)
Projection Approach: After computing a stationary point to (100), the top eigenvalue and corresponding eigenvector of are computed and the vector is formed. Then, using (99), the vector is scaled so that it satisfies the quadratic inequality constraints in (96). This scaled version of is then used as an approximate solution to (96).
-
(A2).
Randomization Approach: After computing a stationary point to (100), the stationary point is utilized to generate randomized solutions in the same manner as done in the semidefinite relaxation and randomization approach described above.

To compare our model (100) and two approximation approaches detailed above to the semidefinite program relaxation (SDR) approach with randomization, we conducted forty numerical experiments. In-order to check if our method was able to compute global minimums, we conducted our numerical experiments with so that a brute force grid search was able to compute a global optimum of (96) in a less than prohibitively long time. The form of (96) we considered in our experiments was
(101) | ||||
s.t. | ||||
where we randomly generated positive definite matrices for . An example of the feasible region and level sets of the objective can be seen in Figure 1. In our experiments, we conducted ten randomized tests for , and , and solved (100) with , and for all settings. When generating points via randomization, we sampled 20 points from the Gaussian distributions. Since our model is nonconvex, we can benefit from initializing our algorithm multiple times and resolving, whereas the SDR approach cannot. Thus, for these tests, we solved each instance of (100) three times. One initialization of our algorithm was formed by taking the solution to (98) and projecting it onto the constraint set in (100); the other two came from randomly generating a symmetric matrix and projecting onto the constraint set. Tables 2 and 3 display the numerical results of the experiments.
SDR | Our Model | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
m | Test | Optimal | Orig. | Random | Orig. | Random | Project | Orig. | Random | Project | Orig. | Random | Project |
5 | 1 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 | 1.10 |
2 | 1.49 | 1.49 | 1.49 | 1.49 | 1.49 | 1.49 | 1.49 | 1.49 | 1.49 | 1.49 | 1.49 | 1.49 | |
3 | 2.86 | 1.91 | 3.17 | 2.74 | 2.94 | 2.87 | 2.85 | 2.86 | 2.85 | 2.85 | 2.85 | 2.85 | |
4 | 2.33 | 1.77 | 2.41 | 2.25 | 2.33 | 2.34 | 2.33 | 2.33 | 2.33 | 2.33 | 2.33 | 2.33 | |
5 | 1.67 | 1.66 | 1.83 | 1.67 | 1.68 | 1.68 | 1.67 | 1.67 | 1.67 | 1.67 | 1.67 | 1.67 | |
6 | 2.26 | 1.76 | 2.59 | 2.20 | 2.27 | 2.28 | 2.26 | 2.26 | 2.26 | 2.26 | 2.26 | 2.26 | |
7 | 1.91 | 1.67 | 1.94 | 1.87 | 1.91 | 1.92 | 1.91 | 1.91 | 1.91 | 1.91 | 1.91 | 1.91 | |
8 | 1.13 | 1.13 | 1.13 | 1.13 | 1.13 | 1.13 | 1.13 | 1.13 | 1.13 | 1.13 | 1.13 | 1.13 | |
9 | 2.57 | 1.93 | 2.59 | 2.49 | 2.64 | 2.58 | 2.57 | 2.57 | 2.57 | 2.57 | 2.57 | 2.57 | |
10 | 2.18 | 1.84 | 2.35 | 2.14 | 2.19 | 2.19 | 2.17 | 2.18 | 2.17 | 2.17 | 2.17 | 2.17 | |
10 | 1 | 5.41 | 1.95 | 9.05 | 5.19 | 5.68 | 5.57 | 5.56 | 5.57 | 5.56 | 5.56 | 5.56 | 5.56 |
2 | 3.91 | 1.89 | 5.05 | 3.68 | 4.41 | 3.92 | 3.91 | 3.92 | 3.91 | 3.91 | 3.91 | 3.91 | |
3 | 2.37 | 1.88 | 2.60 | 2.31 | 2.42 | 2.36 | 2.36 | 2.36 | 2.36 | 2.36 | 2.36 | 2.36 | |
4 | 1.78 | 1.72 | 2.11 | 1.77 | 1.79 | 1.80 | 1.78 | 1.78 | 1.78 | 1.78 | 1.78 | 1.78 | |
5 | 2.98 | 1.92 | 3.18 | 2.87 | 3.01 | 2.98 | 2.98 | 2.98 | 2.98 | 2.98 | 2.98 | 2.98 | |
6 | 4.49 | 1.89 | 5.34 | 4.21 | 4.59 | 4.49 | 4.95 | 4.97 | 4.95 | 4.48 | 4.48 | 4.48 | |
7 | 3.45 | 1.87 | 3.74 | 3.26 | 3.49 | 3.47 | 3.45 | 3.45 | 3.45 | 3.45 | 3.45 | 3.45 | |
8 | 3.85 | 1.89 | 5.79 | 3.64 | 3.99 | 3.85 | 3.85 | 3.85 | 3.85 | 3.85 | 3.85 | 3.85 | |
9 | 3.06 | 1.88 | 3.27 | 2.90 | 3.09 | 3.06 | 3.05 | 3.06 | 3.05 | 3.05 | 3.05 | 3.05 | |
10 | 2.97 | 1.94 | 3.00 | 2.87 | 2.98 | 2.97 | 2.97 | 2.98 | 2.97 | 2.97 | 2.97 | 2.97 |
SDR | Our Model | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
m | Test | Optimal | Orig. | Random | Orig. | Random | Project | Orig. | Random | Project | Orig. | Random | Project |
25 | 1 | 6.05 | 1.91 | 6.23 | 5.58 | 6.07 | 6.06 | 6.04 | 6.05 | 6.05 | 6.05 | 6.05 | 6.05 |
2 | 4.84 | 1.96 | 6.06 | 4.53 | 4.92 | 4.85 | 4.83 | 4.84 | 4.83 | 4.83 | 4.83 | 4.83 | |
3 | 4.14 | 1.95 | 4.29 | 3.89 | 4.24 | 4.16 | 4.14 | 4.15 | 4.14 | 4.14 | 4.14 | 4.14 | |
4 | 7.67 | 1.94 | 9.62 | 7.03 | 7.84 | 7.68 | 7.66 | 7.67 | 7.67 | 7.67 | 7.67 | 7.67 | |
5 | 10.06 | 1.96 | 13.55 | 9.20 | 10.08 | 10.07 | 10.05 | 10.09 | 10.06 | 10.06 | 10.06 | 10.06 | |
6 | 3.53 | 1.91 | 4.32 | 3.35 | 3.53 | 3.53 | 3.53 | 3.53 | 3.53 | 3.53 | 3.53 | 3.53 | |
7 | 8.07 | 1.96 | 9.13 | 7.43 | 8.21 | 8.08 | 8.06 | 8.08 | 8.07 | 8.07 | 8.07 | 8.07 | |
8 | 4.70 | 1.95 | 7.12 | 4.39 | 4.72 | 4.71 | 4.69 | 4.70 | 4.70 | 4.70 | 4.70 | 4.70 | |
9 | 7.69 | 1.97 | 7.76 | 7.09 | 7.83 | 7.70 | 7.68 | 7.69 | 7.69 | 7.69 | 7.69 | 7.69 | |
10 | 4.91 | 1.93 | 6.83 | 4.59 | 4.95 | 4.91 | 4.90 | 4.92 | 4.91 | 4.91 | 4.91 | 4.91 | |
50 | 1 | 5.31 | 1.95 | 5.67 | 4.94 | 5.34 | 5.33 | 5.31 | 5.31 | 5.31 | 5.31 | 5.31 | 5.31 |
2 | 9.35 | 1.97 | 10.14 | 8.54 | 9.40 | 9.35 | 9.34 | 9.35 | 9.35 | 9.35 | 9.35 | 9.35 | |
3 | 7.25 | 1.95 | 7.34 | 6.66 | 7.39 | 7.27 | 7.24 | 7.25 | 7.25 | 7.25 | 7.25 | 7.25 | |
4 | 9.49 | 1.96 | 11.01 | 8.66 | 9.54 | 9.49 | 9.47 | 9.50 | 9.48 | 9.48 | 9.48 | 9.48 | |
5 | 9.13 | 1.95 | 10.44 | 8.33 | 9.20 | 9.13 | 9.12 | 9.13 | 9.12 | 9.12 | 9.12 | 9.12 | |
6 | 7.69 | 1.96 | 8.07 | 7.03 | 7.70 | 7.69 | 7.68 | 7.69 | 7.69 | 7.69 | 7.69 | 7.69 | |
7 | 14.93 | 1.97 | 17.22 | 13.58 | 16.66 | 14.94 | 17.17 | 17.19 | 17.18 | 14.92 | 14.92 | 14.92 | |
8 | 11.26 | 1.95 | 13.43 | 11.36 | 12.57 | 12.50 | 12.48 | 12.50 | 12.49 | 11.26 | 11.26 | 11.26 | |
9 | 5.82 | 1.96 | 6.39 | 5.40 | 5.91 | 5.84 | 5.82 | 5.83 | 5.82 | 5.82 | 5.82 | 5.82 | |
10 | 11.79 | 1.97 | 12.47 | 10.90 | 12.10 | 11.98 | 11.95 | 11.97 | 11.96 | 11.96 | 11.96 | 11.96 |

The tables showcase our method was able to drastically outperform the standard SDR approach for solving (101). Each of the blue values in Tables 2 and 3 denote an objective value obtained which was within of the global optimal value estimated via grid search. In the 40 numerical tests, the SDR approach was only able to compute a near optimal solution 3 times while our approach with , with both approaches (A1.) and (A2.), was able to solve 38 for the 40 tests. Similarly, with and , the projection approach was able to compute near optimal solutions in 24 and 35 of the experiments respectively. Furthermore, our projection approach, (A1.), was able to successfully compute near optimal solutions without the need for randomization. Figure 2 displays an example of the feasible points generated by the different methods.
A few observations can be made from Figure 2. First, the randomized solutions generated from the semidefinite programming approach spread themselves out across the feasible region greatly while the general spread of the randomized solutions via our approach is highly dependent upon the value of . For the green circles in the far left panel of Figure 2, we see the spread is less than that of the red circles, and we note the sampled distribution of points collapses around about four local minimums when decreases to in the middel panel. So, while the randomized points computed via (100) seem to cluster around local minimums, the randomized points computed from SDR do not seem to exhibit any such behavior. Second, the projected solutions obtain via (A1.) seem to locate local minimums in this example without the need for randomization. Note, they are the magenta circles in Figure 2. This numerical experiment was an interesting example because it was the only instance where our method with the two larger values was unable to compute a global solution while the smallest value was able to do so. The far right panel in Figure 2 displays the results for on this problem. From the figure we can see that one of the projected solutions located the global optimal value, and furthermore all of the randomized points generated by out method collapsed very tightly about four different local minimum, including the global minimums.
Overall, the numerical results demonstrate our method presents a novel way of approaching a difficult NP-Hard problem and can best the classical and effective SDR approach with randomization. The general behavior of our approach is worth further investigation as well from the observations stated here. Also, it should noted the procedure presented in this section extends over to any problem, not just QCQPs, where rank constraints are removed in-order to obtain easier and sometimes convex problems. Hence, our model could greatly impact the way numerous problems are solved.
9 Conclusion
We have set out in this paper to present the first general matrix optimization framework which enables the modeling of both coordinate and spectral constraints. Further, we developed and theoretically guaranteed the performance of an algorithm which computes stationary points for our model. Under standard assumptions, we proved our algorithm converges to an -approximate first-order KKT point in iterations, and we demonstrated our approach could compute global minimums to challenging nonconvex problems. In our numerical experiments, we demonstrated a novel nonconvex relaxation of a quadratically constrained quadratic program was solvable by our method to global optimality and significantly outperformed the state-of-the-art semidefinite relaxation approach with randomization. We also presented a generalized form of semidefinite programming which extends a current essential paradigm.
The results of this paper are exciting. We have extended matrix optimization into a new arena and demonstrated novel nonconvex relaxations are possible which can outperform current frameworks. However, there is still much work to be done on this research program. For starters, though our algorithm performed well in our numerical experiments, the method is not computationally efficient enough to be applied to large-scale problems. The development of more efficient methods for solving (1) is of first importance moving forward, and we believe there are several ways to utilize developments in nonconvex optimization to these ends. Also, investigating theoretical guarantees for approximation methods resulting from our nonconvex relaxations of rank-one constraints would be an intriguing avenue for future work.
References
- [1] P-A Absil and Jérôme Malick. Projection-like retractions on matrix manifolds. SIAM Journal on Optimization, 22(1):135–158, 2012.
- [2] BG Alex, DS Nicholas, S Shahram, B Mats, and O Bjorn. Convex optimization-based beamforming: From receive to transmit and network designs. IEEE Signal Processing Magazine, 27(3):62–75, 2010.
- [3] Fredrik Andersson and Marcus Carlsson. Alternating projections on nontangential manifolds. Constructive approximation, 38(3):489–525, 2013.
- [4] Roberto Andreani, Kelvin R. Couto, Orizon P. Ferreira, and Gabriel Haeser. Constraint qualifications and strong global convergence properties of an augmented lagrangian method on riemannian manifolds. SIAM Journal on Optimization, 34(2):1799–1825, 2024.
- [5] MV Balashov and AA Tremba. Error bound conditions and convergence of optimization methods on smooth and proximally smooth manifolds. Optimization, 71(3):711–735, 2022.
- [6] Ronny Bergmann and Roland Herzog. Intrinsic formulation of kkt conditions and constraint qualifications on smooth manifolds. SIAM Journal on Optimization, 29(4):2423–2444, 2019.
- [7] Nicolas Boumal. An introduction to optimization on smooth manifolds. Cambridge University Press, 2023.
- [8] Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global Rates of Convergence for Nonconvex Optimization on Manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
- [9] Stephen Boyd, Persi Diaconis, and Lin Xiao. Fastest mixing markov chain on a graph. SIAM review, 46(4):667–689, 2004.
- [10] Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
- [11] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119, 2012.
- [12] Emmanuel J. Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
- [13] Ke Chen. Matrix Preconditioning Techniques and Applications. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2005.
- [14] Zhaoliang Chen and Shiping Wang. A review on matrix completion for recommender systems. Knowledge and Information Systems, 64(1):1–34, 2022.
- [15] Moody Chu and Gene Golub. Inverse eigenvalue problems: theory, algorithms, and applications. OUP Oxford, 2005.
- [16] Moody T Chu. Inverse eigenvalue problems. SIAM review, 40(1):1–39, 1998.
- [17] Nair Maria Maia De Abreu. Old and new results on algebraic connectivity of graphs. Linear algebra and its applications, 423(1):53–73, 2007.
- [18] Dmitriy Drusvyatskiy and Adrian S Lewis. Inexact alternating projections on nonconvex sets. arXiv preprint arXiv:1811.01298, 2018.
- [19] Jianqing Fan, Yuan Liao, and Han Liu. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1):C1–C32, 2016.
- [20] Maryam Fazel. Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University, 2002.
- [21] Silvia Gandy, Benjamin Recht, and Isao Yamada. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse problems, 27(2):025010, 2011.
- [22] Casey Garner, Gilad Lerman, and Shuzhong Zhang. Spectrally constrained optimization. Journal of Scientific Computing, 100(3):89, 2024.
- [23] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
- [24] M Seetharama Gowda. Optimizing certain combinations of spectral and linear distance functions over spectral sets. arXiv preprint arXiv:1902.06640, 2019.
- [25] Masaru Ito and Bruno F Lourenço. Eigenvalue programming beyond matrices. arXiv preprint arXiv:2311.04637, 2023.
- [26] Bo Jiang, Shiqian Ma, and Shuzhong Zhang. Tensor principal component analysis via convex optimization. Mathematical Programming, 150(2):423–457, 2015.
- [27] Yoni Kasten, Amnon Geifman, Meirav Galun, and Ronen Basri. Algebraic characterization of essential matrices and their averaging in multiview settings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5895–5903, 2019.
- [28] Zhijian Lai and Akiko Yoshise. Riemannian interior point methods for constrained optimization on manifolds. Journal of Optimization Theory and Applications, 201(1):433–469, 2024.
- [29] Gilad Lerman and Tyler Maunu. An overview of robust subspace recovery. Proceedings of the IEEE, 106(8):1380–1410, 2018.
- [30] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. Robust Computation of Linear Models by Convex Relaxation. Foundations of Computational Mathematics, 15(2):363–410, 2015.
- [31] Gilad Lerman and Yunpeng Shi. Robust group synchronization via cycle-edge message passing. Foundations of Computational Mathematics, 22(6):1665–1741, 2022.
- [32] Adrian S Lewis. The mathematics of eigenvalue optimization. Mathematical Programming, 97:155–176, 2003.
- [33] Adrian S Lewis and Jérôme Malick. Alternating projections on manifolds. Mathematics of Operations Research, 33(1):216–234, 2008.
- [34] Adrian S Lewis and Michael L Overton. Eigenvalue optimization. Acta numerica, 5:149–190, 1996.
- [35] Changshuo Liu and Nicolas Boumal. Simple algorithms for optimization on riemannian manifolds with constraints. Applied Mathematics & Optimization, 82:949–981, 2020.
- [36] Zhi-Quan Luo, Wing-Kin Ma, Anthony Man-Cho So, Yinyu Ye, and Shuzhong Zhang. Semidefinite relaxation of quadratic optimization problems. IEEE Signal Processing Magazine, 27(3):20–34, 2010.
- [37] Luong Trung Nguyen, Junhan Kim, and Byonghyo Shim. Low-rank matrix completion: A contemporary survey. IEEE Access, 7:94215–94237, 2019.
- [38] Mitsuaki Obara, Takayuki Okuno, and Akiko Takeda. Sequential quadratic optimization for nonlinear optimization problems on riemannian manifolds. SIAM Journal on Optimization, 32(2):822–853, 2022.
- [39] James M. Ortega. Numerical Analysis. Society for Industrial and Applied Mathematics, 1990.
- [40] Anton Schiela and Julian Ortiz. An sqp method for equality constrained optimization on hilbert manifolds. SIAM Journal on Optimization, 31(3):2255–2284, 2021.
- [41] Yunpeng Shi and Gilad Lerman. Message passing least squares framework and its application to rotation synchronization. In International conference on machine learning, pages 8796–8806. PMLR, 2020.
- [42] Guangjing Song and Michael K Ng. Fast alternating projections on manifolds based on tangent spaces. arXiv preprint arXiv:2003.10324, 2020.
- [43] Lieven Vandenberghe and Stephen Boyd. Semidefinite Programming. SIAM review, 38(1):49–95, 1996.
- [44] Namrata Vaswani, Thierry Bouwmans, Sajid Javed, and Praneeth Narayanamurthy. Robust subspace learning: Robust pca, robust subspace tracking, and robust subspace recovery. IEEE Signal Processing Magazine, 35(4):32–55, 2018.
- [45] A. J. Wathen. Preconditioning. Acta Numerica, 24:329–376, 2015.
- [46] Melanie Weber and Suvrit Sra. Riemannian optimization via Frank-Wolfe methods. Mathematical Programming, 199(1):525–556, 2023.
- [47] Henry Wolkowicz, Romesh Saigal, and Lieven Vandenberghe. Handbook of semidefinite programming: theory, algorithms, and applications, volume 27. Springer Science & Business Media, 2012.
- [48] Yuya Yamakawa and Hiroyuki Sato. Sequential optimality conditions for nonlinear optimization on riemannian manifolds and a globally convergent augmented lagrangian method. Computational Optimization and Applications, 81(2):397–421, 2022.
- [49] Wei Hong Yang, Lei-Hong Zhang, and Ruyi Song. Optimality conditions for the nonlinear programming problems on riemannian manifolds. Pacific Journal of Optimization, 10(2):415–434, 2014.
- [50] Debing Zhang, Yao Hu, Jieping Ye, Xuelong Li, and Xiaofei He. Matrix completion by truncated nuclear norm regularization. In 2012 IEEE Conference on computer vision and pattern recognition, pages 2192–2199. IEEE, 2012.
- [51] Junyu Zhang, Shiqian Ma, and Shuzhong Zhang. Primal-dual optimization algorithms over riemannian manifolds: an iteration complexity analysis. Mathematical Programming, 184(1-2):445–490, 2020.
- [52] Junyu Zhang and Shuzhong Zhang. A Cubic Regularized Newton’s Method over Riemannian Manifolds. arXiv preprint arXiv:1805.05565, 2018.
- [53] Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. Global optimality in low-rank matrix optimization. IEEE Transactions on Signal Processing, 66(13):3614–3628, 2018.
- [54] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.
Appendix A Gradient Lipschitz Properties of the Reformulation
Theorem 17.
Assume is gradient Lipschitz with parameter , i.e.,
and define using the spectral decomposition of such that,
If we restrict the domain of to where is a bounded subset of such that for all , then satisfies the Lipschitz conditions in Assumption 2.
Proof.
Let be fixed. Define the function such that, Then,
and computing the partial derivatives of we have,
Then,
where the second inequality follows from the gradient Lipschitz condition on . Thus,
For fixed , define the function such that,
We next compute the gradient of . Let and . Then,
Thus,
We leverage Lemma 2.7 of [8] to prove has a gradient Lipschitz retraction over the manifold of orthogonal matrices, i.e., we shall show is gradient Lipschitz over the convex hull of . Let be contained in the convex hull of , be fixed, , and . Then,
(102) |
Since we are assuming is gradient Lipschitz over the set of symmetric matrices, is bounded, and and are contained in the convex hull of , which is a bounded subset of , it follows there exists such that,
(103) |
with and as defined above. Additionally, be virtue of being in the convex hull of it follows is the convex combination of orthogonal matrices . From this a simple bound on the norm of is achieved. Assume with and . Then,
(104) |
Next, we bound the last term in (A). So,
(105) |
Combining (103), (104) and (A) with (A) we obtain,
and since this holds for any and in the convex hull of it follows is gradient Lipschitz over the convex hull of the orthogonal matrices. Therefore, by Lemma 2.7 in [8], restricted to has a gradient Lipschitz retraction and this holds for all .