This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparse Quadratic Optimisation over the
Stiefel Manifold with Application to
Permutation Synchronisation

Florian Bernard
TU Munich, University of Bonn &Daniel Cremers
TU Munich &Johan Thunberg
Halmstad University
Abstract

We address the non-convex optimisation problem of finding a sparse matrix on the Stiefel manifold (matrices with mutually orthogonal columns of unit length) that maximises (or minimises) a quadratic objective function. Optimisation problems on the Stiefel manifold occur for example in spectral relaxations of various combinatorial problems, such as graph matching, clustering, or permutation synchronisation. Although sparsity is a desirable property in such settings, it is mostly neglected in spectral formulations since existing solvers, e.g. based on eigenvalue decomposition, are unable to account for sparsity while at the same time maintaining global optimality guarantees. We fill this gap and propose a simple yet effective sparsity-promoting modification of the Orthogonal Iteration algorithm for finding the dominant eigenspace of a matrix. By doing so, we can guarantee that our method finds a Stiefel matrix that is globally optimal with respect to the quadratic objective function, while in addition being sparse. As a motivating application we consider the task of permutation synchronisation, which can be understood as a constrained clustering problem that has particular relevance for matching multiple images or 3D shapes in computer vision, computer graphics, and beyond. We demonstrate that the proposed approach outperforms previous methods in this domain.

1 Introduction

We are interested in optimisation problems of the form

argmaxUSt(m,d)f(U)withf(U)=tr(UTWU)andSt(m,d):={Xm×d:XTX=𝐈d},\displaystyle\operatorname*{arg\,max}_{U\in\operatorname{St}(m,d)}~{}f(U)~{}~{}\text{with}~{}~{}f(U)=\operatorname{tr}(U^{T}WU)~{}~{}\text{and}~{}~{}\operatorname{St}(m,d):=\{X\in\mathbb{R}^{m\times d}:X^{T}X=\mathbf{I}_{d}\}\,, (1)

where 𝐈d\mathbf{I}_{d} is the identity matrix of dimension dd, Wm×mW\in\mathbb{R}^{m\times m} and the set St(m,d)\operatorname{St}(m,d) denotes the Stiefel manifold (mdm\geq d). Throughout the paper, w.l.o.g. we consider a maximisation formulation and we assume that WW is a symmetric and positive semidefinite matrix (see Lemma 1 in Sec. 2). Despite the non-convexity of Problem (1), problems of this form can be solved to global optimality based on the eigenvalue decomposition, i.e. by setting U=VdU^{*}=V_{d}, where Vdm×dV_{d}\in\mathbb{R}^{m\times d} denotes an orthogonal basis of eigenvectors corresponding to the dd largest eigenvalues of WW, see Lemma 2. The fact that we can efficiently find global optima of Problem (1) makes it a popular relaxation formulation of various difficult combinatorial problems. This includes spectral relaxations [43, 26, 17] of the NP-hard quadratic assignment problem (QAP) [35], spectral clustering [33, 44], or spectral permutation synchronisation [34, 39, 31]. Yet, a major difficulty of such approaches is to discretise the (relaxed) continuous solution in order to obtain a feasible solution of the original (discrete) problem. For example, in the QAP one optimises over the set of permutation matrices, in clustering one optimises over the set of matrices with rows comprising canonical basis vectors, and in permutation synchronisation one optimises over multiple permutation matrices that are stacked into a large block matrix. Often such combinatorial sets can be characterised by Stiefel matrices that are sparse – for example, the set of (signed) d×dd{\times}d permutation matrices can be characterised by matrices in St(d,d)\operatorname{St}(d,d) that have exactly dd non-zero elements, whereas any other element in St(d,d)\operatorname{St}(d,d) that has more than dd non-zero elements is not a signed permutation matrix.

The most common approach to obtain sparse solutions in an optimisation problem is to integrate an explicit sparsity-promoting regulariser. However, a major hurdle when considering quadratic optimisation problems over the Stiefel manifold is that the incorporation of sparsity-promoting terms is not compatible with solutions based on eigenvalue decomposition, so that desirable global optimality guarantees are generally no longer maintained.

Instead, we depart from the common path of integrating explicit regularisers and instead exploit the orthogonal-invariance present in Problem (1) to assure sparsity. To be more specific, for any orthogonal matrix Q𝕆(d):=St(d,d)Q\in\mathbb{O}(d):=\operatorname{St}(d,d) it holds that f(U)=tr(UTWU)=tr(UTWUQQT)=tr((UQ)TW(UQ))=f(UQ)f(U)=\operatorname{tr}(U^{T}WU)=\operatorname{tr}(U^{T}WUQQ^{T})=\operatorname{tr}((UQ)^{T}W(UQ))=f(UQ). Hence, if U¯St(m,d)\bar{U}\in\operatorname{St}(m,d) is a solution to Problem (1), so is U¯QSt(m,d)\bar{U}Q\in\operatorname{St}(m,d) for any Q𝕆(d)Q\in\mathbb{O}(d). The subspace im(U¯Q)\operatorname{im}(\bar{U}Q) that is spanned by U¯Q\bar{U}Q for a given U¯St(m,d)\bar{U}\in\operatorname{St}(m,d) (and arbitrary Q𝕆(d)Q\in\mathbb{O}(d)) is equal to the subspace im(U¯)\operatorname{im}(\bar{U}). Motivated by this observation we utilise orthogonal-invariance in order to find a solution U{USt(m,d):im(U)=im(U¯)}U^{*}\in\{U\in\operatorname{St}(m,d):\operatorname{im}(U)=\operatorname{im}(\bar{U})\} such that UU^{*} is sparse. To this end, we build upon the additional degrees of freedom due to Q𝕆(d)Q\in\mathbb{O}(d), which allows to rotate a given solution U¯St(m,d)\bar{U}\in\operatorname{St}(m,d) to a sparser representation U=U¯QSt(m,d)U^{*}=\bar{U}Q\in\operatorname{St}(m,d) – most notably, while remaining a globally optimal solution to Problem (1).

Main contributions. We summarise our main contributions as follows: (i) For the first time we propose an algorithm that exploits the orthogonal-invariance in quadratic optimisation problems over the Stiefel manifold while simultaneously accounting for sparsity in the solution. (ii) Despite its simplicity, our algorithm is effective as it builds on a modification of the well-established Orthogonal Iteration algorithm for finding the most dominant eigenspace of a given matrix. (iii) Our algorithm is guaranteed to converge to the dominant subspace with the same convergence rate as the original Orthogonal Iteration algorithm, and our solution constitutes a global optimiser of Problem (1). (iv) We experimentally confirm the efficacy of our approach in the context of the permutation synchronisation problem.

2 Preliminaries & Related Work

In this section we clarify our assumptions, introduce additional preliminaries, and provide reference to related work. Let λ1,λ2,,λm\lambda_{1},\lambda_{2},\ldots,\lambda_{m} be the eigenvalues of Wm×mW\in\mathbb{R}^{m\times m} ordered decreasingly. We impose the following assumption on WW:

Assumption 1 (Separated eigenspace).

We assume that λd>λd+1\lambda_{d}>\lambda_{d+1}.

Throughout the paper we also assume that WW is symmetric and positive semidefinite (p.s.d.), which, however, is not a restriction as the following straightforward result indicates:

Lemma 1 (Symmetry and positive semidefiniteness).

In Problem (1), if WW is not symmetric and not p.s.d. there is an equivalent optimisation problem (i.e. with the same optimisers) where WW has been replaced by a symmetric p.s.d. matrix W~\tilde{W}.

Proof.

See Appendix. ∎

Next, we define the notion of a dominant subspace and convergence of matrix sequences to such a subspace.

Definition 1 (Dominant invariant subspace).

The dd-dimensional dominant invariant subspace (or dominant subspace in short) of the matrix Wm×mW\in\mathbb{R}^{m\times m} is defined as the subspace im(Vd)m×m\operatorname{im}(V_{d})\subseteq\mathbb{R}^{m\times m}, where VdSt(m,d)V_{d}\in\operatorname{St}(m,d) is the matrix whose columns are formed by the dd eigenvectors corresponding to the dd largest eigenvalues of WW.

Definition 2 (Convergence).

We say that a sequence of matrices {Ut}\{U_{t}\} converges to the dominant subspace of WW if limtVdVdTUtUt=0\lim_{t\rightarrow\infty}\|V_{d}V_{d}^{T}U_{t}-U_{t}\|=0.

There exists a close relation between the dominant subspace of WW and solutions of Problem (1):

Lemma 2 (Solution to Problem (1)).

Problem (1) is solved for any matrix UU^{*} that forms an orthogonal basis for the dd-dimensional dominant subspace of WW, i.e. U{USt(m,d):im(U)=im(Vd)}U^{*}\in\{U\in\operatorname{St}(m,d):\operatorname{im}(U)=\operatorname{im}(V_{d})\}.

Proof.

See Appendix. ∎

Basic algorithms for computing eigenvectors. In order to find the dominant subspace of WW we can use algorithms for finding eigenvectors. The Power method [21] is an efficient way for finding the single most dominant eigenvector of a given matrix, i.e. it considers the case d=1d=1. It proceeds by iteratively updating a given initial v0St(m,1)v_{0}\in\operatorname{St}(m,1) based on the update vt+1Wvt/Wvtv_{t+1}\leftarrow Wv_{t}/\|Wv_{t}\|. In order to find the dd most dominant eigenvectors, one can consider the Orthogonal Iteration algorithm [21], which generalises the Power method to the case d>1d>1. The algorithm proceeds by repeatedly computing Vt+1Rt+1WVtV_{t+1}R_{t+1}\leftarrow WV_{t} based on the (thin) QR-decomposition of WVtWV_{t}, where VtSt(m,d)V_{t}\in\operatorname{St}(m,d) and Rtd×dR_{t}\in\mathbb{R}^{d\times d} is upper triangular. For tt\rightarrow\infty the sequence {Vt}\{V_{t}\} converges (under mild conditions on the intial V0V_{0}) to the dominant invariant subspace as long as |λd|>|λd+1||\lambda_{d}|>|\lambda_{d+1}|, see Thm. 8.2.2 in [21].

Sparse Stiefel optimisation. There are numerous approaches for addressing general optimisation problems over the Stiefel manifold, including generic manifold optimisation techniques (e.g. [3, 14]), or Stiefel-specific approaches (e.g. [30, 46]). In the following we will focus on works that consider sparse optimisation over the Stiefel manifold that are most relevant to our approach. In order to promote sparse solutions, sparsity-inducing regularisers can be utilised, for example via the minimisation of the (non-convex) p\ell_{p}-‘norm’ for 0p<10\leq p<1, or the (convex) 1\ell_{1}-norm [36]. However, the non-smoothness of such regularisers often constitutes a computational obstacle.

The optimisation of non-smooth functions over the Stiefel manifold has been considered in [16], where the sum of a non-convex smooth and a convex non-smooth function is optimised via a proximal gradient method. In [27], the authors consider the optimisation of a sum of non-smooth weakly convex functions over the Stiefel manifold using Riemannian subgradient-type methods. Yet, in practice often differentiable surrogates of non-smooth sparsity-promoting terms are considered [40, 29, 38, 15]. Instead of minimising p\ell_{p}-‘norms’ with 0p10\leq p\leq 1, on the Stiefel manifold one may instead choose the maximisation of p\ell_{p}-norms with p>2p>2, such as the 3\ell_{3}-norm [47], or the 4\ell_{4}-norm [49]. Further motivation for sparsity promoting higher-order norms in this context can be found in [37, 50, 28].

Optimisation over the Stiefel manifold has a close connection to optimisation over the Grassmannian manifold [3]. There are numerous approaches for Grassmannian manifold optimisation (e.g. [19, 18]), including sparse optimisation via an 1\ell_{1}-norm regulariser [45] and the optimisation of non-convex and non-smooth objective functions via a projected Riemannian subgradient method [52]. In our case, due to the rotation-invariance of the objective f(U)=f(UQ)f(U)=f(UQ) for any Q𝕆(D)Q\in\mathbb{O}(D) in Problem (1), finding the dominant subspace of WW could also be posed as an optimisation problem over the Grassmannian manifold. However, we are not only interested in identifying this subspace (which can for example be done via the Orthogonal Iterations algorithm [21], or via Grassmann-Rayleigh Quotient Iterations [4]), but we want to find a specific choice of coordinates for which the representation of the subspace is sparse.

3 Proposed Method for Sparse Quadratic Optimisation over the Stiefel

Input: Wm×mW\in\mathbb{R}^{m\times m}, U0m×dU_{0}\in\mathbb{R}^{m\times d}, ϵ>0\epsilon>0
Output: UU^{*}
Initialise: t0t\leftarrow 0
repeat
       Ut+1Rt+1WUtZ(Ut)U_{t+1}R_{t+1}\leftarrow WU_{t}Z(U_{t}) // unique QR-decomposition
      
until convergence
UUt+1U^{*}\leftarrow U_{t+1}
Algorithm 1 Overview of our proposed algorithm. In terms of convergence properties our algorithm is equivalent to the Orthogonal Iteration algorithm and thus produces a UU^{*} in the dominant subspace of WW. However, our modification introduces the matrix Z(Ut)Z(U_{t}) with the purpose of promoting sparsity of the solution (see Sec. 3.2 how we choose Z(Ut)Z(U_{t})).

In this section we introduce our Algorithm 1 that can be seen as a modification of the popular Orthogonal Iteration algorithm. The main difference is that we iteratively weigh the matrix UtSt(m,d)U_{t}\in\operatorname{St}(m,d) by a matrix Z(Ut)d×dZ(U_{t})\in\mathbb{R}^{d\times d}, where the matrix function ZZ maps onto the set of full rank matrices. The purpose of the matrix Z(Ut)Z(U_{t}) (see Sec. 3.2 for our specific choice) is to promote sparsity. Intuitively, we characterise sparsity as having few elements that are large, whereas most elements are close to zero, which is formally expressed in (4).

First, we focus on the overall interpretation of our approach: We want to ensure that we retrieve a UU^{*} that is a global maximiser of Problem (1), which, in some relaxed sense, is also sparse. However, we cannot in general augment the objective function in Problem (1) with an explicit sparsity regulariser and expect that the respective solution is still a maximiser of the original objective. Instead we steer the solution towards being more sparse by weighing our matrix of interest UtU_{t} with the matrix Z(Ut)Z(U_{t}).

3.1 Convergence

We start by ensuring that, under the assumption that Zt(Ut)Z_{t}(U_{t}) is full rank, the sequence {Ut}\{U_{t}\} generated by Algorithm 1 converges to the dominant subspace of WW. This can straightforwardly be shown by using the following result.

Lemma 3.

Consider the two algorithms:

      Algorithm A: Vt+1Rt+1WVtXtV_{t+1}R_{t+1}\leftarrow WV_{t}X_{t},

      Algorithm B: V¯t+1R¯t+1WV¯t\bar{V}_{t+1}\bar{R}_{t+1}\leftarrow W\bar{V}_{t},

where Xtd×dX_{t}\in\mathbb{R}^{d\times d} is full rank, and for each of the two left-hand sides above the two matrices in the product are obtained by the unique (thin) QR-factorisation of the corresponding right-hand side, where Rt+1R_{t+1} and R¯t+1\bar{R}_{t+1} are upper triangular with positive diagonal. Now, if Vt=V¯tSt(m,d)V_{t}=\bar{V}_{t}\in\operatorname{St}(m,d) up to a rotation from the right, then VtV_{t^{\prime}} is equal to V¯t\bar{V}_{t^{\prime}} up to rotation from the right for all t>tt^{\prime}>t.

Proof.

Suppose Vt=V¯tQ¯tV_{t}=\bar{V}_{t}\bar{Q}_{t}, where Q¯t𝕆(d)\bar{Q}_{t}\in\mathbb{O}(d). It holds that Vt+1Rt+1=WVtXt=WV¯tQ¯tXt=V¯t+1(R¯t+1Q¯tXt)=V¯t+1Q~t+1R~t+1V_{t+1}R_{t+1}=WV_{t}X_{t}=W\bar{V}_{t}\bar{Q}_{t}X_{t}=\bar{V}_{t+1}(\bar{R}_{t+1}\bar{Q}_{t}X_{t})=\bar{V}_{t+1}\tilde{Q}_{t+1}\tilde{R}_{t+1}, where Q~t+1\tilde{Q}_{t+1} and R~t+1\tilde{R}_{t+1} are the two matrices in the QR-factorisation of the matrix (R¯t+1Q¯tXt)(\bar{R}_{t+1}\bar{Q}_{t}X_{t}). This means that Vt+1=V¯t+1Q~t+1V_{t+1}=\bar{V}_{t+1}\tilde{Q}_{t+1}. This is due to the uniqueness of the QR-factorisation, so that also Rt+1=R~t+1R_{t+1}=\tilde{R}_{t+1}. Now the results follows readily by using induction. ∎

Thus, if Zt(Ut)Z_{t}(U_{t}) is full rank for each tt, we can identify our Algorithm 1 with Algorithm A in Lemma 3, and conclude that the sequence {Ut}\{U_{t}\} converges to the dominant subspace of WW as tt\rightarrow\infty. This is because Algorithm B in Lemma 3 corresponds to the Orthogonal Iteration algorithm, which is known to converge to the dominant subspace of WW [21].

We continue by investigating the behaviour at the limit where the columns of UtU_{t} are in the dominant subspace of WW. Let us consider the update Ut+1Rt+1WUtZ(Ut)U_{t+1}R_{t+1}\leftarrow WU_{t}Z(U_{t}) of Algorithm 1, and further assume that the dd largest eigenvalues of WW are equal, which is for example the case for synchronisation problems assuming cycle consistency (cf. Sec. 4).

Lemma 4.

Assume that the dd largest eigenvalues of the p.s.d. matrix WW are all equal to one111If they are all equal to a value λ11\lambda_{1}\neq 1, we can w.l.o.g. consider λ11W\lambda_{1}^{-1}W in place of WW, since the optimiser of Problem (1) is invariant to scaling WW. and strictly larger than the other eigenvalues, and assume that the columns of UtSt(m,d)U_{t}\in\operatorname{St}(m,d) are contained in the dominant subspace of WW. Provided Z(Ut)Z(U_{t}) is full rank, it holds that

WUtZ(Ut)=UtZ(Ut)=Ut+1Rt+1=UtQt+1Rt+1,WU_{t}Z(U_{t})=U_{t}Z(U_{t})=U_{t+1}R_{t+1}=U_{t}Q_{t+1}R_{t+1},

where Ut+1Rt+1U_{t+1}R_{t+1} is the unique QR-factorisation of UtZ(Ut)U_{t}Z(U_{t}) and Qt+1Rt+1Q_{t+1}R_{t+1} is the unique QR-factorisation of Z(Ut)Z(U_{t}).

Proof.

Since all the dd largest eigenvalues of WW are equal to one and the columns of UtU_{t} are in the dominant subspace it holds that WUt=UtUtTUt=UtWU_{t}=U_{t}U_{t}^{T}U_{t}=U_{t}, which explains the first equality. While the second equality follows by definition, the third equality remains to be proven. Let Qt+1R~t+1Q_{t+1}\tilde{R}_{t+1} be the unique QR-decomposition of Z(Ut)Z(U_{t}). We want to show that Ut+1=UtQt+1U_{t+1}=U_{t}Q_{t+1} and Rt+1=R~t+1R_{t+1}=\tilde{R}_{t+1}. From the QR-decomposition of Z(Ut)Z(U_{t}) it follows that UtZ(Ut)=UtQt+1R~t+1U_{t}Z(U_{t})=U_{t}Q_{t+1}\tilde{R}_{t+1}, while from the QR-decomposition of UtZ(Ut)U_{t}Z(U_{t}) it follows that UtZ(Ut)=Ut+1Rt+1U_{t}Z(U_{t})=U_{t+1}R_{t+1}. The uniqueness of the QR-decomposition implies UtQt+1=Ut+1U_{t}Q_{t+1}=U_{t+1} and R~t+1=Rt+1\tilde{R}_{t+1}=R_{t+1}. ∎

Hence, under the assumptions in Lemma 4, the columns of UtU_{t^{\prime}} span the dominant subspace of WW for any ttt^{\prime}\geq t. The update in Algorithm 1 reduces to the form

Qt+1Rt+1\displaystyle Q_{t+1}R_{t+1} Z(Ut),\displaystyle\leftarrow Z(U_{t}), (2)
Ut+1\displaystyle U_{t+1} UtQt+1,\displaystyle\leftarrow U_{t}Q_{t+1}, (3)

i.e. updating UtU_{t} to obtain Ut+1U_{t+1} simplifies to multiplication of UtU_{t} with the QQ-matrix from the QR-factorisation of Z(Ut)Z(U_{t}).

3.2 Choosing Z(Ut)Z(U_{t}) to Promote Sparsity

We now turn our attention to the choice of Z(Ut)Z(U_{t}) in Algorithm 1. We know that, as long as Z(Ut)Z(U_{t}) is full rank in each iteration tt, the columns of UtU_{t} converge to the dominant subspace of WW. In addition to our objective function ff in Problem (1), we now introduce the secondary objective

g(U)=i=1mj=1d(Uij)p=tr(UT(U.(p1))),\displaystyle g(U)=\sum_{i=1}^{m}\sum_{j=1}^{d}(U_{ij})^{p}=\operatorname{tr}(U^{T}(U.^{\wedge(p-1)})), (4)

where for pp being a positive integer the notation U.pU.^{\wedge p} means to raise each element in UU to the power pp. For USt(m,d)U\in\operatorname{St}(m,d) and pp larger than 22, the maximisation of gg promotes sparsity in a relaxed sense. To be specific, if pp is odd, a few larger elements (with value close to 1) lead to a larger value of gg compared to many smaller elements (with value close to 0), so that sparsity and non-negativity are simultaneously promoted. Analogously, if pp is even, a few elements with value closer to ±1\pm 1 lead to a larger value of gg compared to many smaller elements close to 0.

Let us consider the maximisation of

g(UQ)=tr(h(U,Q))\displaystyle g(UQ)=\operatorname{tr}(h(U,Q)) (5)

with respect to QQ, where h(U,Q)=(UQ)T((UQ).(p1)),h(U,Q)=(UQ)^{T}((UQ).^{\wedge(p-1)}), Q𝕆dQ\in\mathbb{O}^{d}, and USt(m,d)U\in\operatorname{St}(m,d). This means that we want to rotate UU by QQ in such a way that the secondary objective gg is maximised after the rotation. By applying the rotation from the right, we ensure that the columns UU span the same space after the rotation, but a higher objective value is achieved for the secondary objective gg in (4).

The extrinsic gradient with respect to QQ (for QQ relaxed to be in d×d\mathbb{R}^{d\times d}) of g(UQ)g(UQ) at Q=𝐈Q=\mathbf{I} is ph(U,𝐈)p{\cdot}h(U,\mathbf{I}). Thus the (manifold) gradient of g(UQ)g(UQ) at Q=𝐈Q=\mathbf{I} is given by the projection of the extrinsic gradient at Q=𝐈Q=\mathbf{I} onto the tangent space, which reads

Qg(UQ)|Q=𝐈=p(h(U,𝐈)hT(U,𝐈)).\displaystyle\nabla_{Q}g(UQ)|_{Q=\mathbf{I}}=p(h(U,\mathbf{I})-h^{T}(U,\mathbf{I})). (6)

For a small step size α\alpha, we can perform a first-order approximation of gradient ascent if we choose Z=𝐈+αp(h(U,𝐈)hT(U,𝐈))Z=\mathbf{I}+\alpha p(h(U,\mathbf{I})-h^{T}(U,\mathbf{I})), and then update UU in terms of the Q-matrix of the QR-factorisation of ZZ. Now, let us assume that we investigate the behaviour at the limit under the assumptions in Lemma 4, i.e. that the dd largest eigenvalues of WW are equal. In this context the Q-matrix of the QR-factorisation is a retraction [3] and serves as a first-order approximation of the exponential map. With that, the updates become

Qt+1Rt+1\displaystyle Q_{t+1}R_{t+1} 𝐈+αt(h(Ut,𝐈)hT(Ut,𝐈)),\displaystyle\leftarrow\mathbf{I}+\alpha_{t}(h(U_{t},\mathbf{I})-h^{T}(U_{t},\mathbf{I})), (7)
Ut+1\displaystyle U_{t+1} UtQt+1.\displaystyle\leftarrow U_{t}Q_{t+1}. (8)

This choice of the matrix Z(Ut)Z(U_{t}) ensures that it is full rank, since adding the identity matrix and a skew-symmetric matrix results in a full rank matrix. Hence, for sufficiently small step size α\alpha, we iteratively move in an ascent direction on 𝕆d\mathbb{O}^{d} by utilising the QR-retraction.

4 Application to Permutation Synchronisation

4.1 Permutation Synchronisation in a Nutshell

Permutation synchronisation is a procedure to improve matchings between multiple objects [23, 34], and related concepts have been utilised to address diverse tasks, such as multi-alignment [7, 5, 25], multi-shape matching [22, 24, 20], multi-image matching [51, 42, 9, 13, 12], or multi-graph matching [48, 6, 41], among many others. Permutation synchronisation refers to the process of establishing cycle consistency in the set of pairwise permutation matrices that encode correspondences between points in multiple objects. In computer vision there is a typical application where the points are feature descriptors or key-points, and the objects are images, as shown in Fig. 2.

Let kk denote the number of objects, where each object ii contains mim_{i} points. For 𝟏p\mathbf{1}_{p} being a pp-dimensional vector of all ones, and vector inequalities being understood in an element-wise sense, let Pijmimj:={X{0,1}mi×mj:X𝟏mj𝟏mi,𝟏miTX𝟏mjT}P_{ij}\in\mathbb{P}_{m_{i}m_{j}}:=\{X\in\{0,1\}^{m_{i}\times m_{j}}:X\mathbf{1}_{m_{j}}\leq\mathbf{1}_{m_{i}},\mathbf{1}_{m_{i}}^{T}X\leq\mathbf{1}_{m_{j}}^{T}\} be the partial permutation matrix that represents the correspondence between the mim_{i} points in object ii and the mjm_{j} points in object jj. In the case of bijective permutations, the set of pairwise permutations 𝒫:={Pij}i,j=1k\mathcal{P}:=\{P_{ij}\}_{i,j=1}^{k} is said to be cycle-consistent if for all i,j,i,j,\ell it holds that PiPj=PijP_{i\ell}P_{\ell j}=P_{ij}.

We define the set of partial permutation matrices with full row-rank as ¯mid:={Xmid:X𝟏d=𝟏mi}\overline{\mathbb{P}}_{m_{i}d}:=\{X\in\mathbb{P}_{m_{i}d}:X\mathbf{1}_{d}=\mathbf{1}_{m_{i}}\}, where dd denotes the total number of distinct points across all objects. Cycle consistency is known to be equivalent to the existence of so-called object-to-universe matchings 𝒰:={Pi¯mid}i=1k\mathcal{U}:=\{P_{i}\in\overline{\mathbb{P}}_{m_{i}d}\}_{i=1}^{k} such that for all i,ji,j we can write Pij=PiPjTP_{ij}=P_{i}P_{j}^{T} (see [23, 34] for details). The object-to-universe characterisation of cycle consistency is also valid for the case of non-bijective (i.e. partial) permutations (see [42, 8] for details).

Given the noisy (i.e. not cycle-consistent) set of pairwise permutations 𝒫={Pij}i,j=1k\mathcal{P}=\{P_{ij}\}_{i,j=1}^{k}, permutation synchronisation can be phrased as the optimisation problem

argmax{Pi¯mid}i,jtr(PijTPiPjT)argmaxP𝕌tr(PTWP),\displaystyle\operatorname*{arg\,max}_{\{P_{i}\in\overline{\mathbb{P}}_{m_{i}d}\}}~{}\sum_{i,j}\operatorname{tr}(P_{ij}^{T}P_{i}P_{j}^{T})\quad\Leftrightarrow\quad\operatorname*{arg\,max}_{P\in\mathbb{U}}~{}\operatorname{tr}(P^{T}WP)\,, (9)

where for m:=imim:=\sum_{i}m_{i} we define the set 𝕌:=¯m1d×ׯmkdm×d\mathbb{U}:=\overline{\mathbb{P}}_{m_{1}d}\times\ldots\times\overline{\mathbb{P}}_{m_{k}d}\subset\mathbb{R}^{m\times d}, the (m×d)(m{\times}d)-dimensional block matrix P=[P1T,,PkT]TP=[P_{1}^{T},\ldots,P_{k}^{T}]^{T}, and the block matrix W:=[Pij]ijm×mW:=[P_{ij}]_{ij}\in\mathbb{R}^{m\times m} in order to allow for a compact matrix representation of the objective.

4.2 Proposed Permutation Synchronisation Approach

The core idea of existing spectral permutation synchronisation approaches [34, 39, 31] is to replace the feasible set 𝕌\mathbb{U} in Problem (9) with the Stiefel manifold St(m,d)\operatorname{St}(m,d), so that we obtain an instance of Problem (1). We utilise Algorithm 1 to obtain a (globally optimal) solution USt(m,d)U^{*}\in\operatorname{St}(m,d) of this spectral formulation. In our algorithm we choose p=3p=3 to promote sparsity and non-negativity in the resulting UU^{*} via the function g(U)=tr(UT(U.2))=i=1mj=1dUij3g(U)=\operatorname{tr}(U^{T}(U.^{\wedge 2}))=\sum_{i=1}^{m}\sum_{j=1}^{d}U_{ij}^{3}. With that, in addition to UU^{*} being an orthogonal matrix, it contains few large elements that are close to 1 and many smaller elements that are close to 0. As such, we can readily project the matrix UU^{*} onto the set 𝕌\mathbb{U} in terms of a Euclidean projection. The Euclidean projection is given by projecting each of the kk blocks of U=[U1T,,UkT]TU^{*}=[U_{1}^{*T},\ldots,U_{k}^{*T}]^{T} individually onto the set of partial permutations, i.e. Pi=proj¯mid(Ui)=argmaxX¯midtr(XTUi)P_{i}=\operatorname{proj}_{\overline{\mathbb{P}}_{m_{i}d}}(U_{i}^{*})=\operatorname*{arg\,max}_{X\in\overline{\mathbb{P}}_{m_{i}d}}\operatorname{tr}(X^{T}U_{i}^{*}), see e.g. [8], which amounts to a (partial) linear assignment problem [32, 11] that we solve based on the efficient implementation in [10].

4.3 Experimental Results

We experimentally compare our proposed approach with various methods for permutation synchronisation and perform an evaluation on both real and synthetic datasets. In particular, our comparison includes two existing spectral approaches, namely MatchEig [31] and Spectral [34], where for the latter we use the efficient implementation from the authors of [51]. In addition, we also compare against the alternating minimisation method MatchALS [51], and against the non-negative matrix factorisation approach NmfSync [8]. To emphasise that the methods MatchEig and MatchALS do not guarantee cycle consistency (but instead aim to improve the initial matchings), in all plots we show results of respective methods as dashed lines. We use the fscore to measure the quality of obtained multi-matchings, which is defined as the fraction f=2prp+rf=\frac{2\cdot p\cdot r}{p+r}, where pp and rr denote the precision and recall, respectively. All experiments are run on a Macbook Pro (2.8 GHz quad core i7, 16 GB RAM), where for ϵ=105\epsilon=10^{-5} we use f(Ut)/f(Ut+1)1ϵf(U_{t})/f(U_{t+1})\geq 1-\epsilon as convergence criterion in Algorithm 1, and a step size of αt=h(Ut,𝐈)hT(Ut,𝐈)1\alpha_{t}=\|h(U_{t},\mathbf{I})-h^{T}(U_{t},\mathbf{I})\|_{\infty}^{-1} in (7).

Real data. In this experiment we use the CMU house image sequence [1] comprising 111111 frames within the experimental protocol of [34]. We generate a sequence of permutation synchronisation problem instances with a gradually increasing number of objects kk by extracting respective pairwise matchings for kk objects. To this end, we vary kk from 2020 to 111111 and sample the pairwise matchings evenly-spaced from the 111×111111\times 111 pairwise matchings. Quantitative results in terms of the fscore, the objective value of Problem (9), and the runtimes are shown in Fig. 1, in which the individual problem instances vary along the horizontal axis. We can see that our proposed method dominates other approaches in terms of the fscore and the objective value (the reported objective values are divided by k2k^{2} to normalise the scale in problems of different sizes), while being among the fastest. Note that we do not report the objective value for MatchEig and MatchALS, since they do not lead to cycle-consistent matchings so that the obtained solution does not lie in the feasible set of Problem (9). Qualitative results of the matching between one pair of images for k=111k=111 are shown in Fig. 2. As expected, our proposed approach clearly outperforms the Spectral baseline, since our method is guaranteed to converge to the same same subspace that is spanned by the spectral solution, while at the same time providing a sparser and less negative solution that is thereby closer to the feasible set 𝕌\mathbb{U}.

Refer to caption

Refer to caption Refer to caption Refer to caption

Figure 1: Quantitative results on the CMU house sequence in terms of the fscore (\uparrow), objective value of Problem (9) (\uparrow), and runtime (\downarrow). The individual instances of permutation synchronisation problems vary along the horizontal axis. Methods that do not guarantee cycle consistency are shown as dashed lines.
Input MatchEig
Refer to caption Refer to caption
MatchALS Spectral
Refer to caption Refer to caption
NmfSync Ours
Refer to caption Refer to caption
Figure 2: Comparison of matchings between the first and last image of the CMU house sequence obtained by several methods. The colour of the dots indicates the ground truth correspondence, and the lines show the obtained matchings (green: correct, red: wrong). Overall, our approach obtains the best matchings, see also Fig. 1.

Synthetic data. We reproduce the procedure described in [8] for generating synthetic instances for the synchronisation of partial permutations. Four different parameters are considered for generating a problem instance: the universe size dd, the number of objects kk that are to be matched, the observation rate ρ\rho, and the error rate σ\sigma (see [8] for details). One of these parameters varies in each experimental setting, while the others are kept fixed. Each individual experiment is repeated 55 times with different random seeds. In Fig. 3 we compare the performance of MatchEig [31], MatchALS [51], Spectral [34], NmfSync [8] and Ours. The first row shows the fscore, and the second row the respective runtimes. Note that we did not run MatchALS on the larger instances since the runtime is prohibitively long. The methods MatchEig and MatchALS do not guarantee cycle consistency, and are thus shown as dashed lines. It can be seen that in most settings our method obtains superior performance in terms of the fscore, while being almost as fast as the most efficient methods that are also based on a spectral relaxation (MatchEig and Spectral).

Refer to caption

Refer to caption
Refer to caption
Refer to caption
Refer to caption

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Experimental comparison of permutation synchronisation methods on synthetic data. Each column shows a different varying parameter. The first row shows the fscore (\uparrow) and the second row the runtime (\downarrow). Methods that do not guarantee cycle consistency are shown with dashed lines. Our method guarantees cycle consistency and leads to higher fscores in most settings.

5 Discussion and Limitations

Our algorithm has several favourable properties: it finds a globally optimal solution to Problem (1), it is computationally efficient, it has the same convergence behaviour as the Orthogonal Iteration algorithm [21], and promotes (approximately) sparse solutions. Naturally, since our solution is globally optimal (with respect to Problem (1)) and thus converges to im(Vd)\operatorname{im}(V_{d}), the amount of sparsity that we can achieve is limited by the sparsest orthogonal basis that spans im(Vd)\operatorname{im}(V_{d}). In turn, we rather interpret sparsity in some looser sense, meaning that there are few elements that are large, whereas most elements are close to zero. Furthermore, since the secondary sparsity-promoting problem (with objective in (4)) is generally non-convex, we cannot guarantee that we attain its global optimum.

Since permutation synchronisation was our primary motivation it is the main focus of this paper. In the case of permutations, the assumption that the eigenvalues corresponding to the VdV_{d} are equal (as we explain in Sec. 3.2) is reasonable, since this must hold for cycle-consistent bijective matchings. However, broadening the scope of our approach to other problems in which this assumption may not be valid will require further theoretical analysis and possibly a different strategy for choosing the matrix Z(Ut)Z(U_{t}). Studying the universality of the proposed method, as well as analysing different convergence criteria and different matrix functions Z(Ut)Z(U_{t}) are open problems that we leave for future work.

The proposed method comprises a variation of a well-known iterative procedure for computing the dominant subspace of a matrix. The key component in this procedure is the QR-factorisation, which is differentiable almost everywhere and is therefore well-suited to be applied within a differentiable programming context (e.g. for end-to-end training of neural networks).

6 Conclusion

We propose an efficient algorithm to find a matrix that forms an orthogonal basis of the dominant subspace of a given matrix WW while additionally promoting sparsity and non-negativity. Our procedure inherits favourable properties from the Orthogonal Iteration algorithm, namely it is simple to implement, converges almost everywhere, and is computationally efficient. Moreover, our method is designed to generate sparser solutions compared to the Orthogonal Iteration algorithm. This is achieved by rotating our matrix of interest in each iteration by a matrix corresponding to the gradient of a secondary objective that promotes sparsity (and optionally non-negativity).

The considered problem setting is relevant for various spectral formulations of difficult combinatorial problems, such as they occur in applications like multi-matching, permutation synchronisation or assignment problems. Here, the combination of orthogonality, sparsity and non-negativity is desirable, since these are properties that characterise binary matrices such as permutation matrices. Experimentally we show that our proposed method outperforms existing methods in the context of partial permutation synchronisation, while at the same time having favourable a runtime.

Broader Impact

The key contribution of this paper is an effective and efficient optimisation algorithm for addressing sparse optimisation problems over the Stiefel manifold. Given the fundamental nature of our contribution, we do not see any direct ethical concerns or negative societal impacts related to our work.

Overall, there are numerous opportunities to use the proposed method for various types of multi-matching problems over networks and graphs. The problem addressed is highly relevant within the fields of machine learning (e.g. for data canonicalisation to faciliate efficient learning with non-Euclidean data), computer vision (e.g. for 3D reconstruction in structure from motion, or image alignment), computer graphics (e.g. for bringing 3D shapes into correspondence) and other related areas.

The true power of permutation synchronisation methods appears when the number of objects to synchronise is large. With large quantities of data becoming increasingly available, the introduction of our scalable optimisation procedure that can account for orthogonality – while promoting sparsity and non-negativity – is an important contribution to the synchronisation community on the one hand. Moreover, on the other hand, it has the potential to have an impact on more general non-convex optimisation problems, particularly in the context of spectral relaxations of difficult and large combinatorial problems.

Acknowledgement

JT was supported by the Swedish Research Council (2019-04769).

References

  • [1] CMU/VASC image database. http://www.cs.cmu.edu/afs/cs/project/vision/vasc/idb/www/html/motion/.
  • [2] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007.
  • [3] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
  • [4] P.-A. Absil, R. Mahony, R. Sepulchre, and P. Van Dooren. A grassmann–rayleigh quotient iteration for computing invariant subspaces. SIAM review, 44(1):57–73, 2002.
  • [5] F. Arrigoni, B. Rossi, and A. Fusiello. Spectral synchronization of multiple views in se (3). SIAM Journal on Imaging Sciences, 9(4):1963–1990, 2016.
  • [6] F. Bernard, C. Theobalt, and M. Moeller. DS*: Tighter Lifting-Free Convex Relaxations for Quadratic Matching Problems. In CVPR, 2018.
  • [7] F. Bernard, J. Thunberg, P. Gemmar, F. Hertel, A. Husch, and J. Goncalves. A solution for multi-alignment by transformation synchronisation. In CVPR, 2015.
  • [8] F. Bernard, J. Thunberg, J. Goncalves, and C. Theobalt. Synchronisation of partial multi-matchings via non-negative factorisations. Pattern Recognition, 92:146–155, 2019.
  • [9] F. Bernard, J. Thunberg, P. Swoboda, and C. Theobalt. Hippi: Higher-order projected power iterations for scalable multi-matching. In CVPR, 2019.
  • [10] F. Bernard, N. Vlassis, P. Gemmar, A. Husch, J. Thunberg, J. Goncalves, and F. Hertel. Fast correspondences for statistical shape models of brain structures. In Medical Imaging 2016: Image Processing, volume 9784, page 97840R. International Society for Optics and Photonics, 2016.
  • [11] D. P. Bertsekas. Network Optimization: Continuous and Discrete Models. Athena Scientific, 1998.
  • [12] T. Birdal, V. Golyanik, C. Theobalt, and L. J. Guibas. Quantum permutation synchronization. In CVPR, 2021.
  • [13] T. Birdal and U. Simsekli. Probabilistic permutation synchronization using the riemannian structure of the birkhoff polytope. In CVPR, 2019.
  • [14] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15(42):1455–1459, 2014.
  • [15] A. Breloy, S. Kumar, Y. Sun, and D. P. Palomar. Majorization-minimization on the stiefel manifold with application to robust sparse pca. IEEE Transactions on Signal Processing, 69:1507–1520, 2021.
  • [16] S. Chen, S. Ma, A. Man-Cho So, and T. Zhang. Proximal gradient method for nonsmooth optimization over the stiefel manifold. SIAM Journal on Optimization, 30(1):210–239, 2020.
  • [17] T. Cour, P. Srinivasan, and J. Shi. Balanced graph matching. NIPS, 2006.
  • [18] L. De Lathauwer, L. Hoegaerts, and J. Vandewalle. A grassmann-rayleigh quotient iteration for dimensionality reduction in ica. In International Conference on Independent Component Analysis and Signal Separation, pages 335–342. Springer, 2004.
  • [19] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
  • [20] M. Gao, Z. Lähner, J. Thunberg, D. Cremers, and F. Bernard. Isometric multi-shape matching. In CVPR, 2021.
  • [21] G. H. Golub and C. F. Van Loan. Matrix computations, volume 4. JHU press, 2013.
  • [22] Q. Huang, Z. Liang, H. Wang, S. Zuo, and C. Bajaj. Tensor maps for synchronizing heterogeneous shape collections. ACM Transactions on Graphics (TOG), 38(4):1–18, 2019.
  • [23] Q.-X. Huang and L. Guibas. Consistent shape maps via semidefinite programming. In Symposium on Geometry Processing, 2013.
  • [24] R. Huang, J. Ren, P. Wonka, and M. Ovsjanikov. Consistent zoomout: Efficient spectral map synchronization. In Computer Graphics Forum, volume 39, pages 265–278. Wiley Online Library, 2020.
  • [25] X. Huang, Z. Liang, X. Zhou, Y. Xie, L. J. Guibas, and Q. Huang. Learning transformation synchronization. In CVPR, 2019.
  • [26] M. Leordeanu and M. Hebert. A Spectral Technique for Correspondence Problems Using Pairwise Constraints. In ICCV, 2005.
  • [27] X. Li, S. Chen, Z. Deng, Q. Qu, Z. Zhu, and A. Man-Cho So. Weakly convex optimization over stiefel manifold using riemannian subgradient-type methods. SIAM Journal on Optimization, 31(3):1605–1634, 2021.
  • [28] Y. Li and Y. Bresler. Global geometry of multichannel sparse blind deconvolution on the sphere. In NeurIPS, pages 1132–1143, 2018.
  • [29] C. Lu, S. Yan, and Z. Lin. Convex sparse spectral clustering: Single-view to multi-view. IEEE Transactions on Image Processing, 25(6):2833–2843, 2016.
  • [30] J. H. Manton. Optimization algorithms exploiting unitary constraints. IEEE Transactions on Signal Processing, 50(3):635–650, 2002.
  • [31] E. Maset, F. Arrigoni, and A. Fusiello. Practical and Efficient Multi-View Matching. In ICCV, 2017.
  • [32] J. Munkres. Algorithms for the Assignment and Transportation Problems. Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, Mar. 1957.
  • [33] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS, 14:849–856, 2001.
  • [34] D. Pachauri, R. Kondor, and V. Singh. Solving the multi-way matching problem by permutation synchronization. In NIPS, 2013.
  • [35] P. M. Pardalos, F. Rendl, and H. Wolkowicz. The Quadratic Assignment Problem - A Survey and Recent Developments. DIMACS Series in Discrete Mathematics, 1993.
  • [36] Q. Qu, J. Sun, and J. Wright. Finding a sparse vector in a subspace: Linear sparsity using alternating directions. IEEE Transactions on Information Theory, 62(10):5855–5880, 2016.
  • [37] Q. Qu, Y. Zhai, X. Li, Y. Zhang, and Z. Zhu. Geometric analysis of nonconvex optimization landscapes for overcomplete learning. In International Conference on Learning Representations, 2019.
  • [38] Q. Qu, Z. Zhu, X. Li, M. C. Tsakiris, J. Wright, and R. Vidal. Finding the sparsest vectors in a subspace: Theory, algorithms, and applications. arXiv preprint arXiv:2001.06970, 2020.
  • [39] Y. Shen, Q. Huang, N. Srebro, and S. Sanghavi. Normalized spectral map synchronization. NIPS, 29:4925–4933, 2016.
  • [40] J. Song, P. Babu, and D. P. Palomar. Sparse generalized eigenvalue problem via smooth optimization. IEEE Transactions on Signal Processing, 63(7):1627–1642, 2015.
  • [41] P. Swoboda, D. Kainmüller, A. Mokarian, C. Theobalt, and F. Bernard. A convex relaxation for multi-graph matching. In CVPR, 2019.
  • [42] R. Tron, X. Zhou, C. Esteves, and K. Daniilidis. Fast multi-image matching via density-based clustering. In CVPR, 2017.
  • [43] S. Umeyama. An eigendecomposition approach to weighted graph matching problems. IEEE transactions on pattern analysis and machine intelligence, 10(5):695–703, 1988.
  • [44] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
  • [45] Q. Wang, J. Gao, and H. Li. Grassmannian manifold optimization assisted sparse spectral clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5258–5266, 2017.
  • [46] Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1):397–434, 2013.
  • [47] Y. Xue, Y. Shen, V. Lau, J. Zhang, and K. B. Letaief. Blind data detection in massive mimo via 3\ell_{3}-norm maximization over the stiefel manifold. IEEE Transactions on Wireless Communications, 2020.
  • [48] J. Yan, X.-C. Yin, W. Lin, C. Deng, H. Zha, and X. Yang. A short survey of recent advances in graph matching. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 167–174, 2016.
  • [49] Y. Zhai, Z. Yang, Z. Liao, J. Wright, and Y. Ma. Complete dictionary learning via l4-norm maximization over the orthogonal group. Journal of Machine Learning Research, 21(165):1–68, 2020.
  • [50] Y. Zhang, H.-W. Kuo, and J. Wright. Structured local optima in sparse blind deconvolution. IEEE Transactions on Information Theory, 66(1):419–452, 2019.
  • [51] X. Zhou, M. Zhu, and K. Daniilidis. Multi-image matching via fast alternating minimization. In ICCV, 2015.
  • [52] Z. Zhu, T. Ding, D. Robinson, M. Tsakiris, and R. Vidal. A linearly convergent method for non-smooth non-convex optimization on the grassmannian with applications to robust subspace and dictionary learning. NeurIPS, 32:9442–9452, 2019.

Appendix A Proofs

Proof of Lemma 1

Proof.

If the matrix WW is not symmetric, we can split WW into the sum of a symmetric part WsW_{s} and skew-symmetric part WaW_{a}. It holds that tr(UTWaU)=12tr(UTWaU)+12tr(UTWaTU)=12tr(UTWaU)12tr(UTWaU)=0\operatorname{tr}(U^{T}W_{a}U)=\frac{1}{2}\operatorname{tr}(U^{T}W_{a}U)+\frac{1}{2}\operatorname{tr}(U^{T}W_{a}^{T}U)=\frac{1}{2}\operatorname{tr}(U^{T}W_{a}U)-\frac{1}{2}\operatorname{tr}(U^{T}W_{a}U)=0. Further, if WsW_{s} is not positive semidefinite, we can shift its eigenvalues via Wsλm𝐈m=W~W_{s}{-}\lambda_{m}\mathbf{I}_{m}=\tilde{W} to make it p.s.d. Since αtr(UT𝐈mU)=αd\alpha\operatorname{tr}(U^{T}\mathbf{I}_{m}U)=\alpha d is constant for any scalar α\alpha, the term λm𝐈m\lambda_{m}\mathbf{I}_{m} does not affect the optimisers of Problem (1). Thus, we can replace WW in (1) by the p.s.d. matrix W~\tilde{W} without affecting the optimisers. ∎

Proof of Lemma 2

Proof.

Consider the eigenvalue decomposition (Λ,V)(\Lambda,V) of WW, i.e. W=VΛVTW=V\Lambda V^{T}, where Λ=diag(λ1,,λm)\Lambda=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{m}) contains the decreasingly ordered (nonnegative) eigenvalues on its diagonal and VSt(m,m)V\in\operatorname{St}(m,m). For USt(m,d)U\in\operatorname{St}(m,d), this matrix can be written as the product of VV and another matrix RSt(m,d)R\in\operatorname{St}(m,d), i.e., U=VRU=VR. So, instead of optimising over UU, we can optimise over RR. Let the ii-th row of RR be denoted as rir_{i} for i=1,2,,mi=1,2,\ldots,m. It holds that tr(UTWU)=tr(RTΛR)=i=1mλiri22\operatorname{tr}(U^{T}WU)=\operatorname{tr}(R^{T}\Lambda R)=\sum_{i=1}^{m}\lambda_{i}\|r_{i}\|_{2}^{2}. Thus, an equivalent formulation of Problem (1) is the optimisation problem

maxR=[r1T,r2T,,rmT]TSt(m,d)i=1mλiri22.\displaystyle\max_{R=[r_{1}^{T},r_{2}^{T},\ldots,r_{m}^{T}]^{T}\in\operatorname{St}(m,d)}~{}\sum_{i=1}^{m}\lambda_{i}\|r_{i}\|_{2}^{2}. (A10)

We observe that 0ri2210\leq\|r_{i}\|_{2}^{2}\leq 1, and that i=1mri22=tr(RRT)=tr(RTR)=d\sum_{i=1}^{m}\|r_{i}\|_{2}^{2}=\operatorname{tr}(RR^{T})=\operatorname{tr}(R^{T}R)=d. Hence, a relaxation to (A10) is given by

maxpmi=1mλipi,\displaystyle\max_{p\in\mathbb{R}^{m}}~{}\sum_{i=1}^{m}\lambda_{i}p_{i}, (A11)
s.t.0pi1,\displaystyle~{}~{}~{}\text{s.t.}\quad 0\leq p_{i}\leq 1, (A12)
i=1mpi=d.\displaystyle\quad\quad\sum_{i=1}^{m}p_{i}=d. (A13)

This is a linear programming problem for which an optimal solution is given by p1==pd=1p_{1}=\ldots=p_{d}=1 and pd+1==pm=0p_{d+1}=\ldots=p_{m}=0 (since the λi\lambda_{i}’s are provided in decreasing order). Now we choose R=[𝐈d,𝟎]Tm×dR^{*}=[\mathbf{I}_{d},\mathbf{0}]^{T}\in\mathbb{R}^{m\times d}, so that

r1=e1T,r2=e2T,,rd=edTandrd+1==rm=𝟎dT,r_{1}^{*}=e_{1}^{T},r_{2}^{*}=e_{2}^{T},\ldots,r_{d}^{*}=e_{d}^{T}\quad\text{and}\quad r_{d+1}^{*}=\ldots=r_{m}^{*}=\mathbf{0}^{T}_{d}, (A14)

where eide_{i}\in\mathbb{R}^{d} is the unit vector with element equal to 11 at the ii-th place. We observe that for this choice RSt(m,d)R^{*}\in\operatorname{St}(m,d), and that the objective value for (A10) is the same as the optimal value for the problem defined by (A11)-(A13). Since the latter problem was a relaxation of the former problem, RR^{*} is an optimal solution to Problem (A10). The corresponding optimal UU for Problem (1) is U=VR=VdU^{*}=VR^{*}=V_{d}. The observation that for any U=VdQU^{\prime}=V_{d}Q with Q𝕆(d)Q\in\mathbb{O}(d) we have that tr(UTWU)=tr(QTVdTWVdQ)=tr(VdTWVd)\operatorname{tr}(U^{\prime T}WU^{\prime})=\operatorname{tr}(Q^{T}V_{d}^{T}WV_{d}Q)=\operatorname{tr}(V_{d}^{T}WV_{d}) concludes the proof. ∎

Appendix B Additional Experiments

In the following we provide further evaluations of our proposed algorithm.

B.1 Step Size and Comparison to Two-Stage Approaches

In this section, on the one hand we experimentally confirm that our approach of choosing the step size αt\alpha_{t} (see Sec. 4.3) is valid and that in practice it is not necessary to perform line search. On the other hand, we verify that our proposed algorithm leads to results that are comparable to two-stage approaches derived from Lemma 4. Such two-stage approaches first determine the matrix U0im(Vd)U_{0}\in\operatorname{im}(V_{d}) that spans the dd-dimensional dominant subspace of WW, and subsequently utilise the updates in equations (2) and (3) in order to make U0U_{0} sparser. As explained, this corresponds to finding a matrix Q𝕆(d)Q\in\mathbb{O}(d) that maximises our secondary objective

g(U0Q)=i=1mj=1d(U0Q)ijp.\displaystyle g(U_{0}Q)=\sum_{i=1}^{m}\sum_{j=1}^{d}(U_{0}Q)_{ij}^{p}\,. (A15)

We compare our proposed algorithm to two different settings of two-stage approaches:

  1. 1.

    Our algorithm as stage two. Our proposed algorithm forms the second stage of a two-stage approach. To this end, in the first stage we use the Orthogonal Iteration algorithm [21] to find the matrix VdV_{d} that spans the dd-dimensional dominant subspace of WW. Subsequently in the second stage, we initialise U0VdU_{0}\leftarrow V_{d}, and according to Lemma 4 we make use of the updates in equations (2) and (3) in order to make UtU_{t} iteratively sparser. We consider two variants for the second stage:

    1. (a)

      In the variant denoted Ours/2-stage we run the second-stage updates exactly for the number of iterations that the Orthogonal Iteration required in the first stage to find VdV_{d} (with convergence threshold ϵ=105\epsilon=10^{-5}). We use the step size αt\alpha_{t} as described in Sec. 4.3.

    2. (b)

      In the variant denoted Ours/2-stage/bt we utilise backtracking line search (as implemented in the ManOpt toolbox [14]) in order to find a suitable step size αt\alpha_{t} in each iteration. Here, we run the algorithm until convergence w.r.t. to gg, i.e. until g(UtQt)/g(Ut+1Qt+1)1ϵg(U_{t}Q_{t})/g(U_{t+1}Q_{t+1})\geq 1-\epsilon for ϵ=105\epsilon=10^{-5}.

  2. 2.

    Manifold optimisation as stage two. Further, we consider the trust regions method [2] to find a (local) maximiser of (A15) in the second stage. Here, the optimisation over the Riemannian manifold 𝕆(d)\mathbb{O}(d) is performed using the ManOpt toolbox [14]. For the first stage, we consider three different initialisations for finding the matrix VdV_{d} that spans the dd-dimensional dominant subspace of WW: the Matlab functions eig() and eigs(), as well as our implementation of the Orthogonal Iteration algorithm [21]. We call these methods Eig+ManOpt, Eigs+ManOpt and OrthIt+ManOpt, respectively.

Results are shown in Figs. 4 and  5 for the CMU house sequence and the synthetic dataset, respectively. We observe the following:

  • In terms of solution quality (fscore and objective), all considered methods are comparable in most cases. For the real dataset (Fig. 4) Eigs+ManOpt performs worse due to numerical reasons. For the largest considered permutation synchronisation problems (the right-most column in the synthetic data setting shown in Fig. 5) Ours leads to the best results on average.

  • In terms of runtime, in overall Ours is among the fastest, considering both the real and the synthetic data experiments. In the real dataset, where d=30d=30 is relatively small, Eigs+ManOpt is the fastest (but with poor solution quality), while Eig+ManOpt is the slowest (with comparable solution quality to Ours). Methods that utilise the Orthogonal Iteration have comparable runtimes in the real data experiments.

    Most notably, in the largest considered synthetic data setting (right-most column in Fig. 5) Ours is among the fastest (together with Ours/2-stage), while Ours has the largest fscore on average (as mentioned above) – this indicates that Ours is particularly well-suited for permutation synchronisation problems with increasing size.

  • Overall Ours is the simplest method, see Algorithm 1: the solution is computed in one single stage rather than in two consecutive stages, and it does not require line search, as can be seen by comparing Ours with Ours/2-stage/bt across all experiments.

Refer to caption

Refer to caption Refer to caption Refer to caption

Figure 4: Comparison of Ours to different two-stage approaches on permutation synchronisation problems from the CMU house sequence (see Sec. 4.3 for details). We consider the fscore (\uparrow), objective value (\uparrow), and runtime (\downarrow). The individual instances of permutation synchronisation problems vary along the horizontal axis.

Refer to caption

Refer to caption
Refer to caption
Refer to caption
Refer to caption

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Comparison of Ours to different two-stage approaches on synthetic permutation synchronisation problems (see Sec. 4.3 for details). Each column shows a different varying parameter. The first row shows the fscore (\uparrow) and the second row the runtime (\downarrow). Note that the right-most column shows the largest considered permutation synchronisation instances – for these Ours obtains the best fscore while being among the fastest (together with Ours/2-stage).

B.2 Comparison to Riemannian Subgradient and Evaluation of Different pp

In Fig. 6 we compare Ours with p=3p=3 and p=4p=4 to the Riemannian subgradient-type method (with QR-retraction) by Li et al. [27] with 1\ell_{1}-norm as sparsity-inducing penalty. In the qualitative results (bottom) we can observe that the Riemannian subgradient-type method and Ours (p=4p=4) obtain sparse solutions with few elements with large absolute values (both positive and negative), whereas Ours (p=3p=3) obtains a sparse and (mostly) nonnegative solution. Since for permutation synchronisation we are interested in nonnegative solutions, Ours with p=3p=3 thus outperforms the two alternatives quantitatively (top).

Refer to caption

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

Figure 6: Comparison of Ours (with p=3p=3 and p=4p=4) to the Riemannian subgradient-type method by Li et al. [27]. Here, permutation synchronisation problems from the CMU house sequence (see Sec. 4.3 for details) are evaluated. Top: we consider the fscore (\uparrow), objective value (\uparrow), and runtime (\downarrow), where the individual instances of permutation synchronisation problems vary along the horizontal axis. Bottom: for each of the three methods we show the obtained UU-matrix for k=20k=20 (before projection). It can be seen that the Riemannian subgradient-type method and Ours (p=4p=4) obtain sparse solutions with few elements with large absolute values (both positive and negative), whereas Ours (p=3p=3) obtains a sparse and (mostly) nonnegative solution.

B.3 Effect of Sparsity-Promoting Secondary Objective

In Fig. 7 we illustrate the effect of our sparsity-promoting secondary objective. It can clearly be seen that our method (right) results in a significantly sparser solution compared to the Orthogonal Iteration algorithm (left).

Refer to caption Refer to caption

Figure 7: Illustration of the effect of our sparsity-promoting secondary objective gg for a synthetic permutation synchronisation problem (k=5,d=30,ρ=0.9,σ=0.3k=5,d=30,\rho=0.9,\sigma=0.3, cf. Sec. 4.3). The matrix UU obtained by the Orthogonal Iteration algorithm (left) is not sparse. Our method gives a sparse and mostly nonnegative UU (right).