This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Extrinsic Manifold Representation for Vision Tasks

Tongtong Zhang    Xian Wei    Yuanxiang Li
Abstract

Non-Euclidean data is frequently encountered across different fields, yet there is limited literature that addresses the fundamental challenge of training neural networks with manifold representations as outputs. We introduce the trick named Deep Extrinsic Manifold Representation (DEMR) for visual tasks in this context. DEMR incorporates extrinsic manifold embedding into deep neural networks, which helps generate manifold representations. The DEMR approach does not directly optimize the complex geodesic loss. Instead, it focuses on optimizing the computation graph within the embedded Euclidean space, allowing for adaptability to various architectural requirements. We provide empirical evidence supporting the proposed concept on two types of manifolds, SE(3)SE(3) and its associated quotient manifolds. This evidence offers theoretical assurances regarding feasibility, asymptotic properties, and generalization capability. The experimental results show that DEMR effectively adapts to point cloud alignment, producing outputs in SE(3)SE(3), as well as in illumination subspace learning with outputs on the Grassmann manifold.

Machine Learning, ICML

1 Introduction

Data in non-Euclidean geometric spaces has applications across various domains, such as motion estimation in robotics (Byravan & Fox, 2017), shape analysis in medical imaging (Bermudez et al., 2018) (Huang et al., 2021) (Yang et al., 2022), etc. Deep learning has revolutionized various fields. However, deep neural networks (DNN) typically generate feature vectors in Euclidean space, which may not be universally suitable for certain computer vision tasks, such as estimating probability distributions for classification or rigid motion estimation. The classification of learning problems related to a manifold depends on the application of manifold assumptions. The first category involves signal processing on a manifold structure, with the resulting output situated within the Euclidean space. For example, geometric deep-learning approaches extract features from graphs, meshes, and other structures. The encoded features are then input to decoders for tasks such as classification and regression in the Euclidean space (Bronstein et al., 2017; Cao et al., 2020; Can et al., 2021; Bronstein et al., 2021). Alternatively, they can also function as latent codes for generative models (Ni et al., 2021). The second category establishes continuous mappings between data residing on the same manifold to enable regressions. For instance, (Steinke et al., 2010) addresses regression between manifolds through regularization functional, while (Fang et al., 2023) performs statistical analysis over deep neural network-based mappings between manifolds. The third category of research focuses on deep learning models that have distinct Euclidean inputs and manifold outputs. This line of research often emphasizes a specific type of manifold, such as deep rotation manifold regression (Zhou et al., 2019; Levinson et al., 2020; Chen et al., 2021)

Refer to caption
(a) Intrinsic manifold regression via energy minimization (Boumal, 2020)
Refer to caption
(b) Intrinsic manifold regression from tangent bundle (Zhang, 2020)
Refer to caption
(c) Extrinsic manifold regression (Lin et al., 2017; Lee, 2021)
Figure 1: Manifold regression explores the relationship between a manifold-valued variable and a value in vector space. A typical intrinsic manifold regression finds the best-fitted geodesic curve γ\gamma on \mathcal{M} via (a) minimizing a complex energy function of distance and smoothness, or (b) updating parameters in the local tangent bundle TT\mathcal{M}. Extrinsic manifold regression (c) models the relationship in the extrinsically embedded space.

The paper is centered on the third category, which entails creating multiple representations from DNNs. It’s worth noting that models that produce outputs on the manifold are typically regularized using geometric metrics, which can be categorized into two types: intrinsic manifold loss and extrinsic manifold loss (Bhattacharya & Patrangenaru, 2003; Bhattacharya et al., 2012), as depicted in Figure 1. Intrinsic methods aim to identify the geodesic that best fits the data to preserve the geometrical structure (Fletcher, 2011, 2013; Cornea et al., 2017; Shin & Oh, 2022). However, the inherent characteristics of intrinsic distances pose challenges for DNN architectures. Primarily, many intrinsic losses incorporate intricate geodesic distances, aiming to induce longer gradient flows throughout the entire computation graph (Fletcher, 2011; Hinkle et al., 2012; Fletcher, 2013; Shi et al., 2009; Berkels et al., 2013; Fletcher, 2011). Secondly, directly fitting a geodesic by minimizing distance and smoothness energy in the Euclidean space might result in off-manifold points (Chen et al., 2021; Khayatkhoei et al., 2018).

In contrast, extrinsic regression uses embeddings in a higher-dimensional Euclidean space to create a non-parametric proxy estimator. The estimation on the manifold \mathcal{M} can be achieved using J1J^{-1}, which represents the inverse of the extrinsic embedding JJ. Extensive investigations in (Bhattacharya et al., 2012; Lin et al., 2017) have established that extrinsic regression offers superior computational benefits compared to intrinsic regression. Many regression models are customized for specific applications, utilizing exclusive information to simplify model formulations that include explicit explanatory variables. This customization is evident in applications such as shapes on shape space manifolds (Berkels et al., 2013; Fletcher, 2011).

However, within the computer vision community, deep neural networks are often faced with a large amount of diverse multimedia data. Traditional manifold regression models struggle to handle this varied modeling task due to limitations in representational power. Some recent works have addressed this challenge, such as (Fang et al., 2023) processing manifold inputs with empirical evidence. This paper presents the idea of embedding manifolds externally at the final regression layer of various neural networks, including ResNet and PointNet. This idea is known as Deep Extrinsic Manifold Representation (DEMR). The process is adapted from two perspectives to bridge the gap between traditional extrinsic manifolds and neural networks in computer vision. Firstly, the conventional choice of a proxy estimator, often represented by kernel functions, is substituted with feature extractors in DNNs. Feature extractors like ResNet or PointNet, renowned for their efficacy in specific tasks, significantly elevate the representational power for feature extraction.

Secondly, to project the neural network output onto the preimage of J()J(\cdot), we depart from deterministic projection methods employed in traditional extrinsic manifold regression. Instead, we opt for a learnable linear layer commonly found in DNN settings. This learnable projection module aligns seamlessly with most DNN architectures and eliminates the need for the manual design of the projection function Pr()Pr(\cdot), a step typically required in prior extrinsic manifold regression models to match the type of the manifold. These choices not only enhance the model’s representational power compared to traditional regression models but also allow for the preservation of existing neural network architectures.

Contribution

We facilitate the generation of manifold output from standard DNN architectures through extrinsic manifold embedding. In particular, we elucidate the rationale behind pose regression tasks performing more effectively as a specialized instance of DEMR. Additionally, we offer theoretical substantiation regarding the feasibility, asymptotic properties, and generalization ability of DEMR for SE(3)SE(3) and the Grassmann manifold. Finally, the efficacy of DEMR is validated through its application to two classic computer vision tasks: relative point cloud transformation estimation on SE(3)SE(3) and illumination subspace estimation on the Grassmann manifold.

2 DEMR

2.1 Problem Formulation

Estimation in the embedded space

For distribution 𝒬\mathcal{Q} on manifold \mathcal{M} of dimension dd, the extrinsic embedding ~=J()\tilde{\mathcal{M}}=J(\mathcal{M}), from manifold \mathcal{M} to Euclidean space 𝔼=N\mathbb{E}=\mathbb{R}^{N}, has distribution 𝒬~=𝒬J\tilde{\mathcal{Q}}=\mathcal{Q}\circ J, which is a closed subset of N\mathbb{R}^{N}, where dNd\ll N. In extrinsic manifold regression, u𝔼\forall u\in\mathbb{E}, \exists a compact projection set Pr(u)={x~:xuyu,y~}Pr(u)=\{x\in\tilde{\mathcal{M}}:\|x-u\|\leq\|y-u\|,\forall y\in\tilde{\mathcal{M}}\}, mapping uu to the closest point on \mathcal{M}. The extrinsic mean set of 𝒬\mathcal{Q} is μext=J1(Pr(μ))\mu^{ext}=J^{-1}(Pr(\mu)), where μ\mu is the mean set of 𝒬~\tilde{\mathcal{Q}}. In DEMR, μ\mu is acquired by the neural networks, and the estimation on \mathcal{M} is then deterministically computed.

Refer to caption
Figure 2: DEMR pipeline, with black arrows indicating the forward process, and optimization in the red box.
Pipeline design

The pipeline of DEMR is demonstrated in Figure 2. For an input x𝒳x\in\mathcal{X} with corresponding ground truth estimation ygty_{gt}\in\mathcal{M}, the feedforward process contains two steps; firstly the deep estimation y^E\hat{y}^{E} is given in the embedded space N\mathbb{R}^{N}, then the output manifold representation is y^=J1(Pr(y^E))\hat{y}=J^{-1}(Pr(\hat{y}^{E})), where J1J^{-1} is the inverse of extrinsic embedding JJ. Given that y^EN\hat{y}^{E}\in\mathbb{R}^{N}, where N\mathbb{R}^{N} covers the real-valued vector space of dimension NN, PrPr is then dropped within the pipeline. The training loss is computed between y^E\hat{y}^{E} and the extrinsic embedding ygtE=J(ygt)y^{E}_{gt}=J(y_{gt}). Therein, the gradient used in backpropagation is computed within N\mathbb{R}^{N}, leaving the original DNN architecture unchanged. Moreover, this implies that the transformation associated with DEMR can be directly applied to most existing DNN architectures by simply augmenting the dimensionality of the final output layer.

Similar to the population regression function in extrinsic regression, the estimator in embedded space FNN(x)F_{NN}(x) as the neural network in extrinsic embedded space N\mathbb{R}^{N} aims to minimize the conditional Fréchet mean if it exists: FNN(x)=argminmLextr2(m,y)𝒬~(dy|x)=argminmJ(m)J(y)2𝒬~(dy|x)F_{NN}(x)=\arg\min\limits_{m\in\mathcal{M}}\int_{\mathcal{M}}L_{extr}^{2}(m,y)\tilde{\mathcal{Q}}(\mathrm{d}y|x)=\arg\min\limits_{m\in\mathcal{M}}\int_{\mathcal{M}}\|J(m)-J(y)\|^{2}\tilde{\mathcal{Q}}(\mathrm{d}y|x) where LextrL_{extr} is the extrinsic distance, 𝒬~(dy|x)\tilde{\mathcal{Q}}(\mathrm{d}y|x) denotes the conditional distribution of YY given X=xX=x, and 𝒬(|x)=𝒬~(|x)J1\mathcal{Q}(\cdot|x)=\tilde{\mathcal{Q}}(\cdot|x)\circ J^{-1} is the conditional probability measure on \mathcal{M}. Therefore, in both the training and evaluation phases of DEMR, the computation burden is partaken by the deterministic conversion J1J^{-1}, which requires no gradient computation.

Reformulation of neural network for images

The input data sample \mathcal{I} is fed sequentially to the differentiable feature extractor T()T(\cdot), where the feature extractor can be composited by various modules, such as Resnet, Pointnet, etc., according to the input:

y^E=FNN()\displaystyle\hat{y}^{E}=F_{NN}(\mathcal{I}) =b+P(T())\displaystyle=b+P\cdot(T(\mathcal{I})) (1)
=b+P(c1,,cn)\displaystyle=b+P\cdot(c_{1}^{\mathcal{I}},\ldots,c_{n}^{\mathcal{I}})^{\top}
b+PCfmFL(Cfm)\displaystyle\triangleq b+P\cdot C_{fm}\triangleq F_{L}(C_{fm})

where \cdot indicates matrix multiplication, bb indicates the bias, CfmC_{fm} belongs to the vector matrix of feature maps from T()T(\cdot), with decomposition into column vectors ck{c_{k}^{\mathcal{I}}}. Therefore, the DNN in DEMR serves as the composition mapping from raw input \mathcal{I} to the preimage of J1J^{-1}.

Projection PrPr onto the preimage of JJ

For an extrinsic embedding function J()J(\cdot), there still exists a pivotal issue that the estimation given above might not lie in the preimage \mathcal{IM} of J()J(\cdot).

Since y^iN\hat{y}_{i}\in\mathbb{R}^{N}, T()T(\mathcal{I}) and \mathcal{IM} are all Euclidean and the projection between them can be presented as a linear transform Pr():Pr(\cdot):, in matrix forms. Other than a deterministic linear projection in extrinsic manifold regression (Lin et al., 2017; Lee, 2021), DEMR adopts a learnable projection fulfilled by linear layers within a deep framework as FL()F_{L}(\cdot) in Equation 1, then the output of a DNN is Pr(FNN)NPr(F_{NN}{\mathcal{I}})\in\mathbb{R}^{N}, and J1(Pr(Cfm))J^{-1}(Pr(C_{fm}))\in\mathcal{M}. Therefore, the final output of DEMR on the manifold will be

y^\displaystyle\hat{y} =J1(y^E)=J1(Pr(Cfm))=J1(FL(Cfm))\displaystyle=J^{-1}(\hat{y}^{E})=J^{-1}(Pr(C_{fm}))=J^{-1}(F_{L}(C_{fm}))
=J1(argminqM~qF^NN()2)\displaystyle=J^{-1}(\arg\min\limits_{q\in\tilde{M}}\|q-\hat{F}_{NN}(\mathcal{I})\|^{2})
DIMR

In contrast, the architecture adopted in (Lohit & Turaga, 2017) uses geodesic loss to train the neural network. The intrinsic geodesic distance is dintr(ygt,y^)=Logygty^d_{intr}(y_{gt},\hat{y})=Log_{y_{gt}}\hat{y} in Figure 3, where LogLog is the logarithmic map of \mathcal{M}. We call it Deep Intrinsic Manifold Representation (DIMR) for convenience, whose model parameter set ΘNN\Theta_{NN} is updated with the gradients of LintrL_{intr}.

Refer to caption
Figure 3: DIMR pipeline with geodesic loss on \mathcal{M}, with the black arrow indicating the forward process.

2.2 The extrinsic embedding JJ

The embedding JJ is designed to preserve geometric properties to a great extent, which can be specified as equivariance. JJ is considered an equivariant embedding if there is a group homomorphism ϕ:GGL(D,)\phi:G\rightarrow GL(D,\mathbb{R}) from GG to the general linear group GL(D,)GL(D,\mathbb{R}) of degree DD such that gG,q\forall g\in G,q\in\mathcal{M}. There is J(gq)=ϕ(g)J(q)J(gq)=\phi(g)J(q). Choosing JJ is not unique, and the choices below are equivariant embeddings. Given orthogonal Group Embedding JOJ_{O}, for R1,R2O(n)R_{1},R_{2}\in O(n) with singular value decomposition JO(R1)=U1Σ1V1J_{O}(R_{1})=U_{1}\Sigma_{1}V_{1} and JO(R2)=U2Σ2V2J_{O}(R_{2})=U_{2}\Sigma_{2}V_{2} and let JO(R1R2)K=U1V1ΣKU2V2J_{O}(R_{1}R_{2})\triangleq K=U_{1}V_{1}^{\top}\Sigma_{K}U_{2}V_{2}^{\top}, then there is R1R2=U1V1U2V2R_{1}R_{2}=U_{1}V_{1}^{\top}U_{2}V_{2}^{\top} and if ϕ(R1)=R1ΣKU2Σ21U2\phi(R_{1})=R_{1}\Sigma_{K}U_{2}\Sigma_{2}^{-1}U_{2}^{\top}, there is JO(R1R2)=ϕ(R1)U2σ2V2=ϕ(R1)JO(R2)J_{O}(R_{1}R_{2})=\phi(R_{1})U_{2}\sigma_{2}V_{2}^{\top}=\phi(R_{1})J_{O}(R_{2}).

The proof of Grassmannian Embedding J𝒢J_{\mathcal{G}} follows the same idea with proof for O(n)O(n), since they are all based on matrix decomposition.

2.2.1 Matrix Lie Group

9D: SVD of rank 9

An intuitive embedding choice for the matrix Lie group is to parameterize each matrix entry. For the special orthogonal group O(n)O(n) of dimension nn, the natural embedding can be given by JO:O(n)n2J_{O}:O(n)\rightarrow\mathbb{R}^{n^{2}} and JSO:SO(n)n2J_{SO}:SO(n)\rightarrow\mathbb{R}^{n^{2}}. For its inverse, JO1J_{O}^{-1} which aims to produce nn orthogonal vectors, where Gram Schmidt orthogonalization (Zhou et al., 2019) and its variations are common ways to reparameterize orthogonal vectors. Singular Value Decomposition (SVD) is another convenient way to produce an orthogonal matrix (Levinson et al., 2020):

μext=F^NN(x)={UV,det(μ)>0UHV,otherwise\mu^{ext}=\hat{F}_{NN}(x)=\begin{cases}UV^{\top},&det(\mu)>0\\ UHV^{\top},&otherwise\end{cases} (2)

where UU and VV come from singular value decomposition μ=UDV\mu=UDV^{\top} with elements arranged according to the descending order of singular value, and HH is a diagonal matrix InI_{n} with the last entry replaced by 1-1. Specifically for SO(3)SO(3), JSO(3)1J^{-1}_{SO(3)} could also be produced by cross product.

As for the special Euclidean group, the group of isometries of the Euclidean space, i.e., the rigid body transformations preserving Euclidean distance. A rigid body transformation can be given by a pair (A,t)(A,t) of affine transformation matrix AA and a translation tt, or it can be written in a matrix form of size n+1n+1. For instance, a special Euclidean group can be regarded as the semidirect product of a rotation group SO(n)SO(n) and a translation group T(n)T(n), SE(n)=SO(n)T(n)SE(n)=SO(n)\ltimes T(n).

6D: cross product

Specifically for orthogonal matrices on SO(3)SO(3), the invertible embedding JJ can be more convenient via cross-product operation ×\times. For a 6-dimensional Euclidean vector x=[xa,xb]x=[x_{a},x_{b}] as network output, where xx is the concatenation of xax_{a} and xbx_{b}, let xc=xa×xbx_{c}=x_{a}\times x_{b}, then J1(x)=[xaT,xbT,xcT]J^{-1}(x)=[x_{a}^{T},x_{b}^{T},x_{c}^{T}].

2.2.2 The Quotient Manifold of Lie Group

The real Grassmann manifold 𝒢(m,n)\mathcal{G}(m,\mathbb{R}^{n}) parameterize all mm-dimensional linear subspaces of n\mathbb{R}^{n}, which can also be defined by a quotient manifold: 𝒢(m,n)=O(n)/(O(m)×O(nm))=SO(n)/SO(m)×SO(nm)=𝒱(m,n)/SO(m)\mathcal{G}(m,\mathbb{R}^{n})=O(n)/(O(m)\times O(n-m))=SO(n)/SO(m)\times SO(n-m)=\mathcal{V}(m,n)/SO(m). Since it is a quotient space, and we care more about its rank than which basis to provide, we could convert the problem of embedding Y𝒢(m,n)Y\in\mathcal{G}(m,\mathbb{R}^{n}) to finding mappings for YYSPDm++YY^{\top}\in SPD^{++}_{m}, whose basis could be given by diagonal decomposition, which we referred to as 𝙳𝙳\mathtt{DD}. Actually, 𝙳𝙳\mathtt{DD} is a special case of 𝚂𝚅𝙳\mathtt{SVD} for a symmetric matrix. Thus, for a distribution 𝒬\mathcal{Q} defined on 𝒢(m,n)\mathcal{G}(m,\mathbb{R}^{n}), given μ=F^(x)n×m\mu=\hat{F}(x)\in\mathbb{R}^{n\times m} and its diagonal decomposition (denoted by 𝙳𝙳\mathtt{DD}) μ=UΣU\mu=U\Sigma U^{\top}, the inverse embedding J𝒢1J^{-1}_{\mathcal{G}} comes from μext=J𝒢1(μ)=USU\mu^{ext}=J^{-1}_{\mathcal{G}}(\mu)=USU^{\top}, and the first mm column vectors of UU constitute the subspace.

2.3 DEMR as a generalization of previous research

DNN with Euclidean output

When the output space is a vector space, J()J(\cdot) becomes an identity mapping. Thus, DNN with Euclidean output can be treated as a degenerate form of DEMR. The output from FNNF_{N}N is linearly transformed from the subspace spanned by the base CfmC_{fm}, causing its failure for new extracted features cCfmc^{\prime}\notin C_{fm}. It is the corresponding failure case in (Zhou et al., 2019), when the dimension of the last layer equals the output dimension.

Absolute/relative pose regression

From Equations 1, the estimation FNN()F_{NN}(\mathcal{I}) is linearly-transformed from the subspace spanned by ck{c_{k}^{\mathcal{I}}}. Hence, in typical DNN-based pose regression tasks, the predicted pose y^\hat{y} can be seen as a linear combination of feature maps extracted from input poses, resulting in the failure of extrapolation for unseen poses in the test set. It is a common problem because training samples are often given from limited poses. (Zhou et al., 2019) suggests that it is the discontinuity in the output space incurs poor generalization. Indeed, a better assumption for APR is that the pose estimation lies on SE(3)SE(3), which is composed of a rotation and a translation. The manifold assumption renders the estimator more powerful in continuous interpolation and extrapolation from input poses. This is because the continuity and symmetry to SE(3)SE(3) help a lot in the deep learning task. The detailed analysis is in Section 2.2.1 and validation in Section 4.1 Thereon APR on SE(3)SE(3) can be regarded as a particular case solved by DEMR; and (Sattler et al., 2019; Zhou et al., 2019) revealed parts of the idea from DEMR.

3 Analysis

In line with the experimental setup, the analysis is performed on the special orthogonal group, its quotient space, and the special Euclidean group 111All the proofs are included in the Appendix.

3.1 Feasibility of DEMR

Before conducting optimization in extrinsic embedding space N\mathbb{R}^{N}, the primary misgiving consists of whether the geometry in extrinsic embedded space properly reflects the intrinsic geometry of \mathcal{M}.

Apparently, extrinsic embedding J()J(\cdot) is a diffeomorphism preserving geometrical continuity, which is advantageous for extrinsic embeddings. Since we want to observe conformance between LextrL_{extr} and LintrL_{intr}, it’s natural to bridge distances with smoothness. Firstly, we need the diffeomorphism between manifolds to be bilipshitz.

Lemma 3.1.

Suppose that 1,2\mathcal{M}_{1},\mathcal{M}_{2} are smooth and compact Riemannian manifolds,f:12f:\mathcal{M}_{1}\rightarrow\mathcal{M}_{2} is a diffeomorphism. Then ff is bilipschitz w.r.t. the Riemannian distance.

Then we show in 3.2 the conformity of extrinsic distance and intrinsic distance, which enables indirectly representing an intrinsic loss in Euclidean spaces.

Proposition 3.2.

For a smooth embedding J:mJ:\mathcal{M}\rightarrow\mathbb{R}^{m}, where the n-manifold \mathcal{M} is compact, and its metric is denoted by ρ\rho, the metric of m\mathbb{R}^{m} is denoted by dd. For any two sequences {xk},{yk}\{x_{k}\},\{y_{k}\} of points in \mathcal{M} and their images {J(xk)},{J(yk)}\{J(x_{k})\},\{J(y_{k})\}, if limnd(J(xn),J(yn))=0\lim\limits_{n\rightarrow\infty}d(J(x_{n}),J(y_{n}))=0 then limnρ(xn,yn)=0\lim\limits_{n\rightarrow\infty}\rho(x_{n},y_{n})=0.

3.2 Asymptotic MLE

In this part, we demonstrate that DEMR for SO(3)SO(3) is the approximate maximum likelihood estimation (MLE) of the SO(3)SO(3) response, and for Grassmann manifold 𝒢(m,n)\mathcal{G}(m,\mathbb{R}^{n}) DEMR is the MLE of the Grassmann response.

To be noticed, we adopt a new error model in conformity with DEMR. In previous work such as (Levinson et al., 2020), the error noise matrix NN is assumed to be filled with random entries nijN(0,σ)n_{ij}\sim N(0,\sigma), which is not rational, because NN also lies on \mathcal{M} and there are innate structures between the entries of NN. Here we assume NN\in\mathcal{M}, so the probability QQ on \mathcal{M} should be established first.

3.2.1 Lie group for transformations

As (Bourmaud et al., 2015) suggested, we consider the connected, unimodular matrix Lie group, including the most frequently used categories in computer vision: SE(3)SE(3), SO(3)SO(3), SL(3)SL(3), etc. Since SO(3)SO(3) is a degenerate case of SE(3)SE(3) without translation, here we consider the concentrated Gaussian distribution on SE(3)SE(3). The probability density function (pdf) takes the form P(x;Σ)=αexp12([log(x)]Σ1[log(x)])P(x;\Sigma)=\alpha\exp{-\frac{1}{2}([\log_{\mathcal{M}}(x)]_{\mathcal{M}}^{\vee\top}\Sigma^{-1}[\log_{\mathcal{M}}(x)]_{\mathcal{M}}^{\vee})}, where α\alpha is the normalizing factor, xSO(3)x\in SO(3) and the covariance matrix Σ\Sigma is positive definite.

Maps [][\cdot]^{\wedge}and [][\cdot]^{\vee} are linear isomorphism, re-arranging the Euclidean representations into anti-symmetric matrix and back 222 hat[]:3so(3);θθ=[θ]×,vee[]:so(3)3;[θ]×=θhat[\cdot]^{\wedge}:\mathbb{R}^{3}\rightarrow so(3);\mathbf{\theta}\rightarrow\mathbf{\theta}=[\mathbf{\theta}]_{\times},vee[\cdot]^{\vee}:so(3)\rightarrow\mathbb{R}^{3};[\mathbf{\theta}]_{\times}^{\wedge}=\mathbf{\theta} where []times[\cdot]_{times} indicates the antisymmetric matrix form of the vector. . For a vector θ=[θ1,θ2,θ3]\theta=[\theta_{1},\theta_{2},\theta_{3}]^{\top}, there is θ=[θ]×\mathbf{\theta}^{\wedge}=[\mathbf{\theta}]_{\times} and [θ]×=θ[\mathbf{\theta}]_{\times}^{\vee}=\mathbf{\theta}.

Proposition 3.3.

JSO1:9SO(3)J^{-1}_{SO}:\mathbb{R}^{9}\rightarrow SO(3) gives an approximation of MLE of rotations on SO(3)SO(3), if assuming Σ=σI\Sigma=\sigma I, II is the identity matrix and σ\sigma an arbitrary real value.

DEMR on special Euclidean Group SE(3)SE(3) also approximately provides maximum likelihood estimation, sharing similar ideas of proof with SO(3)SO(3).

Proposition 3.4.

JSE1:9SE(3)J^{-1}_{SE}:\mathbb{R}^{9}\rightarrow SE(3) , where the rotation part comes from 𝚂𝚅𝙳\mathtt{SVD} ((see Supplementary Material) gives an approximation of MLE of transformations on SE(3)SE(3), if assuming Σ=σI\Sigma=\sigma I, II is the identity matrix and σ\sigma an arbitrary real value.

Proof.

Thus, for x𝒩𝒢(μ,Σ)x\sim\mathcal{N}_{\mathcal{G}}(\mu,\Sigma), there is x=μexp([N])x=\mu\exp_{\mathcal{M}}([N]^{\vee}_{\mathcal{M}}), and the simplified log-likelihood function is

L(Y;FΘNN,Σ)=\displaystyle L(Y;F_{\Theta_{NN}},\Sigma)=
[log(FΘNNY)]Σ1[log(FΘNNY)]\displaystyle[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee\top}\Sigma^{-1}[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee}

with log(FΘNNY)]\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee} set to be ϵ\epsilon, if the pdf focused around the group identity, i.e. the fluctuation of ϵ\epsilon is small, the noise ϵ\epsilon ’s distribution could be approximated by 𝒩3(𝟎3×1,Σ)\mathcal{N}_{\mathbb{R}^{3}}(\mathbf{0}_{3\times 1},\Sigma) on 3\mathbb{R}^{3}. Then there is argmaxYSO(3)L(Y;FΘNN,Σ)argminYSO(3)(FΘNNY))(FΘNNY))=argminYSO(3)FΘNNYF2\arg\max\limits_{Y\in SO(3)}L(Y;F_{\Theta_{NN}},\Sigma)\approx\arg\min\limits_{Y\in SO(3)}(F_{\Theta_{NN}}-Y))^{\top}(F_{\Theta_{NN}}-Y))=\arg\min\limits_{Y\in SO(3)}\|F_{\Theta_{NN}}-Y\|_{F}^{2}. ∎

3.2.2 Grassmann for subspaces

One of the manifold versions of Gaussian distribution on Grassmann and Stiefel manifolds is Matrix Angular Central Gaussian (MACG). However, the matrix representation of linear subspaces shall preserve the consistency of eigenvalues across permutations and sign flips, thus we resort to the symmetric UUTUU^{T} on Symmetric Positive Definite (SPD) manifold for U𝒢(m,N)U\in\mathcal{G}(m,\mathbb{R}^{N}).

Similarly, the manifold output on the error model is modified to be FΘNN=YY+NF_{\Theta_{NN}}=YY^{\top}+N, where both YYYY^{\top} and NN are semi-positive definite matrix lying on SPD manifold. The Gaussian distribution extended on SPD manifold, for a random 2th2th-order tensor Xn×nX\in\mathbb{R}^{n\times n} is

P(X;M,S)=1(2π)n|𝒮|exp12(XM)T𝒮1(XM)P(X;M,S)=\frac{1}{\sqrt{(2\pi)^{n}}|\mathcal{S}|}\exp{\frac{1}{2}(X-M)^{T}\mathcal{S}^{-1}(X-M)} (3)

with mean Mm×mM\in\mathbb{R}^{m\times m} and 4th4th-order covariance tensor 𝒮m×m×m×m\mathcal{S}\in\mathbb{R}^{m\times m\times m\times m} inheriting symmetries from three dimensions. For the inverse of Grassmann manifold extrinsic embedding J𝒢1J_{\mathcal{G}}^{-1}, DEMR provides MLE, and the proofs are given in the Supplementary Material.

Proposition 3.5.

DEMR with J𝒢1J^{-1}_{\mathcal{G}} gives MLE of element on 𝒢(m.n)\mathcal{G}(m.\mathbb{R}^{n}), if assuming Σ\Sigma is an identity matrix.

3.3 Generalization Ability

3.3.1 Failure of DNN with Euclidean output space

In light of the analysis in 2.3, the output space is produced by a linear transformation on the convolutional feature space spanned by the extracted features. The feature map extracted by a neural network organized as matrices in Equation (1) the source to form the basis. Denoting the linear subspace spanned by the feature map basis c1,,cn{c_{1}^{\mathcal{I}},\ldots,c_{n}^{\mathcal{I}}} to be feature\mathbb{R}_{feature} and let output\mathbb{R}_{output} be the low-dimensional output space spanned by {o1,,ono}\{o_{1},\ldots,o_{n_{o}}\}, where non_{o} denotes the output dimension, then FNN()outputF_{NN}(\mathcal{I})\in\mathbb{R}_{output}. For a new test input \mathcal{I}^{\prime}, its extracted feature map belongs to the complementary space of feature\mathbb{R}_{feature}, there is FNN()outputF_{NN}(\mathcal{I}^{\prime})\notin\mathbb{R}_{output}. This accounts for the failure of some DNN models with Euclidean output.

3.3.2 Representation power of structured output space

This section studies the enhancement of the representational power of DEMR, when the output space is endowed with geometrical structure. As a linear action, one representation of a Lie group is a smooth group homomorphism Π:GGL(V)\Pi:G\rightarrow GL(V) on the nn-dimensional vector space VV, where GL(V)GL(V) is a general linear group of all invertible linear transformations.

Proposition 3.6.

Any element of dimension nn on SO(n)SO(n) belongs to the image of JSO(n)1J^{-1}_{SO(n)} from known rotations within a certain range, if the Euclidean input of JSO(n)1J^{-1}_{SO(n)} is of more than nn dimensions.

Corollary 3.7.

Any element of dimension nn on SE(n)SE(n) belongs to the image of JSE(n)1J^{-1}_{SE(n)} from known rotations within a certain range, if the Euclidean input of JSE(n)1J^{-1}_{SE(n)} is of more than nn dimensions.

Then for a linear representation of Lie group in matrix form, given a set of basis on VV, we show that the output of DEMR better extrapolates the input samples than common deep learning settings with unstructured output, which resolves the problem raised in (Sattler et al., 2019).

4 Experiments

In this section, we demonstrate the effectiveness of applying extrinsic embedding to deep learning settings on two representative manifolds in computer vision. The experiments are conducted from several aspects below:

  • Whether DEMR takes effects in improving the performance of certain tasks?

  • Whether the geometrical structure boosts model performance facing unseen cases, e.g., the ability to extrapolate training set.

  • Whether extrinsic embedding yields valid geometrical restrictions.

The validations are conducted on two canonical manifold applications in computer vision

4.1 Task I: affine motions on SE(3)SE(3)

Estimating the relative position and rotation between two point clouds has a wide range of downstream applications. The reference and target point clouds PR,PtN×3P_{R},P_{t}\in\mathbb{R}^{N\times 3} are in the same size and shape, with no scale transformations. The relative translation and rotation can be arranged separately in vectors or together in a matrix lying on SE(3)SE(3).

4.1.1 Experimental Setup

Training Detail

During training, at each iteration stage, a randomly chosen point cloud from 2,2902,290 airplanes is transformed by randomly sampling rotations and translations in batches. The rotations are sampled according to the models, i.e., the models producing axis angles are fed with rotations sampled from axis angles, and so on.

Comparison metrics

To validate the ability of the model to preserve geometrical structures, geodesic distance is the opted metric at the testing stage. (Zhou et al., 2019; Levinson et al., 2020) uses minimal angular difference to evaluate the differences between rotations, which is not entirely compatible with Euclidean groups, since it calculates the translation part and rotation part separately. For two rotations R1,R2R_{1},R_{2} and R=R1R21R^{\prime}=R_{1}R_{2}^{-1} with its trace to be tr(R)tr(R^{\prime}), there is Langle=cos1((tr(R)1)/2)L_{angle}=\cos^{-1}((tr(R^{\prime})-1)/2). For intrinsic metric, the geodesic distance between two group elements are defined with Frobenius norm for matrices dint(M1,M2)=log(M11M2)d_{int}(M_{1},M_{2})=\|\log(M_{1}^{-1}M_{2})\|, where log(M11M2)\log(M_{1}^{-1}M_{2}) indicates the logarithm map on the Lie group. For extrinsic metric, we take Mean Squared Loss (MSE) between JSE(M1)J_{SE}(M_{1}) and JSE(M2)J_{SE}(M_{2}).

Table 1: On the task of estimating relative affine transformations on SE(3)SE(3) for point clouds, the results are reported by the geodesic distance between the estimation and the ground truth. The best results across models are emphasized, where the smallest errors come from extrinsic embedding.
Mode avg median std
euler 26.83 15.54 32.23
axis 31.96 10.66 27.75
6D 10.28 5.36 21.54
9D 10.23 6.09 25.30

For models with Euler angle and axis-angle output, the predicted rotation and translation are organized in Euclidean form, and they are reorganized into Lie algebra se(3)se(3). Finally, the distances are calculated in SO(3)SO(3) by exerting an exponential map on their se(3)se(3) forms.

Architecture detail

Similar as (Zhou et al., 2019; Levinson et al., 2020), the backbone for point cloud feature extraction is composed of a weight-sharing Siamese architecture containing two simplified PointNet Structures (Qi et al., 2017) Φi:N×31024\Phi_{i}:\mathbb{R}^{N\times 3}\rightarrow\mathbb{R}^{1024} where i=1,2i=1,2. After extracting the feature of each point with an MLP, both Φ1\Phi_{1} and Φ2\Phi_{2} utilize max pooling to produce a single vector z1,z2z_{1},z_{2} respectively, as representations of features across all points in Pr,PtP_{r},P_{t}. Concatenating z1,z2z_{1},z_{2} to be zz, another MLP is used for mapping the concatenated feature to the final higher-order Euclidean representation Y𝔼Y_{\mathbb{E}}. Finally, Euler angle quaternion coefficients and axis-angle representations will be directly obtained from the last linear layer of the backbone network of dimension, 3,3,43,3,4 respectively. For manifold output, the SE(3)SE(3) representation will be given by both cross-product and SVD, with details in Supplementary materials. The translation part is produced by another line of 𝚏𝚌\mathtt{fc} layer of 3 dimensions. Because MSE computation between two isometric matrices are based on each entry, total MSE loss is the sum of rotation loss and translation loss.

Estimation accuracy and conformance

To validate the necessity of structured output space, firstly we compare the affine transformation estimating network with different output formations. The translation part naturally takes the form of a vector and the rotation part can be in the form of Euclidean representations and SO(3)SO(3) in matrix forms. The dataset is composed of generated point cloud pairs with random transformations, where the rotations are uniformly sampled from various representations, namely, Euler angle, axis-angle in 33-dimensional vector space, and SO(3)SO(3). The translation part is randomly sampled from a standard normal distribution.

The advantage of manifold output produced by baseline DEMR is revealed in Table 1 and Figure 4, output via extrinsic embedding takes the lead across three statistics.

Refer to caption
Figure 4: The cumulative distributions comparison of position errors for the pose regression task on SE(3)SE(3).
Table 2: Comparison of different embeddings. DEMR takes the lead in predicting accuracy, with two types of embedding functions, namely, 6D and 9D. Even when the ratio of unseen cases increases, DEMR has better performance all along.
10% 20% 40% 60% 80%
avg med std avg med std avg med std avg med std avg med std
euler 77.81 54.42 77.52 35.58 13.07 48.94 28.82 17.02 30.89 27.22 16.27 35.52 27.22 16.27 31.75
axis 75.57 50.71 78.60 36.95 12.38 54.40 20.73 12.51 27.63 21.95 12.75 27.15 18.96 10.66 24.75
6D 71.09 29.39 80.09 17.12 5.25 40.54 10.85 5.07 19.60 9.76 4.93 19.60 10.29 5.01 21.08
9D 70.72 30.99 63.18 19.83 5.62 39.22 12.91 6.28 22.66 12.24 6.40 24.18 10.23 5.31 22.60
Table 3: Test result reported in average D𝒢D_{\mathcal{G}} with an example result of a test sample, and the epoch at which the model converged.
Input GrassmannNet (Lohit & Turaga, 2017) DIMR DEMR
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
avg D𝒢(Ugt,Uout)D_{\mathcal{G}}(U_{gt},U_{out}) 9.6826 3.4551 3.4721
epoch 320 680 140
Generalization ability

To evaluate DEMR over its improvement in generalization ability in comparison with unstructured output space, the training set only contains a small portion of the whole affine transformation space while the test set encompasses affine transformations sampled from the whole rotation space. To be fair in the sampling step across the compared representations, the sampling process is conducted uniformly in the representations spaces respectively. For models with axis-angle and Euler angle output, the input point cloud pair is constructed with rotation representation whose entries sampled iid\mathrm{iid} from Euclidean ranges. For Euler representations, the lower bounds of the ranges are π\pi, and the upper bounds are π\pi multiplied by 90%,80%,60%,40%,20%90\%,80\%,60\%,40\%,20\%, while the test set is sampled from [π,π][-\pi,\pi]. The training set with axis-angle representations is also obtained by sampling Euclidean ranges. When constructing a training set on SO(3)SO(3), the uniform sampling process is conducted in the same way with axis-angle sampling, the segments are taken in ratios in the same way with the two aforementioned settings, and then exponential mapping is used to yield random samples on SO(3)SO(3). Facing unseen cases, DEMR has better extrapolation capability compared to plain deep learning settings with unstructured output space. As shown in Table LABEL:tab:_generalization, results from deep extrinsic learning show great competence despite the portion size of the training set.

4.2 Task II: illumination subspace, the Grassmann manifold

Changes in poses, expressions, and illumination conditions are inevitable in real-world applications, but seriously impair the performance of face recognition models. It is universally acknowledged that an image set of a human face under different conditions lies close to a low-dimensional Euclidean subspace. After vectorizing images of each person into one matrix in dataset Yale-B (Georghiades et al., 2001), experiments reveal that the top d=5d=5 principal components (PC) generally capture more than 90%90\% of the singular values. As an essential application of DEMR, we take the feature extraction CNN as T()T(\cdot) in Equation LABEL:eq:_combination. in high-dimensional Euclidean space, and obtain the final result by inversely mapping the estimation to the Grassmann manifold.

4.2.1 Experimental Setup

Architectural detail

A similar work sharing the same background comes from (Lohit & Turaga, 2017), which assumes the output of the last layer in the neural network to lie on the matrix manifolds or its tangent space. We take GrassmannNet in the former assumption in (Lohit & Turaga, 2017) as the baseline, where the training loss function is set to be Mean Squared Loss. For all the models to be compared, the ratio of the training set is 0.80.8, and all of the illumination angles are adopted. In addition, the data preparation step also follows (Lohit & Turaga, 2017), and other training details are recorded in supplementary materials. On extended Yale Face Database B, each grayscale image of 168×192168\times 192 is firstly resized to 256×256256\times 256 as the input of CNN, while for subspace ground truth, they are resized to 28×28=78428\times 28=784 for convenience of computation.

Dataset processing

The input face image IijI_{i}^{j} indicates the ithi_{th} face under the jthj_{th} illumination condition, where j=1,,64j=1,\ldots,64 for Yale B dataset. If we adopt the top dd PCs, the output is assumed to be in the form of {Ei1,Ei2,,EId}\{E_{i}^{1},E_{i}^{2},\ldots,E_{I}^{d}\}, k=1,dk=1\ldots,d and <Eik,Eil>=δkl<E_{i}^{k},E_{i}^{l}>=\delta_{kl} where δkl\delta_{kl} is a Kronecker delta function. Each PC is rearranged as a vector, thus vec(Eik)vec(E_{i}^{k}) is of size 784×1784\times 1 and we define Ui=[vec(Ei1),vec(Ei2)vec(Eid)]U_{i}=[vec(E_{i}^{1}),vec(E_{i}^{2})\ldots vec(E_{i}^{d})], and the output of the network is UiUiTU_{i}U_{i}^{T}. Most of the intrinsic distances defined on the Grassmannian manifold are based on principal angles (PA), for P1,P2𝒢(m,n)P_{1},P_{2}\in\mathcal{G}(m,\mathbb{R}^{n}), with SVD P1TP2=USVTP_{1}^{T}P_{2}=USV^{T} where S=diag{cos(θ1),cos(θ2),,cos(θm)}S=diag\{cos(\theta_{1}),cos(\theta_{2}),\ldots,cos(\theta_{m})\}, PAs of Binet-Cauchy (BC) distance is 1Πk=1Kcos2θk1-\Pi_{k=1}^{K}\cos^{2}\theta_{k} and PAs of Martin (MA) distance is logΠk=1K(cos2θk)1\log\Pi_{k=1}^{K}(\cos^{2}\theta_{k})^{-1}. For comparison of effectiveness and training advantage, the deep intrinsic learning takes D𝒢(Ugt,Uout)=D_{\mathcal{G}}(U_{gt},U_{out})= as both the training loss and testing metric, where UgtU_{gt} and UoutU_{out} indicate the ground truth and the network output respectively.

Estimation accuracy and conformance

The results of illumination subspace estimation are given in Table 3. Measured by the average of geodesic losses, DEMR architecture achieves nearly the same performance as DIMR while drastically reducing training time till convergence. With batch size set to be 10, we record training loss after every 10 epochs and report approximately the epoch at which the training loss converged in the third row in Table 3. Besides, both DIMR and DEMR surpass GrassmannNet in the task of illumination subspace estimation, and DEMR still takes fewer epochs for training than GrassmannNet. In the first row of Table 3, on one of the test samples, the results from GrassmannNet, DIMR, and DEMR are displayed. In addition, the conformance of extrinsic and intrinsic metrics can be referred to as supplementary materials.

5 Conclusion

This paper presents Deep Extrinsic Manifold Representation (DEMR), which incorporates extrinsic embedding into DNNs to circumvent the direct computation of intrinsic information. Experimental results demonstrate that retaining the geometric structure of the manifold enhances overall performance and generalization ability in the original tasks. Furthermore, extrinsic embedding exhibits superior computational advantages over intrinsic methods. Looking ahead, the prospect of a more unified formulation for extrinsic embedding techniques within deep learning settings, characterized by parameterized approaches, is envisioned.

References

  • Berkels et al. (2013) Berkels, B., Fletcher, P. T., Heeren, B., Rumpf, M., and Wirth, B. Discrete geodesic regression in shape space. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp.  108–122. Springer, 2013.
  • Bermudez et al. (2018) Bermudez, C., Plassard, A. J., Davis, L. T., Newton, A. T., Resnick, S. M., and Landman, B. A. Learning implicit brain mri manifolds with deep learning. In Medical Imaging 2018: Image Processing, volume 10574, pp.  105741L. International Society for Optics and Photonics, 2018.
  • Bhattacharya & Patrangenaru (2003) Bhattacharya, R. and Patrangenaru, V. Large sample theory of intrinsic and extrinsic sample means on manifolds. The Annals of Statistics, 31(1):1–29, 2003.
  • Bhattacharya et al. (2012) Bhattacharya, R. N., Ellingson, L., Liu, X., Patrangenaru, V., and Crane, M. Extrinsic analysis on manifolds is computationally faster than intrinsic analysis with applications to quality control by machine vision. Applied Stochastic Models in Business and Industry, 28(3):222–235, 2012.
  • Boumal (2020) Boumal, N. An introduction to optimization on smooth manifolds. Available online, May, 3, 2020.
  • Bourmaud et al. (2015) Bourmaud, G., Mégret, R., Arnaudon, M., and Giremus, A. Continuous-discrete extended kalman filter on matrix lie groups using concentrated gaussian distributions. Journal of Mathematical Imaging and Vision, 51:209–228, 2015.
  • Bronstein et al. (2017) Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • Bronstein et al. (2021) Bronstein, M. M., Bruna, J., Cohen, T., and Veličković, P. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
  • Byravan & Fox (2017) Byravan, A. and Fox, D. Se3-nets: Learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.  173–180. IEEE, 2017.
  • Can et al. (2021) Can, U., Utku, A., Unal, I., and Alatas, B. Deeper in data science: Geometric deep learning. PROCEEDINGS BOOKS, pp.  21, 2021.
  • Cao et al. (2020) Cao, W., Yan, Z., He, Z., and He, Z. A comprehensive survey on geometric deep learning. IEEE Access, 8:35929–35949, 2020.
  • Chen et al. (2021) Chen, J., Yin, Y., Birdal, T., Chen, B., Guibas, L., and Wang, H. Projective manifold gradient layer for deep rotation regression. arXiv preprint arXiv:2110.11657, 2021.
  • Cornea et al. (2017) Cornea, E., Zhu, H., Kim, P., Ibrahim, J. G., and Initiative, A. D. N. Regression models on riemannian symmetric spaces. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2):463–482, 2017.
  • Fang et al. (2023) Fang, Y., Ohn, I., Gupta, V., and Lin, L. Intrinsic and extrinsic deep learning on manifolds. arXiv preprint arXiv:2302.08606, 2023.
  • Fletcher (2013) Fletcher, P. T. Geodesic regression and the theory of least squares on riemannian manifolds. International journal of computer vision, 105(2):171–185, 2013.
  • Fletcher (2011) Fletcher, T. Geodesic regression on riemannian manifolds. In Proceedings of the Third International Workshop on Mathematical Foundations of Computational Anatomy-Geometrical and Statistical Methods for Modelling Biological Shape Variability, pp.  75–86, 2011.
  • Georghiades et al. (2001) Georghiades, A., Belhumeur, P., and Kriegman, D. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.
  • Hinkle et al. (2012) Hinkle, J., Muralidharan, P., Fletcher, P. T., and Joshi, S. Polynomial regression on riemannian manifolds. In European conference on computer vision, pp.  1–14. Springer, 2012.
  • Huang et al. (2021) Huang, Z., Cai, H., Dan, T., Lin, Y., Laurienti, P., and Wu, G. Detecting brain state changes by geometric deep learning of functional dynamics on riemannian manifold. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  543–552. Springer, 2021.
  • Khayatkhoei et al. (2018) Khayatkhoei, M., Singh, M. K., and Elgammal, A. Disconnected manifold learning for generative adversarial networks. Advances in Neural Information Processing Systems, 31, 2018.
  • Lee (2021) Lee, H. Robust extrinsic regression analysis for manifold valued data. arXiv preprint arXiv:2101.11872, 2021.
  • Levinson et al. (2020) Levinson, J., Esteves, C., Chen, K., Snavely, N., Kanazawa, A., Rostamizadeh, A., and Makadia, A. An analysis of svd for deep rotation estimation. arXiv preprint arXiv:2006.14616, 2020.
  • Lin et al. (2017) Lin, L., St. Thomas, B., Zhu, H., and Dunson, D. B. Extrinsic local regression on manifold-valued data. Journal of the American Statistical Association, 112(519):1261–1273, 2017.
  • Lohit & Turaga (2017) Lohit, S. and Turaga, P. Learning invariant riemannian geometric representations using deep nets. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp.  1329–1338, 2017.
  • Ni et al. (2021) Ni, Y., Koniusz, P., Hartley, R., and Nock, R. Manifold learning benefits gans. arXiv preprint arXiv:2112.12618, 2021.
  • Qi et al. (2017) Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  652–660, 2017.
  • Sattler et al. (2019) Sattler, T., Zhou, Q., Pollefeys, M., and Leal-Taixe, L. Understanding the limitations of cnn-based absolute camera pose regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3302–3312, 2019.
  • Shi et al. (2009) Shi, X., Styner, M., Lieberman, J., Ibrahim, J. G., Lin, W., and Zhu, H. Intrinsic regression models for manifold-valued data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  192–199. Springer, 2009.
  • Shin & Oh (2022) Shin, H.-Y. and Oh, H.-S. Robust geodesic regression. International Journal of Computer Vision, pp.  1–26, 2022.
  • Steinke et al. (2010) Steinke, F., Hein, M., and Schölkopf, B. Nonparametric regression between general riemannian manifolds. SIAM Journal on Imaging Sciences, 3(3):527–563, 2010.
  • Yang et al. (2022) Yang, C.-H., Vemuri, B. C., et al. Nested grassmannians for dimensionality reduction with applications. Machine Learning for Biomedical Imaging, 1(IPMI 2021 special issue):1–10, 2022.
  • Zhang (2020) Zhang, Y. Bayesian geodesic regression on riemannian manifolds. arXiv preprint arXiv:2009.05108, 2020.
  • Zhou et al. (2019) Zhou, Y., Barnes, C., Lu, J., Yang, J., and Li, H. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5745–5753, 2019.

Appendix

5.0.1 Proofs in Section 3

Consistency of extrinsic loss and intrinsic loss
Proof of lemma 3.1
Lemma 5.1.

Suppose that 1,2\mathcal{M}_{1},\mathcal{M}_{2} are smooth and compact Riemannian manifolds,f:12f:\mathcal{M}_{1}\rightarrow\mathcal{M}_{2} is a diffeomorphism. Then ff is bilipschitz w.r.t. the Riemannian distance.

Proof.

For 1\mathcal{M}_{1} and 2\mathcal{M}_{2} with smooth metrics ρ1\rho_{1} and ρ2\rho_{2} respectively, considering the smooth and continuous map df:U1T2\mathrm{d}f:U_{1}\rightarrow T\mathcal{M}_{2}, where U1U_{1} indicates the unit tangent bundle of 1\mathcal{M}_{1} and T2T\mathcal{M}_{2} represents the tangent space, then the function ϕ(u)=|df(u)|,uU1\phi(u)=|df(u)|,u\in U_{1} is continuous too. Since U1U_{1} is compact, there exists the maximum CC. Denoting the length of path r:[a,b]1r:[a,b]\rightarrow\mathcal{M}_{1} to be L()L(\cdot), then there is L(fr)=ab|(fr)|dtCab|c(t)|dt=CL(r)L(f\circ r)=\int_{a}^{b}|(f\circ r)^{\prime}|\mathrm{d}t\leq C\int_{a}^{b}|c^{\prime}(t)|\mathrm{d}t=CL(r) thus ρ2(f(t1),f(t2))Cρ1(t1,t2),t1,t21\rho_{2}(f(t_{1}),f(t_{2}))\leq C\rho_{1}(t_{1},t_{2}),t_{1},t_{2}\in\mathcal{M}_{1}, then ff is CC- Lipshitz and f1f^{-1} is 1C\frac{1}{C}- Lipshitz. ∎

Proof of proposition 3.2

The extrinsic metric could reflect the tendency of the intrinsic metric. Here we pay more attention to consistency when the losses are small because consistency is crucial in extrinsic loss convergence. To be more specific, for a smooth embedding J:mJ:\mathcal{M}\rightarrow\mathbb{R}^{m}, where the n-manifold \mathcal{M} is compact, and its metric is denoted by ρ\rho, the metric of m\mathbb{R}^{m} is dd. The minimal change of dd suggests the minimal change of ρ\rho, which will be a direct result if JJ is bilipshitz. Here the condition of compactness of the embedded space will be relaxed, where JJ is only desired to be locally bilipshitz, while the proof follows the same idea of the lemma above.

Proposition 5.2.

For a smooth embedding J:mJ:\mathcal{M}\rightarrow\mathbb{R}^{m}, where the n-manifold \mathcal{M} is compact, and its metric is denoted by ρ\rho, the metric of m\mathbb{R}^{m} is denoted by dd. For any two sequences {xk},{yk}\{x_{k}\},\{y_{k}\} of points in \mathcal{M} and their images {J(xk)},{J(yk)}\{J(x_{k})\},\{J(y_{k})\}, if limnd(J(xn),J(yn))=0\lim\limits_{n\rightarrow\infty}d(J(x_{n}),J(y_{n}))=0 then limnρ(xn,yn)=0\lim\limits_{n\rightarrow\infty}\rho(x_{n},y_{n})=0.

Proof.

Given that every compact submanifold Ys=J()YY_{s}=J(\mathcal{M})\subset Y has positive normal injectivity radius, where YNY\subset\mathbb{R}^{N}, there exists a positive constant rr, such that BrvYs={vvYs:|v|r}B_{r}v_{Y_{s}}=\{v\in v_{Y_{s}}:|v|\leq r\} with vYsv_{Y_{s}} to be the normal bundle of YsY_{s} in YY, and the normal exponential map expYs:vYsY\exp_{Y_{s}}:v_{Y_{s}}\rightarrow Y is a diffeomorphism onto its image S=exp(Br(vYs))S=\exp(B_{r}(v_{Y_{s}})). Because the exponential map expYs\exp_{Y_{s}} of \mathcal{M} is a diffeomorphism, SS is an open neighborhood of YsY_{s}, and the inverse of expYs\exp_{Y_{s}}, logYs\log_{Y_{s}} is smooth. Let the retraction g=plogYs:SYsg=p\circ\log_{Y_{s}}:S\rightarrow Y_{s} to be logYsexpYs\log_{Y_{s}}\circ\exp_{Y_{s}} as the composition of the projection p:YsYp:Y_{s}\rightarrow Y and logYs\log_{Y_{s}}, the closure of BrvYsB_{r^{\prime}}v_{Y_{s}} to be B¯rvYs\bar{B}_{r^{\prime}}v_{Y_{s}} for 0<r<r0<r^{\prime}<r, then its image under expYs\exp_{Y_{s}} would be a compact submanifold with boundary ZYsZ\subset Y_{s}. Hence there exists a real constant C10C_{1}\geq 0 such that gg is C1C_{1}-Lipshitz on ZZ. From lemma 1 there exists a real constant C20C_{2}\geq 0 such that JJ is C2C_{2}-lipshitz. Moreover, since gg is the right inverse of the inclusion map i:YsYi:Y_{s}\rightarrow Y, it follows that ii is C1C_{1}-bilipshitz, when restricted to the sufficiently small ball BϵB_{\epsilon}, where ϵ<r\epsilon<r^{\prime}.

Next, define the map ϕ(y1,y2)=dYs(y1,y2)dY(y1,y2)\phi(y_{1},y_{2})=\frac{d_{Y_{s}}(y_{1},y_{2})}{d_{Y}(y_{1},y_{2})}, where (y1,y2)Ys×Ys(y_{1},y_{2})\in Y_{s}\times Y_{s} and y1y2y_{1}\neq y_{2}, and ϕ\phi is continuous since the Euclidean distance function dYsd_{Y_{s}} and dYd_{Y} is continuous. For a compact subset ScompYs×YsS_{comp}\subset Y_{s}\times Y_{s}, there is a real value C3>0C_{3}>0 such that ϕ\phi on ScompS_{comp} is bounded by C3C_{3}. Finally, J is locally CmaxC_{max}-bilipshitz, where Cmax=max(C1,C2,C3)C_{max}=\max(C_{1},C_{2},C_{3}). Further, for any two sequences {xk},{yk}\{x_{k}\},\{y_{k}\} of points in \mathcal{M} and their images {J(xk)},{J(yk)}\{J(x_{k})\},\{J(y_{k})\}, if limnd(J(xn),J(yn))=0\lim\limits_{n\rightarrow\infty}d(J(x_{n}),J(y_{n}))=0, there is ρ(xn,yn)Cmaxd(J(xn),J(yn))\rho(x_{n},y_{n})\leq C_{max}d(J(x_{n}),J(y_{n})), limnρ(xn,yn)=0\lim\limits_{n\rightarrow\infty}\rho(x_{n},y_{n})=0. ∎

5.0.2 Asymptotic maximum likelihood estimation (MLE)

Proof of proposition 3.3
Proposition 5.3.

JSO1:9SO(3)J^{-1}_{SO}:\mathbb{R}^{9}\rightarrow SO(3) gives an approximation of MLE of rotations on SO(3)SO(3), if assuming Σ=σI\Sigma=\sigma I, II is the identity matrix and σ\sigma an arbitrary real value.

Proof.

For x𝒩𝒢(μ,Σ)x\sim\mathcal{N}_{\mathcal{G}}(\mu,\Sigma), there is x=μexp([N])x=\mu\exp_{\mathcal{M}}([N]^{\vee}_{\mathcal{M}}), and the simplified log-likelihood function is

L(Y;FΘNN,Σ)=[log(FΘNNY)]Σ1[log(FΘNNY)]L(Y;F_{\Theta_{NN}},\Sigma)=[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee\top}\Sigma^{-1}[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee} (4)

with log(FΘNNY)]\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee} set to be ϵ\epsilon, if the pdf focused around the group identity, i.e. the fluctuation of ϵ\epsilon is small, the noise ϵ\epsilon ’s distribution could be approximated by 𝒩3(𝟎3×1,Σ)\mathcal{N}_{\mathbb{R}^{3}}(\mathbf{0}_{3\times 1},\Sigma) on 3\mathbb{R}^{3}. Then there is argmaxYSO(3)L(Y;FΘNN,Σ)argminYSO(3)(FΘNNY))(FΘNNY))=argminYSO(3)FΘNNYF2\arg\max\limits_{Y\in SO(3)}L(Y;F_{\Theta_{NN}},\Sigma)\approx\arg\min\limits_{Y\in SO(3)}(F_{\Theta_{NN}}-Y))^{\top}(F_{\Theta_{NN}}-Y))=\arg\min\limits_{Y\in SO(3)}\|F_{\Theta_{NN}}-Y\|_{F}^{2}. ∎

Proof of proposition 3.4
Proposition 5.4.

JSE1:9SE(3)J^{-1}_{SE}:\mathbb{R}^{9}\rightarrow SE(3) where the rotation part comes from 𝚂𝚅𝙳\mathtt{SVD}, gives an approximation of MLE of transformations on SE(3)SE(3), if assuming Σ=σI\Sigma=\sigma I, II is the identity matrix and σ\sigma an arbitrary real value.

Proof.

Thus, for x𝒩𝒢(μ,Σ)x\sim\mathcal{N}_{\mathcal{G}}(\mu,\Sigma), there is x=μexp([N])x=\mu\exp_{\mathcal{M}}([N]^{\vee}_{\mathcal{M}}), and the simplified log-likelihood function is

L(Y;FΘNN,Σ)=[log(FΘNNY)]Σ1[log(FΘNNY)]L(Y;F_{\Theta_{NN}},\Sigma)=[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee\top}\Sigma^{-1}[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee} (5)

with log(FΘNNY)]\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee} set to be ϵ\epsilon, if the PDF focused around the group identity, i.e. the fluctuation of ϵ\epsilon is small, the noise ϵ\epsilon ’s distribution could be approximated by 𝒩3(𝟎3×1,Σ)\mathcal{N}_{\mathbb{R}^{3}}(\mathbf{0}_{3\times 1},\Sigma) on 3\mathbb{R}^{3}. Then there is argmaxYSO(3)L(Y;FΘNN,Σ)argminYSO(3)(FΘNNY))(FΘNNY))=argminYSO(3)FΘNNYF2\arg\max\limits_{Y\in SO(3)}L(Y;F_{\Theta_{NN}},\Sigma)\approx\arg\min\limits_{Y\in SO(3)}(F_{\Theta_{NN}}-Y))^{\top}(F_{\Theta_{NN}}-Y))=\arg\min\limits_{Y\in SO(3)}\|F_{\Theta_{NN}}-Y\|_{F}^{2}. ∎

Proof of proposition 3.5

For 𝒢(m,n)\mathcal{G}(m,\mathbb{R}^{n}), the probability density function of MACG is

P(x;Rμ,Σ)\displaystyle P(x;R_{\mu},\Sigma) =Σr2(Rμx)Σ1(Rμx)p2\displaystyle=\|\Sigma\|^{-\frac{r}{2}}\|(R_{\mu}x)^{\top}\Sigma^{-1}(R_{\mu}x)\|^{-\frac{p}{2}} (6)
=Σr2(x)Σ1(x)p2\displaystyle=\|\Sigma\|^{-\frac{r}{2}}\|(x)^{\top}\Sigma^{-1}(x)\|^{-\frac{p}{2}}

where RμO(n)R_{\mu}\in O(n) and covariance matrix Σ\Sigma is a symmetric positive-definite.

As clarified in section 2.2.2, we obtain Grassmann matrix by equivalently computing on SPDm++SPD_{m}^{++} with the diffeomorphic mapping J𝒢1=𝚄𝚄𝚃J^{-1}_{\mathcal{G}}=\mathtt{UU^{T}}. The error model is modified to be FΘNN=YY+NF_{\Theta_{NN}}=Y^{\top}Y+N, where both YYY^{\top}Y and NN are semi-positive definite matrix (SPD) SPDm++SPD^{++}_{m}. The Gaussian distribution extended on SPD manifold, for a random 2th2th-order tensor Xm×mX\in\mathbb{R}^{m\times m} (matrix) is

P(X;M,S)=1(2π)m|𝒮|exp12(XM)𝒮1(XM)P(X;M,S)=\frac{1}{\sqrt{(2\pi)^{m}}|\mathcal{S}|}\exp{\frac{1}{2}(X-M)\mathcal{S}^{-1}(X-M)} (7)

with mean Mm×mM\in\mathbb{R}^{m\times m} and 4th4th-order covariance tensor 𝒮m×m×m×m\mathcal{S}\in\mathbb{R}^{m\times m\times m\times m} inheriting symmetries from three dimensions. To be more specific, 𝒮ijmn=𝒮ijnm=𝒮ijmn\mathcal{S}^{mn}_{ij}=\mathcal{S}^{nm}_{ij}=\mathcal{S}^{mn}_{ij} and 𝒮ijmn=𝒮mnij\mathcal{S}^{mn}_{ij}=\mathcal{S}^{ij}_{mn}, thereby there is a vectorized version of Equation 3,

P(X;μ,Σ)=1(2π)m~|Σ|exp12(Xμ)Σ1(Xμ)P(X;\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^{\tilde{m}}}|\Sigma|}\exp{\frac{1}{2}(X-\mu)\Sigma^{-1}(X-\mu)} (8)

where m~=m+m(m1)2\tilde{m}=m+\frac{m(m-1)}{2}, indicating the unfolded parameters:

μ=[μ11μ22μmmμ12μ13μ(m1)m]Σ=[𝒮1111𝒮1122𝒮11mm2𝒮11122𝒮11132𝒮111m𝒮2211𝒮2222𝒮22mm2𝒮22122𝒮22132𝒮221m𝒮mm11𝒮mm22𝒮mmmm2𝒮mm122𝒮mm132𝒮mm1m2𝒮11122𝒮22122𝒮mm122𝒮121222𝒮12132𝒮121m2𝒮1113𝒮22132𝒮mm132𝒮13132𝒮13132𝒮131m2𝒮111m2𝒮222m2𝒮1mmm2𝒮121m2𝒮1m132𝒮mmmm]\mu=\begin{bmatrix}\mu_{11}\\ \mu_{22}\\ \vdots\\ \mu_{mm}\\ \mu_{12}\\ \mu_{13}\\ \vdots\\ \mu_{(m-1)m}\end{bmatrix}\Sigma=\begin{bmatrix}\mathcal{S}^{11}_{11}&\mathcal{S}^{22}_{11}&\ldots&\mathcal{S}^{mm}_{11}&\sqrt{2}\mathcal{S}^{12}_{11}&\sqrt{2}\mathcal{S}^{13}_{11}&\ldots&\sqrt{2}\mathcal{S}^{1m}_{11}\\ \mathcal{S}^{11}_{22}&\mathcal{S}^{22}_{22}&\ldots&\mathcal{S}^{mm}_{22}&\sqrt{2}\mathcal{S}^{12}_{22}&\sqrt{2}\mathcal{S}^{13}_{22}&\ldots&\sqrt{2}\mathcal{S}^{1m}_{22}\\ \vdots&\vdots&&\vdots&\vdots&&\vdots\\ \mathcal{S}^{11}_{mm}&\mathcal{S}^{22}_{mm}&\ldots&\mathcal{S}^{mm}_{mm}&\sqrt{2}\mathcal{S}^{12}_{mm}&\sqrt{2}\mathcal{S}^{13}_{mm}&\ldots&\sqrt{2}\mathcal{S}^{1m}_{mm}\\ \sqrt{2}\mathcal{S}^{12}_{11}&\sqrt{2}\mathcal{S}^{12}_{22}&\ldots&\sqrt{2}\mathcal{S}^{12}_{mm}&2\mathcal{S}^{12}_{12}&2\sqrt{2}\mathcal{S}^{13}_{12}&\ldots&2\mathcal{S}^{1m}_{12}\\ \sqrt{2}\mathcal{S}^{13}_{11}&\mathcal{S}^{13}_{22}&\ldots&\sqrt{2}\mathcal{S}^{13}_{mm}&2\mathcal{S}^{13}_{13}&2\mathcal{S}^{13}_{13}&\ldots&2\mathcal{S}^{1m}_{13}\\ \vdots&\vdots&&\vdots&\vdots&&\vdots\\ \sqrt{2}\mathcal{S}^{1m}_{11}&\sqrt{2}\mathcal{S}_{22}^{2m}&\ldots&\sqrt{2}\mathcal{S}^{mm}_{1m}&2\mathcal{S}^{1m}_{12}&2\mathcal{S}^{13}_{1m}&\ldots&2\mathcal{S}^{mm}_{mm}\\ \end{bmatrix} (9)

Since each entry of XX comes from rearranging DNN outputs, which can be assumed to not correlate with the other entries, there is 𝒮cdab=1\mathcal{S}^{ab}_{cd}=1 when a=b=c=da=b=c=d, otherwise 𝒮cdab=0\mathcal{S}^{ab}_{cd}=0, then the likelihood could be further simplified with Σ\Sigma being an identity matrix. Then there is the proof of proposition 3.5.

Proposition 5.5.

DEL with J𝒢1J^{-1}_{\mathcal{G}} gives MLE of element on 𝒢(m.n)\mathcal{G}(m.\mathbb{R}^{n}), if assuming Σ\Sigma is an identity matrix.

Proof.

The log-likelihood function of PDF in Equation (9) is argmaxYSPDm++L(Y;FΘNN,Σ)=argminYSPDm++(F~ΘNNY~))(F~ΘNNY~))=argminYSPDm++F~ΘNNY~F2\arg\max\limits_{Y\in SPD^{++}_{m}}L(Y;F_{\Theta_{NN}},\Sigma)=\arg\min\limits_{Y\in SPD^{++}_{m}}(\tilde{F}_{\Theta_{NN}}-\tilde{Y}))^{\top}(\tilde{F}_{\Theta_{NN}}-\tilde{Y}))=\arg\min\limits_{Y\in SPD^{++}_{m}}\|\tilde{F}_{\Theta_{NN}}-\tilde{Y}\|_{F}^{2}, where F~ΘNN\tilde{F}_{\Theta_{NN}} and Y~\tilde{Y}are vectorized versions of FΘNNF_{\Theta_{NN}} and, YY respectively. ∎

5.1 Generalization Ability

5.1.1 Proof of poposition 5.6

Proposition 5.6.

Any element of dimension nn on SO(n)SO(n) belongs to the image of JSO(n)1J^{-1}_{SO(n)} from known rotations within a certain range, if the Euclidean input of JSO(n)1J^{-1}_{SO(n)} is of more than nn dimensions.

Proof.

The image of JSO(n)1J^{-1}_{SO(n)} is a set of nn orthogonal vectors which could span the nn-dimensional vector space, or it could be regarded as parameterization of rotations around nn orthogonal axes, so they can be written as θ=[θ1,θ2,,θn]𝐮θωtn\mathbf{\theta}=[\theta_{1},\theta_{2},\ldots,\theta_{n}]^{\top}\triangleq\mathbf{u}\theta\triangleq\omega t\in\mathbb{R}^{n}.

For the identity R˙=R[ω]×TRSO(n)\dot{R}=R[\omega]_{\times}\in T_{R}SO(n), with constant ω\omega, its solution R(t)=R0exp([ω]×t)=R0=Iexp([ω]×t)=exp[θ]×=kθkk!([u]×)kR(t)=R_{0}\exp([\omega]_{\times}t)\overset{R_{0}=I}{=}\exp([\omega]_{\times}t)=\exp{[\theta]_{\times}}=\sum\limits_{k}\frac{\theta^{k}}{k!}([u]_{\times})^{k}, which means that if there are nn orthogonal basis, any rotation could be represented with transformations on SO(n)SO(n). ∎

{comment}

5.1.2 Proof of corollary 3.7

Corollary 5.7.

Any element of dimension nn on SE(n)SE(n) belongs to the image of JSE(n)1J^{-1}_{SE(n)} from known rotations within a certain range, if the Euclidean input of JSE(n)1J^{-1}_{SE(n)} is of more than nn dimensions.