Deep Extrinsic Manifold Representation for Vision Tasks

Tongtong Zhang Xian Wei Yuanxiang Li

Abstract

Non-Euclidean data is frequently encountered across different fields, yet there is limited literature that addresses the fundamental challenge of training neural networks with manifold representations as outputs. We introduce the trick named Deep Extrinsic Manifold Representation (DEMR) for visual tasks in this context. DEMR incorporates extrinsic manifold embedding into deep neural networks, which helps generate manifold representations. The DEMR approach does not directly optimize the complex geodesic loss. Instead, it focuses on optimizing the computation graph within the embedded Euclidean space, allowing for adaptability to various architectural requirements. We provide empirical evidence supporting the proposed concept on two types of manifolds, $SE(3)$ and its associated quotient manifolds. This evidence offers theoretical assurances regarding feasibility, asymptotic properties, and generalization capability. The experimental results show that DEMR effectively adapts to point cloud alignment, producing outputs in $SE(3)$ , as well as in illumination subspace learning with outputs on the Grassmann manifold.

Machine Learning, ICML

1 Introduction

Data in non-Euclidean geometric spaces has applications across various domains, such as motion estimation in robotics (Byravan & Fox, 2017), shape analysis in medical imaging (Bermudez et al., 2018) (Huang et al., 2021) (Yang et al., 2022), etc. Deep learning has revolutionized various fields. However, deep neural networks (DNN) typically generate feature vectors in Euclidean space, which may not be universally suitable for certain computer vision tasks, such as estimating probability distributions for classification or rigid motion estimation. The classification of learning problems related to a manifold depends on the application of manifold assumptions. The first category involves signal processing on a manifold structure, with the resulting output situated within the Euclidean space. For example, geometric deep-learning approaches extract features from graphs, meshes, and other structures. The encoded features are then input to decoders for tasks such as classification and regression in the Euclidean space (Bronstein et al., 2017; Cao et al., 2020; Can et al., 2021; Bronstein et al., 2021). Alternatively, they can also function as latent codes for generative models (Ni et al., 2021). The second category establishes continuous mappings between data residing on the same manifold to enable regressions. For instance, (Steinke et al., 2010) addresses regression between manifolds through regularization functional, while (Fang et al., 2023) performs statistical analysis over deep neural network-based mappings between manifolds. The third category of research focuses on deep learning models that have distinct Euclidean inputs and manifold outputs. This line of research often emphasizes a specific type of manifold, such as deep rotation manifold regression (Zhou et al., 2019; Levinson et al., 2020; Chen et al., 2021)

Refer to caption — (a) Intrinsic manifold regression via energy minimization (Boumal, 2020)

The paper is centered on the third category, which entails creating multiple representations from DNNs. It’s worth noting that models that produce outputs on the manifold are typically regularized using geometric metrics, which can be categorized into two types: intrinsic manifold loss and extrinsic manifold loss (Bhattacharya & Patrangenaru, 2003; Bhattacharya et al., 2012), as depicted in Figure 1. Intrinsic methods aim to identify the geodesic that best fits the data to preserve the geometrical structure (Fletcher, 2011, 2013; Cornea et al., 2017; Shin & Oh, 2022). However, the inherent characteristics of intrinsic distances pose challenges for DNN architectures. Primarily, many intrinsic losses incorporate intricate geodesic distances, aiming to induce longer gradient flows throughout the entire computation graph (Fletcher, 2011; Hinkle et al., 2012; Fletcher, 2013; Shi et al., 2009; Berkels et al., 2013; Fletcher, 2011). Secondly, directly fitting a geodesic by minimizing distance and smoothness energy in the Euclidean space might result in off-manifold points (Chen et al., 2021; Khayatkhoei et al., 2018).

In contrast, extrinsic regression uses embeddings in a higher-dimensional Euclidean space to create a non-parametric proxy estimator. The estimation on the manifold $\mathcal{M}$ can be achieved using $J^{-1}$ , which represents the inverse of the extrinsic embedding $J$ . Extensive investigations in (Bhattacharya et al., 2012; Lin et al., 2017) have established that extrinsic regression offers superior computational benefits compared to intrinsic regression. Many regression models are customized for specific applications, utilizing exclusive information to simplify model formulations that include explicit explanatory variables. This customization is evident in applications such as shapes on shape space manifolds (Berkels et al., 2013; Fletcher, 2011).

However, within the computer vision community, deep neural networks are often faced with a large amount of diverse multimedia data. Traditional manifold regression models struggle to handle this varied modeling task due to limitations in representational power. Some recent works have addressed this challenge, such as (Fang et al., 2023) processing manifold inputs with empirical evidence. This paper presents the idea of embedding manifolds externally at the final regression layer of various neural networks, including ResNet and PointNet. This idea is known as Deep Extrinsic Manifold Representation (DEMR). The process is adapted from two perspectives to bridge the gap between traditional extrinsic manifolds and neural networks in computer vision. Firstly, the conventional choice of a proxy estimator, often represented by kernel functions, is substituted with feature extractors in DNNs. Feature extractors like ResNet or PointNet, renowned for their efficacy in specific tasks, significantly elevate the representational power for feature extraction.

Secondly, to project the neural network output onto the preimage of $J(\cdot)$ , we depart from deterministic projection methods employed in traditional extrinsic manifold regression. Instead, we opt for a learnable linear layer commonly found in DNN settings. This learnable projection module aligns seamlessly with most DNN architectures and eliminates the need for the manual design of the projection function $Pr(\cdot)$ , a step typically required in prior extrinsic manifold regression models to match the type of the manifold. These choices not only enhance the model’s representational power compared to traditional regression models but also allow for the preservation of existing neural network architectures.

Contribution

We facilitate the generation of manifold output from standard DNN architectures through extrinsic manifold embedding. In particular, we elucidate the rationale behind pose regression tasks performing more effectively as a specialized instance of DEMR. Additionally, we offer theoretical substantiation regarding the feasibility, asymptotic properties, and generalization ability of DEMR for $SE(3)$ and the Grassmann manifold. Finally, the efficacy of DEMR is validated through its application to two classic computer vision tasks: relative point cloud transformation estimation on $SE(3)$ and illumination subspace estimation on the Grassmann manifold.

2 DEMR

2.1 Problem Formulation

Estimation in the embedded space

For distribution $\mathcal{Q}$ on manifold $\mathcal{M}$ of dimension $d$ , the extrinsic embedding $\tilde{\mathcal{M}}=J(\mathcal{M})$ , from manifold $\mathcal{M}$ to Euclidean space $\mathbb{E}=\mathbb{R}^{N}$ , has distribution $\tilde{\mathcal{Q}}=\mathcal{Q}\circ J$ , which is a closed subset of $\mathbb{R}^{N}$ , where $d\ll N$ . In extrinsic manifold regression, $\forall u\in\mathbb{E}$ , $\exists$ a compact projection set $Pr(u)=\{x\in\tilde{\mathcal{M}}:\|x-u\|\leq\|y-u\|,\forall y\in\tilde{\mathcal{M}}\}$ , mapping $u$ to the closest point on $\mathcal{M}$ . The extrinsic mean set of $\mathcal{Q}$ is $\mu^{ext}=J^{-1}(Pr(\mu))$ , where $\mu$ is the mean set of $\tilde{\mathcal{Q}}$ . In DEMR, $\mu$ is acquired by the neural networks, and the estimation on $\mathcal{M}$ is then deterministically computed.

Pipeline design

The pipeline of DEMR is demonstrated in Figure 2. For an input $x\in\mathcal{X}$ with corresponding ground truth estimation $y_{gt}\in\mathcal{M}$ , the feedforward process contains two steps; firstly the deep estimation $\hat{y}^{E}$ is given in the embedded space $\mathbb{R}^{N}$ , then the output manifold representation is $\hat{y}=J^{-1}(Pr(\hat{y}^{E}))$ , where $J^{-1}$ is the inverse of extrinsic embedding $J$ . Given that $\hat{y}^{E}\in\mathbb{R}^{N}$ , where $\mathbb{R}^{N}$ covers the real-valued vector space of dimension $N$ , $Pr$ is then dropped within the pipeline. The training loss is computed between $\hat{y}^{E}$ and the extrinsic embedding $y^{E}_{gt}=J(y_{gt})$ . Therein, the gradient used in backpropagation is computed within $\mathbb{R}^{N}$ , leaving the original DNN architecture unchanged. Moreover, this implies that the transformation associated with DEMR can be directly applied to most existing DNN architectures by simply augmenting the dimensionality of the final output layer.

Similar to the population regression function in extrinsic regression, the estimator in embedded space $F_{NN}(x)$ as the neural network in extrinsic embedded space $\mathbb{R}^{N}$ aims to minimize the conditional Fréchet mean if it exists: $F_{NN}(x)=\arg\min\limits_{m\in\mathcal{M}}\int_{\mathcal{M}}L_{extr}^{2}(m,y)\tilde{\mathcal{Q}}(\mathrm{d}y|x)=\arg\min\limits_{m\in\mathcal{M}}\int_{\mathcal{M}}\|J(m)-J(y)\|^{2}\tilde{\mathcal{Q}}(\mathrm{d}y|x)$ where $L_{extr}$ is the extrinsic distance, $\tilde{\mathcal{Q}}(\mathrm{d}y|x)$ denotes the conditional distribution of $Y$ given $X=x$ , and $\mathcal{Q}(\cdot|x)=\tilde{\mathcal{Q}}(\cdot|x)\circ J^{-1}$ is the conditional probability measure on $\mathcal{M}$ . Therefore, in both the training and evaluation phases of DEMR, the computation burden is partaken by the deterministic conversion $J^{-1}$ , which requires no gradient computation.

Reformulation of neural network for images

The input data sample $\mathcal{I}$ is fed sequentially to the differentiable feature extractor $T(\cdot)$ , where the feature extractor can be composited by various modules, such as Resnet, Pointnet, etc., according to the input:

$\displaystyle\hat{y}^{E}=F_{NN}(\mathcal{I})$	$\displaystyle=b+P\cdot(T(\mathcal{I}))$	(1)
	$\displaystyle=b+P\cdot(c_{1}^{\mathcal{I}},\ldots,c_{n}^{\mathcal{I}})^{\top}$
	$\displaystyle\triangleq b+P\cdot C_{fm}\triangleq F_{L}(C_{fm})$

where $\cdot$ indicates matrix multiplication, $b$ indicates the bias, $C_{fm}$ belongs to the vector matrix of feature maps from $T(\cdot)$ , with decomposition into column vectors ${c_{k}^{\mathcal{I}}}$ . Therefore, the DNN in DEMR serves as the composition mapping from raw input $\mathcal{I}$ to the preimage of $J^{-1}$ .

Projection $Pr$ onto the preimage of $J$

For an extrinsic embedding function $J(\cdot)$ , there still exists a pivotal issue that the estimation given above might not lie in the preimage $\mathcal{IM}$ of $J(\cdot)$ .

Since $\hat{y}_{i}\in\mathbb{R}^{N}$ , $T(\mathcal{I})$ and $\mathcal{IM}$ are all Euclidean and the projection between them can be presented as a linear transform $Pr(\cdot):$ , in matrix forms. Other than a deterministic linear projection in extrinsic manifold regression (Lin et al., 2017; Lee, 2021), DEMR adopts a learnable projection fulfilled by linear layers within a deep framework as $F_{L}(\cdot)$ in Equation 1, then the output of a DNN is $Pr(F_{NN}{\mathcal{I}})\in\mathbb{R}^{N}$ , and $J^{-1}(Pr(C_{fm}))\in\mathcal{M}$ . Therefore, the final output of DEMR on the manifold will be

	$\displaystyle\hat{y}$	$\displaystyle=J^{-1}(\hat{y}^{E})=J^{-1}(Pr(C_{fm}))=J^{-1}(F_{L}(C_{fm}))$
		$\displaystyle=J^{-1}(\arg\min\limits_{q\in\tilde{M}}\\|q-\hat{F}_{NN}(\mathcal{I})\\|^{2})$

DIMR

In contrast, the architecture adopted in (Lohit & Turaga, 2017) uses geodesic loss to train the neural network. The intrinsic geodesic distance is $d_{intr}(y_{gt},\hat{y})=Log_{y_{gt}}\hat{y}$ in Figure 3, where $Log$ is the logarithmic map of $\mathcal{M}$ . We call it Deep Intrinsic Manifold Representation (DIMR) for convenience, whose model parameter set $\Theta_{NN}$ is updated with the gradients of $L_{intr}$ .

2.2 The extrinsic embedding $J$

The embedding $J$ is designed to preserve geometric properties to a great extent, which can be specified as equivariance. $J$ is considered an equivariant embedding if there is a group homomorphism $\phi:G\rightarrow GL(D,\mathbb{R})$ from $G$ to the general linear group $GL(D,\mathbb{R})$ of degree $D$ such that $\forall g\in G,q\in\mathcal{M}$ . There is $J(gq)=\phi(g)J(q)$ . Choosing $J$ is not unique, and the choices below are equivariant embeddings. Given orthogonal Group Embedding $J_{O}$ , for $R_{1},R_{2}\in O(n)$ with singular value decomposition $J_{O}(R_{1})=U_{1}\Sigma_{1}V_{1}$ and $J_{O}(R_{2})=U_{2}\Sigma_{2}V_{2}$ and let $J_{O}(R_{1}R_{2})\triangleq K=U_{1}V_{1}^{\top}\Sigma_{K}U_{2}V_{2}^{\top}$ , then there is $R_{1}R_{2}=U_{1}V_{1}^{\top}U_{2}V_{2}^{\top}$ and if $\phi(R_{1})=R_{1}\Sigma_{K}U_{2}\Sigma_{2}^{-1}U_{2}^{\top}$ , there is $J_{O}(R_{1}R_{2})=\phi(R_{1})U_{2}\sigma_{2}V_{2}^{\top}=\phi(R_{1})J_{O}(R_{2})$ .

The proof of Grassmannian Embedding $J_{\mathcal{G}}$ follows the same idea with proof for $O(n)$ , since they are all based on matrix decomposition.

2.2.1 Matrix Lie Group

9D: SVD of rank 9

An intuitive embedding choice for the matrix Lie group is to parameterize each matrix entry. For the special orthogonal group $O(n)$ of dimension $n$ , the natural embedding can be given by $J_{O}:O(n)\rightarrow\mathbb{R}^{n^{2}}$ and $J_{SO}:SO(n)\rightarrow\mathbb{R}^{n^{2}}$ . For its inverse, $J_{O}^{-1}$ which aims to produce $n$ orthogonal vectors, where Gram Schmidt orthogonalization (Zhou et al., 2019) and its variations are common ways to reparameterize orthogonal vectors. Singular Value Decomposition (SVD) is another convenient way to produce an orthogonal matrix (Levinson et al., 2020):

\mu^{ext}=\hat{F}_{NN}(x)=\begin{cases}UV^{\top},&det(\mu)>0\\ UHV^{\top},&otherwise\end{cases}

(2)

where $U$ and $V$ come from singular value decomposition $\mu=UDV^{\top}$ with elements arranged according to the descending order of singular value, and $H$ is a diagonal matrix $I_{n}$ with the last entry replaced by $-1$ . Specifically for $SO(3)$ , $J^{-1}_{SO(3)}$ could also be produced by cross product.

As for the special Euclidean group, the group of isometries of the Euclidean space, i.e., the rigid body transformations preserving Euclidean distance. A rigid body transformation can be given by a pair $(A,t)$ of affine transformation matrix $A$ and a translation $t$ , or it can be written in a matrix form of size $n+1$ . For instance, a special Euclidean group can be regarded as the semidirect product of a rotation group $SO(n)$ and a translation group $T(n)$ , $SE(n)=SO(n)\ltimes T(n)$ .

6D: cross product

Specifically for orthogonal matrices on $SO(3)$ , the invertible embedding $J$ can be more convenient via cross-product operation $\times$ . For a 6-dimensional Euclidean vector $x=[x_{a},x_{b}]$ as network output, where $x$ is the concatenation of $x_{a}$ and $x_{b}$ , let $x_{c}=x_{a}\times x_{b}$ , then $J^{-1}(x)=[x_{a}^{T},x_{b}^{T},x_{c}^{T}]$ .

2.2.2 The Quotient Manifold of Lie Group

The real Grassmann manifold $\mathcal{G}(m,\mathbb{R}^{n})$ parameterize all $m$ -dimensional linear subspaces of $\mathbb{R}^{n}$ , which can also be defined by a quotient manifold: $\mathcal{G}(m,\mathbb{R}^{n})=O(n)/(O(m)\times O(n-m))=SO(n)/SO(m)\times SO(n-m)=\mathcal{V}(m,n)/SO(m)$ . Since it is a quotient space, and we care more about its rank than which basis to provide, we could convert the problem of embedding $Y\in\mathcal{G}(m,\mathbb{R}^{n})$ to finding mappings for $YY^{\top}\in SPD^{++}_{m}$ , whose basis could be given by diagonal decomposition, which we referred to as $\mathtt{DD}$ . Actually, $\mathtt{DD}$ is a special case of $\mathtt{SVD}$ for a symmetric matrix. Thus, for a distribution $\mathcal{Q}$ defined on $\mathcal{G}(m,\mathbb{R}^{n})$ , given $\mu=\hat{F}(x)\in\mathbb{R}^{n\times m}$ and its diagonal decomposition (denoted by $\mathtt{DD}$ ) $\mu=U\Sigma U^{\top}$ , the inverse embedding $J^{-1}_{\mathcal{G}}$ comes from $\mu^{ext}=J^{-1}_{\mathcal{G}}(\mu)=USU^{\top}$ , and the first $m$ column vectors of $U$ constitute the subspace.

2.3 DEMR as a generalization of previous research

DNN with Euclidean output

When the output space is a vector space, $J(\cdot)$ becomes an identity mapping. Thus, DNN with Euclidean output can be treated as a degenerate form of DEMR. The output from $F_{N}N$ is linearly transformed from the subspace spanned by the base $C_{fm}$ , causing its failure for new extracted features $c^{\prime}\notin C_{fm}$ . It is the corresponding failure case in (Zhou et al., 2019), when the dimension of the last layer equals the output dimension.

Absolute/relative pose regression

From Equations 1, the estimation $F_{NN}(\mathcal{I})$ is linearly-transformed from the subspace spanned by ${c_{k}^{\mathcal{I}}}$ . Hence, in typical DNN-based pose regression tasks, the predicted pose $\hat{y}$ can be seen as a linear combination of feature maps extracted from input poses, resulting in the failure of extrapolation for unseen poses in the test set. It is a common problem because training samples are often given from limited poses. (Zhou et al., 2019) suggests that it is the discontinuity in the output space incurs poor generalization. Indeed, a better assumption for APR is that the pose estimation lies on $SE(3)$ , which is composed of a rotation and a translation. The manifold assumption renders the estimator more powerful in continuous interpolation and extrapolation from input poses. This is because the continuity and symmetry to $SE(3)$ help a lot in the deep learning task. The detailed analysis is in Section 2.2.1 and validation in Section 4.1 Thereon APR on $SE(3)$ can be regarded as a particular case solved by DEMR; and (Sattler et al., 2019; Zhou et al., 2019) revealed parts of the idea from DEMR.

3 Analysis

In line with the experimental setup, the analysis is performed on the special orthogonal group, its quotient space, and the special Euclidean group ¹¹1All the proofs are included in the Appendix.

3.1 Feasibility of DEMR

Before conducting optimization in extrinsic embedding space $\mathbb{R}^{N}$ , the primary misgiving consists of whether the geometry in extrinsic embedded space properly reflects the intrinsic geometry of $\mathcal{M}$ .

Apparently, extrinsic embedding $J(\cdot)$ is a diffeomorphism preserving geometrical continuity, which is advantageous for extrinsic embeddings. Since we want to observe conformance between $L_{extr}$ and $L_{intr}$ , it’s natural to bridge distances with smoothness. Firstly, we need the diffeomorphism between manifolds to be bilipshitz.

Lemma 3.1.

Suppose that $\mathcal{M}_{1},\mathcal{M}_{2}$ are smooth and compact Riemannian manifolds, $f:\mathcal{M}_{1}\rightarrow\mathcal{M}_{2}$ is a diffeomorphism. Then $f$ is bilipschitz w.r.t. the Riemannian distance.

Then we show in 3.2 the conformity of extrinsic distance and intrinsic distance, which enables indirectly representing an intrinsic loss in Euclidean spaces.

Proposition 3.2.

For a smooth embedding $J:\mathcal{M}\rightarrow\mathbb{R}^{m}$ , where the n-manifold $\mathcal{M}$ is compact, and its metric is denoted by $\rho$ , the metric of $\mathbb{R}^{m}$ is denoted by $d$ . For any two sequences $\{x_{k}\},\{y_{k}\}$ of points in $\mathcal{M}$ and their images $\{J(x_{k})\},\{J(y_{k})\}$ , if $\lim\limits_{n\rightarrow\infty}d(J(x_{n}),J(y_{n}))=0$ then $\lim\limits_{n\rightarrow\infty}\rho(x_{n},y_{n})=0$ .

3.2 Asymptotic MLE

In this part, we demonstrate that DEMR for $SO(3)$ is the approximate maximum likelihood estimation (MLE) of the $SO(3)$ response, and for Grassmann manifold $\mathcal{G}(m,\mathbb{R}^{n})$ DEMR is the MLE of the Grassmann response.

To be noticed, we adopt a new error model in conformity with DEMR. In previous work such as (Levinson et al., 2020), the error noise matrix $N$ is assumed to be filled with random entries $n_{ij}\sim N(0,\sigma)$ , which is not rational, because $N$ also lies on $\mathcal{M}$ and there are innate structures between the entries of $N$ . Here we assume $N\in\mathcal{M}$ , so the probability $Q$ on $\mathcal{M}$ should be established first.

3.2.1 Lie group for transformations

As (Bourmaud et al., 2015) suggested, we consider the connected, unimodular matrix Lie group, including the most frequently used categories in computer vision: $SE(3)$ , $SO(3)$ , $SL(3)$ , etc. Since $SO(3)$ is a degenerate case of $SE(3)$ without translation, here we consider the concentrated Gaussian distribution on $SE(3)$ . The probability density function (pdf) takes the form $P(x;\Sigma)=\alpha\exp{-\frac{1}{2}([\log_{\mathcal{M}}(x)]_{\mathcal{M}}^{\vee\top}\Sigma^{-1}[\log_{\mathcal{M}}(x)]_{\mathcal{M}}^{\vee})}$ , where $\alpha$ is the normalizing factor, $x\in SO(3)$ and the covariance matrix $\Sigma$ is positive definite.

Maps $[\cdot]^{\wedge}$ and $[\cdot]^{\vee}$ are linear isomorphism, re-arranging the Euclidean representations into anti-symmetric matrix and back ²²2 $hat[\cdot]^{\wedge}:\mathbb{R}^{3}\rightarrow so(3);\mathbf{\theta}\rightarrow\mathbf{\theta}=[\mathbf{\theta}]_{\times},vee[\cdot]^{\vee}:so(3)\rightarrow\mathbb{R}^{3};[\mathbf{\theta}]_{\times}^{\wedge}=\mathbf{\theta}$ where $[\cdot]_{times}$ indicates the antisymmetric matrix form of the vector. . For a vector $\theta=[\theta_{1},\theta_{2},\theta_{3}]^{\top}$ , there is $\mathbf{\theta}^{\wedge}=[\mathbf{\theta}]_{\times}$ and $[\mathbf{\theta}]_{\times}^{\vee}=\mathbf{\theta}$ .

Proposition 3.3.

$J^{-1}_{SO}:\mathbb{R}^{9}\rightarrow SO(3)$ gives an approximation of MLE of rotations on $SO(3)$ , if assuming $\Sigma=\sigma I$ , $I$ is the identity matrix and $\sigma$ an arbitrary real value.

DEMR on special Euclidean Group $SE(3)$ also approximately provides maximum likelihood estimation, sharing similar ideas of proof with $SO(3)$ .

Proposition 3.4.

$J^{-1}_{SE}:\mathbb{R}^{9}\rightarrow SE(3)$ , where the rotation part comes from $\mathtt{SVD}$ ((see Supplementary Material) gives an approximation of MLE of transformations on $SE(3)$ , if assuming $\Sigma=\sigma I$ , $I$ is the identity matrix and $\sigma$ an arbitrary real value.

Proof.

Thus, for $x\sim\mathcal{N}_{\mathcal{G}}(\mu,\Sigma)$ , there is $x=\mu\exp_{\mathcal{M}}([N]^{\vee}_{\mathcal{M}})$ , and the simplified log-likelihood function is

		$\displaystyle L(Y;F_{\Theta_{NN}},\Sigma)=$
		$\displaystyle[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee\top}\Sigma^{-1}[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee}$

with $\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee}$ set to be $\epsilon$ , if the pdf focused around the group identity, i.e. the fluctuation of $\epsilon$ is small, the noise $\epsilon$ ’s distribution could be approximated by $\mathcal{N}_{\mathbb{R}^{3}}(\mathbf{0}_{3\times 1},\Sigma)$ on $\mathbb{R}^{3}$ . Then there is $\arg\max\limits_{Y\in SO(3)}L(Y;F_{\Theta_{NN}},\Sigma)\approx\arg\min\limits_{Y\in SO(3)}(F_{\Theta_{NN}}-Y))^{\top}(F_{\Theta_{NN}}-Y))=\arg\min\limits_{Y\in SO(3)}\|F_{\Theta_{NN}}-Y\|_{F}^{2}$ . ∎

3.2.2 Grassmann for subspaces

One of the manifold versions of Gaussian distribution on Grassmann and Stiefel manifolds is Matrix Angular Central Gaussian (MACG). However, the matrix representation of linear subspaces shall preserve the consistency of eigenvalues across permutations and sign flips, thus we resort to the symmetric $UU^{T}$ on Symmetric Positive Definite (SPD) manifold for $U\in\mathcal{G}(m,\mathbb{R}^{N})$ .

Similarly, the manifold output on the error model is modified to be $F_{\Theta_{NN}}=YY^{\top}+N$ , where both $YY^{\top}$ and $N$ are semi-positive definite matrix lying on SPD manifold. The Gaussian distribution extended on SPD manifold, for a random $2th-$ order tensor $X\in\mathbb{R}^{n\times n}$ is

P(X;M,S)=\frac{1}{\sqrt{(2\pi)^{n}}|\mathcal{S}|}\exp{\frac{1}{2}(X-M)^{T}\mathcal{S}^{-1}(X-M)}

(3)

with mean $M\in\mathbb{R}^{m\times m}$ and $4th-$ order covariance tensor $\mathcal{S}\in\mathbb{R}^{m\times m\times m\times m}$ inheriting symmetries from three dimensions. For the inverse of Grassmann manifold extrinsic embedding $J_{\mathcal{G}}^{-1}$ , DEMR provides MLE, and the proofs are given in the Supplementary Material.

Proposition 3.5.

DEMR with $J^{-1}_{\mathcal{G}}$ gives MLE of element on $\mathcal{G}(m.\mathbb{R}^{n})$ , if assuming $\Sigma$ is an identity matrix.

3.3 Generalization Ability

3.3.1 Failure of DNN with Euclidean output space

In light of the analysis in 2.3, the output space is produced by a linear transformation on the convolutional feature space spanned by the extracted features. The feature map extracted by a neural network organized as matrices in Equation (1) the source to form the basis. Denoting the linear subspace spanned by the feature map basis ${c_{1}^{\mathcal{I}},\ldots,c_{n}^{\mathcal{I}}}$ to be $\mathbb{R}_{feature}$ and let $\mathbb{R}_{output}$ be the low-dimensional output space spanned by $\{o_{1},\ldots,o_{n_{o}}\}$ , where $n_{o}$ denotes the output dimension, then $F_{NN}(\mathcal{I})\in\mathbb{R}_{output}$ . For a new test input $\mathcal{I}^{\prime}$ , its extracted feature map belongs to the complementary space of $\mathbb{R}_{feature}$ , there is $F_{NN}(\mathcal{I}^{\prime})\notin\mathbb{R}_{output}$ . This accounts for the failure of some DNN models with Euclidean output.

3.3.2 Representation power of structured output space

This section studies the enhancement of the representational power of DEMR, when the output space is endowed with geometrical structure. As a linear action, one representation of a Lie group is a smooth group homomorphism $\Pi:G\rightarrow GL(V)$ on the $n$ -dimensional vector space $V$ , where $GL(V)$ is a general linear group of all invertible linear transformations.

Proposition 3.6.

Any element of dimension $n$ on $SO(n)$ belongs to the image of $J^{-1}_{SO(n)}$ from known rotations within a certain range, if the Euclidean input of $J^{-1}_{SO(n)}$ is of more than $n$ dimensions.

Corollary 3.7.

Any element of dimension $n$ on $SE(n)$ belongs to the image of $J^{-1}_{SE(n)}$ from known rotations within a certain range, if the Euclidean input of $J^{-1}_{SE(n)}$ is of more than $n$ dimensions.

Then for a linear representation of Lie group in matrix form, given a set of basis on $V$ , we show that the output of DEMR better extrapolates the input samples than common deep learning settings with unstructured output, which resolves the problem raised in (Sattler et al., 2019).

4 Experiments

In this section, we demonstrate the effectiveness of applying extrinsic embedding to deep learning settings on two representative manifolds in computer vision. The experiments are conducted from several aspects below:

•

Whether DEMR takes effects in improving the performance of certain tasks?
•

Whether the geometrical structure boosts model performance facing unseen cases, e.g., the ability to extrapolate training set.
•

Whether extrinsic embedding yields valid geometrical restrictions.

The validations are conducted on two canonical manifold applications in computer vision

4.1 Task I: affine motions on $SE(3)$

Estimating the relative position and rotation between two point clouds has a wide range of downstream applications. The reference and target point clouds $P_{R},P_{t}\in\mathbb{R}^{N\times 3}$ are in the same size and shape, with no scale transformations. The relative translation and rotation can be arranged separately in vectors or together in a matrix lying on $SE(3)$ .

4.1.1 Experimental Setup

Training Detail

During training, at each iteration stage, a randomly chosen point cloud from $2,290$ airplanes is transformed by randomly sampling rotations and translations in batches. The rotations are sampled according to the models, i.e., the models producing axis angles are fed with rotations sampled from axis angles, and so on.

Comparison metrics

To validate the ability of the model to preserve geometrical structures, geodesic distance is the opted metric at the testing stage. (Zhou et al., 2019; Levinson et al., 2020) uses minimal angular difference to evaluate the differences between rotations, which is not entirely compatible with Euclidean groups, since it calculates the translation part and rotation part separately. For two rotations $R_{1},R_{2}$ and $R^{\prime}=R_{1}R_{2}^{-1}$ with its trace to be $tr(R^{\prime})$ , there is $L_{angle}=\cos^{-1}((tr(R^{\prime})-1)/2)$ . For intrinsic metric, the geodesic distance between two group elements are defined with Frobenius norm for matrices $d_{int}(M_{1},M_{2})=\|\log(M_{1}^{-1}M_{2})\|$ , where $\log(M_{1}^{-1}M_{2})$ indicates the logarithm map on the Lie group. For extrinsic metric, we take Mean Squared Loss (MSE) between $J_{SE}(M_{1})$ and $J_{SE}(M_{2})$ .

Table 1: On the task of estimating relative affine transformations on

SE(3)

for point clouds, the results are reported by the geodesic distance between the estimation and the ground truth. The best results across models are emphasized, where the smallest errors come from extrinsic embedding.

Mode	avg	median	std
euler	26.83	15.54	32.23
axis	31.96	10.66	27.75
6D	10.28	5.36	21.54
9D	10.23	6.09	25.30

For models with Euler angle and axis-angle output, the predicted rotation and translation are organized in Euclidean form, and they are reorganized into Lie algebra $se(3)$ . Finally, the distances are calculated in $SO(3)$ by exerting an exponential map on their $se(3)$ forms.

Architecture detail

Similar as (Zhou et al., 2019; Levinson et al., 2020), the backbone for point cloud feature extraction is composed of a weight-sharing Siamese architecture containing two simplified PointNet Structures (Qi et al., 2017) $\Phi_{i}:\mathbb{R}^{N\times 3}\rightarrow\mathbb{R}^{1024}$ where $i=1,2$ . After extracting the feature of each point with an MLP, both $\Phi_{1}$ and $\Phi_{2}$ utilize max pooling to produce a single vector $z_{1},z_{2}$ respectively, as representations of features across all points in $P_{r},P_{t}$ . Concatenating $z_{1},z_{2}$ to be $z$ , another MLP is used for mapping the concatenated feature to the final higher-order Euclidean representation $Y_{\mathbb{E}}$ . Finally, Euler angle quaternion coefficients and axis-angle representations will be directly obtained from the last linear layer of the backbone network of dimension, $3,3,4$ respectively. For manifold output, the $SE(3)$ representation will be given by both cross-product and SVD, with details in Supplementary materials. The translation part is produced by another line of $\mathtt{fc}$ layer of 3 dimensions. Because MSE computation between two isometric matrices are based on each entry, total MSE loss is the sum of rotation loss and translation loss.

Estimation accuracy and conformance

To validate the necessity of structured output space, firstly we compare the affine transformation estimating network with different output formations. The translation part naturally takes the form of a vector and the rotation part can be in the form of Euclidean representations and $SO(3)$ in matrix forms. The dataset is composed of generated point cloud pairs with random transformations, where the rotations are uniformly sampled from various representations, namely, Euler angle, axis-angle in $3$ -dimensional vector space, and $SO(3)$ . The translation part is randomly sampled from a standard normal distribution.

The advantage of manifold output produced by baseline DEMR is revealed in Table 1 and Figure 4, output via extrinsic embedding takes the lead across three statistics.

Table 2: Comparison of different embeddings. DEMR takes the lead in predicting accuracy, with two types of embedding functions, namely, 6D and 9D. Even when the ratio of unseen cases increases, DEMR has better performance all along.

	10%			20%			40%			60%			80%
	avg	med	std	avg	med	std	avg	med	std	avg	med	std	avg	med	std
euler	77.81	54.42	77.52	35.58	13.07	48.94	28.82	17.02	30.89	27.22	16.27	35.52	27.22	16.27	31.75
axis	75.57	50.71	78.60	36.95	12.38	54.40	20.73	12.51	27.63	21.95	12.75	27.15	18.96	10.66	24.75
6D	71.09	29.39	80.09	17.12	5.25	40.54	10.85	5.07	19.60	9.76	4.93	19.60	10.29	5.01	21.08
9D	70.72	30.99	63.18	19.83	5.62	39.22	12.91	6.28	22.66	12.24	6.40	24.18	10.23	5.31	22.60

[Uncaptioned image] — Table 3: Test result reported in average $D_{\mathcal{G}}$ with an example result of a test sample, and the epoch at which the model converged.

Input	GrassmannNet (Lohit & Turaga, 2017)	DIMR	DEMR

avg $D_{\mathcal{G}}(U_{gt},U_{out})$	9.6826	3.4551	3.4721
epoch	320	680	140

Generalization ability

To evaluate DEMR over its improvement in generalization ability in comparison with unstructured output space, the training set only contains a small portion of the whole affine transformation space while the test set encompasses affine transformations sampled from the whole rotation space. To be fair in the sampling step across the compared representations, the sampling process is conducted uniformly in the representations spaces respectively. For models with axis-angle and Euler angle output, the input point cloud pair is constructed with rotation representation whose entries sampled $\mathrm{iid}$ from Euclidean ranges. For Euler representations, the lower bounds of the ranges are $\pi$ , and the upper bounds are $\pi$ multiplied by $90\%,80\%,60\%,40\%,20\%$ , while the test set is sampled from $[-\pi,\pi]$ . The training set with axis-angle representations is also obtained by sampling Euclidean ranges. When constructing a training set on $SO(3)$ , the uniform sampling process is conducted in the same way with axis-angle sampling, the segments are taken in ratios in the same way with the two aforementioned settings, and then exponential mapping is used to yield random samples on $SO(3)$ . Facing unseen cases, DEMR has better extrapolation capability compared to plain deep learning settings with unstructured output space. As shown in Table LABEL:tab:_generalization, results from deep extrinsic learning show great competence despite the portion size of the training set.

4.2 Task II: illumination subspace, the Grassmann manifold

Changes in poses, expressions, and illumination conditions are inevitable in real-world applications, but seriously impair the performance of face recognition models. It is universally acknowledged that an image set of a human face under different conditions lies close to a low-dimensional Euclidean subspace. After vectorizing images of each person into one matrix in dataset Yale-B (Georghiades et al., 2001), experiments reveal that the top $d=5$ principal components (PC) generally capture more than $90\%$ of the singular values. As an essential application of DEMR, we take the feature extraction CNN as $T(\cdot)$ in Equation LABEL:eq:_combination. in high-dimensional Euclidean space, and obtain the final result by inversely mapping the estimation to the Grassmann manifold.

4.2.1 Experimental Setup

Architectural detail

A similar work sharing the same background comes from (Lohit & Turaga, 2017), which assumes the output of the last layer in the neural network to lie on the matrix manifolds or its tangent space. We take GrassmannNet in the former assumption in (Lohit & Turaga, 2017) as the baseline, where the training loss function is set to be Mean Squared Loss. For all the models to be compared, the ratio of the training set is $0.8$ , and all of the illumination angles are adopted. In addition, the data preparation step also follows (Lohit & Turaga, 2017), and other training details are recorded in supplementary materials. On extended Yale Face Database B, each grayscale image of $168\times 192$ is firstly resized to $256\times 256$ as the input of CNN, while for subspace ground truth, they are resized to $28\times 28=784$ for convenience of computation.

Dataset processing

The input face image $I_{i}^{j}$ indicates the $i_{th}$ face under the $j_{th}$ illumination condition, where $j=1,\ldots,64$ for Yale B dataset. If we adopt the top $d$ PCs, the output is assumed to be in the form of $\{E_{i}^{1},E_{i}^{2},\ldots,E_{I}^{d}\}$ , $k=1\ldots,d$ and $<E_{i}^{k},E_{i}^{l}>=\delta_{kl}$ where $\delta_{kl}$ is a Kronecker delta function. Each PC is rearranged as a vector, thus $vec(E_{i}^{k})$ is of size $784\times 1$ and we define $U_{i}=[vec(E_{i}^{1}),vec(E_{i}^{2})\ldots vec(E_{i}^{d})]$ , and the output of the network is $U_{i}U_{i}^{T}$ . Most of the intrinsic distances defined on the Grassmannian manifold are based on principal angles (PA), for $P_{1},P_{2}\in\mathcal{G}(m,\mathbb{R}^{n})$ , with SVD $P_{1}^{T}P_{2}=USV^{T}$ where $S=diag\{cos(\theta_{1}),cos(\theta_{2}),\ldots,cos(\theta_{m})\}$ , PAs of Binet-Cauchy (BC) distance is $1-\Pi_{k=1}^{K}\cos^{2}\theta_{k}$ and PAs of Martin (MA) distance is $\log\Pi_{k=1}^{K}(\cos^{2}\theta_{k})^{-1}$ . For comparison of effectiveness and training advantage, the deep intrinsic learning takes $D_{\mathcal{G}}(U_{gt},U_{out})=$ as both the training loss and testing metric, where $U_{gt}$ and $U_{out}$ indicate the ground truth and the network output respectively.

Estimation accuracy and conformance

The results of illumination subspace estimation are given in Table 3. Measured by the average of geodesic losses, DEMR architecture achieves nearly the same performance as DIMR while drastically reducing training time till convergence. With batch size set to be 10, we record training loss after every 10 epochs and report approximately the epoch at which the training loss converged in the third row in Table 3. Besides, both DIMR and DEMR surpass GrassmannNet in the task of illumination subspace estimation, and DEMR still takes fewer epochs for training than GrassmannNet. In the first row of Table 3, on one of the test samples, the results from GrassmannNet, DIMR, and DEMR are displayed. In addition, the conformance of extrinsic and intrinsic metrics can be referred to as supplementary materials.

5 Conclusion

This paper presents Deep Extrinsic Manifold Representation (DEMR), which incorporates extrinsic embedding into DNNs to circumvent the direct computation of intrinsic information. Experimental results demonstrate that retaining the geometric structure of the manifold enhances overall performance and generalization ability in the original tasks. Furthermore, extrinsic embedding exhibits superior computational advantages over intrinsic methods. Looking ahead, the prospect of a more unified formulation for extrinsic embedding techniques within deep learning settings, characterized by parameterized approaches, is envisioned.

References

Berkels et al. (2013) Berkels, B., Fletcher, P. T., Heeren, B., Rumpf, M., and Wirth, B. Discrete geodesic regression in shape space. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 108–122. Springer, 2013.
Bermudez et al. (2018) Bermudez, C., Plassard, A. J., Davis, L. T., Newton, A. T., Resnick, S. M., and Landman, B. A. Learning implicit brain mri manifolds with deep learning. In Medical Imaging 2018: Image Processing, volume 10574, pp. 105741L. International Society for Optics and Photonics, 2018.
Bhattacharya & Patrangenaru (2003) Bhattacharya, R. and Patrangenaru, V. Large sample theory of intrinsic and extrinsic sample means on manifolds. The Annals of Statistics, 31(1):1–29, 2003.
Bhattacharya et al. (2012) Bhattacharya, R. N., Ellingson, L., Liu, X., Patrangenaru, V., and Crane, M. Extrinsic analysis on manifolds is computationally faster than intrinsic analysis with applications to quality control by machine vision. Applied Stochastic Models in Business and Industry, 28(3):222–235, 2012.
Boumal (2020) Boumal, N. An introduction to optimization on smooth manifolds. Available online, May, 3, 2020.
Bourmaud et al. (2015) Bourmaud, G., Mégret, R., Arnaudon, M., and Giremus, A. Continuous-discrete extended kalman filter on matrix lie groups using concentrated gaussian distributions. Journal of Mathematical Imaging and Vision, 51:209–228, 2015.
Bronstein et al. (2017) Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
Bronstein et al. (2021) Bronstein, M. M., Bruna, J., Cohen, T., and Veličković, P. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
Byravan & Fox (2017) Byravan, A. and Fox, D. Se3-nets: Learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 173–180. IEEE, 2017.
Can et al. (2021) Can, U., Utku, A., Unal, I., and Alatas, B. Deeper in data science: Geometric deep learning. PROCEEDINGS BOOKS, pp. 21, 2021.
Cao et al. (2020) Cao, W., Yan, Z., He, Z., and He, Z. A comprehensive survey on geometric deep learning. IEEE Access, 8:35929–35949, 2020.
Chen et al. (2021) Chen, J., Yin, Y., Birdal, T., Chen, B., Guibas, L., and Wang, H. Projective manifold gradient layer for deep rotation regression. arXiv preprint arXiv:2110.11657, 2021.
Cornea et al. (2017) Cornea, E., Zhu, H., Kim, P., Ibrahim, J. G., and Initiative, A. D. N. Regression models on riemannian symmetric spaces. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2):463–482, 2017.
Fang et al. (2023) Fang, Y., Ohn, I., Gupta, V., and Lin, L. Intrinsic and extrinsic deep learning on manifolds. arXiv preprint arXiv:2302.08606, 2023.
Fletcher (2013) Fletcher, P. T. Geodesic regression and the theory of least squares on riemannian manifolds. International journal of computer vision, 105(2):171–185, 2013.
Fletcher (2011) Fletcher, T. Geodesic regression on riemannian manifolds. In Proceedings of the Third International Workshop on Mathematical Foundations of Computational Anatomy-Geometrical and Statistical Methods for Modelling Biological Shape Variability, pp. 75–86, 2011.
Georghiades et al. (2001) Georghiades, A., Belhumeur, P., and Kriegman, D. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.
Hinkle et al. (2012) Hinkle, J., Muralidharan, P., Fletcher, P. T., and Joshi, S. Polynomial regression on riemannian manifolds. In European conference on computer vision, pp. 1–14. Springer, 2012.
Huang et al. (2021) Huang, Z., Cai, H., Dan, T., Lin, Y., Laurienti, P., and Wu, G. Detecting brain state changes by geometric deep learning of functional dynamics on riemannian manifold. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 543–552. Springer, 2021.
Khayatkhoei et al. (2018) Khayatkhoei, M., Singh, M. K., and Elgammal, A. Disconnected manifold learning for generative adversarial networks. Advances in Neural Information Processing Systems, 31, 2018.
Lee (2021) Lee, H. Robust extrinsic regression analysis for manifold valued data. arXiv preprint arXiv:2101.11872, 2021.
Levinson et al. (2020) Levinson, J., Esteves, C., Chen, K., Snavely, N., Kanazawa, A., Rostamizadeh, A., and Makadia, A. An analysis of svd for deep rotation estimation. arXiv preprint arXiv:2006.14616, 2020.
Lin et al. (2017) Lin, L., St. Thomas, B., Zhu, H., and Dunson, D. B. Extrinsic local regression on manifold-valued data. Journal of the American Statistical Association, 112(519):1261–1273, 2017.
Lohit & Turaga (2017) Lohit, S. and Turaga, P. Learning invariant riemannian geometric representations using deep nets. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1329–1338, 2017.
Ni et al. (2021) Ni, Y., Koniusz, P., Hartley, R., and Nock, R. Manifold learning benefits gans. arXiv preprint arXiv:2112.12618, 2021.
Qi et al. (2017) Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660, 2017.
Sattler et al. (2019) Sattler, T., Zhou, Q., Pollefeys, M., and Leal-Taixe, L. Understanding the limitations of cnn-based absolute camera pose regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3302–3312, 2019.
Shi et al. (2009) Shi, X., Styner, M., Lieberman, J., Ibrahim, J. G., Lin, W., and Zhu, H. Intrinsic regression models for manifold-valued data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 192–199. Springer, 2009.
Shin & Oh (2022) Shin, H.-Y. and Oh, H.-S. Robust geodesic regression. International Journal of Computer Vision, pp. 1–26, 2022.
Steinke et al. (2010) Steinke, F., Hein, M., and Schölkopf, B. Nonparametric regression between general riemannian manifolds. SIAM Journal on Imaging Sciences, 3(3):527–563, 2010.
Yang et al. (2022) Yang, C.-H., Vemuri, B. C., et al. Nested grassmannians for dimensionality reduction with applications. Machine Learning for Biomedical Imaging, 1(IPMI 2021 special issue):1–10, 2022.
Zhang (2020) Zhang, Y. Bayesian geodesic regression on riemannian manifolds. arXiv preprint arXiv:2009.05108, 2020.
Zhou et al. (2019) Zhou, Y., Barnes, C., Lu, J., Yang, J., and Li, H. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5745–5753, 2019.

Appendix

5.0.1 Proofs in Section 3

Consistency of extrinsic loss and intrinsic loss

Proof of lemma 3.1

Lemma 5.1.

Proof.

For $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ with smooth metrics $\rho_{1}$ and $\rho_{2}$ respectively, considering the smooth and continuous map $\mathrm{d}f:U_{1}\rightarrow T\mathcal{M}_{2}$ , where $U_{1}$ indicates the unit tangent bundle of $\mathcal{M}_{1}$ and $T\mathcal{M}_{2}$ represents the tangent space, then the function $\phi(u)=|df(u)|,u\in U_{1}$ is continuous too. Since $U_{1}$ is compact, there exists the maximum $C$ . Denoting the length of path $r:[a,b]\rightarrow\mathcal{M}_{1}$ to be $L(\cdot)$ , then there is $L(f\circ r)=\int_{a}^{b}|(f\circ r)^{\prime}|\mathrm{d}t\leq C\int_{a}^{b}|c^{\prime}(t)|\mathrm{d}t=CL(r)$ thus $\rho_{2}(f(t_{1}),f(t_{2}))\leq C\rho_{1}(t_{1},t_{2}),t_{1},t_{2}\in\mathcal{M}_{1}$ , then $f$ is $C-$ Lipshitz and $f^{-1}$ is $\frac{1}{C}-$ Lipshitz. ∎

Proof of proposition 3.2

The extrinsic metric could reflect the tendency of the intrinsic metric. Here we pay more attention to consistency when the losses are small because consistency is crucial in extrinsic loss convergence. To be more specific, for a smooth embedding $J:\mathcal{M}\rightarrow\mathbb{R}^{m}$ , where the n-manifold $\mathcal{M}$ is compact, and its metric is denoted by $\rho$ , the metric of $\mathbb{R}^{m}$ is $d$ . The minimal change of $d$ suggests the minimal change of $\rho$ , which will be a direct result if $J$ is bilipshitz. Here the condition of compactness of the embedded space will be relaxed, where $J$ is only desired to be locally bilipshitz, while the proof follows the same idea of the lemma above.

Proposition 5.2.

Proof.

Given that every compact submanifold $Y_{s}=J(\mathcal{M})\subset Y$ has positive normal injectivity radius, where $Y\subset\mathbb{R}^{N}$ , there exists a positive constant $r$ , such that $B_{r}v_{Y_{s}}=\{v\in v_{Y_{s}}:|v|\leq r\}$ with $v_{Y_{s}}$ to be the normal bundle of $Y_{s}$ in $Y$ , and the normal exponential map $\exp_{Y_{s}}:v_{Y_{s}}\rightarrow Y$ is a diffeomorphism onto its image $S=\exp(B_{r}(v_{Y_{s}}))$ . Because the exponential map $\exp_{Y_{s}}$ of $\mathcal{M}$ is a diffeomorphism, $S$ is an open neighborhood of $Y_{s}$ , and the inverse of $\exp_{Y_{s}}$ , $\log_{Y_{s}}$ is smooth. Let the retraction $g=p\circ\log_{Y_{s}}:S\rightarrow Y_{s}$ to be $\log_{Y_{s}}\circ\exp_{Y_{s}}$ as the composition of the projection $p:Y_{s}\rightarrow Y$ and $\log_{Y_{s}}$ , the closure of $B_{r^{\prime}}v_{Y_{s}}$ to be $\bar{B}_{r^{\prime}}v_{Y_{s}}$ for $0<r^{\prime}<r$ , then its image under $\exp_{Y_{s}}$ would be a compact submanifold with boundary $Z\subset Y_{s}$ . Hence there exists a real constant $C_{1}\geq 0$ such that $g$ is $C_{1}$ -Lipshitz on $Z$ . From lemma 1 there exists a real constant $C_{2}\geq 0$ such that $J$ is $C_{2}$ -lipshitz. Moreover, since $g$ is the right inverse of the inclusion map $i:Y_{s}\rightarrow Y$ , it follows that $i$ is $C_{1}$ -bilipshitz, when restricted to the sufficiently small ball $B_{\epsilon}$ , where $\epsilon<r^{\prime}$ .

Next, define the map $\phi(y_{1},y_{2})=\frac{d_{Y_{s}}(y_{1},y_{2})}{d_{Y}(y_{1},y_{2})}$ , where $(y_{1},y_{2})\in Y_{s}\times Y_{s}$ and $y_{1}\neq y_{2}$ , and $\phi$ is continuous since the Euclidean distance function $d_{Y_{s}}$ and $d_{Y}$ is continuous. For a compact subset $S_{comp}\subset Y_{s}\times Y_{s}$ , there is a real value $C_{3}>0$ such that $\phi$ on $S_{comp}$ is bounded by $C_{3}$ . Finally, J is locally $C_{max}$ -bilipshitz, where $C_{max}=\max(C_{1},C_{2},C_{3})$ . Further, for any two sequences $\{x_{k}\},\{y_{k}\}$ of points in $\mathcal{M}$ and their images $\{J(x_{k})\},\{J(y_{k})\}$ , if $\lim\limits_{n\rightarrow\infty}d(J(x_{n}),J(y_{n}))=0$ , there is $\rho(x_{n},y_{n})\leq C_{max}d(J(x_{n}),J(y_{n}))$ , $\lim\limits_{n\rightarrow\infty}\rho(x_{n},y_{n})=0$ . ∎

5.0.2 Asymptotic maximum likelihood estimation (MLE)

Proof of proposition 3.3

Proposition 5.3.

$J^{-1}_{SO}:\mathbb{R}^{9}\rightarrow SO(3)$ gives an approximation of MLE of rotations on $SO(3)$ , if assuming $\Sigma=\sigma I$ , $I$ is the identity matrix and $\sigma$ an arbitrary real value.

Proof.

For $x\sim\mathcal{N}_{\mathcal{G}}(\mu,\Sigma)$ , there is $x=\mu\exp_{\mathcal{M}}([N]^{\vee}_{\mathcal{M}})$ , and the simplified log-likelihood function is

L(Y;F_{\Theta_{NN}},\Sigma)=[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee\top}\Sigma^{-1}[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee}

(4)

Proof of proposition 3.4

Proposition 5.4.

$J^{-1}_{SE}:\mathbb{R}^{9}\rightarrow SE(3)$ where the rotation part comes from $\mathtt{SVD}$ , gives an approximation of MLE of transformations on $SE(3)$ , if assuming $\Sigma=\sigma I$ , $I$ is the identity matrix and $\sigma$ an arbitrary real value.

Proof.

Thus, for $x\sim\mathcal{N}_{\mathcal{G}}(\mu,\Sigma)$ , there is $x=\mu\exp_{\mathcal{M}}([N]^{\vee}_{\mathcal{M}})$ , and the simplified log-likelihood function is

L(Y;F_{\Theta_{NN}},\Sigma)=[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee\top}\Sigma^{-1}[\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee}

(5)

with $\log_{\mathcal{M}}(F_{\Theta_{NN}}-Y)]_{\mathcal{M}}^{\vee}$ set to be $\epsilon$ , if the PDF focused around the group identity, i.e. the fluctuation of $\epsilon$ is small, the noise $\epsilon$ ’s distribution could be approximated by $\mathcal{N}_{\mathbb{R}^{3}}(\mathbf{0}_{3\times 1},\Sigma)$ on $\mathbb{R}^{3}$ . Then there is $\arg\max\limits_{Y\in SO(3)}L(Y;F_{\Theta_{NN}},\Sigma)\approx\arg\min\limits_{Y\in SO(3)}(F_{\Theta_{NN}}-Y))^{\top}(F_{\Theta_{NN}}-Y))=\arg\min\limits_{Y\in SO(3)}\|F_{\Theta_{NN}}-Y\|_{F}^{2}$ . ∎

Proof of proposition 3.5

For $\mathcal{G}(m,\mathbb{R}^{n})$ , the probability density function of MACG is

	$\displaystyle P(x;R_{\mu},\Sigma)$	$\displaystyle=\\|\Sigma\\|^{-\frac{r}{2}}\\|(R_{\mu}x)^{\top}\Sigma^{-1}(R_{\mu}x)\\|^{-\frac{p}{2}}$		(6)
		$\displaystyle=\\|\Sigma\\|^{-\frac{r}{2}}\\|(x)^{\top}\Sigma^{-1}(x)\\|^{-\frac{p}{2}}$

where $R_{\mu}\in O(n)$ and covariance matrix $\Sigma$ is a symmetric positive-definite.

As clarified in section 2.2.2, we obtain Grassmann matrix by equivalently computing on $SPD_{m}^{++}$ with the diffeomorphic mapping $J^{-1}_{\mathcal{G}}=\mathtt{UU^{T}}$ . The error model is modified to be $F_{\Theta_{NN}}=Y^{\top}Y+N$ , where both $Y^{\top}Y$ and $N$ are semi-positive definite matrix (SPD) $SPD^{++}_{m}$ . The Gaussian distribution extended on SPD manifold, for a random $2th-$ order tensor $X\in\mathbb{R}^{m\times m}$ (matrix) is

P(X;M,S)=\frac{1}{\sqrt{(2\pi)^{m}}|\mathcal{S}|}\exp{\frac{1}{2}(X-M)\mathcal{S}^{-1}(X-M)}

(7)

with mean $M\in\mathbb{R}^{m\times m}$ and $4th-$ order covariance tensor $\mathcal{S}\in\mathbb{R}^{m\times m\times m\times m}$ inheriting symmetries from three dimensions. To be more specific, $\mathcal{S}^{mn}_{ij}=\mathcal{S}^{nm}_{ij}=\mathcal{S}^{mn}_{ij}$ and $\mathcal{S}^{mn}_{ij}=\mathcal{S}^{ij}_{mn}$ , thereby there is a vectorized version of Equation 3,

P(X;\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^{\tilde{m}}}|\Sigma|}\exp{\frac{1}{2}(X-\mu)\Sigma^{-1}(X-\mu)}

(8)

where $\tilde{m}=m+\frac{m(m-1)}{2}$ , indicating the unfolded parameters:

\mu=\begin{bmatrix}\mu_{11}\\ \mu_{22}\\ \vdots\\ \mu_{mm}\\ \mu_{12}\\ \mu_{13}\\ \vdots\\ \mu_{(m-1)m}\end{bmatrix}\Sigma=\begin{bmatrix}\mathcal{S}^{11}_{11}&\mathcal{S}^{22}_{11}&\ldots&\mathcal{S}^{mm}_{11}&\sqrt{2}\mathcal{S}^{12}_{11}&\sqrt{2}\mathcal{S}^{13}_{11}&\ldots&\sqrt{2}\mathcal{S}^{1m}_{11}\\ \mathcal{S}^{11}_{22}&\mathcal{S}^{22}_{22}&\ldots&\mathcal{S}^{mm}_{22}&\sqrt{2}\mathcal{S}^{12}_{22}&\sqrt{2}\mathcal{S}^{13}_{22}&\ldots&\sqrt{2}\mathcal{S}^{1m}_{22}\\ \vdots&\vdots&&\vdots&\vdots&&\vdots\\ \mathcal{S}^{11}_{mm}&\mathcal{S}^{22}_{mm}&\ldots&\mathcal{S}^{mm}_{mm}&\sqrt{2}\mathcal{S}^{12}_{mm}&\sqrt{2}\mathcal{S}^{13}_{mm}&\ldots&\sqrt{2}\mathcal{S}^{1m}_{mm}\\ \sqrt{2}\mathcal{S}^{12}_{11}&\sqrt{2}\mathcal{S}^{12}_{22}&\ldots&\sqrt{2}\mathcal{S}^{12}_{mm}&2\mathcal{S}^{12}_{12}&2\sqrt{2}\mathcal{S}^{13}_{12}&\ldots&2\mathcal{S}^{1m}_{12}\\ \sqrt{2}\mathcal{S}^{13}_{11}&\mathcal{S}^{13}_{22}&\ldots&\sqrt{2}\mathcal{S}^{13}_{mm}&2\mathcal{S}^{13}_{13}&2\mathcal{S}^{13}_{13}&\ldots&2\mathcal{S}^{1m}_{13}\\ \vdots&\vdots&&\vdots&\vdots&&\vdots\\ \sqrt{2}\mathcal{S}^{1m}_{11}&\sqrt{2}\mathcal{S}_{22}^{2m}&\ldots&\sqrt{2}\mathcal{S}^{mm}_{1m}&2\mathcal{S}^{1m}_{12}&2\mathcal{S}^{13}_{1m}&\ldots&2\mathcal{S}^{mm}_{mm}\\ \end{bmatrix}

(9)

Since each entry of $X$ comes from rearranging DNN outputs, which can be assumed to not correlate with the other entries, there is $\mathcal{S}^{ab}_{cd}=1$ when $a=b=c=d$ , otherwise $\mathcal{S}^{ab}_{cd}=0$ , then the likelihood could be further simplified with $\Sigma$ being an identity matrix. Then there is the proof of proposition 3.5.

Proposition 5.5.

DEL with $J^{-1}_{\mathcal{G}}$ gives MLE of element on $\mathcal{G}(m.\mathbb{R}^{n})$ , if assuming $\Sigma$ is an identity matrix.

Proof.

The log-likelihood function of PDF in Equation (9) is $\arg\max\limits_{Y\in SPD^{++}_{m}}L(Y;F_{\Theta_{NN}},\Sigma)=\arg\min\limits_{Y\in SPD^{++}_{m}}(\tilde{F}_{\Theta_{NN}}-\tilde{Y}))^{\top}(\tilde{F}_{\Theta_{NN}}-\tilde{Y}))=\arg\min\limits_{Y\in SPD^{++}_{m}}\|\tilde{F}_{\Theta_{NN}}-\tilde{Y}\|_{F}^{2}$ , where $\tilde{F}_{\Theta_{NN}}$ and $\tilde{Y}$ are vectorized versions of $F_{\Theta_{NN}}$ and, $Y$ respectively. ∎

5.1 Generalization Ability

5.1.1 Proof of poposition 5.6

Proposition 5.6.

Proof.

The image of $J^{-1}_{SO(n)}$ is a set of $n$ orthogonal vectors which could span the $n$ -dimensional vector space, or it could be regarded as parameterization of rotations around $n$ orthogonal axes, so they can be written as $\mathbf{\theta}=[\theta_{1},\theta_{2},\ldots,\theta_{n}]^{\top}\triangleq\mathbf{u}\theta\triangleq\omega t\in\mathbb{R}^{n}$ .

For the identity $\dot{R}=R[\omega]_{\times}\in T_{R}SO(n)$ , with constant $\omega$ , its solution $R(t)=R_{0}\exp([\omega]_{\times}t)\overset{R_{0}=I}{=}\exp([\omega]_{\times}t)=\exp{[\theta]_{\times}}=\sum\limits_{k}\frac{\theta^{k}}{k!}([u]_{\times})^{k}$ , which means that if there are $n$ orthogonal basis, any rotation could be represented with transformations on $SO(n)$ . ∎

{comment}

Deep Extrinsic Manifold Representation for Vision Tasks

Abstract

1 Introduction

Contribution

2 DEMR

2.1 Problem Formulation

Estimation in the embedded space

Pipeline design

Reformulation of neural network for images

Projection P​rPr onto the preimage of JJ

DIMR

2.2 The extrinsic embedding JJ

2.2.1 Matrix Lie Group

9D: SVD of rank 9

6D: cross product

2.2.2 The Quotient Manifold of Lie Group

2.3 DEMR as a generalization of previous research

DNN with Euclidean output

Absolute/relative pose regression

3 Analysis

3.1 Feasibility of DEMR

Lemma 3.1.

Proposition 3.2.

3.2 Asymptotic MLE

3.2.1 Lie group for transformations

Proposition 3.3.

Proposition 3.4.

Proof.

3.2.2 Grassmann for subspaces

Proposition 3.5.

3.3 Generalization Ability

3.3.1 Failure of DNN with Euclidean output space

3.3.2 Representation power of structured output space

Proposition 3.6.

Corollary 3.7.

4 Experiments

4.1 Task I: affine motions on S​E​(3)SE(3)

4.1.1 Experimental Setup

Training Detail

Comparison metrics

Architecture detail

Estimation accuracy and conformance

Generalization ability

4.2 Task II: illumination subspace, the Grassmann manifold

4.2.1 Experimental Setup

Architectural detail

Dataset processing

Estimation accuracy and conformance

5 Conclusion

References

Appendix

5.0.1 Proofs in Section 3

Consistency of extrinsic loss and intrinsic loss

Proof of lemma 3.1

Lemma 5.1.

Proof.

Proof of proposition 3.2

Proposition 5.2.

Proof.

5.0.2 Asymptotic maximum likelihood estimation (MLE)

Proof of proposition 3.3

Proposition 5.3.

Proof.

Proof of proposition 3.4

Proposition 5.4.

Proof.

Proof of proposition 3.5

Proposition 5.5.

Proof.

5.1 Generalization Ability

5.1.1 Proof of poposition 5.6

Proposition 5.6.

Proof.

5.1.2 Proof of corollary 3.7

Corollary 5.7.

Projection $Pr$ onto the preimage of $J$

2.2 The extrinsic embedding $J$

4.1 Task I: affine motions on $SE(3)$