This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Probabilistic Rotation Representation With an Efficiently Computable Bingham Loss Function and Its Application to Pose Estimation*{}^{\text{*}}

Hiroya Sato1, 2{}^{\text{1,\,2}}, Takuya Ikeda1{}^{\text{1}}, and Koichi Nishiwaki1 This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.*{}^{\text{*}} This is a contribution in Woven Planet Holdings, Inc. Part of the work is the result of Summer Internship Program.1{}^{\text{1}} All authors are with the Woven Planet Holdings, Inc. 3 Chome-2-1 Nihonbashimuromachi, Chuo City, Tokyo, 103-0022, Japan, [firstname.lastname]@woven-planet.global2{}^{\text{2}}Hiroya Sato is with Department of Mechano-Informatics, Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan. h-sato@jsk.t.u-tokyo.ac.jp
Abstract

In recent years, a deep learning framework has been widely used for object pose estimation. While quaternion is a common choice for rotation representation of 6D pose, it cannot represent an uncertainty of the observation. In order to handle the uncertainty, Bingham distribution is one promising solution because this has suitable features, such as a smooth representation over SO(3)\mathop{\mathrm{SO}}(3), in addition to the ambiguity representation. However it requires the complex computation of the normalizing constants. This is the bottleneck of loss computation in training neural networks based on Bingham representation. As such, we propose a fast-computable and easy-to-implement loss function for Bingham distribution. We also show not only to examine the parametrization of Bingham distribution but also an application based on our loss function.

I INTRODUCTION

Recently, there are many research efforts on pose estimation based on deep learning framework, such as [1, 2]. In these works, quaternion is widely used for rotation representation. However, since single quaternion can only represent a single rotation, it cannot capture an uncertainty of the observation. Handling uncertainty is quite important, especially in the situation that a target object is occluded or has a symmetric shape [3, 4] .

Many researchers have been considering how to represent the ambiguity of rotations. One way to represent it is to utilize Bingham distribution [5]. It mainly has two advantages. Firstly, the Bingham distribution is the probability distribution that is consistent with the spatial rotation SO(3)\mathop{\mathrm{SO}}(3) (detaily described in Section III-B), and is easy to be parametrized. Secondly, the continuous representation, which is suitable for a neural network, can be derived from this distribution, as Peretroukhin et al.[6] suggested one example of it. For the above characteristics, we choose the Bingham distribution for probabilistic rotation representation.

To optimize the probability distribution, in general, a negative log-likelihood (NLL) is common choice for loss function. Since NLL is defined in mathematically natural way, there are plenty of theorems on it. Moreover, it has potential of wide application because it can be easily extended to more complicated model, such as mixture model [7, 8].

Refer to caption
Figure 1: Inference sample of our Bingham model. The upper figure shows that the handle is visible and the corresponding distribution is low-variance, while the lower figure shows that the mug’s handle is occluded and the distribution is widely spread.

The Bingham distribution, however, has a difficulty in calculating its normalizing constants, which is a key bottleneck of Bingham NLL computation. Since the normalizing constant depends on the distribution’s parameter, we must compute the constant for each parameter during optimization. To our best knowledge, one common solution is to prepare a pre-computed table of the constants so the repetitive complicated calculation during optimization can be avoided. However, it takes time to create the table.

In this paper, we introduce a fast-computable and easy-to-implement NLL function of Bingham distribution, which based on a novel algorithm by Chen et al.[9]. We also examine the parametrization of the Bingham distribution. Furthermore, we will show its application to the object pose estimation in Section V, and show how easy to apply our method to existing pose estimator.

II RELATED WORKS

II-A Continuous Rotation Representations

4-dimensional rotation representation is widely used. In particular, quaternion is a popular representation. It is utilized such as in PoseCNN [1], PoseNet [10], and 6D-VNet [11]. Another 4-dimensional representation is an axis-rotation representation, introduced such as in MapNet [12]. Although these 4D representation are valid in some cases, it is known that every dd-dimensional (d<5d<5) rotation representation is “discontinuous” (in the sense of [13]), because the 3-dimensional real projective space P3(SO(3))\mathbb{R}P^{3}(\cong\mathop{\mathrm{SO}}(3)) cannot be embedded in d\mathbb{R}^{d} unless d5d\geq 5 [14]. Because these representations has some singular points that prevent the network from stable regression, it is preferred that we use the continuous rotation representation in training neural networks.

Researchers have proposed various high-dimension representations. For example, 9D representation was proposed in [15] by using the singular value decomposition (SVD). 10D representation was proposed in [6] using the 4-dimensional symmetric matrix, which can be used for parametrization of Bingham distribution. We will examine other continuous paramterization in Section II-B.

II-B Expressions of Rotational Ambiguity

There are several ways to express the rotational ambiguity. In [3, 16], they tried capturing the ambiguity by using multiple quaternions and minimizing the special loss function. In KOSNet [4], they used 3 parameters for describing camera’s rotation (elevation, azimuth, and rotation around optical axis), and employed Gaussian distribution for describing their ambiguity. These two representations, however, are discontinuous as described in Section II-A, since their dimensions are both lesser than 5.

Another approach is to use Bingham distribution. This distribution is easy to parameterize and has been utilized in pose estimation field. It is used for the model distribution of a Bayesian filter to realize online pose estimation [17], for visual self-localization [7], for multiview fusion [8], and for describing the pose ambiguity of objects with symmetric shape [18]. Moreover, there is a continuous parameterization of this distribution, as Peretroukhin et al.[6] gives an example of it. We adopt Bingham distribution as our probabilistic rotation representation.

II-C Loss Functions for Bingham Representation

An NLL loss function is the common choice and has advantages, as described in the Introduction. When computing Bingham loss, the main barrier is the computation of normalizing constant. One solution of this problem is to take the time to pre-compute the table of normalizing constant. For example, Kume et al. [19] use the saddlepoint approximation technique to construct this table. However, even if we use the pre-computed table, we also need to implement a smooth interpolation function as described in [18] to compute a constant missing in the table and a derivative of each constant for backpropagation. Thus, it is desirable to use Bingham’s negative log likelihood (NLL) loss without the pre-computed table.

Instead of using the Bingham NLL loss, another loss function is defined in [6]. They defined QCQP loss to prevent them from suffering this troublesome computations. While this shows moderate performance by calculating uni-modal Bingham loss implicitly, it is difficult to extend to more complicated model such as mixture models [7, 8], because the normalizing constant is required to mix multiple Bingham distributions. In contrast, our Bingham loss can be easily extended to mixture models.

III Rotation Representations

III-A Quaternion and Spatial Rotation

III-A1 Quaternion

We introduce symbols i,j,ki,j,k which satisfies the property:

i2=j2=k2=ijk=1.i^{2}=j^{2}=k^{2}=ijk=-1. (1)

A quaternion is an expression of the form:

q=w+xi+yj+zkq=w+xi+yj+zk (2)

where w,x,y,zw,x,y,z are real numbers. i,j,ki,j,k are called the imaginary units of the quaternion. The set of quaternions forms a 4D vector space whose basis is {1,i,j,k}\{1,i,j,k\}. Therefore, we identify a quaternion qq defined in (2) with

𝒒=(w,x,y,z)4.\bm{q}=(w,x,y,z)^{\top}\in\mathbb{R}^{4}. (3)

III-A2 Product of Quaternions

For any quaternion q=a+bi+cj+dkq^{\prime}=a+bi+cj+dk, we can define a product of quaternions qqq^{\prime}q thanks to the rule (1). The set of quaternions forms a group by this multiplication. qqq^{\prime}q can also be identified with an element of 4\mathbb{R}^{4}. We denote it 𝒒𝒒4\bm{q}^{\prime}\odot\bm{q}\in\mathbb{R}^{4}. Note that qqqqq^{\prime}q\neq qq^{\prime} in general. Since 𝒒𝒒\bm{q}^{\prime}\odot\bm{q} is bilinear w.r.t. 𝒒\bm{q}^{\prime} and 𝒒\bm{q}, we can define matrices ΩL(𝒒)\mathop{\mathrm{\Omega_{L}}}(\bm{q}^{\prime}) and ΩR(𝒒)\mathop{\mathrm{\Omega_{R}}}(\bm{q}) satisfying

𝒒𝒒=ΩL(𝒒)𝒒=ΩR(𝒒)𝒒.\bm{q}^{\prime}\odot\bm{q}=\mathop{\mathrm{\Omega_{L}}}(\bm{q}^{\prime})\bm{q}=\mathop{\mathrm{\Omega_{R}}}(\bm{q})\bm{q}^{\prime}. (4)

ΩL(𝒒)\mathop{\mathrm{\Omega_{L}}}(\bm{q}^{\prime}) and ΩR(𝒒)\mathop{\mathrm{\Omega_{R}}}(\bm{q}) can be written in closed form:

ΩL(𝒒)\displaystyle\mathop{\mathrm{\Omega_{L}}}(\bm{q}^{\prime}) =(abcdbadccdabdcba),\displaystyle=\begin{pmatrix}a&-b&-c&-d\\ b&a&-d&c\\ c&d&a&-b\\ d&-c&b&a\end{pmatrix}, (5)
ΩR(𝒒)\displaystyle\mathop{\mathrm{\Omega_{R}}}(\bm{q}) =(wxyzxwzyyzwxzyxw).\displaystyle=\begin{pmatrix}w&-x&-y&-z\\ x&w&z&-y\\ y&-z&w&x\\ z&y&-x&w\end{pmatrix}. (6)

III-A3 Conjugate, Norm, and Unit Quaternion

The conjugate of 𝒒\bm{q} is defined by 𝒒¯=(w,x,y,z)\overline{\bm{q}}=(w,-x,-y,-z)^{\top}. In general, 𝒒𝒒¯=𝒒¯𝒒¯\overline{\bm{q}\odot\bm{q}^{\prime}}=\overline{\bm{q}}^{\prime}\odot\overline{\bm{q}}. In particular, if 𝒒=𝒒\bm{q}=\bm{q}^{\prime}, then

𝒒𝒒¯=𝒒¯𝒒=w2+x2+y2+z2.\bm{q}\odot\overline{\bm{q}}=\overline{\bm{q}}\odot\bm{q}=w^{2}+x^{2}+y^{2}+z^{2}. (7)

By definition (5) and (6), we get

ΩL(𝒒¯)=ΩL(𝒒),ΩR(𝒒¯)=ΩR(𝒒).\mathop{\mathrm{\Omega_{L}}}(\overline{\bm{q}})=\mathop{\mathrm{\Omega_{L}}}(\bm{q})^{\top},\quad\mathop{\mathrm{\Omega_{R}}}(\overline{\bm{q}})=\mathop{\mathrm{\Omega_{R}}}(\bm{q})^{\top}. (8)

We define the norm of quaternion 𝒒\|\bm{q}\| as

𝒒=𝒒𝒒¯=𝒒¯𝒒.\|\bm{q}\|=\sqrt{\bm{q}\odot\overline{\bm{q}}}=\sqrt{\overline{\bm{q}}\odot\bm{q}}. (9)

We call 𝒒\bm{q} a unit quaternion if 𝒒=1\|\bm{q}\|=1. For a unit quaternion, its inverse coincides with its conjugate: 𝒒1=𝒒¯\bm{q}^{-1}=\overline{\bm{q}}. Using (8), we can see that ΩL(𝒒)\mathop{\mathrm{\Omega_{L}}}(\bm{q}) and ΩR(𝒒)\mathop{\mathrm{\Omega_{R}}}(\bm{q}) are both orthogonal:

ΩL(𝒒)ΩL(𝒒)=ΩR(𝒒)ΩR(𝒒)=I4\mathop{\mathrm{\Omega_{L}}}(\bm{q})^{\top}\mathop{\mathrm{\Omega_{L}}}(\bm{q})=\mathop{\mathrm{\Omega_{R}}}(\bm{q})^{\top}\mathop{\mathrm{\Omega_{R}}}(\bm{q})=I_{4} (10)

where 𝒒\bm{q} is any unit quaternion, and I4I_{4} is the 4-dimensional identity matrix.

We denote the set of unit quaternions 𝕊3\mathbb{S}^{3} because it is homeomorphic to a 3-sphere 𝕊3\mathbb{S}^{3}.

III-A4 Unit Quaternion and Spatial rotation

It is well known that unit quaternions can represent the spatial rotation. A mapping R:𝕊3SO(3)R:\mathbb{S}^{3}\to\mathop{\mathrm{SO}}(3) defined below is in fact a group homomorphism:

R(𝒒)=(12y22z22wz+2xy2wy+2xz2wz+2xy12x22z22wx+2yz2wy+2xz2wx+2yz12x22y2).\displaystyle R(\bm{q})=\left(\begin{array}[]{ccc}1-2y^{2}-2z^{2}&-2wz+2xy&2wy+2xz\\ 2wz+2xy&1-2x^{2}-2z^{2}&-2wx+2yz\\ -2wy+2xz&2wx+2yz&1-2x^{2}-2y^{2}\end{array}\right). (14)

Crucially, antipodal unit quaternions represent the same rotation; namely, R(𝒒)=R(𝒒)R(-\bm{q})=R(\bm{q}).

III-B Definition of Bingham Distribution and Its Properties

The Bingham distribution [5] is a probability distribution on the unit sphere 𝕊d1d\mathbb{S}^{d-1}\subset\mathbb{R}^{d} with the property of antipodal symmetry, which is consistent with the quaternion’s property. We set d=4d=4 throughout of this paper because we only consider 𝕊3\mathbb{S}^{3}. We define Bingham distribution as follows.

𝔅(𝒒;D,𝝀)=1𝒞(𝝀)exp(𝒒Ddiag(𝝀)D𝒒).\mathfrak{B}(\bm{q};{D},\bm{\lambda})=\frac{1}{\mathcal{C}(\bm{\lambda})}\exp\left(\bm{q}^{\top}{D}\mathop{\mathrm{diag}}(\bm{\lambda}){D}^{\top}\bm{q}\right). (15)

where 𝒒𝕊3,DO(4),𝝀4\bm{q}\in\mathbb{S}^{3},{D}\in{\mathrm{O}}(4),\bm{\lambda}\in\mathbb{R}^{4}. Here O(n){\mathrm{O}}(n) denotes the nn-dimensional orthogonal group. Note that it is a Lie group and its Lie algebra 𝔬(n){\mathfrak{o}}(n) is formed by nn-dimensional skew-symmetric matrices.

Here we define diag:mm×m\mathop{\mathrm{diag}}:\mathbb{R}^{m}\to\mathbb{R}^{m\times m} as below.

diag:(v1vm)(v1vm)\mathop{\mathrm{diag}}:\begin{pmatrix}v_{1}\\ \vdots\\ v_{m}\end{pmatrix}\mapsto\begin{pmatrix}v_{1}&&\\ &\ddots&\\ &&v_{m}\end{pmatrix} (16)

𝒞(𝝀)\mathcal{C}(\bm{\lambda}) is called a normalizing factor or a normalizing constant of a Bingham distribution 𝔅(𝒒;D,𝝀)\mathfrak{B}(\bm{q};{D},\bm{\lambda}). 𝒞(𝝀)\mathcal{C}(\bm{\lambda}) is defined as below:

𝒞(𝝀)=𝒒𝕊3exp(𝒒diag(𝝀)𝒒)d𝕊3(𝒒)\mathcal{C}(\bm{\lambda})=\int_{\bm{q}\in\mathbb{S}^{3}}\exp\left(\bm{q}^{\top}\mathop{\mathrm{diag}}(\bm{\lambda})\bm{q}\right)\mathrm{d}_{\mathbb{S}^{3}}(\bm{q}) (17)

where d𝕊3()\mathrm{d}_{\mathbb{S}^{3}}(\cdot) is the uniform measure on the 𝕊3\mathbb{S}^{3}. Note that a normalizing factor depends only on 𝝀\bm{\lambda}. It is easy to check that for any cc\in\mathbb{R},

𝔅(𝒒;D,𝝀+c)=𝔅(𝒒;D,𝝀)\mathfrak{B}(\bm{q};D,\bm{\lambda}+c)=\mathfrak{B}(\bm{q};D,\bm{\lambda}) (18)

where 𝝀+c=(λ1+c,,λ4+c)\bm{\lambda}+c=(\lambda_{1}+c,\dots,\lambda_{4}+c). Therefore, we can set 𝝀\bm{\lambda} satisfying

0=λ1λ2λ3λ40=\lambda_{1}\geq\lambda_{2}\geq\lambda_{3}\geq\lambda_{4} (19)

by sorting a column of DD if necessary. A processed DD and a processed 𝝀\bm{\lambda} are denoted as DshiftedD_{\text{shifted}}, 𝝀shifted\bm{\lambda}_{\text{shifted}} respectively It follows directly from the Rayleigh’s quotient formula that

argmax𝒒𝕊3𝔅(𝒒;D,𝝀)=𝒒λ1\arg\max_{\bm{q}\in\mathbb{S}^{3}}\mathfrak{B}(\bm{q};D,\bm{\lambda})=\bm{q}_{\lambda_{1}} (20)

where 𝒒λ1\bm{q}_{\lambda_{1}} is a column vector of DD corresponding to the maximum entry of 𝝀\bm{\lambda}. If we sort 𝝀\bm{\lambda} as (19), 𝒒λ1\bm{q}_{\lambda_{1}} coincides with the left-most column vector of DD.

III-C Parametrization of Bingham Distribution

III-C1 Representaion using Symmetric Matrix

There are several choices of the parametrization of Bingham distribution. Firstly, we introduce here the 10D parameterization using a symmetric matrix which proposed by [6]. Since every symmetric matrices can be diagonalized by some orthogonal matrix, we can rewrite the distribution instead of (15):

𝔅sym(𝒒;A)=1𝒞(𝝀)exp(𝒒A𝒒),\mathfrak{B}_{\text{sym}}(\bm{q};A)=\frac{1}{\mathcal{C}(\bm{\lambda})}\exp\left(\bm{q}^{\top}A\bm{q}\right), (21)

where AA is a 4-dimensional symmetric matrix. If the the eigenvalues of AA is sorted and shifted so as to satisfy (19), then we call it AshiftedA_{\text{shifted}}. We assume that all parameters are shifted.

Here we define a bijective map triu:Sym410\mathop{\mathrm{triu}}:\mathrm{Sym}_{4}\to\mathbb{R}^{10} as below:

triu:(θ1θ2θ3θ4θ2θ5θ6θ7θ3θ6θ8θ9θ4θ7θ9θ10)(θ1θ10)\mathop{\mathrm{triu}}:\begin{pmatrix}\theta_{1}&\theta_{2}&\theta_{3}&\theta_{4}\\ \theta_{2}&\theta_{5}&\theta_{6}&\theta_{7}\\ \theta_{3}&\theta_{6}&\theta_{8}&\theta_{9}\\ \theta_{4}&\theta_{7}&\theta_{9}&\theta_{10}\end{pmatrix}\mapsto\begin{pmatrix}\theta_{1}\\ \vdots\\ \theta_{10}\end{pmatrix} (22)

where Symn\mathrm{Sym}_{n} denotes the set of nn-dimensional symmetric matrices. We can use this for 10D parametrization of Bingham distribution 𝒫10\mathcal{P}_{10} as following:

𝒫10:10𝜽=(θ1θ10)𝔅sym(𝒒;triu(𝜽)).\mathcal{P}_{10}:\mathbb{R}^{10}\ni\bm{\theta}=\begin{pmatrix}\theta_{1}\\ \vdots\\ \theta_{10}\end{pmatrix}\mapsto\mathfrak{B}_{\text{sym}}\left(\bm{q};\mathop{\mathrm{triu}}(\bm{\theta})\right). (23)

We call this representation Peretroukhin representation here.

III-C2 Representations of Orthogonal Matrices

The Peretroukhin representation defined above is simple; however, it includes the eigenvalue decomposition process for calculate the normalizing factor 𝒞(𝝀)\mathcal{C}(\bm{\lambda}), which has a high computational cost. It is reasonable that the network directly infers a orthogonal matrix DD and a diagonal entries 𝝀\bm{\lambda}, then reconstructs A=Ddiag(𝝀)DA=D\mathop{\mathrm{diag}}(\bm{\lambda})D^{\top}.

To parametrize DO(4)D\in{\mathrm{O}}(4), we introduce following two strategies. The first is Cayley transformation [20], which is commonly used representation for orthogonal matrices. It is expressed as follows:

cay:6𝜽(IS(𝜽))1(I+S(𝜽))O(4)\operatorname{cay}:\mathbb{R}^{6}\ni\bm{\theta}\mapsto\left(I-S(\bm{\theta})\right)^{-1}\left(I+S(\bm{\theta})\right)\in{\mathrm{O}}(4) (24)

where S:6𝔬(4)S:\mathbb{R}^{6}\to{\mathfrak{o}}(4) is defined as following:

S:(θ1θ6)(0θ1θ2θ3θ10θ4θ5θ2θ40θ6θ3θ5θ60).S:\begin{pmatrix}\theta_{1}\\ \vdots\\ \theta_{6}\end{pmatrix}\mapsto\begin{pmatrix}0&\theta_{1}&-\theta_{2}&\theta_{3}\\ -\theta_{1}&0&\theta_{4}&-\theta_{5}\\ \theta_{2}&-\theta_{4}&0&\theta_{6}\\ -\theta_{3}&\theta_{5}&-\theta_{6}&0\end{pmatrix}. (25)

This representation can be express all of the orthogonal matrices. We call this orthogonal matrix representation, cay\operatorname{cay}, a Cayley representation.

The second is to use 4-dimensional representation defined by Birdal et al.[21]. Nevertheless there is a orthogonal matrix that cannot be represented in 4D, it works well in such as [7]. This representation uses the orthogonal property of unit quaternion’s matrix representation. That is, as in (5),

bir:4𝜽=(θ1θ4)ΩL(𝜽)O(4).\operatorname{bir}:\mathbb{R}^{4}\ni\bm{\theta}=\begin{pmatrix}\theta_{1}\\ \vdots\\ \theta_{4}\end{pmatrix}\mapsto\mathop{\mathrm{\Omega_{L}}}(\bm{\theta})\in{\mathrm{O}}(4). (26)

We call this orthogonal matrix representation bir\operatorname{bir} a Birdal representation.

III-C3 Representations of Eigenvalues

The simplest choice for representation of 𝝀\bm{\lambda} is that the network infers 𝝀4\bm{\lambda}\in\mathbb{R}^{4} directly, then shift and sort it so as to satisfy (19):

Λ4:4(θ1θ2θ3θ4)(0λ2λ1λ3λ1λ4λ1)4\Lambda_{4}:\mathbb{R}^{4}\ni\begin{pmatrix}\theta_{1}\\ \theta_{2}\\ \theta_{3}\\ \theta_{4}\end{pmatrix}\mapsto\begin{pmatrix}0\\ \lambda_{2}-\lambda_{1}\\ \lambda_{3}-\lambda_{1}\\ \lambda_{4}-\lambda_{1}\end{pmatrix}\in\mathbb{R}^{4} (27)

where λi\lambda_{i} (i=1,,4)(i=1,\dots,4) is a permutation of {θ1,,θ4}\{\theta_{1},\dots,\theta_{4}\} satisfying λ1λ4\lambda_{1}\geq\dots\geq\lambda_{4}.

Another approach is to use softplus function [22] defined as below:

ϕ(x)=log(1+exp(x)).\phi(x)=\log(1+\exp(x)). (28)

Note that ϕ(x)>0\phi(x)>0 for all xx\in\mathbb{R}. Using the softmax function, we can define a 3D representation as follows [7]:

Λ3:3(θ1θ2θ3)(0ϕ(θ1)ϕ(θ1)ϕ(θ2)ϕ(θ1)ϕ(θ2)ϕ(θ3))4\Lambda_{3}:\mathbb{R}^{3}\ni\begin{pmatrix}\theta_{1}\\ \theta_{2}\\ \theta_{3}\end{pmatrix}\mapsto\begin{pmatrix}0\\ -\phi(\theta_{1})\\ -\phi(\theta_{1})-\phi(\theta_{2})\\ -\phi(\theta_{1})-\phi(\theta_{2})-\phi(\theta_{3})\end{pmatrix}\in\mathbb{R}^{4} (29)

The resulting tuple automatically satisfies (19).

So far we introduced two representations for DD and two for 𝝀\bm{\lambda}. Now we have 5 choices of parametrization of Bingham distribution:

𝒫10:10𝜽𝔅sym(𝒒;triu(𝜽))𝒫4+3:4×3(𝒅,𝝀)𝔅(𝒒;bir(𝒅),Λ3(𝝀))𝒫4+4:4×4(𝒅,𝝀)𝔅(𝒒;bir(𝒅),Λ4(𝝀))𝒫6+3:6×3(𝒅,𝝀)𝔅(𝒒;cay(𝒅),Λ3(𝝀))𝒫6+4:6×4(𝒅,𝝀)𝔅(𝒒;cay(𝒅),Λ4(𝝀))\begin{array}[]{crr@{\,\,\mapsto\,\,}l}\bullet&\mathcal{P}_{10}:&\mathbb{R}^{10}\ni\bm{\theta}&\mathfrak{B}_{\text{sym}}\left(\bm{q};\mathop{\mathrm{triu}}(\bm{\theta})\right)\\ \bullet&\mathcal{P}_{4+3}:&\mathbb{R}^{4}\times\mathbb{R}^{3}\ni(\bm{d},\bm{\lambda})&\mathfrak{B}\left(\bm{q};\operatorname{bir}(\bm{d}),\Lambda_{3}(\bm{\lambda})\right)\\ \bullet&\mathcal{P}_{4+4}:&\mathbb{R}^{4}\times\mathbb{R}^{4}\ni(\bm{d},\bm{\lambda})&\mathfrak{B}\left(\bm{q};\operatorname{bir}(\bm{d}),\Lambda_{4}(\bm{\lambda})\right)\\ \bullet&\mathcal{P}_{6+3}:&\mathbb{R}^{6}\times\mathbb{R}^{3}\ni(\bm{d},\bm{\lambda})&\mathfrak{B}\left(\bm{q};\operatorname{cay}(\bm{d}),\Lambda_{3}(\bm{\lambda})\right)\\ \bullet&\mathcal{P}_{6+4}:&\mathbb{R}^{6}\times\mathbb{R}^{4}\ni(\bm{d},\bm{\lambda})&\mathfrak{B}\left(\bm{q};\operatorname{cay}(\bm{d}),\Lambda_{4}(\bm{\lambda})\right)\end{array} (30)

Note that these representations are all continuous in the sense of [13]. We will compare these representations in Section V.

Algorithm 1 Our implementation of the loss function
1:function Integrator(fintegrantf_{\text{integrant}}, 𝝀\bm{\lambda})
2:     Nmin15N_{\text{min}}\leftarrow 15; N200N\leftarrow 200
3:     r2.5r\leftarrow 2.5; ωd0.5\omega_{d}\leftarrow 0.5
4:     Define cc as in (33); dc/2d\leftarrow c/2
5:     Define h,p1,p2h,p_{1},p_{2} as in (33)
6:     S0S\leftarrow 0
7:     for n=N1,,Nn=-N-1,\dots,N do
8:         SS+w(|nh|)fintegrant(nh,𝝀)enh1S\leftarrow S+w(|nh|)\cdot f_{\text{integrant}}(nh,\bm{\lambda})\cdot e^{nh\sqrt{-1}}
9:\triangleright ww is defined in (34)
10:     end for
11:     return the real part of πechS\pi e^{c}hS
12:end function
13:
14:function BinghamLoss(DD, 𝝀\bm{\lambda}, 𝒒gt\bm{q}_{\text{gt}})
15:     DshiftedD_{\text{shifted}}, 𝝀shifted\bm{\lambda}_{\text{shifted}} \leftarrow Sort&Shift(DD, 𝝀\bm{\lambda})
16:     AshiftedA_{\text{shifted}} \leftarrow Dshifteddiag(𝝀shifted)DshiftedD_{\text{shifted}}\mathop{\mathrm{diag}}(\bm{\lambda}_{\text{shifted}})D_{\text{shifted}}^{\top}
17:     𝒞\mathcal{C}\leftarrow Integrator(\mathcal{F}, 𝝀\bm{\lambda}) \triangleright see (38)
18:     return 𝒒gtAshifted𝒒gt+ln𝒞-\bm{q}_{\text{gt}}^{\top}A_{\text{shifted}}\,\bm{q}_{\text{gt}}+\ln\mathcal{C}
19:end function

IV EFFICIENT COMPUTATION OF LOSS FUNCTION

IV-A Definition of Loss Function

The negative log-likelihood function of the Bingham distribution can be written as follows:

(D,𝝀,𝒒gt)=𝒒gtDdiag(𝝀)D𝒒gt+ln𝒞(𝝀).\mathcal{L}(D,\bm{\lambda},\bm{q}_{\text{gt}})=-\bm{q}_{\text{gt}}^{\top}D\mathop{\mathrm{diag}}(\bm{\lambda})D^{\top}\,\bm{q}_{\text{gt}}+\ln\mathcal{C}(\bm{\lambda}). (31)

It had been a hard problem to compute 𝒞(𝝀)\mathcal{C}(\bm{\lambda}) until a high efficient computation method was proposed by [9]. Our loss function is implemented mainly based on [9].

IV-B Calculation of Normalizing Constant and Its Derivative

The whole procedure is shown in Algorithm 1. Let r,ωdr,\omega_{d} be real numbers satisfying

r2and1rωd1.r\geq 2\quad\text{and}\quad\frac{1}{r}\leq\omega_{d}\leq 1. (32)

We chose r=2.5,ωd=0.5r=2.5,\,\omega_{d}=0.5 here. Let c,h,p1,p2c,h,p_{1},p_{2} be defined as

c=Nminπr2(1+r)ωd,h=2πd(1+r)ωdN,p1=Nhωd,p2=ωdNh4,\begin{array}[]{c}\displaystyle c=\frac{N_{\text{min}}\pi}{r^{2}(1+r)\omega_{d}},\quad h=\sqrt{\frac{2\pi d(1+r)}{\omega_{d}N}},\\[12.0pt] \displaystyle p_{1}=\sqrt{\frac{Nh}{\omega_{d}}},\quad p_{2}=\sqrt{\frac{\omega_{d}Nh}{4}},\end{array} (33)

where dd is any positive number satisfying d<cd<c. We chose d=c/2d=c/2 here. NN is a positive integer satisfying NNminN\geq N_{\text{min}}. One can choose NminN_{\text{min}} arbitrarily; however, a too small NminN_{\text{min}} may lead to unstable computation. We chose Nmin=15N_{\text{min}}=15 here.

We define a function ww parametrized by p1,p2p_{1},\,p_{2} in (33) as below.

w(x)=12erfc(xp1p2),w(x)=\frac{1}{2}\mathop{\mathrm{erfc}}\left(\frac{x}{p_{1}}-p_{2}\right), (34)

where erfc\mathop{\mathrm{erfc}} is the complementary error function:

erfc(x)=12π0xet2𝑑t.\mathop{\mathrm{erfc}}(x)=1-\frac{2}{\sqrt{\pi}}\int_{0}^{x}e^{-t^{2}}dt. (35)

Then we define

(t,𝝀)\displaystyle\mathcal{F}(t,\bm{\lambda}) =k=14(λk+t1+c)1/2,\displaystyle=\prod^{4}_{k=1}\left(-\lambda_{k}+t\sqrt{-1}+c\right)^{-1/2}, (36)
and, for each i=1,,4i=1,\dots,4, we get
λi(t,𝝀)\displaystyle\frac{\partial\mathcal{F}}{\partial\lambda_{i}}(t,\bm{\lambda}) =12(λi+t1+c)1(t,𝝀).\displaystyle=\frac{1}{2}\left(-\lambda_{i}+t\sqrt{-1}+c\right)^{-1}\mathcal{F}(t,\bm{\lambda}). (37)

Now we can calculate the normalizing constant 𝒞\mathcal{C} as below

𝒞(𝝀)\displaystyle\mathcal{C}(\bm{\lambda}) =πechn=N1Nw(|nh|)(nh,𝝀)enh1,\displaystyle=\pi e^{c}h\sum_{n=-N-1}^{N}w(|nh|)\,\mathcal{F}(nh,\bm{\lambda})\,e^{nh\sqrt{-1}}, (38)
𝒞λi(𝝀)\displaystyle\frac{\partial\mathcal{C}}{\partial\lambda_{i}}(\bm{\lambda}) =πechn=N1Nw(|nh|)λi(nh,𝝀)enh1,\displaystyle=\pi e^{c}h\sum_{n=-N-1}^{N}w(|nh|)\,\frac{\partial\mathcal{F}}{\partial\lambda_{i}}(nh,\bm{\lambda})\,e^{nh\sqrt{-1}}, (39)

for each i=1,,4i=1,\dots,4. Although a calculation result of 𝒞(𝝀)\mathcal{C}(\bm{\lambda}) and 𝒞/λi(𝝀){\partial\mathcal{C}}/{\partial\lambda_{i}}(\bm{\lambda}) should exactly be a real number, one may get a complex number with the very small imaginary part. In our implementation shown in Algorithm 1, we ignore the imaginary part, assuming that it is sufficiently small.

It is noteworthy that if we set the true value of 𝒞(𝝀)\mathcal{C}(\bm{\lambda}) as 𝒞truth(𝝀)\mathcal{C}_{\text{truth}}(\bm{\lambda}), we get

|𝒞truth(𝝀)𝒞(𝝀)|=O(NecN)|\mathcal{C}_{\text{truth}}(\bm{\lambda})-\mathcal{C}(\bm{\lambda})|=O\left(\sqrt{N}e^{-c\sqrt{N}}\right) (40)

for a constant c>0c>0 independent from NN [23]. This means that we can achieve any accuracy if we set a large enough NN. In this paper, we set N=200N=200 in consideration of the computation time.

V APPLICATION TO POSE ESTIMATOR

Refer to caption
Figure 2: Overview of our network implementation. We changed the loss function and the dimension of final output from 4 to dd. As we will describe in Section V-D, we decided d=10d=10. Our method can also obtain the pose ambiguity.

V-A Implementation

In this section, we introduce an application of our representation to an existing pose estimator. We use PoseCNN [1] as the backborn framework. Our implementation is based on PoseCNN-PyTorch [24] by NVIDIA Research Projects. We call our network “Bingham-PoseCNN” for convenience. Fig. 2 shows the overview of Bingham-PoseCNN. “Ours” in the figure is the point of change compared to the original PoseCNN. We changed just the dimension of the final FCN layer from 4 to dd (dd varies from 7 to 10).

On the one hand, our Bingham NLL loss function is described in Section IV. On the other hand, original PoseCNN used the following loss functions for training;

PLOSS(𝒒,𝒒gt)\displaystyle\operatorname{PLOSS}(\bm{q},\bm{q}_{\text{gt}}) =12m𝒙R(𝒒gt)𝒙R(𝒒)𝒙2,\displaystyle=\frac{1}{2m}\sum_{\bm{x}\in\mathcal{M}}\|R(\bm{q}_{\text{gt}})\bm{x}-R(\bm{q})\bm{x}\|^{2}, (41)
SLOSS(𝒒,𝒒gt)\displaystyle\operatorname{SLOSS}(\bm{q},\bm{q}_{\text{gt}}) =12m𝒙1min𝒙2R(𝒒gt)𝒙1R(𝒒)𝒙22,\displaystyle=\frac{1}{2m}\sum_{\bm{x}_{1}\in\mathcal{M}}\min_{\bm{x}_{2}\in\mathcal{M}}\left\|R(\bm{q}_{\text{gt}})\bm{x}_{1}-R(\bm{q})\bm{x}_{2}\right\|^{2}, (42)

where \mathcal{M} denotes the set of points on the mesh model of each object. They used PLoss if the object has no symmetry, and SLoss if the object has symmetry. While they annotated the symmetric property to each object, our method doesn’t need these annotations. In addition, our method doesn’t need mesh models of target objects.

V-B Dataset

We tested our model with the YCB-Video dataset which is the same dataset used in [1]. In this dataset, 80 videos for training, and 2949 keyframes for testing that are extracted from the rest 12 unused videos were provided. In addition, YCB-Video dataset contains 80000 synthetic images. We also used them for training.

V-C Evaluation Metrics

We used ADD and ADD-S metrics to evaluate the performance of our pose estimator. Let (R,𝒕)(R,\bm{t}) be a pair of the groundtruth rotation RR and translation 𝒕\bm{t}, and (R^,𝒕^)(\widehat{R},\widehat{\bm{t}}) be a pair of the estimated rotation and translation. Then, ADD and ADD-S are defined as below:

ADD =1m𝒙(R𝒙+𝒕)(R^𝒙+𝒕^),\displaystyle=\frac{1}{m}\sum_{\bm{x}\in\mathcal{M}}\|(R\bm{x}+\bm{t})-(\widehat{R}\bm{x}+\widehat{\bm{t}})\|, (43)
ADD-S =1m𝒙1min𝒙2(R𝒙1+𝒕)(R^𝒙2+𝒕^),\displaystyle=\frac{1}{m}\sum_{\bm{x}_{1}\in\mathcal{M}}\min_{\bm{x}_{2}\in\mathcal{M}}\|(R\bm{x}_{1}+\bm{t})-(\widehat{R}\bm{x}_{2}+\widehat{\bm{t}})\|, (44)

where \mathcal{M} is the set of the mm sampled points from 3D mesh model’s surface. These metrics are the same as that used in PoseCNN [1].

V-D Results

V-D1 Comparison of Bingham Representations

Table I shows the mean values of area under the curve (AUC) of ADD and ADD-S. Peretroukhin’s 10D representation achieves the best score among 6 representations. Thus we adapt this representation to our network. 8D representation with Birdal representation comes next.

It can be seen that the 4D eigenvalue representations got a higher score than 3D representation . This would be related that Λ4\Lambda_{4}, defined in (27), tends to give smaller eigenvalues than the values given by Λ3\Lambda_{3}, defined in (29). It is known that the distribution is concentrated if the eigenvalues are small, as we will described in Section VI-A1. In the well-trained network, the mode quaternion closer to the given quaternion as the dispersion of the distribution becomes smaller. This is a possible reason why Λ4\Lambda_{4} gives a better result that Λ3\Lambda_{3}.

TABLE I: Comparison of Parametrization. Only the mode quaternion is used for ADDs calculation for Bingham representation.
Param Ortho. Matrix Diag. Entries Dim ADD ADD-S
𝒫4+3\mathcal{P}_{4+3} Birdal [21] 3D 7 49.2 72.9
𝒫4+4\mathcal{P}_{4+4} 4D 8 53.0 74.2
𝒫6+3\mathcal{P}_{6+3} Cayley [20] 3D 9 13.5 58.8
𝒫6+4\mathcal{P}_{6+4} 4D 10 23.5 66.2
𝒫10\mathcal{P}_{10} Symmetry Matrix [6] 10 55.1 75.1
Quaternion [1] 4 52.9 74.1

V-D2 Conventional Quaternion vs Bingham’s Mode Quaternion

TABLE II: Area under the curve of Fig. 4. “Ratio” is the ratio of Ours score to Original PoseCNN’s score. Ratio >> 95.0% are shown in bold.
RGB RGB + Depth
ADD ADD-S Ratio ADD ADD-S Ratio
objects Ours Original Ours Original ADD ADD-S Ours Original Ours Original ADD ADD-S
002_master_chef_can 61.0 60.5 86.7 88.7 100.9 97.7 70.7 70.0 93.1 93.4 101.0 99.7
003_cracker_box 36.6 61.2 63.3 79.5 59.8 79.6 64.8 79.1 72.8 85.4 81.9 85.3
004_sugar_box 56.7 51.6 76.7 73.4 109.8 104.5 91.3 90.7 94.7 93.7 100.6 101.1
005_tomato_soup_can 71.1 69.5 83.7 82.6 102.4 101.3 86.8 87.8 93.2 93.5 98.8 99.7
006_mustard_bottle 88.7 84.5 94.0 92.1 105.0 102.1 94.9 90.6 96.6 93.5 104.8 103.2
007_tuna_fish_can 73.5 68.4 91.5 87.8 107.5 104.2 84.7 84.3 96.9 95.1 100.4 101.9
008_pudding_box 29.3 67.8 53.8 83.4 43.2 64.5 81.8 86.0 90.3 93.7 95.1 96.4
009_gelatin_box 87.8 80.2 92.8 89.4 109.4 103.8 65.4 95.3 67.4 97.2 68.7 69.4
010_potted_meat_can 60.1 59.7 79.3 78.2 100.6 101.3 80.4 78.6 89.4 88.3 102.3 101.2
011_banana 69.8 77.4 85.4 89.9 90.2 95.0 81.2 89.1 90.0 95.1 91.2 94.6
019_pitcher_base 68.6 67.8 84.4 83.8 101.1 100.6 92.3 93.9 96.2 96.5 98.3 99.7
021_bleach_cleanser 50.7 51.1 67.5 70.1 99.2 96.3 85.2 84.2 93.9 91.5 101.2 102.6
024_bowl 4.3 4.9 60.5 74.2 87.7 81.6 17.9 8.4 90.4 78.1 213.8 115.7
025_mug 71.2 47.4 87.8 72.4 150.3 121.4 77.7 84.6 88.8 95.0 91.9 93.5
035_power_drill 61.4 52.7 77.9 72.7 116.6 107.2 87.4 86.3 92.8 91.7 101.4 101.2
036_wood_block 0.9 1.3 21.7 15.8 69.7 137.3 39.5 29.0 86.2 88.7 136.1 97.1
037_scissors 43.5 50.1 65.1 68.8 86.7 94.6 62.1 72.8 77.1 82.2 85.3 93.8
040_large_marker 55.1 55.2 66.9 67.2 99.8 99.5 82.3 86.1 90.4 93.2 95.6 96.9
051_large_clamp 43.2 12.9 68.4 38.9 334.9 175.8 63.1 56.7 81.6 76.8 111.3 106.3
052_extra_large_clamp 8.1 6.3 37.6 38.6 128.5 97.5 27.3 9.2 49.6 40.6 295.8 122.1
061_foam_brick 50.3 56.8 83.9 90.0 88.5 93.2 63.4 67.2 95.0 96.6 94.3 98.3
all 55.1 53.0 75.1 74.1 104.0 101.4 75.5 75.8 88.3 88.7 99.6 99.5

Table II shows the result of PoseCNN with our Bingham representation and conventional quaternion representation. Mode quaternion described in (20) is used for comparing the performance of Bingham representation with that of conventional one. Bingham representation with our loss function achieved a equivalent performance with that of quaternions.

VI DISCUSSIONS

VI-A Evaluation of Inferred Probabilistic Representation

Table II is evaluated only with the mode quaternion. Our method can extract information about the ambiguity or uncertainty of inferrence result. Inferred results are probability distribution so we may interpret them in a several way. We evaluate the result in the two interpretation: confidence and shape ambiguity.

VI-A1 Interpret as Confidence (Epismetic uncertainty)

Fig. 1 shows an example of inferrence result of our model. The plots in the right column shows sampled poses from the inferred distribution. The sampling algorithm from Bingham distribution is based on [25].

In Fig. 1, If the handle of the mug appears, the resulting distribution becomes low-variance and concentrated. This can be interpreted that the inferred result has high confidence. In contrast, if the handle is occluded, the distribution becomes widely spread. This can be interpreted that the inferred result has low confidence.

To explain this more quantitatively, here we introduce a random variable ΔQ\Delta Q. Given a random quaternion from the estimated distribution Q𝔅(D,𝝀)Q\sim\mathfrak{B}(D,\bm{\lambda}) and the groundtruth 𝒒gt\bm{q}_{\text{gt}}, we define a random variable ΔQ\Delta Q as follows:

ΔQ\displaystyle\Delta Q =2arccos(|Re(Q𝒒gt1)|)\displaystyle=2\arccos\left(\left|\mathop{\mathrm{Re}}(Q\odot\bm{q}_{\text{gt}}^{-1})\right|\right)
=2arccos(|𝒒gtQ|),\displaystyle=2\arccos\left(\left|{\bm{q}_{\text{gt}}^{\top}Q}\right|\right), (45)

where Re(𝒒)\mathop{\mathrm{Re}}(\bm{q}) is the real part of 𝒒\bm{q}; that is, if 𝒒=w+xi+yj+zk\bm{q}=w+xi+yj+zk, then Re(𝒒)=w\mathop{\mathrm{Re}}(\bm{q})=w. A realization of ΔQ\Delta Q represents a difference between sampled rotations from the inferred distribution and the groundtruth.

Now we introduce an indicator of the inference uncertainty proposed in [6]. They empirically found that as the trace of shifted parameter matrix: the lesser

tr(Ashifted)=tr(A)4max(𝝀)\operatorname{tr}(A_{\text{shifted}})=\operatorname{tr}(A)-4\max(\bm{\lambda}) (46)

becomes, the more confident the estimation is. The upper figure in Fig. 3 shows the inferrence result of 025_mug as an example. The red and green plots in the figure are corresponding to the minimum and maximum value of traces, respectively. The lower figures shows the distribution at the corresponding point in the upper figure; the left is to the red point, and the right is to the green point, respectively. Here we can see that if the trace is large, then ΔQ\Delta Q widely distributes, that is, the confidence is low. In addition, we can also see that E[ΔQ]\operatorname{E}[\Delta Q] becomes smaller as the trace is lesser. This implies that an inference with large trace may have a large error.

Refer to caption
Refer to caption
Figure 3: An example of the inference result of our network. The upper figure shows the relation between E[ΔQ]\operatorname{E}[\Delta Q] and tr(Ashifted)\operatorname{tr}(A_{\text{shifted}}). The red and the green point are the minimum and the maximum trace, respectively. The lower figures are the distribution of ΔQ\Delta Q at the red and the green point shown in the upper figure.

VI-A2 Interpret as Rotation Symmetry (Aleatoric uncertainty)

In Fig. 1, we can see that rotations are zonally spread around the zz-axis. We can interpret this that the mug in this view has rotational symmetry around the z-axis. We can see the symmetry characteristics of the observed objects quantitatively by inspecting the eigenvalues. According to [26], for the eigenvalues sorted as (19), 𝝀\bm{\lambda} gives

  • a bipolar distribution, if λ2+λ3<λ4\lambda_{2}+\lambda_{3}<\lambda_{4},

  • a circular distribution, if λ2+λ3=λ4\lambda_{2}+\lambda_{3}=\lambda_{4},

  • a spherical distribution, if λ2+λ3>λ4\lambda_{2}+\lambda_{3}>\lambda_{4},

  • a uniform distribution, if λ2=λ3=λ4\lambda_{2}=\lambda_{3}=\lambda_{4}.

The orientation of the symmetry axis is determined by the orthogonal matrix DD.

Refer to caption
Figure 4: Results on ours and PoseCNN [1]: Top row represents Average distance threshold curves. Middle row represents Translation threshold curves. Bottom row represents the histogram of rotation angle error. These metrics are described in [1].

VI-B Explanation How Our Bingham Representation Works

Refer to caption
Figure 5: Some toy example for an explanation of the mechanism how the network learns the Bingham parameter. Groundtruths are shown in rotation angles around the zz-axis instead of quaternions. The first two and the last entries of inferred 10D parameter are shown in the bottom row.

Fig. 5 shows synthetic images of a red cup. It shows the groundtruth of the angles of rotation around zz-axis, instead of the quaternions. The inferred distributions with their 10D parameters are also shown. We will use this figure for explaining the learning mechanism of our network.

Our model is mathematically represented as below:

𝜽i=F(Ii)\bm{\theta}_{i}=F(I_{i}) (47)

where IiI_{i} is an input image and 𝜽i\bm{\theta}_{i} is a inferred parameter of distribution. FF is our network to be trained. Suppose that we have pairs {Ii,𝒒i}i=1N\{I_{i},\bm{q}_{i}\}_{i=1}^{N} whose images are similar to each other but whose groundtruth quaternions are all different. In Fig. 5, the middle column is corresponding to this circumstance. In this situation, there is a 𝜽\bm{\theta} satisfying

F(Ii)𝜽for all i=1,,N.F(I_{i})\approx\bm{\theta}\quad\text{for all $i=1,\dots,N$}. (48)

Let MM be a function that transforms a given parameter vector to a parameter matrix of Bingham distribution. Then our problem becomes

Find 𝜽\bm{\theta} that minimize i=1N(M(𝜽),𝒒i)\displaystyle\sum_{i=1}^{N}\mathcal{L}(M(\bm{\theta}),\bm{q}_{i}).

By solving this problem, we finally get 𝜽\bm{\theta} which is optimized to the given all quaternions 𝒒\bm{q}. The problem is equivalent to “solve the maximum likelihood estimation (MLE) problem for each M(f(Ii))M(f(I_{i})), given 𝒒i\bm{q}_{i}”. This means that the resulting parameter has information about the distribution of quaternions that share the similar views.

In Fig. 5, the inferred results in the left and right column are both concentrated because quaternions that gives the similar view are close to each other. In contrast, the result in the middle column is zonally spread because the quaternions sharing the view is widely spread. Our network learns the parameter that covers rotations sharing the similar view.

VI-C Adapting to Objects with Discrete Symmetry

In Table II, the objects with discrete symmetry, such as 036_wood_block and 052_extra_large_clamp, got relatively low score in ADD and ADD-S. This is because a single Bingham distribution cannot capture the ambiguity with multiple modes well. We can improve score by introducing mixture Bingham representation which is introduced in such as [7]. Our NLL loss function is easy to extend to them, compared to non-NLL losses such as the QCQP loss presented in [6].

VII CONCLUSIONS

We proposed and implemented a Bingham NLL loss function which is free from pre-computed lookup table. This is directly computable and there is no need to interpolate computation. Also, we showed our loss function is easy to implement for being used for training. Moreover, it is quite easy to be introduced to the existing 6D pose estimator. We tested with PoseCNN as an example and proved that our representation successfully expressed the ambiguity of rotation while the evaluating the peak of distribution showed equivalent performance with that of original PoseCNN. Furthermore, we discovered the relationship between the various parametrization of the Bingham distribution and the performances from object pose perspective. In future works, we would like to handle mixture Bingham distribution for more capabilities, especially for the objects with discrete symmetry, based on this loss function.

References

  • [1] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes,” 2018.
  • [2] M. Bui, S. Zakharov, S. Albarqouni, S. Ilic, and N. Navab, “When regression meets manifold learning for object recognition and pose estimation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018, pp. 6140–6146.
  • [3] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari, “Explaining the ambiguity of object detection and 6d pose from visual data,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6841–6850.
  • [4] K. Hashimoto, D.-N. Ta, E. Cousineau, and R. Tedrake, “KOSNet: A Unified Keypoint, Orientation and Scale Network for Probabilistic 6D Pose Estimation,” 2019.
  • [5] C. Bingham, “An Antipodally Symmetric Distribution on the Sphere,” The Annals of Statistics, vol. 2, no. 6, pp. 1201 – 1225, 1974. [Online]. Available: https://doi.org/10.1214/aos/1176342874
  • [6] V. Peretroukhin, M. Giamou, D. M. Rosen, W. N. Greene, N. Roy, and J. Kelly, “A Smooth Representation of SO(3) for Deep Rotation Learning with Uncertainty,” in Proceedings of Robotics: Science and Systems (RSS’20), Jul. 12–16 2020.
  • [7] H. Deng, M. Bui, N. Navab, L. Guibas, S. Ilic, and T. Birdal, “Deep Bingham Networks: Dealing with Uncertainty and Ambiguity in Pose Estimation,” 2020.
  • [8] S. Riedel, Z. C. Marton, and S. Kriegel, “Multi-view orientation estimation using Bingham mixture models,” 2016 20th IEEE Int. Conf. Autom. Qual. Testing, Robot. AQTR 2016 - Proc., vol. 2, 2016.
  • [9] Y. Chen and K. Tanaka, “Maximum likelihood estimation of the Fisher–Bingham distribution via efficient calculation of its normalizing constant,” Statistics and Computing, vol. 31, 07 2021.
  • [10] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization,” 12 2015, pp. 2938–2946.
  • [11] W. Zou, D. Wu, S. Tian, C. Xiang, X. Li, and L. Zhang, “End-to-end 6dof pose estimation from monocular rgb images,” IEEE Transactions on Consumer Electronics, vol. 67, no. 1, pp. 87–96, 2021.
  • [12] B. Samarth, G. Jinwei, K. Kihwan, H. James, and K. Jan, “Geometry-aware learning of maps for camera localization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [13] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5738–5746.
  • [14] D. M. Davis, “Embeddings of real projective spaces,” Boletin De La Sociedad Matematica Mexicana, vol. 4, pp. 115–122, 1998.
  • [15] L. Jake, E. Carlos, C. Kefan, S. Noah, K. Angjoo, R. Afshin, and M. Ameesh, “An analysis of SVD for deep rotation estimation,” in Advances in Neural Information Processing Systems 34, 2020, to appear in.
  • [16] M. Bui, T. Birdal, H. Deng, S. Albarqouni, L. Guibas, S. Ilic, and N. Navab, “6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal Inference,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds.   Cham: Springer International Publishing, 2020, pp. 139–157.
  • [17] R. A. Srivatsan, M. Xu, N. Zevallos, and H. Choset, “Bingham distribution-based linear filter for online pose estimation,” Robot. Sci. Syst., vol. 13, 2017.
  • [18] I. Gilitschenski, R. Sahoo, W. Schwarting, A. Amini, S. Karaman, and D. Rus, “Deep orientation uncertainty learning based on a bingham loss,” in International Conference on Learning Representations, 2020.
  • [19] A. Kume, S. Preston, and A. Wood, “Saddlepoint approximations for the normalizing constant of Fisher-Bingham distributions on products of spheres and Stiefel manifolds,” Biometrika, vol. 4, 12 2013.
  • [20] E. Hairer and G. Wanner, Geometric Numerical Integration, ser. Springer Series in Computational Mathematics.   Berlin/Heidelberg: Springer-Verlag, 2006, vol. 31. [Online]. Available: http://link.springer.com/10.1007/3-540-30666-8
  • [21] T. Birdal, U. Şimşekli, M. O. Eken, and S. Ilic, “Bayesian Pose Graph Optimization via Bingham Distributions and Tempered Geodesic MCMC,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18.   Red Hook, NY, USA: Curran Associates Inc., 2018, pp. 306–317.
  • [22] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10.   Madison, WI, USA: Omnipress, 2010, pp. 807–814.
  • [23] K. Tanaka, “Error control of a numerical formula for the fourier transform by ooura’s continuous euler transform and fractional fft,” Journal of Computational and Applied Mathematics, vol. 266, pp. 73–86, 2014.
  • [24] NVIDIA, “PoseCNN-PyTorch,” 2020. [Online]. Available: https://github.com/NVlabs/PoseCNN-PyTorch
  • [25] J. Kent, A. Ganeiber, and K. Mardia, “A new method to simulate the bingham and related distributions in directional data analysis with applications,” 10 2013.
  • [26] K. Kunze and H. Schaeben, “The bingham distribution of quaternions and its spherical radon transform in texture analysis,” Mathematical Geology, vol. 36, pp. 917–943, 11 2004.