This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Birotation Solution for Relative Pose Problems

Hongbo Zhao  Ziwei Long  Mengtan Zhang  Hanli Wang , ,
Qijun Chen , , and Rui Fan
Corresponding author: Rui Fan.Hongbo Zhao and Mengtan Zhang are with the Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, P. R. China (e-mail: hongbozhao@tongji.edu.cn; zmt200110@tongji.edu.cn).Ziwei Long and Qijun Chen are with the College of Electronics & Information Engineering, Tongji University, Shanghai, P. R. China (e-mail: zwlong@tongji.edu.cn; qjchen@tongji.edu.cn).Hanli Wang is with the College of Electronics and Information Engineering, Tongji University, Shanghai, P. R. China, the School of Computer Science and Technology, Tongji University, Shanghai, P. R. China, and the Key Laboratory of Embedded System and Service Computing (Ministry of Education), Tongji University, Shanghai, P. R. China (e-mail: hanliwang@tongji.edu.cn). Rui Fan is with the College of Electronics & Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, the State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Tongji University, Shanghai, P. R. China (e-mail: rui.fan@ieee.org). 0009-0008-2198-2484 0009-0007-1520-5526 0009-0003-3468-7680 0000-0002-9999-4871 0000-0001-5644-1188 0000-0003-2593-6596
Abstract

Relative pose estimation, a fundamental computer vision problem, has been extensively studied for decades. Existing methods either estimate and decompose the essential matrix or directly estimate the rotation and translation to obtain the solution. In this article, we break the mold by tackling this traditional problem with a novel birotation solution. We first introduce three basis transformations, each associated with a geometric metric to quantify the distance between the relative pose to be estimated and its corresponding basis transformation. Three energy functions, designed based on these metrics, are then minimized on the Riemannian manifold SO(3)\mathrm{SO(3)} by iteratively updating the two rotation matrices. The two rotation matrices and the basis transformation corresponding to the minimum energy are ultimately utilized to recover the relative pose. Extensive quantitative and qualitative evaluations across diverse relative pose estimation tasks demonstrate the superior performance of our proposed birotation solution. Source code, demo video, and datasets will be available at mias.group/birotation-solution upon publication.

Index Terms:
Relative pose, birotation solution, basis transformations

I Introduction

Relative pose estimation is the core of various computer vision applications, including stereo camera extrinsic calibration [1, 2, 3], visual odometry [4, 5, 6], and so forth. This process estimates a transformation matrix, defined as follows:

𝑷=(𝑹𝒕𝟎1),\boldsymbol{P}=\begin{pmatrix}\boldsymbol{R}&\boldsymbol{t}\\ \boldsymbol{0}^{\top}&1\end{pmatrix}, (1)

where 𝑹SO(3)\boldsymbol{R}\in\mathrm{SO(3)} denotes the rotation matrix, 𝒕3\boldsymbol{t}\in\mathbb{R}^{3} represents the translation vector, and 𝟎\boldsymbol{0} is a column vector of zeros. This problem can be solved by estimating an essential matrix 𝑬\boldsymbol{E}, which satisfies the following epipolar constraint:

𝒑2C𝑬𝒑1C=0,{\boldsymbol{p}^{C}_{2}}^{\top}\boldsymbol{E}{\boldsymbol{p}^{C}_{1}}=0, (2)

and decomposing it into 𝑹\boldsymbol{R} and 𝒕\boldsymbol{t}, where 𝒑1C=(x1,y1,z1)\boldsymbol{p}^{C}_{1}=(x_{1},y_{1},z_{1})^{\top} and 𝒑2C=(x2,y2,z2)\boldsymbol{p}^{C}_{2}=(x_{2},y_{2},z_{2})^{\top} represent a pair of 3-D correspondences in the reference and target camera coordinate systems, respectively111In this article, the subscripts “1” and “2” denote the reference and target views, respectively. . Alternatively, 𝑷\boldsymbol{P} can be directly estimated based on the following relation:

𝒑~2C=𝑷𝒑~1C,\tilde{\boldsymbol{p}}^{C}_{2}=\boldsymbol{P}\tilde{\boldsymbol{p}}^{C}_{1}, (3)

where 𝒑~C\tilde{\boldsymbol{p}}^{C} denotes the homogeneous coordinates of 𝒑C\boldsymbol{p}^{C}. Although both methods provide viable solutions for determining the relative pose, they still have significant limitations.

In the first approach, the essential matrix inherently possesses only five degrees of freedom (DoFs), yet its estimation typically requires solving for eight or nine unknown parameters, which can undermine both the uniqueness and stability of the solution [7, 8]. In particular, overfitting to inaccurately matched correspondences can substantially impair the performance of relative pose estimation [9]. To address this issue, researchers have resorted to correspondence refinement strategies [10, 11, 12, 13] and additional geometric constraints [8, 14]. Additionally, this approach cannot solve pure rotation problems, as the essential matrix degenerates to a zero matrix under such conditions [15].

In the second approach, the choice of the optimization algorithm hinges on the availability of depth information. When depth is known, relative pose estimation can be formulated as a perspective-n-point (PnP) problem [16], which can be solved reliably and efficiently using methods such as efficient PnP [17]. In contrast, when depth is unknown, the scale of the translation vector 𝒕\boldsymbol{t} becomes ambiguous, necessitating its treatment as a unit vector [18]. Normalizing 𝒕\boldsymbol{t}, nevertheless, increases the risk of the optimization process falling into local minima, posing significant challenges in achieving stable and accurate solutions [19].

Substituting the translation vector with a rotation matrix and using two rotation matrices to solve the relative pose problem can avoid the instability caused by the absence of depth information while introducing only one additional dimensionality to the solution space [19]. Although this concept has been investigated in previous studies [2, 3, 19], its practical application remains predominantly confined to stereo camera extrinsic online calibration, where the rotation matrix approximates the identity matrix and the translation vector aligns closely with the X-axis [3, 19]. Therefore, these methods fall short of providing a stable solution for general relative pose estimation problems.

To address this well-studied yet enduring problem, we propose a novel birotation solution in this article. Our study effectively expands existing methods, initially designed for stereo camera extrinsic online calibration, to tackle a broader range of relative pose problems. Specifically, we first define three basis transformations along the X-, Y-, and Z-axes. Three geometric metrics are then designed to quantify the distances between the relative pose to be estimated and these basis transformations. Each metric can be employed to construct an energy function EiE_{i}, which includes a regularization term to guarantee the uniqueness of the solution. The energy function EiE_{i} is minimized on the Riemannian manifold SO(3)\mathrm{SO(3)} through iterative updates of the reference and target camera coordinate systems with two independent rotation matrices, 𝑹1,i\boldsymbol{R}_{1,i} and 𝑹2,i\boldsymbol{R}_{2,i}. These optimization processes yield three locally optimal birotation solutions, each defined by a pair of rotations, a scaling factor, and a coordinate axis. The birotation solution with the minimal geometric metric is then selected as the optimal solution, from which the relative pose 𝑷\boldsymbol{P} can be directly derived. Extensive experiments across a variety of relative pose estimation tasks demonstrate the feasibility of our proposed solution and its greater accuracy over all other state-of-the-art (SoTA) methods.

In a nutshell, our contributions are as follows:

  • We revisit the relative pose problem from a fresh perspective and propose a novel birotation solution.

  • We introduce three basis transformations and corresponding geometric metrics to expand relevant prior works, which primarily focus on stereo camera extrinsic calibration, to solve general relative pose problems.

  • We propose a highly efficient algorithm for relative pose estimation based on the birotation solution, which simultaneously optimizes three energy functions derived from geometric metrics.

  • We demonstrate the feasibility and superiority of our solution through quantitative and qualitative evaluations across diverse relative pose estimation tasks.

The remainder of this article is organized as follows: Sect. II reviews previous relative pose estimation algorithms. The proposed basis transformations, geometric metrics, birotation solutions, and the relative pose estimation algorithm are detailed in Sect. III. Sect. IV presents experimental results that demonstrate the effectiveness and superiority of our algorithm. The potential risks associated with the uniqueness of the birotation solution are discussed in Sect. V. Finally, we conclude this article and suggest directions for future work in Sect. VI.

II Related Work

For the relative pose problem, the essential matrix estimation process under the epipolar constraint in (2) can be formulated as follows:

argmin𝑬n=1N𝒑2,nC𝑬𝒑1,nC22,\underset{\boldsymbol{E}}{\arg\min}\sum^{N}_{n=1}\|{\boldsymbol{p}^{C}_{2,n}}^{\top}\boldsymbol{E}{\boldsymbol{p}^{C}_{1,n}}\|^{2}_{2}, (4)

where NN represents the number of 3-D correspondences, and 𝒑1,nC\boldsymbol{p}^{C}_{1,n} and 𝒑2,nC\boldsymbol{p}^{C}_{2,n} denote the nn-th pair of 3-D correspondences in the reference and target camera coordinate systems, respectively. Although the eight-point method [7] can effectively solve such an NN-point problem, it necessitates highly accurate correspondences. The five-point method, combined with random sample consensus (RANSAC) [8, 9], proves effective in addressing cases where correspondences are less reliable. Furthermore, existing local optimization approaches [20, 21, 14] either suffer from low efficiency or are prone to becoming trapped in local minima. While manifold-based approaches [22, 23, 14] provide insightful constructions of the solution space, they often fail to guarantee iterative convergence in complex relative pose problems [12]. To overcome these challenges, the study [12] reformulates the NN-point problem as a quadratically constrained quadratic program (QCQP) and utilizes semi-definite relaxation to achieve a globally optimal solution, demonstrating polynomial-time convergence.

Different from the above-mentioned methods, the study [18] directly estimates 𝑹\boldsymbol{R} and 𝒕\boldsymbol{t} by solving the following optimization problem based on the epipolar constraint:

argmin𝑹,𝒕n=1N𝒑1,nC[𝒕]×𝑹𝒑2,nC22,\underset{\boldsymbol{R},\boldsymbol{t}}{\arg\min}\sum^{N}_{n=1}\left\|{\boldsymbol{p}^{C}_{1,n}}^{\top}[\boldsymbol{t}]_{\times}\boldsymbol{R}\boldsymbol{p}^{C}_{2,n}\right\|^{2}_{2}, (5)

where [𝒕]×[\boldsymbol{t}]_{\times} denotes cross-product matrix of 𝒕\boldsymbol{t}. However, this method demonstrates limited stability in estimating 𝒕\boldsymbol{t}, primarily due to the significant non-convexity of the energy function introduced by [𝒕]×[\boldsymbol{t}]_{\times} and the normalization process of 𝒕\boldsymbol{t} [19].

In recent years, data-driven methods, such as PoseNet [24], have gained significant popularity. Prior studies predominantly utilize convolutional neural networks, often employing a ResNet-18, to directly regress the 6-DoF relative poses, which include three Euler angles and three translation vector elements. Nevertheless, these methods often demonstrate limited interpretability, generalization, and accuracy. Recent advancements, such as SC-Depth [25] and SCIPaD [26], have incorporated a monocular depth estimation branch to improve the overall performance of relative pose estimation. However, the insufficient geometric constraints between depth and relative pose incorporated during training limit the accuracy of these approaches.

As discussed above, the use of birotation for relative pose estimation remains confined to stereo camera extrinsic rectification. The study [3] uses two independent rotation matrices, 𝑹1\boldsymbol{R}_{1} and 𝑹2\boldsymbol{R}_{2}, to separately rotate the left and right camera coordinate systems. They derive the Euler angles for 𝑹1\boldsymbol{R}_{1} and 𝑹2\boldsymbol{R}_{2} by formulating and solving the following optimization problem based on stereo rectification:

argmin𝑹1,𝑹2n=1N𝒓1,2𝒑1,nC𝒓1,3𝒑1,nC𝒓2,2𝒑2,nC𝒓2,3𝒑2,nC22,\underset{\boldsymbol{R}_{1},\boldsymbol{R}_{2}}{\arg\min}\sum^{N}_{n=1}\left\|\frac{\boldsymbol{r}_{1,2}^{\top}\boldsymbol{p}^{C}_{1,n}}{\boldsymbol{r}_{1,3}^{\top}\boldsymbol{p}^{C}_{1,n}}-\frac{\boldsymbol{r}_{2,2}^{\top}\boldsymbol{p}^{C}_{2,n}}{\boldsymbol{r}_{2,3}^{\top}\boldsymbol{p}^{C}_{2,n}}\right\|^{2}_{2}, (6)

where 𝒓1,j\boldsymbol{r}_{1,j} and 𝒓2,j\boldsymbol{r}_{2,j} denote the jj-th row vector of 𝑹1\boldsymbol{R}_{1} and 𝑹2\boldsymbol{R}_{2}, respectively. The recent study [19] improves this approach by incorporating geometry constraints to ensure solution uniqueness. Nonetheless, these approaches are applicable only to relatively ideal stereo vision models (with small perturbations) and cannot be generalized for broader relative pose estimation problems.

III Methodology

While it is geometrically feasible to obtain an ideal stereo vision model (a transformation along the X-axis) by rotating two cameras, solving for such a pure rotation problem becomes highly challenging, particularly when the actual relative pose significantly deviates from the ideal model. This challenge arises because existing birotation-based methods [3, 19] primarily utilize local optimization algorithms to address a non-convex problem. Their effectiveness is influenced by both the initialization and the penalty term in each energy function, which can misdirect the optimization process. This greatly limits their applicability to general relative pose problems. Therefore, Sect. III-A first introduces two additional transformations along the Y- and Z-axes, respectively. Subsequently, Sect. III-B defines the geometric metrics corresponding to these three basis transformations. These geometric metrics are then employed to formulate energy functions, the minimization of which yields three birotation solutions, as presented in Sect. III-C. Building upon these solutions, a novel relative pose estimation algorithm is introduced in Sect. III-D.

III-A Basis Transformations

Previous birotation-based methods [3, 19] have been confined to an ideal stereo vision model (both cameras have an identical intrinsic matrix), primarily because their energy functions are formulated based on the constraint that 𝒑1=(u1,v1)\boldsymbol{p}_{1}=(u_{1},v_{1})^{\top} and 𝒑2=(u2,v2)\boldsymbol{p}_{2}=(u_{2},v_{2})^{\top}, a given pair of correspondences in the two images, have the same vertical coordinates (i.e., v1=v2v_{1}=v_{2}). Similar constraints can be observed when a single camera translates solely along either the Y-axis or the Z-axis. Specifically, the horizontal coordinates of 𝒑1\boldsymbol{p}_{1} and 𝒑2\boldsymbol{p}_{2} are identical (i.e., u1=u2u_{1}=u_{2}) in the former case, while the ratios of their horizontal to vertical coordinates are identical (i.e., u1v1=u2v2\frac{u_{1}}{v_{1}}=\frac{u_{2}}{v_{2}}) in the latter case.

Nevertheless, in general relative pose problems, the intrinsic matrices of the reference and target cameras may differ. It is therefore necessary to reformulate the above-mentioned constraints based on the following basis transformations:

𝒑1C=𝒑2C+si𝒍i,i{1,2,3},{\boldsymbol{p}^{C}_{1}}={\boldsymbol{p}^{C}_{2}}+s_{i}\boldsymbol{l}_{i},\ \ i\in\{1,2,3\}, (7)

where 𝒍1=(1,0,0)\boldsymbol{l}_{1}=(1,0,0)^{\top}, 𝒍2=(0,1,0)\boldsymbol{l}_{2}=(0,1,0)^{\top}, and 𝒍3=(0,0,1)\boldsymbol{l}_{3}=(0,0,1)^{\top} are unit vectors representing pure translations along the X-, Y-, and Z-axes of the camera coordinate system, respectively, and si(,+)s_{i}\in(-\infty,+\infty) denotes the scaling factor corresponding to each transformation. 𝒑1,2C\boldsymbol{p}^{C}_{1,2} and 𝒑1,2\boldsymbol{p}_{1,2} have the following relation:

{𝒑1C=z1𝒑¯1=z1𝑲11𝒑~1,𝒑2C=z2𝒑¯2=z2𝑲21𝒑~2,\begin{cases}\boldsymbol{p}^{C}_{1}=z_{1}\bar{\boldsymbol{p}}_{1}=z_{1}\boldsymbol{K}^{-1}_{1}\tilde{\boldsymbol{p}}_{1},\\ \boldsymbol{p}^{C}_{2}=z_{2}\bar{\boldsymbol{p}}_{2}=z_{2}\boldsymbol{K}^{-1}_{2}\tilde{\boldsymbol{p}}_{2},\end{cases} (8)

where 𝒑¯1,2\bar{\boldsymbol{p}}_{1,2} represents the normalized coordinates of 𝒑1,2C\boldsymbol{p}^{C}_{1,2} with respect to depth, 𝒑~1,2=(u1,2,v1,2,1)\tilde{\boldsymbol{p}}_{1,2}=(u_{1,2},v_{1,2},1)^{\top} denotes the homogeneous coordinates of 𝒑1,2\boldsymbol{p}_{1,2}, and

𝑲1=(fx,10uo,10fy,1vo,1001)\boldsymbol{K}_{1}=\begin{pmatrix}f_{x,1}&0&u_{o,1}\\ 0&f_{y,1}&v_{o,1}\\ 0&0&1\end{pmatrix} (9)

and

𝑲2=(fx,20uo,20fy,2vo,2001)\boldsymbol{K}_{2}=\begin{pmatrix}f_{x,2}&0&u_{o,2}\\ 0&f_{y,2}&v_{o,2}\\ 0&0&1\end{pmatrix} (10)

represent the intrinsic matrices of the reference and target cameras, respectively. By combining (7) and (8), we can reformulate the aforementioned three constraints into the following expressions:

X-Axis: 𝒑1C=𝒑2C+s1𝒍1y1z1=y2z2\displaystyle{\boldsymbol{p}^{C}_{1}}={\boldsymbol{p}^{C}_{2}}+s_{1}\boldsymbol{l}_{1}\rightarrow\frac{y_{1}}{z_{1}}=\frac{y_{2}}{z_{2}} (11)
\displaystyle\rightarrow v1vo,1fy,1=v2vo,2fy,2,\displaystyle\frac{v_{1}-v_{o,1}}{f_{y,1}}=\frac{v_{2}-v_{o,2}}{f_{y,2}},
Y-Axis: 𝒑1C=𝒑2C+s2𝒍2x1z1=x2z2\displaystyle{\boldsymbol{p}^{C}_{1}}={\boldsymbol{p}^{C}_{2}}+s_{2}\boldsymbol{l}_{2}\rightarrow\frac{x_{1}}{z_{1}}=\frac{x_{2}}{z_{2}} (12)
\displaystyle\rightarrow u1uo,1fx,1=u2uo,2fx,2,\displaystyle\frac{u_{1}-u_{o,1}}{f_{x,1}}=\frac{u_{2}-u_{o,2}}{f_{x,2}},
Z-Axis: 𝒑1C=𝒑2C+s3𝒍3x1y1=x2y2\displaystyle{\boldsymbol{p}^{C}_{1}}={\boldsymbol{p}^{C}_{2}}+s_{3}\boldsymbol{l}_{3}\rightarrow\frac{x_{1}}{y_{1}}=\frac{x_{2}}{y_{2}} (13)
\displaystyle\rightarrow u1uo,1v1vo,1=u2uo,2v2vo,2,\displaystyle\frac{u_{1}-u_{o,1}}{v_{1}-v_{o,1}}=\frac{u_{2}-u_{o,2}}{v_{2}-v_{o,2}},

which serve as the foundation for developing geometric metrics in the subsequent subsection.

III-B Geometric Metrics

Existing methods [3, 19] formulate the energy function as the sum of absolute residuals calculated based on the constraint presented in (11). In this subsection, we comprehensively analyze the drawbacks of existing methods in terms of absolute residual accumulation, and introduce three geometric metrics defined to quantify the distances between the relative pose to be estimated and the basis transformations.

Refer to caption
Figure 1: (a) Definition of 𝚽=(ϕ1,ϕ2,ϕ3)\boldsymbol{\Phi}=(\phi_{1},\phi_{2},\phi_{3}); (b) Comparison between ϵ1{\epsilon}_{1} and ϵ1{\epsilon}^{\prime}_{1}; (c) Comparison between ϵ2{\epsilon}_{2} and ϵ2{\epsilon}^{\prime}_{2}; (d) Comparison between ϵ3{\epsilon}_{3} and ϵ3{\epsilon}^{\prime}_{3}.

As illustrated in Fig. 1(a), 𝒑¯1\bar{\boldsymbol{p}}_{1} and 𝒑¯2\bar{\boldsymbol{p}}_{2}, the normalized coordinates of a pair of correspondences with respect to depth, can be respectively expressed using 𝚽1=(ϕ1,1,ϕ2,1,ϕ3,1)\boldsymbol{\Phi}_{1}=(\phi_{1,1},\phi_{2,1},\phi_{3,1})^{\top} and 𝚽2=(ϕ1,2,ϕ2,2,ϕ3,2)\boldsymbol{\Phi}_{2}=(\phi_{1,2},\phi_{2,2},\phi_{3,2})^{\top} as follows:

{𝒑¯1=(x1z1,y1z1,1)=(tan(ϕ2,1),tan(ϕ1,1),1),𝒑¯2=(x2z2,y2z2,1)=(tan(ϕ2,2),tan(ϕ1,2),1).\begin{cases}\bar{\boldsymbol{p}}_{1}=\left(\displaystyle\frac{x_{1}}{z_{1}},\displaystyle\frac{y_{1}}{z_{1}},1\right)^{\top}=(\tan(\phi_{2,1}),\tan(\phi_{1,1}),1)^{\top},\\[5.0pt] \bar{\boldsymbol{p}}_{2}=\left(\displaystyle\frac{x_{2}}{z_{2}},\displaystyle\frac{y_{2}}{z_{2}},1\right)^{\top}=(\tan(\phi_{2,2}),\tan(\phi_{1,2}),1)^{\top}.\end{cases} (14)

These angles satisfy the following relationship:

tan(ϕ1)tan(ϕ3)tan(ϕ2)=1.\frac{\tan(\phi_{1})\tan(\phi_{3})}{\tan(\phi_{2})}=1. (15)

Combining (14) and (15) results in the following expressions:

{x1y1=x1z1z1y1=tan(ϕ3,1),x2y2=x2z2z2y2=tan(ϕ3,2).\begin{cases}\displaystyle\frac{x_{1}}{y_{1}}=\frac{x_{1}}{z_{1}}\frac{z_{1}}{y_{1}}=\tan(\phi_{3,1}),\\[7.0pt] \displaystyle\frac{x_{2}}{y_{2}}=\frac{x_{2}}{z_{2}}\frac{z_{2}}{y_{2}}=\tan(\phi_{3,2}).\end{cases} (16)

According to (14) and (16), an absolute residual vector ϵ\boldsymbol{\epsilon}, formulated based on the geometric constraints presented in (11)-(13), can be expressed in terms of 𝚽1\boldsymbol{\Phi}_{1} and 𝚽2\boldsymbol{\Phi}_{2} as follows:

ϵ\displaystyle\boldsymbol{\epsilon} =(|y1z1y2z2|,|x1z1x2z2|,|x1y1x2y2|)\displaystyle=\left(\left|\frac{y_{1}}{z_{1}}-\frac{y_{2}}{z_{2}}\right|,\left|\frac{x_{1}}{z_{1}}-\frac{x_{2}}{z_{2}}\right|,\left|\frac{x_{1}}{y_{1}}-\frac{x_{2}}{y_{2}}\right|\right) (17)
=|tan(𝚽1)tan(𝚽2)|.\displaystyle=\left|\tan(\boldsymbol{\Phi}_{1})-\tan(\boldsymbol{\Phi}_{2})\right|.

ϵi{\epsilon}_{i}, the ii-th element of ϵ\boldsymbol{\epsilon}, is illustrated in Fig. 1(b)-(d), respectively. As the relative pose to be estimated nears alignment with the ii-th basis transformation, Δϕi=|ϕi,1ϕi,2|\Delta\phi_{i}=\lvert\phi_{i,1}-\phi_{i,2}\rvert approaches zero. The ratio of ϵi{\epsilon}_{i} to Δϕi\Delta\phi_{i} can be formulated as follows:

limΔϕi0ϵiΔϕi\displaystyle\lim_{\Delta\phi_{i}\to 0}\frac{{\epsilon}_{i}}{\Delta\phi_{i}} =limϕi,2ϕi,1|tan(ϕi,1)tan(ϕi,2)|Δϕi\displaystyle=\lim_{\phi_{i,2}\to\phi_{i,1}}\frac{\left|\tan(\phi_{i,1})-\tan(\phi_{i,2})\right|}{\Delta\phi_{i}} (18)
=1cos2(ϕi,1),\displaystyle=\frac{1}{\cos^{2}(\phi_{i,1})},

which leads to the following expression:

ϵiΔϕicos2(ϕi,1),{\epsilon}_{i}\approx\frac{\Delta\phi_{i}}{\cos^{2}(\phi_{i,1})}, (19)

indicating that ϵi{\epsilon}_{i} increases monotonically with respect to ϕi,1\phi_{i,1}. Therefore, the implicit weights of residuals currently in use are subject to pixel coordinates. To illustrate this bias, we visualize the normalized implicit weights under different constraints, as shown in Fig. 2. Specifically, under the first constraint, the implicit weight increases along the vertical axis (see Fig. 2(a)), peaking near the top and bottom image boundaries and reaching a minimum near the horizontal midline. Similarly, under the second constraint, the implicit weight increases along the horizontal axis (see Fig. 2(b)), peaking near the left and right image boundaries and reaching a minimum near the vertical midline. Nonetheless, as illustrated in Fig. 2(c), under the third constraint, the implicit weights along the horizontal midline reach the maximum, whereas those along the vertical midline reach the minimum. In addition, the implicit weights are symmetric about the image center and decrease monotonically as the line passing through the center rotates from the horizontal to the vertical midline. This phenomenon emphasizes the instability of previous relative pose estimation approaches, particularly when correspondences from specific regions, such as treetops or sky, are utilized.

Refer to caption
Figure 2: Visualizations of the normalized implicit weights under different constraints: (a) Constraint given in (11); (b) Constraint given in (12); (c) Constraint given in (13).

To overcome this limitation, we adjust the three constraints in (11)-(13) using an inverse trigonometric function, as expressed as follows:

X-Axis: arctanv1vo,1fy,1=arctanv2vo,2fy,2,\displaystyle\ \ \arctan\frac{v_{1}-v_{o,1}}{f_{y,1}}=\arctan\frac{v_{2}-v_{o,2}}{f_{y,2}}, (20)
Y-Axis: arctanu1uo,1fx,1=arctanu2uo,2fx,2,\displaystyle\ \ \arctan\frac{u_{1}-u_{o,1}}{f_{x,1}}=\arctan\frac{u_{2}-u_{o,2}}{f_{x,2}}, (21)
Z-Axis: arctanu1uo,1v1vo,1=arctanu2uo,2v2vo,2.\displaystyle\ \ \arctan\frac{u_{1}-u_{o,1}}{v_{1}-v_{o,1}}=\arctan\frac{u_{2}-u_{o,2}}{v_{2}-v_{o,2}}. (22)

The new absolute residual vector ϵ\boldsymbol{\epsilon}^{\prime} can therefore be expressed as follows:

ϵ=|𝚽1𝚽2|=(Δϕ1,Δϕ2,Δϕ3),\boldsymbol{\epsilon}^{\prime}=|\boldsymbol{\Phi}_{1}-\boldsymbol{\Phi}_{2}|=(\Delta\phi_{1},\Delta\phi_{2},\Delta\phi_{3}), (23)

which eliminates the influence of implicit weights in residual accumulation. Comparisons between the absolute residuals ϵi{\epsilon}_{i} and ϵi{\epsilon}_{i}^{\prime} are also given in Fig. 1(b)-(d).

In accordance with the new constraints defined in (20)-(22), we introduce three geometric metrics that correspond to the three basis transformations as follows:

X-Axis:d1=(Δarctanvvof)2𝑑u𝑑v,\displaystyle\text{X-Axis:}\ \ d_{1}=\iint\left(\Delta\arctan\frac{v-v_{o}}{f}\right)^{2}\,du\,dv, (24)
Y-Axis:d2=(Δarctanuuof)2𝑑u𝑑v,\displaystyle\text{Y-Axis:}\ \ d_{2}=\iint\left(\Delta\arctan\frac{u-u_{o}}{f}\right)^{2}\,du\,dv, (25)
Z-Axis:d3=(Δarctanuuovvo)2𝑑u𝑑v.\displaystyle\text{Z-Axis:}\ \ d_{3}=\iint\left(\Delta\arctan\frac{u-u_{o}}{v-v_{o}}\right)^{2}\,du\,dv. (26)

The value of did_{i} represents the distance between the actual relative pose and the ii-th basis transformation. A smaller did_{i} indicates a closer alignment between the actual relative pose and the ii-th basis transformation. When exactly one of the three distances approaches zero, the actual relative pose aligns with the corresponding basis transformation. Additionally, it is worth noting that when the camera remains stationary, all three distances approach zero.

III-C The Birotation Solution

Building on the above-mentioned basis transformations and geometric metrics, we introduce three birotation solutions to the relative pose estimation problem.

We use two rotation matrices, 𝑹1\boldsymbol{R}_{1} and 𝑹2\boldsymbol{R}_{2}, to apply pure rotation transformations to the reference and target camera coordinate systems, respectively, in order to align them with the ii-th basis transformation. This process can be formulated as follows:

{𝒑1C=𝑹1𝒑1C=z1𝑹1𝒑¯1,𝒑2C=𝑹2𝒑2C=z2𝑹2𝒑¯2,𝒑1C=𝒑2C+si𝒍i,i{1,2,3},\begin{cases}\begin{aligned} {\boldsymbol{p}^{C}_{1}}{{}^{\prime}}&=\boldsymbol{R}_{1}\boldsymbol{p}^{C}_{1}=z_{1}\boldsymbol{R}_{1}\bar{\boldsymbol{p}}_{1},\end{aligned}\\ \begin{aligned} {\boldsymbol{p}^{C}_{2}}{{}^{\prime}}&=\boldsymbol{R}_{2}\boldsymbol{p}^{C}_{2}=z_{2}\boldsymbol{R}_{2}\bar{\boldsymbol{p}}_{2},\end{aligned}\\ \begin{aligned} {\boldsymbol{p}^{C}_{1}}{{}^{\prime}}={\boldsymbol{p}^{C}_{2}}{{}^{\prime}}+s_{i}\boldsymbol{l}_{i},\ \ i\in\{1,2,3\},\end{aligned}\end{cases} (27)

where 𝒑1,2C\boldsymbol{p}^{C^{\prime}}_{1,2} denotes the transformed coordinates of 𝒑1,2C\boldsymbol{p}^{C}_{1,2} in the new camera coordinate systems. We refer to (27) as the ii-th birotation model for each i{1,2,3}i\in\{1,2,3\}, where the corresponding parameters 𝑹1\boldsymbol{R}_{1}, 𝑹2\boldsymbol{R}_{2}, sis_{i}, and 𝒍i\boldsymbol{l}_{i} together constitute the ii-th birotation solution for the relative pose problem.

By substituting the ii-th birotation model into the geometric metrics presented in (24)-(26), the three geometric metrics can be reformulated as functions of 𝑹1\boldsymbol{R}_{1} and 𝑹2\boldsymbol{R}_{2}, as follows:

di=(arctan𝒓1,j𝒑¯1𝒓1,k𝒑¯1arctan𝒓2,j𝒑¯2𝒓2,k𝒑¯2)2𝑑u𝑑v,d_{i}=\iint\left(\arctan\frac{\boldsymbol{r}_{1,j}^{\top}\bar{\boldsymbol{p}}_{1}}{\boldsymbol{r}_{1,k}^{\top}\bar{\boldsymbol{p}}_{1}}-\arctan\frac{\boldsymbol{r}_{2,j}^{\top}\bar{\boldsymbol{p}}_{2}}{\boldsymbol{r}_{2,k}^{\top}\bar{\boldsymbol{p}}_{2}}\right)^{2}\,du\,dv, (28)

where (ijk)(i\ j\ k) forms an even permutation [27] of the set {1,2,3}\{1,2,3\} when i{1,3}i\in\{1,3\}, and an odd permutation when i=2i=2. Given the scaling factor sis_{i}, the ii-th birotation solution can be obtained by minimizing did_{i}.

Refer to caption
Figure 3: An overview of a single iteration in our proposed relative pose estimation algorithm. For each i{1,2,3}i\in\{1,2,3\}, two rotation matrices 𝑹1,i\boldsymbol{R}_{1,i} and 𝑹2,i\boldsymbol{R}_{2,i}, obtained from the previous iteration, are used to rotate NN pairs of correspondences. The transformed correspondences 𝒑1,n,iC\boldsymbol{p}^{C^{\prime}}_{1,n,i} and 𝒑2,n,iC\boldsymbol{p}^{C^{\prime}}_{2,n,i} are then used to compute the residual vectors 𝒆1\boldsymbol{e}_{1}, 𝒆2\boldsymbol{e}_{2}, and 𝒆3\boldsymbol{e}_{3} via (34). The residuals, along with the diagonal matrix 𝚲\boldsymbol{\Lambda} (used for outlier removal), are incorporated into the energy functions defined in (36). To optimize the energy, the rotation matrices 𝑹1,i\boldsymbol{R}_{1,i} and 𝑹2,i\boldsymbol{R}_{2,i} are mapped to the tangent space 𝔰𝔬(3)\mathfrak{so}(3) via the logarithmic map, allowing the increment vector Δ𝚯i\Delta\boldsymbol{\Theta}_{i} to be solved in closed form. The updated rotation matrices 𝑹1,i\boldsymbol{R}_{1,i} and 𝑹2,i\boldsymbol{R}_{2,i} are then obtained on the Riemannian manifold SO(3)\mathrm{SO(3)} through the exponential map. If convergence has not yet been achieved, 𝑹1,i\boldsymbol{R}^{\prime}_{1,i} and 𝑹2,i\boldsymbol{R}^{\prime}_{2,i} are passed to the next iteration; otherwise, they are used to derive the final relative pose via (31).

The relative pose 𝑷\boldsymbol{P} and the essential matrix 𝑬\boldsymbol{E} can be derived from the birotation solutions. By multiplying both sides of 𝒑~2C=𝑷𝒑~1C\tilde{\boldsymbol{p}}^{C}_{2}=\boldsymbol{P}\tilde{\boldsymbol{p}}^{C}_{1} with the matrix (𝑹2𝟎)\begin{pmatrix}\boldsymbol{R}_{2}&\boldsymbol{0}\end{pmatrix}, we have the following expression:

𝑹2𝒑2C=𝑹2𝑹𝒑1C+𝑹2𝒕.\boldsymbol{R}_{2}\boldsymbol{p}^{C}_{2}=\boldsymbol{R}_{2}\boldsymbol{R}\boldsymbol{p}^{C}_{1}+\boldsymbol{R}_{2}\boldsymbol{t}. (29)

According to the ii-th birotation model defined in (27), the following relationship holds:

𝒑2C=𝑹2𝒑2C=𝒑1Csi𝒍i=𝑹1𝒑1Csi𝒍i.\boldsymbol{p}^{C^{\prime}}_{2}=\boldsymbol{R}_{2}\boldsymbol{p}^{C}_{2}=\boldsymbol{p}^{C^{\prime}}_{1}-s_{i}\boldsymbol{l}_{i}=\boldsymbol{R}_{1}\boldsymbol{p}^{C}_{1}-s_{i}\boldsymbol{l}_{i}. (30)

By combining (29) and (30), the rotation matrix 𝑹\boldsymbol{R} and translation vector 𝒕\boldsymbol{t} can be determined from the ii-th birotation solution as follows:

{𝑹=𝑹2𝑹1,𝒕=si𝑹2𝒍i=si𝒓2,i.\begin{cases}\boldsymbol{R}=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{1},\\ \boldsymbol{t}=-s_{i}\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{i}=-s_{i}\boldsymbol{r}_{2,i}.\end{cases} (31)

Ideally, the three birotation solutions yield an identical relative pose. The essential matrix 𝑬\boldsymbol{E} can then be derived from the birotation solutions as follows:

𝑬\displaystyle\boldsymbol{E} =1t2[𝒕]×𝑹=[𝒓2,i]×𝑹2𝑹1\displaystyle=\frac{1}{\|t\|_{2}}[\boldsymbol{t}]_{\times}\boldsymbol{R}=-[\boldsymbol{r}_{2,i}]_{\times}\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{1} (32)
={(𝟎,𝒓2,3,𝒓2,2)𝑹1,ifi=1,(𝒓2,3,𝟎,𝒓2,1)𝑹1,ifi=2,(𝒓2,2,𝒓2,1,𝟎)𝑹1,ifi=3.\displaystyle=\begin{cases}(\boldsymbol{0},\boldsymbol{r}_{2,3},-\boldsymbol{r}_{2,2})\boldsymbol{R}_{1},&\text{if}\ i=1,\\ (-\boldsymbol{r}_{2,3},\boldsymbol{0},\boldsymbol{r}_{2,1})\boldsymbol{R}_{1},&\text{if}\ i=2,\\ (\boldsymbol{r}_{2,2},-\boldsymbol{r}_{2,1},\boldsymbol{0})\boldsymbol{R}_{1},&\text{if}\ i=3.\end{cases}
={𝒓2,3𝒓1,2𝒓2,2𝒓1,3,ifi=1,𝒓2,1𝒓1,3𝒓2,3𝒓1,1,ifi=2,𝒓2,2𝒓1,1𝒓2,1𝒓1,2,ifi=3.\displaystyle=\begin{cases}\boldsymbol{r}_{2,3}\boldsymbol{r}^{\top}_{1,2}-\boldsymbol{r}_{2,2}\boldsymbol{r}^{\top}_{1,3},&\text{if}\ i=1,\\ \boldsymbol{r}_{2,1}\boldsymbol{r}^{\top}_{1,3}-\boldsymbol{r}_{2,3}\boldsymbol{r}^{\top}_{1,1},&\text{if}\ i=2,\\ \boldsymbol{r}_{2,2}\boldsymbol{r}^{\top}_{1,1}-\boldsymbol{r}_{2,1}\boldsymbol{r}^{\top}_{1,2},&\text{if}\ i=3.\\ \end{cases}

However, it should be noted here that all three birotation solutions are not unique, as each solution has six DoFs (three from each rotation matrix), while the relative pose problem with ambiguous scale has only five DoFs. Specifically, rotating both the reference and target camera coordinate systems by the same angle around the translation vector does not change their relative pose, which geometrically demonstrates the redundant DoF in each birotation solution. Thus, additional geometric constraints should be incorporated into our proposed relative pose estimation algorithm, detailed in the next subsection, to resolve this ambiguity.

III-D Relative Pose Estimation Algorithm

According to the constraints presented in (20)-(22) and the birotation models defined in (27), we can obtain three NN-entry residual vectors using NN pairs of correspondences, as expressed as follows:

𝒆i(𝜽1,i,𝜽2,i)=(ei,1,ei,2,,ei,N),i{1,2,3},\boldsymbol{e}_{i}(\boldsymbol{\theta}_{1,i},\boldsymbol{\theta}_{2,i})=(e_{i,1},e_{i,2},\dots,e_{i,N})^{\top},\ \ i\in\{1,2,3\}, (33)

where

ei,n=arctan𝒓1,j,i𝒑¯1,n𝒓1,k,i𝒑¯1,narctan𝒓2,j,i𝒑¯2,n𝒓2,k,i𝒑¯2,n,\displaystyle e_{i,n}=\arctan\frac{\boldsymbol{r}_{1,j,i}^{\top}\bar{\boldsymbol{p}}_{1,n}}{\boldsymbol{r}_{1,k,i}^{\top}\bar{\boldsymbol{p}}_{1,n}}-\arctan\frac{\boldsymbol{r}_{2,j,i}^{\top}\bar{\boldsymbol{p}}_{2,n}}{\boldsymbol{r}_{2,k,i}^{\top}\bar{\boldsymbol{p}}_{2,n}}, (34)
n[1,N],\displaystyle n\in[1,N]\cap\mathbb{Z},

𝜽1,i\boldsymbol{\theta}_{1,i} and 𝜽2,i\boldsymbol{\theta}_{2,i} denote the rotation vectors corresponding to the rotation matrices 𝑹1,i\boldsymbol{R}_{1,i} and 𝑹2,i\boldsymbol{R}_{2,i} in the ii-th birotation model, respectively, 𝒑¯1,n\bar{\boldsymbol{p}}_{1,n} and 𝒑¯2,n\bar{\boldsymbol{p}}_{2,n} represent the normalized coordinates of the nn-th pair of correspondences with respect to depth, 𝒓1,j,i\boldsymbol{r}_{1,j,i} and 𝒓1,k,i\boldsymbol{r}_{1,k,i} denote the jj-th and kk-th row vector of 𝑹1,i\boldsymbol{R}_{1,i}, respectively, and 𝒓2,j,i\boldsymbol{r}_{2,j,i} and 𝒓2,k,i\boldsymbol{r}_{2,k,i} are the jj-th and kk-th row vector of 𝑹2,i\boldsymbol{R}_{2,i}, respectively. d^i\hat{d}_{i}, the discretized approximation of did_{i} defined in (28), can be yielded by accumulating residuals defined in (34) as follows:

d^i=𝒆i(𝜽1,i,𝜽2,i)𝚲𝒆i(𝜽1,i,𝜽2,i),i{1,2,3},\hat{d}_{i}=\boldsymbol{e}^{\top}_{i}(\boldsymbol{\theta}_{1,i},\boldsymbol{\theta}_{2,i})\boldsymbol{\Lambda}\boldsymbol{e}_{i}(\boldsymbol{\theta}_{1,i},\boldsymbol{\theta}_{2,i}),\ \ i\in\{1,2,3\}, (35)

where 𝚲\boldsymbol{\Lambda} is a diagonal matrix in which each element λn\lambda_{n} adaptively takes a value of either 0 or 1 to indicate the reliability of the nn-th pair of correspondences. By incorporating a regularization term into d^i\hat{d}_{i}, the complete energy function for the ii-th birotation model can be formulated as follows:

Ei(𝜽1,i,𝜽2,i)=𝒆i𝚲𝒆i+αR(𝜽1,i,𝜽2,i),i{1,2,3},E_{i}(\boldsymbol{\theta}_{1,i},\boldsymbol{\theta}_{2,i})=\boldsymbol{e}^{\top}_{i}\boldsymbol{\Lambda}\boldsymbol{e}_{i}+\alpha R(\boldsymbol{\theta}_{1,i},\boldsymbol{\theta}_{2,i}),\ \ i\in\{1,2,3\}, (36)

where R(𝜽1,i,𝜽2,i)=𝜽1,i22+𝜽2,i22R(\boldsymbol{\theta}_{1,i},\boldsymbol{\theta}_{2,i})=\|\boldsymbol{\theta}_{1,i}\|^{2}_{2}+\|\boldsymbol{\theta}_{2,i}\|^{2}_{2} denotes the L2L_{2} regularization term, and α>0\alpha>0 represents the regularization parameter. The regularization term is incorporated to reduce the energy function’s non-convexity and ensure the solution’s uniqueness.

The ii-th birotation solution can be yielded by minimizing (36) using our newly designed iterative optimization approach. Specifically, based on (31), the three birotation solutions are initialized using the prior rotation 𝑹init\boldsymbol{R}_{init} and translation 𝒕init\boldsymbol{t}_{init} as follows:

{𝒓2,1,1=𝒕init𝒕init,𝒓2,2,1=𝒍3×𝒓2,1,1,𝒓2,3,1=𝒓2,1,1×𝒓2,2,1,\displaystyle\begin{cases}\boldsymbol{r}_{2,1,1}=-\frac{\boldsymbol{t}_{init}}{\|\boldsymbol{t}_{init}\|},\\ \boldsymbol{r}_{2,2,1}=\boldsymbol{l}_{3}\times\boldsymbol{r}_{2,1,1},\\ \boldsymbol{r}_{2,3,1}=\boldsymbol{r}_{2,1,1}\times\boldsymbol{r}_{2,2,1},\\ \end{cases} (37)
{𝒓2,2,2=𝒕init𝒕init,𝒓2,3,2=𝒍1×𝒓2,2,2,𝒓2,1,2=𝒓2,2,2×𝒓2,3,2,\displaystyle\begin{cases}\boldsymbol{r}_{2,2,2}=-\frac{\boldsymbol{t}_{init}}{\|\boldsymbol{t}_{init}\|},\\ \boldsymbol{r}_{2,3,2}=\boldsymbol{l}_{1}\times\boldsymbol{r}_{2,2,2},\\ \boldsymbol{r}_{2,1,2}=\boldsymbol{r}_{2,2,2}\times\boldsymbol{r}_{2,3,2},\end{cases} (38)
{𝒓2,3,3=𝒕init𝒕init,𝒓2,1,3=𝒍2×𝒓2,3,3,𝒓2,2,3=𝒓2,3,3×𝒓2,1,3,\displaystyle\begin{cases}\boldsymbol{r}_{2,3,3}=-\frac{\boldsymbol{t}_{init}}{\|\boldsymbol{t}_{init}\|},\\ \boldsymbol{r}_{2,1,3}=\boldsymbol{l}_{2}\times\boldsymbol{r}_{2,3,3},\\ \boldsymbol{r}_{2,2,3}=\boldsymbol{r}_{2,3,3}\times\boldsymbol{r}_{2,1,3},\end{cases} (39)
𝑹2,i=(𝒓2,1,i,𝒓2,2,i,𝒓2,3,i),\displaystyle\boldsymbol{R}_{2,i}=(\boldsymbol{r}_{2,1,i},\boldsymbol{r}_{2,2,i},\boldsymbol{r}_{2,3,i})^{\top}, (40)
𝑹1,i=𝑹2,i𝑹init,i{1,2,3}.\displaystyle\boldsymbol{R}_{1,i}=\boldsymbol{R}_{2,i}\boldsymbol{R}_{init},\ \ i\in\{1,2,3\}. (41)

An overview of a single iteration of the optimization process is illustrated in Fig. 3. We first derive a first-order approximation of 𝒆i\boldsymbol{e}_{i}, which is then applied to the energy function EiE_{i} as follows:

Ei(𝒆i+𝑱iΔ𝚯i)𝚲(𝒆i+𝑱iΔ𝚯i)+αR,E_{i}\approx(\boldsymbol{e}_{i}+\boldsymbol{J}^{\top}_{i}\Delta\boldsymbol{\Theta}_{i})^{\top}\boldsymbol{\Lambda}(\boldsymbol{e}_{i}+\boldsymbol{J}^{\top}_{i}\Delta\boldsymbol{\Theta}_{i})+\alpha R, (42)

where Δ𝚯i=(Δ𝜽1,i,Δ𝜽2,i)\Delta\boldsymbol{\Theta}_{i}=(\Delta\boldsymbol{\theta}^{\top}_{1,i},\Delta\boldsymbol{\theta}^{\top}_{2,i})^{\top} denotes the increment vector at each iteration, and

𝑱i=(ei,1𝜽1,i,ei,2𝜽1,i,,ei,N𝜽1,iei,1𝜽2,i,ei,2𝜽2,i,,ei,N𝜽2,i),\boldsymbol{J}_{i}=\begin{pmatrix}\displaystyle\frac{\partial e_{i,1}}{\partial\boldsymbol{\theta}_{1,i}},\displaystyle\frac{\partial e_{i,2}}{\partial\boldsymbol{\theta}_{1,i}},\cdots,\displaystyle\frac{\partial e_{i,N}}{\partial\boldsymbol{\theta}_{1,i}}\\ \displaystyle\frac{\partial e_{i,1}}{\partial\boldsymbol{\theta}_{2,i}},\displaystyle\frac{\partial e_{i,2}}{\partial\boldsymbol{\theta}_{2,i}},\cdots,\displaystyle\frac{\partial e_{i,N}}{\partial\boldsymbol{\theta}_{2,i}}\end{pmatrix}^{\top}, (43)

is a matrix storing the gradients of 𝒆i\boldsymbol{e}_{i} with respect to 𝜽1,i\boldsymbol{\theta}_{1,i} and 𝜽2,i\boldsymbol{\theta}_{2,i} computed using the left perturbation model [18, 19], which avoids the need to compute the Baker-Campbell-Hausdorff Jacobian matrix at each iteration. Under the definitions of 𝒑1,n,iC=𝑹1,i𝒑1,nC=(x1,n,i,y1,n,i,z1,n,i)\boldsymbol{p}^{C^{\prime}}_{1,n,i}=\boldsymbol{R}_{1,i}\boldsymbol{p}^{C}_{1,n}=(x^{\prime}_{1,n,i},y^{\prime}_{1,n,i},z^{\prime}_{1,n,i})^{\top} and 𝒑2,n,iC=𝑹2,i𝒑2,nC=(x2,n,i,y2,n,i,z2,n,i)\boldsymbol{p}^{C^{\prime}}_{2,n,i}=\boldsymbol{R}_{2,i}\boldsymbol{p}^{C}_{2,n}=(x^{\prime}_{2,n,i},y^{\prime}_{2,n,i},z^{\prime}_{2,n,i})^{\top}, we can derive the following expressions:

{e1,n𝜽1,1=(1,x1,n,1y1,n,1y1,n,12+z1,n,12,x1,n,1z1,n,1y1,n,12+z1,n,12),e1,n𝜽2,1=(1,x2,n,1y2,n,1y2,n,12+z2,n,12,x2,n,1z2,n,1y2,n,12+z2,n,12),\displaystyle\begin{cases}\frac{\partial e_{1,n}}{\partial\boldsymbol{\theta}_{1,1}}&=\begin{pmatrix}-1,\displaystyle\frac{x^{\prime}_{1,n,1}y^{\prime}_{1,n,1}}{{y^{\prime}_{1,n,1}}^{2}+{z^{\prime}_{1,n,1}}^{2}},\displaystyle\frac{x^{\prime}_{1,n,1}z^{\prime}_{1,n,1}}{{y^{\prime}_{1,n,1}}^{2}+{z^{\prime}_{1,n,1}}^{2}}\end{pmatrix},\\[12.0pt] \frac{\partial e_{1,n}}{\partial\boldsymbol{\theta}_{2,1}}&=\begin{pmatrix}1,-\displaystyle\frac{x^{\prime}_{2,n,1}y^{\prime}_{2,n,1}}{{y^{\prime}_{2,n,1}}^{2}+{z^{\prime}_{2,n,1}}^{2}},-\displaystyle\frac{x^{\prime}_{2,n,1}z^{\prime}_{2,n,1}}{{y^{\prime}_{2,n,1}}^{2}+{z^{\prime}_{2,n,1}}^{2}}\end{pmatrix},\end{cases} (44)
{e2,n𝜽1,2=(x1,n,2y1,n,2x1,n,22+z1,n,22,1,y1,n,2z1,n,2x1,n,22+z1,n,22),e2,n𝜽2,2=(x2,n,2y2,n,2x2,n,22+z2,n,22,1,y2,n,2z2,n,2x2,n,22+z2,n,22),\displaystyle\begin{cases}\frac{\partial e_{2,n}}{\partial\boldsymbol{\theta}_{1,2}}&=\begin{pmatrix}-\displaystyle\frac{x^{\prime}_{1,n,2}y^{\prime}_{1,n,2}}{{x^{\prime}_{1,n,2}}^{2}+{z^{\prime}_{1,n,2}}^{2}},1,-\displaystyle\frac{y^{\prime}_{1,n,2}z^{\prime}_{1,n,2}}{{x^{\prime}_{1,n,2}}^{2}+{z^{\prime}_{1,n,2}}^{2}}\end{pmatrix},\\[12.0pt] \frac{\partial e_{2,n}}{\partial\boldsymbol{\theta}_{2,2}}&=\begin{pmatrix}\displaystyle\frac{x^{\prime}_{2,n,2}y^{\prime}_{2,n,2}}{{x^{\prime}_{2,n,2}}^{2}+{z^{\prime}_{2,n,2}}^{2}},-1,\displaystyle\frac{y^{\prime}_{2,n,2}z^{\prime}_{2,n,2}}{{x^{\prime}_{2,n,2}}^{2}+{z^{\prime}_{2,n,2}}^{2}}\end{pmatrix},\end{cases} (45)
{e3,n𝜽1,3=(x1,n,3z1,n,3x1,n,32+y1,n,32,y1,n,3z1,n,3x1,n,32+y1,n,32,1),e3,n𝜽2,3=(x2,n,3z2,n,3x2,n,32+y2,n,32,y2,n,3z2,n,3x2,n,32+y2,n,32,1).\displaystyle\begin{cases}\frac{\partial e_{3,n}}{\partial\boldsymbol{\theta}_{1,3}}&=\begin{pmatrix}\displaystyle\frac{x^{\prime}_{1,n,3}z^{\prime}_{1,n,3}}{{x^{\prime}_{1,n,3}}^{2}+{y^{\prime}_{1,n,3}}^{2}},\displaystyle\frac{y^{\prime}_{1,n,3}z^{\prime}_{1,n,3}}{{x^{\prime}_{1,n,3}}^{2}+{y^{\prime}_{1,n,3}}^{2}},-1\end{pmatrix},\\[12.0pt] \frac{\partial e_{3,n}}{\partial\boldsymbol{\theta}_{2,3}}&=\begin{pmatrix}-\displaystyle\frac{x^{\prime}_{2,n,3}z^{\prime}_{2,n,3}}{{x^{\prime}_{2,n,3}}^{2}+{y^{\prime}_{2,n,3}}^{2}},-\displaystyle\frac{y^{\prime}_{2,n,3}z^{\prime}_{2,n,3}}{{x^{\prime}_{2,n,3}}^{2}+{y^{\prime}_{2,n,3}}^{2}},1\end{pmatrix}.\end{cases} (46)

Setting the derivative of (42) with respect to Δ𝚯i\Delta\boldsymbol{\Theta}_{i} to zero yields the following expression:

Δ𝚯i=(𝑱i𝚲𝑱i+α𝑰)1𝑱i𝚲𝒆i.\Delta\boldsymbol{\Theta}_{i}=-({\boldsymbol{J}_{i}}\boldsymbol{\Lambda}{\boldsymbol{J}_{i}}^{\top}+\alpha\boldsymbol{I})^{-1}{\boldsymbol{J}_{i}}\boldsymbol{\Lambda}\boldsymbol{e}_{i}. (47)

Subsequently, 𝑹1,i\boldsymbol{R}_{1,i} and 𝑹2,i\boldsymbol{R}_{2,i} are updated using the left perturbation model, as follows:

{𝑹1,i=exp([Δ𝜽1,i]×)𝑹1,i,𝑹2,i=exp([Δ𝜽2,i]×)𝑹2,i,\begin{cases}\boldsymbol{R}^{\prime}_{1,i}=\exp([\Delta\boldsymbol{\theta}_{1,i}]_{\times})\boldsymbol{R}_{1,i},\\ \boldsymbol{R}^{\prime}_{2,i}=\exp([\Delta\boldsymbol{\theta}_{2,i}]_{\times})\boldsymbol{R}_{2,i},\end{cases} (48)

where 𝑹1,i\boldsymbol{R}^{\prime}_{1,i} and 𝑹2,i\boldsymbol{R}^{\prime}_{2,i} denote the updated rotation matrices after one iteration. The iterative process terminates when either the value or the change rate of d^iN\frac{\hat{d}_{i}}{N} falls below predefined thresholds. After the iteration terminates, the final solution is selected from the three birotation solutions based on the following criteria:

𝑹1,𝑹2=𝑹1,i^,𝑹2,i^,i^=argminiβid^i,\boldsymbol{R}_{1},\boldsymbol{R}_{2}=\boldsymbol{R}_{1,\hat{i}},\boldsymbol{R}_{2,\hat{i}},\quad\hat{i}=\arg\min_{i}\beta_{i}\hat{d}_{i}, (49)

where weighting coefficients (β1,β2,β3)(\beta_{1},\beta_{2},\beta_{3}) can be empirically adjusted to prioritize one of the three birotation solutions, which can greatly enhance the reliability of the final solution. The relative rotation 𝑹\boldsymbol{R} and the line along which the translation vector 𝒕\boldsymbol{t} lies can then be derived from 𝑹1\boldsymbol{R}_{1}, 𝑹2\boldsymbol{R}_{2}, and their corresponding index i^\hat{i} using (31).

In addition, although the absolute value of the scaling factor si^s_{\hat{i}} is unknown, its sign can be determined based on the transformed coordinates 𝒑1,n,i^C\boldsymbol{p}^{C^{\prime}}_{1,n,\hat{i}} and 𝒑2,n,i^C\boldsymbol{p}^{C^{\prime}}_{2,n,\hat{i}} using the following expression when z1,n,i^z^{\prime}_{1,n,\hat{i}} and z2,n,i^z^{\prime}_{2,n,\hat{i}} are positive:

{si^>0,ifi^=1andx1,n,i^z1,n,i^>x2,n,i^z2,n,i^,si^>0,ifi^=2andy1,n,i^z1,n,i^>y2,n,i^z2,n,i^,si^>0,ifi^=3and|x1,n,i^z1,n,i^|<|x2,n,i^z2,n,i^|,si^<0,otherwise.\begin{cases}s_{\hat{i}}>0,\quad\text{if}\quad\hat{i}=1\quad\text{and}\quad\frac{x^{\prime}_{1,n,\hat{i}}}{z^{\prime}_{1,n,\hat{i}}}>\frac{x^{\prime}_{2,n,\hat{i}}}{z^{\prime}_{2,n,\hat{i}}},\\[8.0pt] s_{\hat{i}}>0,\quad\text{if}\quad\hat{i}=2\quad\text{and}\quad\frac{y^{\prime}_{1,n,\hat{i}}}{z^{\prime}_{1,n,\hat{i}}}>\frac{y^{\prime}_{2,n,\hat{i}}}{z^{\prime}_{2,n,\hat{i}}},\\[8.0pt] s_{\hat{i}}>0,\quad\text{if}\quad\hat{i}=3\quad\text{and}\quad\left|\frac{x^{\prime}_{1,n,\hat{i}}}{z^{\prime}_{1,n,\hat{i}}}\right|<\left|\frac{x^{\prime}_{2,n,\hat{i}}}{z^{\prime}_{2,n,\hat{i}}}\right|,\\[8.0pt] s_{\hat{i}}<0,\quad\text{otherwise}.\end{cases} (50)

When the depth of either 𝒑1,n,i^C\boldsymbol{p}^{C^{\prime}}_{1,n,\hat{i}} or 𝒑2,n,i^C\boldsymbol{p}^{C^{\prime}}_{2,n,\hat{i}} is negative, the direction of the translation is determined by computing the depths of both 𝒑1,nC\boldsymbol{p}^{C}_{1,n} and 𝒑2,nC\boldsymbol{p}^{C}_{2,n}. The translation yielding two positive depths is selected as the valid solution. With convergence-related parameters fixed, the computational complexity of our relative pose estimation algorithm remains 𝒪(N)\mathcal{O}(N).

IV Experiments

Sect. IV-A provides implementation details on the proposed algorithm and the methods used for comparison. Sect. IV-B introduces the four public datasets utilized for performance comparison, along with the datasets newly created for ablation studies and robustness analyses. The evaluation metrics are detailed in Sect. IV-C. Sect. IV-D presents extensive quantitative and qualitative comparisons between the proposed method and previous SoTA approaches. In Sect. IV-E, ablation studies are conducted to evaluate the contributions of the energy functions and the effectiveness of the constraint adjustments. Finally, Sect. IV-F analyzes the robustness of our approach under different initialization strategies, correspondence noise levels, and matching error rates.

IV-A Implementation Details

IV-A1 Correspondence Matching

For both comparisons with SoTA methods and ablation studies, we adopt LoFTR [28], trained on the MegaDepth dataset [29], to obtain dense correspondences. This is primarily because did_{i} in (28) can be closely approximated with d^i\hat{d}_{i} in (35) when correspondences are dense. Additionally, we employ SIFT [30] in conjunction with LightGlue [31], also trained on the MegaDepth dataset, to further evaluate the robustness of our algorithm over other relative pose estimation algorithms under sparse correspondence conditions.

TABLE I: Quantitative comparisons of relative pose algorithms on the ScanNet [32] dataset.
Algorithm Dense Correspondence Sparse Correspondence
@11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ} @11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ}
Nistér [8] 0.56 6.84 15.24 31.49 0.74 7.82 16.54 33.44
Helmke et al. [14] 0.05 0.89 2.76 7.42 0.08 2.21 5.51 14.11
Zhao [12] 0.00 0.04 0.23 1.49 0.00 0.07 0.40 2.12
Ling et al. [18] 0.83 8.39 16.55 32.15 0.58 8.11 17.18 34.56
Zhao et al. [19] 0.63 6.34 12.83 26.16 0.74 6.62 13.93 29.48
Ours 1.08 9.56 18.65 33.08 1.05 8.53 17.75 33.94
TABLE II: Quantitative comparisons of relative pose algorithms on the YFCC100M [33] dataset.
Algorithm Dense Correspondence Sparse Correspondence
@11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ} @11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ}
Nistér [8] 6.57 26.84 41.65 62.01 6.95 26.74 41.18 60.74
Helmke et al. [14] 0.84 6.03 11.58 22.80 1.50 10.17 18.68 33.75
Zhao [12] 0.04 0.76 2.44 9.04 0.07 0.96 2.86 9.78
Ling et al. [18] 9.50 31.81 46.59 65.51 10.00 31.13 45.13 63.53
Zhao et al. [19] 5.04 21.25 33.36 52.27 5.61 21.85 33.71 52.19
Ours 10.15 32.02 45.89 63.45 9.49 29.19 41.67 58.22
TABLE III: Quantitative comparisons of relative pose algorithms on the pure rotation dataset.
Rotation Angle Algorithm Dense Correspondence Sparse Correspondence
@11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ} @11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ}
5\leq 5^{\circ} Nistér [8] 47.90 52.06 52.89 53.52 64.87 70.69 71.86 72.73
Helmke et al. [14] 47.92 52.59 53.57 54.44 67.24 73.91 75.47 76.77
Zhao [12] 89.00 96.33 97.80 98.90 90.13 96.71 98.03 99.01
Ling et al. [18] 88.75 96.25 97.75 98.87 88.82 96.27 97.76 98.88
Zhao et al. [19] 68.36 74.09 75.24 76.10 76.11 82.19 83.41 84.32
Hartley [34] 91.45 97.15 98.29 99.14 92.83 97.61 98.57 99.28
Ours 91.77 97.26 98.35 99.18 90.54 96.85 98.11 99.05
10\leq 10^{\circ} Nistér [8] 53.47 58.09 59.02 59.71 66.30 72.51 73.75 74.68
Helmke et al. [14] 52.02 56.99 58.00 58.75 68.76 75.76 77.37 78.69
Zhao [12] 89.55 96.52 97.91 98.96 89.69 96.42 97.76 98.77
Ling et al. [18] 88.87 96.29 97.77 98.89 88.94 96.31 97.79 98.89
Zhao et al. [19] 69.56 75.69 76.92 77.83 77.34 83.81 85.10 86.06
Hartley [34] 90.80 96.63 97.80 98.68 91.64 96.91 97.97 98.76
Ours 90.67 96.59 97.78 98.66 90.63 96.88 98.13 99.06
15\leq 15^{\circ} Nistér [8] 54.00 58.72 59.66 60.37 68.67 75.69 77.09 78.14
Helmke et al. [14] 51.75 57.20 58.39 59.43 72.44 80.18 81.83 83.08
Zhao [12] 87.54 94.50 95.90 96.94 82.82 89.20 90.48 91.44
Ling et al. [18] 88.83 96.28 97.77 98.88 88.90 96.30 97.78 98.89
Zhao et al. [19] 69.97 76.72 78.07 79.08 80.28 87.01 88.36 89.37
Hartley [34] 89.72 95.53 96.69 97.56 90.24 96.15 97.33 98.22
Ours 90.03 96.23 97.47 98.40 89.75 96.14 97.41 98.37

IV-A2 Compared SoTA Algorithms

We first compare our algorithm with the classical five-point algorithm used in conjunction with RANSAC [8] (implemented in OpenCV with default parameters), which is an explicit solver whose output is then used as the initialization for subsequent local optimization methods. These local optimization methods include the manifold-based method [14], the method that directly estimates rotation and translation [18], and the birotation-based method [19]. Moreover, we also use the published code of the QCQP-based global optimization method [12] to conduct comparative experiments. In addition, the homography estimation method [34], as implemented in OpenCV (all parameters are set by default), is also used for further comparison under the pure rotation condition. In the monocular visual odometry task, we also compare our method with two SoTA data-driven methods, SC-Depth [25] and SCIPaD [26], both trained on sequences 00–08 of the KITTI Odometry dataset [35].

IV-A3 Parameter Settings in Our Algorithm

The regularization parameter α\alpha defined in (36) is set to 1×1031\times 10^{-3}. The coefficients (β1,β2,β3)(\beta_{1},\beta_{2},\beta_{3}) in (49) are empirically set to (0.25,1,1)(0.25,1,1) in the stereo camera extrinsic calibration experiment, (1,1,0.25)(1,1,0.25) in the monocular visual odometry experiment, and the default values (1,1,1)(1,1,1) in all other experiments. This is because the ground-truth relative poses in the stereo camera extrinsic calibration and monocular visual odometry tasks typically align more closely with the first and third basis transformations defined in (7), respectively. Our method is also initialized using the output of the five-point algorithm with RANSAC [8]. In addition, the upper quartile method [36] is employed in each iteration to compute 𝚲\boldsymbol{\Lambda} in (36) and remove outliers. The iteration terminates when either the value of d^iN\frac{\hat{d}_{i}}{N} falls below 10810^{-8} or its rate of change falls below 10610^{-6}.

TABLE IV: Quantitative comparisons of relative pose estimation algorithms for stereo camera extrinsic calibration. Given that the rotation error is generally small, its unit is reported in 10310^{-3} radian in the table.
Scenes Correspondence Algorithm Δθ¯1\Delta\bar{\theta}_{1} Δθ¯2\Delta\bar{\theta}_{2} Δθ¯3\Delta\bar{\theta}_{3} Δt¯1\Delta\bar{t}_{1} Δt¯2\Delta\bar{t}_{2} Δt¯3\Delta\bar{t}_{3} @1 @3 @5 @10
Indoor Dense Nistér [8] 1.85 6.07 1.77 6.02 6.52 11.23 5.47 26.30 42.40 64.50
Helmke et al. [14] 8.61 8.77 7.01 17.33 12.30 29.34 0.37 6.35 13.99 34.02
Zhao [12] 2.67 13.74 3.13 61.06 15.74 75.79 0.00 4.99 16.18 29.89
Ling et al. [18] 0.37 2.51 0.49 5.77 2.44 8.85 30.21 60.10 72.95 84.06
Zhao et al. [19] 41.53 4.11 6.28 6.69 2.22 3.79 32.08 66.01 77.34 86.74
Ours 0.27 2.14 0.43 0.70 1.52 2.85 39.10 70.74 81.41 90.25
Sparse Nistér [8] 2.01 7.52 2.24 11.98 6.29 13.60 6.44 25.48 40.79 62.34
Helmke et al. [14] 2.65 7.93 3.82 15.89 13.08 24.95 0.63 7.63 16.28 34.78
Zhao [12] 2.74 13.34 3.42 51.61 15.65 67.92 0.00 5.28 17.18 31.65
Ling et al. [18] 0.40 2.43 0.43 7.73 2.54 10.97 29.00 57.31 69.27 81.31
Zhao et al. [19] 41.09 6.59 4.97 13.47 2.79 5.20 35.55 68.01 78.28 86.85
Ours 0.25 2.17 0.38 2.85 1.66 4.19 36.67 70.41 80.60 89.47
Outdoor Dense Nistér [8] 1.31 6.40 1.03 2.36 3.35 15.82 2.86 16.19 28.36 52.39
Helmke et al. [14] 2.24 6.18 3.19 6.51 7.50 17.51 0.77 8.28 19.38 44.64
Zhao [12] 1.10 1.24 0.66 4.48 6.00 11.83 0.00 11.72 40.89 67.99
Ling et al. [18] 0.40 1.40 0.29 0.40 1.55 7.46 13.55 39.69 56.50 76.31
Zhao et al. [19] 0.26 0.87 0.22 1.32 0.87 1.67 44.71 80.12 88.07 94.04
Ours 0.23 0.96 0.20 0.12 0.64 1.30 52.15 83.69 90.22 95.11
Sparse Nistér [8] 1.22 5.91 1.15 2.37 3.43 14.89 4.71 19.61 32.86 55.13
Helmke et al. [14] 2.54 6.83 3.76 8.55 6.79 17.42 1.84 10.31 22.49 46.63
Zhao [12] 0.43 1.14 0.31 1.36 6.01 9.34 0.19 11.86 40.22 68.71
Ling et al. [18] 0.39 1.28 0.34 0.46 1.62 7.58 16.98 42.38 57.84 76.46
Zhao et al. [19] 0.25 0.83 0.24 1.31 0.83 1.41 52.36 83.12 89.87 94.93
Ours 0.25 0.77 0.23 0.10 0.55 0.91 64.36 88.06 92.83 96.42

IV-B Datasets

IV-B1 Public Datasets

  • The ScanNet dataset [32] is a large-scale dataset, consisting of 2.5 million indoor RGB-D image pairs with ground-truth camera poses. Following the study [37], we select 1,500 image pairs to evaluate the performance of relative pose estimation algorithms in indoor scenarios.

  • The YFCC100M dataset [33] is a large-scale public multimedia collection containing approximately 99.2 million images with ground-truth camera poses. Following the study in [37], we select 4,000 image pairs to evaluate the performance of relative pose estimation algorithms in outdoor scenarios.

  • The KITTI Odometry dataset [35] consists of well-rectified stereo video sequences and ground-truth trajectories collected from 22 real-world driving scenarios. Existing data-driven approaches, including the two networks we compare [25, 26], are typically trained on sequences 0 to 8 and evaluated on sequences 9 and 10. In our experiments, we evaluate all explicitly programming-based algorithms on sequences 0 to 10. Although this may introduce a slight bias when compared to data-driven approaches, the experimental results presented in Sect. IV-D3 demonstrate the superiority of our method over them. Additionally, we select 300 consecutive video frames from KITTI Odometry sequence 0, where the relative pose closely aligns with the third basis transformation, to conduct an ablation study. The results demonstrate the advantage of using three energy functions for relative pose estimation over a single energy function. The images used in the ablation study are publicly available for result reproduction.

  • The KITTI Raw dataset [35] consists of raw video sequences and ground-truth trajectories collected from 56 real-world driving scenarios. In our experiments, we select around 500 images as reference frames and generate three subsets of target frames by applying random rotations with angles of up to 55^{\circ}, 1010^{\circ}, and 1515^{\circ}, respectively. These subsets are used to evaluate the performance of relative pose estimation algorithms when only rotation is involved. The images used for the pure rotation experiment are publicly available for result reproduction.

Refer to caption
Figure 4: Qualitative comparisons of relative pose algorithms on the stereo camera extrinsic calibration task: (a) Left images rectified based on the ground-truth extrinsic parameters; (b-f) Disparity maps generated from stereo images rectified based on the extrinsic parameters estimated using [8], [18], [12], [19], and our algorithm, respectively.
Refer to caption
Figure 5: Quantitative comparisons on the KITTI Odometry dataset, where “Seq ii” refers to the ii-th sequence. The figures present the [min,max][\min,\max] ranges along with our results, which are highlighted in red when optimal. A smaller area enclosed by lines indicates better performance of the corresponding algorithm. SC-Depth, SCIPaD, Nistér [8], Helmke et al. [14], Zhao et al. [19], Zhao [12], Ling et al. [18], and ours.
Refer to caption
Figure 6: Comparisons of estimated trajectories on the KITTI Odometry dataset: (a) Estimated trajectories when correspondences are dense; (b) Estimated trajectories when correspondences are sparse. Ground truth, SCIPaD, Nistér [8], Zhao [12], Ling et al. [18], and ours.

IV-B2 New Real-World Datasets

As discussed above, ablation studies are required to compare the performance of our algorithm when using three energy functions versus a single energy function. Therefore, in addition to the 300 images selected from the KITTI Odometry dataset, we also create two additional datasets (each containing 300 image pairs), in which the cameras are positioned on the left and right, and top and bottom, respectively. Given that a relatively ideal experimental setup may not comprehensively reflect the performance of our algorithm, we introduce slight perturbations to the stereo rig in practice. Moreover, we use the same stereo rig to collect an additional 600 image pairs (300 indoor and 300 outdoor) for the stereo camera extrinsic online calibration experiment.

IV-B3 New Synthetic Datasets

We also create two synthetic datasets to further evaluate the robustness of our algorithm. We first generate a synthetic 3-D point cloud by uniformly sampling points within a cubic volume in the world coordinate system. All cameras have the same intrinsic parameters. The reference camera coordinate system is aligned with the world coordinate system, whereas the target camera coordinate systems are randomly generated. The synthetic 3-D points are then projected onto each camera’s image plane to generate corresponding 2-D pixel coordinates. The details of the two synthetic datasets are as follows:

  • The first synthetic dataset is created by adding Gaussian noise to the 2-D correspondences, with the standard deviation σ\sigma ranging from 0 to 2 pixels in increments of 0.02 pixel. For each noise level, we generate 100 pairs of reference and target views, each observing 200 3-D points.

  • The second synthetic dataset is created under varying correspondence matching error rates, ranging from 0 to 0.3 in increments of 0.01. Correct correspondences are perturbed with slight Gaussian noise (σ=0.1\sigma=0.1 pixel), while erroneous correspondences are perturbed with stronger Gaussian noise (σ=10\sigma=10 pixels) to simulate outliers. For each error rate, we generate 100 pairs of reference and target views, each observing 200 3-D points.

All the synthetic datasets are publicly available for result reproduction.

IV-C Evaluation Metrics

To quantify the algorithm’s performance with respect to individual motion components, we follow the studies [18] and [19] to compute the average absolute rotation error Δ𝜽¯=1Mm=1M|𝜽^m𝜽|=(Δθ¯1,Δθ¯2,Δθ¯3)\Delta\bar{\boldsymbol{\theta}}=\frac{1}{M}\sum^{M}_{m=1}\lvert\hat{\boldsymbol{\theta}}_{m}-\boldsymbol{\theta}^{\ast}\rvert=(\Delta\bar{\theta}_{1},\Delta\bar{\theta}_{2},\Delta\bar{\theta}_{3}) in radians and average absolute translation error Δ𝒕¯=1Mm=1M|𝒕^m𝒕|=(Δt¯1,Δt¯2,Δt¯3)\Delta\bar{\boldsymbol{t}}=\frac{1}{M}\sum^{M}_{m=1}\lvert\hat{\boldsymbol{t}}_{m}-\boldsymbol{t}^{\ast}\rvert=(\Delta\bar{t}_{1},\Delta\bar{t}_{2},\Delta\bar{t}_{3}) in millimeters, respectively, where MM denotes the total number of frames used for evaluation, 𝜽^m\hat{\boldsymbol{\theta}}_{m} denotes the mm-th rotation vector derived from the estimated rotation matrix 𝑹^m\hat{\boldsymbol{R}}_{m}, 𝜽\boldsymbol{\theta}^{\ast} represents the ground-truth rotation vector, and 𝒕^m\hat{\boldsymbol{t}}_{m} and 𝒕\boldsymbol{t}^{\ast} denote the mm-th estimated and ground-truth translation vectors, respectively. To further quantify the overall performance of the algorithm in rotation and translation estimation, we follow the study [12] to compute the rotation error ε𝒓=arccos12(tr(𝑹𝑹^)1)\varepsilon_{\boldsymbol{r}}=\arccos\frac{1}{2}(\text{tr}(\boldsymbol{R}^{*\top}\hat{\boldsymbol{R}})-1) and translation error ε𝒕=arccos(|𝒕𝒕𝒕^𝒕^|)\varepsilon_{\boldsymbol{t}}=\arccos\left(\left|\frac{\boldsymbol{t}^{*}}{\|\boldsymbol{t}^{*}\|}\cdot\frac{\hat{\boldsymbol{t}}}{\|\hat{\boldsymbol{t}}\|}\right|\right), both measured in degrees. Moreover, we follow the study [37] to quantify the algorithm’s overall relative pose estimation performance using the area under the cumulative pose error curve (AUC) in percents, computed at thresholds of 11^{\circ}, 33^{\circ}, 55^{\circ}, and 1010^{\circ}, respectively, where the pose error is defined as the maximum of ε𝒓\varepsilon_{\boldsymbol{r}} and ε𝒕\varepsilon_{\boldsymbol{t}}. In this article, @ψ@\psi denotes the AUC value at the threshold ψ\psi, with higher values indicating better performance. It is important to note that when only rotational motion is involved, the AUC is calculated solely based on the rotation error. Additionally, when evaluating the algorithm’s performance on the monocular visual odometry task, we adopt commonly used metrics as presented in [35], including the average translational error e𝒕e_{\boldsymbol{t}} (%), the rotational error e𝒓e_{\boldsymbol{r}} (/100{}^{\circ}/100m), and the absolute trajectory error (ATE) in meters [38]. To ensure fair comparisons, following the study [26], the scales of all estimated trajectories are aligned with the ground truth using 7-DoF optimization.

IV-D Comparisons with SoTA Algorithms

This subsection first evaluates the performance of our algorithm under challenging conditions. Additional experiments on the pure rotation dataset are then conducted to validate the effectiveness of our method when only rotations are involved. Finally, experiments on stereo camera extrinsic online calibration and monocular visual odometry are conducted to demonstrate the practicality of our algorithm in specific tasks.

IV-D1 Performance under Challenging Conditions

Since this study aims to extend the existing birotation-based approach [19] to general relative pose problems, it is necessary to evaluate the performance of our method on two public datasets, ScanNet and YFCC100M, where the ground-truth relative poses can be significantly different from any of the three predefined basis transformations, making correspondence matching considerably more challenging. Following the study [37], we compute the AUC (%) across varying thresholds ψ\psi () to quantify the performance of relative pose estimation algorithms on both datasets.

As shown in Table I, our algorithm achieves SoTA performance on the ScanNet dataset for both dense and sparse correspondences, which demonstrates the effectiveness of our method in indoor scenarios. Moreover, the outdoor relative pose estimation results presented in Table II indicate that our approach performs on par with the best-performing method [18] when correspondences are dense, and achieves the second-best performance when correspondences are sparse. Notably, while all methods use the output of the algorithm [8] as the initial relative pose, the previous birotation-based approach [19] performs significantly worse than the algorithm [8], whereas our method achieves substantial improvements. This underscores the limitation of the previous study [19], which was primarily designed for relatively ideal stereo vision models, and demonstrates the effectiveness of our approach in broadening the applicability of the birotation solution by introducing two additional basis transformations. In addition, although our method performs slightly worse than the algorithm presented in [18] in outdoor scenarios, this discrepancy may be attributed to the outlier removal strategy incorporated in their approach. Unlike the upper quartile method used in our algorithm, their strategy depends on an empirically tuned hyperparameter (extensive hyperparameter tuning experiments are conducted in this study to ensure fair comparisons), which limits its practicality across diverse datasets.

IV-D2 Performance under Pure Rotation Conditions

Pure rotation represents a corner case in relative pose estimation. Nonetheless, most existing algorithms developed for general relative pose problems rarely take this corner case into consideration. Therefore, it is essential to evaluate the performance of existing methods under pure rotation conditions using the curated pure rotation dataset detailed in Sect. IV-B1. The results are shown in Table III.

As expected, the approaches [8, 14] that estimate the essential matrix perform poorly across all thresholds, which validates the limitations of these methods in this corner case, as discussed in Sect. I. While the approach [12] also estimates the essential matrix, it achieves significantly better performance, primarily because it incorporates statistical analyses to explicitly identify pure rotation cases. In addition, the algorithm [18], which directly estimates relative poses, demonstrates strong robustness in pure rotation scenarios. We attribute this to the normalization of the translation vector and the decoupled optimization of rotation and translation in their method.

Moreover, the previous birotation-based method [19] struggles to stably solve pure rotation problems, primarily due to its sensitivity to the initialization of relative poses induced by the penalty term. In contrast, our proposed algorithm demonstrates superior robustness across all evaluation thresholds, regardless of whether the correspondences are dense or sparse. In addition, our approach performs on par with the homography estimation approach [34], which is specifically designed for pure rotation scenarios, further validating the accuracy of our algorithm in this corner case.

IV-D3 Performance on Two Specific Tasks

Additionally, we conduct extensive experiments to demonstrate the applicability of our relative pose estimation algorithm to two specific computer vision tasks: stereo camera extrinsic online calibration and monocular visual odometry.

In addition to AUC, we follow the studies [18] and [19] to compute the average absolute rotation and translation errors in the stereo camera extrinsic online calibration experiment. As shown in Table IV, our algorithm consistently achieves SoTA performance in both indoor and outdoor scenarios. Notably, it increases the three translation components by approximately 73%73\%, 33%33\%, and 22%22\% in indoor scenarios, and by 74%74\%, 30%30\%, and 28%28\% in outdoor scenarios. As expected, the translation vectors estimated by the algorithm [18] are occasionally unsatisfactory, primarily due to the significant non-convexity of its energy function shown in (5). Moreover, although the previous birotation-based method [19], specifically designed for stereo camera extrinsic online calibration, achieves relatively high AUC, it yields unexpectedly large average absolute rotation and translation errors in indoor scenarios due to outliers in the estimated relative poses. These outliers are primarily caused by the inherent non-uniqueness of the birotation solution, as discussed in Sect. V. In contrast, our method ensures a unique solution by incorporating a regularization term into the energy function defined in (36). Furthermore, in outdoor scenarios, our method outperforms the method [19], primarily because the adjustments in constraints defined in (20)-(22) eliminate the influence of implicit weights in residual accumulation, as described in Sect. III-B. In addition, we also provide qualitative comparisons of different relative pose estimation approaches by visualizing disparity maps estimated using pre-trained deep stereo networks [39, 40]. These disparity maps are generated from stereo images rectified using the extrinsic parameters estimated by different relative pose estimation algorithms. As illustrated in Fig. 4, our method produces disparity maps with fewer erroneous regions compared to other approaches, which further validates its greater accuracy in both outdoor and indoor scenarios.

Quantitative comparisons of relative pose estimation algorithms for monocular visual odometry are depicted in Fig. 5, where our method achieves SoTA performance across all three matrices. Specifically, when correspondences are dense, our method achieves improvements of around 3%3\% in ATE, 7%7\% in e𝒓e_{\boldsymbol{r}}, and 6%6\% in e𝒕e_{\boldsymbol{t}}, while these gains increase to around 6%6\%, 21%21\%, and 17%17\% when using sparse correspondences. Notably, our algorithm outperforms existing approaches on Seqs 0, 2, 5, 8, and 9, which were collected along complex motion trajectories. These results underscore the robustness of our method for monocular visual odometry, particularly in challenging scenarios. As expected, the previous birotation-based method [19] performs poorly on this task, as ground-truth relative poses are mostly aligned with the third basis transformation, making the approach originally designed for relatively ideal stereo vision model infeasible. Moreover, the two SoTA data-driven approaches underperform several explicit programming-based methods, as current learning-based visual odometry frameworks lack sufficient geometric constraints during training to ensure accurate relative pose estimation. We also visualize the trajectories obtained by the five top-performing approaches to further compare their monocular visual odometry performance. As illustrated in Fig. 6, our approach yields trajectories that closely align with the ground truth in all sequences when using both dense and sparse correspondences, which demonstrates its high accuracy and robustness in the monocular visual odometry task.

IV-E Ablation Studies

TABLE V: Ablation study on energy function contributions.
Ground-Truth Relative Pose E1E_{1} E2E_{2} E3E_{3} @11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ}
Left-right \checkmark 55.94 85.02 91.01 95.51
\checkmark 54.35 83.37 89.62 94.54
\checkmark 0.19 0.29 0.30 0.54
\checkmark \checkmark \checkmark 55.94 85.02 91.01 95.51
Top-bottom \checkmark 3.69 20.66 32.16 48.49
\checkmark 8.84 44.67 63.69 81.36
\checkmark 1.10 5.33 7.85 11.79
\checkmark \checkmark \checkmark 8.93 44.54 63.61 80.94
Front-back \checkmark 2.13 6.73 8.71 10.49
\checkmark 1.69 3.89 4.67 5.69
\checkmark 14.61 63.97 77.91 88.85
\checkmark \checkmark \checkmark 14.61 63.97 77.91 88.85

In our proposed method, three energy functions are simultaneously optimized to obtain the optimal birotation solution. To validate the necessity of this strategy, we conduct an ablation study on the contribution of each energy function. As shown in Table V, optimizing a single energy function achieves performance comparable to optimizing all three energy functions when the ground-truth relative poses are closely aligned with the corresponding basis transformation. However, its performance degrades significantly in the other two cases, demonstrating the practical necessity of optimizing all three energy functions.

TABLE VI: Ablation study on constraint adjustments.
Dataset Adjustments @11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ}
ScanNet [32] 0.97 9.17 17.28 31.03
\checkmark 1.08 9.56 18.65 33.08
YFCC100M [33] 9.49 29.73 42.61 59.46
\checkmark 10.15 32.02 45.89 63.45
TABLE VII: Comparisons of different initialization strategies on the ScanNet and YFCC100M datasets.
Dataset Algorithm Dense Correspondence Sparse Correspondence
@11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ} @11^{\circ} @33^{\circ} @55^{\circ} @1010^{\circ}
ScanNet Nistér [8] 0.56 6.84 15.24 31.49 0.74 7.82 16.54 33.44
+Ours 1.08 9.56 18.65 33.08 1.05 8.53 17.75 33.94
Hartley [7] 0.50 7.50 16.28 32.71 0.55 6.92 15.38 32.14
+Ours 0.98 9.22 18.12 32.74 1.09 8.65 17.69 33.57
Barath et al. [13] 0.89 9.30 18.26 34.41 0.66 8.01 17.15 35.11
+Ours 1.11 9.59 18.86 34.07 1.08 8.98 18.23 34.46
YFCC100M Nistér [8] 6.57 26.84 41.65 62.01 6.95 26.74 41.18 60.74
+Ours 10.15 32.02 45.89 63.45 9.49 29.19 41.67 58.22
Hartley [7] 8.60 30.16 44.83 64.05 8.04 28.24 42.31 61.54
+Ours 10.14 32.06 45.83 63.23 9.61 29.35 41.93 58.41
Barath et al. [13] 9.36 33.17 48.53 67.40 9.44 32.32 46.48 64.34
+Ours 10.60 33.22 47.40 64.88 9.89 30.27 42.90 59.39
Refer to caption
Figure 7: Average rotation and translation errors with respect to the correspondence noise levels and matching error rates. Nistér [8], Ling et al. [18], Zhao et al. [19], and ours.

In addition, to demonstrates the superiority of the adjusted constraints shown in (20)-(22) for residual accumulation, we conduct another ablation study on the ScanNet and YFCC100M datasets, both of which exhibit high diversity in ground-truth relative poses. As shown in Table VI, the adjusted constraints enable our algorithm to increase the average AUC by approximately 8%8\% on the ScanNet dataset and around 7%7\% on the YFCC100M dataset across different thresholds.

IV-F Robustness Analyses

We first evaluate the performance of our algorithm with respect to different initialization strategies on the ScanNet and YFCC100M datasets. In addition to the default method [8], we also use two other explicit pose estimation algorithms: the direct linear transformation approach [7] and the five-point algorithm combined with marginalizing sample consensus (MAGSAC) [13]. As shown in Table VII, our method performs stably across different initial relative poses on both datasets. Moreover, it achieves significant improvements over the methods used for initialization on the ScanNet dataset. However, on the YFCC100M dataset, our method performs slightly worse than the algorithm [13], as MAGSAC is particularly effective at removing a large number of outliers.

We also compute the average rotation and translation errors of the four top-performing algorithms on the two synthetic datasets to evaluate the robustness of our method under varying correspondence noise levels and matching error rates, respectively. As illustrated in Fig. 7(a)–(b), when the standard deviation of Gaussian noise on correspondences is below 0.8 pixels, our algorithm outperforms existing approaches. Nonetheless, as the standard deviation exceeds 0.8 pixels, the average rotation and translation errors of our method increase significantly compared to others. Additionally, as shown in Fig. 7(c)-(d), our method consistently achieves the lowest rotation and translation errors compared to existing approaches when the correspondence matching error rate is below 0.28. Nevertheless, as the proportion of incorrect correspondences increases, the accuracy of our method gradually degrades. These results indicate that our method is highly effective in scenarios with relatively accurate correspondences, but its performance degrades under conditions of extremely low-quality correspondence matching.

V Discussion

As mentioned earlier in the stereo camera extrinsic online calibration experiment, recovering the relative pose from a birotation solution is non-unique, resulting in outliers in the estimated relative poses. This section provides detailed analyses of this issue.

We begin by revisiting the non-uniqueness in essential matrix decomposition. Four possible solutions exist because the epipolar constraint remains satisfied when one or both camera coordinate systems are rotated by π\pi radians around the translation vector. Notably, when the relative pose aligns with one of the basis transformations, the epipolar constraint (2) simplifies to the corresponding constraint in (11), (12), and (13). Therefore, similar to the essential matrix decomposition, recovering the relative pose from a birotation solution also yields four possible solutions. Specifically, the constraint in (11) can be derived from the following two transformations:

𝒑1C\displaystyle\boldsymbol{p}^{C}_{1} =𝒑2C+s1𝒍1,\displaystyle=\boldsymbol{p}^{C}_{2}+s_{1}\boldsymbol{l}_{1}, (51)
𝒑1C\displaystyle\boldsymbol{p}^{C}_{1} =𝑹x(π)𝒑2C+s1𝒍1,\displaystyle=\boldsymbol{R}_{x}(\pi){\boldsymbol{p}^{C}_{2}}+s_{1}\boldsymbol{l}_{1}, (52)

where

𝑹x(π)=(100010001)\boldsymbol{R}_{x}(\pi)=\begin{pmatrix}1&0&0\\ 0&-1&0\\ 0&0&-1\end{pmatrix} (53)

denotes a π\pi-radian rotation around the X-axis. The first birotation solution thus corresponds to four possible relative poses expressed as follows:

𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹1,𝒕=|s1|𝑹2𝒍1,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{1},\ \ \boldsymbol{t}=-|s_{1}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{1}, (54)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹1,𝒕=|s1|𝑹2𝒍1,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{1},\ \ \boldsymbol{t}=|s_{1}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{1}, (55)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹x(π)𝑹1,𝒕=|s1|𝑹2𝒍1,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{x}(\pi)\boldsymbol{R}_{1},\ \ \boldsymbol{t}=-|s_{1}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{1}, (56)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹x(π)𝑹1,𝒕=|s1|𝑹2𝒍1.\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{x}(\pi)\boldsymbol{R}_{1},\ \ \boldsymbol{t}=|s_{1}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{1}. (57)

Similarly, the constraint in (12) can also be derived from the following two transformations:

𝒑1C\displaystyle\boldsymbol{p}^{C}_{1} =𝒑2C+s2𝒍2,\displaystyle=\boldsymbol{p}^{C}_{2}+s_{2}\boldsymbol{l}_{2}, (58)
𝒑1C\displaystyle\boldsymbol{p}^{C}_{1} =𝑹y(π)𝒑2C+s2𝒍2,\displaystyle=\boldsymbol{R}_{y}(\pi){\boldsymbol{p}^{C}_{2}}+s_{2}\boldsymbol{l}_{2}, (59)

where

𝑹y(π)=(100010001)\boldsymbol{R}_{y}(\pi)=\begin{pmatrix}-1&0&0\\ 0&1&0\\ 0&0&-1\end{pmatrix} (60)

denotes a π\pi-radian rotation around the Y-axis. The second birotation solution corresponds to four possible relative poses expressed as follows:

𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹1,𝒕=|s2|𝑹2𝒍2,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{1},\ \ \boldsymbol{t}=-|s_{2}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{2}, (61)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹1,𝒕=|s2|𝑹2𝒍2,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{1},\ \ \boldsymbol{t}=|s_{2}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{2}, (62)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹y(π)𝑹1,𝒕=|s2|𝑹2𝒍2,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{y}(\pi)\boldsymbol{R}_{1},\ \ \boldsymbol{t}=-|s_{2}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{2}, (63)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹y(π)𝑹1,𝒕=|s2|𝑹2𝒍2.\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{y}(\pi)\boldsymbol{R}_{1},\ \ \boldsymbol{t}=|s_{2}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{2}. (64)

Additionally, the constraint represented in (13) can be derived from the following two transformations:

𝒑1C\displaystyle\boldsymbol{p}^{C}_{1} =𝒑2C+s3𝒍3,\displaystyle={\boldsymbol{p}^{C}_{2}}+s_{3}\boldsymbol{l}_{3}, (65)
𝒑1C\displaystyle\boldsymbol{p}^{C}_{1} =𝑹z(π)𝒑2C+s3𝒍3,\displaystyle=\boldsymbol{R}_{z}(\pi){\boldsymbol{p}^{C}_{2}}+s_{3}\boldsymbol{l}_{3}, (66)

where

𝑹z(π)=(100010001)\boldsymbol{R}_{z}(\pi)=\begin{pmatrix}-1&0&0\\ 0&-1&0\\ 0&0&1\end{pmatrix} (67)

denotes a π\pi-radian rotation around the Z-axis. The third birotation solution thus corresponds to four possible relative poses expressed as follows:

𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹1,𝒕=|s3|𝑹2𝒍3,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{1},\ \ \boldsymbol{t}=-|s_{3}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{3}, (68)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹1,𝒕=|s3|𝑹2𝒍3,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{1},\ \ \boldsymbol{t}=|s_{3}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{3}, (69)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹z(π)𝑹1,𝒕=|s3|𝑹2𝒍3,\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{z}(\pi)\boldsymbol{R}_{1},\ \ \boldsymbol{t}=-|s_{3}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{3}, (70)
𝑹\displaystyle\boldsymbol{R} =𝑹2𝑹z(π)𝑹1,𝒕=|s3|𝑹2𝒍3.\displaystyle=\boldsymbol{R}^{\top}_{2}\boldsymbol{R}_{z}(\pi)\boldsymbol{R}_{1},\ \ \boldsymbol{t}=|s_{3}|\boldsymbol{R}^{\top}_{2}\boldsymbol{l}_{3}. (71)

Similar to the process of determining the correct relative pose after essential matrix decomposition, all possible relative poses from birotation solutions can be enumerated, and the correct one can be identified by checking the depth signs of the reconstructed 3-D points. Nonetheless, in our proposed relative pose estimation algorithm, this process can be omitted when performing appropriate initialization. When the initial relative pose obtained from essential matrix decomposition is close to the correct solution, the regularization term in (36) ensures that the birotation solution does not converge to incorrect ones. Although our approach consistently produces unique solutions, as demonstrated in our experiments, there remains a potential risk of yielding non-unique solutions in practical applications, which we leave for future work.

VI Conclusion and Future Work

This article introduced a novel birotation solution for general relative pose problems. To expand the applicability of existing birotation-based methods, this study first presented three basis transformations and the corresponding geometric metrics, which serve as the foundation for the formulation of three energy functions. Relative poses can be accurately and robustly estimated by simultaneously minimizing these energy functions. This high-performing relative pose estimation algorithm significantly overcame the limitations of existing birotation-based methods, particularly by reducing reliance on initialization quality and alleviating the non-uniqueness in relative pose recovery. Extensive experiments conducted on both real-world and synthetic datasets demonstrate the superior performance of our algorithm over existing relative pose estimation methods across varying correspondence densities, as well as its strong robustness to different correspondence noise levels and matching error rates.

In the future, we are committed to further advancing our approach along the following three directions: (1) improving its efficiency by avoiding the simultaneous optimization of all three energy functions, (2) enhancing its robustness by incorporating a self-evolving correspondence matching network, and (3) ensuring the uniqueness of relative pose recovery from a birotation solution.

References

  • [1] E. Dexheimer et al., “Information-theoretic online multi-camera extrinsic calibration,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4757–4764, 2022.
  • [2] T. Dang et al., “Continuous stereo self-calibration by camera parameter tracking,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1536–1550, 2009.
  • [3] P. Hansen et al., “Online continuous stereo extrinsic parameter estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2012, pp. 1059–1066.
  • [4] D. Nistér et al., “Visual odometry,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2004, DOI: 10.1109/CVPR.2004.1315094.
  • [5] A. Levin and R. Szeliski, “Visual odometry and map correlation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2004, DOI: 10.1109/CVPR.2004.1315088.
  • [6] W. Chen et al., “LEAP-VO: Long-term effective any point tracking for visual odometry,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 19 844–19 853.
  • [7] R. I. Hartley, “In defense of the eight-point algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 6, pp. 580–593, 1997.
  • [8] D. Nistér, “An efficient solution to the five-point relative pose problem,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 756–770, 2004.
  • [9] O. Chum and J. Matas, “Optimal randomized RANSAC,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 8, pp. 1472–1482, 2008.
  • [10] T. Wei et al., “Generalized differentiable RANSAC,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17 649–17 660.
  • [11] Y.-Y. Jau et al., “Deep keypoint-based camera pose estimation with geometric constraints,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 4950–4957.
  • [12] J. Zhao, “An efficient solution to non-minimal case essential matrix estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 1777–1792, 2020.
  • [13] D. Barath et al., “MAGSAC: Marginalizing sample consensus,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 197–10 205.
  • [14] U. Helmke et al., “Essential matrix estimation using Gauss-Newton iterations on a manifold,” International Journal of Computer Vision, vol. 74, pp. 117–136, 2007.
  • [15] Q. Cai et al., “Equivalent constraints for two-view geometry: Pose solution/pure rotation identification and 3D reconstruction,” International Journal of Computer Vision, vol. 127, pp. 163–180, 2019.
  • [16] Y. Zheng et al., “Revisiting the PnP problem: A fast, general and optimal solution,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2344–2351.
  • [17] V. Lepetit et al., “EPnP: An accurate O(n) solution to the PnP problem,” International Journal of Computer Vision, vol. 81, pp. 155–166, 2009.
  • [18] Y. Ling and S. Shen, “High-precision online markerless stereo extrinsic calibration,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2016, pp. 1771–1778.
  • [19] H. Zhao et al., “Dive deeper into rectifying homography for stereo camera online self-calibration,” in IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2024, pp. 14 479–14 485.
  • [20] K. Kanatani and Y. Sugaya, “Unified computation of strict maximum likelihood for geometric fitting,” Journal of Mathematical Imaging and Vision, vol. 38, pp. 1–13, 2010.
  • [21] Z. Zhang, “Determining the epipolar geometry and its uncertainty: A review,” International Journal of Computer Vision, vol. 27, pp. 161–195, 1998.
  • [22] Y. Ma et al., “Optimization criteria and geometric algorithms for motion and structure estimation,” International Journal of Computer Vision, vol. 44, pp. 219–249, 2001.
  • [23] R. Tron and K. Daniilidis, “The space of essential matrices as a Riemannian quotient manifold,” SIAM Journal on Imaging Sciences, vol. 10, no. 3, pp. 1416–1445, 2017.
  • [24] A. Kendall et al., “PoseNet: A convolutional network for real-time 6-DOF camera relocalization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2938–2946.
  • [25] J.-W. Bian et al., “Unsupervised scale-consistent depth learning from video,” International Journal of Computer Vision, vol. 129, no. 9, pp. 2548–2564, 2021.
  • [26] Y. Feng et al., “SCIPaD: Incorporating spatial clues into unsupervised pose-depth joint learning,” IEEE Transactions on Intelligent Vehicles, 2024, DOI: 10.1109/TIV.2024.3460868.
  • [27] J. D. Dixon and B. Mortimer, Permutation groups.   Springer Science & Business Media, 1996, vol. 163.
  • [28] J. Sun et al., “LoFTR: Detector-free local feature matching with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8922–8931.
  • [29] Z. Li and N. Snavely, “MegaDepth: Learning single-view depth prediction from internet photos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2041–2050.
  • [30] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, pp. 91–110, 2004.
  • [31] P. Lindenberger et al., “LightGlue: Local feature matching at light speed,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 627–17 638.
  • [32] A. Dai et al., “ScanNet: Richly-annotated 3D reconstructions of indoor scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5828–5839.
  • [33] B. Thomee et al., “YFCC100M: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
  • [34] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.   Cambridge University Press, 2003.
  • [35] A. Geiger et al., “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
  • [36] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs in statistics: Methodology and distribution.   Springer, 1992, pp. 492–518.
  • [37] P.-E. Sarlin et al., “SuperGlue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4938–4947.
  • [38] J. Sturm et al., “A benchmark for the evaluation of RGB-D SLAM systems,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2012, pp. 573–580.
  • [39] J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5410–5418.
  • [40] L. Lipson et al., “RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,” in 2021 International Conference on 3D Vision (3DV).   IEEE, 2021, pp. 218–227.