This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tensor-Based Synchronization and the
Low-Rankness of the Block Trifocal Tensor

Daniel Miao           Gilad Lerman11footnotemark: 1           Joe Kileel School of Mathematics, University of Minnesota (miao0022@umn.edu, lerman@umn.edu) Department of Mathematics and Oden Institute for Computational Engineering and Sciences, University of Texas at Austin (jkileel@math.utexas.edu)
Abstract

The block tensor of trifocal tensors provides crucial geometric information on the three-view geometry of a scene. The underlying synchronization problem seeks to recover camera poses (locations and orientations up to a global transformation) from the block trifocal tensor. We establish an explicit Tucker factorization of this tensor, revealing a low multilinear rank of (6,4,4)(6,4,4) independent of the number of cameras under appropriate scaling conditions. We prove that this rank constraint provides sufficient information for camera recovery in the noiseless case. The constraint motivates a synchronization algorithm based on the higher-order singular value decomposition of the block trifocal tensor. Experimental comparisons with state-of-the-art global synchronization methods on real datasets demonstrate the potential of this algorithm for significantly improving location estimation accuracy. Overall this work suggests that higher-order interactions in synchronization problems can be exploited to improve performance, beyond the usual pairwise-based approaches.

1 Introduction

Synchronization is crucial for the success of many data-intensive applications, including structure from motion, simultaneous localization and mapping (SLAM), and community detection. This problem involves estimating global states from relative measurements between states. While many studies have explored synchronization in different contexts using pairwise measurements, few have considered measurements between three or more states. In real-world scenarios, relying solely on pairwise measurements often fails to capture the full complexity of the system. For instance, in networked systems, interactions frequently occur among groups of nodes, necessitating approaches that can handle higher-order relationships. Extending synchronization to consider measurements between three or more states, however, increases computational complexity and requires sophisticated mathematical models. Addressing these challenges is vital for advancing various technological fields. For example, higher-order synchronization can improve the accuracy of 3D reconstructions in structure from motion by leveraging more complex geometric relationships. In SLAM, it enhances mapping and localization precision in dynamic environments by considering multi-robot interactions. Similarly, in social networks, it could lead to more accurate identification of tightly-knit groups. Developing efficient algorithms to handle higher-order measurements will open new research avenues and make systems more resilient and accurate.

In this work, we focus on a specific instance of the synchronization problem within the context of structure from motion in 3D computer vision, where each state represents the orientation and location of a camera. Traditional approaches rely on relative measurements encoded by fundamental matrices, which describe the relative projective geometry between pairs of images. Instead, we consider higher-order relative measurements encoded in trifocal tensors, which capture the projective information between triplets of images. Trifocal tensors uniquely determine the geometry of three views, even in the collinear case [1], making them more favorable than triplets of fundamental matrices for synchronization. To understand the structure and properties of trifocal tensors in multi-view geometry, we carefully study the mathematical properties of the block tensor of trifocal tensors. We then use these theoretical insights to develop effective synchronization algorithms.

Directly relevant previous works. In the structure from motion problem, synchronization has traditionally been done using incremental methods, such as Bundler [2] and COLMAP [3]. These methods process images sequentially, gradually recovering camera poses. However, the order of image processing can impact reconstruction quality, as error may significantly accumulate. Bundle adjustment [4], which jointly optimizes camera parameters and 3D points, has been used to limit drifting but is computationally expensive.

Alternatively, global synchronization methods have been proposed. These methods process multiple images simultaneously, avoiding iterative procedures and offering more rigorous and robust solutions. Global methods generally optimize noisy and corrupted measurements by exploiting the structure of relative measurements and imposing constraints. Many global methods solve for orientation and location separately, using structures on SO(3)SO(3) and the set of locations. Solutions for retrieving camera poses from pairwise measurements have been developed for camera orientations [5, 6, 7, 8, 9, 10], camera locations [11, 12, 13], and both simultaneously [14, 15, 16, 17]. Some methods explore the structure on fundamental or essential matrices [18, 19, 20].

Several attempts to extract information from trifocal tensors include works by: Leonardos et al. [21], which parameterizes calibrated trifocal tensors with non-collinear pinhole cameras as a quotient Riemannian manifold and uses the manifold structure to estimate individual trifocal tensors robustly; Larsson et al. [22], which proposes minimal solvers to determine calibrated radial trifocal tensors for use in an incremental pipeline, handling distorted images with constraints invariant to radial displacement; and Moulon et al. [23], which introduces a structure from motion pipeline, retrieving global rotations via cleaning the estimation graph and solving a least squares problem, and solving for translations by estimating trifocal tensors individually by linear programs. To our knowledge, no prior works develop a global pipeline where the synchronization operates directly on trifocal tensors.

Contribution of this work. The main contributions of this work are as follows:

  • We establish an explicit Tucker factorization of the block trifocal tensor when its blocks are suitably scaled, demonstrating a low multilinear rank of (6,4,4)(6,4,4). Moreover, we prove that this rank constraint is sufficient to determine the scales and fully characterizes camera poses in the noiseless case.

  • We develop a method for synchronizing trifocal tensors by enforcing this low rank constraint on the block tensor. We validate the effectiveness of our method through tests on several real datasets in structure from motion.

2 Low-rankness of the block trifocal tensor

We first briefly review relevant background material in Section 2.1. Then we present the main new construction and theoretical results in Section 2.2.

2.1 Background

2.1.1 Cameras and 3D geometry

Given a collection of nn images I1,,InI_{1},\dots,I_{n} of a 3D scene, let ti3t_{i}\in\mathbb{R}^{3} and RiSO(3)R_{i}\in SO(3) denote the location and orientation of the camera associated with the image IiI_{i} in the global coordinate system. Moreover, each camera is associated with a calibration matrix KiK_{i} that encodes the intrinsic parameters of a camera, including the focal length, the principal points, and the skew parameter. Then, the 3×43\times 4 camera matrix has the following form, Pi=KiRi[I3×3,ti]P_{i}=K_{i}R_{i}[I_{3\times 3},-t_{i}] and is defined up to nonzero scale. Three-dimensional world points XX are represented as 4\mathbb{R}^{4} vectors in homogeneous coordinates, and the projection of XX into the image corresponding to PP is x=PXx=PX. 3D world lines LL can be represented via Plücker coordinates as an 6\mathbb{R}^{6} vector. Then the projection of LL onto the image corresponding to PP is l=𝒫Ll=\mathcal{P}L, where 𝒫\mathcal{P} is the 3×63\times 6 line projection matrix. It can be written as 𝒫=[P2P3;P3P1;P1P2]\mathcal{P}=\left[P^{2}\wedge P^{3};\,P^{3}\wedge P^{1};\,P^{1}\wedge P^{2}\right] where PiP^{i} is the ii-th row of the camera matrix PP and wedge denotes exterior product. Explicitly the (i,j)(i,j) element of the line projection matrix can be calculated as the determinant of the submatrix, where the ii-th row is omitted and the column are selected as the jj-th pair from [(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)][(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)]. The elements on the second row are multiplied by 1-1.

To retrieve global poses, relative measurement of pairs or triplets of images is needed. Let xix_{i} and xjx_{j} be any pair of corresponding keypoints in images IiI_{i} and IjI_{j} respectively, meaning that they are images of a common world point. The fundamental matrix FijF_{ij} is a 3×33\times 3 matrix such that xiTFijxj=0x_{i}^{T}F_{ij}x_{j}=0. It is known that FijF_{ij} encodes the relative orientation Rij=RiRjTR_{ij}=R_{i}R_{j}^{T} and translation tij=Ri(titj)t_{ij}=R_{i}(t_{i}-t_{j}) through Fij=KiT[tij]×RijKj1F_{ij}=K_{i}^{-T}[t_{ij}]_{\times}R_{ij}K_{j}^{-1}. The essential matrix corresponds to the calibrated case, where Ki=I3×3K_{i}=I_{3\times 3} for all ii.

2.1.2 Trifocal tensors

Analogous to the fundamental matrix, the trifocal tensor TijkT_{ijk} is a 3×3×33\times 3\times 3 tensor that relates the features across images and characterizes the relative pose between a triplet of cameras Pi,Pj,PkP_{i},P_{j},P_{k}. The trifocal tensor TijkT_{ijk} corresponding to cameras Pi,Pj,PkP_{i},P_{j},P_{k} can be calculated by

(Tijk)wqr=(1)w+1det[PiwPjqPkr],\displaystyle(T_{ijk})_{wqr}=(-1)^{w+1}\det\begin{bmatrix}\sim P_{i}^{w}\\ P_{j}^{q}\\ P_{k}^{r}\end{bmatrix}, (1)

where PiwP_{i}^{w} is the ww-th row of PiP_{i}, and Piw\sim P_{i}^{w} is the 2×42\times 4 submatrix of PiP_{i} omitting the ww-th row. The trifocal tensor determines the geometry of three cameras up to a global projective ambiguity, or up to a scaled rigid transformation in the calibrated case. In addition to point correspondences, trifocal tensors satisfy constraints for corresponding lines, and mixtures thereof. For example, let li,lj,lkl_{i},l_{j},l_{k} be corresponding image lines in the views of cameras Pi,Pj,PkP_{i},P_{j},P_{k} respectively, then the lines are related through the trifocal tensor TijkT_{ijk} by (ljT[(Tijk)1::,(Tijk)2::,(Tijk)3::]lk)[l]×=0T\left(l_{j}^{T}[(T_{ijk})_{1::},(T_{ijk})_{2::},(T_{ijk})_{3::}]l_{k}\right)[l]_{\times}=0^{T}, where [l]×[l]_{\times} denotes the ×3\times 3 skew-symmetric matrix corresponding to cross product by ll. We refer to [1] for more details of the properties of a trifocal tensor. We include the standard derivation of the trifocal tensor in Section A.1.

Since corresponding lines put constraints on the trifocal tensor, one advantage of incorporating trifocal tensors into structure from motion pipelines is that trifocal tensors can be estimated purely from line correspondences or a mixture of points and lines. Fundamental matrices can not be estimated directly from line correspondences, so the effectiveness of pairwise methods for datasets where feature points are scarce is limited. Furthermore, trifocal tensors have the potential to improve location estimation. From pairwise measurements, one can only get the relative direction but not the scale and the location estimation in the pairwise setting is a “notoriously difficult problem" (quoting from pages 316-317 of [24]). However, trifocal tensors encode the relative scales of the direction and can greatly simplify the location estimation procedure. We refer to several works on characterizing the complexity of minimal problems for individual trifocal tensors [25, 26], and on developing methods for solving certain minimal problems [27],[28], [29], [30], [31], [32], [33]. We also refer to [34] for a survey paper on structure from motion, which discusses minimal problem solvers from the perspective of computational algebraic geometry.

2.1.3 Tucker decomposition and the multilinear rank of tensors

We review basic material on the Tucker decomposition and the multilinear rank of a tensor. We refer to [35] for more details while adopting its notation. Let TI1×I2××INT\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}} be an order NN tensor. The mode-ii flattening (or matricization) T(i)Ii×(I1Ii1Ii+1IN)T_{(i)}\in\mathbb{R}^{I_{i}\times(I_{1}\ldots I_{i-1}I_{i+1}\ldots I_{N})} is the rearrangement of TT into a matrix by taking mode-ii fibers to be columns of the flattened matrix. By convention, the ordering of the columns in the flattening follows lexicographic order of the modes excluding ii. Symbols \otimes and \odot denote the Kronecker product and the Hadamard product respectively. The norm on tensors is defined as T=T(1)F\|T\|=\|T_{(1)}\|_{F}. The ii-rank of TT is the column rank of T(i)T_{(i)} and is denoted as rank(T)i{}_{i}(T). Let RiR_{i}=rank(T)i{}_{i}(T). Then the multilinear rank of TT is defined as mlrank(TT) = (R1,R2,,RN)(R_{1},R_{2},\dots,R_{N}). The ii-mode product of TT with a matrix Um×IiU\in\mathbb{R}^{m\times I_{i}} is a tensor in I1××Ii1×m×Ii+1××IN\mathbb{R}^{I_{1}\times\cdots\times I_{i-1}\times m\times I_{i+1}\times\cdots\times I_{N}} such that

(T×iU)j1ji1kji+1jN=ji=1IiTj1j2jNUkji.\displaystyle(T\times_{i}U)_{j_{1}\cdots j_{i-1}kj_{i+1}\cdots j_{N}}=\sum_{j_{i}=1}^{I_{i}}T_{j_{1}j_{2}\cdots j_{N}}U_{kj_{i}}.

Then, the Tucker decomposition of TI1×I2××INT\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}} is a decomposition of the following form:

T=𝒢×1A1×2A2×3×NAN=𝒢;A1,A2,,AN,\displaystyle T=\mathcal{G}\times_{1}A_{1}\times_{2}A_{2}\times_{3}\cdots\times_{N}A_{N}=\llbracket\mathcal{G};A_{1},A_{2},\dots,A_{N}\rrbracket,

where 𝒢Q1××QN\mathcal{G}\in\mathbb{R}^{Q_{1}\times\cdots\times Q_{N}} is the core tensor, and AnIn×QnA_{n}\in\mathbb{R}^{I_{n}\times Q_{n}} are the factor matrices. Without loss of generality, the factor matrices can be assumed to have orthonormal columns. Given the multilinear rank of the core tensor (R1,,RN)(R_{1},\dots,R_{N}), the Tucker decomposition approximation problem can be written as

argmin𝒢R1××RN,AiIi×RiT𝒢;A1,A2,,AN.\underset{\mathcal{G}\in\mathbb{R}^{R_{1}\times\cdots\times R_{N}},A_{i}\in\mathbb{R}^{I_{i}\times R_{i}}}{\arg\min}\|T-\llbracket\mathcal{G};A_{1},A_{2},\dots,A_{N}\rrbracket\|. (2)

A standard way of solving (2) is the higher-order singular value decomposition (HOSVD). The HOSVD is computed with the following steps. First, for each ii calculate the factor matrix AiA_{i} as the RiR_{i} leading left singular vectors of T(i)T_{(i)}. Second, set the core tensor 𝒢\mathcal{G} as 𝒢=T×1A1T×2×NANT\mathcal{G}=T\times_{1}A_{1}^{T}\times_{2}\cdots\times_{N}A_{N}^{T}. Though the solution from HOSVD will not be the optimal solution to (2), it satisfies a quasi-optimality property: if TT^{*} is the optimal solution, and TT^{\prime} the solution from HOSVD, then

TTNTT.\|T-T^{\prime}\|\leq\sqrt{N}\|T-T^{*}\|. (3)

2.2 Low Tucker rank of the block trifocal tensor and one shot camera retrieval

Suppose we are given a set of camera matrices {Pi}i=1n\{P_{i}\}_{i=1}^{n} with n3n\geq 3 and scales fixed on each camera matrix. Define the block trifocal tensor TnT^{n} to be the 3n×3n×3n3n\times 3n\times 3n tensor, where the 3×3×33\times 3\times 3 sized ijkijk block is the trifocal tensor corresponding to the triplet of cameras Pi,Pj,PkP_{i},P_{j},P_{k}. We assume for all blocks that have overlapping indices, the corresponding 3×3×33\times 3\times 3 tensor is also calculated using the formula (1). We summarize key properties of TnT^{n} in 1 and Theorem 1. The proof of Proposition 1 is by direct computation and can be found in Section A.3.

Proposition 1.

We have the following observations for the block trifocal tensor TnT^{n}. For all distinct i,j[n]i,j\in[n], we have the following properties:

  • (i)

    Tiiin=03×3×3T^{n}_{iii}=0_{3\times 3\times 3}

  • (ii)

    The TjiinT^{n}_{jii} blocks are rearrangements of elements in the fundamental matrix FijF_{ij} up to signs.

  • (iii)

    The TijinT^{n}_{iji} and TiijnT^{n}_{iij} blocks encode the epipoles.

  • (iv)

    The horizontal slices Tn(i,:,:)T^{n}(i,:,:) of TnT^{n} are skew symmetric.

  • (v)

    When all cameras are calibrated, three singular values of T(1)nT^{n}_{(1)} are equal.

Theorem 1 (Tucker factorization and low multilinear rank of block trifocal tensor).

The block trifocal tensor TnT^{n} admits a Tucker factorization, Tn=𝒢×1𝒫×2𝒞×3𝒞T^{n}=\mathcal{G}\times_{1}\mathcal{P}\times_{2}\mathcal{C}\times_{3}\mathcal{C}, where 𝒢6×4×4\mathcal{G}\in\mathbb{R}^{6\times 4\times 4}, 𝒫3n×6\mathcal{P}\in\mathbb{R}^{3n\times 6}, and 𝒞3n×4\mathcal{C}\in\mathbb{R}^{3n\times 4}. If the nn cameras that produce TnT^{n} are not all collinear, then mlrank(Tn)=(6,4,4)mlrank(T^{n})=(6,4,4). If the nn cameras that produce TnT^{n} are collinear, then mlrank(Tn)(6,4,4)mlrank(T^{n})\preceq(6,4,4).

Proof.

We can explicitly calculate that Tn=𝒢×1𝒫×2𝒞×3𝒞T^{n}=\mathcal{G}\times_{1}\mathcal{P}\times_{2}\mathcal{C}\times_{3}\mathcal{C}. The details of the calculation are in Section A.2. The specific forms for 𝒢,𝒞,𝒫\mathcal{G},\mathcal{C},\mathcal{P} are the following. The horizontal slices of the core are

\bBigg@4\displaystyle\bBigg@{4} [0000000000010010],[0000000100000100],[0000001001000000],[0001000000001000],[0010000010000000],[0100100000000000]\bBigg@4}.\displaystyle\begin{bmatrix}0&0&0&0\\ 0&0&0&0\\ 0&0&0&1\\ 0&0&-1&0\end{bmatrix},\begin{bmatrix}0&0&0&0\\ 0&0&0&-1\\ 0&0&0&0\\ 0&1&0&0\end{bmatrix},\begin{bmatrix}0&0&0&0\\ 0&0&1&0\\ 0&-1&0&0\\ 0&0&0&0\end{bmatrix},\begin{bmatrix}0&0&0&1\\ 0&0&0&0\\ 0&0&0&0\\ -1&0&0&0\end{bmatrix},\begin{bmatrix}0&0&-1&0\\ 0&0&0&0\\ 1&0&0&0\\ 0&0&0&0\end{bmatrix},\begin{bmatrix}0&1&0&0\\ -1&0&0&0\\ 0&0&0&0\\ 0&0&0&0\end{bmatrix}\bBigg@{4}\}.

​​​ The factor matrices are 𝒞=[P1,P2,,Pn]T3n×4\mathcal{C}=\begin{bmatrix}P_{1},P_{2},\dots,P_{n}\end{bmatrix}^{T}\in\mathbb{R}^{3n\times 4} and 𝒫=[S1,S2,,Sn]T3n×6\mathcal{P}=\begin{bmatrix}S_{1},S_{2},\dots,S_{n}\end{bmatrix}^{T}\in\mathbb{R}^{3n\times 6}, where PiP_{i} are the camera matrices and SiS_{i} are the corresponding line projection matrices.

Now, we suppose that the nn cameras are not collinear. We first show that 𝒞\mathcal{C} and 𝒫\mathcal{P} both have full rank. From [1], the null space of a camera matrix PiP_{i} is generated by the camera center. For the sake of contradiction, suppose that rank(𝒞\mathcal{C}) < 4. Then there exists x4x\in\mathbb{R}^{4} such that x0x\not=0 and 𝒞x=0\mathcal{C}x=0. This means that Pix=0P_{i}x=0 for all i=1,,ni=1,...,n. Then, xx is the camera centre for all cameras, which means that the cameras are centered at one point and are collinear. Similarly, every vector in the null space of the line projection matrix SiS_{i} is a line that passes through the camera centre [1]. For the sake of contradiction, suppose that rank(𝒫\mathcal{P}) < 6. Then there exists x6x\in\mathbb{R}^{6} such that x0x\not=0 and 𝒫x=0\mathcal{P}x=0. This implies that Six=0S_{i}x=0 for all i=1,,ni=1,...,n, which means that xx is a line that passes through all of the camera centers. Again the cameras are collinear, which is a contradiction. Next we write the flattening of the block trifocal tensor as T(1)n=𝒫𝒢(1)(𝒞𝒞)TT^{n}_{(1)}=\mathcal{P}\mathcal{G}_{(1)}(\mathcal{C}\otimes\mathcal{C})^{T}. Then 𝒫3n×6\mathcal{P}\in\mathbb{R}^{3n\times 6} has rank 6, and (𝒞𝒞)T16×9n2(\mathcal{C}\otimes\mathcal{C})^{T}\in\mathbb{R}^{16\times 9n^{2}} has rank 16. Given the specific form of 𝒢\mathcal{G}, where 𝒢(1)6×16\mathcal{G}_{(1)}\in\mathbb{R}^{6\times 16} it is easy to check rank(𝒢(1)\mathcal{G}_{(1)}) = 6. Thus, rank(T(1)nT^{n}_{(1)}) = 6. Similarly, we can show that rank(T(2)nT^{n}_{(2)}) = 4, and rank(T(3)nT^{n}_{(3)}) = 4. This implies that the multilinear rank of the block trifocal tensor is (6,4,4)(6,4,4) when the nn cameras are not collinear.

When the nn cameras are collinear, the individual factors in each flattening may be rank deficient, so that rank(T(1)nT^{n}_{(1)}) 6\leq 6, rank(T(2)nT^{n}_{(2)}) 4\leq 4, and rank(T(3)nT^{n}_{(3)}) 4\leq 4. This implies mlrank(TnT^{n}) (6,4,4)\preceq(6,4,4). ∎

The theorem inspires a straightforward way of retrieving global poses from the block trifocal tensor, which we summarize in the following claim.

Proposition 2 (One shot camera pose retrieval).

Given the block trifocal tensor TnT^{n} produced by cameras P1,P2,,PnP_{1},P_{2},...,P_{n}, the cameras can be retrieved from TnT^{n} up to a global projective ambiguity using the higher-order SVD. The cameras will be the leading 44 singular vectors of T(2)nT^{n}_{(2)} or T(3)nT^{n}_{(3)}.

Using the higher-order SVD on TnT^{n}, we can get a Tucker decomposition of the block trifocal tensor Tn=𝒢^×1𝒫^×2𝒞^×3𝒞^T^{n}=\hat{\mathcal{G}}\times_{1}\hat{\mathcal{P}}\times_{2}\hat{\mathcal{C}}\times_{3}\hat{\mathcal{C}}^{\prime}. Though the Tucker factorization is not unique [35], as we can apply an invertible linear transformation to one of the factor matrices and apply the inverse onto the core tensor, this invertible linear transformation can be interpreted as the global projective ambiguity for projective 3D reconstruction algorithms. Thus, the cameras can be retrieved by taking the leading four singular vectors of the mode-2 and mode-3 flattenings of the block tensor.

Very importantly however, in practice each trifocal tensor block in TnT^{n} can be estimated from image data only up to an unknown multiplicative scale [1]. The following theorem establishes the fact that the multilinear rank constraints provide sufficient information for determining the correct scales. In the statement b\odot_{b} denotes blockwise scalar multiplication, thus the (i,j,k)(i,j,k)-block of λbTn\lambda\odot_{b}T^{n} is λijkTijkn3×3×3\lambda_{ijk}T^{n}_{ijk}\in\mathbb{R}^{3\times 3\times 3}.

Theorem 2.

Let Tn3n×3n×3nT^{n}\in\mathbb{R}^{3n\times 3n\times 3n} be a block trifocal tensor corresponding to n4n\geq 4 calibrated or uncalibrated cameras in generic position. Let λn×n×n\lambda\in\mathbb{R}^{n\times n\times n} be a block scaling with λijk\lambda_{ijk} nonzero iff i,j,ki,j,k are not all equal. Assume that λbTn3n×3n×3n\lambda\odot_{b}T^{n}\in\mathbb{R}^{3n\times 3n\times 3n} has multilinear rank (6,4,4)(6,4,4) where b\odot_{b} denotes blockwise scalar multiplication. Then there exist α,β,γn\alpha,\beta,\gamma\in\mathbb{R}^{n} such that λijk=αiβjγk\lambda_{ijk}=\alpha_{i}\beta_{j}\gamma_{k} whenever i,j,ki,j,k are not all the same.

Sketch.

The idea is to identify certain submatrices in the flattenings of λbTn\lambda\odot_{b}T^{n} which must have determinant 0, and use these to solve for λ\lambda. A proof is in Appendix A.4. We remark that the proof technique extends that of [36, Theorem 5.1], which showed a similar result for a matrix problem. ∎

Theorem 2 is the basic guarantee for our algorithm development below. We stress that the ambiguities brought by α,β,γ\alpha,\beta,\gamma are not problematic for purposes of recovering the camera matrices by 2. Indeed, (αβγ)bTn=𝒢×1(Dα𝒫)×2(Dβ𝒞)×3(Dγ𝒞)(\alpha\otimes\beta\otimes\gamma)\odot_{b}T^{n}=\mathcal{G}\times_{1}(D_{\alpha}\mathcal{P})\times_{2}(D_{\beta}\mathcal{C})\times_{3}(D_{\gamma}\mathcal{C}) where Dα3n×3nD_{\alpha}\in\mathbb{R}^{3n\times 3n} is the diagonal matrix with each entry of α\alpha triplicated, etc. Hence the camera matrices can still be recovered up to individual scales (as expected) and a global projective transformation, from the higher-order SVD.

3 Synchronization of the block trifocal tensor

In this section, we develop a heuristic method for synchronizing the block trifocal tensor TnT^{n} by exploiting the multilinear rank of TnT^{n} from Theorem 1. Let T^n\hat{T}^{n} denote the estimated block trifocal tensor, and TnT^{n} the ground truth. Assume that there are nn images and a set of trifocal tensor estimates T^ijk\hat{T}_{ijk} where (i,j,k)Ω(i,j,k)\in\Omega and Ω\Omega is the set of indices whose corresponding trifocal tensor is estimated. Note that each estimated trifocal tensor T^ijk\hat{T}_{ijk} will have an unknown scale λijk\lambda_{ijk}\in\mathbb{R}^{*} associated with it. We always assume that we observe the iiiiii blocks, as they will be 0. We formulate the block trifocal tensor T^n\hat{T}^{n} by plugging in the estimates T^ijk\hat{T}_{ijk} and setting the unobserved positions ((i,j,k)Ω(i,j,k)\not\in\Omega) to 3×3×33\times 3\times 3 tensors of all zeros. Let WΩ{0,1}3n×3n×3nW_{\Omega}\in\{0,1\}^{3n\times 3n\times 3n} denote the block tensor where the (i,j,k)(i,j,k) blocks are ones for (i,j,k)Ω(i,j,k)\in\Omega and zeros otherwise. Let WΩCW_{\Omega^{C}} denote the opposite. In our experiments, we observe that the HOSVD is quite robust against noise for retrieving camera poses, which arises e.g., from numerical sensitivities when first estimating relative poses [37]. Therefore we develop an algorithm that projects T^n\hat{T}^{n} onto the set of tensors that have multilinear rank of (6,4,4)(6,4,4) while completing the tensor and retrieving an appropriate set of scales. Specifically, we can write our problem as

minΛΛT^n𝒫τ(ΛT^n)2{\underset{\Lambda}{\min}\|\Lambda\odot\hat{T}^{n}-\mathcal{P}_{\tau}(\Lambda\odot\hat{T}^{n})\|^{2}} (4)

where Λ3n×3n×3n\Lambda\in\mathbb{R}^{3n\times 3n\times 3n}, each 3×3×33\times 3\times 3 block is uniform, Λijk\Lambda_{ijk} blocks are zero for (i,j,k)Ω(i,j,k)\not\in\Omega, and Λ\Lambda satisfies a normalization condition like Λ2=1\|\Lambda\|^{2}=1 to avoid its vanishing. However, we drop this normalization constant in our implementation as we never observe Λ\Lambda vanishing in practice. (For convenience, we formulate this section with the notation of Λ3n×3n×3n\Lambda\in\mathbb{R}^{3n\times 3n\times 3n} and Hadamard multiplication, rather than λn×n×n\lambda\in\mathbb{R}^{n\times n\times n} and blockwise scalar multiplication from Theorem 2.) Furthermore in problem (4), 𝒫τ\mathcal{P}_{\tau} denotes the exact projection onto the set Γ={T3n×3n×3n:mlrank(T)=(6,4,4)}\Gamma=\{T\in\mathbb{R}^{3n\times 3n\times 3n}:\text{mlrank}(T)=(6,4,4)\}. Note that though HOSVD provides an efficient way to project onto Γ\Gamma, it is quasi-optimal and not the exact projection. The exact projection is much harder to calculate, and in general NP-hard. The algorithm below adopts an alternating projection strategy to estimate the best set of scales.

3.1 Higher-order SVD with a hard threshold (HOSVD-HT)

The key idea for our algorithm is to use the relative scales on the rank truncated tensor as a heuristic to retrieve scales for the estimated block tensor. There are two main challenges for calculating the rank truncated tensor. First, the exact projection 𝒫τ\mathcal{P}_{\tau} onto Γ\Gamma is expensive and difficult to calculate. Second, many blocks in the block tensor will be unknown if the corresponding images of the block lacks corresponding point and directly projecting the uncompleted tensor will be inaccurate. We apply an HOSVD framework with imputations to tackle the challenges. Regarding the first challenge, HOSVD is a simple, efficient, and quasi-optimal (3) projection onto Γ\Gamma. Though inexact, it is a reliable approximation. For the second challenge, the tensor T^n\hat{T}^{n} must be completed. We adopt the matrix completion idea of HARD-IMPUTE [38], where the matrix is filled-in iteratively with the rank truncated matrix obtained using the hard-thresholded SVD. In other words, we complete the missing blocks with the corresponding blocks in the rank truncated tensor. We define three hyperparameters l1,l2,l3l_{1},l_{2},l_{3} that correspond to the thresholding parameters of the hard-thresholded SVD on modes 1,2,31,2,3 of the block tensor respectively. Specifically, for each mode-ii flattening T(i)nT^{n}_{(i)}, we calculate the full SVD T(i)n=USVTT^{n}_{(i)}=USV^{T}. Since our tensor will scale cubically with the number of cameras, we suggest using a randomized SVD. We refer to [39] for different randomized strategies. Assume the singular values σi\sigma_{i} on the diagonal of SS are sorted in descending order, as usual. We return the factor matrix AiA_{i} as the top aa left singular vectors in UU, where a=max{i:Sii>li}a=\max\{i:S_{ii}>l_{i}\}. Our adapted truncation method is summarized by Algorithm 1.

Algorithm 1 HOSVD-HT
Input: T^n3n×3n×3n\hat{T}^{n}\in\mathbb{R}^{3n\times 3n\times 3n}: the estimated block tensor; l1,l2,l3l_{1},l_{2},l_{3}\in\mathbb{R}: the thresholds for modes 1,2,31,2,3 respectively
Output: T^r3n×3n×3n\hat{T}_{r}\in\mathbb{R}^{3n\times 3n\times 3n}: the rank truncated tensor.
for i = 1 to 3 do
   Perform the randomized SVD on the mode-ii flattening such that T^(i)nUSVT\hat{T}^{n}_{(i)}\leftarrow US{V^{T}}
   aimax{i:Sii>li}a_{i}\leftarrow\max\{i:S_{ii}>l_{i}\}
   AiA_{i}\leftarrow first aia_{i} columns of UU
end for
𝒢=T^n×1A1T×2A2T×3A3T\mathcal{G}=\hat{T}^{n}\times_{1}A_{1}^{T}\times_{2}A_{2}^{T}\times_{3}A_{3}^{T}
T^r𝒢;A1,A2,A3\hat{T}^{r}\leftarrow\llbracket\mathcal{G};A_{1},A_{2},A_{3}\rrbracket

From now on, we refer to hard-thresholded HOSVD as HOSVD-HT and denote the operation as 𝒫ht\mathcal{P}_{ht}.

3.2 Scale recovery

HOSVD-HT provides an efficient way for projecting T^n\hat{T}^{n} onto the set of tensors with with truncated rank. To recover scales, we use the rank truncated tensor’s relative scale as a heuristic to adjust the scale on our estimated block trifocal tensor T^(n)\hat{T}^{(n)}. For each step, we solve

Λ(t+1)\displaystyle\Lambda^{(t+1)} =argminΛΛT^n𝒫ht(Λ(t)T^n)2s.t. Λijk=03×3×3 for (i,j,k)ΩC,\displaystyle=\underset{\Lambda}{\arg\min}\|\Lambda\odot\hat{T}^{n}-\mathcal{P}_{ht}(\Lambda^{(t)}\odot\hat{T}^{n})\|^{2}\quad\text{s.t. }\Lambda_{ijk}=0_{3\times 3\times 3}\text{ for }(i,j,k)\in\Omega^{C}, (5)

where we drop the normalization condition on Λ\Lambda because in practice it is not needed. We solve (5) for each observed block separately. Denoting 𝒫ht(Λ(t)T^n)\mathcal{P}_{ht}(\Lambda^{(t)}\odot\hat{T}^{n}) as (T^rn)(t)(\hat{T}^{n}_{r})^{(t)}, we have

Λijk(t+1)\displaystyle\Lambda_{ijk}^{(t+1)} =argmin𝜇μT^ijkn(T^rn)ijk(t)2=trace((T^ijkn)(1)T((T^rn)ijk(t))(1))((T^rn)ijk(t))(1)F2.\displaystyle=\underset{\mu}{\arg\min}\|\mu\cdot\hat{T}_{ijk}^{n}-(\hat{T}_{r}^{n})_{ijk}^{(t)}\|^{2}=\frac{trace((\hat{T}_{ijk}^{n})_{(1)}^{T}((\hat{T}^{n}_{r})_{ijk}^{(t)})_{(1)})}{\|((\hat{T}^{n}_{r})_{ijk}^{(t)})_{(1)}\|_{F}^{2}}. (6)

Recall that our strategy for completing the tensor is to impute the tensor with the entries from the rank truncated tensor using HOSVD-HT. Specifically, given the current imputed tensor (T^n)(t)(\hat{T}^{n})^{(t)}, we calculate 𝒫ht((T^n)(t))\mathcal{P}_{ht}((\hat{T}^{n})^{(t)}) and the new scales Λ(t+1)\Lambda^{(t+1)}. Then update with

(T^n)(t+1)=(Λ(t+1)(T^n)(t)WΩ)+𝒫ht((T^n)(t))WΩC.(\hat{T}^{n})^{(t+1)}=(\Lambda^{(t+1)}\odot(\hat{T}^{n})^{(t)}\odot W_{\Omega})+\mathcal{P}_{ht}((\hat{T}^{n})^{(t)})\odot W_{\Omega^{C}}. (7)

3.3 Synchronization algorithm

Now we summarize our synchronization framework in Algorithm 2.

Algorithm 2 Synchronization of the block trifocal tensor
Input: T^n3n×3n×3n\hat{T}^{n}\in\mathbb{R}^{3n\times 3n\times 3n}; WΩ,WΩC{0,1}3n×3n×3nW_{\Omega},W_{\Omega^{C}}\in\{0,1\}^{3n\times 3n\times 3n}; l1,l2,l3l_{1},l_{2},l_{3}\in\mathbb{R}
Output: 𝒞3n×4\mathcal{C}\in\mathbb{R}^{3n\times 4}: camera matrices up to a 4×44\times 4 projective ambiguity and camera-wise scales
Initialize T^n\hat{T}^{n} by imputing unobserved blocks randomly to get (T^n)(0)(\hat{T}^{n})^{(0)}
while not converged do
   Calculate 𝒫ht((T^n)(t))\mathcal{P}_{ht}((\hat{T}^{n})^{(t)}) using HOSVD-HT
   Calculate Λ(t+1)\Lambda^{(t+1)} (5) using (6)
   (T^n)(t+1)(Λ(t+1)(T^n)(t)WΩ)+𝒫ht((T^n)(t))WΩC(\hat{T}^{n})^{(t+1)}\leftarrow(\Lambda^{(t+1)}\odot(\hat{T}^{n})^{(t)}\odot W_{\Omega})+\mathcal{P}_{ht}((\hat{T}^{n})^{(t)})\odot W_{\Omega^{C}}
   tt+1t\leftarrow t+1
end while
(𝒢,A1,A2,A3)HOSVD((T^n)(t))(\mathcal{G},A_{1},A_{2},A_{3})\leftarrow HOSVD((\hat{T}^{n})^{(t)})
𝒞\mathcal{C}\leftarrow First 4 columns of A2A_{2}

We have observed that the algorithm can overfit, as the recovered scales will experience sudden and huge leaps. Our stopping criteria for the algorithm is when we observe sudden jumps in the variance of the new scales or when we exceed a maximum number of iterations. Another challenge in structure from motion datasets is that estimations may be highly corrupted. The HOSVD framework mainly consists of retrieving a dominant subspace from each flattening. Thus, it is natural to replace the SVD on each flattening with a more robust subspace recovery method, such as the Tyler’s M estimator (TME) [40] or a recent extension of TME that incorporates the information of the dimension of the subspace in the algorithm [41]. We refer to Appendix A.5.2 for more details and provide an implementation there.

4 Numerical experiments

We conduct experiments of Algorithm 2 on two benchmark real datasets, the EPFL datasets [42] and the Photo Tourism datasets [11]. We observe that the algorithm performs better in the calibrated setting, and since the calibration matrix is usually known in practice, we restrict our scope of experiments to calibrated trifocal tensors. We compare against three state-of-the-art synchronization based on two view measurements, NRFM [18] and LUD [12]. NRFM relies on nonconvex optimization and requires a good initialization. We test NRFM with an initialization obtained from LUD and with a random initialization. We also test BATA [43] initialized with MPLS [9]. We refer to A.6 in the appendix for a comprehensive summary of numerical results including rotation and translation estimation errors. We include our code in the following github repository: TrifocalSync.

4.1 EPFL dataset

For EPFL, we follow the experimental setup and adopt code from [44] and test an entire structure from motion pipeline. We first describe the structure from motion pipeline for EPFL experiements.

  • Step 1 (feature detection and feature matching). We obtain matched features across pairs of images using a modern deep learning based feature detection and matching algorithm, GlueStick [45]. Though we do not implement this in our experiments, there have been methods developed to further screen corrupted keypoint matches or obtain matches robustly, such as [46, 47, 48]. Key points across a triplet of cameras is matched from pairs and is included only if it appears in all the pair combinations of the three images.

  • Step 2 (estimation and refinement of trifocal tensors). With the triplet matches, we calculate the trifocal tensors with more than 11 correspondences. To have an even sparser graph, one can skip the estimation of trifocal tensors and rely on the imputation for images that have less than a number bigger than 11 point correspondences. This can further speed up the trifocal tensor estimation process. We apply STE from [41] to find 40% of the correspondences as inliers, then use at most 3030 inlier point correspondences to linearly estimate the trifocal tensor. To refine the estimates, we apply bundle adjustment on the inliers and delete triplets with reprojection error larger than 1 pixel.

  • Step 3 (synchronization). We synchronize the estimated block trifocal tensor with a robust variant of SVD using the framework described in Algorithm 2. The robustness comes from replacing SVD with a robust subspace recovery method [41]. More details can be found in Section A.5.2. Recall that the cameras we retrieve are up to a global projective ambiguity. When comparing with ground truth poses, we first align our estimated cameras with the ground truth cameras by finding a 4×44\times 4 projective transformation. Then we round the cameras to calibrated cameras and compare.

We test our full pipeline on two EPFL datasets on a personal machine with 2 GHz Intel Core i5 with 4 cores and 16GB of memory. To test NRFM [18], LUD [12] and BATA [43] initialized with MPLS [9], we estimate the corresponding essential matrices using the GC-ransac [49]. We did not include blocks corresponding to two views in our trifocal tensor pipeline. The mean and median translation errors are summarized in Figure 1 here and more comprehensive results can be found in Table 1 and Table 2 in the appendix.

Refer to caption
Figure 1: EPFL translation error comparison between our method, NRFM initialized by LUD, LUD, and NRFM initialized randomly. BATA(MPLS) stands for BATA initialized by MPLS. HZ8 stands for HerzP8, FP11 for FountainP11, HZ25 for Herz P25, EN10 for EntryP10, CS19 for CastleP19, CS30 for CastleP30.

The EPFL datasets generally have a plethora of point correspondences, so that the trifocal tensors are estimated accurately. When the dataset focuses on a single scene, our algorithm retrieves locations competitively. Our algorithm achieves the best location estimation for 4 out of 6 datasets. The translation error bars are not visible for FP11, HZP8, EN10 due to the accuracy that we achieve. However, our pipeline is incapable of accurately processing CastleP19 and CastleP30. The main reason is that our algorithm relies on having a very dense observation graph to ensure high completion rate. CastleP19 and CastleP30 are datasets where the camera scans portions of the general area sequentially, so that not many triplets have overlapping features. Our method is not suitable for this type of dataset. However, it is possible to apply our algorithm in parallel on groups of neighboring frames, so that the completion rate is high in each group. Then the results can be merged to obtain a larger reconstruction. Rotations for the two view methods are estimated via rejecting outliers from iteratively applying [10]. We also compare against [43] for location estimation, where we initialize with a state-of-the-art global rotation estimation method [9]. Our algorithm achieves superior rotation estimation for only 2 out of the 6 datasets. See Table 1 and 2 in the appendix for comprehensive errors.

4.2 Photo Tourism

We conduct experiments on the Photo Tourism datasets. The Photo Tourism datasets consist of internet images of real world scenes. Each scene has hundreds to thousands of images. The datasets [11] provide essential matrix estimates, and we estimate the trifocal tensors from the given essential matrices. To limit the computational cost for tensors, we downsample the datasets by choosing cameras with observations more than a certain percentage in the corresponding block frontal slice while maintaining a decent number of cameras. Note that this may not be the optimal way of extracting a dense subset in general. The maximum number of cameras we select for each dataset is 225 cameras. The largest dataset Piccadilly has 2031 cameras initially. We randomly sample 1000 cameras and then run our procedure. For Roman Forum and Piccadilly, the two view methods further deleted cameras from the robust rotation estimation process or parallel rigidity test. We rerun and report the trifocal tensor synchronization algorithm with the further downsampled data. We initialize the hard thresholding parameters for HOSVD-HT by first imputing the trifocal tensor with small random entries and then calculating the singular values for each of the flattenings. We take lil_{i} to be the tertile singular value for each mode-ii flattening. We then keep this parameter fixed for the synchronization process. Recall that the jiijii blocks in the block trifocal tensor correspond to elements in the essential matrix EijE_{ij}. We also include these essential matrix estimations in the block trifocal tensor. The Photo Tourism experiments were run on an HPC center with 32 cores, but the only procedure that can benefit from parallel computing in a single experiment is the scale retrieval. Mean and median translation errors are summarized in Figure 2. Fully comprehensive results can be found in Tables 3 and 4 in Section A.6.

Refer to caption
(a) Photo Tourism mean translation errors
Refer to caption
(b) Photo Tourism median translation errors
Figure 2: Photo Tourism translation error comparison between our method, NRFM initialized by LUD, LUD, NRFM initialized randomly, and BATA initialized with MPLS. Note that we have not been able to acquire results for Piccadilly for BATA + MPLS.

Our method is able to achieve competitive translation errors on 8 of the 14 datasets tested. Similar to the observation in the EPFL experiments, our algorithm performs well when the viewing graph is dense, or in other words, when the estimation percentage is high. We achieve better locations in 6 out of 8 datasets where the estimation percentage exceeds 60%, and better locations in only 2 out of 6 datasets where the estimation percentage falls below 60%. We achieve reasonable rotation estimations for 10 out of 14 datasets, but not as good as LUD. See Table 4 for a comprehensive result. Since the block trifocal tensor scales cubically with respect to the number of cameras, our algorithm runtime is longer than most two view global methods. This could be alleviated by synchronizing dense subsets in parallel and merging the results to construct a larger reconstruction.

Additional remark: Trifocal tensors can be estimated from line correspondences or a mix of point and line correspondences, while fundamental matrices are estimated from only point correspondences. There are many situations where accurate point correspondences are in short supply but there is a plethora of clear and distinct lines. For example, see datasets in a recent SfM method using lines [50]. We demonstrate the potential of our method to be adapted to process datasets with only lines or very few points. Due to the limited availability of well annotated line datasets, we provide a small synthetic experiment that simulates a case where only lines correspondences are present. We first generate 20 random camera matrices, then we generate 25 lines that are projected on and shared across all images. We add about 0.02 percent of noise in terms of the relative frobenius norms between the line equation parameters and the noise. We estimate the trifocal tensor of three different views from line correspondences linearly. One remark is that our synchronization method works well only when the signs of the initial unknown scales are mostly uniform. We manually use ground truth trifocal tensors to correct the sign of the scale. This has not been an issue in the previous experiments due to bundle adjustment for EPFL and the overall good estimations in Photo Tourism. In practice, the sign of the scale on a trifocal tensor can be corrected via triangulation of points or reconstruction of lines, and correcting the sign using the depths of the reconstructed points or intersecting line segments. We synchronize the trifocal tensors with Algorithm 2 and were able to achieve a mean rotation error of 0.61 degrees, median rotation error of 0.49 degrees, mean location error of 0.76, and median location error of 0.74.

5 Conclusion

In this work, we introduced the block tensor of trifocal tensors characterizing the three-view geometry of a scene. We established an explicit Tucker factorization of the block trifocal tensor and proved it has a low multilinear rank of (6,4,4)(6,4,4) under appropriate scaling. We developed a synchronization algorithm based on tensor decomposition that retrieves an appropriate set of scales, and synchronizes rotations and translations simultaneously. On several real data benchmarks we demonstrated state-of-the-art performance in terms of camera location estimation, and saw particular advantages on smaller and denser sets of images. Overall, this work suggests that higher-order interactions in synchronization problems have the potential to improve performance over pairwise-based methods.

There are several limitations to our tensor-based synchronization method. First, our rotation estimations are not as strong as our location estimations. Second, our algorithm performance is affected by the estimation percentage of trifocal tensors within the block trifocal tensor. One could incorporate more robust completion methods and explore new approaches for processing sparse triplet graphs. Further, our block trifocal tensor scales cubically in terms of the number of cameras and becomes computationally expensive for large datasets. We can develop methods for extracting dense subgraphs, synchronizing in parallel, then merging results to obtain a larger reconstruction, similarly to the distributed algorithms of [51] and [52]. Moreover, our synchronization method’s success depends on accurate trifocal tensor estimations, and it motivates further work on robust estimation of multi-view tensors. Algorithm 2 could also be made more robust by adding outlier rejection techniques. Finally we plan to extend our theory by proving convergence of our algorithm and exploring structures for even higher-order tensors, such as quadrifocal tensors.

Acknowledgement

D.M. and G.L. were supported in part by NSF award DMS 2152766. J.K. was supported in part by NSF awards DMS 2309782 and CISE-IIS 2312746, the DOE award SC0025312, and start-up grants from the College of Natural Science and Oden Institute at the University of Texas at Austin.

We thank Shaohan Li and Feng Yu for helpful discussions on processing EPFL and Photo Tourism. We also thank Hongyi Fan for helpful advice and references on estimating trifocal tensors.

References

  • [1] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
  • [2] Noah Snavely, Steven Seitz, and Richard Szeliski. Photo Tourism: Exploring photo collections in 3D. In Proceedings of the ACM Special Interest Group on Computer Graphics and Interactive Techniques Conference, SIGGRAPH 2006, pages 835–846, 2006.
  • [3] Johannes Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pages 4104–4113, 2016.
  • [4] Bill Triggs, Philip McLauchlan, Richard Hartley, and Andrew Fitzgibbon. Bundle adjustment — A modern synthesis. In Bill Triggs, Andrew Zisserman, and Richard Szeliski, editors, Vision Algorithms: Theory and Practice, pages 298–372, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg.
  • [5] Venu Madhav Govindu. Lie-algebraic averaging for globally consistent motion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2004, volume 1, pages 1–8, 2004.
  • [6] Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hongdong Li. Rotation averaging. International Journal of Computer Vision, 103:267–305, 2013.
  • [7] Avishek Chatterjee and Venu Madhav Govindu. Robust relative rotation averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):958–972, 2018.
  • [8] Avishek Chatterjee and Venu Madhav Govindu. Efficient and robust large-scale rotation averaging. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2013, pages 521–528, 2013.
  • [9] Yunpeng Shi and Gilad Lerman. Message passing least squares framework and its application to rotation synchronization. In Proceedings of the International Conference on Machine Learning, ICML 2020, pages 8796–8806, 2020.
  • [10] Mica Arie-Nachimson, Shahar Kovalsky, Ira Kemelmacher-Shlizerman, Amit Singer, and Ronen Basri. Global motion estimation from point matches. In Proceedings of the International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, 3DIMPVT 2012, pages 81–88, 2012.
  • [11] Kyle Wilson and Noah Snavely. Robust global translations with 1DSfM. In Proceedings of the European Conference on Computer Vision, EECV 2014, pages 61–75, 2014.
  • [12] Onur Ozyesil and Amit Singer. Robust camera location estimation by convex programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, pages 2674–2683, 2015.
  • [13] Thomas Goldstein, Paul Hand, Choongbum Lee, Vladislav Voroninski, and Stefano Soatto. ShapeFit and ShapeKick for robust, scalable structure from motion. In Proceedings of the European Conference on Computer Vision, EECV 2016, pages 289–304, 2016.
  • [14] David Rosen, Luca Carlone, Afonso Bandeira, and John Leonard. SE-Sync: A certifiably correct algorithm for synchronization over the special Euclidean group. International Journal of Robotics Research, 38(2-3):95–125, 2019.
  • [15] Federica Arrigoni, Beatrice Rossi, and Andrea Fusiello. Spectral synchronization of multiple views in SE(3). SIAM Journal on Imaging Sciences, 9(4):1963–1990, 2016.
  • [16] Mihai Cucuringu, Yaron Lipman, and Amit Singer. Sensor network localization by eigenvector synchronization over the Euclidean group. ACM Transactions on Sensor Networks, 8(3), 2012.
  • [17] Jesus Briales and Javier Gonzalez-Jimenez. Cartan-Sync: Fast and global SE(d)-synchronization. IEEE Robotics and Automation Letters, 2(4):2127–2134, 2017.
  • [18] Soumyadip Sengupta, Tal Amir, Meirav Galun, Tom Goldstein, David Jacobs, Amit Singer, and Ronen Basri. A new rank constraint on multi-view fundamental matrices, and its application to camera location recovery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 4798–4806, 2017.
  • [19] Yoni Kasten, Amnon Geifman, Meirav Galun, and Ronen Basri. Algebraic characterization of essential matrices and their averaging in multiview settings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, pages 5895–5903, 2019.
  • [20] Yoni Kasten, Amnon Geifman, Meirav Galun, and Ronen Basri. GPSfM: Global projective SFM using algebraic constraints on multi-view fundamental matrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, pages 3264–3272, 2019.
  • [21] Spyridon Leonardos, Roberto Tron, and Kostas Daniilidis. A metric parametrization for trifocal tensors with non-colinear pinholes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, pages 259–267, 2015.
  • [22] Viktor Larsson, Nicolas Zobernig, Kasim Taskin, and Marc Pollefeys. Calibration-free structure-from-motion with calibrated radial trifocal tensors. In Proceedings of the European Conference on Computer Vision, EECV 2020, pages 382–399, 2020.
  • [23] Pierre Moulon, Pascal Monasse, and Renaud Marlet. Global fusion of relative motions for robust, accurate and scalable structure-from-motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013, pages 3248–3255, 2013.
  • [24] Onur Ozyesil, Vladislav Voroninski, Ronen Basri, and Amit Singer. A survey of structure from motion. Acta Numerica, 26:305–364, 2017.
  • [25] Joe Kileel. Minimal problems for the calibrated trifocal variety. SIAM Journal on Applied Algebra and Geometry, 1(1):575–598, 2017.
  • [26] Timothy Duff, Kathlen Kohn, Anton Leykin, and Tomas Pajdla. PLMP-point-line minimal problems in complete multi-view visibility. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, pages 1675–1684, 2019.
  • [27] Ricardo Fabbri, Timothy Duff, Hongyi Fan, Margaret Regan, David da Costa de Pinho, Elias Tsigaridas, Charles Wampler, Jonathan Hauenstein, Peter Giblin, Benjamin Kimia, Anton Leykin, and Tomas Pajdla. Trifocal relative pose from lines at points. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7870–7884, 2023.
  • [28] David Nister and Henrik Stewenius. A minimal solution to the generalised 3-point pose problem. Journal of Mathematical Imaging and Vision, 27(1):67–79, 2007.
  • [29] Ali Elqursh and Ahmed Elgammal. Line-based relative pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, pages 3049–3056, 2011.
  • [30] Yubin Kuang and Kalle Astrom. Pose estimation with unknown focal length using points, directions and lines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013, pages 529–536, 2013.
  • [31] Zuzana Kukelova, Joe Kileel, Bernd Sturmfels, and Tomas Pajdla. A clever elimination strategy for efficient minimal solvers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 4912–4921, 2017.
  • [32] Pedro Miraldo, Tiago Dias, and Srikumar Ramalingam. A minimal closed-form solution for multi-perspective pose estimation using points and lines. In Proceedings of the European Conference on Computer Vision, ECCV 2018, pages 474–490, 2018.
  • [33] Joe Kileel, Zuzana Kukelova, Tomas Pajdla, and Bernd Sturmfels. Distortion varieties. Foundations of Computational Mathematics, 18:1043–1071, 2018.
  • [34] Joe Kileel and Kathlén Kohn. Snapshot of algebraic vision. arXiv preprint arXiv:2210.11443, 2022.
  • [35] Tamara Kolda and Brett Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
  • [36] Tommi Muller, Adriana Duncan, Eric Verbeke, and Joe Kileel. Algebraic constraints and algorithms for common lines in cryo-EM. Biological Imaging, pages 1–30, Published online 2024.
  • [37] Hongyi Fan, Joe Kileel, and Benjamin Kimia. On the instability of relative pose estimation and RANSAC’s role. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2022, pages 8935–8943, 2022.
  • [38] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322, 2010.
  • [39] Nathan Halko, Per-Gunnar Martinsson, and Joel Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.
  • [40] David Tyler. A distribution-free M-estimator of multivariate scatter. Annals of Statistics, pages 234–251, 1987.
  • [41] Feng Yu, Teng Zhang, and Gilad Lerman. A subspace-constrained Tyler’s estimator and its applications to structure from motion. arXiv preprint arXiv:2404.11590, 2024.
  • [42] Christoph Strecha, Wolfgang Von Hansen, Luc Van Gool, Pascal Fua, and Ulrich Thoennessen. On benchmarking camera calibration and multi-view stereo for high resolution imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pages 1–8, 2008.
  • [43] Bingbing Zhuang, Loong-Fah Cheong, and Gim Hee Lee. Baseline desensitizing in translation averaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018), June 2018.
  • [44] Laura Julia and Pascal Monasse. A critical review of the trifocal tensor estimation. In Proceedings of the Pacific Rim Symposium on Image and Video Technology, PSIVT 2017, Revised Selected Papers 8, pages 337–349. Springer, 2018.
  • [45] Rémi Pautrat, Iago Suárez, Yifan Yu, Marc Pollefeys, and Viktor Larsson. Gluestick: Robust image matching by sticking points and lines together. In Proceedings of the IEEE/CVF International Conference on Computer Vision, CVPR 2023, pages 9706–9716, 2023.
  • [46] Yunpeng Shi, Shaohan Li, Tyler Maunu, and Gilad Lerman. Scalable cluster-consistency statistics for robust multi-object matching. In Proceedings of the International Conference on 3D Vision, 3DV 2021, pages 352–360, 2021.
  • [47] Yunpeng Shi, Shaohan Li, and Gilad Lerman. Robust multi-object matching via iterative reweighting of the graph connection laplacian. Advances in Neural Information Processing Systems, 33:15243–15253, 2020.
  • [48] Shaohan Li, Yunpeng Shi, and Gilad Lerman. Fast, accurate and memory-efficient partial permutation synchronization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2022, pages 15735–15743, 2022.
  • [49] Daniel Barath and Jiří Matas. Graph-cut ransac. In Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR 2018, pages 6733–6741, 2018.
  • [50] Shaohui Liu, Yifan Yu, Rémi Pautrat, Marc Pollefeys, and Viktor Larsson. 3d line mapping revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, pages 21445–21455, 2023.
  • [51] Shaohan Li, Yunpeng Shi, and Gilad Lerman. Efficient detection of long consistent cycles and its application to distributed synchronization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, pages 5260–5269, 2024.
  • [52] Andrea Porfiri Dal Cin, Luca Magri, Federica Arrigoni, Andrea Fusiello, and Giacomo Boracchi. Synchronization of group-labelled multi-graphs. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, pages 6433–6443. IEEE, 2021.
  • [53] Joe Harris. Algebraic Geometry: A First Course, volume 133. Springer Science & Business Media, 1992.
  • [54] Ying Sun, Prabhu Babu, and Daniel Palomar. Regularized Tyler’s scatter estimator: Existence, uniqueness, and algorithms. IEEE Transactions on Signal Processing, 62(19):5143–5156, 2014.

Appendix A Appendix / supplemental material

A.1 Derivation of the trifocal tensor

To provide a better intuition for the trifocal tensor, we briefly summarize the derivation of the trifocal tensor from [1] and [21] under the general setup of uncalibrated cameras.

Let Pi=KiRi[I,ti]P_{i}=K_{i}R_{i}[I,-t_{i}] be the form of the camera matrix for P1,P2,P3P_{1},P_{2},P_{3}. Let LL be a line in the 3D world scene, and l1,l2,l3l_{1},l_{2},l_{3} the corresponding projections in the images I1,I2,I3I_{1},I_{2},I_{3} respectively. Each lil_{i} back projects to a plane πi=PiTli\pi_{i}=P_{i}^{T}l_{i} in 3\mathbb{R}^{3}, and since lil_{i} correspond to the same LL in the 3D world scene, π=[π1,π2,π3]\pi=[\pi_{1},\pi_{2},\pi_{3}] must be rank deficient and its kernel will generically be spanned by LL. Then,

π=[K1TR10t1T1]π=[l1K1TR12K2Tl2K1TR13K3Tl30(t1t2)TR2TK2Tl2(t1t3)TR3TK3Tl3]=[π1,π2,π3]\displaystyle\pi^{\prime}=\begin{bmatrix}K_{1}^{-T}R_{1}&0\\ t_{1}^{T}&1\end{bmatrix}\pi=\begin{bmatrix}l_{1}&K_{1}^{-T}R_{12}K_{2}^{T}l_{2}&K_{1}^{-T}R_{13}K_{3}^{T}l_{3}\\ 0&(t_{1}-t_{2})^{T}R_{2}^{T}K_{2}^{T}l_{2}&(t_{1}-t_{3})^{T}R_{3}^{T}K_{3}^{T}l_{3}\end{bmatrix}=[\pi^{\prime}_{1},\pi^{\prime}_{2},\pi^{\prime}_{3}]

will also be rank-deficient, implying that the columns of π\pi^{\prime} are linearly dependent. This means that there exist α,β\alpha,\beta such that π1=απ2+βπ3\pi^{\prime}_{1}=\alpha\pi^{\prime}_{2}+\beta\pi^{\prime}_{3}. We can choose α=(t1t3)TR3TK3Tl3\alpha=-(t_{1}-t_{3})^{T}R_{3}^{T}K_{3}^{T}l_{3}, β=(t1t2)TR2TK2Tl2\beta=(t_{1}-t_{2})^{T}R_{2}^{T}K_{2}^{T}l_{2}, so that

l1=l2T[K2R2(t1t2)K1TR13K3T]l3l2T[K2R12TK11(t1t3)TR3TK3T]l3.\displaystyle l_{1}=l_{2}^{T}[K_{2}R_{2}(t_{1}-t_{2})K_{1}^{-T}R_{13}K_{3}^{T}]l_{3}-l_{2}^{T}[K_{2}R_{12}^{T}K_{1}^{-1}(t_{1}-t_{3})^{T}R_{3}^{T}K_{3}^{T}]l_{3}.

Then, the canonical trifocal tensor centered at camera 1 is defined as

Ti=K2R2(t1t2)eiTK1TR13K3TK2R12TK11ei(t1t3)TR3TK3TT_{i}=K_{2}R_{2}(t_{1}-t_{2})e_{i}^{T}K_{1}^{-T}R_{13}K_{3}^{T}-K_{2}R_{12}^{T}K_{1}^{-1}e_{i}(t_{1}-t_{3})^{T}R_{3}^{T}K_{3}^{T} (8)

where ei3e_{i}\in\mathbb{R}^{3} is the ii-th standard basis vector. The trifocal tensor will be the tensor {T1,T2,T3}\{T_{1},T_{2},T_{3}\}, where the TiT_{i}’s are stacked along the first mode. The line incidence relation is then (l1)i=l2TTil3(l_{1})_{i}=l_{2}^{T}T_{i}l_{3}. Other combinations of point and line incidence relations are also encoded by the trifocal tensor; see [1] for details. The construction for calibrated cameras is the same, just with PiP_{i} in calibrated form.

A.2 Proof details for Theorem 1

We include a detailed calculation for the Tucker factorization of the block trifocal tensor. Recall that each individual trifocal tensor corresponding to the cameras a,b,ca,b,c can be calculated as

Tiqr=\displaystyle T_{iqr}= (1)i+1det[aibqcr]=(1)i+1det[amanbqcr]\displaystyle(-1)^{i+1}\det\begin{bmatrix}\sim a^{i}\\ b^{q}\\ c^{r}\end{bmatrix}=(-1)^{i+1}\det\begin{bmatrix}a^{m}\\ a^{n}\\ b^{q}\\ c^{r}\end{bmatrix}
=\displaystyle= (1)i+1(det[am3am4an3an4](bq1cr2bq2cr1)+det[am2am4an2an4](bq1cr3+bq3cr1)\displaystyle(-1)^{i+1}(\det\begin{bmatrix}a_{m3}&a_{m4}\\ a_{n3}&a_{n4}\end{bmatrix}(b_{q1}c_{r2}-b_{q2}c_{r1})+\det\begin{bmatrix}a_{m2}&a_{m4}\\ a_{n2}&a_{n4}\end{bmatrix}(-b_{q1}c_{r3}+b_{q3}c_{r1})
+det[am1am4an1an4](bq2cr3bq3cr2)+det[am2am3an2an3](bq1cr4bq4cr1)\displaystyle\quad\quad+\det\begin{bmatrix}a_{m1}&a_{m4}\\ a_{n1}&a_{n4}\end{bmatrix}(b_{q2}c_{r3}-b_{q3}c_{r2})+\det\begin{bmatrix}a_{m2}&a_{m3}\\ a_{n2}&a_{n3}\end{bmatrix}(b_{q1}c_{r4}-b_{q4}c_{r1})
+det[am1am3an1an3](bq2cr4+bq4cr2)+det[am1am2an1an2](bq3cr4bq4cr3))\displaystyle\quad\quad+\det\begin{bmatrix}a_{m1}&a_{m3}\\ a_{n1}&a_{n3}\end{bmatrix}(-b_{q2}c_{r4}+b_{q4}c_{r2})+\det\begin{bmatrix}a_{m1}&a_{m2}\\ a_{n1}&a_{n2}\end{bmatrix}(b_{q3}c_{r4}-b_{q4}c_{r3}))
=\displaystyle= 𝒫(a)i6(bq1cr2bq2cr1)+𝒫(a)i5(bq1cr3+bq3cr1)+𝒫(a)i3(bq2cr3bq3cr2)\displaystyle\mathcal{P}(a)_{i6}(b_{q1}c_{r2}-b_{q2}c_{r1})+\mathcal{P}(a)_{i5}(-b_{q1}c_{r3}+b_{q3}c_{r1})+\mathcal{P}(a)_{i3}(b_{q2}c_{r3}-b_{q3}c_{r2})
+𝒫(a)i4(bq1cr4bq4cr1)+𝒫(a)i2(bq2cr4+bq4cr2)+𝒫(a)i1(bq3cr4bq4cr3)\displaystyle\quad+\mathcal{P}(a)_{i4}(b_{q1}c_{r4}-b_{q4}c_{r1})+\mathcal{P}(a)_{i2}(-b_{q2}c_{r4}+b_{q4}c_{r2})+\mathcal{P}(a)_{i1}(b_{q3}c_{r4}-b_{q4}c_{r3})
=\displaystyle= k=16𝒫(a)ikw=14bqwj=14crj𝒢kwj.\displaystyle\sum_{k=1}^{6}\mathcal{P}(a)_{ik}\sum_{w=1}^{4}b_{qw}\sum_{j=1}^{4}c_{rj}\mathcal{G}_{kwj}.

The last equality can be easily checked since 𝒢\mathcal{G} is sparse. For example, when k=1k=1, 𝒫(a)i1\mathcal{P}(a)_{i1} is the determinant of the submatrix dropping the ii-th row and keeping columns 1 and 2, which is det[am1am2;an1an2]\det\!\begin{bmatrix}a_{m1}a_{m2};a_{n1}a_{n2}\end{bmatrix}. The only nonzero elements in the first horizontal slice are 𝒢(1,4,3)=1\mathcal{G}(1,4,3)=-1 and 𝒢(1,3,4)=1\mathcal{G}(1,3,4)=1. Then, the nonzero elements in the sum when k=1k=1 will be exactly 𝒫(a)i1w=14bqwj=14crj𝒢1wj=𝒫(a)i1(bq3cr4bq4cr3)\mathcal{P}(a)_{i1}\sum_{w=1}^{4}b_{qw}\sum_{j=1}^{4}c_{rj}\mathcal{G}_{1wj}=\mathcal{P}(a)_{i1}(b_{q3}c_{r4}-b_{q4}c_{r3}).

Then, since 𝒫\mathcal{P} will be the stackings of 𝒫(Pi)\mathcal{P}(P_{i}), 𝒞\mathcal{C} is the stacking of camera matrices in Theorem 1, each ijkijk block in TnT^{n} will be calculated by exactly the corresponding ii, jj, kk blocks in 𝒫,𝒞,𝒞\mathcal{P},\mathcal{C},\mathcal{C} respectively using the calculations above.

A.3 Proof details for 1

Proof for (i).

We have

(Tiiin)wqr=(1)w+1det[PiwPiqPir]=0,\displaystyle(T^{n}_{iii})_{wqr}=(-1)^{w+1}\det\begin{bmatrix}\sim P_{i}^{w}\\ P_{i}^{q}\\ P_{i}^{r}\end{bmatrix}=0,

since PiP_{i} is a 3×43\times 4 matrix and the submatrix above will always have two identical rows.

Proof for (ii): Consider the wqrwqr element of the jiijii block trifocal tensor, TjiinT^{n}_{jii}. It can be written as

(Tjiin)wqr=(1)w+1det[PjwPiqPir].\displaystyle(T^{n}_{jii})_{wqr}=(-1)^{w+1}\det\begin{bmatrix}\sim P_{j}^{w}\\ P_{i}^{q}\\ P_{i}^{r}\end{bmatrix}.

Thus, when q=rq=r, clearly (Tjiin)wqr=0(T^{n}_{jii})_{wqr}=0 as we will have identical rows again. When qrq\not=r, we first observe that (Tjiin)wqr=(Tjiin)wrq(T^{n}_{jii})_{wqr}=-(T^{n}_{jii})_{wrq} since we just swap two rows. Second,

(Tjiin)wqr=(1)w+1det[PjwPiqPir]=(1)w+1det[PjwPim]\displaystyle(T^{n}_{jii})_{wqr}=(-1)^{w+1}\det\begin{bmatrix}\sim P_{j}^{w}\\ P_{i}^{q}\\ P_{i}^{r}\end{bmatrix}=(-1)^{w+1}\det\begin{bmatrix}\sim P_{j}^{w}\\ \sim P_{i}^{m}\end{bmatrix}

where m{1,2,3}{q,r}m\in\{1,2,3\}\setminus\{q,r\}. This is exactly the bilinear relationship in [1] defining the fundamental matrix (Fji)mw(F_{ji})_{mw} element up to a possible negative sign.

Proof for (iii): We can only show this for TiijnT^{n}_{iij} blocks from symmetry. The elements in TiijnT^{n}_{iij} blocks can be calculated as

(Tijin)wqr=(1)w+1det[PiwPiqPjr]\displaystyle(T^{n}_{iji})_{wqr}=(-1)^{w+1}\det\begin{bmatrix}\sim P_{i}^{w}\\ P_{i}^{q}\\ P_{j}^{r}\end{bmatrix}

Elements are nonzero only when w=qw=q, and they correspond to determinants of matrices with three rows from one PiP_{i} and one row from PjP_{j}. By [1], these are exactly the elements of the epipoles. When w=1w=1, the order of the rows in the determinant corresponding to camera ii is (2,3,1)(2,3,1), when w=2w=2, the order is (1,3,2)(1,3,2) and there is a negative sign in front of the determinant, and when w=3w=3, the order is (1,2,3)(1,2,3). Since the first and last case are even permutations of the rows of PiP_{i}, and the second case is corrected by a negative sign, (Tijin)ww:(T^{n}_{iji})_{ww:} is exactly the epipole.

Proof for (iv): On a horizontal slice, the camera along the 1st mode is fixed, and blocks symmetric across the diagonal is calculated by cameras, which the 2nd and 3rd mode cameras are swapped. Then, we will simply be swapping rows in (1), which means that we will simply be changing signs for elements symmetric across the diagonal, implying skew symmetry.

Proof for (v): Now assume that we have a block trifocal tensor whose corresponding cameras are all calibrated. Let 𝒫\mathcal{P} be the line projection matrix, 𝒞=[P1,P2,,Pn]T\mathcal{C}=[P_{1},P_{2},\dots,P_{n}]^{T} is the stacked camera matrix, and 𝒢\mathcal{G} is the core tensor. The flattening in the 1st mode can be written as T(1)n=𝒫𝒢(1)(𝒞𝒞)TT^{n}_{(1)}=\mathcal{P}\mathcal{G}_{(1)}(\mathcal{C}\otimes\mathcal{C})^{T}, where T(1)nT^{n}_{(1)} is a 3n×9n23n\times 9n^{2} matrix. For the proof, we calculate the eigenvalue of T(1)n(T(1)n)T=𝒫𝒢(1)(𝒞𝒞)T(𝒞𝒞)𝒢(1)T𝒫TT^{n}_{(1)}(T_{(1)}^{n})^{T}=\mathcal{P}\mathcal{G}_{(1)}(\mathcal{C}\otimes\mathcal{C})^{T}(\mathcal{C}\otimes\mathcal{C})\mathcal{G}_{(1)}^{T}\mathcal{P}^{T}

T(1)n(T(1)n)T\displaystyle T^{n}_{(1)}(T_{(1)}^{n})^{T} =𝒫𝒢(1)(𝒞𝒞)T(𝒞𝒞)𝒢(1)T𝒫T\displaystyle=\mathcal{P}\mathcal{G}_{(1)}(\mathcal{C}\otimes\mathcal{C})^{T}(\mathcal{C}\otimes\mathcal{C})\mathcal{G}_{(1)}^{T}\mathcal{P}^{T}
=𝒫𝒢(1)(𝒞T𝒞T)(𝒞𝒞)𝒢(1)T𝒫T\displaystyle=\mathcal{P}\mathcal{G}_{(1)}(\mathcal{C}^{T}\otimes\mathcal{C}^{T})(\mathcal{C}\otimes\mathcal{C})\mathcal{G}_{(1)}^{T}\mathcal{P}^{T}
=𝒫𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)T𝒫T.\displaystyle=\mathcal{P}\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T}\mathcal{P}^{T}.

The second and third line uses two Kronecker product properties: (AB)T=ATBT(A\otimes B)^{T}=A^{T}\otimes B^{T} and (AB)(CD)=ACBD(A\otimes B)(C\otimes D)=AC\otimes BD as long as ACAC and BDBD are defined.

We first calculate (𝒞T𝒞𝒞T𝒞)(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C}).

We assume that the cameras are centered at the origin, i.e. i=1nti=0\sum_{i=1}^{n}t_{i}=0. Then we have

𝒞T𝒞=[nI3×3i=1ntii=1ntiTi=1nti2]=[nI3×303×101×3i=1nti2],\displaystyle\mathcal{C}^{T}\mathcal{C}=\begin{bmatrix}nI_{3\times 3}&-\sum_{i=1}^{n}t_{i}\\ -\sum_{i=1}^{n}t_{i}^{T}&\sum_{i=1}^{n}\|t_{i}\|^{2}\end{bmatrix}=\begin{bmatrix}nI_{3\times 3}&0_{3\times 1}\\ 0_{1\times 3}&\sum_{i=1}^{n}\|t_{i}\|^{2}\end{bmatrix}, (9)

so that

(𝒞T𝒞𝒞T𝒞)=\displaystyle(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})= [nI3×3𝒞T𝒞03×1𝒞T𝒞01×3𝒞T𝒞i=1nti2𝒞T𝒞]\displaystyle\begin{bmatrix}nI_{3\times 3}\otimes\mathcal{C}^{T}\mathcal{C}&0_{3\times 1}\otimes\mathcal{C}^{T}\mathcal{C}\\ 0_{1\times 3}\otimes\mathcal{C}^{T}\mathcal{C}&\sum_{i=1}^{n}\|t_{i}\|^{2}\otimes\mathcal{C}^{T}\mathcal{C}\end{bmatrix} (10)

We have an explicit form for 𝒢(1)\mathcal{G}_{(1)}:

𝒢(1)=[00000000000-100100000000100000-100000000-1001000000000-100000000100000100000-100000000-100100000000000].\displaystyle\mathcal{G}_{(1)}=\begin{bmatrix}\begin{tabular}[]{cccccccccccccccc}0&0&0&0&0&0&0&0&0&0&0&-1&0&0&1&0\\ 0&0&0&0&0&0&0&1&0&0&0&0&0&-1&0&0\\ 0&0&0&0&0&0&-1&0&0&1&0&0&0&0&0&0\\ 0&0&0&-1&0&0&0&0&0&0&0&0&1&0&0&0\\ 0&0&1&0&0&0&0&0&-1&0&0&0&0&0&0&0\\ 0&-1&0&0&1&0&0&0&0&0&0&0&0&0&0&0\end{tabular}\end{bmatrix}. (11)

Let X=𝒞T𝒞X=\mathcal{C}^{T}\mathcal{C} and let XijX_{ij} denote the ijij entry in 𝒞T𝒞\mathcal{C}^{T}\mathcal{C}. Let a=i=1nti2a=\sum_{i=1}^{n}\|t_{i}\|^{2}.

We first show that 𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)T\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T} is diagonal by direct computation:

𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)T\displaystyle\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T}
=\displaystyle= [nX44+aX33aX22nX42aX31nX410aX23nX44+aX22nX43aX210nX41nX24nX34nX33+nX220nX21nX31aX13aX120nX44+aX11nX43nX42nX140nX12nX34nX33+nX11nX320nX24nX13nX24nX23nX22+nX11]\displaystyle\begin{bmatrix}nX_{44}+aX_{33}&aX_{22}&-nX_{42}&aX_{31}&nX_{41}&0\\ -aX_{23}&nX_{44}+aX_{22}&nX_{43}&aX_{21}&0&nX_{41}\\ nX_{24}&-nX_{34}&nX_{33}+nX_{22}&0&-nX_{21}&-nX_{31}\\ aX_{13}&-aX_{12}&0&nX_{44}+aX_{11}&-nX_{43}&nX_{42}\\ nX_{14}&0&-nX_{12}&nX_{34}&nX_{33}+nX_{11}&-nX_{32}\\ 0&-nX_{24}&-nX_{13}&nX_{24}&-nX_{23}&nX_{22}+nX_{11}\end{bmatrix}
=\displaystyle= [na+an000000na+an000000n2+n2000000na+an000000n2+n2000000n2+n2]\displaystyle\begin{bmatrix}na+an&0&0&0&0&0\\ 0&na+an&0&0&0&0\\ 0&0&n^{2}+n^{2}&0&0&0\\ 0&0&0&na+an&0&0\\ 0&0&0&0&n^{2}+n^{2}&0\\ 0&0&0&0&0&n^{2}+n^{2}\end{bmatrix}
=\displaystyle= [2na0000002na0000002n20000002na0000002n20000002n2].\displaystyle\begin{bmatrix}2na&0&0&0&0&0\\ 0&2na&0&0&0&0\\ 0&0&2n^{2}&0&0&0\\ 0&0&0&2na&0&0\\ 0&0&0&0&2n^{2}&0\\ 0&0&0&0&0&2n^{2}\end{bmatrix}.

We then calculate the spectral decomposition of T(1)nT^{n}_{(1)}. With a slight abuse of notation, let PikP_{i}^{k} denote the kk-th column of PiP_{i}. The 3n×63n\times 6 rank-66 stacked line projection matrix would have columns ordered according to

[e1e2e1e3e1e4e2e3e2e4e3e4],\displaystyle\begin{bmatrix}e_{1}\wedge e_{2}&e_{1}\wedge e_{3}&e_{1}\wedge e_{4}&e_{2}\wedge e_{3}&e_{2}\wedge e_{4}&e_{3}\wedge e_{4}\end{bmatrix},

and since the second row in 𝒫\mathcal{P} for each camera is Pi3Pi1P_{i}^{3}\wedge P_{i}^{1} it holds

𝒫=[Pi1×Pi2Pi1×Pi3Pi1×Pi4Pi2×Pi3Pi2×Pi4Pi3×Pi4].\displaystyle\mathcal{P}=\begin{bmatrix}\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ P^{1}_{i}\times P^{2}_{i}&P^{1}_{i}\times P^{3}_{i}&P^{1}_{i}\times P^{4}_{i}&P^{2}_{i}\times P^{3}_{i}&P^{2}_{i}\times P^{4}_{i}&P^{3}_{i}\times P^{4}_{i}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\end{bmatrix}.

Or equivalently, the stacked wedge products between columns. Let 𝒫=USVT\mathcal{P}=USV^{T} be the thin singular value decomposition of 𝒫\mathcal{P}, so that UU is a 3n×63n\times 6 orthonormal matrix, SS is a 6×66\times 6 diagonal matrix where all diagonal entries are nonzero, and VV is a 6×66\times 6 orthonormal matrix.

Then,

T(1)n(T(1)n)T\displaystyle T^{n}_{(1)}(T_{(1)}^{n})^{T} =𝒫𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)T𝒫T\displaystyle=\mathcal{P}\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T}\mathcal{P}^{T}
=U(SVT𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)TVS)UT.\displaystyle=U(SV^{T}\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T}VS)U^{T}.

Since VV is orthonormal, SVT𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)TVSSV^{T}\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T}VS is still a diagonal matrix. We just need to establish the fact that three of the diagonal entries are the same.

For one camera, 𝒫T𝒫\mathcal{P}^{T}\mathcal{P} equals

[10Pi2Pi40Pi1Pi401Pi3Pi400(Pi1Pi4)Pi4Pi4(Pi1Pi4)(Pi1Pi4)0(Pi1Pi4)(Pi4Pi2)(Pi1Pi4)(Pi4Pi3)1Pi3Pi4Pi2Pi4(Pi4Pi4)(Pi2Pi4)(Pi4Pi2)(Pi2Pi4)(Pi4Pi3)(Pi4Pi4)(Pi3Pi4)(Pi4Pi3)]\displaystyle\begin{bmatrix}1&0&P_{i}^{2}\cdot P_{i}^{4}&0&-P_{i}^{1}\cdot P_{i}^{4}&0\\ &1&P_{i}^{3}\cdot P_{i}^{4}&0&0&-(P_{i}^{1}\cdot P_{i}^{4})\\ &&P_{i}^{4}\cdot P_{i}^{4}-(P_{i}^{1}\cdot P_{i}^{4})(P_{i}^{1}\cdot P_{i}^{4})&0&-(P_{i}^{1}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{2})&-(P_{i}^{1}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{3})\\ &&&1&P_{i}^{3}\cdot P_{i}^{4}&-P_{i}^{2}\cdot P_{i}^{4}\\ &&&&(P_{i}^{4}\cdot P_{i}^{4})-(P_{i}^{2}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{2})&-(P_{i}^{2}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{3})\\ &&&&&(P_{i}^{4}\cdot P_{i}^{4})-(P_{i}^{3}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{3})\end{bmatrix}
=\displaystyle= [10Pi2Pi40Pi1Pi401Pi3Pi400(Pi1Pi4)Pi42Pi1Pi420(Pi1Pi4)(Pi4Pi2)(Pi1Pi4)(Pi4Pi3)1Pi3Pi4Pi2Pi4Pi42Pi2Pi42(Pi2Pi4)(Pi4Pi3)Pi42Pi3Pi42].\displaystyle\begin{bmatrix}1&0&P_{i}^{2}\cdot P_{i}^{4}&0&-P_{i}^{1}\cdot P_{i}^{4}&0\\ &1&P_{i}^{3}\cdot P_{i}^{4}&0&0&-(P_{i}^{1}\cdot P_{i}^{4})\\ &&\|P_{i}^{4}\|^{2}-\|P_{i}^{1}\cdot P_{i}^{4}\|^{2}&0&-(P_{i}^{1}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{2})&-(P_{i}^{1}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{3})\\ &&&1&P_{i}^{3}\cdot P_{i}^{4}&-P_{i}^{2}\cdot P_{i}^{4}\\ &&&&\|P_{i}^{4}\|^{2}-\|P_{i}^{2}\cdot P_{i}^{4}\|^{2}&-(P_{i}^{2}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{3})\\ &&&&&\|P_{i}^{4}\|^{2}-\|P_{i}^{3}\cdot P_{i}^{4}\|^{2}\end{bmatrix}.

where the matrix is symmetric and we reduce redundancy by omitting the entries below the diagonal. For nn cameras,

𝒫T𝒫=[n0iPi2Pi40iPi1Pi40niPi3Pi400iPi1Pi4iPi42Pi1Pi420i(Pi1Pi4)(Pi4Pi2)i(Pi1Pi4)(Pi4Pi3)niPi3Pi4iPi2Pi4iPi42Pi2Pi42i(Pi2Pi4)(Pi4Pi3)iPi42Pi3Pi42].\displaystyle\mathcal{P}^{T}\mathcal{P}=\begin{bmatrix}n&0&\sum_{i}P_{i}^{2}\cdot P_{i}^{4}&0&-\sum_{i}P_{i}^{1}\cdot P_{i}^{4}&0\\ &n&\sum_{i}P_{i}^{3}\cdot P_{i}^{4}&0&0&-\sum_{i}P_{i}^{1}\cdot P_{i}^{4}\\ &&\sum_{i}\|P_{i}^{4}\|^{2}-\|P_{i}^{1}\cdot P_{i}^{4}\|^{2}&0&-\sum_{i}(P_{i}^{1}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{2})&-\sum_{i}(P_{i}^{1}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{3})\\ &&&n&\sum_{i}P_{i}^{3}\cdot P_{i}^{4}&-\sum_{i}P_{i}^{2}\cdot P_{i}^{4}\\ &&&&\sum_{i}\|P_{i}^{4}\|^{2}-\|P_{i}^{2}\cdot P_{i}^{4}\|^{2}&-\sum_{i}(P_{i}^{2}\cdot P_{i}^{4})(P_{i}^{4}\cdot P_{i}^{3})\\ &&&&&\sum_{i}\|P_{i}^{4}\|^{2}-\|P_{i}^{3}\cdot P_{i}^{4}\|^{2}\end{bmatrix}.

where PiaPibP_{i}^{a}\cdot P_{i}^{b} means the dot product between the aath and bbth column

​​ The SVD of 𝒫T𝒫\mathcal{P}^{T}\mathcal{P} is 𝒫T𝒫=HDHT\mathcal{P}^{T}\mathcal{P}=HDH^{T}, where HH is 6×66\times 6 orthonormal matrix, DD is 6×66\times 6 diagonal matrix. However, since we have an nI3×3nI_{3\times 3} submatrix in 𝒫T𝒫\mathcal{P}^{T}\mathcal{P}, we deduce that n appears as an eigenvalue 3 times for 𝒫T𝒫\mathcal{P}^{T}\mathcal{P}, where we can use the determinant identity for block matrices. We check that this indeed holds by a computer calculation, generating random instances of PiP_{i}’s and calculating the eigenvalues for 𝒫T𝒫\mathcal{P}^{T}\mathcal{P}.

As a result, in the thin SVD of 𝒫\mathcal{P}, we have 𝒫=USVT\mathcal{P}=USV^{T} where S=DS=\sqrt{D}, V=HV=H. Then in

T(1)n(T(1)n)T\displaystyle T^{n}_{(1)}(T^{n}_{(1)})^{T} =𝒫𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)T𝒫T\displaystyle=\mathcal{P}\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T}\mathcal{P}^{T}
=U(SVT𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)TVS)UT,\displaystyle=U(SV^{T}\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T}VS)U^{T},

we see that SVT𝒢(1)(𝒞T𝒞𝒞T𝒞)𝒢(1)TVSSV^{T}\mathcal{G}_{(1)}(\mathcal{C}^{T}\mathcal{C}\otimes\mathcal{C}^{T}\mathcal{C})\mathcal{G}_{(1)}^{T}VS is a diagonal matrix where three of the entries are the same. By the uniqueness of the eigenvalues, we see that we have a spectral decomposition of T(1)n(T(1)n)TT^{n}_{(1)}(T^{n}_{(1)})^{T}, so that three of the singular values of T(1)nT^{n}_{(1)} are equal. ∎

A.4 Proof details for Theorem 2

Proof.

Note blockwise multiplication by a rank-11 tensor with nonzero entries preserves multilinear rank, since it is a Tucker product by invertible diagonal matrices. Therefore, without loss of generality we may assume λi11=λ1j1=λ11k=1\lambda_{i11}=\lambda_{1j1}=\lambda_{11k}=1 for all i,j,k{2,3,,n}i,j,k\in\{2,3,\ldots,n\}. Below we will prove it follows λijk=c\lambda_{ijk}=c if exactly one of i,j,ki,j,k equals 11, and λijk=c2\lambda_{ijk}=c^{2} if none of i,j,ki,j,k equal 11 and the indices are not all the same, for some constant cc\in\mathbb{R}^{*}. This will immediately imply the theorem, because taking α=β=(1ccc)\alpha=\beta=(1\,c\,c\,\ldots\,c) and γ=(1c 1 1 1)\gamma=(\tfrac{1}{c}\,1\,1\,\ldots\,1) achieves λijk=αiβjγk\lambda_{ijk}=\alpha_{i}\beta_{j}\gamma_{k} whenever i,j,ki,j,k are not all the same.

We consider the matrix flattenings T(2)nT^{n}_{(2)} and (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} in 3n×9n2\mathbb{R}^{3n\times 9n^{2}} of the block trifocal tensor and its scaled counterpart, with rows corresponding to the second mode of the tensors. By Theorem 1 and assumptions, the flattenings have matrix rank 44, thus all of their 5×55\times 5 minors vanish. The argument consists of considering several carefully chosen 5×55\times 5 submatrices of (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} to prove the existence of a constant cc as above. Index the rows and columns of the flattenings by (j,r)(j,r) and (iq,ks)(iq,ks) respectively, for i,j,k[n]i,j,k\in[n] and q,r,s[3]q,r,s\in[3], so that e.g., ((λbTn)(2))(j,r),(iq,ks)=λijk(Tijkn)qrs((\lambda\odot_{b}T^{n})_{(2)})_{(j,r),(iq,ks)}=\lambda_{ijk}(T^{n}_{ijk})_{qrs}.

Step 1: The first submatrix of (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} we consider has column labels (i1,11)(i1,11), (i1,21)(i1,21), (i1,31)(i1,31), (i1,12)(i1,12), (i2,11)(i2,11) and row labels (1,1)(1,1), (1,2),(1,3),(i,1),(i,2)(1,2),(1,3),(i,1),(i,2), where i{2,,n}i\in\{2,\ldots,n\}. Explicitly, it is

[(Ti11n)111(Ti11n)211(Ti11n)311(Ti11n)112(T11in)111(Ti11n)121(Ti11n)221(Ti11n)321(Ti11n)122(T11in)121(Ti11n)131(Ti11n)231(Ti11n)331(Ti11n)132(T11in)131λii1(Tii1n)111λii1(Tii1n)211λii1(Tii1n)311λii1(Tii1n)112λ1ii(T1iin)111λii1(Tii1n)121λii1(Tii1n)221λii1(Tii1n)321λii1(Tii1n)122λ1ii(T1iin)121],\begin{bmatrix}(T^{n}_{i11})_{111}&(T^{n}_{i11})_{211}&(T^{n}_{i11})_{311}&(T^{n}_{i11})_{112}&(T^{n}_{11i})_{111}\\[1.5pt] (T^{n}_{i11})_{121}&(T^{n}_{i11})_{221}&(T^{n}_{i11})_{321}&(T^{n}_{i11})_{122}&(T^{n}_{11i})_{121}\\[1.5pt] (T^{n}_{i11})_{131}&(T^{n}_{i11})_{231}&(T^{n}_{i11})_{331}&(T^{n}_{i11})_{132}&(T^{n}_{11i})_{131}\\[1.5pt] \lambda_{ii1}(T^{n}_{ii1})_{111}&\lambda_{ii1}(T^{n}_{ii1})_{211}&\lambda_{ii1}(T^{n}_{ii1})_{311}&\lambda_{ii1}(T^{n}_{ii1})_{112}&\lambda_{1ii}(T^{n}_{1ii})_{111}\\[1.5pt] \lambda_{ii1}(T^{n}_{ii1})_{121}&\lambda_{ii1}(T^{n}_{ii1})_{221}&\lambda_{ii1}(T^{n}_{ii1})_{321}&\lambda_{ii1}(T^{n}_{ii1})_{122}&\lambda_{1ii}(T^{n}_{1ii})_{121}\end{bmatrix},

which we abbreviate as

[λii1λii1λii1λii1λ1iiλii1λii1λii1λii1λ1ii],\begin{bmatrix}\ast&\ast&\ast&\ast&\ast\\ \ast&\ast&\ast&\ast&\ast\\ \ast&\ast&\ast&\ast&\ast\\ \lambda_{ii1}\ast&\lambda_{ii1}\ast&\lambda_{ii1}\ast&\lambda_{ii1}\ast&\lambda_{1ii}\ast\\ \lambda_{ii1}\ast&\lambda_{ii1}\ast&\lambda_{ii1}\ast&\lambda_{ii1}\ast&\lambda_{1ii}\ast\\ \end{bmatrix}, (12)

with asterisk denoting the corresponding entry in T(2)nT^{n}_{(2)}. As a function of λii1,λ1ii\lambda_{ii1},\lambda_{1ii}, the determinant of (12) is a degree 2\leq 2 polynomial, which must be divisible by λii1\lambda_{ii1} and λii1λ1ii\lambda_{ii1}-\lambda_{1ii} (because if λii1=0\lambda_{ii1}=0 then clearly the bottom two rows of (12) are linearly independent, and if λii1λ1ii=0\lambda_{ii1}-\lambda_{1ii}=0 we have a submatrix of T(2)nT^{n}_{(2)} with the bottom two rows scaled uniformly). So the determinant of (12)\eqref{eq:det1} is a scalar multiple of λii1(λii1λ1ii)\lambda_{ii1}(\lambda_{ii1}-\lambda_{1ii}). Note that the multiple is a polynomial function of the cameras P1P_{1} and PiP_{i}. We claim that generically the multiple is nonzero; and to see this, it suffices to exhibit a single instance of (calibrated) cameras where the determinant of (12) does not vanish identically for all λii1,λ1ii\lambda_{ii1},\lambda_{1ii} due to the polynomiality (e.g., see [53]). We check that this indeed holds by a computer calculation, generating numerical instances of P1P_{1} and PiP_{i} randomly. Thus the vanishing of the minor in (12) implies λii1(λii1λ1ii)=0\lambda_{ii1}(\lambda_{ii1}-\lambda_{1ii})=0, whence λii1=λ1ii\lambda_{ii1}=\lambda_{1ii} since λii10\lambda_{ii1}\neq 0. An analogous calculation with (λbTn)(3)(\lambda\odot_{b}T^{n})_{(3)} gives λi1i=λ1ii\lambda_{i1i}=\lambda_{1ii}.

Step 2: Next consider the submatrix of (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} with column labels (j1,11)(j1,11), (j1,21)(j1,21), (j1,31)(j1,31), (j1,12)(j1,12), (1j,11)(1j,11) and row labels (1,1)(1,1), (1,2)(1,2), (1,3)(1,3), (i,1)(i,1), (i,2)(i,2), where i,j{2,,n}i,j\in\{2,\ldots,n\} are distinct. It looks like

[λji1λji1λji1λji1λ1ijλji1λji1λji1λji1λ1ij],\begin{bmatrix}\ast&\ast&\ast&\ast&\ast\\ \ast&\ast&\ast&\ast&\ast\\ \ast&\ast&\ast&\ast&\ast\\ \lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{1ij}\ast\\ \lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{1ij}\ast\\ \end{bmatrix}, (13)

with asterisks denoting entries of T(2)nT^{n}_{(2)}. Similarly to the previous case, the determinant of (13) must be a scalar multiple of λji1(λji1λ1ij)\lambda_{ji1}(\lambda_{ji1}-\lambda_{1ij}) where the scale depends polynomially on P1,Pi,PjP_{1},P_{i},P_{j}. By a computer computation, we find that the scale is nonzero for random instances of cameras (alternatively, note the polynomial system in step 1 is a special case of the present one). It the scale is generically nonzero, hence λji1=λ1ij\lambda_{ji1}=\lambda_{1ij}. An analogous calculation with (λbTn)(3)(\lambda\odot_{b}T^{n})_{(3)} gives λi1j=λ1ij\lambda_{i1j}=\lambda_{1ij}.

Step 3: Consider the submatrix of (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} with column labels (j1,11)(j1,11), (j1,21)(j1,21), (j1,31)(j1,31), (j1,12)(j1,12), (1k,11)(1k,11) and row labels (1,1)(1,1), (1,2)(1,2), (1,3)(1,3), (i,1)(i,1), (i,2)(i,2), for i,j,k{2,,n}i,j,k\in\{2,\ldots,n\} distinct. It looks like

[λji1λji1λji1λji1λ1ikλji1λji1λji1λji1λ1ik].\begin{bmatrix}\ast&\ast&\ast&\ast&\ast\\ \ast&\ast&\ast&\ast&\ast\\ \ast&\ast&\ast&\ast&\ast\\ \lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{1ik}\ast\\ \lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{1ik}\ast\\ \end{bmatrix}. (14)

The determinant of (14) is a scalar multiple of λji1(λji1λ1ik)\lambda_{ji1}(\lambda_{ji1}-\lambda_{1ik}). By a direct computer computation as before, it is a nonzero multiple generically (alternatively, note the polynomial system in step 1 is a special of the present one). We deduce λji1=λ1ik\lambda_{ji1}=\lambda_{1ik}. An analogous calculation with (λbTn)(3)(\lambda\odot_{b}T^{n})_{(3)} gives λi1j=λ1kj\lambda_{i1j}=\lambda_{1kj}.

In particular, combining with step 2 it follows λ1ij=λ1ji\lambda_{1ij}=\lambda_{1ji}, because λ1ij=λk1j=λ1kj=λik1=λ1ki=λj1i=λ1ji\lambda_{1ij}=\lambda_{k1j}=\lambda_{1kj}=\lambda_{ik1}=\lambda_{1ki}=\lambda_{j1i}=\lambda_{1ji}. From this, step 1 and step 2, we have that the λ\lambda-scale does not depend on the ordering of its indices, provided there is a 11 among the indices.

Step 4: Consider the submatrix of (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} with column labels (j1,11)(j1,11), (j1,21)(j1,21), (j1,31)(j1,31), (j1,12)(j1,12), (1i,11)(1i,11) and row labels (1,1)(1,1), (1,2)(1,2), (1,3)(1,3), (i,1)(i,1), (i,2)(i,2), for i,j{2,,n}i,j\in\{2,\ldots,n\} distinct. It looks like

[λji1λji1λji1λji1λ1iiλji1λji1λji1λji1λ1ii].\begin{bmatrix}\ast&\ast&\ast&\ast&\ast\\ \ast&\ast&\ast&\ast&\ast\\ \ast&\ast&\ast&\ast&\ast\\ \lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{1ii}\ast\\ \lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{ji1}\ast&\lambda_{1ii}\ast\\ \end{bmatrix}. (15)

The determinant of (15) is a scalar multiple of λji1(λji1λ1ii)\lambda_{ji1}(\lambda_{ji1}-\lambda_{1ii}). By a direct computer computation, it is a nonzero multiple generically (alternatively, note the polynomial system in step 1 is a special case of the present one). We deduce λji1=λ1ii\lambda_{ji1}=\lambda_{1ii}.

Putting together what we know so far, all λ\lambda-scales with a single 11-index agree. Indeed, this follows from λ1ii=λji1=λij1=λ1jj\lambda_{1ii}=\lambda_{ji1}=\lambda_{ij1}=\lambda_{1jj} so all λ\lambda-scales with a single 11-index and two repeated indices agree, combined with λji1=λ1ii\lambda_{ji1}=\lambda_{1ii} and the last sentence of step 3. Let cc\in\mathbb{R}^{*} denote this common scale.

Step 5: Consider the submatrix of (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} with column labels (1i,11)(1i,11), (1i,21)(1i,21), (1i,31)(1i,31), (1i,12)(1i,12), (ij,11)(ij,11) and row labels (1,1)(1,1), (1,2)(1,2), (1,3)(1,3), (i,1)(i,1), (i,2)(i,2), for i,j{2,,n}i,j\in\{2,\ldots,n\} distinct. It looks like

[cccccccλiijccccλiij].\begin{bmatrix}\ast&\ast&\ast&\ast&c\ast\\ \ast&\ast&\ast&\ast&c\ast\\ \ast&\ast&\ast&\ast&c\ast\\ c\ast&c\ast&c\ast&c\ast&\lambda_{iij}\ast\\ c\ast&c\ast&c\ast&c\ast&\lambda_{iij}\ast\\ \end{bmatrix}. (16)

As a function of cc and λiij\lambda_{iij}, the determinant of (16) is a scalar multiple of c(c2λiij)c(c^{2}-\lambda_{iij}) (the second factor is present because it corresponds to scaling the bottom two rows and rightmost column of a 5×55\times 5 submatrix of T(2)T_{(2)} each by cc, which preserves rank deficiency). By a direct computer computation, we find that the scale is nonzero for a random instance of P1,Pi,PjP_{1},P_{i},P_{j}, therefore it is nonzero generically. It follows c2=λiijc^{2}=\lambda_{iij}. An analogous calculation with (λbTn)(3)(\lambda\odot_{b}T^{n})_{(3)} gives c2=λijic^{2}=\lambda_{iji}.

Step 6: Consider the submatrix of (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} with column labels (1i,11)(1i,11), (1i,21)(1i,21), (1i,31)(1i,31), (1i,12)(1i,12), (ji,11)(ji,11) and row labels (1,1)(1,1), (1,2)(1,2), (1,3)(1,3), (i,1)(i,1), (i,2)(i,2), for i,j{2,,n}i,j\in\{2,\ldots,n\} distinct. It looks like

[cccccccλjiiccccλjii].\begin{bmatrix}\ast&\ast&\ast&\ast&c\ast\\ \ast&\ast&\ast&\ast&c\ast\\ \ast&\ast&\ast&\ast&c\ast\\ c\ast&c\ast&c\ast&c\ast&\lambda_{jii}\ast\\ c\ast&c\ast&c\ast&c\ast&\lambda_{jii}\ast\\ \end{bmatrix}. (17)

Similarly to the previous step, the determinant of (17) must be a scalar multiple of c(c2λjii)c(c^{2}-\lambda_{jii}). By a direct computer computation, it is a nonzero multiple generically. We deduce c2=λjiic^{2}=\lambda_{jii}.

Step 7: Consider the submatrix of (λbTn)(2)(\lambda\odot_{b}T^{n})_{(2)} with column labels (1i,11)(1i,11), (1i,21)(1i,21), (1i,31)(1i,31), (1i,12)(1i,12), (ik,11)(ik,11) and row labels (1,1)(1,1), (1,2)(1,2), (1,3)(1,3), (j,1)(j,1), (j,2)(j,2), for i,j,k{2,,n}i,j,k\in\{2,\ldots,n\} distinct. It looks like

[cccccccλijkccccλijk].\begin{bmatrix}\ast&\ast&\ast&\ast&c\ast\\ \ast&\ast&\ast&\ast&c\ast\\ \ast&\ast&\ast&\ast&c\ast\\ c\ast&c\ast&c\ast&c\ast&\lambda_{ijk}\ast\\ c\ast&c\ast&c\ast&c\ast&\lambda_{ijk}\ast\\ \end{bmatrix}. (18)

The determinant of (18) is a scalar multiple of c(c2λijk)c(c^{2}-\lambda_{ijk}). By a direct computer computation, it is a nonzero multiple generically (alternatively, note the polynomial system in step 5 is a special case of the present case). We deduce c2=λijkc^{2}=\lambda_{ijk}.

At this point, by steps 5,6,7 we have that all λ\lambda-scales with no 11-indices and not all indices the same must equal c2c^{2}. Combined with the second paragraph of step 4, this shows cc satisfies the property announced at the start of the proof. Therefore the proof is complete. ∎

A.5 Implementation details

A.5.1 Estimating trifocal tensors from three fundamental matrices

Given three cameras P1,P2,P3P_{1},P_{2},P_{3} and the corresponding fundamental matrices F21,F31,F32F_{21},F_{31},F_{32}, we can calculate the trifocal tensor TijkT_{ijk} using the following procedure detailed in [1]. Specifically, from F21F_{21} calculate an initial estimate of the cameras P1,P2P_{1}^{\prime},P_{2}^{\prime}. Then, P3TF32P2P_{3}^{T}F_{32}P_{2}^{\prime} and P3TF31P1P_{3}^{T}F_{31}P_{1}^{\prime} should be skew-symmetric matrices. This gives 20 linear equations in terms of the entries in P3P_{3}, which can be used to solve for the trifocal tensor. Note that there are no geometrical constraints when calculating P3P_{3}, and there will be no guarantee of the quality of the estimation.

A.5.2 Higher-order regularized subspace-constrained Tyler’s estimator (HOrSTE) for EPFL

We describe the robust variant of SVD that we used for the EPFL experiments in Section 4. Numerically, it performs more stably and accurately than HOSVD-HT, yet it is an iterative procedure and each iteration requires an SVD of the 3n×9n23n\times 9n^{2} flattening. This becomes computationally expensive when nn becomes large and the number of iterations are also large. However, since the number of cameras for the EPFL dataset are below 20 cameras, the computational overhead is not too great.

In HOSVD, a low dimensional subspace is estimated using the singular value decomposition and taking the RnR_{n} leading left singular vectors for each mode-nn flattening. The Tyler’s M Estimator (TME) [40] is a robust covariance estimator of a DD dimensional dataset {xi}i=1ND\{x_{i}\}_{i=1}^{N}\subset\mathbb{R}^{D}. It minimizes the objective function

minΣD×DDNi=1Nlog(xiTΣ1xi)+logdet(Σ)\underset{\Sigma\in\mathbb{R}^{D\times D}}{\min}\frac{D}{N}\sum_{i=1}^{N}\log(x_{i}^{T}\Sigma^{-1}x_{i})+\log\det(\Sigma) (19)

such that Σ\Sigma is positive definite and has trace 1. The TME estimator can be applied to robustly find an RnR_{n} dimensional subspace by taking the RnR_{n} leading eigenvectors of the covariance matrix of TME. To compute the TME, [40] proposes an iterative algorithm, where

Σ(k)=i=1NxixiTxiT(Σ(k1))1)xi/tr(i=1NxixiTxiT(Σ(k1))1)xi).\Sigma^{(k)}\,=\,\sum_{i=1}^{N}\frac{x_{i}x_{i}^{T}}{x_{i}^{T}(\Sigma^{(k-1)})^{-1})x_{i}}/tr(\sum_{i=1}^{N}\frac{x_{i}x_{i}^{T}}{x_{i}^{T}(\Sigma^{(k-1)})^{-1})x_{i}}). (20)

TME doesn’t exist when D>ND>N, but a regularized TME has been proposed by [54]. The iterations become

Σ(k)=11+αDNi=1NxixiTxiT(Σ(k1))1)xi+α1+αI\Sigma^{(k)}=\frac{1}{1+\alpha}\frac{D}{N}\sum_{i=1}^{N}\frac{x_{i}x_{i}^{T}}{x_{i}^{T}(\Sigma^{(k-1)})^{-1})x_{i}}+\frac{\alpha}{1+\alpha}I (21)

where α\alpha is a regularization parameter, and II is the D×DD\times D identity matrix. TME does not assume the dimension of the subspace dd is predetermined. In the case when dd is prespecified, [41] improves the TME estimator by incorporating the information into the algorithm and develops the subspace-constrained Tyler’s Estimator (STE). For each iteration, STE equalizes the trailing DdD-d eigenvalues of the estimated covariance matrix and uses a parameter 0<γ<10<\gamma<1 to shrink the eigenvalues. The iterative procedure for STE is summarized into 3 steps:

  1. 1.

    Calculate the unnormalized TME, Z(k)=i=1N(xixiT/xiT(Σ(k1))1)xi)Z^{(k)}=\sum_{i=1}^{N}(x_{i}x_{i}^{T}/x_{i}^{T}(\Sigma^{(k-1)})^{-1})x_{i}).

  2. 2.

    Perform the eigendecomposition of Z(k)=U(k)S(k)(U(k))TZ^{(k)}=U^{(k)}S^{(k)}(U^{(k)})^{T}, and set the trailing DdD-d eigenvalues as γi=d+1Dσi/(Dd)\gamma\sum_{i=d+1}^{D}\sigma_{i}/(D-d).

  3. 3.

    Calculate Σ(k)=U(k)S(k)(U(k))T/tr(U(k)S(k)(U(k))T)\Sigma^{(k)}=U^{(k)}S^{(k)}(U^{(k)})^{T}/tr(U^{(k)}S^{(k)}(U^{(k)})^{T}), which is the normalized covariance matrix. Repeat steps 1-3 until convergence.

Similar to the regularized TME, STE can also be regularized to succeed in situations where there are fewer inliers, and can improve the robustness of the algorithm. The regularized STE differs from STE in only the first step, which is replaced by

  1. 1.*

    Calculate the unnormalized regularized TME, Z(k)=11+αDNi=1NxixiTxiT(Σ(k1))1)xi+α1+αIZ^{(k)}=\frac{1}{1+\alpha}\frac{D}{N}\sum_{i=1}^{N}\frac{x_{i}x_{i}^{T}}{x_{i}^{T}(\Sigma^{(k-1)})^{-1})x_{i}}+\frac{\alpha}{1+\alpha}I

We apply the regularized STE to the HOSVD framework, and call the resulting projection the higher-order regularized STE (HOrSTE). It is performed via the following steps:

  1. 1.

    For each nn, calculate the factor matrices AnA_{n} as the RnR_{n} leading left singular vectors from regularized STE applied to T(n)T_{(n)}.

  2. 2.

    Set the core tensor 𝒢\mathcal{G} as 𝒢=T×1A1T×2A2T×NANT\mathcal{G}=T\times_{1}A_{1}^{T}\times_{2}A_{2}^{T}\cdots\times_{N}A_{N}^{T}.

A.6 Additional numerical results

In this section, we include comprehensive results for the rotation and translation errors for the EPFL and Photo Tourism experiments. Table 1 and 2 contains all results for EPFL datasets. Table 3 contains the location estimation errors for Photo Tourism. Table 4 contains the rotation estimation errors for Photo Tourism. In Table 4, we only report the rotation errors for LUD for all the methods that we compared against, as they are mostly the same since they used the same rotation averaging method.

Table 1: EPFL synchronization errors. e¯r\bar{e}_{r} is the mean rotation error in degrees, e^r\hat{e}_{r} is the median rotation error in degrees. e¯t\bar{e}_{t} is the mean location error, e^t\hat{e}_{t} is the median location error. NRFM(LUD) is NRFM initialized with LUD and NRFM is randomly initialized. BATA(MLPS) is BATA initialized with MPLS.
Our LUD NRFM(LUD) NRFM BATA(MPLS)
Dataset e¯t\bar{e}_{t} e^t\hat{e}_{t} e¯t\bar{e}_{t} e^t\hat{e}_{t} e¯t\bar{e}_{t} e^t\hat{e}_{t} e¯t\bar{e}_{t} e^t\hat{e}_{t} e¯t\bar{e}_{t} e^t\hat{e}_{t}
FountainP11 0.008 0.007 0.91 0.54 0.75 0.46 3.37 3.03 1.12 1.01
HerzP8 0.02 0.02 5.06 5.06 4.37 3.42 4.24 3.14 5.04 5.03
HerzP25 4.70 4.68 7.75 8.00 6.20 5.82 8.85 8.38 7.77 8.41
EntryP10 0.05 0.02 3.08 3.02 1.34 1.11 7.63 7.43 2.90 2.58
CastleP19 9.64 5.80 4.58 4.04 3.37 3.02 15.81 15.43 5.77 5.62
CastleP30 11.00 11.33 4.27 3.72 3.24 2.75 16.54 17.04 4.23 3.26
Table 2: EPFL synchronization errors. e¯r\bar{e}_{r} is the mean rotation error in degrees, e^r\hat{e}_{r} is the median rotation error in degrees. BATA(MLPS) is BATA initialized with MPLS.
Our LUD BATA(MPLS)
Dataset e¯r\bar{e}_{r} e^r\hat{e}_{r} e¯r\bar{e}_{r} e^r\hat{e}_{r} e¯r\bar{e}_{r} e^r\hat{e}_{r}
FountainP11 0.09 0.08 0.05 0.05 0.06 0.05
HerzP8 0.12 0.12 0.33 0.34 0.44 0.39
HerzP25 2.01 1.11 0.18 0.19 0.26 0.23
EntryP10 0.15 0.11 0.25 0.25 0.27 0.25
CastleP19 56.24 11.71 0.24 0.22 0.27 0.25
CastleP30 38.84 4.58 0.13 0.13 0.19 0.15
Table 3: Translation errors for Photo Tourism. nn is the size after downsampling. Est. % is the ratio of observed blocks over total number of blocks. e¯t\bar{e}_{t} is the mean location error, e^t\hat{e}_{t} is the median location error. NRFM(L) is NRFM initialized with LUD and NRFM(R) is randomly initialized. The notation PR means that the dataset was further downsampled to match the two view methods. BATA is BATA initialized with MPLS. We were not able to get results for our subsampled dataset for Piccadilly with MPLS.
Dataset Our Approach NRFM(L) LUD NRFM(R) BATA
dataset n Est. % e¯t\bar{e}_{t} e^t\hat{e}_{t} e¯t\bar{e}_{t} e^t\hat{e}_{t} e¯t\bar{e}_{t} e^t\hat{e}_{t} e¯t\bar{e}_{t} e^t\hat{e}_{t} e¯t\bar{e}_{t} e^t\hat{e}_{t}
Piazza del Popolo 185 72.3 0.78 0.45 1.63 0.85 1.66 0.86 13.45 12.06 1.63 1.10
NYC Library 127 64.7 1.01 0.53 1.39 0.48 1.49 0.57 13.06 14.03 1.59 0.68
Ellis Island 194 70.3 9.56 7.73 19.31 16.97 20.71 17.96 26.08 26.38 23.63 22.50
Tower of London 130 34.1 4.15 2.66 3.26 2.49 3.54 2.51 49.99 47.33 2.70 2.26
Madrid Metropolis 190 35.9 18.93 15.53 1.91 1.19 1.94 1.20 31.48 24.02 3.33 1.72
Yorkminster 196 37.2 1.46 1.14 2.31 1.39 2.35 1.45 16.67 14.46 1.37 1.15
Alamo 224 94.3 0.62 0.28 0.53 0.31 0.53 0.31 10.04 7.68 0.55 0.33
Vienna Cathedral 197 97.8 0.73 0.33 2.96 1.64 3.15 1.79 16.08 14.76 6.16 2.18
Roman Forum(PR) 111 51.1 10.71 6.75 1.59 0.89 1.63 0.93 23.23 11.20 1.85 1.04
Notre Dame 214 96.6 0.57 0.34 0.38 0.21 0.38 0.21 6.87 4.75 1.02 0.26
Montreal N.D. 162 97.0 0.38 0.24 0.56 0.37 0.57 0.38 10.33 11.15 0.58 0.41
Union Square 144 28.6 5.64 3.99 4.31 3.76 4.85 4.38 9.59 6.69 5.77 4.83
Gendarmenmarkt 112 89.7 45.34 23.63 37.93 17.35 37.92 17.41 62.69 26.42 54.38 15.91
Piccadilly(PR) 169 55.4 0.73 0.39 3.68 1.90 3.71 1.93 13.55 13.34 - -
Table 4: Rotation errors for Photo Tourism. NN is the total number of cameras. nn is the size after down sampling. Est. % is the ratio of observed blocks over total number of blocks. e¯r\bar{e}_{r} is the mean rotation error, e^r\hat{e}_{r} is the median rotation error. The notation PR means that the dataset was further down sampled to match the two view methods. We were not able to get results for our subsampled dataset for Piccadilly with MPLS.
Our Approach LUD MPLS
dataset N n Est. % e¯r\bar{e}_{r} e^r\hat{e}_{r} e¯r\bar{e}_{r} e^r\hat{e}_{r} e¯r\bar{e}_{r} e^r\hat{e}_{r} Our Runtime (s)
Piazza del Popolo 307 185 72.3 1.26 0.61 0.72 0.43 0.69 0.41 13531
NYC Library 306 127 64.7 2.80 1.58 1.16 0.61 1.19 0.57 4465
Ellis Island 223 194 70.3 4.61 1.11 1.16 0.50 0.99 0.49 13816
Tower of London 440 130 34.1 2.28 1.31 1.63 1.28 1.66 1.37 4242
Madrid Metropolis 315 190 35.9 28.85 4.60 1.27 0.61 1.54 1.15 11764
Yorkminster 410 196 37.2 2.33 1.97 1.34 1.09 1.89 1.04 13115
Alamo 564 224 94.3 1.10 0.76 1.07 0.68 1.09 0.68 17513
Vienna Cathedral 770 197 97.8 0.74 0.46 0.40 0.28 0.39 0.28 12499
Roman Forum(PR) 989 111 51.1 11.86 3.39 0.40 0.28 1.07 0.65 2162
Notre Dame 547 214 96.6 0.78 0.50 0.67 0.43 0.68 0.43 17430
Montreal N.D. 442 162 97.0 0.50 0.35 0.49 0.32 0.49 0.31 7241
Union Square 680 144 28.6 20.70 5.29 1.82 1.34 2.00 1.56 4355
Gendarmenmarkt 655 112 89.7 22.95 15.24 18.42 10.25 17.42 8.41 2432
Piccadilly(PR) 1000 169 55.4 2.01 0.96 6.12 2.95 - - 11230