This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Smooth Mesh Estimation from Depth Data
using Non-Smooth Convex Optimization

Antoni Rosinol1, Luca Carlone1 1A. Rosinol and L. Carlone are with the Laboratory for Information & Decision Systems (LIDS), Massachusetts Institute of Technology, Cambridge, MA, USA, {arosinol,lcarlone}@mit.eduThis work was partially funded by ARL DCIST CRA W911NF-17-2-0181, and Lincoln Laboratory’s “Resilient Perception in Degraded Environments” program.
Abstract

Meshes are commonly used as 3D maps since they encode the topology of the scene while being lightweight. Unfortunately, 3D meshes are mathematically difficult to handle directly because of their combinatorial and discrete nature. Therefore, most approaches generate 3D meshes of a scene after fusing depth data using volumetric or other representations. Nevertheless, volumetric fusion remains computationally expensive both in terms of speed and memory. In this paper, we leapfrog these intermediate representations and build a 3D mesh directly from a depth map and the sparse landmarks triangulated with visual odometry. To this end, we formulate a non-smooth convex optimization problem that we solve using a primal-dual method. Our approach generates a smooth and accurate 3D mesh that substantially improves the state-of-the-art on direct mesh reconstruction while running in real-time.

Index Terms:
Mesh, Mapping, Vision-Based Navigation

I Introduction

Fast, lightweight, and accurate 3D representations of the scene are of utmost importance for computationally constrained robots, such as flying drones, or high-stake applications, such as self-driving cars. Failing to map the scene densely and accurately in a timely manner can lead to collisions, which result in damaged equipment or, worst, fatalities. Being able to densely reconstruct the scene with a camera in real-time also opens a handful of new applications in augmented reality, online real-life gaming, or search-and-rescue operations.

Therefore, in this work, we propose a 3D scene reconstruction method that leverages depth data (either from RGB-D or mono/stereo reconstruction) and recent advances in the field of convex variational optimization to build a 3D mesh that faithfully represents the scene while working in real-time.

Our work can be used with any available Visual Odometry (VO) pipeline. In our case, we use a Visual Inertial Odometry (VIO) pipeline, Kimera [1], which achieves very accurate pose estimates in real-time by tracking 2D features from frame to frame and using inertial information from an Inertial Measuring Unit (IMU).

Refer to caption
Figure 1: We propose a 3D mesh reconstruction method from input depth maps (blue rays) coming from RGB-D or dense stereo matching, and the sparse landmarks from visual odometry (VO) (green rays). We formulate the problem as a variational convex optimization problem which we solve in real-time and to optimality. The cost function measures the fitness of the mesh with respect to the depth input (red arrows), and the visual odometry landmarks, as well as the smoothness of the mesh. The optimization is done over the inverse depth of the vertices of the mesh (orange arrows).

VO methods typically provide a sparse map of the scene. In particular, feature-based methods [2, 3, 4, 5] produce a sparse point cloud that is not directly usable for path planning or obstacle avoidance. In those cases, a denser map is built subsequently, e.g., by using (multi-view) stereo algorithms [6, 7]. Nevertheless, with current broad availability of depth information from either RGB-D, LIDAR, or stereo cameras, it is possible to fuse the depth online using volumetric approaches [8, 9]. Unfortunately, such volumetric approaches are memory intensive, and slow if they are not heavily parallelized using powerful GPUs.

Ideally, a map representation should be (i) lightweight to compute and store, and (ii) capable of representing the topology of the environment. In this work, we use a 3D mesh as our map representation (Fig. 1). A 3D mesh is lightweight, since it can represent piece-wise planar surface with very few parameters (compared to dense pointclouds or volumetric representations), while providing information about the topology of the scene due to its connectivity information.

Recent approaches have tried to build the mesh over the set of sparse 3D landmarks triangulated by a VO/VIO pipeline [10, 11]. Nevertheless, these approaches fail to reconstruct an accurate mesh of the scene, since they discard most of the information present in the input data.

Contributions. In this paper, we propose a 3D mesh reconstruction algorithm that summarizes the information present in depth-maps and tracked 3D landmarks from visual odometry. In this way, we can build denser and more accurate maps than sparse approaches, while preserving real-time performance. We experimentally evaluate our approach on the EuRoC dataset [12] and the uHumans2 dataset [13, 8], and compare against competing approaches. Our evaluation shows that (i) the proposed approach produces a lightweight representation of the environment that captures the geometry of the scene, (ii) leveraging variational optimization enables real-time, accurate, and denser reconstructions.

II Related Work

While there exist a myriad of representations for 3D mapping, we focus our related work review on approaches using meshes. Despite the fact that volumetric approaches are typically used to build a mesh [14], and other authors have tried to mesh intermediate representations such as surfels [15], we focus on works that build a 3D mesh directly from input data either in the form of RGB images or RGB-D data.

Meshes were one of the first dense map representations used in the nineties when GPUs were not available. Early on, two different types of meshes were introduced: 3D tetrahedral meshes and 2D triangular meshes (embedded in 3D).

ProForma [16] was one of the first works to reconstruct a detailed map of the scene using a tetrahedral 3D mesh. Unfortunately, tetrahedral 3D meshes share similar costs and benefits than volumetric approaches, and hence they were mostly used in small-scale scenes first [17, 16]. More recent works have used tetrahedral 3D meshes for real-time reconstruction of the scene [18, 19], but the quality and artifacts in the reconstructions remain a problem. In particular, reconstructing a manifold mesh remains challenging, and many authors have tried to enforce manifoldness with different degrees of success [20, 21]. While 3D tetrahedral meshes are getting increasingly more capable, they remain challenging to use, are computationally expensive, and lack robustness.

In this work, we focus instead on building 2D triangular meshes embedded in 3D space. Turk et al. [22] proposed Zippered Meshes, an early attempt to fuse depth information into 2D triangular meshes by ‘sewing’ them from frame to frame. In SLAM, several works have tried to build a 3D mesh directly from triangulation of the features in the image followed by a re-projection in 3D space using the depth estimated from triangulation. While very fast in practice, these approaches trade-off quality and accuracy for speed. Their application has therefore mostly been geared towards fast obstacle avoidance, and local path-planning, particularly for resource constrained robots such as drones[23, 11, 10], and also on ground-robots [24, 25].

One of the major problems of reconstructing a 2D mesh from input data is its lack of robustness to outliers, and smoothness against noise. By leveraging recent insights in variational smoothing of dense depth maps, Greene and Roy [11] propose FLaME, a discrete formulation using a 3D mesh instead of dense depth maps, thereby achieving substantial computational savings, while reconstructing a smooth 3D mesh. Since they do not use dense depth maps, readily available from RGB-D, FLaME suffers from inaccurate reconstructions. In the same vein, VITAMIN-E[26] shows that tracking a substantially higher amount of 2D features, and thereby providing denser depth maps, improves the 3D reconstruction accuracy. They also used the proposed variational framework from FLaME, but they used all tracked features as vertices of the 3D mesh, rather than summarizing the depth map into a simplified and coherent surface reconstruction. Finally, [10] proposed to smooth the 3D mesh by leveraging the least-squares optimization performed in VO/VIO, and enforcing structural regularities (such as planarity). While this approach allowed the reconstruction of a multi-frame 3D mesh in real-time, the accuracy remained low due to the sparsity of the features tracked in SLAM. To cope with the low accuracy of these reconstructions, both VITAMIN-E [26] and Rosinol et al. [10] fuse the depth into a volumetric representation [1, 13, 8]. In a similar turn of events, while Newcombe et al. [27] and Lovegrove [28] initially attempted to build 3D meshes from sparse features, they concluded that using dense all-pixel methods with volumetric maps leads to far superior accuracy[29], despite the heavy use of GPUs.

In this work, we maintain the speed and low computational budget of early approaches, despite the use of a GPU, by re-using the feature tracks of the visual pipeline, but we use a dense depth map, from RGB-D or stereo, to build an improved and smoothed 3D mesh that summarizes all the depth information. The resulting 3D mesh best fits the dense depth-maps in an optimal sense, and results in an accurate and lightweight representation of the scene. To conclude our review, we note that, more recently, non-euclidean deep learning approaches have started to work with 3D meshes and are becoming increasingly popular[30, 31, 32, 33]. Unfortunately, these approaches remain slow and require powerful GPUs for computation.

Refer to caption
Refer to caption
Figure 2: (Left) Textured 3D mesh reconstruction from the presented approach of a planar surface in the uHumans2 dataset[8]. (Right) Wireframe of the mesh, showing both interests points (around the painting), and Steiner points attached to the wall. We add Steiner points every 50 pixels, forming a squared grid-like pattern.

III Mesh Optimization: Depth-Map Fusion

III-A Notation

We follow a similar notation than the one used in FLaME [11], since we use the same mathematical machinery. We define by 𝒱\mathcal{V} the set of vertices of the 3D mesh, and by \mathcal{E} the set of edges of the 3D mesh. For each edge, e=(i,j),e=(i,j)\in\mathcal{E}, we denote by vi,vj𝒱v^{i},v^{j}\in\mathcal{V} the associated vertices. The edges are directed from ii to jj, but the sense is arbitrary and can be ignored. For each vertex, v𝒱v\in\mathcal{V}, we assign a smoothed inverse depth value that we denote vξv_{\xi}, and an auxiliary variable 𝐰2\mathbf{w}\in\mathbb{R}^{2} that represents the 2D projection of a 3D normal, such that v𝐰:=v_{\mathbf{w}}:= (vw1,vw2)(v_{w_{1}},v_{w_{2}}). We encapsulate in v𝐱:=(vξ,v𝐰)v_{\mathbf{x}}:=(v_{\xi},v_{\mathbf{w}}) all the (primal) variables associated to a vertex vv. We also denote by 𝕧ξ\mathbb{v}_{\xi} the inverse depth of all the vertices of the mesh stacked into a column vector. ,\langle\cdot,\cdot\rangle is the inner product between two vectors.

III-B Fitting the Data

It is well-known from the computer graphics literature [34, Section 13.5] that the inverse depth map 𝐛\mathbf{b} depends linearly on the inverse depths 𝕧ξ\mathbb{v}_{\xi} of the vertices of the mesh:

𝐛=A𝕧ξ\mathbf{b}=A\mathbb{v}_{\xi}

where the map AA encodes the barycentric coordinates of each pixel with respect to the triangle it belongs to. Here, 𝐛\mathbf{b} is a column vector with as many entries as there are pixels in the depth image, while 𝕧ξ\mathbb{v}_{\xi} has as many entries as there are vertices in the mesh. Therefore, given a 2D mesh (in image space) with fixed topology, we can fit the inverse depths of the vertices such that the mesh minimizes the fit to a given depth-map in a least-squares fashion. In other words, the loss function for mesh reconstruction could be:

depth(𝕧ξ)=A𝕧ξ𝐛22\mathcal{L}_{\text{depth}}(\mathbb{v}_{\xi})=\|A\mathbb{v}_{\xi}-\mathbf{b}\|_{2}^{2}

where 𝐛\mathbf{b} is now the measured inverse depth, and A𝕧ξA\mathbb{v}_{\xi} is the estimated inverse depth from the mesh. The optimal value of 𝕧ξ\mathbb{v}_{\xi} that solves min𝕧ξdepth\min_{\mathbb{v}_{\xi}}\mathcal{L}_{\text{depth}} is simply given by the normal equations. Nevertheless, we are not just interested in fitting a 3D mesh to the depth map, but rather in summarizing the depth map into a smooth 3D surface. Moreover, we expect our depth map to potentially contain outlier measurements. Therefore, we want to formulate our loss function such that it is robust to outliers, and we want to penalize solutions that are not smooth surfaces.

On the one hand, to reduce the influence of outliers, we can make use of the 1\ell_{1} norm instead. In addition, we also want to quantify the difference between the estimated inverse depth vξv_{\xi} and the inverse depth vzv_{z} given by the visual odometry pipeline, which we denote by tracking\mathcal{L}_{\text{tracking}}. In summary, our data loss data\mathcal{L}_{\text{data}} consists of two terms, depth\mathcal{L}_{\text{depth}} and tracking,\mathcal{L}_{\text{tracking}}, defined as:

depth(𝕧ξ)\displaystyle\mathcal{L}_{\text{depth}}(\mathbb{v}_{\xi}) =d𝒟|𝐚d𝕧ξbd|\displaystyle=\sum_{d\in\mathcal{D}}|\mathbf{a}_{d}\mathbb{v}_{\xi}-b_{d}| (1)
tracking(𝕧ξ)\displaystyle\mathcal{L}_{\text{tracking}}(\mathbb{v}_{\xi}) =v𝒱|vξvz|,\displaystyle=\sum_{v\in\mathcal{V}}\left|v_{\xi}-v_{z}\right|,

where 𝒱\mathcal{V} is the set of vertices of the mesh and 𝒟\mathcal{D} is the set of depth measurements. 𝐚d,bd\mathbf{a}_{d},b_{d} are the dthd^{\text{th}} row of AA and 𝐛\mathbf{b} associated to the dthd^{\text{th}} depth measurement in 𝒟\mathcal{D}. While [23, 11, 10] only used the sparse features from VO tracking to estimate the mesh (using Delaunay triangulation), we further add Steiner points [35, Section 14.1], including the set of sparse features tracked by the VO. A Steiner point is an extra mesh vertex that is added to the optimization problem to create a better fit than it would be possible using only the tracked features from VO alone. On the other hand, to promote smooth 3D meshes, we can penalize deviations from piecewise affine solutions. We will quantify this desire by using the smooth\mathcal{L}_{smooth} term below.

We formulate this multi-objective optimization via scalarization, which introduces the free parameter λ>0\lambda>0 balancing the two objectives:

=smooth+λdata,\mathcal{L}=\mathcal{L}_{\text{smooth}}+\lambda\mathcal{L}_{\text{data}},

where data=depth+tracking\mathcal{L}_{\text{data}}=\mathcal{L}_{\text{depth}}+\mathcal{L}_{\text{tracking}}. We could also balance the contribution of depth\mathcal{L}_{\text{depth}} and tracking\mathcal{L}_{\text{tracking}} with another free parameter, but we avoid this for simplicity.

III-C Smoothing the Mesh

There are many possible choices for smooth\mathcal{L}_{smooth}, but one that is particularly interesting is the second-order Total Generalized Variation (TGV2\text{TGV}^{2}) functional [36], because it promotes piecewise affine solutions. A non-local extension was proposed by [37] that allows using additional priors into the regularization term (NLTGV2\text{NLTGV}^{2}). Pinies et al.[38] showed that piecewise planarity in image space, afforded by NLTGV2\text{NLTGV}^{2}, translates to piecewise planarity in 3D space, when optimizing over the inverse depth. Leveraging this insight, Greene and Roy proposed FLaME [11] which discretizes NLTGV2\text{NLTGV}^{2} for 3D meshes (instead of dense inverse depth maps), thereby achieving substantial computational savings. Similarly to FLaME [11], we will make use of the discrete NLTGV2\text{NLTGV}^{2} to smooth our 3D mesh, which we now explain for completeness.

First, we need to introduce a key fact [38]: if two pixels v𝐮iv_{\mathbf{u}}^{i} and v𝐮jv_{\mathbf{u}}^{j} belong to the same planar surface in 3D, their inverse depth values vξiv^{i}_{\xi} and vξjv^{j}_{\xi} are constrained by the following relation:

vξivξjv𝐰i,v𝐮iv𝐮j=0,v_{\xi}^{i}-v_{\xi}^{j}-\left\langle v_{\mathbf{w}}^{i},v_{\mathbf{u}}^{i}-v_{\mathbf{u}}^{j}\right\rangle=0,

where v𝐰iv^{i}_{\mathbf{w}} is the normal vector of the surface at vertex vi.v^{i}.

The NLTGV2\text{NLTGV}^{2} smoothing term tries to penalize solutions that deviate from this relationship. Hence, the NLTGV2\text{NLTGV}^{2} term over the mesh is given by:

NLTGV2(ξ)ekDe(v𝐱i,v𝐱j)||1\mathrm{NLTGV}^{2}(\xi)\approx\sum_{e\in\mathcal{E}_{k}}\|D_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)||_{1} (2)

Here DeD_{e} is a linear operator that acts on the vertices v𝐱iv_{\mathbf{x}}^{i} and v𝐱jv_{\mathbf{x}}^{j} corresponding to edge ee, where we recall that v𝐱:=(vξ,v𝐰)v_{\mathbf{x}}:=(v_{\xi},v_{\mathbf{w}}):

De(v𝐱i,v𝐱j)=[eα(vξivξjv𝐰i,v𝐮iv𝐮j)eβ(vw1ivw1j)eβ(vw2ivw2j)],{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)=\left[\begin{array}[]{c}e_{\alpha}\left(v_{\xi}^{i}-v_{\xi}^{j}-\left\langle v_{\mathbf{w}}^{i},v_{\mathbf{u}}^{i}-v_{\mathbf{u}}^{j}\right\rangle\right)\\ e_{\beta}\left(v_{w_{1}}^{i}-v_{w_{1}}^{j}\right)\\ e_{\beta}\left(v_{w_{2}}^{i}-v_{w_{2}}^{j}\right)\end{array}\right],

where v𝐮iv_{\mathbf{u}}^{i} and v𝐮jv_{\mathbf{u}}^{j} are the 2D pixel projections of the 3D vertices viv^{i} and vjv^{j} of the mesh. The auxiliary variables v𝐰iv^{i}_{\mathbf{w}}, v𝐰jv^{j}_{\mathbf{w}} are related to the 2D projection of the 3D normal of the plane associated to vertices viv^{i} and vjv^{j}. Intuitively, the first row of De(v𝐱i,v𝐱j)||1\|{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)||_{1} penalizes the total deviation of the vertices from the expected plane going through the vertices, while the second and third row penalize having different normals for adjacent vertices. Weights eα,eβ0e_{\alpha},e_{\beta}\geq 0 are assigned to each edge, and control the influence of the smoothness term. Setting one of these weights to 0 would cancel the enforcement of planarity between vertices of the mesh. As in [11], we set the edge weight eα=1/v𝐮iv𝐮j2,e_{\alpha}=1/\|v_{\mathbf{u}}^{i}-v_{\mathbf{u}}^{j}\|_{2}, i.e. inversely proportional to the edge length in pixels, and eβ=1e_{\beta}=1. It will be useful in the remaining to express the operator De{D}_{e} as a matrix, in particular, we have:

De=\displaystyle{D}_{e}= (3)
[eαeα(vu1jvu1i)eα(vu2jvu2i)eα000eβ00eβ000eβ00eβ]\displaystyle\left[\begin{matrix}e_{\alpha}&e_{\alpha}(v_{u_{1}}^{j}-v_{u_{1}}^{i})&e_{\alpha}(v_{u_{2}}^{j}-v_{u_{2}}^{i})&-e_{\alpha}&0&0\\ 0&e_{\beta}&0&0&-e_{\beta}&0\\ 0&0&e_{\beta}&0&0&-e_{\beta}\end{matrix}\right]

The operator De{D}_{e} is applied on the per-edge vector of primal variables, namely: [v𝐱i,v𝐱j]=[vξi,vw1i,vw2i,vξj,vw1j,vw2j].[v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}]=[v_{\xi}^{i},v_{w_{1}}^{i},v_{w_{2}}^{i},v_{\xi}^{j},v_{w_{1}}^{j},v_{w_{2}}^{j}].

III-D Primer on Primal-Dual Optimization

The resulting loss function combining the data terms and NLTGV2\text{NLTGV}^{2} smoothness term is:

(v𝐱)=smooth(v𝐱)+λdata(v𝐱)=eDe(v𝐱i,v𝐱j)||1+λ[v𝒱|vξvz|+d𝒟|𝕒d𝕧ξbd|]\begin{split}\mathcal{L}(v_{\mathbf{x}})&=\mathcal{L}_{\text{smooth}}(v_{\mathbf{x}})+\lambda\mathcal{L}_{\text{data}}(v_{\mathbf{x}})\\ &=\sum_{e\in\mathcal{E}}\|{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)||_{1}\\ &\quad+\lambda\left[\sum_{v\in\mathcal{V}}|v_{\xi}-v_{z}|+\sum_{d\in\mathcal{D}}|\mathbb{a}_{d}\mathbb{v}_{\xi}-b_{d}|\right]\\ \end{split} (4)

Eq. 4 is a non-smooth convex optimization problem. The lack of smoothness is due to the use of non-differentiable 1\ell_{1} norms. Fortunately, this type of problems can be solved using first-order primal-dual optimization [39].

Let us first briefly introduce the primal-dual optimization proposed by Chambolle and Pock [39]. The problem considered is the nonlinear primal problem:

minxXF(Kx)+G(x)\min_{x\in X}F(Kx)+G(x) (5)

where the map K:XYK:X\rightarrow Y is a continuous linear operator, and G:X[0,+)G:X\rightarrow[0,+\infty) and F:Y[0,+)F:Y\rightarrow[0,+\infty) are proper, convex, lower semi-continuous functions.

This nonlinear problem can be formulated as the following saddle-point problem [39]:

minxXmaxyYKx,y+G(x)F(y)\min_{x\in X}\max_{y\in Y}\langle Kx,y\rangle+G(x)-F^{*}(y) (6)

where FF^{*} is the convex conjugate of FF. From now on, we refer to variable xx as the primal, and yy as the dual. Chambolle and Pock proposed to solve this saddle-point problem by performing descent steps on the primal, and ascent steps on the dual. In particular, for each iteration n0n\geq 0, the algorithm updates xn,yn,x¯nx^{n},y^{n},\bar{x}^{n} as follows:

{yn+1=(I+σF)1(yn+σKx¯n)(Dual ascent)xn+1=(I+τG)1(xnτKyn+1)(Primal descent)x¯n+1=xn+1+θ(xn+1xn)\left\{\begin{aligned} y^{n+1}&=\left(I+\sigma\partial F^{*}\right)^{-1}\left(y^{n}+\sigma K\bar{x}^{n}\right)&\text{(Dual ascent)}\\ x^{n+1}&=(I+\tau\partial G)^{-1}\left(x^{n}-\tau K^{*}y^{n+1}\right)&\text{(Primal descent)}\\ \bar{x}^{n+1}&=x^{n+1}+\theta\left(x^{n+1}-x^{n}\right)&\end{aligned}\right. (7)

where σ,τ>0\sigma,\tau>0 are the dual and primal steps, and θ[0,1]\theta\in[0,1]. KK^{*} is the adjoint of KK. (I+τF)1(I+\tau\partial F^{*})^{-1} and (I+τG)1(I+\tau\partial G)^{-1} are the resolvent operators, defined as:

x=(I+τG)1(y)=argminx{xy22τ+G(x)}x=(I+\tau\partial G)^{-1}(y)=\arg\min_{x}\left\{\frac{\|x-y\|^{2}}{2\tau}+G(x)\right\}

The main assumption of [39] is that the resolvent operators for FF and GG are easy to compute.

As noted by Greene and Roy [11], the summation over the edges \mathcal{E} and the summation over the vertices 𝒱\mathcal{V} in Eq. 4 can be encoded in F(K𝐱)F({K}\mathbf{x}) and G(𝐱),G(\mathbf{x}), respectively, in Eq. 5. It remains to be seen where does the third term |𝕒d𝕧ξbd||\mathbb{a}_{d}\mathbb{v}_{\xi}-b_{d}| fit.

III-E Finding the Optimal Mesh

On the one hand, we could consider the term |𝕒d𝕧ξbd||\mathbb{a}_{d}\mathbb{v}_{\xi}-b_{d}| to be part of G(𝐱)G(\mathbf{x}), but then we need to find its proximal operator111The proximal operator is the resolvent of the subdifferential operator.. Unfortunately, there is no known closed-form proximal operator for the 1\ell_{1} norm of an affine function [40]. On the other hand, we can consider it part of F(K𝐱)F({K\mathbf{x}}). In this case, we need to dualize the function with respect to G(𝐱)G(\mathbf{x}). Unfortunately, doing so will introduce as many dual variables as pixels in the image, which ideally we would like to avoid for computational reasons. We proceed hence with the second choice, and implement a fast GPU-based solver (Fig. 3).

The dual representation of A𝕧ξ𝐛1\|{A}\mathbb{v}_{\xi}-{\mathbf{b}}\|_{1} is given by (Appendix VI):

max𝐩A𝕧ξ𝐛,𝐩δP(𝐩),\max_{\mathbf{p}}\langle{A}\mathbb{v}_{\xi}-{\mathbf{b}},\mathbf{p}\rangle-\delta_{P}(\mathbf{p}),

where δP(𝕡)\delta_{P}(\mathbb{p}) is the convex conjugate of the 1\ell_{1} norm, and corresponds to the indicator function of the set P={𝐩:𝐩1}P=\{\mathbf{p}:\|\mathbf{p}\|_{\infty}\leq 1\}. Together with the dual representation of the NLTGV2\text{NLTGV}^{2} term [11]:

De(v𝐱i,v𝐱j)1=maxe𝐪De(v𝐱i,v𝐱j),e𝐪δQ(e𝐪)\left\|{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)\right\|_{1}=\max_{e_{\mathbf{q}}}\left\langle{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right),e_{\mathbf{q}}\right\rangle-\delta_{Q}\left(e_{\mathbf{q}}\right)

where each e𝐪3e_{\mathbf{q}}\in\mathbb{R}^{3} corresponds to a dual variable for each edge, and δQ(e𝐪)\delta_{Q}(e_{\mathbf{q}}) is also the convex conjugate of the 1\ell_{1} norm. We have the primal-dual formulation of our original problem:

minv𝐱maxe𝐪,𝐩\displaystyle\min_{v_{\mathbf{x}}}\max_{e_{\mathbf{q}},\mathbf{p}} ekDe(v𝐱i,v𝐱j),e𝐪δQ(e𝐪)\displaystyle\sum_{e\in\mathcal{E}_{k}}\left\langle{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right),e_{\mathbf{q}}\right\rangle-\delta_{Q}\left(e_{\mathbf{q}}\right) (8)
+λ[v𝒱k|vξvz|+A𝕧ξ𝐛,𝐩δP(𝐩)]\displaystyle+\lambda\left[\sum_{v\in\mathcal{V}_{k}}\left|v_{\xi}-v_{z}\right|+\langle{A}\mathbb{v}_{\xi}-{\mathbf{b}},\mathbf{p}\rangle-\delta_{P}(\mathbf{p})\right]

Let us now formulate this primal-dual optimization problem under the formalism of Eq. 6. In particular, Eq. 8 can be cast as Eq. 6 by defining FF^{*} and GG as follows:

F(e𝐪,𝐩)\displaystyle F^{*}(e_{\mathbf{q}},\mathbf{p}) =ekδQ(e𝐪)+λ𝐛,𝐩+δP(𝐩),\displaystyle=\sum_{e\in\mathcal{E}_{k}}\delta_{Q}(e_{\mathbf{q}})+\lambda\langle{\mathbf{b}},\mathbf{p}\rangle+\delta_{P}(\mathbf{p}),
G(𝐯ξ)\displaystyle G(\mathbf{v}_{\xi}) =λv𝒱k|vξvz|.\displaystyle=\lambda\sum_{v\in\mathcal{V}_{k}}|v_{\xi}-v_{z}|.

We now detail the resolvent operators (I+σF)1\left(I+\sigma\partial F^{*}\right)^{-1} and (I+τG)1(I+\tau\partial G)^{-1}. Since FF^{*} is the sum of separable functions, we can first find the resolvent for qq and then pp. For the indicator function of a convex set δQ(q)\delta_{Q}(q), the resolvent operator reduces to point-wise projectors onto unit \ell_{\infty} balls. Similarly, for the sum of an indicator function of a convex set δP(𝐩)\delta_{P}(\mathbf{p}) and the linear map λ𝐛,𝐩\lambda\langle\mathbf{b},\mathbf{p}\rangle, the resolvent operator remains a point-wise projector onto the unit \ell_{\infty} ball, where the argument is offset by λ𝐛{\lambda\mathbf{b}}.

(q,p)=(I+σF)1(q~,p~){qi=q~imax(1,|q~i|)pj=pj~λbjmax(1,|pj~λbj|)(q,p)=\left(I+\sigma\partial F^{*}\right)^{-1}(\tilde{q},\tilde{p})\Leftrightarrow\left\{\begin{aligned} q_{i}&=\frac{\tilde{q}_{i}}{\max\left(1,\left|\tilde{q}_{i}\right|\right)}\\ p_{j}&=\frac{\tilde{p_{j}}-\lambda{b}_{j}}{\max\left(1,\left|\tilde{p_{j}}-\lambda{b}_{j}\right|\right)}\end{aligned}\right.

Similarly, the resolvent operator for G(𝐯ξ)G(\mathbf{v}_{\xi}) is given by the pointwise shrinkage operations:

vξ\displaystyle{}v_{\xi} =(I+τG)1(vξ~)\displaystyle=(I+\tau\partial G)^{-1}(\tilde{v_{\xi}})\Leftrightarrow
vξ\displaystyle v_{\xi} ={vξ~τλ if vξ~vz>τλvξ~+τλ if vξ~vz<τλvz if |vξ~vz|τλ\displaystyle=\left\{\begin{aligned} {}\tilde{v_{\xi}}-\tau\lambda&\text{ if }&\tilde{v_{\xi}}-v_{z}&>&\tau\lambda\\ \tilde{v_{\xi}}+\tau\lambda&\text{ if }&\tilde{v_{\xi}}-v_{z}&<&-\tau\lambda\\ v_{z}&\text{ if }&\left|\tilde{v_{\xi}}-v_{z}\right|&\leq&\tau\lambda\end{aligned}\right.

We now detail the primal descent and dual ascent steps. First, we need to define the linear operator KK for the dual ascent. The dual ascent over 𝐪,𝐩\mathbf{q,p} is given by:

(𝐪n+1𝐩n+1)=(I+σF)1(𝐪n+σD𝐯𝐱𝐩n+σλA𝐯ξ)\left(\begin{aligned} &\mathbf{q}^{n+1}\\ &\mathbf{p}^{n+1}\end{aligned}\right)=\left(I+\sigma\partial F^{*}\right)^{-1}\left(\begin{aligned} &\mathbf{q}^{n}+\sigma D\mathbf{v}_{\mathbf{x}}\\ &\mathbf{p}^{n}+\sigma\lambda{A}\mathbf{v}_{\xi}\end{aligned}\right) (9)

where we use DD to represent the linear operator that stacks all of the De{D}_{e} operators acting on a per-edge basis. σ>0\sigma>0 is the dual step size.

Let us now define the adjoint KK^{*} for the primal descent over vξv_{\xi}. Since we have formulated our linear operators in terms of matrices De{D}_{e} and A{A}, the adjoints are simply the transpose of these matrices: De=DeT{D}_{e}^{*}={D}_{e}^{T} and A=AT{A}^{*}={A}^{T}. We now further decompose De{D}_{e}^{*} to make the subsequent primal descent iteration explicit. De{D}_{e}^{*} operates on the incoming edges 𝒩in(v)\mathcal{N}_{in}(v) and outgoing edges 𝒩out(v)\mathcal{N}_{out}(v) to a vertex vv, as noted by [11], and therefore can be decomposed:

De(e𝐪)=DeTe𝐪=\displaystyle{D}_{e}^{*}\left(e_{\mathbf{q}}\right)=D_{e}^{T}e_{\mathbf{q}}= [eαeq1eα(vu1jvu1i)eq1+eβeq2eα(vu2jvu2i)eq1+eβeq3eαeq1eβeq2eβeq3]\displaystyle\left[\begin{array}[]{c}e_{\alpha}e_{q_{1}}\\ e_{\alpha}\left(v_{u_{1}}^{j}-v_{u_{1}}^{i}\right)e_{q_{1}}+e_{\beta}e_{q_{2}}\\ e_{\alpha}\left(v_{u_{2}}^{j}-v_{u_{2}}^{i}\right)e_{q_{1}}+e_{\beta}e_{q_{3}}\\ -e_{\alpha}e_{q_{1}}\\ -e_{\beta}e_{q_{2}}\\ -e_{\beta}e_{q_{3}}\end{array}\right]
=[Din(e𝐪)Dout(e𝐪)]\displaystyle=\left[\begin{array}[]{c}{D}_{in}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{out}^{*}\left(e_{\mathbf{q}}\right)\end{array}\right]

where Din{D}_{in}^{*} and Dout{D}_{out}^{*} partition DeD_{e}^{*} into the first and last three rows, respectively. Furthermore, we need to make explicit the primal updates on the inverse depths, since the adjoint AA^{*} only affects these. Therefore, we use that:

De(e𝐪)=[Din(e𝐪)Dout(e𝐪)]=[Dinξ(e𝐪)Din𝐰(e𝐪)Doutξ(e𝐪)Dout𝐰(e𝐪)]{D}_{e}^{*}\left(e_{\mathbf{q}}\right)=\left[\begin{array}[]{c}{D}_{in}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{out}^{*}\left(e_{\mathbf{q}}\right)\end{array}\right]=\left[\begin{array}[]{c}{D}_{in_{\xi}}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{in_{\mathbf{w}}}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{out_{\xi}}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{out_{\mathbf{w}}}^{*}\left(e_{\mathbf{q}}\right)\end{array}\right]
Refer to caption
Figure 3: Graphical representation of the computation required to evaluate the primal contribution in Eq. 12 due to the H×WH\times W dual variable 𝐩n+1\mathbf{p}^{n+1} (dark blue). Starting from the CPU (bottom right), we forward the current inverse depths vξnv_{\mathbf{\xi}}^{n} of the mesh to the GPU (pink column). Then, we rasterize an interpolated inverse depth (using GPU) for the whole image using the barycentric coordinates of each pixel (red, sparse). We substract to the interpolated depth (green) the measured depth (yellow), and add the previous dual value 𝐩n\mathbf{p}^{n} (clear blue). The result is the next dual values (dark blue) 𝐩n+1\mathbf{p}^{n+1}. The primal contribution (dark pink) is then calculated by element-wise multiplication with the barycentric coordinates (red, sparse), and sent back to the CPU for a primal update. The process is repeated until convergence.

The descent equations for each vξv_{\xi} are then given by:

vξn+1=(I+τG)1(vξn\displaystyle v_{\xi}^{n+1}=(I+\tau\partial G)^{-1}\bigg{(}v_{\xi}^{n} τe𝒩in(v)Dinξ(e𝐪n+1)\displaystyle-\tau\sum_{e\in\mathcal{N}_{in}(v)}{D}_{in_{\xi}}^{*}\left(e_{\mathbf{q}}^{n+1}\right) (10)
τe𝒩out(v)Doutξ(e𝐪n+1)\displaystyle-\tau\sum_{e\in\mathcal{N}_{out}(v)}{D}_{out_{\xi}}^{*}\left(e_{\mathbf{q}}^{n+1}\right)
τλ[A𝐩n+1]v),\displaystyle-\tau\lambda[{A}^{*}\mathbf{p}^{n+1}]_{v}\bigg{)},

where τ>0\tau>0 is the primal step size, and []v[\cdot]_{v} in [A𝐩n+1]v[{A}^{*}\mathbf{p}^{n+1}]_{v} extracts the row corresponding to vertex vv. The descent equations for each v𝐰v_{\mathbf{w}} are given by:

v𝐰n+1=v𝐰nτ\displaystyle v_{\mathbf{w}}^{n+1}=v_{\mathbf{w}}^{n}-\tau e𝒩in(v)Din𝐰(e𝐪n+1)\displaystyle\sum_{e\in\mathcal{N}_{in}(v)}{D}_{in_{\mathbf{w}}}^{*}\left(e_{\mathbf{q}}^{n+1}\right) (11)
τ\displaystyle-\tau e𝒩out(v)Dout𝐰(e𝐪n+1).\displaystyle\sum_{e\in\mathcal{N}_{out}(v)}{D}_{out_{\mathbf{w}}}^{*}\left(e_{\mathbf{q}}^{n+1}\right).

With the primal descent Eqs. 10 and 11 and dual ascent Eq. 9 in hand, we can now iteratively apply Eq. 7 until convergence.

In summary, our optimization proceeds as follows:

{Dual ascent:¯(𝐪n+1𝐩n+1)=(I+σF)1(𝐪n+σD𝐯𝐱𝐩n+σλA𝐯ξ)Primal descent:¯(vξn+1v𝐰n+1)=(I+σG)1(vξnτDξe𝐪𝐧+𝟏τλ[A𝐩n+1]vv𝐰nτD𝐰vξ)Extra gradient step:¯v¯𝐱n+1=v𝐱n+1+θ(v𝐱n+1v𝐱n)\left\{\begin{aligned} &\quad\quad\quad\underline{\text{Dual ascent:}}&\\ \left(\begin{aligned} &\mathbf{q}^{n+1}\\ &\mathbf{p}^{n+1}\end{aligned}\right)&=\left(I+\sigma\partial F^{*}\right)^{-1}\left(\begin{aligned} &\mathbf{q}^{n}+\sigma D\mathbf{v}_{\mathbf{x}}\\ &\mathbf{p}^{n}+\sigma\lambda{A}\mathbf{v}_{\xi}\end{aligned}\right)\\ &\quad\quad\quad\underline{\text{Primal descent:}}&\\ \left(\begin{aligned} &v_{\xi}^{n+1}\\ &v_{\mathbf{w}}^{n+1}\end{aligned}\right)&=\left(I+\sigma\partial G\right)^{-1}\left(\begin{aligned} v_{\xi}^{n}&-\tau{D}^{*}_{\xi}e_{\mathbf{q^{n+1}}}\dots\\ &\dots-\tau\lambda[{A}^{*}\mathbf{p}^{n+1}]_{v}\\ v_{\mathbf{w}}^{n}&-\tau{D}^{*}_{\mathbf{w}}v_{\xi}\end{aligned}\right)\\ &\quad\quad\quad\underline{\text{Extra gradient step:}}&\\ \bar{v}^{n+1}_{\mathbf{x}}&=v_{\mathbf{x}}^{n+1}+\theta\left(v_{\mathbf{x}}^{n+1}-v_{\mathbf{x}}^{n}\right)\end{aligned}\right. (12)

and we implement the dual step in the GPU as illustrated in Fig. 3.

IV Experimental Results

We benchmark the proposed approach on real and simulated datasets, and evaluate map estimation accuracy, as well as runtime.

The EuRoC dataset [12] provides visual and inertial data recorded from a micro aerial vehicle flying indoors. We use the Vicon Room (V) data, which is similar to an office room where walls, floor, and ceiling are visible, as well as other planar surfaces (boxes, stacked mattresses).

We also use the simulated uHumans2 dataset [8] to evaluate the accuracy of the mesh reconstruction with RGB-D data. The uHumans2 dataset is generated using a photorealistic Unity-based simulator, and provides RGB-D data and ground truth of the geometry of the scene. Using this dataset, we can also provide an ablation study on the effect that the number of Steiner points has on the reconstruction (without being influenced by stereo reconstruction errors). We will be using the ‘Apartment‘ scene in the uHumans2 dataset.

Compared techniques. To assess the advantages of our proposed approach, we compare the original FLaME formulation, and ours, as well as VO pipelines. We also evaluate with both RGB-D data, using uHumans’s dataset, and dense stereo reconstruction, using the EuRoC dataset.

IV-A Evaluation in EuRoC

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: (First) Textured 3D mesh showing the proposed mesh reconstruction by using dense stereo depth estimation in the EuRoC dataset[12]. (Second) Wireframe of the mesh. (Third) Normal map showing the smoothness of the 3D mesh. (Fourth) Color-coded depth map, brighter colors correspond to closer depths.

The EuRoC dataset provides ground-truth 3D reconstructions of the scene for the Vicon Room (V) datasets. Therefore, we can directly evaluate the accuracy of the per-frame 3D mesh reconstructions. Nevertheless, the dataset was recorded with a stereo camera instead of an RGB-D camera. Hence, we perform dense stereo reconstruction (using OpenCV’s block-matching approach) to fit our 3D mesh.

Table I compares our approach with two VO pipelines, LSD-SLAM[41] and MLM[42], as well as FLaME [11], which is the work closest to our approach. LSD-SLAM is a semi-dense VO pipeline, and as such, it partially reconstructs the scene. While being semi-dense, LSD-SLAM remains practically unable to reconstruct the scene, staying below 25%25\% in terms of accurate depth estimates. MLM is built on top of LSD-SLAM and was designed to densify LSD-SLAM’s map, as such, it clearly achieves denser reconstructions than its counterpart (approximately 50%50\% denser reconstructions). FLaME further densifies the 3D reconstruction by using 3D meshes and a primal-dual formulation like ours. Their approach clearly improves the MLM estimates (by an average factor of 42%42\%), but most importantly, provides topological information with the 3D mesh. Our approach further uses the depth data coming from RGB-D or stereo, and fuses this information into the 3D mesh, which further densifies the 3D reconstruction, while also keeping the topological properties of the scene. Fig. 4 shows the quality of the reconstruction in EuRoC.

TABLE I: Fraction of inverse depths per depth-map within 10 percent of ground-truth for LSD-SLAM[41], MLM[42], FLaME [11], and our approach. Our approach enables more accurate and denser reconstructions. In bold the best result.
EuRoC Accurate Inverse Depth Density [%]
Sequence LSD MLM FLaME Ours
V1_01 18.2 26.4 37.8 41.7
V1_02 18.3 27.9 40.5 44.8
V1_03 11.0 17.0 27.0 33.1
V2_01 25.9 39.3 48.1 52.9
V2_02 20.5 28.9 37.2 44.0
V2_03 11.5 19.0 28.9 34.8

IV-B Evaluation in uHumans2 Dataset

Using the RGB-D images in uHumans2 dataset, we run and evaluate our approach using an increasing number of Steiner points. Table II shows that increasing the number of Steiner points has a clear positive effect on the quality of the depth estimates. The improvements are most noticeable when comparing the use of no Steiner points (No Steiner), and adding Steiner points every 100 pixels (S=100S=100). Subsequently, the addition of Steiner points increases the accuracy up to around one Steiner point every 10 pixels. Note that the computation time increases according to the number of points added, and beyond S=50S=50 our approach might not be fast enough for real-time operations. Fig. 2 shows a reconstruction using the S=50S=50 configuration.

TABLE II: Fraction of inverse depths per depth-map within 10 percent of ground-truth for our approach using an increasing number of Steiner points every SS pixels. The accuracy and density of our reconstructions increase with the number of Steiner points added. In bold the best result.
uHumans2 Accurate Inverse Depth Density [%] (Timing [ms])
Sequence Ours (No Steiner) Ours (S=100S=100) Ours (S=50S=50) Ours (S=10S=10)
Apart_01 43.2 (9.1) 46.4 (10.9) 53.8 (12.4) 56.7 (41)
Apart_02 42.3 (10.2) 49.9 (11.9) 52.5 (13.4) 51.2 (49)
Apart_03 43.0 (9.8) 48.0 (11.8) 53.0 (13.3) 53.9 (43)

IV-C Timing

All timings were processed using an Nvidia RTX 2080 Ti GPU on an Intel i9 CPU. The computation of the dual contribution takes less than 2.12.1ms overall (worst case timing recorded, including data transfer). Nevertheless, since this operation needs to be repeated several times until convergence, we record an average processing time of 13.213.2ms. This includes performing the Delaunay triangulation, uploading depth and 2D mesh coordinates to the GPU, solving the dual problem, sending the primal contribution to the CPU, optimizing, and repeating the process until convergence (for 66 iterations). Ultimately, the bottleneck in terms of computation comes from Kimera-VIO which takes 2020ms per frame, despite being an optimized VIO pipeline that has its frontend and backend working in parallel.

V Conclusion

By leveraging the triangulated landmarks from visual odometry and the depth from RGB-D data (or stereo), we present a 3D reconstruction algorithm capable of building a smooth, compact, and accurate 3D mesh of the scene. The proposed algorithm shows that dense depth fusion into a consistent 3D mesh is not only possible, but also computationally fast.

Finally, while the results presented are promising, we are not yet fusing the 3D mesh into a consistent multi-frame scene representation. Therefore, a promising line for future work is to use the per-frame 3D mesh to achieve multi-view 3D scene reconstruction. With the accuracy and speed afforded by the presented approach, it becomes possible to build a map without relying on intermediate volumetric representations.

References

  • [1] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2020, arXiv preprint arXiv: 1910.02490, (video), (code), (pdf).
  • [2] A. Mourikis and S. Roumeliotis, “A multi-state constraint Kalman filter for vision-aided inertial navigation,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), April 2007, pp. 3565–3572.
  • [3] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual-inertial slam using nonlinear optimization,” Intl. J. of Robotics Research, no. 3, pp. 314–334, 2015.
  • [4] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration for real-time visual-inertial odometry,” IEEE Trans. Robotics, vol. 33, no. 1, pp. 1–21, 2017, arxiv preprint: 1512.02363, (pdf), technical report GT-IRIM-CP&R-2015-001.
  • [5] R. Mur-Artal and J. Tardós, “Visual-inertial monocular SLAM with map reuse,” ArXiv preprint:1610.05949, 2016.
  • [6] T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [7] S. Pillai, S. Ramalingam, and J. Leonard, “High-performance and tunable stereo reconstruction,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2016.
  • [8] A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: from SLAM to spatial perception with 3D dynamic scene graphs,” arXiv preprint arXiv: 2101.06894, 2021, (pdf).
  • [9] R. Newcombe, A. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon, “KinectFusion: Real-time dense surface mapping and tracking,” in IEEE and ACM Intl. Sym. on Mixed and Augmented Reality (ISMAR), 2011, pp. 127–136.
  • [10] A. Rosinol, T. Sattler, M. Pollefeys, and L. Carlone, “Incremental Visual-Inertial 3D Mesh Generation with Structural Regularities,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2019, (pdf), (web). [Online]. Available: https://www.mit.edu/%7Earosinol/research/struct3dmesh.html
  • [11] W. N. Greene and N. Roy, “Flame: Fast lightweight mesh estimation using variational smoothing on delaunay graphs,” in 2017 IEEE International Conference on Computer Vision (ICCV).   IEEE, 2017, pp. 4696–4704.
  • [12] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” Intl. J. of Robotics Research, 2016.
  • [13] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” in Robotics: Science and Systems (RSS), 2020, (pdf), (video).
  • [14] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in SIGGRAPH, 1996, pp. 303–312.
  • [15] T. Schöps, T. Sattler, and M. Pollefeys, “Surfelmeshing: Online surfel-based mesh reconstruction,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2494–2507, 2019.
  • [16] J. Fox, N. Johns, C. Lyons, A. Rahmanzadeh, R. Thomson, and P. Wilson, “Proforma: a general technology for clinical decision support systems,” Computer methods and programs in biomedicine, vol. 54, no. 1-2, pp. 59–67, 1997.
  • [17] O. D. Faugeras, E. L. Bras-Mehlman, and J. D. Boissonnat, “Representing stereo data with the Delaunay triangulation,” Artif. Intell., vol. 44, no. 1-2, pp. 41–87, 1990.
  • [18] E. Piazza, A. Romanoni, and M. Matteucci, “Real-time cpu-based large-scale 3d mesh reconstruction,” arXiv preprint arXiv:1801.05230, 2018.
  • [19] M. Lhuillier, “Surface reconstruction from a sparse point cloud by enforcing visibility consistency and topology constraints,” Computer Vision and Image Understanding, vol. 175, pp. 52–71, 2018.
  • [20] M. Lhuillier and S. Yu, “Manifold surface reconstruction of an environment from sparse structure-from-motion data,” Computer Vision and Image Understanding, vol. 117, no. 11, pp. 1628–1644, 2013.
  • [21] A. Bódis-Szomorú, H. Riemenschneider, and L. Van Gool, “Efficient edge-aware surface mesh reconstruction for urban scenes,” Computer Vision and Image Understanding, vol. 157, pp. 3–24, 2017.
  • [22] G. Turk and M. Levoy, “Zippered polygon meshes from range images,” in Proceedings of the 21st annual conference on Computer graphics and interactive techniques, 1994, pp. 311–318.
  • [23] L. Teixeira and M. Chli, “Real-time mesh-based scene estimation for aerial inspection,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS).   IEEE, 2016, pp. 4863–4869.
  • [24] S. Pütz, T. Wiemann, J. Sprickerhof, and J. Hertzberg, “3D navigation mesh generation for path planning in uneven terrain,” IFAC-PapersOnLine, vol. 49, no. 15, pp. 212–217, 2016.
  • [25] F. Ruetz, E. Hernández, M. Pfeiffer, H. Oleynikova, M. Cox, T. Lowe, and P. Borges, “Ovpc mesh: 3d free-space representation for local ground vehicle navigation,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 8648–8654.
  • [26] M. Yokozuka, S. Oishi, S. Thompson, and A. Banno, “VITAMIN-E: visual tracking and mapping with extremely dense feature points,” CoRR, vol. abs/1904.10324, 2019.
  • [27] R. Newcombe and A. Davison, “Live dense reconstruction with a single moving camera,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2010, pp. 1498–1505.
  • [28] S. Lovegrove, “Parametric dense visual slam,” Ph.D. dissertation, 2012.
  • [29] R. Newcombe, S. Lovegrove, and A. Davison, “DTAM: Dense tracking and mapping in real-time,” in Intl. Conf. on Computer Vision (ICCV), Barcelona, Spain, Nov 2011, pp. 2320–2327.
  • [30] F. Milano, A. Loquercio, A. Rosinol, D. Scaramuzza, and L. Carlone, “Primal-dual mesh convolutional neural networks,” in Conference on Neural Information Processing Systems (NeurIPS), 2020. [Online]. Available: https://github.com/MIT-SPARK/PD-MeshNet
  • [31] R. Hanocka, A. Hertz, N. Fish, R. Giryes, S. Fleishman, and D. Cohen-Or, “MeshCNN: a network with an edge,” ACM Trans. Graph., vol. 38, no. 4, pp. 1–12, 2019.
  • [32] M. Bloesch, T. Laidlow, R. Clark, S. Leutenegger, and A. J. Davison, “Learning meshes for dense visual slam,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5855–5864.
  • [33] Q. Feng and N. Atanasov, “Mesh reconstruction from aerial images for outdoor terrain mapping using joint 2d-3d learning,” arXiv preprint arXiv:2101.01844, 2021.
  • [34] J. D. Foley, F. D. Van, A. Van Dam, S. K. Feiner, J. F. Hughes, and J. Hughes, Computer graphics: principles and practice.   Addison-Wesley Professional, 1996, vol. 12110.
  • [35] M. De Berg, M. Van Kreveld, M. Overmars, and O. Schwarzkopf, “Computational geometry,” in Computational geometry.   Springer, 1997, pp. 1–17.
  • [36] K. Bredies, K. Kunisch, and T. Pock, “Total generalized variation,” SIAM Journal on Imaging Sciences, vol. 3, no. 3, pp. 492–526, 2010.
  • [37] R. Ranftl, K. Bredies, and T. Pock, “Non-local total generalized variation for optical flow estimation,” in European Conference on Computer Vision.   Springer, 2014, pp. 439–454.
  • [38] P. Pinies, L. M. Paz, and P. Newman, “Dense mono reconstruction: Living with the pain of the plain plane,” in 2015 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2015, pp. 5226–5231.
  • [39] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” Journal of mathematical imaging and vision, vol. 40, no. 1, pp. 120–145, 2011.
  • [40] A. Beck, First-order methods in optimization.   SIAM, 2017.
  • [41] J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” pp. 834–849, 2014.
  • [42] W. N. Greene, K. Ok, P. Lommel, and N. Roy, “Multi-level mapping: Real-time dense monocular slam,” in 2016 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2016, pp. 833–840.
  • [43] S. Boyd and L. Vandenberghe, Convex optimization.   Cambridge University Press, 2004.

VI Appendix

Let us show that the following dual representation holds:

A𝝃𝐛1=max𝐩nA𝝃𝐛,𝐩δP(𝐩),\|A\bm{\xi}-\mathbf{b}\|_{1}=\max_{\mathbf{p}\in{\mathbb{R}^{n}}}\langle A\bm{\xi}-\mathbf{b},\mathbf{p}\rangle-\delta_{P}(\mathbf{p}),

where 𝝃,𝐛,𝐩n\bm{\xi},\mathbf{b,p}\in\mathbb{R}^{n}, An×nA\in\mathbb{R}^{n\times n}, and δP(𝐩)\delta_{P}(\mathbf{p}) is the indicator function of the convex set P={𝐩:𝐩1}P=\{\mathbf{p}:\|\mathbf{p}\|_{\infty}\leq 1\}. We assume in the remaining that AA is nonsingular.

VI-1 Proposition [43, Section 3.3]

Since AA is nonsingular, the conjugate of g(𝐱)=g(\mathbf{x})= f(A𝐱+𝐛)f(A\mathbf{x}+\mathbf{b}) is:

g(𝐱)=f(AT𝐱)bTAT𝐱,g^{*}(\mathbf{x})=f^{*}\left(A^{-T}\mathbf{x}\right)-b^{T}A^{-T}\mathbf{x},

with 𝐝𝐨𝐦g=AT𝐝𝐨𝐦f\mathbf{dom}~{}g^{*}=A^{T}\mathbf{dom}~{}f^{*}, with ff^{*} the conjugate of ff.

VI-2 Proposition [43, Section 3.3]

The convex conjugate of the 1\ell_{1} norm is the indicator function δY(𝐲)\delta_{Y}(\mathbf{y}) of the convex set Y={𝐲:𝐲1}Y=\{\mathbf{y}:\|\mathbf{y}\|_{\infty}\leq 1\}.

Using proposition 1 and 2, we have that the convex conjugate of g(𝐱)=A𝐱𝐛1=f(A𝐱𝐛)g(\mathbf{x})=\|A\mathbf{x}-\mathbf{b}\|_{1}=f(A\mathbf{x}-\mathbf{b}), is:

g(𝐱)=δY(AT𝐱)+𝐛TATx,g^{*}(\mathbf{x})=\delta_{Y}(A^{-T}\mathbf{x})+\mathbf{b}^{T}A^{-T}x,

where Y={AT𝐱:AT𝐱1}Y=\{A^{-T}\mathbf{x}:\|A^{-T}\mathbf{x}\|_{\infty}\leq 1\}, and 𝐝𝐨𝐦g=AT𝐝𝐨𝐦f={𝐱:AT𝐱1}\mathbf{dom}~{}g^{*}=A^{T}\mathbf{dom}~{}f^{*}=\{\mathbf{x}:\|A^{-T}\mathbf{x}\|_{\infty}\leq 1\}. Let us take the convex conjugate of gg^{*}, by definition, we have:

g(𝐲)=max𝐱𝐲,𝐱[δY(AT𝐱)+𝐛TAT𝐱]g^{**}(\mathbf{y})=\max_{\mathbf{x}}\langle\mathbf{y},\mathbf{x}\rangle-\left[\delta_{Y}(A^{-T}\mathbf{x})+\mathbf{b}^{T}A^{-T}\mathbf{x}\right]

Taking 𝐱=AT𝐱\mathbf{x}^{\prime}=A^{-T}\mathbf{x}, we have:

g(𝐲)=max𝐱𝐲,AT𝐱[δX(𝐱)+𝐛T𝐱]=max𝐱A𝐲𝐛,𝐱δX(𝐱),\begin{aligned} g^{**}(\mathbf{y})&=\max_{\mathbf{x}^{\prime}}\langle\mathbf{y},A^{T}\mathbf{x}^{\prime}\rangle-\left[\delta_{X}^{\prime}(\mathbf{x}^{\prime})+\mathbf{b}^{T}\mathbf{x}^{\prime}\right]\\ &=\max_{\mathbf{x}^{\prime}}\langle A\mathbf{y}-\mathbf{b},\mathbf{x}^{\prime}\rangle-\delta_{X^{\prime}}(\mathbf{x}^{\prime})\\ \end{aligned},

where X={𝐱:𝐱1}X^{\prime}=\{\mathbf{x}^{\prime}:\|\mathbf{x}^{\prime}\|_{\infty}\leq 1\}

Since g(𝐱)g(\mathbf{x}) is a proper, lower semi-continuous, and convex function, we have g=gg^{**}=g, according to Fenchel-Moreau’s theorem. Therefore,

g(𝝃)=g(𝝃)=A𝝃𝐛1=max𝐩A𝝃𝐛,𝐩δP(𝐩),g^{**}(\bm{\xi})=g(\bm{\xi})=\|A\bm{\xi}-\mathbf{b}\|_{1}=\max_{\mathbf{p}}\langle A\bm{\xi}-\mathbf{b},\mathbf{p}\rangle-\delta_{P}(\mathbf{p}),

with P={𝐩:𝐩1}P=\{\mathbf{p}:\|\mathbf{p}\|_{\infty}\leq 1\}.