Smooth Mesh Estimation from Depth Data
using Non-Smooth Convex Optimization

Antoni Rosinol¹, Luca Carlone¹ ¹A. Rosinol and L. Carlone are with the Laboratory for Information & Decision Systems (LIDS), Massachusetts Institute of Technology, Cambridge, MA, USA, {arosinol,lcarlone}@mit.eduThis work was partially funded by ARL DCIST CRA W911NF-17-2-0181, and Lincoln Laboratory’s “Resilient Perception in Degraded Environments” program.

Abstract

Meshes are commonly used as 3D maps since they encode the topology of the scene while being lightweight. Unfortunately, 3D meshes are mathematically difficult to handle directly because of their combinatorial and discrete nature. Therefore, most approaches generate 3D meshes of a scene after fusing depth data using volumetric or other representations. Nevertheless, volumetric fusion remains computationally expensive both in terms of speed and memory. In this paper, we leapfrog these intermediate representations and build a 3D mesh directly from a depth map and the sparse landmarks triangulated with visual odometry. To this end, we formulate a non-smooth convex optimization problem that we solve using a primal-dual method. Our approach generates a smooth and accurate 3D mesh that substantially improves the state-of-the-art on direct mesh reconstruction while running in real-time.

Index Terms:

Mesh, Mapping, Vision-Based Navigation

I Introduction

Fast, lightweight, and accurate 3D representations of the scene are of utmost importance for computationally constrained robots, such as flying drones, or high-stake applications, such as self-driving cars. Failing to map the scene densely and accurately in a timely manner can lead to collisions, which result in damaged equipment or, worst, fatalities. Being able to densely reconstruct the scene with a camera in real-time also opens a handful of new applications in augmented reality, online real-life gaming, or search-and-rescue operations.

Therefore, in this work, we propose a 3D scene reconstruction method that leverages depth data (either from RGB-D or mono/stereo reconstruction) and recent advances in the field of convex variational optimization to build a 3D mesh that faithfully represents the scene while working in real-time.

Our work can be used with any available Visual Odometry (VO) pipeline. In our case, we use a Visual Inertial Odometry (VIO) pipeline, Kimera [1], which achieves very accurate pose estimates in real-time by tracking 2D features from frame to frame and using inertial information from an Inertial Measuring Unit (IMU).

Refer to caption — Figure 1: We propose a 3D mesh reconstruction method from input depth maps (blue rays) coming from RGB-D or dense stereo matching, and the sparse landmarks from visual odometry (VO) (green rays). We formulate the problem as a variational convex optimization problem which we solve in real-time and to optimality. The cost function measures the fitness of the mesh with respect to the depth input (red arrows), and the visual odometry landmarks, as well as the smoothness of the mesh. The optimization is done over the inverse depth of the vertices of the mesh (orange arrows).

VO methods typically provide a sparse map of the scene. In particular, feature-based methods [2, 3, 4, 5] produce a sparse point cloud that is not directly usable for path planning or obstacle avoidance. In those cases, a denser map is built subsequently, e.g., by using (multi-view) stereo algorithms [6, 7]. Nevertheless, with current broad availability of depth information from either RGB-D, LIDAR, or stereo cameras, it is possible to fuse the depth online using volumetric approaches [8, 9]. Unfortunately, such volumetric approaches are memory intensive, and slow if they are not heavily parallelized using powerful GPUs.

Ideally, a map representation should be (i) lightweight to compute and store, and (ii) capable of representing the topology of the environment. In this work, we use a 3D mesh as our map representation (Fig. 1). A 3D mesh is lightweight, since it can represent piece-wise planar surface with very few parameters (compared to dense pointclouds or volumetric representations), while providing information about the topology of the scene due to its connectivity information.

Recent approaches have tried to build the mesh over the set of sparse 3D landmarks triangulated by a VO/VIO pipeline [10, 11]. Nevertheless, these approaches fail to reconstruct an accurate mesh of the scene, since they discard most of the information present in the input data.

Contributions. In this paper, we propose a 3D mesh reconstruction algorithm that summarizes the information present in depth-maps and tracked 3D landmarks from visual odometry. In this way, we can build denser and more accurate maps than sparse approaches, while preserving real-time performance. We experimentally evaluate our approach on the EuRoC dataset [12] and the uHumans2 dataset [13, 8], and compare against competing approaches. Our evaluation shows that (i) the proposed approach produces a lightweight representation of the environment that captures the geometry of the scene, (ii) leveraging variational optimization enables real-time, accurate, and denser reconstructions.

II Related Work

While there exist a myriad of representations for 3D mapping, we focus our related work review on approaches using meshes. Despite the fact that volumetric approaches are typically used to build a mesh [14], and other authors have tried to mesh intermediate representations such as surfels [15], we focus on works that build a 3D mesh directly from input data either in the form of RGB images or RGB-D data.

Meshes were one of the first dense map representations used in the nineties when GPUs were not available. Early on, two different types of meshes were introduced: 3D tetrahedral meshes and 2D triangular meshes (embedded in 3D).

ProForma [16] was one of the first works to reconstruct a detailed map of the scene using a tetrahedral 3D mesh. Unfortunately, tetrahedral 3D meshes share similar costs and benefits than volumetric approaches, and hence they were mostly used in small-scale scenes first [17, 16]. More recent works have used tetrahedral 3D meshes for real-time reconstruction of the scene [18, 19], but the quality and artifacts in the reconstructions remain a problem. In particular, reconstructing a manifold mesh remains challenging, and many authors have tried to enforce manifoldness with different degrees of success [20, 21]. While 3D tetrahedral meshes are getting increasingly more capable, they remain challenging to use, are computationally expensive, and lack robustness.

In this work, we focus instead on building 2D triangular meshes embedded in 3D space. Turk et al. [22] proposed Zippered Meshes, an early attempt to fuse depth information into 2D triangular meshes by ‘sewing’ them from frame to frame. In SLAM, several works have tried to build a 3D mesh directly from triangulation of the features in the image followed by a re-projection in 3D space using the depth estimated from triangulation. While very fast in practice, these approaches trade-off quality and accuracy for speed. Their application has therefore mostly been geared towards fast obstacle avoidance, and local path-planning, particularly for resource constrained robots such as drones[23, 11, 10], and also on ground-robots [24, 25].

One of the major problems of reconstructing a 2D mesh from input data is its lack of robustness to outliers, and smoothness against noise. By leveraging recent insights in variational smoothing of dense depth maps, Greene and Roy [11] propose FLaME, a discrete formulation using a 3D mesh instead of dense depth maps, thereby achieving substantial computational savings, while reconstructing a smooth 3D mesh. Since they do not use dense depth maps, readily available from RGB-D, FLaME suffers from inaccurate reconstructions. In the same vein, VITAMIN-E[26] shows that tracking a substantially higher amount of 2D features, and thereby providing denser depth maps, improves the 3D reconstruction accuracy. They also used the proposed variational framework from FLaME, but they used all tracked features as vertices of the 3D mesh, rather than summarizing the depth map into a simplified and coherent surface reconstruction. Finally, [10] proposed to smooth the 3D mesh by leveraging the least-squares optimization performed in VO/VIO, and enforcing structural regularities (such as planarity). While this approach allowed the reconstruction of a multi-frame 3D mesh in real-time, the accuracy remained low due to the sparsity of the features tracked in SLAM. To cope with the low accuracy of these reconstructions, both VITAMIN-E [26] and Rosinol et al. [10] fuse the depth into a volumetric representation [1, 13, 8]. In a similar turn of events, while Newcombe et al. [27] and Lovegrove [28] initially attempted to build 3D meshes from sparse features, they concluded that using dense all-pixel methods with volumetric maps leads to far superior accuracy[29], despite the heavy use of GPUs.

In this work, we maintain the speed and low computational budget of early approaches, despite the use of a GPU, by re-using the feature tracks of the visual pipeline, but we use a dense depth map, from RGB-D or stereo, to build an improved and smoothed 3D mesh that summarizes all the depth information. The resulting 3D mesh best fits the dense depth-maps in an optimal sense, and results in an accurate and lightweight representation of the scene. To conclude our review, we note that, more recently, non-euclidean deep learning approaches have started to work with 3D meshes and are becoming increasingly popular[30, 31, 32, 33]. Unfortunately, these approaches remain slow and require powerful GPUs for computation.

III Mesh Optimization: Depth-Map Fusion

III-A Notation

We follow a similar notation than the one used in FLaME [11], since we use the same mathematical machinery. We define by $\mathcal{V}$ the set of vertices of the 3D mesh, and by $\mathcal{E}$ the set of edges of the 3D mesh. For each edge, $e=(i,j)\in\mathcal{E},$ we denote by $v^{i},v^{j}\in\mathcal{V}$ the associated vertices. The edges are directed from $i$ to $j$ , but the sense is arbitrary and can be ignored. For each vertex, $v\in\mathcal{V}$ , we assign a smoothed inverse depth value that we denote $v_{\xi}$ , and an auxiliary variable $\mathbf{w}\in\mathbb{R}^{2}$ that represents the 2D projection of a 3D normal, such that $v_{\mathbf{w}}:=$ $(v_{w_{1}},v_{w_{2}})$ . We encapsulate in $v_{\mathbf{x}}:=(v_{\xi},v_{\mathbf{w}})$ all the (primal) variables associated to a vertex $v$ . We also denote by $\mathbb{v}_{\xi}$ the inverse depth of all the vertices of the mesh stacked into a column vector. $\langle\cdot,\cdot\rangle$ is the inner product between two vectors.

III-B Fitting the Data

It is well-known from the computer graphics literature [34, Section 13.5] that the inverse depth map $\mathbf{b}$ depends linearly on the inverse depths $\mathbb{v}_{\xi}$ of the vertices of the mesh:

\mathbf{b}=A\mathbb{v}_{\xi}

where the map $A$ encodes the barycentric coordinates of each pixel with respect to the triangle it belongs to. Here, $\mathbf{b}$ is a column vector with as many entries as there are pixels in the depth image, while $\mathbb{v}_{\xi}$ has as many entries as there are vertices in the mesh. Therefore, given a 2D mesh (in image space) with fixed topology, we can fit the inverse depths of the vertices such that the mesh minimizes the fit to a given depth-map in a least-squares fashion. In other words, the loss function for mesh reconstruction could be:

\mathcal{L}_{\text{depth}}(\mathbb{v}_{\xi})=\|A\mathbb{v}_{\xi}-\mathbf{b}\|_{2}^{2}

where $\mathbf{b}$ is now the measured inverse depth, and $A\mathbb{v}_{\xi}$ is the estimated inverse depth from the mesh. The optimal value of $\mathbb{v}_{\xi}$ that solves $\min_{\mathbb{v}_{\xi}}\mathcal{L}_{\text{depth}}$ is simply given by the normal equations. Nevertheless, we are not just interested in fitting a 3D mesh to the depth map, but rather in summarizing the depth map into a smooth 3D surface. Moreover, we expect our depth map to potentially contain outlier measurements. Therefore, we want to formulate our loss function such that it is robust to outliers, and we want to penalize solutions that are not smooth surfaces.

On the one hand, to reduce the influence of outliers, we can make use of the $\ell_{1}$ norm instead. In addition, we also want to quantify the difference between the estimated inverse depth $v_{\xi}$ and the inverse depth $v_{z}$ given by the visual odometry pipeline, which we denote by $\mathcal{L}_{\text{tracking}}$ . In summary, our data loss $\mathcal{L}_{\text{data}}$ consists of two terms, $\mathcal{L}_{\text{depth}}$ and $\mathcal{L}_{\text{tracking}},$ defined as:

	$\displaystyle\mathcal{L}_{\text{depth}}(\mathbb{v}_{\xi})$	$\displaystyle=\sum_{d\in\mathcal{D}}\|\mathbf{a}_{d}\mathbb{v}_{\xi}-b_{d}\|$		(1)
	$\displaystyle\mathcal{L}_{\text{tracking}}(\mathbb{v}_{\xi})$	$\displaystyle=\sum_{v\in\mathcal{V}}\left\|v_{\xi}-v_{z}\right\|,$		(1)

where $\mathcal{V}$ is the set of vertices of the mesh and $\mathcal{D}$ is the set of depth measurements. $\mathbf{a}_{d},b_{d}$ are the $d^{\text{th}}$ row of $A$ and $\mathbf{b}$ associated to the $d^{\text{th}}$ depth measurement in $\mathcal{D}$ . While [23, 11, 10] only used the sparse features from VO tracking to estimate the mesh (using Delaunay triangulation), we further add Steiner points [35, Section 14.1], including the set of sparse features tracked by the VO. A Steiner point is an extra mesh vertex that is added to the optimization problem to create a better fit than it would be possible using only the tracked features from VO alone. On the other hand, to promote smooth 3D meshes, we can penalize deviations from piecewise affine solutions. We will quantify this desire by using the $\mathcal{L}_{smooth}$ term below.

We formulate this multi-objective optimization via scalarization, which introduces the free parameter $\lambda>0$ balancing the two objectives:

\mathcal{L}=\mathcal{L}_{\text{smooth}}+\lambda\mathcal{L}_{\text{data}},

where $\mathcal{L}_{\text{data}}=\mathcal{L}_{\text{depth}}+\mathcal{L}_{\text{tracking}}$ . We could also balance the contribution of $\mathcal{L}_{\text{depth}}$ and $\mathcal{L}_{\text{tracking}}$ with another free parameter, but we avoid this for simplicity.

III-C Smoothing the Mesh

There are many possible choices for $\mathcal{L}_{smooth}$ , but one that is particularly interesting is the second-order Total Generalized Variation ( $\text{TGV}^{2}$ ) functional [36], because it promotes piecewise affine solutions. A non-local extension was proposed by [37] that allows using additional priors into the regularization term ( $\text{NLTGV}^{2}$ ). Pinies et al.[38] showed that piecewise planarity in image space, afforded by $\text{NLTGV}^{2}$ , translates to piecewise planarity in 3D space, when optimizing over the inverse depth. Leveraging this insight, Greene and Roy proposed FLaME [11] which discretizes $\text{NLTGV}^{2}$ for 3D meshes (instead of dense inverse depth maps), thereby achieving substantial computational savings. Similarly to FLaME [11], we will make use of the discrete $\text{NLTGV}^{2}$ to smooth our 3D mesh, which we now explain for completeness.

First, we need to introduce a key fact [38]: if two pixels $v_{\mathbf{u}}^{i}$ and $v_{\mathbf{u}}^{j}$ belong to the same planar surface in 3D, their inverse depth values $v^{i}_{\xi}$ and $v^{j}_{\xi}$ are constrained by the following relation:

v_{\xi}^{i}-v_{\xi}^{j}-\left\langle v_{\mathbf{w}}^{i},v_{\mathbf{u}}^{i}-v_{\mathbf{u}}^{j}\right\rangle=0,

where $v^{i}_{\mathbf{w}}$ is the normal vector of the surface at vertex $v^{i}.$

The $\text{NLTGV}^{2}$ smoothing term tries to penalize solutions that deviate from this relationship. Hence, the $\text{NLTGV}^{2}$ term over the mesh is given by:

\mathrm{NLTGV}^{2}(\xi)\approx\sum_{e\in\mathcal{E}_{k}}\|D_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)||_{1}

(2)

Here $D_{e}$ is a linear operator that acts on the vertices $v_{\mathbf{x}}^{i}$ and $v_{\mathbf{x}}^{j}$ corresponding to edge $e$ , where we recall that $v_{\mathbf{x}}:=(v_{\xi},v_{\mathbf{w}})$ :

{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)=\left[\begin{array}[]{c}e_{\alpha}\left(v_{\xi}^{i}-v_{\xi}^{j}-\left\langle v_{\mathbf{w}}^{i},v_{\mathbf{u}}^{i}-v_{\mathbf{u}}^{j}\right\rangle\right)\\ e_{\beta}\left(v_{w_{1}}^{i}-v_{w_{1}}^{j}\right)\\ e_{\beta}\left(v_{w_{2}}^{i}-v_{w_{2}}^{j}\right)\end{array}\right],

where $v_{\mathbf{u}}^{i}$ and $v_{\mathbf{u}}^{j}$ are the 2D pixel projections of the 3D vertices $v^{i}$ and $v^{j}$ of the mesh. The auxiliary variables $v^{i}_{\mathbf{w}}$ , $v^{j}_{\mathbf{w}}$ are related to the 2D projection of the 3D normal of the plane associated to vertices $v^{i}$ and $v^{j}$ . Intuitively, the first row of $\|{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)||_{1}$ penalizes the total deviation of the vertices from the expected plane going through the vertices, while the second and third row penalize having different normals for adjacent vertices. Weights $e_{\alpha},e_{\beta}\geq 0$ are assigned to each edge, and control the influence of the smoothness term. Setting one of these weights to $0$ would cancel the enforcement of planarity between vertices of the mesh. As in [11], we set the edge weight $e_{\alpha}=1/\|v_{\mathbf{u}}^{i}-v_{\mathbf{u}}^{j}\|_{2},$ i.e. inversely proportional to the edge length in pixels, and $e_{\beta}=1$ . It will be useful in the remaining to express the operator ${D}_{e}$ as a matrix, in particular, we have:

		$\displaystyle{D}_{e}=$		(3)
		$\displaystyle\left[\begin{matrix}e_{\alpha}&e_{\alpha}(v_{u_{1}}^{j}-v_{u_{1}}^{i})&e_{\alpha}(v_{u_{2}}^{j}-v_{u_{2}}^{i})&-e_{\alpha}&0&0\\ 0&e_{\beta}&0&0&-e_{\beta}&0\\ 0&0&e_{\beta}&0&0&-e_{\beta}\end{matrix}\right]$		(3)

The operator ${D}_{e}$ is applied on the per-edge vector of primal variables, namely: $[v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}]=[v_{\xi}^{i},v_{w_{1}}^{i},v_{w_{2}}^{i},v_{\xi}^{j},v_{w_{1}}^{j},v_{w_{2}}^{j}].$

III-D Primer on Primal-Dual Optimization

The resulting loss function combining the data terms and $\text{NLTGV}^{2}$ smoothness term is:

\begin{split}\mathcal{L}(v_{\mathbf{x}})&=\mathcal{L}_{\text{smooth}}(v_{\mathbf{x}})+\lambda\mathcal{L}_{\text{data}}(v_{\mathbf{x}})\\ &=\sum_{e\in\mathcal{E}}\|{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)||_{1}\\ &\quad+\lambda\left[\sum_{v\in\mathcal{V}}|v_{\xi}-v_{z}|+\sum_{d\in\mathcal{D}}|\mathbb{a}_{d}\mathbb{v}_{\xi}-b_{d}|\right]\\ \end{split}

(4)

Eq. 4 is a non-smooth convex optimization problem. The lack of smoothness is due to the use of non-differentiable $\ell_{1}$ norms. Fortunately, this type of problems can be solved using first-order primal-dual optimization [39].

Let us first briefly introduce the primal-dual optimization proposed by Chambolle and Pock [39]. The problem considered is the nonlinear primal problem:

\min_{x\in X}F(Kx)+G(x)

(5)

where the map $K:X\rightarrow Y$ is a continuous linear operator, and $G:X\rightarrow[0,+\infty)$ and $F:Y\rightarrow[0,+\infty)$ are proper, convex, lower semi-continuous functions.

This nonlinear problem can be formulated as the following saddle-point problem [39]:

\min_{x\in X}\max_{y\in Y}\langle Kx,y\rangle+G(x)-F^{*}(y)

(6)

where $F^{*}$ is the convex conjugate of $F$ . From now on, we refer to variable $x$ as the primal, and $y$ as the dual. Chambolle and Pock proposed to solve this saddle-point problem by performing descent steps on the primal, and ascent steps on the dual. In particular, for each iteration $n\geq 0$ , the algorithm updates $x^{n},y^{n},\bar{x}^{n}$ as follows:

\left\{\begin{aligned} y^{n+1}&=\left(I+\sigma\partial F^{*}\right)^{-1}\left(y^{n}+\sigma K\bar{x}^{n}\right)&\text{(Dual ascent)}\\ x^{n+1}&=(I+\tau\partial G)^{-1}\left(x^{n}-\tau K^{*}y^{n+1}\right)&\text{(Primal descent)}\\ \bar{x}^{n+1}&=x^{n+1}+\theta\left(x^{n+1}-x^{n}\right)&\end{aligned}\right.

(7)

where $\sigma,\tau>0$ are the dual and primal steps, and $\theta\in[0,1]$ . $K^{*}$ is the adjoint of $K$ . $(I+\tau\partial F^{*})^{-1}$ and $(I+\tau\partial G)^{-1}$ are the resolvent operators, defined as:

x=(I+\tau\partial G)^{-1}(y)=\arg\min_{x}\left\{\frac{\|x-y\|^{2}}{2\tau}+G(x)\right\}

The main assumption of [39] is that the resolvent operators for $F$ and $G$ are easy to compute.

As noted by Greene and Roy [11], the summation over the edges $\mathcal{E}$ and the summation over the vertices $\mathcal{V}$ in Eq. 4 can be encoded in $F({K}\mathbf{x})$ and $G(\mathbf{x}),$ respectively, in Eq. 5. It remains to be seen where does the third term $|\mathbb{a}_{d}\mathbb{v}_{\xi}-b_{d}|$ fit.

III-E Finding the Optimal Mesh

On the one hand, we could consider the term $|\mathbb{a}_{d}\mathbb{v}_{\xi}-b_{d}|$ to be part of $G(\mathbf{x})$ , but then we need to find its proximal operator¹¹1The proximal operator is the resolvent of the subdifferential operator.. Unfortunately, there is no known closed-form proximal operator for the $\ell_{1}$ norm of an affine function [40]. On the other hand, we can consider it part of $F({K\mathbf{x}})$ . In this case, we need to dualize the function with respect to $G(\mathbf{x})$ . Unfortunately, doing so will introduce as many dual variables as pixels in the image, which ideally we would like to avoid for computational reasons. We proceed hence with the second choice, and implement a fast GPU-based solver (Fig. 3).

The dual representation of $\|{A}\mathbb{v}_{\xi}-{\mathbf{b}}\|_{1}$ is given by (Appendix VI):

\max_{\mathbf{p}}\langle{A}\mathbb{v}_{\xi}-{\mathbf{b}},\mathbf{p}\rangle-\delta_{P}(\mathbf{p}),

where $\delta_{P}(\mathbb{p})$ is the convex conjugate of the $\ell_{1}$ norm, and corresponds to the indicator function of the set $P=\{\mathbf{p}:\|\mathbf{p}\|_{\infty}\leq 1\}$ . Together with the dual representation of the $\text{NLTGV}^{2}$ term [11]:

\left\|{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right)\right\|_{1}=\max_{e_{\mathbf{q}}}\left\langle{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right),e_{\mathbf{q}}\right\rangle-\delta_{Q}\left(e_{\mathbf{q}}\right)

where each $e_{\mathbf{q}}\in\mathbb{R}^{3}$ corresponds to a dual variable for each edge, and $\delta_{Q}(e_{\mathbf{q}})$ is also the convex conjugate of the $\ell_{1}$ norm. We have the primal-dual formulation of our original problem:

	$\displaystyle\min_{v_{\mathbf{x}}}\max_{e_{\mathbf{q}},\mathbf{p}}$	$\displaystyle\sum_{e\in\mathcal{E}_{k}}\left\langle{D}_{e}\left(v_{\mathbf{x}}^{i},v_{\mathbf{x}}^{j}\right),e_{\mathbf{q}}\right\rangle-\delta_{Q}\left(e_{\mathbf{q}}\right)$		(8)
		$\displaystyle+\lambda\left[\sum_{v\in\mathcal{V}_{k}}\left\|v_{\xi}-v_{z}\right\|+\langle{A}\mathbb{v}_{\xi}-{\mathbf{b}},\mathbf{p}\rangle-\delta_{P}(\mathbf{p})\right]$		(8)

Let us now formulate this primal-dual optimization problem under the formalism of Eq. 6. In particular, Eq. 8 can be cast as Eq. 6 by defining $F^{*}$ and $G$ as follows:

	$\displaystyle F^{*}(e_{\mathbf{q}},\mathbf{p})$	$\displaystyle=\sum_{e\in\mathcal{E}_{k}}\delta_{Q}(e_{\mathbf{q}})+\lambda\langle{\mathbf{b}},\mathbf{p}\rangle+\delta_{P}(\mathbf{p}),$
	$\displaystyle G(\mathbf{v}_{\xi})$	$\displaystyle=\lambda\sum_{v\in\mathcal{V}_{k}}\|v_{\xi}-v_{z}\|.$

We now detail the resolvent operators $\left(I+\sigma\partial F^{*}\right)^{-1}$ and $(I+\tau\partial G)^{-1}$ . Since $F^{*}$ is the sum of separable functions, we can first find the resolvent for $q$ and then $p$ . For the indicator function of a convex set $\delta_{Q}(q)$ , the resolvent operator reduces to point-wise projectors onto unit $\ell_{\infty}$ balls. Similarly, for the sum of an indicator function of a convex set $\delta_{P}(\mathbf{p})$ and the linear map $\lambda\langle\mathbf{b},\mathbf{p}\rangle$ , the resolvent operator remains a point-wise projector onto the unit $\ell_{\infty}$ ball, where the argument is offset by ${\lambda\mathbf{b}}$ .

(q,p)=\left(I+\sigma\partial F^{*}\right)^{-1}(\tilde{q},\tilde{p})\Leftrightarrow\left\{\begin{aligned} q_{i}&=\frac{\tilde{q}_{i}}{\max\left(1,\left|\tilde{q}_{i}\right|\right)}\\ p_{j}&=\frac{\tilde{p_{j}}-\lambda{b}_{j}}{\max\left(1,\left|\tilde{p_{j}}-\lambda{b}_{j}\right|\right)}\end{aligned}\right.

Similarly, the resolvent operator for $G(\mathbf{v}_{\xi})$ is given by the pointwise shrinkage operations:

	$\displaystyle{}v_{\xi}$	$\displaystyle=(I+\tau\partial G)^{-1}(\tilde{v_{\xi}})\Leftrightarrow$
	$\displaystyle v_{\xi}$	$\displaystyle=\left\{\begin{aligned} {}\tilde{v_{\xi}}-\tau\lambda&\text{ if }&\tilde{v_{\xi}}-v_{z}&>&\tau\lambda\\ \tilde{v_{\xi}}+\tau\lambda&\text{ if }&\tilde{v_{\xi}}-v_{z}&<&-\tau\lambda\\ v_{z}&\text{ if }&\left\|\tilde{v_{\xi}}-v_{z}\right\|&\leq&\tau\lambda\end{aligned}\right.$

We now detail the primal descent and dual ascent steps. First, we need to define the linear operator $K$ for the dual ascent. The dual ascent over $\mathbf{q,p}$ is given by:

\left(\begin{aligned} &\mathbf{q}^{n+1}\\ &\mathbf{p}^{n+1}\end{aligned}\right)=\left(I+\sigma\partial F^{*}\right)^{-1}\left(\begin{aligned} &\mathbf{q}^{n}+\sigma D\mathbf{v}_{\mathbf{x}}\\ &\mathbf{p}^{n}+\sigma\lambda{A}\mathbf{v}_{\xi}\end{aligned}\right)

(9)

where we use $D$ to represent the linear operator that stacks all of the ${D}_{e}$ operators acting on a per-edge basis. $\sigma>0$ is the dual step size.

Let us now define the adjoint $K^{*}$ for the primal descent over $v_{\xi}$ . Since we have formulated our linear operators in terms of matrices ${D}_{e}$ and ${A}$ , the adjoints are simply the transpose of these matrices: ${D}_{e}^{*}={D}_{e}^{T}$ and ${A}^{*}={A}^{T}$ . We now further decompose ${D}_{e}^{*}$ to make the subsequent primal descent iteration explicit. ${D}_{e}^{*}$ operates on the incoming edges $\mathcal{N}_{in}(v)$ and outgoing edges $\mathcal{N}_{out}(v)$ to a vertex $v$ , as noted by [11], and therefore can be decomposed:

	$\displaystyle{D}_{e}^{*}\left(e_{\mathbf{q}}\right)=D_{e}^{T}e_{\mathbf{q}}=$	$\displaystyle\left[\begin{array}[]{c}e_{\alpha}e_{q_{1}}\\ e_{\alpha}\left(v_{u_{1}}^{j}-v_{u_{1}}^{i}\right)e_{q_{1}}+e_{\beta}e_{q_{2}}\\ e_{\alpha}\left(v_{u_{2}}^{j}-v_{u_{2}}^{i}\right)e_{q_{1}}+e_{\beta}e_{q_{3}}\\ -e_{\alpha}e_{q_{1}}\\ -e_{\beta}e_{q_{2}}\\ -e_{\beta}e_{q_{3}}\end{array}\right]$
		$\displaystyle=\left[\begin{array}[]{c}{D}_{in}^{}\left(e_{\mathbf{q}}\right)\\ {D}_{out}^{}\left(e_{\mathbf{q}}\right)\end{array}\right]$

where ${D}_{in}^{*}$ and ${D}_{out}^{*}$ partition $D_{e}^{*}$ into the first and last three rows, respectively. Furthermore, we need to make explicit the primal updates on the inverse depths, since the adjoint $A^{*}$ only affects these. Therefore, we use that:

{D}_{e}^{*}\left(e_{\mathbf{q}}\right)=\left[\begin{array}[]{c}{D}_{in}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{out}^{*}\left(e_{\mathbf{q}}\right)\end{array}\right]=\left[\begin{array}[]{c}{D}_{in_{\xi}}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{in_{\mathbf{w}}}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{out_{\xi}}^{*}\left(e_{\mathbf{q}}\right)\\ {D}_{out_{\mathbf{w}}}^{*}\left(e_{\mathbf{q}}\right)\end{array}\right]

The descent equations for each $v_{\xi}$ are then given by:

$\displaystyle v_{\xi}^{n+1}=(I+\tau\partial G)^{-1}\bigg{(}v_{\xi}^{n}$	$\displaystyle-\tau\sum_{e\in\mathcal{N}_{in}(v)}{D}_{in_{\xi}}^{*}\left(e_{\mathbf{q}}^{n+1}\right)$	(10)
	$\displaystyle-\tau\sum_{e\in\mathcal{N}_{out}(v)}{D}_{out_{\xi}}^{*}\left(e_{\mathbf{q}}^{n+1}\right)$
	$\displaystyle-\tau\lambda[{A}^{*}\mathbf{p}^{n+1}]_{v}\bigg{)},$

where $\tau>0$ is the primal step size, and $[\cdot]_{v}$ in $[{A}^{*}\mathbf{p}^{n+1}]_{v}$ extracts the row corresponding to vertex $v$ . The descent equations for each $v_{\mathbf{w}}$ are given by:

	$\displaystyle v_{\mathbf{w}}^{n+1}=v_{\mathbf{w}}^{n}-\tau$	$\displaystyle\sum_{e\in\mathcal{N}_{in}(v)}{D}_{in_{\mathbf{w}}}^{*}\left(e_{\mathbf{q}}^{n+1}\right)$		(11)
	$\displaystyle-\tau$	$\displaystyle\sum_{e\in\mathcal{N}_{out}(v)}{D}_{out_{\mathbf{w}}}^{*}\left(e_{\mathbf{q}}^{n+1}\right).$		(11)

With the primal descent Eqs. 10 and 11 and dual ascent Eq. 9 in hand, we can now iteratively apply Eq. 7 until convergence.

In summary, our optimization proceeds as follows:

\left\{\begin{aligned} &\quad\quad\quad\underline{\text{Dual ascent:}}&\\ \left(\begin{aligned} &\mathbf{q}^{n+1}\\ &\mathbf{p}^{n+1}\end{aligned}\right)&=\left(I+\sigma\partial F^{*}\right)^{-1}\left(\begin{aligned} &\mathbf{q}^{n}+\sigma D\mathbf{v}_{\mathbf{x}}\\ &\mathbf{p}^{n}+\sigma\lambda{A}\mathbf{v}_{\xi}\end{aligned}\right)\\ &\quad\quad\quad\underline{\text{Primal descent:}}&\\ \left(\begin{aligned} &v_{\xi}^{n+1}\\ &v_{\mathbf{w}}^{n+1}\end{aligned}\right)&=\left(I+\sigma\partial G\right)^{-1}\left(\begin{aligned} v_{\xi}^{n}&-\tau{D}^{*}_{\xi}e_{\mathbf{q^{n+1}}}\dots\\ &\dots-\tau\lambda[{A}^{*}\mathbf{p}^{n+1}]_{v}\\ v_{\mathbf{w}}^{n}&-\tau{D}^{*}_{\mathbf{w}}v_{\xi}\end{aligned}\right)\\ &\quad\quad\quad\underline{\text{Extra gradient step:}}&\\ \bar{v}^{n+1}_{\mathbf{x}}&=v_{\mathbf{x}}^{n+1}+\theta\left(v_{\mathbf{x}}^{n+1}-v_{\mathbf{x}}^{n}\right)\end{aligned}\right.

(12)

and we implement the dual step in the GPU as illustrated in Fig. 3.

IV Experimental Results

We benchmark the proposed approach on real and simulated datasets, and evaluate map estimation accuracy, as well as runtime.

The EuRoC dataset [12] provides visual and inertial data recorded from a micro aerial vehicle flying indoors. We use the Vicon Room (V) data, which is similar to an office room where walls, floor, and ceiling are visible, as well as other planar surfaces (boxes, stacked mattresses).

We also use the simulated uHumans2 dataset [8] to evaluate the accuracy of the mesh reconstruction with RGB-D data. The uHumans2 dataset is generated using a photorealistic Unity-based simulator, and provides RGB-D data and ground truth of the geometry of the scene. Using this dataset, we can also provide an ablation study on the effect that the number of Steiner points has on the reconstruction (without being influenced by stereo reconstruction errors). We will be using the ‘Apartment‘ scene in the uHumans2 dataset.

Compared techniques. To assess the advantages of our proposed approach, we compare the original FLaME formulation, and ours, as well as VO pipelines. We also evaluate with both RGB-D data, using uHumans’s dataset, and dense stereo reconstruction, using the EuRoC dataset.

IV-A Evaluation in EuRoC

The EuRoC dataset provides ground-truth 3D reconstructions of the scene for the Vicon Room (V) datasets. Therefore, we can directly evaluate the accuracy of the per-frame 3D mesh reconstructions. Nevertheless, the dataset was recorded with a stereo camera instead of an RGB-D camera. Hence, we perform dense stereo reconstruction (using OpenCV’s block-matching approach) to fit our 3D mesh.

Table I compares our approach with two VO pipelines, LSD-SLAM[41] and MLM[42], as well as FLaME [11], which is the work closest to our approach. LSD-SLAM is a semi-dense VO pipeline, and as such, it partially reconstructs the scene. While being semi-dense, LSD-SLAM remains practically unable to reconstruct the scene, staying below $25\%$ in terms of accurate depth estimates. MLM is built on top of LSD-SLAM and was designed to densify LSD-SLAM’s map, as such, it clearly achieves denser reconstructions than its counterpart (approximately $50\%$ denser reconstructions). FLaME further densifies the 3D reconstruction by using 3D meshes and a primal-dual formulation like ours. Their approach clearly improves the MLM estimates (by an average factor of $42\%$ ), but most importantly, provides topological information with the 3D mesh. Our approach further uses the depth data coming from RGB-D or stereo, and fuses this information into the 3D mesh, which further densifies the 3D reconstruction, while also keeping the topological properties of the scene. Fig. 4 shows the quality of the reconstruction in EuRoC.

TABLE I: Fraction of inverse depths per depth-map within 10 percent of ground-truth for LSD-SLAM[41], MLM[42], FLaME [11], and our approach. Our approach enables more accurate and denser reconstructions. In bold the best result.

EuRoC	Accurate Inverse Depth Density [%]
Sequence	LSD	MLM	FLaME	Ours
V1_01	18.2	26.4	37.8	41.7
V1_02	18.3	27.9	40.5	44.8
V1_03	11.0	17.0	27.0	33.1
V2_01	25.9	39.3	48.1	52.9
V2_02	20.5	28.9	37.2	44.0
V2_03	11.5	19.0	28.9	34.8

IV-B Evaluation in uHumans2 Dataset

Using the RGB-D images in uHumans2 dataset, we run and evaluate our approach using an increasing number of Steiner points. Table II shows that increasing the number of Steiner points has a clear positive effect on the quality of the depth estimates. The improvements are most noticeable when comparing the use of no Steiner points (No Steiner), and adding Steiner points every 100 pixels ( $S=100$ ). Subsequently, the addition of Steiner points increases the accuracy up to around one Steiner point every 10 pixels. Note that the computation time increases according to the number of points added, and beyond $S=50$ our approach might not be fast enough for real-time operations. Fig. 2 shows a reconstruction using the $S=50$ configuration.

TABLE II: Fraction of inverse depths per depth-map within 10 percent of ground-truth for our approach using an increasing number of Steiner points every

S

pixels. The accuracy and density of our reconstructions increase with the number of Steiner points added. In bold the best result.

uHumans2	Accurate Inverse Depth Density [%] (Timing [ms])
Sequence	Ours (No Steiner)	Ours ( $S=100$ )	Ours ( $S=50$ )	Ours ( $S=10$ )
Apart_01	43.2 (9.1)	46.4 (10.9)	53.8 (12.4)	56.7 (41)
Apart_02	42.3 (10.2)	49.9 (11.9)	52.5 (13.4)	51.2 (49)
Apart_03	43.0 (9.8)	48.0 (11.8)	53.0 (13.3)	53.9 (43)

IV-C Timing

All timings were processed using an Nvidia RTX 2080 Ti GPU on an Intel i9 CPU. The computation of the dual contribution takes less than $2.1$ ms overall (worst case timing recorded, including data transfer). Nevertheless, since this operation needs to be repeated several times until convergence, we record an average processing time of $13.2$ ms. This includes performing the Delaunay triangulation, uploading depth and 2D mesh coordinates to the GPU, solving the dual problem, sending the primal contribution to the CPU, optimizing, and repeating the process until convergence (for $6$ iterations). Ultimately, the bottleneck in terms of computation comes from Kimera-VIO which takes $20$ ms per frame, despite being an optimized VIO pipeline that has its frontend and backend working in parallel.

V Conclusion

By leveraging the triangulated landmarks from visual odometry and the depth from RGB-D data (or stereo), we present a 3D reconstruction algorithm capable of building a smooth, compact, and accurate 3D mesh of the scene. The proposed algorithm shows that dense depth fusion into a consistent 3D mesh is not only possible, but also computationally fast.

Finally, while the results presented are promising, we are not yet fusing the 3D mesh into a consistent multi-frame scene representation. Therefore, a promising line for future work is to use the per-frame 3D mesh to achieve multi-view 3D scene reconstruction. With the accuracy and speed afforded by the presented approach, it becomes possible to build a map without relying on intermediate volumetric representations.

References

[1] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2020, arXiv preprint arXiv: 1910.02490, (video), (code), (pdf).
[2] A. Mourikis and S. Roumeliotis, “A multi-state constraint Kalman filter for vision-aided inertial navigation,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), April 2007, pp. 3565–3572.
[3] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual-inertial slam using nonlinear optimization,” Intl. J. of Robotics Research, no. 3, pp. 314–334, 2015.
[4] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration for real-time visual-inertial odometry,” IEEE Trans. Robotics, vol. 33, no. 1, pp. 1–21, 2017, arxiv preprint: 1512.02363, (pdf), technical report GT-IRIM-CP&R-2015-001.
[5] R. Mur-Artal and J. Tardós, “Visual-inertial monocular SLAM with map reuse,” ArXiv preprint:1610.05949, 2016.
[6] T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[7] S. Pillai, S. Ramalingam, and J. Leonard, “High-performance and tunable stereo reconstruction,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2016.
[8] A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: from SLAM to spatial perception with 3D dynamic scene graphs,” arXiv preprint arXiv: 2101.06894, 2021, (pdf).
[9] R. Newcombe, A. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon, “KinectFusion: Real-time dense surface mapping and tracking,” in IEEE and ACM Intl. Sym. on Mixed and Augmented Reality (ISMAR), 2011, pp. 127–136.
[10] A. Rosinol, T. Sattler, M. Pollefeys, and L. Carlone, “Incremental Visual-Inertial 3D Mesh Generation with Structural Regularities,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2019, (pdf), (web). [Online]. Available: https://www.mit.edu/%7Earosinol/research/struct3dmesh.html
[11] W. N. Greene and N. Roy, “Flame: Fast lightweight mesh estimation using variational smoothing on delaunay graphs,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 4696–4704.
[12] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” Intl. J. of Robotics Research, 2016.
[13] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” in Robotics: Science and Systems (RSS), 2020, (pdf), (video).
[14] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in SIGGRAPH, 1996, pp. 303–312.
[15] T. Schöps, T. Sattler, and M. Pollefeys, “Surfelmeshing: Online surfel-based mesh reconstruction,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2494–2507, 2019.
[16] J. Fox, N. Johns, C. Lyons, A. Rahmanzadeh, R. Thomson, and P. Wilson, “Proforma: a general technology for clinical decision support systems,” Computer methods and programs in biomedicine, vol. 54, no. 1-2, pp. 59–67, 1997.
[17] O. D. Faugeras, E. L. Bras-Mehlman, and J. D. Boissonnat, “Representing stereo data with the Delaunay triangulation,” Artif. Intell., vol. 44, no. 1-2, pp. 41–87, 1990.
[18] E. Piazza, A. Romanoni, and M. Matteucci, “Real-time cpu-based large-scale 3d mesh reconstruction,” arXiv preprint arXiv:1801.05230, 2018.
[19] M. Lhuillier, “Surface reconstruction from a sparse point cloud by enforcing visibility consistency and topology constraints,” Computer Vision and Image Understanding, vol. 175, pp. 52–71, 2018.
[20] M. Lhuillier and S. Yu, “Manifold surface reconstruction of an environment from sparse structure-from-motion data,” Computer Vision and Image Understanding, vol. 117, no. 11, pp. 1628–1644, 2013.
[21] A. Bódis-Szomorú, H. Riemenschneider, and L. Van Gool, “Efficient edge-aware surface mesh reconstruction for urban scenes,” Computer Vision and Image Understanding, vol. 157, pp. 3–24, 2017.
[22] G. Turk and M. Levoy, “Zippered polygon meshes from range images,” in Proceedings of the 21st annual conference on Computer graphics and interactive techniques, 1994, pp. 311–318.
[23] L. Teixeira and M. Chli, “Real-time mesh-based scene estimation for aerial inspection,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4863–4869.
[24] S. Pütz, T. Wiemann, J. Sprickerhof, and J. Hertzberg, “3D navigation mesh generation for path planning in uneven terrain,” IFAC-PapersOnLine, vol. 49, no. 15, pp. 212–217, 2016.
[25] F. Ruetz, E. Hernández, M. Pfeiffer, H. Oleynikova, M. Cox, T. Lowe, and P. Borges, “Ovpc mesh: 3d free-space representation for local ground vehicle navigation,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8648–8654.
[26] M. Yokozuka, S. Oishi, S. Thompson, and A. Banno, “VITAMIN-E: visual tracking and mapping with extremely dense feature points,” CoRR, vol. abs/1904.10324, 2019.
[27] R. Newcombe and A. Davison, “Live dense reconstruction with a single moving camera,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 1498–1505.
[28] S. Lovegrove, “Parametric dense visual slam,” Ph.D. dissertation, 2012.
[29] R. Newcombe, S. Lovegrove, and A. Davison, “DTAM: Dense tracking and mapping in real-time,” in Intl. Conf. on Computer Vision (ICCV), Barcelona, Spain, Nov 2011, pp. 2320–2327.
[30] F. Milano, A. Loquercio, A. Rosinol, D. Scaramuzza, and L. Carlone, “Primal-dual mesh convolutional neural networks,” in Conference on Neural Information Processing Systems (NeurIPS), 2020. [Online]. Available: https://github.com/MIT-SPARK/PD-MeshNet
[31] R. Hanocka, A. Hertz, N. Fish, R. Giryes, S. Fleishman, and D. Cohen-Or, “MeshCNN: a network with an edge,” ACM Trans. Graph., vol. 38, no. 4, pp. 1–12, 2019.
[32] M. Bloesch, T. Laidlow, R. Clark, S. Leutenegger, and A. J. Davison, “Learning meshes for dense visual slam,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5855–5864.
[33] Q. Feng and N. Atanasov, “Mesh reconstruction from aerial images for outdoor terrain mapping using joint 2d-3d learning,” arXiv preprint arXiv:2101.01844, 2021.
[34] J. D. Foley, F. D. Van, A. Van Dam, S. K. Feiner, J. F. Hughes, and J. Hughes, Computer graphics: principles and practice. Addison-Wesley Professional, 1996, vol. 12110.
[35] M. De Berg, M. Van Kreveld, M. Overmars, and O. Schwarzkopf, “Computational geometry,” in Computational geometry. Springer, 1997, pp. 1–17.
[36] K. Bredies, K. Kunisch, and T. Pock, “Total generalized variation,” SIAM Journal on Imaging Sciences, vol. 3, no. 3, pp. 492–526, 2010.
[37] R. Ranftl, K. Bredies, and T. Pock, “Non-local total generalized variation for optical flow estimation,” in European Conference on Computer Vision. Springer, 2014, pp. 439–454.
[38] P. Pinies, L. M. Paz, and P. Newman, “Dense mono reconstruction: Living with the pain of the plain plane,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 5226–5231.
[39] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” Journal of mathematical imaging and vision, vol. 40, no. 1, pp. 120–145, 2011.
[40] A. Beck, First-order methods in optimization. SIAM, 2017.
[41] J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” pp. 834–849, 2014.
[42] W. N. Greene, K. Ok, P. Lommel, and N. Roy, “Multi-level mapping: Real-time dense monocular slam,” in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 833–840.
[43] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004.

VI Appendix

Let us show that the following dual representation holds:

\|A\bm{\xi}-\mathbf{b}\|_{1}=\max_{\mathbf{p}\in{\mathbb{R}^{n}}}\langle A\bm{\xi}-\mathbf{b},\mathbf{p}\rangle-\delta_{P}(\mathbf{p}),

where $\bm{\xi},\mathbf{b,p}\in\mathbb{R}^{n}$ , $A\in\mathbb{R}^{n\times n}$ , and $\delta_{P}(\mathbf{p})$ is the indicator function of the convex set $P=\{\mathbf{p}:\|\mathbf{p}\|_{\infty}\leq 1\}$ . We assume in the remaining that $A$ is nonsingular.

VI-1 Proposition [43, Section 3.3]

Since $A$ is nonsingular, the conjugate of $g(\mathbf{x})=$ $f(A\mathbf{x}+\mathbf{b})$ is:

g^{*}(\mathbf{x})=f^{*}\left(A^{-T}\mathbf{x}\right)-b^{T}A^{-T}\mathbf{x},

with $\mathbf{dom}~{}g^{*}=A^{T}\mathbf{dom}~{}f^{*}$ , with $f^{*}$ the conjugate of $f$ .

VI-2 Proposition [43, Section 3.3]

The convex conjugate of the $\ell_{1}$ norm is the indicator function $\delta_{Y}(\mathbf{y})$ of the convex set $Y=\{\mathbf{y}:\|\mathbf{y}\|_{\infty}\leq 1\}$ .

Using proposition 1 and 2, we have that the convex conjugate of $g(\mathbf{x})=\|A\mathbf{x}-\mathbf{b}\|_{1}=f(A\mathbf{x}-\mathbf{b})$ , is:

g^{*}(\mathbf{x})=\delta_{Y}(A^{-T}\mathbf{x})+\mathbf{b}^{T}A^{-T}x,

where $Y=\{A^{-T}\mathbf{x}:\|A^{-T}\mathbf{x}\|_{\infty}\leq 1\}$ , and $\mathbf{dom}~{}g^{*}=A^{T}\mathbf{dom}~{}f^{*}=\{\mathbf{x}:\|A^{-T}\mathbf{x}\|_{\infty}\leq 1\}$ . Let us take the convex conjugate of $g^{*}$ , by definition, we have:

g^{**}(\mathbf{y})=\max_{\mathbf{x}}\langle\mathbf{y},\mathbf{x}\rangle-\left[\delta_{Y}(A^{-T}\mathbf{x})+\mathbf{b}^{T}A^{-T}\mathbf{x}\right]

Taking $\mathbf{x}^{\prime}=A^{-T}\mathbf{x}$ , we have:

\begin{aligned} g^{**}(\mathbf{y})&=\max_{\mathbf{x}^{\prime}}\langle\mathbf{y},A^{T}\mathbf{x}^{\prime}\rangle-\left[\delta_{X}^{\prime}(\mathbf{x}^{\prime})+\mathbf{b}^{T}\mathbf{x}^{\prime}\right]\\ &=\max_{\mathbf{x}^{\prime}}\langle A\mathbf{y}-\mathbf{b},\mathbf{x}^{\prime}\rangle-\delta_{X^{\prime}}(\mathbf{x}^{\prime})\\ \end{aligned},

where $X^{\prime}=\{\mathbf{x}^{\prime}:\|\mathbf{x}^{\prime}\|_{\infty}\leq 1\}$

Since $g(\mathbf{x})$ is a proper, lower semi-continuous, and convex function, we have $g^{**}=g$ , according to Fenchel-Moreau’s theorem. Therefore,

g^{**}(\bm{\xi})=g(\bm{\xi})=\|A\bm{\xi}-\mathbf{b}\|_{1}=\max_{\mathbf{p}}\langle A\bm{\xi}-\mathbf{b},\mathbf{p}\rangle-\delta_{P}(\mathbf{p}),

with $P=\{\mathbf{p}:\|\mathbf{p}\|_{\infty}\leq 1\}$ .

Smooth Mesh Estimation from Depth Data using Non-Smooth Convex Optimization