Depth360: Self-supervised Learning for Monocular Depth Estimation using Learnable Camera Distortion Model

Noriaki Hirose¹ and Kosuke Tahara¹ ¹Noriaki Hirose and Kosuke Tahara are in Toyota Central R&D Labs., Inc., 41-1, Yokomichi, Nagakute, Aichi, 480-1192, Japan hirose@mosk.tytlabs.co.jp

Abstract

Self-supervised monocular depth estimation has been widely investigated to estimate depth images and relative poses from RGB images. This framework is promising because the depth and pose networks can be trained from just time-sequence images without the need for the ground truth depth and poses.

In this work, we estimate the depth around a robot (360^∘ view) using time-sequence spherical camera images, from a camera whose parameters are unknown. We propose a learnable axisymmetric camera model which accepts distorted spherical camera images with two fisheye camera images as well as pinhole camera images. In addition, we trained our models with a photo-realistic simulator to generate ground truth depth images to provide supervision. Moreover, we introduced loss functions to provide floor constraints to reduce artifacts that can result from reflective floor surfaces. We demonstrate the efficacy of our method using the spherical camera images from the GO Stanford dataset and pinhole camera images from the KITTI dataset to compare our method’s performance with that of baseline method in learning the camera parameters.

I INTRODUCTION

Accurately estimating three-dimensional (3D) information of objects and structures in the environment is essential for autonomous navigation and manipulation [1, 2]. Self-supervised monocular depth estimation without ground truth (GT) depth images is one of the most popular approaches for obtaining 3D information [3, 4, 5, 6, 7]. In self-supervised monocular depth estimation, there are several limitations including that this method requires the camera parameters, it cannot estimate the real scale of the depth image, and it functions poorly for highly reflective objects. These limitations hinder the amount of available datasets for training and critical artifacts for robotics applications.

This paper proposes a novel self-supervised monocular depth estimation approach for a spherical camera image view. We propose a learnable axisymmetric camera model to handle images from a camera whose camera parameter is unknown. Because this camera model is applicable to highly distorted images, such as spherical camera images based on two fisheye images, we can obtain 360^∘ 3D point clouds all around a robot from only one camera.

In addition to self-supervised learning with real images, we rendered many pairs of spherical RGB images and their corresponding depth images from a photo-realistic robot simulator. In training, we mixed these rendered images with real images to achieve sim2real transfer in an attempt to provide scaling for the estimated depth. Moreover, we introduce additional cost functions to improve the accuracy of depth estimation for reflective floor areas. We provide supervision for estimated depth images from the future and past robot footprints, which are obtained from the reference velocities in the data collection. Main contributions are summarized as:

Refer to caption — Figure 1: Estimated depth images from 256 $\times$ 128 spherical camera images from the GO Stanford dataset and 640 $\times$ 128 pinhole camera images from the KITTI dataset. A spherical image was constructed using two fisheye images. The gray area in the spherical image is a common mask to exclude the corresponding pixels for training. Our method can estimate depth from the spherical image in [a] and the pinhole camera image in [b] using our learnable axisymmetric camera model.

•

A novel learnable axisymmetric camera model capable of handling distorted images with unknown camera parameters,
•

Sim2real transfer using ground truth depth from a photo-realistic simulator to sharpen the estimated depth image and provide scaling,
•

Proposal of novel loss functions that use the robot footprints and trajectories to provide constraints against reflective floor surfaces.

In addition to these main contributions, we blended front- and back-side fisheye images to reduce the occluded area for image prediction in self-supervised learning. As a result, our method can estimate the depth image without the large artifacts from a low-resolution spherical image.

Our method was trained and evaluated on the GO Stanford (GS) dataset [9, 10, 11], as depicted in in Fig 1[a], with time-sequence spherical camera images and associated reference velocities for dataset collection. Moreover, we solely evaluated the most important contribution, that is, the learnable axisymmetric camera model on the common KITTI dataset [12], as depicted in Fig 1[b], to provide a comparison with other learnable camera models [13, 14]. We quantitatively and qualitatively evaluated our method with the GS and KITTI datasets.

II RELATED WORK

Monocular depth estimation using self-supervised learning has been widely investigated by several approaches that use deep learning [15, 16, 17, 6, 18, 19, 20, 4, 13, 21, 22]. Zhou et al. [3], and Vijayanarasimhan et al. [23] applied spatial transformer modules [24] based on the knowledge of structure from motion to achieve self-supervised learning from time-sequence images. Since the publication of [3] and [23], several subsequent studies have attempted to estimate accurate depth using different methods, including a modified loss function [25], an additional loss function to penalize geometric inconsistencies [5, 26, 13, 27, 7], a probabilistic approach to estimate the reliability of depth estimation [22, 28], masking dynamic objects [4, 13], and an entirely novel model structure [17, 29, 30]. We review the two categories most related to our method.

Fisheye camera image. Various approaches have attempted to estimate depth images from fisheye images. [31] proposed supervised learning with sparse GT from LiDAR. [32, 33, 34] leveraged multiple fisheye cameras to estimate a 360^∘ depth image. Similarly, Kumar et al. proposed self-supervised learning approaches with a calibrated fisheye camera model [35] and semantic segmentation [36].

Learning camera model. Gordon et al. [13] proposed a self-supervised monocular depth estimation method that can learn the camera’s intrinsic parameters. Vasiljevic et al. [14] proposed a learnable neural ray surface (NRS) camera model for highly distorted images.

In contrast to the most related work [14], our learnable axisymmetric camera model is fully differentiable without the softmax approximation. Hence, end-to-end learning can be achieved without adjusting the hyperparameters during training. In addition, we provide supervision for the estimated depth images by using a photo-realistic simulator and robot trajectories from the dataset. As a result, our method can accurately estimate the depth from low-resolution spherical images.

III PROPOSED METHOD

From the process in Fig. 2, we designed the following cost functions to train the depth, pose and camera networks:

\displaystyle J=J_{bimg}+\lambda_{d}J_{depth}+\lambda_{f}J_{floor}+\lambda_{p}J_{pose}+\lambda_{s}J_{sm}.

(1)

In $J_{bimg}$ , we propose a learnable camera model to handle the image sequences with unknown camera parameters. Our camera model can be trained without the GT camera parameters, through minimization of $J$ for self-supervised monocular depth estimation. The camera model has the learnable convex projection surface to deal with the arbitrary distortion, e.g. spherical camera image with two fisheye images. Contrary to the image loss of baseline methods [3, 25], $J_{bimg}$ is an occlusion-aware image loss using our learnable camera model to leverage the advantage of the 360^∘ view around the robot by the spherical camera.

$J_{depth}$ penalizes the depth difference using the GT depth from photo-realistic simulator. By penalizing $J_{depth}$ with $J_{bimg}$ , our model can learn sim2real transfer and can thereby estimate accurate depths from real images. $J_{floor}$ is the proposed loss function that provides supervision against floor areas by using robot’s footprint and trajectory.

In addition to these major contributions, $J_{pose}$ penalizes the difference between the estimated pose and the GT pose calculated from the reference velocities in the dataset. $J_{sm}$ penalizes the discontinuity of the estimated depth using the exactly same objective, following [16, 25].

In the following sections, we first present $J_{bimg}$ and $J_{pose}$ in the overview of the training process. Then, we explain our camera model as the main contribution. Finally, we introduce $J_{depth}$ and $J_{floor}$ to improve the performance of depth estimation.

We define the robot and camera coordinates based on the global robot pose $X$ , as shown in Fig. 3. $\Sigma_{X^{r}}$ is the base coordinate of the robot. In addition, $\Sigma_{X^{f}}$ and $\Sigma_{X^{b}}$ are the camera coordinates of the front- and back-side fisheye cameras in the spherical camera, respectively. The axes directions are shown in Fig. 3. Following convention of the camera coordinate, we define the $y$ axis as downward and the $z$ axis as forward in $\Sigma_{X^{f}}$ and $\Sigma_{X^{b}}$ . Additionally, $\Sigma_{X^{f}}$ and $\Sigma_{X^{b}}$ are opposite around $y$ axis on their coordinates. We assume that the relative poses between each coordinate are known after measuring the height $h_{cam}$ and the offset $l_{cam}$ of the camera position.

III-A Overview

III-A1 Process of depth estimation

Fig. 2[a] presents the calculation of depth estimation for real images. Since there are no GT depth images, we employed self-supervised learning approach using time-sequence images. Unlike previous approaches[3, 25], our spherical camera can capture both the front- and back-side of the robot. Hence, we propose a cost function to blend the front- and back-side images and thereby reduce the negative effects of occlusion.

We feed the front-side images $I_{A^{f}}$ and back-side image $I_{A^{b}}$ at robot pose $A$ into the depth network $f_{\mbox{depth}}()$ to estimate the corresponding depth images as $D_{A^{f}},D_{A^{b}}=f_{\mbox{depth}}(I_{A^{f}},I_{A^{b}})$ . By back-projection $f_{\mbox{backproj}}()$ with our proposed camera model, we can obtain the corresponding point clouds $Q_{A^{f}}$ and $Q_{A^{b}}$ for each camera coordinate $\Sigma_{A^{f}}$ and $\Sigma_{A^{b}}$ , respectively.

\displaystyle Q_{A^{f}}=f_{\mbox{backproj}}(D_{A^{f}}),\hskip 5.69054ptQ_{A^{b}}=f_{\mbox{backproj}}(D_{A^{b}})

(2)

Opposed to the baseline methods, our method predicts both the front- and back-side images from a single side image to blend them in $J_{bimg}$ . Hence, “Trans.” in Fig. 1[a] transforms the coordinates of the estimated point clouds as follows:

	$\displaystyle Q^{B^{f}}_{A^{f}}=T_{AB}Q_{A^{f}},\hskip 22.76219ptQ^{B^{f}}_{A^{b}}=T_{AB}\cdot T^{-1}_{fb}Q_{A^{b}},$		(3)
	$\displaystyle Q^{B^{b}}_{A^{f}}=T_{fb}\cdot T_{AB}Q_{A^{f}},\hskip 2.84526ptQ^{B^{b}}_{A^{b}}=T_{fb}\cdot T_{AB}\cdot T^{-1}_{fb}Q_{A^{b}}.$

Here, $Q^{Y^{\beta}}_{X^{\alpha}}$ denotes the point clouds on the coordinate $\Sigma_{Y^{\beta}}$ estimated from image $I_{X^{\alpha}}$ . $T_{AB}$ is the estimated transformation matrix between $\Sigma_{A^{f}}$ and $\Sigma_{B^{f}}$ . $T_{fb}$ is the known transformation matrix between $\Sigma_{X^{f}}$ and $\Sigma_{X^{b}}$ . By projecting these point clouds with our learnable camera model, we can estimate four matrices $M^{B^{\beta}}_{A^{\alpha}}$ ,

\displaystyle M^{B^{\beta}}_{A^{\alpha}}=f_{\mbox{proj}}(Q^{B^{\beta}}_{A^{\alpha}})

(4)

where, $\alpha=\{f,b\}$ and $\beta=\{f,b\}$ . According to [24], we estimate $I_{A^{\alpha}}$ by sampling the pixel value of $I_{B^{\beta}}$ as $\hat{I}^{B^{\beta}}_{A^{\alpha}}=f_{\mbox{sample}}(M^{B^{\beta}}_{A^{\alpha}},I_{B^{\beta}})$ . Here $\hat{I}^{B^{\beta}}_{A^{\alpha}}$ denotes the estimated image of $I_{A^{\alpha}}$ by sampling $I_{B^{\beta}}$ . Note that we estimated four images from the combination of $\alpha=\{f,b\}$ and $\beta=\{f,b\}$ , as shown in Fig. 2[a]. We calculate the blended image loss $J_{bimg}$ to penalize the model during training.

	$\displaystyle J_{bimg}=\lambda_{1}J_{L1}+\lambda_{2}J_{SSIM}.$		(5)
	$\displaystyle J_{L1}=\sum_{\alpha\in\{f,b\}}f_{\mbox{min}}(M\|I_{A^{\alpha}}-\hat{I}^{B^{f}}_{A^{\alpha}}\|,M\|I_{A^{x}}-\hat{I}^{B^{b}}_{A^{\alpha}}\|),$
	$\displaystyle J_{SSIM}=\sum_{\alpha\in\{f,b\}}f_{\mbox{min}}(Md_{\mbox{ssim}}(I_{A^{\alpha}},\hat{I}^{B^{f}}_{A^{\alpha}}),Md_{\mbox{ssim}}(I_{A^{\alpha}},\hat{I}^{B^{b}}_{A^{\alpha}})).$

Here, $f_{\mbox{min}}(\cdot,\cdot)$ selects a smaller value at each pixel and calculates the mean of all the pixels. $d_{\mbox{ssim}}(\cdot,\cdot)$ is the pixel-wise structural similarity (SSIM) [37], following [25]. By selecting a smaller value in $f_{\mbox{min}}(\cdot,\cdot)$ , we can equivalently select the non-occluded pixel value of $\hat{I}^{B^{f}}_{A^{\alpha}}$ or $\hat{I}^{B^{b}}_{A^{\alpha}}$ to calculate L1 and SSIM [25]. $M$ is a mask to remove the pixels without RGB values and those of the robot itself, which are depicted in gray color in Fig. 1[a].

III-A2 Process of pose estimation

Fig. 2[c] denotes the process to estimate $T_{AB}$ . Unlike previous monocular depth estimation approaches, we use the GT transformation matrix $\bar{T}_{AB}$ from the integral of the reference velocities $\{v_{i},\omega_{i}\}_{i=0\cdots N_{g-1}}$ to move between poses $A$ and $B$ . $J_{pose}$ is designed as $J_{pose}=\sum(\bar{T}_{AB}-T_{AB})^{2}$ .

III-B Learnable axisymmetric camera model

Our camera model defines the relationship between the pixel position $[u_{i},v_{i}]$ on the image coordinates and corresponding 3D point $[X_{i},Y_{i},Z_{i}]$ on the camera coordinates. This mapping has been written as $f_{\mbox{backproj}}()$ or $f_{\mbox{proj}}()$ during the training process. Since we formulate a differentiable model, all parameters in our model can be simultaneously trained along with depth and pose networks.

Figure 4 shows an overview of the camera model. There are two individual processes: 1) $[u_{i},v_{i}]\Leftrightarrow[x_{i},y_{i}]$ (blue double arrow) for modeling the field of view (FoV) and the offset similar to the camera intrinsic parameters, and 2) $[x_{i},y_{i}]\Leftrightarrow[X_{i},Y_{i},Z_{i}]$ (green double arrow) for handling the camera distortion. These processes are connected at $[x_{i},y_{i}]$ on $XY$ plane of the camera coordinate. We explain each of the separate processes in the following paragraphs. To accept arbitrarily sized images, the image coordinate $UV$ is regularized within [-1, 1] and origin is positioned at the center of the image.

III-B1 Part I $[u_{i},v_{i}]\Leftrightarrow[x_{i},y_{i}]$

The mapping between $[x_{i},y_{i}]$ and $[u_{i},v_{i}]$ is defined by four independent parameters: $r_{x},r_{y}$ for the FoV, and $o_{x},o_{y}$ for the offset between $UV$ and $XY$ . Assuming linear relationships, this part of $f_{\mbox{backproj}}()$ is written as follows,

\displaystyle[x_{i},y_{i}]^{T}=R\cdot[u_{i},v_{i}]^{T}+[o_{x},o_{y}]^{T},

(6)

where $R=\mbox{diag}(1/r_{x},1/r_{y})$ . Owing to the linearity and regularity, defining the inverse transformation for $f_{\mbox{proj}}()$ as $R^{-1}\cdot([x_{i},y_{i}]^{T}-[o_{x},o_{y}]^{T})$ is straightforward.

III-B2 Part II $[x_{i},y_{i}]\Leftrightarrow[X_{i},Y_{i},Z_{i}]$

Next, we show the process of the green double arrow between $[x_{i},y_{i}]$ and $[X_{i},Y_{i},Z_{i}]$ in Fig. 4 to model the distortion. At first, we design the learnable projection surface (grey color lines in Fig. 4). Then, we explain the computation procedure in $f_{\mbox{backproj}}()$ and $f_{\mbox{proj}}()$ .

Projection surface

Figure 5 indicates the details of our projection surface, which is axisymmetric around the $Z$ axis and convex upwards. Unlike the baseline camera model [38, 39], our projection surface is modeled as a linear interpolation of a discrete surface to effectively train the camera model in self-supervised learning, inspired by Bhat et al. [40]. Note that Bhat et al. [40] discritizes the estimated depth itself and interpolates them in supervised learning architecture for explicit depth estimation, unlike our approach.

This projection surface can define a unique mapping between the angle of incident light and radial position of projected point on the $XY$ plane, which reflects the distortion property of the camera. The projection surface on the $WZ$ plane is defined as consecutive line segments by the points $[b_{i},h_{i}]_{i=0\cdots N_{b}}$ in Fig. 5[b]. Here, the $W$ axis is defined as the radial direction from the origin towards $[X_{i},Y_{i},0]$ as shown in Fig. 5[a]. Because the parameters $[b_{i},h_{i}]_{i=0\cdots N_{b}}$ are normalized to stabilize the training process, the projection surface on the $XY$ plane can be depicted as a unit circle centered at the origin, as shown in Fig. 5[a]. By giving the constraint $\Delta h_{i}/\Delta b_{i}>0$ and $\Delta b_{i}>0$ , the convex upwards shape can be ensured. This constraint is indispensable to be fully differentiable model, as shown in b) Back-projection and c) Projection. Here $\Delta h_{i}=\tilde{h}_{i-1}-\tilde{h}_{i}$ , $\Delta b_{i}=\tilde{b}_{i-1}-\tilde{b}_{i}$ . $\tilde{x}$ indicates the variable $x$ before applying normalization.

Our camera network $f_{\mbox{cam}}()$ estimates all parameters in our camera model as follows:

\displaystyle\{\Delta b_{i},\Delta h_{i}/\Delta b_{i}\}_{i=1\cdots N_{b}},r_{x},r_{y},o_{x},o_{y}=f_{\mbox{cam}}(f_{img})

(7)

where $f_{img}$ is the image features from the depth encoder. In $f_{\mbox{cam}}()$ , we provide a sigmoid function at the last layer to achieve $\Delta h_{i}/\Delta b_{i}>0$ and $\Delta b_{i}>0$ . By simple algebra with $\tilde{h}_{N_{b}}=0.0$ and $\tilde{b}_{N_{b}}=1.0$ , we can obtain $[\tilde{b}_{i},\tilde{h}_{i}]_{i=0\cdots N_{b}-1}$ . By performing normalization to stabilize the training process, we can obtain $h_{i}=\tilde{h}_{i}/\sum_{k=0}^{N_{b}}\tilde{h}_{k}$ and $b_{i}=\tilde{b}_{i}/\sum_{k=0}^{N_{b}}\tilde{b}_{k}$ .

Back-projection

To calculate $[X_{i},Y_{i},Z_{i}]$ from the estimated depth $Z_{i}$ at $[x_{i},y_{i}]$ , we first calculate $[w_{i},z_{i}]$ , which is the intersection point on the projection surface. The line of the $j$ -th line segment of the projection surface on the $WZ$ plane can be expressed as:

\displaystyle Z=\frac{h_{j-1}-h_{j}}{b_{j-1}-b_{j}}\cdot W+\frac{h_{j-1}b_{j}-h_{j}b_{j-1}}{b_{j-1}-b_{j}}=\alpha_{j}W+\beta_{j}.

(8)

To achieve a fully differentiable process, we calculate the intersection points between all lines of the projection surface and the vertical line $W=w_{i}~{}(=\sqrt{x_{i}^{2}+y_{i}^{2}})$ . Then, we select the minimum height at all intersections as $z_{i}$ because the projection surface is upwardly convex.

\displaystyle z_{i}=\mbox{min}(\{\alpha_{j}w_{i}+\beta_{j}\}_{j=1\cdots N_{b}})

(9)

Here, the min function is a differentiable function. Note that searching the corresponding line segment by element-wise comparison instead of the above calculation is not differentiable. Based on $z_{i}$ , we can obtain $X_{i}=\frac{Z_{i}}{z_{i}}\cdot x_{i}$ and $Y_{i}=\frac{Z_{i}}{z_{i}}\cdot y_{i}$ .

Projection

Similarly to back-projection, we first calculate the intersection point $[w_{i},z_{i}]$ and derive $[x_{i},y_{i}]$ from $[X_{i},Y_{i},Z_{i}]$ . The line between the origin and $[X_{i},Y_{i},Z_{i}]$ can be expressed as $Z=\frac{Z_{i}}{W_{i}}\cdot W=\gamma_{i}\cdot W$ . Much like the back-projection process, the minimum value at all intersections on the $Z$ axis can be $z_{i}$ . Hence, $z_{i}$ and $w_{i}$ can be derived as follows:

\displaystyle z_{i}=\mbox{min}\left(\left\{\frac{\gamma_{i}\beta_{j}}{\gamma_{i}-\alpha_{j}}\right\}_{j=1\cdots N_{b}}\right),\hskip 2.84526ptw_{i}=z_{i}/\gamma_{i}.

(10)

Thus, $[x_{i},y_{i}]$ $=$ $[\frac{w_{i}}{W_{i}}\cdot X_{i},\frac{w_{i}}{W_{i}}\cdot Y_{i}]$ . Here, $W_{i}$ $=$ $\sqrt{X_{i}^{2}+Y_{i}^{2}}$ .

In both back-projection and projection, we clamp $z_{i}$ between $h_{0}$ and $h_{N_{b}}$ , and do not consider the points $[x_{i},y_{i}]$ outside the unit circle on the $XY$ plane as the out of view points.

III-C Floor loss

The highly reflective floor surface in indoor environments can often cause significant artifacts in depth estimation. Our floor loss function provides geometric supervision for floor areas. Assuming that the camera is horizontally mounted on the robot and its height is known, the floor loss function is constructed by two components: $J_{floor}=J_{fcl}+J_{lbl}$ .

III-C1 Footprint consistency loss $J_{fcl}$

The robot footprint is almost horizontally flat, and its height is obtained from the camera mounting position. According to the method shown below, we obtain the GT depth $\{\bar{D}_{A^{\alpha}}\}_{\alpha=\{f,b\}}$ only around the robot footprint between $\pm M_{r}$ steps and provide the supervision as follows:

\displaystyle J_{fcl}=\sum_{\alpha\in\{f,b\}}\frac{1}{N_{m}}\sum_{i=1}^{N_{m}}M_{f}|\bar{D}_{A^{\alpha}}-D_{A^{\alpha}}|,

(11)

where $M_{f}$ is the mask, which masks out the pixels without the GT values in $\bar{D}_{A^{\alpha}}$ , and $N_{m}$ is the number of pixel with the GT depth. $\bar{D}_{A^{\alpha}}$ can be derived as:

\displaystyle\bar{D}_{A^{\alpha}}(x^{\alpha}_{j},y^{\alpha}_{j})=Z^{\alpha}_{j},

(12)

where $[x^{\alpha}_{j},y^{\alpha}_{j}]=f_{\mbox{proj}}(T_{r\alpha}Q_{foot}[j])$ and $Z^{\alpha}_{j}$ are the values of the $Z$ axis of $T_{r\alpha}Q_{foot}[j]$ . Here, $Q_{foot}$ is the point clouds of the robot footprint between $\pm M_{r}$ steps on $\Sigma_{X^{r}}$ . To obtain $Q_{foot}$ , we calculate the robot local positions using the teleoperator’s velocity between $\pm M_{r}$ steps. And, $T_{r\alpha}$ is the known transformation matrix from $\Sigma_{X^{r}}$ to $\Sigma_{X^{\alpha}}$ . By assigning all point clouds into $\bar{D}_{A^{\alpha}}$ using (12), we can take $\bar{D}_{A^{\alpha}}$ to calculate $J_{fcl}$ . Note that $\bar{D}_{A^{\alpha}}$ is a sparse matrix. No GT pixel in $\bar{D}_{A^{\alpha}}$ is excluded by $M_{f}$ .

III-C2 lower boundary loss $J_{lbl}$

In indoor scenes, floor areas often reflect ceiling lights. This can cause it to appear that there are holes on the floor in the estimated depth image. To provide a lower boundary for the height of estimated point clouds, we observe two key points: 1) the camera is horizontally mounted on the robot, and 2) most objects around the robot are higher than the floor. One exception would be anything that is downstairs, which would be lower than the floor. However, it is rare in to find such occurrences in the dataset, because teleoperation around the stairs is dangerous and ill-advised.

To provide the constraint for the $Y$ axis ( $=$ hight) value of the estimated point clouds, $J_{lbl}$ can be given as follows:

\displaystyle J_{lbl}=\sum_{\alpha\in\{f,b\}}\frac{1}{N}\sum_{i=1}^{N}(\mbox{max}(0.0,Y_{A^{\alpha}}-h_{cam})),

(13)

where $Q_{A^{\alpha}}=[X_{A^{\alpha}},Y_{A^{\alpha}},Z_{A^{\alpha}}]$ . $J_{lbl}$ penalizes $Y_{A^{\alpha}}$ , which is larger ( $=$ lower) than the floor height ( $=h_{cam}$ ).

III-D Sim2real transfer with $J_{depth}$

In conjunction with self-supervised learning using time-sequence real images, we used the GT depth image $\{\bar{D}_{s^{f}},\bar{D}_{s^{b}}\}$ from a photo-realistic robot simulator [41, 42], as shown in Fig. 2[b]. Although there is an appearance gap between real and simulated images, the GT depth from the simulator can help the model understand the 3D geometry of the environment from the image. Here, $J_{depth}=\sum_{\alpha\in\{f,b\}}d_{\mbox{depth}}(\bar{D}_{s^{\alpha}},D_{s^{\alpha}})$ , where $D_{s^{f}},D_{s^{b}}=f_{\mbox{depth}}(I_{s^{f}},I_{s^{b}})$ . $I_{s^{f}}$ and $I_{s^{b}}$ are front- and back-side fisheye images, respectively, from the simulator. We employed the same metric $d_{\mbox{depth}}()$ as the baseline method [43] to measure the depth differences. To achieve a sim2real transfer, we simultaneously penalized $J_{depth}$ and $J_{bimg}$ .

IV EXPERIMENT

IV-A Dataset

Our method was mainly evaluated using the GO Stanford (GS) dataset with time-sequence spherical camera images. In addition, we used the KITTI dataset with pinhole camera images for comparison with the baseline methods, which attempts to learn the camera parameters.

IV-A1 GO Stanford dataset

We used the GS dataset [9, 10, 11] with time-sequence spherical camera images (256 $\times$ 128) and the reference velocities, which were collected by teleoperating turtlebot2 with Ricoh THETA S. The GS dataset contains 10.3 hours of data from twelve buildings at the Stanford University campus. To train our networks, we used a training dataset from eight buildings, following [11].

In addition to the real images of the GS dataset, we collected pairs of simulator images and GT depth images for $J_{depth}$ . We scanned 12 floors (e.g., office rooms, meeting rooms, laboratories) in our company buildings and ten environments (e.g., a traditional Japanese house, art gallery, fitness gym) for the simulator by Matterport pro2, as shown in Fig. 6. We separated them into groups: eight floors in our company buildings for training, four floors in our company buildings for validation, and ten environments not in our company for testing. For training, we rendered 10,000 data from each environment using interactive Gibson [41, 42]. In addition, we collected 1,000 data points for testing from ten environments.

Data collection of the GT depth from a real spherical image is a challenging task. Hence, in the quantitative analysis, we used the GT depth from the simulator. To evaluate the generalization performance, our test environment was not derived from our company building. Examples of the domain gaps between training and testing are shown in Fig. 6. In the qualitative analysis, we used both real and simulated images.

TABLE I: Evaluation of depth estimation from the GO Stanford dataset. For the three leftmost metrics, smaller values are better; for the rightmost metric, higher values are better. “SGT” denotes the GT depth from the simulator. The bold values indicate the best results. All methods except

\dagger

were evaluated without median scaling because scaling is learned from the SGT.

\ddagger

used the half-sphere camera model shown in (14) and (15).

Method	SGT	Abs-Rel	Sq-Rel	RMSE	$\delta$ $<$ $1.25$
monodepth2 [25] $\ddagger$ $\dagger$		0.586	1.890	1.123	0.439
–with SGT[25, 43] $\ddagger$	✓	0.228	0.162	0.535	0.664
Alhashim et al. [43]	✓	0.203	0.154	0.542	0.698
\hdashlineOur method (full)	✓	0.198	0.143	0.525	0.711
–wo $J_{depth}$ $\dagger$		0.377	0.335	0.829	0.433
–wo our cam. model $\ddagger$	✓	0.248	0.179	0.548	0.646
–wo $J_{lbl}$	✓	0.233	0.161	0.532	0.665
–wo $J_{fcl}$	✓	0.216	0.151	0.530	0.684
–wo blending in $J_{bimg}$	✓	0.205	0.147	0.530	0.702
–wo $J_{pose}$	✓	0.200	0.145	0.532	0.698

TABLE II: Evaluation of depth estimation on the KITTI raw dataset. For the leftmost three metrics, smaller values are better; for the rightmost metric, higher values are better. The bold values indicate the best results. All the methods in this table employ a pretrained ResNet-18 for their encoder. All values were calculated after median scaling.

\dagger

was trained on a mixed dataset with the KITTI, Cityscape, bike, and GO Stanford datasets. The others were trained only on KITTI.

Method	Camera	Abs-Rel	Sq-Rel	RMSE	$\delta$ $<$ $1.25$
Gordon et al. [13]	known	0.129	0.982	5.23	0.840
Gordon et al. [13]	learned	0.128	0.959	5.23	0.845
\hdashlineNRS [44]	known	0.137	0.987	5.337	0.830
NRS [44]	learned	0.134	0.952	5.264	0.832
\hdashlinemonodepth2 (original)	known	0.115	0.903	4.864	0.877
with our cam. model	known	0.115	0.915	4.848	0.878
with our cam. model	learned	0.113	0.885	4.829	0.878
with our cam. model $\dagger$	learned	0.109	0.826	4.702	0.884

IV-A2 KITTI dataset

To evaluate our proposed camera model against the baseline methods, we employed the KITTI raw dataset [12] for the evaluation of depth estimation. Similar to the baseline methods, we separated the KITTI raw dataset via Eigen split [45] with 40,000 images for training, 4,000 images for validation, and 697 images for testing. To compare with baseline methods, we employed the widely used 640 $\times$ 192 image size as input.

Besides, we used the KITTI odometry dataset with ground truth pose for the evaluation of pose estimation. It is known that the KITTI raw dataset for depth estimation partially includes test images from the KITTI odometry dataset. Hence, following the baseline methods, we trained our models with sequences 00 to 08 and conduct testing on sequences 09 and 10.

IV-B Training

In training with the GS dataset, we randomly selected 12 real images from the training dataset as $\{I_{A^{\alpha}}\}_{\alpha\in\{f,b\}}$ . Then, we randomly selected $\{I_{B^{\alpha}}\}_{\alpha\in\{f,b\}}$ between $\pm$ 5(= $N_{g}$ ) steps. In addition, we selected 12 simulator images $\{I_{s^{\alpha}}\}_{\alpha\in\{f,b\}}$ and the GT depth $\{\bar{D}_{s^{\alpha}}\}_{\alpha\in\{f,b\}}$ for $J_{\mbox{depth}}$ .

To define the transformation matrices $T_{fb}$ , $T_{rf}$ , and $T_{rb}$ , we measured $h_{cam}$ and $l_{cam}$ in Fig. 3 as 0.57 and 0.12 meters, respectively. Additionally, we set $N_{b}=32$ for our camera model. The weighting factors for the loss function $J$ were designed as $\lambda_{1}=0.85$ , $\lambda_{2}=0.15$ , $\lambda_{s}=0.001$ , $\lambda_{f}=0.1$ , and $\lambda_{d}=\lambda_{p}=1.0$ . $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{s}$ were exactly the same as in previous studies. We only determined $\lambda_{f}$ by trial and error.

The robot footprint shape was defined as a circle with a diameter of 0.5 m. The point cloud of the footprint was set as 1400 points inside the circle. The number of steps for the robot footprints was $M_{r}=5$ for $J_{fcl}$ .

The network structures of $f_{\mbox{depth}}()$ and $f_{\mbox{pose}}()$ were exactly the same as those of monodepth2 [25]. In addition, $f_{\mbox{cam}}()$ was designed with three convolutional layers, with the ReLu function and two fully connected layers with a sigmoid function, to estimate the camera parameters. We used the Adam optimizer with a learning rate of 0.0001 and conducted training loop for 40 epochs.

During training, we iteratively calculated $J$ and derived the gradient to update all the models. Hence, we could simultaneously penalize $J_{\mbox{depth}}$ from the simulator and the others from the real images to achieve a sim2real transfer.

For the KITTI dataset, we employed the source code of monodepth2 [25] and replaced the camera model with our proposed camera model. The other settings were exactly the same as those of monodepth2, to focus the experimentation on our camera model.

IV-C Evaluation of depth estimation

IV-C1 Quantitative Analysis

Table I shows the ablation study of our method and the results of three baseline methods for comparison. We trained the following baseline methods with the same dataset.

monodepth2 [25] We applied the following half-sphere model [46] into monodepth2 [25], instead of the pinhole camera model, and trained depth and pose networks. This model assumes that the $UV$ coordinate is the same as the $XY$ coordinate and the projection surface is a half-sphere.

$\bullet$ back-projection: ( $u_{i}$ , $v_{i}$ , $Z_{i}$ ) $\rightarrow$ ( $X_{i}$ , $Y_{i}$ ),

\displaystyle X_{i}=\frac{Z_{i}}{\sqrt{1-u_{i}^{2}-v_{i}^{2}}}\cdot x_{i},\hskip 5.69054ptY_{i}=\frac{Z_{i}}{\sqrt{1-u_{i}^{2}-v_{i}^{2}}}\cdot y_{i}

(14)

$\bullet$ Projection: ( $X_{i}$ , $Y_{i}$ , $Z_{i}$ ) $\rightarrow$ ( $u_{i}$ , $v_{i}$ )

\displaystyle u_{i}=\frac{X_{i}}{\sqrt{X_{i}^{2}+Y_{i}^{2}+Z_{i}^{2}}},\hskip 5.69054ptv_{i}=\frac{Y_{i}}{\sqrt{X_{i}^{2}+Y_{i}^{2}+Z_{i}^{2}}}

(15)

monodepth2 with sim. GT [25, 43] We added $J_{depth}$ to the cost function of the above baseline to train the models.

Alhashim et al. [43] We trained the depth network by minimizing $J_{depth}$ , which is the same cost function as [43].

In quantitative analysis, we evaluate the estimated depth using common metrics. “Abs-Rel,” “Sq-Rel,” and “RMSE,” are calculated by means of the following values.

•

Abs-Rel : $|D_{gt}-\hat{D}|/D_{gt}$
•

Sq Rel : $(D_{gt}-\hat{D})^{2}/D_{gt}$
•

RMSE : $((D_{gt}-\hat{D})^{2})^{\frac{1}{2}}$

Here, $D_{gt}$ is the ground truth of the estimated depth image $\hat{D}$ . The remaining metric is the ratios that satisfy $\delta<1.25$ . $\delta$ is defined as $\delta=\mbox{max}(D_{gt}/\hat{D},\hat{D}/D_{gt})$ .

From Table I, we can observe that our method significantly outperforms all baseline methods. In addition, we confirmed the advantages of our proposed components via the ablation study. The use of the GT depth from the simulator and the learning axisymmetric camera model were fairly effective. Even though the proposed method is evaluated without scaling using the GT depth, it outperforms the other methods with scaling. This suggests that our method learns the correct scaling via $J_{depth}$ with the GT depth from the simulator.

In Table II, we present the quantitative results for the KITTI dataset. Similar to our method, the baseline methods (shown in Table II) learned the camera model. All methods presented in Table II used ResNet-18 for their depth network to allow for fair comparisons.

In our method with known camera parameters, we set $\{h_{q}\}_{q=0\cdots N_{b}}$ , $r_{x}$ , $r_{y}$ , $o_{x}$ , and $o_{y}$ as the constant values for the GT camera’s intrinsic parameters. $\{b_{q}\}_{q=0\cdots N_{b}}$ was designed with equal intervals between 0.0 and 1.0. Our method improved the accuracy of depth estimation by learning the camera model. In addition, our method outperformed all baseline methods, including the original monodepth2.

Moreover, we trained our models by mixing the KITTI, Cityscape [47], bike [5], and GS datasets [11] to evaluate its ability to handle various cameras and evaluated on KITTI images. During training, all images were aligned into KITTI’s image size by center cropping. In GS dataset, we use front-side fisheye images. Our method showed improved performance by adding datasets from various cameras with various distortions, as shown at the bottom of Table II.

IV-C2 Qualitative Analysis

Figure 7 shows the estimated depth images from simulated images and real images from the GS dataset. The depth images estimated by monodepth2 are blurred. This is caused by the small size of the input image and the camera model’s error. The GT depth from the simulator can sharpen the simulated depth of images in monodepth2 with sim. GT. However, there are many artifacts, particularly on the reflected floor. Alhashim et al. observed errors in the estimated depth from real images. Alhashim et al. failed sim2real transfer because the depth network was trained only from simulated images. However, our method (rightmost side) can accurately estimate depth images of both real and simulated images by reducing these artifacts. Additional examples are provided in the supplemental videos.

Finally, we present the depth images of KITTI in Fig. 1 [b]. Our method can handle the pinhole camera image and estimate an accurate depth image without camera calibration.

IV-D Evaluation of pose estimation

We evaluate our pose network using the KITTI odometry dataset with the ground truth poses. Table III shows the mean and standard deviation of absolute trajectory error over five-snippets in the test dataset, following the baseline methods. Although Gordon et al. [13] with known camera model shows explicit advantages, their performance deteriorates while learning the camera parameters. Besides, our method with learning our camera model shows a healthy advantageous gap against the original monodepth2 with known camera intrinsic parameters.

TABLE III: Evaluation of pose estimation on the KITTI odometry dataset. Mean and standard deviation of absolute trajectory error (ATE) over five-frame snippets are calculated for sequence 09 and 10, respectively. The bold value indicates the better one between ”known” or ”learned”.

Method	Camera	Sequence 09	Sequence 10
Gordon et al. [13]	known	0.009 $\pm$ 0.0015	0.008 $\pm$ 0.011
Gordon et al. [13]	learned	0.0120 $\pm$ 0.0016	0.010 $\pm$ 0.010
\hdashlineNRS [44]	known	–	–
NRS [44]	learned	0.0150 $\pm$ 0.0301	0.0103 $\pm$ 0.0073
\hdashlinemonodepth2 (original)	known	0.017 $\pm$ 0.008	0.015 $\pm$ 0.010
with our cam. model	learned	0.0134 $\pm$ 0.0068	0.0134 $\pm$ 0.0084

V CONCLUSIONS

We proposed a novel learnable axisymmetric camera model for self-supervised monocular depth estimation. In addition, we proposed to supervise the estimated depth using the GT depth from the photo-realistic simulator. By mixing real and simulator images during training, we can achieve a sim2real transfer in depth estimation. Additionally, we proposed loss functions to provide the constraints for the floor to reduce artifacts that may result from reflective floors. The effectiveness of our method was quantitatively and qualitatively validated using the GS and KITTI datasets.

VI ACKNOWLEDGMENT

We thank Kazutoshi Sukigara, Kota Sato, Yuichiro Matsuda, and Yasuaki Tsurumi for measuring 3D environments to collect pairs of simulator images and GT depth images.

References

[1] S. Thrun, “Probabilistic robotics,” Communications of the ACM, vol. 45, no. 3, pp. 52–57, 2002.
[2] J. Biswas and M. Veloso, “Depth camera based indoor mobile robot localization and navigation,” in 2012 IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 1697–1702.
[3] T. Zhou et al., “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851–1858.
[4] V. Casser et al., “Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8001–8008.
[5] R. Mahjourian et al., “Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5667–5675.
[6] Z. Yang et al., “Every pixel counts: Unsupervised geometry learning with holistic 3d motion understanding,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
[7] N. Hirose et al., “Plg-in: Pluggable geometric consistency loss with wasserstein distance in monocular depth estimation,” in 2021 International Conference on Robotics and Automation (ICRA). IEEE, 2021.
[8] https://matterport.com, (accessed August 20, 2021).
[9] N. Hirose et al., “Gonet: A semi-supervised deep learning approach for traversability estimation,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3044–3051.
[10] ——, “Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 2062–2069, 2019.
[11] ——, “Deep visual mpc-policy learning for navigation,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3184–3191, 2019.
[12] A. Geiger et al., “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR), 2013.
[13] A. Gordon et al., “Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8977–8986.
[14] I. Vasiljevic et al., “Neural ray surfaces for self-supervised learning of depth and ego-motion,” in 2020 International Conference on 3D Vision (3DV). IEEE, 2020, pp. 1–11.
[15] R. Garg et al., “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in European conference on computer vision. Springer, 2016, pp. 740–756.
[16] C. Godard et al., “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 270–279.
[17] V. Guizilini et al., “Packnet-sfm: 3d packing for self-supervised monocular depth estimation,” arXiv preprint arXiv:1905.02693, vol. 5, 2019.
[18] C. Wang et al., “Learning depth from monocular videos using direct methods,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2022–2030.
[19] Y. Chen et al., “Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7063–7072.
[20] V. R. Kumar et al., “Unrectdepthnet: Self-supervised monocular depth estimation using a generic framework for handling common camera distortion models,” arXiv preprint arXiv:2007.06676, 2020.
[21] V. Guizilini et al., “Robust semi-supervised monocular depth estimation with reprojected distances,” in Conference on Robot Learning, 2020, pp. 503–512.
[22] M. Poggi et al., “On the uncertainty of self-supervised monocular depth estimation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[23] S. Vijayanarasimhan et al., “Sfm-net: Learning of structure and motion from video,” arXiv preprint arXiv:1704.07804, 2017.
[24] M. Jaderberg et al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
[25] C. Godard and othersJ, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3828–3838.
[26] Y. Zou et al., “Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 36–53.
[27] X. Luo et al., “Consistent video depth estimation,” arXiv preprint arXiv:2004.15021, 2020.
[28] N. Hirose et al., “Variational monocular depth estimation for reliability prediction,” in 2021 International Conference on 3D Vision (3DV). IEEE, 2021, pp. 637–647.
[29] G. Yang et al., “Transformer-based attention networks for continuous pixel-wise prediction,” arXiv preprint arXiv:2103.12091, 2021.
[30] J. H. Lee et al., “From big to small: Multi-scale local planar guidance for monocular depth estimation,” arXiv preprint arXiv:1907.10326, 2019.
[31] V. R. Kumar et al., “Monocular fisheye camera depth estimation using sparse lidar supervision,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2853–2858.
[32] C. Won et al., “Sweepnet: Wide-baseline omnidirectional depth estimation,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6073–6079.
[33] R. Komatsu et al., “360 depth estimation from multiple fisheye images with origami crown representation of icosahedron,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 10 092–10 099.
[34] Z. Cui et al., “Real-time dense mapping for self-driving vehicles using fisheye cameras,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6087–6093.
[35] V. R. Kumar et al., “Fisheyedistancenet: Self-supervised scale-aware distance estimation using monocular fisheye camera for autonomous driving,” in 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 574–581.
[36] ——, “Syndistnet: Self-supervised monocular fisheye camera distance estimation synergized with semantic segmentation for autonomous driving,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 61–71.
[37] Z. Wang et al., “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[38] V. Usenko et al., “The double sphere camera model,” in 2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 552–560.
[39] J. Fang et al., “Self-supervised camera self-calibration from video,” arXiv preprint arXiv:2112.03325, 2021.
[40] S. F. Bhat et al., “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009–4018.
[41] F. Xia et al., “Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 713–720, 2020.
[42] ——, “Gibson env v2: Embodied simulation environments for interactive navigation,” Stanford University, Tech. Rep., 2019.
[43] I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv preprint arXiv:1812.11941, 2018.
[44] I. Vasiljevic et al., “Neural ray surfaces for self-supervised learning of depth and ego-motion,” in 2020 International Conference on 3D Vision (3DV), 2020, pp. 1–11.
[45] D. Eigen et al., “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
[46] J. Courbon et al., “A generic fisheye camera model for robotic applications,” in 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2007, pp. 1683–1688.
[47] M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.